Final Project - Neural Radiance Field

Filip Malm-Bägén


Hero

Introduction

This project is about implementing neural radiance fields (NeRFs) to reconstruct 3D scenes from 2D images by sampling and querying rays to integrate color information. The goal is to build efficient and robust methods for ray generation, sampling, and rendering to ensure high-quality results while minimizing overfitting.

Part 1: Fit a Neural Field to a 2D Image

The main task was to fit a neural field F that maps 2D pixel coordinates {u, v} to RGB color values {r, g, b}. To achieve this, I implemented a Multi_Layer_Perceptron class with the following architecture:

MLP Architecture
2D MLP Architecture

The inputs to the network were augmented using Sinusoidal Positional Encoding (PE) using get_PE, which expands the input dimensionality from 2 to 4 * L + 2 based on the frequency level L. To train on large images, I had to implement a dataloader class, RandomPixelSampler, which samples N random pixels per iteration and returns both their normalized coordinates and RGB values.

During the last step, I experimented with different hyperparameters to see how the results was effected. As seen in the images, the base configuration performed the best. The reduced frequency made the fox look more smooth and cartoonish. This is no surprise since the reduced frequency makes the network less sensitive to high-frequency details. The wider configuration and higher learning rate did perform the worst, as they resulted in black images. This result is also confirmed by the PSNR graph. However, this was not the case for the palace image, where the wider configuration performed the best. This is likely due to the increased complexity of the image.

Neural Field Network Configurations and Hyperparameters
Configuration Channel Size Learning Rate L
Base 256 1e-2 10
Reduced Freq 256 1e-2 3 (Reduced frequency)
Wider 512 (Double channel size) 1e-2 10
Higher LR 256 1e-1 (Higher learning rate) 10
Original fox image
Original fox image
Base configuration on fox
Base configuration on fox
Reduced frequency on fox
Reduced frequency on fox
Wider on fox
Wider on fox
Higher learning rate on fox
Higher learning rate on fox

The images represent iteration [1, 20, 100, 500, 1000, 2000]

PSNR fox
PSNR for different configurations for fox
Original Palace of Fine Arts image
Original Palace of Fine Arts image
Base configuration on palace
Base configuration on palace
Reduced frequency on palace
Reduced frequency on palace
Wider on palace
Wider on palace
Higher learning rate on palace
Higher learning rate on palace
PSNR palace
PSNR for different configurations for palace

The reconstruction quality was evaluated using the Peak Signal-to-Noise Ratio (PSNR), calculated as PSNR = 10 * log10(1 / MSE)

Part 2: Fit a Neural Radiance Field from Multi-view Images

This part involves using a Neural Radiance Field (NeRF) to represent a 3D scene by learning a mapping from position and view direction to color and density: F: {x, y, z, d} → {r, g, b, σ}. Using multi-view calibrated images of a Lego scene (200x200 resolution) and their corresponding camera poses, the task aims to perform inverse rendering. The provided data includes camera-to-world matrices for training, validation, and test cameras.

Part 2.1: Create Rays from Cameras

To render the 3D scene, I implemented functions to convert pixel coordinates into rays, defined by their origin (ro) and normalized direction (rd). The coordinate transformations are performed and a function transforms camera coordinates to world coordinates using the camera-to-world matrix. Another function transforms pixel coordinates to camera coordinates using the intrinsic matrix and pixel depth. For ray generation, the ray origin is the camera's translation vector, and the direction is computed by normalizing the difference between the world coordinate of a depth-1 point and the origin. These transformations are implemented using batched matrix multiplications for efficiency.

Part 2.2: Sampling

Next, I developed ray sampling methods. I trained the model using a batch size of 10k rays. These rays were generated by randomly sampling 10k pixels globally across the training set of 100 images. To accelerate the training process, all rays and pixel coordinates were precomputed at the start. To render the 3D scene, each ray was discretized into sampled points along its path. This step allows querying points to integrate their colors for determining the final color rendered at a particular pixel. Using uniform sampling, I generated points along each ray as: t = np.linspace(near, far, n_samples), where near=2.0, far=6.0, and n_samples=64. The 3D coordinates for these points were calculated as: x = r_o + r_d * t, where r_o represents the ray origin, and r_d the ray direction. I added perturbations during training, t = t + np.random.rand(t.shape) * t_width. This ensures that all locations along the ray are touched.

Part 2.3: Putting the dataloader all together

I ran the code to verify that I had implement everything correctly.

100 Randomly Sampled Rays
100 Randomly Sampled Rays

Part 2.4: Neural Radiance Field

The Neural Radiance Field was implemented as a deep neural network that maps spatial coordinates and viewing directions to color and density values. This network was enhanced to handle higher-dimensional inputs (3D position and view direction vectors) and outputs (RGB colors plus density), compared to the MLP from part 1. The complete network architecture is illustrated below:

3D MLP Architecture
3D MLP Architecture

Part 2.5: Volume Rendering

Volume rendering integrates color values along each ray to produce the final pixel color. At each sampled point along a ray, the network predicts both color and density values. These values are then combined using a numerical approximation of the volume rendering equation. The rendering process works by accumulating colors from back to front along each ray, using density values (σ) to determine opacity at each point, weighting colors based on transmittance (how much light passes through), and computing distance intervals (δᵢ) between sampled points.

The implementation uses PyTorch's torch.cumprod for efficient calculation of transmittance values. The distance intervals δᵢ are derived from the sampling points generated earlier in the pipeline. This numerical approximation enables efficient parallel computation across all rays in a batch.

\[\begin{align} \hat{C}(\mathbf{r})=\sum_{i=1}^N T_i\left(1-\exp \left(-\sigma_i \delta_i\right)\right) \mathbf{c}_i, \text{ where } T_i=\exp \left(-\sum_{j=1}^{i-1} \sigma_j \delta_j\right) \end{align}\]

I trained using Adam optimizer with learning_rate = 1e-3, batch_size = 10_000 and iterations = 2500. The images below are visualizations of the training process.

Depth Render 0
Depth Render 0
Depth Render 10
Depth Render 10
Depth Render 20
Depth Render 20
Depth Render 40
Depth Render 40
Final Render 0
Final Render 0
Final Render 10
Final Render 10
Final Render 20
Final Render 20
Final Render 40
Final Render 40

The following image is the PSNR curve over the iterations. The PSNR is steadily increasing, and I believe that the model could acieve a PSNR greater than 30 if trained for more iterations.

PSNR NeRF
PSNR for NeRF

The final image is the rendered image of the Lego scene. The rendering quality is quite good, and the model has learned to represent the scene well. Next to it is the depth image, which represents the distance from the camera to the object. The darker the pixel, the closer it is to the camera. The difference is that the depth rendering only uses the density value, while the color rendering uses both the color and density values.

NeRF Render
NeRF Render
NeRF Depth
NeRF Depth

This webpage design was partly made using generative AI models.