CS180 Proj4 · Neural Radiance Field

Part 0 · Camera Calibration and 3D Scanning

We calibrate the camera and visualize all training views in viser as frustums to verify intrinsic/extrinsic parameters and scene coverage.

Camera Frustums in Viser

Drag left/right or scroll to change view

You can drag and scroll to see different views. The frustums represent the camera positions and orientations

Part 1 · Fit a Neural Field to a 2D Image

Fit an MLP-based coordinate network to reconstruct 2D images from continuous pixel coordinates using positional encoding.

Model Architecture

BackboneMulti-Layer Perceptron Layers4 Width128 ActivationReLU(for hidden layer) & Sigmoid(for output layer) Positional Encoding10 Learning Rate1e-2 with Adam Loss FunctionMean Squared Error (MSE Loss)

Training Progression · Provided Test Image

Step0

Run A at current step — **Fox**
PSNR: -- dB

Run B at current step — **Fox**
PSNR: -- dB

Drag the slider to see pictures in different training steps.

Positional Encoding & Width

Comparison across two max positional encoding frequencies and two network widths.

High PE, narrow MLP — **L=10, Width=256**

PSNR Curve

Part 2 · Fit a Neural Radiance Field from Multi-view Images (Lego)

Implement a NeRF-style volumetric renderer for the Lego scene, including ray sampling, MLP-based density/color prediction, and hierarchical rendering from calibrated multi-view images.

Implementation Overview

Compute rays from the cameras
I generate pixel-aligned rays using my own get_rays() function. It first maps pixel coordinates into camera space via pixel_to_camera(), then transforms them into world coordinates using the camera-to-world matrix. Each ray consists of:
- ray_o – the camera center extracted from c2w
- ray_d – a normalized direction pointing through the pixel
Sample points along each ray
I implement stratified sampling in sample_along_rays(). For each ray, I draw N_samples depth values between near and far. With perturbation enabled, samples are jittered within each interval to encourage anti-aliasing and reduce bias. The output is:
- pts – 3D sample points in world coordinates
- deltas – interval lengths used for volumetric integration
Predict color and density using an MLP
Each sample point is fed into my custom NeRF MLP. The network processes:
- the 3D position of each point (with positional encoding)
- the viewing direction for directional appearance effects
The MLP outputs:
- σ – volume density at each point
- rgb – view-dependent color prediction
Combine samples into the final pixel color
I implement volumetric rendering in volume_render(), following NeRF’s equation:
- compute transmittance along the ray
- accumulate colors weighted by opacity
- produce a single RGB for each pixel

Rays & Samples Visualization

Sample rays with camera frustums (interactive)

Training Progression (Lego)

Lego prediction at 1k iters — **Groundthruth**

Lego NeRF prediction (interactive) — **Iter 200**
Train PSNR: -- dB · Val PSNR: -- dB · Loss: --

Drag the slider to browse NeRF predictions at different iterations.

Validation PSNR Curve

PSNR curve on validation set — **PSNR** · evaluated on held-out Lego views.

Spherical Rendering Video

Part 2.6 · Training NeRF with My Own Data

Novel View GIF

Novel views of my object — **Camera Orbit GIF** · synthesized views circling the object.

Code & Hyperparameter Adjustments

Hyperparameter Tuning

I experimented with several key NeRF hyperparameters to make the custom scene both stable and efficient to train. First, I tightened the depth range to near = 0.001 and far = 0.5, which better matches the small physical scale of my capture and avoids wasting samples on empty space. I then swept the number of points sampled along each ray (n_samples) over 32, 64, and 128. With 32 samples the reconstruction was noticeably noisy, while 64 samples produced sharp geometry and clean colors; increasing to 128 slightly improved details but made training significantly slower. Finally, I set the total number of optimization steps to 5000, which was enough for the training PSNR to saturate and for the rendered novel views to look visually consistent without overfitting.

Training Loss Curve

Training loss over iterations — **Loss vs Iterations** · convergence of my-scene NeRF.

Intermediate Renders

Step1

Panda NeRF prediction — **Prediction · step 1**

Drag the slider to browse NeRF reconstructions at different training steps.

Summary

When I first started working on this, I really thought I could achieve something close to the Lego example. Later I realized this is far more difficult than I imagined. I feel like I’ve tried almost every possible combination at every stage, yet every step turned out to be way more sensitive and error-prone than expected.

During image capture, I tested all kinds of setups:
• 6 tags as calib_images + 1 tag (same size) for object_images
• 6 tags as calib_images + 6 tags for object_images
• using the 6-tag object_images directly for calibration
• trying a single large tag for calibration
My conclusion: larger tags work much better. They’re more stable, easier to detect, and much less sensitive to lighting or background noise. Small tags get messed up by shadows, reflections, or noisy textures extremely easily.

I also hit a ton of issues while shooting photos. Later I realized the object’s distance doesn’t need to change at all; it’s the tag’s viewpoint that needs diversity, otherwise the pose solve becomes unstable. Downsampling is a must as well, because high-res noise makes detection much worse.

If possible, find a place with uniform lighting and a clean background. My desk has complicated wood grain patterns, so the detector kept hallucinating tag IDs that didn’t exist. On top of that, the desk sits between a lamp and a window, so I had to constantly avoid shadows and reflections. I basically spent the whole shooting process tiptoeing around these problems just to prevent PnP from exploding.

As for implementation details, the part that consumed the most time was visualization in Viser. My camera poses were always flipped or chaotic—sometimes all flipped in one direction (which can be fixed with a scale), sometimes half of them flipped and the other half not (which needs manual axis correction in code), and sometimes just totally inconsistent. In those cases, my final conclusion is simple: the images are the problem. Blurry shots, misdetected tags, or extreme viewing angles all lead to unstable pose estimation. That’s something I only understood after wrestling with it for days.

In the final rendering stage, I also ran into another big issue: because the captured viewpoints were too limited, NeRF didn’t have enough angular coverage. As a result, the rendered output had obvious artifacts—especially those “floating blurry layers” that look like ghost surfaces. Later I finally understood that this isn’t the model’s fault. If the training views don’t constrain the space enough, NeRF simply starts hallucinating. The less information you give it, the more it invents.

I kept trying until the last day of the deadline, but I still think the result is not good enough. I believe I will try again when I have time. If anyone sees this webpage and is willing to visit my GitHub to check my code and point out the areas I can optimize, I would be very grateful.