CS180 Proj5 · The Power of Diffusion Models

A: The Power of Diffusion Models!

Part 0 · Play with the Model using Your Own Text Prompts!

In this part, I create my own text prompts, use the course HuggingFace cluster to generate 4096-dim prompt embeddings (.pth), and then visualize a few generations using different num_inference_steps. I also fix a random seed that will be reused for all subsequent parts.

Prompt List & Embeddings

Example Prompt Embedding

          Key: a high quality photo
          Shape: torch.Size([1, 77, 4096])
          Value: tensor([[[-0.0423, -0.0605,  0.0012,  ..., -0.0445, -0.0551,  0.0456],
                  [-0.0925,  0.0050, -0.0171,  ..., -0.0464,  0.0248, -0.1573],
                  [-0.2024,  0.0070,  0.0139,  ..., -0.0425, -0.0241, -0.1090],
                  ...,
                  [ 0.0936,  0.3098, -0.1141,  ...,  0.1418, -0.2241,  0.2144],
                  [ 0.0936,  0.3098, -0.1141,  ...,  0.1418, -0.2241,  0.2144],
                  [ 0.0936,  0.3098, -0.1141,  ...,  0.1418, -0.2241,  0.2144]]],
                dtype=torch.float16)

Seeds seed = 48
Steps num_inference_steps = 20, 50, 100

Prompt 1: a high quality photo of a corgi riding a skateboard in downtown Tokyo at night

Prompt 2: a hyper-realistic portrait of an astronaut discovering an alien flower

Prompt 3: a watercolor painting of the UC Berkeley campanile during golden hour

Interestingly, the image quality does not improve monotonically with more denoising steps as I thought.
For Prompt 1 (the corgi in downtown Tokyo), I think the the quality of different step images are similar.
For Prompt 2, I think the 100-step sample captures the flower details most clearly. For 20-step, no idea what is that weird thing.
For Prompt 3 (UC Berkeley Campanile), I think the 50-step output again provides the best balance between structure and watercolor smoothness, whereas the 25-step version stresses Berkeley features and resembles a generic clock tower, and the 100-step version stresses 'golden hour'.

Part 1 · Sampling Loops

1.1 Implementing the Forward Process

I implement the forward step using the DDPM formula:

$$x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon, \qquad \epsilon \sim \mathcal{N}(0,1)$$

Here, alpha_bar[t] is taken from alphas_cumprod, and noise ~ N(0,1). This produces a progressively noisier image as t increases.

1.2 Classical Gaussian Denoising

I then try to denoise these images using Gaussian blur (torchvision.transforms.functional.gaussian_blur), tuning kernel size and sigma by hand for each timestep.

1.3 One-Step Denoising with DeepFloyd

Using the pretrained stage_1.unet conditioned on "a high quality photo", I estimate the noise and reconstruct the clean image in a single reverse step for each noisy input. Below I compare the original, the noisy, and the one-step estimate.

Compared to Gaussian blur, the diffusion model preserves much sharper edges and more realistic textures, especially at higher noise levels where classical filtering severely washes out the structure.

1.4 Iterative Denoising

I construct strided_timesteps from 990 down to 0 with stride 30, call stage_1.scheduler.set_timesteps, and implement iterative_denoise(im_noisy, i_start). Starting from the Campanile noised at timestep[10], I show every 5th denoising step and compare the final estimate against the one-step denoised and Gaussian result.

Intermediate Iterative Denoising (every 5th step)

Iterative denoising step 0 — Iterative Step 0 (noisiest)

Iterative denoising step 5 — Iterative Step 5

Iterative denoising step 10 — Iterative Step 10

Iterative denoising step 15 — Iterative Step 15

Iterative denoising step 20 — Iterative Step 20

Final Comparison

Iteratively denoised Campanile — Original Image

One-step denoised from same noise level — One-Step Denoised

Gaussian blur baseline — Gaussian Blur Baseline

1.5 Unconditional Sampling from Noise

By setting i_start = 0 and initializing im_noisy with Gaussian noise, I can generate samples purely from the diffusion prior, conditioned only on "a high quality photo".

1.6 Classifier-Free Guidance (CFG)

I then implement iterative_denoise_cfg, which runs the UNet twice (conditional and unconditional) and combines the noise estimates with a guidance scale w > 1. Below are five samples using CFG with the same text prompt.

With CFG turned on, the images become much more faithful to the text prompt, with stronger semantic structure at the cost of reduced diversity.

During experimentation, I observed that diffusion models tend to generate human portraits even when the prompt does not explicitly specify people. This occurs because phrases like “a high quality photo” introduce strong semantic priors learned from the training data, where high-quality images are disproportionately human-centered.

1.7 · Image-to-Image Translation & Inpainting

Using the SDEdit-style procedure, I start from real (or hand-drawn) images, add noise to a chosen level, and denoise them with CFG to project them back onto the natural image manifold. I then extend this idea to inpainting, where a mask specifies which region is allowed to change.

I apply iterative_denoise_cfg to the Campanile image with different starting indices i_start ∈ {1, 3, 5, 7, 10, 20}, using the prompt "a high quality photo".

Campanile

A photo of a bus taken in San Francisco

The little yellow ducks in Victoria Harbour, Hong Kong

1.7.1 · Editing Hand-Drawn and Web Images

I repeat the same SDEdit procedure on one web image and two hand-drawn sketches, again varying >i_start over the same set of values.

Web picture: A cat with an orange on its head

Hand-Drawn 1

Hand-Drawn 2

I first tried using a general prompt, which does not explicitly specify color or rendering style. As i_start increases, the diffusion process gradually deviates from the original hand-drawn input, but the outputs turn to portrait photo. Then I tried to use low CFG_SCALE, then we can see there is no longer a strong tendency to be a portrait photo, but they preserve a black-and-white or low-saturation appearance. This behavior suggests that, when starting from a hand-drawn grayscale sketch, the model tends to prioritize structural consistency over introducing new color information, especially under generic prompts. Without strong color or style constraints, the diffusion model often defaults to line-dominant or desaturated outputs that are consistent with the input distribution. To encourage colorization, I also experimented with a more descriptive prompt such as “a cute children’s book illustration of a cheerful orange fruit with tiny leaves, drawn in a playful watercolor style,” along with higher CFG scales. While increasing the CFG scale strengthened prompt adherence, the model still showed limited ability to inject rich colors, indicating that color information is harder to introduce when the editing process starts from a predominantly grayscale hand-drawn input.

1.7.2 · Inpainting

I implement inpaint by combining the noisy reconstruction with the original image outside the mask at each diffusion step. Below I show the Campanile inpainting result and two additional examples on my own images. I used the prompt "a high quality photo"

Campanile Inpainting

Campanile inpainted — Campanile Inpainted

My Own Images 1[Irregular Region]: The sign of an ice cream shop shot at Santa Monica Beach

I redrew a small and irregular area. It seems that diffusion wanted to directly restore this area to the color of the sky.

My Own Images 2 [Big Region]: The setting sun blocked by the house

This is the sunset I captured at my residence. Due to the obstruction of the buildings, it didn't look very good. However, to my surprise, diffusion helped me redraw the buildings into mountains.

1.7.3 · Text-Conditional Image-to-Image Translation

Finally, I repeat the SDEdit procedure but change the conditioning prompt to one of my custom prompts from Part 0. The resulting images gradually trade off between preserving the original content and matching the new text. Here are the prompts I used and the results:

Prompt 1: "a tall shimmering silver Christmas tree decorated with glowing white lights, crystal ornaments, and a bright star on top, elegant winter holiday atmosphere"

Campanile to rocket ship transformation — i=1

Prompt 2: "a soft cinematic scene with the calm and wise atmosphere of an elderly white-haired wizard reminiscent of Dumbledore, with gentle magical lighting and a warm enchanted glow"

Prompt 3: "a serene Icelandic seascape at dusk with a small boat crossing the dark Arctic waters beneath a vivid aurora borealis, cold atmospheric lighting"

Part 1.8 · Visual Anagrams

Reference & inspiration: Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models

I implement visual_anagrams, which denoises an image using two different prompts in opposite orientations, averages the noise estimates, and creates a single image that looks like one concept upright and another concept when flipped upside down.

Implementation

At each denoising step, I:

Run UNet on y_t with prompt A to get noise estimate ε_A.
Run UNet on flip(y_t) with prompt B to get ε_B, then flip it back.
Average the two estimates to obtain ε = (ε_A + flip⁻¹(ε_B)) / 2.
Use ε in the DDPM reverse step.

Part 1.9 · Hybrid Images via Factorized Diffusion

Reference & inspiration: Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models

I follow the Factorized Diffusion idea to create hybrid images: I denoise with two different prompts, then combine low frequencies from one noise estimate with high frequencies of the other (using Gaussian blur and a residual).

Hybrid Construction

For prompts A and B, I compute noise estimates ε_A and ε_B. Then:

low_A = gaussian_blur(ε_A, ksize=33, sigma=2)
high_B = ε_B - gaussian_blur(ε_B, ksize=33, sigma=2)
ε_hybrid = low_A + high_B

I plug ε_hybrid into the reverse diffusion update to obtain a single hybrid image.

B: Flow Matching from Scratch!

Part 1: Training a Single-Step Denoising UNet

I first implement an unconditional UNet and train it as a single-step denoiser on MNIST. Given a noisy image $x = x_0 + \sigma \varepsilon$ with Gaussian noise level $\sigma$, the model learns a mapping $f_\theta(x) \approx x_0$ by minimizing an L2 reconstruction loss.

1.1 Implementing the UNet

1.2 Noising Process Visualization

For a fixed clean digit $x_0$ I add Gaussian noise with different standard deviations $\sigma \in \{0, 0.2, 0.4, 0.6, 0.8, 1.0\}$ and visualize how the digit gradually degrades into noise.

MNIST noising visualization for multiple sigma values — Noising process of a single MNIST digit for increasing noise levels $\sigma$. Larger $\sigma$ leads to stronger corruption.

1.2.1 Training the Denoiser

I train the UNet with hidden dimension $D = 128$ on the MNIST training set for 5 epochs. I use batch size 256, Adam optimizer with learning rate $1\times 10^{-4}$, and resample a random $\sigma$ for each batch so the network sees different noisy versions of the same digits across epochs.

Training loss curve for single-step denoiser — Training loss curve of the single-step denoiser over 5 epochs.

Denoising results at epoch 1 — Test-set denoising with $\sigma = 0.5$ after epoch 1

Denoising results at epoch 5 — Test-set denoising with $\sigma = 0.5$ after epoch 5

1.2.2 Out-of-Distribution Noise Levels

The denoiser is trained on $\sigma = 0.5$. I now test it on digits corrupted by out-of-distribution noise levels, keeping the underlying digit fixed and varying $\sigma$.

Denoising results for out-of-distribution sigma — Denoising performance for $\sigma \in \{0.1, 0.3, 0.7, 0.9\}$. Performance degrades as the noise level moves further away from the training setting.

1.2.3 Denoising Pure Noise

I repeat the training procedure but now feed pure Gaussian noise $x \sim \mathcal{N}(0, I)$ as input and still ask the UNet to predict a clean digit. The model converges to mapping random noise to blurry, digit-like blobs that resemble “average” shapes over the dataset.

Training loss when denoising pure noise — Training loss for the pure-noise denoiser.

Generated digits from pure noise at epoch 1

Generated digits from pure noise at epoch 5

Pure-noise denoising behavior. When the UNet is trained to map pure Gaussian noise directly to clean MNIST digits, the generated samples quickly converge to a few blurry, prototype-like digit shapes rather than diverse, sharp digits. Many outputs look very similar to each other and often resemble an “average” digit rather than a specific example from the dataset.

Why? This happens because, under an MSE loss, the network is effectively learning the conditional expectation of the training images given noise. Since the input is just random noise with no meaningful structure, the model learns to largely ignore the noise and regress toward the global centroid (or a small set of centroids) of the MNIST training distribution, leading to low-diversity, smoothed digit templates instead of varied realistic samples.

Part 2: Training a Flow Matching Model

In the second part I switch to flow matching. Instead of directly predicting a clean image, a time-conditioned UNet $f_\theta(x_t, t)$ predicts the velocity field that transports a noisy sample from Gaussian noise towards the MNIST data manifold.

2.1 Time-Conditioned UNet

I inject a scalar time $t \in [0,1]$ into the UNet via two FCBlocks, which modulate the bottleneck feature (“unflatten”) and the first upsampling block (“up1”). The hidden dimension is set to $D = 64$. The time scalar is normalized to [0, 1] before being embedded.

Architecture diagram for time-conditioned UNet — Time-conditioned UNet. Two FCBlocks generate scale features that modulate intermediate activations based on the current time $t$.

2.2 Training the Time-Conditioned UNet

For each step, I sample a training image $x_1$, a time $t \sim \mathcal{U}(0, 1)$, construct an interpolated sample $x_t = (1 - t) x_0 + t x_1$, and compute the target flow $v_t(x_t) = x_1 - x_0$. The UNet is trained to minimize $\|f_\theta(x_t, t) - v_t(x_t)\|^2$ with Adam (lr $= 1 \times 10^{-4}$) and an exponential LR scheduler with $\gamma = 0.1$.

Training loss for time-conditioned flow-matching UNet — Training loss curve for the time-conditioned UNet.

2.3 Sampling from the Time-Conditioned UNet

To sample, I start from pure noise $x_{t=0} \sim \mathcal{N}(0, I)$ and integrate the learned vector field forward in time using a fixed grid of timesteps. After 1, 5, and 10 epochs, recognizable digits gradually emerge.

Part 2: Class-Conditioned Flow Matching

To improve sample quality and gain control over which digit is generated, I extend the UNet with class conditioning. A one-hot vector $\mathbf{c} \in \mathbb{R}^{10}$ encodes the digit label and modulates the same layers as the time embedding via additional FCBlocks. Classifier-free guidance is implemented by randomly dropping the class vector 10% of the time during training.

2.4 Adding Class-Conditioning to UNet

Training follows the same flow-matching objective as before, but now the UNet also receives the class embedding. I keep the same optimizer and learning-rate schedule as for the time-only model.