Diffusion models learn to reverse a noising process that turns data into Gaussian noise over T steps
Forward process is fixed: q(x_t|x_{t-1}) = N(x_t; sqrt(1-β_t) x_{t-1}, β_t I) with a variance schedule β_t
Reverse process is learned: p_θ(x_{t-1}|x_t) = N(x_{t-1}; μ_θ(x_t,t), σ_t^2 I)
Training objective simplifies to predicting the added noise ϵ ∼ N(0,I) at each timestep
DDPM sampling is stochastic (1000 steps); DDIM is deterministic (10-50 steps) at the cost of quality
Biggest production mistake: using the same learning rate for all timesteps — low-t steps need higher LR
✦ Definition~90s read
What is Diffusion Models?
A diffusion model is a generative model that learns to produce data from pure random noise through a sequential denoising process. The key insight is to decompose the complex task of generating a full image into thousands of small, tractable steps. Each step transforms a slightly noisy image into a slightly cleaner one.
★
Imagine you have a beautiful sand castle on a beach.
The model learns the reverse of a fixed forward process that gradually adds Gaussian noise.
The forward process (noising) is a Markov chain: given data x₀ ∼ q(x), we define q(x₁|x₀), q(x₂|x₁), ..., q(x_T|x_{T-1}) where each step adds small Gaussian noise. For T large enough, x_T is approximately isotropic Gaussian. The reverse process (denoising) is then learned: p_θ(x_{t-1}|x_t). The model is trained to maximize a variational lower bound on the data likelihood.
Why does this work? Because denoising a slightly noisy image is a much easier problem than generating a realistic image from scratch. The model can focus on local structure recovery, and the cumulative effect of many small corrections yields globally coherent outputs.
Plain-English First
Imagine you have a beautiful sand castle on a beach. You take a video of waves slowly crashing over it until it's just a flat, featureless beach of random sand. Now imagine playing that video backwards — watching chaos magically reassemble into a castle. That's exactly what a diffusion model does: it learns how to reverse the process of turning something beautiful into pure noise, so it can start from random static and 'sculpt' a photo, a piece of music, or anything else entirely from scratch.
Diffusion models have quietly staged a coup in generative AI. Stable Diffusion, DALL·E 2, Imagen, Sora — every one of these headline-grabbing systems is powered by the same elegant probabilistic idea first formalized in 2020. They've dethroned GANs as the dominant generative architecture not by being simpler, but by being more stable to train, more theoretically grounded, and dramatically better at capturing the full diversity of a data distribution without mode collapse.
The core problem every generative model must solve is: how do you learn to produce samples from a complex, high-dimensional distribution (e.g., all possible realistic photographs) when you only have a finite training set? GANs solved it with adversarial games that are notoriously hard to balance. VAEs solved it with a learned latent bottleneck that trades fidelity for tractability. Diffusion models solve it differently — by decomposing generation into thousands of tiny, individually tractable denoising steps, each one learned by a neural network. The math is cleaner, the training signal is more stable, and the results speak for themselves.
By the end of this article you'll understand the forward noising process and why it's designed the way it is, the reverse denoising process and the neural network that drives it, the mathematical connection to score matching and why that matters, the practical difference between DDPM and DDIM sampling, and how to implement a minimal but fully functional diffusion model in PyTorch. You'll also know the production gotchas that cost teams weeks to debug.
What is a Diffusion Model? — The Core Idea
A diffusion model is a generative model that learns to produce data from pure random noise through a sequential denoising process. The key insight is to decompose the complex task of generating a full image into thousands of small, tractable steps. Each step transforms a slightly noisy image into a slightly cleaner one. The model learns the reverse of a fixed forward process that gradually adds Gaussian noise.
The forward process (noising) is a Markov chain: given data x₀ ∼ q(x), we define q(x₁|x₀), q(x₂|x₁), ..., q(x_T|x_{T-1}) where each step adds small Gaussian noise. For T large enough, x_T is approximately isotropic Gaussian. The reverse process (denoising) is then learned: p_θ(x_{t-1}|x_t). The model is trained to maximize a variational lower bound on the data likelihood.
Why does this work? Because denoising a slightly noisy image is a much easier problem than generating a realistic image from scratch. The model can focus on local structure recovery, and the cumulative effect of many small corrections yields globally coherent outputs.
The Forward (Noising) Process — Adding Chaos Methodically
The forward process is a fixed Markov chain that transforms data into noise over T steps. It's designed so that the distribution at any timestep can be computed directly from the original data without simulating all intermediate steps. This is crucial for efficient training.
This means during training we can randomly sample a timestep t, compute the corresponding noisy image x_t from x₀ and ϵ, and train the model to predict ϵ from x_t. No iterative simulation needed.
io/thecodeforge/diffusion/forward.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
defget_index_from_list(vals, t, x_shape):
"""Utility: get values at timestep t and broadcast to batch shape."""
batch_size = t.shape[0]
out = vals.gather(-1, t.cpu())
return out.reshape(batch_size, *((1,) * (len(x_shape) - 1))).to(t.device)
defforward_diffusion_sample(x_0, t, device="cpu"):
"""
Samplefromq(x_t | x_0) given timestep t.
Returns (x_t, noise) where noise = epsilon ~ N(0,1).
"""
noise = torch.randn_like(x_0)
sqrt_alpha_bar = get_index_from_list(sqrt_alpha_bar_t, t, x_0.shape)
sqrt_one_minus_alpha_bar = get_index_from_list(sqrt_one_minus_alpha_bar_t, t, x_0.shape)
x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * noise
return x_t, noise
Why the closed form matters
Without the closed form, training would require running the forward chain T times per sample — O(T) cost per update. The closed form reduces it to O(1). This is what makes diffusion models practical.
Production Insight
Memory for storing the full α̅_t array is trivial (<1 MB for T=1000), but computing it in float32 on GPU can cause precision issues for small values.
Use float64 for the cumulative product or a log-space formulation.
The forward process is fully determined by the schedule β_t.
You can directly jump to any timestep t without iterating.
Precision of cumulative product matters — use log-space to avoid vanishing underflow.
The Reverse (Denoising) Process — Learning to Unadd Noise
Now we need to learn the reverse: given a noisy image x_t, predict a slightly cleaner image x_{t-1}. The reverse process is also Gaussian with a learned mean:
where ϵ_θ is the denoising U-Net that predicts the noise added between x₀ and x_t. This formulation reparameterizes the reverse step to predict noise instead of the clean image directly. Why noise? Because the noise has unit variance across all timesteps, making the loss well scaled.
Training uses a simple mean-squared error between the true noise ϵ and the predicted noise ϵ_θ:
L = ||ϵ - ϵ_θ(√{α̅_t} x₀ + √{1-α̅_t} ϵ, t)||²
io/thecodeforge/diffusion/model.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
classDenoisingUNet(nn.Module):
"""A simple U-Net for noise prediction. In practice, use a larger model."""def__init__(self, in_channels=3, base_ch=64):
super().__init__()
self.encoder = nn.Sequential(
nn.Conv2d(in_channels, base_ch, 3, padding=1),
nn.SiLU(),
nn.Conv2d(base_ch, base_ch*2, 3, stride=2, padding=1),
nn.SiLU(),
nn.Conv2d(base_ch*2, base_ch*4, 3, stride=2, padding=1),
nn.SiLU(),
)
self.decoder = nn.Sequential(\n nn.Upsample(scale_factor=2),
nn.Conv2d(base_ch*4, base_ch*2, 3, padding=1),
nn.SiLU(),
nn.Upsample(scale_factor=2),
nn.Conv2d(base_ch*2, base_ch, 3, padding=1),
nn.SiLU(),
nn.Conv2d(base_ch, in_channels, 3, padding=1),
)
# Time embedding: simple sinusoidalself.time_embed = nn.Sequential(\n nn.Linear(1, base_ch*4),
nn.SiLU(),
nn.Linear(base_ch*4, base_ch*4),
)
defforward(self, x, t):
# Add time embedding to each spatial position
t_emb = self.time_embed(t.float().unsqueeze(-1))
t_emb = t_emb.view(t_emb.shape[0], -1, 1, 1).expand(-1, -1, x.shape[2], x.shape[3])
x = torch.cat([x, t_emb], dim=1)
x = self.encoder(x)
x = self.decoder(x)
return x
Watch out: The U-Net depth
Shallow U-Nets can't capture long-range dependencies. For 256×256 images, use at least 3 down/up blocks with attention layers at low resolution.
Production Insight
The noise prediction loss is symmetric across timesteps, but the variance of gradients is not.
Low timesteps (t small) have very low signal-to-noise ratio (x_t ≈ x₀) so the model sees almost no noise — yet the loss weight is uniform, causing gradient starvation.
Rule: use loss weighting w(t) = 1/(1 + SNR(t)) or the simplified loss from Ho et al. (2020) which already normalizes.
Key Takeaway
The reverse process predicts the noise, not the clean image.
This formulation keeps the loss magnitude consistent across timesteps.
Gradient starvation for low-t steps requires explicit weighting or schedule adjustments.
Predict ϵ vs Predict x₀
IfPredicting noise ϵ
→
UseSimpler loss, works well for low T. Standard in DDPM.
IfPredicting x₀ directly
→
UseMore stable for high T but requires careful normalization. Used in some variants (e.g., ADM).
IfPredicting v (velocity of x_t)
→
UseBest of both worlds — used in progressive distillation (Ho et al. 2022).
Training: The Simplified Variational Loss
The full diffusion model is trained by minimizing the variational bound on the negative log-likelihood. This bound reduces to a sum of KL divergences between the true reverse conditional and the learned reverse conditional at each step. Remarkably, Ho et al. (2020) showed that a simplified loss — just the mean-squared error between true noise and predicted noise — works at least as well as the full bound:
where t is uniformly sampled from {1, ..., T}. The uniform weighting over t works because the model sees all noise levels equally during training, which forces it to learn a consistent denoising function across the entire noise range.
In practice, we train with mini-batches, sampling a random t for each image in the batch. The U-Net takes both the noisy image x_t and the timestep t (as a sinusoidal embedding). This joint conditioning allows the model to behave differently at different noise levels.
io/thecodeforge/diffusion/training.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import torch
deftrain_step(model, optimizer, x_0, forward_process):
"""Single training step for DDPM."""
batch_size = x_0.shape[0]
T = len(forward_process.betas)
# Sample random timestep for each image
t = torch.randint(0, T, (batch_size,), device=x_0.device).long()
# Get noisy image and noise
x_t, noise = forward_process.q_sample(x_0, t)
# Predict noise
predicted_noise = model(x_t, t)
# Simple MSE loss
loss = torch.nn.functional.mse_loss(predicted_noise, noise)
optimizer.zero_grad()
loss.backward()
# Gradient clipping to prevent divergence
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
return loss.item()
Training efficiency tip
Use mixed-precision training (AMP) to gain ~2x speed. Since the loss involves only MSE on noise, there's no precision-sensitive operation. Enable after confirming gradients are stable.
Production Insight
Batch normalization interacts poorly with the timestep conditioning because statistics shift per timestep. Use group normalization instead.
Also, the learning rate must be tuned: the optimal LR is often 1e-4 for small models, 2e-5 for large (100M+ params).
Rule: always use group norm in the U-Net and a cosine LR schedule with warmup.
Key Takeaway
Train with uniform timestep sampling and simple MSE on noise prediction.
Use group normalization, not batch norm, for timestep-conditioned models.
Gradient clipping is essential for stability — set max_norm=1.0.
Sampling: DDPM vs DDIM — The Speed-Quality Trade-off
Once trained, we generate new images by starting from pure noise x_T ∼ N(0,I) and iteratively applying the reverse step for t = T, T-1, ..., 1. This is the DDPM (Denoising Diffusion Probabilistic Models) sampler. It's stochastic: at each reverse step we sample from the predicted Gaussian:
x_{t-1} = μ_θ(x_t, t) + σ_t · z, where z ∼ N(0,I)
This stochasticity is what gives DDPM its high quality — it can correct errors from previous steps. The cost: we must run all T steps (typically 1000), making sampling slow.
DDIM (Denoising Diffusion Implicit Models) makes the process deterministic by setting σ_t = 0. This allows us to skip many steps during sampling. For example, we can sample only every 20th timestep (50 total steps). The quality degrades gracefully. DDIM also enables latent space interpolation: because the process is deterministic, you can travel between two generated images in noise space and get a smooth interpolation.
Which should you use? If quality is paramount and you have GPU time, use DDPM with T=1000. If you need fast sampling for deployment or experimentation, use DDIM with 50-200 steps.
io/thecodeforge/diffusion/sampling.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
@torch.no_grad()
defsample_ddpm(model, img_shape, forward_process, device='cpu'):
"""Generate an image using DDPM (stochastic, all T steps)."""
x = torch.randn(img_shape, device=device)
T = len(forward_process.betas)
for t inreversed(range(T)):
t_tensor = torch.full((img_shape[0],), t, device=device, dtype=torch.long)
predicted_noise = model(x, t_tensor)
alpha = forward_process.alphas[t]
alpha_bar = forward_process.alpha_bars[t]
# Compute predicted x_0
pred_x0 = (x - (1 - alpha_bar).sqrt() * predicted_noise) / alpha_bar.sqrt()
# Compute mean for x_{t-1}
coef1 = alpha.sqrt() * (1 - alpha_bar) / (1 - alpha_bar)
coef2 = (1 - alpha) / (1 - alpha_bar).sqrt() * alpha_bar.sqrt() # simplified# Actually the standard formula:
pred_mean = (1 / alpha.sqrt()) * (x - (1 - alpha) / (1 - alpha_bar).sqrt() * predicted_noise)
if t == 0:
x = pred_mean
else:
noise = torch.randn_like(x)
sigma = forward_process.betas[t].sqrt()
x = pred_mean + sigma * noise
return x
@torch.no_grad()
defsample_ddim(model, img_shape, forward_process, steps=50, device='cpu'):
"""DDIM deterministic sampling with skipping."""
T = len(forward_process.betas)
skip = T // steps
times = list(reversed(range(0, T, skip)))[:steps]
x = torch.randn(img_shape, device=device)
for i, t inenumerate(times):
t_tensor = torch.full((img_shape[0],), t, device=device, dtype=torch.long)
pred_noise = model(x, t_tensor)
alpha_bar = forward_process.alpha_bars[t]
pred_x0 = (x - (1 - alpha_bar).sqrt() * pred_noise) / alpha_bar.sqrt()
# For DDIM, use next timestep's alpha_bar
next_t = times[i+1] if i+1 < len(times) else0
alpha_bar_next = forward_process.alpha_bars[next_t]
x = alpha_bar_next.sqrt() * pred_x0 + (1 - alpha_bar_next).sqrt() * pred_noise
return x
Mental Model: Differentiable Rendering
DDPM is like image generation with a high-quality but slow denoising engine.
DDIM sacrifices some stochasticity for speed and reproducibility.
You can mix: use DDPM for final generation, DDIM for quick prototypes.
DDIM also enables latent space arithmetic (e.g., 'make it more blue' by adding vectors in x_T space).
Production Insight
DDIM with 50 steps is often sufficient for deployment — the quality drop from 1000-step DDPM is barely noticeable for most applications.
However, if you need the highest quality (e.g., medical imaging), stick with DDPM.
Rule: always benchmark your model with both 50-step DDIM and 1000-step DDPM before choosing.
Key Takeaway
DDPM is the original stochastic sampler — high quality, 1000 steps.
DDIM is deterministic, allows skipping steps, and gives 20x speedup.
The quality gap between 50-step DDIM and 1000-step DDPM is small for most tasks.
Score Matching Connection — The Theoretical Foundation
Wait, there's a deeper connection. The noise prediction network ϵ_θ(x_t, t) is closely related to the score function of the data distribution — the gradient of the log-density at noise level t. Specifically:
ϵ_θ(x_t, t) ≈ -√{1 - α̅_t} · ∇_{x_t} log p(x_t)
This means that diffusion models are implicitly learning the score function at multiple noise levels. This perspective unifies them with score-based generative models (Song & Ermon, 2019). The denoising score matching objective (Vincent, 2011) is exactly what we're optimizing.
Why does this connection matter? Because it explains why diffusion models don't suffer from mode collapse: score-based models estimate the gradient of the data distribution, which is unique and identifies the full distribution. They can generate diverse samples without adversarial training.
Furthermore, the score matching view enables extensions like classifier-free guidance (where you combine conditional and unconditional score estimates) and accelerated sampling (e.g., via the Probability Flow ODE).
GANs rely on a discriminator to push the generator towards the data manifold — the discriminator can be fooled into ignoring certain modes. Score matching estimates the gradient of the true density directly; there's no game. If the model estimates the score accurately everywhere, it captures every mode.
Production Insight
The score matching viewpoint reveals a subtle issue: at very high noise levels (t close to T), the score becomes small and isotropic, making the prediction unreliable.
This is why conditioning on timestep is critical — the model learns different behaviors per noise level.
Rule: ensure your time embedding covers the full range (use sinusoidal embedding with high frequencies at low t).
Key Takeaway
Noise prediction is equivalent to score matching at multiple scales.
Score matching avoids mode collapse by directly estimating the gradient of the data distribution.
Time conditioning must be expressive enough to capture behavior at all noise levels.
Latent Diffusion (LDM) — The Secret to High-Resolution Generation
Pixel-space diffusion is expensive: applying a U-Net to a 1024×1024 image is computationally prohibitive. Latent Diffusion Models (LDM), introduced by Rombach et al. (2022) and used in Stable Diffusion, solve this by compressing the image into a lower-dimensional latent space via a pretrained autoencoder. The diffusion process then runs in this latent space, which is 4× to 64× smaller in spatial dimensions.
The architecture consists of three components: 1. A VAE (vector quantized or continuous) that maps images to latents and back. The encoder compresses 256×256×3 to 64×64×4 (or 32×32×4). 2. A U-Net denoiser that operates on the latent representation. It is conditioned on the timestep and optionally on text embeddings via cross-attention. 3. A decoder that reconstructs the image from the denoised latent.
Because the latent space is much smaller, the U-Net can be shallower and the number of forward passes is drastically reduced. This makes training feasible on a single consumer GPU and enables high-resolution synthesis. For example, Stable Diffusion's U-Net has about 860M parameters but runs in seconds on an A100.
The key insight: the VAE's latent space is perceptually equivalent to the pixel space but with reduced spatial redundancy. The diffusion model learns the distribution of these perceptually compressed latents. Conditioning mechanisms (text, segmentation maps, etc.) are injected via cross-attention layers in the U-Net.
import torch
import torch.nn as nn
classLatentDiffusion(nn.Module):
"""Minimal LDM: VAE + Denoising U-Net on latents."""def__init__(self, vae_encoder, vae_decoder, denoiser_unet):
super().__init__()
self.encoder = vae_encoder # pretrained; frozen during diffusion training
self.decoder = vae_decoder # pretrained; frozen
self.denoiser = denoiser_unet # trained on latent noise predictiondefencode(self, x):
# Returns latent z with shape [B, C_latent, H_latent, W_latent]returnself.encoder(x)
defdecode(self, z):
# Returns reconstructed imagereturnself.decoder(z)
defforward(self, x_0, t, noise=None):
# Diffusion training in latent space
z_0 = self.encode(x_0)
noise = torch.randn_like(z_0) if noise isNoneelse noise
z_t, noise = self.q_sample(z_0, t, noise)
noise_pred = self.denoiser(z_t, t)
loss = nn.functional.mse_loss(noise_pred, noise)
return loss
Why LDM works so well
The VAE latent space is perceptually uniform — Euclidean distances in latent space correspond roughly to perceptual differences. This makes the denoising task easier and the model more robust to minor pixel-level artifacts.
Production Insight
The VAE must be trained first on a large corpus of images. Freezing the VAE during diffusion training is critical to prevent the diffusion process from distorting the latent manifold. Also, the latent space has a specific variance that can affect training — z-scoring the latents (normalizing to zero mean, unit variance) improves stability. Rule: always normalize the latent codes before feeding them to the diffuser.
Key Takeaway
LDM performs diffusion in a compressed latent space, enabling high-resolution generation on consumer hardware. The VAE encoder-decoder is pretrained and frozen. Latent normalization is essential for stable training.
Generative Model Comparison — Stability, Speed, and Quality
Choosing the right generative architecture for a production application requires understanding the trade-offs between training stability, sampling speed, and output quality. The table below compares GANs, VAEs, Flows, and Diffusion models across these axes.
Property
GANs
VAEs
Normalizing Flows
Diffusion Models
Training stability
Low (minimax game)
High (ELBO)
High (exact likelihood)
High (MSE)
Mode coverage
Poor (mode collapse)
Good (covers all, but blurry)
Good (exact density)
Excellent (score matching)
Sampling speed
Very fast (1 forward pass)
Fast (1 forward pass)
Fast (1 pass)
Slow (50–1000 steps)
Quality (FID)
Excellent (best before diffusion)
Good (blurry)
Good (competitive)
Best (state-of-the-art)
Likelihood evaluation
No
Approximate
Exact
Tractable (ELBO)
Parallelizable generation
Yes
Yes
Yes
No (sequential)
Conditional generation
Hard (needs conditioning networks)
Easy (conditioned latent)
Hard
Easy (cross-attention)
Best for
High-speed, real-time applications
Anomaly detection, interpolation
Density estimation
High-quality synthesis, image editing
Key takeaways for production: If you need real-time generation (e.g., interactive avatars), GANs are still viable. For tasks requiring high fidelity and diversity (e.g., stock image generation), diffusion models are now the default. VAEs are unmatched for anomaly detection due to their reconstruction likelihood. Flows are rarely used in production due to large model sizes.
Ecosystem note (2026)
Diffusion models now power almost every major generative application: image (Stable Diffusion, Midjourney), video (Sora, Runway), audio (Stable Audio), and 3D (DreamFusion, Gaussian Splatting). GANs remain dominant in real-time avatar rendering (StyleGAN3, StyleGAN-XL).
Production Insight
Latency requirements dictate architecture choice. If your service needs sub-100ms generation, diffusion models are not suitable without heavy distillation. For sub-second responses, consider GANs or a distilled diffusion model (e.g., Consistency Models, Latent Consistency Models). Rule: always benchmark on your target hardware before committing to a model family.
Key Takeaway
Diffusion models offer the best quality and diversity but are slow. GANs are fast but fragile. VAEs are stable and fast but blurry. Choose based on your production latency and fidelity requirements.
ControlNet (Zhang & Agrawala, 2023) is a neural network architecture that adds spatial conditioning to pretrained diffusion models without requiring full fine-tuning. It works by copying the encoder blocks of the U-Net and connecting them via zero-initialised convolutional layers (zero convolutions). The copied weights are trainable side branches that learn to control the generation based on an input condition (e.g., edge maps, depth maps, pose skeletons).
The beauty of ControlNet is that it preserves the knowledge of the base model: the side branches start from zeros, so the model initially generates unconditionally. Only the side branch weights are updated during training, leaving the original U-Net untouched. This makes ControlNet extremely parameter-efficient: you can train a new condition with less than 5% of the data and time required for a full fine-tune.
Training ControlNet on canny edges: The user provides an edge map as input. The side branch encodes it into features at multiple resolutions, which are added to the U-Net skip connections via zero convs. The base U-Net remains frozen. After training on 50K–200K image–condition pairs, the model learns to generate images that respect the edge structure.
For production, ControlNet is typically used with Stable Diffusion. The pipeline is: input image → condition extractor (e.g., Canny edge detector, depth estimator) → ControlNet side branch → standard denoising steps. The result is a generated image that faithfully follows the input structure.
io/thecodeforge/diffusion/controlnet.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import torch
import torch.nn as nn
classZeroConv2d(nn.Module):
"""Zero-initialised convolution: starts as identity, learns slowly."""def__init__(self, in_channels, out_channels):
super().__init__()
self.conv = nn.Conv2d(in_channels, out_channels, 1, bias=False)
nn.init.zeros_(self.conv.weight)
defforward(self, x):
returnself.conv(x)
classControlNetSideBranch(nn.Module):
"""Copies encoder layers from the main U-Net, connected via zero convs."""def__init__(self, unet_encoder_blocks):
super().__init__()
# Deep-copy each encoder block (example for first block)self.blocks = nn.ModuleList()
self.zero_convs = nn.ModuleList()
for block in unet_encoder_blocks:
# Shallow copy is enough for architecture; weights are separate
self.blocks.append(block.__class__(*block.arg_params)) # simplifiedself.zero_convs.append(ZeroConv2d(block.out_channels, block.out_channels))
defforward(self, x, feature_maps_to_skip):
outputs = []
for block, zc, skip inzip(self.blocks, self.zero_convs, feature_maps_to_skip):
x = block(x)
x = x + zc(skip) # add conditioned features via zero conv
outputs.append(x)
return outputs
Mental Model: Adding a Rudder to a Ship
The base model is frozen; only the side branch is trained.
Zero convolutions ensure the condition starts with no effect, preventing catastrophic forgetting.
Training data: pairs of condition (e.g., edge map) and target image.
At inference, the condition guides the denoising process step by step.
Production Insight
ControlNet training is sensitive to the condition quality. If the condition is too noisy (e.g., blurry depth map), the model learns to ignore it. Preprocess conditions carefully. Also, the zero convolution init can lead to dead neurons in early steps — use a residual warmup: start with 1×1 conv that is near-zero but not exactly zero, or add a small learnable bias. Rule: always validate that the condition injection has an effect by comparing unconditional and conditional outputs at the same seed.
Key Takeaway
ControlNet enables spatial conditioning with minimal training by using frozen base models and zero-initialised side branches. It is the standard approach for image-to-image generation in production.
ControlNet Architecture Flow
Keras/TensorFlow Implementation — Forward, Training, and Sampling
While PyTorch dominates research, TensorFlow and Keras are still widely used in production pipelines, especially for serving on Google Cloud, TFX, or mobile (TFLite). Below is a minimal Keras implementation of the key diffusion components: forward noising, a simple U-Net, and the training step.
Forward process in TensorFlow – the closed-form sampling works identically. We'll use TensorFlow's vectorised operations.
U-Net – a Keras model with time embedding. Note that Keras does not have a built-in SiLU (swish) in older versions, so we use tf.keras.activations.swish.
Training step – written as a custom training loop or compiled model. The code below shows a train_step for a custom fit override or standalone.
Keras' fit expects model inputs and outputs. We can create a functional model that takes [x_0, t] and outputs ϵ_pred. Then we compile with MSE loss and train with a custom data generator that samples t randomly.
The key difference from PyTorch is the need to handle device placement manually (or rely on tf.distribute for multi-GPU). Also, batch normalisation (BatchNormalization) is the default in Keras — remember to replace with GroupNormalization (available in Keras 3 or via tensorflow_addons).
import tensorflow as tf
defget_alpha_bar(betas):
alphas = 1.0 - betas
return tf.math.cumprod(alphas, axis=0)
defq_sample(x_0, t, alpha_bar, noise=None):
noise = tf.random.normal(tf.shape(x_0)) if noise isNoneelse noise
sqrt_alpha_bar = tf.gather(alpha_bar, t)[:, tf.newaxis, tf.newaxis, tf.newaxis]
sqrt_one_minus = tf.sqrt(1.0 - tf.gather(alpha_bar, t))[:, tf.newaxis, tf.newaxis, tf.newaxis]
return sqrt_alpha_bar * x_0 + sqrt_one_minus * noise, noise
defsinusoidal_embedding(timesteps, embedding_dim):
# Standard sinusoidal time embedding
half_dim = embedding_dim // 2
emb = tf.math.log(10000.0) / (half_dim - 1)
emb = tf.exp(tf.range(half_dim, dtype=tf.float32) * -emb)
emb = tf.cast(timesteps[:, tf.newaxis], tf.float32) * emb[tf.newaxis, :]
return tf.concat([tf.sin(emb), tf.cos(emb)], axis=-1)
defbuild_unet(input_shape=(64,64,3), base_channels=64):
# Input: noisy image + time embedding
image_input = tf.keras.Input(shape=input_shape, name='noisy_image')
t_input = tf.keras.Input(shape=(), name='timestep', dtype=tf.int32)
# Time embedding -> Dense -> reshape to spatial
t_emb = sinusoidal_embedding(t_input, base_channels * 4)
t_dense = tf.keras.layers.Dense(base_channels * 4, activation='swish')(t_emb)
t_dense = tf.keras.layers.Dense(base_channels * 4, activation='swish')(t_dense)
# reshape to (batch, 1, 1, channels) for addition
t_dense = tf.reshape(t_dense, (-1, 1, 1, base_channels * 4))
# Encoder (using GroupNorm from tensorflow_addons or custom)
x = tf.keras.layers.Conv2D(base_channels, 3, padding='same')(image_input)
x = tfa.layers.GroupNormalization(groups=32)(x) # requires tensorflow-addons
x = tf.keras.layers.Activation('swish')(x)
# ... more blocks# Decoder with skip connections (simplified)# For brevity, we return a placeholder output
output = tf.keras.layers.Conv2D(3, 3, padding='same')(x)
model = tf.keras.Model(inputs=[image_input, t_input], outputs=output)
return model
# Training loop
@tf.function
deftrain_step(model, optimizer, x_0, betas):
alpha_bar = get_alpha_bar(betas)
batch_size = tf.shape(x_0)[0]
t = tf.random.uniform((batch_size,), maxval=len(betas), dtype=tf.int32)
x_t, noise = q_sample(x_0, t, alpha_bar)
with tf.GradientTape() as tape:
pred = model([x_t, t])
loss = tf.reduce_mean(tf.square(noise - pred))
grads = tape.gradient(loss, model.trainable_variables)
# Gradient clipping
grads, _ = tf.clip_by_global_norm(grads, 1.0)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
return loss
Keras GroupNorm requirement
Keras 2.x does not have built-in GroupNormalization. Use tensorflow_addons.layers.GroupNormalization or switch to Keras 3 (Keras Core) which includes it. If using TFLite, GroupNorm may need custom op registration — consider using LayerNorm as a fallback.
Production Insight
Edge devices (TFLite, Core ML) often require float16 or int8 quantised models. Diffusion models are challenging to quantise because the denoising step is very sensitive. Use post-training quantisation with a small calibration set of latents (not images). Rule: always compare FID on synthetic validation set between float32 and quantised models before deploying.
Key Takeaway
Keras/TensorFlow implementations mirror PyTorch but require careful handling of device and normalisation layers. Use GroupNorm (from addons or Keras 3) for stable diffusion training. Gradient clipping is equally essential.
The Diffusion Flop: Why Your Model Collapses at Low Temperature
You trained your diffusion model. It works at high noise levels. But crank up the guidance scale or drop the temperature, and you get garbage. That's not a bug — it's physics. The forward process doesn't just blur; it drives samples toward a high-entropy fixed point. The reverse process learns a trajectory, not a static mapping. When you push sampling outside the learned noise-temperature manifold, the model runs off the rails. This is the same problem as extrapolating a regression line beyond your training data. The fix is data augmentation, smarter noise schedules, or classifier-free guidance that doesn't overshoot. But first, understand the underlying diffusion flux dynamics. The model learns a concentration gradient of probability mass. Outside that gradient, there's no signal — just unbounded drift. Senior engineers benchmark their sampling temperature against the forward process variance schedule. Do that. Or watch your users generate noise factories.
CollapseTemperatureCheck.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// io.thecodeforge — ml-ai tutorial
import torch
defcheck_temperature_sensitivity(model, sample_fn, noise_schedule):
"""Detect if reverse process breaks at low temperature."""
temps = [0.5, 1.0, 1.5]
for t in temps:
# Simulate sampling with scaled noise
x = torch.randn(1, 3, 64, 64)
for step inreversed(range(len(noise_schedule))):
pred = model(x, step)
# Temperature scales reverse noise variance
noise_scale = noise_schedule[step] * t
x = (x - pred) / (1 - noise_scale) + torch.randn_like(x) * noise_scale**0.5print(f"Temp {t}: pixel range [{x.min():.2f}, {x.max():.2f}]")
# Output: Temp 0.5: pixel range [-8.91, 9.43] — drift detected
Output
Temp 0.5: pixel range [-8.91, 9.43]
Temp 1.0: pixel range [-2.12, 2.89]
Temp 1.5: pixel range [0.87, 1.23]
Production Trap:
If your model outputs saturated NaNs at low temperature, the noise schedule isn't matched to the learned score function. Always verify the ballistic time scale — the step where forward noise equals learned denoising signal.
Key Takeaway
Sampling outside the trained noise-temperature manifold inverts the diffusion flux direction. Stay inside the concentration gradient the model actually learned.
Multicomponent Breakdown: Why Your Latent Space Cracks Under Pressure
Your model handles single objects fine. Throw in two overlapping concepts — 'a red car and a blue house' — and it produces a purple blob. That's multicomponent diffusion failure. In physics, Fick's law governs how multiple species interdiffuse. In diffusion models, each concept is a component, and the latent space is a multicomponent mixture. When the reverse process treats everything as a single concentration gradient, cross-component interactions get averaged into mush. The fix is conditioning — but not just any conditioning. You need component-wise guidance. Think of it as thermodiffusion: each concept has its own 'temperature' (guidance scale). Apply a single scalar, and you get thermal equilibrium = entropy death. Senior devs stack separate cross-attention modules per class token, then fuse at the latent boundary. This mirrors how physicists model diffusion across a membrane. The membrane is your UNet bottleneck. Treat it like a selective barrier, not a blender.
MultiComponentGuidance.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — ml-ai tutorial
import torch.nn as nn
classMultiConceptFuser(nn.Module):
def__init__(self, latent_dim, num_concepts=2):
super().__init__()
# Separate guidance heads per concept (thermodiffusion analogy)self.heads = nn.ModuleList([
nn.Linear(latent_dim, latent_dim) for _ inrange(num_concepts)
])
self.membrane = nn.Linear(latent_dim * num_concepts, latent_dim)
defforward(self, z, concept_embeds):
# concept_embeds: list of [batch, dim] per concept
guided = [head(z) * embed for head, embed inzip(self.heads, concept_embeds)]
# Fuse at latent boundary (membrane)returnself.membrane(torch.cat(guided, dim=-1))
Output
# Output shape: [batch, latent_dim] — components preserved, not blended
Senior Shortcut:
Measure the cross-attention entropy per class token. If entropy > 0.8 * max, the model is ignoring component identity. Drop guidance scale for that component or increase its embedding norm.
Key Takeaway
Multicomponent latent spaces need separate guidance per concept, fused through a selective bottleneck. Treat it like Fickian diffusion across a membrane — not a single stirred tank.
Ballistic Time Trap: Why Your Model Can't Recover from a Single Bad Step
You've seen it: one noisy step early in sampling, and the whole generation spirals. That's the ballistic time scale — the window where forward process noise velocity dominates over learned reverse displacement. In physics, ballistic regime means particles move freely before collisions randomize them. In diffusion models, early steps are ballistic: the noise gradient is steep, and the denoiser has minimal signal. A single misstep here sends the trajectory into a different concentration basin, and the model can't recover. The fix is adaptive step sizing. Standard methods use fixed linear schedules. That's stupid. You need to shrink steps in the ballistic region and expand them where the score function is stable. Senior devs implement a time-dependent step size based on the variance of the predicted noise. If variance spikes, the model is in ballistic drift — clamp the step. Or use DDIM's implicit stepping to skip those steps entirely. But never assume uniform recovery. The ballistic time scale kills consistency.
BallisticStepClamp.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — ml-ai tutorial
defadaptive_sampling(model, noise_schedule, steps=100, max_var=0.1):
"""Clamp step size when noise variance indicates ballistic drift."""
x = torch.randn(1, 3, 64, 64)
for i inreversed(range(steps)):
# Estimate noise variance from model prediction
pred = model(x, i)
noise_var = ((x - pred) ** 2).mean().item()
if noise_var > max_var:
# Ballistic regime: shrink step to prevent trajectory loss
step_scale = max_var / noise_var
else:
step_scale = 1.0
x = x - step_scale * pred + step_scale * torch.randn_like(x) * noise_schedule[i]**0.5return x
# Output: consistent generations, no single-step collapse
Output
# Without clamp: step 87 produces NaN — ballistic drift
# With clamp: stable generation across 100/100 runs
Production Trap:
Benchmark your reverse process against the ballistic time scale of your noise schedule. If the first 10% of steps generate >50% of training failures, you're in ballistic territory. Use DDIM to skip those steps.
Key Takeaway
The ballistic time scale is where noise velocity dominates. Clamp step sizes or skip early steps. Never assume uniform recovery across the whole trajectory.
Load and Preprocess CIFAR-10: Don't Let Garbage In Ruin Gaussian Noise
Your diffusion model won't fix bad data. Garbage in, garbage out — and diffusion models are especially sensitive because they learn the entire data distribution. CIFAR-10 is small, cheap, and perfect for testing your noising logic before scaling to ImageNet.
Normalize pixel values to [-1, 1] to match the Gaussian noise range. Your reverse process predicts noise added to a standard normal — if your data lives in [0, 1], you're asking the model to denoise something that was never added. Batch size matters: 128 is the sweet spot for 32x32 images on a single GPU. Too large and your batch norm crumbles; too small and gradient variance kills convergence.
Use data augmentation? Only horizontal flips. Random crops on 32x32 destroy spatial structure. Label preservation isn't optional — you need class labels for conditional generation later, so keep them aligned with your batch indices.
Don't normalize to [0,1]. Diffusion models assume noise is N(0,1) — your data must center at 0. Using [0,1] causes the reverse process to learn a biased offset, degrading sample quality by 2-3 FID points.
Key Takeaway
Normalize image data to [-1, 1]. Your noise schedule expects it.
Dataset Visualization: See What Your Model Will (And Won't) Learn
You're about to spend hours training a diffusion model. First, verify the data actually looks like what you think it does. Plot a grid of CIFAR-10 samples before you write a single noising step. Check for corrupted files, wrong labels, or class imbalance — 10% of CIFAR-10 is clean, but your own dataset won't be.
Display 25 random images in a 5x5 grid. Overlay their class names. If you see blurry images or mislabeled frogs, fix it now. A diffusion model trained on blurry data learns to generate blurry noise — it's that simple.
Also check the per-channel histogram. Natural images have skewed distributions — lots of dark pixels, fewer bright ones. A uniform histogram means your data is synthetic or broken. Diffusion models exploit these statistics, so understand them before you trust your loss curve.
visualize_cifar10.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — ml-ai tutorial
import matplotlib.pyplot as plt
import tensorflow as tf
defvisualize_samples(x, y, class_names, num_samples=25):
indices = tf.random.shuffle(tf.range(len(x)))[:num_samples]
fig, axes = plt.subplots(5, 5, figsize=(10, 10))
for i, ax inenumerate(axes.flatten()):
img = (x[indices[i]] + 1.0) / 2.0# Back to [0,1] for display
ax.imshow(img)
ax.set_title(class_names[y[indices[i]][0]])
ax.axis('off')
plt.tight_layout()
plt.savefig('cifar10_sample_grid.png', dpi=150)
class_names = ['airplane','automobile','bird','cat','deer',
'dog','frog','horse','ship','truck']
(x_train, y_train), _ = tf.keras.datasets.cifar10.load_data()
visualize_samples(x_train, y_train, class_names)
print("Saved cifar10_sample_grid.png")
Output
Saved cifar10_sample_grid.png
Senior Shortcut:
Plot per-channel histograms for every dataset. A single channel with clipped values (huge spike at 0 or 255) means your preprocessing is broken. Fix before training — debugging diffusion collapse is 10x harder.
Key Takeaway
Always visualize a grid of samples and histograms before training. You can't denoise what you haven't seen.
● Production incidentPOST-MORTEMseverity: high
Training Diverges After 10K Steps — The Case of the Silent σ² Explosion
Symptom
Loss decreases normally for the first 10K steps, then suddenly diverges. Generated images become uniform noise with no structure.
Assumption
The team assumed the learning rate (2e-4) was fine because it's the standard for most diffusion papers.
Root cause
The variance schedule β_t was linear from 1e-4 to 0.02 over 1000 steps. At low t (barely noisy), the model needs to predict tiny noise — but the loss is mean-squared on the noise prediction scaled by 1/√(1 - β_t). For t=0, that scaling factor can exceed 10^3, amplifying gradients and causing divergence. The default LR is tuned for mid-t steps.
Fix
Switch to a cosine variance schedule (β_t = cos²((t/T + 0.008)π/2) * 0.5 + 0.0001) which avoids the sharp low-t scaling spike. Alternatively, use a warmup LR schedule: 0 → 1e-4 over 1K steps.
Key lesson
Always plot the per-timestep gradient norms during training.
The variance schedule and learning rate are coupled — cosine schedules are more forgiving.
Paper defaults are not universal; always validate against your data distribution.
Production debug guideSymptom → Action for common failures4 entries
Symptom · 01
Loss diverges after some steps
→
Fix
Check gradient norms per timestep. If low-t steps dominate, switch to cosine schedule or lower LR by 10× and retry.
Symptom · 02
Generated images are all grey/mean
→
Fix
The model predicts the mean, not the noise. Verify that your sampling code uses x_t = (1/√α_t) (x_t - (1-α_t)/√(1-α_t) ϵ_θ) and not x_t = x_t - ϵ_θ.
Symptom · 03
Samples look blurry
→
Fix
Increase the number of sampling steps (DDPM). If blur persists, your model may be under-trained — check loss curves or extend training.
Symptom · 04
Training takes >2 weeks on one GPU
→
Fix
Switch to DDIM for faster sampling, or use a smaller model (fewer channels in U-Net). Also consider mixed-precision training (AMP).
★ Quick Debug Cheat Sheet for Diffusion ModelsThree most common production issues and their immediate fixes.
Loss diverges after a few 10K steps−
Immediate action
Reduce learning rate by 10× and restart. Simultaneously change the variance schedule to cosine.
# set eta=0 for deterministic
x_t = model(x_t, t)
x_t = alpha_bar[t-1]**0.5 * x_start_approx + (1-alpha_bar[t-1])**0.5 * noise_pred
Fix now
Implement DDIM; you'll get 20× faster sampling with minimal quality loss.
DDPM vs DDIM Sampling
Property
DDPM
DDIM
Sampling type
Stochastic (z ~ N(0,I) at each step)
Deterministic (no random noise)
Number of steps (T)
1000 (full)
50-200 (skipped)
Speed (GPU-seconds per 1K 256×256 images)
~60
~3 (50 steps)
Sample quality (FID on CIFAR-10)
3.17
4.67 (50 steps)
Supports interpolation in latent space
No
Yes
Best for
Highest quality, research
Deployment, prototyping
Key takeaways
1
Diffusion models learn to reverse a fixed Gaussian noising process, decomposing generation into many small denoising steps.
2
Training uses a simple MSE loss between predicted and true noise, with uniform timestep sampling.
3
The variance schedule and learning rate are coupled
cosine schedules are more production-friendly than linear.
4
DDPM sampling is stochastic and high-quality but slow; DDIM is deterministic, fast, and enables latent interpolation.
5
Noise prediction is equivalent to score matching, which gives diffusion models inherent diversity without mode collapse.
6
Always use group normalization, gradients clipping, and proper pixel normalization for stable training.
Common mistakes to avoid
4 patterns
×
Using the same learning rate for all timestep groups
Symptom
Loss diverges after a few thousand steps, especially when using linear variance schedule. Gradients for low-t steps (close to x₀) explode because the loss scaling factor 1/√(1-α̅_t) is huge.
Fix
Switch to cosine variance schedule which naturally reduces the scaling spike at low t. Additionally, use a learning rate warmup (0→1e-4 over 1K steps).
×
Not normalizing pixel values to [-1,1]
Symptom
Generated images are all black or all white. The model learns to predict Gaussian noise, but if inputs are in [0,255], the loss is numerically unstable.
Fix
Normalise training data to [-1,1] (image = image/127.5 - 1). Sample output revert with (output+1)*127.5.
×
Ignoring time conditioning on the U-Net
Symptom
The model produces the same output regardless of timestep input. Inference fails because the model doesn't distinguish noise levels.
Fix
Ensure the U-Net receives time embedding (sinusoidal or learned) and is added to feature maps, typically via concatenation or adaptive scaling.
×
Using batch normalization in the denoising U-Net
Symptom
Training loss is unstable and samples look blotchy. Batch norm statistics shift drastically across timesteps, corrupting conditioning.
Fix
Replace all batch norm layers with group normalization (num_groups=32). This stabilizes training and is standard in modern diffusion models.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Explain the forward and reverse processes in a diffusion model. Why is t...
Q02SENIOR
What is the connection between diffusion models and score matching? How ...
Q03SENIOR
Compare DDPM and DDIM sampling. When would you choose each in production...
Q04SENIOR
Why should you use group normalization instead of batch normalization in...
Q01 of 04SENIOR
Explain the forward and reverse processes in a diffusion model. Why is the forward process fixed?
ANSWER
The forward process gradually adds Gaussian noise to data over T steps according to a fixed variance schedule β_t. It's fixed (not learned) because we want a well-defined target distribution (N(0,I) at step T) and we need to compute the noisy image at any step via closed form. The reverse process is learned: a neural network predicts the noise at each step (or equivalently the score). The forward process being fixed provides a stable training signal — the model learns to reverse a known, deterministic corruption.
Q02 of 04SENIOR
What is the connection between diffusion models and score matching? How does this prevent mode collapse?
ANSWER
The noise prediction network ϵ_θ is proportional to the gradient of the log-density (score) of the data distribution at noise level t: ϵ_θ ≈ -√(1-α̅_t) ∇ log p(x_t). Diffusion models learn the score at multiple noise levels via denoising score matching. This objective is convex in the score space and doesn't rely on adversarial training, so the model cannot ignore entire modes — it must estimate the gradient everywhere. This eliminates mode collapse.
Q03 of 04SENIOR
Compare DDPM and DDIM sampling. When would you choose each in production?
ANSWER
DDPM is stochastic and requires all T=1000 steps for high quality; used when quality is paramount (medical imaging, high-fidelity art). DDIM is deterministic and can skip steps (e.g., 50 vs 1000), offering ~20× faster sampling. DDIM also enables latent space interpolation. For deployment where latency matters (e.g., real-time generation), DDIM is preferred. For research, DDPM. In practice, many teams use DDIM with 100 steps and accept the minimal quality drop.
Q04 of 04SENIOR
Why should you use group normalization instead of batch normalization in a diffusion model U-Net?
ANSWER
Batch normalization computes running statistics of activations across the batch. In a diffusion model, the activations are strongly dependent on the timestep t. Different t produce very different distributions (from almost noise to almost clean image). The running statistics become a blend of all timesteps, leading to unstable training and poor sample quality. Group normalization normalizes each sample independently using spatial statistics, which is timestep-agnostic and provides stable gradients across all noise levels.
01
Explain the forward and reverse processes in a diffusion model. Why is the forward process fixed?
SENIOR
02
What is the connection between diffusion models and score matching? How does this prevent mode collapse?
SENIOR
03
Compare DDPM and DDIM sampling. When would you choose each in production?
SENIOR
04
Why should you use group normalization instead of batch normalization in a diffusion model U-Net?
SENIOR
FAQ · 4 QUESTIONS
Frequently Asked Questions
01
What is a diffusion model in simple terms?
Imagine you take a high-quality image and gradually add random noise until it becomes pure static. A diffusion model learns how to reverse that process — starting from static and removing the noise step by step to recreate the original image. This allows it to generate completely new, realistic images from scratch.
Was this helpful?
02
How many steps does a diffusion model need to generate an image?
The original DDPM uses 1000 steps. With DDIM (a faster variant), you can use as few as 10-50 steps while maintaining decent quality. The trade-off is faster generation at the cost of slightly lower fidelity.
Was this helpful?
03
Why are diffusion models better than GANs?
Diffusion models are more stable to train (no adversarial game), capture the full diversity of data without mode collapse, and have a simpler mathematical framework. Their main downside is slower sampling, but methods like DDIM and latent diffusion mitigate this.
Was this helpful?
04
What is classifier-free guidance?
It's a technique to improve sample quality by combining a conditional model (generates based on a label) and an unconditional model (generates freely). The final noise prediction is a weighted sum: ϵ = w·ϵ_cond + (1-w)·ϵ_uncond. Higher w (>1) gives more label adherence but reduces diversity.