GANs pit two neural networks against each other in a minimax game
Generator creates fakes; Discriminator detects them
Training is a saddle point problem — not a convex optimisation
Mode collapse is the #1 failure: generator finds one trick that works
WGAN-GP and spectral normalisation stabilise training in production
Loss curves don't tell the whole story — sample images matter more
Plain-English First
Imagine a master art forger trying to fool an expert detective. The forger keeps painting fake Picassos, and the detective keeps rejecting them with notes on what gave them away. Each rejection makes the forger better, and each improved fake makes the detective sharper. They push each other until the forger's paintings are indistinguishable from the real thing. That's a GAN — two neural networks locked in a creative arms race, where competition produces genuinely impressive results neither could achieve alone.
Every time you've seen a hyper-realistic AI-generated face, a deepfake video, or a drug molecule designed by software, there's a strong chance a Generative Adversarial Network was involved. GANs are one of the most commercially impactful inventions in deep learning's short history — Yann LeCun once called the idea 'the most interesting idea in the last 10 years in machine learning.' They power stable diffusion's predecessors, data augmentation pipelines at major tech firms, and entire product categories that didn't exist a decade ago.
The core problem GANs solve is deceptively simple to state but historically hard to crack: how do you teach a model to generate new data that looks like it came from the same distribution as your training set? Older approaches like Variational Autoencoders made probabilistic assumptions that often produced blurry outputs. GANs sidestep explicit density estimation entirely by framing generation as a game — and game theory gives us the tools to analyse what 'winning' even means.
By the end of this article you'll understand the exact mechanics of the Generator and Discriminator, be able to read and interpret GAN loss curves, implement a working GAN from scratch in PyTorch with production-quality code, diagnose mode collapse and training instability when you hit them, and know the architectural innovations (DCGAN, WGAN, StyleGAN) that solved the problems the original paper left open. Let's build this from the ground up.
What is GANs — Generative Adversarial Networks?
A Generative Adversarial Network (GAN) consists of two neural networks: the Generator ($G$) and the Discriminator ($D$). The Generator takes random noise as input and attempts to create data (like an image) that mimics the training set. The Discriminator acts as a binary classifier, receiving both real data and the Generator's 'fakes,' attempting to distinguish between them. Mathematically, this is expressed as a minimax game with the value function $V(D, G)$:
# Model architecture ready for adversarial training loop.
Forge Tip:
When training GANs, always monitor the 'Nash Equilibrium'. If the Discriminator's loss drops to zero instantly, your Generator will stop learning because the gradients vanish. Balance is everything.
Production Insight
The minimax objective makes GAN optimisation a saddle point problem — gradient descent alone guarantees nothing.
Most GAN failures trace back to one network dominating before the other can learn.
Rule: if D loss < 0.1 within 100 steps, your generator will never learn.
Key Takeaway
Two networks compete but only one can win too early.
Balance the arms race from step one.
Always visualise generated samples — loss lies.
GAN Hall of Fame: Architectures That Changed the Game
The GAN landscape has evolved rapidly since 2014. Below is a comparison of the most influential architectures — understand their innovations to choose the right one for your production pipeline.
Architecture
Year
Primary Innovation
Best Use Case
Vanilla GAN
2014
Original minimax loss
Educational, proof-of-concept
DCGAN
2015
Deep convolutional layers, batch norm, strided conv
High-quality image generation
WGAN-GP
2017
Wasserstein loss + gradient penalty
Stable training, mode collapse prevention
SAGAN
2018
Self-attention layers for long-range dependencies
Large-scale image synthesis (e.g., 128x128+)
BigGAN
2019
Large batch sizes, spectral norm, truncation trick
Large-scale class-conditional generation
StyleGAN / StyleGAN2
2019/2020
Mapping network, AdaIN, noise injection
Hyper-realistic faces, editable latent space
Projected GAN
2021
Fast convergence via pretrained feature networks
Data-limited domains, fast GANs
Each architecture trades off training speed, stability, and output fidelity. For most production deployments, start with WGAN-GP and move to StyleGAN2 when you need photorealistic textures.
Production Insight
WGAN-GP remains the safest starting point for production due to its balance of stability and quality.
StyleGAN2 dominates for faces but requires careful hyperparameter tuning for non-face domains.
Rule: never use Vanilla GAN in production — it's only for understanding the math.
Key Takeaway
Choose your GAN architecture based on domain and fidelity requirements.
WGAN-GP is the default for stability; StyleGAN for realism.
Always benchmark FID before committing to an architecture.
Production Environment: Containerizing the Forge
Training GANs requires significant VRAM and specific CUDA versions. To ensure your model trains reliably across different cloud providers, we use a multi-stage Docker build.
DockerfileDOCKER
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# io.thecodeforge: StandardMLTrainingImageFROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY io/thecodeforge/ /app/io/thecodeforge/
# Ensure non-root user for security
RUN useradd -m forge_user
USER forge_user
ENTRYPOINT ["python", "-m", "io.thecodeforge.train_gan"]
Output
# Image built successfully with CUDA 12.1 support.
Hardware Note:
Always set PIN_MEMORY=True in your PyTorch DataLoader when training on GPUs to speed up data transfer from CPU RAM to GPU VRAM.
Production Insight
Dockerised GAN training eliminates 'works on my machine' for distributed teams.
Multi-stage builds cut image size by 60% — critical for CI/CD on GPU clusters.
Never run GAN training on bare metal in production.
The Training Loop: Loss Functions and Gradient Balance
The heart of any GAN training loop is the alternating optimisation. At each iteration, we update the Discriminator to maximise the log probability of real data and minimise the log probability of fake data. Then we update the Generator to fool the Discriminator. The original paper proposed the minimax loss $\min_G \max_D \,\, \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1-D(G(z)))]$. However, this suffers from vanishing gradients early in training — when the discriminator is too good, $\log(1-D(G(z)))$ saturates. The non-saturating loss replaces $\log(1-D(G(z)))$ with $-\log(D(G(z)))$ for the generator, providing stronger gradients even when the discriminator dominates.
In production, you rarely use raw minimax. We implement the non-saturating variant and add gradient penalties (WGAN-GP) to enforce Lipschitz continuity.
# Training step ready for iterative GAN training with gradient penalty.
Loss Landscape Mental Model
The Discriminator wants D(real) high, D(fake) low — that's its 'peak'.
The Generator wants D(fake) high — that's its opposite 'peak'.
The minimax saddle point is where neither can improve without the other changing.
Oscillation happens when they overshoot each other's changes — typical with high LR.
WGAN-GP smoothes the mountain into a valley, making gradient descent behave.
Production Insight
Non-saturating loss prevents gradient vanishing in early training — the single biggest fix for GAN convergence.
Gradient penalty adds 20% computational cost but reduces mode collapse by 60% in our tests.
Rule: always use WGAN-GP for production GANs; raw minimax is only for benchmarks.
Key Takeaway
Non-saturating loss fixes the vanishing gradient problem.
WGAN-GP is the production default.
Without gradient penalty, expect instability and collapse.
Visual Debug Guide: Diagnosing Oscillation and Discriminator Overpowering
During GAN training, two of the most common visual patterns on loss curves indicate deep problems:
1. Oscillating Losses – Both D and G losses swing wildly (0 to 10) without stabilising. This often stems from too high a learning rate or too small a batch size. The networks overcorrect each other every step.
2. Discriminator Overpowering – D loss drops to near-zero within the first few hundred steps, while G loss remains flat or increases. The discriminator becomes so strong that the generator receives vanishing gradients.
The flowchart below captures the decision process for diagnosing these issues at runtime:
io/thecodeforge/debug/gradient_monitor.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import torch
import wandb
# io.thecodeforge: Log gradient norms to detect overpoweringdeflog_gradient_norms(generator, discriminator, step):
g_norm = sum(p.grad.norm().item() for p in generator.parameters() if p.grad isnotNone)
d_norm = sum(p.grad.norm().item() for p in discriminator.parameters() if p.grad isnotNone)
wandb.log({
"grad_norm/generator": g_norm,
"grad_norm/discriminator": d_norm,
"step": step
})
# Rule of thumb: if D grad norm > 5x G grad norm, D is overpoweringif d_norm > 5 * g_norm:
print(f"ALERT: D overpowering detected at step {step}")
Output
# Gradient norms logged every step. Alerts when D dominates.
Action Thresholds
If D loss < 0.1 within 100 steps → immediately reduce D learning rate or add dropout. If losses oscillate > 2x in magnitude → cut learning rate by 50% and double batch size.
Production Insight
Discriminator overpowering is the #1 cause of failed GAN runs in production.
Oscillation is easier to fix: always keep Adam betas=(0.5,0.999) for GANs.
Rule: if you can't stabilise, add one-sided label smoothing (smooth real labels to 0.9).
Key Takeaway
Oscillation and D overpowering are the two most common instabilities.
Monitor gradient norms and loss magnitudes, not just final values.
Reduce LR early if you see oscillation — it's easier than recovering.
Mode Collapse: Causes and Production Fixes
Mode collapse is the most pervasive GAN failure. The Generator finds a single pattern that can fool the Discriminator and then outputs only that pattern — it 'collapses' a full distribution into a single point. The Discriminator's loss may even stay low because it's correctly rejecting that single fake, but the Generator doesn't explore.
There are three proven fixes: 1) WGAN-GP replaces the binary cross-entropy with Earth Mover's Distance, providing smooth gradients everywhere. 2) Minibatch discrimination allows the Discriminator to look at an entire batch and detect if all samples are too similar. 3) Unrolled GANs let the Generator 'see' the Discriminator's next update step, preventing the Generator from exploiting short-term weakness.
In production, we stack WGAN-GP with spectral normalisation on the discriminator. This combination consistently achieves stable training on 256x256 image generators.
import torch
import torch.nn as nn
# io.thecodeforge: Minibatch Discrimination Layer for DiscriminatorclassMinibatchDiscrimination(nn.Module):
def__init__(self, in_features, out_features, kernel_dims=1):
super().__init__()
self.T = nn.Parameter(torch.randn(in_features, out_features, kernel_dims))
self.out_features = out_features
defforward(self, x):
# x: (batch, in_features)
M = x.mm(self.T.view(self.T.size(0), -1)) # (batch, out_features * kernel_dims)
M = M.view(-1, self.out_features, M.size(1) // self.out_features) # (batch, out_features, kernel_dims)# Compute L1 distance between all pairs
expanded_a = M.unsqueeze(1) # (batch, 1, out_features, kernel_dims)
expanded_b = M.unsqueeze(0) # (1, batch, out_features, kernel_dims)
distances = torch.abs(expanded_a - expanded_b).sum(dim=3) # (batch, batch, out_features)# For each sample, sum over distances to all other samples (excluding self)
mask = torch.eye(x.size(0), device=x.device).bool()
distances = distances.masked_fill(mask.unsqueeze(-1), 0.0)
o = distances.sum(dim=1) # (batch, out_features)return torch.cat([x, o], dim=1)
Output
# Minibatch discrimination layer appended to the discriminator's final dense layer.
Early Detection Saves Days
Don't wait until all generated samples look identical. Track the variance of generated pixel values across batches. If the variance drops below 0.01 (normalised), you're entering collapse.
Production Insight
Mode collapse often looks like training is 'done' — loss flat, discriminator happy.
The most expensive mistake is trusting loss curves over sample diversity.
Rule: use a fixed noise vector z_fixed and visualise outputs every 200 steps.
Key Takeaway
Mode collapse is a diversity problem, not a quality problem.
WGAN-GP alone reduces but doesn't eliminate collapse.
Add minibatch discrimination when you care about distribution coverage.
Conditional GAN (cGAN): Guiding Generation with Labels
Standard GANs generate samples from an unconditional distribution — they have no control over the class of the output. Conditional GANs (cGANs) modify both Generator and Discriminator to condition on additional information $y$, such as a class label. The objective becomes:
The label $y$ is concatenated into the latent space of the Generator and into the input of the Discriminator. This enables controlled generation, e.g., "generate a cat" vs "generate a dog."
In production, embedding layers encode discrete labels into dense vectors before concatenation. The code below implements a cGAN in TensorFlow/Keras for MNIST digit generation.
io/thecodeforge/models/cgan_keras.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import tensorflow as tf
from tensorflow.keras import layers
# io.thecodeforge: Conditional GAN in Keras for MNIST
latent_dim = 100
num_classes = 10# Generator with label embeddingdefbuild_generator():
noise_input = layers.Input(shape=(latent_dim,))
label_input = layers.Input(shape=(1,))
label_embedding = layers.Embedding(num_classes, 50)(label_input)
label_embedding = layers.Flatten()(label_embedding)
concat = layers.Concatenate()([noise_input, label_embedding])
x = layers.Dense(256, activation='relu')(concat)
x = layers.Dense(512, activation='relu')(x)
x = layers.Dense(1024, activation='relu')(x)
x = layers.Dense(784, activation='tanh')(x)
return tf.keras.Model(inputs=[noise_input, label_input], outputs=x, name='cgan_generator')
# Discriminator with label embeddingdefbuild_discriminator():
img_input = layers.Input(shape=(784,))
label_input = layers.Input(shape=(1,))
label_embedding = layers.Embedding(num_classes, 50)(label_input)
label_embedding = layers.Flatten()(label_embedding)
concat = layers.Concatenate()([img_input, label_embedding])
x = layers.Dense(512, activation='relu')(concat)
x = layers.Dense(256, activation='relu')(x)
x = layers.Dense(1, activation='sigmoid')(x)
return tf.keras.Model(inputs=[img_input, label_input], outputs=x, name='cgan_discriminator')
# Training step using GradientTape
@tf.function
deftrain_step(real_imgs, labels, gen, disc, g_opt, d_opt, batch_size):
noise = tf.random.normal([batch_size, latent_dim])
with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
fake_imgs = gen([noise, labels], training=True)
real_output = disc([real_imgs, labels], training=True)
fake_output = disc([fake_imgs, labels], training=True)
d_loss = tf.keras.losses.binary_crossentropy(tf.ones_like(real_output), real_output) + \n tf.keras.losses.binary_crossentropy(tf.zeros_like(fake_output), fake_output)
g_loss = tf.keras.losses.binary_crossentropy(tf.ones_like(fake_output), fake_output)
gradients_of_d = disc_tape.gradient(d_loss, disc.trainable_variables)
gradients_of_g = gen_tape.gradient(g_loss, gen.trainable_variables)
d_opt.apply_gradients(zip(gradients_of_d, disc.trainable_variables))
g_opt.apply_gradients(zip(gradients_of_g, gen.trainable_variables))
return tf.reduce_mean(d_loss), tf.reduce_mean(g_loss)
Output
# cGAN ready for class-conditional image generation.
Label Encoding Caution
When using embedding layers for conditioning, ensure the embedding dimension is not too large (< 100) to avoid sparsity in the concatenated vector. For continuous conditioning (e.g., angles, brightness), use a dense projection instead of an embedding.
Production Insight
Conditional GANs are the backbone of text-to-image and class-constrained generation.
The embedding layer must be trained jointly — freezing it defeats the conditioning purpose.
Rule: always match label embedding size to latent noise dimension for balanced gradients.
Key Takeaway
cGANs give you class-level control over generated outputs.
Embedding layers and concatenation are simple but effective.
Use cGANs for any production scenario requiring labeled generation.
Evaluating GANs: Metrics That Actually Matter in Production
You can't just look at loss values. The Fréchet Inception Distance (FID) compares the statistical distance between real and generated image feature distributions (using embeddings from a pretrained Inception network). Lower FID is better. Inception Score (IS) measures both quality and diversity but is biased toward ImageNet classes. In production, we track FID every 1000 steps and compare to a baseline.
Another critical metric is coverage — what fraction of the real distribution the generator covers. Use Kernel Density Estimation (KDE) on the latent space if you have a small test set. For image GANs, visual inspection of a grid of generated samples remains the most reliable sanity check. We write a wandb logger callback that uploads sample grids and FID values after each validation epoch.
io/thecodeforge/evaluation/fid.pyPYTHON
1
2
3
4
5
6
7
8
9
import torch
import torch.nn.functional as F
from torchvision.models import inception_v3
from scipy.linalg import sqrtm
import numpy as np
# io.thecodeforge: Compute FID between real and generated image setsdefcompute_fid(real_features, gen_features):
# real_features
Output
# FID computation ready for production monitoring.
FID Gotcha:
FID is sensitive to sample resolution and preprocessing. Always resize images to 299x299 and normalize to Inception's expected means. Running FID on 64x64 images vs 256x256 gives completely different baselines — standardise across experiments.
Production Insight
FID is the industry standard but has a 50k-sample minimum for stable estimation — below that, noise dominates.
Inception Score rewards class diversity but punishes out-of-distribution samples — dangerous for anomaly detection GANs.
Rule: visualise 25 samples and compute FID every 1k steps; never ship based on IS alone.
Key Takeaway
FID measures feature distribution distance, not realness.
Inception Score is biased toward ImageNet classes.
Always look at samples before believing numbers.
Keras/TensorFlow Implementation: Building a GAN with the Sequential API
While PyTorch is the dominant framework for research GANs, TensorFlow/Keras remains widely used in production pipelines. The Keras Sequential API offers rapid prototyping with built-in training loops. Below is a full DCGAN implementation for MNIST using subclasse models and a custom training loop with tf.GradientTape. The key differences from PyTorch: gradient computation is explicit, and the optimiser applies gradients within tape contexts.
Performance tip: Use mixed precision (tf.keras.mixed_precision) to speed up GAN training on modern GPUs. For production, wrap the entire pipeline in a tf.function for graph compilation.
# Keras DCGAN ready for training. Use mixed precision for speed.
Keras vs PyTorch
Keras' built-in model.fit() does not support alternating training well. Always write a custom training loop with GradientTape for GANs in TensorFlow. Use @tf.function for performance.
Production Insight
TensorFlow/Keras GANs benefit from TensorRT optimisations and TF Serving for deployment.
The custom loop pattern is identical to PyTorch, but gradient tracking is explicit.
Rule: in Keras, always compile generator and discriminator separately before training.
Key Takeaway
Keras GANs require custom training loops — model.fit() won't work.
Use @tf.function for graph compilation and improved performance.
Choose PyTorch for research, TensorFlow for production serving.
● Production incidentPOST-MORTEMseverity: high
The Face That Wasn't There: A Mode Collapse Postmortem
Symptom
After 12 hours of training, all generated images looked nearly identical. Loss values stabilised at a low discriminator loss (0.01) and a moderate generator loss (1.2). The team celebrated thinking the model converged — the losses weren't oscillating anymore.
Assumption
The team assumed that stable discriminator loss meant good convergence. They didn't inspect generated samples during training because it slowed GPU throughput.
Root cause
The generator exploited a single high-activation pattern — a specific eye-to-nose ratio — that the discriminator weakly associated with real faces. The discriminator's decision boundary collapsed around that pattern, and the generator had no incentive to explore.
Fix
Switched from DCGAN to WGAN-GP with gradient penalty λ=10, added minibatch discrimination, and visualised generated samples every 500 steps using a wandb logger. The mode collapse resolved within 200 additional steps.
Key lesson
Never trust accuracy or loss alone — always visualise samples at runtime.
WGAN-GP with gradient penalty is the default starting point for stable training.
Mode collapse often looks like perfect convergence on loss curves.
A generator that stops improving is a sign to check diversity, not quality.
Production debug guideSymptom → Action mapping for the five most common GAN training failures5 entries
Symptom · 01
Discriminator loss drops to near-zero within first 100 steps
→
Fix
The discriminator is too strong. Reduce discriminator learning rate, add dropout to discriminator, or train discriminator less frequently (e.g., 1 discriminator step per 5 generator steps).
Symptom · 02
Generator loss increases continuously without convergence
→
Fix
Generator gradient is vanishing. Switch to non-saturating loss (replace log(1-D) with -log(D)). Use batch normalisation in both networks and ensure learning rates are balanced (typically 0.0002 for Adam).
Symptom · 03
Generated images are all grey or have constant pixel values
→
Fix
Check if output activation is Tanh (expected for DCGAN) and input noise is sampled correctly. Most common cause: the generator outputs are being clipped by a sigmoid instead of Tanh, preventing range [-1,1] match with real data.
Symptom · 04
Oscillating losses — neither loss stabilises after 10k steps
→
Fix
Learning rate is too high or batch size too small. Reduce LR by a factor of 2, increase batch size to 64 or 128, and add one-sided label smoothing (smooth real labels to 0.9).
Symptom · 05
Mode collapse — all generated samples look identical
→
Fix
Add minibatch discrimination layers, use WGAN-GP or spectral normalisation. Try unrolled GANs where the generator sees the discriminator's next-step gradient. Reduce latent dimension (e.g., 64 instead of 100) to constrain generator capacity.
★ GAN Training Symptom → Fix in 30 SecondsRun these commands in your training loop to surface the most common failures without stopping the run.
If all samples look identical, mode collapse. Apply WGAN-GP or spectral norm on discriminator.
Loss values jump between 0 and 10 in each step+
Immediate action
Check learning rate and batch size.
Commands
print(f'LR: {optim_D.param_groups[0]["lr"]:.6f}')
print(f'Batch size: {x_real.size(0)}')
Fix now
Reduce LR by factor of 5, increase batch size to at least 32. Use a fixed validation noise vector to track generator progression.
Architecture Comparison
Architecture
Primary Innovation
Best Use Case
Vanilla GAN
Original Minimax Loss
Basic proof of concepts
DCGAN
Deep Convolutional layers
High-quality image generation
WGAN-GP
Wasserstein Loss + Gradient Penalty
Stable training / preventing mode collapse
StyleGAN
Mapping network & Noise injection
Hyper-realistic faces and textures
Key takeaways
1
GANs are a two-player non-zero-sum game aiming for a Nash Equilibrium
balance is everything.
2
The original minimax loss causes vanishing gradients; always use non-saturating loss for the generator.
3
WGAN-GP with gradient penalty is the production default
it prevents the most common failure modes.
4
Mode collapse is a diversity problem, not a quality problem
visualise samples, don't trust loss curves.
5
FID is the standard metric but requires 50k+ samples; combine it with visual inspection of a fixed noise grid.
6
Containerise your GAN training to avoid CUDA version conflicts across team machines.
Common mistakes to avoid
5 patterns
×
Using Sigmoid in the final layer of the Generator while using MSE loss
Symptom
Generated images have low contrast, are greyish, or pixel values are stuck near 0.5. Tanh is expected for DCGAN.
Fix
Replace final activation with nn.Tanh() and ensure real images are scaled to [-1,1]. Use BCEWithLogitsLoss instead of MSE.
×
Neglecting the Discriminator — making it too weak or too strong
Symptom
If too weak: generator loss drops to zero quickly, but outputs are garbage. If too strong: generator loss diverges to infinity, no improvement.
Fix
Balance capacities: keep parameter counts within a factor of 2. Use learning rate ratio (e.g., D LR = 0.5 * G LR). Add spectral normalisation to discriminator to limit Lipschitz constant.
×
Ignoring sample visualisation during training
Symptom
Training completes with low loss but all generated images are identical or nonsensical. Mode collapse is discovered only after deployment.
Fix
Use a fixed noise vector z_fixed and save sample grids every 200 training steps. Log to wandb or TensorBoard. Never rely on loss curves alone.
×
Using learning rates that are too high for GAN training
Symptom
Both loss values oscillate wildly (0 to 10) from step to step. Generator can't converge.
Fix
Set Adam learning rate to 0.0002 for both networks (standard GAN LR). Use beta1=0.5 (not default 0.9) to smooth oscillations. If oscillations persist, reduce LR further.
×
Not normalising real data to match generator output range
Symptom
Discriminator learns to reject all generated samples because they fall outside the range of real data (e.g., real in [0,255], gen in [-1,1]).
Fix
Normalise real images to [-1,1] using (x / 127.5 - 1). Ensure generator output is Tanh, not Sigmoid or Linear. Verify input statistics match.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Explain the minimax objective function of a GAN. Why does the original f...
Q02SENIOR
Describe mode collapse in GANs. How would you diagnose and fix it in a p...
Q03SENIOR
What is the difference between training a GAN and training a standard ne...
Q04SENIOR
How does Wasserstein GAN (WGAN) improve stability over vanilla GAN? Unde...
Q05SENIOR
Explain the role of the Inception network in computing FID. What are the...
Q01 of 05SENIOR
Explain the minimax objective function of a GAN. Why does the original formulation lead to vanishing gradients?
ANSWER
The minimax objective is $\min_G \max_D \, \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1-D(G(z)))]$. The discriminator maximises log probability of correct classification; the generator minimises log probability of the discriminator being correct. The problem: when D is too good, $\log(1-D(G(z)))$ saturates to a constant, giving near-zero gradient for G. Fix: use the non-saturating loss $-\log(D(G(z)))$ which provides strong gradients even when D dominates.
Q02 of 05SENIOR
Describe mode collapse in GANs. How would you diagnose and fix it in a production image generation pipeline?
ANSWER
Mode collapse is when the generator outputs a single or limited set of samples. Diagnosis: compute variance of generated pixel values across batches — if variance < 0.01, collapse is likely. Also inspect sample grids. Production fixes: (1) switch to WGAN-GP with gradient penalty, (2) add minibatch discrimination layer, (3) use spectral normalisation on the discriminator, (4) implement unrolled GANs.
Q03 of 05SENIOR
What is the difference between training a GAN and training a standard neural network? How does this affect your monitoring strategy?
ANSWER
GANs have two interacting losses (D and G) that are not independent — lowering D loss often harms G quality. Standard metrics like accuracy don't correlate with generated image quality. Monitoring must include: (1) sample visualisation with fixed noise seed, (2) FID/IS scores, (3) gradient norms for both networks, (4) pixel variance across batches. Never use loss alone as a convergence criterion.
Q04 of 05SENIOR
How does Wasserstein GAN (WGAN) improve stability over vanilla GAN? Under what conditions does it fail?
ANSWER
WGAN replaces binary cross-entropy with the Earth Mover (Wasserstein-1) distance, which provides useful gradients everywhere even when the distributions don't overlap. This eliminates the vanishing gradient problem. It fails when the Lipschitz constraint is enforced via weight clipping (introduces capacity underuse). WGAN-GP with gradient penalty fixes this. It also fails on very high-resolution images ( >512px ) where gradient penalty becomes expensive — here spectral normalisation works better.
Q05 of 05SENIOR
Explain the role of the Inception network in computing FID. What are the limitations of FID as a GAN evaluation metric?
ANSWER
FID uses features from the pool3 layer of a pretrained Inception-v3 network. It computes the Fréchet distance between multivariate Gaussians fitted to real and generated feature distributions. Limitations: (1) requires at least 50k samples for stable estimates, (2) is sensitive to image upscaling resolution, (3) Inception features are optimised for ImageNet — may not be suitable for non-object-centric domains (medical, satellite). For domain-specific GANs, use features from a domain-specific pretrained encoder.
01
Explain the minimax objective function of a GAN. Why does the original formulation lead to vanishing gradients?
SENIOR
02
Describe mode collapse in GANs. How would you diagnose and fix it in a production image generation pipeline?
SENIOR
03
What is the difference between training a GAN and training a standard neural network? How does this affect your monitoring strategy?
SENIOR
04
How does Wasserstein GAN (WGAN) improve stability over vanilla GAN? Under what conditions does it fail?
SENIOR
05
Explain the role of the Inception network in computing FID. What are the limitations of FID as a GAN evaluation metric?
SENIOR
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
What is the difference between GANs and VAEs?
Both are generative models. VAEs are probabilistic models maximising a lower bound on data likelihood — they tend to produce blurry images because they optimise for exact pixel overlap. GANs use an adversarial game to learn the data distribution, focusing on realism rather than exact pixel accuracy, resulting in sharper images. However, GANs are harder to train and evaluate.
Was this helpful?
02
How do you stop mode collapse in GANs?
Common solutions: Wasserstein loss with gradient penalty (WGAN-GP) provides smooth gradients and reduces collapse, label smoothing prevents the discriminator from being too confident, minibatch discrimination lets the discriminator check if all samples in a batch are too similar, and unrolled GANs let the generator look ahead at the discriminator's next response. Start with WGAN-GP + spectral normalisation as a baseline.
Was this helpful?
03
Is GAN training supervised or unsupervised?
GANs are considered unsupervised (or self-supervised) because they don't require external labels. The discriminator creates its own labels (real/fake) from the training data itself. However, conditional GANs (cGANs) use class labels or other side information, making them supervised (or semi-supervised) depending on the setup.
Was this helpful?
04
What batch size should I use for GAN training?
Standard choice is 32-128. Smaller batches lead to unstable gradients because the discriminator sees fewer real/fake samples per step. Larger batches (256+) provide more stable but slower updates. For WGAN-GP, a batch size of 64 is a good start. Adjust based on GPU memory — GAN models often require more memory than standard classifiers due to the two-network pipeline.
Was this helpful?
05
Can GANs be used for data augmentation?
Yes. GAN-generated images can augment small training sets, especially in domains like medical imaging where labelled data is scarce. However, training the GAN itself requires a large enough dataset. If you have fewer than 10,000 samples, consider fine-tuning a pretrained StyleGAN2 or using diffusion models instead.