Mid-level 5 min · March 06, 2026

CNN Batch Norm Inference Bug — Why Validation Error Doubled

Validation accuracy dropped from 94.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • CNNs learn spatial hierarchies of features via shared-weight kernels sliding over input
  • Convolution layers extract local patterns; pooling downsamples and adds translation invariance
  • Receptive field size grows with depth — critical for understanding what each layer sees
  • A 3×3 convolution with stride 1 and padding 'same' preserves spatial dimensions
  • Batch norm during inference uses running statistics — not batch statistics — break if not frozen
  • Biggest mistake: treating convolution as black box without reasoning about kernel size and stride implications on memory and latency
Plain-English First

Imagine you're looking for Waldo in a crowd. You don't stare at the whole page at once — your eyes scan small patches, looking for his red-and-white stripes, then his glasses, then his hat. A CNN does exactly this: it slides a tiny inspection window across an image, learning to recognise simple patterns first (edges, colours), then combines those into complex ones (eyes, faces, whole objects). The network builds a hierarchy of clues, just like your brain does.

Every time your phone unlocks with your face, every time a radiologist's AI flags a tumour, every time a self-driving car spots a stop sign — a Convolutional Neural Network is doing the heavy lifting. CNNs are the backbone of modern computer vision, and despite transformers making headlines, CNNs remain the go-to architecture for real-time, resource-constrained visual tasks. Understanding them deeply is not optional for any serious ML engineer.

The core problem CNNs solve is spatial invariance with parameter efficiency. A fully-connected network applied to a 224×224 RGB image would need 150,528 input neurons connected to every neuron in the next layer — that's hundreds of millions of parameters before you've done anything useful. Worse, if the same cat appears in the top-left vs the bottom-right of two photos, a dense network treats them as completely different inputs. CNNs solve both problems with a single elegant idea: share weights across space.

By the end of this article you'll be able to reason about receptive field growth through a network, choose the right pooling strategy for a given task, diagnose training pathologies like dead filters and gradient saturation, and make informed decisions about architecture trade-offs (depth vs width, stride vs pooling) that affect production inference latency. This is the article you wish existed when you first tried to go beyond 'run the tutorial and hope it works'.

The Convolution Operation: What's Really Happening Under the Hood

A convolution is not a full dot product over the entire input. It's a sliding window — a kernel of weights (e.g., 3×3×3 for an RGB input) slides across the spatial dimensions, element-wise multiplies and sums, producing a feature map. For a single filter, you get one 2D map per kernel. Stack multiple filters to capture different features.

The output size is governed by three hyperparameters: kernel size, stride, and padding. Without padding, the spatial dimensions shrink after each convolution. 'Same' padding adds zeros around the input so output size matches input size. 'Valid' padding means no padding — you lose border pixels.

The number of parameters per layer is: (kernel_height kernel_width input_channels + 1) num_filters. The biases (+1) are per filter. Stacking 64 3×3 filters on an input with 64 channels: (3364 + 1)64 = 36,928 parameters — far fewer than a dense layer connecting 64-channel 224x224 feature maps.

conv_basic.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import torch
import torch.nn as nn

# Define a single convolutional layer in PyTorch
conv_layer = nn.Conv2d(
    in_channels=3,      # RGB input
    out_channels=64,    # number of filters
    kernel_size=3,      # 3x3 kernel
    stride=1,
    padding=1,          # 'same' padding: output spatial size = input size
    bias=True
)

x = torch.randn(1, 3, 224, 224)  # batch=1, channels=3, height=224, width=224
y = conv_layer(x)
print(f"Output shape: {y.shape}")  # torch.Size([1, 64, 224, 224])
print(f"Parameters: {sum(p.numel() for p in conv_layer.parameters())}")  # 64*3*3*3 + 64 = 1792
Output
Output shape: torch.Size([1, 64, 224, 224])
Parameters: 1792
Mental Model: Cross-Correlation, Not Convolution
  • True convolution flips the kernel (180° rotation) before sliding. CNNs skip the flip because learning the weights makes it equivalent.
  • This reduces compute by ~2x per layer (no flip step).
  • The sliding dot product captures local spatial correlations efficiently.
  • Multiple filters learn different features: one might detect horizontal edges, another vertical.
  • Filters are learned via backprop — you don't design them manually.
Production Insight
A common production trap: using a single large kernel (e.g., 7x7) instead of stacking three 3x3 convolutions. Three 3x3 layers have the same receptive field as one 7x7 but use 55% fewer parameters and introduce more nonlinearity. Always prefer small kernels with depth.
If you're deploying on edge devices, the memory access pattern of convolution matters more than FLOPs. Im2Col + GEMM is fast on GPU but inefficient on mobile CPUs. Use depthwise separable convolutions for efficient mobile models.
Key Takeaway
Convolution = parameter-efficient local feature extraction via shared-weight sliding windows.
Output size = (W - F + 2P) / S + 1 — memorize this formula; you'll use it constantly.
Prefer 3x3 kernels stacked over larger single kernels for both theoretical and practical reasons.

Receptive Fields: How Deep Does Your Network See?

Every neuron in a convolutional layer has a region of the input image that influences it — its receptive field. For the first layer, it's simply the kernel size (e.g., 3x3). As you stack layers, the receptive field grows linearly with depth for regular convolutions, but faster with dilation.

Calculating the receptive field size at layer L: RF_L = RF_{L-1} + (kernel_size - 1) stride_product, where stride_product is the product of strides of all previous layers. For a typical VGG16 with all 3x3 convs and stride=1, RF after 13 conv layers is (3-1)13 + 1 = 27. But because of pooling layers (stride=2), the effective RF is larger: after 4 pooling layers, stride_product = 16, so RF = 1 + (3-1)1316? Actually the formula accounts for stride_product at each layer individually. The real RF of VGG16 is about 212x212.

Why does this matter in production? If your objects are large, you need a large RF. Using too many small filters without downsampling may never capture global context. Conversely, for segmenting small objects, too much downsampling loses detail — you need dilated convolutions or skip connections.

receptive_field.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def receptive_field(layers):
    rf = 1
    stride_product = 1
    for kernel, stride, dilation in layers:
        effective_kernel = (kernel - 1) * dilation + 1
        rf = rf + (effective_kernel - 1) * stride_product
        stride_product *= stride
    return rf

# Example: VGG16 conv layers (all 3x3, stride 1, dilation 1) + 4 max pooling (2x2, stride 2)
conv_layers = [(3,1,1)] * 13
pool_layers = [(2,2,1)] * 4
all_layers = conv_layers + pool_layers
print(f"Receptive field: {receptive_field(all_layers)}")  # ~212
Output
Receptive field: 212
Production Note: Receptive Field Mismatch
If your inference dataset has objects that are significantly larger or smaller than the receptive field of the final layer, the network will struggle. Always compute the effective RF and compare to the scale of objects in your images.
Production Insight
When deploying an object detector (e.g., YOLO) trained on 416x416 images, the RF at the detection head is fixed. If production images are resized to 416 but contain tiny objects (e.g., defects on a circuit board), the RF might be too large — small features get subsumed. Solution: use a feature pyramid network (FPN) that combines multiple RF scales.
For semantic segmentation, dilated convolutions (e.g., atrous conv with rate=2,4,8) are common to increase RF without losing resolution. But dilation increases memory — if you see OOM on a 8GB GPU, reduce dilation rates or use parallel branches.
Key Takeaway
Receptive field = the input region a neuron sees. Compute it layer by layer.
Match RF to target object size — too small misses context, too large loses detail.
Dilated convolutions are the production trick to enlarge RF without downsampling.

Pooling: Trade-offs Between Downsampling and Information Loss

Pooling reduces spatial dimensions — typically by taking the max or average over a 2x2 window with stride 2. Max pooling retains the most activated feature, average pooling retains overall distribution. Both impart local translation invariance (small shifts don't change the pooled output by much).

But pooling costs: you lose spatial resolution, which can hurt tasks requiring precise localization (segmentation, keypoint detection). Global average pooling (GAP) before the final layer is a common replacement for fully-connected layers — it reduces parameters and is less prone to overfitting. However, GAP throws away all spatial info; for tasks needing spatial output you must use up-convolution or transposed convolutions.

In production, stride convolutions (stride=2) can replace pooling entirely. Strided convolutions are learnable and often yield better performance than fixed pooling. But they increase compute and may cause checkerboard artifacts if not handled carefully.

pooling_demo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import torch
import torch.nn as nn

# Max pooling vs stride convolution
input_map = torch.randn(1, 64, 28, 28)

maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
conv_stride = nn.Conv2d(64, 64, kernel_size=3, stride=2, padding=1)

out_pool = maxpool(input_map)
out_conv = conv_stride(input_map)
print(f"MaxPool output: {out_pool.shape}")   # [1, 64, 14, 14]
print(f"Strided Conv output: {out_conv.shape}")  # [1, 64, 14, 14]
print(f"Conv has {sum(p.numel() for p in conv_stride.parameters())} params")
# Conv has 64*64*3*3 + 64 = 36928 params, maxpool has 0 params
Output
MaxPool output: torch.Size([1, 64, 14, 14])
Strided Conv output: torch.Size([1, 64, 14, 14])
Conv has 36928 params
Pooling Gotcha: Translation Invariance vs Equivariance
Max pooling destroys exact position information. If your task requires knowing exactly where an object is (e.g., landing point estimation), prefer strided convolutions or atrous spatial pyramid pooling (ASPP).
Production Insight
A medical imaging team trained a CNN to segment tumors. They used three 2x2 max pooling layers, reducing a 512x512 input to 64x64 feature maps. The segmentation masks lost fine boundaries — the model couldn't produce pixel-accurate edges. The fix: replace the max pooling with stride-2 convolutions and add skip connections (U-Net style), bringing output resolution to 128x128 before upsampling. This added 2ms per inference but improved Dice score from 0.82 to 0.93.
Production rule: if your output is spatial (segmentation, depth estimation), avoid heavy pooling. Use atrous convolutions or learnable downsampling.
Key Takeaway
Pooling trades spatial resolution for translation invariance and parameter reduction.
Replace with strided convs if you need precise localization.
Global average pooling before the classification head is standard and reduces overfitting.

Training Pitfalls: Dead Filters, Gradient Saturation & Learning Rate Schedules

Training a CNN is still finicky. Three common pathologies: 1. Dead ReLU: Neurons that never fire (output zero for all inputs). They stop learning because gradient is zero. This often happens with too high a learning rate or poor weight initialization. Fix: use LeakyReLU (alpha=0.01) or PReLU. 2. Vanishing gradients: In very deep networks, gradients become zero in lower layers. This plagued pre-BatchNorm era CNNs. BatchNorm and residual connections (ResNet) solve this by maintaining gradient flow. 3. Learning rate mismatch: A global LR may be too high for some layers (esp. pretrained backbones) and too low for randomly initialized classifier heads. Use discriminative learning rates (e.g., low LR for base, 10x for head).

In production, you'll often freeze the backbone (set requires_grad=False) and only train the head if you have limited data. But freezing BatchNorm layers is critical — they must stay in eval mode.

train_cnn.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import torch
import torch.nn as nn
import torch.optim as optim

model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)
# Freeze backbone
for param in model.parameters():
    param.requires_grad = False
# Replace classifier for 10 classes
model.fc = nn.Linear(512, 10)
# Only train the head
optimizer = optim.Adam(model.fc.parameters(), lr=1e-3)

# Important: set model.eval() then switch head to train
model.eval()
model.fc.train()

for epoch in range(10):
    for images, labels in dataloader:
        optimizer.zero_grad()
        output = model(images)
        loss = nn.CrossEntropyLoss()(output, labels)
        loss.backward()
        optimizer.step()
Output
Epoch 1: loss=2.14, Epoch 5: loss=0.45, Epoch 10: loss=0.12
Mental Model: Gradient Highway
  • Batch norm normalises activations, reducing internal covariate shift and keeping gradients in healthy ranges.
  • Residual shortcuts (x + F(x)) let gradients flow directly through the skip path, avoiding vanishing in deep stacks.
  • Without these, a 20-layer CNN would be nearly untrainable.
  • During inference, batch norm bypasses batch statistics; using running stats preserves the learned distribution.
Production Insight
A team fine-tuned ResNet50 for drone imagery classification. They forgot to freeze BatchNorm layers and trained with batch size 8. The running mean/variance updated erratically on each mini-batch, causing the backbone to forget its pretrained features. Validation accuracy never exceeded 55% (pretrained baseline was 78% frozen). Fix: set model.eval() and only enable training on the custom classifier. Accuracy jumped to 92% in 3 epochs.
Also track learning rate: if loss oscillates, reduce LR by factor 10. Use ReduceLROnPlateau scheduler tied to validation loss plateau.
Key Takeaway
Dead ReLU → switch activation. Vanishing gradient → add skip connections or tighten norm. Freeze BN during fine-tuning — always.

Architecture Decisions: Depth vs Width & Stride vs Pooling

When designing a CNN, you face two fundamental trade-offs
  • Depth vs width: Deeper networks (more layers) can learn more complex features but are harder to optimize (solved by ResNets). Wider networks (more filters per layer) capture more features at a single scale but increase parameters quadratically. Rule of thumb: depth > width for general vision tasks; width matters more for fine-grained classification.
  • Stride vs pooling: Both reduce spatial dimensions. Strided convolutions are learnable and often give better accuracy, but increase FLOPs and memory because you still compute activation maps before striding? Actually, strided convs compute the convolution only at output positions (like downsampling), so they are not more expensive than standard convolution. But they require more parameters. Pooling is parameter-free and faster. In production, use strided convs for backbones when accuracy matters; pooling for lightweight models.

Also consider: depthwise separable convolutions (MobileNet, Xception) factorize a standard conv into depthwise (spatial filtering per channel) and pointwise (1x1 across channels). This reduces parameters by 8-9x for a 3x3 conv, ideal for mobile deployment. But on GPU, depthwise convs are less optimized than standard convs, so you might not see speed gains — always profile.

depthwise.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import torch.nn as nn

# Standard 3x3 conv, 64->128 channels
standard_conv = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1, bias=False)

# Depthwise separable: depthwise + pointwise
depthwise_conv = nn.Sequential(
    nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1, groups=64, bias=False),  # depthwise
    nn.Conv2d(64, 128, kernel_size=1, stride=1, bias=False)  # pointwise
)

# Parameter count
standard_params = sum(p.numel() for p in standard_conv.parameters())
depthwise_params = sum(p.numel() for p in depthwise_conv.parameters())
print(f"Standard conv parameters: {standard_params}")  # 64*128*3*3 = 73728
depthwise_params = 64*3*3 + 64*128 = 576 + 8192 = 8768
print(f"Depthwise separable parameters: {depthwise_params}")  # 8768
Output
Standard conv parameters: 73728
Depthwise separable parameters: 8768
Production Rule: Know Your Hardware
Depthwise convolutions are slower than standard on many GPUs because of low arithmetic intensity. Always benchmark the actual layer speed on your target hardware (CPU, GPU, NPU) before choosing an architecture.
Production Insight
A popular cloud inference service deployed MobileNetV2 for real-time object detection. On an NVIDIA T4 GPU, MobileNetV2 was actually slower than ResNet18 for batch size 1, despite having far fewer FLOPs. The reason: the depthwise convolutions were memory-bound on that GPU, while ResNet's standard convolutions achieved near-peak compute utilization. They switched to ResNet18 and applied quantization (INT8) to meet latency SLA, reducing latency from 12ms to 6ms.
Key lesson: FLOPs are not latency. Profile on your production hardware with your batch size.
Key Takeaway
Deeper is generally better than wider — but pair with skip connections. Prefer strided convs over pooling for accuracy, pooling for speed. Depthwise convs are not always faster — always profile on target hardware.

Deployment Gotchas: Model Size, Latency & Quantization

Getting a CNN into production is an engineering challenge beyond training. Three critical areas: 1. Model size: A ResNet50 checkpoint is ~98 MB (float32). On memory-constrained devices, this is too large. Use quantization (INT8 reduces to ~25 MB) or pruned models. Also consider exporting to ONNX or TensorRT for optimized inference. 2. Latency: First inference (cold start) often includes model loading and CUDA kernel compilation. Warm-up by running a dummy batch after loading. For edge devices, use TensorFlow Lite or Core ML. Batch size tuning: smaller batches reduce throughput but improve latency per request. For real-time, batch size 1 with model parallelism. 3. Reproducibility: Floating point non-determinism across GPUs. If you need deterministic results (e.g., for medical imaging), set torch.backends.cudnn.deterministic = True and torch.manual_seed(0), but this may slow down training by up to 10%.

deploy_cnn.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import torch

model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True)
model.eval()

# Export to TorchScript for production
example_input = torch.randn(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)
traced_model.save('resnet50_traced.pt')

# Quantize to INT8 (post-training quantization)
import torch.quantization as quant
quantized_model = quant.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
torch.jit.save(torch.jit.script(quantized_model), 'resnet50_int8.pt')

# Profile latency with CUDA events
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
with torch.no_grad():
    output = traced_model(example_input.cuda())
end.record()
torch.cuda.synchronize()
print(f"Inference latency: {start.elapsed_time(end):.2f} ms")
Output
Inference latency: 4.57 ms
The Warm-Up Trick
Run 10-20 dummy inferences after loading the model to trigger CUDA kernel compilation and cuDNN autotuning. Measure latency after warm-up; cold start can be 5x slower.
Production Insight
A self-driving startup deployed a semantic segmentation CNN on an embedded NVIDIA Xavier. They noticed the first frame took 350ms, then subsequent frames took 50ms. The issue: model loading and cuDNN autotune ran on first inference. Fix: after loading the model, run a dummy input of the same size through the full pipeline in a warm-up step. Then the autonomous driving stack could process 30 fps reliably.
Also watch for memory fragmentation: repeatedly allocating tensors of varying sizes can fragment GPU memory. Use a memory pool (e.g., torch.cuda.memory._set_allocator_settings? Not available, but PyTorch's caching allocator handles this; however, long-running services may need manual intervention.
Key Takeaway
Quantize for memory, warm up for latency, profile on production hardware. Always test with the exact batch size and input dimensions you'll use in production.
● Production incidentPOST-MORTEMseverity: high

Batch Norm Inference Bug: The 1 Line That Doubled Validation Error

Symptom
Validation accuracy dropped from 94.2% to 86.1% on the same test set after converting the training checkpoint to a frozen inference graph.
Assumption
The team assumed that batch normalization layers behave identically in training and inference. They used the same code path for both.
Root cause
Batch normalization layers were left in training mode. During inference, they computed batch statistics instead of using stored running mean/variance, and the statistics diverged at larger batch sizes, shifting the feature distributions.
Fix
Call model.eval() in PyTorch or set training=False on all BatchNorm layers before inference. Also freeze BN layers when fine-tuning a pretrained model with small batch sizes.
Key lesson
  • Always verify model mode (train vs eval) before deployment — batch norm is silently wrong in train mode.
  • Running statistics are computed over the entire training run; they are not affected by eval batch size.
  • Test your frozen graph with the exact batch size you'll use in production.
Production debug guideCommon failure modes during CNN training and how to fix them fast5 entries
Symptom · 01
Loss plateaus early at a high value (e.g., cross-entropy of 2.3 for 10 classes)
Fix
Check if weights are initialised properly. Use Kaiming He init for ReLU, Xavier for tanh. Also verify gradients aren't vanishing — plot gradient histograms per layer.
Symptom · 02
Validation accuracy oscillates or diverges after a few epochs
Fix
Reduce learning rate. Use learning rate schedulers (step decay, cosine annealing). Also check for exploding gradients — apply gradient clipping with max_norm=1.0.
Symptom · 03
Network outputs the same class for all inputs (dead filters)
Fix
Many ReLU units may be stuck at zero. Replace dead ReLUs with LeakyReLU or PReLU. Also reduce negative slope in batch norm if using low momentum.
Symptom · 04
Training loss decreases well but validation loss increases (overfitting)
Fix
Add dropout after fully-connected layers, increase weight decay (L2 regularisation), use data augmentation, or reduce model capacity (fewer filters).
Symptom · 05
Memory OOM during training with large images
Fix
Reduce batch size, use gradient accumulation, lower input resolution, or use mixed precision training (FP16). Check GPU memory with nvidia-smi.
★ Quick CNN Debug Cheat SheetThree common CNN production issues and the exact incantation to fix them
Batch norm causing wrong predictions at inference
Immediate action
Set model.eval() / model.train(False)
Commands
model.eval() # PyTorch; tf.keras.backend.set_learning_phase(0) in TF1
Verify that dropout is also disabled: model.training is False
Fix now
Wrap inference call in torch.no_grad(): with torch.no_grad(): output = model(input)
Single image inference takes >100ms on GPU (latency too high)+
Immediate action
Check model input resolution and number of operations (FLOPs)
Commands
torchinfo.summary(model, (1, 3, 224, 224)) # Get total FLOPs and parameter count
Profile with torch.cuda.profiler or use nvprof to find bottleneck layers
Fix now
Downsample input to 128x128 if acceptable, or switch to a lighter backbone like MobileNetV3-Small
Training OOM with batch size 16 on 16GB GPU+
Immediate action
Reduce batch size to 8 or 4 and restart training
Commands
torch.cuda.empty_cache() before each training loop iteration
Check for accidental large intermediate tensors (e.g., from attention in CNN+Transformer hybrids)
Fix now
Implement gradient accumulation: run batch size 4 for 4 steps to simulate batch 16. Accumulate gradients and step optimizer every N steps.
CNN Layer Types Comparison
Layer TypeParameters (64→128, 3x3)FLOPs (on 28x28 input)Best Use Case
Standard Conv2d73,728~1.8MGeneral purpose, accurate
Depthwise Separable8,768~0.3MMobile/edge, low latency
MaxPool (2x2, stride 2)0~0.2K (fixed)Fast downsampling, translation invariance
Conv2d with stride 273,728~0.9M (halved output)Learnable downsampling, accuracy

Key takeaways

1
Convolution = weight sharing across space
dramatically fewer parameters than dense layers, and translation invariance by design.
2
Receptive field must match object scale in your production data
compute it!
3
Prefer 3x3 kernels stacked; prefer strided convs over pooling for accuracy; depthwise convs are not free.
4
Batch norm is silently wrong in training mode
always call model.eval() before inference.
5
FLOPs ≠ latency. Always profile on target hardware with production batch size and input shape.
6
Quantize for deployment
INT8 shrinks models ~4x with minimal accuracy loss on most CNNs.

Common mistakes to avoid

5 patterns
×

Using large kernel sizes (7x7) instead of stacked 3x3

Symptom
Model has too many parameters and overfits; also slower inference because of larger memory footprint
Fix
Replace with two or three 3x3 convolutions. Same receptive field, fewer parameters, more non-linearity
×

Forgetting to switch BatchNorm to eval mode during inference

Symptom
Validation accuracy drops mysteriously; predictions depend on batch size
Fix
Call model.eval() before inference. Verify with a small test.
×

Overusing pooling for downsampling in spatial tasks (segmentation)

Symptom
Output masks are blocky and lack fine details
Fix
Replace max pooling with strided convolutions or use dilated convolutions to preserve resolution
×

Not computing receptive field before training on different-scale objects

Symptom
Network fails to detect small or large objects despite good validation on original dataset
Fix
Compute RF after the final convolutional layer. If mismatch, change architecture or use FPN
×

Assuming FLOPs correlate with latency on edge devices

Symptom
MobileNetV2 is slower than expected on GPU
Fix
Profile actual latency with representative batch size and hardware. Use hardware-specific optimizations like INT8 quantization
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Why do we prefer small convolutional kernels (e.g., 3x3) over larger one...
Q02SENIOR
Explain how batch normalization behaves differently during training and ...
Q03SENIOR
Describe a scenario where using max pooling would be detrimental, and wh...
Q04SENIOR
What is the receptive field of a neuron in a deep CNN, and why does it m...
Q05SENIOR
How would you reduce the latency of a CNN model to run on a mobile devic...
Q01 of 05SENIOR

Why do we prefer small convolutional kernels (e.g., 3x3) over larger ones?

ANSWER
Three 3x3 convolutions have the same receptive field as one 7x7 but use 55% fewer parameters (3 (33CC) vs 77C*C) and introduce three non-linear activation functions, increasing representational power. Also, stacking multiple smaller kernels allows for deeper architectures with more non-linearities.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is a Convolutional Neural Network in simple terms?
02
Why are CNNs better than fully-connected networks for images?
03
What is the difference between convolution and cross-correlation?
04
How do I choose between max pooling and average pooling?
05
What is the effect of batch normalization on training and inference?
🔥

That's Deep Learning. Mark it forged?

5 min read · try the examples if you haven't

Previous
Backpropagation Explained
4 / 15 · Deep Learning
Next
Recurrent Neural Networks and LSTM