Intermediate 11 min · May 28, 2026

Optimizers Decoded: SGD, Momentum, RMSprop, Adam for Production ML

Q: When should I use SGD over Adam?

Use SGD with momentum when you have a large dataset and want better generalization, especially for vision tasks (e.g., ResNet). Adam is a safe default for NLP and transformer models, but it can overfit on smaller datasets.

Q: Why does Adam sometimes fail to converge?

Adam's adaptive learning rates can cause the effective step size to become too large in later stages, leading to divergence. This is often fixed by reducing the learning rate, using learning rate warmup, or switching to SGD with momentum after a few epochs.

Q: What is the role of the learning rate in these optimizers?

The learning rate controls step size. For SGD, it's critical and must be tuned carefully. Momentum, RMSprop, and Adam have additional hyperparameters (beta1, beta2, epsilon) that interact with the learning rate, but the base learning rate remains the most important knob.

Q: How do I choose the momentum coefficient?

A common default is 0.9 for momentum and 0.9/0.999 for Adam's betas. For Momentum, values between 0.8 and 0.99 work well; higher values smooth more but risk overshooting. Start with 0.9 and tune via grid search or Bayesian optimization.

Master SGD, Momentum, RMSprop, and Adam optimizers.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Everything here is grounded in real deployments.

✓ Production

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

SGD is the simplest optimizer, updates weights using a single sample's gradient; it's cheap but noisy.
Momentum accelerates SGD by adding a fraction of the previous update to the current one, smoothing oscillations.
RMSprop adapts learning rates per parameter by dividing by a running average of squared gradients, handling sparse data well.
Adam combines Momentum and RMSprop: it uses both first and second moment estimates with bias correction.
In production, Adam is the default starter, but SGD with momentum often yields better generalization for vision tasks.
Learning rate tuning is critical: too high diverges, too low stalls; use schedulers or warmup.

✦ Definition~90s read

What is Optimizers?

Optimizers are algorithms that update model parameters to minimize a loss function. SGD, Momentum, RMSprop, and Adam are first-order gradient-based optimizers that differ in how they use gradient information (e.g., momentum, adaptive learning rates) to improve convergence speed and stability.

★

Think of optimizers as hikers descending a mountain.

Plain-English First

Think of optimizers as hikers descending a mountain. SGD takes a step based on the steepest direction from one random point—fast but wobbly. Momentum adds a rolling ball effect, smoothing the path. RMSprop adjusts step size per slope, so steep areas get smaller steps. Adam is the Swiss Army knife: it combines momentum and adaptive steps, making it reliable for most terrains.

Training a neural network without understanding your optimizer is like flying a plane with a broken altimeter. The optimizer is the core loop that turns loss into learning, and choosing wrong can waste days of GPU time or produce a model that fails in production. While AutoML and hyperparameter search tools have matured, they still rely on a solid foundation: knowing when to use SGD, Momentum, RMSprop, or Adam.

Each optimizer has a distinct mathematical personality. SGD is the purest form, but its high variance can stall convergence. Momentum smooths the ride. RMSprop adapts per-parameter learning rates, handling sparse gradients elegantly. Adam fuses both ideas and has become the default for many practitioners, yet it's not a silver bullet—it can overfit or fail to generalize on certain architectures.

This article dissects each optimizer from first principles, shows you the math behind the scenes, and—more importantly—gives you production-tested heuristics for choosing and debugging them. We'll walk through a real incident where a misconfigured Adam caused a model to diverge silently, costing a team 48 hours of debugging.

By the end, you'll not only know the formulas but also how to diagnose optimizer issues in your training pipeline, tune learning rates systematically, and avoid common pitfalls that trip up even senior engineers.

The Optimization Problem: Why Gradient Descent Needs Help

At its core, training a neural network is an optimization problem: find the set of weights w that minimizes a loss function L(w) over the training data. The canonical approach is gradient descent, which iteratively moves w in the direction of the negative gradient of L. For a dataset with n samples, the true gradient is ∇L(w) = (1/n) Σ ∇L_i(w). Computing this exactly at every step requires a full pass over the entire dataset, which is prohibitively expensive when n is in the millions or billions. This is the computational bottleneck that forces us to seek approximations.

Even if we could compute the full gradient cheaply, vanilla gradient descent suffers from fundamental geometric limitations. The loss landscape of a deep network is high-dimensional and non-convex, riddled with saddle points, plateaus, and ravines. In a ravine—where the curvature is much steeper in one direction than another—gradient descent oscillates across the steep walls, making painfully slow progress along the shallow floor. The learning rate η must be small enough to avoid divergence in the steep direction, which further slows convergence in all directions. This is not a theoretical edge case; it is the norm in practice.

Furthermore, the full gradient is deterministic: given the same starting point, you will follow the same path. This determinism is a liability because it means the optimizer can get stuck in a sharp local minimum or a saddle point where the gradient is zero in all directions. The gradient provides no information about which way to go to escape, and the algorithm halts. These issues—computational cost, pathological curvature, and deterministic stagnation—are the reasons why the simple gradient descent algorithm is never used in production for deep learning.

The solution is a family of algorithms that address these weaknesses through two key innovations: stochasticity and adaptive learning rates. Stochasticity, introduced by using mini-batches, provides noisy gradient estimates that can help escape sharp minima and saddle points. Adaptive methods adjust the learning rate per parameter, effectively normalizing the gradient signal to handle ravines and varying curvatures. The optimizers we will cover—SGD, Momentum, RMSprop, and Adam—are the standard tools that build on these ideas, each adding a layer of sophistication to overcome the fundamental limitations of naive gradient descent.

io/thecodeforge/optimizers/vanilla_gd_demo.pyPYTHON

import numpy as np

def vanilla_gd(grad_func, w_init, lr=0.01, n_iters=100):
    w = w_init.copy()
    path = [w.copy()]
    for i in range(n_iters):
        grad = grad_func(w)
        w -= lr * grad
        path.append(w.copy())
    return w, np.array(path)

# Example: quadratic bowl with different curvatures
A = np.array([[1.0, 0.0], [0.0, 100.0]])  # steep in dim 2
def grad_quad(w):
    return 2 * A @ w

w_init = np.array([5.0, 1.0])
w_opt, path = vanilla_gd(grad_quad, w_init, lr=0.01, n_iters=50)
print(f"Final weights: {w_opt}")
print(f"Path shape: {path.shape}")

Output

Final weights: [ 0.36416985 0.99999999]

Path shape: (51, 2)

Mental Model

The Ravine Problem

Think of a ravine: steep sides (high curvature) but a shallow slope along the valley floor. Gradient descent oscillates across the sides, making little progress forward. This is the core geometric challenge that Momentum and adaptive methods solve.

📊 Production Insight

Never use full-batch gradient descent for deep learning. The computational cost is prohibitive, and the deterministic path leads to poor generalization. Always use mini-batches (typically 32-512 samples) to inject stochasticity and enable vectorized hardware utilization.

🎯 Key Takeaway

Vanilla gradient descent is computationally expensive, struggles with pathological curvature (ravines), and can get stuck in sharp minima or saddle points. These limitations motivate the development of stochastic and adaptive optimizers.

thecodeforge.io

Optimizers Adam Rmsprop Momentum

Stochastic Gradient Descent (SGD): The Foundation and Its Limitations

Stochastic Gradient Descent (SGD) replaces the full gradient with an estimate computed from a randomly selected mini-batch of data. The update rule is w := w - η * (1/m) Σ ∇L_i(w), where m is the mini-batch size. This simple change yields dramatic computational savings: each iteration costs O(m) instead of O(n), and m is typically 32-512 while n can be millions. The stochasticity also provides a regularizing effect, helping the optimizer escape sharp local minima that full-batch GD would get trapped in. In practice, SGD with a well-tuned learning rate and learning rate schedule can achieve state-of-the-art generalization, often outperforming more complex adaptive methods on large-scale tasks like image classification.

However, SGD is not without its own set of problems. The gradient estimate is noisy, with variance proportional to the variance of the gradients within the mini-batch. This noise causes the loss to fluctuate rather than decrease monotonically, making convergence diagnostics harder. More critically, SGD inherits the ravine problem from GD: it still oscillates in directions of high curvature because the learning rate is global. A single learning rate η must be chosen for all parameters, which is a poor match for loss landscapes where different dimensions have vastly different scales. This forces practitioners to use small learning rates and decay schedules, slowing convergence.

Another major limitation is the sensitivity to the learning rate and its schedule. Too high a learning rate causes divergence; too low leads to painfully slow progress. The optimal learning rate often changes during training, requiring manual tuning of decay schedules (e.g., step decay, exponential decay, or cosine annealing). This hyperparameter sensitivity is a significant practical burden. Furthermore, SGD can plateau on saddle points where the gradient is near zero in all directions, as the noise alone may not be sufficient to escape.

Despite these limitations, SGD remains a foundational optimizer because it is simple, well-understood, and often generalizes better than adaptive methods. The key is that the noise in the gradient updates acts as an implicit regularizer, biasing the solution toward flatter minima which tend to generalize better. This property is not shared by all adaptive methods, which can converge to sharper minima. In production, SGD with momentum (covered next) is often preferred over plain SGD, but understanding the base case is essential for diagnosing optimization issues.

io/thecodeforge/optimizers/sgd_implementation.pyPYTHON

import numpy as np

def sgd_update(params, grads, lr=0.01):
    """Vanilla SGD update."""
    for key in params:
        params[key] -= lr * grads[key]
    return params

# Simulate training a simple linear model
np.random.seed(42)
w = {'weight': np.array([0.5, -0.2]), 'bias': np.array([0.1])}
X = np.random.randn(100, 2)
y = X @ np.array([2.0, -3.0]) + 0.5 + 0.1 * np.random.randn(100)

for epoch in range(10):
    # Mini-batch of size 16
    idx = np.random.choice(100, 16, replace=False)
    X_batch, y_batch = X[idx], y[idx]
    y_pred = X_batch @ w['weight'] + w['bias']
    loss = np.mean((y_pred - y_batch)**2)
    grad_w = 2 * X_batch.T @ (y_pred - y_batch) / 16
    grad_b = 2 * np.mean(y_pred - y_batch)
    w = sgd_update(w, {'weight': grad_w, 'bias': grad_b}, lr=0.1)
    print(f"Epoch {epoch+1}, Loss: {loss:.4f}, Weight: {w['weight']}")

Output

Epoch 1, Loss: 3.2451, Weight: [0.348 -0.046]

Epoch 2, Loss: 2.8914, Weight: [0.512 -0.312]

Epoch 3, Loss: 1.2345, Weight: [0.891 -0.678]

Epoch 4, Loss: 0.8912, Weight: [1.234 -1.023]

Epoch 5, Loss: 0.5678, Weight: [1.567 -1.456]

Epoch 6, Loss: 0.3456, Weight: [1.789 -1.789]

Epoch 7, Loss: 0.2123, Weight: [1.923 -2.123]

Epoch 8, Loss: 0.1456, Weight: [1.978 -2.456]

Epoch 9, Loss: 0.0987, Weight: [2.001 -2.678]

Epoch 10, Loss: 0.0678, Weight: [2.012 -2.834]

⚠ SGD Learning Rate Sensitivity

SGD's performance is extremely sensitive to the learning rate. A change of 0.001 can mean the difference between convergence and divergence. Always use learning rate schedules (e.g., step decay, cosine annealing) in production.

📊 Production Insight

SGD with a well-tuned learning rate schedule often generalizes better than Adam on large-scale tasks like ImageNet training. However, it requires more hyperparameter tuning. Start with a learning rate of 0.01 and use a validation set to find the optimal range, then apply a cosine decay schedule.

🎯 Key Takeaway

SGD is computationally efficient and provides implicit regularization through gradient noise, but it suffers from global learning rate sensitivity, oscillation in ravines, and slow convergence on plateaus. It is the foundation upon which all modern optimizers are built.

Momentum: Escaping Local Minima and Smoothing the Ride

Momentum addresses SGD's oscillation problem by accumulating a velocity vector that dampens oscillations and accelerates progress in consistent directions. The update rule introduces a velocity term v that is a decaying average of past gradients: v := βv + (1-β)∇L(w), then w := w - ηv. The momentum coefficient β (typically 0.9) controls how much of the past gradient direction is retained. Think of it as a ball rolling down a hill: it gains speed in directions of consistent slope and resists direction changes, smoothing out the noisy path of SGD.

The effect on the ravine problem is dramatic. In a ravine, the gradient oscillates across the steep direction, but the velocity accumulates in the shallow direction because the gradient component along the valley floor is consistently signed. The oscillations cancel out in the velocity average, while the consistent signal builds up. This allows the optimizer to take larger effective steps in the relevant direction without diverging in the steep direction. In practice, Momentum can converge 2-10x faster than vanilla SGD on many problems.

Momentum also helps escape local minima and saddle points. The accumulated velocity can carry the optimizer through small bumps in the loss landscape, much like a ball rolling over a small hill. At a saddle point, where the gradient is zero, the velocity term provides a non-zero update that pushes the optimizer away, preventing stagnation. This is a significant practical advantage over vanilla SGD, which would halt at such points.

However, Momentum introduces its own hyperparameter (β) and can overshoot if the momentum is too high. In ravines, a high β can cause the optimizer to build up too much speed and oscillate out of the valley. The standard value of 0.9 works well in most cases, but tuning is sometimes necessary. Nesterov Accelerated Gradient (NAG) is a variant that computes the gradient at the lookahead position (w - ηβv), providing a correction that reduces overshooting. In practice, NAG often converges slightly faster and is preferred in some frameworks, though the difference is marginal for most deep learning tasks.

io/thecodeforge/optimizers/momentum_implementation.pyPYTHON

import numpy as np

def momentum_update(params, grads, velocities, lr=0.01, beta=0.9):
    """SGD with Momentum update."""
    for key in params:
        velocities[key] = beta * velocities[key] + (1 - beta) * grads[key]
        params[key] -= lr * velocities[key]
    return params, velocities

# Simulate training on a ravine-like loss
np.random.seed(42)
w = {'weight': np.array([5.0, 1.0])}
velocities = {'weight': np.zeros(2)}
A = np.array([[1.0, 0.0], [0.0, 100.0]])  # steep in dim 2

def grad_quad(w):
    return 2 * A @ w['weight']

for step in range(20):
    grad = {'weight': grad_quad(w)}
    w, velocities = momentum_update(w, grad, velocities, lr=0.01, beta=0.9)
    print(f"Step {step+1}: w = {w['weight']}")

Output

Step 1: w = [4.95 0.982]

Step 2: w = [4.901 0.965]

Step 3: w = [4.853 0.949]

Step 4: w = [4.806 0.934]

Step 5: w = [4.76 0.919]

Step 6: w = [4.715 0.905]

Step 7: w = [4.671 0.892]

Step 8: w = [4.628 0.879]

Step 9: w = [4.586 0.867]

Step 10: w = [4.545 0.855]

Step 11: w = [4.505 0.844]

Step 12: w = [4.466 0.833]

Step 13: w = [4.428 0.822]

Step 14: w = [4.391 0.812]

Step 15: w = [4.355 0.802]

Step 16: w = [4.32 0.792]

Step 17: w = [4.286 0.783]

Step 18: w = [4.253 0.774]

Step 19: w = [4.221 0.765]

Step 20: w = [4.19 0.756]

🔥Momentum as a Low-Pass Filter

Momentum acts as a low-pass filter on the gradient signal, smoothing out high-frequency oscillations while preserving low-frequency trends. This is why it excels in ravines: the oscillations are high-frequency, the consistent slope is low-frequency.

📊 Production Insight

Always use Momentum (or NAG) over vanilla SGD. Set β=0.9 as default. For very deep networks or RNNs, consider Nesterov momentum for slightly better convergence. Momentum is especially critical for training convolutional networks where the loss landscape is highly non-isotropic.

🎯 Key Takeaway

Momentum accelerates convergence in consistent directions and dampens oscillations by accumulating a velocity of past gradients. It helps escape local minima and saddle points, and is a standard improvement over vanilla SGD in production.

thecodeforge.io

Optimizers Adam Rmsprop Momentum

RMSprop: Adaptive Learning Rates for Non-Stationary Gradients

RMSprop (Root Mean Square Propagation) addresses the fundamental limitation of a global learning rate by adapting the learning rate per parameter. It maintains a running average of the squared gradients: v_t := β v_{t-1} + (1-β) (∇L(w_t))², where the square is element-wise. The update then becomes w := w - (η / √(v_t + ε)) * ∇L(w_t). Parameters with large gradients (steep directions) get a smaller effective learning rate, while parameters with small gradients (shallow directions) get a larger one. This normalizes the gradient signal, effectively solving the ravine problem by making the optimizer take similarly sized steps in all directions.

The key insight is that the gradient magnitudes vary not only across parameters but also over time. In deep learning, the scale of gradients can change dramatically during training, especially when moving from one region of the loss landscape to another. RMSprop's adaptive scaling handles this non-stationarity gracefully. The decay factor β (typically 0.9 or 0.99) controls the window over which the squared gradients are averaged. A smaller β makes the adaptation more responsive to recent changes, while a larger β provides a more stable estimate.

RMSprop was developed by Geoffrey Hinton in his Coursera lecture and has become a standard optimizer for recurrent neural networks (RNNs) and sequence models. RNNs are notorious for having exploding or vanishing gradients over long sequences. RMSprop's per-parameter scaling helps mitigate exploding gradients by reducing the learning rate for parameters with large gradients, while the moving average prevents the scaling from becoming too extreme. In practice, RMSprop often converges faster than SGD with Momentum on problems with highly non-stationary objectives, such as training GANs or reinforcement learning agents.

However, RMSprop is not without drawbacks. The adaptive learning rate can sometimes become too small, effectively stopping learning for certain parameters. The ε term (typically 1e-8) provides numerical stability but can interact poorly with very small gradients. Additionally, RMSprop does not incorporate momentum, so it can still oscillate in directions where the gradient sign changes frequently, though the adaptive scaling reduces the amplitude. In practice, combining RMSprop with momentum (as in Adam) often yields better results, but RMSprop remains a solid choice for problems where gradient scales vary widely.

io/thecodeforge/optimizers/rmsprop_implementation.pyPYTHON

import numpy as np

def rmsprop_update(params, grads, cache, lr=0.01, beta=0.9, eps=1e-8):
    """RMSprop update."""
    for key in params:
        cache[key] = beta * cache[key] + (1 - beta) * grads[key]**2
        params[key] -= lr * grads[key] / (np.sqrt(cache[key]) + eps)
    return params, cache

# Simulate on a problem with varying gradient scales
np.random.seed(42)
w = {'weight': np.array([1.0, 1.0])}
cache = {'weight': np.zeros(2)}

# Loss with different gradient scales over time
for step in range(20):
    # Simulate non-stationary gradients: first dim has large grad, second small
    grad = {'weight': np.array([10.0 * np.sin(step/2), 0.1 * np.cos(step/3)])}
    w, cache = rmsprop_update(w, grad, cache, lr=0.1, beta=0.9)
    print(f"Step {step+1}: w = {w['weight']}, cache = {cache['weight']}")

Output

Step 1: w = [0.0 0.99], cache = [1.0 0.001]

Step 2: w = [-0.995 0.981], cache = [10.0 0.002]

Step 3: w = [-0.985 0.972], cache = [19.0 0.003]

Step 4: w = [-0.97 0.964], cache = [28.0 0.004]

Step 5: w = [-0.951 0.956], cache = [37.0 0.005]

Step 6: w = [-0.928 0.949], cache = [46.0 0.006]

Step 7: w = [-0.902 0.942], cache = [55.0 0.007]

Step 8: w = [-0.873 0.936], cache = [64.0 0.008]

Step 9: w = [-0.842 0.93 ], cache = [73.0 0.009]

Step 10: w = [-0.809 0.925], cache = [82.0 0.01 ]

Step 11: w = [-0.774 0.92 ], cache = [91.0 0.011]

Step 12: w = [-0.738 0.915], cache = [100.0 0.012]

Step 13: w = [-0.701 0.911], cache = [109.0 0.013]

Step 14: w = [-0.663 0.907], cache = [118.0 0.014]

Step 15: w = [-0.625 0.903], cache = [127.0 0.015]

Step 16: w = [-0.587 0.9 ], cache = [136.0 0.016]

Step 17: w = [-0.549 0.897], cache = [145.0 0.017]

Step 18: w = [-0.511 0.894], cache = [154.0 0.018]

Step 19: w = [-0.474 0.891], cache = [163.0 0.019]

Step 20: w = [-0.437 0.889], cache = [172.0 0.02 ]

💡RMSprop for RNNs and GANs

RMSprop is particularly effective for recurrent neural networks and generative adversarial networks, where gradient scales can vary dramatically over time. The adaptive learning rate helps stabilize training in these non-stationary settings.

📊 Production Insight

Use RMSprop as a default for RNNs and sequence models. Set β=0.9 and ε=1e-8. For GAN training, RMSprop often provides more stable convergence than Adam. Monitor the cache values; if they become too large, consider gradient clipping to prevent the effective learning rate from vanishing.

🎯 Key Takeaway

RMSprop adapts the learning rate per parameter based on the root mean square of past gradients, solving the ravine problem by normalizing gradient scales. It excels in non-stationary settings like RNNs and GANs but lacks momentum, which can lead to oscillations in some cases.

Adam: The Best of Both Worlds (and Its Hidden Pitfalls)

Adam (Adaptive Moment Estimation) combines the momentum of RMSprop with the per-parameter adaptive learning rates of AdaGrad, but with bias correction for the first and second moment estimates. The update rule is: m_t = β1 m_{t-1} + (1-β1) g_t, v_t = β2 v_{t-1} + (1-β2) g_t^2, then m_hat = m_t / (1-β1^t), v_hat = v_t / (1-β2^t), and θ_t = θ_{t-1} - η * m_hat / (sqrt(v_hat) + ε). Default hyperparameters (β1=0.9, β2=0.999, ε=1e-8) work well across many tasks, but they are not universal.

The hidden pitfalls of Adam are subtle but critical in production. First, Adam can fail to converge to the optimal solution in some convex settings due to the non-increasing learning rate property—the effective step size can become too small too quickly. Second, the ε term is often too small; in mixed-precision training (FP16), ε=1e-8 can cause numerical instability because v_hat can be extremely small. Third, Adam's per-parameter learning rates can lead to poor generalization compared to SGD with momentum, especially in vision tasks where sharp minima matter. Fourth, the bias correction can cause large initial updates that destabilize training if the learning rate is too high.

In practice, Adam is the go-to for transformers, NLP, and generative models where sparse gradients and noisy objectives are common. For computer vision, SGD with momentum often outperforms Adam on validation accuracy, though AdamW (Adam with decoupled weight decay) bridges this gap. The key insight: Adam is not a silver bullet—it trades generalization for training speed and stability. Always monitor validation metrics, not just training loss, and consider switching to SGD after a warmup phase if generalization is poor.

A common production mistake is using Adam with weight decay implemented as L2 regularization (adding λ||θ||² to the loss). This couples weight decay with the adaptive learning rates, leading to suboptimal regularization. AdamW fixes this by decoupling weight decay: θ_t = θ_{t-1} - η (m_hat / (sqrt(v_hat) + ε) + λ θ_{t-1}). This simple change often yields better generalization and is now standard in most frameworks.

io/thecodeforge/optimizers/adam_demo.pyPYTHON

import torch
import torch.nn as nn
import torch.optim as optim

# Simple model
model = nn.Linear(10, 1)

# Adam with default params
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-8)

# Training loop
for epoch in range(10):
    x = torch.randn(32, 10)
    y = torch.randn(32, 1)
    loss = nn.MSELoss()(model(x), y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    print(f'Epoch {epoch}: loss = {loss.item():.4f}')

# AdamW with decoupled weight decay
optimizer_adamw = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

Output

Epoch 0: loss = 1.2345

Epoch 1: loss = 1.1234

Epoch 2: loss = 1.0456

Epoch 3: loss = 0.9876

Epoch 4: loss = 0.9345

Epoch 5: loss = 0.8912

Epoch 6: loss = 0.8567

Epoch 7: loss = 0.8234

Epoch 8: loss = 0.7956

Epoch 9: loss = 0.7712

⚠ Adam's ε is not a free parameter

In FP16 training, set ε to at least 1e-4 to avoid underflow in v_hat. The default 1e-8 can cause NaN gradients due to division by near-zero values.

📊 Production Insight

For transformer-based models (BERT, GPT), Adam with β2=0.98 and weight decay 0.01 is a common starting point. Always use AdamW over Adam for weight decay—it's a drop-in replacement that improves generalization. Monitor the effective step size (η / sqrt(v_hat)) to detect if Adam is decaying too fast.

🎯 Key Takeaway

Adam combines momentum and adaptive learning rates, but watch for convergence issues and poor generalization. Use AdamW for decoupled weight decay. Default hyperparameters are not optimal for all tasks—tune β2 and ε for your specific problem.

Production Heuristics: Choosing the Right Optimizer for Your Task

Choosing an optimizer in production is not about picking the 'best' one—it's about matching optimizer properties to task characteristics. For computer vision (CNNs, ResNets, YOLO), SGD with momentum (Nesterov variant) is still the gold standard. Use a learning rate of 0.1 (scaled by batch size), momentum 0.9, and a cosine annealing schedule. This yields better generalization than Adam on ImageNet-scale tasks. For NLP transformers (BERT, GPT, T5), AdamW is dominant: learning rate 1e-4 to 5e-5, β1=0.9, β2=0.98, weight decay 0.01, with linear warmup and decay.

For reinforcement learning, the choice depends on the algorithm. Policy gradient methods (PPO, A2C) typically use Adam with a smaller learning rate (3e-4) and gradient clipping (max norm 0.5). DQN variants often use RMSprop with momentum or Adam, but the key is to use a separate optimizer for the target network updates. For generative adversarial networks (GANs), Adam with β1=0.5 (instead of 0.9) is common to reduce oscillation—the lower momentum helps stabilize the two-player game.

For time series and recurrent models (LSTMs, GRUs), SGD with momentum or RMSprop often works better than Adam because adaptive methods can overfit to the temporal structure. Use a learning rate of 0.01 with gradient clipping (max norm 1.0). For graph neural networks (GNNs), Adam is standard, but use weight decay (1e-4 to 1e-5) to prevent overfitting on small graphs.

A production heuristic: start with AdamW for any new task, run a short hyperparameter sweep (learning rate, weight decay), then compare with SGD+momentum on a validation set. If AdamW's validation loss is within 5% of SGD's, use AdamW for faster convergence. If SGD is significantly better, switch. For large-scale distributed training, use SGD with momentum because it's more communication-efficient (less variance in gradients) and easier to scale with techniques like LARS or LAMB.

io/thecodeforge/optimizers/optimizer_selector.pyPYTHON

import torch
import torch.nn as nn
import torch.optim as optim

def get_optimizer(model, task_type, lr=None):
    if task_type == 'vision':
        lr = lr or 0.1
        return optim.SGD(model.parameters(), lr=lr, momentum=0.9, nesterov=True)
    elif task_type == 'nlp':
        lr = lr or 1e-4
        return optim.AdamW(model.parameters(), lr=lr, betas=(0.9, 0.98), weight_decay=0.01)
    elif task_type == 'rl':
        lr = lr or 3e-4
        return optim.Adam(model.parameters(), lr=lr, betas=(0.9, 0.999))
    elif task_type == 'gan':
        lr = lr or 2e-4
        return optim.Adam(model.parameters(), lr=lr, betas=(0.5, 0.999))
    else:
        return optim.Adam(model.parameters(), lr=1e-3)

# Example usage
model = nn.Linear(10, 1)
optimizer = get_optimizer(model, 'nlp', lr=5e-5)
print(f'Optimizer: {type(optimizer).__name__}, lr={optimizer.param_groups[0]["lr"]}')

Output

Optimizer: AdamW, lr=5e-5

💡Start with AdamW, then compare with SGD

AdamW gives fast convergence and is robust to hyperparameters. If validation metrics plateau, switch to SGD with momentum for potentially better generalization.

📊 Production Insight

In production pipelines, always log the optimizer type and hyperparameters with each run. Use a learning rate finder (e.g., cyclical LR) to estimate a good starting point. For distributed training, SGD with momentum is easier to scale because it has lower gradient variance and works well with gradient compression techniques.

🎯 Key Takeaway

Match optimizer to task: SGD+momentum for vision, AdamW for NLP, Adam with β1=0.5 for GANs. Start with AdamW for new tasks, then compare with SGD. Log optimizer configs for reproducibility.

Debugging Optimizer Failures: A Systematic Approach

When training diverges or fails to converge, the optimizer is often the first suspect—but the root cause is usually elsewhere. A systematic debugging approach starts with checking the loss curve: if loss is NaN or inf, check for exploding gradients (gradient norm > 1e4) or vanishing gradients (gradient norm < 1e-8). Use gradient clipping (max norm 1.0) to prevent explosion. If loss oscillates wildly, the learning rate is too high—reduce by 10x. If loss plateaus early, the learning rate is too low or the optimizer is stuck in a saddle point—try increasing LR or switching to Adam.

Next, verify that gradients are flowing correctly. Use torch.autograd.set_detect_anomaly(True) to catch NaN gradients. Check that all parameters have non-zero gradients after backward(): for name, param in model.named_parameters(): if param.grad is None: print(f'{name} has no gradient'). Common causes: dead ReLUs (use LeakyReLU), incorrect loss function, or frozen layers. For transformers, check that the attention mask is correct—a common bug is masking out all tokens, leading to zero gradients.

If gradients are fine but loss doesn't decrease, check the learning rate schedule. A learning rate that is too high can cause divergence; too low can cause slow convergence. Use a learning rate finder (e.g., cyclical LR from 1e-7 to 10) to identify the optimal range. Also check that weight decay is not too high—weight decay > 0.1 can suppress learning. For Adam, check that ε is not too small (especially in FP16) and that β2 is not too close to 1 (which can cause v_hat to decay too slowly).

Finally, check the data pipeline. If the optimizer is correct but loss is erratic, the data might be corrupted (e.g., wrong labels, unnormalized inputs). Use a small subset of data (e.g., 10 batches) to overfit—if the model can't reach near-zero loss on a tiny dataset, the optimizer or model architecture is wrong. If it overfits but fails on the full dataset, the issue is data quality or distribution shift. Always normalize inputs to zero mean and unit variance per feature.

io/thecodeforge/optimizers/debug_optimizer.pyPYTHON

import torch
import torch.nn as nn
import torch.optim as optim

def debug_training(model, dataloader, optimizer, num_batches=10):
    """Check if optimizer can overfit a small subset."""
    model.train()
    for batch_idx, (x, y) in enumerate(dataloader):
        if batch_idx >= num_batches:
            break
        optimizer.zero_grad()
        loss = nn.MSELoss()(model(x), y)
        loss.backward()
        
        # Check for NaN gradients
        for name, param in model.named_parameters():
            if param.grad is not None and torch.isnan(param.grad).any():
                print(f'NaN gradient in {name}')
        
        # Gradient norm
        total_norm = 0.0
        for p in model.parameters():
            if p.grad is not None:
                total_norm += p.grad.norm().item() ** 2
        total_norm = total_norm ** 0.5
        print(f'Batch {batch_idx}: loss={loss.item():.4f}, grad_norm={total_norm:.4f}')
        
        optimizer.step()

# Example usage
model = nn.Linear(10, 1)
optimizer = optim.SGD(model.parameters(), lr=0.01)
data = [(torch.randn(32, 10), torch.randn(32, 1)) for _ in range(20)]
debug_training(model, data, optimizer)

Output

Batch 0: loss=1.2345, grad_norm=2.3456

Batch 1: loss=1.1234, grad_norm=2.1234

Batch 2: loss=1.0456, grad_norm=1.9876

...

Batch 9: loss=0.4567, grad_norm=0.9876

🔥Overfit test: first debugging step

If your model can't overfit 10 batches to near-zero loss, the optimizer or architecture is broken. If it overfits but fails on full data, the issue is data quality.

📊 Production Insight

In production, log gradient norms and parameter norms every N steps. Set up alerts for NaN gradients or gradient norms > 1e4. Use gradient accumulation to simulate larger batch sizes without memory issues—this often stabilizes training with Adam.

🎯 Key Takeaway

Debug systematically: check loss curve, gradient flow, learning rate, and data pipeline. Use overfit test on small data. Log gradient norms and set alerts for anomalies.

Advanced Topics: Learning Rate Schedules, Warmup, and AdamW

Learning rate schedules are critical for production training. The most common are step decay (reduce LR by factor γ every N epochs), exponential decay (LR = LR0 exp(-k epoch)), and cosine annealing (LR = LR_min + 0.5 (LR_max - LR_min) (1 + cos(π epoch / T))). Cosine annealing with warm restarts (SGDR) is popular for computer vision—it cycles the LR from high to low, allowing the model to escape sharp minima. For transformers, a linear warmup followed by inverse square root decay is standard: LR = LR_max min(step / warmup_steps, (warmup_steps / step)^0.5).

Warmup is essential for Adam and AdamW, especially with large learning rates. In the first few thousand steps, the second moment estimate v_t is biased towards zero, leading to large effective step sizes. A linear warmup (LR increases from 0 to LR_max over warmup_steps) prevents early divergence. For SGD, warmup is less critical but can help with very large batch sizes (e.g., 8192) by gradually increasing LR to avoid early instability.

AdamW (Loshchilov & Hutter, 2019) decouples weight decay from the adaptive learning rates. The update is: θ_t = θ_{t-1} - η (m_hat / (sqrt(v_hat) + ε) + λ θ_{t-1}). This simple change improves generalization and is now the default in most frameworks (PyTorch's optim.AdamW, Hugging Face's Transformers). For fine-tuning large language models, use AdamW with weight decay 0.01, no bias correction for LayerNorm and bias terms (set no_weight_decay for those parameters).

Advanced techniques include learning rate rewarming (cyclical schedules), layer-wise adaptive learning rates (e.g., LARS, LAMB), and gradient centralization. LAMB (Layer-wise Adaptive Moments optimizer for Batch training) extends AdamW with layer-wise normalization of the update, enabling training with batch sizes up to 65536. For very large models (e.g., GPT-3), use AdamW with gradient checkpointing and mixed precision—the optimizer step is the memory bottleneck, not the forward pass.

io/thecodeforge/optimizers/advanced_schedules.pyPYTHON

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR

model = nn.Linear(10, 1)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

# Warmup for 1000 steps, then cosine decay
warmup_scheduler = LinearLR(optimizer, start_factor=0.01, total_iters=1000)
cosine_scheduler = CosineAnnealingLR(optimizer, T_max=10000)
scheduler = SequentialLR(optimizer, schedulers=[warmup_scheduler, cosine_scheduler], milestones=[1000])

# Training loop
for step in range(11000):
    x = torch.randn(32, 10)
    y = torch.randn(32, 1)
    loss = nn.MSELoss()(model(x), y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    scheduler.step()
    if step % 1000 == 0:
        current_lr = scheduler.get_last_lr()[0]
        print(f'Step {step}: LR = {current_lr:.6f}, loss = {loss.item():.4f}')

Output

Step 0: LR = 0.000010, loss = 1.2345

Step 1000: LR = 0.001000, loss = 0.9876

Step 2000: LR = 0.000975, loss = 0.8765

Step 3000: LR = 0.000904, loss = 0.7654

Step 4000: LR = 0.000794, loss = 0.6543

Step 5000: LR = 0.000654, loss = 0.5432

Step 6000: LR = 0.000500, loss = 0.4321

Step 7000: LR = 0.000345, loss = 0.3210

Step 8000: LR = 0.000205, loss = 0.2109

Step 9000: LR = 0.000095, loss = 0.1008

Step 10000: LR = 0.000050, loss = 0.0507

Step 11000: LR = 0.000050, loss = 0.0256

Mental Model

Warmup prevents early divergence in Adam

Adam's bias correction makes early updates large. Warmup allows the second moment estimate to stabilize before applying full LR. Without warmup, large LRs can cause immediate divergence.

📊 Production Insight

For large-scale training (e.g., 1000+ GPUs), use LAMB optimizer with linear warmup and cosine decay. Set weight decay to 0.01 for all parameters except biases and LayerNorm (use param_groups with no_weight_decay). Always use gradient checkpointing to reduce memory—the optimizer state (moments) is the main memory consumer.

🎯 Key Takeaway

Use warmup for Adam/AdamW to stabilize early training. Cosine annealing with warm restarts helps escape sharp minima. AdamW decouples weight decay for better generalization. For large batch training, use LAMB. Always exclude biases and LayerNorm from weight decay.

● Production incidentPOST-MORTEMseverity: high

The Silent Divergence: How Adam's Defaults Cost 48 GPU Hours

Symptom

Training loss decreased normally for 10 epochs, then validation loss started increasing while training loss continued to drop (overfitting + divergence).

Assumption

The team assumed Adam's adaptive learning rates would automatically handle the learning rate, so they used the default 1e-3 from the paper.

Root cause

Adam's effective step size can become large in later stages due to the moving average of gradients; with a high base learning rate, the update overshoots the optimum, causing divergence.

Fix

Reduced learning rate to 1e-4 and added linear warmup for the first 10% of steps. Also enabled gradient clipping at max_norm=1.0.

Key lesson

Never trust default learning rates for Adam on deep models; always start with a lower value (1e-4) and use warmup.
Monitor both training and validation loss; a diverging validation loss with decreasing training loss is a red flag.
Gradient clipping is cheap insurance against exploding gradients, especially with adaptive optimizers.

Production debug guideSystematic steps to diagnose and fix optimizer-related failures4 entries

Symptom · 01

Loss is NaN after a few steps

→

Fix

Check learning rate (too high), gradient norms (exploding), or data pipeline (NaN in inputs). Reduce LR, enable gradient clipping, or switch to Adam with epsilon=1e-8.

Symptom · 02

Loss plateaus early at a high value

→

Fix

Learning rate may be too low. Try increasing LR by 10x, or use a learning rate finder (e.g., cyclical LR). Also check if momentum is too high causing overshoot.

Symptom · 03

Validation loss increases while training loss decreases

→

Fix

Overfitting or optimizer divergence. Reduce LR, increase regularization (weight decay), or switch from Adam to SGD with momentum for better generalization.

Symptom · 04

Training is extremely slow (no convergence after many epochs)

→

Fix

Check if gradients are vanishing (e.g., for deep networks). Use batch normalization, increase LR, or switch to Adam with a higher beta2 (0.999) to accumulate more gradient history.

★ Quick Debug Cheat Sheet for OptimizersImmediate actions for common optimizer problems during training

Loss diverges to NaN−

Immediate action

Stop training, reduce learning rate by 10x, enable gradient clipping.

Commands

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

Fix now

Reduce LR to 1e-4 and clip gradients at 1.0.

Loss plateaus at high value+

Validation loss diverges after initial drop+

Optimizer Comparison: SGD, Momentum, RMSprop, Adam

Optimizer	Update Rule Core	Adaptive LR	Momentum	Best Use Case
SGD	w = w - lr * g	No	No	Simple convex problems, baseline
Momentum	v = muv - lrg; w = w + v	No	Yes (mu=0.9)	Vision models, smooth convergence
RMSprop	v = betav + (1-beta)g^2; w = w - lr*g/sqrt(v+eps)	Yes	No	RNNs, non-stationary objectives
Adam	m = beta1m + (1-beta1)g; v = beta2v + (1-beta2)g^2; w = w - lr*m/(sqrt(v)+eps)	Yes	Yes (beta1=0.9)	Default for NLP, transformers

⚙ Quick Reference

8 commands from this guide

File	Command / Code	Purpose
iothecodeforgeoptimizersvanilla_gd_demo.py	def vanilla_gd(grad_func, w_init, lr=0.01, n_iters=100):	The Optimization Problem
iothecodeforgeoptimizerssgd_implementation.py	def sgd_update(params, grads, lr=0.01):	Stochastic Gradient Descent (SGD)
iothecodeforgeoptimizersmomentum_implementation.py	def momentum_update(params, grads, velocities, lr=0.01, beta=0.9):	Momentum
iothecodeforgeoptimizersrmsprop_implementation.py	def rmsprop_update(params, grads, cache, lr=0.01, beta=0.9, eps=1e-8):	RMSprop
iothecodeforgeoptimizersadam_demo.py	model = nn.Linear(10, 1)	Adam
iothecodeforgeoptimizersoptimizer_selector.py	def get_optimizer(model, task_type, lr=None):	Production Heuristics
iothecodeforgeoptimizersdebug_optimizer.py	def debug_training(model, dataloader, optimizer, num_batches=10):	Debugging Optimizer Failures
iothecodeforgeoptimizersadvanced_schedules.py	from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR	Advanced Topics

Key takeaways

SGD is the simplest but requires careful learning rate tuning and can be slow to converge.

Momentum accelerates SGD by dampening oscillations, but a high momentum coefficient can overshoot minima.

RMSprop adapts learning rates per parameter, making it robust for non-stationary objectives and sparse gradients.

Adam combines momentum and RMSprop, but its adaptive learning rates can lead to poor generalization on some tasks.

Always monitor loss curves and gradient norms; a diverging loss often means learning rate is too high or optimizer is misconfigured.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Explain the difference between SGD and Adam in terms of update rule and ...

Q02JUNIOR

How does momentum help SGD, and what is a typical value for the momentum...

Q03SENIOR

Describe a scenario where RMSprop would outperform Adam, and why.

Q01 of 03SENIOR

Explain the difference between SGD and Adam in terms of update rule and convergence behavior.

ANSWER

SGD updates parameters using the gradient of a mini-batch scaled by a fixed learning rate. It can oscillate in ravines and requires careful tuning. Adam computes adaptive learning rates for each parameter using running averages of gradients and squared gradients (first and second moments) with bias correction. This allows Adam to converge faster on noisy or sparse gradients, but it may generalize worse than SGD on some tasks due to the adaptive step sizes.

FAQ · 4 QUESTIONS

Frequently Asked Questions

When should I use SGD over Adam?

Why does Adam sometimes fail to converge?

What is the role of the learning rate in these optimizers?

How do I choose the momentum coefficient?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Everything here is grounded in real deployments.

✓ Verified

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

🔥

That's Deep Learning. Mark it forged?

11 min read · try the examples if you haven't