Senior 7 min · April 15, 2026

Gradient Descent — 0.45 Loss to NaN in 3 Epochs

Q: What is the difference between a loss function and a metric?

A loss function is what the model optimizes during training — it must be differentiable so that gradients can be computed for backpropagation. A metric is what you use to evaluate model performance for human decision-making — it does not need to be differentiable. For example, you might train a classifier with binary cross-entropy loss (differentiable, smooth, well-behaved for optimization) but report F1 score as your evaluation metric (not differentiable, but directly meaningful to stakeholders). Loss functions drive learning. Metrics drive deployment decisions. They are related but serve fundamentally different purposes.

Q: Can gradient descent find the global minimum?

In general, no — gradient descent finds a local minimum, and which one it reaches depends on the starting point (random initialization) and the optimization trajectory (learning rate, batch size, optimizer choice). However, in high-dimensional spaces like neural networks with millions of parameters, research has shown that most local minima have loss values very close to the global minimum — the practical difference is negligible. The real problems are saddle points (where the gradient is zero but it is not a minimum in all directions) and plateaus (where gradients are near zero and progress stalls). Momentum-based optimizers like Adam help escape both by building velocity that carries through these flat regions.

Q: How do I choose the right batch size?

Start with 32 — it is the most widely validated default and works well for most problems. Smaller batches (8, 16) add more gradient noise per step, which can act as implicit regularization and help the model find flatter minima that generalize better, but they are slower per epoch because you cannot fully utilize GPU parallelism. Larger batches (128, 256, 512) give smoother gradient estimates and better GPU throughput but may converge to sharper minima that generalize worse. If you increase batch size, increase the learning rate proportionally (linear scaling rule) to compensate for the reduced noise. For very large models on large datasets, batch sizes of 256-1024 are common. The batch size rarely matters as much as learning rate and architecture — tune those first.

Q: What is gradient clipping and when should I use it?

Gradient clipping limits the maximum magnitude of gradients during backpropagation. If the total gradient norm exceeds a threshold (commonly 1.0), all gradients are scaled down proportionally so the norm equals the threshold. This prevents a single large gradient from causing a catastrophic parameter update that destabilizes training. Use gradient clipping when training recurrent neural networks (which are prone to exploding gradients due to repeated multiplication through time steps), when using large learning rates, when training with mixed precision (FP16), or whenever you observe NaN loss values during training. The most common implementation: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0). Place it after loss.backward() and before optimizer.step(). Gradient clipping is cheap, safe, and worth adding to any training pipeline as a defensive measure.

Q: Why does my model train fine on one dataset but explode on another?

Different datasets have different scale, noise, and outlier characteristics that interact with your loss function and learning rate. A learning rate of 0.001 that works perfectly on normalized data with values in [-1, 1] can cause divergence on unnormalized data with values in [0, 1000000] because the raw gradient magnitudes are orders of magnitude larger. Similarly, MSE loss on clean data produces well-behaved gradients, but MSE on data with extreme outliers produces gradient spikes that destabilize training. The fix: always normalize your input features, check for outliers and extreme values before training, and use gradient clipping as a safety net. Your training pipeline should be robust to dataset characteristics, not tuned to one specific dataset.

Training loss went from 0.45 to NaN in 3 epochs after changing learning rate to 0.1.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Lessons pulled from things that broke in production.

✓ Production

production tested

May 23, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Loss functions measure how wrong your model's predictions are — lower loss means better predictions
Gradient descent finds the best model parameters by following the slope downhill on the loss surface
Learning rate controls step size — too large and you overshoot the minimum, too small and training takes forever
MSE (Mean Squared Error) penalizes large errors heavily; MAE (Mean Absolute Error) treats all errors equally
Production rule: monitor loss curves during training — a flat training loss with rising validation loss means overfitting
Biggest mistake: assuming lower training loss always means a better model — it often means overfitting

✦ Definition~90s read

What is Understanding Loss Functions and Gradient Descent Visually?

Gradient descent is the optimization algorithm that powers most modern machine learning. It's a first-order iterative method for finding the minimum of a loss function — the mathematical measure of how wrong your model's predictions are. Think of it as a blindfolded hiker trying to find the bottom of a valley: you take steps proportional to the steepness of the slope at your current position.

★

Imagine you are lost in a valley at night with a flashlight that only illuminates your feet.

The algorithm computes the gradient (partial derivatives) of the loss with respect to every model parameter, then updates those parameters in the opposite direction of the gradient. This is why you'll see loss values like 0.45 drop to NaN (Not a Number) in just a few epochs — a learning rate that's too large can make your updates overshoot the minimum, causing the loss to diverge to infinity or produce floating-point overflows.

The loss function itself is the objective you're minimizing. For regression, mean squared error (MSE) is common; for classification, cross-entropy loss dominates. The choice of loss function directly shapes the loss landscape — the high-dimensional surface you're navigating.

This landscape can have local minima, saddle points (where gradient is zero but not a minimum), and plateaus (flat regions that stall progress). Real-world training with frameworks like PyTorch or TensorFlow uses variants like Adam or SGD with momentum to escape these traps.

When you monitor loss curves during training, you're watching the trajectory of gradient descent across this landscape — a smooth downward curve indicates healthy convergence, while oscillations or sudden spikes signal learning rate issues or numerical instability.

Gradient descent comes in three main flavors based on how much data you use per update. Batch gradient descent computes the gradient over the entire dataset — it's deterministic but slow for large datasets. Stochastic gradient descent (SGD) updates after every single sample — it's fast per step but noisy, often bouncing around minima.

Mini-batch gradient descent (the standard in practice) strikes a balance, using batches of 32-512 samples. This gives you the computational efficiency of matrix operations on GPUs while maintaining enough noise to escape shallow local minima. The learning rate is the critical hyperparameter that controls step size — too small and training stalls, too large and you get the NaN explosion in the title.

Modern practice uses learning rate schedules (cosine annealing, step decay) or adaptive methods like Adam that adjust per-parameter learning rates based on gradient history.

Plain-English First

Imagine you are lost in a valley at night with a flashlight that only illuminates your feet. You want to reach the lowest point. You feel the slope beneath you, take a step downhill, and repeat. That is gradient descent. The loss function is the elevation map — it tells you how high or low you are at any point. Your goal is to find the bottom.

Every machine learning model learns by minimizing a loss function. The loss function quantifies prediction error — it is the single number that tells the optimizer whether the model is getting better or worse. Gradient descent is the algorithm that navigates the loss landscape to find parameter values that produce the smallest error. Without understanding these two concepts, model training is a black box you cannot debug.

The loss function is a design choice, not a fixed constant. Different loss functions produce different models from the same data. MSE aggressively penalizes outliers because it squares the error. MAE is robust to them because it takes the absolute value. Cross-entropy is designed for classification because it measures divergence between predicted probabilities and true labels. Choosing the wrong loss function silently degrades model performance in ways that are difficult to diagnose after the fact.

Gradient descent has its own set of hyperparameters that determine whether training converges, diverges, or oscillates. The learning rate is the most critical — it controls how far you step downhill at each iteration. Getting it wrong means the model either never learns or blows up entirely. This article walks through all of this visually, with code you can run, loss curves you can interpret, and production mistakes you can avoid.

What Loss Functions Actually Measure

A loss function takes two inputs — the model's prediction and the true value — and returns a single number that represents how wrong the prediction is. Lower loss means better predictions. The model's entire training process is an attempt to find parameter values that minimize this number across all training examples.

The three most common loss functions for regression are MSE, MAE, and Huber loss. They differ in how they penalize errors of different sizes. MSE squares the error, so a prediction that is off by 10 gets a loss of 100 — large errors dominate the total loss. MAE takes the absolute value, so the same error of 10 gets a loss of 10 — all errors contribute proportionally. Huber loss acts like MSE for small errors and like MAE for large errors, giving you smooth gradients near zero without outlier sensitivity.

The choice of loss function is not academic. It directly shapes what the model learns. If your data contains outliers and you use MSE, the model will warp its predictions toward fitting those outliers because their squared errors dominate the gradient signal. Switch to Huber loss and the same model on the same data learns to ignore the outliers and fit the majority pattern instead.

io/thecodeforge/loss/loss_comparison.pyPYTHON

import numpy as np
import matplotlib.pyplot as plt

def plot_loss_comparison():
    """Plot MSE, MAE, and Huber loss side by side to show how each
    penalizes prediction errors differently."""
    errors = np.linspace(-5, 5, 300)

    mse = errors ** 2
    mae = np.abs(errors)

    delta = 1.0
    huber = np.where(
        np.abs(errors) <= delta,
        0.5 * errors ** 2,
        delta * (np.abs(errors) - 0.5 * delta)
    )

    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    axes[0].plot(errors, mse, linewidth=2, color='blue')
    axes[0].set_title('MSE: Quadratic Penalty')
    axes[0].set_xlabel('Prediction Error')
    axes[0].set_ylabel('Loss')
    axes[0].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
    axes[0].axvline(x=0, color='gray', linestyle='--', alpha=0.5)
    axes[0].annotate('Large error = huge loss', xy=(3, 9), xytext=(1, 15),
                     arrowprops=dict(arrowstyle='->', color='red'),
                     fontsize=10, color='red')

    axes[1].plot(errors, mae, linewidth=2, color='green')
    axes[1].set_title('MAE: Linear Penalty')
    axes[1].set_xlabel('Prediction Error')
    axes[1].set_ylabel('Loss')
    axes[1].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
    axes[1].axvline(x=0, color='gray', linestyle='--', alpha=0.5)
    axes[1].annotate('Constant slope regardless of error size',
                     xy=(3, 3), xytext=(0.5, 4.5),
                     arrowprops=dict(arrowstyle='->', color='red'),
                     fontsize=10, color='red')

    axes[2].plot(errors, huber, linewidth=2, color='purple')
    axes[2].set_title('Huber: Best of Both')
    axes[2].set_xlabel('Prediction Error')
    axes[2].set_ylabel('Loss')
    axes[2].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
    axes[2].axvline(x=0, color='gray', linestyle='--', alpha=0.5)
    axes[2].axvline(x=delta, color='orange', linestyle=':', alpha=0.7,
                    label=f'delta={delta}')
    axes[2].axvline(x=-delta, color='orange', linestyle=':', alpha=0.7)
    axes[2].legend()
    axes[2].annotate('Quadratic near zero, linear far away',
                     xy=(2.5, 2), xytext=(0.5, 4),
                     arrowprops=dict(arrowstyle='->', color='red'),
                     fontsize=10, color='red')

    fig.suptitle('Loss Function Shapes — How They Penalize Errors',
                 fontsize=14, fontweight='bold')
    fig.tight_layout()
    fig.savefig('loss_comparison.png', dpi=300, bbox_inches='tight')
    plt.close(fig)
    print('Saved loss_comparison.png')

plot_loss_comparison()

MSE Creates Steep Cliffs for Outliers

When an outlier produces a large error, MSE creates an extremely steep gradient. This gradient can dominate the entire batch update, pulling model parameters toward fitting the outlier at the expense of normal data points. A single outlier with an error of 100 contributes 10,000 to MSE loss — swamping the signal from thousands of normal samples. MAE does not have this problem — its gradient magnitude is constant (always +1 or -1) regardless of error size.

Production Insight

MSE gradients grow linearly with error size. MAE gradients are constant (+1 or -1).

Outliers create disproportionately large MSE gradients that distort parameter updates.

Rule: if your data has outliers, Huber loss gives you smooth gradients near zero (easy optimization) and robustness far away (outlier resistance). Set delta based on what you consider a 'normal' error range.

Key Takeaway

MSE curves upward (quadratic) — large errors dominate training.

MAE is a straight line (linear) — all errors contribute equally to gradients.

Huber combines both: quadratic near zero for smooth optimization, linear for large errors to resist outliers.

Gradient Descent: Walking Downhill

Gradient descent is the algorithm that minimizes the loss function. At each step it computes the gradient — the slope of the loss with respect to each model parameter — then updates parameters in the direction that reduces loss. The gradient tells you which way is uphill; you step the opposite way.

The update rule is simple: new_weight = old_weight - learning_rate × gradient. The gradient points toward steeper loss, so subtracting it moves the weight toward lower loss. The learning rate scales how big each step is. Repeat this across all parameters for many iterations and the model converges toward the minimum loss.

The code below shows this on the simplest possible loss function — a single-variable quadratic with a known minimum at w=3. You can watch the weight converge step by step and see exactly how the gradient drives the update. The same principle applies when there are millions of parameters; the math is identical, just scaled up.

io/thecodeforge/loss/gradient_descent.pyPYTHON

import numpy as np

class GradientDescentVisualizer:
    """Step-by-step gradient descent on a simple loss function."""

    def __init__(self, learning_rate=0.1):
        self.lr = learning_rate
        self.history = []

    def loss_function(self, w):
        """Simple quadratic loss: L(w) = (w - 3)^2
        Minimum is at w = 3."""
        return (w - 3) ** 2

    def gradient(self, w):
        """Derivative of loss: dL/dw = 2(w - 3)"""
        return 2 * (w - 3)

    def step(self, w):
        """One gradient descent step: w_new = w - lr * gradient"""
        grad = self.gradient(w)
        w_new = w - self.lr * grad
        self.history.append({
            'w': w,
            'loss': self.loss_function(w),
            'grad': grad
        })
        return w_new

    def optimize(self, w_init=0.0, n_steps=20):
        """Run gradient descent and print each step."""
        w = w_init
        print(f"{'Step':<6} {'w':<10} {'Loss':<10} {'Gradient':<12} {'Update'}")
        print('-' * 55)

        for i in range(n_steps):
            loss = self.loss_function(w)
            grad = self.gradient(w)
            update = -self.lr * grad
            print(f"{i:<6} {w:<10.4f} {loss:<10.4f} {grad:<12.4f} {update:+.4f}")

            if abs(grad) < 0.0001:
                print(f"\nConverged at step {i} with w={w:.6f}")
                break

            w = self.step(w)

        return w

# Compare learning rates
print('=== Learning Rate = 0.1 (good) ===')
opt = GradientDescentVisualizer(learning_rate=0.1)
opt.optimize(w_init=0.0, n_steps=15)

print('\n=== Learning Rate = 0.9 (too large, oscillates) ===')
opt2 = GradientDescentVisualizer(learning_rate=0.9)
opt2.optimize(w_init=0.0, n_steps=10)

print('\n=== Learning Rate = 0.01 (too small, slow) ===')
opt3 = GradientDescentVisualizer(learning_rate=0.01)
opt3.optimize(w_init=0.0, n_steps=15)

The Gradient Is the Slope Direction

Gradient = the slope of the loss function at your current position.
Positive gradient means loss increases to the right — step left.
Negative gradient means loss increases to the left — step right.
Step size = learning rate multiplied by gradient magnitude.
At the minimum, gradient is zero — no slope, no update, training stops.

Production Insight

A learning rate of 0.9 on a simple quadratic oscillates instead of converging.

On complex loss surfaces with millions of parameters, the effect is amplified dramatically.

Rule: start with a small learning rate (0.001) and increase only if training is provably too slow.

Key Takeaway

Gradient descent follows the negative slope to find the loss minimum.

Each step: w_new = w_old - learning_rate × gradient.

The learning rate controls step size — the single most important hyperparameter in training.

Gradient Descent — Step-by-Step on L(w) = (w − 3)²

Learning rate = 0.1 · Starting point w = 0 · Minimum at w = 3

Step	w	Loss	Grad	Δw
0	0.000	9.000	−6.000	+0.600
1	0.600	5.760	−4.800	+0.480
2	1.080	3.686	−3.840	+0.384
3	1.464	2.359	−3.072	+0.307
4	1.771	1.510	−2.458	+0.246
5	2.017	0.966	−1.966	+0.197
6	2.214	0.618	−1.572	+0.157
7	2.371	0.396	−1.258	+0.126
8	2.497	0.253	−1.006	+0.101
9	2.597	0.162	−0.806	+0.081
10	2.678	0.104	−0.644	+0.064
· · ·
20	2.988	0.0001	−0.024	+0.002
∞	3.000	0.000	0.000	0

Large gradient — big steps

Gradient shrinking

Near minimum — tiny steps

Why steps shrink automatically: the gradient = 2(w − 3) gets smaller as w approaches 3. At the minimum, gradient = 0 — the update rule produces no change. No manual step-size scheduling needed; the math handles it.

Loss Functions Gradient Descent Visual Guide

Learning Rate: The Critical Hyperparameter

The learning rate controls how far you step at each iteration. It is the most important hyperparameter in gradient descent — more important than model architecture, batch size, or number of epochs for determining whether training succeeds at all.

Too small and training takes thousands of epochs to converge, wasting compute and time. Too large and the loss diverges to infinity within a handful of steps, producing a model that outputs garbage. The right learning rate converges quickly and reliably to a good minimum.

The visualization below simulates gradient descent on a simple quadratic with four different learning rates. You can see exactly how each one behaves — the too-small rate barely moves, the good rate converges smoothly, the large rate oscillates, and the too-large rate explodes. This same behavior happens on real models with real data; the only difference is that you cannot see the loss surface directly and must rely on loss curves to diagnose the problem.

io/thecodeforge/loss/learning_rate_demo.pyPYTHON

import numpy as np
import matplotlib.pyplot as plt

def simulate_training(learning_rate, n_steps=50, w_init=0.0):
    """Simulate gradient descent on L(w) = (w-3)^2."""
    w = w_init
    losses = []
    weights = []

    for _ in range(n_steps):
        loss = (w - 3) ** 2
        grad = 2 * (w - 3)
        w = w - learning_rate * grad

        if abs(w) > 1e6:  # Diverged
            losses.append(float('inf'))
            weights.append(w)
            break

        losses.append(loss)
        weights.append(w)

    return weights, losses

def plot_learning_rates():
    """Visualize how different learning rates affect convergence."""
    learning_rates = {
        'Too Small (0.001)': 0.001,
        'Good (0.1)': 0.1,
        'Large (0.45)': 0.45,
        'Too Large (0.95)': 0.95
    }

    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    colors = ['#e74c3c', '#2ecc71', '#f39c12', '#9b59b6']

    for idx, (label, lr) in enumerate(learning_rates.items()):
        ax = axes[idx // 2, idx % 2]
        weights, losses = simulate_training(lr, n_steps=30)

        valid_losses = [l for l in losses if l != float('inf')]
        steps = range(len(valid_losses))

        ax.plot(steps, valid_losses, linewidth=2, color=colors[idx])
        ax.set_title(f'LR = {lr} — {label.split("(")[0].strip()}')
        ax.set_xlabel('Step')
        ax.set_ylabel('Loss')
        ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
        ax.grid(True, alpha=0.3)

        if valid_losses:
            final_loss = valid_losses[-1]
            ax.annotate(f'Final loss: {final_loss:.4f}',
                        xy=(len(valid_losses)-1, final_loss),
                        fontsize=9, color=colors[idx])

    fig.suptitle('Learning Rate Impact on Training Convergence',
                 fontsize=14, fontweight='bold')
    fig.tight_layout()
    fig.savefig('learning_rates.png', dpi=300, bbox_inches='tight')
    plt.close(fig)
    print('Saved learning_rates.png')

plot_learning_rates()

Finding the Right Learning Rate

Start with a very small learning rate (1e-7) and increase exponentially each batch.
Plot loss vs. learning rate on a log scale (this is the LR range test or LR finder).
The optimal learning rate is at the steepest downward slope — just before loss starts to increase.
In practice, use Adam optimizer — it adapts the learning rate per parameter automatically.
If training loss oscillates, reduce by 3-10x. If it plateaus from the start, increase by 3x.

Production Insight

A learning rate that is 2x too large can cause oscillation instead of convergence.

The safe starting point for most problems is 0.001 with Adam optimizer.

Rule: never blindly increase learning rate to speed up training — test incrementally and watch the loss curve.

Key Takeaway

Learning rate is the most critical hyperparameter in gradient descent.

Too small = slow convergence. Too large = divergence or oscillation.

Use Adam optimizer or a learning rate scheduler — never rely on a single fixed rate for the entire run.

Learning Rate Diagnosis

IfLoss decreases very slowly over many epochs

→

UseLearning rate is too small. Increase by 3-10x. Run a learning rate finder to identify the optimal range.

IfLoss oscillates between epochs but trends downward on average

→

UseLearning rate is slightly too large. Reduce by 2-3x or add a learning rate scheduler that decays over time.

IfLoss oscillates with increasing amplitude then becomes NaN

→

UseLearning rate is far too large. Reduce by 10-100x. Add gradient clipping as a safety net.

IfLoss decreases to a plateau and stops improving despite more epochs

→

UseModel may have converged to a local minimum, or learning rate decayed too aggressively. Try a cosine annealing schedule with warm restarts, or increase model capacity.

Learning Rate — Three Outcomes on the Same Loss Surface

L(w) = (w − 3)² · Starting point w = 0 · Same 15 steps each

Too Small

lr = 0.005

After 15 steps, w ≈ 0.43 — barely off the starting point. Gradient is large but the step is tiny. Would need ~460 steps to converge.

Just Right

lr = 0.1

Smooth, monotonic descent. Converges to w = 3 in ~20 steps. Steps shrink naturally as the gradient approaches zero. This is the target behavior.

Too Large

lr = 0.95

The step overshoots the minimum and lands on the opposite side — with a larger error than before. Each step amplifies the next. Loss explodes to NaN.

✗ Too slow — reduce training time by increasing lr 10-100×

✓ Converges cleanly — this is the target behavior

✗ Diverges — reduce lr by 10× and add gradient clipping

Rule of thumb: start with lr = 0.001 and Adam optimizer. If loss is flat for 5+ epochs, increase 3×. If loss oscillates or jumps, reduce 10×. The loss curve shape tells you which direction to go — not guesswork.

Loss Functions Gradient Descent Visual Guide

Batch vs. Stochastic vs. Mini-Batch Gradient Descent

Gradient descent has three variants that differ in how many data points are used to compute each gradient update. The trade-off is between gradient accuracy and computational cost per step.

Batch gradient descent computes the gradient using the entire training set. The gradient is exact — no noise — so updates are smooth and convergence is monotonic. But on a dataset with millions of rows, computing one update is extremely slow. You wait a long time between parameter updates.

Stochastic gradient descent (SGD) uses a single randomly selected data point per update. Each update is fast but the gradient estimate is noisy — one sample is a poor approximation of the true gradient. The noise causes the parameter trajectory to zigzag erratically, though over many steps it trends toward the minimum.

Mini-batch gradient descent splits the difference. You sample a batch (typically 32-256 examples), compute the gradient on that batch, and update parameters. The gradient estimate is good enough to be useful, the computation is fast enough to be practical, and the batch fits neatly into GPU memory for parallel computation. This is what everyone uses in practice.

io/thecodeforge/loss/gd_variants.pyPYTHON

import numpy as np

class GradientDescentVariants:
    """Demonstrate the three gradient descent variants."""

    @staticmethod
    def batch_gradient_descent(X, y, lr=0.01, n_epochs=100):
        """Uses ALL data points for each update.
        Smooth convergence but slow per step."""
        w = np.random.randn(X.shape[1]) * 0.01
        losses = []

        for epoch in range(n_epochs):
            predictions = X @ w
            error = predictions - y
            loss = np.mean(error ** 2)
            gradient = (2 / len(y)) * (X.T @ error)
            w = w - lr * gradient
            losses.append(loss)

        return w, losses

    @staticmethod
    def stochastic_gradient_descent(X, y, lr=0.01, n_epochs=100):
        """Uses ONE data point per update.
        Noisy updates but fast per step."""
        w = np.random.randn(X.shape[1]) * 0.01
        losses = []

        for epoch in range(n_epochs):
            indices = np.random.permutation(len(y))
            epoch_loss = 0

            for i in indices:
                prediction = X[i] @ w
                error = prediction - y[i]
                epoch_loss += error ** 2
                gradient = 2 * X[i] * error
                w = w - lr * gradient

            losses.append(epoch_loss / len(y))

        return w, losses

    @staticmethod
    def mini_batch_gradient_descent(X, y, lr=0.01, batch_size=32, n_epochs=100):
        """Uses a mini-batch per update.
        Best balance of speed and stability."""
        w = np.random.randn(X.shape[1]) * 0.01
        losses = []

        for epoch in range(n_epochs):
            indices = np.random.permutation(len(y))
            epoch_loss = 0
            n_batches = 0

            for start in range(0, len(y), batch_size):
                batch_idx = indices[start:start + batch_size]
                X_batch = X[batch_idx]
                y_batch = y[batch_idx]

                predictions = X_batch @ w
                error = predictions - y_batch
                epoch_loss += np.mean(error ** 2)
                gradient = (2 / len(y_batch)) * (X_batch.T @ error)
                w = w - lr * gradient
                n_batches += 1

            losses.append(epoch_loss / n_batches)

        return w, losses


# --- Quick demo ---
np.random.seed(42)
X = np.random.randn(500, 3)
w_true = np.array([2.0, -1.0, 0.5])
y = X @ w_true + np.random.randn(500) * 0.1

print('Batch GD final loss:',
      GradientDescentVariants.batch_gradient_descent(X, y, lr=0.01, n_epochs=50)[1][-1])
print('SGD final loss:',
      GradientDescentVariants.stochastic_gradient_descent(X, y, lr=0.001, n_epochs=50)[1][-1])
print('Mini-batch GD final loss:',
      GradientDescentVariants.mini_batch_gradient_descent(X, y, lr=0.01, batch_size=32, n_epochs=50)[1][-1])

Why Mini-Batch Is the Default

Batch GD computes exact gradients but is prohibitively slow on large datasets — one update requires a full pass over all data. SGD is fast but noisy — each update is based on one example, creating erratic parameter jumps that can slow convergence. Mini-batch (typically 32-256 samples) balances gradient accuracy with computational efficiency. It also enables GPU parallelism — GPUs are designed to process batches of data in parallel, so a batch of 64 runs nearly as fast as a batch of 1.

Production Insight

Batch size affects both convergence speed and generalization quality.

Smaller batches (8-32) add gradient noise that can help escape sharp local minima, leading to flatter minima that generalize better. Larger batches (256-1024) give smoother gradients and better GPU utilization but may converge to sharper minima.

Rule: start with batch size 32. Increase to 128 or 256 if training is slow and GPU memory allows. If you increase batch size, increase learning rate proportionally (linear scaling rule).

Key Takeaway

Batch GD uses all data (smooth but slow). SGD uses one sample (fast but noisy).

Mini-batch is the practical default — 32-256 samples per update.

Smaller batches add noise that can help generalization. Larger batches improve GPU throughput.

The Loss Landscape: Local Minima, Saddle Points, and Plateaus

Real loss landscapes are not simple bowls. They have local minima (valleys that are not the deepest), saddle points (flat ridges where the gradient is zero in some directions but not a true minimum), and plateaus (large flat regions where gradients are near zero and training stalls). Gradient descent can get stuck in any of these.

Local minima are points where the loss is lower than all nearby points but not the lowest possible value globally. In low-dimensional problems, local minima can trap gradient descent permanently. In high-dimensional neural network loss surfaces, research has shown that most local minima have loss values close to the global minimum — so getting stuck is less catastrophic than it sounds.

Saddle points are a bigger practical problem. At a saddle point the gradient is zero — the algorithm thinks it has converged — but the point is a minimum in some directions and a maximum in others. In a space with millions of dimensions, saddle points vastly outnumber true minima. Momentum-based optimizers (SGD with momentum, Adam) help by building up velocity that carries the optimizer through saddle points rather than stalling on them.

Plateaus are extended flat regions where the gradient is very small but nonzero. The model is not stuck, but progress is painfully slow. Adaptive learning rate methods like Adam increase the effective step size when gradients are small, helping the optimizer cross plateaus faster.

io/thecodeforge/loss/landscape.pyPYTHON

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def plot_loss_landscape():
    """Visualize a complex loss landscape with local minima and saddle points."""
    x = np.linspace(-5, 5, 200)
    y = np.linspace(-5, 5, 200)
    X, Y = np.meshgrid(x, y)

    # Himmelblau's function — four local minima, one saddle point
    Z = (X**2 + Y - 11)**2 + (X + Y**2 - 7)**2

    fig = plt.figure(figsize=(16, 5))

    # 3D surface
    ax1 = fig.add_subplot(131, projection='3d')
    ax1.plot_surface(X, Y, Z, cmap='viridis', alpha=0.8,
                     antialiased=True, rcount=100, ccount=100)
    ax1.set_xlabel('Parameter w1')
    ax1.set_ylabel('Parameter w2')
    ax1.set_zlabel('Loss')
    ax1.set_title('3D Loss Landscape')
    ax1.view_init(elev=35, azim=45)

    # Contour plot with minima marked
    ax2 = fig.add_subplot(132)
    contour = ax2.contour(X, Y, Z, levels=50, cmap='viridis')
    ax2.clabel(contour, inline=True, fontsize=7)
    ax2.set_xlabel('Parameter w1')
    ax2.set_ylabel('Parameter w2')
    ax2.set_title('Contour View (Top-Down)')

    # Mark the four minima of Himmelblau's function
    minima = [
        (3.0, 2.0), (-2.805118, 3.131312),
        (-3.779310, -3.283186), (3.584428, -1.848126)
    ]
    for mx, my in minima:
        ax2.plot(mx, my, 'r*', markersize=15)
        ax2.annotate(f'({mx:.1f}, {my:.1f})',
                     xy=(mx, my), xytext=(mx + 0.5, my + 0.5),
                     arrowprops=dict(arrowstyle='->', color='red'),
                     fontsize=8, color='red')

    # Gradient descent paths from different starting points
    ax3 = fig.add_subplot(133)
    ax3.contour(X, Y, Z, levels=50, cmap='viridis', alpha=0.5)
    ax3.set_xlabel('Parameter w1')
    ax3.set_ylabel('Parameter w2')
    ax3.set_title('GD Paths — Different Start → Different Minimum')

    starts = [(-4, -4), (4, -4), (-1, 4), (1, 1)]
    colors = ['red', 'blue', 'orange', 'magenta']

    for start, color in zip(starts, colors):
        path_x, path_y = [start[0]], [start[1]]
        wx, wy = float(start[0]), float(start[1])
        lr = 0.001

        for _ in range(500):
            # Analytical gradients of Himmelblau's function
            grad_x = (4 * wx * (wx**2 + wy - 11)
                       + 2 * (wx + wy**2 - 7))
            grad_y = (2 * (wx**2 + wy - 11)
                       + 4 * wy * (wx + wy**2 - 7))
            wx -= lr * grad_x
            wy -= lr * grad_y
            path_x.append(wx)
            path_y.append(wy)

        ax3.plot(path_x, path_y, '-', color=color, linewidth=1.5,
                 alpha=0.8, label=f'Start {start}')
        ax3.plot(path_x[0], path_y[0], 'o', color=color, markersize=8)
        ax3.plot(path_x[-1], path_y[-1], 's', color=color, markersize=10)

    ax3.legend(fontsize=7, loc='upper left')

    fig.suptitle('Loss Landscape — Multiple Minima, Saddle Points, Plateaus',
                 fontsize=14, fontweight='bold')
    fig.tight_layout()
    fig.savefig('loss_landscape.png', dpi=300, bbox_inches='tight')
    plt.close(fig)
    print('Saved loss_landscape.png')

plot_loss_landscape()

Why Initialization Matters

Each starting point follows the local gradient — it cannot see the global landscape.
In the visualization, four different starts converge to four different minima of the same function.
In practice, this is why random seed affects final model accuracy.
Modern initializers (Xavier, He, Kaiming) set starting weights in regions where gradients flow well.
The real enemies in high-dimensional spaces are saddle points (zero gradient, not a minimum) and plateaus (near-zero gradient, extremely slow progress).

Production Insight

In high-dimensional spaces (millions of parameters), local minima are rare — most critical points are saddle points.

Saddle points are the real bottleneck — they slow training without being true minima.

Rule: use momentum-based optimizers (Adam, SGD with momentum ≥ 0.9) to carry through saddle points and plateaus faster. If training stalls, try increasing momentum before increasing learning rate.

Key Takeaway

Real loss landscapes have local minima, saddle points, and plateaus.

Gradient descent can get stuck — momentum and adaptive learning rates help escape.

In high-dimensional neural network loss surfaces, saddle points are a bigger practical problem than local minima.

Monitoring Loss Curves During Training

The training and validation loss curves are your primary diagnostic tool during model training. Their shape reveals whether the model is learning, overfitting, underfitting, or failing to converge. Reading these curves correctly saves hours of debugging and prevents shipping broken models.

You need both curves plotted together. Training loss alone is actively misleading — a model that memorizes the training set has near-zero training loss but terrible generalization. The validation loss tells you how the model performs on data it has never seen. The gap between the two curves is the overfitting signal: a small gap means the model generalizes well, a large gap means it is memorizing.

Every training run you ship to production should have its loss curves saved as artifacts. When a model degrades in production six months later, those curves are the first thing you pull up to understand what happened during training.

io/thecodeforge/loss/loss_curves.pyPYTHON

import numpy as np
import matplotlib.pyplot as plt

def plot_loss_curves():
    """Visualize common loss curve patterns and their meanings."""
    np.random.seed(42)
    epochs = np.arange(1, 51)

    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    # 1. Good training: both curves decrease and converge
    train_good = 2.0 * np.exp(-0.08 * epochs) + 0.1 + np.random.normal(0, 0.02, 50)
    val_good = 2.2 * np.exp(-0.07 * epochs) + 0.12 + np.random.normal(0, 0.03, 50)
    axes[0, 0].plot(epochs, train_good, label='Training Loss', linewidth=2)
    axes[0, 0].plot(epochs, val_good, label='Validation Loss', linewidth=2)
    axes[0, 0].set_title('✅ Good: Both Decrease and Converge')
    axes[0, 0].legend()
    axes[0, 0].set_xlabel('Epoch')
    axes[0, 0].set_ylabel('Loss')
    axes[0, 0].grid(True, alpha=0.3)
    axes[0, 0].annotate('Small gap = good generalization',
                        xy=(40, 0.15), fontsize=9, color='green')

    # 2. Overfitting: train decreases, val increases
    train_over = 2.0 * np.exp(-0.1 * epochs) + 0.05 + np.random.normal(0, 0.01, 50)
    val_over = np.concatenate([
        2.2 * np.exp(-0.06 * epochs[:20]) + 0.15,
        0.3 + 0.02 * (epochs[20:] - 20) + np.random.normal(0, 0.03, 30)
    ])
    axes[0, 1].plot(epochs, train_over, label='Training Loss', linewidth=2)
    axes[0, 1].plot(epochs, val_over, label='Validation Loss', linewidth=2)
    axes[0, 1].axvline(x=20, color='red', linestyle='--', alpha=0.7,
                       label='Overfitting starts')
    axes[0, 1].set_title('⚠️ Overfitting: Val Loss Diverges')
    axes[0, 1].legend()
    axes[0, 1].set_xlabel('Epoch')
    axes[0, 1].set_ylabel('Loss')
    axes[0, 1].grid(True, alpha=0.3)
    axes[0, 1].annotate('Stop here (early stopping)',
                        xy=(20, val_over[19]), xytext=(25, 0.6),
                        arrowprops=dict(arrowstyle='->', color='red'),
                        fontsize=9, color='red')

    # 3. Underfitting: both stay high
    train_under = 1.5 - 0.005 * epochs + np.random.normal(0, 0.03, 50)
    val_under = 1.6 - 0.004 * epochs + np.random.normal(0, 0.04, 50)
    axes[1, 0].plot(epochs, train_under, label='Training Loss', linewidth=2)
    axes[1, 0].plot(epochs, val_under, label='Validation Loss', linewidth=2)
    axes[1, 0].set_title('❌ Underfitting: Both Stay High')
    axes[1, 0].legend()
    axes[1, 0].set_xlabel('Epoch')
    axes[1, 0].set_ylabel('Loss')
    axes[1, 0].grid(True, alpha=0.3)
    axes[1, 0].annotate('Model lacks capacity — needs more parameters',
                        xy=(30, 1.4), fontsize=9, color='orange')

    # 4. Learning rate too high: oscillation
    train_osc = 1.0 + 0.5 * np.sin(0.5 * epochs) * np.exp(-0.02 * epochs) \
                + np.random.normal(0, 0.05, 50)
    val_osc = 1.1 + 0.6 * np.sin(0.5 * epochs + 0.3) * np.exp(-0.02 * epochs) \
              + np.random.normal(0, 0.07, 50)
    axes[1, 1].plot(epochs, train_osc, label='Training Loss', linewidth=2)
    axes[1, 1].plot(epochs, val_osc, label='Validation Loss', linewidth=2)
    axes[1, 1].set_title('🔄 LR Too High: Oscillation')
    axes[1, 1].legend()
    axes[1, 1].set_xlabel('Epoch')
    axes[1, 1].set_ylabel('Loss')
    axes[1, 1].grid(True, alpha=0.3)
    axes[1, 1].annotate('Reduce LR by 10x or switch to Adam',
                        xy=(30, 1.3), fontsize=9, color='purple')

    fig.suptitle('Loss Curve Patterns — Reading Your Model\'s Training Story',
                 fontsize=14, fontweight='bold')
    fig.tight_layout()
    fig.savefig('loss_curves.png', dpi=300, bbox_inches='tight')
    plt.close(fig)
    print('Saved loss_curves.png')

plot_loss_curves()

The Four Loss Curve Patterns

Good: both curves decrease and converge with a small gap. Model is learning well. No action needed.
Overfitting: training loss decreases but validation loss increases after a certain epoch. Model memorized training data. Add regularization, dropout, or get more data. Use early stopping.
Underfitting: both curves stay high and barely decrease. Model lacks capacity. Increase model complexity, train longer, or check that input features carry enough signal.
Oscillation: both curves jump up and down without settling. Learning rate is too high. Reduce learning rate by 10x or switch to Adam optimizer.

Production Insight

Always plot training AND validation loss together. Training loss alone hides overfitting completely.

The gap between training and validation loss is the overfitting signal — monitor it, not just the absolute values.

Rule: save loss curves as versioned artifacts with every model training run. When a production model degrades, the first diagnostic step is comparing its training curves against the previous good model.

Key Takeaway

Loss curves are the primary diagnostic for model training health.

Four patterns: good (converge), overfitting (diverge), underfitting (flat high), oscillation (jumpy).

Always plot both training and validation loss — training loss alone is actively misleading.

Reading Loss Curves — The Four Patterns Every Engineer Must Know

Training loss (orange) · Validation loss (blue) · x-axis = epochs

✓ Healthy Good Convergence

What you see: both curves decrease together, small stable gap, plateau near a low value.
Action: none — model is learning well. Consider training a few more epochs if still trending down.

⚠ Overfitting Overfitting

What you see: training loss falls while validation loss rises — the curves diverge.
Fix: add dropout, weight_decay, early stopping at the yellow line.

↔ Underfitting Underfitting

What you see: both curves plateau high from the start. Model never really learns.
Fix: increase model capacity, train longer, or verify input features have predictive signal.

~ Oscillating Noisy / Oscillating

What you see: large epoch-to-epoch swings but a general downward trend. Training is unstable.
Fix: reduce lr by 3–10×, increase batch size, or switch to Adam optimizer.

Training loss

Validation loss

Early stopping checkpoint

Loss Functions Gradient Descent Visual Guide

Advanced Optimizers: Beyond Vanilla Gradient Descent

Vanilla gradient descent has known limitations: it stalls at saddle points because the gradient is zero, it oscillates in narrow valleys because the gradient direction alternates, and it uses the same learning rate for all parameters even when some need large updates and others need small ones. Modern optimizers solve each of these problems.

SGD with momentum adds a velocity term that accumulates gradient direction over time. If the gradient consistently points in one direction, momentum builds up and the optimizer moves faster. If the gradient oscillates, the velocity averages out the zigzag. This is directly analogous to a ball rolling downhill — it builds speed on consistent slopes and dampens jitter.

RMSProp adapts the learning rate per parameter by dividing the update by a running average of recent gradient magnitudes. Parameters with consistently large gradients get smaller effective learning rates; parameters with small gradients get larger ones. This balances the update scale across parameters with different gradient magnitudes.

Adam combines momentum and adaptive learning rates into a single optimizer. It maintains both a first-moment estimate (momentum) and a second-moment estimate (RMSProp-style adaptation), with bias correction to handle the initial epochs where both estimates are biased toward zero. Adam with default parameters (lr=0.001, beta1=0.9, beta2=0.999) works well on the vast majority of deep learning problems without manual tuning.

io/thecodeforge/loss/optimizers.pyPYTHON

import numpy as np

class OptimizerComparison:
    """Compare optimizer behaviors on a simple loss surface."""

    @staticmethod
    def sgd(w, grad, lr):
        """Vanilla SGD: w = w - lr * grad."""
        return w - lr * grad

    @staticmethod
    def sgd_momentum(w, grad, velocity, lr, momentum=0.9):
        """SGD with momentum: accumulates velocity in consistent
        gradient directions.

        Momentum helps escape saddle points and dampens oscillation
        in narrow valleys by smoothing the update direction.
        """
        velocity = momentum * velocity - lr * grad
        w = w + velocity
        return w, velocity

    @staticmethod
    def rmsprop(w, grad, cache, lr, decay=0.9, epsilon=1e-8):
        """RMSProp: adapts learning rate per parameter.

        Parameters with large recent gradients get smaller effective
        learning rates. Parameters with small recent gradients get
        larger effective learning rates.
        """
        cache = decay * cache + (1 - decay) * grad ** 2
        w = w - lr * grad / (np.sqrt(cache) + epsilon)
        return w, cache

    @staticmethod
    def adam(w, grad, m, v, t, lr=0.001, beta1=0.9, beta2=0.999,
            epsilon=1e-8):
        """Adam: combines momentum (beta1) and adaptive learning
        rate (beta2).

        Most widely used optimizer in deep learning. Works well with
        default hyperparameters on most problems.
        """
        m = beta1 * m + (1 - beta1) * grad        # First moment
        v = beta2 * v + (1 - beta2) * grad ** 2   # Second moment

        m_hat = m / (1 - beta1 ** t)   # Bias correction
        v_hat = v / (1 - beta2 ** t)   # Bias correction

        w = w - lr * m_hat / (np.sqrt(v_hat) + epsilon)
        return w, m, v


# --- Quick comparison on L(w) = (w-3)^2 ---
def compare_optimizers(n_steps=50):
    """Run all four optimizers on the same quadratic and compare."""
    results = {}

    # Vanilla SGD
    w, losses = 0.0, []
    for _ in range(n_steps):
        losses.append((w - 3) ** 2)
        w = OptimizerComparison.sgd(w, 2 * (w - 3), lr=0.1)
    results['SGD'] = losses

    # SGD + Momentum
    w, vel, losses = 0.0, 0.0, []
    for _ in range(n_steps):
        losses.append((w - 3) ** 2)
        w, vel = OptimizerComparison.sgd_momentum(
            w, 2 * (w - 3), vel, lr=0.05, momentum=0.9)
    results['SGD+Momentum'] = losses

    # RMSProp
    w, cache, losses = 0.0, 0.0, []
    for _ in range(n_steps):
        losses.append((w - 3) ** 2)
        w, cache = OptimizerComparison.rmsprop(
            w, 2 * (w - 3), cache, lr=0.1)
    results['RMSProp'] = losses

    # Adam
    w, m, v, losses = 0.0, 0.0, 0.0, []
    for t in range(1, n_steps + 1):
        losses.append((w - 3) ** 2)
        w, m, v = OptimizerComparison.adam(
            w, 2 * (w - 3), m, v, t, lr=0.5)
    results['Adam'] = losses

    for name, losses in results.items():
        print(f'{name:15s}  final_loss={losses[-1]:.6f}  '
              f'steps_to_0.01={next((i for i, l in enumerate(losses) if l < 0.01), "never")}')

compare_optimizers()

When to Use Which Optimizer

SGD with momentum: best final accuracy for well-tuned problems (image classification with ResNets). Requires careful LR tuning and a schedule.
Adam: best default choice for getting started. Works well out of the box. Use lr=0.001.
AdamW: Adam with decoupled weight decay. Better regularization than Adam. The default for transformers and large language models.
RMSProp: historically preferred for RNNs and reinforcement learning. Less common now that Adam exists.
Rule of thumb: start with Adam (lr=0.001). Switch to SGD+momentum only if you need the last 1% of accuracy and have time to tune the learning rate schedule.

Production Insight

Adam with default parameters (lr=0.001, betas=(0.9, 0.999)) works for roughly 80% of deep learning problems without tuning.

SGD with momentum requires more hyperparameter tuning but can reach marginally better final accuracy on some vision tasks.

AdamW is the current standard for transformer architectures — it decouples weight decay from the adaptive learning rate, which matters when regularization is critical.

Rule: start with Adam. Graduate to AdamW for transformers. Switch to SGD+momentum only after Adam has been properly benchmarked and you have a validated LR schedule.

Key Takeaway

Vanilla SGD has known limitations: saddle points, oscillation, uniform learning rate.

Momentum solves saddle points. Adaptive LR solves per-parameter scaling. Adam combines both.

Start with Adam (lr=0.001) for any new project. Switch to SGD+momentum only when chasing the last fraction of accuracy.

● Production incidentPOST-MORTEMseverity: high

Model Loss Exploded to Infinity After Changing Learning Rate from 0.001 to 0.1

Symptom

Training loss was 0.45 at epoch 1, jumped to 12.3 at epoch 2, then NaN at epoch 3. All model predictions returned NaN values. The model checkpoint saved at epoch 3 was unrecoverable — every weight had overflowed to infinity.

Assumption

The team assumed a larger learning rate would simply train the model faster. They expected roughly 10x speed improvement with a 100x learning rate increase, reasoning that bigger steps meant faster arrival at the minimum.

Root cause

A learning rate of 0.1 caused each gradient descent step to overshoot the minimum by a large margin. Instead of converging toward the minimum, the parameters jumped back and forth across the loss surface with increasing amplitude — each overshoot produced a larger gradient, which produced an even larger overshoot on the next step. This positive feedback loop caused the loss to oscillate and then diverge as parameters grew without bound, eventually exceeding floating-point range and becoming NaN.

Fix

Reverted to learning rate 0.001. Implemented a learning rate scheduler (ReduceLROnPlateau) that starts at 0.01 and decays by 0.5 every 10 epochs when validation loss plateaus. Added gradient clipping (max_norm=1.0) to prevent parameter explosion even if the learning rate is slightly too high. Added early stopping to halt training if validation loss increases for 3 consecutive epochs. Added a NaN check on the loss value after every batch — if loss is NaN, training halts immediately and the last valid checkpoint is restored.

Key lesson

Learning rate is not a linear speed dial — doubling it can destroy training entirely.
Always monitor loss per epoch and stop training immediately if loss diverges or becomes NaN.
Use learning rate schedulers or adaptive optimizers (Adam, AdamW) instead of fixed learning rates in production.
Save model checkpoints at regular intervals so you can recover from catastrophic training failures.

Production debug guideCommon signals from loss curves and what they mean.5 entries

Symptom · 01

Loss stays constant and never decreases from the initial value

→

Fix

Learning rate is too small, the model has no trainable parameters, or the loss function is misconfigured. Check that requires_grad is True on model parameters. Verify that the optimizer is actually connected to the model parameters. Print the gradient norm after the first backward pass — if it is zero, the computation graph is broken.

Symptom · 02

Loss oscillates wildly between epochs without trending downward

→

Fix

Learning rate is too large. Reduce by 10x. If using SGD, try Adam which adapts learning rate per parameter. If already using Adam, reduce the base learning rate to 0.0001 and verify batch size is not too small (very small batches amplify gradient noise).

Symptom · 03

Loss decreases for a few epochs then suddenly becomes NaN

→

Fix

Numerical instability from exploding gradients. Add gradient clipping with max_norm=1.0. Check input data for NaN, Inf, or extreme values. Verify that the loss function does not take log(0) — add a small epsilon to prevent it. Check for division by zero in custom loss functions.

Symptom · 04

Training loss decreases steadily but validation loss increases after a certain epoch

→

Fix

Overfitting. The model is memorizing training data instead of learning generalizable patterns. Add regularization (dropout, L2 weight decay), reduce model complexity (fewer layers or smaller hidden dimensions), apply data augmentation, or collect more training data. Implement early stopping based on validation loss.

Symptom · 05

Both training and validation loss decrease but plateau at a high value

→

Fix

Underfitting. The model lacks capacity to learn the underlying patterns. Increase model complexity (more layers, wider hidden dimensions), train for more epochs, or verify that the input features contain enough signal for the prediction task.

★ Gradient Descent Debug Cheat SheetQuick checks when training does not converge.

Loss is NaN after a few epochs−

Immediate action

Check for numerical overflow in loss computation or gradient updates. Inspect raw gradient magnitudes.

Commands

for name, p in model.named_parameters():
    if p.grad is not None:
        print(f'{name}: grad_norm={p.grad.norm().item():.4f}')

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Fix now

Add gradient clipping before optimizer.step(). Reduce learning rate by 10x. Check input data for NaN or Inf values using torch.isnan(X).any() and torch.isinf(X).any().

Loss plateaus early and stops improving+

Training loss is near zero but validation accuracy is poor+

Key takeaways

Loss functions quantify prediction error

the model's entire training process is an attempt to minimize this single number.

MSE penalizes large errors quadratically (outlier-sensitive). MAE treats all errors linearly (outlier-robust). Huber combines both behaviors.

Gradient descent follows the negative slope of the loss surface to find parameter values that minimize error.

Learning rate controls step size

too large causes divergence, too small causes painfully slow convergence.

Always plot both training and validation loss together

the gap between them is the overfitting signal.

Adam optimizer is the practical default

it adapts learning rate per parameter and handles most problems with default settings.

Mini-batch gradient descent (32-256 samples) balances gradient quality with computational efficiency and GPU utilization.

Common mistakes to avoid

5 patterns

Using MSE loss on data with heavy outliers

Symptom

Model predictions are systematically pulled toward outlier values. Predictions for the majority of normal data points are noticeably less accurate because the model is spending most of its gradient budget fitting the outliers — whose squared errors dwarf the rest.

Fix

Switch to MAE or Huber loss. Huber loss provides smooth gradients near zero (like MSE, which helps optimization converge cleanly) and constant gradients for large errors (like MAE, which prevents outliers from dominating). Set the delta parameter based on what constitutes a 'normal' error range for your domain.

Not normalizing input features before training

Symptom

Training is extremely slow or does not converge. Features with large ranges (income: 0-500,000) dominate features with small ranges (age: 0-100). The loss surface becomes elongated — gradient descent zigzags instead of heading straight toward the minimum because the optimal step size differs wildly between features.

Fix

Normalize all features to similar ranges using StandardScaler (mean=0, std=1) or MinMaxScaler (range 0-1). Fit the scaler on training data only — never on validation or test data. Apply the same fitted scaler to validation and test data to prevent data leakage.

Setting learning rate too high to speed up training

Symptom

Loss oscillates wildly or diverges to NaN within a few epochs. Model parameters grow without bound. The loss curve shows an upward trend or sudden jumps rather than a smooth decrease.

Fix

Reduce learning rate by 10x. Run a learning rate finder (LR range test) to identify the optimal rate empirically. Use Adam optimizer which adapts the effective learning rate per parameter automatically. Add gradient clipping as a safety net against catastrophic updates.

Only monitoring training loss without validation loss

Symptom

Training loss decreases to near zero. Model appears perfect during development. On new data, predictions are terrible. The model memorized training data — including its noise — without learning generalizable patterns.

Fix

Always plot training and validation loss together on the same chart. Use early stopping to halt training when validation loss stops decreasing for a specified number of epochs (patience). The gap between training and validation loss is the overfitting signal — track it as a first-class metric.

Using a fixed learning rate for the entire training run

Symptom

Training converges to a suboptimal solution. The learning rate that was appropriate at the start (when the model is far from the minimum and needs large steps) is too large near the minimum (when the model needs small, precise steps). The model bounces around the optimum instead of settling into it.

Fix

Use a learning rate scheduler. Common choices: StepLR (reduce by a fixed factor every N epochs), ReduceLROnPlateau (reduce when validation loss stops improving), CosineAnnealingLR (smooth decay with optional warm restarts). Or use Adam/AdamW which adapt per-parameter rates automatically.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

Explain the difference between MSE and MAE loss functions. When would yo...

Q02SENIOR

What happens when the learning rate is too large? How would you diagnose...

Q03SENIOR

Your training loss is 0.01 but validation loss is 0.85. What is happenin...

Q04SENIOR

Explain why Adam optimizer is preferred over vanilla SGD in most deep le...

Q01 of 04JUNIOR

Explain the difference between MSE and MAE loss functions. When would you use each?

ANSWER

MSE (Mean Squared Error) computes the average of squared differences between predictions and true values. Squaring amplifies large errors — an error of 10 contributes 100 to MSE, while an error of 1 contributes 1. This makes MSE heavily penalize large errors, which is useful when large deviations are unacceptable. It also means MSE is sensitive to outliers — a single extreme value can dominate the loss. MAE (Mean Absolute Error) computes the average of absolute differences. An error of 10 contributes 10, and an error of 1 contributes 1. All errors contribute proportionally. MAE is robust to outliers because no single data point can disproportionately influence the loss. I use MSE when the data is clean and large errors should be penalized aggressively — for example, predicting house prices where a $100K error is genuinely much worse than a $10K error. I use MAE or Huber loss when the data contains outliers or measurement noise that I do not want the model to fit. In practice, I try both and compare validation metrics rather than assuming one is always better.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What is the difference between a loss function and a metric?

Can gradient descent find the global minimum?

How do I choose the right batch size?

What is gradient clipping and when should I use it?

Why does my model train fine on one dataset but explode on another?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Lessons pulled from things that broke in production.

✓ Verified

production tested

May 23, 2026

last updated

1,554

articles · all by Naren

🔥

That's ML Basics. Mark it forged?

7 min read · try the examples if you haven't