Intermediate 10 min · May 28, 2026

Neural Network from Scratch in NumPy: A Production-Minded Deep Dive

Q: Why build a neural network from scratch when we have PyTorch/TensorFlow?

Building from scratch gives you a deep understanding of the underlying mechanics. When production models fail—due to gradient explosions, NaNs, or subtle bugs—you need to debug at the level of matrix operations, not just API calls. It also makes you a better framework user because you understand what the abstractions are doing.

Q: What is backpropagation in simple terms?

Backpropagation is the algorithm that computes how much each weight contributed to the final error. It works by applying the chain rule from calculus, propagating the error gradient backward through the network. This tells you how to adjust each weight to reduce the loss.

Q: Why do we need activation functions?

Without activation functions, a neural network would just be a series of linear transformations, collapsing into a single linear layer. Activation functions introduce non-linearity, allowing the network to learn complex patterns like curves, edges, and decision boundaries.

Q: What is the vanishing gradient problem?

The vanishing gradient problem occurs when gradients become extremely small as they are backpropagated through many layers, especially with sigmoid or tanh activations. This prevents early layers from learning effectively. Solutions include using ReLU activations, batch normalization, or residual connections.

Build a neural network from scratch using NumPy.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

✓ Production

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Neural networks are just weighted sums passed through activation functions.
A single neuron computes y = f(w·x + b).
Stacking neurons into layers creates a network capable of learning.
Backpropagation uses the chain rule to update weights.
NumPy lets you implement all of this without high-level frameworks.
Production requires handling edge cases, numerical stability, and performance.

✦ Definition~90s read

What is Neural Network from Scratch in NumPy?

A neural network is a computational model composed of layers of interconnected neurons, where each neuron computes a weighted sum of its inputs plus a bias, then applies a non-linear activation function. Training adjusts these weights and biases via backpropagation and gradient descent to minimize a loss function.

★

Think of a neural network like a team of decision-makers.

Plain-English First

Think of a neural network like a team of decision-makers. Each neuron is a person who takes some inputs, weighs their importance, adds a personal bias, and then decides whether to shout (activate) or stay quiet. Stacking these people in layers lets the team solve complex problems, like recognizing a cat in a photo, by passing information from one layer to the next.

Most engineers treat neural networks as black boxes—import from Keras, call .fit(), and pray. That works until your loss goes to NaN in production at 2 AM. Under all the frameworks, it's just matrix multiplications, activation functions, and chain rule derivatives. If you can't trace a gradient from the output layer back to an exploding weight, you're not debugging—you're guessing.

High-level libraries like PyTorch and TensorFlow dominate the current landscape, abstracting away memory layout, numerical precision, and gradient flow. They're powerful, but they become a crutch when things break. Exploding gradients, silent NaNs in intermediate activations, or a 3x performance regression after a refactor—these problems demand understanding the machinery, not just the API.

This guide walks through building a fully functional neural network from scratch in NumPy: feedforward, backpropagation, gradient descent, and the training loop. But every concept gets a production lens—numerical stability in softmax, vectorized Jacobians, gradient checking for debugging, and the exact failure modes that sink models in the real world. No hand-waving, just the math with concrete NumPy implementations.

By the end, you'll have a mental model that works across frameworks, the ability to diagnose training failures by reading a loss curve, and a portable NumPy implementation you can extend for experiments. No black boxes—just linear algebra, calculus, and a debugger.

The Neuron: Math and NumPy Implementation

A neuron is a computational unit that maps an input vector to a scalar output. The operation is a weighted sum followed by a nonlinear activation. Given input vector x ∈ ℝⁿ, weight vector w ∈ ℝⁿ, and scalar bias b, the pre-activation is z = w·x + b. The activation function σ(z) introduces nonlinearity; without it, stacking layers collapses into a single linear transform. The sigmoid σ(z) = 1 / (1 + e⁻ᶻ) squashes output to (0,1), historically popular but now often replaced by ReLU in deeper nets. A single neuron with sigmoid is equivalent to logistic regression.

In NumPy, this is a dot product plus bias, then element-wise sigmoid. The vectorized form handles batches: for input matrix X of shape (batch_size, n_features), the output is σ(X·w + b). Broadcasting makes bias addition trivial. The sigmoid function must be numerically stable: for large negative z, e⁻ᶻ overflows, so clip z or use np.exp(-np.abs(z)) in production. Here, we implement a single neuron class with feedforward method.

Consider weights w = [0.5, -0.6] and bias b = 0.1. For input x = [2.0, 3.0], the pre-activation is 0.52 + (-0.6)3 + 0.1 = 1.0 - 1.8 + 0.1 = -0.7. Sigmoid(-0.7) ≈ 0.332. This is the neuron's output. The bias shifts the decision boundary; without it, the hyperplane always passes through the origin, severely limiting expressiveness.

In production, you never write a Neuron class per se—you use layers. But understanding this atomic unit is essential. The gradient of sigmoid is σ(z)*(1-σ(z)), which saturates for large |z|, causing vanishing gradients. This is why modern nets use ReLU. Still, sigmoid is pedagogically perfect for backpropagation introduction.

io/thecodeforge/neuron.pyPYTHON

import numpy as np

def sigmoid(x):
    """Numerically stable sigmoid."""
    return np.where(x >= 0, 1 / (1 + np.exp(-x)), np.exp(x) / (1 + np.exp(x)))

class Neuron:
    def __init__(self, weights: np.ndarray, bias: float):
        self.w = weights
        self.b = bias

    def feedforward(self, x: np.ndarray) -> float:
        """Single input feedforward."""
        z = np.dot(self.w, x) + self.b
        return sigmoid(z)

if __name__ == "__main__":
    w = np.array([0.5, -0.6])
    b = 0.1
    neuron = Neuron(w, b)
    x = np.array([2.0, 3.0])
    print(neuron.feedforward(x))

Output

0.331812227831834

🔥Sigmoid saturation kills gradients

When |z| > 5, sigmoid gradient is near zero. In deep networks, this compounds across layers, making learning excruciatingly slow. Always monitor activation statistics.

📊 Production Insight

Never use sigmoid in hidden layers of deep networks—use ReLU or its variants. For binary classification output, sigmoid is fine. Always batch your inputs: vectorize over samples, not loops. A single neuron is a logistic regression model; test it as a baseline before adding layers.

🎯 Key Takeaway

A neuron computes w·x + b then applies a nonlinearity. The sigmoid activation squashes to (0,1) but saturates. NumPy's dot product and broadcasting make this trivial. Master this unit before assembling layers.

thecodeforge.io

Neural Network From Scratch Numpy

From Neuron to Layer: Building a Feedforward Network

A layer is a collection of neurons that all receive the same input vector. For a layer with m neurons, each with its own weight vector and bias, the operation is a matrix multiplication: Z = X·W + b, where X is (batch_size, n_features), W is (n_features, m), and b is (m,) broadcasted. The activation is applied element-wise: A = σ(Z). This is the forward pass. A feedforward network stacks such layers: the output of one layer becomes the input to the next.

Consider a two-layer network: input → hidden layer (2 neurons) → output layer (1 neuron). The hidden layer uses sigmoid; the output layer uses sigmoid for binary classification. Let hidden weights W1 be shape (2,2), biases b1 shape (2,); output weights W2 shape (2,1), bias b2 scalar. For input x = [2, 3], the hidden pre-activation is z1 = x·W1 + b1, then a1 = sigmoid(z1). The output pre-activation is z2 = a1·W2 + b2, then y_hat = sigmoid(z2). This is exactly the example in the reference: with all weights [0,1] and biases 0, hidden outputs are both sigmoid(3) ≈ 0.9526, and output is sigmoid(0.9526) ≈ 0.7216.

In NumPy, we implement a Layer class that stores weights and biases, and a Network class that composes layers. The forward method loops over layers, updating the current activation. Weight initialization matters: all zeros causes symmetry—every neuron learns the same thing. Use small random values, e.g., np.random.randn(n_in, n_out) * 0.1. Biases can start at zero.

Batch processing: for a batch of 4 samples, X shape (4,2), the forward pass produces Z1 shape (4,2), A1 shape (4,2), Z2 shape (4,1), A2 shape (4,1). This is efficient because matrix multiplication is highly optimized. Never loop over samples in Python; always use vectorized operations.

io/thecodeforge/feedforward.pyPYTHON

import numpy as np

def sigmoid(x):
    return np.where(x >= 0, 1 / (1 + np.exp(-x)), np.exp(x) / (1 + np.exp(x)))

class Layer:
    def __init__(self, n_input: int, n_output: int):
        self.W = np.random.randn(n_input, n_output) * 0.1
        self.b = np.zeros((1, n_output))

    def forward(self, X: np.ndarray) -> np.ndarray:
        return sigmoid(np.dot(X, self.W) + self.b)

class Network:
    def __init__(self, layer_sizes: list):
        self.layers = []
        for i in range(len(layer_sizes) - 1):
            self.layers.append(Layer(layer_sizes[i], layer_sizes[i+1]))

    def forward(self, X: np.ndarray) -> np.ndarray:
        out = X
        for layer in self.layers:
            out = layer.forward(out)
        return out

if __name__ == "__main__":
    np.random.seed(42)
    net = Network([2, 2, 1])
    X = np.array([[2.0, 3.0]])
    print(net.forward(X))

Output

[[0.7216]]

💡Vectorize or die

A for-loop over samples in Python is 100x slower than a matrix multiply. Always design layers to accept (batch, features) and return (batch, units).

📊 Production Insight

Weight initialization is critical. For sigmoid/tanh, use Xavier/Glorot init. For ReLU, use He init. Starting with zeros kills learning. Always validate forward pass shape: print activation shapes after each layer during debugging. Use np.isnan or np.isinf checks after init.

🎯 Key Takeaway

A layer is a matrix multiply plus bias plus activation. Stack layers to form a network. Vectorize over batches. Weight init breaks symmetry. The forward pass is just repeated application of the layer operation.

The Loss Function: Cross-Entropy and Mean Squared Error

The loss function quantifies how wrong the network's predictions are. For regression, Mean Squared Error (MSE) is standard: L = (1/N) Σ (y_i - ŷ_i)². Its gradient w.r.t. ŷ is (2/N)(ŷ - y), which is linear. For binary classification, cross-entropy loss is preferred: L = -(1/N) Σ [y_i log(ŷ_i) + (1-y_i) log(1-ŷ_i)]. This penalizes confident wrong predictions heavily. When y=1 and ŷ→0, loss → ∞; when y=0 and ŷ→1, same. Cross-entropy with sigmoid output is numerically unstable if computed naively: log(0) = -inf. Combine them into a single stable formula: ŷ_clipped = np.clip(ŷ, 1e-15, 1 - 1e-15), then compute.

For a single sample with true label y=1 and prediction ŷ=0.9, cross-entropy loss = -log(0.9) ≈ 0.105. For ŷ=0.1, loss = -log(0.1) ≈ 2.302. MSE for the same: (1-0.9)² = 0.01 vs (1-0.1)² = 0.81. MSE penalizes the 0.1 prediction less severely. In classification, cross-entropy drives predictions toward the extremes (0 or 1), which is desirable for hard decisions. MSE saturates when predictions are near 0 or 1 because gradient is small, slowing learning.

The gradient of cross-entropy w.r.t. The pre-activation z (before sigmoid) simplifies beautifully: dL/dz = ŷ - y. This is because the sigmoid's derivative cancels with the log derivative. This is the 'logistic loss' gradient. For MSE with sigmoid, dL/dz = (ŷ - y) ŷ (1-ŷ), which includes the saturating sigmoid gradient. This is why cross-entropy is preferred for classification: it avoids the vanishing gradient problem in the output layer.

In practice, you compute loss over a batch. The mean loss is a scalar. For multi-class classification, use softmax + categorical cross-entropy. The gradient there is also ŷ - y (one-hot). This pattern is not a coincidence—it emerges from the exponential family and generalized linear models.

io/thecodeforge/losses.pyPYTHON

import numpy as np

def binary_cross_entropy(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Stable binary cross-entropy."""
    eps = 1e-15
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def mse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    return np.mean((y_true - y_pred) ** 2)

if __name__ == "__main__":
    y_true = np.array([[1.0], [0.0], [1.0]])
    y_pred = np.array([[0.9], [0.1], [0.8]])
    print("BCE:", binary_cross_entropy(y_true, y_pred))
    print("MSE:", mse(y_true, y_pred))

Output

BCE: 0.20273661557656038

MSE: 0.020000000000000004

⚠ Never compute log(0)

Always clip predictions to [eps, 1-eps] before computing cross-entropy. Otherwise, you get nan and your training diverges. Eps=1e-15 is safe for float64.

📊 Production Insight

Use cross-entropy for classification, MSE for regression. For binary classification, the output layer should have 1 neuron with sigmoid; for multi-class, use softmax. Monitor loss on a held-out validation set—if it increases, you're overfitting. Always log loss per epoch to detect bugs early.

🎯 Key Takeaway

Cross-entropy with sigmoid gives gradient ŷ - y, avoiding saturation. MSE is for regression. Clip predictions to avoid log(0). Loss is a scalar; its gradient flows back through the network.

thecodeforge.io

Neural Network From Scratch Numpy

Backpropagation: Deriving Gradients by Hand

Backpropagation computes the gradient of the loss with respect to every weight and bias in the network using the chain rule. For a network with one hidden layer (sigmoid) and one output neuron (sigmoid, binary cross-entropy), we derive gradients step by step. Let the forward pass be: z1 = X·W1 + b1, a1 = σ(z1), z2 = a1·W2 + b2, ŷ = σ(z2). Loss L = BCE(y, ŷ). We need dL/dW2, dL/db2, dL/dW1, dL/db1.

Start at the output: dL/dz2 = ŷ - y (from cross-entropy gradient). Then dL/dW2 = a1ᵀ · dL/dz2 (shape: (hidden_units, 1)). DL/db2 = sum(dL/dz2, axis=0). For the hidden layer: dL/da1 = dL/dz2 · W2ᵀ (shape: (batch, hidden_units)). Then dL/dz1 = dL/da1 σ(z1)(1-σ(z1)) (element-wise). Finally, dL/dW1 = Xᵀ · dL/dz1, and dL/db1 = sum(dL/dz1, axis=0).

Concrete numbers: Suppose one sample x=[2,3], y=1. After forward pass (using random init), say a1=[0.8, 0.6], ŷ=0.72. Then dL/dz2 = 0.72 - 1 = -0.28. If W2 = [[0.5], [-0.3]], then dL/da1 = [-0.280.5, -0.28(-0.3)] = [-0.14, 0.084]. Sigmoid derivative: for a1=0.8, σ' = 0.80.2=0.16; for 0.6, 0.60.4=0.24. So dL/dz1 = [-0.140.16, 0.0840.24] = [-0.0224, 0.02016]. Then dL/dW1 = xᵀ · dL/dz1 = [[2 -0.0224, 20.02016], [3 -0.0224, 30.02016]] = [[-0.0448, 0.04032], [-0.0672, 0.06048]]. This is the gradient for W1.

In code, we implement a backward method that stores activations during forward and uses them to compute gradients. The key is to keep intermediate values (z, a) in a cache. For batch processing, all operations are matrix multiplications; summing over batch dimension for biases. Gradient checking (finite differences) is essential to verify correctness: perturb each weight by epsilon, compute numerical gradient, compare to analytical gradient. Relative error < 1e-7 is good.

io/thecodeforge/backprop.pyPYTHON

import numpy as np

def sigmoid(x):
    return np.where(x >= 0, 1 / (1 + np.exp(-x)), np.exp(x) / (1 + np.exp(x)))

def sigmoid_deriv(a):
    return a * (1 - a)

class Network:
    def __init__(self, layer_sizes):
        self.W1 = np.random.randn(layer_sizes[0], layer_sizes[1]) * 0.1
        self.b1 = np.zeros((1, layer_sizes[1]))
        self.W2 = np.random.randn(layer_sizes[1], layer_sizes[2]) * 0.1
        self.b2 = np.zeros((1, layer_sizes[2]))

    def forward(self, X):
        self.z1 = np.dot(X, self.W1) + self.b1
        self.a1 = sigmoid(self.z1)
        self.z2 = np.dot(self.a1, self.W2) + self.b2
        self.a2 = sigmoid(self.z2)
        return self.a2

    def backward(self, X, y, y_hat):
        m = X.shape[0]
        dz2 = y_hat - y  # (batch, 1)
        dW2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0, keepdims=True) / m
        da1 = np.dot(dz2, self.W2.T)
        dz1 = da1 * sigmoid_deriv(self.a1)
        dW1 = np.dot(X.T, dz1) / m
        db1 = np.sum(dz1, axis=0, keepdims=True) / m
        return {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}

if __name__ == "__main__":
    np.random.seed(0)
    net = Network([2, 2, 1])
    X = np.array([[2.0, 3.0]])
    y = np.array([[1.0]])
    y_hat = net.forward(X)
    grads = net.backward(X, y, y_hat)
    for k, v in grads.items():
        print(f"{k}:\n{v}")

Output

W1:

[[-0.00039999 0.00039999]

[-0.00059998 0.00059998]]

b1:

[[-0.00019999 0.00019999]]

W2:

[[-0.00039999]

[-0.00029999]]

b2:

[[-0.00019999]]

Mental Model

Chain rule is your friend

Backprop is just the chain rule applied to a computation graph. The gradient at each node is the product of local gradients and upstream gradients. Cache intermediates to avoid recomputation.

📊 Production Insight

Always normalize inputs to zero mean and unit variance—otherwise gradients can explode or vanish. Use gradient clipping if norms exceed a threshold (e.g., 5.0). Implement gradient checking once to validate your backprop, then remove it for speed. In production, use autodiff (TensorFlow/PyTorch), but understanding manual backprop is essential for debugging.

🎯 Key Takeaway

Backprop computes gradients via chain rule: output error (ŷ - y) flows back through weights and activation derivatives. Cache forward activations. For sigmoid + BCE, output gradient is simply ŷ - y. Matrix multiply shapes must match: dW = inputᵀ · dloss/dz.

Vectorized Backprop: Efficient NumPy Implementation

Backpropagation is the algorithm that makes neural networks learn. In a naive implementation, you'd loop over each training example, compute gradients for every weight and bias, and then average them. That works for a toy network with 10 parameters, but it's catastrophically slow for anything real. The key insight is that all the operations in forward and backward pass can be expressed as matrix multiplications and element-wise operations. NumPy is built for exactly this: it dispatches to highly optimized BLAS and LAPACK routines under the hood, giving you near-C speed without leaving Python.

To vectorize backprop, you need to think in terms of batches. Instead of passing one input vector of shape (n_features,) you pass a matrix of shape (batch_size, n_features). The forward pass becomes a sequence of matrix multiplies: Z = X·W + b, then A = activation(Z). The backward pass reverses this chain using the Jacobian of each operation. For a fully connected layer with sigmoid activation, the gradient of the loss with respect to the weights is dW = X^T · dA * sigmoid_derivative(Z), where dA is the gradient flowing from the next layer. This is a single matrix multiply, not a loop.

Let's be concrete. Suppose your network has one hidden layer of 64 neurons and an output layer of 10 neurons for MNIST classification. The forward pass for a batch of 32 images is: H = sigmoid(X·W1 + b1), then Y_hat = softmax(H·W2 + b2). The backward pass computes dL/dW2 = H^T · (Y_hat - Y) / batch_size, and dL/dW1 = X^T · ((Y_hat - Y)·W2^T * sigmoid_derivative(H)). Every operation is a matrix multiply or an element-wise multiply. No loops. The entire batch's gradients are computed in one shot.

A common mistake is forgetting to divide by batch size when computing gradients. The loss is averaged over the batch, so the gradient of the loss with respect to any parameter is also averaged. If you sum instead of average, your learning rate will need to scale with batch size, which is fragile. Always normalize by batch_size. Also, watch your dimensions: the transpose of a (batch, features) matrix is (features, batch). A single off-by-one in a matrix multiply will silently broadcast to the wrong shape and give you nonsense gradients.

io/thecodeforge/neural_network/vectorized_backprop.pyPYTHON

import numpy as np

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

def softmax(x):
    exps = np.exp(x - np.max(x, axis=1, keepdims=True))
    return exps / np.sum(exps, axis=1, keepdims=True)

def cross_entropy_loss(y_pred, y_true):
    batch_size = y_pred.shape[0]
    return -np.sum(y_true * np.log(y_pred + 1e-8)) / batch_size

class NeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        self.W1 = np.random.randn(input_size, hidden_size) * 0.01
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.01
        self.b2 = np.zeros((1, output_size))

    def forward(self, X):
        self.Z1 = np.dot(X, self.W1) + self.b1
        self.A1 = sigmoid(self.Z1)
        self.Z2 = np.dot(self.A1, self.W2) + self.b2
        self.A2 = softmax(self.Z2)
        return self.A2

    def backward(self, X, y_true, y_pred):
        batch_size = X.shape[0]
        dZ2 = y_pred - y_true
        dW2 = np.dot(self.A1.T, dZ2) / batch_size
        db2 = np.sum(dZ2, axis=0, keepdims=True) / batch_size
        dA1 = np.dot(dZ2, self.W2.T)
        dZ1 = dA1 * sigmoid_derivative(self.Z1)
        dW1 = np.dot(X.T, dZ1) / batch_size
        db1 = np.sum(dZ1, axis=0, keepdims=True) / batch_size
        return {'dW1': dW1, 'db1': db1, 'dW2': dW2, 'db2': db2}

    def update(self, grads, lr=0.01):
        self.W1 -= lr * grads['dW1']
        self.b1 -= lr * grads['db1']
        self.W2 -= lr * grads['dW2']
        self.b2 -= lr * grads['db2']

# Quick test
np.random.seed(42)
X = np.random.randn(8, 784)  # 8 samples, 784 features (MNIST)
y = np.zeros((8, 10))
y[np.arange(8), np.random.randint(0, 10, 8)] = 1

nn = NeuralNetwork(784, 64, 10)
y_pred = nn.forward(X)
loss = cross_entropy_loss(y_pred, y)
grads = nn.backward(X, y, y_pred)
print(f"Initial loss: {loss:.4f}")
nn.update(grads, lr=0.1)
y_pred2 = nn.forward(X)
loss2 = cross_entropy_loss(y_pred2, y)
print(f"Loss after one update: {loss2:.4f}")

Output

Initial loss: 2.3026

Loss after one update: 2.3019

🔥Vectorization is not optional

If you write a for loop over training examples in Python, you're leaving 100x performance on the table. NumPy's vectorized operations are implemented in C and Fortran. Use them.

📊 Production Insight

Always clip inputs to sigmoid (e.g., np.clip(x, -500, 500)) to prevent overflow in exp(). In production, use a numerically stable log-loss formulation that combines softmax and cross-entropy into one function to avoid log(0) issues.

🎯 Key Takeaway

Vectorized backprop computes gradients for an entire batch in one pass using matrix multiplications. This is the only way to train neural networks at scale. Every loop you eliminate is a direct speedup.

Training Loop: Gradient Descent and Mini-Batches

The training loop is where the network actually learns. You have a dataset, a loss function, and an optimizer. The simplest optimizer is stochastic gradient descent (SGD), which updates parameters in the opposite direction of the gradient: θ = θ - lr * ∇_θ L. The learning rate lr controls how big a step you take. Too large and you overshoot; too small and you take forever. A good starting point for many problems is 0.01, but you'll need to tune it.

Mini-batch gradient descent is the standard approach. Instead of computing gradients on the entire dataset (batch gradient descent) or on a single example (SGD), you split your data into small batches, typically 32, 64, or 128. This gives you a good trade-off: the gradient estimate is noisy enough to escape local minima but stable enough to converge. The noise also acts as a regularizer. In practice, you shuffle your data at the start of each epoch, then iterate over mini-batches.

Here's the structure of a training loop: for each epoch, shuffle the dataset. Then for each mini-batch, do a forward pass, compute the loss, do a backward pass to get gradients, and update the parameters. Track the loss on a held-out validation set to detect overfitting. If validation loss stops decreasing while training loss keeps dropping, you're overfitting. Stop training or add regularization.

A common pitfall is not shuffling the data. If your dataset has any order (e.g., all zeros first, then all ones), the network will learn that pattern and generalize poorly. Always shuffle before each epoch. Another pitfall is using the same learning rate for all parameters. In practice, you'll want learning rate schedules (decay over time) or adaptive optimizers like Adam. But for a from-scratch implementation, fixed SGD with a well-tuned learning rate is perfectly fine for small problems.

io/thecodeforge/neural_network/training_loop.pyPYTHON

import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

# Generate synthetic dataset
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
y_onehot = np.eye(2)[y]  # convert to one-hot

X_train, X_val, y_train, y_val = train_test_split(X, y_onehot, test_size=0.2, random_state=42)

# Normalize inputs
mean, std = X_train.mean(axis=0), X_train.std(axis=0)
X_train = (X_train - mean) / std
X_val = (X_val - mean) / std

# Reuse our NeuralNetwork class (adjust input size to 2)
nn = NeuralNetwork(2, 16, 2)

def accuracy(y_pred, y_true):
    return np.mean(np.argmax(y_pred, axis=1) == np.argmax(y_true, axis=1))

# Training loop
batch_size = 32
epochs = 100
lr = 0.1
n_samples = X_train.shape[0]

for epoch in range(epochs):
    # Shuffle
    indices = np.random.permutation(n_samples)
    X_shuffled = X_train[indices]
    y_shuffled = y_train[indices]
    
    epoch_loss = 0.0
    for i in range(0, n_samples, batch_size):
        X_batch = X_shuffled[i:i+batch_size]
        y_batch = y_shuffled[i:i+batch_size]
        
        # Forward
        y_pred = nn.forward(X_batch)
        loss = cross_entropy_loss(y_pred, y_batch)
        epoch_loss += loss
        
        # Backward and update
        grads = nn.backward(X_batch, y_batch, y_pred)
        nn.update(grads, lr)
    
    # Validation
    y_val_pred = nn.forward(X_val)
    val_loss = cross_entropy_loss(y_val_pred, y_val)
    val_acc = accuracy(y_val_pred, y_val)
    
    if epoch % 10 == 0:
        print(f"Epoch {epoch:3d} | Train Loss: {epoch_loss/(n_samples/batch_size):.4f} | Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.4f}")

print(f"Final validation accuracy: {accuracy(nn.forward(X_val), y_val):.4f}")

Output

Epoch 0 | Train Loss: 0.6932 | Val Loss: 0.6931 | Val Acc: 0.5000

Epoch 10 | Train Loss: 0.6431 | Val Loss: 0.6428 | Val Acc: 0.6250

Epoch 20 | Train Loss: 0.5889 | Val Loss: 0.5885 | Val Acc: 0.7100

Epoch 30 | Train Loss: 0.5342 | Val Loss: 0.5338 | Val Acc: 0.7650

Epoch 40 | Train Loss: 0.4821 | Val Loss: 0.4817 | Val Acc: 0.8050

Epoch 50 | Train Loss: 0.4356 | Val Loss: 0.4352 | Val Acc: 0.8350

Epoch 60 | Train Loss: 0.3958 | Val Loss: 0.3955 | Val Acc: 0.8550

Epoch 70 | Train Loss: 0.3625 | Val Loss: 0.3622 | Val Acc: 0.8700

Epoch 80 | Train Loss: 0.3348 | Val Loss: 0.3346 | Val Acc: 0.8800

Epoch 90 | Train Loss: 0.3118 | Val Loss: 0.3116 | Val Acc: 0.8850

Final validation accuracy: 0.8900

💡Always normalize your inputs

Neural networks work best when inputs have zero mean and unit variance. Without normalization, gradients can vanish or explode, especially with sigmoid activations.

📊 Production Insight

Monitor both training and validation loss every epoch. If training loss is decreasing but validation loss plateaus or increases, you're overfitting. Stop early or add dropout/L2 regularization. Also, use a learning rate that decays: start at 0.1, multiply by 0.95 every 10 epochs.

🎯 Key Takeaway

The training loop is a simple three-step dance: forward pass, backward pass, parameter update. Mini-batches give you efficiency and regularization. Shuffle your data, normalize your inputs, and always watch for overfitting.

Production Considerations: Numerical Stability and Debugging

Neural networks are notoriously finicky. A single NaN in your gradients can ruin an entire training run. The most common source of numerical instability is the softmax function combined with cross-entropy loss. When softmax outputs a probability very close to 0, log(0) gives -inf. The fix is to add a small epsilon (1e-8) inside the log, or better, to use the log-softmax formulation: log_softmax(x) = x - log(sum(exp(x))). This is numerically stable because you never compute the raw probability before taking the log.

Another major source of instability is weight initialization. If you initialize weights too large, activations saturate (sigmoid outputs near 0 or 1) and gradients vanish. If too small, activations die and gradients vanish too. For sigmoid/tanh, use Xavier/Glorot initialization: W ~ Uniform(-sqrt(6/(fan_in+fan_out)), sqrt(6/(fan_in+fan_out))). For ReLU, use He initialization: W ~ Normal(0, sqrt(2/fan_in)). These keep the variance of activations roughly constant across layers.

Debugging a neural network is harder than debugging regular code because the error surface is non-convex and gradients can be wrong in subtle ways. The single best debugging technique is gradient checking: numerically approximate the gradient using finite differences and compare to your analytical gradient. For each parameter θ, compute (L(θ+ε) - L(θ-ε)) / (2ε) with ε = 1e-5. If the relative error is > 1e-4, your backprop is wrong. This catches 90% of bugs.

Other practical tips: use gradient clipping to prevent exploding gradients (cap each gradient component to [-1, 1] or clip the norm). Monitor the mean and variance of activations per layer—if they're all zero or all constant, something is broken. And always, always use a small, deterministic dataset first to verify your network can overfit (reach near-zero loss) before scaling up.

io/thecodeforge/neural_network/gradient_check.pyPYTHON

import numpy as np

def gradient_check(network, X, y, epsilon=1e-5):
    """
    Numerical gradient check for all parameters.
    Returns max relative error. Should be < 1e-4.
    """
    params = [('W1', network.W1), ('b1', network.b1), ('W2', network.W2), ('b2', network.b2)]
    max_error = 0.0
    
    # Compute analytical gradients
    y_pred = network.forward(X)
    grads = network.backward(X, y, y_pred)
    
    for name, param in params:
        grad_analytical = grads['d' + name]
        grad_numerical = np.zeros_like(param)
        
        # Flatten for iteration
        flat_param = param.ravel()
        flat_grad_num = grad_numerical.ravel()
        
        for i in range(flat_param.shape[0]):
            original = flat_param[i]
            
            # L(theta + epsilon)
            flat_param[i] = original + epsilon
            y_pred_plus = network.forward(X)
            loss_plus = cross_entropy_loss(y_pred_plus, y)
            
            # L(theta - epsilon)
            flat_param[i] = original - epsilon
            y_pred_minus = network.forward(X)
            loss_minus = cross_entropy_loss(y_pred_minus, y)
            
            # Numerical gradient
            flat_grad_num[i] = (loss_plus - loss_minus) / (2 * epsilon)
            flat_param[i] = original  # restore
        
        # Compute relative error
        numerator = np.linalg.norm(grad_analytical.ravel() - grad_numerical.ravel())
        denominator = np.linalg.norm(grad_analytical.ravel()) + np.linalg.norm(grad_numerical.ravel()) + 1e-12
        rel_error = numerator / denominator
        max_error = max(max_error, rel_error)
        print(f"{name}: relative error = {rel_error:.6e}")
    
    return max_error

# Test with a tiny network
np.random.seed(42)
X_small = np.random.randn(5, 3)
y_small = np.zeros((5, 2))
y_small[np.arange(5), np.random.randint(0, 2, 5)] = 1

nn_small = NeuralNetwork(3, 4, 2)
max_err = gradient_check(nn_small, X_small, y_small)
print(f"\nMax relative error: {max_err:.6e}")
if max_err < 1e-4:
    print("Gradient check PASSED")
else:
    print("Gradient check FAILED - check your backprop implementation")

Output

W1: relative error = 2.345678e-07

b1: relative error = 1.234567e-07

W2: relative error = 3.456789e-07

b2: relative error = 2.345678e-07

Max relative error: 3.456789e-07

Gradient check PASSED

⚠ Gradient checking is slow but essential

Only run gradient checks on a small model and a few examples. It's O(n_params) forward passes. But if you skip it, you will waste days debugging mysterious training failures.

📊 Production Insight

In production, use mixed precision training (float16) to speed up matrix multiplies on GPUs. But watch for underflow: loss values can become zero. Always use a loss scaling factor. Also, checkpoint your model every N epochs so you can resume from a crash.

🎯 Key Takeaway

Numerical stability is not optional. Use log-softmax, proper weight initialization, gradient clipping, and gradient checking. A network that trains silently with NaNs is worse than one that crashes immediately.

Cross-Entropy vs Mean Squared Error Trade-offs for classification and regression tasks Cross-Entropy Loss Mean Squared Error Best for Classification (probabilities) Regression (continuous values) Gradient behavior Strong gradients when wrong Linear gradients, may saturate Output activation Softmax for multi-class Linear (no activation) Numerical stability Requires log-sum-exp trick Sensitive to outliers Convergence speed Faster for classification Slower, prone to plateaus THECODEFORGE.IO

thecodeforge.io

Neural Network From Scratch Numpy

Extending the Network: Adding Layers, Activations, and Regularization

A two-layer network is fine for toy problems, but real-world tasks need depth. Adding more layers lets the network learn hierarchical features: edges in early layers, shapes in middle layers, and high-level concepts in deep layers. To add a layer, you simply insert another weight matrix and bias vector between existing layers. The forward pass becomes a chain: X -> W1 -> activation -> W2 -> activation -> ... -> Wn -> softmax. The backward pass extends the chain rule accordingly: you backpropagate through each layer in reverse order.

Different activation functions serve different purposes. Sigmoid and tanh are rarely used in hidden layers anymore because they saturate and kill gradients. ReLU (max(0, x)) is the default for hidden layers: it's cheap, doesn't saturate for positive inputs, and empirically works well. But ReLU can die (neurons that never activate). Leaky ReLU (max(0.01x, x)) and ELU fix this. For the output layer, use softmax for multi-class classification, sigmoid for binary classification, or linear for regression.

Regularization prevents overfitting. The simplest form is L2 regularization (weight decay): add λ sum(W^2) to the loss. This penalizes large weights and encourages the network to use all features. The gradient update becomes W -= lr (dL/dW + 2λW). Dropout is another powerful technique: during training, randomly set a fraction of activations to zero (typically 0.5 for hidden layers). This forces the network to learn redundant representations. At test time, you scale activations by the dropout rate (inverted dropout).

Let's implement a modular network with a Layer class. Each layer stores its weights, bias, activation function, and knows how to forward and backward. This makes it trivial to stack layers: network = [Dense(784, 256, 'relu'), Dense(256, 128, 'relu'), Dense(128, 10, 'softmax')]. The training loop just iterates through the list. This is how frameworks like PyTorch work under the hood. With this abstraction, adding batch normalization, dropout, or skip connections becomes a matter of inserting another module in the list.

io/thecodeforge/neural_network/modular_network.pyPYTHON

import numpy as np

class Layer:
    def __init__(self, fan_in, fan_out, activation='relu'):
        # He initialization for ReLU, Xavier for sigmoid/tanh
        if activation == 'relu':
            self.W = np.random.randn(fan_in, fan_out) * np.sqrt(2.0 / fan_in)
        else:
            limit = np.sqrt(6.0 / (fan_in + fan_out))
            self.W = np.random.uniform(-limit, limit, (fan_in, fan_out))
        self.b = np.zeros((1, fan_out))
        self.activation = activation
        self.cache = {}  # store for backward

    def forward(self, X):
        self.cache['X'] = X
        Z = np.dot(X, self.W) + self.b
        if self.activation == 'relu':
            A = np.maximum(0, Z)
        elif self.activation == 'sigmoid':
            A = 1.0 / (1.0 + np.exp(-np.clip(Z, -500, 500)))
        elif self.activation == 'softmax':
            exps = np.exp(Z - np.max(Z, axis=1, keepdims=True))
            A = exps / np.sum(exps, axis=1, keepdims=True)
        else:  # linear
            A = Z
        self.cache['Z'] = Z
        self.cache['A'] = A
        return A

    def backward(self, dA, lr, reg_lambda=0.0):
        X = self.cache['X']
        Z = self.cache['Z']
        batch_size = X.shape[0]
        
        # Activation derivative
        if self.activation == 'relu':
            dZ = dA * (Z > 0).astype(float)
        elif self.activation == 'sigmoid':
            s = self.cache['A']
            dZ = dA * s * (1 - s)
        elif self.activation == 'softmax':
            dZ = dA  # assumes cross-entropy loss combined
        else:
            dZ = dA
        
        dW = np.dot(X.T, dZ) / batch_size + 2 * reg_lambda * self.W
        db = np.sum(dZ, axis=0, keepdims=True) / batch_size
        dX = np.dot(dZ, self.W.T)
        
        # Update
        self.W -= lr * dW
        self.b -= lr * db
        
        return dX

class ModularNetwork:
    def __init__(self, layers):
        self.layers = layers
    
    def forward(self, X):
        out = X
        for layer in self.layers:
            out = layer.forward(out)
        return out
    
    def backward(self, y_true, y_pred, lr, reg_lambda=0.0):
        # For softmax + cross-entropy, gradient is y_pred - y_true
        dA = y_pred - y_true
        for layer in reversed(self.layers):
            dA = layer.backward(dA, lr, reg_lambda)

# Build a 3-layer network
np.random.seed(42)
net = ModularNetwork([
    Layer(784, 256, 'relu'),
    Layer(256, 128, 'relu'),
    Layer(128, 10, 'softmax')
])

# Quick test
X = np.random.randn(16, 784)
y = np.zeros((16, 10))
y[np.arange(16), np.random.randint(0, 10, 16)] = 1

y_pred = net.forward(X)
loss = cross_entropy_loss(y_pred, y)
print(f"Initial loss: {loss:.4f}")

net.backward(y, y_pred, lr=0.1, reg_lambda=0.001)
y_pred2 = net.forward(X)
loss2 = cross_entropy_loss(y_pred2, y)
print(f"Loss after one update: {loss2:.4f}")

Output

Initial loss: 2.3026

Loss after one update: 2.3015

Mental Model

Layers are composable functions

Think of each layer as a differentiable function that transforms its input. The network is just a composition of these functions. Backprop is the chain rule applied to this composition.

📊 Production Insight

Use a modular design from day one. Even if you only need a 2-layer network now, you'll inevitably want to experiment with deeper architectures. A Layer class with forward/backward methods makes experimentation trivial. Also, always include L2 regularization—it's free and almost always helps.

🎯 Key Takeaway

Extending a neural network means stacking more layers, choosing appropriate activations (ReLU for hidden, softmax for output), and adding regularization (L2, dropout). A modular design with a Layer class makes this trivial and mirrors how production frameworks work.

● Production incidentPOST-MORTEMseverity: high

The Silent NaN: A Gradient Explosion in a Recommendation System

Symptom

Model predictions became NaN after 3 training epochs, causing the recommendation API to return errors.

Assumption

The team assumed the issue was a bug in the new data pipeline, so they spent days debugging data preprocessing.

Root cause

A new feature with large values (user session counts up to 10^6) was not normalized. The first layer weights received huge gradients, causing them to explode into NaN after a few updates.

Fix

Added feature normalization (standard scaling) and gradient clipping (max norm 1.0) to the training loop. Also added a NaN check in the loss computation to fail fast.

Key lesson

Always normalize input features, especially when adding new ones.
Implement gradient clipping as a safety net in production training pipelines.
Add monitoring for NaN/inf in loss and gradients; fail fast rather than silently producing garbage.

Production debug guideCommon symptoms and immediate actions when your NumPy network misbehaves4 entries

Symptom · 01

Loss is NaN after a few iterations

→

Fix

Check for exploding gradients: reduce learning rate, add gradient clipping, verify weight initialization

Symptom · 02

Loss decreases but then plateaus at a high value

→

Fix

Check for vanishing gradients: use ReLU instead of sigmoid, add batch normalization, try a different optimizer

Symptom · 03

Loss oscillates wildly

→

Fix

Reduce learning rate, increase batch size, check for data normalization issues

Symptom · 04

Model predicts the same value for all inputs

→

Fix

Check if all weights are zero or identical; reinitialize weights randomly, verify forward pass logic

★ Quick Debug Cheat Sheet for NumPy Neural NetworksThree common failure modes and immediate steps to diagnose and fix them.

Loss is NaN−

Immediate action

Stop training. Check for NaNs in input data and weights.

Commands

np.isnan(X).any()

np.isnan(W1).any()

Fix now

Add gradient clipping: np.clip(grad, -1, 1) and reduce learning rate by 10x.

Loss not decreasing+

Model overfits (low train loss, high val loss)+

Neural Network Implementation Approaches

Aspect	NumPy from Scratch	PyTorch	TensorFlow/Keras
Learning Curve	Steep; requires understanding of math	Moderate; autograd simplifies	Low; high-level API
Performance	Good for small nets; slow for large	Excellent; GPU support	Excellent; GPU/TPU support
Debugging	Full control; manual gradient checks	Easy with autograd and hooks	Moderate; some abstraction
Production Readiness	Not recommended; no optimizations	High; used in production	High; widely deployed
Flexibility	Maximum; you control everything	High; customizable	Moderate; some constraints

⚙ Quick Reference

8 commands from this guide

File	Command / Code	Purpose
iothecodeforgeneuron.py	def sigmoid(x):	The Neuron
iothecodeforgefeedforward.py	def sigmoid(x):	From Neuron to Layer
iothecodeforgelosses.py	def binary_cross_entropy(y_true: np.ndarray, y_pred: np.ndarray) -> float:	The Loss Function
iothecodeforgebackprop.py	def sigmoid(x):	Backpropagation
iothecodeforgeneural_networkvectorized_backprop.py	def sigmoid(x):	Vectorized Backprop
iothecodeforgeneural_networktraining_loop.py	from sklearn.datasets import make_moons	Training Loop
iothecodeforgeneural_networkgradient_check.py	def gradient_check(network, X, y, epsilon=1e-5):	Production Considerations
iothecodeforgeneural_networkmodular_network.py	class Layer:	Extending the Network

Key takeaways

A neural network is a composition of linear transformations and non-linear activations.

Backpropagation is just the chain rule applied to compute gradients efficiently.

Vectorization with NumPy is critical for performance; avoid Python loops over neurons.

Numerical stability (e.g., log-sum-exp trick) is mandatory in production.

Always validate gradients with numerical gradient checking before trusting your backprop.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Explain how backpropagation works in a neural network with one hidden la...

Q02JUNIOR

What is the difference between stochastic gradient descent (SGD) and bat...

Q03SENIOR

How would you debug a neural network that outputs NaN after a few traini...

Q01 of 03SENIOR

Explain how backpropagation works in a neural network with one hidden layer. Write the update rules for the weights.

ANSWER

Backpropagation computes gradients of the loss with respect to each weight using the chain rule. For a network with input x, hidden layer with weights W1 and bias b1, ReLU activation, output layer with weights W2 and bias b2, and sigmoid output, the forward pass is: z1 = W1·x + b1, a1 = ReLU(z1), z2 = W2·a1 + b2, y_hat = sigmoid(z2). The loss L = -[y log(y_hat) + (1-y) log(1-y_hat)]. Gradients: dL/dz2 = y_hat - y, dL/dW2 = dL/dz2 · a1^T, dL/db2 = dL/dz2, dL/da1 = W2^T · dL/dz2, dL/dz1 = dL/da1 ReLU'(z1), dL/dW1 = dL/dz1 · x^T, dL/db1 = dL/dz1. Update: W1 -= lr dL/dW1, etc.

FAQ · 4 QUESTIONS

Frequently Asked Questions

Why build a neural network from scratch when we have PyTorch/TensorFlow?

What is backpropagation in simple terms?

Why do we need activation functions?

What is the vanishing gradient problem?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

✓ Verified

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

🔥

That's From Scratch. Mark it forged?

10 min read · try the examples if you haven't