ML / AI Intermediate

Neural Networks Explained: Architecture, Training, and Real-World Use

📅 March 2026 ⏱ 8 min read 🎯 Intermediate

In Plain English 🔥

Imagine you're teaching a child to recognise dogs. You don't hand them a rulebook — you show them thousands of pictures and say 'dog' or 'not dog' until they just get it. A neural network learns exactly the same way: you feed it examples, it makes guesses, you tell it how wrong it was, and it quietly adjusts itself until the guesses get really good. The 'network' part just means thousands of tiny decision-makers (neurons) passing signals to each other, the same way your brain cells do.

⚡ Quick Answer

Every time Netflix recommends a show you end up binge-watching, or your phone unlocks just by looking at you, a neural network made that call. These models now power fraud detection at banks, real-time translation across 100+ languages, and cancer-detection tools that outperform radiologists in controlled studies. Neural networks aren't a niche research topic — they're the backbone of most production AI you interact with daily.

The problem they solve is deceptively simple to state: some patterns are too complex for humans to describe with explicit rules. You can't write an if-else tree that reliably identifies a cat in every possible photo — different lighting, angles, breeds, and backgrounds make hand-coded rules collapse. Neural networks sidestep that entirely. Instead of you writing the rules, the network learns them from data by adjusting millions of internal parameters until its predictions match reality closely enough to be useful.

By the end of this article you'll understand exactly what a neuron computes, why networks need multiple layers, how weights get updated during training, and you'll have a fully working neural network built from scratch in Python — no ML frameworks, just NumPy. You'll walk away able to explain forward propagation, backpropagation, and activation functions to both a colleague and an interviewer.

What a Single Neuron Actually Computes (And Why That's Not Enough)

A single artificial neuron does something embarrassingly simple: it takes a list of numbers (inputs), multiplies each one by a corresponding weight, adds them all up, adds a bias value, then squishes the result through an activation function. That's it.

The weights represent how important each input is. A neuron predicting house price might receive square footage and number of bedrooms as inputs — if it learns that square footage matters more, that weight will be larger. The bias lets the neuron shift its output independently of the inputs, like adjusting the baseline before the data even arrives.

So why isn't one neuron enough? Because a single neuron with any smooth activation function can only draw a single straight decision boundary. It can separate dogs from cats only if the difference between them is perfectly linear — and nothing in the real world is. You need layers of neurons so the network can learn curved, jagged, non-linear boundaries by composing many simple decisions together. Each layer learns to see a more abstract version of the previous layer's output.

single_neuron.py · PYTHON

123456789101112131415161718192021222324252627282930

import numpy as np

def sigmoid(raw_output):
    """Squishes any number into the range (0, 1).
    Values far below zero approach 0; far above zero approach 1.
    This lets the neuron express 'confidence' as a probability."""
    return 1 / (1 + np.exp(-raw_output))

def single_neuron_forward(inputs, weights, bias):
    """Simulates one forward pass through a single neuron."""
    # Dot product: multiply each input by its weight, then sum them all.
    # This is the core of what a neuron does before activation.
    weighted_sum = np.dot(inputs, weights) + bias

    # The activation function decides: how 'fired up' is this neuron?
    activated_output = sigmoid(weighted_sum)
    return weighted_sum, activated_output

# --- Example: predicting whether a house is 'expensive' ---
# Input features: [square_footage_scaled, num_bedrooms_scaled]
house_inputs   = np.array([0.85, 0.60])   # normalised to 0-1 range
initial_weights = np.array([0.40, 0.35])  # learned during training
bias_term       = -0.20                   # shifts the decision threshold

raw, output = single_neuron_forward(house_inputs, initial_weights, bias_term)

print(f"Weighted sum (before activation): {raw:.4f}")
print(f"Sigmoid output (prediction):      {output:.4f}")
print(f"Interpretation: {output*100:.1f}% confidence the house is expensive")

▶ Output

Weighted sum (before activation): 0.3300
Sigmoid output (prediction): 0.5818
Interpretation: 58.2% confidence the house is expensive

🔥

Why Normalise Inputs?Neural networks train dramatically faster when input features are scaled to a similar range (0–1 or mean=0, std=1). Without normalisation, features with large raw values (like square footage in the thousands) dominate the weight updates and the network can take hundreds of extra epochs to converge — or never converge at all.

Stacking Layers: How Depth Creates Intelligence

A single neuron learns one linear combination of inputs. Stack them into a layer and you get many different linear combinations simultaneously. Stack multiple layers and something remarkable happens: each layer's output becomes the next layer's input, so later layers learn to recognise combinations of combinations — patterns built on top of patterns.

In an image-recognition network, the first layer might learn to detect edges. The second layer combines edges into shapes. The third combines shapes into object parts (an ear, a wheel). The final layer combines parts into categories ('cat', 'car'). Nobody told it to do this — it discovers this hierarchy automatically because that structure is genuinely useful for reducing prediction error.

This is the core intuition behind deep learning: depth allows the network to build increasingly abstract representations of data. Shallow networks (one or two layers) can theoretically approximate any function, but they'd need an astronomically wide layer to do it. Depth is the practical shortcut. Three layers of 64 neurons each can represent things that would require millions of neurons in a single-layer network.

The layer between input and output is called a hidden layer — hidden because you never directly observe its activations. It's the network's internal scratchpad.

neural_network_from_scratch.py · PYTHON

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196

import numpy as np

np.random.seed(42)  # Makes results reproducible — critical for debugging

# ── Activation functions ──────────────────────────────────────────────────────

def sigmoid(z):
    """Maps any real number to (0, 1). Good for output layers in binary tasks."""
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    """Gradient of sigmoid — needed during backpropagation.
    Note: we pass the pre-activation value z, not the activated output."""
    s = sigmoid(z)
    return s * (1 - s)  # Derivative of sigmoid is elegantly self-referential

def relu(z):
    """Rectified Linear Unit: max(0, z). Fast, simple, works great in hidden layers."""
    return np.maximum(0, z)

def relu_derivative(z):
    """Gradient of ReLU: 1 where input > 0, else 0."""
    return (z > 0).astype(float)

# ── Network initialisation ────────────────────────────────────────────────────

def initialise_network(layer_sizes):
    """Creates weight matrices and bias vectors for a network of any depth.

    layer_sizes: list of integers, e.g. [2, 4, 4, 1] means
                 2 inputs -> hidden layer of 4 -> hidden layer of 4 -> 1 output.

    We use He initialisation (weights scaled by sqrt(2/n_inputs)) because
    it keeps gradients from vanishing or exploding in deep ReLU networks.
    """
    parameters = {}
    num_layers  = len(layer_sizes) - 1  # number of weight matrices

    for layer_idx in range(1, len(layer_sizes)):
        fan_in   = layer_sizes[layer_idx - 1]  # neurons feeding INTO this layer
        fan_out  = layer_sizes[layer_idx]       # neurons IN this layer

        # He initialisation: variance = 2 / fan_in
        parameters[f"W{layer_idx}"] = np.random.randn(fan_out, fan_in) * np.sqrt(2 / fan_in)
        parameters[f"b{layer_idx}"] = np.zeros((fan_out, 1))  # biases start at zero

    return parameters

# ── Forward propagation ───────────────────────────────────────────────────────

def forward_pass(input_data, parameters, num_layers):
    """Passes input through every layer and returns predictions + a cache.

    The cache stores pre-activation values (Z) and activations (A) for each
    layer — we'll need them during backpropagation to compute gradients.
    """
    cache      = {"A0": input_data}  # A0 is just the raw input
    current_A  = input_data

    for layer_idx in range(1, num_layers + 1):
        W = parameters[f"W{layer_idx}"]
        b = parameters[f"b{layer_idx}"]

        # Linear step: Z = W·A_prev + b
        Z = np.dot(W, current_A) + b
        cache[f"Z{layer_idx}"] = Z  # save for backprop

        # Activation step: use ReLU for hidden layers, sigmoid for output
        if layer_idx < num_layers:
            current_A = relu(Z)      # hidden layers stay non-linear with ReLU
        else:
            current_A = sigmoid(Z)   # output layer gives probability [0,1]

        cache[f"A{layer_idx}"] = current_A

    prediction = current_A  # final layer output
    return prediction, cache

# ── Loss function ─────────────────────────────────────────────────────────────

def binary_cross_entropy_loss(predictions, true_labels):
    """Measures how wrong our predictions are for a binary classification task.

    Cross-entropy penalises confident wrong predictions very heavily —
    far more than mean squared error would. This is WHY we use it for
    classification rather than regression losses.
    """
    num_samples = true_labels.shape[1]
    epsilon     = 1e-15  # prevents log(0) which is undefined

    predictions = np.clip(predictions, epsilon, 1 - epsilon)
    loss = -np.mean(
        true_labels * np.log(predictions) +
        (1 - true_labels) * np.log(1 - predictions)
    )
    return loss

# ── Backpropagation ───────────────────────────────────────────────────────────

def backward_pass(predictions, true_labels, cache, parameters, num_layers):
    """Computes gradients for every weight and bias using the chain rule.

    Backpropagation works BACKWARDS from the output error, distributing
    blame to each weight based on how much it contributed to the mistake.
    """
    gradients   = {}
    num_samples = true_labels.shape[1]

    # Gradient of loss with respect to output layer activation
    # (This formula is the derivative of binary cross-entropy + sigmoid combined)
    dA_output = -(true_labels / predictions) + (1 - true_labels) / (1 - predictions)

    current_dA = dA_output

    for layer_idx in reversed(range(1, num_layers + 1)):
        Z      = cache[f"Z{layer_idx}"]
        A_prev = cache[f"A{layer_idx - 1}"]
        W      = parameters[f"W{layer_idx}"]

        # Gradient through activation function
        if layer_idx == num_layers:
            dZ = current_dA * sigmoid_derivative(Z)
        else:
            dZ = current_dA * relu_derivative(Z)

        # Gradients for weights, biases, and the previous layer's activation
        dW             = np.dot(dZ, A_prev.T) / num_samples
        db             = np.sum(dZ, axis=1, keepdims=True) / num_samples
        dA_prev        = np.dot(W.T, dZ)

        gradients[f"dW{layer_idx}"] = dW
        gradients[f"db{layer_idx}"] = db
        current_dA                  = dA_prev  # propagate error to previous layer

    return gradients

# ── Weight update (Gradient Descent) ─────────────────────────────────────────

def update_weights(parameters, gradients, learning_rate, num_layers):
    """Nudges every weight in the direction that reduces loss.

    learning_rate controls step size — too large and we overshoot the minimum,
    too small and training takes forever. 0.01 is a safe starting point.
    """
    for layer_idx in range(1, num_layers + 1):
        parameters[f"W{layer_idx}"] -= learning_rate * gradients[f"dW{layer_idx}"]
        parameters[f"b{layer_idx}"] -= learning_rate * gradients[f"db{layer_idx}"]
    return parameters

# ── Full training loop ────────────────────────────────────────────────────────

def train(input_data, true_labels, layer_sizes, learning_rate=0.01, num_epochs=1000):
    """Ties everything together: initialise, forward, compute loss, backward, update."""
    num_layers = len(layer_sizes) - 1
    parameters = initialise_network(layer_sizes)

    for epoch in range(num_epochs):
        predictions, cache = forward_pass(input_data, parameters, num_layers)
        loss               = binary_cross_entropy_loss(predictions, true_labels)
        gradients          = backward_pass(predictions, true_labels, cache, parameters, num_layers)
        parameters         = update_weights(parameters, gradients, learning_rate, num_layers)

        if epoch % 200 == 0:
            print(f"Epoch {epoch:>5} | Loss: {loss:.4f}")

    return parameters

# ── XOR problem: the classic test for multi-layer networks ────────────────────
# XOR cannot be solved by a single neuron (it's not linearly separable).
# A two-layer network solves it easily — great proof that depth adds power.

XOR_inputs  = np.array([[0, 0, 1, 1],
                          [0, 1, 0, 1]])  # shape: (2 features, 4 samples)
XOR_outputs = np.array([[0, 1, 1, 0]])   # shape: (1 output, 4 samples)

# Architecture: 2 inputs -> 4 hidden neurons -> 4 hidden neurons -> 1 output
architecture = [2, 4, 4, 1]

print("=== Training Neural Network on XOR Problem ===")
trained_params = train(
    input_data   = XOR_inputs,
    true_labels  = XOR_outputs,
    layer_sizes  = architecture,
    learning_rate= 0.1,
    num_epochs   = 1001
)

# Final predictions
final_preds, _ = forward_pass(XOR_inputs, trained_params, len(architecture) - 1)
print("\n=== Final Predictions ===")
for i in range(4):
    input_pair = XOR_inputs[:, i]
    predicted  = final_preds[0, i]
    expected   = XOR_outputs[0, i]
    print(f"Input: {input_pair} | Expected: {expected} | Predicted: {predicted:.4f}")

▶ Output

⚠️

Pro Tip: Why XOR is the Perfect Test CaseXOR is not linearly separable — you literally cannot draw a single straight line to separate the 0s from the 1s in 2D space. A single-neuron model will never solve it. If your multi-layer network solves XOR cleanly (predictions above 0.95 for 1s and below 0.05 for 0s), you've confirmed that backpropagation and your layer structure are both working correctly. It's the unit test of neural network implementations.

Backpropagation Demystified: How the Network Learns from Its Mistakes

Backpropagation sounds intimidating but it's just the chain rule from calculus applied systematically. Here's the intuition: after every forward pass, the network has a prediction and you have the truth. The difference is the error. Backpropagation asks a simple question for every single weight: 'If I nudge this weight slightly, does the error go up or down?'

That question is answered by computing a gradient — a number that tells you both the direction and steepness of the error landscape with respect to that weight. If the gradient is positive, increasing the weight increases error, so you decrease it. If negative, you increase it. You adjust every weight proportionally to its gradient, scaled by the learning rate.

The 'backward' part is crucial. You start at the output error and propagate blame backwards through each layer using the chain rule — the error at layer 3 depends on layer 2's weights, which depend on layer 1's weights, and so on. This is why caching the intermediate values during the forward pass matters: you need them to compute those gradients efficiently without re-running the whole network.

Gradient descent is the engine, backpropagation is just the efficient method for computing what gradient descent needs.

gradient_descent_visualised.py · PYTHON

1234567891011121314151617181920212223242526272829303132333435363738394041

import numpy as np

# ── Demonstrating gradient descent on a simple 1D loss surface ────────────────
# Imagine weight_value is ONE weight in the network.
# The loss function below is a simplified stand-in for real cross-entropy loss.
# Goal: find the weight value that minimises loss.

def simplified_loss(weight_value):
    """A bowl-shaped loss surface. Minimum is at weight_value = 2.0.
    In real networks, the loss surface is millions of dimensions — same idea."""
    return (weight_value - 2.0) ** 2 + 0.5

def loss_gradient(weight_value):
    """The derivative of the loss with respect to the weight.
    Tells us which direction to nudge the weight to reduce loss.
    For (w-2)^2 + 0.5, the derivative is 2*(w-2)."""
    return 2 * (weight_value - 2.0)

# Start with a random weight far from optimal
current_weight = -3.0
learning_rate  = 0.15
num_steps      = 20

print("=== Gradient Descent — Watching a Weight Find Its Optimal Value ===")
print(f"{'Step':>5} | {'Weight':>10} | {'Loss':>10} | {'Gradient':>10}")
print("-" * 46)

for step in range(num_steps):
    current_loss     = simplified_loss(current_weight)
    gradient         = loss_gradient(current_weight)

    # The core update rule: move OPPOSITE to the gradient
    # Negative gradient means loss decreases as weight increases — so increase weight
    current_weight  -= learning_rate * gradient  # w = w - lr * dL/dw

    if step % 4 == 0 or step == num_steps - 1:
        print(f"{step:>5} | {current_weight:>10.4f} | {current_loss:>10.4f} | {gradient:>10.4f}")

print(f"\nFinal weight: {current_weight:.4f} (target was 2.0)")
print(f"Final loss:   {simplified_loss(current_weight):.6f} (minimum is 0.5)")

▶ Output

=== Gradient Descent — Watching a Weight Find Its Optimal Value ===
Step | Weight | Loss | Gradient
----------------------------------------------
0 | -2.1000 | 25.0000 | -10.0000
4 | 0.9185 | 1.1659 | -2.1630
8 | 1.8004 | 0.5400 | -0.3991
12 | 1.9742 | 0.5007 | -0.0516
16 | 1.9968 | 0.5000 | -0.0064
19 | 1.9996 | 0.5000 | -0.0009

Final weight: 1.9996 (target was 2.0)
Final loss: 0.500000 (minimum is 0.5)

⚠️

Watch Out: The Vanishing Gradient ProblemWhen networks get very deep (10+ layers), gradients can shrink to near-zero as they propagate backward through sigmoid activations — because sigmoid's derivative maxes out at 0.25. By the time the gradient reaches early layers, it's effectively zero and those weights stop learning. This is why modern deep networks use ReLU in hidden layers (derivative is either 0 or 1, never shrinks) and techniques like batch normalisation and residual connections.

Aspect	Sigmoid Activation	ReLU Activation
Formula	1 / (1 + e^-z)	max(0, z)
Output range	Strictly (0, 1)	[0, +infinity)
Best used in	Output layer — binary classification	Hidden layers — almost everything else
Vanishing gradient risk	High — derivative max is 0.25	Low — derivative is 1 for all positive inputs
Computational cost	Moderate (needs exp())	Extremely cheap (just a max operation)
Dead neuron problem	No — always has a gradient	Yes — neurons stuck at 0 stop learning
Training speed	Slower convergence	Faster convergence in practice
When to pick it	You need a probability output	Default choice for hidden layers

🎯 Key Takeaways

A neuron computes a weighted sum of inputs plus a bias, then passes the result through an activation function — depth stacks these simple operations into powerful non-linear representations.
Backpropagation is just the chain rule applied backwards from output error to input weights, computing how much each weight contributed to the mistake so gradient descent can correct it.
ReLU in hidden layers and sigmoid only at the output is the default architecture choice for binary classification — not arbitrary, but grounded in the vanishing gradient problem.
The XOR problem proves mathematically that depth matters: no single-layer network can solve it, but two hidden layers crack it easily — this is your intuition check for why we stack layers.

⚠ Common Mistakes to Avoid

✕Mistake 1: Not normalising input features — Symptom: loss oscillates wildly or explodes to NaN in the first few epochs — Fix: scale all inputs to have mean ≈ 0 and standard deviation ≈ 1 using sklearn's StandardScaler, or manually subtract the mean and divide by std. Features on different scales cause weight updates to be massively uneven.
✕Mistake 2: Using sigmoid activation in hidden layers — Symptom: deep network trains slowly, early layers barely update, accuracy plateaus prematurely — Fix: switch hidden layer activations to ReLU. Only use sigmoid in the final output layer for binary classification. Sigmoid's gradient ceiling of 0.25 kills learning speed in multi-layer networks.
✕Mistake 3: Setting the learning rate too high — Symptom: loss decreases for a few epochs then suddenly jumps to a huge number or NaN — Fix: start with 0.001 or 0.01. If loss explodes, divide learning rate by 10. A good diagnostic: plot loss vs epochs. Healthy training shows a smooth, monotonically decreasing curve. Spikes mean your learning rate is too aggressive.

Interview Questions on This Topic

QExplain backpropagation without using the phrase 'chain rule'. What is it actually doing to each weight, and why does the direction of traversal matter?
QWhy does a neural network with no activation functions (or only linear ones) reduce to simple linear regression no matter how many layers you add?
QA colleague says their deep network's early layers aren't learning anything — the gradients are basically zero. What are three possible causes and how would you diagnose each one?

Frequently Asked Questions

How many hidden layers does a neural network need?

For most tabular data problems, 1–3 hidden layers is sufficient. More layers help when learning hierarchical structure (like in images or audio). Adding layers beyond what the problem requires usually just adds training time and overfitting risk without improving accuracy. Start shallow and add depth only if your model underfits.

What's the difference between deep learning and machine learning?

Machine learning is the broad field of algorithms that learn from data — it includes decision trees, SVMs, linear regression, and neural networks. Deep learning is specifically the subset that uses neural networks with many layers (typically more than two hidden layers). All deep learning is machine learning, but most machine learning is not deep learning.

Why do neural networks need so much data compared to traditional ML models?

A neural network with millions of parameters needs enough examples to constrain all those parameters to meaningful values — with too little data, it memorises the training examples instead of learning the underlying pattern (overfitting). Traditional models like decision trees have far fewer parameters, so they can generalise from smaller datasets. As a rough guide, aim for at least 10x more training examples than you have parameters.

🔥

TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

About Our Team Editorial Standards

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged