Build an Autograd Engine from Scratch: Micrograd to Production
Learn how to build a reverse-mode autograd engine from scratch using Python.
20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.
- Autograd engines compute gradients automatically via reverse-mode automatic differentiation over a dynamic computation graph.
- The core data structure is a DAG where each node stores data, gradient, and operation children.
- Backpropagation recursively applies the chain rule from output to inputs, accumulating gradients.
- Micrograd implements this in ~100 lines of Python for scalar values, enough to train small neural nets.
- Production autograd (PyTorch, TensorFlow) extends this with tensor operations, GPU support, and graph optimizations.
- Understanding autograd internals is critical for debugging gradient issues and optimizing training pipelines.
Think of an autograd engine as a smart accountant that tracks every mathematical operation you perform. When you run a calculation, it builds a family tree of operations. Later, when you ask 'how does changing this input affect the final result?', it walks backward through that tree, applying simple rules at each step to compute the answer. This is exactly how neural networks learn from their mistakes.
Every modern deep learning framework—PyTorch, TensorFlow, JAX—relies on automatic differentiation to train neural networks. Yet most developers treat autograd as a black box, calling .backward() without understanding what happens under the hood. In 2026, with models growing larger and training pipelines more complex, this ignorance is a liability. Gradient vanishing, exploding gradients, and incorrect gradient accumulation are common bugs that require deep understanding of the autograd mechanism.
Building an autograd engine from scratch is the single best way to demystify backpropagation. The canonical implementation, Andrej Karpathy's micrograd, does this in about 100 lines of Python for scalar values. It's minimal, educational, and captures the essence of reverse-mode autodiff. But micrograd is a toy—it doesn't handle tensors, batching, or GPU acceleration. Understanding its limitations is as important as understanding its design.
This article walks through building a micrograd-like engine, then extends the discussion to production-grade autograd systems. We'll cover the DAG construction, the backward pass, gradient accumulation, and common pitfalls. You'll learn why PyTorch's autograd is more complex than micrograd, and how to debug gradient issues in real-world models.
By the end, you'll have a working autograd engine and the mental model to reason about gradient flow in any framework. This is not just academic—it's the foundation for debugging training failures, implementing custom operations, and optimizing memory usage in large-scale ML systems.
What is an Autograd Engine? Core Concepts and Why Build One
An autograd engine is the computational backbone of modern deep learning frameworks. It automates the calculation of gradients—the partial derivatives of a scalar loss with respect to every parameter in a model—using reverse-mode automatic differentiation. Instead of manually deriving and coding gradients for each operation, you define a forward pass that builds a directed acyclic graph (DAG) of operations, and the engine traverses this graph backward to compute gradients via the chain rule. This is the same mechanism powering PyTorch's autograd and TensorFlow's GradientTape, but at a fundamental level it operates on scalar values, making it an ideal teaching tool.
Why build one from scratch? Because understanding autograd demystifies how gradients flow through neural networks. When you implement the core logic—tracking operations, storing gradients, and accumulating them during backpropagation—you internalize why gradient descent works and how frameworks like PyTorch handle it efficiently. The micrograd library, for example, achieves this in roughly 100 lines of Python, proving that the concept is simpler than it appears. Building your own engine gives you the confidence to debug gradient issues, optimize custom layers, and appreciate the engineering behind production systems.
At its heart, autograd relies on two key ideas: (1) every tensor (or Value object) stores its data, its gradient (initialized to zero), and references to its children (the inputs that produced it); (2) the backward pass computes gradients by applying the chain rule in topological order, ensuring each node's gradient is fully accumulated before moving to its parents. This approach scales to arbitrary computation graphs, from a single neuron to deep networks with millions of parameters.
In production, autograd engines are highly optimized—they use C++ backends, kernel fusion, and memory-efficient gradient checkpointing. But the scalar version you'll build here captures the essence: a DAG where each node knows how to compute its local derivative and propagate it backward. This foundation is what enables training loops to work with a simple call to .backward(), and it's what you'll implement step by step.
The Value Object: Data, Gradient, and Children
The Value object is the atomic unit of an autograd engine. It wraps a scalar number and three critical components: data (the actual value), grad (the gradient, initialized to 0.0), and _prev (a set of parent Values that produced it). Additionally, it stores a _backward function—a closure that computes the local gradient contribution during backpropagation. This design mirrors how PyTorch's Tensor stores data and grad, but at a scalar level, making the mechanics transparent.
When you create a Value, its gradient is zero because no backward pass has run. The _prev set tracks the computation graph: for a = Value(2.0) and b = Value(3.0), c = a + b will have _prev = {a, b}. This set is essential for topological sorting, ensuring that when you call backward(), the engine processes nodes in the correct order—from the output back to the inputs. The _op string (like '+' or '*') is optional but useful for debugging and visualization.
The _backward function is where the magic happens. For addition, _backward sets the gradient of each parent to the current node's gradient (since d(c)/da = 1 and d(c)/db = 1). For multiplication, it uses the product rule: if c = a * b, then d(c)/da = b and d(c)/db = a. These local derivatives are then accumulated into each parent's grad field. This accumulation is crucial because a node may contribute to multiple outputs (e.g., a parameter used in many neurons), and gradients must sum.
In practice, you'll never manually set _backward for basic operations—you'll define them in the operator overloads (__add__, __mul__, etc.). But understanding the structure is key: every Value is a node in a DAG, and its grad is the partial derivative of the final output with respect to that node. This is the same concept as PyTorch's .grad attribute, and it's why you can call .backward() on a scalar loss and then inspect .grad on any parameter.
optimizer.zero_grad()). In production, forgetting this leads to gradient accumulation across batches, which is rarely desired. For scalar engines, it's automatic on creation, but in loops you must reset manually.Building the Computation Graph: Operations and Topological Order
The computation graph is a directed acyclic graph (DAG) where nodes are Values and edges represent operations. Every time you perform an operation like addition or multiplication, the engine creates a new Value node that records its parents (the inputs) and the operation that produced it. This graph is built dynamically during the forward pass—there's no separate graph definition phase. This dynamic nature is what makes frameworks like PyTorch flexible: you can use Python control flow (if statements, loops) and the graph adapts automatically.
To compute gradients, the engine must traverse the graph in reverse topological order. Topological order means processing nodes such that all children (outputs) are processed before their parents (inputs). For a simple graph like c = a + b, the order is [c, a, b]. For more complex graphs with shared nodes (e.g., a used in multiple operations), topological sorting ensures that when you compute the gradient for a, all contributions from its children have already been accumulated. This is critical because the chain rule requires summing gradients from all paths.
The algorithm for topological sort is straightforward: perform a depth-first search (DFS) from the output node, visiting children first, then adding the current node to a list. This yields a reverse topological order. In code, you'll implement a helper function that recursively visits _prev sets and appends nodes to a list. This list is then iterated in reverse during backward(), calling each node's _backward function. This is exactly what PyTorch's autograd does internally, albeit with a highly optimized C++ implementation.
A common pitfall is forgetting to handle non-differentiable operations (like ReLU at x=0) or operations that break differentiability (like integer indexing). In a scalar engine, you can define subgradients for ReLU (0 at x<0, 1 at x>=0) or use a smooth approximation. For production, frameworks handle these edge cases with custom gradients or stop-gradient operations. The key takeaway: the graph is a DAG, and its topological order guarantees correct gradient propagation.
Implementing the Backward Pass: Chain Rule and Gradient Accumulation
The backward pass is where autograd earns its keep. Starting from the scalar output (usually a loss), it computes the gradient of that output with respect to every node in the graph by applying the chain rule recursively. The algorithm is: (1) set the gradient of the output node to 1.0 (since d(output)/d(output) = 1), (2) traverse the graph in reverse topological order, and (3) for each node, call its _backward function, which accumulates gradients into its parents. This is reverse-mode automatic differentiation, and it computes all gradients in a single forward-backward pass, regardless of the number of inputs.
Gradient accumulation is the key detail. When a node has multiple children (e.g., a parameter used in two different operations), its gradient is the sum of the gradients from each child. This is because the total derivative of the output with respect to that node is the sum of partial derivatives along all paths. In code, this means _backward functions use += to add to parent gradients, not =. For example, if a is used in both c = a * b and d = a + e, then during backward, a.grad will receive contributions from both c and d. This is exactly what PyTorch does, and it's why you must zero gradients before each training iteration.
The chain rule for a simple operation like multiplication: if c = a b, then d(output)/da = b d(output)/dc. So _backward for multiplication sets self.grad += other.data out.grad and other.grad += self.data out.grad. For addition, it's even simpler: self.grad += out.grad and other.grad += out.grad. For more complex operations like ReLU, the derivative is 1 if input > 0, else 0 (subgradient at 0). These local derivatives are the building blocks of any neural network.
A complete backward() method on the Value class orchestrates this: it calls topological_sort to get the order, sets the output's gradient to 1.0, then iterates in reverse, calling each node's _backward. This is the same pattern used in micrograd and PyTorch's Tensor.backward(). After calling backward(), every node's .grad contains the partial derivative of the output with respect to that node. You can then use these gradients in an optimizer like SGD to update parameters.
torch.no_grad() for inference to avoid building the graph. For custom operations, implement backward manually to ensure correctness and efficiency.Extending to Neural Networks: Neuron, Layer, and MLP Classes
The scalar autograd engine gives us gradients, but we need higher-level abstractions to build actual neural networks. The Neuron class encapsulates a weighted sum followed by a nonlinearity. For a neuron with n inputs, we store n weights and a bias, all as Value objects. The forward pass computes w·x + b, then applies tanh (or ReLU). This is the computational unit that, when composed, forms layers and multi-layer perceptrons (MLPs).
The Layer class holds a list of neurons. Its forward pass concatenates each neuron's output into a list. For an MLP, we stack layers: the first layer maps input dimension to hidden size, subsequent layers map hidden to hidden, and the final layer maps to output dimension. Each layer uses the same activation, except the output layer often omits it for regression or uses sigmoid for binary classification.
Critically, because every operation uses our Value class, the entire network's forward pass builds a DAG. When we call backward() on the final loss, gradients flow through every weight and bias automatically. This is the magic: we never write gradient formulas for layers or activations—the autograd engine handles it. The code is shockingly short: Neuron (~10 lines), Layer (~10 lines), MLP (~10 lines).
This design mirrors PyTorch's nn.Module hierarchy but at scalar granularity. Each neuron's weights are independent Value objects, so the graph is fine-grained. For a 2-16-16-1 network, we have (216 + 16) + (1616 + 16) + (16*1 + 1) = 337 scalar parameters, each with its own gradient. This is educational but utterly impractical beyond toy problems.
parameters() method provides a uniform interface for optimization. This pattern—composable modules with a parameters() method—is the same pattern used in PyTorch's nn.Module.backward() computes all gradients. The code is minimal (~30 lines total) but demonstrates the core pattern behind all deep learning frameworks.Training a Binary Classifier: From Autograd to SGD
With the autograd engine and neural network classes, we can train a binary classifier end-to-end. The classic demo uses the moons dataset from sklearn—two interleaving half-circles that are not linearly separable. A 2-layer MLP with 16 hidden units can learn a nonlinear decision boundary.
The training loop is straightforward: for each epoch, compute the forward pass for all examples, calculate a loss, call backward() to get gradients, then update parameters with SGD. The loss function is key: we use the SVM max-margin hinge loss. For each sample, we compute yi (2 model(xi) - 1) where yi is ±1, then take the ReLU of (1 - that value). This encourages the model to produce outputs with magnitude at least 1 for correct classifications. We add L2 regularization by summing the squares of all parameters multiplied by a small coefficient (e.g., 1e-4).
The update rule is simple: for each parameter p, p.data -= learning_rate * p.grad. After each update, we zero the gradients by setting p.grad = 0.0. Without this, gradients accumulate across batches. The learning rate is typically 0.1 to 1.0 for this toy problem. Training for 100 epochs with batch gradient descent (all 100 samples at once) converges to a clean decision boundary.
This is the minimal viable training loop. It demonstrates the entire ML pipeline: data preparation, model definition, loss computation, gradient calculation via autograd, and parameter update via SGD. The same pattern scales to millions of parameters with tensor operations and mini-batches, but the conceptual core is identical.
optimizer.step().optimizer.zero_grad() handles this. Always zero before backward, not after.Limitations of Scalar Autograd: Why Production Systems Need Tensors
Our scalar autograd engine is elegant but fundamentally limited. Each operation creates a new Value object, and the DAG grows linearly with the number of scalar operations. For a single forward pass of a 2-16-16-1 MLP with 100 samples, we create ~100 (216 + 1616 + 161) ≈ 100 * 304 = 30,400 Value objects. Each object stores data, gradient, children, and operation type. Memory overhead is enormous: each Python object has ~56 bytes overhead plus the float (24 bytes) and list references. A single training iteration might use 10+ MB for the graph alone.
Performance is worse. Every arithmetic operation involves Python function calls, dynamic dispatch, and list appends. Compare to PyTorch's tensor operations: a single matrix multiply (w @ x) uses optimized BLAS (Basic Linear Algebra Subprograms) routines written in C/Fortran. For a 16x16 matrix multiply, PyTorch processes 256 multiplications and 240 additions in a single kernel call. Our scalar engine does 496 separate Python operations, each with overhead.
Numerical precision is also a concern. Our engine uses Python floats (64-bit), which is fine. But production systems need mixed precision (FP16/BF16) for memory bandwidth and throughput. Tensor frameworks support automatic mixed precision (AMP) with loss scaling. Our scalar engine cannot do this without rewriting everything.
Finally, GPU acceleration is impossible. GPUs execute thousands of threads in parallel on tensor operations. A single matrix multiply can saturate the GPU's compute units. Our scalar engine runs on one CPU core. For any real dataset (e.g., ImageNet with 1.2M images), training would take years instead of hours.
Production Autograd: Graph Optimization, GPU Kernels, and Debugging
Production autograd systems like PyTorch's autograd engine are vastly more sophisticated than our scalar implementation. They operate on tensors, not scalars. The computation graph is built lazily or eagerly depending on the mode. In eager mode (PyTorch default), operations are recorded in a tape (Wengert list) rather than a tree. This tape stores the forward results and a pointer to the backward function (grad_fn). Memory is reused: tensors can be freed once their gradients are computed.
Graph optimization is critical. PyTorch's JIT compiler (TorchScript) and the newer Dynamo system can trace the computation graph and apply optimizations: operator fusion (combining multiple operations into one kernel), constant folding, and dead code elimination. For example, a sequence of (add, relu, add) can be fused into a single kernel, reducing memory bandwidth and launch overhead. The XLA compiler (used in JAX and TensorFlow) goes further: it compiles the entire computation graph into a single optimized executable, often fusing across backward and forward passes.
GPU kernels are the backbone of production training. CUDA kernels for matrix multiplication (cuBLAS), convolutions (cuDNN), and attention (FlashAttention) are hand-tuned by NVIDIA engineers. They exploit tensor cores (specialized hardware for matrix multiply-accumulate), shared memory, and warp-level primitives. Writing efficient kernels requires understanding memory coalescing, occupancy, and instruction-level parallelism. Libraries like Triton make this more accessible, but the bar is high.
Debugging production autograd systems is a distinct skill. Common issues: gradient explosion (use gradient clipping), vanishing gradients (use proper initialization and activation functions), and NaN gradients (check for division by zero or log of zero). Tools like torch.autograd.set_detect_anomaly(True) can identify the exact operation that produced NaN. For performance debugging, use PyTorch Profiler to identify bottlenecks: kernel launch overhead, memory transfers, or inefficient operations. Profile-guided optimization often reveals that data loading (CPU→GPU transfer) is the bottleneck, not compute.
The Vanishing Gradient That Killed Our Recommender Model
- Always verify gradient flow through custom operations using torch.autograd.gradcheck.
- Add gradient norm monitoring to detect vanishing/exploding gradients early in training.
- Treat autograd as a critical system component—test it with unit tests that check gradient shapes and values.
torch.nn.utils.clip_grad_norm_. If norms are zero, trace backward graph for broken gradient flow. If norms are huge, apply gradient clipping.torch.autograd.set_detect_anomaly(True). This will pinpoint the exact operation that produced NaN. Common causes: division by zero, log of negative, exp overflow.requires_grad=True on all parameters. Check that optimizer's parameter list matches model parameters. Use torch.autograd.grad to manually compute gradients for specific parameters.loss.backward() and then optimizer.step() in the correct order. Use torch.cuda.empty_cache() if using GPU. Profile memory with torch.cuda.memory_summary().param.grad is Noneparam.requires_gradKey takeaways
Common mistakes to avoid
4 patternsOverwriting gradients instead of accumulating
+= when adding gradient contributions, and zero gradients before each backward pass.Forgetting to zero gradients before backward
optimizer.zero_grad() or manually set .grad = 0 for all parameters before each backward pass.Incorrect topological sort order
Numerical instability in gradient computation
Interview Questions on This Topic
Explain how reverse-mode automatic differentiation works and why it's preferred for training neural networks.
Frequently Asked Questions
20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.
That's From Scratch. Mark it forged?
13 min read · try the examples if you haven't