Build GPT from Scratch: A Production-Grained Walkthrough
Implement a GPT from scratch in PyTorch: tokenization, attention, training loop, and scaling.
20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.
- GPT is a decoder-only transformer that predicts next tokens via self-attention.
- nanoGPT trains a 124M parameter GPT-2 on OpenWebText in ~4 days on 8xA100.
- Core components: token embedding, positional encoding, multi-head attention, feed-forward, layer norm.
- Training loop: data loading, forward pass, loss computation, backward pass, optimizer step.
- Key hyperparameters: block size, n_layer, n_head, n_embd, learning rate, batch size.
- Production concerns: memory management, gradient accumulation, mixed precision, checkpointing.
Think of GPT as a supercharged autocomplete. It reads a sequence of words (or characters) and learns patterns from massive text data to predict what comes next. The 'attention' mechanism lets it focus on relevant parts of the input, like a reader scanning back to earlier sentences for context.
In 2026, building a GPT from scratch isn't just an academic exercise—it's a core skill for any ML engineer working with language models. Understanding the internals lets you fine-tune, debug, and scale models beyond what off-the-shelf APIs offer. The canonical nanoGPT repo by Andrej Karpathy provides a clean, minimal implementation that reproduces GPT-2 (124M) on a single node, making it the perfect starting point for serious developers.
This article walks through every component of a GPT: tokenization, embedding, multi-head self-attention, feed-forward networks, layer normalization, and the training loop. We'll reference nanoGPT's ~300-line model.py and train.py, but go deeper into production considerations like memory profiling, gradient accumulation, and checkpointing strategies.
You'll learn not just how to code a GPT, but how to debug it when training diverges, optimize throughput on GPUs, and avoid common pitfalls that waste compute. By the end, you'll have a mental model that scales from a character-level Shakespeare model to a 1.3B parameter GPT-2.
This is not a beginner tutorial. You need working knowledge of PyTorch, transformers, and basic deep learning. We'll assume you've read the Attention Is All You Need paper and understand backpropagation. If not, start with Karpathy's 'Neural Networks: Zero to Hero' series first.
Introduction: Why Build GPT from Scratch in 2026
By 2026, GPTs are commodity infrastructure. You don't build one to beat OpenAI — you build one to own your stack, control your data, and ship models that fit in a single GPU for under $500. The era of 'just call the API' is over for production systems that need predictable latency, zero data leakage, and custom tokenization for domain-specific corpora like legal documents, medical records, or codebases with proprietary syntax. Building from scratch gives you surgical control over every parameter, from embedding dimension to attention head count, and lets you deploy on edge devices or air-gapped environments where no cloud API reaches.
This walkthrough implements a decoder-only transformer that mirrors GPT-2's architecture at 124M parameters — the smallest viable model that exhibits coherent long-range dependencies. We use PyTorch 2.x with compile, Flash Attention kernels, and the tiktoken BPE tokenizer. The final model trains on OpenWebText in under 4 days on a single 8x A100 node, reproducing GPT-2's loss curve. But more importantly, you'll understand every line of code: tokenization, embeddings, causal self-attention, feedforward blocks, layer normalization, weight tying, and the training loop with cosine decay and gradient clipping.
Why 2026 specifically? Because hardware has shifted: consumer GPUs now have 24-48GB VRAM, Flash Attention is standard in PyTorch, and quantization (FP8, INT4) is trivial. The barrier to training a 124M model from scratch is a weekend project on a single RTX 4090. The knowledge you gain transfers directly to scaling laws, mixture-of-experts, and multi-modal architectures. If you can build GPT from scratch, you can debug any transformer-based system in production.
This is not a tutorial for beginners. You need working knowledge of PyTorch, backpropagation, and basic NLP. We skip the 'what is attention' hand-waving and go straight to tensor shapes, masking logic, and numerical stability. Every code block is runnable and tested against PyTorch 2.5. Let's build.
Embedding Layer: Token and Positional Embeddings
The embedding layer converts token IDs (integers) into dense vectors. GPT-2 uses a learned token embedding matrix of shape (vocab_size, n_embd) where n_embd = 768 for the 124M model. Each token ID indexes a row in this matrix, producing a vector of size 768. This is a simple lookup: no computation, just memory access. The embedding matrix is shared with the output projection layer (weight tying) to reduce parameters and improve training stability.
Positional embeddings are also learned, not sinusoidal. GPT-2 uses a separate learned embedding of shape (block_size, n_embd) where block_size = 1024. The position index (0 to 1023) is added to the token embedding element-wise. This gives the model a sense of order without any inductive bias. The sum of token and positional embeddings is then passed through layer normalization before the first transformer block.
In code, we implement a combined Embedding module that stores both token and position embeddings. The forward pass takes token IDs of shape (batch, seq_len) and returns embeddings of shape (batch, seq_len, n_embd). We use PyTorch's nn.Embedding with padding_idx=None (no padding token in GPT-2). The position indices are generated on the fly as torch.arange(seq_len, device=x.device).
Weight tying is implemented by setting the output linear layer's weight equal to the token embedding weight. This is done after model initialization: model.lm_head.weight = model.transformer.wte.weight. This halves the embedding parameter count and empirically improves convergence. For the 124M model, the embedding layer accounts for 50,257 * 768 ≈ 38.6M parameters, about 31% of total parameters.
Multi-Head Self-Attention: Implementation with Causal Masking
Multi-head self-attention is the core of the transformer. For each head, we compute queries (Q), keys (K), and values (V) from the input via learned linear projections. The attention scores are Q @ K^T / sqrt(d_k) where d_k = n_embd / n_head. For GPT-2 124M, n_embd=768 and n_head=12, so d_k=64. The scores are masked to prevent attending to future tokens (causal masking) by setting upper-triangular entries to -inf before softmax. The softmax output is then multiplied by V to produce the head's output. All heads are concatenated and projected back to n_embd.
Causal masking is implemented as a boolean mask of shape (1, 1, seq_len, seq_len) where mask[i,j] = 0 if i >= j else -inf. We add this mask to the attention scores before softmax. In practice, we use torch.triu with diagonal=1 to create the mask. For efficiency, we use Flash Attention (torch.nn.functional.scaled_dot_product_attention with is_causal=True) which fuses the QKV projections, masking, and softmax into a single kernel, reducing memory from O(n^2) to O(n).
The attention mechanism has O(n^2 * d_k) complexity per head, where n is sequence length. For GPT-2's block_size=1024, this is manageable. But for longer sequences (e.g., 8k), Flash Attention is essential. Our implementation falls back to manual attention for clarity but includes a flag to use Flash Attention when available.
In code, we implement a single attention head as a module, then combine multiple heads in MultiHeadAttention. The forward pass: (1) project input to Q, K, V for all heads simultaneously using a single linear layer, (2) split into heads, (3) compute attention with causal mask, (4) concatenate heads, (5) final projection. We include dropout on attention weights and output for regularization.
Transformer Block: Attention, Feed-Forward, Layer Norm, and Residuals
The transformer block is the fundamental building unit of GPT. Each block consists of two sub-layers: multi-head causal self-attention and a position-wise feed-forward network (FFN). Both sub-layers are wrapped with residual connections and preceded by layer normalization (pre-norm). The pre-norm formulation, where LayerNorm is applied before the sub-layer rather than after, has become standard in GPT-style models because it stabilizes training at depth. The residual path allows gradients to flow directly through the stack, mitigating vanishing gradient problems even with 12, 24, or 48 blocks.
Multi-head attention splits the embedding dimension into h heads, each of dimension d_k = d_model / h. For each head, we compute queries Q, keys K, and values V via learned linear projections. The attention scores are computed as softmax(QK^T / sqrt(d_k) + M), where M is a causal mask that sets all future positions to -inf. This ensures position i can only attend to positions j ≤ i. The mask is typically implemented as a lower-triangular matrix filled with 0s in the lower triangle and -inf in the upper triangle. After computing attention, the heads are concatenated and projected back to d_model.
The feed-forward network is a simple two-layer MLP with a GELU activation in between. The typical GPT-2 configuration uses an inner dimension of 4 d_model. For d_model=768, the FFN expands to 3072 and then projects back to 768. This expansion-contraction pattern allows the model to learn complex non-linear transformations. The GELU activation is approximated as 0.5 x (1 + tanh(sqrt(2/pi) (x + 0.044715 * x^3))), though modern implementations often use the exact erf-based version.
Residual connections are critical: each sub-layer's output is added to its input. If we denote the input to a block as x, the output is x + Attention(LayerNorm(x)) + FFN(LayerNorm(x + Attention(LayerNorm(x)))). This additive structure means the model can learn to ignore sub-layers by learning near-zero weights, effectively reducing depth if needed. In practice, we initialize the output projection of each sub-layer with a small weight (e.g., N(0, 0.02)) and often use a scaling factor of 1/sqrt(2 * num_layers) to keep activations stable.
Layer normalization computes mean and variance across the feature dimension (not the sequence dimension). For an input x of shape (batch, seq_len, d_model), LayerNorm computes μ = mean(x, dim=-1) and σ = std(x, dim=-1), then outputs γ * (x - μ) / (σ + ε) + β, where γ and β are learnable parameters of size d_model. The epsilon (typically 1e-5) prevents division by zero. This normalization is crucial for training stability, especially when using FP16 or BF16 mixed precision.
The Full GPT Model: Stacking Blocks and the Language Modeling Head
The full GPT model is a stack of N transformer blocks (typically 12 for GPT-2 small, 24 for medium, 36 for large, 48 for XL) followed by a language modeling head. The input pipeline starts with token embeddings and position embeddings, which are summed to produce the initial hidden state. There is no segment embedding (unlike BERT) because GPT is a unidirectional decoder-only model. The token embeddings are a learned lookup table of size vocab_size × n_embd, and the position embeddings are a learned lookup table of size block_size × n_embd.
After the embedding layer, the hidden state passes through the stack of transformer blocks. Each block maintains the same dimensionality (n_embd) throughout. After the final block, a layer normalization is applied, followed by a linear projection (the LM head) that maps from n_embd to vocab_size. This produces logits of shape (batch, seq_len, vocab_size). During training, we compute cross-entropy loss between these logits and the target tokens (shifted by one position). During inference, we sample from the logits to generate the next token.
The weight tying trick, popularized by the original Transformer paper and used in GPT-2, shares the weight matrix between the token embedding layer and the LM head. This reduces the number of parameters by vocab_size × n_embd (e.g., ~38M for GPT-2 small with vocab_size=50257 and n_embd=768). The shared weights are typically scaled by sqrt(n_embd) in the embedding layer to keep the variance of the summed embeddings consistent with the residual stream.
The model also includes dropout layers for regularization. In the original GPT-2, dropout is applied to the embedding layer (with rate 0.1) and to the output of each attention sub-layer (also 0.1). However, many modern implementations (including nanoGPT) set dropout to 0 during pretraining on large datasets, as the regularization from large-scale data is sufficient. Dropout is more commonly used during fine-tuning on smaller datasets.
For the GPT-2 124M parameter configuration, the architecture is: vocab_size=50257, block_size=1024, n_embd=768, n_head=12, n_layer=12, bias=True. The total parameter count is approximately 124M, which includes embeddings (50257 × 768 ≈ 38.6M), transformer blocks (12 × (attention: 4 × 768² + MLP: 2 × 768 × 3072) ≈ 85M), and layer norms (12 × 2 × 768 × 2 ≈ 36K). The bias terms add a small fraction.
Training Loop: Data Loading, Loss Computation, Backpropagation, and Optimization
The training loop for GPT follows the standard autoregressive language modeling setup. Data is preprocessed into a flat array of token IDs (typically using tiktoken for BPE or a simple character-level encoding for small experiments). The data loader samples random contiguous chunks of length block_size from this array, creating input-target pairs where the target is the input shifted by one position. For example, if the input sequence is [t0, t1, ..., t_{n-1}], the target is [t1, t2, ..., t_n]. This is implemented efficiently by memory-mapping the token array and using random offsets to avoid loading the entire dataset into memory.
Loss computation uses cross-entropy between the predicted logits and the target tokens. The loss is averaged over all non-padding tokens (padding tokens are masked with ignore_index=-1 in the cross-entropy function). For a batch of B sequences of length T, the loss is: L = -1/(BT) Σ_b Σ_t log P(t_{b,t+1} | t_{b,≤t}). This is equivalent to the negative log-likelihood of the next token given all previous tokens. The perplexity, often reported as a metric, is exp(L).
Backpropagation computes gradients of the loss with respect to all model parameters. The AdamW optimizer is the standard choice for training GPTs. AdamW decouples weight decay from the adaptive learning rate, applying L2 regularization only to the weights (not biases or layer norms). The typical configuration is: learning_rate=3e-4 for 124M model, β1=0.9, β2=0.95, weight_decay=0.1, and epsilon=1e-8. A cosine learning rate schedule with linear warmup is used: the learning rate linearly increases from 0 to max_lr over the first few thousand steps (e.g., 2000), then follows a cosine decay to a minimum value (typically 10% of max_lr).
Gradient clipping is essential to prevent exploding gradients. The typical threshold is max_grad_norm=1.0. After computing gradients via loss.backward(), we call torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm). This scales down gradients whose L2 norm exceeds the threshold, ensuring stable training. Without clipping, a single outlier batch can destabilize the entire training run.
The evaluation loop runs periodically (e.g., every 1000 iterations) on a held-out validation set. It computes the loss without gradient computation (torch.no_grad()) and reports validation loss/perplexity. This is used for checkpoint selection: we save the model whenever validation loss improves. The training loop also logs metrics (loss, learning rate, gradient norm) to a dashboard like Weights & Biases or TensorBoard for monitoring.
Production Considerations: Mixed Precision, Gradient Accumulation, Checkpointing, and Debugging
Mixed precision training (FP16 or BF16) is essential for training large GPT models efficiently. Modern GPUs (A100, H100) have dedicated Tensor Cores that provide 2-4x throughput for FP16/BF16 operations compared to FP32. The standard approach uses torch.cuda.amp (automatic mixed precision) with a GradScaler to prevent underflow in the loss during backpropagation. The scaler multiplies the loss by a scale factor before backward, then divides the gradients by the same factor after. If gradients overflow (become inf/nan), the scaler skips the step and reduces the scale. BF16 is preferred over FP16 when available because it has the same exponent range as FP32, eliminating the need for loss scaling in many cases.
Gradient accumulation allows training with effective batch sizes larger than what fits in GPU memory. Instead of computing the gradient over one large batch, we accumulate gradients over multiple micro-batches. For example, to achieve an effective batch size of 512 with micro-batch size 16, we accumulate gradients over 32 steps. The loss for each micro-batch is divided by the number of accumulation steps to keep the gradient magnitude consistent. This is implemented by calling loss.backward() on each micro-batch without zeroing gradients, then calling optimizer.step() after the accumulation is complete. Gradient accumulation is transparent to the optimizer: it sees the sum of gradients, which is equivalent to the gradient of the full batch.
Checkpointing strategy is critical for long training runs (days to weeks). Save checkpoints at regular intervals (e.g., every 1000 steps) and always keep the best model based on validation loss. A checkpoint should include: model state_dict, optimizer state_dict, scheduler state_dict, current step, and best validation loss. This allows resuming training from any checkpoint. Use a naming convention like 'ckpt_{step}_{val_loss:.4f}.pt' and implement a retention policy (e.g., keep last 5 checkpoints plus best). For distributed training, only save from rank 0 to avoid file corruption.
Debugging training issues requires systematic monitoring. Log these metrics every step: training loss, learning rate, gradient norm, and scale factor (for AMP). Watch for these red flags: (1) Loss not decreasing after 1000 steps → check learning rate, data loading, or model initialization. (2) Gradient norm suddenly spiking to >10 → reduce learning rate or increase gradient clipping. (3) Loss going to NaN → check for numerical instability in attention (use torch.nn.functional.scaled_dot_product_attention which is numerically stable), or reduce learning rate. (4) Validation loss diverging from training loss → overfitting; increase dropout or reduce model size.
For distributed training across multiple GPUs, use PyTorch's DistributedDataParallel (DDP). The key is to split the batch across GPUs and synchronize gradients during backward. With gradient accumulation, each GPU processes micro-batches independently and gradients are synchronized only at the optimizer step. The effective batch size is micro_batch_size × gradient_accumulation_steps × num_gpus. For example, with micro_batch_size=8, accumulation_steps=4, and 8 GPUs, the effective batch size is 256. DDP adds communication overhead, but for models up to 1.5B parameters on 8 GPUs, the overhead is negligible compared to compute time.
The Silent Divergence: Training a 1.3B GPT-2 on a Single Node
- Always validate numerical stability with mixed precision training by monitoring for NaNs in activations and gradients.
- Use built-in PyTorch layers when possible; they are battle-tested for edge cases.
- Implement early stopping on NaN detection to avoid wasting compute on corrupted runs.
torch.isnan(model.parameters()).any()torch.autograd.set_detect_anomaly(True)Key takeaways
Common mistakes to avoid
4 patternsForgetting causal masking in self-attention
Incorrect weight initialization
Not using gradient clipping
Overfitting on small datasets
Interview Questions on This Topic
Explain the multi-head self-attention mechanism in GPT. How does it differ from the original transformer?
Frequently Asked Questions
20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.
That's From Scratch. Mark it forged?
13 min read · try the examples if you haven't