Senior 7 min · March 06, 2026

Transformer Positional Encoding — Flat Prediction Fix

Transformers are permutation-invariant by default.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Transformer replaces recurrence with self-attention: processes all tokens in parallel → 10x faster training than RNNs
  • Scaled dot-product attention: softmax(Q·K^T / √d_k)·V — divide by √d_k prevents softmax saturation. Missing it = gradients vanish
  • Multi-head attention: h parallel heads (d_k = d_model/h) learn different patterns (syntax, coreference, local context)
  • Positional encoding is mandatory — Transformer is permutation-invariant without it. Omit it = model treats sequence as bag-of-words
  • Flash Attention: reduces memory from O(n²) to O(n). For n=100k, 40GB → 2GB. Use PyTorch 2.0+'s scaled_dot_product_attention
  • Production killer: missing positional encoding → model trains but predicts flat outputs. Always add: x = x + pe[:, :seq_len]
Plain-English First

Imagine you're trying to understand the sentence 'The trophy didn't fit in the bag because it was too big.' To know what 'it' refers to — the trophy — your brain doesn't read every word with equal focus. It zooms in on 'trophy' and 'big' and connects them. The Transformer does exactly this: for every word it processes, it asks 'which other words in this sentence should I pay the most attention to right now?' and builds its understanding by weighting those relationships. No step-by-step reading required — it looks at the whole sentence at once, like a photograph rather than a film strip.

In 2017, eight researchers at Google Brain published a 15-page paper that quietly made recurrent neural networks obsolete. 'Attention Is All You Need' introduced the Transformer architecture, and within three years it became the backbone of GPT, BERT, T5, DALL-E, Whisper, and virtually every state-of-the-art model in language, vision, audio, and protein folding. If you work in ML, this paper is not optional reading — it is the constitution of modern deep learning.

Before Transformers, sequence models like LSTMs and GRUs processed tokens one at a time, left to right. That sequential dependency meant you couldn't parallelise training across time steps, and long-range dependencies decayed badly across hundreds of tokens. The Transformer killed recurrence entirely. By replacing recurrence with self-attention, it achieved parallelism across the entire sequence and made long-range dependency a first-class citizen.

Here's the thing most tutorials skip: the paper isn't just theory you read once and file away. Every line — from the scaling factor to the learning rate schedule to the label smoothing — encodes a hard-won production lesson. The teams that deploy Transformers without internalising those details don't get elegant training curves. They get OOM crashes, flat predictions, and models that look great on validation but fail in the wild. This article makes sure you're not one of them.

Scaled Dot-Product Attention and Multi-Head Mechanics

The core operation is scaled dot-product attention. Given queries Q, keys K, values V (all matrices of shape [seq_len, d_k]), the attention output is softmax(Q·K^T / √d_k) · V. The division by √d_k prevents the dot products from growing too large, which would push the softmax into regions of extremely small gradients (saturation).

Multi-head attention: instead of one attention operation in d_model dimensions, project Q, K, V down to h lower-dimensional heads (each of dimension d_k = d_model / h), compute attention in parallel on each head, then concatenate and project back up. Each head learns different relationship types: some heads focus on local syntax (adjacent words), others on long-range dependencies, others on coreference (pronoun resolution).

In practice, h=8 for base model (d_model=512, d_k=64). The computational cost is the same as single-head attention because the total dimension is the same: h (d_k²) = d_model d_k. But multi-head adds a projection layer O(d_model²) after concatenation.

The paper found 8 heads performed best on translation; increasing to 16 gave marginal gains at higher compute cost. Don't chase more heads — the real wins come from better scaling, not more parallel subspaces.

One subtlety: the scaling factor. With d_k=64, √d_k=8, the dot products shrink by 8x. Without that, the variance of logits scales linearly with d_k. For d_k=512, logits have variance ~512, which pushes softmax nearly one-hot. Gradients become minuscule — your loss doesn't move. Production teams often forget this when increasing d_model and keeping n_heads constant (d_k grows).

A production reality: we once debugged a model where training loss flatlined at 4.3 for three days. The team tried different optimizers, learning rates, everything. The fix was one line: adding / math.sqrt(self.d_k) before softmax. The scaling factor was present in the paper's pseudocode but missing in the implementation. Three days of compute, gone. That's the kind of paper detail that separates working models from broken ones.

io/thecodeforge/ml/transformer_attention.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, n_heads=8, dropout=0.1):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # Q, K, V: [batch, n_heads, seq_len, d_k]
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        attn = F.softmax(scores, dim=-1)
        attn = self.dropout(attn)
        return torch.matmul(attn, V)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # 1. Linear projections
        Q = self.W_q(query)
        K = self.W_k(key)
        V = self.W_v(value)

        # 2. Split into heads: [batch, seq_len, n_heads, d_k] -> [batch, n_heads, seq_len, d_k]
        Q = Q.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)

        # 3. Attention
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)

        # 4. Concatenate heads
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)

        # 5. Final projection
        return self.W_o(attn_output)
Why Multi-Head? Different Heads Learn Different Patterns
  • Head 1: focuses on the previous token (local context)
  • Head 2: attends to the subject of the sentence (for pronoun resolution)
  • Head 3: spreads attention across the whole sentence equally (global context)
  • Head 4: focuses on object of the verb (dependency parsing)
  • In BERT, different heads specialise in different linguistic phenomena automatically through training.
Production Insight
The scaling factor √d_k prevents the dot products from saturating the softmax. With d_k=64, scaling factor=8.
Without scaling, variance of QK^T is d_k (for unit-normalized vectors), pushing softmax gradients to near-zero.
Rule: Always use attn = (Q @ K.T) / sqrt(d_k) before softmax. Forgetting scaling causes training instability and loss not decreasing.
Key Takeaway
Scaled dot-product attention: divide by √d_k before softmax to prevent saturation. Without scaling, attention becomes nearly one-hot, gradients vanish.
Multi-head attention splits d_model into h heads (each d_k = d_model/h), computes attention in parallel, concatenates results.
Rule: Each head operates independently; total compute is same as single-head attention due to dimensionality reduction.
Choosing Number of Attention Heads
Ifd_model = 512, general machine translation
UseUse h=8 (d_k=64). Original paper's sweet spot. Good balance of capacity and compute.
Ifd_model = 1024, larger model
Useh=16 (d_k=64) or h=32 (d_k=32). Smaller d_k per head can degrade performance. Keep d_k >= 32.
IfMemory-constrained (e.g., mobile)
UseUse fewer heads (h=4) with same d_k. Reduces projection parameters by 25%. Accept small quality loss.
IfVery long sequences (8k+), need speed
UseReduce heads to 4 and use Flash Attention. Fewer heads reduce total QKV projection compute.

Positional Encoding — Giving Order to the Permutation-Invariant Transformer

The Transformer's attention mechanism is permutation-invariant: swapping two input tokens yields the same attention distribution over other tokens. This is a problem because language is fundamentally ordered — 'dog bites man' vs 'man bites dog' have opposite meanings. Positional encodings add information about each token's position in the sequence.

The original paper used sinusoidal encodings: PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) and PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)). Each dimension of the positional encoding has a different wavelength, from 2π to 10000·2π cycles. This allows the model to attend to relative positions (because the encoding at position pos+k can be represented as a linear function of encoding at pos).

Sinusoidal encodings are not learned, so they can extrapolate to sequence lengths longer than those seen during training. Learned positional embeddings (trainable parameters) often perform better on fixed-length tasks but cannot generalise to longer sequences.

The encoding is added directly to the input embeddings: x = embedding + positional_encoding. Not concatenated — addition preserves the embedding dimension, concatenation would double it.

Here's the trap: if you forget positional encoding, the model still trains. Loss goes down. But you'll see flat predictions on any task requiring order. In the time-series incident above, the model literally predicted the mean of training values for every time step. The fix was adding positional encoding, not changing the model size or learning rate.

Another production pitfall: using learned embeddings and then hitting a sequence length longer than the pre-defined max_len during inference. The embedding matrix has fixed size; out-of-range positions throw IndexError. Always validate inference sequences against the embedding table size.

Modern practice has largely moved to Rotary Position Embedding (RoPE), used in Llama, Mistral, and GPT-NeoX. RoPE applies rotation matrices to Q and K based on position — it encodes relative position directly into the attention computation rather than adding a fixed vector. It extrapolates better than learned embeddings and has become the default for new LLM implementations.

io/thecodeforge/ml/positional_encoding.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Create positional encoding matrix [max_len, d_model]
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # [1, max_len, d_model]
        self.register_buffer('pe', pe)

    def forward(self, x):
        # x: [batch, seq_len, d_model]
        x = x + self.pe[:, :x.size(1), :]  # Add positional encoding to input embeddings
        return self.dropout(x)

# Check that different positions have distinct encodings
pos_enc = PositionalEncoding(512, max_len=10)
print("Position 0 encoding (first 8 dims):", pos_enc.pe[0, 0, :8])
print("Position 1 encoding (first 8 dims):", pos_enc.pe[0, 1, :8])
assert not torch.allclose(pos_enc.pe[0, 0, :8], pos_enc.pe[0, 1, :8])
Positional Encoding Must Be Added, Not Concatenated
Concatenating positional encodings doubles the embedding dimension, changing the model capacity and breaking the projection layers. Always add: x = x + pe. Addition preserves the dimension and lets the model optionally ignore positional info if not needed — concatenation forces it to be used.
Production Insight
Without positional encoding, the Transformer cannot distinguish between 'hello world' and 'world hello'
Sinusoidal encodings can extrapolate to sequences longer than max_len seen during training — useful for variable-length inputs.
Rule: For fixed sequence length tasks (e.g., 512-token BERT), learned embeddings often outperform sinusoidal. For variable length, use sinusoidal.
Key Takeaway
The Transformer is permutation-invariant without positional encodings — your model cannot tell order without explicit position information.
Sinusoidal encodings are not learned, enabling extrapolation to longer sequences at inference time.
Rule: Add positional encodings to embeddings (element-wise addition) not concatenate. Concatenation increases dimension and breaks the model's projection layers.
Sinusoidal vs Learned Positional Embeddings
IfVariable sequence length (e.g., document summarization, translation)
UseUse sinusoidal encodings. They extrapolate to unseen lengths without extra parameters.
IfFixed maximum length (e.g., BERT 512 tokens, GPT-2 1024)
UseUse learned embeddings. They can encode position-specific patterns better (e.g., beginning-of-sentence bias).
IfNeed to encode relative positions (e.g., T5, Transformer-XL)
UseUse relative position bias or Rotary Position Embedding (RoPE). Sinusoidal can approximate relative but RoPE is more direct.

Encoder-Decoder Stack and Masking

The original Transformer has an encoder (processes input sequence) and a decoder (generates output sequence). Each encoder layer has multi-head self-attention (no masking) + feed-forward network (FFN). Each decoder layer has masked self-attention (prevents looking at future tokens) + cross-attention (attends to encoder output) + FFN.

The encoder sees the entire input sequence simultaneously. Self-attention is unmasked — every token can attend to every other token in the input.

The decoder is autoregressive: when generating token i, it can only attend to positions 0..i-1. This is enforced with a causal mask: an upper triangular matrix of -inf that zeros out attention to future tokens.

Cross-attention in the decoder uses the encoder output as K and V, and the decoder's previous layer output as Q. This allows the decoder to focus on different parts of the input sequence for each generated output token.

The feed-forward network (FFN) is a simple two-layer MLP with ReLU: FFN(x) = max(0, xW1 + b1)W2 + b2. It operates per token independently (no interaction across positions). This gives the model additional capacity to transform the attention output before the next layer.

One mistake I've seen in production code: applying the causal mask to the cross-attention. Cross-attention should have no mask — the decoder can attend to any encoder position, including those 'ahead' in the encoder sequence. The causal mask only applies to decoder self-attention. Mixing them up leads to artificially constrained generation.

Another production issue: when using KV cache for inference, the mask changes shape. During training, the mask is [seq_len, seq_len]. During inference with KV cache, the mask becomes [1, cached_len+1] — only the new token needs to mask out future tokens it shouldn't see. Getting this shape wrong causes either information leak or all tokens generating identical output.

io/thecodeforge/ml/transformer_stack.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import torch
import torch.nn as nn

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, n_heads, d_ff=2048, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, n_heads, dropout)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Pre-LN (stabler than original Post-LN)
        x = x + self.dropout(self.self_attn(self.norm1(x), self.norm1(x), self.norm1(x), mask))
        x = x + self.dropout(self.feed_forward(self.norm2(x)))
        return x

class TransformerDecoderLayer(nn.Module):
    def __init__(self, d_model, n_heads, d_ff=2048, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, n_heads, dropout)
        self.cross_attn = MultiHeadAttention(d_model, n_heads, dropout)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff), nn.ReLU(), nn.Linear(d_ff, d_model), nn.Dropout(dropout)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, encoder_output, causal_mask=None, cross_mask=None):
        # Masked self-attention
        x = x + self.dropout(self.self_attn(self.norm1(x), self.norm1(x), self.norm1(x), causal_mask))
        # Cross-attention (encoder output as K, V)
        x = x + self.dropout(self.cross_attn(self.norm2(x), encoder_output, encoder_output, cross_mask))
        # Feed-forward
        x = x + self.dropout(self.feed_forward(self.norm3(x)))
        return x
Causal Mask Must Be Applied in Decoder Self-Attention
If the causal mask is missing or applied incorrectly, the decoder will see future tokens during training, making the task trivial (it can copy the answer). The model will have artificially low loss but fail completely at inference when future tokens are not available.
Production Insight
LayerNorm placement matters: original Transformer used Post-LN (norm after residual). Modern implementations use Pre-LN (norm before residual) for better training stability.
Pre-LN reduces gradient vanishing and allows higher learning rates. The paper used Post-LN with learning rate warmup; Pre-LN often eliminates the need for warmup.
Rule: Use Pre-LN for new Transformer implementations. It's more stable and converges faster, especially with deep models (>12 layers).
Key Takeaway
Encoder: unmasked self-attention + feed-forward. Decoder: masked self-attention + cross-attention + feed-forward.
Causal mask (upper triangle of -inf) prevents decoder from attending to future tokens during training.
Rule: Cross-attention uses encoder output as K and V, decoder output as Q. No mask is applied (encoder is fully visible).
LayerNorm Placement: Pre-LN vs Post-LN
IfModel depth < 12 layers, you want to exactly replicate original paper
UseUse Post-LN (norm after residual addition). Requires learning rate warmup and careful tuning.
IfModel depth > 12 layers or you want stable training without warmup
UseUse Pre-LN (norm before sublayers). Dominant in modern LLMs (GPT, Llama, BERT). More tolerant of high learning rates.
IfYou're experiencing gradient explosion or vanishing in deep model
UseSwitch to Pre-LN and add gradient clipping (max norm 1.0). Post-LN becomes unstable beyond 24 layers.

Training Dynamics: Residual Connections, LayerNorm, and Dropout

The Transformer's depth (6 layers in base, 12 in big) requires careful architectural choices to enable gradient flow. The paper uses three key components: residual connections, layer normalization, and dropout.

Residual connections (skip connections): each sublayer output is added to its input: output = LayerNorm(x + Sublayer(x)). This lets gradients flow directly through the network, preventing vanishing gradients in deep stacks. Without residuals, a 6-layer Transformer would be nearly untrainable.

Layer Normalization: normalizes activations across the feature dimension (d_model). Unlike BatchNorm, LayerNorm is independent of batch size and works for variable-length sequences. The original paper placed LayerNorm after the residual addition (Post-LN), but modern practice places it before (Pre-LN).

Dropout: applied to the output of each sublayer (before addition) and to attention weights. Dropout rate 0.1 is standard. Insufficient dropout causes overfitting within 2-3 epochs on small datasets; too much dropout (>0.3) slows convergence.

Learning rate schedule: the paper uses a warm-up of 4000 steps with linear increase to 0.0005, then decays proportionally to inverse square root of step count. Pre-LN often makes warmup unnecessary.

Here's a practical insight: if you see training loss spike around step 4000 (the peak of warmup), the learning rate is too high. Reduce peak LR or extend warmup to 8000 steps. If loss plateaus and refuses to drop, you likely have one of two issues: Post-LN with insufficient warmup, or missing scaling factor in attention.

Also, label smoothing of 0.1 was used in the paper. It helps prevent the model from becoming overconfident, which degrades generation quality. Many modern LLMs skip label smoothing — but for translation it was critical.

For production training at scale, mixed precision (FP16/BF16) is standard. The original paper used FP32, but modern implementations use automatic mixed precision (AMP) for 2x throughput with negligible accuracy loss. The one gotcha: loss scaling in FP16 can overflow if the loss spikes during warmup. BF16 (if your hardware supports it) eliminates this problem entirely and is now the default for LLM training.

io/thecodeforge/ml/transformer_training_dynamics.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import torch
import torch.nn as nn

# Example: Pre-LN vs Post-LN comparison
class EncoderLayerPostLN(nn.Module):
    """Original Post-LN implementation"""
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, n_heads)
        self.ffn = nn.Sequential(nn.Linear(d_model, 2048), nn.ReLU(), nn.Linear(2048, d_model))
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        x = self.norm1(x + self.dropout(self.self_attn(x, x, x)))
        x = self.norm2(x + self.dropout(self.ffn(x)))
        return x

class EncoderLayerPreLN(nn.Module):
    """Modern Pre-LN implementation"""
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, n_heads)
        self.ffn = nn.Sequential(nn.Linear(d_model, 2048), nn.ReLU(), nn.Linear(2048, d_model))
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        x = x + self.dropout(self.self_attn(self.norm1(x), self.norm1(x), self.norm1(x)))
        x = x + self.dropout(self.ffn(self.norm2(x)))
        return x
Residual Connections as Gradient Highways
  • Without residuals, the gradient must pass through N attention + FFN layers. Each layer compresses the gradient; after 12 layers it can vanish.
  • Residuals let the gradient 'skip' layers. The network can choose to rely on the shortcut or the transformed path.
  • Pre-LN places LayerNorm on the residual branch, keeping the main path clean. This is why Pre-LN works better with deep models.
  • In practice, removing residual connections from even a 6-layer Transformer causes training to diverge.
Production Insight
Dropout too low (0.0) causes overfitting within 3 epochs on small datasets like WMT En-De.
Learning rate warmup of 4000 steps is critical for Post-LN; Pre-LN can use constant LR 1e-4 from step 0.
Rule: If loss spikes after warmup, reduce peak LR or increase warmup steps. If loss plateaus high, increase learning rate or check for vanishing gradients (Pre-LN fixes this).
Key Takeaway
Residual connections enable gradient flow through depth; LayerNorm stabilizes activations; Dropout prevents overfitting.
Pre-LN (norm before sublayer) is now standard over Post-LN (norm after) for stable training and higher learning rates.
Rule: For any Transformer with 12+ layers, use Pre-LN. Post-LN requires careful learning rate warmup and tuning.
Training Configuration for Transformer
IfDataset < 1M examples (academic benchmarks)
UseUse dropout 0.1, label smoothing 0.1, weight decay 0.01. Use Pre-LN with constant LR 1e-4.
IfDataset > 10M examples (web-scale)
UseReduce dropout to 0.0 or very small (0.05). Use Pre-LN with cosine decay LR from 3e-4 to 1e-5.
IfModel depth > 24 layers (large LLM)
UseUse Pre-LN exclusively. Add gradient clipping (max norm 1.0). Consider using Post-LN with extra tuning only for reproduction.
● Production incidentPOST-MORTEMseverity: high

The Positional Encoding That Wasn't

Symptom
Training loss dropped normally. Validation loss on held-out sequences was also low. But when deployed to predict future time steps, the model produced completely flat predictions (average of all training values). The model had learned to ignore order entirely, predicting the same output regardless of input sequence order.
Assumption
The team assumed the Transformer's self-attention mechanism would naturally capture positional information because the input sequence is fed in order. They didn't know that self-attention is permutation-invariant: swapping two tokens produces the same attention distribution. Without positional encodings, the model cannot tell the difference between [a,b,c] and [c,b,a].
Root cause
The Transformer has no built-in concept of token position. The formula Attention(Q,K,V) = softmax(QK^T/√d_k)V is symmetric in rows: swapping two tokens in the input sequence swaps the same rows in Q, K, V, but the attention weights for other tokens remain unchanged. The model learned to rely on content alone, ignoring the order of time steps. For forecasting, this meant the model reduced to output = f(x_t), ignoring all past context.
Fix
1. Added sinusoidal positional encoding before the first encoder layer: PE(pos, 2i) = sin(pos / 10000^(2i/d_model)), PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)). 2. Verified that the encoding was being added, not concatenated (common off-by-one error). 3. Tested with shuffled sequences: with encoding, the model's output changed; without encoding, output remained identical. 4. Switched to learned positional embeddings for better performance (trainable parameters). 5. Added an assertion that input positions range from 0 to seq_len-1 before adding to embeddings. 6. Re-ran training with a positional encoding sanity check: feed reversed input and confirm output differs.
Key lesson
  • The Transformer is permutation-invariant without positional encodings. Your model cannot tell order without explicit position information.
  • Sinusoidal encodings (original paper) are not learned. They work for unseen sequence lengths but may underperform learned embeddings on fixed-length tasks.
  • Always add positional encodings to input embeddings, not concatenate. Concatenation doubles the dimension, breaking the projection layers.
  • Test positional invariance: shuffle input tokens during validation and verify that output changes (or doesn't, depending on task).
  • If you see flat predictions in a sequence task, check positional encoding presence first — not model capacity.
Production debug guideSymptom → Action mapping for common Transformer failures in production ML systems.5 entries
Symptom · 01
Training loss low, validation loss good, but production predictions are nonsensical
Fix
Check if positional encoding is applied. Feed permuted inputs during validation and see if outputs change. Add explicit position IDs to forward pass and verify they're used.
Symptom · 02
Training takes 10x longer than reported in paper with same parameters
Fix
Check causal mask implementation. If mask is on the wrong dimension (e.g., used in encoder), you lose parallelism. Also check attention implementation: quadratic O(n²) memory means longer sequences crash; use flash attention.
Symptom · 03
Cross-attention not working — decoder copies encoder input without attending
Fix
Cross-attention K and V come from encoder, Q from decoder. Check that you're not accidentally using decoder self-attention for cross-attention. Also verify the mask is not incorrectly applied to cross-attention.
Symptom · 04
Model overfits dramatically after 2-3 epochs
Fix
Dropout likely missing or too small (default 0.1 recommended). Check LayerNorm placement: should be before attention/FFN (Pre-LN) for stable training, not after (Post-LN).
Symptom · 05
Inference memory exceeds training memory for same sequence length
Fix
Caching KV values for autoregressive generation not implemented. Implement KV cache: store previous K,V from each decoder layer, only compute new token's Q and append. This reduces O(n²) to O(n) in inference.
★ Transformer Debug Cheat SheetFast diagnostics for Transformer issues in production ML deployments.
Output independent of input order — model treats sequence as bag-of-words
Immediate action
Check positional encoding addition
Commands
grep -n 'positional_encoding' model.py
python -c "import torch; pos_enc = positional_encoding(10, 512); print(pos_enc[0,0], pos_enc[0,1])"
Fix now
Add positional encoding to input embeddings: x = x + pos_enc[:, :seq_len]. Use sinusoidal for variable length, learned for fixed length.
Training OOM — CUDA out of memory on 8K sequence+
Immediate action
Check if attention is O(n²) and sequence length is large
Commands
nvidia-smi --query-gpu=memory.used --format=csv
pip install flash-attn; use F.scaled_dot_product_attention with `enable_flash=True`
Fix now
Replace manual attention with torch.nn.functional.scaled_dot_product_attention (PyTorch 2.0+). Use flash attention kernel: memory O(n) not O(n²).
Loss not decreasing — model not learning+
Immediate action
Check if causal mask is applied correctly
Commands
grep -n 'mask' model.py
python test_mask.py --visualize
Fix now
Causal mask must be -inf in upper triangle (positions > i). Shape: (seq_len, seq_len). Apply mask in attention softmax: attn = attn.masked_fill(mask==0, -1e9).
NaN loss after first iteration+
Immediate action
Check for numerical stability: softmax with large logits, divide by sqrt(d_k)
Commands
grep -n 'softmax' model.py
torch.isnan(model.parameters()).any()
Fix now
Add attn = attn / math.sqrt(d_k) before softmax. Use torch.nn.init.xavier_uniform_ for weights, not random normal.
Decoder sees future tokens during training (impossible good loss)+
Immediate action
Verify that causal mask blocks future positions
Commands
python -c "attn = torch.randn(1,8,8); mask = torch.triu(torch.ones(8,8), diagonal=1).bool(); attn = attn.masked_fill(mask, -1e9); print('Future masked?', (attn[0,0,1] == -1e9))"
grep -n 'triu' model.py
Fix now
Create mask: mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool(). Apply before softmax.
RNN/LSTM vs Transformer vs Linear Attention
AspectRNN / LSTMTransformer (Self-Attention)Flash Attention / Linear Attention
Time complexity per tokenO(1) (processes one token at a time, state from previous)O(n) (attends to all previous tokens)O(1) (linear in sequence length with kernel approximation)
Total training complexityO(n) sequential: cannot parallelise across time stepsO(n²) compute, O(n²) memoryO(n) compute, O(n) memory
Parallelisation across time stepsImpossible (sequential recurrence)Full parallel (all tokens processed simultaneously)Full parallel
Long-range dependency (n=1000)Exponential decay (gradient vanishes or explodes)Direct connection via attention (no distance penalty)Direct connection (approximate)
Positional encoding needed?No (sequential by design)Yes (permutation-invariant without it)Yes
Memory for n=100kO(1) state size, O(n) activationsO(n²) attention matrix: 40GB for FP16O(n) memory using tiling / kernel approximation
Inference KV caching costO(1) per new token (state carries forward)O(n) per new token (must attend to all previous)O(1) per new token with recurrent formulation
Example pretrained modelsELMo, Seq2Seq with Luong attentionGPT, BERT, T5, Llama, ClaudeRWKV, RetNet, Hyena

Key takeaways

1
The Transformer replaces recurrence with self-attention, enabling full sequence parallelism and O(1) path length between any two tokens.
2
Scaled dot-product attention
Softmax(Q·K^T / √d_k)·V — scaling prevents softmax saturation and gradient vanishing.
3
Multi-head attention runs h parallel attention heads in low-dimensional subspaces, then concatenates outputs.
4
Positional encoding is mandatory
without it, the Transformer is permutation-invariant and cannot distinguish token order.
5
Encoder uses unmasked self-attention; decoder uses causal mask + cross-attention to prevent future token leakage.
6
Standard attention is O(n²) memory; Flash Attention reduces to O(n) via tiling, enabling 100k+ token contexts.
7
Residual connections and Pre-LN are essential for deep Transformers; the original Post-LN is now obsolete for new models.

Common mistakes to avoid

5 patterns
×

Omitting positional encoding — model can't distinguish token order

Symptom
Training loss converges but model fails on tasks requiring order (translation, question answering). For a sequence classification task, model may still learn some patterns but underperforms.
Fix
Add x = x + positional_encoding(seq_len, d_model) before first encoder/decoder layer. Use sinusoids for variable-length, learned embeddings for fixed-length.
×

Not scaling dot products by 1/√d_k before softmax — gradients vanish

Symptom
Loss decreases very slowly or not at all. Attention entropy is too low (nearly one-hot) because logits are large in magnitude.
Fix
Divide scores by math.sqrt(d_k) before softmax: attn = (Q @ K.T) / sqrt(d_k). This keeps the variance of the logits near 1 regardless of d_k.
×

Using causal mask in encoder or cross-attention by mistake

Symptom
Model underperforms because encoder cannot see full input context (only left context). Cross-attention artificially limits what decoder can see from encoder.
Fix
Encoder self-attention: no mask (None). Decoder cross-attention: no mask (encoder output is fully visible). Decoder self-attention: causal mask only.
×

Forgetting to apply mask to attention scores before softmax

Symptom
Decoder attends to future tokens during training, making loss artificially low. Model fails at inference when future tokens are unavailable.
Fix
scores = scores.masked_fill(mask == 0, -1e9) then attn = softmax(scores). The large negative number zeros out the masked positions.
×

Using standard attention on long sequences without Flash Attention — OOM

Symptom
CUDA out of memory for sequence length > 4k. Memory grows quadratically with sequence length.
Fix
Use F.scaled_dot_product_attention with enable_flash=True (PyTorch 2.0+). For longer sequences, use implementation or use linear attention variants.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What is the difference between self-attention, cross-attention, and caus...
Q02SENIOR
Why does the Transformer use multi-head attention instead of a single at...
Q03SENIOR
Why does the Transformer use sinusoidal positional encodings instead of ...
Q04SENIOR
Explain the time and memory complexity of standard attention and how Fla...
Q05SENIOR
What happens if you remove residual connections from a Transformer?
Q01 of 05SENIOR

What is the difference between self-attention, cross-attention, and causal attention in Transformers?

ANSWER
Self-attention computes attention between different positions within the same sequence (Q, K, V all from the same input). Cross-attention computes attention between two different sequences: Q from one sequence (e.g., decoder), K and V from another (e.g., encoder). Causal attention is a specific form of self-attention where each position can only attend to previous positions (positions ≤ i) by applying a mask that sets future positions to -inf before softmax. The encoder uses self-attention without masking (full bidirectional). The decoder uses self-attention with causal masking (autoregressive) and cross-attention (attends to encoder output).
FAQ · 6 QUESTIONS

Frequently Asked Questions

01
What is the difference between causal masking and padding masking in Transformers?
02
How do you set up learning rate warmup for Transformers?
03
Why does the Transformer need a high learning rate warmup but RNNs don't?
04
What is the advantage of Flash Attention over standard attention?
05
What are the best practices for dropout in Transformers?
06
How many layers does the original Transformer have, and how does depth affect modern LLMs?
🔥

That's Deep Learning. Mark it forged?

7 min read · try the examples if you haven't

Previous
Autoencoders Explained
11 / 15 · Deep Learning
Next
Batch Normalisation