Question Answering Transformers: Last Chars Bug
Extractive QA answers drop last 2-5 chars from subword offset bug.
20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.
- Extractive QA: given context + question, predict start and end token positions of the answer span within the context.
- BERT adds two classification heads: start_logits and end_logits. Loss = cross-entropy on both.
- SQuAD2.0 adds unanswerable questions: model must predict start=0, end=0 when no answer exists.
- Performance: base BERT does ~200 QPS on A100 at 128 seq_len, ~40 QPS at 512 seq_len.
- Production failure: tokenizer subword splitting ("jumping" → ["jump", "##ing"]) causes answer span misalignment by 2-5 tokens.
- Biggest mistake: aligning predictions to raw text without offset mapping — answers come back with extra spaces or wrong characters.
Imagine you hand a really well-read librarian a specific page from a book, then ask them a question. Instead of re-reading the whole library, they scan just that page, underline the answer, and hand it back in seconds. That's extractive question answering — the model gets a context passage and a question, then figures out exactly which words in that passage ARE the answer. It doesn't make anything up; it just finds the right underline.
Every time you ask Google a question and get a highlighted snippet, or query an enterprise chatbot about a policy document and get a crisp sentence back, you're watching a QA Transformer do its job. These systems run inside medical record search engines, legal document tools, customer support bots, and developer documentation assistants. They're not research toys anymore — they're infrastructure.
The core problem QA Transformers solve is that traditional keyword search returns documents, not answers. A user who types 'what is the maximum file upload size' doesn't want ten blue links — they want '25 MB'. Extractive QA bridges that gap by treating the problem as: given a context string and a question string, predict the start token and end token of the answer span within the context. That framing turns a fuzzy language problem into two classification heads on top of a contextual encoder.
By the end of this article you'll understand exactly how BERT's dual span-prediction heads work internally, how to fine-tune a QA model on SQuAD2.0 from scratch with real code, how to handle impossible questions and long contexts that exceed the 512-token window, and what will actually bite you when you ship this to production. We'll cover confidence thresholding, sliding-window chunking, quantization trade-offs, and the subtle tokenizer alignment bug that ruins more QA systems than any model choice does.
How Question Answering Transformers Actually Extract Answers
Question answering transformers are models that locate a span of text within a given context to answer a natural language question. The core mechanic is a two-tower architecture: one encoder processes the question, another processes the context, and a final layer predicts start and end token positions for the answer span. This is fundamentally a span extraction task, not generation — the answer must exist verbatim in the context.
In practice, these models operate in O(n) time relative to context length, with a maximum input size typically 384–512 tokens. The output is a pair of logits for each token, converted to probabilities via softmax. The answer is the span with the highest joint probability of start and end tokens. A common constraint is that the end token must appear after the start token, enforced by masking invalid combinations.
Use this approach when answers are known to be contained in a document or passage, such as in FAQ systems, legal document review, or customer support ticket triage. It matters because it provides exact, verifiable answers with no hallucination risk — unlike generative models that may invent facts. The trade-off is that it cannot answer questions requiring synthesis or information not present in the provided context.
How Extractive QA Works — Span Prediction on BERT
Extractive question answering frames the problem as finding a contiguous span of tokens in the context that answers the question. BERT-based models solve this by adding two classification heads on top of the encoder: a start head and an end head.
Architecture breakdown: The input format is [CLS] question [SEP] context [SEP]. BERT produces a contextualised embedding for every token in the sequence. The start head is a linear layer that maps each token's embedding to a logit score — how likely this token is to be the start of the answer. The end head does the same for end positions. During training, the loss is the sum of cross-entropy on start positions and cross-entropy on end positions.
During inference, you compute all (start, end) pairs where start ≤ end, sum their start_logit + end_logit, and pick the highest-scoring span. For SQuAD2.0, you also have a 'no answer' option: the [CLS] token is treated as both a start and end position, and its score is compared against the best span score. If null_score is high enough, the model outputs no answer.
A critical detail often overlooked: BERT's position embeddings only go up to 512. If your context is longer than 512 tokens, the encoder has no way to distinguish tokens beyond the limit. The tokenizer physically truncates the input. This is why sliding-window approaches are necessary for long documents.
return_offsets_mapping=True, you cannot recover the original character positions from token indices. Subword tokens (e.g., '##ing') break simple concatenation. Always store the offset mapping during tokenisation and use it to extract answers from the raw context string, not from token strings.Fine-Tuning a QA Model — From BERT to SQuAD2.0
Fine-tuning a pre-trained BERT for QA is surprisingly straightforward because the architecture already includes the span heads. The key is preparing your data in the exact format the model expects: a question, a context, and a start position + end position.
Dataset format: For each example in SQuAD2.0, you have a context, a question, and either an answer dict with text and answer_start, or is_impossible: true. The answer_start is the character offset of the answer within the context. During preprocessing, you tokenise the question+context pair, then locate which token indices correspond to the answer's character range. This is where the offset_mapping comes in: you find the token whose start_char <= answer_start and whose end_char >= answer_start + len(answer).
For unanswerable questions, the answer should be the [CLS] token (index 0) for both start and end. The model learns to output high start_logits[0] and end_logits[0] when there's no answer.
Training hyperparameters: Learning rate 3e-5, batch size 8-16 (depending on GPU memory), 2-3 epochs. BERT-base fits on a single 16GB GPU with batch size 8 at 384 sequence length. For longer contexts (512), reduce batch size to 4-6.
Critical detail: SQuAD2.0's unanswerable questions are balanced almost 50/50. If your domain has a different ratio (e.g., medical QA where every query should have an answer), you'll need to reweight the null loss or adjust the threshold. Fine-tuning on imbalanced null labels can cause the model to either always answer (false positives) or never answer (false negatives).
Handling Long Contexts — Sliding Windows and Longformer
BERT's maximum input length is 512 tokens. For many production QA tasks — legal documents, research papers, medical records — your context can be thousands of tokens long. You have three options, each with trade-offs.
Option 1: Sliding Window Chunking. Split the context into overlapping chunks of 384 tokens with a stride of 128. Run QA inference on each chunk independently, then aggregate the answers. For each chunk, you get a (start, end, score) triple. Take the highest score across all chunks as your final answer. This keeps the BERT architecture untouched. The cost: inference time scales linearly with number of chunks. A 2000-token document with stride 128 becomes ~14 chunks → 14x slower.
Option 2: Use Longformer or BigBird. These architectures replace BERT's full attention (O(n²)) with sparse attention patterns (O(n)). Longformer-base supports up to 4096 tokens, BigBird up to 4096. They're fine-tuned on SQuAD-like tasks and can be dropped in as replacements. Performance is slightly lower than BERT on short contexts but far better on long ones. Memory usage is still high — 4096 tokens on Longformer-base uses ~16GB VRAM.
Option 3: Semantic Chunking. Instead of fixed-size windows, split on sentence boundaries or paragraphs. Retrieve the most relevant chunks using BM25 or a retriever (e.g., DPR), then run QA only on the top-k chunks. This reduces inference cost dramatically but adds a retrieval component (and retrieval errors).
Production recommendation: Start with sliding window for simplicity. If your dataset has many very long contexts (>2000 tokens) and latency is a concern, implement Longformer. For extremely long documents (e.g., entire contracts), use retrieval + QA.
- Each window is a complete BERT input: question + context chunk.
- Overlap (stride) ensures answer near a chunk boundary isn't missed.
- You get an answer candidate + confidence score from each window.
- Final answer = candidate with highest confidence across all windows.
- Cost: #windows × base_inference_time. A 2000-token document ≈ 10-12 windows.
Production QA — Latency, Quantization, and Confidence Thresholds
Shipping a QA model to production requires more than just accuracy. Latency, memory, and decision thresholds determine whether your system is usable.
- BERT-base (seq_len=128): 200-250 QPS
- BERT-base (seq_len=512): 40-50 QPS
- DistilBERT-base (seq_len=512): 80-100 QPS
- quantized int8 BERT (seq_len=512): 120-150 QPS
Memory footprint: BERT-base in FP32 is 440MB. FP16 halves to 220MB. INT8 quantisation reduces to ~110MB with 1-2% accuracy loss on SQuAD. For CPU inference, ONNX Runtime with int8 quantisation runs BERT at 10-20ms per 128-token query.
Confidence thresholds: The model's raw logit scores are not calibrated probabilities. You need to tune a threshold on your dev set to decide whether to return an answer or say "I don't know". For each example, compute score_diff = max_span_score - null_score. Plot precision/recall vs threshold to find the operating point that matches your use case. For a medical QA system where false negatives are dangerous, set a low threshold (return answers even if noise). For a fact-checking system, set a high threshold (only answer when very confident).
Time to first token vs total latency: For very long contexts, you can stream intermediate answers. But BERT/Transformer QA is not autoregressive — the model sees the whole input at once. There's no streaming. You pay the full latency on every query.
GPU vs CPU: If your QPS is under 5 and latency tolerance is >200ms, CPU inference with ONNX Runtime is fine (and cheaper). For >50 QPS, use GPU.
Why You Still Need a Retriever After Training the Model
You fine-tuned BERT on SQuAD. Congrats. Now try asking it a question about your internal documents. It will fail because a transformer has a maximum context window — typically 512 tokens. That's about 300 words. Your production knowledge base is 10,000 documents. You can't jam them all into one forward pass.
This is where Retrieval-Augmented Generation (RAG) comes in. A retriever searches your corpus for relevant passages before the QA model ever sees text. The retriever is usually a dense vector search engine — FAISS or Milvus — that encodes documents into embeddings and returns the top-k most similar to the user's question.
You don't train the retriever on the same data as your QA model. You train it to rank relevance. Common choices: DPR (Dense Passage Retrieval) or a bi-encoder like Sentence-BERT. The QA model then only needs to extract the answer from the top 3-5 retrieved passages. This keeps inference fast and context within limits.
Never drop a transformer straight into an open-domain QA system without a retriever. Latency becomes unbounded. If your model needs to read everything, it reads nothing well.
How to Handle Out-of-Scope Questions with Confidence Calibration
Your model returns a score of 0.92 for every answer. Here is the problem: a transformer does not know what it does not know. If a user asks 'Where is the secret server?', and the context is about baking recipes, the model will still produce a confident-looking answer span. That is because softmax normalizes logits across all positions. The highest-scoring span will always win, even if it is garbage.
You need a rejection mechanism. The simplest: a confidence threshold on the model's start and end logit scores. But raw logits vary per input length. What works for a 300-token context will fail for 50 tokens.
A better approach: use a calibration dataset. Collect 100 questions where the answer is definitively not in the context. Run inference and record the model's top-1 score. Set your threshold at the 95th percentile of that 'no-answer' distribution. Anything below that gets a 'I don't know' response.
Another option: fine-tune a separate classifier on top of the [CLS] token that predicts answerability. This adds a second head that outputs 0 or 1. But you need answerability-labeled data — SQuAD 2.0 has this built in. Do not skip the 'unanswerable' examples during fine-tuning.
Production QA is not about maximizing accuracy. It is about minimizing garbage outputs. A model that says 'I don't know' builds trust. One that hallucinates a folder name loses a client.
The Medical QA System That Kept Truncating Diagnosis Answers
convert_ids_to_tokens() and then merge any token starting with '##' into the previous token. 2. For alignment to raw text, store the character offset of the first and last token of the answer span, not token indices alone. 3. Add validation: if a predicted answer doesn't appear as a substring of the original context, log a dead-letter alert and use the span from offset mapping, not token reconstruction.- Never convert model predictions to raw text by concatenating token strings. Subword splitting will break you.
- Always use the tokenizer's offset mapping (start_char, end_char) provided by
tokenizer(return_offsets_mapping=True)to map token indices back to original character positions. - Test your QA system on examples where the answer contains rare words — those are most likely to be subword-split.
- Add a validation check: the extracted answer string must be a substring of the original context. If it isn't, fall back to offset mapping and log the mismatch.
- This bug is invisible on SQuAD because answers are usually single common words. Production data will find it immediately.
tokenizer(return_offsets_mapping=True) to get character-aligned spans, not token-concatenated strings.start_logits[:,0] + end_logits[:,0] (null score) and the max non-null span score. Set threshold where precision/recall trade-off matches your use case. For medical QA, set low threshold (answer anything rather than say no). For fact-checking, set high threshold.from transformers import AutoTokenizer; tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased'); print(tokenizer.tokenize('diabetes mellitus'))tokens = tokenizer('diabetes mellitus', return_offsets_mapping=True); print(tokens['offset_mapping'])Key takeaways
return_offsets_mapping=True and slice the original context string, not token concatenation. Subword tokens will ruin your answers.Common mistakes to avoid
5 patternsAligning answers by concatenating token strings instead of using offset mapping
return_offsets_mapping=True during tokenisation. Extract answer by slicing the original context string with start_char = offset_mapping[start_idx][0], end_char = offset_mapping[end_idx][1]. Never build the answer from token strings.Using SQuAD-v1.1 (no unanswerable questions) when your production data has impossible queries
Ignoring domain shift — using out-of-the-box SQuAD model on legal/medical data
Hardcoding null threshold at 0.0 or using raw logit comparison without tuning
score_diff = max_span_score - null_score. Plot precision/recall vs threshold. Choose threshold that maximises F1 for your use case. For safety-critical domains, favour recall (lower threshold). Document the threshold in model cards.Not handling long contexts — truncating to 512 tokens without warning
Interview Questions on This Topic
How does BERT perform extractive question answering? Explain the architecture and loss function.
[CLS] question [SEP] context [SEP]. The model produces a contextualised embedding for every token. The start head is a linear layer mapping each token's embedding to a logit (score for being the answer's start). The end head does the same for the end position. During training, we compute cross-entropy loss on both heads. The total loss is L_start + L_end where L_start is negative log likelihood of the true start token. For SQuAD2.0, the [CLS] token represents 'no answer' — we also predict start=0, end=0 for unanswerable questions. During inference, we compute all valid (start, end) pairs where start ≤ end, sum their logits, and pick the highest-scoring span. We also compute the null score from [CLS] and choose no answer if best_span_score < null_score + threshold.Frequently Asked Questions
20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.
That's NLP. Mark it forged?
9 min read · try the examples if you haven't