Text Summarization: Extractive vs Abstractive – A Production Guide for ML Engineers
Learn the key differences between extractive and abstractive text summarization, with production-ready code, evaluation metrics, common pitfalls, and real-world deployment strategies for 2026..
20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.
- Extractive summarization selects and concatenates existing sentences from source text.
- Abstractive summarization generates novel sentences that paraphrase and condense the original content.
- Extractive methods are simpler, faster, and more faithful to source but can be disjointed.
- Abstractive methods produce more fluent summaries but risk hallucination and require heavy compute.
- Modern production systems often use hybrid pipelines: extractive pre-filtering then abstractive generation.
- Evaluation metrics like ROUGE, BERTScore, and factuality checks are critical for both approaches.
Think of extractive summarization like highlighting key sentences in a textbook—you copy the most important parts verbatim. Abstractive summarization is like explaining the chapter to a friend in your own words—you understand the meaning and rephrase it concisely. Both aim to save time, but one sticks to the original wording while the other creates new text.
Text summarization is a core feature in enterprise search, news aggregation, legal document review, and customer support. Every day, millions of summaries are generated by APIs and open-source models, yet many production systems still struggle with hallucinations, factual inconsistencies, and latency. Understanding the fundamental split between extractive and abstractive approaches is the first step to building reliable summarization pipelines.
Extractive summarization, rooted in classical NLP, treats summarization as a sentence ranking problem. It's deterministic, interpretable, and cheap to run. Abstractive summarization, powered by transformer-based language models like BART, T5, and GPT variants, generates fluent paraphrases but introduces the risk of fabricating information. The choice between them is not just academic—it directly impacts user trust and system cost.
We'll cover the algorithmic foundations, production trade-offs, evaluation pitfalls, and real-world deployment patterns. You'll learn when to use extractive, when to go abstractive, and how to combine them for robust results. We'll also dissect a production incident where abstractive summarization nearly caused a compliance failure, and provide a debug guide for common issues.
By the end, you'll have a concrete framework for building, evaluating, and debugging text summarization systems that work in production—not just in notebooks.
Fundamentals: What is Text Summarization?
Text summarization is the computational process of distilling a source document into a condensed version that preserves its most salient information. The goal is not merely to shorten text, but to produce a coherent, informative summary that captures the essence of the original. This task sits at the intersection of natural language understanding and generation, requiring models to identify key content, resolve redundancy, and maintain factual consistency. The two dominant paradigms are extractive and abstractive summarization, each with distinct algorithmic foundations and trade-offs.
Extractive summarization selects existing sentences or phrases from the source to form a summary. It treats summarization as a sentence ranking or classification problem, often using features like TF-IDF, TextRank (a graph-based algorithm), or neural sentence embeddings. The output is a subset of the original text, ensuring grammatical correctness but potentially lacking coherence when sentences are concatenated. In contrast, abstractive summarization generates novel sentences that may paraphrase or rephrase content, requiring deeper semantic understanding and language generation capabilities. This is typically approached with sequence-to-sequence (seq2seq) models, later enhanced by transformer architectures.
Mathematically, extractive methods can be framed as a binary classification per sentence: given a document D = {s1, s2, ..., sn}, predict label yi ∈ {0,1} indicating inclusion in summary S. Abstractive methods model the conditional probability P(S|D) directly, generating tokens sequentially. The evaluation metrics—ROUGE (Recall-Oriented Understudy for Gisting Evaluation) compare n-gram overlap between generated and reference summaries, while newer metrics like BERTScore leverage contextual embeddings for semantic similarity.
Production systems must balance compression ratio (typically 10-30% of source length) with information retention. A common pitfall is hallucination in abstractive models, where generated text includes facts not present in the source. This is especially dangerous in domains like healthcare or legal, where accuracy is paramount. The choice between extractive and abstractive approaches depends on the use case: extractive for high-precision, fact-critical applications; abstractive for more fluent, human-like summaries where some creativity is acceptable.
Extractive Summarization: Algorithms, Implementation, and Trade-offs
Extractive summarization selects a subset of sentences from the source document to form a summary. The core challenge is ranking sentences by importance and relevance. Classic algorithms include TextRank, which applies PageRank to a sentence similarity graph, and LexRank, which uses eigenvector centrality. More modern approaches use BERT embeddings to compute sentence representations and then cluster or rank them. The output is a concatenation of selected sentences, often reordered to match the original sequence for coherence.
TextRank constructs a graph where nodes are sentences and edges are weighted by cosine similarity of TF-IDF vectors. The score of each node is iteratively updated: S(V_i) = (1-d) + d sum_{V_j in In(V_i)} (w_{ji} / sum_{V_k in Out(V_j)} w_{jk}) S(V_j), where d is the damping factor (typically 0.85). After convergence, top-k sentences are selected. This unsupervised method requires no labeled data but can be sensitive to noise and may select redundant sentences.
Neural extractive methods treat it as a sequence labeling task. A model like BERTSUM (based on BERT) encodes sentences with [CLS] tokens and adds inter-sentence Transformer layers to capture document-level context. The output is a binary classification per sentence. Training requires labeled data (e.g., CNN/DailyMail with extracted oracle summaries). These models achieve higher ROUGE scores but are computationally expensive and require large datasets.
Trade-offs: Extractive methods guarantee grammatical correctness since they use original sentences, but they lack fluency when sentences are stitched together. They cannot paraphrase or compress beyond sentence-level selection. Redundancy is a common issue; post-processing with Maximal Marginal Relevance (MMR) can reduce it by balancing relevance and diversity. In production, extractive summarization is preferred for domains where factuality is non-negotiable, such as legal document summarization or medical report generation.
Abstractive Summarization: Sequence-to-Sequence Models and Transformers
Abstractive summarization generates novel text that paraphrases and condenses the source, requiring language generation capabilities. The dominant architecture is the sequence-to-sequence (seq2seq) model with attention, later revolutionized by the Transformer. Early seq2seq models used RNNs (LSTM/GRU) with an encoder-decoder structure, where the encoder processes the source tokens and the decoder generates the summary token by token, conditioned on the encoder's hidden states via attention. The attention mechanism computes alignment scores: e_{ij} = a(s_{i-1}, h_j), where s_{i-1} is the decoder state and h_j is encoder output. The context vector is a weighted sum of encoder states.
Transformers replaced RNNs with self-attention, enabling parallel computation and better long-range dependencies. The encoder uses multi-head self-attention and feed-forward layers; the decoder uses masked self-attention and cross-attention to the encoder. Pre-trained models like BART and T5 are fine-tuned for summarization. BART combines a bidirectional encoder (like BERT) with an autoregressive decoder (like GPT), trained on denoising objectives. T5 frames all tasks as text-to-text, using a unified architecture. These models achieve state-of-the-art ROUGE scores on benchmarks like CNN/DailyMail and XSum.
Training abstractive models requires large paired datasets (document-summary pairs). Loss is typically cross-entropy between predicted and target tokens. Inference uses beam search (beam width 4-8) to generate multiple candidates, selecting the one with highest log-probability. However, beam search can lead to repetitive or generic outputs; techniques like length penalty and no-repeat n-grams help. A critical issue is hallucination—generating facts not in the source. This can be mitigated by using pointer-generator networks that copy words from the source, or by incorporating factual consistency checks post-generation.
In production, abstractive models are computationally expensive (e.g., BART-large has 400M parameters). Latency can be reduced by using distilled versions (e.g., DistilBART) or by caching encoder outputs for repeated source documents. For real-time applications, consider using a smaller model like T5-small (60M params) with acceptable quality. Always validate summaries against the source for factual consistency, especially in news or medical domains.
Hybrid Pipelines: Combining Extractive and Abstractive for Production
Hybrid pipelines leverage the strengths of both extractive and abstractive methods to produce high-quality summaries in production. The typical architecture is a two-stage process: first, an extractive model selects the most important sentences (reducing the input length), then an abstractive model rewrites and condenses those sentences into a fluent summary. This approach reduces the computational burden on the abstractive model (since it processes shorter text) and mitigates hallucination by constraining the generation to a relevant subset.
A concrete pipeline: given a long document (e.g., 1000+ words), use a BERT-based extractive model to select the top 5-10 sentences (compression ratio ~20%). These sentences are concatenated and fed into a BART abstractive model to generate a final summary of 3-5 sentences. The extractive stage acts as a filter, removing irrelevant content and reducing noise. The abstractive stage then paraphrases and compresses further. This can improve ROUGE scores by 2-5 points over pure abstractive on long documents, as shown in research (e.g., Liu & Lapata, 2019).
Implementation considerations: The extractive model can be a lightweight classifier (e.g., BERT-base with a linear head) or even a simple TextRank for speed. The abstractive model should be fine-tuned on the output of the extractive stage (i.e., train on extractive summaries paired with human-written abstracts). This ensures the model learns to process truncated inputs. In production, cache the extractive scores for repeated documents to avoid recomputation. For real-time systems, use a smaller abstractive model (e.g., DistilBART) and limit input length to 512 tokens.
Trade-offs: The hybrid approach adds latency due to two model calls, but the overall quality often justifies it. The extractive stage may discard information that the abstractive model could have used creatively, so tuning the extractive threshold is critical. A/B test different compression ratios (e.g., 10%, 20%, 30%) to find the sweet spot for your domain. Also, monitor for cascading errors: if the extractive stage misses key facts, the abstractive stage cannot recover them. Consider using a confidence threshold to fall back to pure extractive if the abstractive model's uncertainty is high.
Evaluation Metrics: ROUGE, BERTScore, Factuality, and Human Evaluation
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) remains the de facto standard for automatic summarization evaluation. ROUGE-N measures n-gram overlap between a candidate summary and one or more reference summaries. ROUGE-1 (unigrams) and ROUGE-2 (bigrams) are common, but ROUGE-L uses longest common subsequence to capture sentence-level structure. For a candidate C and reference R, ROUGE-N recall = (count of overlapping n-grams) / (total n-grams in R). Precision and F1 are also reported. However, ROUGE correlates poorly with human judgment for abstractive summaries because it penalizes valid paraphrasing. A ROUGE-1 F1 of 0.45 on CNN/DailyMail is considered strong, but this number is meaningless without knowing the dataset and metric variant.
BERTScore addresses ROUGE's lexical rigidity by computing token-level similarity using contextual embeddings from BERT. For each token in the candidate, it finds the most similar token in the reference via cosine similarity, then aggregates precision, recall, and F1. BERTScore correlates better with human evaluation (Pearson r ~0.4-0.5 vs ROUGE's ~0.2-0.3 on common benchmarks). However, it is computationally expensive: generating embeddings for a 512-token summary takes ~50ms on a V100. In production, you might cache embeddings or use a distilled model like DistilBERT to reduce latency.
Factuality metrics are critical because abstractive models hallucinate. FactCC is a BERT-based classifier trained to detect factual consistency between source and summary. It achieves ~80% accuracy on the FactCC dataset. More recent approaches like QAFactEval use question answering: generate questions from the summary, answer them from the source, and measure answer overlap. These metrics are not perfect—they miss subtle factual errors and can be gamed. Human evaluation remains the gold standard, typically using Likert scales (1-5) for fluency, relevance, and factuality. Inter-annotator agreement (Krippendorff's alpha > 0.7) is essential. In practice, combine ROUGE for regression testing, BERTScore for model selection, and human eval for final quality gates.
Production Deployment: Latency, Throughput, and Cost Optimization
Deploying a summarization model at scale requires balancing latency, throughput, and cost. For extractive models (e.g., BERT-based sentence classifiers), latency is dominated by encoding: a DistilBERT model on a CPU takes ~100ms for a 512-token document. Throughput can reach 50 requests/second on a single T4 GPU with batch size 32. Abstractive models (e.g., BART, Pegasus) are more expensive: a BART-large model generates ~30 tokens/second on a V100, with latency of 2-5 seconds for a 100-token summary. To optimize, use mixed precision (FP16) to reduce memory and increase throughput by 1.5-2x. Quantization (INT8) can further reduce latency by 30% with minimal quality loss (ROUGE drop < 0.01).
Batching is critical. For abstractive models, dynamic batching (grouping requests by input length) avoids padding waste. Use a framework like NVIDIA Triton Inference Server or TorchServe to manage batching and model versioning. For cost, consider serverless inference (e.g., AWS SageMaker Serverless) for variable workloads, but beware of cold starts (2-5 seconds). For steady traffic, provisioned GPUs (e.g., 4x T4) are cheaper. A typical cost breakdown: BART-large on a T4 GPU costs ~$0.10/hour; at 10 requests/second, that's $0.000003 per request. Add 20% for overhead.
Caching is your best friend. Use a content-addressable cache (e.g., Redis) keyed by a hash of the input text. For news summarization, many articles are duplicates or near-duplicates; a cache hit rate of 30% is realistic. For streaming applications, use a sliding window cache that evicts old entries. Also, consider pre-computing summaries for popular documents (e.g., top 1000 news articles) during off-peak hours. Finally, monitor tail latency: p99 should be < 5 seconds for interactive apps. Use async processing (e.g., Celery) for non-real-time workloads.
Common Pitfalls and Debugging Strategies
One of the most frequent pitfalls is the 'copy-paste' problem in extractive models: they select entire sentences verbatim, leading to summaries that are disjointed or contain redundant information. For example, a BERT-based extractor might pick two sentences that say the same thing, inflating ROUGE but confusing users. Debug by examining the attention weights: if the model attends uniformly across all sentences, it's not learning. Fix by adding a diversity penalty (e.g., penalize cosine similarity between selected sentence embeddings) or using a reinforcement learning objective that rewards non-redundancy.
Abstractive models hallucinate. A BART model might generate 'The company reported a loss of $10 million' when the source says 'profit of $10 million'. This is often due to the model relying on its pre-training knowledge rather than the source. Debug by checking the cross-attention scores: if the model ignores source tokens, it's hallucinating. Mitigate with constrained beam search (force the model to copy from the source) or use a factuality classifier as a reward during training. Another common issue is repetition: models generate 'the the the' or repeat phrases. This is a decoding problem; use repetition penalty (penalty > 1.0) or top-k sampling with k=50.
Data leakage is subtle but deadly. If your training and test sets share articles from the same event (e.g., multiple news outlets covering the same story), the model memorizes rather than summarizes. Always deduplicate at the document level, not just the sentence level. Use MinHash or SimHash to detect near-duplicates. Also, watch for domain shift: a model trained on CNN/DailyMail (news) will fail on scientific papers. Debug by evaluating on a small in-domain set first. Finally, don't ignore the tokenizer: if the input exceeds the model's max length (e.g., 1024 tokens for BART), truncation loses key information. Use a sliding window approach or a Longformer model for long documents.
Future Directions: Long Document Summarization, Multimodal, and Factuality Guarantees
Long document summarization (e.g., books, legal contracts, scientific papers) remains an open challenge. Current models like BART and Pegasus are limited to 1024 tokens. Approaches include hierarchical models (e.g., Longformer, BigBird) that use sparse attention to handle up to 4096 tokens, and retrieval-augmented methods that chunk the document and summarize each chunk, then summarize the summaries. The latter is common in production: chunk size 512 tokens with 50% overlap, then a second-level model. However, this loses cross-chunk dependencies. A promising direction is the 'sliding window' approach with memory (e.g., Transformer-XL), which maintains a hidden state across chunks. Evaluation on the SCROLLS benchmark shows that Longformer achieves ROUGE-1 of 0.42 on GovReport, vs 0.38 for BART.
Multimodal summarization combines text, images, and video. For example, summarizing a news article with its accompanying image. Models like CLIP and Flamingo can align visual and textual representations. A typical pipeline: encode the image with a vision transformer, fuse with text embeddings via cross-attention, then decode a summary. Challenges include alignment (the image may not directly relate to the text) and evaluation (how do you measure visual relevance?). The MSMO dataset (Multi-Source Multi-Modal) is a benchmark, but it's small (300 examples). Expect more work in this area as multimodal LLMs mature.
Factuality guarantees are the holy grail. Current methods include: (1) training with a factuality reward using reinforcement learning (e.g., RLHF with factuality as a reward), (2) post-hoc verification using a separate NLI model, and (3) constrained decoding that forces the model to copy from the source. None provide guarantees. A recent approach uses 'contrastive decoding': compare the model's output with a 'source-only' model (trained only on the source) and penalize tokens that are more likely in the source-only model. This reduces hallucination by 30% on XSum. In the future, we may see 'certified' summarization using formal verification or differential privacy to bound the probability of hallucination. Until then, production systems must combine multiple techniques and accept that some errors will slip through.
The Hallucinated Compliance Report: When Abstractive Summarization Nearly Cost a Client
- High ROUGE scores do not guarantee factual accuracy; always include factuality checks in production.
- Abstractive models can hallucinate by incorrectly combining information from different parts of the source.
- A hybrid extractive-abstractive pipeline reduces hallucination risk by grounding the generation in relevant content.
python -c "from transformers import pipeline; nli = pipeline('text-classification', model='roberta-large-mnli'); print(nli('source text', 'generated summary'))"curl -X POST http://localhost:8000/summarize -H 'Content-Type: application/json' -d '{"text":"...", "method":"extractive"}'Key takeaways
Common mistakes to avoid
4 patternsUsing ROUGE as the sole evaluation metric
Ignoring input length limits in abstractive models
Assuming extractive summaries are always faithful
Deploying abstractive models without latency budgets
Interview Questions on This Topic
Explain the difference between extractive and abstractive summarization. When would you choose one over the other?
Frequently Asked Questions
20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.
That's NLP. Mark it forged?
13 min read · try the examples if you haven't