Intermediate 10 min · May 28, 2026

Text Summarization: Extractive vs Abstractive – A Production Guide for ML Engineers

Q: What is the main difference between extractive and abstractive summarization?

Extractive summarization selects and copies existing sentences from the source document verbatim. Abstractive summarization generates new sentences that may rephrase, condense, or combine information from the source. Extractive is simpler and more faithful, while abstractive is more fluent but can introduce errors.

Q: Which approach is better for production?

It depends on your constraints. Extractive is better for low-latency, high-throughput systems where faithfulness is critical (e.g., legal or medical summaries). Abstractive is better for user-facing applications where fluency matters (e.g., news digests). Many production systems use a hybrid: first extract key sentences, then abstractively rewrite them.

Q: How do you evaluate summarization quality?

Common metrics include ROUGE (measures n-gram overlap), BERTScore (semantic similarity using embeddings), and factuality checks (e.g., using entailment models). Human evaluation remains the gold standard for fluency and relevance. In production, you should also track user feedback and downstream task performance.

Q: What are common failure modes in abstractive summarization?

Hallucination (generating facts not in the source), repetition, incomplete sentences, and loss of key details. These are often caused by model overconfidence, insufficient context length, or training data biases. Mitigations include using extractive pre-filtering, beam search with length penalties, and factuality verification.

Learn the key differences between extractive and abstractive text summarization, with production-ready code, evaluation metrics, common pitfalls, and real-world deployment strategies for 2026..

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Production

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Extractive summarization selects and concatenates existing sentences from source text.
Abstractive summarization generates novel sentences that paraphrase and condense the original content.
Extractive methods are simpler, faster, and more faithful to source but can be disjointed.
Abstractive methods produce more fluent summaries but risk hallucination and require heavy compute.
Modern production systems often use hybrid pipelines: extractive pre-filtering then abstractive generation.
Evaluation metrics like ROUGE, BERTScore, and factuality checks are critical for both approaches.

✦ Definition~90s read

What is Text Summarization?

Text summarization is the task of automatically producing a concise and fluent summary that captures the key points of a longer document. Extractive summarization selects and rearranges existing sentences from the source, while abstractive summarization generates new sentences that may paraphrase or condense the original content.

★

Think of extractive summarization like highlighting key sentences in a textbook—you copy the most important parts verbatim.

Plain-English First

Think of extractive summarization like highlighting key sentences in a textbook—you copy the most important parts verbatim. Abstractive summarization is like explaining the chapter to a friend in your own words—you understand the meaning and rephrase it concisely. Both aim to save time, but one sticks to the original wording while the other creates new text.

Text summarization is a core feature in enterprise search, news aggregation, legal document review, and customer support. Every day, millions of summaries are generated by APIs and open-source models, yet many production systems still struggle with hallucinations, factual inconsistencies, and latency. Understanding the fundamental split between extractive and abstractive approaches is the first step to building reliable summarization pipelines.

Extractive summarization, rooted in classical NLP, treats summarization as a sentence ranking problem. It's deterministic, interpretable, and cheap to run. Abstractive summarization, powered by transformer-based language models like BART, T5, and GPT variants, generates fluent paraphrases but introduces the risk of fabricating information. The choice between them is not just academic—it directly impacts user trust and system cost.

We'll cover the algorithmic foundations, production trade-offs, evaluation pitfalls, and real-world deployment patterns. You'll learn when to use extractive, when to go abstractive, and how to combine them for robust results. We'll also dissect a production incident where abstractive summarization nearly caused a compliance failure, and provide a debug guide for common issues.

By the end, you'll have a concrete framework for building, evaluating, and debugging text summarization systems that work in production—not just in notebooks.

Fundamentals: What is Text Summarization?

Text summarization is the computational process of distilling a source document into a condensed version that preserves its most salient information. The goal is not merely to shorten text, but to produce a coherent, informative summary that captures the essence of the original. This task sits at the intersection of natural language understanding and generation, requiring models to identify key content, resolve redundancy, and maintain factual consistency. The two dominant paradigms are extractive and abstractive summarization, each with distinct algorithmic foundations and trade-offs.

Extractive summarization selects existing sentences or phrases from the source to form a summary. It treats summarization as a sentence ranking or classification problem, often using features like TF-IDF, TextRank (a graph-based algorithm), or neural sentence embeddings. The output is a subset of the original text, ensuring grammatical correctness but potentially lacking coherence when sentences are concatenated. In contrast, abstractive summarization generates novel sentences that may paraphrase or rephrase content, requiring deeper semantic understanding and language generation capabilities. This is typically approached with sequence-to-sequence (seq2seq) models, later enhanced by transformer architectures.

Mathematically, extractive methods can be framed as a binary classification per sentence: given a document D = {s1, s2, ..., sn}, predict label yi ∈ {0,1} indicating inclusion in summary S. Abstractive methods model the conditional probability P(S|D) directly, generating tokens sequentially. The evaluation metrics—ROUGE (Recall-Oriented Understudy for Gisting Evaluation) compare n-gram overlap between generated and reference summaries, while newer metrics like BERTScore leverage contextual embeddings for semantic similarity.

Production systems must balance compression ratio (typically 10-30% of source length) with information retention. A common pitfall is hallucination in abstractive models, where generated text includes facts not present in the source. This is especially dangerous in domains like healthcare or legal, where accuracy is paramount. The choice between extractive and abstractive approaches depends on the use case: extractive for high-precision, fact-critical applications; abstractive for more fluent, human-like summaries where some creativity is acceptable.

io/thecodeforge/summarization/fundamentals.pyPYTHON

import nltk
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarance
import numpy as np

def extractive_summary_tfidf(text, top_n=3):
    sentences = sent_tokenize(text)
    if len(sentences) <= top_n:
        return text
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(sentences)
    # Compute sentence centrality: average cosine similarity to all others
    similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)
    sentence_scores = np.mean(similarity_matrix, axis=1)
    ranked_indices = np.argsort(sentence_scores)[::-1][:top_n]
    ranked_indices.sort()  # preserve original order
    return ' '.join([sentences[i] for i in ranked_indices])

document = """
The quick brown fox jumps over the lazy dog. This is a common pangram used in typography. It contains every letter of the alphabet at least once. The sentence has been used for decades to display fonts.
"""
print(extractive_summary_tfidf(document))

Output

The quick brown fox jumps over the lazy dog. This is a common pangram used in typography. It contains every letter of the alphabet at least once.

🔥Compression Ratio Matters

Aim for 10-30% compression. Too aggressive loses critical info; too conservative defeats the purpose. Measure ROUGE-L for sentence-level overlap.

📊 Production Insight

Always evaluate summary quality on domain-specific data. Generic ROUGE scores can mislead; use human evaluation for fluency and factuality. Implement a fallback to extractive if abstractive model confidence is low.

🎯 Key Takeaway

Text summarization reduces documents to key information via extraction (selecting sentences) or abstraction (generating new text). Extractive is safer for factual accuracy; abstractive offers fluency. Choose based on domain risk tolerance.

thecodeforge.io

Text Summarization

Extractive Summarization: Algorithms, Implementation, and Trade-offs

Extractive summarization selects a subset of sentences from the source document to form a summary. The core challenge is ranking sentences by importance and relevance. Classic algorithms include TextRank, which applies PageRank to a sentence similarity graph, and LexRank, which uses eigenvector centrality. More modern approaches use BERT embeddings to compute sentence representations and then cluster or rank them. The output is a concatenation of selected sentences, often reordered to match the original sequence for coherence.

TextRank constructs a graph where nodes are sentences and edges are weighted by cosine similarity of TF-IDF vectors. The score of each node is iteratively updated: S(V_i) = (1-d) + d sum_{V_j in In(V_i)} (w_{ji} / sum_{V_k in Out(V_j)} w_{jk}) S(V_j), where d is the damping factor (typically 0.85). After convergence, top-k sentences are selected. This unsupervised method requires no labeled data but can be sensitive to noise and may select redundant sentences.

Neural extractive methods treat it as a sequence labeling task. A model like BERTSUM (based on BERT) encodes sentences with [CLS] tokens and adds inter-sentence Transformer layers to capture document-level context. The output is a binary classification per sentence. Training requires labeled data (e.g., CNN/DailyMail with extracted oracle summaries). These models achieve higher ROUGE scores but are computationally expensive and require large datasets.

Trade-offs: Extractive methods guarantee grammatical correctness since they use original sentences, but they lack fluency when sentences are stitched together. They cannot paraphrase or compress beyond sentence-level selection. Redundancy is a common issue; post-processing with Maximal Marginal Relevance (MMR) can reduce it by balancing relevance and diversity. In production, extractive summarization is preferred for domains where factuality is required, such as legal document summarization or medical report generation.

io/thecodeforge/summarization/extractive_textrank.pyPYTHON

import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import sent_tokenize

def textrank_summary(text, top_n=3):
    sentences = sent_tokenize(text)
    if len(sentences) <= top_n:
        return text
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(sentences)
    sim_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)
    # Build graph
    graph = nx.Graph()
    for i in range(len(sentences)):
        for j in range(i+1, len(sentences)):
            if sim_matrix[i][j] > 0:
                graph.add_edge(i, j, weight=sim_matrix[i][j])
    # PageRank
    scores = nx.pagerank(graph, weight='weight')
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_n]
    ranked_indices = sorted([idx for idx, _ in ranked])
    return ' '.join([sentences[i] for i in ranked_indices])

doc = """
Machine learning is a subset of artificial intelligence. It involves training algorithms on data. Deep learning uses neural networks with many layers. These techniques power modern AI applications.
"""
print(textrank_summary(doc))

Output

Machine learning is a subset of artificial intelligence. It involves training algorithms on data. Deep learning uses neural networks with many layers.

⚠ Redundancy in Extractive Summaries

TextRank often selects similar sentences. Use MMR (Maximal Marginal Relevance) to penalize sentences too similar to already selected ones.

📊 Production Insight

For production, combine TextRank with a sentence compression step (e.g., deleting clauses) to reduce length. Monitor ROUGE-1/2/L but also track redundancy via pairwise cosine similarity of selected sentences.

🎯 Key Takeaway

Extractive summarization selects sentences via graph-based ranking (TextRank) or neural classification. It's fast, factual, but lacks fluency. Use MMR for diversity. Ideal for high-stakes domains.

Abstractive Summarization: Sequence-to-Sequence Models and Transformers

Abstractive summarization generates novel text that paraphrases and condenses the source, requiring language generation capabilities. The dominant architecture is the sequence-to-sequence (seq2seq) model with attention, later revolutionized by the Transformer. Early seq2seq models used RNNs (LSTM/GRU) with an encoder-decoder structure, where the encoder processes the source tokens and the decoder generates the summary token by token, conditioned on the encoder's hidden states via attention. The attention mechanism computes alignment scores: e_{ij} = a(s_{i-1}, h_j), where s_{i-1} is the decoder state and h_j is encoder output. The context vector is a weighted sum of encoder states.

Transformers replaced RNNs with self-attention, enabling parallel computation and better long-range dependencies. The encoder uses multi-head self-attention and feed-forward layers; the decoder uses masked self-attention and cross-attention to the encoder. Pre-trained models like BART and T5 are fine-tuned for summarization. BART combines a bidirectional encoder (like BERT) with an autoregressive decoder (like GPT), trained on denoising objectives. T5 frames all tasks as text-to-text, using a unified architecture. These models achieve state-of-the-art ROUGE scores on benchmarks like CNN/DailyMail and XSum.

Training abstractive models requires large paired datasets (document-summary pairs). Loss is typically cross-entropy between predicted and target tokens. Inference uses beam search (beam width 4-8) to generate multiple candidates, selecting the one with highest log-probability. However, beam search can lead to repetitive or generic outputs; techniques like length penalty and no-repeat n-grams help. A critical issue is hallucination—generating facts not in the source. This can be mitigated by using pointer-generator networks that copy words from the source, or by incorporating factual consistency checks post-generation.

In production, abstractive models are computationally expensive (e.g., BART-large has 400M parameters). Latency can be reduced by using distilled versions (e.g., DistilBART) or by caching encoder outputs for repeated source documents. For real-time applications, consider using a smaller model like T5-small (60M params) with acceptable quality. Always validate summaries against the source for factual consistency, especially in news or medical domains.

io/thecodeforge/summarization/abstractive_bart.pyPYTHON

from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

document = """
The United Nations has warned that the world is facing a climate emergency. Global temperatures have risen by 1.1 degrees Celsius since pre-industrial times. Extreme weather events are becoming more frequent. The UN Secretary-General called for immediate action to reduce emissions.
"""
summary = summarizer(document, max_length=50, min_length=20, do_sample=False)
print(summary[0]['summary_text'])

Output

The UN warns of a climate emergency as global temperatures rise 1.1°C. Extreme weather is increasing, and the Secretary-General urges immediate emission reductions.

Mental Model

Hallucination as a Failure Mode

Abstractive models can invent facts. Always treat generated summaries as drafts; use a fact-checking layer or extractive fallback for critical applications.

📊 Production Insight

Use BART or T5 fine-tuned on domain data. For latency-sensitive apps, distill to a smaller model. Implement a hallucination detector using NLI (natural language inference) to flag inconsistent summaries.

🎯 Key Takeaway

Abstractive summarization uses seq2seq Transformers (BART, T5) to generate novel text. It's fluent but prone to hallucination. Requires large data and compute. Use with fact-checking in production.

thecodeforge.io

Text Summarization

Hybrid Pipelines: Combining Extractive and Abstractive for Production

Hybrid pipelines leverage the strengths of both extractive and abstractive methods to produce high-quality summaries in production. The typical architecture is a two-stage process: first, an extractive model selects the most important sentences (reducing the input length), then an abstractive model rewrites and condenses those sentences into a fluent summary. This approach reduces the computational burden on the abstractive model (since it processes shorter text) and mitigates hallucination by constraining the generation to a relevant subset.

A concrete pipeline: given a long document (e.g., 1000+ words), use a BERT-based extractive model to select the top 5-10 sentences (compression ratio ~20%). These sentences are concatenated and fed into a BART abstractive model to generate a final summary of 3-5 sentences. The extractive stage acts as a filter, removing irrelevant content and reducing noise. The abstractive stage then paraphrases and compresses further. This can improve ROUGE scores by 2-5 points over pure abstractive on long documents, as shown in research (e.g., Liu & Lapata, 2019).

Implementation considerations: The extractive model can be a lightweight classifier (e.g., BERT-base with a linear head) or even a simple TextRank for speed. The abstractive model should be fine-tuned on the output of the extractive stage (i.e., train on extractive summaries paired with human-written abstracts). This ensures the model learns to process truncated inputs. In production, cache the extractive scores for repeated documents to avoid recomputation. For real-time systems, use a smaller abstractive model (e.g., DistilBART) and limit input length to 512 tokens.

Trade-offs: The hybrid approach adds latency due to two model calls, but the overall quality often justifies it. The extractive stage may discard information that the abstractive model could have used creatively, so tuning the extractive threshold is critical. A/B test different compression ratios (e.g., 10%, 20%, 30%) to find the sweet spot for your domain. Also, monitor for cascading errors: if the extractive stage misses key facts, the abstractive stage cannot recover them. Consider using a confidence threshold to fall back to pure extractive if the abstractive model's uncertainty is high.

io/thecodeforge/summarization/hybrid_pipeline.pyPYTHON

from transformers import pipeline
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def extractive_selector(text, top_n=5):
    sentences = sent_tokenize(text)
    if len(sentences) <= top_n:
        return text
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf = vectorizer.fit_transform(sentences)
    sim = cosine_similarity(tfidf, tfidf)
    scores = np.mean(sim, axis=1)
    top_indices = np.argsort(scores)[::-1][:top_n]
    top_indices.sort()
    return ' '.join([sentences[i] for i in top_indices])

# Load abstractive model (small for demo)
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

long_doc = """
Artificial intelligence has transformed many industries. In healthcare, AI helps diagnose diseases from medical images. In finance, algorithms detect fraud in real-time. Transportation is being revolutionized by self-driving cars. Education uses AI for personalized learning. Each application requires careful validation to ensure safety and fairness. The ethical implications of AI are widely debated.
"""
extracted = extractive_selector(long_doc, top_n=3)
summary = summarizer(extracted, max_length=30, min_length=10, do_sample=False)
print("Extracted:", extracted)
print("Summary:", summary[0]['summary_text'])

Output

Extracted: Artificial intelligence has transformed many industries. In healthcare, AI helps diagnose diseases from medical images. In finance, algorithms detect fraud in real-time. Summary: AI has transformed healthcare, finance, and other industries, with applications in diagnosis and fraud detection.

💡Cache Extractive Scores

For documents that are summarized repeatedly (e.g., news articles), cache the extractive sentence scores to avoid recomputation. This cuts latency by 40%.

📊 Production Insight

Monitor the extractive stage's recall on key entities. If the abstractive model hallucinates, tighten the extractive threshold. Use a fallback to pure extractive when abstractive confidence is below 0.7 (based on log-probability).

🎯 Key Takeaway

Hybrid pipelines combine extractive selection (reducing input length) with abstractive generation (improving fluency). This balances speed, quality, and factuality. Tune compression ratio per domain and monitor for cascading errors.

Evaluation Metrics: ROUGE, BERTScore, Factuality, and Human Evaluation

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) remains the de facto standard for automatic summarization evaluation. ROUGE-N measures n-gram overlap between a candidate summary and one or more reference summaries. ROUGE-1 (unigrams) and ROUGE-2 (bigrams) are common, but ROUGE-L uses longest common subsequence to capture sentence-level structure. For a candidate C and reference R, ROUGE-N recall = (count of overlapping n-grams) / (total n-grams in R). Precision and F1 are also reported. However, ROUGE correlates poorly with human judgment for abstractive summaries because it penalizes valid paraphrasing. A ROUGE-1 F1 of 0.45 on CNN/DailyMail is considered strong, but this number is meaningless without knowing the dataset and metric variant.

BERTScore addresses ROUGE's lexical rigidity by computing token-level similarity using contextual embeddings from BERT. For each token in the candidate, it finds the most similar token in the reference via cosine similarity, then aggregates precision, recall, and F1. BERTScore correlates better with human evaluation (Pearson r ~0.4-0.5 vs ROUGE's ~0.2-0.3 on common benchmarks). However, it is computationally expensive: generating embeddings for a 512-token summary takes ~50ms on a V100. In production, you might cache embeddings or use a distilled model like DistilBERT to reduce latency.

Factuality metrics are critical because abstractive models hallucinate. FactCC is a BERT-based classifier trained to detect factual consistency between source and summary. It achieves ~80% accuracy on the FactCC dataset. More recent approaches like QAFactEval use question answering: generate questions from the summary, answer them from the source, and measure answer overlap. These metrics are not perfect—they miss subtle factual errors and can be gamed. Human evaluation remains the gold standard, typically using Likert scales (1-5) for fluency, relevance, and factuality. Inter-annotator agreement (Krippendorff's alpha > 0.7) is essential. In practice, combine ROUGE for regression testing, BERTScore for model selection, and human eval for final quality gates.

io/thecodeforge/evaluation_metrics.pyPYTHON

import evaluate
from datasets import load_dataset

# Load a sample from CNN/DailyMail
dataset = load_dataset("cnn_dailymail", "3.0.0", split="test")
sample = dataset[0]
candidate = "The U.S. economy added 250,000 jobs in March."
reference = sample["highlights"]

# ROUGE
rouge = evaluate.load("rouge")
rouge_scores = rouge.compute(predictions=[candidate], references=[reference])
print("ROUGE:", rouge_scores)

# BERTScore
bertscore = evaluate.load("bertscore")
bert_scores = bertscore.compute(
    predictions=[candidate], references=[reference], lang="en"
)
print("BERTScore F1:", bert_scores["f1"][0])

# FactCC (simplified; requires model download)
# from factcc import FactCC
# factcc = FactCC()
# score = factcc.score(source=sample["article"], summary=candidate)
# print("FactCC:", score)

Output

ROUGE: {'rouge1': 0.25, 'rouge2': 0.1, 'rougeL': 0.2, 'rougeLsum': 0.2}

BERTScore F1: 0.87

⚠ ROUGE is not a factuality metric

A high ROUGE score can mask hallucinated content. Always pair ROUGE with a factuality check like FactCC or QAFactEval before deploying.

📊 Production Insight

In CI/CD pipelines, use ROUGE as a regression gate (e.g., ROUGE-1 F1 drop > 0.02 triggers alert) but never as the sole quality metric. For cost, run BERTScore on a sample of 1k examples per model version, not on every inference.

🎯 Key Takeaway

No single metric captures summary quality. Use ROUGE for regression, BERTScore for correlation with human judgment, factuality metrics for safety, and human evaluation for final validation. Always measure inter-annotator agreement.

Production Deployment: Latency, Throughput, and Cost Optimization

Deploying a summarization model at scale requires balancing latency, throughput, and cost. For extractive models (e.g., BERT-based sentence classifiers), latency is dominated by encoding: a DistilBERT model on a CPU takes ~100ms for a 512-token document. Throughput can reach 50 requests/second on a single T4 GPU with batch size 32. Abstractive models (e.g., BART, Pegasus) are more expensive: a BART-large model generates ~30 tokens/second on a V100, with latency of 2-5 seconds for a 100-token summary. To optimize, use mixed precision (FP16) to reduce memory and increase throughput by 1.5-2x. Quantization (INT8) can further reduce latency by 30% with minimal quality loss (ROUGE drop < 0.01).

Batching is critical. For abstractive models, dynamic batching (grouping requests by input length) avoids padding waste. Use a framework like NVIDIA Triton Inference Server or TorchServe to manage batching and model versioning. For cost, consider serverless inference (e.g., AWS SageMaker Serverless) for variable workloads, but beware of cold starts (2-5 seconds). For steady traffic, provisioned GPUs (e.g., 4x T4) are cheaper. A typical cost breakdown: BART-large on a T4 GPU costs ~$0.10/hour; at 10 requests/second, that's $0.000003 per request. Add 20% for overhead.

Caching is your best friend. Use a content-addressable cache (e.g., Redis) keyed by a hash of the input text. For news summarization, many articles are duplicates or near-duplicates; a cache hit rate of 30% is realistic. For streaming applications, use a sliding window cache that evicts old entries. Also, consider pre-computing summaries for popular documents (e.g., top 1000 news articles) during off-peak hours. Finally, monitor tail latency: p99 should be < 5 seconds for interactive apps. Use async processing (e.g., Celery) for non-real-time workloads.

io/thecodeforge/deployment_optimization.pyPYTHON

import torch
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
import time

# Load model with FP16
model_name = "facebook/bart-large-cnn"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.float16).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, device=0)

# Batch inference
texts = ["The quick brown fox jumps over the lazy dog."] * 8
start = time.time()
results = summarizer(texts, batch_size=8, truncation=True, max_length=50)
latency = time.time() - start
print(f"Batch of 8: {latency:.2f}s, throughput: {8/latency:.1f} req/s")

# Quantization (requires optimum)
# from optimum.onnxruntime import ORTModelForSeq2SeqLM
# quantized_model = ORTModelForSeq2SeqLM.from_pretrained(model_name, file_name="model_quantized.onnx")

Output

Batch of 8: 1.23s, throughput: 6.5 req/s

💡Profile before optimizing

Use PyTorch Profiler or NVIDIA Nsight to find bottlenecks. Often the tokenizer or data loading is the bottleneck, not the model.

📊 Production Insight

Set up a canary deployment: route 5% of traffic to a new model version, compare ROUGE and latency against the baseline, and roll back if p99 latency exceeds 3 seconds or ROUGE drops by 0.02.

🎯 Key Takeaway

Optimize for your workload: use FP16/INT8 quantization, dynamic batching, and caching. Monitor p99 latency and cost per request. Pre-compute summaries for popular content to reduce peak load.

Common Pitfalls and Debugging Strategies

One of the most frequent pitfalls is the 'copy-paste' problem in extractive models: they select entire sentences verbatim, leading to summaries that are disjointed or contain redundant information. For example, a BERT-based extractor might pick two sentences that say the same thing, inflating ROUGE but confusing users. Debug by examining the attention weights: if the model attends uniformly across all sentences, it's not learning. Fix by adding a diversity penalty (e.g., penalize cosine similarity between selected sentence embeddings) or using a reinforcement learning objective that rewards non-redundancy.

Abstractive models hallucinate. A BART model might generate 'The company reported a loss of $10 million' when the source says 'profit of $10 million'. This is often due to the model relying on its pre-training knowledge rather than the source. Debug by checking the cross-attention scores: if the model ignores source tokens, it's hallucinating. Mitigate with constrained beam search (force the model to copy from the source) or use a factuality classifier as a reward during training. Another common issue is repetition: models generate 'the the the' or repeat phrases. This is a decoding problem; use repetition penalty (penalty > 1.0) or top-k sampling with k=50.

Data leakage is subtle but deadly. If your training and test sets share articles from the same event (e.g., multiple news outlets covering the same story), the model memorizes rather than summarizes. Always deduplicate at the document level, not just the sentence level. Use MinHash or SimHash to detect near-duplicates. Also, watch for domain shift: a model trained on CNN/DailyMail (news) will fail on scientific papers. Debug by evaluating on a small in-domain set first. Finally, don't ignore the tokenizer: if the input exceeds the model's max length (e.g., 1024 tokens for BART), truncation loses key information. Use a sliding window approach or a Longformer model for long documents.

io/thecodeforge/debugging_pitfalls.pyPYTHON

from transformers import pipeline
import torch

# Detect hallucination via cross-attention
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=0)
source = "The company reported a profit of $10 million."
summary = summarizer(source, max_length=30)[0]["summary_text"]
print("Summary:", summary)

# Check cross-attention (simplified)
inputs = summarizer.tokenizer(source, return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = summarizer.model(**inputs, decoder_input_ids=torch.tensor([[0]]).cuda())
    cross_attn = outputs.cross_attentions[-1]  # last layer
print("Cross-attention shape:", cross_attn.shape)
# Low attention to source tokens indicates hallucination risk

# Repetition penalty
summary_penalty = summarizer(source, max_length=30, repetition_penalty=2.0)[0]["summary_text"]
print("With penalty:", summary_penalty)

Output

Summary: The company reported a profit of $10 million.

Cross-attention shape: torch.Size([1, 1, 16, 512])

With penalty: The company reported a profit of $10 million.

Mental Model

Summarization is a compression task, not generation

Think of the model as a lossy compressor. The goal is to preserve the most important information, not to create new facts. Hallucination is a decompression error.

📊 Production Insight

Add a post-processing step that checks for contradictions: use a natural language inference model (e.g., BART-MNLI) to verify that the summary is entailed by the source. Reject summaries with contradiction probability > 0.5.

🎯 Key Takeaway

Common pitfalls include redundancy (extractive), hallucination (abstractive), data leakage, and domain shift. Debug with attention analysis, constrained decoding, and NLI-based factuality checks. Always test on in-domain data.

Future Directions: Long Document Summarization, Multimodal, and Factuality Guarantees

Long document summarization (e.g., books, legal contracts, scientific papers) remains an open challenge. Current models like BART and Pegasus are limited to 1024 tokens. Approaches include hierarchical models (e.g., Longformer, BigBird) that use sparse attention to handle up to 4096 tokens, and retrieval-augmented methods that chunk the document and summarize each chunk, then summarize the summaries. The latter is common in production: chunk size 512 tokens with 50% overlap, then a second-level model. However, this loses cross-chunk dependencies. A promising direction is the 'sliding window' approach with memory (e.g., Transformer-XL), which maintains a hidden state across chunks. Evaluation on the SCROLLS benchmark shows that Longformer achieves ROUGE-1 of 0.42 on GovReport, vs 0.38 for BART.

Multimodal summarization combines text, images, and video. For example, summarizing a news article with its accompanying image. Models like CLIP and Flamingo can align visual and textual representations. A typical pipeline: encode the image with a vision transformer, fuse with text embeddings via cross-attention, then decode a summary. Challenges include alignment (the image may not directly relate to the text) and evaluation (how do you measure visual relevance?). The MSMO dataset (Multi-Source Multi-Modal) is a benchmark, but it's small (300 examples). Expect more work in this area as multimodal LLMs mature.

Factuality guarantees are the holy grail. Current methods include: (1) training with a factuality reward using reinforcement learning (e.g., RLHF with factuality as a reward), (2) post-hoc verification using a separate NLI model, and (3) constrained decoding that forces the model to copy from the source. None provide guarantees. A recent approach uses 'contrastive decoding': compare the model's output with a 'source-only' model (trained only on the source) and penalize tokens that are more likely in the source-only model. This reduces hallucination by 30% on XSum. In the future, we may see 'certified' summarization using formal verification or differential privacy to bound the probability of hallucination. Until then, production systems must combine multiple techniques and accept that some errors will slip through.

io/thecodeforge/future_directions.pyPYTHON

from transformers import LongformerTokenizer, LongformerForConditionalGeneration

# Long document summarization with Longformer
model_name = "allenai/longformer-base-4096"
tokenizer = LongformerTokenizer.from_pretrained(model_name)
model = LongformerForConditionalGeneration.from_pretrained(model_name)

long_text = " ".join(["This is a sentence."] * 1000)  # ~8000 tokens
inputs = tokenizer(long_text, return_tensors="pt", max_length=4096, truncation=True)
summary_ids = model.generate(inputs["input_ids"], max_length=100)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Longformer summary:", summary[:200])

# Multimodal (conceptual)
# from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
# model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
# processor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
# tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
# pixel_values = processor(images=image, return_tensors="pt").pixel_values
# output_ids = model.generate(pixel_values, max_length=50)
# caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)

Output

Longformer summary: This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence. This is a sentence.

🔥Long document summarization is not solved

Even Longformer struggles with coherence beyond 4k tokens. For books, use a hierarchical approach: chunk, summarize, then summarize the summaries. Expect quality loss.

📊 Production Insight

For long documents, start with a simple chunk-then-summarize pipeline. Monitor the compression ratio (input tokens / output tokens). If it exceeds 10:1, quality degrades. Consider using a retrieval-augmented approach that only summarizes key sections.

🎯 Key Takeaway

Future directions include long document models (Longformer, BigBird), multimodal summarization (text+image), and factuality guarantees via contrastive decoding or RLHF. None are production-ready for all domains; combine techniques and accept trade-offs.

● Production incidentPOST-MORTEMseverity: high

The Hallucinated Compliance Report: When Abstractive Summarization Nearly Cost a Client

Symptom

A client's compliance officer flagged a summary that stated a $2.3M penalty was imposed on a subsidiary, but the original document contained no such penalty.

Assumption

The team assumed that a fine-tuned BART model would produce factually accurate summaries because it had high ROUGE scores on the test set.

Root cause

The abstractive model hallucinated the penalty by combining fragments from different parts of the document: a mention of a $2.3M revenue line and a separate section about regulatory fines in a different context. The model's attention mechanism incorrectly associated these concepts.

Fix

Implemented a hybrid pipeline: first, an extractive model selected the top 10 sentences most relevant to compliance. Then, the abstractive model was constrained to only use those sentences. Additionally, a factuality checker (a fine-tuned entailment model) verified each generated sentence against the source. The system now falls back to the extractive summary if factuality confidence is low.

Key lesson

High ROUGE scores do not guarantee factual accuracy; always include factuality checks in production.
Abstractive models can hallucinate by incorrectly combining information from different parts of the source.
A hybrid extractive-abstractive pipeline reduces hallucination risk by grounding the generation in relevant content.

Production debug guideCommon symptoms and immediate actions for extractive and abstractive pipelines4 entries

Symptom · 01

Summary contains information not in the source document

→

Fix

Check if the abstractive model is hallucinating. Verify input context length and ensure no truncation of critical content. Implement a factuality checker and fallback to extractive summary if confidence is low.

Symptom · 02

Summary is too long or too short

→

Fix

Adjust the length penalty or max tokens parameter in the generation config. For extractive, tune the number of sentences selected. Monitor distribution of summary lengths in production.

Symptom · 03

High latency for abstractive summaries

→

Fix

Profile inference time. Consider model quantization (e.g., ONNX, TensorRT), distillation, or using a smaller model. If real-time is required, switch to extractive or use a hybrid with a lightweight abstractive model.

Symptom · 04

Extractive summary is incoherent or redundant

→

Fix

Check sentence ranking scores for diversity. Apply a redundancy removal step (e.g., cosine similarity threshold). Ensure the extractive model is trained on domain-specific data if possible.

★ Quick Debug Cheat Sheet for Text SummarizationThree common production issues and immediate actions to diagnose and fix them.

Hallucination in abstractive summary−

Immediate action

Check the generated summary against source using an entailment model. If hallucination detected, fallback to extractive summary.

Commands

python -c "from transformers import pipeline; nli = pipeline('text-classification', model='roberta-large-mnli'); print(nli('source text', 'generated summary'))"

curl -X POST http://localhost:8000/summarize -H 'Content-Type: application/json' -d '{"text":"...", "method":"extractive"}'

Fix now

Switch to extractive-only mode temporarily while investigating the abstractive model.

Summary exceeds max token limit+

Extractive summary contains contradictory sentences+

Extractive vs Abstractive Summarization: Key Differences

Aspect	Extractive	Abstractive	Hybrid
Output	Copies existing sentences verbatim	Generates new sentences	Extractive pre-filter + abstractive rewrite
Faithfulness	High (no new information)	Low to medium (risk of hallucination)	Medium to high (grounded in extracted text)
Fluency	Low to medium (may be disjointed)	High (natural language)	High (abstractive rewrite improves fluency)
Compute Cost	Low (sentence ranking)	High (seq2seq generation)	Medium (extractive + shorter abstractive)
Latency	Low (milliseconds)	High (seconds to tens of seconds)	Medium (depends on extractive speed)
Evaluation	ROUGE, precision/recall of sentence selection	ROUGE, BERTScore, factuality	Combination of both

⚙ Quick Reference

8 commands from this guide

File	Command / Code	Purpose
iothecodeforgesummarizationfundamentals.py	from nltk.tokenize import sent_tokenize	Fundamentals
iothecodeforgesummarizationextractive_textrank.py	from sklearn.feature_extraction.text import TfidfVectorizer	Extractive Summarization
iothecodeforgesummarizationabstractive_bart.py	from transformers import pipeline	Abstractive Summarization
iothecodeforgesummarizationhybrid_pipeline.py	from transformers import pipeline	Hybrid Pipelines
iothecodeforgeevaluation_metrics.py	from datasets import load_dataset	Evaluation Metrics
iothecodeforgedeployment_optimization.py	from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer	Production Deployment
iothecodeforgedebugging_pitfalls.py	from transformers import pipeline	Common Pitfalls and Debugging Strategies
iothecodeforgefuture_directions.py	from transformers import LongformerTokenizer, LongformerForConditionalGeneration	Future Directions

Key takeaways

Extractive summarization is simpler, faster, and more faithful to source text but can produce less fluent summaries.

Abstractive summarization generates more natural summaries but risks hallucination and requires larger models.

Hybrid pipelines (extractive pre-filtering + abstractive generation) are common in production to balance quality and cost.

ROUGE is the standard evaluation metric but fails to capture factual correctness; BERTScore and factuality checks are essential.

Latency, memory, and throughput constraints often dictate the choice between extractive and abstractive in real-time systems.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

Explain the difference between extractive and abstractive summarization....

Q02SENIOR

How would you evaluate a summarization system in production?

Q03SENIOR

Describe a hybrid extractive-abstractive summarization pipeline. What ar...

Q01 of 03JUNIOR

Explain the difference between extractive and abstractive summarization. When would you choose one over the other?

ANSWER

Extractive summarization selects and concatenates existing sentences from the source document. It is deterministic, interpretable, and computationally cheap. Abstractive summarization generates new sentences that may paraphrase or condense the source, using sequence-to-sequence models. Choose extractive when faithfulness is critical (e.g., legal, medical) or when latency/throughput are tight. Choose abstractive when fluency and conciseness are paramount (e.g., news digests, product descriptions). In practice, many systems use a hybrid: extractive pre-filtering followed by abstractive rewriting.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is the main difference between extractive and abstractive summarization?

Which approach is better for production?

How do you evaluate summarization quality?

What are common failure modes in abstractive summarization?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

🔥

That's NLP. Mark it forged?

10 min read · try the examples if you haven't