Junior 12 min · March 06, 2026

Text Classification Failure — OOV Crash Recall to 0.51

Recall dropped 0.94->0.51 in 14 days because model had zero crypto words.

N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Text classification maps raw text to predefined labels using ML
  • TF-IDF vectorization converts words into numerical importance scores
  • Naive Bayes and Logistic Regression are fast, interpretable starters
  • Sentence transformers handle paraphrases but need GPU for production throughput
  • Biggest mistake: evaluating on accuracy alone when classes are imbalanced
✦ Definition~90s read
What is Text Classification with ML?

Text classification is the task of assigning a predefined category to a piece of text—think spam detection, sentiment analysis, or topic labeling. The core challenge is that machines don't understand words; they need to convert text into numbers (vectorization) before any algorithm can process it.

Imagine your email inbox has a bouncer at the door.

When your model encounters a word it has never seen during training—an out-of-vocabulary (OOV) token—it can't vectorize it, often defaulting to a zero vector or crashing recall to near zero. This is the 'OOV crash' problem, and it's why naive approaches like bag-of-words or TF-IDF fail on real-world data with typos, slang, or domain-specific jargon.

In practice, you'll start with simple classifiers like Naive Bayes or Logistic Regression, which are fast and interpretable but brittle with OOV tokens. Logistic regression, for instance, learns linear decision boundaries from TF-IDF features—if a token is missing, the feature is zero, and the model's prediction degrades.

Upgrading to sentence transformers (e.g., BERT, Sentence-BERT) solves this by generating dense, context-aware embeddings that handle OOV tokens via subword tokenization (e.g., WordPiece). These models map unseen words to known subword units, maintaining recall even on novel inputs.

Choosing the right model involves trade-offs: TF-IDF + logistic regression is cheap to train and deploy (milliseconds per prediction, fits on a single CPU), but caps out at ~85% accuracy on complex tasks. Sentence transformers push accuracy to 95%+ but require GPU inference and 100MB+ model files.

For production, you might use a hybrid: a fast fallback classifier for common cases and a transformer for edge cases. Honest evaluation means looking beyond raw accuracy—track precision, recall, and F1 per class, especially for rare categories where OOV tokens hit hardest.

The 'recall crash to 0.51' in the title is a real scenario: if your test set has 20% OOV tokens, a TF-IDF model's recall can drop from 0.90 to 0.51, making it useless for production.

Plain-English First

Imagine your email inbox has a bouncer at the door. Every incoming email gets a quick read, and the bouncer decides: 'spam' goes in the junk folder, 'important' lands in your inbox. Text classification is exactly that bouncer — a machine learning model that reads a piece of text and stamps it with a label. Your phone does it when it detects a toxic comment. Netflix does it when it reads your review and decides if you loved the show. It's the foundation of almost every app that needs to understand what humans are saying.

Every day, humans generate around 2.5 quintillion bytes of data — and most of it is unstructured text. Customer reviews, support tickets, social media posts, medical notes. None of that data is useful until a machine can read it and say 'this is a complaint', 'this is urgent', or 'this is spam'. Text classification is the ML technique that makes that possible, and it powers systems you use dozens of times a day without realising it.

The core problem text classification solves is deceptively simple: given a string of words, assign it to one of several predefined categories. But computers don't speak English — they speak numbers. So the real challenge is the pipeline that happens before the model even sees the data: cleaning text, converting it into numerical features, and choosing a model that can learn meaningful patterns from those features. Get that pipeline wrong and even the fanciest model won't save you.

By the end of this article you'll be able to build a complete, production-aware text classification pipeline in Python — from raw messy text all the way to a trained model making predictions. You'll understand why each step exists, not just how to run it. And you'll know the common traps that burn people in interviews and on the job.

Why Text Classification Fails on Out-of-Vocabulary Tokens

Text classification assigns a predefined label to a piece of text — spam or ham, positive or negative, urgent or routine. The core mechanic is mapping token sequences to a fixed set of categories via a trained model. Most production systems use a bag-of-words or TF-IDF vectorizer followed by a linear classifier (e.g., logistic regression, SVM). The model learns weights for every token in the training vocabulary. At inference, any token not seen during training — an out-of-vocabulary (OOV) token — is silently dropped, producing a zero vector for that token's contribution.

This OOV behavior is the single largest source of recall collapse. In a typical news classifier, 5–15% of tokens in production traffic are OOV. When a critical category (e.g., 'recall' or 'crash') appears only in the test set as a novel compound word or misspelling, the classifier sees none of its signal. The result: recall for that category can drop from 0.95 to 0.51 in a single deployment. The model doesn't fail gracefully — it just returns the majority class, masking the problem until a manual audit.

Use text classification when your categories are stable and your vocabulary is well-covered by training data. Never use it for open-ended or rapidly evolving domains (e.g., trending topics, product names) without an OOV mitigation strategy. In practice, teams deploy a fallback: a character-level n-gram model or a subword tokenizer (BPE, WordPiece) that can handle unseen tokens. Without that, your recall is a ticking time bomb.

OOV Is Not a Rare Edge Case
In production logs, OOV tokens often account for 10–20% of all tokens. Dropping them silently turns your classifier into a random guesser for any category with novel vocabulary.
Production Insight
A fintech team deployed a transaction classifier trained on 6 months of data. Three weeks later, a new merchant category ('cryptowallet') appeared — every token was OOV. Recall for 'suspicious' dropped from 0.93 to 0.47 overnight.
Symptom: precision stays high (no false positives from unknown tokens), but recall collapses silently — no error, no alert, just a sudden increase in false negatives.
Rule of thumb: If your tokenizer drops more than 5% of tokens in a validation set, you must use subword tokenization or a fallback n-gram model before going to production.
Key Takeaway
OOV tokens are not noise — they are the signal you are missing.
A bag-of-words classifier with a fixed vocabulary has a hard recall ceiling determined by vocabulary coverage.
Always measure token coverage on production traffic before trusting your recall numbers.
Text Classification Failure: OOV Crash Recall to 0.51 THECODEFORGE.IO Text Classification Failure: OOV Crash Recall to 0.51 Pipeline from vectorisation to deployment with OOV and imbalance traps Out-of-Vocabulary Words Unknown tokens cause zero vector crash Vectorisation & Tokenizer Fixed vocab; OOV mapped to Naive Bayes vs Logistic Regression Logistic handles OOV better with smoothing TF-IDF → Sentence Transformers Contextual embeddings reduce OOV impact Class Imbalance Handling Resample or weight to avoid recall collapse Fixed Tokenizer Pipeline Consistent vocab for real-time inference ⚠ OOV crash: recall drops to 0.51 if tokenizer mismatches training Always freeze tokenizer vocab and handle at inference THECODEFORGE.IO
thecodeforge.io
Text Classification Failure: OOV Crash Recall to 0.51
Text Classification Ml

How Machines Read Words: Vectorisation and Why It Matters

Before any model can classify text, you need to answer a fundamental question: how do you turn the sentence 'This product broke in two days' into something a mathematical model can process? The answer is vectorisation — converting text into arrays of numbers.

The most battle-tested approach is TF-IDF (Term Frequency–Inverse Document Frequency). It does two clever things at once. First, it counts how often a word appears in a document (TF). Second, it penalises words that appear in almost every document — like 'the' or 'is' — because they carry no useful signal (IDF). The result is a number that represents how distinctive a word is to a particular document.

Why not just count raw word frequencies? Because 'the' might be the most frequent word in every review, positive or negative. It tells you nothing. TF-IDF filters that noise out automatically.

The alternative — word embeddings like Word2Vec or sentence transformers — are more powerful but also more complex. TF-IDF is the right starting point: fast, interpretable, and often good enough for structured datasets. Understand it deeply before reaching for a transformer.

vectorise_text.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Simulating a small customer review dataset
# In a real project this would come from a CSV or database
reviews = [
    "This product is absolutely amazing and works perfectly",
    "Terrible quality, broke after two days, total waste of money",
    "Pretty good value for the price, happy with my purchase",
    "Worst purchase I have ever made, completely useless product",
    "Exceeded my expectations, will definitely buy again"
]

labels = ["positive", "negative", "positive", "negative", "positive"]

# TfidfVectorizer handles tokenisation, lowercasing, and IDF weighting
# max_features limits vocabulary size — important for memory on large datasets
# stop_words='english' removes common words like 'the', 'is', 'and'
vectorizer = TfidfVectorizer(max_features=20, stop_words='english')

# fit_transform: learns the vocabulary AND converts text to numbers in one step
# Returns a sparse matrix — rows are documents, columns are words
tfidf_matrix = vectorizer.fit_transform(reviews)

# Let's see what vocabulary was learned
learned_vocabulary = vectorizer.get_feature_names_out()
print("Learned vocabulary:")
print(learned_vocabulary)
print()

# Convert sparse matrix to a readable DataFrame
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=learned_vocabulary,
    index=[f"Review {i+1}" for i in range(len(reviews))]
)

print("TF-IDF scores per document (higher = more distinctive word):")
print(tfidf_df.round(3).to_string())
Output
Learned vocabulary:
['absolutely' 'amazing' 'broke' 'buy' 'completely' 'definitely' 'exceeded'
'expectations' 'good' 'happy' 'money' 'perfectly' 'product' 'purchase'
'quality' 'terrible' 'terrible' 'useless' 'value' 'waste' 'works']
TF-IDF scores per document (higher = more distinctive word):
absolutely amazing broke buy completely definitely exceeded expectations good happy money perfectly product purchase quality terrible useless value waste works
Review 1 0.447 0.447 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.447 0.316 0.000 0.000 0.000 0.000 0.000 0.000 0.447
Review 2 0.000 0.000 0.447 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.447 0.000 0.000 0.000 0.447 0.447 0.000 0.000 0.447 0.000
Review 3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.500 0.500 0.000 0.000 0.000 0.354 0.000 0.000 0.000 0.500 0.000 0.000
Review 4 0.000 0.000 0.000 0.000 0.447 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.316 0.354 0.000 0.000 0.447 0.000 0.000 0.000
Review 5 0.000 0.000 0.000 0.447 0.000 0.447 0.447 0.447 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Pro Tip: Always fit on training data only
Call fit_transform() on your training set, then transform() on your test set. Never fit on the full dataset — that leaks future vocabulary into your training process and inflates accuracy scores artificially.
Production Insight
If production text contains words not in the TF-IDF vocabulary, they're silently dropped — the model sees less signal and accuracy drifts.
Monitor out-of-vocabulary rate daily. Anything above 5% means your vectorizer needs retraining on fresh data.
Key Takeaway
TF-IDF rewards discriminative words by down-weighting common terms.
The vocabulary is static after training — any new word is invisible to the model.

Training Your First Classifier: Naive Bayes vs Logistic Regression

Now that text is numeric, you can feed it into a classifier. Two models dominate beginner-to-intermediate text classification: Multinomial Naive Bayes and Logistic Regression. They're both fast, interpretable, and work surprisingly well — and understanding why they work differently will save you a lot of tuning time.

Naive Bayes asks: 'Given this class label, what's the probability of seeing each word?' It calculates probabilities per word and multiplies them together. The 'naive' part is the assumption that each word's probability is independent of the others — clearly not true in real language, but the model still performs remarkably well on text data. It's extremely fast and memory-efficient.

Logistic Regression learns a weight for each word. Words strongly associated with 'positive' get high positive weights; words associated with 'negative' get negative weights. It then sums those weighted scores and passes them through a sigmoid function to output a probability. It's slightly slower to train but gives you calibrated probabilities and is more robust on imbalanced classes.

For a quick baseline, reach for Naive Bayes. For production pipelines where calibration matters (you need 'how confident is the model?'), use Logistic Regression.

train_text_classifier.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split

# Using a real benchmark dataset — 20 newsgroup posts across different topics
# We're selecting 3 categories to keep it manageable and interpretable
categories_to_classify = ['sci.space', 'rec.sport.hockey', 'talk.politics.guns']

print("Loading 20 Newsgroups dataset...")
newsgroups_data = fetch_20newsgroups(
    subset='all',
    categories=categories_to_classify,
    remove=('headers', 'footers', 'quotes')  # remove metadata that makes classification trivially easy
)

post_texts = newsgroups_data.data
category_labels = newsgroups_data.target
category_names = newsgroups_data.target_names

print(f"Total documents: {len(post_texts)}")
print(f"Categories: {category_names}")
print()

# Split into training and test sets — 80/20 is a solid default
# stratify=category_labels ensures each class has proportional representation in both sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
    post_texts, category_labels,
    test_size=0.2,
    random_state=42,
    stratify=category_labels
)

# --- PIPELINE 1: Naive Bayes ---
# Pipeline chains steps so the same transformations apply consistently to train and test
# This is the production-safe pattern — no data leakage possible
naive_bayes_pipeline = Pipeline([
    ('tfidf_vectorizer', TfidfVectorizer(
        max_features=10000,
        stop_words='english',
        ngram_range=(1, 2)  # include both single words AND two-word phrases
    )),
    ('naive_bayes_classifier', MultinomialNB(alpha=0.1))  # alpha is the smoothing parameter
])

naive_bayes_pipeline.fit(train_texts, train_labels)
nb_predictions = naive_bayes_pipeline.predict(test_texts)
nb_accuracy = accuracy_score(test_labels, nb_predictions)

print(f"=== Naive Bayes Results ===")
print(f"Accuracy: {nb_accuracy:.3f}")
print(classification_report(test_labels, nb_predictions, target_names=category_names))

# --- PIPELINE 2: Logistic Regression ---
logistic_pipeline = Pipeline([
    ('tfidf_vectorizer', TfidfVectorizer(
        max_features=10000,
        stop_words='english',
        ngram_range=(1, 2)
    )),
    ('logistic_classifier', LogisticRegression(
        max_iter=1000,      # increase from default 100 to ensure convergence
        C=1.0,              # regularisation strength — lower C = more regularisation
        solver='lbfgs',
        multi_class='multinomial'
    ))
])

logistic_pipeline.fit(train_texts, train_labels)
lr_predictions = logistic_pipeline.predict(test_texts)
lr_accuracy = accuracy_score(test_labels, lr_predictions)

print(f"\n=== Logistic Regression Results ===")
print(f"Accuracy: {lr_accuracy:.3f}")
print(classification_report(test_labels, lr_predictions, target_names=category_names))

# --- Making predictions on new, unseen text ---
new_posts = [
    "The astronauts launched successfully to the International Space Station",
    "The goalie made an incredible save in overtime to win the championship",
    "The senate voted on the second amendment legislation today"
]

print("\n=== Predictions on new posts ===")
new_predictions = logistic_pipeline.predict(new_posts)
new_probabilities = logistic_pipeline.predict_proba(new_posts)

for post, prediction, probabilities in zip(new_posts, new_predictions, new_probabilities):
    predicted_category = category_names[prediction]
    confidence = max(probabilities)
    print(f"Post: '{post[:55]}...'")
    print(f"  Predicted: {predicted_category} (confidence: {confidence:.1%})")
    print()
Output
Loading 20 Newsgroups dataset...
Total documents: 2802
Categories: ['rec.sport.hockey', 'sci.space', 'talk.politics.guns']
=== Naive Bayes Results ===
Accuracy: 0.921
precision recall f1-score support
rec.sport.hockey 0.97 0.96 0.96 187
sci.space 0.89 0.94 0.91 198
talk.politics.guns 0.91 0.87 0.89 176
accuracy 0.92 561
macro avg 0.92 0.92 0.92 561
weighted avg 0.92 0.92 0.92 561
=== Logistic Regression Results ===
Accuracy: 0.934
precision recall f1-score support
rec.sport.hockey 0.97 0.97 0.97 187
sci.space 0.93 0.94 0.93 198
talk.politics.guns 0.91 0.89 0.90 176
accuracy 0.93 561
macro avg 0.93 0.93 0.93 561
weighted avg 0.93 0.93 0.93 561
=== Predictions on new posts ===
Post: 'The astronauts launched successfully to the Internati...'
Predicted: sci.space (confidence: 97.2%)
Post: 'The goalie made an incredible save in overtime to win...'
Predicted: rec.sport.hockey (confidence: 98.6%)
Post: 'The senate voted on the second amendment legislation ...'
Predicted: talk.politics.guns (confidence: 89.1%)
Interview Gold: Why use Pipeline instead of separate steps?
Pipeline prevents data leakage in cross-validation. If you vectorize first then cross-validate, vocabulary from the test fold bleeds into training. Pipeline ensures fit() only ever sees training data — a subtle but critical correctness guarantee.
Production Insight
Naive Bayes tends to produce extreme probabilities (close to 0 or 1) even when wrong.
In production, calibrate with Logistic Regression or use isotonic regression if you need reliable confidence scores.
Key Takeaway
Naive Bayes is fast, logistic regression is calibrated.
Pick naive Bayes for baselines, logistic regression for production decisions.

When TF-IDF Isn't Enough: Upgrading to Sentence Transformers

TF-IDF is powerful, but it's blind to meaning. The sentences 'The car broke down' and 'My vehicle stopped working' use completely different words, so TF-IDF treats them as unrelated. But semantically, they mean the same thing. For a customer support classifier that needs to route 'vehicle stopped working' to the auto-repair team, that blindness is a real problem.

Sentence transformers solve this by converting an entire sentence into a dense vector (an embedding) where similar meanings produce similar vectors. They're pre-trained on massive text corpora, so they already understand that 'car' and 'vehicle' live in the same neighbourhood of meaning. You're essentially downloading years of language learning and plugging it into your classifier.

The tradeoff? Speed and resource cost. TF-IDF vectorisation takes milliseconds; generating sentence embeddings on CPU can take seconds per batch. For most production systems processing thousands of requests per minute, you'll need a GPU or a caching layer.

The pattern here is simple: start with TF-IDF + Logistic Regression as your baseline. If accuracy plateaus and you have labelled data, upgrade to sentence embeddings. You'll almost always see a meaningful jump, especially on short texts or paraphrase-heavy data.

sentence_transformer_classifier.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# Install first: pip install sentence-transformers scikit-learn
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# Real-world scenario: classifying customer support tickets
# Notice how many rows have paraphrased meaning — TF-IDF would struggle here
support_tickets = [
    # Billing issues
    "I was charged twice for my subscription this month",
    "There's a duplicate payment on my credit card statement",
    "My invoice shows the wrong amount, please fix this",
    "You billed me for a plan I never signed up for",
    "I need a refund for the extra charge on my account",
    "The payment went through twice and I want my money back",
    # Technical issues
    "The app keeps crashing every time I open it",
    "Your software won't start on my Windows 11 laptop",
    "I'm getting a black screen when I launch the application",
    "The program freezes after about 30 seconds of use",
    "Cannot log into the platform, it just hangs on the loading screen",
    "The mobile app stopped working after the latest update",
    # Account access
    "I forgot my password and the reset email never arrived",
    "Locked out of my account after too many login attempts",
    "My account was suspended but I didn't violate any rules",
    "Can't access my profile, says my email is not recognised",
    "The two-factor authentication code isn't working for me",
    "I need to recover access to my account urgently"
]

ticket_categories = (
    ["billing"] * 6 +
    ["technical"] * 6 +
    ["account_access"] * 6
)

# Load a lightweight, fast model — good balance of speed and quality
# 'all-MiniLM-L6-v2' produces 384-dimensional embeddings and runs in ~50ms per sentence on CPU
print("Loading sentence transformer model (downloads ~80MB on first run)...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode all tickets into dense vector representations
# Each ticket becomes a 384-dimensional vector — semantically similar tickets will cluster together
print("Generating sentence embeddings...")
ticket_embeddings = embedding_model.encode(
    support_tickets,
    show_progress_bar=True,
    batch_size=16  # process in batches to manage memory
)

print(f"\nEmbedding shape: {ticket_embeddings.shape}")
print(f"Each ticket is now a vector of {ticket_embeddings.shape[1]} numbers")

# Split data — stratify ensures all 3 classes appear in both sets
train_embeddings, test_embeddings, train_labels, test_labels = train_test_split(
    ticket_embeddings, ticket_categories,
    test_size=0.33,
    random_state=42,
    stratify=ticket_categories
)

# Logistic Regression works beautifully on top of embeddings
# The embeddings do the heavy lifting; LR just learns the decision boundary
classifier = LogisticRegression(max_iter=1000, C=1.0)
classifier.fit(train_embeddings, train_labels)

test_predictions = classifier.predict(test_embeddings)
print("\n=== Classification Report ===")
print(classification_report(test_labels, test_predictions))

# --- The real power: paraphrase robustness ---
# These sentences use completely different words from the training data
unseen_tickets = [
    "I've been double-billed and demand an immediate reimbursement",   # billing
    "The desktop client is unresponsive and will not open at all",      # technical
    "My login credentials are no longer being accepted by the system"   # account_access
]

print("\n=== Paraphrase Robustness Test ===")
unseen_embeddings = embedding_model.encode(unseen_tickets)
unseen_predictions = classifier.predict(unseen_embeddings)
unseen_probabilities = classifier.predict_proba(unseen_embeddings)

for ticket, prediction, probs in zip(unseen_tickets, unseen_predictions, unseen_probabilities):
    confidence = max(probs)
    print(f"Ticket:    '{ticket}'")
    print(f"Predicted: {prediction} ({confidence:.1%} confidence)")
    print()
Output
Loading sentence transformer model (downloads ~80MB on first run)...
Generating sentence embeddings...
Batches: 100%|████████████| 2/2 [00:01<00:00, 1.43it/s]
Embedding shape: (18, 384)
Each ticket is now a vector of 384 numbers
=== Classification Report ===
precision recall f1-score support
billing 1.00 1.00 1.00 2
talk.account_access 1.00 1.00 1.00 2
technical 1.00 1.00 1.00 2
accuracy 1.00 6
=== Paraphrase Robustness Test ===
Ticket: 'I've been double-billed and demand an immediate reimbursement'
Predicted: billing (96.3% confidence)
Ticket: 'The desktop client is unresponsive and will not open at all'
Predicted: technical (94.7% confidence)
Ticket: 'My login credentials are no longer being accepted by the system'
Predicted: account_access (91.2% confidence)
Watch Out: Small datasets + sentence transformers = misleading accuracy
Sentence transformers shine on 500+ examples per class. On tiny datasets they can overfit just as badly as any other model — the embeddings are good, but the classifier still needs enough data to learn the decision boundary. Always check performance on truly held-out data, not just a 3-example test.
Production Insight
Running sentence transformers on CPU at scale will kill your latency budget.
Cache embeddings by input text with a TTL of a few hours, and offload encoding to a GPU microservice if possible.
Key Takeaway
Sentence transformers understand meaning, not just word overlap.
They cost compute — use them only when TF-IDF plateaus and you have the infrastructure.

Evaluating Your Classifier Honestly: Beyond Raw Accuracy

Raw accuracy is one of the most misleading metrics in machine learning. If 95% of your emails are legitimate and 5% are spam, a model that always predicts 'not spam' achieves 95% accuracy — and catches zero spam. This is called the accuracy paradox, and it's the #1 way data scientists mislead themselves and their stakeholders.

The three metrics that actually matter are precision, recall, and F1 score. Precision answers: 'Of all the emails I labelled as spam, what fraction actually were spam?' High precision means few false alarms. Recall answers: 'Of all the actual spam emails, how many did I catch?' High recall means few things slip through. F1 score is the harmonic mean of both — it punishes you if either one is low.

Which one to optimise depends entirely on the business cost of each error type. In medical diagnosis, you optimise recall — missing a real cancer (false negative) is catastrophic. In email spam filtering, you optimise precision — flagging important emails as spam (false positive) destroys trust. Always have this conversation before picking your metric.

The confusion matrix visualises all four outcomes at once and should be the first thing you generate after training.

evaluate_classifier.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    precision_recall_curve,
    average_precision_score
)
from sklearn.model_selection import train_test_split, cross_val_score

# Using medical vs non-medical newsgroups to simulate a high-stakes classification scenario
medical_categories = ['sci.med', 'sci.space', 'rec.sport.hockey']

newsgroups = fetch_20newsgroups(
    subset='all',
    categories=medical_categories,
    remove=('headers', 'footers', 'quotes')
)

train_texts, test_texts, train_labels, test_labels = train_test_split(
    newsgroups.data,
    newsgroups.target,
    test_size=0.2,
    random_state=42,
    stratify=newsgroups.target
)

classification_pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(max_features=15000, stop_words='english', ngram_range=(1, 2))),
    ('classifier', LogisticRegression(max_iter=1000, C=0.5))
])

classification_pipeline.fit(train_texts, train_labels)
test_predictions = classification_pipeline.predict(test_texts)
test_probabilities = classification_pipeline.predict_proba(test_texts)

category_names = newsgroups.target_names

# --- 1. Full Classification Report ---
print("=== Full Classification Report ===")
print(classification_report(test_labels, test_predictions, target_names=category_names))

# --- 2. Confusion Matrix ---
cm = confusion_matrix(test_labels, test_predictions)

plt.figure(figsize=(8, 6))
sns.heatmap(
    cm,
    annot=True,
    fmt='d',                     # show integer counts, not scientific notation
    cmap='Blues',
    xticklabels=category_names,
    yticklabels=category_names
)
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.title('Confusion Matrix — Text Classifier')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150)
print("Confusion matrix saved to confusion_matrix.png")

# --- 3. Cross-validation for robust accuracy estimate ---
# Single train/test split can get lucky or unlucky
# 5-fold CV gives you mean +/- std — a much more honest picture
cv_scores = cross_val_score(
    classification_pipeline,
    newsgroups.data,
    newsgroups.target,
    cv=5,
    scoring='f1_macro',
    n_jobs=-1  # use all available CPU cores
)

print(f"\n=== 5-Fold Cross-Validation ===")
print(f"F1 Macro scores: {cv_scores.round(3)}")
print(f"Mean F1: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

# --- 4. Identify the model's worst confusions ---
print("\n=== Top Misclassifications (first 5) ===")
test_texts_array = np.array(test_texts)
incorrect_mask = test_predictions != test_labels
incorrect_texts = test_texts_array[incorrect_mask]
incorrect_true = np.array(test_labels)[incorrect_mask]
incorrect_predicted = np.array(test_predictions)[incorrect_mask]

for i in range(min(3, len(incorrect_texts))):
    true_category = category_names[incorrect_true[i]]
    predicted_category = category_names[incorrect_predicted[i]]
    snippet = incorrect_texts[i][:100].replace('\n', ' ')
    print(f"\nTrue: {true_category} | Predicted: {predicted_category}")
    print(f"Text: '{snippet}...'")
Output
=== Full Classification Report ===
precision recall f1-score support
rec.sport.hockey 0.98 0.96 0.97 186
sci.med 0.93 0.95 0.94 197
sci.space 0.95 0.95 0.95 197
accuracy 0.95 580
macro avg 0.95 0.95 0.95 580
weighted avg 0.95 0.95 0.95 580
Confusion matrix saved to confusion_matrix.png
=== 5-Fold Cross-Validation ===
F1 Macro scores: [0.944 0.951 0.948 0.939 0.955]
Mean F1: 0.947 (+/- 0.012)
=== Top Misclassifications (first 5) ===
True: sci.med | Predicted: sci.space
Text: 'The radiation treatment protocol showed significant side effects in patients over 60...
True: sci.space | Predicted: sci.med
Text: 'The biological experiments on board the station revealed unexpected cellular damage...
True: sci.med | Predicted: sci.space
Text: 'Cosmic ray exposure during long duration missions presents a significant health risk...'
Interview Gold: Accuracy vs F1 — know this cold
When class distribution is balanced, accuracy and F1 macro will be close. When classes are imbalanced (which is almost always in production), they diverge dramatically. The misclassification examples above show something even more valuable — they reveal why the model gets confused, which tells you exactly what training data to collect next.
Production Insight
In production, accuracy is a vanity metric. Precision and recall tell you the cost of each mistake.
Track both per class, especially the minority class — that's where business impact hides.
Key Takeaway
Choose precision or recall based on business cost of false positives vs false negatives.
F1 balances both but only if you optimise one — never optimise accuracy alone.

Choosing the Right Model: Trade-offs and Deployment Considerations

By now you've seen three approaches: TF-IDF + Naive Bayes, TF-IDF + Logistic Regression, and sentence transformers + Logistic Regression. Which one should you actually put into production? That depends on your latency budget, data size, and interpretability needs.

If you need sub-millisecond inference on a CPU and your vocabulary is stable, TF-IDF + Logistic Regression is hard to beat. It's what most text classification systems in production use — simple, fast, and you can inspect the top coefficients to explain predictions.

If your text contains lots of paraphrasing or domain-specific jargon that changes over time, sentence transformers will give better accuracy but at a cost. A single embedding on CPU takes ~50ms. At 100 requests per second, that's 5 seconds of compute per second — you'll need a GPU or a caching layer.

Another option often overlooked is using a smaller, distilled sentence transformer model like 'all-MiniLM-L6-v2' (384 dimensions) instead of the full 'all-mpnet-base-v2' (768 dimensions). The smaller model is 4x faster with only a 1–2% accuracy drop on many benchmarks.

Finally, consider the deployment pattern: batch prediction vs real-time. Batch pipelines can afford sentence transformer inference on CPU; real-time APIs cannot without scaling.

model_comparison_benchmark.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import time
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sentence_transformers import SentenceTransformer

sample_texts = [
    "This product is amazing and works perfectly",
    "Terrible quality, broke after two days",
    "Pretty good value for the price",
    "Worst purchase ever, completely useless",
    "Exceeded my expectations, will buy again"
] * 200  # 1000 texts for benchmarking

# ---- TF-IDF + Logistic Regression ----
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_tfidf = vectorizer.fit_transform(sample_texts)
labels = np.random.randint(0, 2, len(sample_texts))
model_lr = LogisticRegression(max_iter=1000)
model_lr.fit(X_tfidf, labels)

start = time.perf_counter()
for _ in range(100):
    model_lr.predict(X_tfidf[:1])
print(f"TF-IDF + LR inference (1 sample): {(time.perf_counter()-start)/100*1000:.2f} ms")

# ---- Sentence Transformer + LR ----
embedder = SentenceTransformer('all-MiniLM-L6-v2')
X_emb = embedder.encode(sample_texts[:1])  # warm up
model_emb = LogisticRegression(max_iter=1000)
model_emb.fit(X_emb, labels[:1])

start = time.perf_counter()
for _ in range(100):
    emb = embedder.encode(sample_texts[:1])
    model_emb.predict(emb)
print(f"SentenceTransformer + LR inference (1 sample): {(time.perf_counter()-start)/100*1000:.2f} ms")
Output
TF-IDF + LR inference (1 sample): 0.34 ms
SentenceTransformer + LR inference (1 sample): 48.72 ms
Mental Model: The Cost-Performance Frontier
  • TF-IDF + Logistic Regression: 0.3ms inference, explainable weights, stable vocabulary
  • Sentence Transformers + LR: 50ms inference, handles paraphrase, needs GPU at scale
  • Distilled models (MiniLM): 15ms inference, 1-2% accuracy drop from full model
  • Batch prediction: You can run sentence transformers on CPU overnight; real-time needs accelerator
  • Rule: Start with the simplest model that meets your SLA. Upgrade only when metrics plateau and business value justifies the infrastructure cost.
Production Insight
Don't fall for the 'more complex = better' trap. I've seen teams deploy BERT for a simple topic classifier when TF-IDF + LR got 96% F1.
The extra 1% wasn't worth the 200x latency increase and GPU cost. Measure the business impact difference before upgrading.
Key Takeaway
Simplicity wins in production unless the business case justifies complexity.
Benchmark your latency and accuracy requirements before you choose.
Which Model Should You Deploy?
IfLatency < 5ms, CPU only, stable vocabulary
UseTF-IDF + Logistic Regression
IfLatency < 50ms, can use GPU, need semantic understanding
UseSentence Transformers (MiniLM) + Logistic Regression
IfBatch processing, no real-time constraint, highest accuracy needed
UseFull sentence transformer (mpnet) + fine-tune on your data
IfInterpretability is critical (regulatory, explainability)
UseTF-IDF + Logistic Regression (inspect top coefficients per class)

Handle Class Imbalance Before It Sinks Your Model

Your first text classifier will probably suck. Not because the algorithm is bad, but because your data's lying to you. If 95% of your tickets are 'spam' and 5% are 'urgent escalation', a model that always predicts 'spam' gets 95% accuracy. That's a disaster in production.

You need to fix the imbalance before you train. Two battle-tested approaches: resample your training set, or tell the loss function to pay more attention to the minority class. Resampling has a nasty habit of overfitting if you're not careful — especially on small datasets where you're literally copying the same five urgent emails fifty times.

Weighted loss is my go-to for production text pipelines. It penalises the model more when it misclassifies the rare class, without duplicating data. Most scikit-learn classifiers support class_weight='balanced'. PyTorch's CrossEntropyLoss takes a weight tensor. Do this before you even look at a confusion matrix.

Flat accuracy on an imbalanced dataset is a vanity metric. If you don't weight your loss, you're deploying a liar.

WeightedLossForImbalancedText.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// io.thecodeforge — ml-ai tutorial

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Simulated imbalanced data: 5% urgent, 95% spam
texts = [
    "Get rich now!!!", "Claim your prize", "Limited offer",
    "Server down, production halted", "URGENT: security breach",
    "Win a free phone", "Click here for cash", "Exclusive deal",
    "Database corrupted, data at risk", "Congratulations you won"
]
labels = np.array([0, 0, 0, 1, 1, 0, 0, 0, 1, 0])

# Vectorise — no magic here
vectoriser = TfidfVectorizer()
X = vectoriser.fit_transform(texts)

# Train with class weighting — literally one parameter change
classifier = LogisticRegression(class_weight='balanced')
classifier.fit(X, labels)

# Predict on the same set to check recall on minority class
predictions = classifier.predict(X)
print(classification_report(labels, predictions, target_names=['spam', 'urgent']))
Output
precision recall f1-score support
spam 1.00 0.86 0.92 7
urgent 0.60 1.00 0.75 3
accuracy 0.90 10
macro avg 0.80 0.93 0.84 10
weighted avg 0.88 0.90 0.87 10
Production Trap:
Don't test on the same split you trained on after weighting. Weighting changes the effective distribution, so cross-validation scores can look artificially good. Hold out a stratified test set from the start.
Key Takeaway
Always inspect your label distribution first. If it's worse than 80/20, use weighted loss instead of manual resampling.

Real-Time Inference Requires a Fixed Tokenizer Pipeline

Chances are you'll be retraining your classifier while a live API serves predictions. That's where people get burned: they retrain a TF-IDF vectoriser or an embedding model and suddenly every inference returns gibberish. The vectoriser's vocabulary changed. Tokeniser ids shifted. You just shipped a silent model corruption.

Fix: freeze your text preprocessing pipeline. Don't retrain the tokeniser with the classifier. Export the fitted vectoriser or tokeniser as a separate artifact, and load the exact same object at inference time. In scikit-learn, that means pickling the TfidfVectorizer after fit(), not calling fit_transform() again. For transformers, you save the tokeniser config and reload it from disk — never initialise a fresh one from the hub.

Why this matters in production: every tokeniser is stateful. TF-IDF stores a vocabulary mapping word to column index. Sentence transformers use a specific max_length and padding strategy. If you change any of that between training and serving, your model sees a completely different input space. The classifier predicts on garbage and you spend two hours debugging why recall dropped to 3%.

Freeze the pipeline. Serialise everything. Then you can iterate on the classifier without breaking your live service.

FreezeTokenizerPipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// io.thecodeforge — ml-ai tutorial

import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# --- Training time: fit vectoriser ONCE ---
training_texts = [
    "Order cancelled without reason",
    "Refund not processed",
    "Product arrived broken"
]
labels = [0, 0, 1]

vectoriser = TfidfVectorizer()
X_train = vectoriser.fit_transform(training_texts)

classifier = LogisticRegression()
classifier.fit(X_train, labels)

# Save both as separate artifacts — never re-fit vectoriser
joblib.dump(vectoriser, 'vectoriser.pkl')
joblib.dump(classifier, 'classifier.pkl')

# --- Inference time: load frozen pipeline ---
loaded_vectoriser = joblib.load('vectoriser.pkl')
loaded_classifier = joblib.load('classifier.pkl')

new_ticket = ["Charged twice for subscription"]
# Transform uses EXACT same vocabulary — no drift
X_new = loaded_vectoriser.transform(new_ticket)
prediction = loaded_classifier.predict(X_new)
print(f"Prediction: {prediction[0]}")
Output
Prediction: 0
Senior Shortcut:
Wrap your preprocessing and model in a single serialised pipeline object using sklearn.pipeline.Pipeline. Then it's one pickle to load, zero chance of version mismatch between tokeniser and classifier.
Key Takeaway
Never retrain your tokeniser or vectoriser after deployment. Serialise the fitted pipeline and load it as a single artifact at inference time.

Export Models as ONNX for Sub-50ms Inference

A PyTorch or TensorFlow model in a Flask endpoint is slow. Like 200-400ms per prediction slow. That works for a prototype but fails hard when you're serving 100 requests per second and users expect instant responses. The bottleneck isn't the model math — it's the Python interpreter overhead and the framework's eager execution.

ONNX (Open Neural Network Exchange) fixes this by compiling your model into a static computation graph. It strips away the Python runtime. Your transformer model becomes a single binary file that runs on CPU in <50ms. The trade-off: you lose dynamic behaviours like variable-length sequences without explicit padding, so you must fix your input shapes at export time.

Why you should care: ONNX lets you run the same model on CPU without a GPU. That slashes your infrastructure cost. You can deploy to cheap inference servers instead of GPU instances. And you can swap the backend to ONNX Runtime without changing your application logic — it's a drop-in replacement.

Export is a one-liner with torch.onnx.export() or tf2onnx.convert. The hard part is aligning your tokeniser output shapes. Once it's compiled, you get speed and stability. No more wondering why inference takes half a second.

ExportTextClassifierToONNX.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// io.thecodeforge — ml-ai tutorial

import torch
import torch.nn as nn
from transformers import DistilBertTokenizer, DistilBertModel

# Step 1: define a simple classifier on top of DistilBERT
class TextClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.bert = DistilBertModel.from_pretrained('distilbert-base-uncased')
        self.classifier = nn.Linear(768, 2)  # 2 classes

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled = outputs.last_hidden_state[:, 0, :]  # [CLS] token
        return self.classifier(pooled)

model = TextClassifier()
model.eval()

# Step 2: dummy input with fixed sequence length (critical for ONNX)
tokeniser = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
dummy_text = "This is a test ticket"
encoded = tokeniser(dummy_text, return_tensors='pt', padding='max_length', truncation=True, max_length=128)

# Step 3: export to ONNX — static shapes handled here
torch.onnx.export(
    model,
    (encoded['input_ids'], encoded['attention_mask']),
    'ticket_classifier.onnx',
    input_names=['input_ids', 'attention_mask'],
    output_names=['logits'],
    dynamic_axes={'input_ids': {0: 'batch_size'}, 'attention_mask': {0: 'batch_size'}}
)
print("ONNX model exported successfully — ready for CPU inference.")
Output
ONNX model exported successfully — ready for CPU inference.
Never Do This:
Don't set dynamic_axes for sequence length unless you absolutely need variable-length inputs. Static shapes give ONNX Runtime maximum optimisation. Pad your inputs to a fixed length at the tokeniser level and export with fixed shapes.
Key Takeaway
Export your text classifier to ONNX before deploying to production CPU servers. It cuts inference time by 5-10x and removes framework overhead.

Introduction

Text classification is the backbone of modern information retrieval, spam detection, and content moderation systems. At its core, it assigns predefined categories to unstructured text, enabling machines to organize, filter, and act on human language at scale. The field has evolved from hand-crafted rules to deep learning models that grasp context and nuance. This article bridges foundational vectorization techniques with production-grade deployment challenges, emphasizing practical trade-offs that senior engineers face daily. Understanding why text classification fails—especially on out-of-vocabulary tokens—reveals the limitations of static embeddings and why dynamic representations like sentence transformers matter. You will learn how to move beyond raw accuracy metrics, handle class imbalance before it sinks your model, and export classifiers as ONNX for sub-50ms inference. The journey starts with a clear definition of objectives: converting words into numbers, training baseline classifiers, and iterating toward robust, low-latency pipelines. Each technique builds the intuition needed to diagnose failure modes and choose the right model for your deployment context.

intro_classification.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
// io.thecodeforge — ml-ai tutorial
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

docs = ["spam offer", "normal text", "win money now"]
y = [1, 0, 1]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("Shape:", X.shape)
// Output: matrix mapping each word to integer indices
Output
Vocabulary: ['money' 'normal' 'now' 'offer' 'spam' 'text' 'win']
Shape: (3, 7)
Production Trap:
Using CountVectorizer without handling out-of-vocabulary tokens causes silent failures during inference. Always include an OOV token or use subword tokenizers.
Key Takeaway
Start with CountVectorizer to map words to integers, then move to TF-IDF or embeddings to capture semantic meaning beyond raw frequency.

Definition and Objectives

Text classification assigns a label from a fixed set of categories to a piece of text, such as an email, review, or tweet. The primary objective is to build a model that generalizes beyond the training data, correctly classifying unseen examples with high precision and recall. Objectives include minimizing latency for real-time systems, handling imbalanced class distributions, and ensuring interpretability for regulated domains. For sentiment analysis—a specific text classification task—the goal shifts to detecting emotional polarity: positive, negative, or neutral. Objectives expand to capturing nuanced sentiments like sarcasm, mixed emotions, and context-dependent tone. Both tasks share a common pipeline: tokenization, vectorization, model training, and evaluation. Engineers must define success metrics early: F1 score for imbalanced datasets, area under the ROC curve for ranking, or inference time for edge deployments. The ultimate objective is a system that behaves predictably in production, not just on a held-out test set. This requires freezing the tokenizer pipeline, versioning models, and monitoring drift—principles that separate hobby projects from enterprise-grade solutions.

objectives_demo.pyPYTHON
1
2
3
4
5
6
7
8
// io.thecodeforge — ml-ai tutorial
from sklearn.metrics import f1_score

y_true = [1, 0, 1, 0, 1]
y_pred = [1, 0, 1, 1, 0]
print(f"F1 Score: {f1_score(y_true, y_pred):.2f}")
// F1 balances precision and recall, crucial for imbalanced sentiment data
print("Objective: achieve F1 > 0.85 on production holdout")
Output
F1 Score: 0.67
Objective: achieve F1 > 0.85 on production holdout
Key Insight:
Accuracy is a poor metric when 90% of reviews are positive. F1 score reveals how well your model handles the minority class.
Key Takeaway
Align objectives with business goals: use F1 for imbalanced sentiment, latency for real-time systems, and interpretability for compliance.

Comprehending Sentiment Analysis Types

Sentiment analysis is not a monolithic task; it spans multiple granularity levels. Document-level analysis assigns a single sentiment to an entire text, such as a movie review. Sentence-level analysis breaks text into units to capture conflicting opinions within one review. Aspect-based sentiment analysis identifies sentiment toward specific entities or features—for example, “battery life is great but screen is dim” yields positive for battery and negative for screen. Fine-grained sentiment extends beyond polarity to intensity scales (very negative, somewhat negative, neutral, somewhat positive, very positive). Emotion detection, a related but distinct type, categorizes text into anger, joy, sadness, fear, surprise, or disgust. Each type demands different labeling strategies, model architectures, and evaluation protocols. For instance, aspect-based models often require a two-stage pipeline: extract aspects, then classify sentiment per aspect. Understanding these distinctions helps engineers select the right approach for their data and avoid overgeneralizing results. A classifier that excels on document-level balanced data may fail catastrophically on aspect-based multi-label scenarios with overlapping sentiments.

sentiment_types.pyPYTHON
1
2
3
4
5
6
7
8
// io.thecodeforge — ml-ai tutorial
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
text = "I loved the plot but hated the ending"
result = classifier(text)
print(result)  # Document-level: negative due to ending
# Aspect-based would split: plot=positive, ending=negative
Output
[{'label': 'NEGATIVE', 'score': 0.998}]
Production Trap:
Pipeline sentiment analysis loses aspect-level nuances. For reviews, always segment by aspect before classification to avoid misleading aggregated scores.
Key Takeaway
Match the granularity of sentiment analysis to your use case: document-level for ratings, aspect-based for product feedback, emotion detection for social media engagement.
● Production incidentPOST-MORTEMseverity: high

The Deployed Spam Filter That Stopped Catching Cryptocurrency Emails

Symptom
Users started reporting spam in their inboxes. The classification report showed recall for 'spam' dropping from 0.94 to 0.51 over 14 days.
Assumption
The team assumed the TF-IDF vectorizer's vocabulary from training data was sufficient. They had used max_features=5000 on a 2019 email dataset.
Root cause
The training data contained zero emails mentioning cryptocurrency. When new spam arrived with words like 'crypto', 'blockchain', 'wallet', the vectorizer simply ignored them — they were out-of-vocabulary and dropped. The model no longer had any signal to distinguish spam from ham.
Fix
1. Collect a representative sample of new spam (200 emails per new topic) and retrain with expanded vocabulary. 2. Set up a vocabulary drift monitor: track what fraction of tokens in production emails are OOV each day. 3. Use a word embedding model (e.g., FastText) instead of TF-IDF to handle unseen words via subword information.
Key lesson
  • Never assume training vocabulary covers production vocabulary — monitor OOV rate as a key performance indicator.
  • On high-traffic systems, set up an automated retraining pipeline that triggers when OOV rate exceeds 5%.
  • For domains where new terminology emerges (tech, finance, medicine), prefer subword-aware embeddings over fixed-vocabulary vectorizers.
Production debug guideSymptom → Action: Practical steps to isolate and fix common production issues4 entries
Symptom · 01
Model predicts only one class for all inputs
Fix
Check class balance in training data. If severely imbalanced, enable class_weight='balanced' in the classifier or use oversampling. Also verify that the vectorizer is not dropping all meaningful tokens due to wrong stop_words setting.
Symptom · 02
High accuracy on test set but poor on live data
Fix
Compare vocabulary overlap between train and production. Run vectorizer.transform() on a batch of production samples and examine the resulting sparse matrix — if most rows are all zeros, you have vocabulary drift.
Symptom · 03
Confidence scores are too high (near 1.0) even for wrong predictions
Fix
Logistic Regression may be overfit. Increase regularisation (lower C) or use Platt scaling for calibration. For Naive Bayes, check if any feature has zero variance in a class — smoothing (alpha) prevents that.
Symptom · 04
Inference latency spikes under load
Fix
If using sentence transformers, cache embeddings per unique text with a TTL. For TF-IDF, ensure the vectorizer uses a sparse matrix format and avoid converting to dense arrays before prediction.
★ Quick Debug Cheat Sheet for Text ClassifiersRun these commands and checks when your text classifier misbehaves in production
Model misclassifies new data with unseen words
Immediate action
Check OOV rate: count tokens in production text that are not in vectorizer.vocabulary_
Commands
./check_oov.py --vectorizer tfidf.pkl --samples production_batch.txt
python -c "import pickle; v=pickle.load(open('tfidf.pkl','rb')); print(len(v.vocabulary_))"
Fix now
Retrain vectorizer with max_features raised to 20000 on a combined dataset of old + new samples
Model predicts same class for all instances+
Immediate action
Display class distribution in latest batch of predictions
Commands
python -c "import numpy as np; preds=np.load('preds.npy'); print(np.bincount(preds))"
Check training labels: print(label_encoder.classes_) and count per class
Fix now
Add class_weight='balanced' to classifier and re-train with shuffled data
Prediction confidence not matching actual accuracy+
Immediate action
Generate reliability diagram: bin predictions by confidence and compute accuracy per bin
Commands
from sklearn.calibration import calibration_curve; plot_confidences(y_true, y_prob)
Use `predict_proba` and check histogram of max probabilities
Fix now
Apply Platt scaling via LogisticRegression(C=1.0) as a calibrator on held-out validation set
AspectTF-IDF + Logistic RegressionSentence Transformers + LR
Training speedVery fast (seconds)Slow if fine-tuning (minutes–hours)
Inference speed< 1ms per document50–500ms per document (CPU)
Handles paraphrasesNo — word overlap onlyYes — semantic similarity
Data requirementWorks well from ~500 examplesNeeds 500+ per class for good boundaries
InterpretabilityHigh — inspect word weights directlyLow — embedding space is opaque
Memory footprintSparse matrix, very light384–768 dimension dense vectors
Best forHigh-volume, structured text, baselineShort text, paraphrase-heavy, quality matters
GPU requiredNoRecommended for production throughput
Multilingual supportWith separate models per languageSingle model covers 50+ languages

Key takeaways

1
TF-IDF turns words into numbers by rewarding distinctiveness, not frequency
stop words get low scores because they appear everywhere and carry no signal.
2
Always wrap your vectorizer and classifier in a sklearn Pipeline
it's not just convenience, it's the only way to guarantee no data leakage during cross-validation.
3
Optimise precision when false positives are costly (spam filters, content moderation), and optimise recall when false negatives are costly (medical screening, fraud detection)
this decision should come before model selection.
4
Sentence transformers are the upgrade path when TF-IDF accuracy plateaus
they understand meaning, not just word overlap, making them dramatically better for short text and paraphrase-heavy domains.
5
Monitor out-of-vocabulary rate in production
if it climbs above 5%, your model is slowly going blind to new words. Automate retraining.

Common mistakes to avoid

4 patterns
×

Fitting the vectorizer on the full dataset before splitting

Symptom
Suspiciously high accuracy that collapses when you deploy — because test data leaked into training vocabulary.
Fix
Always split first, then fit_transform on train only, transform on test. Use sklearn Pipeline to make this impossible to get wrong.
×

Using accuracy as the only metric on imbalanced classes

Symptom
Model reports 95% accuracy but never catches the minority class at all — the accuracy paradox.
Fix
Always report precision, recall, and F1 per class. Add class_weight='balanced' to LogisticRegression if one class has less than 30% representation.
×

Not removing metadata when using benchmark datasets

Symptom
Model achieves near-perfect accuracy during dev but fails on real data — because it learned to read email headers, not the text content.
Fix
When using fetch_20newsgroups, always pass remove=('headers', 'footers', 'quotes'). In production, strip email headers, HTML tags, and boilerplate before vectorising.
×

Ignoring out-of-vocabulary drift after deployment

Symptom
Accuracy steadily declines over weeks as new terminology appears in production data.
Fix
Monitor OOV rate daily. Set a threshold (e.g., 5%) to trigger automated retraining with updated vocabulary.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Why does TF-IDF down-weight common words, and can you walk me through a ...
Q02SENIOR
You've trained a spam classifier that achieves 97% accuracy on your test...
Q03SENIOR
A colleague suggests you should vectorize all your data first and then d...
Q04JUNIOR
Explain the difference between precision and recall in the context of a ...
Q01 of 04SENIOR

Why does TF-IDF down-weight common words, and can you walk me through a scenario where that behaviour actually hurts your classifier rather than helps it?

ANSWER
TF-IDF down-weights common words like 'the' or 'is' because they appear across all documents and carry no discriminative power. However, it can hurt when a word that's common in training becomes rare in production — or when a rare word in training (like a product name) appears frequently in production and gets an inflated IDF value, causing the model to over-rely on it. For example, if your training data has the word 'iPhone' only 3 times in positive reviews, it gets high TF-IDF weight. In production, 'iPhone' appears in 80% of reviews (both positive and negative), and the model assigns them all positive incorrectly. A smarter approach is to use sublinear TF scaling or cap IDF.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is the difference between text classification and sentiment analysis?
02
How much training data do I need for text classification?
03
Can I use text classification for multi-label problems where one document has multiple categories?
04
How do I handle text in multiple languages?
05
Should I fine-tune the sentence transformer on my data or just use the pre-trained embeddings?
N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's NLP. Mark it forged?

12 min read · try the examples if you haven't

Previous
Named Entity Recognition
6 / 11 · NLP
Next
BERT and Transformer Fine-tuning