Junior 5 min · March 06, 2026

Text Classification Failure — OOV Crash Recall to 0.51

Recall dropped 0.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Text classification maps raw text to predefined labels using ML
  • TF-IDF vectorization converts words into numerical importance scores
  • Naive Bayes and Logistic Regression are fast, interpretable starters
  • Sentence transformers handle paraphrases but need GPU for production throughput
  • Biggest mistake: evaluating on accuracy alone when classes are imbalanced
Plain-English First

Imagine your email inbox has a bouncer at the door. Every incoming email gets a quick read, and the bouncer decides: 'spam' goes in the junk folder, 'important' lands in your inbox. Text classification is exactly that bouncer — a machine learning model that reads a piece of text and stamps it with a label. Your phone does it when it detects a toxic comment. Netflix does it when it reads your review and decides if you loved the show. It's the foundation of almost every app that needs to understand what humans are saying.

Every day, humans generate around 2.5 quintillion bytes of data — and most of it is unstructured text. Customer reviews, support tickets, social media posts, medical notes. None of that data is useful until a machine can read it and say 'this is a complaint', 'this is urgent', or 'this is spam'. Text classification is the ML technique that makes that possible, and it powers systems you use dozens of times a day without realising it.

The core problem text classification solves is deceptively simple: given a string of words, assign it to one of several predefined categories. But computers don't speak English — they speak numbers. So the real challenge is the pipeline that happens before the model even sees the data: cleaning text, converting it into numerical features, and choosing a model that can learn meaningful patterns from those features. Get that pipeline wrong and even the fanciest model won't save you.

By the end of this article you'll be able to build a complete, production-aware text classification pipeline in Python — from raw messy text all the way to a trained model making predictions. You'll understand why each step exists, not just how to run it. And you'll know the common traps that burn people in interviews and on the job.

How Machines Read Words: Vectorisation and Why It Matters

Before any model can classify text, you need to answer a fundamental question: how do you turn the sentence 'This product broke in two days' into something a mathematical model can process? The answer is vectorisation — converting text into arrays of numbers.

The most battle-tested approach is TF-IDF (Term Frequency–Inverse Document Frequency). It does two clever things at once. First, it counts how often a word appears in a document (TF). Second, it penalises words that appear in almost every document — like 'the' or 'is' — because they carry no useful signal (IDF). The result is a number that represents how distinctive a word is to a particular document.

Why not just count raw word frequencies? Because 'the' might be the most frequent word in every review, positive or negative. It tells you nothing. TF-IDF filters that noise out automatically.

The alternative — word embeddings like Word2Vec or sentence transformers — are more powerful but also more complex. TF-IDF is the right starting point: fast, interpretable, and often good enough for structured datasets. Understand it deeply before reaching for a transformer.

vectorise_text.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Simulating a small customer review dataset
# In a real project this would come from a CSV or database
reviews = [
    "This product is absolutely amazing and works perfectly",
    "Terrible quality, broke after two days, total waste of money",
    "Pretty good value for the price, happy with my purchase",
    "Worst purchase I have ever made, completely useless product",
    "Exceeded my expectations, will definitely buy again"
]

labels = ["positive", "negative", "positive", "negative", "positive"]

# TfidfVectorizer handles tokenisation, lowercasing, and IDF weighting
# max_features limits vocabulary size — important for memory on large datasets
# stop_words='english' removes common words like 'the', 'is', 'and'
vectorizer = TfidfVectorizer(max_features=20, stop_words='english')

# fit_transform: learns the vocabulary AND converts text to numbers in one step
# Returns a sparse matrix — rows are documents, columns are words
tfidf_matrix = vectorizer.fit_transform(reviews)

# Let's see what vocabulary was learned
learned_vocabulary = vectorizer.get_feature_names_out()
print("Learned vocabulary:")
print(learned_vocabulary)
print()

# Convert sparse matrix to a readable DataFrame
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=learned_vocabulary,
    index=[f"Review {i+1}" for i in range(len(reviews))]
)

print("TF-IDF scores per document (higher = more distinctive word):")
print(tfidf_df.round(3).to_string())
Output
Learned vocabulary:
['absolutely' 'amazing' 'broke' 'buy' 'completely' 'definitely' 'exceeded'
'expectations' 'good' 'happy' 'money' 'perfectly' 'product' 'purchase'
'quality' 'terrible' 'terrible' 'useless' 'value' 'waste' 'works']
TF-IDF scores per document (higher = more distinctive word):
absolutely amazing broke buy completely definitely exceeded expectations good happy money perfectly product purchase quality terrible useless value waste works
Review 1 0.447 0.447 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.447 0.316 0.000 0.000 0.000 0.000 0.000 0.000 0.447
Review 2 0.000 0.000 0.447 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.447 0.000 0.000 0.000 0.447 0.447 0.000 0.000 0.447 0.000
Review 3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.500 0.500 0.000 0.000 0.000 0.354 0.000 0.000 0.000 0.500 0.000 0.000
Review 4 0.000 0.000 0.000 0.000 0.447 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.316 0.354 0.000 0.000 0.447 0.000 0.000 0.000
Review 5 0.000 0.000 0.000 0.447 0.000 0.447 0.447 0.447 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Pro Tip: Always fit on training data only
Call fit_transform() on your training set, then transform() on your test set. Never fit on the full dataset — that leaks future vocabulary into your training process and inflates accuracy scores artificially.
Production Insight
If production text contains words not in the TF-IDF vocabulary, they're silently dropped — the model sees less signal and accuracy drifts.
Monitor out-of-vocabulary rate daily. Anything above 5% means your vectorizer needs retraining on fresh data.
Key Takeaway
TF-IDF rewards discriminative words by down-weighting common terms.
The vocabulary is static after training — any new word is invisible to the model.

Training Your First Classifier: Naive Bayes vs Logistic Regression

Now that text is numeric, you can feed it into a classifier. Two models dominate beginner-to-intermediate text classification: Multinomial Naive Bayes and Logistic Regression. They're both fast, interpretable, and work surprisingly well — and understanding why they work differently will save you a lot of tuning time.

Naive Bayes asks: 'Given this class label, what's the probability of seeing each word?' It calculates probabilities per word and multiplies them together. The 'naive' part is the assumption that each word's probability is independent of the others — clearly not true in real language, but the model still performs remarkably well on text data. It's extremely fast and memory-efficient.

Logistic Regression learns a weight for each word. Words strongly associated with 'positive' get high positive weights; words associated with 'negative' get negative weights. It then sums those weighted scores and passes them through a sigmoid function to output a probability. It's slightly slower to train but gives you calibrated probabilities and is more robust on imbalanced classes.

For a quick baseline, reach for Naive Bayes. For production pipelines where calibration matters (you need 'how confident is the model?'), use Logistic Regression.

train_text_classifier.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split

# Using a real benchmark dataset — 20 newsgroup posts across different topics
# We're selecting 3 categories to keep it manageable and interpretable
categories_to_classify = ['sci.space', 'rec.sport.hockey', 'talk.politics.guns']

print("Loading 20 Newsgroups dataset...")
newsgroups_data = fetch_20newsgroups(
    subset='all',
    categories=categories_to_classify,
    remove=('headers', 'footers', 'quotes')  # remove metadata that makes classification trivially easy
)

post_texts = newsgroups_data.data
category_labels = newsgroups_data.target
category_names = newsgroups_data.target_names

print(f"Total documents: {len(post_texts)}")
print(f"Categories: {category_names}")
print()

# Split into training and test sets — 80/20 is a solid default
# stratify=category_labels ensures each class has proportional representation in both sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
    post_texts, category_labels,
    test_size=0.2,
    random_state=42,
    stratify=category_labels
)

# --- PIPELINE 1: Naive Bayes ---
# Pipeline chains steps so the same transformations apply consistently to train and test
# This is the production-safe pattern — no data leakage possible
naive_bayes_pipeline = Pipeline([
    ('tfidf_vectorizer', TfidfVectorizer(
        max_features=10000,
        stop_words='english',
        ngram_range=(1, 2)  # include both single words AND two-word phrases
    )),
    ('naive_bayes_classifier', MultinomialNB(alpha=0.1))  # alpha is the smoothing parameter
])

naive_bayes_pipeline.fit(train_texts, train_labels)
nb_predictions = naive_bayes_pipeline.predict(test_texts)
nb_accuracy = accuracy_score(test_labels, nb_predictions)

print(f"=== Naive Bayes Results ===")
print(f"Accuracy: {nb_accuracy:.3f}")
print(classification_report(test_labels, nb_predictions, target_names=category_names))

# --- PIPELINE 2: Logistic Regression ---
logistic_pipeline = Pipeline([
    ('tfidf_vectorizer', TfidfVectorizer(
        max_features=10000,
        stop_words='english',
        ngram_range=(1, 2)
    )),
    ('logistic_classifier', LogisticRegression(
        max_iter=1000,      # increase from default 100 to ensure convergence
        C=1.0,              # regularisation strength — lower C = more regularisation
        solver='lbfgs',
        multi_class='multinomial'
    ))
])

logistic_pipeline.fit(train_texts, train_labels)
lr_predictions = logistic_pipeline.predict(test_texts)
lr_accuracy = accuracy_score(test_labels, lr_predictions)

print(f"\n=== Logistic Regression Results ===")
print(f"Accuracy: {lr_accuracy:.3f}")
print(classification_report(test_labels, lr_predictions, target_names=category_names))

# --- Making predictions on new, unseen text ---
new_posts = [
    "The astronauts launched successfully to the International Space Station",
    "The goalie made an incredible save in overtime to win the championship",
    "The senate voted on the second amendment legislation today"
]

print("\n=== Predictions on new posts ===")
new_predictions = logistic_pipeline.predict(new_posts)
new_probabilities = logistic_pipeline.predict_proba(new_posts)

for post, prediction, probabilities in zip(new_posts, new_predictions, new_probabilities):
    predicted_category = category_names[prediction]
    confidence = max(probabilities)
    print(f"Post: '{post[:55]}...'")
    print(f"  Predicted: {predicted_category} (confidence: {confidence:.1%})")
    print()
Output
Loading 20 Newsgroups dataset...
Total documents: 2802
Categories: ['rec.sport.hockey', 'sci.space', 'talk.politics.guns']
=== Naive Bayes Results ===
Accuracy: 0.921
precision recall f1-score support
rec.sport.hockey 0.97 0.96 0.96 187
sci.space 0.89 0.94 0.91 198
talk.politics.guns 0.91 0.87 0.89 176
accuracy 0.92 561
macro avg 0.92 0.92 0.92 561
weighted avg 0.92 0.92 0.92 561
=== Logistic Regression Results ===
Accuracy: 0.934
precision recall f1-score support
rec.sport.hockey 0.97 0.97 0.97 187
sci.space 0.93 0.94 0.93 198
talk.politics.guns 0.91 0.89 0.90 176
accuracy 0.93 561
macro avg 0.93 0.93 0.93 561
weighted avg 0.93 0.93 0.93 561
=== Predictions on new posts ===
Post: 'The astronauts launched successfully to the Internati...'
Predicted: sci.space (confidence: 97.2%)
Post: 'The goalie made an incredible save in overtime to win...'
Predicted: rec.sport.hockey (confidence: 98.6%)
Post: 'The senate voted on the second amendment legislation ...'
Predicted: talk.politics.guns (confidence: 89.1%)
Interview Gold: Why use Pipeline instead of separate steps?
Pipeline prevents data leakage in cross-validation. If you vectorize first then cross-validate, vocabulary from the test fold bleeds into training. Pipeline ensures fit() only ever sees training data — a subtle but critical correctness guarantee.
Production Insight
Naive Bayes tends to produce extreme probabilities (close to 0 or 1) even when wrong.
In production, calibrate with Logistic Regression or use isotonic regression if you need reliable confidence scores.
Key Takeaway
Naive Bayes is fast, logistic regression is calibrated.
Pick naive Bayes for baselines, logistic regression for production decisions.

When TF-IDF Isn't Enough: Upgrading to Sentence Transformers

TF-IDF is powerful, but it's blind to meaning. The sentences 'The car broke down' and 'My vehicle stopped working' use completely different words, so TF-IDF treats them as unrelated. But semantically, they mean the same thing. For a customer support classifier that needs to route 'vehicle stopped working' to the auto-repair team, that blindness is a real problem.

Sentence transformers solve this by converting an entire sentence into a dense vector (an embedding) where similar meanings produce similar vectors. They're pre-trained on massive text corpora, so they already understand that 'car' and 'vehicle' live in the same neighbourhood of meaning. You're essentially downloading years of language learning and plugging it into your classifier.

The tradeoff? Speed and resource cost. TF-IDF vectorisation takes milliseconds; generating sentence embeddings on CPU can take seconds per batch. For most production systems processing thousands of requests per minute, you'll need a GPU or a caching layer.

The pattern here is simple: start with TF-IDF + Logistic Regression as your baseline. If accuracy plateaus and you have labelled data, upgrade to sentence embeddings. You'll almost always see a meaningful jump, especially on short texts or paraphrase-heavy data.

sentence_transformer_classifier.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# Install first: pip install sentence-transformers scikit-learn
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# Real-world scenario: classifying customer support tickets
# Notice how many rows have paraphrased meaning — TF-IDF would struggle here
support_tickets = [
    # Billing issues
    "I was charged twice for my subscription this month",
    "There's a duplicate payment on my credit card statement",
    "My invoice shows the wrong amount, please fix this",
    "You billed me for a plan I never signed up for",
    "I need a refund for the extra charge on my account",
    "The payment went through twice and I want my money back",
    # Technical issues
    "The app keeps crashing every time I open it",
    "Your software won't start on my Windows 11 laptop",
    "I'm getting a black screen when I launch the application",
    "The program freezes after about 30 seconds of use",
    "Cannot log into the platform, it just hangs on the loading screen",
    "The mobile app stopped working after the latest update",
    # Account access
    "I forgot my password and the reset email never arrived",
    "Locked out of my account after too many login attempts",
    "My account was suspended but I didn't violate any rules",
    "Can't access my profile, says my email is not recognised",
    "The two-factor authentication code isn't working for me",
    "I need to recover access to my account urgently"
]

ticket_categories = (
    ["billing"] * 6 +
    ["technical"] * 6 +
    ["account_access"] * 6
)

# Load a lightweight, fast model — good balance of speed and quality
# 'all-MiniLM-L6-v2' produces 384-dimensional embeddings and runs in ~50ms per sentence on CPU
print("Loading sentence transformer model (downloads ~80MB on first run)...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode all tickets into dense vector representations
# Each ticket becomes a 384-dimensional vector — semantically similar tickets will cluster together
print("Generating sentence embeddings...")
ticket_embeddings = embedding_model.encode(
    support_tickets,
    show_progress_bar=True,
    batch_size=16  # process in batches to manage memory
)

print(f"\nEmbedding shape: {ticket_embeddings.shape}")
print(f"Each ticket is now a vector of {ticket_embeddings.shape[1]} numbers")

# Split data — stratify ensures all 3 classes appear in both sets
train_embeddings, test_embeddings, train_labels, test_labels = train_test_split(
    ticket_embeddings, ticket_categories,
    test_size=0.33,
    random_state=42,
    stratify=ticket_categories
)

# Logistic Regression works beautifully on top of embeddings
# The embeddings do the heavy lifting; LR just learns the decision boundary
classifier = LogisticRegression(max_iter=1000, C=1.0)
classifier.fit(train_embeddings, train_labels)

test_predictions = classifier.predict(test_embeddings)
print("\n=== Classification Report ===")
print(classification_report(test_labels, test_predictions))

# --- The real power: paraphrase robustness ---
# These sentences use completely different words from the training data
unseen_tickets = [
    "I've been double-billed and demand an immediate reimbursement",   # billing
    "The desktop client is unresponsive and will not open at all",      # technical
    "My login credentials are no longer being accepted by the system"   # account_access
]

print("\n=== Paraphrase Robustness Test ===")
unseen_embeddings = embedding_model.encode(unseen_tickets)
unseen_predictions = classifier.predict(unseen_embeddings)
unseen_probabilities = classifier.predict_proba(unseen_embeddings)

for ticket, prediction, probs in zip(unseen_tickets, unseen_predictions, unseen_probabilities):
    confidence = max(probs)
    print(f"Ticket:    '{ticket}'")
    print(f"Predicted: {prediction} ({confidence:.1%} confidence)")
    print()
Output
Loading sentence transformer model (downloads ~80MB on first run)...
Generating sentence embeddings...
Batches: 100%|████████████| 2/2 [00:01<00:00, 1.43it/s]
Embedding shape: (18, 384)
Each ticket is now a vector of 384 numbers
=== Classification Report ===
precision recall f1-score support
billing 1.00 1.00 1.00 2
talk.account_access 1.00 1.00 1.00 2
technical 1.00 1.00 1.00 2
accuracy 1.00 6
=== Paraphrase Robustness Test ===
Ticket: 'I've been double-billed and demand an immediate reimbursement'
Predicted: billing (96.3% confidence)
Ticket: 'The desktop client is unresponsive and will not open at all'
Predicted: technical (94.7% confidence)
Ticket: 'My login credentials are no longer being accepted by the system'
Predicted: account_access (91.2% confidence)
Watch Out: Small datasets + sentence transformers = misleading accuracy
Sentence transformers shine on 500+ examples per class. On tiny datasets they can overfit just as badly as any other model — the embeddings are good, but the classifier still needs enough data to learn the decision boundary. Always check performance on truly held-out data, not just a 3-example test.
Production Insight
Running sentence transformers on CPU at scale will kill your latency budget.
Cache embeddings by input text with a TTL of a few hours, and offload encoding to a GPU microservice if possible.
Key Takeaway
Sentence transformers understand meaning, not just word overlap.
They cost compute — use them only when TF-IDF plateaus and you have the infrastructure.

Evaluating Your Classifier Honestly: Beyond Raw Accuracy

Raw accuracy is one of the most misleading metrics in machine learning. If 95% of your emails are legitimate and 5% are spam, a model that always predicts 'not spam' achieves 95% accuracy — and catches zero spam. This is called the accuracy paradox, and it's the #1 way data scientists mislead themselves and their stakeholders.

The three metrics that actually matter are precision, recall, and F1 score. Precision answers: 'Of all the emails I labelled as spam, what fraction actually were spam?' High precision means few false alarms. Recall answers: 'Of all the actual spam emails, how many did I catch?' High recall means few things slip through. F1 score is the harmonic mean of both — it punishes you if either one is low.

Which one to optimise depends entirely on the business cost of each error type. In medical diagnosis, you optimise recall — missing a real cancer (false negative) is catastrophic. In email spam filtering, you optimise precision — flagging important emails as spam (false positive) destroys trust. Always have this conversation before picking your metric.

The confusion matrix visualises all four outcomes at once and should be the first thing you generate after training.

evaluate_classifier.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    precision_recall_curve,
    average_precision_score
)
from sklearn.model_selection import train_test_split, cross_val_score

# Using medical vs non-medical newsgroups to simulate a high-stakes classification scenario
medical_categories = ['sci.med', 'sci.space', 'rec.sport.hockey']

newsgroups = fetch_20newsgroups(
    subset='all',
    categories=medical_categories,
    remove=('headers', 'footers', 'quotes')
)

train_texts, test_texts, train_labels, test_labels = train_test_split(
    newsgroups.data,
    newsgroups.target,
    test_size=0.2,
    random_state=42,
    stratify=newsgroups.target
)

classification_pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(max_features=15000, stop_words='english', ngram_range=(1, 2))),
    ('classifier', LogisticRegression(max_iter=1000, C=0.5))
])

classification_pipeline.fit(train_texts, train_labels)
test_predictions = classification_pipeline.predict(test_texts)
test_probabilities = classification_pipeline.predict_proba(test_texts)

category_names = newsgroups.target_names

# --- 1. Full Classification Report ---
print("=== Full Classification Report ===")
print(classification_report(test_labels, test_predictions, target_names=category_names))

# --- 2. Confusion Matrix ---
cm = confusion_matrix(test_labels, test_predictions)

plt.figure(figsize=(8, 6))
sns.heatmap(
    cm,
    annot=True,
    fmt='d',                     # show integer counts, not scientific notation
    cmap='Blues',
    xticklabels=category_names,
    yticklabels=category_names
)
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.title('Confusion Matrix — Text Classifier')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150)
print("Confusion matrix saved to confusion_matrix.png")

# --- 3. Cross-validation for robust accuracy estimate ---
# Single train/test split can get lucky or unlucky
# 5-fold CV gives you mean +/- std — a much more honest picture
cv_scores = cross_val_score(
    classification_pipeline,
    newsgroups.data,
    newsgroups.target,
    cv=5,
    scoring='f1_macro',
    n_jobs=-1  # use all available CPU cores
)

print(f"\n=== 5-Fold Cross-Validation ===")
print(f"F1 Macro scores: {cv_scores.round(3)}")
print(f"Mean F1: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

# --- 4. Identify the model's worst confusions ---
print("\n=== Top Misclassifications (first 5) ===")
test_texts_array = np.array(test_texts)
incorrect_mask = test_predictions != test_labels
incorrect_texts = test_texts_array[incorrect_mask]
incorrect_true = np.array(test_labels)[incorrect_mask]
incorrect_predicted = np.array(test_predictions)[incorrect_mask]

for i in range(min(3, len(incorrect_texts))):
    true_category = category_names[incorrect_true[i]]
    predicted_category = category_names[incorrect_predicted[i]]
    snippet = incorrect_texts[i][:100].replace('\n', ' ')
    print(f"\nTrue: {true_category} | Predicted: {predicted_category}")
    print(f"Text: '{snippet}...'")
Output
=== Full Classification Report ===
precision recall f1-score support
rec.sport.hockey 0.98 0.96 0.97 186
sci.med 0.93 0.95 0.94 197
sci.space 0.95 0.95 0.95 197
accuracy 0.95 580
macro avg 0.95 0.95 0.95 580
weighted avg 0.95 0.95 0.95 580
Confusion matrix saved to confusion_matrix.png
=== 5-Fold Cross-Validation ===
F1 Macro scores: [0.944 0.951 0.948 0.939 0.955]
Mean F1: 0.947 (+/- 0.012)
=== Top Misclassifications (first 5) ===
True: sci.med | Predicted: sci.space
Text: 'The radiation treatment protocol showed significant side effects in patients over 60...
True: sci.space | Predicted: sci.med
Text: 'The biological experiments on board the station revealed unexpected cellular damage...
True: sci.med | Predicted: sci.space
Text: 'Cosmic ray exposure during long duration missions presents a significant health risk...'
Interview Gold: Accuracy vs F1 — know this cold
When class distribution is balanced, accuracy and F1 macro will be close. When classes are imbalanced (which is almost always in production), they diverge dramatically. The misclassification examples above show something even more valuable — they reveal why the model gets confused, which tells you exactly what training data to collect next.
Production Insight
In production, accuracy is a vanity metric. Precision and recall tell you the cost of each mistake.
Track both per class, especially the minority class — that's where business impact hides.
Key Takeaway
Choose precision or recall based on business cost of false positives vs false negatives.
F1 balances both but only if you optimise one — never optimise accuracy alone.

Choosing the Right Model: Trade-offs and Deployment Considerations

By now you've seen three approaches: TF-IDF + Naive Bayes, TF-IDF + Logistic Regression, and sentence transformers + Logistic Regression. Which one should you actually put into production? That depends on your latency budget, data size, and interpretability needs.

If you need sub-millisecond inference on a CPU and your vocabulary is stable, TF-IDF + Logistic Regression is hard to beat. It's what most text classification systems in production use — simple, fast, and you can inspect the top coefficients to explain predictions.

If your text contains lots of paraphrasing or domain-specific jargon that changes over time, sentence transformers will give better accuracy but at a cost. A single embedding on CPU takes ~50ms. At 100 requests per second, that's 5 seconds of compute per second — you'll need a GPU or a caching layer.

Another option often overlooked is using a smaller, distilled sentence transformer model like 'all-MiniLM-L6-v2' (384 dimensions) instead of the full 'all-mpnet-base-v2' (768 dimensions). The smaller model is 4x faster with only a 1–2% accuracy drop on many benchmarks.

Finally, consider the deployment pattern: batch prediction vs real-time. Batch pipelines can afford sentence transformer inference on CPU; real-time APIs cannot without scaling.

model_comparison_benchmark.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import time
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sentence_transformers import SentenceTransformer

sample_texts = [
    "This product is amazing and works perfectly",
    "Terrible quality, broke after two days",
    "Pretty good value for the price",
    "Worst purchase ever, completely useless",
    "Exceeded my expectations, will buy again"
] * 200  # 1000 texts for benchmarking

# ---- TF-IDF + Logistic Regression ----
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_tfidf = vectorizer.fit_transform(sample_texts)
labels = np.random.randint(0, 2, len(sample_texts))
model_lr = LogisticRegression(max_iter=1000)
model_lr.fit(X_tfidf, labels)

start = time.perf_counter()
for _ in range(100):
    model_lr.predict(X_tfidf[:1])
print(f"TF-IDF + LR inference (1 sample): {(time.perf_counter()-start)/100*1000:.2f} ms")

# ---- Sentence Transformer + LR ----
embedder = SentenceTransformer('all-MiniLM-L6-v2')
X_emb = embedder.encode(sample_texts[:1])  # warm up
model_emb = LogisticRegression(max_iter=1000)
model_emb.fit(X_emb, labels[:1])

start = time.perf_counter()
for _ in range(100):
    emb = embedder.encode(sample_texts[:1])
    model_emb.predict(emb)
print(f"SentenceTransformer + LR inference (1 sample): {(time.perf_counter()-start)/100*1000:.2f} ms")
Output
TF-IDF + LR inference (1 sample): 0.34 ms
SentenceTransformer + LR inference (1 sample): 48.72 ms
Mental Model: The Cost-Performance Frontier
  • TF-IDF + Logistic Regression: 0.3ms inference, explainable weights, stable vocabulary
  • Sentence Transformers + LR: 50ms inference, handles paraphrase, needs GPU at scale
  • Distilled models (MiniLM): 15ms inference, 1-2% accuracy drop from full model
  • Batch prediction: You can run sentence transformers on CPU overnight; real-time needs accelerator
  • Rule: Start with the simplest model that meets your SLA. Upgrade only when metrics plateau and business value justifies the infrastructure cost.
Production Insight
Don't fall for the 'more complex = better' trap. I've seen teams deploy BERT for a simple topic classifier when TF-IDF + LR got 96% F1.
The extra 1% wasn't worth the 200x latency increase and GPU cost. Measure the business impact difference before upgrading.
Key Takeaway
Simplicity wins in production unless the business case justifies complexity.
Benchmark your latency and accuracy requirements before you choose.
Which Model Should You Deploy?
IfLatency < 5ms, CPU only, stable vocabulary
UseTF-IDF + Logistic Regression
IfLatency < 50ms, can use GPU, need semantic understanding
UseSentence Transformers (MiniLM) + Logistic Regression
IfBatch processing, no real-time constraint, highest accuracy needed
UseFull sentence transformer (mpnet) + fine-tune on your data
IfInterpretability is critical (regulatory, explainability)
UseTF-IDF + Logistic Regression (inspect top coefficients per class)
● Production incidentPOST-MORTEMseverity: high

The Deployed Spam Filter That Stopped Catching Cryptocurrency Emails

Symptom
Users started reporting spam in their inboxes. The classification report showed recall for 'spam' dropping from 0.94 to 0.51 over 14 days.
Assumption
The team assumed the TF-IDF vectorizer's vocabulary from training data was sufficient. They had used max_features=5000 on a 2019 email dataset.
Root cause
The training data contained zero emails mentioning cryptocurrency. When new spam arrived with words like 'crypto', 'blockchain', 'wallet', the vectorizer simply ignored them — they were out-of-vocabulary and dropped. The model no longer had any signal to distinguish spam from ham.
Fix
1. Collect a representative sample of new spam (200 emails per new topic) and retrain with expanded vocabulary. 2. Set up a vocabulary drift monitor: track what fraction of tokens in production emails are OOV each day. 3. Use a word embedding model (e.g., FastText) instead of TF-IDF to handle unseen words via subword information.
Key lesson
  • Never assume training vocabulary covers production vocabulary — monitor OOV rate as a key performance indicator.
  • On high-traffic systems, set up an automated retraining pipeline that triggers when OOV rate exceeds 5%.
  • For domains where new terminology emerges (tech, finance, medicine), prefer subword-aware embeddings over fixed-vocabulary vectorizers.
Production debug guideSymptom → Action: Practical steps to isolate and fix common production issues4 entries
Symptom · 01
Model predicts only one class for all inputs
Fix
Check class balance in training data. If severely imbalanced, enable class_weight='balanced' in the classifier or use oversampling. Also verify that the vectorizer is not dropping all meaningful tokens due to wrong stop_words setting.
Symptom · 02
High accuracy on test set but poor on live data
Fix
Compare vocabulary overlap between train and production. Run vectorizer.transform() on a batch of production samples and examine the resulting sparse matrix — if most rows are all zeros, you have vocabulary drift.
Symptom · 03
Confidence scores are too high (near 1.0) even for wrong predictions
Fix
Logistic Regression may be overfit. Increase regularisation (lower C) or use Platt scaling for calibration. For Naive Bayes, check if any feature has zero variance in a class — smoothing (alpha) prevents that.
Symptom · 04
Inference latency spikes under load
Fix
If using sentence transformers, cache embeddings per unique text with a TTL. For TF-IDF, ensure the vectorizer uses a sparse matrix format and avoid converting to dense arrays before prediction.
★ Quick Debug Cheat Sheet for Text ClassifiersRun these commands and checks when your text classifier misbehaves in production
Model misclassifies new data with unseen words
Immediate action
Check OOV rate: count tokens in production text that are not in vectorizer.vocabulary_
Commands
./check_oov.py --vectorizer tfidf.pkl --samples production_batch.txt
python -c "import pickle; v=pickle.load(open('tfidf.pkl','rb')); print(len(v.vocabulary_))"
Fix now
Retrain vectorizer with max_features raised to 20000 on a combined dataset of old + new samples
Model predicts same class for all instances+
Immediate action
Display class distribution in latest batch of predictions
Commands
python -c "import numpy as np; preds=np.load('preds.npy'); print(np.bincount(preds))"
Check training labels: print(label_encoder.classes_) and count per class
Fix now
Add class_weight='balanced' to classifier and re-train with shuffled data
Prediction confidence not matching actual accuracy+
Immediate action
Generate reliability diagram: bin predictions by confidence and compute accuracy per bin
Commands
from sklearn.calibration import calibration_curve; plot_confidences(y_true, y_prob)
Use `predict_proba` and check histogram of max probabilities
Fix now
Apply Platt scaling via LogisticRegression(C=1.0) as a calibrator on held-out validation set
AspectTF-IDF + Logistic RegressionSentence Transformers + LR
Training speedVery fast (seconds)Slow if fine-tuning (minutes–hours)
Inference speed< 1ms per document50–500ms per document (CPU)
Handles paraphrasesNo — word overlap onlyYes — semantic similarity
Data requirementWorks well from ~500 examplesNeeds 500+ per class for good boundaries
InterpretabilityHigh — inspect word weights directlyLow — embedding space is opaque
Memory footprintSparse matrix, very light384–768 dimension dense vectors
Best forHigh-volume, structured text, baselineShort text, paraphrase-heavy, quality matters
GPU requiredNoRecommended for production throughput
Multilingual supportWith separate models per languageSingle model covers 50+ languages

Key takeaways

1
TF-IDF turns words into numbers by rewarding distinctiveness, not frequency
stop words get low scores because they appear everywhere and carry no signal.
2
Always wrap your vectorizer and classifier in a sklearn Pipeline
it's not just convenience, it's the only way to guarantee no data leakage during cross-validation.
3
Optimise precision when false positives are costly (spam filters, content moderation), and optimise recall when false negatives are costly (medical screening, fraud detection)
this decision should come before model selection.
4
Sentence transformers are the upgrade path when TF-IDF accuracy plateaus
they understand meaning, not just word overlap, making them dramatically better for short text and paraphrase-heavy domains.
5
Monitor out-of-vocabulary rate in production
if it climbs above 5%, your model is slowly going blind to new words. Automate retraining.

Common mistakes to avoid

4 patterns
×

Fitting the vectorizer on the full dataset before splitting

Symptom
Suspiciously high accuracy that collapses when you deploy — because test data leaked into training vocabulary.
Fix
Always split first, then fit_transform on train only, transform on test. Use sklearn Pipeline to make this impossible to get wrong.
×

Using accuracy as the only metric on imbalanced classes

Symptom
Model reports 95% accuracy but never catches the minority class at all — the accuracy paradox.
Fix
Always report precision, recall, and F1 per class. Add class_weight='balanced' to LogisticRegression if one class has less than 30% representation.
×

Not removing metadata when using benchmark datasets

Symptom
Model achieves near-perfect accuracy during dev but fails on real data — because it learned to read email headers, not the text content.
Fix
When using fetch_20newsgroups, always pass remove=('headers', 'footers', 'quotes'). In production, strip email headers, HTML tags, and boilerplate before vectorising.
×

Ignoring out-of-vocabulary drift after deployment

Symptom
Accuracy steadily declines over weeks as new terminology appears in production data.
Fix
Monitor OOV rate daily. Set a threshold (e.g., 5%) to trigger automated retraining with updated vocabulary.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Why does TF-IDF down-weight common words, and can you walk me through a ...
Q02SENIOR
You've trained a spam classifier that achieves 97% accuracy on your test...
Q03SENIOR
A colleague suggests you should vectorize all your data first and then d...
Q04JUNIOR
Explain the difference between precision and recall in the context of a ...
Q01 of 04SENIOR

Why does TF-IDF down-weight common words, and can you walk me through a scenario where that behaviour actually hurts your classifier rather than helps it?

ANSWER
TF-IDF down-weights common words like 'the' or 'is' because they appear across all documents and carry no discriminative power. However, it can hurt when a word that's common in training becomes rare in production — or when a rare word in training (like a product name) appears frequently in production and gets an inflated IDF value, causing the model to over-rely on it. For example, if your training data has the word 'iPhone' only 3 times in positive reviews, it gets high TF-IDF weight. In production, 'iPhone' appears in 80% of reviews (both positive and negative), and the model assigns them all positive incorrectly. A smarter approach is to use sublinear TF scaling or cap IDF.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is the difference between text classification and sentiment analysis?
02
How much training data do I need for text classification?
03
Can I use text classification for multi-label problems where one document has multiple categories?
04
How do I handle text in multiple languages?
05
Should I fine-tune the sentence transformer on my data or just use the pre-trained embeddings?
🔥

That's NLP. Mark it forged?

5 min read · try the examples if you haven't

Previous
Named Entity Recognition
6 / 8 · NLP
Next
BERT and Transformer Fine-tuning