Text classification maps raw text to predefined labels using ML
TF-IDF vectorization converts words into numerical importance scores
Naive Bayes and Logistic Regression are fast, interpretable starters
Sentence transformers handle paraphrases but need GPU for production throughput
Biggest mistake: evaluating on accuracy alone when classes are imbalanced
Plain-English First
Imagine your email inbox has a bouncer at the door. Every incoming email gets a quick read, and the bouncer decides: 'spam' goes in the junk folder, 'important' lands in your inbox. Text classification is exactly that bouncer — a machine learning model that reads a piece of text and stamps it with a label. Your phone does it when it detects a toxic comment. Netflix does it when it reads your review and decides if you loved the show. It's the foundation of almost every app that needs to understand what humans are saying.
Every day, humans generate around 2.5 quintillion bytes of data — and most of it is unstructured text. Customer reviews, support tickets, social media posts, medical notes. None of that data is useful until a machine can read it and say 'this is a complaint', 'this is urgent', or 'this is spam'. Text classification is the ML technique that makes that possible, and it powers systems you use dozens of times a day without realising it.
The core problem text classification solves is deceptively simple: given a string of words, assign it to one of several predefined categories. But computers don't speak English — they speak numbers. So the real challenge is the pipeline that happens before the model even sees the data: cleaning text, converting it into numerical features, and choosing a model that can learn meaningful patterns from those features. Get that pipeline wrong and even the fanciest model won't save you.
By the end of this article you'll be able to build a complete, production-aware text classification pipeline in Python — from raw messy text all the way to a trained model making predictions. You'll understand why each step exists, not just how to run it. And you'll know the common traps that burn people in interviews and on the job.
How Machines Read Words: Vectorisation and Why It Matters
Before any model can classify text, you need to answer a fundamental question: how do you turn the sentence 'This product broke in two days' into something a mathematical model can process? The answer is vectorisation — converting text into arrays of numbers.
The most battle-tested approach is TF-IDF (Term Frequency–Inverse Document Frequency). It does two clever things at once. First, it counts how often a word appears in a document (TF). Second, it penalises words that appear in almost every document — like 'the' or 'is' — because they carry no useful signal (IDF). The result is a number that represents how distinctive a word is to a particular document.
Why not just count raw word frequencies? Because 'the' might be the most frequent word in every review, positive or negative. It tells you nothing. TF-IDF filters that noise out automatically.
The alternative — word embeddings like Word2Vec or sentence transformers — are more powerful but also more complex. TF-IDF is the right starting point: fast, interpretable, and often good enough for structured datasets. Understand it deeply before reaching for a transformer.
vectorise_text.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import pandas as pd
from sklearn.feature_extraction.text importTfidfVectorizer# Simulating a small customer review dataset# In a real project this would come from a CSV or database
reviews = [
"This product is absolutely amazing and works perfectly",
"Terrible quality, broke after two days, total waste of money",
"Pretty good value for the price, happy with my purchase",
"Worst purchase I have ever made, completely useless product",
"Exceeded my expectations, will definitely buy again"
]
labels = ["positive", "negative", "positive", "negative", "positive"]
# TfidfVectorizer handles tokenisation, lowercasing, and IDF weighting# max_features limits vocabulary size — important for memory on large datasets# stop_words='english' removes common words like 'the', 'is', 'and'
vectorizer = TfidfVectorizer(max_features=20, stop_words='english')
# fit_transform: learns the vocabulary AND converts text to numbers in one step# Returns a sparse matrix — rows are documents, columns are words
tfidf_matrix = vectorizer.fit_transform(reviews)
# Let's see what vocabulary was learned
learned_vocabulary = vectorizer.get_feature_names_out()
print("Learned vocabulary:")
print(learned_vocabulary)
print()
# Convert sparse matrix to a readable DataFrame
tfidf_df = pd.DataFrame(
tfidf_matrix.toarray(),
columns=learned_vocabulary,
index=[f"Review {i+1}"for i inrange(len(reviews))]
)
print("TF-IDF scores per document (higher = more distinctive word):")
print(tfidf_df.round(3).to_string())
Call fit_transform() on your training set, then transform() on your test set. Never fit on the full dataset — that leaks future vocabulary into your training process and inflates accuracy scores artificially.
Production Insight
If production text contains words not in the TF-IDF vocabulary, they're silently dropped — the model sees less signal and accuracy drifts.
Monitor out-of-vocabulary rate daily. Anything above 5% means your vectorizer needs retraining on fresh data.
Key Takeaway
TF-IDF rewards discriminative words by down-weighting common terms.
The vocabulary is static after training — any new word is invisible to the model.
Training Your First Classifier: Naive Bayes vs Logistic Regression
Now that text is numeric, you can feed it into a classifier. Two models dominate beginner-to-intermediate text classification: Multinomial Naive Bayes and Logistic Regression. They're both fast, interpretable, and work surprisingly well — and understanding why they work differently will save you a lot of tuning time.
Naive Bayes asks: 'Given this class label, what's the probability of seeing each word?' It calculates probabilities per word and multiplies them together. The 'naive' part is the assumption that each word's probability is independent of the others — clearly not true in real language, but the model still performs remarkably well on text data. It's extremely fast and memory-efficient.
Logistic Regression learns a weight for each word. Words strongly associated with 'positive' get high positive weights; words associated with 'negative' get negative weights. It then sums those weighted scores and passes them through a sigmoid function to output a probability. It's slightly slower to train but gives you calibrated probabilities and is more robust on imbalanced classes.
For a quick baseline, reach for Naive Bayes. For production pipelines where calibration matters (you need 'how confident is the model?'), use Logistic Regression.
train_text_classifier.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text importTfidfVectorizerfrom sklearn.naive_bayes importMultinomialNBfrom sklearn.linear_model importLogisticRegressionfrom sklearn.pipeline importPipelinefrom sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split
# Using a real benchmark dataset — 20 newsgroup posts across different topics# We're selecting 3 categories to keep it manageable and interpretable
categories_to_classify = ['sci.space', 'rec.sport.hockey', 'talk.politics.guns']
print("Loading 20 Newsgroups dataset...")
newsgroups_data = fetch_20newsgroups(
subset='all',
categories=categories_to_classify,
remove=('headers', 'footers', 'quotes') # remove metadata that makes classification trivially easy
)
post_texts = newsgroups_data.data
category_labels = newsgroups_data.target
category_names = newsgroups_data.target_names
print(f"Total documents: {len(post_texts)}")
print(f"Categories: {category_names}")
print()
# Split into training and test sets — 80/20 is a solid default# stratify=category_labels ensures each class has proportional representation in both sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
post_texts, category_labels,
test_size=0.2,
random_state=42,
stratify=category_labels
)
# --- PIPELINE 1: Naive Bayes ---# Pipeline chains steps so the same transformations apply consistently to train and test# This is the production-safe pattern — no data leakage possible
naive_bayes_pipeline = Pipeline([
('tfidf_vectorizer', TfidfVectorizer(
max_features=10000,
stop_words='english',
ngram_range=(1, 2) # include both single words AND two-word phrases
)),
('naive_bayes_classifier', MultinomialNB(alpha=0.1)) # alpha is the smoothing parameter
])
naive_bayes_pipeline.fit(train_texts, train_labels)
nb_predictions = naive_bayes_pipeline.predict(test_texts)
nb_accuracy = accuracy_score(test_labels, nb_predictions)
print(f"=== Naive Bayes Results ===")
print(f"Accuracy: {nb_accuracy:.3f}")
print(classification_report(test_labels, nb_predictions, target_names=category_names))
# --- PIPELINE 2: Logistic Regression ---
logistic_pipeline = Pipeline([
('tfidf_vectorizer', TfidfVectorizer(
max_features=10000,
stop_words='english',
ngram_range=(1, 2)
)),
('logistic_classifier', LogisticRegression(
max_iter=1000, # increase from default 100 to ensure convergence
C=1.0, # regularisation strength — lower C = more regularisation
solver='lbfgs',
multi_class='multinomial'
))
])
logistic_pipeline.fit(train_texts, train_labels)
lr_predictions = logistic_pipeline.predict(test_texts)
lr_accuracy = accuracy_score(test_labels, lr_predictions)
print(f"\n=== Logistic Regression Results ===")
print(f"Accuracy: {lr_accuracy:.3f}")
print(classification_report(test_labels, lr_predictions, target_names=category_names))
# --- Making predictions on new, unseen text ---
new_posts = [
"The astronauts launched successfully to the International Space Station",
"The goalie made an incredible save in overtime to win the championship",
"The senate voted on the second amendment legislation today"
]
print("\n=== Predictions on new posts ===")
new_predictions = logistic_pipeline.predict(new_posts)
new_probabilities = logistic_pipeline.predict_proba(new_posts)
for post, prediction, probabilities inzip(new_posts, new_predictions, new_probabilities):
predicted_category = category_names[prediction]
confidence = max(probabilities)
print(f"Post: '{post[:55]}...'")
print(f" Predicted: {predicted_category} (confidence: {confidence:.1%})")
print()
Post: 'The astronauts launched successfully to the Internati...'
Predicted: sci.space (confidence: 97.2%)
Post: 'The goalie made an incredible save in overtime to win...'
Predicted: rec.sport.hockey (confidence: 98.6%)
Post: 'The senate voted on the second amendment legislation ...'
Predicted: talk.politics.guns (confidence: 89.1%)
Interview Gold: Why use Pipeline instead of separate steps?
Pipeline prevents data leakage in cross-validation. If you vectorize first then cross-validate, vocabulary from the test fold bleeds into training. Pipeline ensures fit() only ever sees training data — a subtle but critical correctness guarantee.
Production Insight
Naive Bayes tends to produce extreme probabilities (close to 0 or 1) even when wrong.
In production, calibrate with Logistic Regression or use isotonic regression if you need reliable confidence scores.
Key Takeaway
Naive Bayes is fast, logistic regression is calibrated.
Pick naive Bayes for baselines, logistic regression for production decisions.
When TF-IDF Isn't Enough: Upgrading to Sentence Transformers
TF-IDF is powerful, but it's blind to meaning. The sentences 'The car broke down' and 'My vehicle stopped working' use completely different words, so TF-IDF treats them as unrelated. But semantically, they mean the same thing. For a customer support classifier that needs to route 'vehicle stopped working' to the auto-repair team, that blindness is a real problem.
Sentence transformers solve this by converting an entire sentence into a dense vector (an embedding) where similar meanings produce similar vectors. They're pre-trained on massive text corpora, so they already understand that 'car' and 'vehicle' live in the same neighbourhood of meaning. You're essentially downloading years of language learning and plugging it into your classifier.
The tradeoff? Speed and resource cost. TF-IDF vectorisation takes milliseconds; generating sentence embeddings on CPU can take seconds per batch. For most production systems processing thousands of requests per minute, you'll need a GPU or a caching layer.
The pattern here is simple: start with TF-IDF + Logistic Regression as your baseline. If accuracy plateaus and you have labelled data, upgrade to sentence embeddings. You'll almost always see a meaningful jump, especially on short texts or paraphrase-heavy data.
sentence_transformer_classifier.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# Install first: pip install sentence-transformers scikit-learnimport numpy as np
from sentence_transformers importSentenceTransformerfrom sklearn.linear_model importLogisticRegressionfrom sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
# Real-world scenario: classifying customer support tickets# Notice how many rows have paraphrased meaning — TF-IDF would struggle here
support_tickets = [
# Billing issues"I was charged twice for my subscription this month",
"There's a duplicate payment on my credit card statement",
"My invoice shows the wrong amount, please fix this",
"You billed me for a plan I never signed up for",
"I need a refund for the extra charge on my account",
"The payment went through twice and I want my money back",
# Technical issues"The app keeps crashing every time I open it",
"Your software won't start on my Windows 11 laptop",
"I'm getting a black screen when I launch the application",
"The program freezes after about 30 seconds of use",
"Cannot log into the platform, it just hangs on the loading screen",
"The mobile app stopped working after the latest update",
# Account access"I forgot my password and the reset email never arrived",
"Locked out of my account after too many login attempts",
"My account was suspended but I didn't violate any rules",
"Can't access my profile, says my email is not recognised",
"The two-factor authentication code isn't working for me",
"I need to recover access to my account urgently"
]
ticket_categories = (
["billing"] * 6 +
["technical"] * 6 +
["account_access"] * 6
)
# Load a lightweight, fast model — good balance of speed and quality# 'all-MiniLM-L6-v2' produces 384-dimensional embeddings and runs in ~50ms per sentence on CPUprint("Loading sentence transformer model (downloads ~80MB on first run)...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
# Encode all tickets into dense vector representations# Each ticket becomes a 384-dimensional vector — semantically similar tickets will cluster togetherprint("Generating sentence embeddings...")
ticket_embeddings = embedding_model.encode(
support_tickets,
show_progress_bar=True,
batch_size=16# process in batches to manage memory
)
print(f"\nEmbedding shape: {ticket_embeddings.shape}")
print(f"Each ticket is now a vector of {ticket_embeddings.shape[1]} numbers")
# Split data — stratify ensures all 3 classes appear in both sets
train_embeddings, test_embeddings, train_labels, test_labels = train_test_split(
ticket_embeddings, ticket_categories,
test_size=0.33,
random_state=42,
stratify=ticket_categories
)
# Logistic Regression works beautifully on top of embeddings# The embeddings do the heavy lifting; LR just learns the decision boundary
classifier = LogisticRegression(max_iter=1000, C=1.0)
classifier.fit(train_embeddings, train_labels)
test_predictions = classifier.predict(test_embeddings)
print("\n=== Classification Report ===")
print(classification_report(test_labels, test_predictions))
# --- The real power: paraphrase robustness ---# These sentences use completely different words from the training data
unseen_tickets = [
"I've been double-billed and demand an immediate reimbursement", # billing
"The desktop client is unresponsive and will not open at all", # technical
"My login credentials are no longer being accepted by the system" # account_access
]
print("\n=== Paraphrase Robustness Test ===")
unseen_embeddings = embedding_model.encode(unseen_tickets)
unseen_predictions = classifier.predict(unseen_embeddings)
unseen_probabilities = classifier.predict_proba(unseen_embeddings)
for ticket, prediction, probs inzip(unseen_tickets, unseen_predictions, unseen_probabilities):
confidence = max(probs)
print(f"Ticket: '{ticket}'")
print(f"Predicted: {prediction} ({confidence:.1%} confidence)")
print()
Output
Loading sentence transformer model (downloads ~80MB on first run)...
Ticket: 'I've been double-billed and demand an immediate reimbursement'
Predicted: billing (96.3% confidence)
Ticket: 'The desktop client is unresponsive and will not open at all'
Predicted: technical (94.7% confidence)
Ticket: 'My login credentials are no longer being accepted by the system'
Predicted: account_access (91.2% confidence)
Watch Out: Small datasets + sentence transformers = misleading accuracy
Sentence transformers shine on 500+ examples per class. On tiny datasets they can overfit just as badly as any other model — the embeddings are good, but the classifier still needs enough data to learn the decision boundary. Always check performance on truly held-out data, not just a 3-example test.
Production Insight
Running sentence transformers on CPU at scale will kill your latency budget.
Cache embeddings by input text with a TTL of a few hours, and offload encoding to a GPU microservice if possible.
Key Takeaway
Sentence transformers understand meaning, not just word overlap.
They cost compute — use them only when TF-IDF plateaus and you have the infrastructure.
Evaluating Your Classifier Honestly: Beyond Raw Accuracy
Raw accuracy is one of the most misleading metrics in machine learning. If 95% of your emails are legitimate and 5% are spam, a model that always predicts 'not spam' achieves 95% accuracy — and catches zero spam. This is called the accuracy paradox, and it's the #1 way data scientists mislead themselves and their stakeholders.
The three metrics that actually matter are precision, recall, and F1 score. Precision answers: 'Of all the emails I labelled as spam, what fraction actually were spam?' High precision means few false alarms. Recall answers: 'Of all the actual spam emails, how many did I catch?' High recall means few things slip through. F1 score is the harmonic mean of both — it punishes you if either one is low.
Which one to optimise depends entirely on the business cost of each error type. In medical diagnosis, you optimise recall — missing a real cancer (false negative) is catastrophic. In email spam filtering, you optimise precision — flagging important emails as spam (false positive) destroys trust. Always have this conversation before picking your metric.
The confusion matrix visualises all four outcomes at once and should be the first thing you generate after training.
evaluate_classifier.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text importTfidfVectorizerfrom sklearn.linear_model importLogisticRegressionfrom sklearn.pipeline importPipelinefrom sklearn.metrics import (
classification_report,
confusion_matrix,
precision_recall_curve,
average_precision_score
)
from sklearn.model_selection import train_test_split, cross_val_score
# Using medical vs non-medical newsgroups to simulate a high-stakes classification scenario
medical_categories = ['sci.med', 'sci.space', 'rec.sport.hockey']
newsgroups = fetch_20newsgroups(
subset='all',
categories=medical_categories,
remove=('headers', 'footers', 'quotes')
)
train_texts, test_texts, train_labels, test_labels = train_test_split(
newsgroups.data,
newsgroups.target,
test_size=0.2,
random_state=42,
stratify=newsgroups.target
)
classification_pipeline = Pipeline([
('vectorizer', TfidfVectorizer(max_features=15000, stop_words='english', ngram_range=(1, 2))),
('classifier', LogisticRegression(max_iter=1000, C=0.5))
])
classification_pipeline.fit(train_texts, train_labels)
test_predictions = classification_pipeline.predict(test_texts)
test_probabilities = classification_pipeline.predict_proba(test_texts)
category_names = newsgroups.target_names
# --- 1. Full Classification Report ---print("=== Full Classification Report ===")
print(classification_report(test_labels, test_predictions, target_names=category_names))
# --- 2. Confusion Matrix ---
cm = confusion_matrix(test_labels, test_predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(
cm,
annot=True,
fmt='d', # show integer counts, not scientific notation
cmap='Blues',
xticklabels=category_names,
yticklabels=category_names
)
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.title('Confusion Matrix — Text Classifier')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150)
print("Confusion matrix saved to confusion_matrix.png")
# --- 3. Cross-validation for robust accuracy estimate ---# Single train/test split can get lucky or unlucky# 5-fold CV gives you mean +/- std — a much more honest picture
cv_scores = cross_val_score(
classification_pipeline,
newsgroups.data,
newsgroups.target,
cv=5,
scoring='f1_macro',
n_jobs=-1# use all available CPU cores
)
print(f"\n=== 5-Fold Cross-Validation ===")
print(f"F1 Macro scores: {cv_scores.round(3)}")
print(f"Mean F1: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
# --- 4. Identify the model's worst confusions ---print("\n=== Top Misclassifications (first 5) ===")
test_texts_array = np.array(test_texts)
incorrect_mask = test_predictions != test_labels
incorrect_texts = test_texts_array[incorrect_mask]
incorrect_true = np.array(test_labels)[incorrect_mask]
incorrect_predicted = np.array(test_predictions)[incorrect_mask]
for i inrange(min(3, len(incorrect_texts))):
true_category = category_names[incorrect_true[i]]
predicted_category = category_names[incorrect_predicted[i]]
snippet = incorrect_texts[i][:100].replace('\n', ' ')
print(f"\nTrue: {true_category} | Predicted: {predicted_category}")
print(f"Text: '{snippet}...'")
Output
=== Full Classification Report ===
precision recall f1-score support
rec.sport.hockey 0.98 0.96 0.97 186
sci.med 0.93 0.95 0.94 197
sci.space 0.95 0.95 0.95 197
accuracy 0.95 580
macro avg 0.95 0.95 0.95 580
weighted avg 0.95 0.95 0.95 580
Confusion matrix saved to confusion_matrix.png
=== 5-Fold Cross-Validation ===
F1 Macro scores: [0.944 0.951 0.948 0.939 0.955]
Mean F1: 0.947 (+/- 0.012)
=== Top Misclassifications (first 5) ===
True: sci.med | Predicted: sci.space
Text: 'The radiation treatment protocol showed significant side effects in patients over 60...
True: sci.space | Predicted: sci.med
Text: 'The biological experiments on board the station revealed unexpected cellular damage...
True: sci.med | Predicted: sci.space
Text: 'Cosmic ray exposure during long duration missions presents a significant health risk...'
Interview Gold: Accuracy vs F1 — know this cold
When class distribution is balanced, accuracy and F1 macro will be close. When classes are imbalanced (which is almost always in production), they diverge dramatically. The misclassification examples above show something even more valuable — they reveal why the model gets confused, which tells you exactly what training data to collect next.
Production Insight
In production, accuracy is a vanity metric. Precision and recall tell you the cost of each mistake.
Track both per class, especially the minority class — that's where business impact hides.
Key Takeaway
Choose precision or recall based on business cost of false positives vs false negatives.
F1 balances both but only if you optimise one — never optimise accuracy alone.
Choosing the Right Model: Trade-offs and Deployment Considerations
By now you've seen three approaches: TF-IDF + Naive Bayes, TF-IDF + Logistic Regression, and sentence transformers + Logistic Regression. Which one should you actually put into production? That depends on your latency budget, data size, and interpretability needs.
If you need sub-millisecond inference on a CPU and your vocabulary is stable, TF-IDF + Logistic Regression is hard to beat. It's what most text classification systems in production use — simple, fast, and you can inspect the top coefficients to explain predictions.
If your text contains lots of paraphrasing or domain-specific jargon that changes over time, sentence transformers will give better accuracy but at a cost. A single embedding on CPU takes ~50ms. At 100 requests per second, that's 5 seconds of compute per second — you'll need a GPU or a caching layer.
Another option often overlooked is using a smaller, distilled sentence transformer model like 'all-MiniLM-L6-v2' (384 dimensions) instead of the full 'all-mpnet-base-v2' (768 dimensions). The smaller model is 4x faster with only a 1–2% accuracy drop on many benchmarks.
Finally, consider the deployment pattern: batch prediction vs real-time. Batch pipelines can afford sentence transformer inference on CPU; real-time APIs cannot without scaling.
model_comparison_benchmark.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import time
import numpy as np
from sklearn.feature_extraction.text importTfidfVectorizerfrom sklearn.linear_model importLogisticRegressionfrom sklearn.pipeline importPipelinefrom sentence_transformers importSentenceTransformer
sample_texts = [
"This product is amazing and works perfectly",
"Terrible quality, broke after two days",
"Pretty good value for the price",
"Worst purchase ever, completely useless",
"Exceeded my expectations, will buy again"
] * 200# 1000 texts for benchmarking# ---- TF-IDF + Logistic Regression ----
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_tfidf = vectorizer.fit_transform(sample_texts)
labels = np.random.randint(0, 2, len(sample_texts))
model_lr = LogisticRegression(max_iter=1000)
model_lr.fit(X_tfidf, labels)
start = time.perf_counter()
for _ inrange(100):
model_lr.predict(X_tfidf[:1])
print(f"TF-IDF + LR inference (1 sample): {(time.perf_counter()-start)/100*1000:.2f} ms")
# ---- Sentence Transformer + LR ----
embedder = SentenceTransformer('all-MiniLM-L6-v2')
X_emb = embedder.encode(sample_texts[:1]) # warm up
model_emb = LogisticRegression(max_iter=1000)
model_emb.fit(X_emb, labels[:1])
start = time.perf_counter()
for _ inrange(100):
emb = embedder.encode(sample_texts[:1])
model_emb.predict(emb)
print(f"SentenceTransformer + LR inference (1 sample): {(time.perf_counter()-start)/100*1000:.2f} ms")
Output
TF-IDF + LR inference (1 sample): 0.34 ms
SentenceTransformer + LR inference (1 sample): 48.72 ms
IfBatch processing, no real-time constraint, highest accuracy needed
→
UseFull sentence transformer (mpnet) + fine-tune on your data
IfInterpretability is critical (regulatory, explainability)
→
UseTF-IDF + Logistic Regression (inspect top coefficients per class)
● Production incidentPOST-MORTEMseverity: high
The Deployed Spam Filter That Stopped Catching Cryptocurrency Emails
Symptom
Users started reporting spam in their inboxes. The classification report showed recall for 'spam' dropping from 0.94 to 0.51 over 14 days.
Assumption
The team assumed the TF-IDF vectorizer's vocabulary from training data was sufficient. They had used max_features=5000 on a 2019 email dataset.
Root cause
The training data contained zero emails mentioning cryptocurrency. When new spam arrived with words like 'crypto', 'blockchain', 'wallet', the vectorizer simply ignored them — they were out-of-vocabulary and dropped. The model no longer had any signal to distinguish spam from ham.
Fix
1. Collect a representative sample of new spam (200 emails per new topic) and retrain with expanded vocabulary.
2. Set up a vocabulary drift monitor: track what fraction of tokens in production emails are OOV each day.
3. Use a word embedding model (e.g., FastText) instead of TF-IDF to handle unseen words via subword information.
Key lesson
Never assume training vocabulary covers production vocabulary — monitor OOV rate as a key performance indicator.
On high-traffic systems, set up an automated retraining pipeline that triggers when OOV rate exceeds 5%.
For domains where new terminology emerges (tech, finance, medicine), prefer subword-aware embeddings over fixed-vocabulary vectorizers.
Production debug guideSymptom → Action: Practical steps to isolate and fix common production issues4 entries
Symptom · 01
Model predicts only one class for all inputs
→
Fix
Check class balance in training data. If severely imbalanced, enable class_weight='balanced' in the classifier or use oversampling. Also verify that the vectorizer is not dropping all meaningful tokens due to wrong stop_words setting.
Symptom · 02
High accuracy on test set but poor on live data
→
Fix
Compare vocabulary overlap between train and production. Run vectorizer.transform() on a batch of production samples and examine the resulting sparse matrix — if most rows are all zeros, you have vocabulary drift.
Symptom · 03
Confidence scores are too high (near 1.0) even for wrong predictions
→
Fix
Logistic Regression may be overfit. Increase regularisation (lower C) or use Platt scaling for calibration. For Naive Bayes, check if any feature has zero variance in a class — smoothing (alpha) prevents that.
Symptom · 04
Inference latency spikes under load
→
Fix
If using sentence transformers, cache embeddings per unique text with a TTL. For TF-IDF, ensure the vectorizer uses a sparse matrix format and avoid converting to dense arrays before prediction.
★ Quick Debug Cheat Sheet for Text ClassifiersRun these commands and checks when your text classifier misbehaves in production
Model misclassifies new data with unseen words−
Immediate action
Check OOV rate: count tokens in production text that are not in vectorizer.vocabulary_
Retrain vectorizer with max_features raised to 20000 on a combined dataset of old + new samples
Model predicts same class for all instances+
Immediate action
Display class distribution in latest batch of predictions
Commands
python -c "import numpy as np; preds=np.load('preds.npy'); print(np.bincount(preds))"
Check training labels: print(label_encoder.classes_) and count per class
Fix now
Add class_weight='balanced' to classifier and re-train with shuffled data
Prediction confidence not matching actual accuracy+
Immediate action
Generate reliability diagram: bin predictions by confidence and compute accuracy per bin
Commands
from sklearn.calibration import calibration_curve; plot_confidences(y_true, y_prob)
Use `predict_proba` and check histogram of max probabilities
Fix now
Apply Platt scaling via LogisticRegression(C=1.0) as a calibrator on held-out validation set
Aspect
TF-IDF + Logistic Regression
Sentence Transformers + LR
Training speed
Very fast (seconds)
Slow if fine-tuning (minutes–hours)
Inference speed
< 1ms per document
50–500ms per document (CPU)
Handles paraphrases
No — word overlap only
Yes — semantic similarity
Data requirement
Works well from ~500 examples
Needs 500+ per class for good boundaries
Interpretability
High — inspect word weights directly
Low — embedding space is opaque
Memory footprint
Sparse matrix, very light
384–768 dimension dense vectors
Best for
High-volume, structured text, baseline
Short text, paraphrase-heavy, quality matters
GPU required
No
Recommended for production throughput
Multilingual support
With separate models per language
Single model covers 50+ languages
Key takeaways
1
TF-IDF turns words into numbers by rewarding distinctiveness, not frequency
stop words get low scores because they appear everywhere and carry no signal.
2
Always wrap your vectorizer and classifier in a sklearn Pipeline
it's not just convenience, it's the only way to guarantee no data leakage during cross-validation.
3
Optimise precision when false positives are costly (spam filters, content moderation), and optimise recall when false negatives are costly (medical screening, fraud detection)
this decision should come before model selection.
4
Sentence transformers are the upgrade path when TF-IDF accuracy plateaus
they understand meaning, not just word overlap, making them dramatically better for short text and paraphrase-heavy domains.
5
Monitor out-of-vocabulary rate in production
if it climbs above 5%, your model is slowly going blind to new words. Automate retraining.
Common mistakes to avoid
4 patterns
×
Fitting the vectorizer on the full dataset before splitting
Symptom
Suspiciously high accuracy that collapses when you deploy — because test data leaked into training vocabulary.
Fix
Always split first, then fit_transform on train only, transform on test. Use sklearn Pipeline to make this impossible to get wrong.
×
Using accuracy as the only metric on imbalanced classes
Symptom
Model reports 95% accuracy but never catches the minority class at all — the accuracy paradox.
Fix
Always report precision, recall, and F1 per class. Add class_weight='balanced' to LogisticRegression if one class has less than 30% representation.
×
Not removing metadata when using benchmark datasets
Symptom
Model achieves near-perfect accuracy during dev but fails on real data — because it learned to read email headers, not the text content.
Fix
When using fetch_20newsgroups, always pass remove=('headers', 'footers', 'quotes'). In production, strip email headers, HTML tags, and boilerplate before vectorising.
×
Ignoring out-of-vocabulary drift after deployment
Symptom
Accuracy steadily declines over weeks as new terminology appears in production data.
Fix
Monitor OOV rate daily. Set a threshold (e.g., 5%) to trigger automated retraining with updated vocabulary.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Why does TF-IDF down-weight common words, and can you walk me through a ...
Q02SENIOR
You've trained a spam classifier that achieves 97% accuracy on your test...
Q03SENIOR
A colleague suggests you should vectorize all your data first and then d...
Q04JUNIOR
Explain the difference between precision and recall in the context of a ...
Q01 of 04SENIOR
Why does TF-IDF down-weight common words, and can you walk me through a scenario where that behaviour actually hurts your classifier rather than helps it?
ANSWER
TF-IDF down-weights common words like 'the' or 'is' because they appear across all documents and carry no discriminative power. However, it can hurt when a word that's common in training becomes rare in production — or when a rare word in training (like a product name) appears frequently in production and gets an inflated IDF value, causing the model to over-rely on it. For example, if your training data has the word 'iPhone' only 3 times in positive reviews, it gets high TF-IDF weight. In production, 'iPhone' appears in 80% of reviews (both positive and negative), and the model assigns them all positive incorrectly. A smarter approach is to use sublinear TF scaling or cap IDF.
Q02 of 04SENIOR
You've trained a spam classifier that achieves 97% accuracy on your test set, but your client says it's missing too many spam emails in production. What metric should you have been optimising for, and how would you adjust the model?
ANSWER
The client's complaint is about recall — too many false negatives. Accuracy was misleading because the class distribution is imbalanced (most emails are not spam). I would switch to optimising recall for the spam class, possibly at the expense of precision. Adjustments: (1) Lower the decision threshold for the spam class from 0.5 to something like 0.3 — this catches more spam but also increases false positives. (2) Use class_weight='balanced' during training. (3) Collect more spam examples or use oversampling. (4) If false positives are acceptable, deploy the lower-threshold model and monitor user complaints. The business cost of a missed spam (user phished) was higher than a false alarm (user moves email out of spam folder).
Q03 of 04SENIOR
A colleague suggests you should vectorize all your data first and then do cross-validation to save time. What's wrong with that approach, and what would you see in your metrics if you did it?
ANSWER
This causes data leakage. If you vectorize the entire corpus before cross-validation, the vocabulary from the hold-out fold (which should be unseen) is included in the vectorizer's vocabulary. During cross-validation, the model sees words from the test fold during training. This inflates accuracy, precision, recall — essentially all metrics become overly optimistic by 2–5% depending on the dataset. The fix is to always use a Pipeline object that calls fit() only on training splits. You'll see that if you accidentally vectorize first, your cross-validation scores are suspiciously high across all folds, but a separate held-out test set that was never touched by the vectorizer will show much lower performance.
Q04 of 04JUNIOR
Explain the difference between precision and recall in the context of a medical text classifier that predicts whether a patient has a rare disease.
ANSWER
Precision: Of all patients the model flagged as having the disease, how many actually had it? High precision means few false alarms — you avoid overwhelming doctors with false positives. Recall: Of all patients who actually had the disease, how many did the model catch? High recall means you miss very few cases — crucial for a deadly disease where false negatives are catastrophic. For a rare disease (say 1% prevalence), a model that always predicts 'no disease' would have 99% accuracy but 0% recall. The trade-off is usually tuned via the decision threshold, with recall prioritised when the cost of missing a case is high (e.g., cancer).
01
Why does TF-IDF down-weight common words, and can you walk me through a scenario where that behaviour actually hurts your classifier rather than helps it?
SENIOR
02
You've trained a spam classifier that achieves 97% accuracy on your test set, but your client says it's missing too many spam emails in production. What metric should you have been optimising for, and how would you adjust the model?
SENIOR
03
A colleague suggests you should vectorize all your data first and then do cross-validation to save time. What's wrong with that approach, and what would you see in your metrics if you did it?
SENIOR
04
Explain the difference between precision and recall in the context of a medical text classifier that predicts whether a patient has a rare disease.
JUNIOR
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
What is the difference between text classification and sentiment analysis?
Sentiment analysis is a specific type of text classification where the categories are sentiments (positive, negative, neutral). Text classification is the broader technique — you could classify text into topics, intent, language, urgency, or any custom categories you define. Sentiment analysis just happens to be the most well-known application.
Was this helpful?
02
How much training data do I need for text classification?
For TF-IDF + Logistic Regression, you can get reasonable results with as few as 200–500 examples per class. Sentence transformers need at least 500 per class to learn a reliable decision boundary, though their pre-trained embeddings mean they generalise better from less data than training from scratch. Below 100 examples per class, consider few-shot prompting with a large language model instead.
Was this helpful?
03
Can I use text classification for multi-label problems where one document has multiple categories?
Yes, but it requires a different setup. Instead of a single classifier, you train one binary classifier per label (OneVsRestClassifier in sklearn) or use a model that natively outputs multiple labels. The evaluation metrics also change — you'd use macro-averaged F1 or hamming loss instead of standard accuracy. The preprocessing and vectorisation steps remain identical.
Was this helpful?
04
How do I handle text in multiple languages?
If using TF-IDF, you need separate vectorizers per language (or a multilingual stop words list). Sentence transformers like 'paraphrase-multilingual-MiniLM-L12-v2' support 50+ languages in a single model — much easier. For mixed-language text, consider language detection first then route to the appropriate pipeline, or use a multilingual embedding model.
Was this helpful?
05
Should I fine-tune the sentence transformer on my data or just use the pre-trained embeddings?
Start without fine-tuning. Train only the logistic regression on top of frozen embeddings. If accuracy plateaus below your target, fine-tune the transformer using a small learning rate (2e-5) on your labelled data. Fine-tuning can add 2–5% accuracy but requires careful regularisation to avoid overfitting on small datasets.