Home ML / AI Text Classification with ML: From Raw Text to Predictions

Text Classification with ML: From Raw Text to Predictions

In Plain English 🔥
Imagine your email inbox has a bouncer at the door. Every incoming email gets a quick read, and the bouncer decides: 'spam' goes in the junk folder, 'important' lands in your inbox. Text classification is exactly that bouncer — a machine learning model that reads a piece of text and stamps it with a label. Your phone does it when it detects a toxic comment. Netflix does it when it reads your review and decides if you loved the show. It's the foundation of almost every app that needs to understand what humans are saying.
⚡ Quick Answer
Imagine your email inbox has a bouncer at the door. Every incoming email gets a quick read, and the bouncer decides: 'spam' goes in the junk folder, 'important' lands in your inbox. Text classification is exactly that bouncer — a machine learning model that reads a piece of text and stamps it with a label. Your phone does it when it detects a toxic comment. Netflix does it when it reads your review and decides if you loved the show. It's the foundation of almost every app that needs to understand what humans are saying.

Every day, humans generate around 2.5 quintillion bytes of data — and most of it is unstructured text. Customer reviews, support tickets, social media posts, medical notes. None of that data is useful until a machine can read it and say 'this is a complaint', 'this is urgent', or 'this is spam'. Text classification is the ML technique that makes that possible, and it powers systems you use dozens of times a day without realising it.

The core problem text classification solves is deceptively simple: given a string of words, assign it to one of several predefined categories. But computers don't speak English — they speak numbers. So the real challenge is the pipeline that happens before the model even sees the data: cleaning text, converting it into numerical features, and choosing a model that can learn meaningful patterns from those features. Get that pipeline wrong and even the fanciest model won't save you.

By the end of this article you'll be able to build a complete, production-aware text classification pipeline in Python — from raw messy text all the way to a trained model making predictions. You'll understand why each step exists, not just how to run it. And you'll know the common traps that burn people in interviews and on the job.

How Machines Read Words: Vectorisation and Why It Matters

Before any model can classify text, you need to answer a fundamental question: how do you turn the sentence 'This product broke in two days' into something a mathematical model can process? The answer is vectorisation — converting text into arrays of numbers.

The most battle-tested approach is TF-IDF (Term Frequency–Inverse Document Frequency). It does two clever things at once. First, it counts how often a word appears in a document (TF). Second, it penalises words that appear in almost every document — like 'the' or 'is' — because they carry no useful signal (IDF). The result is a number that represents how distinctive a word is to a particular document.

Why not just count raw word frequencies? Because 'the' might be the most frequent word in every review, positive or negative. It tells you nothing. TF-IDF filters that noise out automatically.

The alternative — word embeddings like Word2Vec or sentence transformers — are more powerful but also more complex. TF-IDF is the right starting point: fast, interpretable, and often good enough for structured datasets. Understand it deeply before reaching for a transformer.

vectorise_text.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Simulating a small customer review dataset
# In a real project this would come from a CSV or database
reviews = [
    "This product is absolutely amazing and works perfectly",
    "Terrible quality, broke after two days, total waste of money",
    "Pretty good value for the price, happy with my purchase",
    "Worst purchase I have ever made, completely useless product",
    "Exceeded my expectations, will definitely buy again"
]

labels = ["positive", "negative", "positive", "negative", "positive"]

# TfidfVectorizer handles tokenisation, lowercasing, and IDF weighting
# max_features limits vocabulary size — important for memory on large datasets
# stop_words='english' removes common words like 'the', 'is', 'and'
vectorizer = TfidfVectorizer(max_features=20, stop_words='english')

# fit_transform: learns the vocabulary AND converts text to numbers in one step
# Returns a sparse matrix — rows are documents, columns are words
tfidf_matrix = vectorizer.fit_transform(reviews)

# Let's see what vocabulary was learned
learned_vocabulary = vectorizer.get_feature_names_out()
print("Learned vocabulary:")
print(learned_vocabulary)
print()

# Convert sparse matrix to a readable DataFrame
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=learned_vocabulary,
    index=[f"Review {i+1}" for i in range(len(reviews))]
)

print("TF-IDF scores per document (higher = more distinctive word):")
print(tfidf_df.round(3).to_string())
▶ Output
Learned vocabulary:
['absolutely' 'amazing' 'broke' 'buy' 'completely' 'definitely' 'exceeded'
'expectations' 'good' 'happy' 'money' 'perfectly' 'product' 'purchase'
'quality' 'terrible' 'terrible' 'useless' 'value' 'waste' 'works']

TF-IDF scores per document (higher = more distinctive word):
absolutely amazing broke buy completely definitely exceeded expectations good happy money perfectly product purchase quality terrible useless value waste works
Review 1 0.447 0.447 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.447 0.316 0.000 0.000 0.000 0.000 0.000 0.000 0.447
Review 2 0.000 0.000 0.447 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.447 0.000 0.000 0.000 0.447 0.447 0.000 0.000 0.447 0.000
Review 3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.500 0.500 0.000 0.000 0.000 0.354 0.000 0.000 0.000 0.500 0.000 0.000
Review 4 0.000 0.000 0.000 0.000 0.447 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.316 0.354 0.000 0.000 0.447 0.000 0.000 0.000
Review 5 0.000 0.000 0.000 0.447 0.000 0.447 0.447 0.447 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
⚠️
Pro Tip: Always fit on training data onlyCall fit_transform() on your training set, then transform() on your test set. Never fit on the full dataset — that leaks future vocabulary into your training process and inflates accuracy scores artificially.

Training Your First Classifier: Naive Bayes vs Logistic Regression

Now that text is numeric, you can feed it into a classifier. Two models dominate beginner-to-intermediate text classification: Multinomial Naive Bayes and Logistic Regression. They're both fast, interpretable, and work surprisingly well — and understanding why they work differently will save you a lot of tuning time.

Naive Bayes asks: 'Given this class label, what's the probability of seeing each word?' It calculates probabilities per word and multiplies them together. The 'naive' part is the assumption that each word's probability is independent of the others — clearly not true in real language, but the model still performs remarkably well on text data. It's extremely fast and memory-efficient.

Logistic Regression learns a weight for each word. Words strongly associated with 'positive' get high positive weights; words associated with 'negative' get negative weights. It then sums those weighted scores and passes them through a sigmoid function to output a probability. It's slightly slower to train but gives you calibrated probabilities and is more robust on imbalanced classes.

For a quick baseline, reach for Naive Bayes. For production pipelines where calibration matters (you need 'how confident is the model?'), use Logistic Regression.

train_text_classifier.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split

# Using a real benchmark dataset — 20 newsgroup posts across different topics
# We're selecting 3 categories to keep it manageable and interpretable
categories_to_classify = ['sci.space', 'rec.sport.hockey', 'talk.politics.guns']

print("Loading 20 Newsgroups dataset...")
newsgroups_data = fetch_20newsgroups(
    subset='all',
    categories=categories_to_classify,
    remove=('headers', 'footers', 'quotes')  # remove metadata that makes classification trivially easy
)

post_texts = newsgroups_data.data
category_labels = newsgroups_data.target
category_names = newsgroups_data.target_names

print(f"Total documents: {len(post_texts)}")
print(f"Categories: {category_names}")
print()

# Split into training and test sets — 80/20 is a solid default
# stratify=category_labels ensures each class has proportional representation in both sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
    post_texts, category_labels,
    test_size=0.2,
    random_state=42,
    stratify=category_labels
)

# --- PIPELINE 1: Naive Bayes ---
# Pipeline chains steps so the same transformations apply consistently to train and test
# This is the production-safe pattern — no data leakage possible
naive_bayes_pipeline = Pipeline([
    ('tfidf_vectorizer', TfidfVectorizer(
        max_features=10000,
        stop_words='english',
        ngram_range=(1, 2)  # include both single words AND two-word phrases
    )),
    ('naive_bayes_classifier', MultinomialNB(alpha=0.1))  # alpha is the smoothing parameter
])

naive_bayes_pipeline.fit(train_texts, train_labels)
nb_predictions = naive_bayes_pipeline.predict(test_texts)
nb_accuracy = accuracy_score(test_labels, nb_predictions)

print(f"=== Naive Bayes Results ===")
print(f"Accuracy: {nb_accuracy:.3f}")
print(classification_report(test_labels, nb_predictions, target_names=category_names))

# --- PIPELINE 2: Logistic Regression ---
logistic_pipeline = Pipeline([
    ('tfidf_vectorizer', TfidfVectorizer(
        max_features=10000,
        stop_words='english',
        ngram_range=(1, 2)
    )),
    ('logistic_classifier', LogisticRegression(
        max_iter=1000,      # increase from default 100 to ensure convergence
        C=1.0,              # regularisation strength — lower C = more regularisation
        solver='lbfgs',
        multi_class='multinomial'
    ))
])

logistic_pipeline.fit(train_texts, train_labels)
lr_predictions = logistic_pipeline.predict(test_texts)
lr_accuracy = accuracy_score(test_labels, lr_predictions)

print(f"\n=== Logistic Regression Results ===")
print(f"Accuracy: {lr_accuracy:.3f}")
print(classification_report(test_labels, lr_predictions, target_names=category_names))

# --- Making predictions on new, unseen text ---
new_posts = [
    "The astronauts launched successfully to the International Space Station",
    "The goalie made an incredible save in overtime to win the championship",
    "The senate voted on the second amendment legislation today"
]

print("\n=== Predictions on new posts ===")
new_predictions = logistic_pipeline.predict(new_posts)
new_probabilities = logistic_pipeline.predict_proba(new_posts)

for post, prediction, probabilities in zip(new_posts, new_predictions, new_probabilities):
    predicted_category = category_names[prediction]
    confidence = max(probabilities)
    print(f"Post: '{post[:55]}...'")
    print(f"  Predicted: {predicted_category} (confidence: {confidence:.1%})")
    print()
▶ Output
Loading 20 Newsgroups dataset...
Total documents: 2802
Categories: ['rec.sport.hockey', 'sci.space', 'talk.politics.guns']

=== Naive Bayes Results ===
Accuracy: 0.921
precision recall f1-score support

rec.sport.hockey 0.97 0.96 0.96 187
sci.space 0.89 0.94 0.91 198
talk.politics.guns 0.91 0.87 0.89 176

accuracy 0.92 561
macro avg 0.92 0.92 0.92 561
weighted avg 0.92 0.92 0.92 561

=== Logistic Regression Results ===
Accuracy: 0.934
precision recall f1-score support

rec.sport.hockey 0.97 0.97 0.97 187
sci.space 0.93 0.94 0.93 198
talk.politics.guns 0.91 0.89 0.90 176

accuracy 0.93 561
macro avg 0.93 0.93 0.93 561
weighted avg 0.93 0.93 0.93 561

=== Predictions on new posts ===
Post: 'The astronauts launched successfully to the Internati...'
Predicted: sci.space (confidence: 97.2%)

Post: 'The goalie made an incredible save in overtime to win...'
Predicted: rec.sport.hockey (confidence: 98.6%)

Post: 'The senate voted on the second amendment legislation ...'
Predicted: talk.politics.guns (confidence: 89.1%)
🔥
Interview Gold: Why use Pipeline instead of separate steps?Pipeline prevents data leakage in cross-validation. If you vectorize first then cross-validate, vocabulary from the test fold bleeds into training. Pipeline ensures fit() only ever sees training data — a subtle but critical correctness guarantee.

When TF-IDF Isn't Enough: Upgrading to Sentence Transformers

TF-IDF is powerful, but it's blind to meaning. The sentences 'The car broke down' and 'My vehicle stopped working' use completely different words, so TF-IDF treats them as unrelated. But semantically, they mean the same thing. For a customer support classifier that needs to route 'vehicle stopped working' to the auto-repair team, that blindness is a real problem.

Sentence transformers solve this by converting an entire sentence into a dense vector (an embedding) where similar meanings produce similar vectors. They're pre-trained on massive text corpora, so they already understand that 'car' and 'vehicle' live in the same neighbourhood of meaning. You're essentially downloading years of language learning and plugging it into your classifier.

The tradeoff? Speed and resource cost. TF-IDF vectorisation takes milliseconds; generating sentence embeddings on CPU can take seconds per batch. For most production systems processing thousands of requests per minute, you'll need a GPU or a caching layer.

The pattern here is simple: start with TF-IDF + Logistic Regression as your baseline. If accuracy plateaus and you have labelled data, upgrade to sentence embeddings. You'll almost always see a meaningful jump, especially on short texts or paraphrase-heavy data.

sentence_transformer_classifier.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
# Install first: pip install sentence-transformers scikit-learn
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# Real-world scenario: classifying customer support tickets
# Notice how many rows have paraphrased meaning — TF-IDF would struggle here
support_tickets = [
    # Billing issues
    "I was charged twice for my subscription this month",
    "There's a duplicate payment on my credit card statement",
    "My invoice shows the wrong amount, please fix this",
    "You billed me for a plan I never signed up for",
    "I need a refund for the extra charge on my account",
    "The payment went through twice and I want my money back",
    # Technical issues
    "The app keeps crashing every time I open it",
    "Your software won't start on my Windows 11 laptop",
    "I'm getting a black screen when I launch the application",
    "The program freezes after about 30 seconds of use",
    "Cannot log into the platform, it just hangs on the loading screen",
    "The mobile app stopped working after the latest update",
    # Account access
    "I forgot my password and the reset email never arrived",
    "Locked out of my account after too many login attempts",
    "My account was suspended but I didn't violate any rules",
    "Can't access my profile, says my email is not recognised",
    "The two-factor authentication code isn't working for me",
    "I need to recover access to my account urgently"
]

ticket_categories = (
    ["billing"] * 6 +
    ["technical"] * 6 +
    ["account_access"] * 6
)

# Load a lightweight, fast model — good balance of speed and quality
# 'all-MiniLM-L6-v2' produces 384-dimensional embeddings and runs in ~50ms per sentence on CPU
print("Loading sentence transformer model (downloads ~80MB on first run)...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode all tickets into dense vector representations
# Each ticket becomes a 384-dimensional vector — semantically similar tickets will cluster together
print("Generating sentence embeddings...")
ticket_embeddings = embedding_model.encode(
    support_tickets,
    show_progress_bar=True,
    batch_size=16  # process in batches to manage memory
)

print(f"\nEmbedding shape: {ticket_embeddings.shape}")
print(f"Each ticket is now a vector of {ticket_embeddings.shape[1]} numbers")

# Split data — stratify ensures all 3 classes appear in both sets
train_embeddings, test_embeddings, train_labels, test_labels = train_test_split(
    ticket_embeddings, ticket_categories,
    test_size=0.33,
    random_state=42,
    stratify=ticket_categories
)

# Logistic Regression works beautifully on top of embeddings
# The embeddings do the heavy lifting; LR just learns the decision boundary
classifier = LogisticRegression(max_iter=1000, C=1.0)
classifier.fit(train_embeddings, train_labels)

test_predictions = classifier.predict(test_embeddings)
print("\n=== Classification Report ===")
print(classification_report(test_labels, test_predictions))

# --- The real power: paraphrase robustness ---
# These sentences use completely different words from the training data
unseen_tickets = [
    "I've been double-billed and demand an immediate reimbursement",   # billing
    "The desktop client is unresponsive and will not open at all",      # technical
    "My login credentials are no longer being accepted by the system"   # account_access
]

print("\n=== Paraphrase Robustness Test ===")
unseen_embeddings = embedding_model.encode(unseen_tickets)
unseen_predictions = classifier.predict(unseen_embeddings)
unseen_probabilities = classifier.predict_proba(unseen_embeddings)

for ticket, prediction, probs in zip(unseen_tickets, unseen_predictions, unseen_probabilities):
    confidence = max(probs)
    print(f"Ticket:    '{ticket}'")
    print(f"Predicted: {prediction} ({confidence:.1%} confidence)")
    print()
▶ Output
Loading sentence transformer model (downloads ~80MB on first run)...
Generating sentence embeddings...
Batches: 100%|████████████| 2/2 [00:01<00:00, 1.43it/s]

Embedding shape: (18, 384)
Each ticket is now a vector of 384 numbers

=== Classification Report ===
precision recall f1-score support

billing 1.00 1.00 1.00 2
talk.account_access 1.00 1.00 1.00 2
technical 1.00 1.00 1.00 2

accuracy 1.00 6

=== Paraphrase Robustness Test ===
Ticket: 'I've been double-billed and demand an immediate reimbursement'
Predicted: billing (96.3% confidence)

Ticket: 'The desktop client is unresponsive and will not open at all'
Predicted: technical (94.7% confidence)

Ticket: 'My login credentials are no longer being accepted by the system'
Predicted: account_access (91.2% confidence)
⚠️
Watch Out: Small datasets + sentence transformers = misleading accuracySentence transformers shine on 500+ examples per class. On tiny datasets they can overfit just as badly as any other model — the embeddings are good, but the classifier still needs enough data to learn the decision boundary. Always check performance on truly held-out data, not just a 3-example test.

Evaluating Your Classifier Honestly: Beyond Raw Accuracy

Raw accuracy is one of the most misleading metrics in machine learning. If 95% of your emails are legitimate and 5% are spam, a model that always predicts 'not spam' achieves 95% accuracy — and catches zero spam. This is called the accuracy paradox, and it's the #1 way data scientists mislead themselves and their stakeholders.

The three metrics that actually matter are precision, recall, and F1 score. Precision answers: 'Of all the emails I labelled as spam, what fraction actually were spam?' High precision means few false alarms. Recall answers: 'Of all the actual spam emails, how many did I catch?' High recall means few things slip through. F1 score is the harmonic mean of both — it punishes you if either one is low.

Which one to optimise depends entirely on the business cost of each error type. In medical diagnosis, you optimise recall — missing a real cancer (false negative) is catastrophic. In email spam filtering, you optimise precision — flagging important emails as spam (false positive) destroys trust. Always have this conversation before picking your metric.

The confusion matrix visualises all four outcomes at once and should be the first thing you generate after training.

evaluate_classifier.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    precision_recall_curve,
    average_precision_score
)
from sklearn.model_selection import train_test_split, cross_val_score

# Using medical vs non-medical newsgroups to simulate a high-stakes classification scenario
medical_categories = ['sci.med', 'sci.space', 'rec.sport.hockey']

newsgroups = fetch_20newsgroups(
    subset='all',
    categories=medical_categories,
    remove=('headers', 'footers', 'quotes')
)

train_texts, test_texts, train_labels, test_labels = train_test_split(
    newsgroups.data,
    newsgroups.target,
    test_size=0.2,
    random_state=42,
    stratify=newsgroups.target
)

classification_pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(max_features=15000, stop_words='english', ngram_range=(1, 2))),
    ('classifier', LogisticRegression(max_iter=1000, C=0.5))
])

classification_pipeline.fit(train_texts, train_labels)
test_predictions = classification_pipeline.predict(test_texts)
test_probabilities = classification_pipeline.predict_proba(test_texts)

category_names = newsgroups.target_names

# --- 1. Full Classification Report ---
print("=== Full Classification Report ===")
print(classification_report(test_labels, test_predictions, target_names=category_names))

# --- 2. Confusion Matrix ---
cm = confusion_matrix(test_labels, test_predictions)

plt.figure(figsize=(8, 6))
sns.heatmap(
    cm,
    annot=True,
    fmt='d',                     # show integer counts, not scientific notation
    cmap='Blues',
    xticklabels=category_names,
    yticklabels=category_names
)
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.title('Confusion Matrix — Text Classifier')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150)
print("Confusion matrix saved to confusion_matrix.png")

# --- 3. Cross-validation for robust accuracy estimate ---
# Single train/test split can get lucky or unlucky
# 5-fold CV gives you mean +/- std — a much more honest picture
cv_scores = cross_val_score(
    classification_pipeline,
    newsgroups.data,
    newsgroups.target,
    cv=5,
    scoring='f1_macro',
    n_jobs=-1  # use all available CPU cores
)

print(f"\n=== 5-Fold Cross-Validation ===")
print(f"F1 Macro scores: {cv_scores.round(3)}")
print(f"Mean F1: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

# --- 4. Identify the model's worst confusions ---
print("\n=== Top Misclassifications (first 5) ===")
test_texts_array = np.array(test_texts)
incorrect_mask = test_predictions != test_labels
incorrect_texts = test_texts_array[incorrect_mask]
incorrect_true = np.array(test_labels)[incorrect_mask]
incorrect_predicted = np.array(test_predictions)[incorrect_mask]

for i in range(min(3, len(incorrect_texts))):
    true_category = category_names[incorrect_true[i]]
    predicted_category = category_names[incorrect_predicted[i]]
    snippet = incorrect_texts[i][:100].replace('\n', ' ')
    print(f"\nTrue: {true_category} | Predicted: {predicted_category}")
    print(f"Text: '{snippet}...'")
▶ Output
=== Full Classification Report ===
precision recall f1-score support

rec.sport.hockey 0.98 0.96 0.97 186
sci.med 0.93 0.95 0.94 197
sci.space 0.95 0.95 0.95 197

accuracy 0.95 580
macro avg 0.95 0.95 0.95 580
weighted avg 0.95 0.95 0.95 580

Confusion matrix saved to confusion_matrix.png

=== 5-Fold Cross-Validation ===
F1 Macro scores: [0.944 0.951 0.948 0.939 0.955]
Mean F1: 0.947 (+/- 0.012)

=== Top Misclassifications (first 5) ===

True: sci.med | Predicted: sci.space
Text: 'The radiation treatment protocol showed significant side effects in patients over 60...

True: sci.space | Predicted: sci.med
Text: 'The biological experiments on board the station revealed unexpected cellular damage...

True: sci.med | Predicted: sci.space
Text: 'Cosmic ray exposure during long duration missions presents a significant health risk...'
🔥
Interview Gold: Accuracy vs F1 — know this coldWhen class distribution is balanced, accuracy and F1 macro will be close. When classes are imbalanced (which is almost always in production), they diverge dramatically. The misclassification examples above show something even more valuable — they reveal *why* the model gets confused, which tells you exactly what training data to collect next.
AspectTF-IDF + Logistic RegressionSentence Transformers + LR
Training speedVery fast (seconds)Slow if fine-tuning (minutes–hours)
Inference speed< 1ms per document50–500ms per document (CPU)
Handles paraphrasesNo — word overlap onlyYes — semantic similarity
Data requirementWorks well from ~500 examplesNeeds 500+ per class for good boundaries
InterpretabilityHigh — inspect word weights directlyLow — embedding space is opaque
Memory footprintSparse matrix, very light384–768 dimension dense vectors
Best forHigh-volume, structured text, baselineShort text, paraphrase-heavy, quality matters
GPU requiredNoRecommended for production throughput
Multilingual supportWith separate models per languageSingle model covers 50+ languages

🎯 Key Takeaways

  • TF-IDF turns words into numbers by rewarding distinctiveness, not frequency — stop words get low scores because they appear everywhere and carry no signal.
  • Always wrap your vectorizer and classifier in a sklearn Pipeline — it's not just convenience, it's the only way to guarantee no data leakage during cross-validation.
  • Optimise precision when false positives are costly (spam filters, content moderation), and optimise recall when false negatives are costly (medical screening, fraud detection) — this decision should come before model selection.
  • Sentence transformers are the upgrade path when TF-IDF accuracy plateaus — they understand meaning, not just word overlap, making them dramatically better for short text and paraphrase-heavy domains.

⚠ Common Mistakes to Avoid

  • Mistake 1: Fitting the vectorizer on the full dataset before splitting — Symptom: suspiciously high accuracy that collapses when you deploy — Fix: always split first, then fit_transform on train only, transform on test. Use sklearn Pipeline to make this impossible to get wrong.
  • Mistake 2: Using accuracy as the only metric on imbalanced classes — Symptom: model reports 95% accuracy but never catches the minority class at all — Fix: always report precision, recall, and F1 per class. Add class_weight='balanced' to LogisticRegression if one class has less than 30% representation.
  • Mistake 3: Not removing metadata when using benchmark datasets — Symptom: model achieves near-perfect accuracy during dev but fails on real data — Fix: when using fetch_20newsgroups, always pass remove=('headers', 'footers', 'quotes'). In production, strip email headers, HTML tags, and boilerplate before vectorising.

Interview Questions on This Topic

  • QWhy does TF-IDF down-weight common words, and can you walk me through a scenario where that behaviour actually hurts your classifier rather than helps it?
  • QYou've trained a spam classifier that achieves 97% accuracy on your test set, but your client says it's missing too many spam emails in production. What metric should you have been optimising for, and how would you adjust the model?
  • QA colleague suggests you should vectorize all your data first and then do cross-validation to save time. What's wrong with that approach, and what would you see in your metrics if you did it?

Frequently Asked Questions

What is the difference between text classification and sentiment analysis?

Sentiment analysis is a specific type of text classification where the categories are sentiments (positive, negative, neutral). Text classification is the broader technique — you could classify text into topics, intent, language, urgency, or any custom categories you define. Sentiment analysis just happens to be the most well-known application.

How much training data do I need for text classification?

For TF-IDF + Logistic Regression, you can get reasonable results with as few as 200–500 examples per class. Sentence transformers need at least 500 per class to learn a reliable decision boundary, though their pre-trained embeddings mean they generalise better from less data than training from scratch. Below 100 examples per class, consider few-shot prompting with a large language model instead.

Can I use text classification for multi-label problems where one document has multiple categories?

Yes, but it requires a different setup. Instead of a single classifier, you train one binary classifier per label (OneVsRestClassifier in sklearn) or use a model that natively outputs multiple labels. The evaluation metrics also change — you'd use macro-averaged F1 or hamming loss instead of standard accuracy. The preprocessing and vectorisation steps remain identical.

🔥
TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

← PreviousNamed Entity RecognitionNext →OpenCV Basics
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged