Text Classification with ML: From Raw Text to Predictions
Every day, humans generate around 2.5 quintillion bytes of data — and most of it is unstructured text. Customer reviews, support tickets, social media posts, medical notes. None of that data is useful until a machine can read it and say 'this is a complaint', 'this is urgent', or 'this is spam'. Text classification is the ML technique that makes that possible, and it powers systems you use dozens of times a day without realising it.
The core problem text classification solves is deceptively simple: given a string of words, assign it to one of several predefined categories. But computers don't speak English — they speak numbers. So the real challenge is the pipeline that happens before the model even sees the data: cleaning text, converting it into numerical features, and choosing a model that can learn meaningful patterns from those features. Get that pipeline wrong and even the fanciest model won't save you.
By the end of this article you'll be able to build a complete, production-aware text classification pipeline in Python — from raw messy text all the way to a trained model making predictions. You'll understand why each step exists, not just how to run it. And you'll know the common traps that burn people in interviews and on the job.
How Machines Read Words: Vectorisation and Why It Matters
Before any model can classify text, you need to answer a fundamental question: how do you turn the sentence 'This product broke in two days' into something a mathematical model can process? The answer is vectorisation — converting text into arrays of numbers.
The most battle-tested approach is TF-IDF (Term Frequency–Inverse Document Frequency). It does two clever things at once. First, it counts how often a word appears in a document (TF). Second, it penalises words that appear in almost every document — like 'the' or 'is' — because they carry no useful signal (IDF). The result is a number that represents how distinctive a word is to a particular document.
Why not just count raw word frequencies? Because 'the' might be the most frequent word in every review, positive or negative. It tells you nothing. TF-IDF filters that noise out automatically.
The alternative — word embeddings like Word2Vec or sentence transformers — are more powerful but also more complex. TF-IDF is the right starting point: fast, interpretable, and often good enough for structured datasets. Understand it deeply before reaching for a transformer.
import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer # Simulating a small customer review dataset # In a real project this would come from a CSV or database reviews = [ "This product is absolutely amazing and works perfectly", "Terrible quality, broke after two days, total waste of money", "Pretty good value for the price, happy with my purchase", "Worst purchase I have ever made, completely useless product", "Exceeded my expectations, will definitely buy again" ] labels = ["positive", "negative", "positive", "negative", "positive"] # TfidfVectorizer handles tokenisation, lowercasing, and IDF weighting # max_features limits vocabulary size — important for memory on large datasets # stop_words='english' removes common words like 'the', 'is', 'and' vectorizer = TfidfVectorizer(max_features=20, stop_words='english') # fit_transform: learns the vocabulary AND converts text to numbers in one step # Returns a sparse matrix — rows are documents, columns are words tfidf_matrix = vectorizer.fit_transform(reviews) # Let's see what vocabulary was learned learned_vocabulary = vectorizer.get_feature_names_out() print("Learned vocabulary:") print(learned_vocabulary) print() # Convert sparse matrix to a readable DataFrame tfidf_df = pd.DataFrame( tfidf_matrix.toarray(), columns=learned_vocabulary, index=[f"Review {i+1}" for i in range(len(reviews))] ) print("TF-IDF scores per document (higher = more distinctive word):") print(tfidf_df.round(3).to_string())
['absolutely' 'amazing' 'broke' 'buy' 'completely' 'definitely' 'exceeded'
'expectations' 'good' 'happy' 'money' 'perfectly' 'product' 'purchase'
'quality' 'terrible' 'terrible' 'useless' 'value' 'waste' 'works']
TF-IDF scores per document (higher = more distinctive word):
absolutely amazing broke buy completely definitely exceeded expectations good happy money perfectly product purchase quality terrible useless value waste works
Review 1 0.447 0.447 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.447 0.316 0.000 0.000 0.000 0.000 0.000 0.000 0.447
Review 2 0.000 0.000 0.447 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.447 0.000 0.000 0.000 0.447 0.447 0.000 0.000 0.447 0.000
Review 3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.500 0.500 0.000 0.000 0.000 0.354 0.000 0.000 0.000 0.500 0.000 0.000
Review 4 0.000 0.000 0.000 0.000 0.447 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.316 0.354 0.000 0.000 0.447 0.000 0.000 0.000
Review 5 0.000 0.000 0.000 0.447 0.000 0.447 0.447 0.447 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Training Your First Classifier: Naive Bayes vs Logistic Regression
Now that text is numeric, you can feed it into a classifier. Two models dominate beginner-to-intermediate text classification: Multinomial Naive Bayes and Logistic Regression. They're both fast, interpretable, and work surprisingly well — and understanding why they work differently will save you a lot of tuning time.
Naive Bayes asks: 'Given this class label, what's the probability of seeing each word?' It calculates probabilities per word and multiplies them together. The 'naive' part is the assumption that each word's probability is independent of the others — clearly not true in real language, but the model still performs remarkably well on text data. It's extremely fast and memory-efficient.
Logistic Regression learns a weight for each word. Words strongly associated with 'positive' get high positive weights; words associated with 'negative' get negative weights. It then sums those weighted scores and passes them through a sigmoid function to output a probability. It's slightly slower to train but gives you calibrated probabilities and is more robust on imbalanced classes.
For a quick baseline, reach for Naive Bayes. For production pipelines where calibration matters (you need 'how confident is the model?'), use Logistic Regression.
import numpy as np from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.metrics import classification_report, accuracy_score from sklearn.model_selection import train_test_split # Using a real benchmark dataset — 20 newsgroup posts across different topics # We're selecting 3 categories to keep it manageable and interpretable categories_to_classify = ['sci.space', 'rec.sport.hockey', 'talk.politics.guns'] print("Loading 20 Newsgroups dataset...") newsgroups_data = fetch_20newsgroups( subset='all', categories=categories_to_classify, remove=('headers', 'footers', 'quotes') # remove metadata that makes classification trivially easy ) post_texts = newsgroups_data.data category_labels = newsgroups_data.target category_names = newsgroups_data.target_names print(f"Total documents: {len(post_texts)}") print(f"Categories: {category_names}") print() # Split into training and test sets — 80/20 is a solid default # stratify=category_labels ensures each class has proportional representation in both sets train_texts, test_texts, train_labels, test_labels = train_test_split( post_texts, category_labels, test_size=0.2, random_state=42, stratify=category_labels ) # --- PIPELINE 1: Naive Bayes --- # Pipeline chains steps so the same transformations apply consistently to train and test # This is the production-safe pattern — no data leakage possible naive_bayes_pipeline = Pipeline([ ('tfidf_vectorizer', TfidfVectorizer( max_features=10000, stop_words='english', ngram_range=(1, 2) # include both single words AND two-word phrases )), ('naive_bayes_classifier', MultinomialNB(alpha=0.1)) # alpha is the smoothing parameter ]) naive_bayes_pipeline.fit(train_texts, train_labels) nb_predictions = naive_bayes_pipeline.predict(test_texts) nb_accuracy = accuracy_score(test_labels, nb_predictions) print(f"=== Naive Bayes Results ===") print(f"Accuracy: {nb_accuracy:.3f}") print(classification_report(test_labels, nb_predictions, target_names=category_names)) # --- PIPELINE 2: Logistic Regression --- logistic_pipeline = Pipeline([ ('tfidf_vectorizer', TfidfVectorizer( max_features=10000, stop_words='english', ngram_range=(1, 2) )), ('logistic_classifier', LogisticRegression( max_iter=1000, # increase from default 100 to ensure convergence C=1.0, # regularisation strength — lower C = more regularisation solver='lbfgs', multi_class='multinomial' )) ]) logistic_pipeline.fit(train_texts, train_labels) lr_predictions = logistic_pipeline.predict(test_texts) lr_accuracy = accuracy_score(test_labels, lr_predictions) print(f"\n=== Logistic Regression Results ===") print(f"Accuracy: {lr_accuracy:.3f}") print(classification_report(test_labels, lr_predictions, target_names=category_names)) # --- Making predictions on new, unseen text --- new_posts = [ "The astronauts launched successfully to the International Space Station", "The goalie made an incredible save in overtime to win the championship", "The senate voted on the second amendment legislation today" ] print("\n=== Predictions on new posts ===") new_predictions = logistic_pipeline.predict(new_posts) new_probabilities = logistic_pipeline.predict_proba(new_posts) for post, prediction, probabilities in zip(new_posts, new_predictions, new_probabilities): predicted_category = category_names[prediction] confidence = max(probabilities) print(f"Post: '{post[:55]}...'") print(f" Predicted: {predicted_category} (confidence: {confidence:.1%})") print()
Total documents: 2802
Categories: ['rec.sport.hockey', 'sci.space', 'talk.politics.guns']
=== Naive Bayes Results ===
Accuracy: 0.921
precision recall f1-score support
rec.sport.hockey 0.97 0.96 0.96 187
sci.space 0.89 0.94 0.91 198
talk.politics.guns 0.91 0.87 0.89 176
accuracy 0.92 561
macro avg 0.92 0.92 0.92 561
weighted avg 0.92 0.92 0.92 561
=== Logistic Regression Results ===
Accuracy: 0.934
precision recall f1-score support
rec.sport.hockey 0.97 0.97 0.97 187
sci.space 0.93 0.94 0.93 198
talk.politics.guns 0.91 0.89 0.90 176
accuracy 0.93 561
macro avg 0.93 0.93 0.93 561
weighted avg 0.93 0.93 0.93 561
=== Predictions on new posts ===
Post: 'The astronauts launched successfully to the Internati...'
Predicted: sci.space (confidence: 97.2%)
Post: 'The goalie made an incredible save in overtime to win...'
Predicted: rec.sport.hockey (confidence: 98.6%)
Post: 'The senate voted on the second amendment legislation ...'
Predicted: talk.politics.guns (confidence: 89.1%)
When TF-IDF Isn't Enough: Upgrading to Sentence Transformers
TF-IDF is powerful, but it's blind to meaning. The sentences 'The car broke down' and 'My vehicle stopped working' use completely different words, so TF-IDF treats them as unrelated. But semantically, they mean the same thing. For a customer support classifier that needs to route 'vehicle stopped working' to the auto-repair team, that blindness is a real problem.
Sentence transformers solve this by converting an entire sentence into a dense vector (an embedding) where similar meanings produce similar vectors. They're pre-trained on massive text corpora, so they already understand that 'car' and 'vehicle' live in the same neighbourhood of meaning. You're essentially downloading years of language learning and plugging it into your classifier.
The tradeoff? Speed and resource cost. TF-IDF vectorisation takes milliseconds; generating sentence embeddings on CPU can take seconds per batch. For most production systems processing thousands of requests per minute, you'll need a GPU or a caching layer.
The pattern here is simple: start with TF-IDF + Logistic Regression as your baseline. If accuracy plateaus and you have labelled data, upgrade to sentence embeddings. You'll almost always see a meaningful jump, especially on short texts or paraphrase-heavy data.
# Install first: pip install sentence-transformers scikit-learn import numpy as np from sentence_transformers import SentenceTransformer from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report from sklearn.model_selection import train_test_split # Real-world scenario: classifying customer support tickets # Notice how many rows have paraphrased meaning — TF-IDF would struggle here support_tickets = [ # Billing issues "I was charged twice for my subscription this month", "There's a duplicate payment on my credit card statement", "My invoice shows the wrong amount, please fix this", "You billed me for a plan I never signed up for", "I need a refund for the extra charge on my account", "The payment went through twice and I want my money back", # Technical issues "The app keeps crashing every time I open it", "Your software won't start on my Windows 11 laptop", "I'm getting a black screen when I launch the application", "The program freezes after about 30 seconds of use", "Cannot log into the platform, it just hangs on the loading screen", "The mobile app stopped working after the latest update", # Account access "I forgot my password and the reset email never arrived", "Locked out of my account after too many login attempts", "My account was suspended but I didn't violate any rules", "Can't access my profile, says my email is not recognised", "The two-factor authentication code isn't working for me", "I need to recover access to my account urgently" ] ticket_categories = ( ["billing"] * 6 + ["technical"] * 6 + ["account_access"] * 6 ) # Load a lightweight, fast model — good balance of speed and quality # 'all-MiniLM-L6-v2' produces 384-dimensional embeddings and runs in ~50ms per sentence on CPU print("Loading sentence transformer model (downloads ~80MB on first run)...") embedding_model = SentenceTransformer('all-MiniLM-L6-v2') # Encode all tickets into dense vector representations # Each ticket becomes a 384-dimensional vector — semantically similar tickets will cluster together print("Generating sentence embeddings...") ticket_embeddings = embedding_model.encode( support_tickets, show_progress_bar=True, batch_size=16 # process in batches to manage memory ) print(f"\nEmbedding shape: {ticket_embeddings.shape}") print(f"Each ticket is now a vector of {ticket_embeddings.shape[1]} numbers") # Split data — stratify ensures all 3 classes appear in both sets train_embeddings, test_embeddings, train_labels, test_labels = train_test_split( ticket_embeddings, ticket_categories, test_size=0.33, random_state=42, stratify=ticket_categories ) # Logistic Regression works beautifully on top of embeddings # The embeddings do the heavy lifting; LR just learns the decision boundary classifier = LogisticRegression(max_iter=1000, C=1.0) classifier.fit(train_embeddings, train_labels) test_predictions = classifier.predict(test_embeddings) print("\n=== Classification Report ===") print(classification_report(test_labels, test_predictions)) # --- The real power: paraphrase robustness --- # These sentences use completely different words from the training data unseen_tickets = [ "I've been double-billed and demand an immediate reimbursement", # billing "The desktop client is unresponsive and will not open at all", # technical "My login credentials are no longer being accepted by the system" # account_access ] print("\n=== Paraphrase Robustness Test ===") unseen_embeddings = embedding_model.encode(unseen_tickets) unseen_predictions = classifier.predict(unseen_embeddings) unseen_probabilities = classifier.predict_proba(unseen_embeddings) for ticket, prediction, probs in zip(unseen_tickets, unseen_predictions, unseen_probabilities): confidence = max(probs) print(f"Ticket: '{ticket}'") print(f"Predicted: {prediction} ({confidence:.1%} confidence)") print()
Generating sentence embeddings...
Batches: 100%|████████████| 2/2 [00:01<00:00, 1.43it/s]
Embedding shape: (18, 384)
Each ticket is now a vector of 384 numbers
=== Classification Report ===
precision recall f1-score support
billing 1.00 1.00 1.00 2
talk.account_access 1.00 1.00 1.00 2
technical 1.00 1.00 1.00 2
accuracy 1.00 6
=== Paraphrase Robustness Test ===
Ticket: 'I've been double-billed and demand an immediate reimbursement'
Predicted: billing (96.3% confidence)
Ticket: 'The desktop client is unresponsive and will not open at all'
Predicted: technical (94.7% confidence)
Ticket: 'My login credentials are no longer being accepted by the system'
Predicted: account_access (91.2% confidence)
Evaluating Your Classifier Honestly: Beyond Raw Accuracy
Raw accuracy is one of the most misleading metrics in machine learning. If 95% of your emails are legitimate and 5% are spam, a model that always predicts 'not spam' achieves 95% accuracy — and catches zero spam. This is called the accuracy paradox, and it's the #1 way data scientists mislead themselves and their stakeholders.
The three metrics that actually matter are precision, recall, and F1 score. Precision answers: 'Of all the emails I labelled as spam, what fraction actually were spam?' High precision means few false alarms. Recall answers: 'Of all the actual spam emails, how many did I catch?' High recall means few things slip through. F1 score is the harmonic mean of both — it punishes you if either one is low.
Which one to optimise depends entirely on the business cost of each error type. In medical diagnosis, you optimise recall — missing a real cancer (false negative) is catastrophic. In email spam filtering, you optimise precision — flagging important emails as spam (false positive) destroys trust. Always have this conversation before picking your metric.
The confusion matrix visualises all four outcomes at once and should be the first thing you generate after training.
import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.metrics import ( classification_report, confusion_matrix, precision_recall_curve, average_precision_score ) from sklearn.model_selection import train_test_split, cross_val_score # Using medical vs non-medical newsgroups to simulate a high-stakes classification scenario medical_categories = ['sci.med', 'sci.space', 'rec.sport.hockey'] newsgroups = fetch_20newsgroups( subset='all', categories=medical_categories, remove=('headers', 'footers', 'quotes') ) train_texts, test_texts, train_labels, test_labels = train_test_split( newsgroups.data, newsgroups.target, test_size=0.2, random_state=42, stratify=newsgroups.target ) classification_pipeline = Pipeline([ ('vectorizer', TfidfVectorizer(max_features=15000, stop_words='english', ngram_range=(1, 2))), ('classifier', LogisticRegression(max_iter=1000, C=0.5)) ]) classification_pipeline.fit(train_texts, train_labels) test_predictions = classification_pipeline.predict(test_texts) test_probabilities = classification_pipeline.predict_proba(test_texts) category_names = newsgroups.target_names # --- 1. Full Classification Report --- print("=== Full Classification Report ===") print(classification_report(test_labels, test_predictions, target_names=category_names)) # --- 2. Confusion Matrix --- cm = confusion_matrix(test_labels, test_predictions) plt.figure(figsize=(8, 6)) sns.heatmap( cm, annot=True, fmt='d', # show integer counts, not scientific notation cmap='Blues', xticklabels=category_names, yticklabels=category_names ) plt.ylabel('True Label', fontsize=12) plt.xlabel('Predicted Label', fontsize=12) plt.title('Confusion Matrix — Text Classifier') plt.tight_layout() plt.savefig('confusion_matrix.png', dpi=150) print("Confusion matrix saved to confusion_matrix.png") # --- 3. Cross-validation for robust accuracy estimate --- # Single train/test split can get lucky or unlucky # 5-fold CV gives you mean +/- std — a much more honest picture cv_scores = cross_val_score( classification_pipeline, newsgroups.data, newsgroups.target, cv=5, scoring='f1_macro', n_jobs=-1 # use all available CPU cores ) print(f"\n=== 5-Fold Cross-Validation ===") print(f"F1 Macro scores: {cv_scores.round(3)}") print(f"Mean F1: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})") # --- 4. Identify the model's worst confusions --- print("\n=== Top Misclassifications (first 5) ===") test_texts_array = np.array(test_texts) incorrect_mask = test_predictions != test_labels incorrect_texts = test_texts_array[incorrect_mask] incorrect_true = np.array(test_labels)[incorrect_mask] incorrect_predicted = np.array(test_predictions)[incorrect_mask] for i in range(min(3, len(incorrect_texts))): true_category = category_names[incorrect_true[i]] predicted_category = category_names[incorrect_predicted[i]] snippet = incorrect_texts[i][:100].replace('\n', ' ') print(f"\nTrue: {true_category} | Predicted: {predicted_category}") print(f"Text: '{snippet}...'")
precision recall f1-score support
rec.sport.hockey 0.98 0.96 0.97 186
sci.med 0.93 0.95 0.94 197
sci.space 0.95 0.95 0.95 197
accuracy 0.95 580
macro avg 0.95 0.95 0.95 580
weighted avg 0.95 0.95 0.95 580
Confusion matrix saved to confusion_matrix.png
=== 5-Fold Cross-Validation ===
F1 Macro scores: [0.944 0.951 0.948 0.939 0.955]
Mean F1: 0.947 (+/- 0.012)
=== Top Misclassifications (first 5) ===
True: sci.med | Predicted: sci.space
Text: 'The radiation treatment protocol showed significant side effects in patients over 60...
True: sci.space | Predicted: sci.med
Text: 'The biological experiments on board the station revealed unexpected cellular damage...
True: sci.med | Predicted: sci.space
Text: 'Cosmic ray exposure during long duration missions presents a significant health risk...'
| Aspect | TF-IDF + Logistic Regression | Sentence Transformers + LR |
|---|---|---|
| Training speed | Very fast (seconds) | Slow if fine-tuning (minutes–hours) |
| Inference speed | < 1ms per document | 50–500ms per document (CPU) |
| Handles paraphrases | No — word overlap only | Yes — semantic similarity |
| Data requirement | Works well from ~500 examples | Needs 500+ per class for good boundaries |
| Interpretability | High — inspect word weights directly | Low — embedding space is opaque |
| Memory footprint | Sparse matrix, very light | 384–768 dimension dense vectors |
| Best for | High-volume, structured text, baseline | Short text, paraphrase-heavy, quality matters |
| GPU required | No | Recommended for production throughput |
| Multilingual support | With separate models per language | Single model covers 50+ languages |
🎯 Key Takeaways
- TF-IDF turns words into numbers by rewarding distinctiveness, not frequency — stop words get low scores because they appear everywhere and carry no signal.
- Always wrap your vectorizer and classifier in a sklearn Pipeline — it's not just convenience, it's the only way to guarantee no data leakage during cross-validation.
- Optimise precision when false positives are costly (spam filters, content moderation), and optimise recall when false negatives are costly (medical screening, fraud detection) — this decision should come before model selection.
- Sentence transformers are the upgrade path when TF-IDF accuracy plateaus — they understand meaning, not just word overlap, making them dramatically better for short text and paraphrase-heavy domains.
⚠ Common Mistakes to Avoid
- ✕Mistake 1: Fitting the vectorizer on the full dataset before splitting — Symptom: suspiciously high accuracy that collapses when you deploy — Fix: always split first, then fit_transform on train only, transform on test. Use sklearn Pipeline to make this impossible to get wrong.
- ✕Mistake 2: Using accuracy as the only metric on imbalanced classes — Symptom: model reports 95% accuracy but never catches the minority class at all — Fix: always report precision, recall, and F1 per class. Add class_weight='balanced' to LogisticRegression if one class has less than 30% representation.
- ✕Mistake 3: Not removing metadata when using benchmark datasets — Symptom: model achieves near-perfect accuracy during dev but fails on real data — Fix: when using fetch_20newsgroups, always pass remove=('headers', 'footers', 'quotes'). In production, strip email headers, HTML tags, and boilerplate before vectorising.
Interview Questions on This Topic
- QWhy does TF-IDF down-weight common words, and can you walk me through a scenario where that behaviour actually hurts your classifier rather than helps it?
- QYou've trained a spam classifier that achieves 97% accuracy on your test set, but your client says it's missing too many spam emails in production. What metric should you have been optimising for, and how would you adjust the model?
- QA colleague suggests you should vectorize all your data first and then do cross-validation to save time. What's wrong with that approach, and what would you see in your metrics if you did it?
Frequently Asked Questions
What is the difference between text classification and sentiment analysis?
Sentiment analysis is a specific type of text classification where the categories are sentiments (positive, negative, neutral). Text classification is the broader technique — you could classify text into topics, intent, language, urgency, or any custom categories you define. Sentiment analysis just happens to be the most well-known application.
How much training data do I need for text classification?
For TF-IDF + Logistic Regression, you can get reasonable results with as few as 200–500 examples per class. Sentence transformers need at least 500 per class to learn a reliable decision boundary, though their pre-trained embeddings mean they generalise better from less data than training from scratch. Below 100 examples per class, consider few-shot prompting with a large language model instead.
Can I use text classification for multi-label problems where one document has multiple categories?
Yes, but it requires a different setup. Instead of a single classifier, you train one binary classifier per label (OneVsRestClassifier in sklearn) or use a model that natively outputs multiple labels. The evaluation metrics also change — you'd use macro-averaged F1 or hamming loss instead of standard accuracy. The preprocessing and vectorisation steps remain identical.
Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.