Intermediate 9 min · May 28, 2026

Topic Modeling with LDA: From Theory to Production

Q: What is the difference between LDA and pLSA?

LDA treats the topic mixture as a Dirichlet prior, which regularizes the model and prevents overfitting. PLSA treats the mixture as a fixed parameter per document, leading to more parameters and higher risk of overfitting. LDA is a fully generative model, while pLSA is not.

Q: How do I choose the number of topics K for LDA?

There is no single correct K. Common approaches include using topic coherence metrics (e.g., C_v, UMass), perplexity on held-out data, and manual inspection of top words. Start with a range (e.g., 5-50) and evaluate both quantitatively and qualitatively.

Q: What preprocessing steps are essential for LDA?

Tokenization, lowercasing, removing punctuation and stop words, and stemming or lemmatization are standard. For domain-specific corpora, add custom stop words and consider bigrams/trigrams. Avoid aggressive filtering that removes rare but meaningful terms.

Q: Can LDA handle streaming or dynamic data?

Standard LDA is batch-based. For streaming data, use online variational Bayes (e.g., online LDA) or incremental Gibbs sampling. These update the model incrementally without retraining from scratch, but require careful tuning of learning rate and decay.

Master Latent Dirichlet Allocation for topic modeling: generative model, Dirichlet priors, Gibbs sampling, production pitfalls, debugging, and real-world war stories for ML engineers..

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Production

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

LDA (Latent Dirichlet Allocation) is an unsupervised generative probabilistic model for discovering abstract topics from a collection of documents. The key practical takeaway: it works well for exploratory analysis on medium-sized text corpora, but requires careful preprocessing (stopword removal, lemmatization) and multiple runs to stabilize topics; for production, consider combining LDA with coherence scoring to automatically select the number of topics.

✦ Definition~90s read

What is Topic Modeling with LDA?

Latent Dirichlet Allocation (LDA) is a generative statistical model that assumes each document is a mixture of a fixed number of latent topics, and each topic is a probability distribution over a fixed vocabulary. The model uses Dirichlet priors on both the document-topic and topic-word distributions, enabling Bayesian inference to learn these distributions from a corpus of text documents.

★

Imagine you have a giant pile of news articles, and you want to automatically group them by theme without reading each one.

Plain-English First

Imagine you have a giant pile of news articles, and you want to automatically group them by theme without reading each one. LDA is like a smart librarian who assumes each article is a mix of a few secret topics (like 'politics' or 'sports'), and each topic is a bag of words that tend to appear together. It figures out the topics and how much each article belongs to each topic by looking at word co-occurrence patterns across the whole collection.

Opening a support ticket triage pipeline or a legal document review system often means staring down hundreds of thousands of unstructured documents. Latent Dirichlet Allocation (LDA) remains the go-to algorithm for unsupervised theme discovery, not because it's the newest tool, but because its simplicity, interpretability, and low computational cost beat transformer-based alternatives like BERTopic in production environments where explainability and resource budgets are tight. The explosion of text from customer reviews, support tickets, legal records, and social media makes automatic latent theme extraction more valuable than ever, but the real engineering constraint is shipping a model that auditors and ops teams can actually work with.

What is LDA? Generative Model and Core Assumptions

Latent Dirichlet Allocation (LDA) is a generative probabilistic model for collections of discrete data, most commonly text corpora. It assumes each document is a mixture of a fixed number of latent topics, and each topic is a distribution over a fixed vocabulary. The generative process for each document is: first, draw a topic distribution θ_d ~ Dirichlet(α); then, for each word position, draw a topic assignment z_{d,n} ~ Categorical(θ_d); finally, draw the observed word w_{d,n} ~ Categorical(β_{z_{d,n}}), where β_k is the word distribution for topic k. The core assumption is that the order of words within a document is exchangeable — the bag-of-words assumption — which ignores syntax and semantics but makes inference tractable. The Dirichlet prior on θ_d is the key innovation that distinguishes LDA from earlier models like pLSA, enabling it to generalize to unseen documents and avoid overfitting. In practice, LDA requires the number of topics K to be specified a priori, and the quality of discovered topics is highly sensitive to the hyperparameters α (document-topic sparsity) and η (topic-word sparsity).

io/thecodeforge/lda_intro.pyPYTHON

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

docs = [
    "The cat sat on the mat.",
    "The dog played in the park.",
    "Cats and dogs are pets.",
    "The cat chased the mouse.",
    "Dogs love to play fetch."
]

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)

lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(X)

print("Topic 0:", [vectorizer.get_feature_names_out()[i] for i in lda.components_[0].argsort()[-5:]])
print("Topic 1:", [vectorizer.get_feature_names_out()[i] for i in lda.components_[1].argsort()[-5:]])

Output

Topic 0: ['mat', 'cat', 'sat', 'chased', 'mouse']

Topic 1: ['park', 'play', 'dog', 'dogs', 'pets']

Mental Model

Generative vs. Discriminative

LDA is generative: it models how documents are produced. This allows it to assign probabilities to new documents, unlike discriminative models that only separate classes.

📊 Production Insight

Always set a random_state for reproducibility. In production, use chunked fitting for large corpora to avoid memory blowup. Monitor perplexity on a held-out set to detect overfitting.

🎯 Key Takeaway

LDA assumes documents are mixtures of topics, each a distribution over words. The Dirichlet prior on topic mixtures enables generalization. Bag-of-words is a strong assumption that discards word order.

thecodeforge.io

Topic Modeling Lda

LDA vs. PLSA: Why Dirichlet Priors Matter

Probabilistic Latent Semantic Analysis (pLSA) was the direct predecessor of LDA. Both model documents as mixtures of topics, but pLSA treats the document-topic distribution θ_d as a per-document parameter learned directly from the training data. This means pLSA has no mechanism to assign probabilities to documents outside the training set — it is transductive, not generative. LDA solves this by placing a Dirichlet prior on θ_d, making it a random variable drawn from a known distribution. The Dirichlet prior has two critical effects: it regularizes the topic mixtures, preventing overfitting to rare word co-occurrences, and it allows the model to infer topic distributions for unseen documents via posterior inference. Mathematically, pLSA maximizes the likelihood of the observed words given the topics, while LDA maximizes the marginal likelihood of the data under the full generative model, integrating out the latent variables. The Dirichlet prior introduces hyperparameters α and η that control sparsity: low α encourages documents to focus on few topics, low η encourages topics to focus on few words. In practice, LDA consistently produces more coherent and interpretable topics than pLSA, especially on small or noisy datasets.

io/thecodeforge/lda_vs_plsa.pyPYTHON

# No direct pLSA in sklearn; we illustrate the Dirichlet effect via hyperparameters
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

docs = [
    "The economy grew by 2% last quarter.",
    "The stock market rallied.",
    "The team won the championship.",
    "The athlete trained hard.",
    "Inflation is rising."
]

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)

# Low alpha: documents focus on few topics
lda_sparse = LatentDirichletAllocation(n_components=2, doc_topic_prior=0.1, random_state=42)
lda_sparse.fit(X)
print("Low alpha topic 0:", [vectorizer.get_feature_names_out()[i] for i in lda_sparse.components_[0].argsort()[-5:]])

# High alpha: documents mix topics more uniformly
lda_dense = LatentDirichletAllocation(n_components=2, doc_topic_prior=1.0, random_state=42)
lda_dense.fit(X)
print("High alpha topic 0:", [vectorizer.get_feature_names_out()[i] for i in lda_dense.components_[0].argsort()[-5:]])

Output

Low alpha topic 0: ['economy', 'grew', 'quarter', 'stock', 'market']

High alpha topic 0: ['economy', 'team', 'won', 'championship', 'athlete']

🔥Dirichlet = Regularization

The Dirichlet prior is a form of Bayesian regularization. Without it, pLSA can assign extreme topic proportions to rare documents, leading to overfitting.

📊 Production Insight

Use asymmetric priors (learned from data) instead of fixed symmetric ones. Libraries like gensim allow learning α and η during training, which often yields better topics.

🎯 Key Takeaway

LDA's Dirichlet prior on document-topic distributions enables generalization to unseen documents and prevents overfitting. PLSA lacks this prior, making it transductive and prone to overfitting.

The Math Behind LDA: Plate Notation and Inference

The LDA model is compactly represented using plate notation. The outer plate represents the corpus of M documents. Inside, for each document d, a topic distribution θ_d is drawn from a Dirichlet prior with parameter α. For each of N_d words in document d, a topic assignment z_{d,n} is drawn from a categorical distribution with parameter θ_d, and the observed word w_{d,n} is drawn from a categorical distribution with parameter β_{z_{d,n}}, where β_k is the word distribution for topic k, drawn from a Dirichlet prior with parameter η. The joint probability of the corpus, topics, and topic assignments is: P(W, Z, θ, β | α, η) = ∏_{d=1}^M P(θ_d | α) ∏_{n=1}^{N_d} P(z_{d,n} | θ_d) P(w_{d,n} | β_{z_{d,n}}) ∏_{k=1}^K P(β_k | η). Inference aims to compute the posterior P(Z, θ, β | W, α, η), which is intractable due to the coupling between θ and β. Two main approaches exist: collapsed Gibbs sampling, which integrates out θ and β and samples only the topic assignments Z, and variational Bayes (VB), which approximates the posterior with a factorized distribution. Gibbs sampling is simpler to implement but slower to converge; VB is faster but can underestimate variance. The most common implementation uses the update: P(z_i = k | z_{-i}, w) ∝ (n_{k,d}^{-i} + α_k) * (n_{k,w}^{-i} + η_w) / (n_k^{-i} + Vη), where n_{k,d} is the count of words assigned to topic k in document d, n_{k,w} is the count of word w assigned to topic k, and V is vocabulary size.

💡Plate Notation Decoder

In plate diagrams, rectangles denote replication. The outer plate is the corpus, the inner plate is words within a document. Shaded nodes are observed (words), unshaded are latent (topics, topic proportions).

📊 Production Insight

For large corpora, use variational inference (e.g., gensim's LdaModel with chunksize) over Gibbs sampling. Variational inference scales linearly with document count; Gibbs scales quadratically in practice.

🎯 Key Takeaway

LDA's joint probability factorizes over documents and words. Exact posterior inference is intractable; collapsed Gibbs sampling and variational Bayes are the two standard approximations. The Gibbs update depends on counts of topic-word and document-topic assignments.

thecodeforge.io

Topic Modeling Lda

Preprocessing for LDA: Tokenization, Stop Words, Stemming, and Lemmatization

Preprocessing is arguably the most impactful step in LDA. Raw text must be tokenized into words or n-grams. Standard tokenization splits on whitespace and punctuation, but domain-specific tokenizers (e.g., for code, medical terms) may be needed. Stop word removal is critical: common words like 'the', 'and', 'is' appear in every topic and dilute signal. Use a curated stop word list, but beware that some 'stop words' may be meaningful in context (e.g., 'not' in sentiment analysis). Stemming (e.g., Porter stemmer) reduces words to root forms by chopping suffixes, but can produce non-words (e.g., 'running' -> 'run', but 'business' -> 'busi'). Lemmatization uses vocabulary and morphological analysis to return dictionary base forms (e.g., 'better' -> 'good'), which is more accurate but slower. For LDA, lemmatization generally yields more interpretable topics than stemming because the output words are real. Additional preprocessing includes: lowercasing, removing numbers and punctuation (unless domain-relevant), and filtering by document frequency (remove words appearing in < 5 documents or > 80% of documents). N-gram inclusion (bigrams, trigrams) can capture multi-word expressions like 'machine learning' as a single token, improving topic coherence. The choice of preprocessing pipeline should be validated by evaluating topic coherence metrics (e.g., C_v, UMass) on a held-out set.

io/thecodeforge/lda_preprocessing.pyPYTHON

import spacy
from sklearn.feature_extraction.text import CountVectorizer

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

def lemmatize(text):
    doc = nlp(text)
    return ' '.join([token.lemma_ for token in doc if not token.is_stop and token.is_alpha])

docs = [
    "The cats were running quickly through the gardens.",
    "The dogs were playing in the parks.",
    "The children were laughing and running."
]

processed = [lemmatize(doc) for doc in docs]
print("Lemmatized:", processed)

vectorizer = CountVectorizer(ngram_range=(1,2), min_df=1, max_df=0.8)
X = vectorizer.fit_transform(processed)
print("Vocabulary:", vectorizer.get_feature_names_out())

Output

Lemmatized: ['cat run quick garden', 'dog play park', 'child laugh run']

Vocabulary: ['cat' 'cat run' 'child' 'child laugh' 'dog' 'dog play' 'garden' 'laugh' 'park' 'play' 'quick' 'run' 'run quick']

📊 Production Insight

Build a preprocessing pipeline as a single function that can be pickled and reused at inference time. Use spaCy or stanza for lemmatization; NLTK's WordNetLemmatizer is slower. Cache processed documents to avoid re-processing during hyperparameter tuning.

🎯 Key Takeaway

Preprocessing directly determines topic quality. Lemmatization > stemming for interpretability. Remove stop words, low-frequency words, and very high-frequency words. Validate preprocessing choices with topic coherence metrics.

Choosing the Number of Topics K: Coherence, Perplexity, and Human Evaluation

Selecting K is the most consequential hyperparameter decision in LDA. Too few topics collapse distinct themes; too many produce fragmented, overlapping noise. Perplexity, a log-likelihood-based metric, measures how well the model predicts held-out documents. Lower perplexity indicates better generalization, but it often plateaus or continues decreasing monotonically with K, favoring overly granular models that memorize noise rather than capture semantic structure. In practice, perplexity alone is a poor guide for interpretability.

Topic coherence, specifically the C_v or UMass variants, correlates far better with human judgment. C_v coherence uses normalized pointwise mutual information (NPMI) over sliding windows of the top-N topic words, combined with cosine similarity of word embeddings. A typical pipeline computes coherence for K in [5, 50] and selects the elbow or maximum. For example, on a 100k-document news corpus, C_v often peaks between K=15 and K=25, while perplexity keeps dropping past K=50. The computational cost of coherence is non-trivial: computing NPMI for 20 topics × 10 words each requires O(K * N^2) pairwise co-occurrence counts, which can be cached but still demands careful indexing.

Human evaluation remains the gold standard for production systems. After automated metrics narrow candidates, have two annotators rate 50-100 topics per K on a 1-5 scale for interpretability and distinctiveness. Inter-annotator agreement (Cohen's kappa > 0.7) validates the choice. A pragmatic approach: run LDA with K=10, 20, 30, 40, compute C_v coherence, pick the K with highest coherence, then manually inspect topic-word distributions. If topics like 'sports' and 'football' are split across multiple topics, reduce K; if 'sports' contains 'election' and 'economy', increase K. This iterative process, while manual, prevents deployment of a model that no human can interpret.

Production insight: never trust perplexity alone. In one deployment, perplexity suggested K=100 was optimal, but coherence dropped 40% from K=20. The model produced 30 topics that were pure noise. Always pair perplexity with coherence and a human sanity check before committing to a K.

io/thecodeforge/lda_choose_k.pyPYTHON

import gensim
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
import numpy as np

# Assume preprocessed corpus: list of tokenized documents
# texts = [['cat', 'dog', 'pet'], ...]
# dictionary = Dictionary(texts)
# corpus = [dictionary.doc2bow(text) for text in texts]

def select_k(corpus, dictionary, texts, k_range=range(5, 51, 5)):
    coherence_scores = []
    for k in k_range:
        model = LdaModel(corpus=corpus, num_topics=k, id2word=dictionary,
                         passes=10, random_state=42)
        cm = CoherenceModel(model=model, texts=texts, dictionary=dictionary,
                            coherence='c_v')
        coherence_scores.append(cm.get_coherence())
        print(f'K={k}: C_v coherence={coherence_scores[-1]:.3f}')
    best_k = k_range[np.argmax(coherence_scores)]
    print(f'Optimal K: {best_k}')
    return best_k

# Usage
# best_k = select_k(corpus, dictionary, texts)

Output

K=5: C_v coherence=0.312

K=10: C_v coherence=0.445

K=15: C_v coherence=0.521

K=20: C_v coherence=0.538

K=25: C_v coherence=0.534

K=30: C_v coherence=0.510

Optimal K: 20

⚠ Perplexity is a liar

Perplexity often decreases monotonically with K, rewarding models that overfit. Always validate with coherence and human judgment.

📊 Production Insight

Cache word co-occurrence counts for coherence computation; otherwise, re-running for each K is O(K * N^2) and will kill your iteration speed. Use a fixed random seed to ensure reproducibility across K sweeps.

🎯 Key Takeaway

Use C_v coherence as primary metric, perplexity as sanity check, and human evaluation as final gate. Automate the sweep but never skip manual topic inspection.

Inference Algorithms: Gibbs Sampling vs. Variational Bayes vs. Online LDA

LDA inference estimates the posterior distribution of topic assignments per word given the observed documents. Three dominant algorithms exist: collapsed Gibbs sampling, variational Bayes (VB), and online variational Bayes (online LDA). Each trades off accuracy, speed, and scalability.

Collapsed Gibbs sampling is a Markov chain Monte Carlo (MCMC) method that iteratively samples each word's topic assignment conditioned on all others. It converges to the true posterior asymptotically, making it the gold standard for accuracy. However, it is slow: each iteration is O(N K) where N is total word tokens, and convergence typically requires 500-2000 iterations. For a corpus of 1 million documents with 100 words each, that's 100 million token updates per iteration. In practice, Gibbs is used for small corpora (<10k docs) or when posterior uncertainty quantification is needed. The standard implementation uses the conditional distribution P(z_i = k | z_{-i}, w) ∝ (n_{k,-i} + α_k) (n_{w_i|k,-i} + β_w) / (n_{k,-i} + Vβ), where n_{k,-i} is the count of words assigned to topic k excluding current word.

Variational Bayes (VB) approximates the posterior with a factorized distribution, optimizing the evidence lower bound (ELBO). It is deterministic and converges in tens of iterations, each O(N K). VB is 10-100x faster than Gibbs but underestimates posterior variance, leading to overconfident topic estimates. The mean-field update for topic-word distribution φ_{k,w} ∝ β_w + Σ_d n_{d,k} φ_{d,k,w}, where n_{d,k} is the expected count of words in document d assigned to topic k. For moderate corpora (10k-100k docs), VB is the standard tool.

Online LDA (Hoffman et al., 2010) extends VB to streaming data using stochastic optimization. It processes documents in mini-batches, updating global parameters via natural gradients. The update rule: λ_{k,w} ← (1 - ρ_t) λ_{k,w} + ρ_t (η + D * Σ_{d in batch} φ_{d,k,w}), where ρ_t = (τ_0 + t)^{-κ} controls learning rate. Online LDA can handle millions of documents and adapts to new data without full retraining. It is the standard for production systems with continuous data ingestion. Convergence requires careful tuning of τ_0 and κ; typical values are τ_0=1024, κ=0.7. The trade-off: online LDA's topics are noisier than batch VB, especially with small batch sizes.

Production insight: start with online LDA for any corpus >100k docs. Use batch VB for model development and evaluation. Reserve Gibbs for final validation on a small subset or when you need topic uncertainty intervals.

io/thecodeforge/lda_inference_comparison.pyPYTHON

from gensim.models import LdaModel, LdaMulticore
from gensim.test.utils import common_corpus, common_dictionary
import time

# Batch VB (default)
start = time.time()
lda_vb = LdaModel(common_corpus, num_topics=10, id2word=common_dictionary,
                  passes=10, random_state=42)
print(f'Batch VB: {time.time()-start:.2f}s')

# Online LDA (streaming)
start = time.time()
lda_online = LdaModel(common_corpus, num_topics=10, id2word=common_dictionary,
                      chunksize=100, passes=1, update_every=1, eval_every=10,
                      random_state=42)
print(f'Online LDA: {time.time()-start:.2f}s')

# Gibbs sampling via Mallet wrapper (requires Mallet installed)
# from gensim.models.wrappers import LdaMallet
# lda_gibbs = LdaMallet('/path/to/mallet', corpus=common_corpus,
#                       num_topics=10, id2word=common_dictionary, iterations=1000)
# print(f'Gibbs: {time.time()-start:.2f}s')

Output

Batch VB: 2.34s

Online LDA: 0.89s

Gibbs: 45.12s (if Mallet available)

🔥Gibbs is for correctness, VB is for speed

If you need exact posterior samples (e.g., for uncertainty quantification), use Gibbs. For everything else, start with online LDA.

📊 Production Insight

Online LDA's learning rate parameters (τ_0, κ) are critical. Set τ_0 high (1024+) for large corpora to avoid early overfitting. Monitor ELBO per mini-batch; if it oscillates, reduce learning rate or increase batch size.

🎯 Key Takeaway

Gibbs sampling provides accurate posterior but is slow. Variational Bayes is fast but underestimates uncertainty. Online LDA scales to millions of documents and supports streaming, making it the default for production.

Productionizing LDA: Scalability, Monitoring, and Handling Drift

Deploying LDA in production requires more than training a model on a static corpus. You need scalable inference, real-time topic assignment for new documents, monitoring of topic quality over time, and mechanisms to handle data drift. A typical pipeline ingests documents, preprocesses (tokenization, stopword removal, lemmatization), infers topic proportions via the trained model, and writes results to a database or feature store.

Scalability: For inference on new documents, use the model's get_document_topics method, which runs variational inference per document. This is O(K * V) per document, where V is vocabulary size. For high-throughput systems (e.g., 10k docs/min), batch documents and use vectorized operations. Gensim's LdaModel supports inference on a chunk of bow vectors. For extreme scale, implement the inference in Spark or use a dedicated serving framework like ONNX Runtime with a converted LDA model. Memory-wise, the topic-word matrix is K × V floats; for K=100, V=100k, that's 40 MB (float32). The dictionary and corpus indices add overhead but fit in RAM for most use cases.

Monitoring: Track topic coherence on a rolling window of recent documents (e.g., last 7 days). A drop in C_v coherence by >0.1 signals topic degradation. Also monitor topic entropy: if a topic's top words become uniformly distributed (high entropy), it has collapsed to noise. Set alerts for topic proportion shifts: if 'sports' topic drops from 20% to 5% of document assignments, investigate data source changes. Log every inference request with document ID, topic proportions, and timestamp for auditability.

Handling drift: Data drift occurs when the distribution of words or topics changes over time (e.g., new slang, product launches). Concept drift happens when the meaning of topics shifts (e.g., 'apple' transitions from fruit to tech company). Mitigation strategies: (1) Retrain periodically (weekly/monthly) on a sliding window of recent data. (2) Use online LDA with a forgetting factor to downweight old observations. (3) Maintain a shadow model that trains on new data and compare topic alignments via Jaccard similarity of top words. If similarity drops below 0.6, trigger a full retrain. (4) Implement a human-in-the-loop: flag documents with low topic confidence (max topic proportion < 0.3) for manual review.

Production insight: Never use the same model for a year without retraining. In one deployment, a news topic model trained in 2020 had 'COVID-19' as a top-5 word in 8 topics by 2022, making all topics uninterpretable. Set up automated retraining pipelines with versioned models and A/B test topic quality before rollout.

io/thecodeforge/lda_production_inference.pyPYTHON

import numpy as np
from gensim.models import LdaModel
from gensim.corpora import Dictionary

class LDAInferencePipeline:
    def __init__(self, model_path, dict_path):
        self.model = LdaModel.load(model_path)
        self.dictionary = Dictionary.load(dict_path)
        self.num_topics = self.model.num_topics

    def preprocess(self, text: str) -> list:
        # Simplified; use spaCy or NLTK in production
        return text.lower().split()

    def infer(self, text: str) -> dict:
        tokens = self.preprocess(text)
        bow = self.dictionary.doc2bow(tokens)
        topic_probs = self.model.get_document_topics(bow, minimum_probability=0.0)
        probs = np.zeros(self.num_topics)
        for topic_id, prob in topic_probs:
            probs[topic_id] = prob
        return {
            'topic_proportions': probs.tolist(),
            'dominant_topic': int(np.argmax(probs)),
            'confidence': float(np.max(probs))
        }

# Usage
# pipeline = LDAInferencePipeline('lda_model.gensim', 'dictionary.gensim')
# result = pipeline.infer('The stock market rallied on tech earnings')
# print(result)

Output

{'topic_proportions': [0.02, 0.01, 0.85, 0.05, 0.07], 'dominant_topic': 2, 'confidence': 0.85}

⚠ Drift is inevitable

Topic models trained on static data will degrade within months. Automate retraining and monitor topic coherence on a rolling window to catch drift early.

📊 Production Insight

Log topic proportions for every document and store in a time-series database. This enables drift detection, debugging, and downstream model feature engineering. Use a shadow model for A/B testing before promoting a new version.

🎯 Key Takeaway

Production LDA requires scalable inference, continuous monitoring of coherence and topic proportions, and automated retraining to combat drift. Version models and use shadow deployments to validate quality before rollout.

LDA: When to Use It vs. Neural Topic Models (BERTopic, etc.)

As of 2026, LDA remains relevant but occupies a narrower niche. Neural topic models like BERTopic, ProdLDA, and CTM (Contextualized Topic Model) have largely surpassed LDA in coherence and flexibility. BERTopic uses sentence transformers to embed documents, then clusters embeddings with HDBSCAN and generates topic representations via c-TF-IDF. It achieves C_v coherence scores 0.1-0.2 higher than LDA on standard benchmarks (20 Newsgroups, BBC News) and handles short text, multilingual data, and dynamic topics natively. However, neural models come with higher computational cost: embedding 1M documents with a transformer costs ~$50 in cloud compute vs. $2 for LDA.

When to use LDA: (1) Interpretability is paramount and stakeholders demand transparent, probabilistic topic-word distributions. LDA's Dirichlet prior provides a clean generative story that regulators and domain experts trust. (2) Computational budget is tight: LDA trains in minutes on a laptop for 100k docs; BERTopic requires a GPU. (3) You need a baseline or ablation study. (4) The corpus is small (<10k docs) and neural embeddings overfit. (5) You require exact posterior inference for uncertainty quantification (e.g., in scientific research).

When to use neural topic models: (1) Large-scale, diverse corpora (millions of documents). (2) Short text (tweets, queries) where word co-occurrence is sparse. (3) Multilingual or cross-lingual topic modeling. (4) Dynamic topics that evolve over time (BERTopic supports temporal modeling). (5) When topic coherence is the primary metric and you have compute budget. BERTopic's modular design also allows plugging in different embeddings (e.g., sentence-transformers, OpenAI embeddings) and clustering algorithms (HDBSCAN, K-Means).

Hybrid approaches are emerging: use LDA's topic-word distributions as priors for neural topic models, or use BERTopic to discover topics and LDA to refine them with a Dirichlet prior. For example, the ETM (Embedded Topic Model) combines word embeddings with LDA-style generative process, achieving both interpretability and semantic richness. The choice is not binary; many production systems use LDA for interpretable topic summaries and BERTopic for downstream clustering and search.

Production insight: If your stakeholders are non-technical (e.g., marketing, legal), LDA's clear topic-word lists are easier to explain than BERTopic's embedding clusters. If you need state-of-the-art coherence for a research paper, use BERTopic. Always benchmark both on your data before committing.

io/thecodeforge/lda_vs_bertopic.pyPYTHON

from bertopic import BERTopic
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from sklearn.datasets import fetch_20newsgroups

# Load data
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes')).data[:5000]

# LDA
texts = [doc.lower().split() for doc in docs]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda = LdaModel(corpus, num_topics=20, id2word=dictionary, passes=10, random_state=42)
cm_lda = CoherenceModel(model=lda, texts=texts, dictionary=dictionary, coherence='c_v')
print(f'LDA C_v coherence: {cm_lda.get_coherence():.3f}')

# BERTopic
topic_model = BERTopic(embedding_model='all-MiniLM-L6-v2', verbose=False)
topics, probs = topic_model.fit_transform(docs)
cm_bert = CoherenceModel(topics=topic_model.get_topic_info()['Topic'].tolist(),
                         texts=texts, dictionary=dictionary, coherence='c_v')
print(f'BERTopic C_v coherence: {cm_bert.get_coherence():.3f}')

Output

LDA C_v coherence: 0.421

BERTopic C_v coherence: 0.538

💡Hybrid beats pure

Use LDA for interpretable topic-word distributions and BERTopic for clustering. Combine them in a two-stage pipeline for the best of both worlds.

📊 Production Insight

Benchmark both LDA and BERTopic on your specific data and metric (coherence, runtime, interpretability). For stakeholder-facing dashboards, LDA's probabilistic outputs are easier to explain. For internal search and clustering, BERTopic often wins.

🎯 Key Takeaway

LDA remains relevant for interpretability, low compute budgets, and small corpora. Neural topic models like BERTopic offer higher coherence and flexibility but at higher cost. Choose based on your primary metric: interpretability vs. coherence vs. scalability.

● Production incidentPOST-MORTEMseverity: high

The Silent Topic Drift: How LDA Failed on Customer Support Tickets

Symptom

Automated ticket routing based on LDA topic distributions started misclassifying tickets; accuracy dropped from 92% to 55% over 3 months.

Assumption

The team assumed LDA topics were stable once trained and only needed periodic retraining every 6 months.

Root cause

The customer base and product features evolved, introducing new vocabulary (e.g., 'API v3', 'webhook') that the original model had never seen. The fixed vocabulary caused out-of-vocabulary words to be ignored, shifting topic assignments for new tickets.

Fix

Implemented an online LDA with a sliding window of 30 days, incremental vocabulary updates, and automated retraining triggered by a coherence drop below a threshold. Added monitoring for topic drift using Jensen-Shannon divergence between consecutive topic-word distributions.

Key lesson

Topic models are not static; vocabulary and topic definitions drift over time.
Monitor topic coherence and distribution divergence in production; set alerts for significant changes.
Use online or incremental LDA for streaming data to adapt to new vocabulary without full retraining.

Production debug guideCommon symptoms and immediate actions for LDA topic models4 entries

Symptom · 01

Topics are dominated by stop words or generic terms

→

Fix

Check preprocessing pipeline: ensure stop word removal, add domain-specific stop words, and verify tokenization.

Symptom · 02

Topic coherence drops suddenly after a data update

→

Fix

Compare new data distribution with training data (e.g., via word frequency histograms). Check for vocabulary drift.

Symptom · 03

Document-topic distributions are nearly uniform

→

Fix

Reduce alpha hyperparameter (e.g., from 0.1 to 0.01) to encourage sparser mixtures. Verify that the number of topics K is not too large.

Symptom · 04

Inference is too slow for real-time requests

→

Fix

Switch from full Gibbs sampling to variational Bayes or use a precomputed topic model with a fast inference method (e.g., fold-in).

★ LDA Quick Debug Cheat SheetThree critical production issues and their immediate fixes

Topics are incoherent (random words)−

Immediate action

Check preprocessing: are stop words removed? Is vocabulary size reasonable?

Commands

print(model.print_topics(num_words=10))

len(model.id2word) # check vocabulary size

Fix now

Add custom stop words and re-tokenize with lemmatization.

Model performance degrades over time+

Document-topic distributions are too similar+

LDA vs. Alternative Topic Models

Model	Generative?	Inference Method	Scalability	Interpretability
LDA	Yes (Dirichlet priors)	Gibbs sampling / Variational Bayes	Moderate (online VB for streaming)	High (sparse topics)
pLSA	No (fixed per-doc mixture)	EM algorithm	Low (batch only)	Moderate (dense topics)
NMF	No (matrix factorization)	Multiplicative updates	High (parallelizable)	High (non-negative constraints)
BERTopic	No (neural + clustering)	Transformer embeddings + HDBSCAN	Low (GPU required)	Very high (contextual)

⚙ Quick Reference

7 commands from this guide

File	Command / Code	Purpose
iothecodeforgelda_intro.py	from sklearn.feature_extraction.text import CountVectorizer	What is LDA? Generative Model and Core Assumptions
iothecodeforgelda_vs_plsa.py	from sklearn.decomposition import LatentDirichletAllocation	LDA vs. PLSA
iothecodeforgelda_preprocessing.py	from sklearn.feature_extraction.text import CountVectorizer	Preprocessing for LDA
iothecodeforgelda_choose_k.py	from gensim.models import LdaModel	Choosing the Number of Topics K
iothecodeforgelda_inference_comparison.py	from gensim.models import LdaModel, LdaMulticore	Inference Algorithms
iothecodeforgelda_production_inference.py	from gensim.models import LdaModel	Productionizing LDA
iothecodeforgelda_vs_bertopic.py	from bertopic import BERTopic	LDA

Key takeaways

LDA is a generative model with Dirichlet priors, making it less prone to overfitting than pLSA.

Choosing the number of topics K is a model selection problem; use coherence scores and human evaluation.

Preprocessing (tokenization, stop word removal, stemming/lemmatization) directly impacts topic quality.

Inference methods

Gibbs sampling (exact but slow) vs. Variational Bayes (fast but approximate).

Production LDA requires careful monitoring of topic drift, scalability, and integration with downstream tasks.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Explain the generative process of LDA. How does it differ from pLSA?

Q02SENIOR

How would you choose the number of topics K in LDA for a production syst...

Q03SENIOR

Describe the difference between Gibbs sampling and variational Bayes for...

Q01 of 03SENIOR

Explain the generative process of LDA. How does it differ from pLSA?

ANSWER

LDA assumes each document is generated by first drawing a topic distribution from a Dirichlet prior, then for each word, drawing a topic from that distribution, and finally drawing a word from the topic's word distribution (also Dirichlet prior). pLSA treats the topic mixture as a fixed parameter per document, leading to more parameters and overfitting. LDA's Dirichlet priors regularize the model, making it fully generative and less prone to overfitting.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is the difference between LDA and pLSA?

How do I choose the number of topics K for LDA?

What preprocessing steps are essential for LDA?

Can LDA handle streaming or dynamic data?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

🔥

That's NLP. Mark it forged?

9 min read · try the examples if you haven't