Decision Trees — When a Timestamp Split Killed Loans
A tree split on 'application minute' dropped approval rates from 35% to 10%.
- Decision trees split data by asking yes/no questions on features, maximising purity at each step.
- Gini impurity and entropy are the two split criteria – Gini is faster, entropy slightly more balanced.
- Depth control and pruning prevent overfitting: max_depth=7 is a safe production default.
- Cost-complexity pruning (CCP) removes low-value branches post-training.
- Biggest mistake: trusting training accuracy alone – a perfect tree on training data often bombs in production.
Imagine you're playing 20 Questions to guess an animal. You ask 'Does it have fur?' then 'Does it live in water?' — each answer narrows the possibilities until you land on the answer. A decision tree does exactly that with data: it asks a series of yes/no questions about your features, following the branch that best separates your data at each step, until it reaches a confident prediction at the leaf.
If you've ever debugged a rule-based system, you'll recognise the pattern: the tree is essentially a collection of nested if-else statements. The magic is that it learns those rules automatically from labeled data — no manual rule writing required.
A decision tree doesn't just guess the questions — it picks them based on math. Each question is chosen to maximise the purity of the resulting groups. That's why trees can discover patterns you didn't know existed. They're the closest thing ML has to a human reasoning process, which is why banks, insurers, and medical systems trust them for decisions that need to be explained.
If you're deploying one in production, you'll learn why depth control and pruning separate a useful model from a memorisation machine.
Every time a bank decides whether to approve your loan, or a doctor's diagnostic tool flags a high-risk patient, or a streaming service labels content as inappropriate — there's a good chance a decision tree is somewhere in that pipeline. They're not flashy, but they're the backbone of some of the most reliable ML systems in production, and they're the building block of powerhouses like Random Forest and XGBoost.
The problem decision trees solve is deceptively simple: given a pile of labelled examples, figure out a set of rules that correctly categorises new, unseen examples. The magic is in HOW those rules are chosen. A bad algorithm might split data arbitrarily. A decision tree uses mathematical criteria — Gini impurity or information gain — to always pick the split that creates the purest, most separable groups. That's what gives it predictive power.
By the end of this article you'll understand exactly how a tree chooses where to split (and why that maths matters), how to train and visualise one in Python with real data, how to diagnose and fix overfitting with pruning and depth control, and what to say when an interviewer asks you to compare Gini impurity to entropy on the spot.
Here's the thing most tutorials skip: the real challenge isn't splitting — it's stopping. Knowing when a tree knows enough is what separates a production model from a textbook example. A perfect tree on training data is often a disaster in the wild.
If you've ever seen a model that aced homework but bombed the exam, you already know the pain of overfitting. Decision trees are the poster child for that problem — and also the solution, once you learn to control them.
What is Decision Trees?
A decision tree is one of the most interpretable models in machine learning. It mimics human decision-making by asking a sequence of binary questions. Each question splits the data into smaller groups until a final prediction is made. It's the go-to algorithm for tasks where you need to explain why a prediction was made — like loan approvals or medical diagnoses.
Rather than starting with a dry definition, let's see it in action. Every decision tree consists of three components: root node (first split), internal nodes (subsequent splits), and leaf nodes (final predictions). The path from root to leaf represents a rule: "if feature A <= threshold and feature B = value, then class C." This rule-based nature is why trees are so easy to debug.
Most implementations — including scikit-learn's CART algorithm — only allow binary splits. Each node asks a yes/no question. This keeps the tree interpretable and ensures that splits are computationally efficient. Multi-way splits (like in C4.5) are possible but less common in production because they can fragment data quickly.
In practice, you'll rarely train a single tree for a high-stakes production model without some form of constraint. Even simple datasets of 10,000 rows can produce trees with hundreds of nodes if left unchecked. That's why every experienced ML engineer sets max_depth and min_samples_leaf from the start.
Another nuance: the split threshold is chosen by evaluating every possible split point (for numerical features) or every category (for categoricals). This is computationally cheap for small datasets but can be expensive with millions of rows. Scikit-learn optimises by presorting data.
That's because the default max_depth=None in scikit-learn will happily grow the tree until every leaf is pure — a specification that almost never makes sense for real data. I've seen this trip up engineers who assume 'let the algorithm decide' is safe. It's not. You must set constraints.
If your dataset has millions of rows and hundreds of features, consider sampling or using an approximate split-finding algorithm, as scikit-learn does with its 'presort' heuristic (deprecated in newer versions).
For example, on a 50,000-row credit dataset, a tree with max_depth=None grew to 1,200 nodes. Setting max_depth=7 reduced it to 40 nodes with only a 2% accuracy drop. That's the kind of trade-off you need to internalise before deploying.
One more thing: don't forget to export the tree rules for documentation. A tree with depth 7 produces at most 128 rules — manageable for a human to review. Deeper trees produce too many rules and lose the interpretability advantage.
Here's a reality check: that 40-node tree still needs a per-rule review if you're in a regulated industry. I've seen compliance teams reject a model because one rule had an unintuitive threshold like 'income > 49999.5'. Rounding thresholds to human-readable numbers is a small effort that saves days of back-and-forth.
How Decision Trees Choose Splits: Gini vs Entropy
A decision tree builds its rules by asking one question at a time. The question that best separates the data — that is, creates the purest child nodes — wins. Purity is measured mathematically. The two most common metrics are Gini impurity and entropy (information gain).
Gini impurity measures how often a randomly chosen element would be misclassified if labelled randomly according to the class distribution in a node. It ranges from 0 (pure) to 0.5 (maximally impure for binary classes). Entropy, from information theory, measures the average information content — 0 for pure nodes, 1 for maximally impure (binary). Information gain is the reduction in entropy after a split.
In practice, both give similar results. Gini is slightly faster to compute. Scikit-learn uses Gini by default. But entropy tends to produce slightly more balanced trees. The key insight: you're always picking the split that minimises impurity or maximises information gain.
Here's a pure Python example without ML libraries to see the math in action:
But there's a subtlety: the split criterion only matters when the best splits are nearly tied. In that case, Gini and entropy can disagree on which split is better. Cross-validation is your friend — always check both if you have a tiny dataset.
I once saw a team spend three days debugging a model that performed worse with Gini than entropy. Turned out they had a tiny 500-row dataset with a tied best split. After cross-validation, both criteria gave the same test accuracy. The lesson: on large datasets, the difference is noise.
For each numeric feature, CART evaluates every split point by sorting feature values — O(n log n) per feature. With 1M rows and 100 features, that's about 100 million log operations. Gini's computational advantage grows with feature count because its formula is simpler than entropy's logarithm. On a 1M row dataset, switching from entropy to Gini can cut training time by 15–20% with no accuracy loss.
One more nuance: the split criterion selection also affects interpretability. Gini-based splits tend to be more 'aggressive' in isolating a single class, while entropy-based splits favour balanced groupings. For regulatory reporting, auditors often prefer Gini because it's easier to explain: 'the tree chooses splits that minimise misclassification probability.'
Here's a production trick: when you have a categorical feature with many levels, entropy can become computationally expensive because each log2 calculation adds up. If you see training times spike after adding a high-cardinality feature, try switching to Gini. In one case I debugged, the team had 500 categories in a 'postcode' field, and switching from entropy to Gini cut training time from 45 minutes to 32 minutes on a 500K-row dataset.
- Gini = 0 means all items same class (perfect purity).
- Gini = 0.5 means a 50/50 split (worst case for binary).
- Entropy peaks at 1 for a 50/50 split.
- Both penalties are convex — they discourage even splits.
Overfitting in Decision Trees: Why Perfect Trees Fail in Production
A decision tree that splits until every leaf is pure has effectively memorised the training data. That's overfitting. The tree will have near-perfect training accuracy but will fail on new data because it models noise, not signal.
Common causes: no maximum depth, too few samples per leaf, splitting on high-cardinality features (like user IDs or timestamps). The tree essentially learns spurious patterns that don't generalise.
Here's a quick way to detect overfitting: compare training and validation accuracy. A gap of more than 5 points is a red flag. For trees, a gap of 10+ points is common without constraints.
In production, overfit trees degrade silently. Your monitoring system might show 0.99 training accuracy, but the model is rejecting valid requests because it learned non-existent patterns. The loan approval incident earlier is a textbook case, but I've seen this in fraud detection, credit scoring, and medical triage systems.
Cross-validation helps, but it's not a silver bullet — if you use the same CV split every time, you still miss temporal drift. Always hold out a temporally representative test set.
Another debugging technique: visualise the tree with plot_tree and look for deep branches that split on unusual features. A split on 'customer_id' is a dead giveaway.
And if you're using the same CV split every time during hyperparameter tuning, you risk overfitting to that split. Always hold out a temporally representative test set that mirrors production conditions.
In production, monitor the distribution of predicted classes. If the tree is overfit, it will often produce predictions that are concentrated on a few leaves. A sudden shift in leaf distribution without corresponding feature drift is a strong indicator of overfitting.
Don't forget to check feature importances after retraining. If a previously important feature drops off suddenly, it might be because that feature's splits were noise and the new data doesn't have them. That's a signal to revisit your feature engineering.
Here's a failure story: a UK-based fintech trained a tree on 2 years of loan data and saw 98% validation accuracy. The next quarter, approval rates dropped 40%. Root cause? The tree had a split on 'application day of week' — it turned out the training data had a Tuesday bias because they'd started collecting data on a Tuesday and the pattern was an artefact. The fix: drop time-based features and add a monotonic constraint on income.
Pruning: The Fix for Overfitting
Pruning removes branches that contribute little to generalisation. There are two main strategies: pre-pruning (stop growth early) and post-pruning (grow full tree then cut back).
Pre-pruning uses parameters like max_depth, min_samples_split, min_samples_leaf. These are hyperparameters you tune via cross-validation.
Post-pruning, specifically cost-complexity pruning (CCP), grows the full tree then cuts branches that don't improve validation accuracy enough to justify their complexity. Scikit-learn's DecisionTreeClassifier exposes cost_complexity_pruning_path, which returns a list of effective alphas. You pick the alpha that gives the best validation score.
The alpha parameter penalises the number of leaves: higher alpha = smaller tree. Don't prune blindly — always use cross-validation or a hold-out set to choose alpha.
In practice, combine both approaches: set a reasonable max_depth (pre-pruning to avoid massive trees), then apply CCP to fine-tune. This is the standard production workflow.
Important: CCP pruning can be done on a validation set, but the optimal alpha might differ on the full training set. After selecting alpha, retrain on the full training set.
A common mistake is to select alpha on the training set directly — that defeats the purpose. Always use a separate validation set or cross-validation. The pruning path itself is computed from the training data, so you need independent evaluation to choose alpha.
CCP pruning should be re-evaluated when retraining on new data because the optimal alpha can shift with data distribution. If you retrain quarterly, recompute the pruning path each time.
One additional subtlety: pruning can sometimes make the tree too simple and increase bias. Always measure validation accuracy after pruning. If accuracy drops significantly, consider a slightly higher alpha. The sweet spot is where reduction in variance outweighs increase in bias.
Here's a practical trick: after pruning, manually inspect the remaining leaves. In one project, pruning eliminated 80% of leaves but left 12 leaves — each with a clear business logic. The compliance team loved it because they could explain each leaf's rule to the board.
From Single Tree to Forests: Ensemble Methods
A single decision tree suffers from high variance — small changes in training data produce very different trees. Random Forest and Gradient Boosting fix this by combining many trees.
Random Forest trains many trees on bootstrapped samples and random subsets of features, then averages their predictions. This dramatically reduces variance without increasing bias much. Gradient Boosting builds trees sequentially, each correcting the errors of the previous one.
These ensemble methods dominate tabular data competitions because they balance bias, variance, and interpretability. But they sacrifice the simplicity of a single tree. In production, you often start with a single tree for debugging, then switch to an ensemble for the final model.
However, ensembles come with trade-offs: 100 trees means 100x inference latency compared to a single tree. If your serving latency budget is under 1ms, a single pruned tree may be your only option. Always measure p99 latency before committing to an ensemble.
Also consider memory: 100 trees of depth 5 can use ~50x more memory than one tree. In memory-constrained environments (e.g., mobile), a single tree might be the only viable option.
In production, also consider the cost of serialising and loading 100 trees — larger deployment packages and longer cold start times. An ensemble of 100 trees might take 10 seconds to load from disk, while a single tree takes 0.1 seconds.
A single tree depth 5 takes ~0.05ms per prediction. A Random Forest of 100 trees takes ~5ms. A Gradient Boosting of 100 trees is similar. For <1ms SLA, consider using a single tree or a distilled model (a smaller tree trained to mimic the ensemble).
If you need both speed and accuracy, consider a hybrid: train a Random Forest, then distill it into a single shallow tree by training the tree to predict the forest's output. This is called model distillation and gives you near-forest accuracy with single-tree inference speed.
Here's a real example: a payment fraud detection system needed <500µs per prediction. They trained 50 trees and distilled into a single tree of depth 6. The distilled tree was 50x faster, <0.1ms, with only 0.5% accuracy drop compared to the full forest. That's a win.
- Random Forest reduces variance without increasing bias much.
- Gradient Boosting reduces both bias and variance iteratively.
- Ensembles are harder to interpret — you trade explainability for accuracy.
- In production, start with a single tree for baseline, then move to an ensemble.
Handling Categorical Features and Missing Data in Production
Real-world data is messy. Decision trees can handle categorical features natively if the implementation supports splits like "feature == value". Scikit-learn requires numerical encoding (OrdinalEncoder or OneHotEncoder). But one-hot encoding on high-cardinality categories blows up the tree depth.
For missing values, standard decision trees cannot handle them. You must impute before training. Some implementations (like XGBoost) learn a default direction for missing values. In scikit-learn, using SimpleImputer with median or mode is common.
Feature importance from trees helps identify which features drive predictions. But be careful: correlated features can split importance arbitrarily. Always cross-check with permutation importance.
Ordinal encoding imposes an artificial order — for decision trees this can still work because splits are threshold-based, but OneHotEncoding is safer. If you have a categorical feature with hundreds of categories, consider target encoding or use a model that handles categoricals natively, like CatBoost.
A production trick: after training, inspect the tree structure to see which categories are used in splits. If a split uses 'category_42' and that category appears only once in training, that split is memorisation — prune it.
If you see a split on a category that appears only once in the training data, that split is pure memorisation. Consider dropping such rare categories or grouping them into an 'other' bucket.
Label encoding (OrdinalEncoder) can introduce artificial ordering, but for trees it often works because splits are threshold-based. However, it can create splits that are uninterpretable (e.g., 'encoded_color > 2.5'). Consider using OneHotEncoder or target encoding if interpretability is a concern.
For production pipelines, persist your encoders. If you retrain and the encoder refits with different category mappings, your model silently breaks. Serialise the encoder alongside the model and load it in inference.
Here's a painful real example: a team trained a tree with OneHotEncoder on 2000 categories for 'product_id'. The tree depth exploded to 15 and the model had 90% training accuracy but 30% validation accuracy. They switched to target encoding with smoothing (min_samples_leaf=50) and the depth dropped to 6, validation accuracy jumped to 80%. The fix wasn't more data — it was better encoding.
Interpretability: Extracting Rules from Trees for Audits
One major advantage of decision trees is you can extract explicit rules like "if age > 30 and income > 50K then approve loan." This is gold for regulated industries (finance, healthcare). Scikit-learn provides tree_. to dump the tree structure, or you can use export_text to get a text representation.__getstate__()
In production, you might need to log the decision path for each prediction. You can use tree_.decision_path(X) to get sparse matrices showing which nodes each sample passes through. This enables auditing individual predictions.
For regulated models, also consider storing the full tree structure once after training — it becomes a snapshot of your production logic.
A common audit question: "Why was this particular loan rejected?" With a tree, you can provide the exact rule path. If you store decision_path during inference, you can reconstruct the rules offline. This is much harder with ensembles.
Also, when extracting rules, watch out for features with unintuitive splits (e.g., 'one-hot encoded column_25 = 0.5') — map them back to original categories for stakeholder comprehension.
Decision trees are also great for debugging ensemble models: train a shallow tree on the same data to get a global approximation of the ensemble's behaviour. This is called a surrogate model. For example, if a Random Forest rejects a loan, you can train a single tree of depth 3 to approximate the forest's decisions, giving you a human-readable explanation.
For maximum audit transparency, store the following after each training run: tree structure (as JSON), feature names, thresholds, class distribution per leaf, and the training set used. This provides a complete snapshot for regulators.
I've seen a team spend two months preparing for a regulatory audit because they hadn't stored decision paths. Don't be that team. Add decision_path logging as a requirement in your model card template.
export_text() and format the output into a table.decision_path() output for each inference in a database.Feature Importance and Interpretation: Beyond Gini Importance
Decision trees assign a feature importance score to each feature, often based on the total reduction in impurity (Gini importance) achieved by splits on that feature. It's tempting to use these scores for feature selection or to explain model behaviour. But there's a trap: correlated features split importance arbitrarily. A feature that is highly predictive but correlates with another may get low importance simply because the tree used the other feature for splits.
Permutation importance fixes this: shuffle a feature's values and measure the drop in accuracy. If the feature is truly important, accuracy drops significantly. scikit-learn provides permutation_importance in sklearn.inspection. Always cross-check built-in importance with permutation importance, especially when features are correlated.
Another nuance: in production, you might need to explain predictions to stakeholders. For that, decision trees are great — you can extract rule paths. But for a deep tree with 50 leaves, a single rule path can be long and confusing. Consider using a shallow tree (depth ≤ 3) for explanation purposes, or use SHAP values for more rigorous explanations.
I've seen a team drop a feature based on low Gini importance and lose 3% accuracy. Permutation importance later showed that feature had high importance but was masked by a correlated colleague. Always check correlation matrices before trusting tree importance.
Permutation importance is model-agnostic but can be affected by feature correlation. SHAP values provide a more granular explanation per prediction. For compliance, SHAP is often preferred over built-in importance. However, SHAP computation adds overhead — measure the time per prediction if you plan to serve SHAP explanations in real time.
For a quick sanity check, also look at the top splits of the tree. If a feature appears in the top two splits and has low Gini importance, that's a red flag. The top splits often dominate importance, so any discrepancy there is suspicious.
Here's a practical workflow I use: compute Gini importance, then for the top 5 features, run permutation importance with n_repeats=5. If the ranks differ by more than 2 positions, investigate correlation. In one case, 'credit_history_length' and 'number_of_previous_loans' had a 0.8 correlation — Gini gave one of them zero importance. Permutation importance showed both were important. The fix was to keep both and note the correlation in the model card.
The Loan Approval Tree That Rejected Everyone
- Never trust training accuracy alone — validate on a held-out set that mirrors production distribution.
- Always cap max_depth to a reasonable value (5–7 for most tabular data).
- Use pruning to remove splits that don't generalise.
- Include feature importance analysis to catch noise splits early.
- Monitor prediction distribution after deployment — a sudden shift in output rates often signals overfit rules breaking.
- Remove near-unique identifiers (timestamps, customer IDs) before training — they're noise magnets.
- Set up an alert for any drop in approval rate > 5% after a model update.
tree_.get_depth(). If >10, prune or limit depth.Key takeaways
Common mistakes to avoid
5 patternsLeaving max_depth=None in production
Using default min_samples_leaf=1 on large datasets
Relying on Gini importance for feature selection when features are correlated
Not persisting encoders with the model
Assuming higher depth always improves accuracy
Interview Questions on This Topic
Explain how a decision tree chooses the best split. What criteria are used?
Frequently Asked Questions
That's Algorithms. Mark it forged?
16 min read · try the examples if you haven't