Skip to content
Home ML / AI How to Visualize Machine Learning Results (Matplotlib & Seaborn)

How to Visualize Machine Learning Results (Matplotlib & Seaborn)

Where developers are forged. · Structured learning · Free forever.
📍 Part of: ML Basics → Topic 20 of 25
Beautiful charts and graphs every beginner should know how to create.
🧑‍💻 Beginner-friendly — no prior ML / AI experience needed
In this tutorial, you'll learn
Beautiful charts and graphs every beginner should know how to create.
  • Every Matplotlib chart starts with fig, ax = plt.subplots() — use the object-oriented interface, always.
  • Seaborn handles DataFrame grouping and statistical estimation automatically — use it for rapid exploration, then drop down to Matplotlib for polish.
  • Confusion matrices reveal class-level failures that accuracy hides — always show both raw counts and row-normalized percentages.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Matplotlib is the foundation — every chart in Python builds on its figure/axes model
  • Seaborn wraps Matplotlib with statistical defaults and far less boilerplate code
  • Confusion matrices, ROC curves, and residual plots reveal model flaws numbers hide
  • Use fig.savefig() at 300 DPI — screen-resolution plots break in reports and slides
  • Production rule: never present raw accuracy alone — always pair with precision, recall, or error distribution
  • Biggest mistake: choosing the wrong chart type for the data relationship you want to communicate
  • Always call plt.close(fig) after saving — open figures leak memory and crash long-running pipelines
🚨 START HERE
ML Visualization Debug Cheat Sheet
Quick checks when your charts do not tell the right story or something looks suspicious.
🟡Confusion matrix shows all predictions in one class
Immediate ActionCheck class balance and prediction threshold. The model is likely predicting the majority class for every input.
Commands
print(f'Positive predictions: {y_pred.sum()} / {len(y_pred)}')
print(df['target'].value_counts(normalize=True))
Fix NowLower the decision threshold (e.g., from 0.5 to 0.3) and re-evaluate. If the problem persists, address class imbalance with SMOTE, class weights, or stratified sampling before retraining.
🟡Learning curve shows training score much higher than validation score
Immediate ActionModel is overfitting — it memorizes training data but cannot generalize.
Commands
from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10))
Fix NowIncrease regularization, reduce model complexity (fewer trees, shallower depth), or collect more training data. If the validation curve is still rising at maximum data size, more data will help.
🟡Feature importance plot shows one dominant feature at 95%+
Immediate ActionCheck for data leakage — the dominant feature may directly encode or derive from the target variable.
Commands
print(df.corrwith(df['target']).abs().sort_values(ascending=False).head(10))
# Retrain without the suspicious feature and compare performance model_no_leak = model.fit(X.drop(columns=['suspicious_feature']), y)
Fix NowRemove the leaky feature and retrain. If accuracy collapses dramatically (e.g., from 99% to 60%), the original model learned nothing real — it was just memorizing the leaked signal.
Production IncidentFraud Detection Model Degraded for 3 Weeks Because No One Plotted Predictions Over TimeA fraud detection model's precision dropped from 94% to 61% over three weeks, but the team noticed only after a quarterly review because no one was visualizing per-class metrics.
SymptomFalse positive rate tripled. Customer support received a 400% spike in fraud-flag complaints from legitimate merchants. The aggregate weekly accuracy metric — reported as a single number on the team dashboard — still showed 89%, masking the class-level collapse entirely.
AssumptionThe team monitored a single aggregate accuracy number in their Grafana dashboard. They assumed stability because the headline metric had not moved more than 1% in either direction. No per-class breakdown existed. No prediction distribution plot existed.
Root causeA new merchant category code (MCC 7399) was introduced by the payment processor three weeks prior. The model had never seen this code during training. It defaulted to high suspicion scores for all transactions with the unfamiliar code, flagging legitimate purchases as fraud. The aggregate accuracy stayed high because fraud cases represent only 1% of transactions — the model's correct predictions on the other 99% of normal transactions dominated the average, drowning out the class-level failure.
FixAdded daily confusion matrix heatmaps to the monitoring dashboard, broken down by predicted class. Implemented per-class precision and recall time-series plots with automated PagerDuty alerts when any class metric dropped below a configurable threshold for two consecutive days. Added a weekly prediction probability distribution plot (histogram of model confidence scores) to detect distribution shifts before they manifest as metric degradation.
Key Lesson
Never monitor a single aggregate metric — break performance down by class, by segment, and over time.Confusion matrices catch class-level failures that accuracy, F1, and even AUC hide when classes are imbalanced.Plot prediction probability distributions weekly to detect distribution shift before downstream metrics degrade.The charts you build during model evaluation should become your production monitoring dashboards — not throwaway notebook cells.
Production Debug GuideWhen your charts do not reveal what you expect — or when they reveal something you did not anticipate.
All points on a scatter plot overlap into a single blobUse alpha transparency (alpha=0.05 to 0.2 depending on density), add jitter with np.random.normal(0, 0.1, size=len(x)), or switch to a 2D density plot with sns.kdeplot(x=x, y=y, fill=True). For very large datasets (>100K points), use datashader or hexbin plots (ax.hexbin) instead of scatter.
Bar chart error bars look identical across all groupsCheck if you are plotting standard deviation on a log-scale axis, which compresses the visual differences. Switch to confidence intervals (ci=95 in Seaborn) or standard error of the mean instead of standard deviation. Also verify that your groups actually have different variances — identical error bars might be correct.
ROC curve looks perfect (AUC = 1.0) but model performs poorly in productionThis is almost certainly data leakage. Check for target-derived features in your training data, duplicates spanning train and test splits, or temporal leakage where future information bleeds into training rows. A perfect ROC on held-out data means the model has access to the answer, not that it learned the pattern.
Residual plot shows a clear curved or fan-shaped pattern instead of random scatterA curve means missing non-linearity — add polynomial features, interaction terms, or switch to a non-linear model. A fan shape (residuals widening with predicted value) means heteroscedasticity — log-transform the target variable or use weighted regression.
Saved figure looks different from the notebook display — wrong size, cut-off labels, or blankAlways save before calling plt.show(), which destroys the figure in most backends. Use fig.savefig('name.png', dpi=300, bbox_inches='tight') — the bbox_inches parameter prevents label clipping. Set figsize explicitly in plt.subplots() rather than relying on notebook defaults.

Model metrics like accuracy and F1-score tell you the score. Visualizations tell you why. A confusion matrix shows exactly which classes your model confuses. A residual plot reveals systematic prediction errors that RMSE averages away. A learning curve tells you whether collecting more data will help or whether you need a fundamentally different model. These are not decorative — they are diagnostic tools.

Matplotlib provides the rendering engine. Seaborn provides statistical awareness on top of it. You need both: Matplotlib for full control over publication-quality figures, and Seaborn for rapid exploratory analysis with sensible defaults. They are not competitors — Seaborn is literally built on Matplotlib, and every Seaborn plot returns a Matplotlib axes object you can customize further.

The common mistake is treating visualization as an afterthought — something you do after the model is trained and shipped. In production, a well-designed diagnostic dashboard catches model degradation weeks before aggregate metrics move. The charts you build during evaluation become your monitoring tools after deployment. Skip them, and you are flying blind.

Matplotlib Fundamentals: Figure and Axes

Every Matplotlib chart lives inside a Figure that contains one or more Axes. The Figure is the canvas — it controls overall dimensions, background color, and file output. The Axes is the actual plot area with its own x-axis, y-axis, title, and data layers.

Understanding this hierarchy prevents 90% of the layout confusion beginners hit. When you call plt.plot(), Matplotlib implicitly creates a Figure and Axes behind the scenes. This works for quick exploration but falls apart the moment you need multiple subplots, consistent sizing, or saved files. The object-oriented interface — fig, ax = plt.subplots() — gives you explicit handles to both objects and should be your default for anything beyond throwaway exploration.

io/thecodeforge/viz/matplotlib_basics.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import matplotlib.pyplot as plt
import numpy as np


# --- Method 1: pyplot interface (quick exploration only) ---
# Implicitly creates a Figure and Axes. Fine for throwaway cells.
plt.plot([1, 2, 3], [4, 5, 6])
plt.title('Simple Line Plot')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.show()


# --- Method 2: object-oriented interface (production standard) ---
# Explicitly creates Figure and Axes. Use this for everything you save.
fig, ax = plt.subplots(figsize=(10, 6))

ax.plot([1, 2, 3], [4, 5, 6], marker='o', linewidth=2, label='Series A')
ax.set_title('Production-Ready Line Plot', fontsize=14, fontweight='bold')
ax.set_xlabel('X Axis')
ax.set_ylabel('Y Axis')
ax.legend()
ax.grid(True, alpha=0.3)

fig.tight_layout()
fig.savefig('plot.png', dpi=300, bbox_inches='tight')
plt.close(fig)  # Free memory — critical in loops and pipelines


# --- Multi-panel figure: the pattern you will use most ---
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
np.random.seed(42)
data = np.random.randn(200)

# Panel 1: Distribution
axes[0, 0].hist(data, bins=30, edgecolor='black', alpha=0.7, color='steelblue')
axes[0, 0].set_title('Distribution')
axes[0, 0].set_xlabel('Value')
axes[0, 0].set_ylabel('Frequency')

# Panel 2: Sequential scatter
axes[0, 1].scatter(np.arange(len(data)), data, alpha=0.4, s=12, color='coral')
axes[0, 1].axhline(y=0, color='black', linestyle='--', alpha=0.3)
axes[0, 1].set_title('Sequential Scatter')
axes[0, 1].set_xlabel('Index')

# Panel 3: Box plot
axes[1, 0].boxplot(data, vert=True, patch_artist=True,
                    boxprops=dict(facecolor='lightblue'))
axes[1, 0].set_title('Box Plot')

# Panel 4: Cumulative sum
axes[1, 1].plot(np.cumsum(data), color='seagreen', linewidth=1.5)
axes[1, 1].set_title('Cumulative Sum')
axes[1, 1].set_xlabel('Index')

fig.suptitle('Exploratory Data Summary', fontsize=16, fontweight='bold')
fig.tight_layout()
fig.savefig('dashboard.png', dpi=300, bbox_inches='tight')
plt.close(fig)
Mental Model
Figure vs Axes
Think of Figure as the paper and Axes as individual charts drawn on that paper. You can have many charts on one piece of paper.
  • Figure = the full canvas. Controls overall size (figsize), background, DPI, and file saving.
  • Axes = one plot area. Has its own x-axis, y-axis, title, legend, and data layers. A Figure can hold many Axes.
  • fig, ax = plt.subplots() creates one Figure with one Axes. This is your starting point for every chart.
  • fig, axes = plt.subplots(2, 3) creates a 2×3 grid. Access individual plots with axes[row, col].
  • Always use the object-oriented interface (ax.plot, ax.set_title) for anything you save or present. The pyplot interface (plt.plot, plt.title) operates on an implicit 'current axes' that causes bugs in multi-panel figures.
📊 Production Insight
plt.show() destroys the figure object in most Matplotlib backends. If you call plt.show() then fig.savefig(), you save a blank file with no error message.
Always save before showing: fig.savefig() first, plt.show() second — or skip plt.show() entirely in automated pipelines.
Rule: in production scripts, scheduled jobs, and CI/CD pipelines, never call plt.show(). Use fig.savefig() and plt.close(fig) to render and release memory. Open figures accumulate and will eventually crash long-running processes.
🎯 Key Takeaway
Figure is the canvas, Axes is the plot. Always use the object-oriented interface.
fig, ax = plt.subplots() is your starting point for every chart — no exceptions for production code.
Save with fig.savefig('name.png', dpi=300, bbox_inches='tight') and always call plt.close(fig) afterward.

Seaborn for Statistical Visualization

Seaborn builds on Matplotlib with high-level functions that understand DataFrames natively. Pass column names directly, and Seaborn handles grouping, aggregation, statistical estimation, and legend creation automatically. Where Matplotlib requires 20 lines for a grouped bar chart with confidence intervals, Seaborn does it in 3.

The key insight is that Seaborn is not a replacement for Matplotlib — it is an accelerator for the statistical plotting patterns you use most often. Every Seaborn function returns a Matplotlib axes object, so you can always drop down to Matplotlib for fine-grained customization after Seaborn does the heavy lifting.

io/thecodeforge/viz/seaborn_basics.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np


# Set Seaborn theme once at the top of your notebook or script
sns.set_theme(style='whitegrid', palette='muted', font_scale=1.1)

# Generate example data
np.random.seed(42)
df = pd.DataFrame({
    'feature_a': np.random.randn(200),
    'feature_b': np.random.randn(200) * 2 + 1,
    'category': np.random.choice(['Class A', 'Class B', 'Class C'], 200),
    'target': np.random.choice([0, 1], 200)
})


# --- Distribution plots: understand feature spread ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.histplot(data=df, x='feature_a', hue='category', kde=True, ax=axes[0])
axes[0].set_title('Feature A Distribution by Category')

sns.boxplot(data=df, x='category', y='feature_b', ax=axes[1])
axes[1].set_title('Feature B Spread by Category')

fig.tight_layout()
fig.savefig('distributions.png', dpi=300, bbox_inches='tight')
plt.close(fig)


# --- Correlation heatmap: find feature relationships ---
fig, ax = plt.subplots(figsize=(8, 6))
numeric_df = df.select_dtypes(include=[np.number])
corr_matrix = numeric_df.corr()

sns.heatmap(
    corr_matrix, annot=True, fmt='.2f', cmap='RdBu_r',
    center=0, vmin=-1, vmax=1, ax=ax,
    linewidths=0.5, square=True
)
ax.set_title('Feature Correlation Matrix')

fig.tight_layout()
fig.savefig('correlation.png', dpi=300, bbox_inches='tight')
plt.close(fig)


# --- Pair plot: explore all pairwise relationships at once ---
# Useful for small feature sets (<10 features). Slow for large ones.
pair = sns.pairplot(
    df, hue='category', diag_kind='kde',
    plot_kws={'alpha': 0.4, 's': 15}
)
pair.figure.suptitle('Pairwise Feature Relationships', y=1.02)
pair.savefig('pairplot.png', dpi=150, bbox_inches='tight')
plt.close('all')


# --- Seaborn + Matplotlib customization: the practical pattern ---
fig, ax = plt.subplots(figsize=(10, 6))
sns.violinplot(data=df, x='category', y='feature_a', ax=ax, inner='quartile')

# Drop down to Matplotlib for fine-tuning
ax.set_title('Feature A Violin Plot', fontsize=14, fontweight='bold')
ax.set_xlabel('Category', fontsize=12)
ax.set_ylabel('Feature A Value', fontsize=12)
ax.axhline(y=0, color='red', linestyle='--', alpha=0.5, label='Zero baseline')
ax.legend()

fig.tight_layout()
fig.savefig('violin_customized.png', dpi=300, bbox_inches='tight')
plt.close(fig)
💡When to Use Seaborn vs Matplotlib
  • Seaborn excels at: grouped plots, statistical overlays (confidence intervals, KDE curves), automatic legend handling, DataFrame-native column references.
  • Matplotlib excels at: precise axis control, custom annotations and arrows, multi-panel layouts with unequal sizing, publication-quality formatting.
  • You can always access the underlying Matplotlib axes from any Seaborn plot: ax = sns.histplot(...); ax.set_xlim(0, 100).
  • Rule of thumb: prototype in Seaborn, polish in Matplotlib. Start fast, refine as needed.
📊 Production Insight
sns.set_theme() affects all subsequent plots globally in the current Python process. In shared notebooks or multi-team environments, this can silently change the appearance of other people's charts.
Call sns.set_theme() once at the very top of your notebook or script, and document the style choice.
For production pipelines that generate multiple report types, use matplotlib.rcParams context managers to scope style changes: with plt.rc_context({'font.size': 12}): ...
🎯 Key Takeaway
Seaborn wraps Matplotlib with DataFrame awareness and statistical defaults — use it for exploration.
sns.histplot, sns.boxplot, sns.heatmap, and sns.pairplot cover 80% of ML visualization needs.
Every Seaborn plot returns a Matplotlib axes object — drop down to Matplotlib for final polish.

Confusion Matrix: Where Your Model Gets Confused

The confusion matrix is the single most important diagnostic chart for classification models. It shows exactly which classes your model confuses with which — information that a scalar metric like accuracy or F1 compresses into a single number and loses.

A model with 95% accuracy might be completely failing on one class. In a fraud detection system where only 2% of transactions are fraudulent, a model that predicts 'not fraud' for every single input achieves 98% accuracy while catching zero fraud. Only the confusion matrix reveals this. Always plot it. Always.

io/thecodeforge/viz/confusion_matrix.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.metrics import confusion_matrix


def plot_confusion_matrix(
    y_true, y_pred, labels=None, title='Confusion Matrix'
):
    """Production-grade confusion matrix with both counts and percentages.

    Displays two panels side by side:
    - Left: raw counts (useful for understanding volume)
    - Right: row-normalized percentages (useful for understanding recall per class)

    Args:
        y_true: ground truth labels
        y_pred: predicted labels
        labels: list of class names for axis labels
        title: figure title

    Returns:
        Matplotlib Figure object (caller saves and closes).
    """
    cm = confusion_matrix(y_true, y_pred)
    cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100

    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Left panel: raw counts
    sns.heatmap(
        cm, annot=True, fmt='d', cmap='Blues',
        xticklabels=labels, yticklabels=labels,
        ax=axes[0], linewidths=0.5
    )
    axes[0].set_xlabel('Predicted')
    axes[0].set_ylabel('Actual')
    axes[0].set_title(f'{title} (Counts)')

    # Right panel: row-normalized percentages (each row sums to 100%)
    sns.heatmap(
        cm_percent, annot=True, fmt='.1f', cmap='Blues',
        xticklabels=labels, yticklabels=labels,
        ax=axes[1], linewidths=0.5, vmin=0, vmax=100
    )
    axes[1].set_xlabel('Predicted')
    axes[1].set_ylabel('Actual')
    axes[1].set_title(f'{title} (Row %, i.e., Recall)')

    fig.tight_layout()
    return fig


# Example usage
np.random.seed(42)
y_true = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2] * 20)
y_pred = np.array([0, 0, 1, 1, 1, 0, 2, 2, 2] * 20)
labels = ['Cat', 'Dog', 'Bird']

fig = plot_confusion_matrix(y_true, y_pred, labels=labels, title='Animal Classifier')
fig.savefig('confusion_matrix.png', dpi=300, bbox_inches='tight')
plt.close(fig)
⚠ Accuracy Hides Class-Level Failures
A model predicting 990 correct out of 1000 samples has 99% accuracy. But if those 10 errors are all in the fraud class (which has only 15 total samples), the model missed 67% of all fraud cases. The confusion matrix shows this immediately — the fraud row will have a large off-diagonal value. A single accuracy number never reveals this. On imbalanced datasets, accuracy is almost meaningless. The confusion matrix is not.
📊 Production Insight
Always display both raw counts and row-normalized percentages in your confusion matrix.
Raw counts mislead on imbalanced datasets because 95% of predictions naturally land in the majority class, making the diagonal look strong even when minority class recall is terrible.
Row percentages show recall per class — how much of each true class the model actually captures.
Column percentages show precision per class — of everything predicted as class X, how much is correct.
Rule: for production monitoring dashboards, plot the row-normalized version by default and provide the raw count version as a drill-down.
🎯 Key Takeaway
The confusion matrix is the most important classification diagnostic — plot it for every model, every time.
Always show both counts and row-normalized percentages. Counts tell you volume; percentages tell you recall.
Off-diagonal patterns reveal exactly which classes your model cannot distinguish and guide targeted improvements.
Confusion Matrix Interpretation
IfDiagonal cells are strong, off-diagonal cells are near zero
UseModel separates classes well. Verify that performance is consistent across all classes — a strong overall diagonal can mask one weak class.
IfOne row has high off-diagonal values (model confuses class A with class B specifically)
UseClasses A and B share similar features. Consider feature engineering to surface distinguishing characteristics, collecting more training data for the confused class, or merging the classes if they are semantically close.
IfAll predictions cluster into one class (entire column is dark, rest of matrix is blank)
UseModel is degenerate — predicting the majority class for every input. Check class balance, lower the decision threshold, or apply class weights during training.
IfMatrix looks good on test data but deteriorates on production data
UseData distribution shift. Plot prediction probability distributions over time to detect when the drift started. Compare feature distributions between training data and recent production data.

ROC and Precision-Recall Curves

ROC curves plot the true positive rate against the false positive rate across all possible classification thresholds. They answer the question: as I lower the threshold to catch more positives, how many false positives do I accept?

Precision-Recall curves are more informative for imbalanced datasets because they focus exclusively on the positive class. On a dataset where only 1% of samples are positive, ROC can show an impressive AUC of 0.95 while the model's precision at useful recall levels is actually terrible. Precision-Recall curves expose this directly.

Both curves let you visualize the tradeoff space and choose the optimal threshold for your specific business requirements — something a single F1 score cannot do.

io/thecodeforge/viz/roc_pr_curves.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import (
    roc_curve, auc, precision_recall_curve, average_precision_score
)


def plot_roc_and_pr(y_true, y_proba, title='Model Evaluation'):
    """Plot ROC and Precision-Recall curves side by side.

    Both curves visualize model performance across all possible
    classification thresholds. Together they give a complete picture
    that no single metric can provide.

    Args:
        y_true: ground truth binary labels (0 or 1)
        y_proba: predicted probabilities for the positive class
        title: figure title prefix

    Returns:
        Matplotlib Figure object.
    """
    # Compute ROC curve
    fpr, tpr, roc_thresholds = roc_curve(y_true, y_proba)
    roc_auc = auc(fpr, tpr)

    # Compute Precision-Recall curve
    precision, recall, pr_thresholds = precision_recall_curve(y_true, y_proba)
    avg_precision = average_precision_score(y_true, y_proba)

    fig, axes = plt.subplots(1, 2, figsize=(14, 6))

    # --- ROC Curve ---
    axes[0].plot(fpr, tpr, linewidth=2, label=f'Model (AUC = {roc_auc:.3f})')
    axes[0].plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Random (AUC = 0.5)')
    axes[0].fill_between(fpr, tpr, alpha=0.1)
    axes[0].set_xlabel('False Positive Rate')
    axes[0].set_ylabel('True Positive Rate (Recall)')
    axes[0].set_title(f'{title} — ROC Curve')
    axes[0].legend(loc='lower right')
    axes[0].grid(True, alpha=0.3)
    axes[0].set_xlim([-0.02, 1.02])
    axes[0].set_ylim([-0.02, 1.02])

    # --- Precision-Recall Curve ---
    axes[1].plot(
        recall, precision, linewidth=2, color='orange',
        label=f'Model (AP = {avg_precision:.3f})'
    )
    baseline = y_true.sum() / len(y_true)
    axes[1].axhline(
        y=baseline, color='k', linestyle='--', alpha=0.5,
        label=f'Random baseline = {baseline:.3f}'
    )
    axes[1].fill_between(recall, precision, alpha=0.1, color='orange')
    axes[1].set_xlabel('Recall')
    axes[1].set_ylabel('Precision')
    axes[1].set_title(f'{title} — Precision-Recall Curve')
    axes[1].legend(loc='lower left')
    axes[1].grid(True, alpha=0.3)
    axes[1].set_xlim([-0.02, 1.02])
    axes[1].set_ylim([0, 1.05])

    fig.tight_layout()
    return fig


# Example: imbalanced fraud detection scenario
np.random.seed(42)
y_true = np.random.choice([0, 1], size=500, p=[0.95, 0.05])
y_proba = np.clip(y_true * 0.6 + np.random.randn(500) * 0.2, 0, 1)

fig = plot_roc_and_pr(y_true, y_proba, title='Fraud Detection')
fig.savefig('roc_pr_curves.png', dpi=300, bbox_inches='tight')
plt.close(fig)
🔥ROC vs Precision-Recall: When to Use Which
Use ROC when classes are roughly balanced — it gives a clean summary of the true-positive vs false-positive tradeoff across all thresholds. Use Precision-Recall when the positive class is rare (fraud detection, disease screening, anomaly detection, conversion prediction). On highly imbalanced data, ROC can show AUC > 0.95 because the massive true negative count inflates the true positive rate calculation. Meanwhile, Precision-Recall reveals that the model's precision collapses to 10% at any useful recall level. Always plot both. A model can look excellent on one curve and mediocre on the other. If you only show one, you are hiding information.
📊 Production Insight
AUC and Average Precision summarize performance across all thresholds. In production, you deploy at one specific threshold.
The curve shape tells you where your operating point lives and what tradeoffs it forces. Two models with identical AUC can have very different characteristics at the threshold that matters for your business.
Rule: overlay the actual deployed threshold on the curve as a dot or vertical line. This makes it immediately clear how much performance room exists if you adjust the threshold — and what the cost of that adjustment is in the other metric.
🎯 Key Takeaway
ROC curves work well for balanced classes. Precision-Recall curves are essential for imbalanced ones.
Always plot both side by side — a model can look good on one and poor on the other.
The curve shape reveals operating characteristics that a single AUC or AP number compresses away.

Residual Plots for Regression Models

Residual plots reveal systematic errors in regression models that aggregate metrics like RMSE and MAE completely hide. RMSE tells you the average error magnitude. Residual plots tell you whether those errors are random (acceptable) or structured (a sign your model is missing something).

If residuals show a pattern — a curve, a fan shape, clusters — your model is not capturing a relationship in the data. No amount of hyperparameter tuning will fix this. You need different features, a different transformation, or a different model family. The residual plot is the chart that tells you which.

io/thecodeforge/viz/residual_plots.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression


def plot_regression_diagnostics(y_true, y_pred, title='Regression Diagnostics'):
    """Four-panel diagnostic plot for regression models.

    Panels:
    1. Predicted vs Actual — overall fit quality
    2. Residuals vs Predicted — detect non-linearity, heteroscedasticity
    3. Residual Distribution — check normality assumption
    4. Q-Q Plot — sensitive normality check at distribution tails

    Args:
        y_true: actual target values (numpy array)
        y_pred: predicted target values (numpy array)
        title: overall figure title

    Returns:
        Matplotlib Figure object.
    """
    residuals = y_true - y_pred

    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    # Panel 1: Predicted vs Actual
    axes[0, 0].scatter(y_true, y_pred, alpha=0.4, s=15, color='steelblue')
    min_val = min(y_true.min(), y_pred.min())
    max_val = max(y_true.max(), y_pred.max())
    axes[0, 0].plot(
        [min_val, max_val], [min_val, max_val],
        'r--', linewidth=2, label='Perfect prediction'
    )
    axes[0, 0].set_xlabel('Actual')
    axes[0, 0].set_ylabel('Predicted')
    axes[0, 0].set_title('Predicted vs Actual')
    axes[0, 0].legend()

    # Panel 2: Residuals vs Predicted (the most important panel)
    axes[0, 1].scatter(y_pred, residuals, alpha=0.4, s=15, color='coral')
    axes[0, 1].axhline(y=0, color='r', linestyle='--', linewidth=2)
    axes[0, 1].set_xlabel('Predicted Value')
    axes[0, 1].set_ylabel('Residual (Actual - Predicted)')
    axes[0, 1].set_title('Residuals vs Predicted')

    # Panel 3: Residual Distribution
    sns.histplot(residuals, kde=True, ax=axes[1, 0], bins=30, color='steelblue')
    axes[1, 0].axvline(x=0, color='r', linestyle='--')
    axes[1, 0].set_xlabel('Residual')
    axes[1, 0].set_title(f'Residual Distribution (mean={residuals.mean():.2f})')

    # Panel 4: Q-Q plot (normality check — deviations at tails matter most)
    stats.probplot(residuals, dist='norm', plot=axes[1, 1])
    axes[1, 1].set_title('Q-Q Plot (Normality Check)')

    fig.suptitle(title, fontsize=14, fontweight='bold')
    fig.tight_layout()
    return fig


# Example
X, y = make_regression(
    n_samples=300, n_features=3, noise=15, random_state=42
)
model = LinearRegression().fit(X, y)
y_pred = model.predict(X)

fig = plot_regression_diagnostics(y, y_pred, title='Linear Regression Diagnostics')
fig.savefig('residual_plots.png', dpi=300, bbox_inches='tight')
plt.close(fig)
Mental Model
What Good Residuals Look Like
A well-specified model produces residuals that are random noise with no discernible pattern. If you see structure, the model is leaving signal on the table.
  • Residuals vs Predicted: random scatter centered on zero. No fan shape, no curve, no clusters.
  • Residual Distribution: approximately normal, centered at zero. Skew or heavy tails indicate the model handles some value ranges worse than others.
  • Q-Q Plot: points follow the diagonal line closely. Deviations at the tails mean the model produces more extreme errors than a normal distribution predicts.
  • If you see any pattern in the residual plot, your model is missing a signal. Add features, apply transformations, or switch model families.
📊 Production Insight
A fan-shaped residual plot — where residuals spread wider as predicted values increase — means heteroscedasticity. The model's error is not constant: it predicts well for small values and poorly for large ones, or vice versa.
This violates a core assumption of ordinary least squares and inflates confidence intervals on predictions.
Rule: apply a log transform to the target variable (np.log1p) or use weighted least squares to stabilize error variance. If the fan is severe, tree-based models handle heteroscedasticity naturally without transformation.
🎯 Key Takeaway
Residual plots reveal errors that RMSE hides — always generate them for regression models.
Random scatter around zero means the model is well-specified. Any pattern means missing signal.
Four diagnostic panels: predicted vs actual, residuals vs predicted, residual histogram, Q-Q plot.
Residual Pattern Diagnosis
IfResiduals show a U-shape or curve against predicted values
UseModel is missing a non-linear relationship. Add polynomial features (degree 2 or 3), interaction terms between features, or switch to a non-linear model like gradient boosted trees.
IfResiduals fan out — spread increases with predicted value
UseHeteroscedasticity. Log-transform the target variable with np.log1p(y), use weighted least squares, or switch to a model family that handles non-constant variance naturally (e.g., tree-based models).
IfResiduals are not centered at zero — consistent bias in one direction
UseModel has systematic bias. Check for a missing intercept term, incorrect feature encoding, or a target variable that needs transformation.
IfResiduals show a clear trend when plotted against time or row index
UseAutocorrelation — your data has temporal structure that the model ignores. Add lag features, rolling statistics, or switch to a time-series model (ARIMA, Prophet, temporal neural networks).

Feature Importance Visualization

Feature importance plots show which inputs drive your model's predictions. For tree-based models, importance is built in via impurity reduction. For any model, permutation importance provides a model-agnostic alternative by measuring how much accuracy drops when each feature's values are randomly shuffled.

Visualization makes these rankings immediately interpretable to non-technical stakeholders who need to understand why the model makes the decisions it does — not just what it predicts. A horizontal bar chart sorted by importance is the universal format that everyone from data scientists to product managers can read.

io/thecodeforge/viz/feature_importance.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.datasets import make_classification


def plot_feature_importance(
    model, feature_names, X_test, y_test, top_n=15
):
    """Plot built-in and permutation importance side by side.

    Built-in importance (Gini) is fast but biased toward high-cardinality
    features. Permutation importance is slower but model-agnostic and
    unbiased. Showing both highlights discrepancies worth investigating.

    Args:
        model: fitted sklearn estimator
        feature_names: list of feature name strings
        X_test: test features for permutation importance
        y_test: test labels for permutation importance
        top_n: number of top features to display

    Returns:
        Matplotlib Figure object.
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, max(6, top_n * 0.4)))

    # Left panel: built-in importance (tree-based models only)
    if hasattr(model, 'feature_importances_'):
        importances = model.feature_importances_
        indices = np.argsort(importances)[::-1][:top_n]

        axes[0].barh(
            [feature_names[i] for i in indices][::-1],
            importances[indices][::-1],
            color='steelblue', edgecolor='black', alpha=0.8
        )
        axes[0].set_xlabel('Gini Importance (Impurity Reduction)')
        axes[0].set_title('Built-in Feature Importance')
    else:
        axes[0].text(
            0.5, 0.5, 'Not available\n(model has no feature_importances_)',
            ha='center', va='center', fontsize=12, transform=axes[0].transAxes
        )
        axes[0].set_title('Built-in Feature Importance (N/A)')

    # Right panel: permutation importance (model-agnostic)
    perm_result = permutation_importance(
        model, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1
    )
    perm_mean = perm_result.importances_mean
    perm_std = perm_result.importances_std
    indices = np.argsort(perm_mean)[::-1][:top_n]

    axes[1].barh(
        [feature_names[i] for i in indices][::-1],
        perm_mean[indices][::-1],
        xerr=perm_std[indices][::-1],
        color='coral', edgecolor='black', alpha=0.8
    )
    axes[1].set_xlabel('Mean Accuracy Decrease When Shuffled')
    axes[1].set_title('Permutation Importance')

    fig.suptitle(
        'Feature Importance Comparison', fontsize=14, fontweight='bold'
    )
    fig.tight_layout()
    return fig


# Example
X, y = make_classification(
    n_samples=1000, n_features=10, n_informative=5, random_state=42
)
feature_names = [f'feature_{i}' for i in range(10)]
model = RandomForestClassifier(
    n_estimators=100, random_state=42
).fit(X, y)

fig = plot_feature_importance(model, feature_names, X, y)
fig.savefig('feature_importance.png', dpi=300, bbox_inches='tight')
plt.close(fig)
⚠ Built-in Importance Can Mislead
Gini importance (feature_importances_) in tree-based models is biased toward high-cardinality features. A feature with 1,000 unique values — like a raw ID column — will appear more important than a genuinely predictive binary feature, because the tree has more possible split points to choose from. This is a measurement artifact, not real predictive value. Always validate with permutation importance, which directly measures the accuracy cost of losing each feature and is unbiased by cardinality.
📊 Production Insight
Feature importance rankings can shift dramatically between model versions — not because the data changed, but because tree-based models have inherent randomness in split selection.
Track importance rankings over time across deployments. A feature that drops from top 3 to zero importance between versions may indicate data pipeline corruption (the column went null, changed format, or stopped updating).
Rule: store and compare feature importance snapshots as part of your model registry metadata. Unexpected ranking changes should trigger investigation before deployment, not after.
🎯 Key Takeaway
Built-in importance is fast but biased toward high-cardinality features. Permutation importance is slower but reliable and model-agnostic.
Plot both side by side — significant disagreement between them signals a cardinality bias or data leakage problem.
Track importance rankings across model versions to detect data pipeline degradation early.

Learning Curves: Diagnosing Bias and Variance

Learning curves plot model performance against training set size. They answer the most fundamental question in model improvement: should I get more data, or should I change the model?

The gap between the training score and validation score at each data size reveals whether your model suffers from high bias (underfitting — both curves are low) or high variance (overfitting — training is high, validation is low). This is not an academic distinction. It directly determines whether spending three weeks collecting more data will help or be completely wasted effort.

io/thecodeforge/viz/learning_curves.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification


def plot_learning_curve(
    estimator, X, y, title='Learning Curve', cv=5, scoring='accuracy'
):
    """Plot learning curve showing the bias-variance tradeoff.

    The gap between training and validation curves tells you exactly
    what to fix: more data, more regularization, or a different model.

    Args:
        estimator: unfitted sklearn estimator (will be cloned internally)
        X: feature matrix
        y: target vector
        title: plot title
        cv: number of cross-validation folds
        scoring: sklearn scoring metric name

    Returns:
        Matplotlib Figure object.
    """
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y,
        cv=cv,
        n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring=scoring
    )

    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)

    fig, ax = plt.subplots(figsize=(10, 6))

    # Confidence bands
    ax.fill_between(
        train_sizes, train_mean - train_std, train_mean + train_std,
        alpha=0.1, color='blue'
    )
    ax.fill_between(
        train_sizes, val_mean - val_std, val_mean + val_std,
        alpha=0.1, color='orange'
    )

    # Mean curves
    ax.plot(
        train_sizes, train_mean, 'o-', color='blue',
        linewidth=2, label='Training Score'
    )
    ax.plot(
        train_sizes, val_mean, 'o-', color='orange',
        linewidth=2, label='Validation Score'
    )

    ax.set_xlabel('Training Set Size')
    ax.set_ylabel(scoring.capitalize())
    ax.set_title(title, fontsize=14, fontweight='bold')
    ax.legend(loc='lower right')
    ax.grid(True, alpha=0.3)

    # Annotate the final gap between curves
    final_gap = train_mean[-1] - val_mean[-1]
    ax.annotate(
        f'Gap: {final_gap:.3f}',
        xy=(train_sizes[-1], (train_mean[-1] + val_mean[-1]) / 2),
        fontsize=11, fontweight='bold', color='red',
        ha='right'
    )

    fig.tight_layout()
    return fig


# Example
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)

fig = plot_learning_curve(model, X, y, title='Random Forest Learning Curve')
fig.savefig('learning_curve.png', dpi=300, bbox_inches='tight')
plt.close(fig)
Mental Model
Reading Learning Curves
The gap between training and validation curves tells you exactly what to fix — and what not to waste time on.
  • Large gap (training high, validation low) = high variance (overfitting). Fix with: more data, stronger regularization, fewer features, simpler model.
  • Both curves low and converging together = high bias (underfitting). Fix with: more features, more complex model, less regularization. More data will NOT help here.
  • Both curves high and converging together = good fit. Model is well-calibrated for this data volume.
  • Validation curve still rising at the right edge = more data will help. Collecting additional training examples is a productive investment.
📊 Production Insight
Learning curves computed on a tiny subsample can be misleading about convergence behavior. If your full dataset has 1M rows but you compute the learning curve on a 5K sample, the curve might show convergence that disappears at full scale.
Always compute learning curves on a representative sample large enough to show the real convergence pattern — at least 10% of the full dataset or 10K samples, whichever is larger.
Rule: if the validation curve has not plateaued at the maximum training size, your model will measurably benefit from more training data. If it has plateaued, spending three weeks collecting more data is wasted effort — change the model instead.
🎯 Key Takeaway
Learning curves diagnose bias vs variance — the fundamental decision point for model improvement.
Large gap between curves = overfitting (needs regularization or more data). Both curves low = underfitting (needs more complexity).
The curve shape tells you whether to invest in more data or in a different model architecture.

Saving and Formatting for Production

Charts in notebooks are for exploration. Charts in reports, dashboards, presentations, and papers require consistent formatting, appropriate resolution, and accessible color choices. The gap between a notebook plot and a production-ready figure is not aesthetics — it is legibility, accessibility, and reproducibility.

A chart that looks fine on your 4K monitor becomes an unreadable blur when projected onto a conference room screen or embedded in a PDF at print resolution. This section covers the production formatting pipeline that ensures your figures survive every medium they encounter.

io/thecodeforge/viz/production_formatting.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np


def apply_production_style():
    """Apply a consistent, publication-quality style globally.

    Call this once at the top of your notebook or script.
    Overrides Matplotlib defaults with production-safe values.
    """
    mpl.rcParams.update({
        # Typography
        'font.size': 12,
        'axes.titlesize': 14,
        'axes.labelsize': 12,
        'xtick.labelsize': 10,
        'ytick.labelsize': 10,
        'legend.fontsize': 10,
        'figure.titlesize': 16,

        # Figure defaults
        'figure.figsize': (10, 6),
        'figure.dpi': 100,           # Screen display DPI
        'savefig.dpi': 300,          # Saved file DPI
        'savefig.bbox': 'tight',     # Prevent label clipping
        'savefig.pad_inches': 0.1,

        # Grid and spines
        'axes.grid': True,
        'grid.alpha': 0.3,
        'axes.spines.top': False,    # Remove top spine
        'axes.spines.right': False,  # Remove right spine

        # Lines and markers
        'lines.linewidth': 2,
        'lines.markersize': 6,
    })
    print("Production style applied.")


def save_publication(fig, filename, formats=None):
    """Save figure in multiple formats for different use cases.

    Args:
        fig: Matplotlib Figure object
        filename: base filename without extension
        formats: list of format strings. Defaults to PNG + SVG.
    """
    if formats is None:
        formats = ['png', 'svg']

    for fmt in formats:
        filepath = f"{filename}.{fmt}"
        fig.savefig(filepath, dpi=300, bbox_inches='tight', facecolor='white')
        print(f"Saved: {filepath}")


# --- Usage ---
apply_production_style()

fig, ax = plt.subplots()
colors = ['#2563eb', '#16a34a', '#dc2626']  # Blue, green, red — distinguishable
ax.bar(
    ['Model A', 'Model B', 'Model C'],
    [0.89, 0.92, 0.87],
    color=colors, edgecolor='black', alpha=0.9
)
ax.set_ylabel('Accuracy')
ax.set_title('Model Comparison — Q1 2026')
ax.set_ylim(0.80, 0.95)

# Add value labels on bars
for i, v in enumerate([0.89, 0.92, 0.87]):
    ax.text(i, v + 0.003, f'{v:.2f}', ha='center', fontweight='bold')

save_publication(fig, 'model_comparison')
plt.close(fig)
💡Accessibility in Visualizations
  • Use colorblind-safe palettes: sns.color_palette('colorblind') or the 'muted' palette. Avoid pure red/green combinations as the only differentiator.
  • Add patterns (hatching), markers, or line styles to distinguish series — not just color. ax.bar(..., hatch='//') adds visual texture.
  • Never use the 'jet' or 'rainbow' colormap for continuous data — they introduce perceptual artifacts. Use 'viridis', 'plasma', or 'cividis' instead.
  • Add direct value labels on bars and direct labels on lines instead of relying on a distant legend that requires color matching.
  • Test your charts in grayscale. If they still communicate the message, they are accessible.
📊 Production Insight
PNG at 300 DPI is the standard for reports and presentations. SVG is best for web dashboards and documentation sites because it scales without pixelation and has a smaller file size for simple charts. PDF is best for print publications, LaTeX documents, and archival.
Rule: always save in at least two formats — PNG for immediate sharing and embedding, SVG or PDF for archival and web. The save_publication helper above handles this automatically.
In automated report generation pipelines, save figures to a versioned artifact directory alongside the model they evaluate. Figures and models should share the same version tag.
🎯 Key Takeaway
Apply a consistent style with mpl.rcParams at the top of every notebook or script — never rely on Matplotlib defaults.
Save at 300 DPI in PNG for reports, SVG for web, PDF for print. Always save before calling plt.show().
Use colorblind-safe palettes and direct value labels. Never rely on color alone to convey meaning.
🗂 Matplotlib vs Seaborn: When to Use Which
They are complementary tools, not competitors. Choose based on your current task.
AspectMatplotlibSeaborn
Learning CurveSteeper — more code required for statistical plotsGentler — sensible defaults and fewer lines for common charts
Control LevelFull pixel-level control over every elementLess granular control, but faster to prototype
DataFrame AwarenessNone — requires manual extraction of arrays from DataFramesNative — pass column names directly via data= parameter
Statistical PlotsManual — compute confidence intervals, KDE, regressions yourselfBuilt-in — automatic confidence intervals, KDE, regression lines
Multi-Panel LayoutsExcellent — full control over grid spacing and sizingLimited — pairplot and FacetGrid handle specific patterns only
CustomizationUnlimited — every element is individually addressableGood via Matplotlib axes access, but some Seaborn elements resist customization
Production FormattingFull control via rcParams and style sheetsInherits Matplotlib settings, adds its own theme layer via set_theme()
Best ForFinal figures, custom annotations, publication-quality outputExploratory analysis, statistical summaries, rapid prototyping

🎯 Key Takeaways

  • Every Matplotlib chart starts with fig, ax = plt.subplots() — use the object-oriented interface, always.
  • Seaborn handles DataFrame grouping and statistical estimation automatically — use it for rapid exploration, then drop down to Matplotlib for polish.
  • Confusion matrices reveal class-level failures that accuracy hides — always show both raw counts and row-normalized percentages.
  • ROC curves work for balanced data; Precision-Recall curves are essential for imbalanced data. Plot both.
  • Residual plots diagnose regression model errors that RMSE averages away — check for patterns, not just magnitude.
  • Learning curves tell you whether to invest in more data or a different model — read the gap between training and validation curves.
  • Save at 300 DPI with fig.savefig() and always call plt.close(fig) afterward to prevent memory leaks in pipelines.
  • Use perceptually uniform colormaps (viridis, plasma, cividis) — never use jet or rainbow for continuous data.

⚠ Common Mistakes to Avoid

    Using plt.plot() instead of the object-oriented ax.plot() interface
    Symptom

    Multi-panel figures break unpredictably. Titles, labels, and data end up on the wrong subplot. Saving produces blank files after calling plt.show().

    Fix

    Always use fig, ax = plt.subplots() and call methods on the ax object: ax.plot(), ax.set_title(), ax.set_xlabel(). The pyplot interface (plt.plot()) operates on an implicit 'current axes' that changes unpredictably in multi-panel figures. The object-oriented interface is explicit, debuggable, and production-safe.

    Not calling plt.close(fig) after saving
    Symptom

    Memory usage climbs steadily during training loops or report generation scripts. After generating 50–100 figures, the process crashes with a memory error or slows to a crawl.

    Fix

    Always call plt.close(fig) after fig.savefig(). Each open figure consumes memory. In loops, use plt.close('all') as a safety net. In Jupyter notebooks, this matters less because %matplotlib inline auto-closes, but it is still good practice.

    Using the 'jet' or 'rainbow' colormap for continuous data
    Symptom

    Charts create visual artifacts — bright yellow bands appear to be boundaries or features that do not exist in the data. Colorblind viewers cannot distinguish adjacent regions. Print outputs in grayscale are completely unreadable.

    Fix

    Use perceptually uniform colormaps: 'viridis' (default), 'plasma', 'inferno', or 'cividis' (designed specifically for colorblind accessibility). For diverging data (centered around zero), use 'RdBu_r' or 'coolwarm' with center=0.

    Presenting only accuracy without a confusion matrix or error distribution
    Symptom

    Stakeholders approve a model that is 95% accurate on an imbalanced dataset. In production, it misses 60% of the minority class (the class that actually matters for the business). Nobody knew because accuracy masked the class-level failure.

    Fix

    Always present the confusion matrix alongside any aggregate metric. For regression, always include a residual plot alongside RMSE. The aggregate metric is the headline; the visualization is the evidence. If the evidence contradicts the headline, the headline is wrong.

    Saving figures at screen resolution (72–96 DPI)
    Symptom

    Charts look fine in the notebook but become pixelated and blurry when embedded in PDF reports, printed on paper, or projected in meeting rooms. Text labels become unreadable.

    Fix

    Always save with fig.savefig('name.png', dpi=300, bbox_inches='tight'). 300 DPI is the minimum for print and presentation quality. For posters or large-format prints, use 600 DPI. Set savefig.dpi in rcParams so you never forget.

    Choosing the wrong chart type for the data relationship
    Symptom

    Pie chart used for 15 categories — impossible to compare slice sizes. Line chart used for categorical data — implies a continuous trend that does not exist. Scatter plot used for 1 million points — produces an opaque blob.

    Fix

    Match the chart to the relationship: histogram or KDE for distributions, bar chart for categorical comparisons, scatter plot for bivariate correlation (with alpha for large N), line chart for trends over time or ordered sequences, heatmap for matrices and correlations. When in doubt, ask: what question should this chart answer? Then pick the chart type that answers it most directly.

Interview Questions on This Topic

  • QWhy is a Precision-Recall curve more informative than an ROC curve for imbalanced classification problems?Mid-levelReveal
    ROC curves plot true positive rate against false positive rate. On imbalanced datasets where negatives vastly outnumber positives, even a large number of false positives represents a small false positive rate because the denominator (total negatives) is enormous. This makes the ROC curve look deceptively good — AUC can exceed 0.95 while the model's precision at any useful recall level is actually terrible. Precision-Recall curves focus exclusively on the positive class. Precision measures what fraction of positive predictions are correct, and recall measures what fraction of actual positives are detected. Neither metric is inflated by the large pool of true negatives. On a 1% positive rate dataset, the PR curve immediately shows that achieving 80% recall requires accepting 30% precision — a tradeoff the ROC curve hides entirely. In production, I always plot both side by side. If ROC looks excellent but PR looks mediocre, the model is benefiting from the imbalance, not from genuine discriminative power.
  • QYour residual plot shows a U-shaped pattern. What does this tell you about your regression model, and what would you do about it?Mid-levelReveal
    A U-shaped pattern in residuals plotted against predicted values means the model is missing a non-linear relationship in the data. The linear model systematically overpredicts in some value ranges and underpredicts in others — the residuals are not random, they are structured. Specifically, the model's linearity assumption is violated. The true relationship between features and target includes curvature that a straight line cannot capture. To fix it, I would try three approaches in order: first, add polynomial features (x², x³) or interaction terms between existing features and retrain the linear model. Second, apply a non-linear transformation to the target variable (log, sqrt) if the U-shape suggests multiplicative rather than additive relationships. Third, switch to a non-linear model — gradient boosted trees or a neural network — that can capture arbitrary non-linear patterns without manual feature engineering. After each change, I would regenerate the residual plot to verify the pattern has disappeared. If residuals are now randomly scattered around zero, the fix worked.
  • QHow would you present model evaluation results to a non-technical stakeholder who needs to decide whether to deploy the model?JuniorReveal
    I would use three charts, each answering a specific business question: First, a bar chart comparing the model's performance against a meaningful baseline — not random chance, but the current process or heuristic the model would replace. This answers 'is the model better than what we do today?' Second, a confusion matrix with row-normalized percentages, using business-language labels (not 'Class 0' and 'Class 1'). This answers 'where does the model get it right and where does it get it wrong?' I would annotate the cost of each error type in business terms — 'of every 100 fraud cases, the model catches 85 and misses 15.' Third, a simple before-and-after impact chart showing the projected business metric change: fraud losses reduced, support tickets prevented, revenue captured. This answers 'what is the dollar impact?' I would avoid showing ROC curves, raw F1 scores, or learning curves to non-technical audiences. The stakeholder needs to understand what the model does well, where it fails, and what the business impact is. Mathematical internals create confusion, not confidence.
  • QExplain the difference between built-in feature importance and permutation importance. When would they disagree, and which would you trust?SeniorReveal
    Built-in importance (feature_importances_ in tree-based models) measures how much each feature reduces node impurity (Gini or entropy) across all trees, averaged over all splits. It is fast — computed during training with no additional cost — but biased toward high-cardinality features. A feature with 1,000 unique values gets more split opportunities than a binary feature, inflating its apparent importance even if it is less predictive. Permutation importance measures the decrease in model accuracy when a feature's values are randomly shuffled, breaking its relationship with the target. It is model-agnostic (works with any estimator), computed on held-out data, and unbiased by cardinality. But it is slower because it requires re-predicting the full test set once per feature. They disagree most when: (1) a feature has high cardinality but low predictive value — Gini importance inflates it, permutation importance does not. (2) Features are highly correlated — shuffling one correlated feature has little effect because the other carries the same signal, making both look unimportant in permutation importance. I use built-in importance for quick exploration during development. For final model validation, stakeholder reporting, and production monitoring, I trust permutation importance because it directly measures predictive contribution rather than splitting frequency. When they disagree significantly, I investigate — the disagreement itself is diagnostic information.

Frequently Asked Questions

Should I use Matplotlib or Seaborn?

Use both — they are not alternatives. Seaborn is built on top of Matplotlib, and every Seaborn plot returns a Matplotlib axes object. Use Seaborn for quick statistical plots during exploration: histograms with KDE overlays, grouped boxplots, correlation heatmaps, pair plots. Use Matplotlib for final presentation control: precise axis formatting, custom annotations, multi-panel layouts with unequal sizing, publication-quality output. The practical pattern is: prototype in Seaborn for speed, then customize with Matplotlib methods for polish.

How do I choose the right chart type for my data?

Match the chart to the relationship you want to communicate. Distribution of a single variable: histogram or KDE plot. Comparison across categories: bar chart or boxplot. Correlation between two numeric variables: scatter plot (with alpha transparency for large datasets). Trend over time or ordered sequence: line chart. Matrix of values: heatmap. For ML diagnostics specifically: confusion matrix for classification evaluation, residual plot for regression evaluation, learning curve for bias-variance diagnosis, feature importance bar chart for model interpretability, ROC or PR curve for threshold selection.

Why do my saved plots look different from what I see in the notebook?

Notebook display and file saving use different rendering backends and resolutions. The notebook renders at screen resolution (72–96 DPI) using the inline backend, while savefig uses the DPI value you specify. Additionally, the notebook may auto-adjust figure size to fit the cell width. Always use fig.savefig('name.png', dpi=300, bbox_inches='tight') with an explicit figsize in plt.subplots() to get consistent, predictable output. Test by opening the saved file directly — not by comparing to the notebook display. And always save before calling plt.show(), which destroys the figure in most backends.

How many charts should I include in a model evaluation report?

For classification: confusion matrix, ROC or Precision-Recall curve (or both for imbalanced data), and feature importance. For regression: predicted vs actual scatter, residual plot (four-panel diagnostic), and feature importance. That is 3 charts per model, each answering a specific question about model quality. Add learning curves only if actively diagnosing overfitting or underfitting. Add prediction probability distribution plots for production monitoring. Every chart must answer a specific question — if you cannot state the question the chart answers, remove it. Stakeholders need insight, not decoration.

How do I make my charts accessible to colorblind viewers?

Three rules cover most cases. First, use colorblind-safe palettes: Seaborn's 'colorblind' palette, or perceptually uniform colormaps like 'viridis' and 'cividis'. Avoid red-green as the sole differentiator — the most common color vision deficiency affects red-green perception. Second, add redundant visual channels: different line styles (solid, dashed, dotted), different markers (circle, square, triangle), or hatching patterns on bars. This way color is not the only signal. Third, add direct labels — annotate bars with their values, label lines directly instead of using a distant legend that requires color matching. Test your final figure in grayscale: if it still communicates the message, it is accessible.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousData Cleaning and Preprocessing for Absolute BeginnersNext →How to Choose the Right Algorithm as a Beginner
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged