Beginner 5 min · April 15, 2026

How to Visualize Machine Learning Results (Matplotlib & Seaborn)

Visualization — ROC Perfection Hid 3-Week Fraud Failure

Q: Should I use Matplotlib or Seaborn?

Use both — they are not alternatives. Seaborn is built on top of Matplotlib, and every Seaborn plot returns a Matplotlib axes object. Use Seaborn for quick statistical plots during exploration: histograms with KDE overlays, grouped boxplots, correlation heatmaps, pair plots. Use Matplotlib for final presentation control: precise axis formatting, custom annotations, multi-panel layouts with unequal sizing, publication-quality output. The practical pattern is: prototype in Seaborn for speed, then customize with Matplotlib methods for polish.

Q: How do I choose the right chart type for my data?

Match the chart to the relationship you want to communicate. Distribution of a single variable: histogram or KDE plot. Comparison across categories: bar chart or boxplot. Correlation between two numeric variables: scatter plot (with alpha transparency for large datasets). Trend over time or ordered sequence: line chart. Matrix of values: heatmap. For ML diagnostics specifically: confusion matrix for classification evaluation, residual plot for regression evaluation, learning curve for bias-variance diagnosis, feature importance bar chart for model interpretability, ROC or PR curve for threshold selection.

Q: Why do my saved plots look different from what I see in the notebook?

Notebook display and file saving use different rendering backends and resolutions. The notebook renders at screen resolution (72–96 DPI) using the inline backend, while savefig uses the DPI value you specify. Additionally, the notebook may auto-adjust figure size to fit the cell width. Always use fig.savefig('name.png', dpi=300, bbox_inches='tight') with an explicit figsize in plt.subplots() to get consistent, predictable output. Test by opening the saved file directly — not by comparing to the notebook display. And always save before calling plt.show(), which destroys the figure in most backends.

Q: How many charts should I include in a model evaluation report?

For classification: confusion matrix, ROC or Precision-Recall curve (or both for imbalanced data), and feature importance. For regression: predicted vs actual scatter, residual plot (four-panel diagnostic), and feature importance. That is 3 charts per model, each answering a specific question about model quality. Add learning curves only if actively diagnosing overfitting or underfitting. Add prediction probability distribution plots for production monitoring. Every chart must answer a specific question — if you cannot state the question the chart answers, remove it. Stakeholders need insight, not decoration.

Q: How do I make my charts accessible to colorblind viewers?

Three rules cover most cases. First, use colorblind-safe palettes: Seaborn's 'colorblind' palette, or perceptually uniform colormaps like 'viridis' and 'cividis'. Avoid red-green as the sole differentiator — the most common color vision deficiency affects red-green perception. Second, add redundant visual channels: different line styles (solid, dashed, dotted), different markers (circle, square, triangle), or hatching patterns on bars. This way color is not the only signal. Third, add direct labels — annotate bars with their values, label lines directly instead of using a distant legend that requires color matching. Test your final figure in grayscale: if it still communicates the message, it is accessible.

False positive rate tripled, complaints spiked 400% because a perfect ROC curve (AUC=1.0) hid the class-level failure — don't trust single metrics..

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Production

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 20 min

✓Basic programming fundamentals
✓A computer with internet access
✓Willingness to follow along with examples

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Matplotlib is the foundation — every chart in Python builds on its figure/axes model
Seaborn wraps Matplotlib with statistical defaults and far less boilerplate code
Confusion matrices, ROC curves, and residual plots reveal model flaws numbers hide
Use fig.savefig() at 300 DPI — screen-resolution plots break in reports and slides
Production rule: never present raw accuracy alone — always pair with precision, recall, or error distribution
Biggest mistake: choosing the wrong chart type for the data relationship you want to communicate
Always call plt.close(fig) after saving — open figures leak memory and crash long-running pipelines

✦ Definition~90s read

What is How to Visualize Machine Learning Results (Matplotlib & Seaborn)?

Visualization in machine learning is the practice of creating graphical representations of model performance, data distributions, and prediction errors to uncover what summary statistics hide. A single metric like ROC-AUC can look perfect (0.99) while your model misses every fraud transaction for three weeks because it optimized for overall rank ordering rather than catching rare events.

★

Machine learning outputs are numbers.

This is why you need multiple views: the confusion matrix shows you exactly where false negatives live, precision-recall curves reveal performance on imbalanced classes, and residual plots expose systematic bias in regression models.

Matplotlib provides the foundational Figure and Axes objects that give you pixel-level control over every chart element — you'll use it to build custom ROC curves or overlay decision boundaries. Seaborn sits on top of Matplotlib and handles statistical visualizations like distribution plots, heatmaps for confusion matrices, and pairplots for feature relationships with sensible defaults.

For production monitoring, you'll combine these with libraries like Plotly for interactive dashboards or Yellowbrick for automated diagnostic plots.

When not to use these tools: if you're debugging a production model at 3 AM, a single ROC curve won't tell you if fraud patterns shifted last Tuesday. You need time-series plots of precision and recall by hour, or a confusion matrix sliced by date. For regression, never trust R² alone — always plot residuals against predicted values to spot heteroscedasticity or nonlinear patterns.

The real power of visualization isn't pretty charts; it's catching the failure modes that metrics intentionally hide.

Plain-English First

Machine learning outputs are numbers. Visualization turns those numbers into stories that humans can act on. A confusion matrix is not decoration — it is the difference between knowing your model is '95% accurate' and knowing it misses 67% of the fraud cases you actually care about. This guide teaches you the specific charts that reveal whether your model is actually working, how to build them properly in Python, and how to format them so they survive the journey from notebook to boardroom slide deck.

⚙ Browser compatibility

Latest versions — ✓ supported

Chrome	Firefox	Safari	Edge
✓	✓	✓	✓

Model metrics like accuracy and F1-score tell you the score. Visualizations tell you why. A confusion matrix shows exactly which classes your model confuses. A residual plot reveals systematic prediction errors that RMSE averages away. A learning curve tells you whether collecting more data will help or whether you need a fundamentally different model. These are not decorative — they are diagnostic tools.

Matplotlib provides the rendering engine. Seaborn provides statistical awareness on top of it. You need both: Matplotlib for full control over publication-quality figures, and Seaborn for rapid exploratory analysis with sensible defaults. They are not competitors — Seaborn is literally built on Matplotlib, and every Seaborn plot returns a Matplotlib axes object you can customize further.

The common mistake is treating visualization as an afterthought — something you do after the model is trained and shipped. In production, a well-designed diagnostic dashboard catches model degradation weeks before aggregate metrics move. The charts you build during evaluation become your monitoring tools after deployment. Skip them, and you are flying blind.

How a Perfect ROC Curve Hid a 3-Week Fraud Failure

Visualizing ML results for beginners means learning to look past aggregate metrics like AUC-ROC and instead inspect per-class performance, decision thresholds, and temporal drift. A single number like 0.99 AUC can mask a model that misses 40% of fraud cases because the fraud class is rare and the ROC curve overweights true negative rank ordering. The core mechanic is that ROC plots TPR vs FPR across all thresholds, but when class imbalance exceeds 100:1, the curve stays near-perfect even if the model never catches the minority class — because FPR stays tiny by definition.

In practice, you must pair ROC with precision-recall curves, which use the positive class as denominator. For a fraud model with 0.1% prevalence, a 0.99 AUC-ROC can correspond to precision below 5% at any useful recall. The confusion matrix at your actual decision threshold tells the real story: if you set threshold to catch 80% of fraud, you might flag 30% of legitimate transactions — a 6x false-positive rate that kills user experience. Time-based slicing is equally critical: a model that works in January may fail in February because fraud patterns shift, yet the ROC curve aggregated over six months stays flat.

Use these visualizations whenever you deploy a classification model to production, especially with imbalanced data or non-stationary distributions. Plot precision-recall curves weekly, monitor confusion matrices per day, and track threshold-specific metrics like F2-score. Without this discipline, you will ship a model that looks perfect in the dashboard but silently bleeds money — exactly like the fraud model that missed $2M in losses over three weeks because the team only looked at ROC.

⚠ ROC Lies When Classes Are Imbalanced

A 0.99 AUC-ROC on a 1:1000 dataset can mean the model ranks all negatives correctly but never catches a positive — check precision-recall instead.

📊 Production Insight

Fraud detection model with 0.99 AUC-ROC missed 40% of fraud for 3 weeks because team only monitored aggregate ROC.

Symptom: fraud loss rate doubled while dashboard showed 'green' — no alert fired because AUC never dipped below 0.98.

Rule: always monitor precision at operating recall and confusion matrix per time window, not just AUC-ROC.

🎯 Key Takeaway

Aggregate metrics like AUC-ROC hide failure in rare-class problems — always inspect per-class performance.

Precision-recall curves reveal model utility where ROC deceives, especially under 100:1 imbalance.

Monitor visualizations over time windows, not cumulative — drift kills models silently.

thecodeforge.io

Visualizing Ml Results Beginners

Matplotlib Fundamentals: Figure and Axes

Every Matplotlib chart lives inside a Figure that contains one or more Axes. The Figure is the canvas — it controls overall dimensions, background color, and file output. The Axes is the actual plot area with its own x-axis, y-axis, title, and data layers.

Understanding this hierarchy prevents 90% of the layout confusion beginners hit. When you call plt.plot(), Matplotlib implicitly creates a Figure and Axes behind the scenes. This works for quick exploration but falls apart the moment you need multiple subplots, consistent sizing, or saved files. The object-oriented interface — fig, ax = plt.subplots() — gives you explicit handles to both objects and should be your default for anything beyond throwaway exploration.

io/thecodeforge/viz/matplotlib_basics.pyPYTHON

import matplotlib.pyplot as plt
import numpy as np


# --- Method 1: pyplot interface (quick exploration only) ---
# Implicitly creates a Figure and Axes. Fine for throwaway cells.
plt.plot([1, 2, 3], [4, 5, 6])
plt.title('Simple Line Plot')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.show()


# --- Method 2: object-oriented interface (production standard) ---
# Explicitly creates Figure and Axes. Use this for everything you save.
fig, ax = plt.subplots(figsize=(10, 6))

ax.plot([1, 2, 3], [4, 5, 6], marker='o', linewidth=2, label='Series A')
ax.set_title('Production-Ready Line Plot', fontsize=14, fontweight='bold')
ax.set_xlabel('X Axis')
ax.set_ylabel('Y Axis')
ax.legend()
ax.grid(True, alpha=0.3)

fig.tight_layout()
fig.savefig('plot.png', dpi=300, bbox_inches='tight')
plt.close(fig)  # Free memory — critical in loops and pipelines


# --- Multi-panel figure: the pattern you will use most ---
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
np.random.seed(42)
data = np.random.randn(200)

# Panel 1: Distribution
axes[0, 0].hist(data, bins=30, edgecolor='black', alpha=0.7, color='steelblue')
axes[0, 0].set_title('Distribution')
axes[0, 0].set_xlabel('Value')
axes[0, 0].set_ylabel('Frequency')

# Panel 2: Sequential scatter
axes[0, 1].scatter(np.arange(len(data)), data, alpha=0.4, s=12, color='coral')
axes[0, 1].axhline(y=0, color='black', linestyle='--', alpha=0.3)
axes[0, 1].set_title('Sequential Scatter')
axes[0, 1].set_xlabel('Index')

# Panel 3: Box plot
axes[1, 0].boxplot(data, vert=True, patch_artist=True,
                    boxprops=dict(facecolor='lightblue'))
axes[1, 0].set_title('Box Plot')

# Panel 4: Cumulative sum
axes[1, 1].plot(np.cumsum(data), color='seagreen', linewidth=1.5)
axes[1, 1].set_title('Cumulative Sum')
axes[1, 1].set_xlabel('Index')

fig.suptitle('Exploratory Data Summary', fontsize=16, fontweight='bold')
fig.tight_layout()
fig.savefig('dashboard.png', dpi=300, bbox_inches='tight')
plt.close(fig)

Mental Model

Figure vs Axes

Think of Figure as the paper and Axes as individual charts drawn on that paper. You can have many charts on one piece of paper.

Figure = the full canvas. Controls overall size (figsize), background, DPI, and file saving.
Axes = one plot area. Has its own x-axis, y-axis, title, legend, and data layers. A Figure can hold many Axes.
fig, ax = plt.subplots() creates one Figure with one Axes. This is your starting point for every chart.
fig, axes = plt.subplots(2, 3) creates a 2×3 grid. Access individual plots with axes[row, col].
Always use the object-oriented interface (ax.plot, ax.set_title) for anything you save or present. The pyplot interface (plt.plot, plt.title) operates on an implicit 'current axes' that causes bugs in multi-panel figures.

📊 Production Insight

plt.show() destroys the figure object in most Matplotlib backends. If you call plt.show() then fig.savefig(), you save a blank file with no error message.

Always save before showing: fig.savefig() first, plt.show() second — or skip plt.show() entirely in automated pipelines.

Rule: in production scripts, scheduled jobs, and CI/CD pipelines, never call plt.show(). Use fig.savefig() and plt.close(fig) to render and release memory. Open figures accumulate and will eventually crash long-running processes.

🎯 Key Takeaway

Figure is the canvas, Axes is the plot. Always use the object-oriented interface.

fig, ax = plt.subplots() is your starting point for every chart — no exceptions for production code.

Save with fig.savefig('name.png', dpi=300, bbox_inches='tight') and always call plt.close(fig) afterward.

Seaborn for Statistical Visualization

Seaborn builds on Matplotlib with high-level functions that understand DataFrames natively. Pass column names directly, and Seaborn handles grouping, aggregation, statistical estimation, and legend creation automatically. Where Matplotlib requires 20 lines for a grouped bar chart with confidence intervals, Seaborn does it in 3.

The key insight is that Seaborn is not a replacement for Matplotlib — it is an accelerator for the statistical plotting patterns you use most often. Every Seaborn function returns a Matplotlib axes object, so you can always drop down to Matplotlib for fine-grained customization after Seaborn does the heavy lifting.

io/thecodeforge/viz/seaborn_basics.pyPYTHON

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np


# Set Seaborn theme once at the top of your notebook or script
sns.set_theme(style='whitegrid', palette='muted', font_scale=1.1)

# Generate example data
np.random.seed(42)
df = pd.DataFrame({
    'feature_a': np.random.randn(200),
    'feature_b': np.random.randn(200) * 2 + 1,
    'category': np.random.choice(['Class A', 'Class B', 'Class C'], 200),
    'target': np.random.choice([0, 1], 200)
})


# --- Distribution plots: understand feature spread ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.histplot(data=df, x='feature_a', hue='category', kde=True, ax=axes[0])
axes[0].set_title('Feature A Distribution by Category')

sns.boxplot(data=df, x='category', y='feature_b', ax=axes[1])
axes[1].set_title('Feature B Spread by Category')

fig.tight_layout()
fig.savefig('distributions.png', dpi=300, bbox_inches='tight')
plt.close(fig)


# --- Correlation heatmap: find feature relationships ---
fig, ax = plt.subplots(figsize=(8, 6))
numeric_df = df.select_dtypes(include=[np.number])
corr_matrix = numeric_df.corr()

sns.heatmap(
    corr_matrix, annot=True, fmt='.2f', cmap='RdBu_r',
    center=0, vmin=-1, vmax=1, ax=ax,
    linewidths=0.5, square=True
)
ax.set_title('Feature Correlation Matrix')

fig.tight_layout()
fig.savefig('correlation.png', dpi=300, bbox_inches='tight')
plt.close(fig)


# --- Pair plot: explore all pairwise relationships at once ---
# Useful for small feature sets (<10 features). Slow for large ones.
pair = sns.pairplot(
    df, hue='category', diag_kind='kde',
    plot_kws={'alpha': 0.4, 's': 15}
)
pair.figure.suptitle('Pairwise Feature Relationships', y=1.02)
pair.savefig('pairplot.png', dpi=150, bbox_inches='tight')
plt.close('all')


# --- Seaborn + Matplotlib customization: the practical pattern ---
fig, ax = plt.subplots(figsize=(10, 6))
sns.violinplot(data=df, x='category', y='feature_a', ax=ax, inner='quartile')

# Drop down to Matplotlib for fine-tuning
ax.set_title('Feature A Violin Plot', fontsize=14, fontweight='bold')
ax.set_xlabel('Category', fontsize=12)
ax.set_ylabel('Feature A Value', fontsize=12)
ax.axhline(y=0, color='red', linestyle='--', alpha=0.5, label='Zero baseline')
ax.legend()

fig.tight_layout()
fig.savefig('violin_customized.png', dpi=300, bbox_inches='tight')
plt.close(fig)

💡When to Use Seaborn vs Matplotlib

Seaborn excels at: grouped plots, statistical overlays (confidence intervals, KDE curves), automatic legend handling, DataFrame-native column references.
Matplotlib excels at: precise axis control, custom annotations and arrows, multi-panel layouts with unequal sizing, publication-quality formatting.
You can always access the underlying Matplotlib axes from any Seaborn plot: ax = sns.histplot(...); ax.set_xlim(0, 100).
Rule of thumb: prototype in Seaborn, polish in Matplotlib. Start fast, refine as needed.

📊 Production Insight

sns.set_theme() affects all subsequent plots globally in the current Python process. In shared notebooks or multi-team environments, this can silently change the appearance of other people's charts.

Call sns.set_theme() once at the very top of your notebook or script, and document the style choice.

For production pipelines that generate multiple report types, use matplotlib.rcParams context managers to scope style changes: with plt.rc_context({'font.size': 12}): ...

🎯 Key Takeaway

Seaborn wraps Matplotlib with DataFrame awareness and statistical defaults — use it for exploration.

sns.histplot, sns.boxplot, sns.heatmap, and sns.pairplot cover 80% of ML visualization needs.

Every Seaborn plot returns a Matplotlib axes object — drop down to Matplotlib for final polish.

thecodeforge.io

Visualizing Ml Results Beginners

Confusion Matrix: Where Your Model Gets Confused

The confusion matrix is the single most important diagnostic chart for classification models. It shows exactly which classes your model confuses with which — information that a scalar metric like accuracy or F1 compresses into a single number and loses.

A model with 95% accuracy might be completely failing on one class. In a fraud detection system where only 2% of transactions are fraudulent, a model that predicts 'not fraud' for every single input achieves 98% accuracy while catching zero fraud. Only the confusion matrix reveals this. Always plot it. Always.

io/thecodeforge/viz/confusion_matrix.pyPYTHON

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.metrics import confusion_matrix


def plot_confusion_matrix(
    y_true, y_pred, labels=None, title='Confusion Matrix'
):
    """Production-grade confusion matrix with both counts and percentages.

    Displays two panels side by side:
    - Left: raw counts (useful for understanding volume)
    - Right: row-normalized percentages (useful for understanding recall per class)

    Args:
        y_true: ground truth labels
        y_pred: predicted labels
        labels: list of class names for axis labels
        title: figure title

    Returns:
        Matplotlib Figure object (caller saves and closes).
    """
    cm = confusion_matrix(y_true, y_pred)
    cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100

    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Left panel: raw counts
    sns.heatmap(
        cm, annot=True, fmt='d', cmap='Blues',
        xticklabels=labels, yticklabels=labels,
        ax=axes[0], linewidths=0.5
    )
    axes[0].set_xlabel('Predicted')
    axes[0].set_ylabel('Actual')
    axes[0].set_title(f'{title} (Counts)')

    # Right panel: row-normalized percentages (each row sums to 100%)
    sns.heatmap(
        cm_percent, annot=True, fmt='.1f', cmap='Blues',
        xticklabels=labels, yticklabels=labels,
        ax=axes[1], linewidths=0.5, vmin=0, vmax=100
    )
    axes[1].set_xlabel('Predicted')
    axes[1].set_ylabel('Actual')
    axes[1].set_title(f'{title} (Row %, i.e., Recall)')

    fig.tight_layout()
    return fig


# Example usage
np.random.seed(42)
y_true = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2] * 20)
y_pred = np.array([0, 0, 1, 1, 1, 0, 2, 2, 2] * 20)
labels = ['Cat', 'Dog', 'Bird']

fig = plot_confusion_matrix(y_true, y_pred, labels=labels, title='Animal Classifier')
fig.savefig('confusion_matrix.png', dpi=300, bbox_inches='tight')
plt.close(fig)

⚠ Accuracy Hides Class-Level Failures

A model predicting 990 correct out of 1000 samples has 99% accuracy. But if those 10 errors are all in the fraud class (which has only 15 total samples), the model missed 67% of all fraud cases. The confusion matrix shows this immediately — the fraud row will have a large off-diagonal value. A single accuracy number never reveals this. On imbalanced datasets, accuracy is almost meaningless. The confusion matrix is not.

📊 Production Insight

Always display both raw counts and row-normalized percentages in your confusion matrix.

Raw counts mislead on imbalanced datasets because 95% of predictions naturally land in the majority class, making the diagonal look strong even when minority class recall is terrible.

Row percentages show recall per class — how much of each true class the model actually captures.

Column percentages show precision per class — of everything predicted as class X, how much is correct.

Rule: for production monitoring dashboards, plot the row-normalized version by default and provide the raw count version as a drill-down.

🎯 Key Takeaway

The confusion matrix is the most important classification diagnostic — plot it for every model, every time.

Always show both counts and row-normalized percentages. Counts tell you volume; percentages tell you recall.

Off-diagonal patterns reveal exactly which classes your model cannot distinguish and guide targeted improvements.

Confusion Matrix Interpretation

IfDiagonal cells are strong, off-diagonal cells are near zero

→

UseModel separates classes well. Verify that performance is consistent across all classes — a strong overall diagonal can mask one weak class.

IfOne row has high off-diagonal values (model confuses class A with class B specifically)

→

UseClasses A and B share similar features. Consider feature engineering to surface distinguishing characteristics, collecting more training data for the confused class, or merging the classes if they are semantically close.

IfAll predictions cluster into one class (entire column is dark, rest of matrix is blank)

→

UseModel is degenerate — predicting the majority class for every input. Check class balance, lower the decision threshold, or apply class weights during training.

IfMatrix looks good on test data but deteriorates on production data

→

UseData distribution shift. Plot prediction probability distributions over time to detect when the drift started. Compare feature distributions between training data and recent production data.

ROC and Precision-Recall Curves

ROC curves plot the true positive rate against the false positive rate across all possible classification thresholds. They answer the question: as I lower the threshold to catch more positives, how many false positives do I accept?

Precision-Recall curves are more informative for imbalanced datasets because they focus exclusively on the positive class. On a dataset where only 1% of samples are positive, ROC can show an impressive AUC of 0.95 while the model's precision at useful recall levels is actually terrible. Precision-Recall curves expose this directly.

Both curves let you visualize the tradeoff space and choose the optimal threshold for your specific business requirements — something a single F1 score cannot do.

io/thecodeforge/viz/roc_pr_curves.pyPYTHON

import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import (
    roc_curve, auc, precision_recall_curve, average_precision_score
)


def plot_roc_and_pr(y_true, y_proba, title='Model Evaluation'):
    """Plot ROC and Precision-Recall curves side by side.

    Both curves visualize model performance across all possible
    classification thresholds. Together they give a complete picture
    that no single metric can provide.

    Args:
        y_true: ground truth binary labels (0 or 1)
        y_proba: predicted probabilities for the positive class
        title: figure title prefix

    Returns:
        Matplotlib Figure object.
    """
    # Compute ROC curve
    fpr, tpr, roc_thresholds = roc_curve(y_true, y_proba)
    roc_auc = auc(fpr, tpr)

    # Compute Precision-Recall curve
    precision, recall, pr_thresholds = precision_recall_curve(y_true, y_proba)
    avg_precision = average_precision_score(y_true, y_proba)

    fig, axes = plt.subplots(1, 2, figsize=(14, 6))

    # --- ROC Curve ---
    axes[0].plot(fpr, tpr, linewidth=2, label=f'Model (AUC = {roc_auc:.3f})')
    axes[0].plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Random (AUC = 0.5)')
    axes[0].fill_between(fpr, tpr, alpha=0.1)
    axes[0].set_xlabel('False Positive Rate')
    axes[0].set_ylabel('True Positive Rate (Recall)')
    axes[0].set_title(f'{title} — ROC Curve')
    axes[0].legend(loc='lower right')
    axes[0].grid(True, alpha=0.3)
    axes[0].set_xlim([-0.02, 1.02])
    axes[0].set_ylim([-0.02, 1.02])

    # --- Precision-Recall Curve ---
    axes[1].plot(
        recall, precision, linewidth=2, color='orange',
        label=f'Model (AP = {avg_precision:.3f})'
    )
    baseline = y_true.sum() / len(y_true)
    axes[1].axhline(
        y=baseline, color='k', linestyle='--', alpha=0.5,
        label=f'Random baseline = {baseline:.3f}'
    )
    axes[1].fill_between(recall, precision, alpha=0.1, color='orange')
    axes[1].set_xlabel('Recall')
    axes[1].set_ylabel('Precision')
    axes[1].set_title(f'{title} — Precision-Recall Curve')
    axes[1].legend(loc='lower left')
    axes[1].grid(True, alpha=0.3)
    axes[1].set_xlim([-0.02, 1.02])
    axes[1].set_ylim([0, 1.05])

    fig.tight_layout()
    return fig


# Example: imbalanced fraud detection scenario
np.random.seed(42)
y_true = np.random.choice([0, 1], size=500, p=[0.95, 0.05])
y_proba = np.clip(y_true * 0.6 + np.random.randn(500) * 0.2, 0, 1)

fig = plot_roc_and_pr(y_true, y_proba, title='Fraud Detection')
fig.savefig('roc_pr_curves.png', dpi=300, bbox_inches='tight')
plt.close(fig)

🔥ROC vs Precision-Recall: When to Use Which

Use ROC when classes are roughly balanced — it gives a clean summary of the true-positive vs false-positive tradeoff across all thresholds. Use Precision-Recall when the positive class is rare (fraud detection, disease screening, anomaly detection, conversion prediction). On highly imbalanced data, ROC can show AUC > 0.95 because the massive true negative count inflates the true positive rate calculation. Meanwhile, Precision-Recall reveals that the model's precision collapses to 10% at any useful recall level. Always plot both. A model can look excellent on one curve and mediocre on the other. If you only show one, you are hiding information.

📊 Production Insight

AUC and Average Precision summarize performance across all thresholds. In production, you deploy at one specific threshold.

The curve shape tells you where your operating point lives and what tradeoffs it forces. Two models with identical AUC can have very different characteristics at the threshold that matters for your business.

Rule: overlay the actual deployed threshold on the curve as a dot or vertical line. This makes it immediately clear how much performance room exists if you adjust the threshold — and what the cost of that adjustment is in the other metric.

🎯 Key Takeaway

ROC curves work well for balanced classes. Precision-Recall curves are essential for imbalanced ones.

Always plot both side by side — a model can look good on one and poor on the other.

The curve shape reveals operating characteristics that a single AUC or AP number compresses away.

Residual Plots for Regression Models

Residual plots reveal systematic errors in regression models that aggregate metrics like RMSE and MAE completely hide. RMSE tells you the average error magnitude. Residual plots tell you whether those errors are random (acceptable) or structured (a sign your model is missing something).

If residuals show a pattern — a curve, a fan shape, clusters — your model is not capturing a relationship in the data. No amount of hyperparameter tuning will fix this. You need different features, a different transformation, or a different model family. The residual plot is the chart that tells you which.

io/thecodeforge/viz/residual_plots.pyPYTHON

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression


def plot_regression_diagnostics(y_true, y_pred, title='Regression Diagnostics'):
    """Four-panel diagnostic plot for regression models.

    Panels:
    1. Predicted vs Actual — overall fit quality
    2. Residuals vs Predicted — detect non-linearity, heteroscedasticity
    3. Residual Distribution — check normality assumption
    4. Q-Q Plot — sensitive normality check at distribution tails

    Args:
        y_true: actual target values (numpy array)
        y_pred: predicted target values (numpy array)
        title: overall figure title

    Returns:
        Matplotlib Figure object.
    """
    residuals = y_true - y_pred

    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    # Panel 1: Predicted vs Actual
    axes[0, 0].scatter(y_true, y_pred, alpha=0.4, s=15, color='steelblue')
    min_val = min(y_true.min(), y_pred.min())
    max_val = max(y_true.max(), y_pred.max())
    axes[0, 0].plot(
        [min_val, max_val], [min_val, max_val],
        'r--', linewidth=2, label='Perfect prediction'
    )
    axes[0, 0].set_xlabel('Actual')
    axes[0, 0].set_ylabel('Predicted')
    axes[0, 0].set_title('Predicted vs Actual')
    axes[0, 0].legend()

    # Panel 2: Residuals vs Predicted (the most important panel)
    axes[0, 1].scatter(y_pred, residuals, alpha=0.4, s=15, color='coral')
    axes[0, 1].axhline(y=0, color='r', linestyle='--', linewidth=2)
    axes[0, 1].set_xlabel('Predicted Value')
    axes[0, 1].set_ylabel('Residual (Actual - Predicted)')
    axes[0, 1].set_title('Residuals vs Predicted')

    # Panel 3: Residual Distribution
    sns.histplot(residuals, kde=True, ax=axes[1, 0], bins=30, color='steelblue')
    axes[1, 0].axvline(x=0, color='r', linestyle='--')
    axes[1, 0].set_xlabel('Residual')
    axes[1, 0].set_title(f'Residual Distribution (mean={residuals.mean():.2f})')

    # Panel 4: Q-Q plot (normality check — deviations at tails matter most)
    stats.probplot(residuals, dist='norm', plot=axes[1, 1])
    axes[1, 1].set_title('Q-Q Plot (Normality Check)')

    fig.suptitle(title, fontsize=14, fontweight='bold')
    fig.tight_layout()
    return fig


# Example
X, y = make_regression(
    n_samples=300, n_features=3, noise=15, random_state=42
)
model = LinearRegression().fit(X, y)
y_pred = model.predict(X)

fig = plot_regression_diagnostics(y, y_pred, title='Linear Regression Diagnostics')
fig.savefig('residual_plots.png', dpi=300, bbox_inches='tight')
plt.close(fig)

Mental Model

What Good Residuals Look Like

A well-specified model produces residuals that are random noise with no discernible pattern. If you see structure, the model is leaving signal on the table.

Residuals vs Predicted: random scatter centered on zero. No fan shape, no curve, no clusters.
Residual Distribution: approximately normal, centered at zero. Skew or heavy tails indicate the model handles some value ranges worse than others.
Q-Q Plot: points follow the diagonal line closely. Deviations at the tails mean the model produces more extreme errors than a normal distribution predicts.
If you see any pattern in the residual plot, your model is missing a signal. Add features, apply transformations, or switch model families.

📊 Production Insight

A fan-shaped residual plot — where residuals spread wider as predicted values increase — means heteroscedasticity. The model's error is not constant: it predicts well for small values and poorly for large ones, or vice versa.

This violates a core assumption of ordinary least squares and inflates confidence intervals on predictions.

Rule: apply a log transform to the target variable (np.log1p) or use weighted least squares to stabilize error variance. If the fan is severe, tree-based models handle heteroscedasticity naturally without transformation.

🎯 Key Takeaway

Residual plots reveal errors that RMSE hides — always generate them for regression models.

Random scatter around zero means the model is well-specified. Any pattern means missing signal.

Four diagnostic panels: predicted vs actual, residuals vs predicted, residual histogram, Q-Q plot.

Residual Pattern Diagnosis

IfResiduals show a U-shape or curve against predicted values

→

UseModel is missing a non-linear relationship. Add polynomial features (degree 2 or 3), interaction terms between features, or switch to a non-linear model like gradient boosted trees.

IfResiduals fan out — spread increases with predicted value

→

UseHeteroscedasticity. Log-transform the target variable with np.log1p(y), use weighted least squares, or switch to a model family that handles non-constant variance naturally (e.g., tree-based models).

IfResiduals are not centered at zero — consistent bias in one direction

→

UseModel has systematic bias. Check for a missing intercept term, incorrect feature encoding, or a target variable that needs transformation.

IfResiduals show a clear trend when plotted against time or row index

→

UseAutocorrelation — your data has temporal structure that the model ignores. Add lag features, rolling statistics, or switch to a time-series model (ARIMA, Prophet, temporal neural networks).

Feature Importance Visualization

Feature importance plots show which inputs drive your model's predictions. For tree-based models, importance is built in via impurity reduction. For any model, permutation importance provides a model-agnostic alternative by measuring how much accuracy drops when each feature's values are randomly shuffled.

Visualization makes these rankings immediately interpretable to non-technical stakeholders who need to understand why the model makes the decisions it does — not just what it predicts. A horizontal bar chart sorted by importance is the universal format that everyone from data scientists to product managers can read.

io/thecodeforge/viz/feature_importance.pyPYTHON

import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.datasets import make_classification


def plot_feature_importance(
    model, feature_names, X_test, y_test, top_n=15
):
    """Plot built-in and permutation importance side by side.

    Built-in importance (Gini) is fast but biased toward high-cardinality
    features. Permutation importance is slower but model-agnostic and
    unbiased. Showing both highlights discrepancies worth investigating.

    Args:
        model: fitted sklearn estimator
        feature_names: list of feature name strings
        X_test: test features for permutation importance
        y_test: test labels for permutation importance
        top_n: number of top features to display

    Returns:
        Matplotlib Figure object.
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, max(6, top_n * 0.4)))

    # Left panel: built-in importance (tree-based models only)
    if hasattr(model, 'feature_importances_'):
        importances = model.feature_importances_
        indices = np.argsort(importances)[::-1][:top_n]

        axes[0].barh(
            [feature_names[i] for i in indices][::-1],
            importances[indices][::-1],
            color='steelblue', edgecolor='black', alpha=0.8
        )
        axes[0].set_xlabel('Gini Importance (Impurity Reduction)')
        axes[0].set_title('Built-in Feature Importance')
    else:
        axes[0].text(
            0.5, 0.5, 'Not available\n(model has no feature_importances_)',
            ha='center', va='center', fontsize=12, transform=axes[0].transAxes
        )
        axes[0].set_title('Built-in Feature Importance (N/A)')

    # Right panel: permutation importance (model-agnostic)
    perm_result = permutation_importance(
        model, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1
    )
    perm_mean = perm_result.importances_mean
    perm_std = perm_result.importances_std
    indices = np.argsort(perm_mean)[::-1][:top_n]

    axes[1].barh(
        [feature_names[i] for i in indices][::-1],
        perm_mean[indices][::-1],
        xerr=perm_std[indices][::-1],
        color='coral', edgecolor='black', alpha=0.8
    )
    axes[1].set_xlabel('Mean Accuracy Decrease When Shuffled')
    axes[1].set_title('Permutation Importance')

    fig.suptitle(
        'Feature Importance Comparison', fontsize=14, fontweight='bold'
    )
    fig.tight_layout()
    return fig


# Example
X, y = make_classification(
    n_samples=1000, n_features=10, n_informative=5, random_state=42
)
feature_names = [f'feature_{i}' for i in range(10)]
model = RandomForestClassifier(
    n_estimators=100, random_state=42
).fit(X, y)

fig = plot_feature_importance(model, feature_names, X, y)
fig.savefig('feature_importance.png', dpi=300, bbox_inches='tight')
plt.close(fig)

⚠ Built-in Importance Can Mislead

Gini importance (feature_importances_) in tree-based models is biased toward high-cardinality features. A feature with 1,000 unique values — like a raw ID column — will appear more important than a genuinely predictive binary feature, because the tree has more possible split points to choose from. This is a measurement artifact, not real predictive value. Always validate with permutation importance, which directly measures the accuracy cost of losing each feature and is unbiased by cardinality.

📊 Production Insight

Feature importance rankings can shift dramatically between model versions — not because the data changed, but because tree-based models have inherent randomness in split selection.

Track importance rankings over time across deployments. A feature that drops from top 3 to zero importance between versions may indicate data pipeline corruption (the column went null, changed format, or stopped updating).

Rule: store and compare feature importance snapshots as part of your model registry metadata. Unexpected ranking changes should trigger investigation before deployment, not after.

🎯 Key Takeaway

Built-in importance is fast but biased toward high-cardinality features. Permutation importance is slower but reliable and model-agnostic.

Plot both side by side — significant disagreement between them signals a cardinality bias or data leakage problem.

Track importance rankings across model versions to detect data pipeline degradation early.

Learning Curves: Diagnosing Bias and Variance

Learning curves plot model performance against training set size. They answer the most fundamental question in model improvement: should I get more data, or should I change the model?

The gap between the training score and validation score at each data size reveals whether your model suffers from high bias (underfitting — both curves are low) or high variance (overfitting — training is high, validation is low). This is not an academic distinction. It directly determines whether spending three weeks collecting more data will help or be completely wasted effort.

io/thecodeforge/viz/learning_curves.pyPYTHON

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification


def plot_learning_curve(
    estimator, X, y, title='Learning Curve', cv=5, scoring='accuracy'
):
    """Plot learning curve showing the bias-variance tradeoff.

    The gap between training and validation curves tells you exactly
    what to fix: more data, more regularization, or a different model.

    Args:
        estimator: unfitted sklearn estimator (will be cloned internally)
        X: feature matrix
        y: target vector
        title: plot title
        cv: number of cross-validation folds
        scoring: sklearn scoring metric name

    Returns:
        Matplotlib Figure object.
    """
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y,
        cv=cv,
        n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring=scoring
    )

    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)

    fig, ax = plt.subplots(figsize=(10, 6))

    # Confidence bands
    ax.fill_between(
        train_sizes, train_mean - train_std, train_mean + train_std,
        alpha=0.1, color='blue'
    )
    ax.fill_between(
        train_sizes, val_mean - val_std, val_mean + val_std,
        alpha=0.1, color='orange'
    )

    # Mean curves
    ax.plot(
        train_sizes, train_mean, 'o-', color='blue',
        linewidth=2, label='Training Score'
    )
    ax.plot(
        train_sizes, val_mean, 'o-', color='orange',
        linewidth=2, label='Validation Score'
    )

    ax.set_xlabel('Training Set Size')
    ax.set_ylabel(scoring.capitalize())
    ax.set_title(title, fontsize=14, fontweight='bold')
    ax.legend(loc='lower right')
    ax.grid(True, alpha=0.3)

    # Annotate the final gap between curves
    final_gap = train_mean[-1] - val_mean[-1]
    ax.annotate(
        f'Gap: {final_gap:.3f}',
        xy=(train_sizes[-1], (train_mean[-1] + val_mean[-1]) / 2),
        fontsize=11, fontweight='bold', color='red',
        ha='right'
    )

    fig.tight_layout()
    return fig


# Example
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)

fig = plot_learning_curve(model, X, y, title='Random Forest Learning Curve')
fig.savefig('learning_curve.png', dpi=300, bbox_inches='tight')
plt.close(fig)

Mental Model

Reading Learning Curves

The gap between training and validation curves tells you exactly what to fix — and what not to waste time on.

Large gap (training high, validation low) = high variance (overfitting). Fix with: more data, stronger regularization, fewer features, simpler model.
Both curves low and converging together = high bias (underfitting). Fix with: more features, more complex model, less regularization. More data will NOT help here.
Both curves high and converging together = good fit. Model is well-calibrated for this data volume.
Validation curve still rising at the right edge = more data will help. Collecting additional training examples is a productive investment.

📊 Production Insight

Learning curves computed on a tiny subsample can be misleading about convergence behavior. If your full dataset has 1M rows but you compute the learning curve on a 5K sample, the curve might show convergence that disappears at full scale.

Always compute learning curves on a representative sample large enough to show the real convergence pattern — at least 10% of the full dataset or 10K samples, whichever is larger.

Rule: if the validation curve has not plateaued at the maximum training size, your model will measurably benefit from more training data. If it has plateaued, spending three weeks collecting more data is wasted effort — change the model instead.

🎯 Key Takeaway

Learning curves diagnose bias vs variance — the fundamental decision point for model improvement.

Large gap between curves = overfitting (needs regularization or more data). Both curves low = underfitting (needs more complexity).

The curve shape tells you whether to invest in more data or in a different model architecture.

Saving and Formatting for Production

Charts in notebooks are for exploration. Charts in reports, dashboards, presentations, and papers require consistent formatting, appropriate resolution, and accessible color choices. The gap between a notebook plot and a production-ready figure is not aesthetics — it is legibility, accessibility, and reproducibility.

A chart that looks fine on your 4K monitor becomes an unreadable blur when projected onto a conference room screen or embedded in a PDF at print resolution. This section covers the production formatting pipeline that ensures your figures survive every medium they encounter.

io/thecodeforge/viz/production_formatting.pyPYTHON

import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np


def apply_production_style():
    """Apply a consistent, publication-quality style globally.

    Call this once at the top of your notebook or script.
    Overrides Matplotlib defaults with production-safe values.
    """
    mpl.rcParams.update({
        # Typography
        'font.size': 12,
        'axes.titlesize': 14,
        'axes.labelsize': 12,
        'xtick.labelsize': 10,
        'ytick.labelsize': 10,
        'legend.fontsize': 10,
        'figure.titlesize': 16,

        # Figure defaults
        'figure.figsize': (10, 6),
        'figure.dpi': 100,           # Screen display DPI
        'savefig.dpi': 300,          # Saved file DPI
        'savefig.bbox': 'tight',     # Prevent label clipping
        'savefig.pad_inches': 0.1,

        # Grid and spines
        'axes.grid': True,
        'grid.alpha': 0.3,
        'axes.spines.top': False,    # Remove top spine
        'axes.spines.right': False,  # Remove right spine

        # Lines and markers
        'lines.linewidth': 2,
        'lines.markersize': 6,
    })
    print("Production style applied.")


def save_publication(fig, filename, formats=None):
    """Save figure in multiple formats for different use cases.

    Args:
        fig: Matplotlib Figure object
        filename: base filename without extension
        formats: list of format strings. Defaults to PNG + SVG.
    """
    if formats is None:
        formats = ['png', 'svg']

    for fmt in formats:
        filepath = f"{filename}.{fmt}"
        fig.savefig(filepath, dpi=300, bbox_inches='tight', facecolor='white')
        print(f"Saved: {filepath}")


# --- Usage ---
apply_production_style()

fig, ax = plt.subplots()
colors = ['#2563eb', '#16a34a', '#dc2626']  # Blue, green, red — distinguishable
ax.bar(
    ['Model A', 'Model B', 'Model C'],
    [0.89, 0.92, 0.87],
    color=colors, edgecolor='black', alpha=0.9
)
ax.set_ylabel('Accuracy')
ax.set_title('Model Comparison — Q1 2026')
ax.set_ylim(0.80, 0.95)

# Add value labels on bars
for i, v in enumerate([0.89, 0.92, 0.87]):
    ax.text(i, v + 0.003, f'{v:.2f}', ha='center', fontweight='bold')

save_publication(fig, 'model_comparison')
plt.close(fig)

💡Accessibility in Visualizations

Use colorblind-safe palettes: sns.color_palette('colorblind') or the 'muted' palette. Avoid pure red/green combinations as the only differentiator.
Add patterns (hatching), markers, or line styles to distinguish series — not just color. ax.bar(..., hatch='//') adds visual texture.
Never use the 'jet' or 'rainbow' colormap for continuous data — they introduce perceptual artifacts. Use 'viridis', 'plasma', or 'cividis' instead.
Add direct value labels on bars and direct labels on lines instead of relying on a distant legend that requires color matching.
Test your charts in grayscale. If they still communicate the message, they are accessible.

📊 Production Insight

PNG at 300 DPI is the standard for reports and presentations. SVG is best for web dashboards and documentation sites because it scales without pixelation and has a smaller file size for simple charts. PDF is best for print publications, LaTeX documents, and archival.

Rule: always save in at least two formats — PNG for immediate sharing and embedding, SVG or PDF for archival and web. The save_publication helper above handles this automatically.

In automated report generation pipelines, save figures to a versioned artifact directory alongside the model they evaluate. Figures and models should share the same version tag.

🎯 Key Takeaway

Apply a consistent style with mpl.rcParams at the top of every notebook or script — never rely on Matplotlib defaults.

Save at 300 DPI in PNG for reports, SVG for web, PDF for print. Always save before calling plt.show().

Use colorblind-safe palettes and direct value labels. Never rely on color alone to convey meaning.

Stop Explaining Black Boxes Wrong: SHAP Is Your Only Friend

SHAP values use game theory to show which features actually drove a prediction. Most beginners think feature importance from a tree model is enough. It isn't. Feature importance tells you what the model thinks is important globally. SHAP tells you what actually happened for each row. This distinction cost a fintech client 40% false positives in fraud detection. The key difference: SHAP values are consistent and local. A high-importance feature can still decrease a single prediction. SHAP shows you that. Model-agnostic means it works on anything — XGBoost, neural nets, linear regressions. The summary_plot is your first stop. It shows feature impact distribution across all predictions. The waterfall_plot is for debugging specific bad predictions. Don't just trust the model. Verify its reasoning.

shap_example.pyPYTHON

import shap
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = xgb.XGBClassifier()
model.fit(X_train, y_train)

explainer = shap.Explainer(model, X_train)
shap_values = explainer(X_test)

shap.summary_plot(shap_values, X_test)

Output

A horizontal bar chart appears showing feature importance ranked by mean absolute SHAP value. 'worst concave points' dominates. Each dot is a single prediction, colored by feature value (red=high, blue=low).

⚠ Production Trap:

SHAP Explainer with kernel-based approximation is slow on large datasets. Use shap.TreeExplainer for tree models — it's exact and 100x faster. Never run explainability on all training rows. Sample 200-500 rows unless you want an AWS bill shock.

🎯 Key Takeaway

Feature importance tells you what the model likes. SHAP tells you what it actually did on your data.

LIME for When Your Model Makes a Dumb Mistake on One Row

Global explanations are great. But when a single customer gets incorrectly denied a loan, you need local explanation. LIME (Local Interpretable Model-agnostic Explanations) shines here. It approximates your complex model with a simple linear model around that specific prediction. Think of it as putting a magnifying glass on one data point. The output shows which features pushed the prediction toward each class. LIME's explain_instance method takes a single row and returns a bar chart of feature contributions. A positive bar means that feature increased the probability of the selected class. Negative means it pushed against it. LIME is not SHA. It's faster and easier to interpret for point-level debugging. But it has a flaw: the local approximation can be unstable. Run it twice on the same row. If the top features change, your model is too complex or your data has noise. Use LIME to catch data drift on individual predictions, not for model audits.

lime_example.pyPYTHON

import lime.lime_tabular
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = xgb.XGBClassifier()
model.fit(X_train, y_train)

explainer = lime.lime_tabular.LimeTabularExplainer(
    X_train.values, training_labels=y_train.values, feature_names=X_train.columns.tolist(),
    class_names=['benign', 'malignant'], mode='classification'
)

exp = explainer.explain_instance(X_test.iloc[0].values, model.predict_proba, num_features=5)
exp.show_in_notebook(show_table=True)

Output

A bar chart showing top 5 features for prediction 'malignant' (probability 0.98). 'worst concave points' pushes prediction up (+0.20). 'mean texture' pushes prediction down (-0.04). A table lists actual feature values.

🔥Pro Tip:

LIME's default kernel width works well for many datasets. But if your data has high variance, increase kernel_width to stabilize the local linear model. Test stability by running explain_instance 3 times on the same row. If top features vary, reduce num_features or tune the width.

🎯 Key Takeaway

LIME is for debugging the one bad call. SHAP is for understanding the whole system. Use both, but know the difference.

● Production incidentPOST-MORTEMseverity: high

Fraud Detection Model Degraded for 3 Weeks Because No One Plotted Predictions Over Time

Symptom

False positive rate tripled. Customer support received a 400% spike in fraud-flag complaints from legitimate merchants. The aggregate weekly accuracy metric — reported as a single number on the team dashboard — still showed 89%, masking the class-level collapse entirely.

Assumption

The team monitored a single aggregate accuracy number in their Grafana dashboard. They assumed stability because the headline metric had not moved more than 1% in either direction. No per-class breakdown existed. No prediction distribution plot existed.

Root cause

A new merchant category code (MCC 7399) was introduced by the payment processor three weeks prior. The model had never seen this code during training. It defaulted to high suspicion scores for all transactions with the unfamiliar code, flagging legitimate purchases as fraud. The aggregate accuracy stayed high because fraud cases represent only 1% of transactions — the model's correct predictions on the other 99% of normal transactions dominated the average, drowning out the class-level failure.

Fix

Added daily confusion matrix heatmaps to the monitoring dashboard, broken down by predicted class. Implemented per-class precision and recall time-series plots with automated PagerDuty alerts when any class metric dropped below a configurable threshold for two consecutive days. Added a weekly prediction probability distribution plot (histogram of model confidence scores) to detect distribution shifts before they manifest as metric degradation.

Key lesson

Never monitor a single aggregate metric — break performance down by class, by segment, and over time.
Confusion matrices catch class-level failures that accuracy, F1, and even AUC hide when classes are imbalanced.
Plot prediction probability distributions weekly to detect distribution shift before downstream metrics degrade.
The charts you build during model evaluation should become your production monitoring dashboards — not throwaway notebook cells.

Production debug guideWhen your charts do not reveal what you expect — or when they reveal something you did not anticipate.5 entries

Symptom · 01

All points on a scatter plot overlap into a single blob

→

Fix

Use alpha transparency (alpha=0.05 to 0.2 depending on density), add jitter with np.random.normal(0, 0.1, size=len(x)), or switch to a 2D density plot with sns.kdeplot(x=x, y=y, fill=True). For very large datasets (>100K points), use datashader or hexbin plots (ax.hexbin) instead of scatter.

Symptom · 02

Bar chart error bars look identical across all groups

→

Fix

Check if you are plotting standard deviation on a log-scale axis, which compresses the visual differences. Switch to confidence intervals (ci=95 in Seaborn) or standard error of the mean instead of standard deviation. Also verify that your groups actually have different variances — identical error bars might be correct.

Symptom · 03

ROC curve looks perfect (AUC = 1.0) but model performs poorly in production

→

Fix

This is almost certainly data leakage. Check for target-derived features in your training data, duplicates spanning train and test splits, or temporal leakage where future information bleeds into training rows. A perfect ROC on held-out data means the model has access to the answer, not that it learned the pattern.

Symptom · 04

Residual plot shows a clear curved or fan-shaped pattern instead of random scatter

→

Fix

A curve means missing non-linearity — add polynomial features, interaction terms, or switch to a non-linear model. A fan shape (residuals widening with predicted value) means heteroscedasticity — log-transform the target variable or use weighted regression.

Symptom · 05

Saved figure looks different from the notebook display — wrong size, cut-off labels, or blank

→

Fix

Always save before calling plt.show(), which destroys the figure in most backends. Use fig.savefig('name.png', dpi=300, bbox_inches='tight') — the bbox_inches parameter prevents label clipping. Set figsize explicitly in plt.subplots() rather than relying on notebook defaults.

★ ML Visualization Debug Cheat SheetQuick checks when your charts do not tell the right story or something looks suspicious.

Confusion matrix shows all predictions in one class−

Immediate action

Check class balance and prediction threshold. The model is likely predicting the majority class for every input.

Commands

print(f'Positive predictions: {y_pred.sum()} / {len(y_pred)}')

print(df['target'].value_counts(normalize=True))

Fix now

Lower the decision threshold (e.g., from 0.5 to 0.3) and re-evaluate. If the problem persists, address class imbalance with SMOTE, class weights, or stratified sampling before retraining.

Learning curve shows training score much higher than validation score+

Feature importance plot shows one dominant feature at 95%++

Matplotlib vs Seaborn: When to Use Which

Aspect	Matplotlib	Seaborn
Learning Curve	Steeper — more code required for statistical plots	Gentler — sensible defaults and fewer lines for common charts
Control Level	Full pixel-level control over every element	Less granular control, but faster to prototype
DataFrame Awareness	None — requires manual extraction of arrays from DataFrames	Native — pass column names directly via data= parameter
Statistical Plots	Manual — compute confidence intervals, KDE, regressions yourself	Built-in — automatic confidence intervals, KDE, regression lines
Multi-Panel Layouts	Excellent — full control over grid spacing and sizing	Limited — pairplot and FacetGrid handle specific patterns only
Customization	Unlimited — every element is individually addressable	Good via Matplotlib axes access, but some Seaborn elements resist customization
Production Formatting	Full control via rcParams and style sheets	Inherits Matplotlib settings, adds its own theme layer via `set_theme()`
Best For	Final figures, custom annotations, publication-quality output	Exploratory analysis, statistical summaries, rapid prototyping

⚙ Quick Reference

10 commands from this guide

File	Command / Code	Purpose
iothecodeforgevizmatplotlib_basics.py	plt.plot([1, 2, 3], [4, 5, 6])	Matplotlib Fundamentals
iothecodeforgevizseaborn_basics.py	sns.set_theme(style='whitegrid', palette='muted', font_scale=1.1)	Seaborn for Statistical Visualization
iothecodeforgevizconfusion_matrix.py	from sklearn.metrics import confusion_matrix	Confusion Matrix
iothecodeforgevizroc_pr_curves.py	from sklearn.metrics import (	ROC and Precision-Recall Curves
iothecodeforgevizresidual_plots.py	from scipy import stats	Residual Plots for Regression Models
iothecodeforgevizfeature_importance.py	from sklearn.ensemble import RandomForestClassifier	Feature Importance Visualization
iothecodeforgevizlearning_curves.py	from sklearn.model_selection import learning_curve	Learning Curves
iothecodeforgevizproduction_formatting.py	def apply_production_style():	Saving and Formatting for Production
shap_example.py	from sklearn.datasets import load_breast_cancer	Stop Explaining Black Boxes Wrong
lime_example.py	from sklearn.datasets import load_breast_cancer	LIME for When Your Model Makes a Dumb Mistake on One Row

Key takeaways

Every Matplotlib chart starts with fig, ax = plt.subplots()

use the object-oriented interface, always.

Seaborn handles DataFrame grouping and statistical estimation automatically

use it for rapid exploration, then drop down to Matplotlib for polish.

Confusion matrices reveal class-level failures that accuracy hides

always show both raw counts and row-normalized percentages.

ROC curves work for balanced data; Precision-Recall curves are essential for imbalanced data. Plot both.

Residual plots diagnose regression model errors that RMSE averages away

check for patterns, not just magnitude.

Learning curves tell you whether to invest in more data or a different model

read the gap between training and validation curves.

Save at 300 DPI with fig.savefig() and always call plt.close(fig) afterward to prevent memory leaks in pipelines.

Use perceptually uniform colormaps (viridis, plasma, cividis)

never use jet or rainbow for continuous data.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Why is a Precision-Recall curve more informative than an ROC curve for i...

Q02SENIOR

Your residual plot shows a U-shaped pattern. What does this tell you abo...

Q03JUNIOR

How would you present model evaluation results to a non-technical stakeh...

Q04SENIOR

Explain the difference between built-in feature importance and permutati...

Q01 of 04SENIOR

Why is a Precision-Recall curve more informative than an ROC curve for imbalanced classification problems?

ANSWER

ROC curves plot true positive rate against false positive rate. On imbalanced datasets where negatives vastly outnumber positives, even a large number of false positives represents a small false positive rate because the denominator (total negatives) is enormous. This makes the ROC curve look deceptively good — AUC can exceed 0.95 while the model's precision at any useful recall level is actually terrible. Precision-Recall curves focus exclusively on the positive class. Precision measures what fraction of positive predictions are correct, and recall measures what fraction of actual positives are detected. Neither metric is inflated by the large pool of true negatives. On a 1% positive rate dataset, the PR curve immediately shows that achieving 80% recall requires accepting 30% precision — a tradeoff the ROC curve hides entirely. In production, I always plot both side by side. If ROC looks excellent but PR looks mediocre, the model is benefiting from the imbalance, not from genuine discriminative power.

FAQ · 5 QUESTIONS

Frequently Asked Questions

Should I use Matplotlib or Seaborn?

How do I choose the right chart type for my data?

Why do my saved plots look different from what I see in the notebook?

How many charts should I include in a model evaluation report?

How do I make my charts accessible to colorblind viewers?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

🔥

That's ML Basics. Mark it forged?

5 min read · try the examples if you haven't