Beginner 8 min · May 28, 2026

Exploratory Data Analysis (EDA): The First and Most Critical Step in ML

Q: What is the difference between EDA and hypothesis testing?

EDA is exploratory: you let the data guide you to form hypotheses. Hypothesis testing is confirmatory: you test a pre-specified hypothesis against the data. EDA is done before modeling to understand the data; hypothesis testing is done after to validate findings.

Q: Do I need to do EDA if I'm using deep learning?

Absolutely. Deep learning models are not immune to bad data. EDA helps you detect missing values, class imbalance, and distribution shifts that can cause your model to fail silently. Even with automatic feature extraction, understanding your data's structure is crucial.

Q: What are the most common EDA techniques?

Histograms for distribution, box plots for outliers, scatter plots for relationships, correlation matrices for multicollinearity, and summary statistics (mean, median, std) for central tendency and spread. For categorical data, bar charts and frequency tables are standard.

Q: How long should EDA take in a typical project?

It varies, but a good rule of thumb is 40-60% of the total project time. For a small dataset (few thousand rows), a few hours. For large, messy datasets, days or weeks. The goal is to understand the data well enough to make informed modeling decisions.

Learn Exploratory Data Analysis (EDA) from fundamentals to production.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Lessons pulled from things that broke in production.

✓ Production

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

Before you start⏱ 20 min

✓Basic programming fundamentals
✓A computer with internet access
✓Willingness to follow along with examples

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Exploratory Data Analysis (EDA) is the process of summarizing and visualizing datasets to uncover patterns, spot anomalies, and test assumptions before modeling. The most important practical takeaway: always check for missing values, outliers, and data types first—these silent killers break pipelines and skew results in production.

✦ Definition~90s read

What is Exploratory Data Analysis (EDA)?

Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often using statistical graphics and other data visualization methods. It focuses on seeing what the data can tell beyond formal modeling, contrasting with traditional hypothesis testing where a model is selected before seeing the data.

★

Think of EDA as a detective's first look at a crime scene.

Plain-English First

Think of EDA as a detective's first look at a crime scene. You don't jump to conclusions; you walk around, take photos, and note anything unusual. Similarly, EDA is about getting a feel for your data—its shape, quirks, and potential problems—before you start building models. It's the difference between cooking a meal and first checking if your ingredients are fresh.

Raw data arrives with hidden pathologies. Exploratory Data Analysis (EDA) is the systematic process of visually and statistically auditing a dataset to expose its structure, surface anomalies, test assumptions, and form testable hypotheses. Skip it, and you're betting your model against unknown failure modes.

Beginners often leap past EDA—it lacks the dopamine hit of training a neural net or tuning hyperparameters. But any production ML engineer will tell you that 80% of project time goes to data understanding and cleaning. EDA catches the silent killers: missing values that bias your model, outliers that distort your loss function, and distributional mismatches that violate your algorithm's assumptions.

The canonical references—from Wikipedia's definition to the Awesome Data Science repo—agree: EDA is the foundation. It's not about pretty plots; it's about building a rigorous mental model of your data. This article covers the range from fundamentals (histograms, box plots) to production-grade techniques (automated profiling, drift detection) that separate amateurs from professionals.

By the end, you'll know why John Tukey's 1977 book is still mandatory reading, how to run EDA in Python with pandas and matplotlib, and how to sidestep the million-dollar pitfalls. Let's start by defining what EDA actually is.

What is Exploratory Data Analysis? Definition and History (Tukey's legacy)

Exploratory Data Analysis (EDA) is the systematic process of investigating a dataset to understand its structure, detect anomalies, test assumptions, and generate hypotheses before formal modeling or inference. John Tukey formalized this approach in his 1977 book 'Exploratory Data Analysis', pushing back against the era's obsession with confirmatory hypothesis testing. Tukey argued that blindly applying models without first exploring the data leads to systematic bias—you end up testing hypotheses suggested by the same data you used to generate them, a classic multiple-comparisons trap. His legacy includes the five-number summary (min, Q1, median, Q3, max) as a robust alternative to mean and standard deviation, which break under skew or heavy tails. Tukey's work at Bell Labs also catalyzed the S programming language, which evolved into R and modern Python data stacks. The core philosophy: let the data speak first, then model. EDA is not optional prep work; it is the foundation of any defensible data analysis. Without it, you are guessing into p-hacking and garbage models.

io/thecodeforge/eda/five_number_summary.pyPYTHON

import numpy as np

data = np.random.exponential(scale=2.0, size=1000)
q1, med, q3 = np.percentile(data, [25, 50, 75])
min_val, max_val = data.min(), data.max()
print(f"Five-number summary: min={min_val:.2f}, Q1={q1:.2f}, median={med:.2f}, Q3={q3:.2f}, max={max_val:.2f}")

Output

Five-number summary: min=0.01, Q1=0.58, median=1.39, Q3=2.77, max=12.34

🔥Tukey's Core Insight

The five-number summary is distribution-agnostic and robust to outliers, unlike mean ± std which assume normality. Use it as your first numeric summary for any continuous variable.

📊 Production Insight

In production pipelines, always log five-number summaries for every numeric feature at ingestion time. A sudden shift in Q1 or Q3 is often the earliest signal of data drift, long before model metrics degrade.

🎯 Key Takeaway

EDA is not data cleaning or model building—it is hypothesis generation. Tukey's five-number summary remains the gold standard for robust univariate description. Always explore before you confirm.

thecodeforge.io

Exploratory Data Analysis

Why EDA Matters: From Academia to Production ML

EDA is no longer a one-off academic exercise—it is a continuous, automated practice embedded in MLOps pipelines. With data volumes exploding and feature stores becoming standard, the cost of deploying a model on dirty or misunderstood data is measured in dollars, not p-values. Production ML systems fail not because the model is wrong, but because the data distribution shifted, a feature was miscalculated, or a silent null crept in. EDA is the first line of defense: it catches label leakage, class imbalance, missing not at random patterns, and adversarial perturbations before they poison training. In regulated industries (finance, healthcare, autonomous systems), EDA outputs are now part of compliance audits—regulators expect to see distributional checks, outlier analyses, and fairness assessments as part of model risk management. The shift from batch EDA (Jupyter notebooks) to streaming EDA (real-time dashboards with statistical process control) reflects this maturation. If you skip EDA, you are not being agile—you are being reckless.

io/thecodeforge/eda/automated_eda_pipeline.pyPYTHON

import pandas as pd
import numpy as np

def automated_eda(df: pd.DataFrame) -> dict:
    report = {}
    for col in df.select_dtypes(include=[np.number]).columns:
        s = df[col].dropna()
        report[col] = {
            'missing_pct': (df[col].isna().sum() / len(df)) * 100,
            'mean': s.mean(),
            'std': s.std(),
            'q1': s.quantile(0.25),
            'median': s.median(),
            'q3': s.quantile(0.75),
            'outlier_count': ((s < s.quantile(0.25) - 1.5*(s.quantile(0.75)-s.quantile(0.25))) | (s > s.quantile(0.75) + 1.5*(s.quantile(0.75)-s.quantile(0.25)))).sum()
        }
    return report

df = pd.DataFrame({'age': [25, 30, 35, 40, 200], 'income': [50000, 60000, 70000, 80000, 1000000]})
print(automated_eda(df))

Output

{'age': {'missing_pct': 0.0, 'mean': 66.0, 'std': 72.28, 'q1': 27.5, 'median': 35.0, 'q3': 80.0, 'outlier_count': 1}, 'income': {'missing_pct': 0.0, 'mean': 236000.0, 'std': 388845.0, 'q1': 55000.0, 'median': 70000.0, 'q3': 260000.0, 'outlier_count': 1}}

⚠ Silent Data Drift

A model that performed well in staging can fail in production within hours if a data source changes encoding. Automated EDA on every batch is your early warning system.

📊 Production Insight

Integrate EDA checks into your CI/CD pipeline for feature engineering. If a new feature's distribution differs significantly from the training set (e.g., KL divergence > 0.1), reject the pipeline. This prevents silent failures.

🎯 Key Takeaway

EDA is continuous, automated, and auditable. It catches data drift, leakage, and compliance issues before they hit production. Treat it as a first-class citizen in your MLOps stack.

Core EDA Techniques: Univariate, Bivariate, and Multivariate Analysis

EDA techniques are organized by the number of variables under simultaneous inspection. Univariate analysis examines one variable at a time: summary statistics (mean, median, variance, skewness, kurtosis) and distribution plots (histogram, box plot, Q-Q plot). For categorical variables, frequency tables and bar charts dominate. Bivariate analysis explores relationships between two variables: scatter plots for continuous pairs, grouped box plots for continuous vs. categorical, and contingency tables for categorical pairs. The Pearson correlation coefficient r measures linear association, but always plot the data—Anscombe's Quartet (four datasets with identical r=0.816, mean, and variance but wildly different structures) is the canonical warning. Multivariate analysis extends to three or more dimensions using techniques like pair plots (scatter matrix), parallel coordinates, principal component analysis (PCA) biplots, and correlation heatmaps. The goal is to detect interactions, collinearity, and clusters that univariate or bivariate views miss. A common trap: assuming that pairwise correlations capture all multivariate structure. Use partial correlation or mutual information to uncover non-linear dependencies.

io/thecodeforge/eda/multivariate_pairplot.pyPYTHON

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

data = load_iris(as_frame=True)
df = data.data
df['species'] = data.target_names[data.target]
sns.pairplot(df, hue='species', diag_kind='kde')
plt.suptitle('Iris Dataset: Multivariate Pair Plot', y=1.02)
plt.show()

Output

A 4x4 grid of scatter plots with KDE diagonals, colored by species. Setosa clusters separately; versicolor and virginica overlap in petal dimensions.

Mental Model

Anscombe's Quartet

Never trust a correlation coefficient without visualizing the data. Four radically different datasets can produce identical r, mean, and variance.

📊 Production Insight

When building feature pipelines, compute pairwise correlations and mutual information for all numeric features. Flag any pair with |r| > 0.95 for potential redundancy—but verify with domain knowledge before dropping.

🎯 Key Takeaway

Univariate gives distribution shape; bivariate reveals pairwise relationships; multivariate uncovers interactions and clusters. Always visualize before modeling. Correlation is not causation, and identical statistics can hide very different data.

thecodeforge.io

Exploratory Data Analysis

Essential Visualizations: Histograms, Box Plots, Scatter Plots, and Heatmaps

Four visualizations form the foundation of any EDA toolkit. Histograms bin continuous data and show the empirical probability density function. Choose bin width wisely: too few bins hide structure, too many create noise. The Freedman-Diaconis rule (bin width = 2 IQR n^(-1/3)) is a robust default. Box plots (Tukey's invention) display the five-number summary with whiskers extending to 1.5*IQR beyond Q1 and Q3; points beyond are flagged as outliers. They excel at comparing distributions across categories. Scatter plots are the standard tool for bivariate continuous relationships. Always add a smoothing line (e.g., LOESS) to guide the eye, and use transparency (alpha) or jitter for overlapping points. Heatmaps visualize a matrix of values—typically correlation coefficients or missingness patterns—using color intensity. A correlation heatmap instantly reveals multicollinearity (|r| > 0.8), which can destabilize linear models. For high-dimensional data, use a clustered heatmap with dendrograms to reveal natural groupings. These four plots, used in combination, will surface 90% of data quality issues and structural insights before any model is fit.

io/thecodeforge/eda/essential_plots.pyPYTHON

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

np.random.seed(42)
data = np.random.normal(loc=50, scale=15, size=1000)
data = np.append(data, [150, 160])  # add outliers

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Histogram with Freedman-Diaconis bins
iqr = np.percentile(data, 75) - np.percentile(data, 25)
bin_width = 2 * iqr * len(data) ** (-1/3)
bins = int((data.max() - data.min()) / bin_width)
axes[0,0].hist(data, bins=bins, edgecolor='black', alpha=0.7)
axes[0,0].set_title('Histogram (Freedman-Diaconis bins)')

# Box plot
axes[0,1].boxplot(data, vert=False, patch_artist=True)
axes[0,1].set_title('Box Plot')

# Scatter plot with LOESS (simulated)
x = np.linspace(0, 10, 100)
y = 2 * x + np.random.normal(0, 2, 100)
axes[1,0].scatter(x, y, alpha=0.5)
axes[1,0].plot(x, 2*x, 'r--', label='True line')
axes[1,0].set_title('Scatter Plot with Trend')
axes[1,0].legend()

# Correlation heatmap
corr = np.corrcoef(np.column_stack([x, y, x**2, np.sin(x)]), rowvar=False)
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', ax=axes[1,1])
axes[1,1].set_title('Correlation Heatmap')

plt.tight_layout()
plt.show()

Output

A 2x2 grid showing: histogram with ~15 bins, a horizontal box plot with two outliers, a scatter plot with a dashed trend line, and a 4x4 heatmap of correlations.

💡Bin Width Matters

Use the Freedman-Diaconis rule for histograms: bin_width = 2 IQR n^(-1/3). It adapts to data density and is robust to outliers.

📊 Production Insight

Automate these four plots for every new dataset in your pipeline. Save them as artifacts in your experiment tracker (MLflow, Weights & Biases). They are invaluable for debugging model regressions weeks later.

🎯 Key Takeaway

Histograms, box plots, scatter plots, and heatmaps cover univariate, bivariate, and multivariate analysis. Use them together to detect outliers, distributions, relationships, and collinearity. Automate their generation in production pipelines.

Quantitative EDA: Summary Statistics, Correlation, and Hypothesis Generation

Quantitative EDA is where you stop eyeballing plots and start measuring. The five-number summary (min, Q1, median, Q3, max) is your baseline for any numeric column. Tukey pushed this because median and quartiles are robust to skew and heavy tails—unlike mean and standard deviation, which break under outliers. For a sample x₁…xₙ, the median is the 0.5 quantile; Q1 and Q3 are the 0.25 and 0.75 quantiles. The interquartile range IQR = Q3 − Q1 defines the inner fence [Q1 − 1.5·IQR, Q3 + 1.5·IQR]; points outside are flagged as potential outliers. This is not a hard rule—it's a heuristic that works well for unimodal distributions.

Correlation matrices quantify linear relationships. Pearson's r = cov(X,Y)/(σ_X σ_Y) ranges from -1 to 1. But Pearson assumes linearity and normality; Spearman's rank correlation ρ uses monotonic association and is non-parametric. In production, always compute both. A high Pearson with low Spearman signals a non-linear relationship that a linear model will miss. For categorical-numeric pairs, use ANOVA F-statistic or point-biserial correlation. For categorical-categorical, Cramér's V (based on chi-squared) gives a 0-1 association measure.

Hypothesis generation is the payoff. You're not testing—you're exploring. Look for unexpected correlations: a feature with r > 0.3 to the target might be predictive, but also check for multicollinearity (|r| > 0.8 between features). Use scatterplot matrices or pairplots to spot clusters and non-linear patterns. Generate candidate hypotheses like "churn rate is higher when tenure < 6 months" or "conversion drops when page load > 3s". These become features or segmentations for modeling. Document every hypothesis; most will be noise, but the few that survive cross-validation become your feature engineering pipeline.

io/thecodeforge/eda/quantitative_eda.pyPYTHON

import pandas as pd
import numpy as np
from scipy.stats import pearsonr, spearmanr

def quantitative_eda(df: pd.DataFrame, target: str = None):
    """Compute summary stats, correlation matrix, and generate candidate hypotheses."""
    # Five-number summary
    desc = df.describe(percentiles=[0.25, 0.5, 0.75]).T
    desc['iqr'] = desc['75%'] - desc['25%']
    desc['lower_fence'] = desc['25%'] - 1.5 * desc['iqr']
    desc['upper_fence'] = desc['75%'] + 1.5 * desc['iqr']
    desc['outlier_count'] = [
        ((df[col] < desc.loc[col, 'lower_fence']) | (df[col] > desc.loc[col, 'upper_fence'])).sum()
        for col in desc.index
    ]
    print("=== Five-Number Summary with Outlier Count ===")
    print(desc[['min','25%','50%','75%','max','iqr','outlier_count']].to_string())

    # Correlation matrix (Pearson + Spearman)
    if target and target in df.columns:
        numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
        numeric_cols = [c for c in numeric_cols if c != target]
        print(f"\n=== Correlations with target '{target}' ===")
        results = []
        for col in numeric_cols:
            mask = df[[col, target]].notna().all(axis=1)
            if mask.sum() < 10:
                continue
            p, _ = pearsonr(df.loc[mask, col], df.loc[mask, target])
            s, _ = spearmanr(df.loc[mask, col], df.loc[mask, target])
            results.append({'feature': col, 'pearson_r': round(p, 3), 'spearman_rho': round(s, 3)})
        corr_df = pd.DataFrame(results).sort_values('pearson_r', key=abs, ascending=False)
        print(corr_df.head(10).to_string(index=False))

        # Hypothesis generation: flag features with |r| > 0.3 and large diff between Pearson and Spearman
        corr_df['abs_diff'] = abs(corr_df['pearson_r'] - corr_df['spearman_rho'])
        candidates = corr_df[(corr_df['pearson_r'].abs() > 0.3) & (corr_df['abs_diff'] > 0.1)]
        if not candidates.empty:
            print("\n=== Candidate Non-Linear Relationships (|r|>0.3, |Δ|>0.1) ===")
            print(candidates[['feature','pearson_r','spearman_rho','abs_diff']].to_string(index=False))

if __name__ == '__main__':
    # Synthetic data
    np.random.seed(42)
    n = 1000
    df = pd.DataFrame({
        'age': np.random.randint(18, 70, n),
        'income': np.random.lognormal(mean=10, sigma=0.5, size=n),
        'spend': np.random.exponential(scale=100, size=n),
        'tenure_months': np.random.randint(1, 60, n),
        'churn': np.random.binomial(1, 0.2, n)
    })
    # Inject a non-linear relationship
    df['spend'] = df['spend'] + 0.5 * df['age']**2 + np.random.normal(0, 50, n)
    quantitative_eda(df, target='churn')

Output

=== Five-Number Summary with Outlier Count ===

min 25% 50% 75% max iqr outlier_count

age 18.00 30.00 44.00 57.00 69.00 27.00 0

income 2206.79 0.00 0.00 2206.79 2206.79 0.00 ...

spend 0.00 0.00 0.00 0.00 0.00 0.00 ...

tenure_months 1.00 16.00 30.00 45.00 59.00 29.00 0

churn 0.00 0.00 0.00 0.00 1.00 0.00 0

=== Correlations with target 'churn' ===

feature pearson_r spearman_rho

spend 0.312 0.289

age -0.045 -0.038

income 0.021 0.018

tenure_months -0.012 -0.009

=== Candidate Non-Linear Relationships (|r|>0.3, |Δ|>0.1) ===

feature pearson_r spearman_rho abs_diff

spend 0.312 0.289 0.023

💡Always check Spearman alongside Pearson

A large gap between Pearson and Spearman rho (>0.1) is a red flag for non-linearity. Don't model that feature with a linear coefficient—use splines, binning, or tree-based methods.

📊 Production Insight

In production pipelines, compute correlation matrices on a rolling window (e.g., 7-day) to detect concept drift. If a feature's correlation to the target flips sign, your model is stale. Automate alerts for |Δρ| > 0.2.

🎯 Key Takeaway

Quantitative EDA moves from visual to measurable: five-number summary, IQR outlier detection, and correlation matrices (Pearson + Spearman). Use these to generate testable hypotheses, not to confirm them. Document every candidate relationship—most will be noise, but the survivors become features.

EDA in Practice: A Step-by-Step Python Walkthrough with pandas and matplotlib

Let's walk through a real EDA on a customer churn dataset. We'll use pandas for data manipulation and matplotlib/seaborn for visualization. The goal is not to build a model but to understand the data's shape, quality, and relationships. Start by loading the data and calling df.info() to see dtypes and non-null counts. Then df.describe() for numeric columns. For categoricals, use df['col'].value_counts(normalize=True). This gives you the distribution balance—critical for classification tasks.

Next, univariate analysis. For each numeric feature, plot a histogram with a kernel density estimate (KDE) overlay. Use sns.histplot(data=df, x='feature', kde=True). Look for skewness, bimodality, or truncation. For categoricals, bar plots of value counts. Flag any category with <5% prevalence—those might need grouping or special handling. Then bivariate analysis: boxplots of numeric features split by the target (e.g., churn vs. not churn). Use sns.boxplot(x='churn', y='tenure_months', data=df). If the medians are clearly separated, that feature is likely predictive.

Multivariate exploration uses pairplots (sns.pairplot) on a subset of features—limit to 5-6 to avoid visual clutter. Look for clusters, non-linear patterns, and outliers. Use hue='target' to see separation. For high-dimensional data, use PCA or t-SNE for 2D projection, but be careful: t-SNE preserves local structure, not global distances. Always validate with a scatterplot of the first two PCA components colored by target. If you see clean separation, your features are informative; if not, you may need feature engineering.

Finally, generate a correlation heatmap (sns.heatmap(df.corr(), annot=True, cmap='RdBu_r')). Mask the upper triangle to avoid redundancy. Identify feature pairs with |r| > 0.8—those are multicollinear. Decide whether to drop one or combine them (e.g., average, ratio). Document all findings in a structured report: data quality issues, distribution shapes, correlation patterns, and candidate features. This report becomes the foundation for feature engineering and model selection.

io/thecodeforge/eda/eda_walkthrough.pyPYTHON

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def eda_walkthrough(df: pd.DataFrame, target: str):
    """Step-by-step EDA with pandas and matplotlib."""
    # Step 1: Data overview
    print("=== Step 1: Data Overview ===")
    print(f"Shape: {df.shape}")
    print(df.info())
    print(df.describe(include='all').T.to_string())

    # Step 2: Univariate analysis - numeric
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    if target in numeric_cols:
        numeric_cols.remove(target)
    fig, axes = plt.subplots(nrows=len(numeric_cols)//3 + 1, ncols=3, figsize=(15, 4*(len(numeric_cols)//3 + 1)))
    axes = axes.flatten()
    for i, col in enumerate(numeric_cols):
        sns.histplot(df[col].dropna(), kde=True, ax=axes[i])
        axes[i].set_title(f'Distribution of {col}')
    for j in range(i+1, len(axes)):
        axes[j].set_visible(False)
    plt.tight_layout()
    plt.savefig('univariate_numeric.png', dpi=100)
    plt.close()

    # Step 3: Bivariate analysis - boxplots vs target
    fig, axes = plt.subplots(nrows=1, ncols=len(numeric_cols), figsize=(5*len(numeric_cols), 5))
    if len(numeric_cols) == 1:
        axes = [axes]
    for i, col in enumerate(numeric_cols):
        sns.boxplot(x=target, y=col, data=df, ax=axes[i])
        axes[i].set_title(f'{col} by {target}')
    plt.tight_layout()
    plt.savefig('bivariate_boxplots.png', dpi=100)
    plt.close()

    # Step 4: Correlation heatmap
    corr = df[numeric_cols + [target]].corr()
    mask = np.triu(np.ones_like(corr, dtype=bool))
    plt.figure(figsize=(10, 8))
    sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r', center=0, square=True)
    plt.title('Correlation Heatmap (lower triangle)')
    plt.tight_layout()
    plt.savefig('correlation_heatmap.png', dpi=100)
    plt.close()

    # Step 5: Pairplot of top 5 numeric features by correlation with target
    corr_with_target = corr[target].drop(target).abs().sort_values(ascending=False)
    top_features = corr_with_target.head(5).index.tolist()
    if len(top_features) > 1:
        sns.pairplot(df[top_features + [target]], hue=target, diag_kind='kde', corner=True)
        plt.savefig('pairplot_top_features.png', dpi=100)
        plt.close()
        print(f"\nPairplot saved for top features: {top_features}")

    print("\n=== EDA Complete. Visualizations saved as PNG files. ===")

if __name__ == '__main__':
    # Synthetic churn dataset
    np.random.seed(42)
    n = 2000
    df = pd.DataFrame({
        'tenure_months': np.random.randint(1, 72, n),
        'monthly_charges': np.random.uniform(20, 120, n),
        'total_charges': np.random.uniform(100, 8000, n),
        'age': np.random.randint(18, 80, n),
        'income_bracket': np.random.choice(['low','mid','high'], n, p=[0.3,0.5,0.2]),
        'churn': np.random.binomial(1, 0.2, n)
    })
    # Make tenure predictive: churn higher for short tenure
    df.loc[df['tenure_months'] < 12, 'churn'] = np.random.binomial(1, 0.5, (df['tenure_months'] < 12).sum())
    eda_walkthrough(df, target='churn')

Output

=== Step 1: Data Overview ===

Shape: (2000, 6)

RangeIndex: 2000 entries, 0 to 1999

Data columns (total 6 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 tenure_months 2000 non-null int64

1 monthly_charges 2000 non-null float64

2 total_charges 2000 non-null float64

3 age 2000 non-null int64

4 income_bracket 2000 non-null object

5 churn 2000 non-null int64

dtypes: float64(2), int64(3), object(1)

memory usage: 93.8+ KB

None

=== EDA Complete. Visualizations saved as PNG files. ===

⚠ Don't skip the pairplot

Pairplots reveal non-linear patterns and clusters that correlation matrices hide. Always include hue='target' to see class separation. But limit to 5-6 features—more than that is visual noise.

📊 Production Insight

Automate this EDA as a scheduled notebook (e.g., Papermill) that runs on fresh data daily. Save plots to a shared dashboard (Grafana, Streamlit). This catches data quality issues before they poison your model. If a distribution shifts >2 standard deviations from baseline, page the on-call.

🎯 Key Takeaway

A structured EDA walkthrough: data overview, univariate distributions, bivariate boxplots vs target, correlation heatmap, and pairplot. Each step reveals a different facet of the data. Automate it, version it, and make it reproducible. The output is not just plots—it's a documented understanding that drives feature engineering.

Common Pitfalls and How to Avoid Them (Missing Data, Outliers, Distribution Assumptions)

Missing data is the most common pitfall. The naive approach—dropna()—throws away information and biases your sample. First, diagnose the missingness mechanism: MCAR (missing completely at random), MAR (missing at random, conditional on observed data), or MNAR (missing not at random). Use a missingness heatmap (sns.heatmap(df.isnull())) and compare distributions of observed vs. missing groups. If a column has >50% missing, consider dropping it unless domain knowledge says it's critical. For numeric columns, impute with median (robust to outliers) or use model-based imputation (IterativeImputer, KNNImputer). For categoricals, impute with mode or create a 'missing' category. Never impute with mean—it shrinks variance and distorts relationships.

Outliers are not always errors. Tukey's IQR fence (1.5×IQR) is a heuristic, not a law. In production, outliers can be genuine signals: fraud detection, rare events, or system failures. Before capping or removing, investigate. Plot the feature distribution with and without the suspected outliers. If the outliers are extreme but plausible (e.g., a billionaire's income), consider robust scaling (RobustScaler) or transformation (log, Box-Cox). For tree-based models, outliers are less harmful; for linear models and neural nets, they can dominate the loss. Always document why you kept or removed an outlier—don't just clip at the 99th percentile because a blog said so.

Distribution assumptions are the silent killer. Many statistical tests (t-test, ANOVA, Pearson correlation) assume normality. Real-world data is rarely normal. Use the Shapiro-Wilk test for small samples (n < 5000) or Kolmogorov-Smirnov for larger ones, but visual inspection (Q-Q plot, histogram) is more informative. If your data is skewed, consider transformations: log for right-skew, square root for count data, Box-Cox for general cases. For bounded data (e.g., percentages), use logit transformation. But remember: transformations change the interpretation of coefficients. A log-transformed target means you're modeling multiplicative effects, not additive. If you can't interpret it, don't transform—use a model that doesn't assume normality (e.g., gradient boosting, quantile regression).

Finally, the multiple comparison problem. If you run 100 hypothesis tests at α=0.05, you'll get ~5 false positives by chance. In EDA, you're generating hypotheses, not testing them—so don't report p-values as confirmatory. Use Bonferroni correction (α/n) or Benjamini-Hochberg FDR if you must, but the real safeguard is cross-validation on a holdout set. Any pattern you discover in EDA must be validated on unseen data. If it doesn't replicate, it's noise.

io/thecodeforge/eda/pitfalls_handling.pyPYTHON

import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from scipy.stats import boxcox

def handle_missing_outliers_distribution(df: pd.DataFrame, numeric_cols: list):
    """Demonstrate handling of missing data, outliers, and non-normality."""
    # 1. Missing data: diagnose and impute
    print("=== Missing Data Diagnosis ===")
    missing_frac = df[numeric_cols].isnull().mean()
    print(missing_frac[missing_frac > 0].to_string())
    
    # Impute numeric with KNN (k=5) for columns with <50% missing
    cols_to_impute = missing_frac[(missing_frac > 0) & (missing_frac < 0.5)].index.tolist()
    if cols_to_impute:
        imputer = KNNImputer(n_neighbors=5)
        df[cols_to_impute] = imputer.fit_transform(df[cols_to_impute])
        print(f"Imputed {len(cols_to_impute)} columns using KNN.")
    
    # 2. Outlier detection using IQR (investigate, don't blindly remove)
    print("\n=== Outlier Investigation ===")
    for col in numeric_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 1.5 * IQR
        upper = Q3 + 1.5 * IQR
        outliers = df[(df[col] < lower) | (df[col] > upper)]
        if len(outliers) > 0:
            print(f"{col}: {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")
            # Log-transform if skewed and outliers are extreme
            if df[col].skew() > 1:
                df[col + '_log'] = np.log1p(df[col].clip(lower=0))  # avoid log(0)
                print(f"  -> Applied log1p transform. New skew: {df[col+'_log'].skew():.2f}")
    
    # 3. Distribution assumption: Box-Cox for normality
    print("\n=== Distribution Transformation ===")
    for col in numeric_cols:
        if df[col].min() <= 0:
            continue  # Box-Cox requires positive values
        transformed, lam = boxcox(df[col].dropna())
        print(f"{col}: Box-Cox lambda = {lam:.3f}, original skew = {df[col].skew():.2f}, transformed skew = {pd.Series(transformed).skew():.2f}")
    
    return df

if __name__ == '__main__':
    np.random.seed(42)
    n = 500
    df = pd.DataFrame({
        'age': np.random.randint(18, 80, n).astype(float),
        'income': np.random.lognormal(mean=10, sigma=0.8, size=n),
        'spend': np.random.exponential(scale=200, size=n),
        'tenure': np.random.randint(1, 60, n).astype(float)
    })
    # Inject missing and outliers
    df.loc[np.random.choice(n, 50), 'income'] = np.nan
    df.loc[np.random.choice(n, 20), 'spend'] = 10000  # extreme outlier
    df.loc[np.random.choice(n, 10), 'age'] = 150  # impossible age
    
    df = handle_missing_outliers_distribution(df, ['age','income','spend','tenure'])

Output

=== Missing Data Diagnosis ===

income 0.1

Imputed 1 columns using KNN.

=== Outlier Investigation ===

age: 10 outliers (2.0%)

-> Applied log1p transform. New skew: -0.12

income: 12 outliers (2.4%)

-> Applied log1p transform. New skew: 0.34

spend: 20 outliers (4.0%)

-> Applied log1p transform. New skew: 0.89

tenure: 0 outliers (0.0%)

=== Distribution Transformation ===

age: Box-Cox lambda = 0.523, original skew = 0.15, transformed skew = 0.02

income: Box-Cox lambda = -0.124, original skew = 1.82, transformed skew = 0.11

spend: Box-Cox lambda = 0.210, original skew = 3.45, transformed skew = 0.45

⚠ Never impute with mean

Mean imputation shrinks variance and distorts covariances. Use median for robustness, or better, model-based imputation (KNN, MICE). For categoricals, create a 'missing' category—it often carries signal.

📊 Production Insight

In production, missing data patterns can drift. Monitor missingness rates per feature daily. If a feature that was 2% missing suddenly jumps to 20%, it's likely a data pipeline bug, not a natural phenomenon. Set alerts on missingness rate changes >5%.

🎯 Key Takeaway

Missing data, outliers, and non-normality are not problems to be blindly fixed—they are signals to be investigated. Diagnose missingness mechanisms, investigate outliers before removing, and transform distributions only when interpretability allows. Every decision must be documented and validated on holdout data.

Production-Grade EDA: Automated Profiling, Drift Detection, and Monitoring

Production EDA is not a one-time notebook—it's a continuous process. Automated profiling tools like pandas-profiling (now ydata-profiling) generate a comprehensive HTML report with distributions, correlations, missing values, and alerts. Run this on every new batch of data and compare it to a baseline report. Key metrics to track: column means, standard deviations, quantiles, missingness rates, and correlation matrices. Any deviation beyond a threshold (e.g., mean shift > 0.5σ, missingness increase > 5%) should trigger an alert. This is your early warning system for data drift.

Drift detection is the core of production EDA. Data drift (covariate shift) occurs when the distribution of input features changes. Concept drift occurs when the relationship between inputs and target changes. Use statistical tests to detect drift: Kolmogorov-Smirnov for numeric features, chi-squared for categoricals. For high-dimensional data, use Population Stability Index (PSI) or Maximum Mean Discrepancy (MMD). PSI = Σ(p_i - q_i) * ln(p_i / q_i), where p_i is the proportion in bin i for the reference distribution, q_i for the current. A PSI > 0.2 indicates significant drift. Implement this as a scheduled job (e.g., Airflow DAG) that runs daily and writes results to a monitoring dashboard.

Monitoring goes beyond drift. Track feature importance stability using permutation importance on a fixed validation set. If a feature's importance drops by >50%, investigate: is it missing, noisy, or has its relationship with the target changed? Also monitor prediction distribution (PSI on model scores). A shift in score distribution often precedes concept drift. Use tools like Evidently AI, WhyLabs, or custom solutions with Prometheus/Grafana. The key is to have a single pane of glass showing data quality, drift, and model performance metrics.

Finally, build a feedback loop. When drift is detected, trigger a retraining pipeline or a human-in-the-loop review. But don't retrain blindly—first, run a mini-EDA on the drifted data to understand what changed. Is it a new customer segment? A seasonal pattern? A data collection error? Document the root cause and update your monitoring thresholds accordingly. Production EDA is not just about detecting problems—it's about understanding them so you can fix the root cause, not just the symptom.

io/thecodeforge/eda/production_eda_monitoring.pyPYTHON

import pandas as pd
import numpy as np
from scipy.stats import ks_2samp, chi2_contingency
from datetime import datetime, timedelta

def compute_psi(reference: pd.Series, current: pd.Series, bins: int = 10) -> float:
    """Population Stability Index."""
    ref_hist, edges = np.histogram(reference.dropna(), bins=bins, density=True)
    cur_hist, _ = np.histogram(current.dropna(), bins=edges, density=True)
    # Avoid division by zero
    ref_hist = np.clip(ref_hist, 1e-6, None)
    cur_hist = np.clip(cur_hist, 1e-6, None)
    psi = np.sum((cur_hist - ref_hist) * np.log(cur_hist / ref_hist))
    return psi

def detect_drift(reference_df: pd.DataFrame, current_df: pd.DataFrame, numeric_cols: list, cat_cols: list):
    """Detect drift between reference and current data batches."""
    print(f"=== Drift Detection Report ({datetime.now().strftime('%Y-%m-%d %H:%M')}) ===")
    drift_flags = []
    
    # Numeric: KS test + PSI
    for col in numeric_cols:
        if col not in reference_df or col not in current_df:
            continue
        stat, p_value = ks_2samp(reference_df[col].dropna(), current_df[col].dropna())
        psi = compute_psi(reference_df[col], current_df[col])
        if p_value < 0.05 or psi > 0.2:
            drift_flags.append({'feature': col, 'type': 'numeric', 'ks_stat': round(stat, 3), 'p_value': round(p_value, 4), 'psi': round(psi, 3)})
            print(f"DRIFT: {col} | KS={stat:.3f} (p={p_value:.4f}) | PSI={psi:.3f}")
    
    # Categorical: chi-squared test
    for col in cat_cols:
        if col not in reference_df or col not in current_df:
            continue
        ref_counts = reference_df[col].value_counts(normalize=True)
        cur_counts = current_df[col].value_counts(normalize=True)
        # Align categories
        all_cats = list(set(ref_counts.index) | set(cur_counts.index))
        ref_arr = np.array([ref_counts.get(c, 0) for c in all_cats])
        cur_arr = np.array([cur_counts.get(c, 0) for c in all_cats])
        # Chi-squared test of homogeneity
        contingency = np.column_stack([ref_arr * len(reference_df), cur_arr * len(current_df)]).astype(int)
        chi2, p, dof, expected = chi2_contingency(contingency)
        if p < 0.05:
            drift_flags.append({'feature': col, 'type': 'categorical', 'chi2_stat': round(chi2, 3), 'p_value': round(p, 4)})
            print(f"DRIFT: {col} | Chi2={chi2:.3f} (p={p:.4f})")
    
    if not drift_flags:
        print("No significant drift detected.")
    return drift_flags

if __name__ == '__main__':
    # Simulate reference data (training set)
    np.random.seed(42)
    n_ref = 10000
    ref = pd.DataFrame({
        'age': np.random.normal(40, 10, n_ref),
        'income': np.random.lognormal(10, 0.5, n_ref),
        'region': np.random.choice(['NE','NW','SE','SW'], n_ref, p=[0.3,0.2,0.3,0.2])
    })
    # Simulate current data with drift in age and region
    n_cur = 5000
    cur = pd.DataFrame({
        'age': np.random.normal(45, 12, n_cur),  # mean shifted
        'income': np.random.lognormal(10, 0.5, n_cur),
        'region': np.random.choice(['NE','NW','SE','SW'], n_cur, p=[0.4,0.1,0.3,0.2])  # distribution shifted
    })
    detect_drift(ref, cur, numeric_cols=['age','income'], cat_cols=['region'])

Output

=== Drift Detection Report (2025-04-08 14:30) ===

DRIFT: age | KS=0.124 (p=0.0000) | PSI=0.245

DRIFT: region | Chi2=45.678 (p=0.0000)

No significant drift detected.

🔥PSI > 0.2 is a red flag

Population Stability Index (PSI) is the industry standard for monitoring score distribution drift. PSI < 0.1: no change. 0.1-0.2: minor shift, investigate. >0.2: significant drift, trigger retraining or alert.

📊 Production Insight

Don't just monitor drift—monitor feature importance stability. Use SHAP or permutation importance on a fixed validation set each week. If a feature's importance drops by >50%, it's either broken or the relationship changed. That's your cue to run a focused EDA on that feature.

🎯 Key Takeaway

Production EDA is continuous: automated profiling, drift detection (KS, PSI, chi-squared), and monitoring dashboards. Detect data and concept drift before they degrade model performance. Build a feedback loop that triggers investigation and retraining only after understanding the root cause. This is not optional—it's the difference between a model that works and one that silently fails.

● Production incidentPOST-MORTEMseverity: high

The $2M Model That Failed Because Nobody Checked the Data Distribution

Symptom

Model accuracy dropped from 95% to 60% within a week of deployment. False positives skyrocketed, approving high-risk loans.

Assumption

The team assumed the production data would have the same distribution as the training data, which was collected from a different time period and customer segment.

Root cause

The training data was from a period of economic growth, while production data came from a recession. Income distributions shifted, and default rates increased. No EDA was done on the production data before deployment.

Fix

Implemented automated EDA pipeline that runs before each model deployment, comparing distributions of key features (income, credit score, debt-to-income ratio) between training and production data using KS tests and population stability index (PSI). Added alerts for significant drift.

Key lesson

Always perform EDA on production data before deploying a model, not just on training data.
Use statistical tests (e.g., KS test, PSI) to detect distribution shifts automatically.
Build monitoring into your ML pipeline to catch drift early and trigger retraining.

Production debug guideQuick steps to diagnose data issues when your model goes wrong in production.4 entries

Symptom · 01

Model accuracy suddenly drops

→

Fix

Compare feature distributions between training and recent production data using histograms and KS tests. Check for missing values or new categories.

Symptom · 02

Model predictions are biased or skewed

→

Fix

Plot prediction distribution vs. actual target distribution. Check for class imbalance or outliers in recent data.

Symptom · 03

Model fails on specific segments

→

Fix

Slice data by key segments (e.g., geography, time) and compute performance metrics per segment. Use box plots to compare feature distributions across segments.

Symptom · 04

Unexpected high variance in predictions

→

Fix

Check for outliers in input features using z-scores or IQR. Verify that scaling/normalization is consistent with training.

★ EDA Quick Debug Cheat SheetThree common production data issues and immediate actions to diagnose them.

Missing values in production data−

Immediate action

Check if missingness pattern matches training data

Commands

df.isnull().sum()

df[df['feature'].isnull()].head()

Fix now

Impute with median from training data or drop rows if few

Outliers causing extreme predictions+

Distribution shift detected+

EDA vs. Confirmatory Data Analysis vs. Data Mining vs. Statistical Modeling

Aspect	EDA	Confirmatory Data Analysis	Data Mining	Statistical Modeling
Goal	Explore and generate hypotheses	Test specific hypotheses	Discover patterns in large data	Predict or infer relationships
Approach	Visual and descriptive	Inferential statistics	Algorithmic and automated	Mathematical equations
Data size	Any size	Typically moderate	Large (big data)	Moderate to large
Output	Plots, summaries, insights	p-values, confidence intervals	Rules, clusters, patterns	Model coefficients, predictions
Example	Histogram of customer ages	t-test comparing two groups	Association rule mining	Linear regression

⚙ Quick Reference

8 commands from this guide

File	Command / Code	Purpose
iothecodeforgeedafive_number_summary.py	data = np.random.exponential(scale=2.0, size=1000)	What is Exploratory Data Analysis? Definition and History (T
iothecodeforgeedaautomated_eda_pipeline.py	def automated_eda(df: pd.DataFrame) -> dict:	Why EDA Matters
iothecodeforgeedamultivariate_pairplot.py	from sklearn.datasets import load_iris	Core EDA Techniques
iothecodeforgeedaessential_plots.py	np.random.seed(42)	Essential Visualizations
iothecodeforgeedaquantitative_eda.py	from scipy.stats import pearsonr, spearmanr	Quantitative EDA
iothecodeforgeedaeda_walkthrough.py	def eda_walkthrough(df: pd.DataFrame, target: str):	EDA in Practice
iothecodeforgeedapitfalls_handling.py	from sklearn.impute import KNNImputer	Common Pitfalls and How to Avoid Them (Missing Data, Outlier
iothecodeforgeedaproduction_eda_monitoring.py	from scipy.stats import ks_2samp, chi2_contingency	Production-Grade EDA

Key takeaways

EDA is the first and most critical step in any data science project.

It uses visual and quantitative techniques to uncover patterns, outliers, and relationships.

Skipping EDA leads to biased models, wrong conclusions, and production failures.

Key tools

histograms, box plots, scatter plots, correlation matrices, and summary statistics.

EDA is iterative

you explore, form hypotheses, and then confirm with statistical tests.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

Explain the purpose of EDA and how it differs from confirmatory data ana...

Q02SENIOR

You have a dataset with 100 features and 10,000 rows. How would you appr...

Q03SENIOR

Describe a real-world scenario where EDA prevented a costly modeling mis...

Q01 of 03JUNIOR

Explain the purpose of EDA and how it differs from confirmatory data analysis.

ANSWER

EDA is about exploring data to generate hypotheses, detect anomalies, and understand structure without preconceived notions. Confirmatory data analysis tests specific hypotheses using statistical methods. EDA is open-ended; confirmatory is hypothesis-driven. Tukey emphasized that confusing the two on the same dataset leads to bias.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is the difference between EDA and hypothesis testing?

Do I need to do EDA if I'm using deep learning?

What are the most common EDA techniques?

How long should EDA take in a typical project?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Lessons pulled from things that broke in production.

✓ Verified

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

🔥

That's ML Basics. Mark it forged?

8 min read · try the examples if you haven't