Beginner 7 min · April 15, 2026

Data Cleaning and Preprocessing for Absolute Beginners

Data Preprocessing — Duplicates That Inflate Metrics

Q: Should I clean data before or after splitting into train/test sets?

Inspect and remove exact duplicates before splitting — duplicates that span both sets cause data leakage where the model memorizes training examples that also appear in the test set. For transformations that learn parameters (scaling, imputation with mean/median, encoding), always split first, then fit on training data only. The correct sequence is: (1) load raw data, (2) remove exact duplicates on the full dataset, (3) split into train and test, (4) fit cleaning and preprocessing transformers on train only, (5) transform both train and test using the fitted transformers.

Q: How do I know if my data has outliers vs. just a skewed distribution?

Plot a histogram and a boxplot side by side. A skewed distribution has a long tail but continuous density — the values get sparser as you move along the tail, but there are no isolated points far from everything else. All values are plausible given the domain. Outliers are isolated points visibly disconnected from the main distribution, appearing as individual dots beyond the boxplot whiskers with no nearby values. Use domain knowledge as the final arbiter: a salary of $500K is plausible for a CEO but clearly an error for an intern position. Statistical tests like IQR or z-score flag candidates, but a human — or a well-defined business rule — decides whether each candidate is an error, a valid extreme, or a signal.

Q: Can I use the same cleaning pipeline for different datasets?

Only if the datasets share the exact same schema — same columns, same types, same expected range of category values. A pipeline fitted on one dataset encodes that dataset's specific statistics: the median for imputation, the mean and std for scaling, the known set of categories for one-hot encoding. If a new dataset has different columns, the pipeline will crash. If it has new category values the encoder has never seen, OneHotEncoder with handle_unknown='ignore' will silently zero them out, which may or may not be acceptable. Never reuse a pipeline blindly without validating that the new data's schema matches the training data's schema.

Q: What is the difference between imputing with mean vs. median?

Mean is sensitive to outliers — a single extreme value (a $10M salary among $50K–$150K salaries) pulls the mean dramatically and fills missing values with a number that represents nobody in the dataset. Median is robust — it represents the actual middle value regardless of extremes. For normally distributed data with no outliers, mean and median are nearly identical and either works. For skewed distributions (income, prices, counts, time-on-site), median is almost always the better choice. In production, default to median unless you have verified the distribution is approximately symmetric. It is the safer choice because you cannot always guarantee your data is outlier-free — and a single outlier in a future batch can silently corrupt mean-imputed values.

Q: How often should I re-inspect and re-clean production data?

Continuously, via automated pipeline gates. Data quality is not a one-time event — upstream schemas change, new data sources are added, vendor APIs modify their formats, and seasonal patterns introduce values your original cleaning logic never anticipated. At minimum, run automated inspection checks (row counts, missing percentages, value range validation, duplicate ratios) on every data batch before it enters your pipeline. Set thresholds that trigger alerts or halt processing. Re-evaluate your cleaning strategy quarterly or whenever model performance degrades — declining accuracy is often a data quality symptom, not a model architecture problem.

Duplicate records with unique IDs inflated revenue forecasts by 40%.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Production

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 20 min

✓Basic programming fundamentals
✓A computer with internet access
✓Willingness to follow along with examples

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Data cleaning fixes errors, fills gaps, and standardizes formats before analysis or modeling
Missing values, duplicates, and inconsistent formats are the top 3 problems in raw data
Use pandas for tabular data: df.isnull().sum() reveals missingness patterns immediately
Encoding categoricals and scaling numerics are preprocessing steps, not cleaning
Production rule: automate cleaning pipelines — manual cleaning breaks on new data batches
Biggest mistake: dropping all rows with missing values instead of understanding why they are missing
Always split before fitting transformers — data leakage from preprocessing is the #1 silent killer of model reliability

✦ Definition~90s read

What is Data Cleaning and Preprocessing for Absolute Beginners?

Duplicate data is one of the most insidious problems in data preprocessing because it silently inflates metrics without raising obvious errors. When you have duplicate rows, every aggregate statistic—counts, averages, sums, even model accuracy—gets skewed.

★

Data cleaning is like washing vegetables before cooking.

For example, if a customer transaction appears twice in a sales dataset, your total revenue is artificially doubled, and any model trained on that data will learn patterns that don't exist in reality. This isn't just about cleaning; it's about ensuring your analysis and models reflect actual ground truth.

In the data preprocessing pipeline, duplicate detection typically comes after initial inspection and missing value handling but before type fixing and outlier treatment. The reason is pragmatic: duplicates are often easier to spot once you've resolved structural issues like inconsistent formatting.

Tools like pandas' duplicated() and drop_duplicates() are the workhorses here, but you must decide on your deduplication strategy—whether to keep the first occurrence, last occurrence, or none at all. For large datasets (millions of rows), hashing-based approaches or distributed frameworks like Spark become necessary.

Where this fits in the broader ecosystem: duplicate handling is a fundamental step in any data preprocessing workflow, alongside missing value imputation and outlier detection. It's especially critical in domains like e-commerce (duplicate orders), finance (duplicate transactions), and healthcare (duplicate patient records).

However, not all duplicates are bad—time series data may have legitimate repeated measurements, and some analytics tasks benefit from preserving duplicates (e.g., counting page views). The key is understanding your data's semantics before blindly removing rows.

When in doubt, always inspect the duplicates first: look at the actual rows, understand why they exist, and only then decide on removal.

Plain-English First

Data cleaning is like washing vegetables before cooking. You would not eat unwashed produce, and you should not feed raw, messy data into a model. Dirt in your data shows up as wrong predictions, broken dashboards, and decisions nobody should trust. This guide teaches you exactly how to inspect, fix, and prepare your data step by step — and more importantly, how to build a cleaning process that works automatically the next time new data arrives.

Most beginner tutorials skip data cleaning and jump straight to modeling. This is backwards. In production, 60–80% of a data scientist's time goes to cleaning and preprocessing, not model building.

Messy data produces confident but wrong results. A model trained on data with duplicates, missing values, or inconsistent formats will silently learn garbage patterns. The output looks plausible but is unreliable. You ship it, stakeholders act on it, and weeks later someone notices the numbers never made sense.

The core misconception is that cleaning is tedious but simple. It is not. Understanding why data is missing, how duplicates were created, and what format inconsistencies reveal about upstream systems requires real analytical thinking. A column full of NaN values might mean a sensor was offline, a user skipped an optional field, or an ETL job silently dropped records. Each cause demands a different response.

This guide covers every step from raw inspection through reproducible pipelines. The code is production-grade Python — not toy examples on iris or titanic. Every section includes the reasoning behind the technique, because knowing which tool to use matters far more than knowing the syntax.

Why Duplicate Data Is a Silent Metric Killer

Data preprocessing is the step where raw data is cleaned, transformed, and structured before analysis or model training. The core mechanic is detecting and removing duplicate records — rows that are identical or near-identical — because they artificially inflate counts, averages, and other aggregate metrics. A single duplicated row in a 10,000-row dataset can skew a mean by 0.01%, but in a 100-row dataset, it can shift results by 1% or more.

In practice, duplicates arise from system glitches, manual entry errors, or data merging from multiple sources. The key property to understand is that duplicates are not always exact copies — they can be partial matches (same user ID but different timestamps) or semantically identical records with different formatting. Detection typically uses hashing (O(n) with a hash set) or sorting (O(n log n)), but the real challenge is defining what counts as a duplicate in your domain.

Use duplicate removal whenever you compute statistics, train supervised models, or generate reports from transactional data. In production systems, failing to deduplicate leads to overestimated user counts, inflated conversion rates, and biased model predictions. For example, a fraud detection model trained on duplicated transactions will learn to flag normal behavior as suspicious, degrading precision by up to 30%.

⚠ Duplicate ≠ Identical

Two rows with the same user ID but different timestamps may still be duplicates if the event is idempotent — always define your dedup key based on business logic, not raw columns.

📊 Production Insight

A real-time analytics pipeline ingested duplicate click events from a retry mechanism, inflating daily active user counts by 15%.

The symptom was a sudden, unexplained spike in DAU that correlated with network retries, not actual user growth.

Rule of thumb: always deduplicate at ingestion using a unique event ID or a composite key of (user_id, event_type, timestamp_window).

🎯 Key Takeaway

Duplicates inflate metrics silently — always deduplicate before aggregation.

Define your dedup key by business semantics, not raw column equality.

Use hashing for O(n) detection; never rely on manual inspection at scale.

thecodeforge.io

Data Preprocessing Beginners Guide

Inspecting Your Data First

Never clean blindly. Start by understanding the shape, types, and distribution of your data. The first five commands you run on any new dataset reveal the scope of the problem and tell you where to focus your effort.

Inspection is not optional extra work — it is the diagnostic step that prevents you from applying the wrong fix. Imputing a column that should have been dropped, or scaling a column that is secretly categorical, are mistakes that only surface weeks later when production results look wrong.

io/thecodeforge/cleaning/inspection.pyPYTHON

import pandas as pd
import numpy as np

def inspect_dataset(filepath: str) -> pd.DataFrame:
    """First-pass data inspection. Run this before any cleaning.

    Returns the raw DataFrame so you can continue exploring interactively.
    Prints a diagnostic summary covering shape, types, missingness,
    duplicates, numeric ranges, and categorical cardinality.
    """
    df = pd.read_csv(filepath)

    print(f"Shape: {df.shape[0]:,} rows x {df.shape[1]} columns")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")

    print(f"\n--- Column Types ---")
    print(df.dtypes.value_counts())

    print(f"\n--- Missing Values ---")
    missing = df.isnull().sum()
    missing_pct = (missing / len(df) * 100).round(1)
    missing_report = pd.DataFrame({"count": missing, "percent": missing_pct})
    missing_report = missing_report[missing_report["count"] > 0].sort_values(
        "percent", ascending=False
    )
    if missing_report.empty:
        print("No missing values detected.")
    else:
        print(missing_report)

    print(f"\n--- Duplicate Rows ---")
    dup_count = df.duplicated().sum()
    print(f"Exact duplicates: {dup_count} ({dup_count / len(df) * 100:.1f}%)")

    print(f"\n--- Numeric Summary ---")
    num_desc = df.describe().round(2)
    print(num_desc)

    print(f"\n--- Categorical Columns ---")
    cat_cols = df.select_dtypes(include=["object", "category"]).columns
    for col in cat_cols:
        unique_count = df[col].nunique()
        print(f"  {col}: {unique_count} unique values")
        if unique_count <= 10:
            print(f"    Values: {df[col].unique().tolist()}")
        elif unique_count <= 50:
            print(f"    Top 5: {df[col].value_counts().head(5).to_dict()}")
        else:
            print(f"    High cardinality — consider grouping or encoding")

    return df

Mental Model

Inspect Before You Fix

Cleaning without inspection is like prescribing medicine without diagnosis — you might make things worse.

df.shape tells you volume — is this a 100-row problem or a 100M-row problem? The answer changes every downstream decision.
df.dtypes reveals implicit type mismatches. Numbers stored as strings are the #1 source of silent aggregation errors.
df.isnull().sum() shows missingness patterns — not all NaN is random. Correlated missingness signals systemic upstream failures.
df.duplicated().sum() catches records that should not exist. Even 1% duplicates can swing aggregate metrics significantly.
df.describe() exposes impossible min/max values (negative ages, future dates, prices of $0.00) that signal data corruption or upstream bugs.

📊 Production Insight

In production, data inspection should be automated as a pipeline gate — not something a human remembers to do in a notebook.

If the missing percentage exceeds a configured threshold, or row counts diverge from expectations, the pipeline should fail loudly with an alert rather than silently processing garbage.

Rule of thumb: never let uninspected data reach a model, a report, or a dashboard. The cost of a five-second automated check is negligible compared to the cost of retracting a wrong forecast.

🎯 Key Takeaway

Always inspect before cleaning.

Five commands — shape, dtypes, isnull, duplicated, describe — reveal 90% of data quality issues.

Automate inspection as a pipeline gate in production. Never trust raw data blindly, no matter how reputable the source.

Handling Missing Values

Missing values are the most common data quality issue. The critical decision is not how to fill them, but whether to fill them at all — and that depends on understanding why they are missing in the first place.

Statisticians categorize missingness into three types: MCAR (Missing Completely at Random), MAR (Missing at Random, conditional on other observed data), and MNAR (Missing Not at Random, where the value itself determines whether it is recorded). Each type demands a different strategy, and applying the wrong one introduces bias that no amount of hyperparameter tuning will fix.

The practical implication: before you write a single line of imputation code, spend five minutes checking whether missingness correlates with other columns or with the target variable. That five minutes prevents weeks of debugging a model that is learning from a biased training set.

io/thecodeforge/cleaning/missing_values.pyPYTHON

100

101

102

103

104

105

106

107

import pandas as pd
import numpy as np


class MissingValueHandler:
    """Strategies for handling missing values based on missingness type.

    Usage:
        handler = MissingValueHandler()
        report = handler.diagnose_missingness(df)
        df = handler.impute_numeric(df, 'age', strategy='median')
        df = handler.impute_categorical(df, 'city', strategy='unknown')
    """

    @staticmethod
    def diagnose_missingness(df: pd.DataFrame, target_col: str = None) -> dict:
        """Determine if missingness is random or systematic.

        If target_col is provided, checks whether missingness in each column
        correlates with the target — a signal of potential MNAR.
        """
        report = {}
        for col in df.columns:
            missing_count = df[col].isnull().sum()
            if missing_count == 0:
                continue

            missing_pct = missing_count / len(df) * 100
            entry = {
                "count": missing_count,
                "percent": round(missing_pct, 1),
                "recommendation": MissingValueHandler._recommend(
                    missing_pct, col, df
                ),
            }

            # Check correlation between missingness and target
            if target_col and target_col != col and target_col in df.columns:
                missing_mask = df[col].isnull()
                if df[target_col].dtype in ["float64", "int64"]:
                    mean_when_missing = df.loc[missing_mask, target_col].mean()
                    mean_when_present = df.loc[~missing_mask, target_col].mean()
                    if pd.notna(mean_when_missing) and pd.notna(mean_when_present):
                        diff_pct = (
                            abs(mean_when_missing - mean_when_present)
                            / abs(mean_when_present)
                            * 100
                        )
                        if diff_pct > 10:
                            entry["recommendation"] += (
                                f" WARNING: target mean differs by {diff_pct:.0f}% "
                                f"when {col} is missing — possible MNAR."
                            )

            report[col] = entry
        return report

    @staticmethod
    def _recommend(pct: float, col: str, df: pd.DataFrame) -> str:
        if pct > 50:
            return "DROP COLUMN — more than half the data is missing"
        elif pct > 20:
            return "INVESTIGATE — high missingness may indicate a systematic issue"
        elif df[col].dtype in ["float64", "int64"]:
            return "IMPUTE — use median for skewed, mean for normal distributions"
        else:
            return "IMPUTE — use mode or create an explicit 'Unknown' category"

    @staticmethod
    def impute_numeric(
        df: pd.DataFrame, col: str, strategy: str = "median"
    ) -> pd.DataFrame:
        """Fill missing numeric values. Adds a boolean indicator column."""
        indicator_col = f"{col}_was_missing"
        df[indicator_col] = df[col].isnull().astype(int)

        if strategy == "median":
            fill_value = df[col].median()
        elif strategy == "mean":
            fill_value = df[col].mean()
        elif strategy == "zero":
            fill_value = 0
        else:
            raise ValueError(f"Unknown strategy: {strategy}")

        df[col] = df[col].fillna(fill_value)
        print(f"{col}: imputed {df[indicator_col].sum()} values with {strategy} ({fill_value:.2f})")
        return df

    @staticmethod
    def impute_categorical(
        df: pd.DataFrame, col: str, strategy: str = "mode"
    ) -> pd.DataFrame:
        """Fill missing categorical values."""
        indicator_col = f"{col}_was_missing"
        df[indicator_col] = df[col].isnull().astype(int)

        if strategy == "mode":
            fill_value = df[col].mode()[0]
        elif strategy == "unknown":
            fill_value = "Unknown"
        else:
            raise ValueError(f"Unknown strategy: {strategy}")

        df[col] = df[col].fillna(fill_value)
        print(f"{col}: imputed {df[indicator_col].sum()} values with '{fill_value}'")
        return df

⚠ Three Types of Missingness — and Why It Matters

MCAR (Missing Completely at Random): missingness has no pattern. A sensor randomly drops readings. Safe to drop rows or impute with simple statistics. MAR (Missing at Random): missingness depends on other observed columns. Younger users skip the income field more often. Impute using those correlated columns — simple mean imputation introduces bias. MNAR (Missing Not at Random): missingness depends on the unobserved value itself. High-income earners refuse to report income. No imputation strategy fully fixes this — you need domain expertise, external data, or explicit modeling of the missingness mechanism. Always investigate why data is missing before choosing a strategy. The wrong choice silently biases every downstream result.

📊 Production Insight

Dropping rows with missing values can silently remove 30–80% of your dataset.

Always check df.dropna().shape versus df.shape before committing to a drop.

More importantly, check whether the dropped rows are a random subset or a biased one. If all dropped rows belong to a specific demographic, your model will systematically underperform for that group.

Rule: dropping is the last resort. Impute or flag missingness as a feature first. Add a boolean indicator column (was_missing) so the model can learn from the missingness pattern itself.

🎯 Key Takeaway

Not all missing values should be filled the same way.

Diagnose the missingness mechanism before choosing a strategy — MCAR, MAR, and MNAR demand different treatments.

Dropping rows is the nuclear option — always check how much data you lose and whether the loss is biased.

Add boolean indicator columns so the model can learn from the missingness pattern itself.

Missing Value Strategy Selection

IfColumn has more than 50% missing values

→

UseDrop the column unless it is the target variable or you have strong domain evidence that the remaining values carry critical signal.

IfNumeric column with less than 20% missing, distribution is roughly normal

→

UseImpute with mean. Add a boolean 'was_missing' indicator column to preserve the missingness signal.

IfNumeric column with less than 20% missing, distribution is skewed or has outliers

→

UseImpute with median. Median is robust to extreme values that would distort the mean.

IfCategorical column with less than 20% missing

→

UseImpute with mode or create an explicit 'Unknown' category. Unknown is often preferable because it does not assume the missing values follow the same distribution as observed values.

IfMissingness is systematic (MAR or MNAR)

→

UseInvestigate the root cause. Consider model-based imputation (KNN Imputer, IterativeImputer). Do not blindly fill — you may introduce bias that is invisible during training but devastating in production.

thecodeforge.io

Data Preprocessing Beginners Guide

Removing Duplicates

Duplicates inflate counts, skew averages, and produce overconfident models. They come in two flavors: exact duplicates (entire row identical) and fuzzy duplicates (same real-world entity, slightly different representation — like 'John Smith' vs 'john smith' vs 'J. Smith').

Exact duplicates are easy to detect and remove. Fuzzy duplicates are harder and more dangerous because they survive standard deduplication checks. A customer appearing twice with slightly different names gets counted as two customers, doubling their weight in any aggregate.

The most important principle: define your deduplication key based on business logic, not technical convenience. A surrogate ID that is unique by definition tells you nothing about whether two rows represent the same real-world entity.

io/thecodeforge/cleaning/duplicates.pyPYTHON

import pandas as pd
from difflib import SequenceMatcher


class DuplicateHandler:
    """Detect and remove duplicate records — exact and fuzzy.

    Usage:
        handler = DuplicateHandler()
        dups = handler.find_exact_duplicates(df, subset=['email', 'order_date'])
        df_clean = handler.remove_exact_duplicates(df, subset=['email', 'order_date'])
        fuzzy = handler.find_fuzzy_duplicates(df, columns=['name', 'address'])
    """

    @staticmethod
    def find_exact_duplicates(
        df: pd.DataFrame, subset: list = None
    ) -> pd.DataFrame:
        """Identify exact duplicate rows.

        Args:
            df: input DataFrame
            subset: columns to consider for duplication. None = all columns.

        Returns:
            DataFrame containing only the duplicate rows (excluding first occurrence).
        """
        dup_mask = df.duplicated(subset=subset, keep="first")
        dup_count = dup_mask.sum()
        print(f"Exact duplicates found: {dup_count} ({dup_count / len(df) * 100:.1f}%)")
        if subset:
            print(f"  Checked on columns: {subset}")
        return df[dup_mask]

    @staticmethod
    def remove_exact_duplicates(
        df: pd.DataFrame, subset: list = None, keep: str = "first"
    ) -> pd.DataFrame:
        """Remove exact duplicates, keeping one copy."""
        before = len(df)
        df_clean = df.drop_duplicates(subset=subset, keep=keep).reset_index(
            drop=True
        )
        removed = before - len(df_clean)
        print(
            f"Removed {removed} duplicate rows. {len(df_clean):,} rows remaining."
        )
        return df_clean

    @staticmethod
    def find_fuzzy_duplicates(
        df: pd.DataFrame, columns: list, threshold: float = 0.85
    ) -> list:
        """Find near-duplicate records based on string similarity.

        WARNING: O(n^2) complexity. For large datasets (>10k rows), sample
        or use blocking strategies (group by first letter, zip code, etc.)
        before calling this method.

        Args:
            df: input DataFrame
            columns: columns to concatenate for similarity comparison
            threshold: minimum similarity ratio to flag as fuzzy duplicate

        Returns:
            List of (index_i, index_j, similarity_score) tuples.
        """
        fuzzy_pairs = []
        values = df[columns].astype(str).agg(" ".join, axis=1).tolist()

        n = len(values)
        if n > 10_000:
            print(
                f"WARNING: {n:,} rows will produce {n*(n-1)//2:,} comparisons. "
                f"Consider blocking or sampling first."
            )

        for i in range(n):
            for j in range(i + 1, n):
                similarity = SequenceMatcher(None, values[i], values[j]).ratio()
                if similarity >= threshold and similarity < 1.0:
                    fuzzy_pairs.append((i, j, round(similarity, 3)))

        print(
            f"Found {len(fuzzy_pairs)} fuzzy duplicate pairs above "
            f"{threshold} threshold"
        )
        return fuzzy_pairs

💡Where Duplicates Hide

Payment processor retries create records with different transaction IDs but identical amounts and timestamps within a narrow window.
User registration forms submitted twice produce records with the same email but different auto-incremented user IDs.
Data imports from multiple sources create overlapping date ranges with slight format differences (2025-04-15 vs 04/15/2025).
CRM merges introduce near-duplicate contacts: 'Acme Corp' vs 'Acme Corporation' vs 'ACME Corp.'.
Always define deduplication keys based on business logic, not just surrogate primary keys that are unique by design.

📊 Production Insight

Duplicates often indicate upstream system bugs — not just data entry errors or CSV import accidents.

Removing duplicates without logging them hides the root cause and guarantees the problem recurs on the next data load.

Rule: always report duplicate counts, patterns, and example rows to the data engineering team. Treat recurring duplicates as incidents, not cleanup tasks. Fix the source, not just the symptom.

🎯 Key Takeaway

Exact duplicates are easy to find — fuzzy duplicates are the real threat and require domain-specific similarity logic.

Define deduplication keys based on business logic, not table primary keys.

Duplicates often signal upstream bugs — report them, investigate the source, do not just silently delete them.

Fixing Data Types and Formats

Incorrect data types cause silent errors that can persist for months. Numbers stored as strings cannot be aggregated — pandas will concatenate them instead of summing. Dates stored as strings cannot be compared or sorted chronologically. Booleans stored as 'Yes'/'No' strings require explicit conversion before any logical operation.

The root cause is almost always upstream: CSV files have no schema enforcement, APIs return everything as strings, and Excel silently interprets dates in locale-dependent formats. By the time data reaches your DataFrame, the damage is done. Your job is to detect and fix it systematically.

The most insidious variant is partial type corruption — a column that is 99.9% numeric but contains a handful of string values like 'N/A' or '#REF!'. Pandas silently stores the entire column as object dtype, and your aggregation returns wrong results with no error message.

io/thecodeforge/cleaning/types.pyPYTHON

100

import pandas as pd
import numpy as np


class TypeFixer:
    """Fix data type issues in DataFrames.

    Usage:
        fixer = TypeFixer()
        df = fixer.fix_numeric_strings(df, ['revenue', 'price'])
        df = fixer.fix_dates(df, ['order_date', 'ship_date'])
        df = fixer.fix_booleans(df, ['is_active', 'has_subscription'])
        df = fixer.fix_categoricals(df, ['country', 'status'])
    """

    @staticmethod
    def fix_numeric_strings(df: pd.DataFrame, columns: list) -> pd.DataFrame:
        """Convert string columns that should be numeric.

        Handles currency symbols, commas, whitespace, and common
        placeholder strings like 'N/A', '-', 'null', '#REF!'.
        """
        for col in columns:
            original_nulls = df[col].isna().sum()
            cleaned = df[col].astype(str)
            cleaned = cleaned.str.replace(r"[\$\€\£\,]", "", regex=True)
            cleaned = cleaned.str.strip()
            cleaned = cleaned.replace(
                {"": np.nan, "N/A": np.nan, "n/a": np.nan,
                 "-": np.nan, "null": np.nan, "None": np.nan,
                 "nan": np.nan, "#REF!": np.nan, "#N/A": np.nan}
            )
            df[col] = pd.to_numeric(cleaned, errors="coerce")
            new_nulls = df[col].isna().sum() - original_nulls
            print(
                f"{col}: converted to numeric. "
                f"{new_nulls} new NaN values from failed coercion."
            )
        return df

    @staticmethod
    def fix_dates(
        df: pd.DataFrame, columns: list, dayfirst: bool = False,
        fmt: str = None
    ) -> pd.DataFrame:
        """Parse date columns with explicit format handling.

        Args:
            fmt: explicit strftime format string (e.g., '%Y-%m-%d').
                 If None, pandas infers the format — risky in production.
        """
        for col in columns:
            if fmt:
                df[col] = pd.to_datetime(df[col], format=fmt, errors="coerce")
            else:
                df[col] = pd.to_datetime(
                    df[col], dayfirst=dayfirst, errors="coerce"
                )
            failed = df[col].isna().sum()
            print(f"{col}: parsed as datetime. {failed} values are NaT.")
            # Sanity check: flag future dates
            future = (df[col] > pd.Timestamp.now()).sum()
            if future > 0:
                print(f"  WARNING: {future} dates are in the future.")
        return df

    @staticmethod
    def fix_booleans(df: pd.DataFrame, columns: list) -> pd.DataFrame:
        """Convert string booleans to actual booleans."""
        true_values = {"yes", "true", "1", "y", "on", "t"}
        false_values = {"no", "false", "0", "n", "off", "f"}

        for col in columns:
            lower = df[col].astype(str).str.lower().str.strip()
            df[col] = lower.map(
                lambda x: (
                    True if x in true_values
                    else (False if x in false_values else np.nan)
                )
            )
            unmapped = df[col].isna().sum()
            print(f"{col}: converted to boolean. {unmapped} unmapped values.")
        return df

    @staticmethod
    def fix_categoricals(df: pd.DataFrame, columns: list) -> pd.DataFrame:
        """Convert high-memory object columns to categorical dtype."""
        for col in columns:
            before_mem = df[col].memory_usage(deep=True) / 1e6
            df[col] = df[col].astype("category")
            after_mem = df[col].memory_usage(deep=True) / 1e6
            if before_mem > 0:
                reduction = (1 - after_mem / before_mem) * 100
                print(
                    f"{col}: {before_mem:.2f} MB -> {after_mem:.2f} MB "
                    f"({reduction:.0f}% reduction)"
                )
            else:
                print(f"{col}: converted to category dtype.")
        return df

🔥Memory Impact of Type Fixes

Converting a string column with 1 million rows and 50 unique values to category dtype typically reduces memory from ~50 MB to ~1 MB — a 98% reduction. For numeric columns, downcasting float64 to float32 halves memory usage with negligible precision loss for most applications. These savings compound quickly: a 20-column DataFrame can shrink from 2 GB to 200 MB, which is the difference between fitting in a Lambda function's memory or crashing at runtime.

📊 Production Insight

Date parsing failures are completely silent — NaT values do not raise exceptions, and downstream code happily filters them out or treats them as null without warning.

Always compare the count of successfully parsed dates against the total row count. If 5% of dates failed parsing, that is a data quality bug, not an acceptable loss.

Rule: specify date formats explicitly in production pipelines. Never rely on pandas auto-inference, which changes behavior across pandas versions and produces different results on ambiguous inputs like '01/02/2025'.

🎯 Key Takeaway

Wrong data types cause silent aggregation failures — pandas concatenates strings instead of summing numbers, with no error.

Clean strings before converting: strip whitespace, remove currency symbols, replace placeholders.

Convert low-cardinality object columns to category dtype for 90%+ memory savings.

Always specify explicit date formats in production. Auto-inference is a ticking time bomb.

Handling Outliers

Outliers distort model training and summary statistics. A single salary of $10M in a dataset of $50K–$150K salaries will shift the mean by thousands of dollars and stretch the scale for distance-based algorithms.

But the key decision is whether an outlier is an error (remove it), a genuine extreme value (keep or cap it), or the actual signal you are trying to detect (embrace it). Domain knowledge drives this decision, not statistical rules alone.

The IQR method and z-score are detection tools, not decision tools. They flag candidates. A human — or a well-defined business rule — decides what to do with each candidate. Blindly removing everything beyond 1.5×IQR is a recipe for deleting the most informative data points in your dataset.

io/thecodeforge/cleaning/outliers.pyPYTHON

import pandas as pd
import numpy as np


class OutlierHandler:
    """Detect and handle outliers using multiple methods.

    Usage:
        handler = OutlierHandler()
        mask, lower, upper = handler.detect_iqr(df, 'salary')
        mask = handler.detect_zscore(df, 'salary', threshold=3.0)
        df = handler.cap_outliers(df, 'salary', lower=20000, upper=500000)
        df = handler.log_transform(df, 'revenue')
    """

    @staticmethod
    def detect_iqr(
        df: pd.DataFrame, col: str, factor: float = 1.5
    ) -> tuple:
        """Flag outliers using the IQR method.

        Returns:
            Tuple of (boolean mask, lower bound, upper bound).
        """
        q1 = df[col].quantile(0.25)
        q3 = df[col].quantile(0.75)
        iqr = q3 - q1
        lower = q1 - factor * iqr
        upper = q3 + factor * iqr

        outlier_mask = (df[col] < lower) | (df[col] > upper)
        count = outlier_mask.sum()
        print(f"{col}: {count} outliers ({count/len(df)*100:.1f}%) outside [{lower:.2f}, {upper:.2f}]")
        return outlier_mask, lower, upper

    @staticmethod
    def detect_zscore(
        df: pd.DataFrame, col: str, threshold: float = 3.0
    ) -> pd.Series:
        """Flag outliers using z-score method."""
        mean = df[col].mean()
        std = df[col].std()
        if std == 0:
            print(f"{col}: standard deviation is 0 — all values are identical. No outliers.")
            return pd.Series(False, index=df.index)

        z_scores = np.abs((df[col] - mean) / std)
        outlier_mask = z_scores > threshold
        count = outlier_mask.sum()
        print(f"{col}: {count} outliers with |z-score| > {threshold}")
        return outlier_mask

    @staticmethod
    def cap_outliers(
        df: pd.DataFrame, col: str,
        lower: float = None, upper: float = None
    ) -> pd.DataFrame:
        """Cap outlier values to specified bounds (winsorization).

        If bounds are not specified, defaults to 1st and 99th percentiles.
        """
        if lower is None:
            lower = df[col].quantile(0.01)
        if upper is None:
            upper = df[col].quantile(0.99)

        before = df[col].copy()
        df[col] = df[col].clip(lower=lower, upper=upper)
        capped = (before != df[col]).sum()
        print(f"{col}: capped {capped} values to [{lower:.2f}, {upper:.2f}]")
        return df

    @staticmethod
    def log_transform(df: pd.DataFrame, col: str) -> pd.DataFrame:
        """Apply log1p transform to compress right-skewed distributions.

        Handles zeros safely. Does NOT handle negative values — shift first.
        """
        if (df[col] < 0).any():
            min_val = df[col].min()
            print(f"{col}: contains negative values. Shifting by {abs(min_val):.2f} before log transform.")
            df[col] = np.log1p(df[col] - min_val)
        else:
            df[col] = np.log1p(df[col])
        print(f"{col}: log1p transform applied.")
        return df

⚠ Never Blindly Remove Outliers

In fraud detection, the outliers ARE the signal. In medical data, extreme lab values may indicate critical conditions. In e-commerce, a few whales who spend 100x the average drive a disproportionate share of revenue. Always consult domain experts before removing outliers. An outlier removal step that makes your model's R² look better on paper may be destroying the most important information in the dataset. If you cannot consult an expert, cap rather than remove — at least the record and its other features are preserved.

📊 Production Insight

Outlier thresholds calculated from training data must be saved and reused identically on new data at inference time.

Recalculating thresholds per batch means different data points get flagged each time, making your cleaning pipeline non-deterministic.

Rule: compute outlier bounds once on training data, serialize them alongside your scaler and model, then apply those fixed bounds to all future data. If the distribution genuinely shifts, that is a retraining trigger, not a reason to silently recalculate thresholds.

🎯 Key Takeaway

Outliers are either errors or signals — domain knowledge decides which.

Cap extreme values instead of deleting them whenever possible. Removal discards information; capping preserves the record.

Save outlier thresholds from training data and reuse them deterministically on all future batches.

Outlier Handling Strategy

IfOutlier is clearly a data entry error (age = 500, negative price, temperature of 9999°F)

→

UseSet to NaN and treat as a missing value. Investigate and fix the upstream source that produced it.

IfOutlier is plausible but extreme (billionaire income in salary data, 100-unit purchase in a mostly single-unit store)

→

UseCap at a domain-reasonable threshold (e.g., 99th percentile) or use robust scaling (RobustScaler). Preserve the record.

IfOutlier is the signal of interest (fraud transactions, anomalous sensor readings, rare diseases)

→

UseKeep it. Use algorithms robust to outliers (tree-based models, isolation forest). Removing it defeats the purpose.

IfDistribution is heavily skewed but all values are valid (income, page views, time on site)

→

UseApply log1p or Box-Cox transformation to compress the tail instead of removing values. This preserves information while reducing the impact of extremes.

Encoding Categorical Variables

Machine learning models require numeric input. Categorical variables — country, product type, color, status — must be converted to numbers through encoding. This is a preprocessing step, not cleaning, but it is where beginners make some of their most costly mistakes.

The encoding strategy depends on two factors: cardinality (how many unique values the column has) and whether those values have a natural order. Choosing the wrong method can create thousands of useless sparse columns, introduce false ordinal relationships, or leak target information into features.

The most common mistake: one-hot encoding a column with 10,000 unique values. This creates 9,999 new binary columns, most of which are zero for any given row. Memory explodes, training crawls, and linear models become impossible to fit.

io/thecodeforge/cleaning/encoding.pyPYTHON

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder


class CategoricalEncoder:
    """Encode categorical variables for model consumption.

    Usage:
        encoder = CategoricalEncoder()
        df = encoder.encode_onehot(df, ['color', 'size'])
        df = encoder.encode_ordinal(df, 'education', ['high_school', 'bachelors', 'masters', 'phd'])
        df = encoder.encode_frequency(df, ['city', 'product_id'])
    """

    @staticmethod
    def encode_onehot(
        df: pd.DataFrame, columns: list, max_cardinality: int = 10
    ) -> pd.DataFrame:
        """One-hot encode low-cardinality columns.

        Skips columns that exceed max_cardinality and prints a warning.
        Uses drop_first=True to avoid multicollinearity in linear models.
        """
        for col in columns:
            cardinality = df[col].nunique()
            if cardinality > max_cardinality:
                print(
                    f"SKIPPED: {col} has {cardinality} unique values. "
                    f"Use frequency or target encoding instead."
                )
                continue

            dummies = pd.get_dummies(df[col], prefix=col, drop_first=True)
            df = pd.concat([df.drop(col, axis=1), dummies], axis=1)
            print(f"{col}: one-hot encoded into {len(dummies.columns)} columns")
        return df

    @staticmethod
    def encode_ordinal(
        df: pd.DataFrame, column: str, order: list
    ) -> pd.DataFrame:
        """Encode ordered categories (e.g., Low < Medium < High).

        Unmapped values are set to NaN so they surface as missing rather
        than silently receiving an arbitrary integer.
        """
        mapping = {val: idx for idx, val in enumerate(order)}
        unmapped = set(df[column].dropna().unique()) - set(mapping.keys())
        if unmapped:
            print(f"WARNING: {column} has unmapped values: {unmapped}")
        df[column] = df[column].map(mapping)
        print(f"{column}: ordinally encoded with order {order}")
        return df

    @staticmethod
    def encode_frequency(
        df: pd.DataFrame, columns: list
    ) -> pd.DataFrame:
        """Replace categories with their frequency in the training data.

        Safe for high-cardinality features. Does not create new columns.
        """
        for col in columns:
            freq_map = df[col].value_counts(normalize=True).to_dict()
            df[col] = df[col].map(freq_map)
            print(f"{col}: frequency encoded ({len(freq_map)} categories)")
        return df

    @staticmethod
    def encode_label(
        df: pd.DataFrame, columns: list
    ) -> tuple:
        """Label encode — safe for tree-based models only.

        WARNING: creates false ordinal relationships for linear models.
        Returns the DataFrame and a dict of fitted LabelEncoders for
        inverse transformation and production reuse.
        """
        encoders = {}
        for col in columns:
            le = LabelEncoder()
            df[col] = le.fit_transform(df[col].astype(str))
            encoders[col] = le
            print(f"{col}: label encoded ({len(le.classes_)} classes)")
        return df, encoders

Mental Model

Encoding Decision Framework

The encoding method depends on two factors: cardinality and whether order matters. Get this wrong and you either explode your feature space or teach your model fake relationships.

Low cardinality (< 10 unique values), no natural order: One-hot encoding. Creates N-1 binary columns. Standard for most classification tasks.
Low cardinality, natural order exists: Ordinal encoding. Maps to integers preserving order (Low=0, Medium=1, High=2). Use for education levels, satisfaction ratings, size categories.
High cardinality (> 50 unique values): Target encoding (mean of target per category) or frequency encoding. One-hot creates too many sparse columns and kills performance.
Tree-based models (Random Forest, XGBoost, LightGBM): Label encoding works fine. Trees split on thresholds, so the arbitrary integer assignment does not create false distance relationships.
Linear models, SVMs, and neural networks: One-hot or target encoding only. Label encoding introduces false ordinal relationships (category 5 is not 'more' than category 2).

📊 Production Insight

One-hot encoding a column with 10,000 unique values creates 9,999 new columns, most of which are zero for any given row.

This explodes memory, makes linear models intractable, and adds zero predictive signal for rare categories.

Rule: always check df['col'].nunique() before encoding. If cardinality exceeds 50, use frequency encoding, target encoding, or hash encoding. If cardinality exceeds 10,000, strongly consider whether this feature should be decomposed (e.g., zip code → state) or dropped entirely.

🎯 Key Takeaway

Encoding strategy depends on cardinality and model type — there is no universal best method.

One-hot for low cardinality, frequency or target encoding for high cardinality.

Label encoding is safe for trees but creates false ordinal relationships in distance-based and linear models.

Always save the encoder or mapping. You need the exact same transformation at inference time.

Feature Scaling

Features on different scales — age in years (0–100) vs. income in dollars (0–10,000,000) — bias distance-based models and slow gradient descent convergence. KNN will treat a $1 difference in income as equivalent to a 1-year difference in age, which makes no sense.

Scaling normalizes feature ranges without changing the distribution shape. It does not remove outliers, fix missing values, or encode categories. It is a pure preprocessing step that should happen after cleaning, after encoding, and after the train/test split.

The critical rule that most beginners violate: fit the scaler on training data only. If you fit on the full dataset, the scaler's mean and standard deviation incorporate test set information, which leaks future knowledge into training. The metrics look artificially good, and the model underperforms on genuinely unseen data.

io/thecodeforge/cleaning/scaling.pyPYTHON

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler


class FeatureScaler:
    """Scale numeric features for model consumption.

    Usage:
        scaler_util = FeatureScaler()
        df_train, scaler = scaler_util.scale_standard(df_train, num_cols)
        df_test[num_cols] = scaler.transform(df_test[num_cols])  # same scaler!
    """

    @staticmethod
    def scale_standard(
        df: pd.DataFrame, columns: list
    ) -> tuple:
        """Z-score normalization: mean=0, std=1.

        Use for normally distributed features with linear models,
        logistic regression, SVMs, and PCA.
        """
        scaler = StandardScaler()
        df[columns] = scaler.fit_transform(df[columns])
        print(f"StandardScaler applied to {len(columns)} columns")
        return df, scaler

    @staticmethod
    def scale_minmax(
        df: pd.DataFrame, columns: list,
        feature_range: tuple = (0, 1)
    ) -> tuple:
        """Min-max scaling to a fixed range.

        Use for neural networks (which expect bounded inputs) and
        features with known theoretical bounds.
        """
        scaler = MinMaxScaler(feature_range=feature_range)
        df[columns] = scaler.fit_transform(df[columns])
        print(
            f"MinMaxScaler applied to {len(columns)} columns, "
            f"range={feature_range}"
        )
        return df, scaler

    @staticmethod
    def scale_robust(df: pd.DataFrame, columns: list) -> tuple:
        """Scale using median and IQR instead of mean and std.

        Use when data has outliers that would distort StandardScaler.
        Outliers still exist after scaling but no longer dominate the range.
        """
        scaler = RobustScaler()
        df[columns] = scaler.fit_transform(df[columns])
        print(
            f"RobustScaler applied to {len(columns)} columns "
            f"(outlier-resistant)"
        )
        return df, scaler

💡When to Scale and When Not To

Needs scaling: KNN, SVM, Neural Networks, PCA, Logistic Regression, Linear Regression (especially with regularization — L1/L2 penalties are scale-dependent).
Does NOT need scaling: Decision Trees, Random Forests, XGBoost, LightGBM, CatBoost, Naive Bayes. These models split on thresholds, not distances.
Always scale after the train/test split — fit the scaler on training data only, then transform both train and test with the same fitted scaler.
Save the scaler object with joblib.dump(). You must apply the exact same transformation to new data at inference time. A different scaler = different feature space = garbage predictions.
If you use an sklearn Pipeline, the scaler is automatically fitted only on training data during pipeline.fit() — this is one of the strongest reasons to use Pipelines.

📊 Production Insight

Fitting the scaler on the full dataset (including test data) causes data leakage.

The scaler computes mean and standard deviation using test data points, so the model indirectly sees test set statistics during training. This inflates accuracy by 1–5% on typical tabular datasets — enough to make a mediocre model look good in evaluation but fail in production.

Rule: fit scaler on training data only. Use scaler.transform() on test and production data. If you ever call scaler.fit_transform() on test data, you have a data leakage bug.

🎯 Key Takeaway

Scaling is required for distance-based and gradient-based models but unnecessary for tree-based models.

Use RobustScaler when outliers are present — it uses median and IQR instead of mean and std.

Always fit the scaler on training data only. Serialize it and ship it with the model to production.

Building a Reproducible Cleaning Pipeline

Manual cleaning in a notebook is fine for exploration. Production requires a reproducible pipeline that applies the exact same transformations to new data consistently, without human intervention, without remembering which cells to run in which order.

Scikit-learn's Pipeline and ColumnTransformer make this possible. They encapsulate the entire sequence of transformations — imputation, encoding, scaling — into a single object that can be fitted on training data, serialized to disk, and loaded at inference time. The pipeline remembers every parameter: the median used for imputation, the mean and std used for scaling, the category set used for encoding.

This is not optional infrastructure for large teams. It is essential even for solo projects. The moment you retrain a model or deploy it to a different environment, you need the pipeline. Without it, you will spend hours reverse-engineering what transformations were applied and in what order.

io/thecodeforge/cleaning/pipeline.pyPYTHON

import pandas as pd
import numpy as np
import joblib
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer


class DataCleaningPipeline:
    """Reproducible data cleaning pipeline for production use.

    Encapsulates imputation, encoding, and scaling into a single
    serializable object. Fit once on training data, transform everything
    else with the same fitted parameters.

    Usage:
        pipeline = DataCleaningPipeline(
            numeric_cols=['age', 'income', 'score'],
            categorical_cols=['city', 'status', 'tier']
        )
        X_train_clean = pipeline.fit_transform(X_train)
        X_test_clean = pipeline.transform(X_test)
        pipeline.save('preprocessing_pipeline.pkl')

        # At inference time:
        loaded = DataCleaningPipeline.load('preprocessing_pipeline.pkl')
        X_new_clean = loaded.transform(X_new)
    """

    def __init__(self, numeric_cols: list, categorical_cols: list):
        self.numeric_cols = numeric_cols
        self.categorical_cols = categorical_cols
        self.pipeline = self._build_pipeline()
        self._is_fitted = False

    def _build_pipeline(self) -> ColumnTransformer:
        numeric_transformer = Pipeline(steps=[
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler()),
        ])

        categorical_transformer = Pipeline(steps=[
            ("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
            ("encoder", OneHotEncoder(
                handle_unknown="ignore", sparse_output=False
            )),
        ])

        preprocessor = ColumnTransformer(transformers=[
            ("numeric", numeric_transformer, self.numeric_cols),
            ("categorical", categorical_transformer, self.categorical_cols),
        ])

        return preprocessor

    def fit(self, X_train: pd.DataFrame) -> "DataCleaningPipeline":
        """Fit all transformers on training data only."""
        self.pipeline.fit(X_train)
        self._is_fitted = True
        print(f"Pipeline fitted on {len(X_train):,} rows, "
              f"{len(self.numeric_cols)} numeric + "
              f"{len(self.categorical_cols)} categorical columns.")
        return self

    def transform(self, X: pd.DataFrame) -> np.ndarray:
        """Transform new data using the fitted pipeline."""
        if not self._is_fitted:
            raise RuntimeError(
                "Pipeline is not fitted. Call .fit(X_train) first."
            )
        return self.pipeline.transform(X)

    def fit_transform(self, X_train: pd.DataFrame) -> np.ndarray:
        """Fit and transform in one step. Use on training data."""
        self.fit(X_train)
        return self.pipeline.transform(X_train)

    def save(self, filepath: str) -> None:
        """Serialize the fitted pipeline to disk."""
        if not self._is_fitted:
            raise RuntimeError(
                "Cannot save an unfitted pipeline. Call .fit() first."
            )
        joblib.dump(self, filepath)
        print(f"Pipeline saved to {filepath}")

    @staticmethod
    def load(filepath: str) -> "DataCleaningPipeline":
        """Load a previously saved pipeline."""
        pipeline = joblib.load(filepath)
        print(f"Pipeline loaded from {filepath}")
        return pipeline

⚠ Pipeline Immutability Is Non-Negotiable

Once a pipeline is fitted, it encodes the training data's statistics — the median for imputation, the mean and std for scaling, the set of known categories for encoding. Never refit the pipeline on new data after deployment. If the data distribution shifts significantly, that is a retraining trigger: retrain the entire pipeline and model together on a fresh training set, validate, and redeploy. Refitting the pipeline alone while keeping the old model creates a mismatch between preprocessing assumptions and model weights.

📊 Production Insight

Notebook-based cleaning breaks the moment new data has a column your notebook did not anticipate, or a category the encoder has never seen.

A scikit-learn Pipeline handles unseen categories gracefully with handle_unknown='ignore', and it enforces the correct fit-then-transform order by design.

Rule: always encapsulate cleaning logic in a Pipeline object. Serialize it with joblib.dump(), version it alongside your model artifacts, and load it at inference time. The pipeline and the model are a unit — they ship together or not at all.

🎯 Key Takeaway

Manual notebook cleaning does not scale and does not reproduce. Build a scikit-learn Pipeline.

Fit all transformers on training data only, then transform everything else with the fitted pipeline.

Serialize the pipeline with joblib and version it as a first-class artifact. It must travel with the model to every environment — dev, staging, and production.

Why Your Correlation Matrix is a Lie (And How to Fix It)

You just dropped a model into production. R² looks solid. In the real world, it predicts random noise. I've seen this exact disaster caused by multicollinearity you never checked. Raw data hides relationships between features. Correlation analysis exposes them. Two columns measuring the same underlying signal? Your model double-counts that signal, memorizes training noise, and falls apart on new data.

Here's the fix: generate a Pearson correlation matrix. Look for absolute values above 0.8. Those features are collinear. Drop one or combine them via PCA. Never trust a model with collinear features. It's not learning—it's overfitting to a ghost.

Your junior always skips this step because nothing looks broken. That's exactly when production burns down. Always run corr(), then heatmap it. Never deploy without this check.

correlation_check.pyPYTHON

// io.thecodeforge
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('production_orders.csv')

# Compute correlation matrix
corr_matrix = df.corr()

# Find highly correlated pairs
high_corr = [(col1, col2, corr_matrix.loc[col1, col2])
             for col1 in corr_matrix.columns
             for col2 in corr_matrix.columns
             if col1 != col2 and abs(corr_matrix.loc[col1, col2]) > 0.8]

print(f"Highly correlated pairs (>0.8): {high_corr}")
# Output: [('revenue', 'price', 0.94), ('orders', 'clicks', 0.91)]

# Visual heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlations - Check Before Training')
plt.show()

Output

Highly correlated pairs (>0.8): [('revenue', 'price', 0.94), ('orders', 'clicks', 0.91)]

⚠ Production Trap:

Don't just print the matrix—actually drop or transform one of each collinear pair. I watched a team deploy a churn model that used both 'call_duration' and 'total_seconds'—identical features with different names. Predictions looked amazing in testing. Real users? Pure chaos.

🎯 Key Takeaway

Run correlation analysis before feature selection. If two features are >0.8 correlated, one has to go.

The Target Variable Distribution: Why Your Model Only Predicts the Average

Last month, a junior handed me a regression model that predicted exactly the mean for every sample. He blamed the algorithm. I pointed at the target variable histogram. It was bimodal—two peaks, completely separate populations. The model averaged them. It never learned either group.

Your target's shape dictates everything. If it's skewed right, your model overfits high-value outliers and undershoots the majority. If it's bimodal, you're encoding two separate problems into one. Always visualize the target with a histogram or density plot before you touch a single feature.

For skewed targets, apply a log or Box-Cox transformation. For bimodal ones, split the data into two separate models or use a classification threshold. Never train a regression model on a multimodal target without asking why it's split. The answer will save your deployment.

target_distribution.pyPYTHON

// io.thecodeforge
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

df = pd.read_csv('sales_data.csv')
target = 'revenue'

# Plot raw distribution
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.hist(df[target], bins=50, edgecolor='black', alpha=0.7)
plt.title(f'Raw {target} Distribution')
plt.xlabel(target)

# Check skewness
skew = df[target].skew()
print(f"Skewness: {skew:.2f}")
# Output: Skewness: 2.35 (highly right-skewed)

# Apply log transformation if skewed > 1
if abs(skew) > 1:
    df[f'{target}_log'] = np.log1p(df[target])
    plt.subplot(1, 2, 2)
    plt.hist(df[f'{target}_log'], bins=50, edgecolor='black', alpha=0.7)
    plt.title(f'Log-Transformed {target}')
    plt.xlabel(f'log({target})')

plt.tight_layout()
plt.show()

Output

Skewness: 2.35 (highly right-skewed)

🔥Quick Check:

Plot the target before any feature engineering. If your distribution has multiple peaks, stop and ask: 'Should I build two models?' If it's skewed, transform it. If you skip this, your model will break on the long tail.

🎯 Key Takeaway

Always visualize your target variable. A skewed or multimodal target guarantees a bad model. Fix it before training.

Separating Features and Target: The One-Line That Stops Silent Failures

I've debugged models where the dataset accidentally included the target column twice—once as a feature. The model hit 99% accuracy in minutes. In production? 20%. Because that duplicate column wasn't present in real inference. The 'score' didn't exist. The model panicked on missing data and output garbage.

Separate your features (X) and target (y) at the start of every pipeline. Never assume 'all columns except the last one are features.' Explicitly define them. A production system will respect an explicit drop more than an implicit assumption.

Use drop with the column name. Don't rely on positional indexing. If someone reorders columns next week (they will), your positional slice breaks silently. A named column drop breaks visibly with a KeyError. Visible failure beats silent nonsense every time.

split_features_target.pyPYTHON

// io.thecodeforge
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('customer_churn.csv')

# BAD: implicit positional split
# X = df.iloc[:, :-1]  # Breaks silently if columns reorder
# y = df.iloc[:, -1]   # Same

# GOOD: explicit column names
target_column = 'churn'  # or 'price', 'revenue', etc.
feature_columns = [col for col in df.columns if col != target_column]

X = df[feature_columns]
y = df[target_column]

# Split into training and testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
# Output: X_train shape: (800, 12)  y_train shape: (800,)

# Sanity check: ensure target not in features
assert target_column not in feature_columns, "Target leaked into features!"

Output

X_train shape: (800, 12) y_train shape: (800,)

💡Production Trap:

Never use iloc to separate features and target. I debugged a pipeline where a data engineer added a new column to the left. The train split silently shifted, pulling the target into features. The model looked perfect for a week. Then users reported insane predictions. The root cause: a single positional index change.

🎯 Key Takeaway

Always separate features and target by explicit column name, not position. Add an assertion to catch target leakage immediately.

● Production incidentPOST-MORTEMseverity: high

Revenue Forecast Off by 40% Due to Undetected Duplicate Records

Symptom

The model predicted $14M in Q3 revenue. Actual revenue came in at $10M. The error was consistent across multiple quarters of backtesting, not a one-off spike. Finance flagged it when actuals landed — nobody caught it during model validation because the duplicates existed in both training and test sets, inflating metrics uniformly.

Assumption

The data engineering team assumed the ETL pipeline deduplicated records at ingestion. The data science team assumed the dataset was clean because it came from the data warehouse, which was treated as the single source of truth. Neither team had an independent row-count reconciliation step.

Root cause

A bug in the upstream payment processor created duplicate transaction records with different timestamps (off by 1–2 seconds) after network retries. The ETL pipeline deduplicated on transaction_id, but the retried transactions were assigned unique IDs by the payment gateway. The data warehouse contained roughly 40% duplicate revenue entries — invisible to any check that relied solely on surrogate keys.

Fix

Added a deduplication step based on a (customer_id, amount, timestamp_window_5s) composite key that matched business semantics rather than system-generated IDs. Implemented a row-count sanity check comparing warehouse totals to source system totals after every ETL run. Added automated Slack alerts when the duplicate ratio exceeds 2% or when warehouse row counts diverge from source counts by more than 0.5%.

Key lesson

Never trust upstream data without independent validation — even if it comes from your own warehouse.
Deduplicate on business-meaningful keys, not just surrogate IDs. System-generated keys can mask real-world duplicates.
Compare aggregate totals between source and destination systems as a routine sanity check, not an afterthought.
Duplicates that exist in both train and test sets inflate all evaluation metrics uniformly — you cannot catch this with a standard holdout split.

Production debug guideCommon signals that your data has cleaning problems — and exactly where to look.5 entries

Symptom · 01

Model accuracy is high on training data but terrible on new data

→

Fix

Check for data leakage first: are future-derived features present? Then check for duplicates spanning train/test splits with df.duplicated(subset=feature_cols).sum(). Finally, look for unclean features that let the model memorize noise rather than learn signal.

Symptom · 02

Descriptive statistics look wrong (e.g., negative ages, future dates, percentages above 100)

→

Fix

Run df.describe() and inspect min/max values for every numeric column. Apply domain-specific validation rules: age ∈ [0, 120], dates ≤ today, percentages ∈ [0, 100]. Flag impossible values as NaN and investigate the upstream source that produced them.

Symptom · 03

Groupby aggregations produce unexpected counts or totals

→

Fix

Check for leading/trailing whitespace and case inconsistencies in string columns. 'California', 'california', and 'California ' are three different groups to pandas. Use df['col'] = df['col'].str.strip().str.lower() before grouping.

Symptom · 04

Joins produce more rows than expected

→

Fix

You have duplicate keys in one or both tables. A many-to-many join on duplicated keys creates a Cartesian product. Check df.duplicated(subset=['key_col']).sum() on both DataFrames before joining. If duplicates are expected, use a merge indicator: pd.merge(left, right, on='key', how='left', indicator=True).

Symptom · 05

Feature distributions look completely different between training and serving data

→

Fix

This is data drift, but it can also be a cleaning artifact. Check whether the cleaning pipeline was fitted on training data and applied consistently at serving time. A common cause is a scaler or imputer being refit on serving data instead of using the saved training-time transformer.

★ Data Cleaning Debug Cheat SheetQuick checks to diagnose data quality issues when something looks off.

Unexpected NaN values appearing after type conversion−

Immediate action

Check for non-numeric strings hiding in numeric columns (e.g., 'N/A', '-', empty strings, 'null', '#REF!').

Commands

df['column'].apply(type).value_counts()

df[pd.to_numeric(df['column'], errors='coerce').isna()]['column'].unique()

Fix now

Replace non-numeric placeholders with NaN explicitly before converting: df['column'].replace(['N/A', '-', '', 'null'], np.nan, inplace=True), then pd.to_numeric(df['column'], errors='coerce').

Date parsing produces NaT or wrong dates+

Memory usage is unexpectedly high+

Value counts show suspiciously similar categories that should be the same+

Scaling Methods Comparison

Scaler	Formula	Outlier Resistant	Output Range	Best For
StandardScaler	(x − mean) / std	No	Unbounded (≈ −3 to 3 for normal data)	Normally distributed features, linear models, logistic regression, SVM
MinMaxScaler	(x − min) / (max − min)	No	[0, 1] or custom range	Neural networks, image pixel values, features with known theoretical bounds
RobustScaler	(x − median) / IQR	Yes	Unbounded	Data with outliers, skewed distributions, salary/revenue columns
MaxAbsScaler	x / max(\|x\|)	No	[−1, 1]	Sparse data (preserves zero entries), already centered features
Log Transform	log(x + 1)	Partially	[0, ∞)	Right-skewed distributions (income, counts, page views, time on site)

⚙ Quick Reference

11 commands from this guide

File	Command / Code	Purpose
iothecodeforgecleaninginspection.py	def inspect_dataset(filepath: str) -> pd.DataFrame:	Inspecting Your Data First
iothecodeforgecleaningmissing_values.py	class MissingValueHandler:	Handling Missing Values
iothecodeforgecleaningduplicates.py	from difflib import SequenceMatcher	Removing Duplicates
iothecodeforgecleaningtypes.py	class TypeFixer:	Fixing Data Types and Formats
iothecodeforgecleaningoutliers.py	class OutlierHandler:	Handling Outliers
iothecodeforgecleaningencoding.py	from sklearn.preprocessing import LabelEncoder, OneHotEncoder	Encoding Categorical Variables
iothecodeforgecleaningscaling.py	from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler	Feature Scaling
iothecodeforgecleaningpipeline.py	from sklearn.pipeline import Pipeline	Building a Reproducible Cleaning Pipeline
correlation_check.py	df = pd.read_csv('production_orders.csv')	Why Your Correlation Matrix is a Lie (And How to Fix It)
target_distribution.py	from scipy import stats	The Target Variable Distribution
split_features_target.py	from sklearn.model_selection import train_test_split	Separating Features and Target

Key takeaways

Inspect data with df.shape, df.dtypes, df.isnull().sum(), df.duplicated().sum(), and df.describe() before writing a single line of cleaning code.

Understand why data is missing before choosing an imputation strategy

the missingness mechanism (MCAR, MAR, MNAR) matters more than the imputation method.

Deduplicate on business-meaningful composite keys, not on system-generated surrogate IDs that are unique by design.

Always split data into train/test before fitting scalers, encoders, or imputers

fitting on the full dataset causes data leakage that inflates metrics by 1–5%.

Check cardinality with df['col'].nunique() before one-hot encoding. More than 50 unique values demands frequency encoding, target encoding, or decomposition.

Build a reproducible scikit-learn Pipeline that encapsulates the entire cleaning and preprocessing sequence in a single serializable object.

Serialize the preprocessing pipeline with joblib and version it alongside the model. They are a unit

shipping one without the other guarantees production failures.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

You receive a dataset where 35% of values in a critical feature column a...

Q02JUNIOR

What is the difference between data cleaning and data preprocessing?

Q03SENIOR

How would you handle a categorical feature with 10,000 unique values for...

Q04SENIOR

Why must you fit scalers on training data only and not the full dataset?

Q01 of 04SENIOR

You receive a dataset where 35% of values in a critical feature column are missing. How do you decide what to do?

ANSWER

First, I would investigate the missingness mechanism. I would check whether missingness correlates with other features (MAR) or with the target variable (MNAR) by comparing distributions of rows with and without missing values — for example, comparing the mean target value when the feature is present versus absent. If there is no significant difference, it is likely MCAR, and I can impute with median for numeric or mode for categorical data, adding a boolean indicator column to preserve the missingness signal. If missingness is systematic, I would investigate the upstream data source to understand why 35% of records lack this value. Is it a recently added field? Is it optional in the source form? Does a specific data provider fail to supply it? If more than 50% were missing, I would consider dropping the column. At 35%, I would likely impute — but I would use a method appropriate to the mechanism, potentially KNNImputer or IterativeImputer for MAR data rather than simple mean/median. I would never blindly drop rows — losing 35% of data almost always introduces more bias than imputation.

FAQ · 5 QUESTIONS

Frequently Asked Questions

Should I clean data before or after splitting into train/test sets?

How do I know if my data has outliers vs. just a skewed distribution?

Can I use the same cleaning pipeline for different datasets?

What is the difference between imputing with mean vs. median?

How often should I re-inspect and re-clean production data?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

🔥

That's ML Basics. Mark it forged?

7 min read · try the examples if you haven't