Skip to content
Homeβ€Ί ML / AIβ€Ί Data Cleaning and Preprocessing for Absolute Beginners

Data Cleaning and Preprocessing for Absolute Beginners

Where developers are forged. Β· Structured learning Β· Free forever.
πŸ“ Part of: ML Basics β†’ Topic 19 of 25
Practical guide to cleaning messy data β€” the skill that actually matters most for beginners.
πŸ§‘β€πŸ’» Beginner-friendly β€” no prior ML / AI experience needed
In this tutorial, you'll learn
Practical guide to cleaning messy data β€” the skill that actually matters most for beginners.
  • Inspect data with df.shape, df.dtypes, df.isnull().sum(), df.duplicated().sum(), and df.describe() before writing a single line of cleaning code.
  • Understand why data is missing before choosing an imputation strategy β€” the missingness mechanism (MCAR, MAR, MNAR) matters more than the imputation method.
  • Deduplicate on business-meaningful composite keys, not on system-generated surrogate IDs that are unique by design.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚑Quick Answer
  • Data cleaning fixes errors, fills gaps, and standardizes formats before analysis or modeling
  • Missing values, duplicates, and inconsistent formats are the top 3 problems in raw data
  • Use pandas for tabular data: df.isnull().sum() reveals missingness patterns immediately
  • Encoding categoricals and scaling numerics are preprocessing steps, not cleaning
  • Production rule: automate cleaning pipelines β€” manual cleaning breaks on new data batches
  • Biggest mistake: dropping all rows with missing values instead of understanding why they are missing
  • Always split before fitting transformers β€” data leakage from preprocessing is the #1 silent killer of model reliability
🚨 START HERE
Data Cleaning Debug Cheat Sheet
Quick checks to diagnose data quality issues when something looks off.
🟑Unexpected NaN values appearing after type conversion
Immediate ActionCheck for non-numeric strings hiding in numeric columns (e.g., 'N/A', '-', empty strings, 'null', '#REF!').
Commands
df['column'].apply(type).value_counts()
df[pd.to_numeric(df['column'], errors='coerce').isna()]['column'].unique()
Fix NowReplace non-numeric placeholders with NaN explicitly before converting: df['column'].replace(['N/A', '-', '', 'null'], np.nan, inplace=True), then pd.to_numeric(df['column'], errors='coerce').
🟑Date parsing produces NaT or wrong dates
Immediate ActionInspect raw date strings for mixed formats. The value '03/04/2025' means March 4 in the US and April 3 in most of Europe. Pandas guesses, and it guesses wrong roughly half the time on ambiguous dates.
Commands
df['date_col'].head(20).tolist()
pd.to_datetime(df['date_col'], format='mixed', dayfirst=False)
Fix NowSpecify format explicitly with pd.to_datetime(df['date_col'], format='%Y-%m-%d'). If formats are genuinely mixed, parse each known format separately into sub-DataFrames and combine with pd.concat.
🟑Memory usage is unexpectedly high
Immediate ActionCheck for object columns that should be categorical. A column with 10 unique string values repeated across 10 million rows stores 10 million separate string objects in memory.
Commands
df.memory_usage(deep=True).sum() / 1e6
df.select_dtypes('object').nunique()
Fix NowConvert low-cardinality object columns to category dtype: df['col'] = df['col'].astype('category'). For numeric columns stored as float64 that only contain small integers, downcast: pd.to_numeric(df['col'], downcast='integer').
🟑Value counts show suspiciously similar categories that should be the same
Immediate ActionCheck for encoding issues, invisible Unicode characters, and casing differences.
Commands
df['col'].value_counts().head(20)
[repr(v) for v in df['col'].unique()[:20]]
Fix NowNormalize with df['col'] = df['col'].str.strip().str.lower().str.normalize('NFKD'). Then map known synonyms: df['col'].replace({'usa': 'united states', 'u.s.': 'united states'}).
Production IncidentRevenue Forecast Off by 40% Due to Undetected Duplicate RecordsA quarterly revenue model produced wildly inaccurate forecasts because duplicate transaction records inflated totals by 40%.
SymptomThe model predicted $14M in Q3 revenue. Actual revenue came in at $10M. The error was consistent across multiple quarters of backtesting, not a one-off spike. Finance flagged it when actuals landed β€” nobody caught it during model validation because the duplicates existed in both training and test sets, inflating metrics uniformly.
AssumptionThe data engineering team assumed the ETL pipeline deduplicated records at ingestion. The data science team assumed the dataset was clean because it came from the data warehouse, which was treated as the single source of truth. Neither team had an independent row-count reconciliation step.
Root causeA bug in the upstream payment processor created duplicate transaction records with different timestamps (off by 1–2 seconds) after network retries. The ETL pipeline deduplicated on transaction_id, but the retried transactions were assigned unique IDs by the payment gateway. The data warehouse contained roughly 40% duplicate revenue entries β€” invisible to any check that relied solely on surrogate keys.
FixAdded a deduplication step based on a (customer_id, amount, timestamp_window_5s) composite key that matched business semantics rather than system-generated IDs. Implemented a row-count sanity check comparing warehouse totals to source system totals after every ETL run. Added automated Slack alerts when the duplicate ratio exceeds 2% or when warehouse row counts diverge from source counts by more than 0.5%.
Key Lesson
Never trust upstream data without independent validation β€” even if it comes from your own warehouse.Deduplicate on business-meaningful keys, not just surrogate IDs. System-generated keys can mask real-world duplicates.Compare aggregate totals between source and destination systems as a routine sanity check, not an afterthought.Duplicates that exist in both train and test sets inflate all evaluation metrics uniformly β€” you cannot catch this with a standard holdout split.
Production Debug GuideCommon signals that your data has cleaning problems β€” and exactly where to look.
Model accuracy is high on training data but terrible on new data→Check for data leakage first: are future-derived features present? Then check for duplicates spanning train/test splits with df.duplicated(subset=feature_cols).sum(). Finally, look for unclean features that let the model memorize noise rather than learn signal.
Descriptive statistics look wrong (e.g., negative ages, future dates, percentages above 100)β†’Run df.describe() and inspect min/max values for every numeric column. Apply domain-specific validation rules: age ∈ [0, 120], dates ≀ today, percentages ∈ [0, 100]. Flag impossible values as NaN and investigate the upstream source that produced them.
Groupby aggregations produce unexpected counts or totals→Check for leading/trailing whitespace and case inconsistencies in string columns. 'California', 'california', and 'California ' are three different groups to pandas. Use df['col'] = df['col'].str.strip().str.lower() before grouping.
Joins produce more rows than expected→You have duplicate keys in one or both tables. A many-to-many join on duplicated keys creates a Cartesian product. Check df.duplicated(subset=['key_col']).sum() on both DataFrames before joining. If duplicates are expected, use a merge indicator: pd.merge(left, right, on='key', how='left', indicator=True).
Feature distributions look completely different between training and serving data→This is data drift, but it can also be a cleaning artifact. Check whether the cleaning pipeline was fitted on training data and applied consistently at serving time. A common cause is a scaler or imputer being refit on serving data instead of using the saved training-time transformer.

Most beginner tutorials skip data cleaning and jump straight to modeling. This is backwards. In production, 60–80% of a data scientist's time goes to cleaning and preprocessing, not model building.

Messy data produces confident but wrong results. A model trained on data with duplicates, missing values, or inconsistent formats will silently learn garbage patterns. The output looks plausible but is unreliable. You ship it, stakeholders act on it, and weeks later someone notices the numbers never made sense.

The core misconception is that cleaning is tedious but simple. It is not. Understanding why data is missing, how duplicates were created, and what format inconsistencies reveal about upstream systems requires real analytical thinking. A column full of NaN values might mean a sensor was offline, a user skipped an optional field, or an ETL job silently dropped records. Each cause demands a different response.

This guide covers every step from raw inspection through reproducible pipelines. The code is production-grade Python β€” not toy examples on iris or titanic. Every section includes the reasoning behind the technique, because knowing which tool to use matters far more than knowing the syntax.

Inspecting Your Data First

Never clean blindly. Start by understanding the shape, types, and distribution of your data. The first five commands you run on any new dataset reveal the scope of the problem and tell you where to focus your effort.

Inspection is not optional extra work β€” it is the diagnostic step that prevents you from applying the wrong fix. Imputing a column that should have been dropped, or scaling a column that is secretly categorical, are mistakes that only surface weeks later when production results look wrong.

io/thecodeforge/cleaning/inspection.py Β· PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import pandas as pd
import numpy as np

def inspect_dataset(filepath: str) -> pd.DataFrame:
    """First-pass data inspection. Run this before any cleaning.

    Returns the raw DataFrame so you can continue exploring interactively.
    Prints a diagnostic summary covering shape, types, missingness,
    duplicates, numeric ranges, and categorical cardinality.
    """
    df = pd.read_csv(filepath)

    print(f"Shape: {df.shape[0]:,} rows x {df.shape[1]} columns")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")

    print(f"\n--- Column Types ---")
    print(df.dtypes.value_counts())

    print(f"\n--- Missing Values ---")
    missing = df.isnull().sum()
    missing_pct = (missing / len(df) * 100).round(1)
    missing_report = pd.DataFrame({"count": missing, "percent": missing_pct})
    missing_report = missing_report[missing_report["count"] > 0].sort_values(
        "percent", ascending=False
    )
    if missing_report.empty:
        print("No missing values detected.")
    else:
        print(missing_report)

    print(f"\n--- Duplicate Rows ---")
    dup_count = df.duplicated().sum()
    print(f"Exact duplicates: {dup_count} ({dup_count / len(df) * 100:.1f}%)")

    print(f"\n--- Numeric Summary ---")
    num_desc = df.describe().round(2)
    print(num_desc)

    print(f"\n--- Categorical Columns ---")
    cat_cols = df.select_dtypes(include=["object", "category"]).columns
    for col in cat_cols:
        unique_count = df[col].nunique()
        print(f"  {col}: {unique_count} unique values")
        if unique_count <= 10:
            print(f"    Values: {df[col].unique().tolist()}")
        elif unique_count <= 50:
            print(f"    Top 5: {df[col].value_counts().head(5).to_dict()}")
        else:
            print(f"    High cardinality β€” consider grouping or encoding")

    return df
Mental Model
Inspect Before You Fix
Cleaning without inspection is like prescribing medicine without diagnosis β€” you might make things worse.
  • df.shape tells you volume β€” is this a 100-row problem or a 100M-row problem? The answer changes every downstream decision.
  • df.dtypes reveals implicit type mismatches. Numbers stored as strings are the #1 source of silent aggregation errors.
  • df.isnull().sum() shows missingness patterns β€” not all NaN is random. Correlated missingness signals systemic upstream failures.
  • df.duplicated().sum() catches records that should not exist. Even 1% duplicates can swing aggregate metrics significantly.
  • df.describe() exposes impossible min/max values (negative ages, future dates, prices of $0.00) that signal data corruption or upstream bugs.
πŸ“Š Production Insight
In production, data inspection should be automated as a pipeline gate β€” not something a human remembers to do in a notebook.
If the missing percentage exceeds a configured threshold, or row counts diverge from expectations, the pipeline should fail loudly with an alert rather than silently processing garbage.
Rule of thumb: never let uninspected data reach a model, a report, or a dashboard. The cost of a five-second automated check is negligible compared to the cost of retracting a wrong forecast.
🎯 Key Takeaway
Always inspect before cleaning.
Five commands β€” shape, dtypes, isnull, duplicated, describe β€” reveal 90% of data quality issues.
Automate inspection as a pipeline gate in production. Never trust raw data blindly, no matter how reputable the source.

Handling Missing Values

Missing values are the most common data quality issue. The critical decision is not how to fill them, but whether to fill them at all β€” and that depends on understanding why they are missing in the first place.

Statisticians categorize missingness into three types: MCAR (Missing Completely at Random), MAR (Missing at Random, conditional on other observed data), and MNAR (Missing Not at Random, where the value itself determines whether it is recorded). Each type demands a different strategy, and applying the wrong one introduces bias that no amount of hyperparameter tuning will fix.

The practical implication: before you write a single line of imputation code, spend five minutes checking whether missingness correlates with other columns or with the target variable. That five minutes prevents weeks of debugging a model that is learning from a biased training set.

io/thecodeforge/cleaning/missing_values.py Β· PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
import pandas as pd
import numpy as np


class MissingValueHandler:
    """Strategies for handling missing values based on missingness type.

    Usage:
        handler = MissingValueHandler()
        report = handler.diagnose_missingness(df)
        df = handler.impute_numeric(df, 'age', strategy='median')
        df = handler.impute_categorical(df, 'city', strategy='unknown')
    """

    @staticmethod
    def diagnose_missingness(df: pd.DataFrame, target_col: str = None) -> dict:
        """Determine if missingness is random or systematic.

        If target_col is provided, checks whether missingness in each column
        correlates with the target β€” a signal of potential MNAR.
        """
        report = {}
        for col in df.columns:
            missing_count = df[col].isnull().sum()
            if missing_count == 0:
                continue

            missing_pct = missing_count / len(df) * 100
            entry = {
                "count": missing_count,
                "percent": round(missing_pct, 1),
                "recommendation": MissingValueHandler._recommend(
                    missing_pct, col, df
                ),
            }

            # Check correlation between missingness and target
            if target_col and target_col != col and target_col in df.columns:
                missing_mask = df[col].isnull()
                if df[target_col].dtype in ["float64", "int64"]:
                    mean_when_missing = df.loc[missing_mask, target_col].mean()
                    mean_when_present = df.loc[~missing_mask, target_col].mean()
                    if pd.notna(mean_when_missing) and pd.notna(mean_when_present):
                        diff_pct = (
                            abs(mean_when_missing - mean_when_present)
                            / abs(mean_when_present)
                            * 100
                        )
                        if diff_pct > 10:
                            entry["recommendation"] += (
                                f" WARNING: target mean differs by {diff_pct:.0f}% "
                                f"when {col} is missing β€” possible MNAR."
                            )

            report[col] = entry
        return report

    @staticmethod
    def _recommend(pct: float, col: str, df: pd.DataFrame) -> str:
        if pct > 50:
            return "DROP COLUMN β€” more than half the data is missing"
        elif pct > 20:
            return "INVESTIGATE β€” high missingness may indicate a systematic issue"
        elif df[col].dtype in ["float64", "int64"]:
            return "IMPUTE β€” use median for skewed, mean for normal distributions"
        else:
            return "IMPUTE β€” use mode or create an explicit 'Unknown' category"

    @staticmethod
    def impute_numeric(
        df: pd.DataFrame, col: str, strategy: str = "median"
    ) -> pd.DataFrame:
        """Fill missing numeric values. Adds a boolean indicator column."""
        indicator_col = f"{col}_was_missing"
        df[indicator_col] = df[col].isnull().astype(int)

        if strategy == "median":
            fill_value = df[col].median()
        elif strategy == "mean":
            fill_value = df[col].mean()
        elif strategy == "zero":
            fill_value = 0
        else:
            raise ValueError(f"Unknown strategy: {strategy}")

        df[col] = df[col].fillna(fill_value)
        print(f"{col}: imputed {df[indicator_col].sum()} values with {strategy} ({fill_value:.2f})")
        return df

    @staticmethod
    def impute_categorical(
        df: pd.DataFrame, col: str, strategy: str = "mode"
    ) -> pd.DataFrame:
        """Fill missing categorical values."""
        indicator_col = f"{col}_was_missing"
        df[indicator_col] = df[col].isnull().astype(int)

        if strategy == "mode":
            fill_value = df[col].mode()[0]
        elif strategy == "unknown":
            fill_value = "Unknown"
        else:
            raise ValueError(f"Unknown strategy: {strategy}")

        df[col] = df[col].fillna(fill_value)
        print(f"{col}: imputed {df[indicator_col].sum()} values with '{fill_value}'")
        return df
⚠ Three Types of Missingness β€” and Why It Matters
MCAR (Missing Completely at Random): missingness has no pattern. A sensor randomly drops readings. Safe to drop rows or impute with simple statistics. MAR (Missing at Random): missingness depends on other observed columns. Younger users skip the income field more often. Impute using those correlated columns β€” simple mean imputation introduces bias. MNAR (Missing Not at Random): missingness depends on the unobserved value itself. High-income earners refuse to report income. No imputation strategy fully fixes this β€” you need domain expertise, external data, or explicit modeling of the missingness mechanism. Always investigate why data is missing before choosing a strategy. The wrong choice silently biases every downstream result.
πŸ“Š Production Insight
Dropping rows with missing values can silently remove 30–80% of your dataset.
Always check df.dropna().shape versus df.shape before committing to a drop.
More importantly, check whether the dropped rows are a random subset or a biased one. If all dropped rows belong to a specific demographic, your model will systematically underperform for that group.
Rule: dropping is the last resort. Impute or flag missingness as a feature first. Add a boolean indicator column (was_missing) so the model can learn from the missingness pattern itself.
🎯 Key Takeaway
Not all missing values should be filled the same way.
Diagnose the missingness mechanism before choosing a strategy β€” MCAR, MAR, and MNAR demand different treatments.
Dropping rows is the nuclear option β€” always check how much data you lose and whether the loss is biased.
Add boolean indicator columns so the model can learn from the missingness pattern itself.
Missing Value Strategy Selection
IfColumn has more than 50% missing values
β†’
UseDrop the column unless it is the target variable or you have strong domain evidence that the remaining values carry critical signal.
IfNumeric column with less than 20% missing, distribution is roughly normal
β†’
UseImpute with mean. Add a boolean 'was_missing' indicator column to preserve the missingness signal.
IfNumeric column with less than 20% missing, distribution is skewed or has outliers
β†’
UseImpute with median. Median is robust to extreme values that would distort the mean.
IfCategorical column with less than 20% missing
β†’
UseImpute with mode or create an explicit 'Unknown' category. Unknown is often preferable because it does not assume the missing values follow the same distribution as observed values.
IfMissingness is systematic (MAR or MNAR)
β†’
UseInvestigate the root cause. Consider model-based imputation (KNN Imputer, IterativeImputer). Do not blindly fill β€” you may introduce bias that is invisible during training but devastating in production.

Removing Duplicates

Duplicates inflate counts, skew averages, and produce overconfident models. They come in two flavors: exact duplicates (entire row identical) and fuzzy duplicates (same real-world entity, slightly different representation β€” like 'John Smith' vs 'john smith' vs 'J. Smith').

Exact duplicates are easy to detect and remove. Fuzzy duplicates are harder and more dangerous because they survive standard deduplication checks. A customer appearing twice with slightly different names gets counted as two customers, doubling their weight in any aggregate.

The most important principle: define your deduplication key based on business logic, not technical convenience. A surrogate ID that is unique by definition tells you nothing about whether two rows represent the same real-world entity.

io/thecodeforge/cleaning/duplicates.py Β· PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import pandas as pd
from difflib import SequenceMatcher


class DuplicateHandler:
    """Detect and remove duplicate records β€” exact and fuzzy.

    Usage:
        handler = DuplicateHandler()
        dups = handler.find_exact_duplicates(df, subset=['email', 'order_date'])
        df_clean = handler.remove_exact_duplicates(df, subset=['email', 'order_date'])
        fuzzy = handler.find_fuzzy_duplicates(df, columns=['name', 'address'])
    """

    @staticmethod
    def find_exact_duplicates(
        df: pd.DataFrame, subset: list = None
    ) -> pd.DataFrame:
        """Identify exact duplicate rows.

        Args:
            df: input DataFrame
            subset: columns to consider for duplication. None = all columns.

        Returns:
            DataFrame containing only the duplicate rows (excluding first occurrence).
        """
        dup_mask = df.duplicated(subset=subset, keep="first")
        dup_count = dup_mask.sum()
        print(f"Exact duplicates found: {dup_count} ({dup_count / len(df) * 100:.1f}%)")
        if subset:
            print(f"  Checked on columns: {subset}")
        return df[dup_mask]

    @staticmethod
    def remove_exact_duplicates(
        df: pd.DataFrame, subset: list = None, keep: str = "first"
    ) -> pd.DataFrame:
        """Remove exact duplicates, keeping one copy."""
        before = len(df)
        df_clean = df.drop_duplicates(subset=subset, keep=keep).reset_index(
            drop=True
        )
        removed = before - len(df_clean)
        print(
            f"Removed {removed} duplicate rows. {len(df_clean):,} rows remaining."
        )
        return df_clean

    @staticmethod
    def find_fuzzy_duplicates(
        df: pd.DataFrame, columns: list, threshold: float = 0.85
    ) -> list:
        """Find near-duplicate records based on string similarity.

        WARNING: O(n^2) complexity. For large datasets (>10k rows), sample
        or use blocking strategies (group by first letter, zip code, etc.)
        before calling this method.

        Args:
            df: input DataFrame
            columns: columns to concatenate for similarity comparison
            threshold: minimum similarity ratio to flag as fuzzy duplicate

        Returns:
            List of (index_i, index_j, similarity_score) tuples.
        """
        fuzzy_pairs = []
        values = df[columns].astype(str).agg(" ".join, axis=1).tolist()

        n = len(values)
        if n > 10_000:
            print(
                f"WARNING: {n:,} rows will produce {n*(n-1)//2:,} comparisons. "
                f"Consider blocking or sampling first."
            )

        for i in range(n):
            for j in range(i + 1, n):
                similarity = SequenceMatcher(None, values[i], values[j]).ratio()
                if similarity >= threshold and similarity < 1.0:
                    fuzzy_pairs.append((i, j, round(similarity, 3)))

        print(
            f"Found {len(fuzzy_pairs)} fuzzy duplicate pairs above "
            f"{threshold} threshold"
        )
        return fuzzy_pairs
πŸ’‘Where Duplicates Hide
  • Payment processor retries create records with different transaction IDs but identical amounts and timestamps within a narrow window.
  • User registration forms submitted twice produce records with the same email but different auto-incremented user IDs.
  • Data imports from multiple sources create overlapping date ranges with slight format differences (2025-04-15 vs 04/15/2025).
  • CRM merges introduce near-duplicate contacts: 'Acme Corp' vs 'Acme Corporation' vs 'ACME Corp.'.
  • Always define deduplication keys based on business logic, not just surrogate primary keys that are unique by design.
πŸ“Š Production Insight
Duplicates often indicate upstream system bugs β€” not just data entry errors or CSV import accidents.
Removing duplicates without logging them hides the root cause and guarantees the problem recurs on the next data load.
Rule: always report duplicate counts, patterns, and example rows to the data engineering team. Treat recurring duplicates as incidents, not cleanup tasks. Fix the source, not just the symptom.
🎯 Key Takeaway
Exact duplicates are easy to find β€” fuzzy duplicates are the real threat and require domain-specific similarity logic.
Define deduplication keys based on business logic, not table primary keys.
Duplicates often signal upstream bugs β€” report them, investigate the source, do not just silently delete them.

Fixing Data Types and Formats

Incorrect data types cause silent errors that can persist for months. Numbers stored as strings cannot be aggregated β€” pandas will concatenate them instead of summing. Dates stored as strings cannot be compared or sorted chronologically. Booleans stored as 'Yes'/'No' strings require explicit conversion before any logical operation.

The root cause is almost always upstream: CSV files have no schema enforcement, APIs return everything as strings, and Excel silently interprets dates in locale-dependent formats. By the time data reaches your DataFrame, the damage is done. Your job is to detect and fix it systematically.

The most insidious variant is partial type corruption β€” a column that is 99.9% numeric but contains a handful of string values like 'N/A' or '#REF!'. Pandas silently stores the entire column as object dtype, and your aggregation returns wrong results with no error message.

io/thecodeforge/cleaning/types.py Β· PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
import pandas as pd
import numpy as np


class TypeFixer:
    """Fix data type issues in DataFrames.

    Usage:
        fixer = TypeFixer()
        df = fixer.fix_numeric_strings(df, ['revenue', 'price'])
        df = fixer.fix_dates(df, ['order_date', 'ship_date'])
        df = fixer.fix_booleans(df, ['is_active', 'has_subscription'])
        df = fixer.fix_categoricals(df, ['country', 'status'])
    """

    @staticmethod
    def fix_numeric_strings(df: pd.DataFrame, columns: list) -> pd.DataFrame:
        """Convert string columns that should be numeric.

        Handles currency symbols, commas, whitespace, and common
        placeholder strings like 'N/A', '-', 'null', '#REF!'.
        """
        for col in columns:
            original_nulls = df[col].isna().sum()
            cleaned = df[col].astype(str)
            cleaned = cleaned.str.replace(r"[\$\€\Β£\,]", "", regex=True)
            cleaned = cleaned.str.strip()
            cleaned = cleaned.replace(
                {"": np.nan, "N/A": np.nan, "n/a": np.nan,
                 "-": np.nan, "null": np.nan, "None": np.nan,
                 "nan": np.nan, "#REF!": np.nan, "#N/A": np.nan}
            )
            df[col] = pd.to_numeric(cleaned, errors="coerce")
            new_nulls = df[col].isna().sum() - original_nulls
            print(
                f"{col}: converted to numeric. "
                f"{new_nulls} new NaN values from failed coercion."
            )
        return df

    @staticmethod
    def fix_dates(
        df: pd.DataFrame, columns: list, dayfirst: bool = False,
        fmt: str = None
    ) -> pd.DataFrame:
        """Parse date columns with explicit format handling.

        Args:
            fmt: explicit strftime format string (e.g., '%Y-%m-%d').
                 If None, pandas infers the format β€” risky in production.
        """
        for col in columns:
            if fmt:
                df[col] = pd.to_datetime(df[col], format=fmt, errors="coerce")
            else:
                df[col] = pd.to_datetime(
                    df[col], dayfirst=dayfirst, errors="coerce"
                )
            failed = df[col].isna().sum()
            print(f"{col}: parsed as datetime. {failed} values are NaT.")
            # Sanity check: flag future dates
            future = (df[col] > pd.Timestamp.now()).sum()
            if future > 0:
                print(f"  WARNING: {future} dates are in the future.")
        return df

    @staticmethod
    def fix_booleans(df: pd.DataFrame, columns: list) -> pd.DataFrame:
        """Convert string booleans to actual booleans."""
        true_values = {"yes", "true", "1", "y", "on", "t"}
        false_values = {"no", "false", "0", "n", "off", "f"}

        for col in columns:
            lower = df[col].astype(str).str.lower().str.strip()
            df[col] = lower.map(
                lambda x: (
                    True if x in true_values
                    else (False if x in false_values else np.nan)
                )
            )
            unmapped = df[col].isna().sum()
            print(f"{col}: converted to boolean. {unmapped} unmapped values.")
        return df

    @staticmethod
    def fix_categoricals(df: pd.DataFrame, columns: list) -> pd.DataFrame:
        """Convert high-memory object columns to categorical dtype."""
        for col in columns:
            before_mem = df[col].memory_usage(deep=True) / 1e6
            df[col] = df[col].astype("category")
            after_mem = df[col].memory_usage(deep=True) / 1e6
            if before_mem > 0:
                reduction = (1 - after_mem / before_mem) * 100
                print(
                    f"{col}: {before_mem:.2f} MB -> {after_mem:.2f} MB "
                    f"({reduction:.0f}% reduction)"
                )
            else:
                print(f"{col}: converted to category dtype.")
        return df
πŸ”₯Memory Impact of Type Fixes
Converting a string column with 1 million rows and 50 unique values to category dtype typically reduces memory from ~50 MB to ~1 MB β€” a 98% reduction. For numeric columns, downcasting float64 to float32 halves memory usage with negligible precision loss for most applications. These savings compound quickly: a 20-column DataFrame can shrink from 2 GB to 200 MB, which is the difference between fitting in a Lambda function's memory or crashing at runtime.
πŸ“Š Production Insight
Date parsing failures are completely silent β€” NaT values do not raise exceptions, and downstream code happily filters them out or treats them as null without warning.
Always compare the count of successfully parsed dates against the total row count. If 5% of dates failed parsing, that is a data quality bug, not an acceptable loss.
Rule: specify date formats explicitly in production pipelines. Never rely on pandas auto-inference, which changes behavior across pandas versions and produces different results on ambiguous inputs like '01/02/2025'.
🎯 Key Takeaway
Wrong data types cause silent aggregation failures β€” pandas concatenates strings instead of summing numbers, with no error.
Clean strings before converting: strip whitespace, remove currency symbols, replace placeholders.
Convert low-cardinality object columns to category dtype for 90%+ memory savings.
Always specify explicit date formats in production. Auto-inference is a ticking time bomb.

Handling Outliers

Outliers distort model training and summary statistics. A single salary of $10M in a dataset of $50K–$150K salaries will shift the mean by thousands of dollars and stretch the scale for distance-based algorithms.

But the key decision is whether an outlier is an error (remove it), a genuine extreme value (keep or cap it), or the actual signal you are trying to detect (embrace it). Domain knowledge drives this decision, not statistical rules alone.

The IQR method and z-score are detection tools, not decision tools. They flag candidates. A human β€” or a well-defined business rule β€” decides what to do with each candidate. Blindly removing everything beyond 1.5Γ—IQR is a recipe for deleting the most informative data points in your dataset.

io/thecodeforge/cleaning/outliers.py Β· PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
import pandas as pd
import numpy as np


class OutlierHandler:
    """Detect and handle outliers using multiple methods.

    Usage:
        handler = OutlierHandler()
        mask, lower, upper = handler.detect_iqr(df, 'salary')
        mask = handler.detect_zscore(df, 'salary', threshold=3.0)
        df = handler.cap_outliers(df, 'salary', lower=20000, upper=500000)
        df = handler.log_transform(df, 'revenue')
    """

    @staticmethod
    def detect_iqr(
        df: pd.DataFrame, col: str, factor: float = 1.5
    ) -> tuple:
        """Flag outliers using the IQR method.

        Returns:
            Tuple of (boolean mask, lower bound, upper bound).
        """
        q1 = df[col].quantile(0.25)
        q3 = df[col].quantile(0.75)
        iqr = q3 - q1
        lower = q1 - factor * iqr
        upper = q3 + factor * iqr

        outlier_mask = (df[col] < lower) | (df[col] > upper)
        count = outlier_mask.sum()
        print(f"{col}: {count} outliers ({count/len(df)*100:.1f}%) outside [{lower:.2f}, {upper:.2f}]")
        return outlier_mask, lower, upper

    @staticmethod
    def detect_zscore(
        df: pd.DataFrame, col: str, threshold: float = 3.0
    ) -> pd.Series:
        """Flag outliers using z-score method."""
        mean = df[col].mean()
        std = df[col].std()
        if std == 0:
            print(f"{col}: standard deviation is 0 β€” all values are identical. No outliers.")
            return pd.Series(False, index=df.index)

        z_scores = np.abs((df[col] - mean) / std)
        outlier_mask = z_scores > threshold
        count = outlier_mask.sum()
        print(f"{col}: {count} outliers with |z-score| > {threshold}")
        return outlier_mask

    @staticmethod
    def cap_outliers(
        df: pd.DataFrame, col: str,
        lower: float = None, upper: float = None
    ) -> pd.DataFrame:
        """Cap outlier values to specified bounds (winsorization).

        If bounds are not specified, defaults to 1st and 99th percentiles.
        """
        if lower is None:
            lower = df[col].quantile(0.01)
        if upper is None:
            upper = df[col].quantile(0.99)

        before = df[col].copy()
        df[col] = df[col].clip(lower=lower, upper=upper)
        capped = (before != df[col]).sum()
        print(f"{col}: capped {capped} values to [{lower:.2f}, {upper:.2f}]")
        return df

    @staticmethod
    def log_transform(df: pd.DataFrame, col: str) -> pd.DataFrame:
        """Apply log1p transform to compress right-skewed distributions.

        Handles zeros safely. Does NOT handle negative values β€” shift first.
        """
        if (df[col] < 0).any():
            min_val = df[col].min()
            print(f"{col}: contains negative values. Shifting by {abs(min_val):.2f} before log transform.")
            df[col] = np.log1p(df[col] - min_val)
        else:
            df[col] = np.log1p(df[col])
        print(f"{col}: log1p transform applied.")
        return df
⚠ Never Blindly Remove Outliers
In fraud detection, the outliers ARE the signal. In medical data, extreme lab values may indicate critical conditions. In e-commerce, a few whales who spend 100x the average drive a disproportionate share of revenue. Always consult domain experts before removing outliers. An outlier removal step that makes your model's RΒ² look better on paper may be destroying the most important information in the dataset. If you cannot consult an expert, cap rather than remove β€” at least the record and its other features are preserved.
πŸ“Š Production Insight
Outlier thresholds calculated from training data must be saved and reused identically on new data at inference time.
Recalculating thresholds per batch means different data points get flagged each time, making your cleaning pipeline non-deterministic.
Rule: compute outlier bounds once on training data, serialize them alongside your scaler and model, then apply those fixed bounds to all future data. If the distribution genuinely shifts, that is a retraining trigger, not a reason to silently recalculate thresholds.
🎯 Key Takeaway
Outliers are either errors or signals β€” domain knowledge decides which.
Cap extreme values instead of deleting them whenever possible. Removal discards information; capping preserves the record.
Save outlier thresholds from training data and reuse them deterministically on all future batches.
Outlier Handling Strategy
IfOutlier is clearly a data entry error (age = 500, negative price, temperature of 9999Β°F)
β†’
UseSet to NaN and treat as a missing value. Investigate and fix the upstream source that produced it.
IfOutlier is plausible but extreme (billionaire income in salary data, 100-unit purchase in a mostly single-unit store)
β†’
UseCap at a domain-reasonable threshold (e.g., 99th percentile) or use robust scaling (RobustScaler). Preserve the record.
IfOutlier is the signal of interest (fraud transactions, anomalous sensor readings, rare diseases)
β†’
UseKeep it. Use algorithms robust to outliers (tree-based models, isolation forest). Removing it defeats the purpose.
IfDistribution is heavily skewed but all values are valid (income, page views, time on site)
β†’
UseApply log1p or Box-Cox transformation to compress the tail instead of removing values. This preserves information while reducing the impact of extremes.

Encoding Categorical Variables

Machine learning models require numeric input. Categorical variables β€” country, product type, color, status β€” must be converted to numbers through encoding. This is a preprocessing step, not cleaning, but it is where beginners make some of their most costly mistakes.

The encoding strategy depends on two factors: cardinality (how many unique values the column has) and whether those values have a natural order. Choosing the wrong method can create thousands of useless sparse columns, introduce false ordinal relationships, or leak target information into features.

The most common mistake: one-hot encoding a column with 10,000 unique values. This creates 9,999 new binary columns, most of which are zero for any given row. Memory explodes, training crawls, and linear models become impossible to fit.

io/thecodeforge/cleaning/encoding.py Β· PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder


class CategoricalEncoder:
    """Encode categorical variables for model consumption.

    Usage:
        encoder = CategoricalEncoder()
        df = encoder.encode_onehot(df, ['color', 'size'])
        df = encoder.encode_ordinal(df, 'education', ['high_school', 'bachelors', 'masters', 'phd'])
        df = encoder.encode_frequency(df, ['city', 'product_id'])
    """

    @staticmethod
    def encode_onehot(
        df: pd.DataFrame, columns: list, max_cardinality: int = 10
    ) -> pd.DataFrame:
        """One-hot encode low-cardinality columns.

        Skips columns that exceed max_cardinality and prints a warning.
        Uses drop_first=True to avoid multicollinearity in linear models.
        """
        for col in columns:
            cardinality = df[col].nunique()
            if cardinality > max_cardinality:
                print(
                    f"SKIPPED: {col} has {cardinality} unique values. "
                    f"Use frequency or target encoding instead."
                )
                continue

            dummies = pd.get_dummies(df[col], prefix=col, drop_first=True)
            df = pd.concat([df.drop(col, axis=1), dummies], axis=1)
            print(f"{col}: one-hot encoded into {len(dummies.columns)} columns")
        return df

    @staticmethod
    def encode_ordinal(
        df: pd.DataFrame, column: str, order: list
    ) -> pd.DataFrame:
        """Encode ordered categories (e.g., Low < Medium < High).

        Unmapped values are set to NaN so they surface as missing rather
        than silently receiving an arbitrary integer.
        """
        mapping = {val: idx for idx, val in enumerate(order)}
        unmapped = set(df[column].dropna().unique()) - set(mapping.keys())
        if unmapped:
            print(f"WARNING: {column} has unmapped values: {unmapped}")
        df[column] = df[column].map(mapping)
        print(f"{column}: ordinally encoded with order {order}")
        return df

    @staticmethod
    def encode_frequency(
        df: pd.DataFrame, columns: list
    ) -> pd.DataFrame:
        """Replace categories with their frequency in the training data.

        Safe for high-cardinality features. Does not create new columns.
        """
        for col in columns:
            freq_map = df[col].value_counts(normalize=True).to_dict()
            df[col] = df[col].map(freq_map)
            print(f"{col}: frequency encoded ({len(freq_map)} categories)")
        return df

    @staticmethod
    def encode_label(
        df: pd.DataFrame, columns: list
    ) -> tuple:
        """Label encode β€” safe for tree-based models only.

        WARNING: creates false ordinal relationships for linear models.
        Returns the DataFrame and a dict of fitted LabelEncoders for
        inverse transformation and production reuse.
        """
        encoders = {}
        for col in columns:
            le = LabelEncoder()
            df[col] = le.fit_transform(df[col].astype(str))
            encoders[col] = le
            print(f"{col}: label encoded ({len(le.classes_)} classes)")
        return df, encoders
Mental Model
Encoding Decision Framework
The encoding method depends on two factors: cardinality and whether order matters. Get this wrong and you either explode your feature space or teach your model fake relationships.
  • Low cardinality (< 10 unique values), no natural order: One-hot encoding. Creates N-1 binary columns. Standard for most classification tasks.
  • Low cardinality, natural order exists: Ordinal encoding. Maps to integers preserving order (Low=0, Medium=1, High=2). Use for education levels, satisfaction ratings, size categories.
  • High cardinality (> 50 unique values): Target encoding (mean of target per category) or frequency encoding. One-hot creates too many sparse columns and kills performance.
  • Tree-based models (Random Forest, XGBoost, LightGBM): Label encoding works fine. Trees split on thresholds, so the arbitrary integer assignment does not create false distance relationships.
  • Linear models, SVMs, and neural networks: One-hot or target encoding only. Label encoding introduces false ordinal relationships (category 5 is not 'more' than category 2).
πŸ“Š Production Insight
One-hot encoding a column with 10,000 unique values creates 9,999 new columns, most of which are zero for any given row.
This explodes memory, makes linear models intractable, and adds zero predictive signal for rare categories.
Rule: always check df['col'].nunique() before encoding. If cardinality exceeds 50, use frequency encoding, target encoding, or hash encoding. If cardinality exceeds 10,000, strongly consider whether this feature should be decomposed (e.g., zip code β†’ state) or dropped entirely.
🎯 Key Takeaway
Encoding strategy depends on cardinality and model type β€” there is no universal best method.
One-hot for low cardinality, frequency or target encoding for high cardinality.
Label encoding is safe for trees but creates false ordinal relationships in distance-based and linear models.
Always save the encoder or mapping. You need the exact same transformation at inference time.

Feature Scaling

Features on different scales β€” age in years (0–100) vs. income in dollars (0–10,000,000) β€” bias distance-based models and slow gradient descent convergence. KNN will treat a $1 difference in income as equivalent to a 1-year difference in age, which makes no sense.

Scaling normalizes feature ranges without changing the distribution shape. It does not remove outliers, fix missing values, or encode categories. It is a pure preprocessing step that should happen after cleaning, after encoding, and after the train/test split.

The critical rule that most beginners violate: fit the scaler on training data only. If you fit on the full dataset, the scaler's mean and standard deviation incorporate test set information, which leaks future knowledge into training. The metrics look artificially good, and the model underperforms on genuinely unseen data.

io/thecodeforge/cleaning/scaling.py Β· PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler


class FeatureScaler:
    """Scale numeric features for model consumption.

    Usage:
        scaler_util = FeatureScaler()
        df_train, scaler = scaler_util.scale_standard(df_train, num_cols)
        df_test[num_cols] = scaler.transform(df_test[num_cols])  # same scaler!
    """

    @staticmethod
    def scale_standard(
        df: pd.DataFrame, columns: list
    ) -> tuple:
        """Z-score normalization: mean=0, std=1.

        Use for normally distributed features with linear models,
        logistic regression, SVMs, and PCA.
        """
        scaler = StandardScaler()
        df[columns] = scaler.fit_transform(df[columns])
        print(f"StandardScaler applied to {len(columns)} columns")
        return df, scaler

    @staticmethod
    def scale_minmax(
        df: pd.DataFrame, columns: list,
        feature_range: tuple = (0, 1)
    ) -> tuple:
        """Min-max scaling to a fixed range.

        Use for neural networks (which expect bounded inputs) and
        features with known theoretical bounds.
        """
        scaler = MinMaxScaler(feature_range=feature_range)
        df[columns] = scaler.fit_transform(df[columns])
        print(
            f"MinMaxScaler applied to {len(columns)} columns, "
            f"range={feature_range}"
        )
        return df, scaler

    @staticmethod
    def scale_robust(df: pd.DataFrame, columns: list) -> tuple:
        """Scale using median and IQR instead of mean and std.

        Use when data has outliers that would distort StandardScaler.
        Outliers still exist after scaling but no longer dominate the range.
        """
        scaler = RobustScaler()
        df[columns] = scaler.fit_transform(df[columns])
        print(
            f"RobustScaler applied to {len(columns)} columns "
            f"(outlier-resistant)"
        )
        return df, scaler
πŸ’‘When to Scale and When Not To
  • Needs scaling: KNN, SVM, Neural Networks, PCA, Logistic Regression, Linear Regression (especially with regularization β€” L1/L2 penalties are scale-dependent).
  • Does NOT need scaling: Decision Trees, Random Forests, XGBoost, LightGBM, CatBoost, Naive Bayes. These models split on thresholds, not distances.
  • Always scale after the train/test split β€” fit the scaler on training data only, then transform both train and test with the same fitted scaler.
  • Save the scaler object with joblib.dump(). You must apply the exact same transformation to new data at inference time. A different scaler = different feature space = garbage predictions.
  • If you use an sklearn Pipeline, the scaler is automatically fitted only on training data during pipeline.fit() β€” this is one of the strongest reasons to use Pipelines.
πŸ“Š Production Insight
Fitting the scaler on the full dataset (including test data) causes data leakage.
The scaler computes mean and standard deviation using test data points, so the model indirectly sees test set statistics during training. This inflates accuracy by 1–5% on typical tabular datasets β€” enough to make a mediocre model look good in evaluation but fail in production.
Rule: fit scaler on training data only. Use scaler.transform() on test and production data. If you ever call scaler.fit_transform() on test data, you have a data leakage bug.
🎯 Key Takeaway
Scaling is required for distance-based and gradient-based models but unnecessary for tree-based models.
Use RobustScaler when outliers are present β€” it uses median and IQR instead of mean and std.
Always fit the scaler on training data only. Serialize it and ship it with the model to production.

Building a Reproducible Cleaning Pipeline

Manual cleaning in a notebook is fine for exploration. Production requires a reproducible pipeline that applies the exact same transformations to new data consistently, without human intervention, without remembering which cells to run in which order.

Scikit-learn's Pipeline and ColumnTransformer make this possible. They encapsulate the entire sequence of transformations β€” imputation, encoding, scaling β€” into a single object that can be fitted on training data, serialized to disk, and loaded at inference time. The pipeline remembers every parameter: the median used for imputation, the mean and std used for scaling, the category set used for encoding.

This is not optional infrastructure for large teams. It is essential even for solo projects. The moment you retrain a model or deploy it to a different environment, you need the pipeline. Without it, you will spend hours reverse-engineering what transformations were applied and in what order.

io/thecodeforge/cleaning/pipeline.py Β· PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import pandas as pd
import numpy as np
import joblib
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer


class DataCleaningPipeline:
    """Reproducible data cleaning pipeline for production use.

    Encapsulates imputation, encoding, and scaling into a single
    serializable object. Fit once on training data, transform everything
    else with the same fitted parameters.

    Usage:
        pipeline = DataCleaningPipeline(
            numeric_cols=['age', 'income', 'score'],
            categorical_cols=['city', 'status', 'tier']
        )
        X_train_clean = pipeline.fit_transform(X_train)
        X_test_clean = pipeline.transform(X_test)
        pipeline.save('preprocessing_pipeline.pkl')

        # At inference time:
        loaded = DataCleaningPipeline.load('preprocessing_pipeline.pkl')
        X_new_clean = loaded.transform(X_new)
    """

    def __init__(self, numeric_cols: list, categorical_cols: list):
        self.numeric_cols = numeric_cols
        self.categorical_cols = categorical_cols
        self.pipeline = self._build_pipeline()
        self._is_fitted = False

    def _build_pipeline(self) -> ColumnTransformer:
        numeric_transformer = Pipeline(steps=[
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler()),
        ])

        categorical_transformer = Pipeline(steps=[
            ("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
            ("encoder", OneHotEncoder(
                handle_unknown="ignore", sparse_output=False
            )),
        ])

        preprocessor = ColumnTransformer(transformers=[
            ("numeric", numeric_transformer, self.numeric_cols),
            ("categorical", categorical_transformer, self.categorical_cols),
        ])

        return preprocessor

    def fit(self, X_train: pd.DataFrame) -> "DataCleaningPipeline":
        """Fit all transformers on training data only."""
        self.pipeline.fit(X_train)
        self._is_fitted = True
        print(f"Pipeline fitted on {len(X_train):,} rows, "
              f"{len(self.numeric_cols)} numeric + "
              f"{len(self.categorical_cols)} categorical columns.")
        return self

    def transform(self, X: pd.DataFrame) -> np.ndarray:
        """Transform new data using the fitted pipeline."""
        if not self._is_fitted:
            raise RuntimeError(
                "Pipeline is not fitted. Call .fit(X_train) first."
            )
        return self.pipeline.transform(X)

    def fit_transform(self, X_train: pd.DataFrame) -> np.ndarray:
        """Fit and transform in one step. Use on training data."""
        self.fit(X_train)
        return self.pipeline.transform(X_train)

    def save(self, filepath: str) -> None:
        """Serialize the fitted pipeline to disk."""
        if not self._is_fitted:
            raise RuntimeError(
                "Cannot save an unfitted pipeline. Call .fit() first."
            )
        joblib.dump(self, filepath)
        print(f"Pipeline saved to {filepath}")

    @staticmethod
    def load(filepath: str) -> "DataCleaningPipeline":
        """Load a previously saved pipeline."""
        pipeline = joblib.load(filepath)
        print(f"Pipeline loaded from {filepath}")
        return pipeline
⚠ Pipeline Immutability Is Non-Negotiable
Once a pipeline is fitted, it encodes the training data's statistics β€” the median for imputation, the mean and std for scaling, the set of known categories for encoding. Never refit the pipeline on new data after deployment. If the data distribution shifts significantly, that is a retraining trigger: retrain the entire pipeline and model together on a fresh training set, validate, and redeploy. Refitting the pipeline alone while keeping the old model creates a mismatch between preprocessing assumptions and model weights.
πŸ“Š Production Insight
Notebook-based cleaning breaks the moment new data has a column your notebook did not anticipate, or a category the encoder has never seen.
A scikit-learn Pipeline handles unseen categories gracefully with handle_unknown='ignore', and it enforces the correct fit-then-transform order by design.
Rule: always encapsulate cleaning logic in a Pipeline object. Serialize it with joblib.dump(), version it alongside your model artifacts, and load it at inference time. The pipeline and the model are a unit β€” they ship together or not at all.
🎯 Key Takeaway
Manual notebook cleaning does not scale and does not reproduce. Build a scikit-learn Pipeline.
Fit all transformers on training data only, then transform everything else with the fitted pipeline.
Serialize the pipeline with joblib and version it as a first-class artifact. It must travel with the model to every environment β€” dev, staging, and production.
πŸ—‚ Scaling Methods Comparison
Choose the right scaler based on your data characteristics and model type.
ScalerFormulaOutlier ResistantOutput RangeBest For
StandardScaler(x βˆ’ mean) / stdNoUnbounded (β‰ˆ βˆ’3 to 3 for normal data)Normally distributed features, linear models, logistic regression, SVM
MinMaxScaler(x βˆ’ min) / (max βˆ’ min)No[0, 1] or custom rangeNeural networks, image pixel values, features with known theoretical bounds
RobustScaler(x βˆ’ median) / IQRYesUnboundedData with outliers, skewed distributions, salary/revenue columns
MaxAbsScalerx / max(|x|)No[βˆ’1, 1]Sparse data (preserves zero entries), already centered features
Log Transformlog(x + 1)Partially[0, ∞)Right-skewed distributions (income, counts, page views, time on site)

🎯 Key Takeaways

  • Inspect data with df.shape, df.dtypes, df.isnull().sum(), df.duplicated().sum(), and df.describe() before writing a single line of cleaning code.
  • Understand why data is missing before choosing an imputation strategy β€” the missingness mechanism (MCAR, MAR, MNAR) matters more than the imputation method.
  • Deduplicate on business-meaningful composite keys, not on system-generated surrogate IDs that are unique by design.
  • Always split data into train/test before fitting scalers, encoders, or imputers β€” fitting on the full dataset causes data leakage that inflates metrics by 1–5%.
  • Check cardinality with df['col'].nunique() before one-hot encoding. More than 50 unique values demands frequency encoding, target encoding, or decomposition.
  • Build a reproducible scikit-learn Pipeline that encapsulates the entire cleaning and preprocessing sequence in a single serializable object.
  • Serialize the preprocessing pipeline with joblib and version it alongside the model. They are a unit β€” shipping one without the other guarantees production failures.

⚠ Common Mistakes to Avoid

    βœ•Dropping all rows with any missing values
    Symptom

    Dataset shrinks from 100,000 rows to 20,000 rows after a single df.dropna() call. The model trains on a biased subset because missingness is often correlated with specific groups β€” younger users, mobile traffic, free-tier accounts.

    Fix

    Check df.dropna().shape first. If more than 10% of rows would be lost, impute instead. Use df.isnull().sum() to identify which columns drive the loss β€” often it is one or two columns dragging everything down. Add boolean indicator columns for missingness before imputing so the model can learn from the pattern.

    βœ•Fitting the scaler or encoder on the full dataset before the train/test split
    Symptom

    Model evaluation metrics are artificially inflated by 1–5%. The scaler leaks test set statistics (mean, std, min, max) into the training process. The model looks great in evaluation but underperforms on genuinely unseen production data.

    Fix

    Always split first, then fit transformers on X_train only. Use scaler.transform(X_test) β€” never scaler.fit_transform(X_test). Wrap everything in a scikit-learn Pipeline to enforce this automatically. If you catch yourself calling .fit() on anything other than training data, stop β€” that is a leakage bug.

    βœ•One-hot encoding a column with thousands of unique values
    Symptom

    DataFrame explodes from 20 columns to 5,000+ columns. Memory usage spikes by 10x or more. Linear model training becomes impossibly slow or crashes. Most new columns are nearly all zeros and add no predictive signal.

    Fix

    Check cardinality with df['col'].nunique() before encoding. For features with more than 50 unique values, use frequency encoding, target encoding, or hash encoding. For features with more than 10,000 values, consider decomposing (zip code β†’ state) or dropping the feature entirely if it does not carry enough predictive signal.

    βœ•Applying log transform to columns containing zeros or negative values
    Symptom

    RuntimeWarning: divide by zero encountered in log. Resulting values are NaN or -inf, silently corrupting the entire feature column. Downstream models train on garbage without raising an error.

    Fix

    Use np.log1p(x), which computes log(1 + x) and handles zeros safely. For negative values, shift the distribution first: np.log1p(x - x.min()). For columns that can be both positive and negative, use the Yeo-Johnson transform from sklearn.preprocessing.PowerTransformer, which handles all real values.

    βœ•Not saving cleaning parameters for production inference
    Symptom

    Model works perfectly in the training notebook but produces errors, NaN features, or garbage predictions on new data in production. Missing value fill values, scaler parameters, encoder mappings, and outlier thresholds are lost when the notebook session ends.

    Fix

    Serialize the entire preprocessing pipeline with joblib.dump(pipeline, 'preprocessor.pkl'). Load it at inference time with joblib.load('preprocessor.pkl'). The pipeline object stores all fitted parameters internally. Version this artifact alongside the model file β€” they are inseparable.

    βœ•Removing outliers using a global threshold without considering per-feature distributions
    Symptom

    Valid extreme values in one feature (CEO salary of $5M) are removed because they exceed a threshold calculated from a completely different feature (intern salary). Or: a z-score threshold of 3 flags 10% of a heavily skewed feature, removing real data.

    Fix

    Apply outlier detection per feature, not globally. Use IQR or z-score calculated on each column individually. Consider the business context: is this value plausible for this specific feature? For skewed distributions, transform first (log, Box-Cox), then apply outlier detection on the transformed scale.

Interview Questions on This Topic

  • QYou receive a dataset where 35% of values in a critical feature column are missing. How do you decide what to do?Mid-levelReveal
    First, I would investigate the missingness mechanism. I would check whether missingness correlates with other features (MAR) or with the target variable (MNAR) by comparing distributions of rows with and without missing values β€” for example, comparing the mean target value when the feature is present versus absent. If there is no significant difference, it is likely MCAR, and I can impute with median for numeric or mode for categorical data, adding a boolean indicator column to preserve the missingness signal. If missingness is systematic, I would investigate the upstream data source to understand why 35% of records lack this value. Is it a recently added field? Is it optional in the source form? Does a specific data provider fail to supply it? If more than 50% were missing, I would consider dropping the column. At 35%, I would likely impute β€” but I would use a method appropriate to the mechanism, potentially KNNImputer or IterativeImputer for MAR data rather than simple mean/median. I would never blindly drop rows β€” losing 35% of data almost always introduces more bias than imputation.
  • QWhat is the difference between data cleaning and data preprocessing?JuniorReveal
    Data cleaning fixes quality issues in raw data: removing duplicates, correcting data entry errors, handling missing values, fixing inconsistent formats and string representations, and validating that values fall within domain-valid ranges. Data preprocessing transforms clean data into a format suitable for modeling: encoding categorical variables as numbers, scaling numeric features to comparable ranges, creating derived features through feature engineering, and splitting into train/test sets. Cleaning removes noise and errors. Preprocessing adds structure and prepares data for a specific algorithm's requirements. You must clean before you preprocess β€” feeding dirty data into a scaler or encoder produces well-formatted garbage. The output looks numeric and properly scaled, but the underlying values are meaningless.
  • QHow would you handle a categorical feature with 10,000 unique values for a linear regression model?SeniorReveal
    One-hot encoding would create 9,999 sparse columns, making the model intractable and likely overfitting to noise. I would consider several alternatives in order of preference: 1. Target encoding: replace each category with the mean of the target variable for that category, computed on training data only. To prevent overfitting on rare categories, I would apply smoothing β€” blending the category mean with the global mean, weighted by category frequency. 2. Frequency encoding: replace each category with its relative frequency in the training data. Simple, no target leakage risk, but loses signal about the category-target relationship. 3. Feature decomposition: if the categorical represents something hierarchical (like zip code), decompose it into higher-level groupings (state, region) that have lower cardinality. 4. Grouping rare categories: bucket categories appearing fewer than N times into an 'Other' category, reducing effective cardinality. I would also question whether a feature with 10,000 categories is actually useful for a linear model at all β€” sometimes a tree-based model is a better fit for the data structure.
  • QWhy must you fit scalers on training data only and not the full dataset?Mid-levelReveal
    Fitting a scaler on the full dataset leaks information from the test set into the training process. The scaler computes mean and standard deviation using test data points, so when training data is transformed, those statistics incorporate knowledge of the test distribution. The model indirectly sees test set characteristics during training. This produces optimistically biased evaluation metrics β€” accuracy, F1, RMSE all look better than they will be on truly unseen data. The magnitude is typically 1–5% on tabular datasets, but I have seen cases where leakage inflated AUC by 10+ points on small, high-variance datasets. The correct approach is: split first, fit the scaler on X_train only, then call scaler.transform() on both X_train and X_test. In production, the same fitted scaler object β€” serialized with joblib β€” must be loaded and applied to new incoming data. If you use a scikit-learn Pipeline, this ordering is enforced automatically, which is one of the strongest arguments for using Pipelines.

Frequently Asked Questions

Should I clean data before or after splitting into train/test sets?

Inspect and remove exact duplicates before splitting β€” duplicates that span both sets cause data leakage where the model memorizes training examples that also appear in the test set. For transformations that learn parameters (scaling, imputation with mean/median, encoding), always split first, then fit on training data only. The correct sequence is: (1) load raw data, (2) remove exact duplicates on the full dataset, (3) split into train and test, (4) fit cleaning and preprocessing transformers on train only, (5) transform both train and test using the fitted transformers.

How do I know if my data has outliers vs. just a skewed distribution?

Plot a histogram and a boxplot side by side. A skewed distribution has a long tail but continuous density β€” the values get sparser as you move along the tail, but there are no isolated points far from everything else. All values are plausible given the domain. Outliers are isolated points visibly disconnected from the main distribution, appearing as individual dots beyond the boxplot whiskers with no nearby values.

Use domain knowledge as the final arbiter: a salary of $500K is plausible for a CEO but clearly an error for an intern position. Statistical tests like IQR or z-score flag candidates, but a human β€” or a well-defined business rule β€” decides whether each candidate is an error, a valid extreme, or a signal.

Can I use the same cleaning pipeline for different datasets?

Only if the datasets share the exact same schema β€” same columns, same types, same expected range of category values. A pipeline fitted on one dataset encodes that dataset's specific statistics: the median for imputation, the mean and std for scaling, the known set of categories for one-hot encoding.

If a new dataset has different columns, the pipeline will crash. If it has new category values the encoder has never seen, OneHotEncoder with handle_unknown='ignore' will silently zero them out, which may or may not be acceptable. Never reuse a pipeline blindly without validating that the new data's schema matches the training data's schema.

What is the difference between imputing with mean vs. median?

Mean is sensitive to outliers β€” a single extreme value (a $10M salary among $50K–$150K salaries) pulls the mean dramatically and fills missing values with a number that represents nobody in the dataset. Median is robust β€” it represents the actual middle value regardless of extremes.

For normally distributed data with no outliers, mean and median are nearly identical and either works. For skewed distributions (income, prices, counts, time-on-site), median is almost always the better choice. In production, default to median unless you have verified the distribution is approximately symmetric. It is the safer choice because you cannot always guarantee your data is outlier-free β€” and a single outlier in a future batch can silently corrupt mean-imputed values.

How often should I re-inspect and re-clean production data?

Continuously, via automated pipeline gates. Data quality is not a one-time event β€” upstream schemas change, new data sources are added, vendor APIs modify their formats, and seasonal patterns introduce values your original cleaning logic never anticipated.

At minimum, run automated inspection checks (row counts, missing percentages, value range validation, duplicate ratios) on every data batch before it enters your pipeline. Set thresholds that trigger alerts or halt processing. Re-evaluate your cleaning strategy quarterly or whenever model performance degrades β€” declining accuracy is often a data quality symptom, not a model architecture problem.

πŸ”₯
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousSupervised vs Unsupervised vs Reinforcement Learning – Simple ExplanationNext β†’How to Visualize Machine Learning Results (Matplotlib & Seaborn)
Forged with πŸ”₯ at TheCodeForge.io β€” Where Developers Are Forged