Data Cleaning and Preprocessing for Absolute Beginners
- Inspect data with df.shape, df.dtypes,
df.isnull().sum(),df.duplicated().sum(), anddf.describe()before writing a single line of cleaning code. - Understand why data is missing before choosing an imputation strategy β the missingness mechanism (MCAR, MAR, MNAR) matters more than the imputation method.
- Deduplicate on business-meaningful composite keys, not on system-generated surrogate IDs that are unique by design.
- Data cleaning fixes errors, fills gaps, and standardizes formats before analysis or modeling
- Missing values, duplicates, and inconsistent formats are the top 3 problems in raw data
- Use pandas for tabular data: df.isnull().sum() reveals missingness patterns immediately
- Encoding categoricals and scaling numerics are preprocessing steps, not cleaning
- Production rule: automate cleaning pipelines β manual cleaning breaks on new data batches
- Biggest mistake: dropping all rows with missing values instead of understanding why they are missing
- Always split before fitting transformers β data leakage from preprocessing is the #1 silent killer of model reliability
Unexpected NaN values appearing after type conversion
df['column'].apply(type).value_counts()df[pd.to_numeric(df['column'], errors='coerce').isna()]['column'].unique()Date parsing produces NaT or wrong dates
df['date_col'].head(20).tolist()pd.to_datetime(df['date_col'], format='mixed', dayfirst=False)Memory usage is unexpectedly high
df.memory_usage(deep=True).sum() / 1e6df.select_dtypes('object').nunique()Value counts show suspiciously similar categories that should be the same
df['col'].value_counts().head(20)[repr(v) for v in df['col'].unique()[:20]]Production Incident
Production Debug GuideCommon signals that your data has cleaning problems β and exactly where to look.
df.describe() and inspect min/max values for every numeric column. Apply domain-specific validation rules: age β [0, 120], dates β€ today, percentages β [0, 100]. Flag impossible values as NaN and investigate the upstream source that produced them.Most beginner tutorials skip data cleaning and jump straight to modeling. This is backwards. In production, 60β80% of a data scientist's time goes to cleaning and preprocessing, not model building.
Messy data produces confident but wrong results. A model trained on data with duplicates, missing values, or inconsistent formats will silently learn garbage patterns. The output looks plausible but is unreliable. You ship it, stakeholders act on it, and weeks later someone notices the numbers never made sense.
The core misconception is that cleaning is tedious but simple. It is not. Understanding why data is missing, how duplicates were created, and what format inconsistencies reveal about upstream systems requires real analytical thinking. A column full of NaN values might mean a sensor was offline, a user skipped an optional field, or an ETL job silently dropped records. Each cause demands a different response.
This guide covers every step from raw inspection through reproducible pipelines. The code is production-grade Python β not toy examples on iris or titanic. Every section includes the reasoning behind the technique, because knowing which tool to use matters far more than knowing the syntax.
Inspecting Your Data First
Never clean blindly. Start by understanding the shape, types, and distribution of your data. The first five commands you run on any new dataset reveal the scope of the problem and tell you where to focus your effort.
Inspection is not optional extra work β it is the diagnostic step that prevents you from applying the wrong fix. Imputing a column that should have been dropped, or scaling a column that is secretly categorical, are mistakes that only surface weeks later when production results look wrong.
import pandas as pd import numpy as np def inspect_dataset(filepath: str) -> pd.DataFrame: """First-pass data inspection. Run this before any cleaning. Returns the raw DataFrame so you can continue exploring interactively. Prints a diagnostic summary covering shape, types, missingness, duplicates, numeric ranges, and categorical cardinality. """ df = pd.read_csv(filepath) print(f"Shape: {df.shape[0]:,} rows x {df.shape[1]} columns") print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB") print(f"\n--- Column Types ---") print(df.dtypes.value_counts()) print(f"\n--- Missing Values ---") missing = df.isnull().sum() missing_pct = (missing / len(df) * 100).round(1) missing_report = pd.DataFrame({"count": missing, "percent": missing_pct}) missing_report = missing_report[missing_report["count"] > 0].sort_values( "percent", ascending=False ) if missing_report.empty: print("No missing values detected.") else: print(missing_report) print(f"\n--- Duplicate Rows ---") dup_count = df.duplicated().sum() print(f"Exact duplicates: {dup_count} ({dup_count / len(df) * 100:.1f}%)") print(f"\n--- Numeric Summary ---") num_desc = df.describe().round(2) print(num_desc) print(f"\n--- Categorical Columns ---") cat_cols = df.select_dtypes(include=["object", "category"]).columns for col in cat_cols: unique_count = df[col].nunique() print(f" {col}: {unique_count} unique values") if unique_count <= 10: print(f" Values: {df[col].unique().tolist()}") elif unique_count <= 50: print(f" Top 5: {df[col].value_counts().head(5).to_dict()}") else: print(f" High cardinality β consider grouping or encoding") return df
- df.shape tells you volume β is this a 100-row problem or a 100M-row problem? The answer changes every downstream decision.
- df.dtypes reveals implicit type mismatches. Numbers stored as strings are the #1 source of silent aggregation errors.
- df.isnull().sum() shows missingness patterns β not all NaN is random. Correlated missingness signals systemic upstream failures.
- df.duplicated().sum() catches records that should not exist. Even 1% duplicates can swing aggregate metrics significantly.
- df.describe() exposes impossible min/max values (negative ages, future dates, prices of $0.00) that signal data corruption or upstream bugs.
Handling Missing Values
Missing values are the most common data quality issue. The critical decision is not how to fill them, but whether to fill them at all β and that depends on understanding why they are missing in the first place.
Statisticians categorize missingness into three types: MCAR (Missing Completely at Random), MAR (Missing at Random, conditional on other observed data), and MNAR (Missing Not at Random, where the value itself determines whether it is recorded). Each type demands a different strategy, and applying the wrong one introduces bias that no amount of hyperparameter tuning will fix.
The practical implication: before you write a single line of imputation code, spend five minutes checking whether missingness correlates with other columns or with the target variable. That five minutes prevents weeks of debugging a model that is learning from a biased training set.
import pandas as pd import numpy as np class MissingValueHandler: """Strategies for handling missing values based on missingness type. Usage: handler = MissingValueHandler() report = handler.diagnose_missingness(df) df = handler.impute_numeric(df, 'age', strategy='median') df = handler.impute_categorical(df, 'city', strategy='unknown') """ @staticmethod def diagnose_missingness(df: pd.DataFrame, target_col: str = None) -> dict: """Determine if missingness is random or systematic. If target_col is provided, checks whether missingness in each column correlates with the target β a signal of potential MNAR. """ report = {} for col in df.columns: missing_count = df[col].isnull().sum() if missing_count == 0: continue missing_pct = missing_count / len(df) * 100 entry = { "count": missing_count, "percent": round(missing_pct, 1), "recommendation": MissingValueHandler._recommend( missing_pct, col, df ), } # Check correlation between missingness and target if target_col and target_col != col and target_col in df.columns: missing_mask = df[col].isnull() if df[target_col].dtype in ["float64", "int64"]: mean_when_missing = df.loc[missing_mask, target_col].mean() mean_when_present = df.loc[~missing_mask, target_col].mean() if pd.notna(mean_when_missing) and pd.notna(mean_when_present): diff_pct = ( abs(mean_when_missing - mean_when_present) / abs(mean_when_present) * 100 ) if diff_pct > 10: entry["recommendation"] += ( f" WARNING: target mean differs by {diff_pct:.0f}% " f"when {col} is missing β possible MNAR." ) report[col] = entry return report @staticmethod def _recommend(pct: float, col: str, df: pd.DataFrame) -> str: if pct > 50: return "DROP COLUMN β more than half the data is missing" elif pct > 20: return "INVESTIGATE β high missingness may indicate a systematic issue" elif df[col].dtype in ["float64", "int64"]: return "IMPUTE β use median for skewed, mean for normal distributions" else: return "IMPUTE β use mode or create an explicit 'Unknown' category" @staticmethod def impute_numeric( df: pd.DataFrame, col: str, strategy: str = "median" ) -> pd.DataFrame: """Fill missing numeric values. Adds a boolean indicator column.""" indicator_col = f"{col}_was_missing" df[indicator_col] = df[col].isnull().astype(int) if strategy == "median": fill_value = df[col].median() elif strategy == "mean": fill_value = df[col].mean() elif strategy == "zero": fill_value = 0 else: raise ValueError(f"Unknown strategy: {strategy}") df[col] = df[col].fillna(fill_value) print(f"{col}: imputed {df[indicator_col].sum()} values with {strategy} ({fill_value:.2f})") return df @staticmethod def impute_categorical( df: pd.DataFrame, col: str, strategy: str = "mode" ) -> pd.DataFrame: """Fill missing categorical values.""" indicator_col = f"{col}_was_missing" df[indicator_col] = df[col].isnull().astype(int) if strategy == "mode": fill_value = df[col].mode()[0] elif strategy == "unknown": fill_value = "Unknown" else: raise ValueError(f"Unknown strategy: {strategy}") df[col] = df[col].fillna(fill_value) print(f"{col}: imputed {df[indicator_col].sum()} values with '{fill_value}'") return df
df.dropna().shape versus df.shape before committing to a drop.Removing Duplicates
Duplicates inflate counts, skew averages, and produce overconfident models. They come in two flavors: exact duplicates (entire row identical) and fuzzy duplicates (same real-world entity, slightly different representation β like 'John Smith' vs 'john smith' vs 'J. Smith').
Exact duplicates are easy to detect and remove. Fuzzy duplicates are harder and more dangerous because they survive standard deduplication checks. A customer appearing twice with slightly different names gets counted as two customers, doubling their weight in any aggregate.
The most important principle: define your deduplication key based on business logic, not technical convenience. A surrogate ID that is unique by definition tells you nothing about whether two rows represent the same real-world entity.
import pandas as pd from difflib import SequenceMatcher class DuplicateHandler: """Detect and remove duplicate records β exact and fuzzy. Usage: handler = DuplicateHandler() dups = handler.find_exact_duplicates(df, subset=['email', 'order_date']) df_clean = handler.remove_exact_duplicates(df, subset=['email', 'order_date']) fuzzy = handler.find_fuzzy_duplicates(df, columns=['name', 'address']) """ @staticmethod def find_exact_duplicates( df: pd.DataFrame, subset: list = None ) -> pd.DataFrame: """Identify exact duplicate rows. Args: df: input DataFrame subset: columns to consider for duplication. None = all columns. Returns: DataFrame containing only the duplicate rows (excluding first occurrence). """ dup_mask = df.duplicated(subset=subset, keep="first") dup_count = dup_mask.sum() print(f"Exact duplicates found: {dup_count} ({dup_count / len(df) * 100:.1f}%)") if subset: print(f" Checked on columns: {subset}") return df[dup_mask] @staticmethod def remove_exact_duplicates( df: pd.DataFrame, subset: list = None, keep: str = "first" ) -> pd.DataFrame: """Remove exact duplicates, keeping one copy.""" before = len(df) df_clean = df.drop_duplicates(subset=subset, keep=keep).reset_index( drop=True ) removed = before - len(df_clean) print( f"Removed {removed} duplicate rows. {len(df_clean):,} rows remaining." ) return df_clean @staticmethod def find_fuzzy_duplicates( df: pd.DataFrame, columns: list, threshold: float = 0.85 ) -> list: """Find near-duplicate records based on string similarity. WARNING: O(n^2) complexity. For large datasets (>10k rows), sample or use blocking strategies (group by first letter, zip code, etc.) before calling this method. Args: df: input DataFrame columns: columns to concatenate for similarity comparison threshold: minimum similarity ratio to flag as fuzzy duplicate Returns: List of (index_i, index_j, similarity_score) tuples. """ fuzzy_pairs = [] values = df[columns].astype(str).agg(" ".join, axis=1).tolist() n = len(values) if n > 10_000: print( f"WARNING: {n:,} rows will produce {n*(n-1)//2:,} comparisons. " f"Consider blocking or sampling first." ) for i in range(n): for j in range(i + 1, n): similarity = SequenceMatcher(None, values[i], values[j]).ratio() if similarity >= threshold and similarity < 1.0: fuzzy_pairs.append((i, j, round(similarity, 3))) print( f"Found {len(fuzzy_pairs)} fuzzy duplicate pairs above " f"{threshold} threshold" ) return fuzzy_pairs
- Payment processor retries create records with different transaction IDs but identical amounts and timestamps within a narrow window.
- User registration forms submitted twice produce records with the same email but different auto-incremented user IDs.
- Data imports from multiple sources create overlapping date ranges with slight format differences (2025-04-15 vs 04/15/2025).
- CRM merges introduce near-duplicate contacts: 'Acme Corp' vs 'Acme Corporation' vs 'ACME Corp.'.
- Always define deduplication keys based on business logic, not just surrogate primary keys that are unique by design.
Fixing Data Types and Formats
Incorrect data types cause silent errors that can persist for months. Numbers stored as strings cannot be aggregated β pandas will concatenate them instead of summing. Dates stored as strings cannot be compared or sorted chronologically. Booleans stored as 'Yes'/'No' strings require explicit conversion before any logical operation.
The root cause is almost always upstream: CSV files have no schema enforcement, APIs return everything as strings, and Excel silently interprets dates in locale-dependent formats. By the time data reaches your DataFrame, the damage is done. Your job is to detect and fix it systematically.
The most insidious variant is partial type corruption β a column that is 99.9% numeric but contains a handful of string values like 'N/A' or '#REF!'. Pandas silently stores the entire column as object dtype, and your aggregation returns wrong results with no error message.
import pandas as pd import numpy as np class TypeFixer: """Fix data type issues in DataFrames. Usage: fixer = TypeFixer() df = fixer.fix_numeric_strings(df, ['revenue', 'price']) df = fixer.fix_dates(df, ['order_date', 'ship_date']) df = fixer.fix_booleans(df, ['is_active', 'has_subscription']) df = fixer.fix_categoricals(df, ['country', 'status']) """ @staticmethod def fix_numeric_strings(df: pd.DataFrame, columns: list) -> pd.DataFrame: """Convert string columns that should be numeric. Handles currency symbols, commas, whitespace, and common placeholder strings like 'N/A', '-', 'null', '#REF!'. """ for col in columns: original_nulls = df[col].isna().sum() cleaned = df[col].astype(str) cleaned = cleaned.str.replace(r"[\$\β¬\Β£\,]", "", regex=True) cleaned = cleaned.str.strip() cleaned = cleaned.replace( {"": np.nan, "N/A": np.nan, "n/a": np.nan, "-": np.nan, "null": np.nan, "None": np.nan, "nan": np.nan, "#REF!": np.nan, "#N/A": np.nan} ) df[col] = pd.to_numeric(cleaned, errors="coerce") new_nulls = df[col].isna().sum() - original_nulls print( f"{col}: converted to numeric. " f"{new_nulls} new NaN values from failed coercion." ) return df @staticmethod def fix_dates( df: pd.DataFrame, columns: list, dayfirst: bool = False, fmt: str = None ) -> pd.DataFrame: """Parse date columns with explicit format handling. Args: fmt: explicit strftime format string (e.g., '%Y-%m-%d'). If None, pandas infers the format β risky in production. """ for col in columns: if fmt: df[col] = pd.to_datetime(df[col], format=fmt, errors="coerce") else: df[col] = pd.to_datetime( df[col], dayfirst=dayfirst, errors="coerce" ) failed = df[col].isna().sum() print(f"{col}: parsed as datetime. {failed} values are NaT.") # Sanity check: flag future dates future = (df[col] > pd.Timestamp.now()).sum() if future > 0: print(f" WARNING: {future} dates are in the future.") return df @staticmethod def fix_booleans(df: pd.DataFrame, columns: list) -> pd.DataFrame: """Convert string booleans to actual booleans.""" true_values = {"yes", "true", "1", "y", "on", "t"} false_values = {"no", "false", "0", "n", "off", "f"} for col in columns: lower = df[col].astype(str).str.lower().str.strip() df[col] = lower.map( lambda x: ( True if x in true_values else (False if x in false_values else np.nan) ) ) unmapped = df[col].isna().sum() print(f"{col}: converted to boolean. {unmapped} unmapped values.") return df @staticmethod def fix_categoricals(df: pd.DataFrame, columns: list) -> pd.DataFrame: """Convert high-memory object columns to categorical dtype.""" for col in columns: before_mem = df[col].memory_usage(deep=True) / 1e6 df[col] = df[col].astype("category") after_mem = df[col].memory_usage(deep=True) / 1e6 if before_mem > 0: reduction = (1 - after_mem / before_mem) * 100 print( f"{col}: {before_mem:.2f} MB -> {after_mem:.2f} MB " f"({reduction:.0f}% reduction)" ) else: print(f"{col}: converted to category dtype.") return df
Handling Outliers
Outliers distort model training and summary statistics. A single salary of $10M in a dataset of $50Kβ$150K salaries will shift the mean by thousands of dollars and stretch the scale for distance-based algorithms.
But the key decision is whether an outlier is an error (remove it), a genuine extreme value (keep or cap it), or the actual signal you are trying to detect (embrace it). Domain knowledge drives this decision, not statistical rules alone.
The IQR method and z-score are detection tools, not decision tools. They flag candidates. A human β or a well-defined business rule β decides what to do with each candidate. Blindly removing everything beyond 1.5ΓIQR is a recipe for deleting the most informative data points in your dataset.
import pandas as pd import numpy as np class OutlierHandler: """Detect and handle outliers using multiple methods. Usage: handler = OutlierHandler() mask, lower, upper = handler.detect_iqr(df, 'salary') mask = handler.detect_zscore(df, 'salary', threshold=3.0) df = handler.cap_outliers(df, 'salary', lower=20000, upper=500000) df = handler.log_transform(df, 'revenue') """ @staticmethod def detect_iqr( df: pd.DataFrame, col: str, factor: float = 1.5 ) -> tuple: """Flag outliers using the IQR method. Returns: Tuple of (boolean mask, lower bound, upper bound). """ q1 = df[col].quantile(0.25) q3 = df[col].quantile(0.75) iqr = q3 - q1 lower = q1 - factor * iqr upper = q3 + factor * iqr outlier_mask = (df[col] < lower) | (df[col] > upper) count = outlier_mask.sum() print(f"{col}: {count} outliers ({count/len(df)*100:.1f}%) outside [{lower:.2f}, {upper:.2f}]") return outlier_mask, lower, upper @staticmethod def detect_zscore( df: pd.DataFrame, col: str, threshold: float = 3.0 ) -> pd.Series: """Flag outliers using z-score method.""" mean = df[col].mean() std = df[col].std() if std == 0: print(f"{col}: standard deviation is 0 β all values are identical. No outliers.") return pd.Series(False, index=df.index) z_scores = np.abs((df[col] - mean) / std) outlier_mask = z_scores > threshold count = outlier_mask.sum() print(f"{col}: {count} outliers with |z-score| > {threshold}") return outlier_mask @staticmethod def cap_outliers( df: pd.DataFrame, col: str, lower: float = None, upper: float = None ) -> pd.DataFrame: """Cap outlier values to specified bounds (winsorization). If bounds are not specified, defaults to 1st and 99th percentiles. """ if lower is None: lower = df[col].quantile(0.01) if upper is None: upper = df[col].quantile(0.99) before = df[col].copy() df[col] = df[col].clip(lower=lower, upper=upper) capped = (before != df[col]).sum() print(f"{col}: capped {capped} values to [{lower:.2f}, {upper:.2f}]") return df @staticmethod def log_transform(df: pd.DataFrame, col: str) -> pd.DataFrame: """Apply log1p transform to compress right-skewed distributions. Handles zeros safely. Does NOT handle negative values β shift first. """ if (df[col] < 0).any(): min_val = df[col].min() print(f"{col}: contains negative values. Shifting by {abs(min_val):.2f} before log transform.") df[col] = np.log1p(df[col] - min_val) else: df[col] = np.log1p(df[col]) print(f"{col}: log1p transform applied.") return df
Encoding Categorical Variables
Machine learning models require numeric input. Categorical variables β country, product type, color, status β must be converted to numbers through encoding. This is a preprocessing step, not cleaning, but it is where beginners make some of their most costly mistakes.
The encoding strategy depends on two factors: cardinality (how many unique values the column has) and whether those values have a natural order. Choosing the wrong method can create thousands of useless sparse columns, introduce false ordinal relationships, or leak target information into features.
The most common mistake: one-hot encoding a column with 10,000 unique values. This creates 9,999 new binary columns, most of which are zero for any given row. Memory explodes, training crawls, and linear models become impossible to fit.
import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder, OneHotEncoder class CategoricalEncoder: """Encode categorical variables for model consumption. Usage: encoder = CategoricalEncoder() df = encoder.encode_onehot(df, ['color', 'size']) df = encoder.encode_ordinal(df, 'education', ['high_school', 'bachelors', 'masters', 'phd']) df = encoder.encode_frequency(df, ['city', 'product_id']) """ @staticmethod def encode_onehot( df: pd.DataFrame, columns: list, max_cardinality: int = 10 ) -> pd.DataFrame: """One-hot encode low-cardinality columns. Skips columns that exceed max_cardinality and prints a warning. Uses drop_first=True to avoid multicollinearity in linear models. """ for col in columns: cardinality = df[col].nunique() if cardinality > max_cardinality: print( f"SKIPPED: {col} has {cardinality} unique values. " f"Use frequency or target encoding instead." ) continue dummies = pd.get_dummies(df[col], prefix=col, drop_first=True) df = pd.concat([df.drop(col, axis=1), dummies], axis=1) print(f"{col}: one-hot encoded into {len(dummies.columns)} columns") return df @staticmethod def encode_ordinal( df: pd.DataFrame, column: str, order: list ) -> pd.DataFrame: """Encode ordered categories (e.g., Low < Medium < High). Unmapped values are set to NaN so they surface as missing rather than silently receiving an arbitrary integer. """ mapping = {val: idx for idx, val in enumerate(order)} unmapped = set(df[column].dropna().unique()) - set(mapping.keys()) if unmapped: print(f"WARNING: {column} has unmapped values: {unmapped}") df[column] = df[column].map(mapping) print(f"{column}: ordinally encoded with order {order}") return df @staticmethod def encode_frequency( df: pd.DataFrame, columns: list ) -> pd.DataFrame: """Replace categories with their frequency in the training data. Safe for high-cardinality features. Does not create new columns. """ for col in columns: freq_map = df[col].value_counts(normalize=True).to_dict() df[col] = df[col].map(freq_map) print(f"{col}: frequency encoded ({len(freq_map)} categories)") return df @staticmethod def encode_label( df: pd.DataFrame, columns: list ) -> tuple: """Label encode β safe for tree-based models only. WARNING: creates false ordinal relationships for linear models. Returns the DataFrame and a dict of fitted LabelEncoders for inverse transformation and production reuse. """ encoders = {} for col in columns: le = LabelEncoder() df[col] = le.fit_transform(df[col].astype(str)) encoders[col] = le print(f"{col}: label encoded ({len(le.classes_)} classes)") return df, encoders
- Low cardinality (< 10 unique values), no natural order: One-hot encoding. Creates N-1 binary columns. Standard for most classification tasks.
- Low cardinality, natural order exists: Ordinal encoding. Maps to integers preserving order (Low=0, Medium=1, High=2). Use for education levels, satisfaction ratings, size categories.
- High cardinality (> 50 unique values): Target encoding (mean of target per category) or frequency encoding. One-hot creates too many sparse columns and kills performance.
- Tree-based models (Random Forest, XGBoost, LightGBM): Label encoding works fine. Trees split on thresholds, so the arbitrary integer assignment does not create false distance relationships.
- Linear models, SVMs, and neural networks: One-hot or target encoding only. Label encoding introduces false ordinal relationships (category 5 is not 'more' than category 2).
Feature Scaling
Features on different scales β age in years (0β100) vs. income in dollars (0β10,000,000) β bias distance-based models and slow gradient descent convergence. KNN will treat a $1 difference in income as equivalent to a 1-year difference in age, which makes no sense.
Scaling normalizes feature ranges without changing the distribution shape. It does not remove outliers, fix missing values, or encode categories. It is a pure preprocessing step that should happen after cleaning, after encoding, and after the train/test split.
The critical rule that most beginners violate: fit the scaler on training data only. If you fit on the full dataset, the scaler's mean and standard deviation incorporate test set information, which leaks future knowledge into training. The metrics look artificially good, and the model underperforms on genuinely unseen data.
import pandas as pd import numpy as np from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler class FeatureScaler: """Scale numeric features for model consumption. Usage: scaler_util = FeatureScaler() df_train, scaler = scaler_util.scale_standard(df_train, num_cols) df_test[num_cols] = scaler.transform(df_test[num_cols]) # same scaler! """ @staticmethod def scale_standard( df: pd.DataFrame, columns: list ) -> tuple: """Z-score normalization: mean=0, std=1. Use for normally distributed features with linear models, logistic regression, SVMs, and PCA. """ scaler = StandardScaler() df[columns] = scaler.fit_transform(df[columns]) print(f"StandardScaler applied to {len(columns)} columns") return df, scaler @staticmethod def scale_minmax( df: pd.DataFrame, columns: list, feature_range: tuple = (0, 1) ) -> tuple: """Min-max scaling to a fixed range. Use for neural networks (which expect bounded inputs) and features with known theoretical bounds. """ scaler = MinMaxScaler(feature_range=feature_range) df[columns] = scaler.fit_transform(df[columns]) print( f"MinMaxScaler applied to {len(columns)} columns, " f"range={feature_range}" ) return df, scaler @staticmethod def scale_robust(df: pd.DataFrame, columns: list) -> tuple: """Scale using median and IQR instead of mean and std. Use when data has outliers that would distort StandardScaler. Outliers still exist after scaling but no longer dominate the range. """ scaler = RobustScaler() df[columns] = scaler.fit_transform(df[columns]) print( f"RobustScaler applied to {len(columns)} columns " f"(outlier-resistant)" ) return df, scaler
- Needs scaling: KNN, SVM, Neural Networks, PCA, Logistic Regression, Linear Regression (especially with regularization β L1/L2 penalties are scale-dependent).
- Does NOT need scaling: Decision Trees, Random Forests, XGBoost, LightGBM, CatBoost, Naive Bayes. These models split on thresholds, not distances.
- Always scale after the train/test split β fit the scaler on training data only, then transform both train and test with the same fitted scaler.
- Save the scaler object with
joblib.dump(). You must apply the exact same transformation to new data at inference time. A different scaler = different feature space = garbage predictions. - If you use an sklearn Pipeline, the scaler is automatically fitted only on training data during
pipeline.fit()β this is one of the strongest reasons to use Pipelines.
scaler.transform() on test and production data. If you ever call scaler.fit_transform() on test data, you have a data leakage bug.Building a Reproducible Cleaning Pipeline
Manual cleaning in a notebook is fine for exploration. Production requires a reproducible pipeline that applies the exact same transformations to new data consistently, without human intervention, without remembering which cells to run in which order.
Scikit-learn's Pipeline and ColumnTransformer make this possible. They encapsulate the entire sequence of transformations β imputation, encoding, scaling β into a single object that can be fitted on training data, serialized to disk, and loaded at inference time. The pipeline remembers every parameter: the median used for imputation, the mean and std used for scaling, the category set used for encoding.
This is not optional infrastructure for large teams. It is essential even for solo projects. The moment you retrain a model or deploy it to a different environment, you need the pipeline. Without it, you will spend hours reverse-engineering what transformations were applied and in what order.
import pandas as pd import numpy as np import joblib from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer class DataCleaningPipeline: """Reproducible data cleaning pipeline for production use. Encapsulates imputation, encoding, and scaling into a single serializable object. Fit once on training data, transform everything else with the same fitted parameters. Usage: pipeline = DataCleaningPipeline( numeric_cols=['age', 'income', 'score'], categorical_cols=['city', 'status', 'tier'] ) X_train_clean = pipeline.fit_transform(X_train) X_test_clean = pipeline.transform(X_test) pipeline.save('preprocessing_pipeline.pkl') # At inference time: loaded = DataCleaningPipeline.load('preprocessing_pipeline.pkl') X_new_clean = loaded.transform(X_new) """ def __init__(self, numeric_cols: list, categorical_cols: list): self.numeric_cols = numeric_cols self.categorical_cols = categorical_cols self.pipeline = self._build_pipeline() self._is_fitted = False def _build_pipeline(self) -> ColumnTransformer: numeric_transformer = Pipeline(steps=[ ("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler()), ]) categorical_transformer = Pipeline(steps=[ ("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")), ("encoder", OneHotEncoder( handle_unknown="ignore", sparse_output=False )), ]) preprocessor = ColumnTransformer(transformers=[ ("numeric", numeric_transformer, self.numeric_cols), ("categorical", categorical_transformer, self.categorical_cols), ]) return preprocessor def fit(self, X_train: pd.DataFrame) -> "DataCleaningPipeline": """Fit all transformers on training data only.""" self.pipeline.fit(X_train) self._is_fitted = True print(f"Pipeline fitted on {len(X_train):,} rows, " f"{len(self.numeric_cols)} numeric + " f"{len(self.categorical_cols)} categorical columns.") return self def transform(self, X: pd.DataFrame) -> np.ndarray: """Transform new data using the fitted pipeline.""" if not self._is_fitted: raise RuntimeError( "Pipeline is not fitted. Call .fit(X_train) first." ) return self.pipeline.transform(X) def fit_transform(self, X_train: pd.DataFrame) -> np.ndarray: """Fit and transform in one step. Use on training data.""" self.fit(X_train) return self.pipeline.transform(X_train) def save(self, filepath: str) -> None: """Serialize the fitted pipeline to disk.""" if not self._is_fitted: raise RuntimeError( "Cannot save an unfitted pipeline. Call .fit() first." ) joblib.dump(self, filepath) print(f"Pipeline saved to {filepath}") @staticmethod def load(filepath: str) -> "DataCleaningPipeline": """Load a previously saved pipeline.""" pipeline = joblib.load(filepath) print(f"Pipeline loaded from {filepath}") return pipeline
joblib.dump(), version it alongside your model artifacts, and load it at inference time. The pipeline and the model are a unit β they ship together or not at all.| Scaler | Formula | Outlier Resistant | Output Range | Best For |
|---|---|---|---|---|
| StandardScaler | (x β mean) / std | No | Unbounded (β β3 to 3 for normal data) | Normally distributed features, linear models, logistic regression, SVM |
| MinMaxScaler | (x β min) / (max β min) | No | [0, 1] or custom range | Neural networks, image pixel values, features with known theoretical bounds |
| RobustScaler | (x β median) / IQR | Yes | Unbounded | Data with outliers, skewed distributions, salary/revenue columns |
| MaxAbsScaler | x / max(|x|) | No | [β1, 1] | Sparse data (preserves zero entries), already centered features |
| Log Transform | log(x + 1) | Partially | [0, β) | Right-skewed distributions (income, counts, page views, time on site) |
π― Key Takeaways
- Inspect data with df.shape, df.dtypes,
df.isnull().sum(),df.duplicated().sum(), anddf.describe()before writing a single line of cleaning code. - Understand why data is missing before choosing an imputation strategy β the missingness mechanism (MCAR, MAR, MNAR) matters more than the imputation method.
- Deduplicate on business-meaningful composite keys, not on system-generated surrogate IDs that are unique by design.
- Always split data into train/test before fitting scalers, encoders, or imputers β fitting on the full dataset causes data leakage that inflates metrics by 1β5%.
- Check cardinality with df['col'].nunique() before one-hot encoding. More than 50 unique values demands frequency encoding, target encoding, or decomposition.
- Build a reproducible scikit-learn Pipeline that encapsulates the entire cleaning and preprocessing sequence in a single serializable object.
- Serialize the preprocessing pipeline with joblib and version it alongside the model. They are a unit β shipping one without the other guarantees production failures.
β Common Mistakes to Avoid
Interview Questions on This Topic
- QYou receive a dataset where 35% of values in a critical feature column are missing. How do you decide what to do?Mid-levelReveal
- QWhat is the difference between data cleaning and data preprocessing?JuniorReveal
- QHow would you handle a categorical feature with 10,000 unique values for a linear regression model?SeniorReveal
- QWhy must you fit scalers on training data only and not the full dataset?Mid-levelReveal
Frequently Asked Questions
Should I clean data before or after splitting into train/test sets?
Inspect and remove exact duplicates before splitting β duplicates that span both sets cause data leakage where the model memorizes training examples that also appear in the test set. For transformations that learn parameters (scaling, imputation with mean/median, encoding), always split first, then fit on training data only. The correct sequence is: (1) load raw data, (2) remove exact duplicates on the full dataset, (3) split into train and test, (4) fit cleaning and preprocessing transformers on train only, (5) transform both train and test using the fitted transformers.
How do I know if my data has outliers vs. just a skewed distribution?
Plot a histogram and a boxplot side by side. A skewed distribution has a long tail but continuous density β the values get sparser as you move along the tail, but there are no isolated points far from everything else. All values are plausible given the domain. Outliers are isolated points visibly disconnected from the main distribution, appearing as individual dots beyond the boxplot whiskers with no nearby values.
Use domain knowledge as the final arbiter: a salary of $500K is plausible for a CEO but clearly an error for an intern position. Statistical tests like IQR or z-score flag candidates, but a human β or a well-defined business rule β decides whether each candidate is an error, a valid extreme, or a signal.
Can I use the same cleaning pipeline for different datasets?
Only if the datasets share the exact same schema β same columns, same types, same expected range of category values. A pipeline fitted on one dataset encodes that dataset's specific statistics: the median for imputation, the mean and std for scaling, the known set of categories for one-hot encoding.
If a new dataset has different columns, the pipeline will crash. If it has new category values the encoder has never seen, OneHotEncoder with handle_unknown='ignore' will silently zero them out, which may or may not be acceptable. Never reuse a pipeline blindly without validating that the new data's schema matches the training data's schema.
What is the difference between imputing with mean vs. median?
Mean is sensitive to outliers β a single extreme value (a $10M salary among $50Kβ$150K salaries) pulls the mean dramatically and fills missing values with a number that represents nobody in the dataset. Median is robust β it represents the actual middle value regardless of extremes.
For normally distributed data with no outliers, mean and median are nearly identical and either works. For skewed distributions (income, prices, counts, time-on-site), median is almost always the better choice. In production, default to median unless you have verified the distribution is approximately symmetric. It is the safer choice because you cannot always guarantee your data is outlier-free β and a single outlier in a future batch can silently corrupt mean-imputed values.
How often should I re-inspect and re-clean production data?
Continuously, via automated pipeline gates. Data quality is not a one-time event β upstream schemas change, new data sources are added, vendor APIs modify their formats, and seasonal patterns introduce values your original cleaning logic never anticipated.
At minimum, run automated inspection checks (row counts, missing percentages, value range validation, duplicate ratios) on every data batch before it enters your pipeline. Set thresholds that trigger alerts or halt processing. Re-evaluate your cleaning strategy quarterly or whenever model performance degrades β declining accuracy is often a data quality symptom, not a model architecture problem.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.