Data cleaning fixes errors, fills gaps, and standardizes formats before analysis or modeling
Missing values, duplicates, and inconsistent formats are the top 3 problems in raw data
Use pandas for tabular data: df.isnull().sum() reveals missingness patterns immediately
Encoding categoricals and scaling numerics are preprocessing steps, not cleaning
Production rule: automate cleaning pipelines — manual cleaning breaks on new data batches
Biggest mistake: dropping all rows with missing values instead of understanding why they are missing
Always split before fitting transformers — data leakage from preprocessing is the #1 silent killer of model reliability
✦ Definition~90s read
What is Data Cleaning and Preprocessing for Absolute Beginners?
Duplicate data is one of the most insidious problems in data preprocessing because it silently inflates metrics without raising obvious errors. When you have duplicate rows, every aggregate statistic—counts, averages, sums, even model accuracy—gets skewed.
★
Data cleaning is like washing vegetables before cooking.
For example, if a customer transaction appears twice in a sales dataset, your total revenue is artificially doubled, and any model trained on that data will learn patterns that don't exist in reality. This isn't just about cleaning; it's about ensuring your analysis and models reflect actual ground truth.
In the data preprocessing pipeline, duplicate detection typically comes after initial inspection and missing value handling but before type fixing and outlier treatment. The reason is pragmatic: duplicates are often easier to spot once you've resolved structural issues like inconsistent formatting.
Tools like pandas' duplicated() and drop_duplicates() are the workhorses here, but you must decide on your deduplication strategy—whether to keep the first occurrence, last occurrence, or none at all. For large datasets (millions of rows), hashing-based approaches or distributed frameworks like Spark become necessary.
Where this fits in the broader ecosystem: duplicate handling is a fundamental step in any data preprocessing workflow, alongside missing value imputation and outlier detection. It's especially critical in domains like e-commerce (duplicate orders), finance (duplicate transactions), and healthcare (duplicate patient records).
However, not all duplicates are bad—time series data may have legitimate repeated measurements, and some analytics tasks benefit from preserving duplicates (e.g., counting page views). The key is understanding your data's semantics before blindly removing rows.
When in doubt, always inspect the duplicates first: look at the actual rows, understand why they exist, and only then decide on removal.
Plain-English First
Data cleaning is like washing vegetables before cooking. You would not eat unwashed produce, and you should not feed raw, messy data into a model. Dirt in your data shows up as wrong predictions, broken dashboards, and decisions nobody should trust. This guide teaches you exactly how to inspect, fix, and prepare your data step by step — and more importantly, how to build a cleaning process that works automatically the next time new data arrives.
Most beginner tutorials skip data cleaning and jump straight to modeling. This is backwards. In production, 60–80% of a data scientist's time goes to cleaning and preprocessing, not model building.
Messy data produces confident but wrong results. A model trained on data with duplicates, missing values, or inconsistent formats will silently learn garbage patterns. The output looks plausible but is unreliable. You ship it, stakeholders act on it, and weeks later someone notices the numbers never made sense.
The core misconception is that cleaning is tedious but simple. It is not. Understanding why data is missing, how duplicates were created, and what format inconsistencies reveal about upstream systems requires real analytical thinking. A column full of NaN values might mean a sensor was offline, a user skipped an optional field, or an ETL job silently dropped records. Each cause demands a different response.
This guide covers every step from raw inspection through reproducible pipelines. The code is production-grade Python — not toy examples on iris or titanic. Every section includes the reasoning behind the technique, because knowing which tool to use matters far more than knowing the syntax.
Why Duplicate Data Is a Silent Metric Killer
Data preprocessing is the step where raw data is cleaned, transformed, and structured before analysis or model training. The core mechanic is detecting and removing duplicate records — rows that are identical or near-identical — because they artificially inflate counts, averages, and other aggregate metrics. A single duplicated row in a 10,000-row dataset can skew a mean by 0.01%, but in a 100-row dataset, it can shift results by 1% or more.
In practice, duplicates arise from system glitches, manual entry errors, or data merging from multiple sources. The key property to understand is that duplicates are not always exact copies — they can be partial matches (same user ID but different timestamps) or semantically identical records with different formatting. Detection typically uses hashing (O(n) with a hash set) or sorting (O(n log n)), but the real challenge is defining what counts as a duplicate in your domain.
Use duplicate removal whenever you compute statistics, train supervised models, or generate reports from transactional data. In production systems, failing to deduplicate leads to overestimated user counts, inflated conversion rates, and biased model predictions. For example, a fraud detection model trained on duplicated transactions will learn to flag normal behavior as suspicious, degrading precision by up to 30%.
Duplicate ≠ Identical
Two rows with the same user ID but different timestamps may still be duplicates if the event is idempotent — always define your dedup key based on business logic, not raw columns.
Production Insight
A real-time analytics pipeline ingested duplicate click events from a retry mechanism, inflating daily active user counts by 15%.
The symptom was a sudden, unexplained spike in DAU that correlated with network retries, not actual user growth.
Rule of thumb: always deduplicate at ingestion using a unique event ID or a composite key of (user_id, event_type, timestamp_window).
Key Takeaway
Duplicates inflate metrics silently — always deduplicate before aggregation.
Define your dedup key by business semantics, not raw column equality.
Use hashing for O(n) detection; never rely on manual inspection at scale.
thecodeforge.io
Data Preprocessing Pipeline for Clean Metrics
Data Preprocessing Beginners Guide
Inspecting Your Data First
Never clean blindly. Start by understanding the shape, types, and distribution of your data. The first five commands you run on any new dataset reveal the scope of the problem and tell you where to focus your effort.
Inspection is not optional extra work — it is the diagnostic step that prevents you from applying the wrong fix. Imputing a column that should have been dropped, or scaling a column that is secretly categorical, are mistakes that only surface weeks later when production results look wrong.
io/thecodeforge/cleaning/inspection.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import pandas as pd
import numpy as np
definspect_dataset(filepath: str) -> pd.DataFrame:
"""First-pass data inspection. Run this before any cleaning.
Returns the raw DataFrame so you can continue exploring interactively.
Prints a diagnostic summary covering shape, types, missingness,
duplicates, numeric ranges, and categorical cardinality.
"""
df = pd.read_csv(filepath)
print(f"Shape: {df.shape[0]:,} rows x {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
print(f"\n--- Column Types ---")
print(df.dtypes.value_counts())
print(f"\n--- Missing Values ---")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(1)
missing_report = pd.DataFrame({"count": missing, "percent": missing_pct})
missing_report = missing_report[missing_report["count"] > 0].sort_values(
"percent", ascending=False
)
if missing_report.empty:
print("No missing values detected.")
else:
print(missing_report)
print(f"\n--- Duplicate Rows ---")
dup_count = df.duplicated().sum()
print(f"Exact duplicates: {dup_count} ({dup_count / len(df) * 100:.1f}%)")
print(f"\n--- Numeric Summary ---")
num_desc = df.describe().round(2)
print(num_desc)
print(f"\n--- Categorical Columns ---")
cat_cols = df.select_dtypes(include=["object", "category"]).columns
for col in cat_cols:
unique_count = df[col].nunique()
print(f" {col}: {unique_count} unique values")
if unique_count <= 10:
print(f" Values: {df[col].unique().tolist()}")
elif unique_count <= 50:
print(f" Top 5: {df[col].value_counts().head(5).to_dict()}")
else:
print(f" High cardinality — consider grouping or encoding")
return df
Inspect Before You Fix
df.shape tells you volume — is this a 100-row problem or a 100M-row problem? The answer changes every downstream decision.
df.dtypes reveals implicit type mismatches. Numbers stored as strings are the #1 source of silent aggregation errors.
df.isnull().sum() shows missingness patterns — not all NaN is random. Correlated missingness signals systemic upstream failures.
df.duplicated().sum() catches records that should not exist. Even 1% duplicates can swing aggregate metrics significantly.
df.describe() exposes impossible min/max values (negative ages, future dates, prices of $0.00) that signal data corruption or upstream bugs.
Production Insight
In production, data inspection should be automated as a pipeline gate — not something a human remembers to do in a notebook.
If the missing percentage exceeds a configured threshold, or row counts diverge from expectations, the pipeline should fail loudly with an alert rather than silently processing garbage.
Rule of thumb: never let uninspected data reach a model, a report, or a dashboard. The cost of a five-second automated check is negligible compared to the cost of retracting a wrong forecast.
Key Takeaway
Always inspect before cleaning.
Five commands — shape, dtypes, isnull, duplicated, describe — reveal 90% of data quality issues.
Automate inspection as a pipeline gate in production. Never trust raw data blindly, no matter how reputable the source.
Handling Missing Values
Missing values are the most common data quality issue. The critical decision is not how to fill them, but whether to fill them at all — and that depends on understanding why they are missing in the first place.
Statisticians categorize missingness into three types: MCAR (Missing Completely at Random), MAR (Missing at Random, conditional on other observed data), and MNAR (Missing Not at Random, where the value itself determines whether it is recorded). Each type demands a different strategy, and applying the wrong one introduces bias that no amount of hyperparameter tuning will fix.
The practical implication: before you write a single line of imputation code, spend five minutes checking whether missingness correlates with other columns or with the target variable. That five minutes prevents weeks of debugging a model that is learning from a biased training set.
io/thecodeforge/cleaning/missing_values.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import pandas as pd
import numpy as np
classMissingValueHandler:
"""Strategiesfor handling missing values based on missingness type.
Usage:
handler = MissingValueHandler()
report = handler.diagnose_missingness(df)
df = handler.impute_numeric(df, 'age', strategy='median')
df = handler.impute_categorical(df, 'city', strategy='unknown')
"""
@staticmethod
defdiagnose_missingness(df: pd.DataFrame, target_col: str = None) -> dict:
"""Determineif missingness is random or systematic.
If target_col is provided, checks whether missingness in each column
correlates with the target — a signal of potential MNAR.
"""
report = {}
for col in df.columns:
missing_count = df[col].isnull().sum()
if missing_count == 0:
continue
missing_pct = missing_count / len(df) * 100
entry = {
"count": missing_count,
"percent": round(missing_pct, 1),
"recommendation": MissingValueHandler._recommend(
missing_pct, col, df
),
}
# Check correlation between missingness and targetif target_col and target_col != col and target_col in df.columns:
missing_mask = df[col].isnull()
if df[target_col].dtype in ["float64", "int64"]:
mean_when_missing = df.loc[missing_mask, target_col].mean()
mean_when_present = df.loc[~missing_mask, target_col].mean()
if pd.notna(mean_when_missing) and pd.notna(mean_when_present):
diff_pct = (
abs(mean_when_missing - mean_when_present)
/ abs(mean_when_present)
* 100
)
if diff_pct > 10:
entry["recommendation"] += (
f" WARNING: target mean differs by {diff_pct:.0f}% "
f"when {col} is missing — possible MNAR."
)
report[col] = entry
return report
@staticmethod
def_recommend(pct: float, col: str, df: pd.DataFrame) -> str:
if pct > 50:
return"DROP COLUMN — more than half the data is missing"elif pct > 20:
return"INVESTIGATE — high missingness may indicate a systematic issue"elif df[col].dtype in ["float64", "int64"]:
return"IMPUTE — use median for skewed, mean for normal distributions"else:
return"IMPUTE — use mode or create an explicit 'Unknown' category"
@staticmethod
defimpute_numeric(
df: pd.DataFrame, col: str, strategy: str = "median"
) -> pd.DataFrame:
"""Fill missing numeric values. Adds a boolean indicator column."""
indicator_col = f"{col}_was_missing"
df[indicator_col] = df[col].isnull().astype(int)
if strategy == "median":
fill_value = df[col].median()
elif strategy == "mean":
fill_value = df[col].mean()
elif strategy == "zero":
fill_value = 0else:
raiseValueError(f"Unknown strategy: {strategy}")
df[col] = df[col].fillna(fill_value)
print(f"{col}: imputed {df[indicator_col].sum()} values with {strategy} ({fill_value:.2f})")
return df
@staticmethod
defimpute_categorical(
df: pd.DataFrame, col: str, strategy: str = "mode"
) -> pd.DataFrame:
"""Fill missing categorical values."""
indicator_col = f"{col}_was_missing"
df[indicator_col] = df[col].isnull().astype(int)
if strategy == "mode":
fill_value = df[col].mode()[0]
elif strategy == "unknown":
fill_value = "Unknown"else:
raiseValueError(f"Unknown strategy: {strategy}")
df[col] = df[col].fillna(fill_value)
print(f"{col}: imputed {df[indicator_col].sum()} values with '{fill_value}'")
return df
Three Types of Missingness — and Why It Matters
MCAR (Missing Completely at Random): missingness has no pattern. A sensor randomly drops readings. Safe to drop rows or impute with simple statistics.
MAR (Missing at Random): missingness depends on other observed columns. Younger users skip the income field more often. Impute using those correlated columns — simple mean imputation introduces bias.
MNAR (Missing Not at Random): missingness depends on the unobserved value itself. High-income earners refuse to report income. No imputation strategy fully fixes this — you need domain expertise, external data, or explicit modeling of the missingness mechanism.
Always investigate why data is missing before choosing a strategy. The wrong choice silently biases every downstream result.
Production Insight
Dropping rows with missing values can silently remove 30–80% of your dataset.
Always check df.dropna().shape versus df.shape before committing to a drop.
More importantly, check whether the dropped rows are a random subset or a biased one. If all dropped rows belong to a specific demographic, your model will systematically underperform for that group.
Rule: dropping is the last resort. Impute or flag missingness as a feature first. Add a boolean indicator column (was_missing) so the model can learn from the missingness pattern itself.
Key Takeaway
Not all missing values should be filled the same way.
Diagnose the missingness mechanism before choosing a strategy — MCAR, MAR, and MNAR demand different treatments.
Dropping rows is the nuclear option — always check how much data you lose and whether the loss is biased.
Add boolean indicator columns so the model can learn from the missingness pattern itself.
Missing Value Strategy Selection
IfColumn has more than 50% missing values
→
UseDrop the column unless it is the target variable or you have strong domain evidence that the remaining values carry critical signal.
IfNumeric column with less than 20% missing, distribution is roughly normal
→
UseImpute with mean. Add a boolean 'was_missing' indicator column to preserve the missingness signal.
IfNumeric column with less than 20% missing, distribution is skewed or has outliers
→
UseImpute with median. Median is robust to extreme values that would distort the mean.
IfCategorical column with less than 20% missing
→
UseImpute with mode or create an explicit 'Unknown' category. Unknown is often preferable because it does not assume the missing values follow the same distribution as observed values.
IfMissingness is systematic (MAR or MNAR)
→
UseInvestigate the root cause. Consider model-based imputation (KNN Imputer, IterativeImputer). Do not blindly fill — you may introduce bias that is invisible during training but devastating in production.
Removing Duplicates
Duplicates inflate counts, skew averages, and produce overconfident models. They come in two flavors: exact duplicates (entire row identical) and fuzzy duplicates (same real-world entity, slightly different representation — like 'John Smith' vs 'john smith' vs 'J. Smith').
Exact duplicates are easy to detect and remove. Fuzzy duplicates are harder and more dangerous because they survive standard deduplication checks. A customer appearing twice with slightly different names gets counted as two customers, doubling their weight in any aggregate.
The most important principle: define your deduplication key based on business logic, not technical convenience. A surrogate ID that is unique by definition tells you nothing about whether two rows represent the same real-world entity.
io/thecodeforge/cleaning/duplicates.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import pandas as pd
from difflib importSequenceMatcherclassDuplicateHandler:
"""Detectand remove duplicate records — exact and fuzzy.
Usage:
handler = DuplicateHandler()
dups = handler.find_exact_duplicates(df, subset=['email', 'order_date'])
df_clean = handler.remove_exact_duplicates(df, subset=['email', 'order_date'])
fuzzy = handler.find_fuzzy_duplicates(df, columns=['name', 'address'])
"""
@staticmethod
deffind_exact_duplicates(
df: pd.DataFrame, subset: list = None
) -> pd.DataFrame:
"""Identify exact duplicate rows.
Args:
df: input DataFrame
subset: columns to consider for duplication. None = all columns.
Returns:
DataFrame containing only the duplicate rows (excluding first occurrence).
"""
dup_mask = df.duplicated(subset=subset, keep="first")
dup_count = dup_mask.sum()
print(f"Exact duplicates found: {dup_count} ({dup_count / len(df) * 100:.1f}%)")
if subset:
print(f" Checked on columns: {subset}")
return df[dup_mask]
@staticmethod
defremove_exact_duplicates(
df: pd.DataFrame, subset: list = None, keep: str = "first"
) -> pd.DataFrame:
"""Remove exact duplicates, keeping one copy."""
before = len(df)
df_clean = df.drop_duplicates(subset=subset, keep=keep).reset_index(
drop=True
)
removed = before - len(df_clean)
print(
f"Removed {removed} duplicate rows. {len(df_clean):,} rows remaining."
)
return df_clean
@staticmethod
deffind_fuzzy_duplicates(
df: pd.DataFrame, columns: list, threshold: float = 0.85
) -> list:
"""Find near-duplicate records based on string similarity.
WARNING: O(n^2) complexity. For large datasets (>10k rows), sample
or use blocking strategies (group by first letter, zip code, etc.)
before calling this method.
Args:
df: input DataFrame
columns: columns to concatenate for similarity comparison
threshold: minimum similarity ratio to flag as fuzzy duplicate
Returns:
Listof (index_i, index_j, similarity_score) tuples.
"""
fuzzy_pairs = []
values = df[columns].astype(str).agg(" ".join, axis=1).tolist()
n = len(values)
if n > 10_000:
print(
f"WARNING: {n:,} rows will produce {n*(n-1)//2:,} comparisons. "
f"Consider blocking or sampling first."
)
for i inrange(n):
for j inrange(i + 1, n):
similarity = SequenceMatcher(None, values[i], values[j]).ratio()
if similarity >= threshold and similarity < 1.0:
fuzzy_pairs.append((i, j, round(similarity, 3)))
print(
f"Found {len(fuzzy_pairs)} fuzzy duplicate pairs above "
f"{threshold} threshold"
)
return fuzzy_pairs
Where Duplicates Hide
Payment processor retries create records with different transaction IDs but identical amounts and timestamps within a narrow window.
User registration forms submitted twice produce records with the same email but different auto-incremented user IDs.
Data imports from multiple sources create overlapping date ranges with slight format differences (2025-04-15 vs 04/15/2025).
CRM merges introduce near-duplicate contacts: 'Acme Corp' vs 'Acme Corporation' vs 'ACME Corp.'.
Always define deduplication keys based on business logic, not just surrogate primary keys that are unique by design.
Production Insight
Duplicates often indicate upstream system bugs — not just data entry errors or CSV import accidents.
Removing duplicates without logging them hides the root cause and guarantees the problem recurs on the next data load.
Rule: always report duplicate counts, patterns, and example rows to the data engineering team. Treat recurring duplicates as incidents, not cleanup tasks. Fix the source, not just the symptom.
Key Takeaway
Exact duplicates are easy to find — fuzzy duplicates are the real threat and require domain-specific similarity logic.
Define deduplication keys based on business logic, not table primary keys.
Duplicates often signal upstream bugs — report them, investigate the source, do not just silently delete them.
Fixing Data Types and Formats
Incorrect data types cause silent errors that can persist for months. Numbers stored as strings cannot be aggregated — pandas will concatenate them instead of summing. Dates stored as strings cannot be compared or sorted chronologically. Booleans stored as 'Yes'/'No' strings require explicit conversion before any logical operation.
The root cause is almost always upstream: CSV files have no schema enforcement, APIs return everything as strings, and Excel silently interprets dates in locale-dependent formats. By the time data reaches your DataFrame, the damage is done. Your job is to detect and fix it systematically.
The most insidious variant is partial type corruption — a column that is 99.9% numeric but contains a handful of string values like 'N/A' or '#REF!'. Pandas silently stores the entire column as object dtype, and your aggregation returns wrong results with no error message.
io/thecodeforge/cleaning/types.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import pandas as pd
import numpy as np
classTypeFixer:
"""Fix data type issues inDataFrames.
Usage:
fixer = TypeFixer()
df = fixer.fix_numeric_strings(df, ['revenue', 'price'])
df = fixer.fix_dates(df, ['order_date', 'ship_date'])
df = fixer.fix_booleans(df, ['is_active', 'has_subscription'])
df = fixer.fix_categoricals(df, ['country', 'status'])
"""
@staticmethod
deffix_numeric_strings(df: pd.DataFrame, columns: list) -> pd.DataFrame:
"""Convert string columns that should be numeric.
Handles currency symbols, commas, whitespace, and common
placeholder strings like 'N/A', '-', 'null', '#REF!'."""
for col in columns:
original_nulls = df[col].isna().sum()
cleaned = df[col].astype(str)
cleaned = cleaned.str.replace(r"[\$\€\£\,]", "", regex=True)
cleaned = cleaned.str.strip()
cleaned = cleaned.replace(
{"": np.nan, "N/A": np.nan, "n/a": np.nan,
"-": np.nan, "null": np.nan, "None": np.nan,
"nan": np.nan, "#REF!": np.nan, "#N/A": np.nan}
)
df[col] = pd.to_numeric(cleaned, errors="coerce")
new_nulls = df[col].isna().sum() - original_nulls
print(
f"{col}: converted to numeric. "
f"{new_nulls} new NaN values from failed coercion."
)
return df
@staticmethod
deffix_dates(
df: pd.DataFrame, columns: list, dayfirst: bool = False,
fmt: str = None
) -> pd.DataFrame:
"""Parse date columns with explicit format handling.
Args:
fmt: explicit strftime format string (e.g., '%Y-%m-%d').
IfNone, pandas infers the format — risky in production.
"""
for col in columns:
if fmt:
df[col] = pd.to_datetime(df[col], format=fmt, errors="coerce")
else:
df[col] = pd.to_datetime(
df[col], dayfirst=dayfirst, errors="coerce"
)
failed = df[col].isna().sum()
print(f"{col}: parsed as datetime. {failed} values are NaT.")
# Sanity check: flag future dates
future = (df[col] > pd.Timestamp.now()).sum()
if future > 0:
print(f" WARNING: {future} dates are in the future.")
return df
@staticmethod
deffix_booleans(df: pd.DataFrame, columns: list) -> pd.DataFrame:
"""Convert string booleans to actual booleans."""
true_values = {"yes", "true", "1", "y", "on", "t"}
false_values = {"no", "false", "0", "n", "off", "f"}
for col in columns:
lower = df[col].astype(str).str.lower().str.strip()
df[col] = lower.map(
lambda x: (
Trueif x in true_values
else (Falseif x in false_values else np.nan)
)
)
unmapped = df[col].isna().sum()
print(f"{col}: converted to boolean. {unmapped} unmapped values.")
return df
@staticmethod
deffix_categoricals(df: pd.DataFrame, columns: list) -> pd.DataFrame:
"""Convert high-memory object columns to categorical dtype."""for col in columns:
before_mem = df[col].memory_usage(deep=True) / 1e6
df[col] = df[col].astype("category")
after_mem = df[col].memory_usage(deep=True) / 1e6
if before_mem > 0:
reduction = (1 - after_mem / before_mem) * 100print(
f"{col}: {before_mem:.2f} MB -> {after_mem:.2f} MB "
f"({reduction:.0f}% reduction)"
)
else:
print(f"{col}: converted to category dtype.")
return df
Memory Impact of Type Fixes
Converting a string column with 1 million rows and 50 unique values to category dtype typically reduces memory from ~50 MB to ~1 MB — a 98% reduction. For numeric columns, downcasting float64 to float32 halves memory usage with negligible precision loss for most applications. These savings compound quickly: a 20-column DataFrame can shrink from 2 GB to 200 MB, which is the difference between fitting in a Lambda function's memory or crashing at runtime.
Production Insight
Date parsing failures are completely silent — NaT values do not raise exceptions, and downstream code happily filters them out or treats them as null without warning.
Always compare the count of successfully parsed dates against the total row count. If 5% of dates failed parsing, that is a data quality bug, not an acceptable loss.
Rule: specify date formats explicitly in production pipelines. Never rely on pandas auto-inference, which changes behavior across pandas versions and produces different results on ambiguous inputs like '01/02/2025'.
Key Takeaway
Wrong data types cause silent aggregation failures — pandas concatenates strings instead of summing numbers, with no error.
Convert low-cardinality object columns to category dtype for 90%+ memory savings.
Always specify explicit date formats in production. Auto-inference is a ticking time bomb.
Handling Outliers
Outliers distort model training and summary statistics. A single salary of $10M in a dataset of $50K–$150K salaries will shift the mean by thousands of dollars and stretch the scale for distance-based algorithms.
But the key decision is whether an outlier is an error (remove it), a genuine extreme value (keep or cap it), or the actual signal you are trying to detect (embrace it). Domain knowledge drives this decision, not statistical rules alone.
The IQR method and z-score are detection tools, not decision tools. They flag candidates. A human — or a well-defined business rule — decides what to do with each candidate. Blindly removing everything beyond 1.5×IQR is a recipe for deleting the most informative data points in your dataset.
In fraud detection, the outliers ARE the signal. In medical data, extreme lab values may indicate critical conditions. In e-commerce, a few whales who spend 100x the average drive a disproportionate share of revenue.
Always consult domain experts before removing outliers. An outlier removal step that makes your model's R² look better on paper may be destroying the most important information in the dataset. If you cannot consult an expert, cap rather than remove — at least the record and its other features are preserved.
Production Insight
Outlier thresholds calculated from training data must be saved and reused identically on new data at inference time.
Recalculating thresholds per batch means different data points get flagged each time, making your cleaning pipeline non-deterministic.
Rule: compute outlier bounds once on training data, serialize them alongside your scaler and model, then apply those fixed bounds to all future data. If the distribution genuinely shifts, that is a retraining trigger, not a reason to silently recalculate thresholds.
Key Takeaway
Outliers are either errors or signals — domain knowledge decides which.
Cap extreme values instead of deleting them whenever possible. Removal discards information; capping preserves the record.
Save outlier thresholds from training data and reuse them deterministically on all future batches.
Outlier Handling Strategy
IfOutlier is clearly a data entry error (age = 500, negative price, temperature of 9999°F)
→
UseSet to NaN and treat as a missing value. Investigate and fix the upstream source that produced it.
IfOutlier is plausible but extreme (billionaire income in salary data, 100-unit purchase in a mostly single-unit store)
→
UseCap at a domain-reasonable threshold (e.g., 99th percentile) or use robust scaling (RobustScaler). Preserve the record.
IfOutlier is the signal of interest (fraud transactions, anomalous sensor readings, rare diseases)
→
UseKeep it. Use algorithms robust to outliers (tree-based models, isolation forest). Removing it defeats the purpose.
IfDistribution is heavily skewed but all values are valid (income, page views, time on site)
→
UseApply log1p or Box-Cox transformation to compress the tail instead of removing values. This preserves information while reducing the impact of extremes.
Encoding Categorical Variables
Machine learning models require numeric input. Categorical variables — country, product type, color, status — must be converted to numbers through encoding. This is a preprocessing step, not cleaning, but it is where beginners make some of their most costly mistakes.
The encoding strategy depends on two factors: cardinality (how many unique values the column has) and whether those values have a natural order. Choosing the wrong method can create thousands of useless sparse columns, introduce false ordinal relationships, or leak target information into features.
The most common mistake: one-hot encoding a column with 10,000 unique values. This creates 9,999 new binary columns, most of which are zero for any given row. Memory explodes, training crawls, and linear models become impossible to fit.
io/thecodeforge/cleaning/encoding.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import pandas as pd
import numpy as np
from sklearn.preprocessing importLabelEncoder, OneHotEncoderclassCategoricalEncoder:
"""Encode categorical variables for model consumption.
Usage:
encoder = CategoricalEncoder()
df = encoder.encode_onehot(df, ['color', 'size'])
df = encoder.encode_ordinal(df, 'education', ['high_school', 'bachelors', 'masters', 'phd'])
df = encoder.encode_frequency(df, ['city', 'product_id'])
"""
@staticmethod
defencode_onehot(
df: pd.DataFrame, columns: list, max_cardinality: int = 10
) -> pd.DataFrame:
"""One-hot encode low-cardinality columns.
Skips columns that exceed max_cardinality and prints a warning.
Uses drop_first=True to avoid multicollinearity in linear models.
"""
for col in columns:
cardinality = df[col].nunique()
if cardinality > max_cardinality:
print(
f"SKIPPED: {col} has {cardinality} unique values. "
f"Use frequency or target encoding instead."
)
continue
dummies = pd.get_dummies(df[col], prefix=col, drop_first=True)
df = pd.concat([df.drop(col, axis=1), dummies], axis=1)
print(f"{col}: one-hot encoded into {len(dummies.columns)} columns")
return df
@staticmethod
defencode_ordinal(
df: pd.DataFrame, column: str, order: list
) -> pd.DataFrame:
"""Encode ordered categories (e.g., Low < Medium < High).
Unmapped values are set to NaN so they surface as missing rather
than silently receiving an arbitrary integer.
"""
mapping = {val: idx for idx, val inenumerate(order)}
unmapped = set(df[column].dropna().unique()) - set(mapping.keys())
if unmapped:
print(f"WARNING: {column} has unmapped values: {unmapped}")
df[column] = df[column].map(mapping)
print(f"{column}: ordinally encoded with order {order}")
return df
@staticmethod
defencode_frequency(
df: pd.DataFrame, columns: list
) -> pd.DataFrame:
"""Replace categories with their frequency in the training data.
Safefor high-cardinality features. Doesnot create new columns.
"""
for col in columns:
freq_map = df[col].value_counts(normalize=True).to_dict()
df[col] = df[col].map(freq_map)
print(f"{col}: frequency encoded ({len(freq_map)} categories)")
return df
@staticmethod
defencode_label(
df: pd.DataFrame, columns: list
) -> tuple:
"""Label encode — safe for tree-based models only.
WARNING: creates false ordinal relationships for linear models.
Returns the DataFrameand a dict of fitted LabelEncodersfor
inverse transformation and production reuse.
"""
encoders = {}
for col in columns:
le = LabelEncoder()
df[col] = le.fit_transform(df[col].astype(str))
encoders[col] = le
print(f"{col}: label encoded ({len(le.classes_)} classes)")
return df, encoders
Encoding Decision Framework
Low cardinality (< 10 unique values), no natural order: One-hot encoding. Creates N-1 binary columns. Standard for most classification tasks.
Low cardinality, natural order exists: Ordinal encoding. Maps to integers preserving order (Low=0, Medium=1, High=2). Use for education levels, satisfaction ratings, size categories.
High cardinality (> 50 unique values): Target encoding (mean of target per category) or frequency encoding. One-hot creates too many sparse columns and kills performance.
Tree-based models (Random Forest, XGBoost, LightGBM): Label encoding works fine. Trees split on thresholds, so the arbitrary integer assignment does not create false distance relationships.
Linear models, SVMs, and neural networks: One-hot or target encoding only. Label encoding introduces false ordinal relationships (category 5 is not 'more' than category 2).
Production Insight
One-hot encoding a column with 10,000 unique values creates 9,999 new columns, most of which are zero for any given row.
This explodes memory, makes linear models intractable, and adds zero predictive signal for rare categories.
Rule: always check df['col'].nunique() before encoding. If cardinality exceeds 50, use frequency encoding, target encoding, or hash encoding. If cardinality exceeds 10,000, strongly consider whether this feature should be decomposed (e.g., zip code → state) or dropped entirely.
Key Takeaway
Encoding strategy depends on cardinality and model type — there is no universal best method.
One-hot for low cardinality, frequency or target encoding for high cardinality.
Label encoding is safe for trees but creates false ordinal relationships in distance-based and linear models.
Always save the encoder or mapping. You need the exact same transformation at inference time.
Feature Scaling
Features on different scales — age in years (0–100) vs. income in dollars (0–10,000,000) — bias distance-based models and slow gradient descent convergence. KNN will treat a $1 difference in income as equivalent to a 1-year difference in age, which makes no sense.
Scaling normalizes feature ranges without changing the distribution shape. It does not remove outliers, fix missing values, or encode categories. It is a pure preprocessing step that should happen after cleaning, after encoding, and after the train/test split.
The critical rule that most beginners violate: fit the scaler on training data only. If you fit on the full dataset, the scaler's mean and standard deviation incorporate test set information, which leaks future knowledge into training. The metrics look artificially good, and the model underperforms on genuinely unseen data.
io/thecodeforge/cleaning/scaling.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import pandas as pd
import numpy as np
from sklearn.preprocessing importStandardScaler, MinMaxScaler, RobustScalerclassFeatureScaler:
"""Scale numeric features for model consumption.
Usage:
scaler_util = FeatureScaler()
df_train, scaler = scaler_util.scale_standard(df_train, num_cols)
df_test[num_cols] = scaler.transform(df_test[num_cols]) # same scaler!"""
@staticmethod
defscale_standard(
df: pd.DataFrame, columns: list
) -> tuple:
"""Z-score normalization: mean=0, std=1.
Usefor normally distributed features with linear models,
logistic regression, SVMs, andPCA.
"""
scaler = StandardScaler()
df[columns] = scaler.fit_transform(df[columns])
print(f"StandardScaler applied to {len(columns)} columns")
return df, scaler
@staticmethod
defscale_minmax(
df: pd.DataFrame, columns: list,
feature_range: tuple = (0, 1)
) -> tuple:
"""Min-max scaling to a fixed range.
Usefor neural networks (which expect bounded inputs) and
features with known theoretical bounds.
"""
scaler = MinMaxScaler(feature_range=feature_range)
df[columns] = scaler.fit_transform(df[columns])
print(
f"MinMaxScaler applied to {len(columns)} columns, "
f"range={feature_range}"
)
return df, scaler
@staticmethod
defscale_robust(df: pd.DataFrame, columns: list) -> tuple:
"""Scale using median andIQR instead of mean and std.
Use when data has outliers that would distort StandardScaler.
Outliers still exist after scaling but no longer dominate the range.
"""
scaler = RobustScaler()
df[columns] = scaler.fit_transform(df[columns])
print(
f"RobustScaler applied to {len(columns)} columns "
f"(outlier-resistant)"
)
return df, scaler
When to Scale and When Not To
Needs scaling: KNN, SVM, Neural Networks, PCA, Logistic Regression, Linear Regression (especially with regularization — L1/L2 penalties are scale-dependent).
Does NOT need scaling: Decision Trees, Random Forests, XGBoost, LightGBM, CatBoost, Naive Bayes. These models split on thresholds, not distances.
Always scale after the train/test split — fit the scaler on training data only, then transform both train and test with the same fitted scaler.
Save the scaler object with joblib.dump(). You must apply the exact same transformation to new data at inference time. A different scaler = different feature space = garbage predictions.
If you use an sklearn Pipeline, the scaler is automatically fitted only on training data during pipeline.fit() — this is one of the strongest reasons to use Pipelines.
Production Insight
Fitting the scaler on the full dataset (including test data) causes data leakage.
The scaler computes mean and standard deviation using test data points, so the model indirectly sees test set statistics during training. This inflates accuracy by 1–5% on typical tabular datasets — enough to make a mediocre model look good in evaluation but fail in production.
Rule: fit scaler on training data only. Use scaler.transform() on test and production data. If you ever call scaler.fit_transform() on test data, you have a data leakage bug.
Key Takeaway
Scaling is required for distance-based and gradient-based models but unnecessary for tree-based models.
Use RobustScaler when outliers are present — it uses median and IQR instead of mean and std.
Always fit the scaler on training data only. Serialize it and ship it with the model to production.
Building a Reproducible Cleaning Pipeline
Manual cleaning in a notebook is fine for exploration. Production requires a reproducible pipeline that applies the exact same transformations to new data consistently, without human intervention, without remembering which cells to run in which order.
Scikit-learn's Pipeline and ColumnTransformer make this possible. They encapsulate the entire sequence of transformations — imputation, encoding, scaling — into a single object that can be fitted on training data, serialized to disk, and loaded at inference time. The pipeline remembers every parameter: the median used for imputation, the mean and std used for scaling, the category set used for encoding.
This is not optional infrastructure for large teams. It is essential even for solo projects. The moment you retrain a model or deploy it to a different environment, you need the pipeline. Without it, you will spend hours reverse-engineering what transformations were applied and in what order.
io/thecodeforge/cleaning/pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import pandas as pd
import numpy as np
import joblib
from sklearn.pipeline importPipelinefrom sklearn.compose importColumnTransformerfrom sklearn.preprocessing importStandardScaler, OneHotEncoderfrom sklearn.impute importSimpleImputerclassDataCleaningPipeline:
"""Reproducible data cleaning pipeline for production use.
Encapsulates imputation, encoding, and scaling into a single
serializable object. Fit once on training data, transform everything
elsewith the same fitted parameters.
Usage:
pipeline = DataCleaningPipeline(
numeric_cols=['age', 'income', 'score'],
categorical_cols=['city', 'status', 'tier']
)
X_train_clean = pipeline.fit_transform(X_train)
X_test_clean = pipeline.transform(X_test)
pipeline.save('preprocessing_pipeline.pkl')
# At inference time:
loaded = DataCleaningPipeline.load('preprocessing_pipeline.pkl')
X_new_clean = loaded.transform(X_new)
"""
def__init__(self, numeric_cols: list, categorical_cols: list):
self.numeric_cols = numeric_cols
self.categorical_cols = categorical_cols
self.pipeline = self._build_pipeline()
self._is_fitted = Falsedef_build_pipeline(self) -> ColumnTransformer:
numeric_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
])
categorical_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
("encoder", OneHotEncoder(
handle_unknown="ignore", sparse_output=False
)),
])
preprocessor = ColumnTransformer(transformers=[
("numeric", numeric_transformer, self.numeric_cols),
("categorical", categorical_transformer, self.categorical_cols),
])
return preprocessor
deffit(self, X_train: pd.DataFrame) -> "DataCleaningPipeline":
"""Fit all transformers on training data only."""self.pipeline.fit(X_train)
self._is_fitted = Trueprint(f"Pipeline fitted on {len(X_train):,} rows, "
f"{len(self.numeric_cols)} numeric + "
f"{len(self.categorical_cols)} categorical columns.")
returnselfdeftransform(self, X: pd.DataFrame) -> np.ndarray:
"""Transform new data using the fitted pipeline."""ifnotself._is_fitted:
raiseRuntimeError(
"Pipeline is not fitted. Call .fit(X_train) first."
)
returnself.pipeline.transform(X)
deffit_transform(self, X_train: pd.DataFrame) -> np.ndarray:
"""Fit and transform in one step. Use on training data."""self.fit(X_train)
returnself.pipeline.transform(X_train)
defsave(self, filepath: str) -> None:
"""Serialize the fitted pipeline to disk."""ifnotself._is_fitted:
raiseRuntimeError(
"Cannot save an unfitted pipeline. Call .fit() first."
)
joblib.dump(self, filepath)
print(f"Pipeline saved to {filepath}")
@staticmethod
defload(filepath: str) -> "DataCleaningPipeline":
"""Load a previously saved pipeline."""
pipeline = joblib.load(filepath)
print(f"Pipeline loaded from {filepath}")
return pipeline
Pipeline Immutability Is Non-Negotiable
Once a pipeline is fitted, it encodes the training data's statistics — the median for imputation, the mean and std for scaling, the set of known categories for encoding. Never refit the pipeline on new data after deployment. If the data distribution shifts significantly, that is a retraining trigger: retrain the entire pipeline and model together on a fresh training set, validate, and redeploy. Refitting the pipeline alone while keeping the old model creates a mismatch between preprocessing assumptions and model weights.
Production Insight
Notebook-based cleaning breaks the moment new data has a column your notebook did not anticipate, or a category the encoder has never seen.
A scikit-learn Pipeline handles unseen categories gracefully with handle_unknown='ignore', and it enforces the correct fit-then-transform order by design.
Rule: always encapsulate cleaning logic in a Pipeline object. Serialize it with joblib.dump(), version it alongside your model artifacts, and load it at inference time. The pipeline and the model are a unit — they ship together or not at all.
Key Takeaway
Manual notebook cleaning does not scale and does not reproduce. Build a scikit-learn Pipeline.
Fit all transformers on training data only, then transform everything else with the fitted pipeline.
Serialize the pipeline with joblib and version it as a first-class artifact. It must travel with the model to every environment — dev, staging, and production.
Why Your Correlation Matrix is a Lie (And How to Fix It)
You just dropped a model into production. R² looks solid. In the real world, it predicts random noise. I've seen this exact disaster caused by multicollinearity you never checked. Raw data hides relationships between features. Correlation analysis exposes them. Two columns measuring the same underlying signal? Your model double-counts that signal, memorizes training noise, and falls apart on new data.
Here's the fix: generate a Pearson correlation matrix. Look for absolute values above 0.8. Those features are collinear. Drop one or combine them via PCA. Never trust a model with collinear features. It's not learning—it's overfitting to a ghost.
Your junior always skips this step because nothing looks broken. That's exactly when production burns down. Always run corr(), then heatmap it. Never deploy without this check.
correlation_check.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// io.thecodeforge
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('production_orders.csv')
# Compute correlation matrix
corr_matrix = df.corr()
# Find highly correlated pairs
high_corr = [(col1, col2, corr_matrix.loc[col1, col2])
for col1 in corr_matrix.columns
for col2 in corr_matrix.columns
if col1 != col2 andabs(corr_matrix.loc[col1, col2]) > 0.8]
print(f"Highly correlated pairs (>0.8): {high_corr}")
# Output: [('revenue', 'price', 0.94), ('orders', 'clicks', 0.91)]# Visual heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlations - Check Before Training')
plt.show()
Don't just print the matrix—actually drop or transform one of each collinear pair. I watched a team deploy a churn model that used both 'call_duration' and 'total_seconds'—identical features with different names. Predictions looked amazing in testing. Real users? Pure chaos.
Key Takeaway
Run correlation analysis before feature selection. If two features are >0.8 correlated, one has to go.
The Target Variable Distribution: Why Your Model Only Predicts the Average
Last month, a junior handed me a regression model that predicted exactly the mean for every sample. He blamed the algorithm. I pointed at the target variable histogram. It was bimodal—two peaks, completely separate populations. The model averaged them. It never learned either group.
Your target's shape dictates everything. If it's skewed right, your model overfits high-value outliers and undershoots the majority. If it's bimodal, you're encoding two separate problems into one. Always visualize the target with a histogram or density plot before you touch a single feature.
For skewed targets, apply a log or Box-Cox transformation. For bimodal ones, split the data into two separate models or use a classification threshold. Never train a regression model on a multimodal target without asking why it's split. The answer will save your deployment.
Plot the target before any feature engineering. If your distribution has multiple peaks, stop and ask: 'Should I build two models?' If it's skewed, transform it. If you skip this, your model will break on the long tail.
Key Takeaway
Always visualize your target variable. A skewed or multimodal target guarantees a bad model. Fix it before training.
Separating Features and Target: The One-Line That Stops Silent Failures
I've debugged models where the dataset accidentally included the target column twice—once as a feature. The model hit 99% accuracy in minutes. In production? 20%. Because that duplicate column wasn't present in real inference. The 'score' didn't exist. The model panicked on missing data and output garbage.
Separate your features (X) and target (y) at the start of every pipeline. Never assume 'all columns except the last one are features.' Explicitly define them. A production system will respect an explicit drop more than an implicit assumption.
Use drop with the column name. Don't rely on positional indexing. If someone reorders columns next week (they will), your positional slice breaks silently. A named column drop breaks visibly with a KeyError. Visible failure beats silent nonsense every time.
split_features_target.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// io.thecodeforge
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('customer_churn.csv')
# BAD: implicit positional split# X = df.iloc[:, :-1] # Breaks silently if columns reorder# y = df.iloc[:, -1] # Same# GOOD: explicit column names
target_column = 'churn' # or 'price', 'revenue', etc.
feature_columns = [col for col in df.columns if col != target_column]
X = df[feature_columns]
y = df[target_column]
# Split into training and testing
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
# Output: X_train shape: (800, 12) y_train shape: (800,)# Sanity check: ensure target not in featuresassert target_column notin feature_columns, "Target leaked into features!"
Output
X_train shape: (800, 12) y_train shape: (800,)
Production Trap:
Never use iloc to separate features and target. I debugged a pipeline where a data engineer added a new column to the left. The train split silently shifted, pulling the target into features. The model looked perfect for a week. Then users reported insane predictions. The root cause: a single positional index change.
Key Takeaway
Always separate features and target by explicit column name, not position. Add an assertion to catch target leakage immediately.
● Production incidentPOST-MORTEMseverity: high
Revenue Forecast Off by 40% Due to Undetected Duplicate Records
Symptom
The model predicted $14M in Q3 revenue. Actual revenue came in at $10M. The error was consistent across multiple quarters of backtesting, not a one-off spike. Finance flagged it when actuals landed — nobody caught it during model validation because the duplicates existed in both training and test sets, inflating metrics uniformly.
Assumption
The data engineering team assumed the ETL pipeline deduplicated records at ingestion. The data science team assumed the dataset was clean because it came from the data warehouse, which was treated as the single source of truth. Neither team had an independent row-count reconciliation step.
Root cause
A bug in the upstream payment processor created duplicate transaction records with different timestamps (off by 1–2 seconds) after network retries. The ETL pipeline deduplicated on transaction_id, but the retried transactions were assigned unique IDs by the payment gateway. The data warehouse contained roughly 40% duplicate revenue entries — invisible to any check that relied solely on surrogate keys.
Fix
Added a deduplication step based on a (customer_id, amount, timestamp_window_5s) composite key that matched business semantics rather than system-generated IDs. Implemented a row-count sanity check comparing warehouse totals to source system totals after every ETL run. Added automated Slack alerts when the duplicate ratio exceeds 2% or when warehouse row counts diverge from source counts by more than 0.5%.
Key lesson
Never trust upstream data without independent validation — even if it comes from your own warehouse.
Deduplicate on business-meaningful keys, not just surrogate IDs. System-generated keys can mask real-world duplicates.
Compare aggregate totals between source and destination systems as a routine sanity check, not an afterthought.
Duplicates that exist in both train and test sets inflate all evaluation metrics uniformly — you cannot catch this with a standard holdout split.
Production debug guideCommon signals that your data has cleaning problems — and exactly where to look.5 entries
Symptom · 01
Model accuracy is high on training data but terrible on new data
→
Fix
Check for data leakage first: are future-derived features present? Then check for duplicates spanning train/test splits with df.duplicated(subset=feature_cols).sum(). Finally, look for unclean features that let the model memorize noise rather than learn signal.
Run df.describe() and inspect min/max values for every numeric column. Apply domain-specific validation rules: age ∈ [0, 120], dates ≤ today, percentages ∈ [0, 100]. Flag impossible values as NaN and investigate the upstream source that produced them.
Symptom · 03
Groupby aggregations produce unexpected counts or totals
→
Fix
Check for leading/trailing whitespace and case inconsistencies in string columns. 'California', 'california', and 'California ' are three different groups to pandas. Use df['col'] = df['col'].str.strip().str.lower() before grouping.
Symptom · 04
Joins produce more rows than expected
→
Fix
You have duplicate keys in one or both tables. A many-to-many join on duplicated keys creates a Cartesian product. Check df.duplicated(subset=['key_col']).sum() on both DataFrames before joining. If duplicates are expected, use a merge indicator: pd.merge(left, right, on='key', how='left', indicator=True).
Symptom · 05
Feature distributions look completely different between training and serving data
→
Fix
This is data drift, but it can also be a cleaning artifact. Check whether the cleaning pipeline was fitted on training data and applied consistently at serving time. A common cause is a scaler or imputer being refit on serving data instead of using the saved training-time transformer.
★ Data Cleaning Debug Cheat SheetQuick checks to diagnose data quality issues when something looks off.
Unexpected NaN values appearing after type conversion−
Immediate action
Check for non-numeric strings hiding in numeric columns (e.g., 'N/A', '-', empty strings, 'null', '#REF!').
Replace non-numeric placeholders with NaN explicitly before converting: df['column'].replace(['N/A', '-', '', 'null'], np.nan, inplace=True), then pd.to_numeric(df['column'], errors='coerce').
Date parsing produces NaT or wrong dates+
Immediate action
Inspect raw date strings for mixed formats. The value '03/04/2025' means March 4 in the US and April 3 in most of Europe. Pandas guesses, and it guesses wrong roughly half the time on ambiguous dates.
Specify format explicitly with pd.to_datetime(df['date_col'], format='%Y-%m-%d'). If formats are genuinely mixed, parse each known format separately into sub-DataFrames and combine with pd.concat.
Memory usage is unexpectedly high+
Immediate action
Check for object columns that should be categorical. A column with 10 unique string values repeated across 10 million rows stores 10 million separate string objects in memory.
Commands
df.memory_usage(deep=True).sum() / 1e6
df.select_dtypes('object').nunique()
Fix now
Convert low-cardinality object columns to category dtype: df['col'] = df['col'].astype('category'). For numeric columns stored as float64 that only contain small integers, downcast: pd.to_numeric(df['col'], downcast='integer').
Value counts show suspiciously similar categories that should be the same+
Immediate action
Check for encoding issues, invisible Unicode characters, and casing differences.
Commands
df['col'].value_counts().head(20)
[repr(v) for v in df['col'].unique()[:20]]
Fix now
Normalize with df['col'] = df['col'].str.strip().str.lower().str.normalize('NFKD'). Then map known synonyms: df['col'].replace({'usa': 'united states', 'u.s.': 'united states'}).
Scaling Methods Comparison
Scaler
Formula
Outlier Resistant
Output Range
Best For
StandardScaler
(x − mean) / std
No
Unbounded (≈ −3 to 3 for normal data)
Normally distributed features, linear models, logistic regression, SVM
MinMaxScaler
(x − min) / (max − min)
No
[0, 1] or custom range
Neural networks, image pixel values, features with known theoretical bounds
RobustScaler
(x − median) / IQR
Yes
Unbounded
Data with outliers, skewed distributions, salary/revenue columns
MaxAbsScaler
x / max(|x|)
No
[−1, 1]
Sparse data (preserves zero entries), already centered features
Log Transform
log(x + 1)
Partially
[0, ∞)
Right-skewed distributions (income, counts, page views, time on site)
Key takeaways
1
Inspect data with df.shape, df.dtypes, df.isnull().sum(), df.duplicated().sum(), and df.describe() before writing a single line of cleaning code.
2
Understand why data is missing before choosing an imputation strategy
the missingness mechanism (MCAR, MAR, MNAR) matters more than the imputation method.
3
Deduplicate on business-meaningful composite keys, not on system-generated surrogate IDs that are unique by design.
4
Always split data into train/test before fitting scalers, encoders, or imputers
fitting on the full dataset causes data leakage that inflates metrics by 1–5%.
5
Check cardinality with df['col'].nunique() before one-hot encoding. More than 50 unique values demands frequency encoding, target encoding, or decomposition.
6
Build a reproducible scikit-learn Pipeline that encapsulates the entire cleaning and preprocessing sequence in a single serializable object.
7
Serialize the preprocessing pipeline with joblib and version it alongside the model. They are a unit
shipping one without the other guarantees production failures.
Common mistakes to avoid
6 patterns
×
Dropping all rows with any missing values
Symptom
Dataset shrinks from 100,000 rows to 20,000 rows after a single df.dropna() call. The model trains on a biased subset because missingness is often correlated with specific groups — younger users, mobile traffic, free-tier accounts.
Fix
Check df.dropna().shape first. If more than 10% of rows would be lost, impute instead. Use df.isnull().sum() to identify which columns drive the loss — often it is one or two columns dragging everything down. Add boolean indicator columns for missingness before imputing so the model can learn from the pattern.
×
Fitting the scaler or encoder on the full dataset before the train/test split
Symptom
Model evaluation metrics are artificially inflated by 1–5%. The scaler leaks test set statistics (mean, std, min, max) into the training process. The model looks great in evaluation but underperforms on genuinely unseen production data.
Fix
Always split first, then fit transformers on X_train only. Use scaler.transform(X_test) — never scaler.fit_transform(X_test). Wrap everything in a scikit-learn Pipeline to enforce this automatically. If you catch yourself calling .fit() on anything other than training data, stop — that is a leakage bug.
×
One-hot encoding a column with thousands of unique values
Symptom
DataFrame explodes from 20 columns to 5,000+ columns. Memory usage spikes by 10x or more. Linear model training becomes impossibly slow or crashes. Most new columns are nearly all zeros and add no predictive signal.
Fix
Check cardinality with df['col'].nunique() before encoding. For features with more than 50 unique values, use frequency encoding, target encoding, or hash encoding. For features with more than 10,000 values, consider decomposing (zip code → state) or dropping the feature entirely if it does not carry enough predictive signal.
×
Applying log transform to columns containing zeros or negative values
Symptom
RuntimeWarning: divide by zero encountered in log. Resulting values are NaN or -inf, silently corrupting the entire feature column. Downstream models train on garbage without raising an error.
Fix
Use np.log1p(x), which computes log(1 + x) and handles zeros safely. For negative values, shift the distribution first: np.log1p(x - x.min()). For columns that can be both positive and negative, use the Yeo-Johnson transform from sklearn.preprocessing.PowerTransformer, which handles all real values.
×
Not saving cleaning parameters for production inference
Symptom
Model works perfectly in the training notebook but produces errors, NaN features, or garbage predictions on new data in production. Missing value fill values, scaler parameters, encoder mappings, and outlier thresholds are lost when the notebook session ends.
Fix
Serialize the entire preprocessing pipeline with joblib.dump(pipeline, 'preprocessor.pkl'). Load it at inference time with joblib.load('preprocessor.pkl'). The pipeline object stores all fitted parameters internally. Version this artifact alongside the model file — they are inseparable.
×
Removing outliers using a global threshold without considering per-feature distributions
Symptom
Valid extreme values in one feature (CEO salary of $5M) are removed because they exceed a threshold calculated from a completely different feature (intern salary). Or: a z-score threshold of 3 flags 10% of a heavily skewed feature, removing real data.
Fix
Apply outlier detection per feature, not globally. Use IQR or z-score calculated on each column individually. Consider the business context: is this value plausible for this specific feature? For skewed distributions, transform first (log, Box-Cox), then apply outlier detection on the transformed scale.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
You receive a dataset where 35% of values in a critical feature column a...
Q02JUNIOR
What is the difference between data cleaning and data preprocessing?
Q03SENIOR
How would you handle a categorical feature with 10,000 unique values for...
Q04SENIOR
Why must you fit scalers on training data only and not the full dataset?
Q01 of 04SENIOR
You receive a dataset where 35% of values in a critical feature column are missing. How do you decide what to do?
ANSWER
First, I would investigate the missingness mechanism. I would check whether missingness correlates with other features (MAR) or with the target variable (MNAR) by comparing distributions of rows with and without missing values — for example, comparing the mean target value when the feature is present versus absent. If there is no significant difference, it is likely MCAR, and I can impute with median for numeric or mode for categorical data, adding a boolean indicator column to preserve the missingness signal.
If missingness is systematic, I would investigate the upstream data source to understand why 35% of records lack this value. Is it a recently added field? Is it optional in the source form? Does a specific data provider fail to supply it?
If more than 50% were missing, I would consider dropping the column. At 35%, I would likely impute — but I would use a method appropriate to the mechanism, potentially KNNImputer or IterativeImputer for MAR data rather than simple mean/median. I would never blindly drop rows — losing 35% of data almost always introduces more bias than imputation.
Q02 of 04JUNIOR
What is the difference between data cleaning and data preprocessing?
ANSWER
Data cleaning fixes quality issues in raw data: removing duplicates, correcting data entry errors, handling missing values, fixing inconsistent formats and string representations, and validating that values fall within domain-valid ranges.
Data preprocessing transforms clean data into a format suitable for modeling: encoding categorical variables as numbers, scaling numeric features to comparable ranges, creating derived features through feature engineering, and splitting into train/test sets.
Cleaning removes noise and errors. Preprocessing adds structure and prepares data for a specific algorithm's requirements. You must clean before you preprocess — feeding dirty data into a scaler or encoder produces well-formatted garbage. The output looks numeric and properly scaled, but the underlying values are meaningless.
Q03 of 04SENIOR
How would you handle a categorical feature with 10,000 unique values for a linear regression model?
ANSWER
One-hot encoding would create 9,999 sparse columns, making the model intractable and likely overfitting to noise. I would consider several alternatives in order of preference:
1. Target encoding: replace each category with the mean of the target variable for that category, computed on training data only. To prevent overfitting on rare categories, I would apply smoothing — blending the category mean with the global mean, weighted by category frequency.
2. Frequency encoding: replace each category with its relative frequency in the training data. Simple, no target leakage risk, but loses signal about the category-target relationship.
3. Feature decomposition: if the categorical represents something hierarchical (like zip code), decompose it into higher-level groupings (state, region) that have lower cardinality.
4. Grouping rare categories: bucket categories appearing fewer than N times into an 'Other' category, reducing effective cardinality.
I would also question whether a feature with 10,000 categories is actually useful for a linear model at all — sometimes a tree-based model is a better fit for the data structure.
Q04 of 04SENIOR
Why must you fit scalers on training data only and not the full dataset?
ANSWER
Fitting a scaler on the full dataset leaks information from the test set into the training process. The scaler computes mean and standard deviation using test data points, so when training data is transformed, those statistics incorporate knowledge of the test distribution. The model indirectly sees test set characteristics during training.
This produces optimistically biased evaluation metrics — accuracy, F1, RMSE all look better than they will be on truly unseen data. The magnitude is typically 1–5% on tabular datasets, but I have seen cases where leakage inflated AUC by 10+ points on small, high-variance datasets.
The correct approach is: split first, fit the scaler on X_train only, then call scaler.transform() on both X_train and X_test. In production, the same fitted scaler object — serialized with joblib — must be loaded and applied to new incoming data. If you use a scikit-learn Pipeline, this ordering is enforced automatically, which is one of the strongest arguments for using Pipelines.
01
You receive a dataset where 35% of values in a critical feature column are missing. How do you decide what to do?
SENIOR
02
What is the difference between data cleaning and data preprocessing?
JUNIOR
03
How would you handle a categorical feature with 10,000 unique values for a linear regression model?
SENIOR
04
Why must you fit scalers on training data only and not the full dataset?
SENIOR
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
Should I clean data before or after splitting into train/test sets?
Inspect and remove exact duplicates before splitting — duplicates that span both sets cause data leakage where the model memorizes training examples that also appear in the test set. For transformations that learn parameters (scaling, imputation with mean/median, encoding), always split first, then fit on training data only. The correct sequence is: (1) load raw data, (2) remove exact duplicates on the full dataset, (3) split into train and test, (4) fit cleaning and preprocessing transformers on train only, (5) transform both train and test using the fitted transformers.
Was this helpful?
02
How do I know if my data has outliers vs. just a skewed distribution?
Plot a histogram and a boxplot side by side. A skewed distribution has a long tail but continuous density — the values get sparser as you move along the tail, but there are no isolated points far from everything else. All values are plausible given the domain. Outliers are isolated points visibly disconnected from the main distribution, appearing as individual dots beyond the boxplot whiskers with no nearby values.
Use domain knowledge as the final arbiter: a salary of $500K is plausible for a CEO but clearly an error for an intern position. Statistical tests like IQR or z-score flag candidates, but a human — or a well-defined business rule — decides whether each candidate is an error, a valid extreme, or a signal.
Was this helpful?
03
Can I use the same cleaning pipeline for different datasets?
Only if the datasets share the exact same schema — same columns, same types, same expected range of category values. A pipeline fitted on one dataset encodes that dataset's specific statistics: the median for imputation, the mean and std for scaling, the known set of categories for one-hot encoding.
If a new dataset has different columns, the pipeline will crash. If it has new category values the encoder has never seen, OneHotEncoder with handle_unknown='ignore' will silently zero them out, which may or may not be acceptable. Never reuse a pipeline blindly without validating that the new data's schema matches the training data's schema.
Was this helpful?
04
What is the difference between imputing with mean vs. median?
Mean is sensitive to outliers — a single extreme value (a $10M salary among $50K–$150K salaries) pulls the mean dramatically and fills missing values with a number that represents nobody in the dataset. Median is robust — it represents the actual middle value regardless of extremes.
For normally distributed data with no outliers, mean and median are nearly identical and either works. For skewed distributions (income, prices, counts, time-on-site), median is almost always the better choice. In production, default to median unless you have verified the distribution is approximately symmetric. It is the safer choice because you cannot always guarantee your data is outlier-free — and a single outlier in a future batch can silently corrupt mean-imputed values.
Was this helpful?
05
How often should I re-inspect and re-clean production data?
Continuously, via automated pipeline gates. Data quality is not a one-time event — upstream schemas change, new data sources are added, vendor APIs modify their formats, and seasonal patterns introduce values your original cleaning logic never anticipated.
At minimum, run automated inspection checks (row counts, missing percentages, value range validation, duplicate ratios) on every data batch before it enters your pipeline. Set thresholds that trigger alerts or halt processing. Re-evaluate your cleaning strategy quarterly or whenever model performance degrades — declining accuracy is often a data quality symptom, not a model architecture problem.