Senior 5 min · March 06, 2026

Feature Engineering: Leaky Aggregate Causes AUC Drop

Model AUC dropped from 94% to 61% due to lookahead bias in rolling average feature - avoid the silent accuracy drop with per-fold computation.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Feature engineering transforms raw data into representations ML models can actually learn from
  • Scaling (Min-Max, Standardization) prevents large-magnitude features from dominating distance-based models
  • One-Hot Encoding avoids false ordinal relationships in nominal categories; Label Encoding is for ordinal data only
  • Domain-driven features (ratios, flags, time deltas) often beat raw columns in predictive power
  • Production insight: data leakage from fitting scalers on full dataset is the #1 cause of metric optimism — split before transform
Plain-English First

Imagine you're baking a cake. The raw ingredients — flour, eggs, sugar — are your data. Feature engineering is the prep work: sifting the flour so it's smooth, cracking and beating the eggs, measuring sugar precisely. You could dump everything in a bowl unprocessed, but the cake would be a disaster. ML models are the same — they can't learn well from raw, messy ingredients. Feature engineering is how you prep your data so the model can actually taste the signal.

Your model is only as smart as the data you hand it. You can swap in a fancier algorithm all day long, but if your input features are raw, inconsistent, or poorly structured, accuracy will plateau and you'll spend hours debugging what feels like a model problem — but is actually a data problem. Feature engineering is the craft that separates a hobbyist notebook from a production ML system. It's the reason a simple logistic regression with great features often outperforms a deep neural network fed raw junk.

The core problem feature engineering solves is this: real-world data is collected for humans, not machines. A column labeled 'Customer Since' contains a date string. A 'Price' column spans from $1 to $10,000. A 'City' column has 500 unique text values. Raw ML algorithms see numbers, and they interpret scale literally — a city with code 499 looks mathematically 'larger' than code 1. Feature engineering bridges the gap between how humans record information and how models consume it.

By the end of this article you'll know how to normalize numerical features so large-scale columns don't bully small-scale ones, encode categorical variables without introducing false ordinal relationships, engineer entirely new features from existing ones, and bin continuous values into meaningful groups. Every technique comes with a real Python example you can run today and a clear explanation of when to reach for it.

Scaling and Normalization — Why Raw Numbers Lie to Your Model

Picture a dataset with two columns: a person's age (18–80) and their annual income ($20,000–$200,000). To you, those are just two different measurements. To a distance-based model like K-Nearest Neighbors or a gradient-based model like logistic regression, income completely dominates. The income differences are literally thousands of times larger, so the model barely 'hears' the age signal at all.

Normalization fixes this by rescaling features onto a common playing field. There are two main approaches you'll reach for constantly.

Min-Max Scaling squishes every value into a 0–1 range. It preserves the shape of the distribution but is sensitive to outliers — one rogue value at $999,999 will compress everything else near zero.

Standardization (Z-score) rescales to mean=0, std=1. It doesn't bound values between 0 and 1, but it handles outliers far more gracefully and is the go-to for models that assume normally distributed inputs like linear regression or SVMs.

The rule of thumb: use Min-Max when you know your data is bounded and clean. Use Standardization almost everywhere else.

scaling_comparison.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import pandas as pd

# Simulate a small customer dataset with wildly different scales
customer_data = pd.DataFrame({
    'age':    [22, 35, 47, 58, 29, 63],
    'income': [32000, 75000, 120000, 95000, 48000, 180000]
})

print("=== Raw Data ===")
print(customer_data)
print(f"Income std dev: {customer_data['income'].std():.1f}")
print(f"Age std dev:    {customer_data['age'].std():.1f}\n")

# --- Min-Max Scaling ---
# Each value becomes (value - min) / (max - min)
# Result: every column lives between 0.0 and 1.0
min_max_scaler = MinMaxScaler()
min_max_scaled = pd.DataFrame(
    min_max_scaler.fit_transform(customer_data),
    columns=['age_minmax', 'income_minmax']
)
print("=== After Min-Max Scaling ===")
print(min_max_scaled.round(3))

# --- Standardization (Z-score) ---
# Each value becomes (value - mean) / std_dev
# Result: mean=0, std=1 — no fixed upper or lower bound
standard_scaler = StandardScaler()
standardized = pd.DataFrame(
    standard_scaler.fit_transform(customer_data),
    columns=['age_zscore', 'income_zscore']
)
print("\n=== After Standardization ===")
print(standardized.round(3))
print(f"\nAge z-score mean:    {standardized['age_zscore'].mean():.6f}  (should be ~0)")
print(f"Income z-score std:  {standardized['income_zscore'].std():.6f}  (should be ~1)")
Output
=== Raw Data ===
age income
0 22 32000
1 35 75000
2 47 120000
3 58 95000
4 29 48000
5 63 180000
Income std dev: 52547.1
Age std dev: 15.3
=== After Min-Max Scaling ===
age_minmax income_minmax
0 0.000 0.000
1 0.317 0.287
2 0.610 0.587
3 0.878 0.420
4 0.171 0.107
5 1.000 1.000
=== After Standardization ===
age_zscore income_zscore
0 -1.301 -1.139
1 -0.426 -0.324
2 0.393 0.529
3 1.163 0.149
4 -0.842 -0.834
5 1.012 1.619
Age z-score mean: 0.000000 (should be ~0)
Income z-score std: 1.000000 (should be ~1)
Watch Out: Fit on Train, Transform on Test
Always call .fit_transform() on your training data only, then .transform() on your test data. If you fit the scaler on the full dataset, you're leaking test statistics into training — your model secretly already 'knows' the test set's range, and your evaluation metrics become optimistically wrong.
Production Insight
A credit scoring team once used Min-Max scaling without outlier clipping on a $1M loan amount feature.
The 99th percentile was $2M but one outlier at $50M compressed all normal values into a tiny range.
Rule: always clip extreme values before Min-Max, or use robust scaling (QuantileTransformer).
Key Takeaway
Scale matters to algorithms, not to reality.
Unscaled features let large-magnitude columns dominate distance and gradient calculations.
Standardize unless you have a specific reason for bounded outputs.

Encoding Categorical Variables — Stop Lying About Order

Your model speaks math. The moment you hand it the string 'New York', it breaks. So you need to convert text categories into numbers — but the method you choose matters enormously.

The naive approach is Label Encoding: replace each category with an integer. 'Red'→0, 'Green'→1, 'Blue'→2. Fast, compact — and quietly catastrophic for nominal categories. Why? Because now your model thinks Blue (2) is twice as much as Green (1) and greater than both. You've invented a false order that wasn't in your data, and the model will learn it as signal.

Label Encoding is only safe for ordinal categories — where a real order exists: 'Low'→0, 'Medium'→1, 'High'→2 is perfectly valid because the order is real.

One-Hot Encoding is the fix for nominal categories. It creates a new binary column for each category — the model sees completely independent yes/no flags with no false ranking.

The catch: high-cardinality columns (city names, product SKUs with 10,000 unique values) will explode your feature space with OHE. In those cases, reach for Target Encoding or Frequency Encoding instead — but those are more advanced territory and carry their own leakage risks.

categorical_encoding.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

# A dataset where both ordinal AND nominal categories exist
product_orders = pd.DataFrame({
    'product_color':    ['Red', 'Blue', 'Green', 'Blue', 'Red', 'Green'],  # NOMINAL — no order
    'satisfaction_level': ['Low', 'High', 'Medium', 'High', 'Low', 'Medium'],  # ORDINAL — real order
    'purchase_amount':  [25.0, 89.0, 45.0, 120.0, 30.0, 67.0]
})

print("=== Original Data ===")
print(product_orders, "\n")

# --- WRONG approach: Label Encoding a NOMINAL column ---
# This assigns Red=2, Green=0, Blue=1 — implying Green < Blue < Red
# The model will treat color differences as numerical distances. That's invented information.
bad_label_encoder = LabelEncoder()
product_orders['color_BAD_label'] = bad_label_encoder.fit_transform(product_orders['product_color'])
print("=== WRONG: Label Encoded Color (invents false order) ===")
print(product_orders[['product_color', 'color_BAD_label']])
print(f"Classes learned: {bad_label_encoder.classes_}")
print(f"Encoded as:      {list(range(len(bad_label_encoder.classes_)))}\n")

# --- RIGHT approach: One-Hot Encoding for NOMINAL columns ---
# Creates separate binary columns — no false ranking, model learns each color independently
one_hot_encoded = pd.get_dummies(
    product_orders[['product_color']],
    columns=['product_color'],
    drop_first=True   # drop one column to avoid multicollinearity (dummy variable trap)
)
print("=== CORRECT: One-Hot Encoded Color ===")
print(one_hot_encoded, "\n")

# --- RIGHT approach: Ordinal Encoding for ORDINAL columns ---
# We explicitly define the meaningful order — Low < Medium < High
ordinal_encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
product_orders['satisfaction_encoded'] = ordinal_encoder.fit_transform(
    product_orders[['satisfaction_level']]
).astype(int)
print("=== CORRECT: Ordinal Encoded Satisfaction (preserves real order) ===")
print(product_orders[['satisfaction_level', 'satisfaction_encoded']])
Output
=== Original Data ===
product_color satisfaction_level purchase_amount
0 Red Low 25.0
1 Blue High 89.0
2 Green Medium 45.0
3 Blue High 120.0
4 Red Low 30.0
5 Green Medium 67.0
=== WRONG: Label Encoded Color (invents false order) ===
product_color color_BAD_label
0 Red 2
1 Blue 0
2 Green 1
3 Blue 0
4 Red 2
5 Green 1
Classes learned: ['Blue' 'Green' 'Red']
Encoded as: [0, 1, 2]
=== CORRECT: One-Hot Encoded Color ===
product_color_Green product_color_Red
0 False True
1 False False
2 True False
3 False False
4 False True
5 True False
=== CORRECT: Ordinal Encoded Satisfaction (preserves real order) ===
satisfaction_level satisfaction_encoded
0 Low 0
1 High 2
2 Medium 1
3 High 2
4 Low 0
5 Medium 1
Pro Tip: The Dummy Variable Trap
When one-hot encoding, always drop one category column (use drop_first=True in pandas or drop='first' in sklearn's OneHotEncoder). If you have Red, Green, Blue columns and a row has Red=0, Green=0, you already know it's Blue. Keeping all three creates perfect multicollinearity, which destabilizes linear models and inflates feature importance scores.
Production Insight
A recommendation engine used Label Encoding on 'product_category' (200 nominal categories) and saw a 15% drop in recall.
The model learned false ordinal relationships — e.g., 'Electronics (code 50)' was treated as 'greater' than 'Clothing (code 10)'.
Rule: if the categories have no inherent order, always use One-Hot, Target, or Frequency encoding.
Key Takeaway
Label Encoding nominal data silently invents a ranking that doesn't exist.
This is one of the most common, hardest-to-spot bugs in ML pipelines.
Use One-Hot for nominal; Ordinal only when order is real and known.

Engineering New Features — Creating Signal That Wasn't There

Here's where feature engineering becomes genuinely creative. Sometimes the raw columns aren't the right level of abstraction. The real signal is hiding in a combination, a ratio, or a derived value that you have to construct yourself.

Classic examples: a dataset has house_total_sqft and num_bedrooms — but sqft_per_bedroom might be a far stronger predictor of price than either column alone. An e-commerce dataset has signup_date and first_purchase_date — neither tells the model much, but days_to_first_purchase is a powerful engagement signal.

This is also where domain expertise pays off. A data scientist who knows that churn risk spikes after 30 days of inactivity can engineer a days_since_last_login feature that a raw timestamp never captured. The model can't discover this on its own — you have to hand it the right representation.

Binning (also called discretization) is another powerful tool: converting a continuous feature into categorical buckets. Age 0–17 becomes 'Minor', 18–34 becomes 'YoungAdult', etc. This helps when the relationship between a feature and target is non-linear and step-shaped — trees handle raw continuous values fine, but linear models benefit enormously from binning in these cases.

feature_creation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import pandas as pd
import numpy as np

# Simulated e-commerce customer dataset
customers = pd.DataFrame({
    'customer_id':       [101, 102, 103, 104, 105, 106],
    'account_age_days':  [365, 12, 890, 45, 200, 730],
    'total_orders':      [24, 1, 87, 3, 18, 55],
    'total_spent_usd':   [1200, 35, 8700, 90, 950, 4400],
    'last_login_days_ago': [2, 45, 1, 30, 8, 15],
    'age':               [34, 22, 61, 19, 45, 38]
})

print("=== Raw Customer Features ===")
print(customers.to_string(index=False), "\n")

# --- Ratio feature: average order value ---
# This captures 'how much does this customer spend per purchase' — a better
# signal for high-value customer segmentation than raw totals alone
customers['avg_order_value'] = (
    customers['total_spent_usd'] / customers['total_orders']
).round(2)

# --- Ratio feature: order frequency ---
# Orders per day since account creation — normalizes total_orders by tenure
# A new customer with 3 orders in 12 days is MORE active than a veteran with 24 in 365 days
customers['orders_per_day'] = (
    customers['total_orders'] / customers['account_age_days']
).round(4)

# --- Binary flag feature: is the customer 'at risk'? ---
# Domain knowledge: customers inactive for 30+ days churn at 3x the rate
customers['is_churn_risk'] = (customers['last_login_days_ago'] >= 30).astype(int)

# --- Binning: age groups ---
# Linear models struggle with non-linear age effects — binning makes the
# relationship step-shaped and easier for them to capture
age_bin_edges  = [0, 25, 35, 50, 100]
age_bin_labels = ['GenZ', 'YoungAdult', 'MidCareer', 'Senior']
customers['age_group'] = pd.cut(
    customers['age'],
    bins=age_bin_edges,
    labels=age_bin_labels,
    right=True  # intervals are (left, right] — so 25 falls in GenZ, 26 in YoungAdult
)

print("=== Engineered Features ===")
engineered_cols = [
    'customer_id', 'avg_order_value', 'orders_per_day',
    'is_churn_risk', 'age_group'
]
print(customers[engineered_cols].to_string(index=False))

# Quick sanity check — customer 102 has 1 order in 12 days vs customer 101 with 24 in 365
print("\n=== Sanity Check: Order Frequency ===")
for _, row in customers.iterrows():
    print(f"  Customer {row['customer_id']}: {row['orders_per_day']:.4f} orders/day "
          f"({'high' if row['orders_per_day'] > 0.05 else 'low'} frequency)")
Output
=== Raw Customer Features ===
customer_id account_age_days total_orders total_spent_usd last_login_days_ago age
101 365 24 1200 2 34
102 12 1 35 45 22
103 890 87 8700 1 61
104 45 3 90 30 19
105 200 18 950 8 45
106 730 55 4400 15 38
=== Engineered Features ===
customer_id avg_order_value orders_per_day is_churn_risk age_group
101 50.00 0.0658 0 YoungAdult
102 35.00 0.0833 1 GenZ
103 100.00 0.0977 0 Senior
104 30.00 0.0667 1 GenZ
105 52.78 0.0900 0 MidCareer
106 80.00 0.0753 0 MidCareer
=== Sanity Check: Order Frequency ===
Customer 101: 0.0658 orders/day (high frequency)
Customer 102: 0.0833 orders/day (high frequency)
Customer 103: 0.0977 orders/day (high frequency)
Customer 104: 0.0667 orders/day (high frequency)
Customer 105: 0.0900 orders/day (high frequency)
Customer 106: 0.0753 orders/day (high frequency)
Interview Gold: Feature Engineering vs. Feature Selection
Feature engineering is creating or transforming features — it's additive work. Feature selection is deciding which of your (possibly engineered) features to keep — it's subtractive work. You do engineering first, then selection. Interviewers love this distinction because confusing the two reveals shallow understanding of the ML pipeline.
Production Insight
A ride-sharing demand model used raw 'distance' and 'time_of_day' features. Performance plateaued.
Engineered features: 'distance_to_nearest_hotspot', 'rush_hour_flag', and 'fare_per_km' improved RMSE by 18%.
Rule: raw columns are a starting point, not the end. Ratios, flags, and domain-specific composites almost always beat raw inputs.
Key Takeaway
The best engineered features come from domain knowledge, not algorithms.
A ratio, flag, or time-delta you construct from business understanding often outperforms a dozen raw columns.
Binning helps linear models capture step-shaped relationships.

Working with Date and Time Features — Turning Timestamps Into Signal

Dates and times are a classic trap. Raw timestamps like '2025-03-01 14:30:00' look human-readable but are opaque to ML models. They carry embedded patterns — hour of day, day of week, month, quarter, time since an event — that you must extract explicitly.

Common date-derived features
  • Cyclical encoding for hours, months, or days: sin/cos transforms preserve circular relationships (e.g., 23:00 and 01:00 are close in reality but far apart as raw numbers).
  • Time since a reference event: days since account created, hours since last purchase.
  • Binary flags for special periods: is_weekend, is_holiday, is_business_hours.
  • Rolling window statistics: count of events in last 7 days, average spend in last 30 days.

The key insight: raw timestamps are only useful as relative differences, never as absolute numbers. Don't feed the model '2025-03-01' as an integer — it will try to learn patterns based on arbitrary calendar position.

datetime_features.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import pandas as pd
import numpy as np

# Simulated transaction timestamps
dates = pd.to_datetime([
    '2025-01-05 08:15:00',
    '2025-01-05 23:45:00',
    '2025-03-15 14:30:00',
    '2025-07-20 03:10:00',
    '2025-11-11 12:00:00'
])

tx_df = pd.DataFrame({'timestamp': dates,
                     'amount': [50, 200, 15, 300, 100]})

# Extract basic components
tx_df['hour'] = tx_df['timestamp'].dt.hour
tx_df['day_of_week'] = tx_df['timestamp'].dt.dayofweek  # Monday=0
tx_df['month'] = tx_df['timestamp'].dt.month
tx_df['is_weekend'] = (tx_df['timestamp'].dt.dayofweek >= 5).astype(int)

# Cyclical encoding for hour (preserves 23 and 0 as close)
tx_df['hour_sin'] = np.sin(2 * np.pi * tx_df['hour'] / 24)
tx_df['hour_cos'] = np.cos(2 * np.pi * tx_df['hour'] / 24)

# Time since a reference (e.g., first transaction)
first_timestamp = tx_df['timestamp'].min()
tx_df['hours_since_first'] = (tx_df['timestamp'] - first_timestamp).dt.total_seconds() / 3600

print("=== Date/Time Features ===")
print(tx_df[['timestamp', 'hour', 'day_of_week', 'month', 'is_weekend', 'hour_sin', 'hour_cos', 'hours_since_first']])
Output
=== Date/Time Features ===
timestamp hour day_of_week month is_weekend hour_sin hour_cos hours_since_first
0 2025-01-05 08:15:00 8 6 1 0 0.866025 0.500000 0.0
1 2025-01-05 23:45:00 23 6 1 0 -0.258819 0.965926 15.5
2 2025-03-15 14:30:00 14 5 3 1 0.866025 -0.500000 1662.25
3 2025-07-20 03:10:00 3 6 7 0 0.707107 0.707107 4707.0
4 2025-11-11 12:00:00 12 1 11 0 0.000000 1.000000 7480.0
Pro Tip: Rolling Windows Need Care
When creating rolling window features (e.g., 'average order value in last 7 days'), always ensure the window looks only at past data relative to the current row. Using .rolling() with a closed='left' interval prevents lookahead bias. Never compute rolling windows on the full dataset without proper temporal ordering.
Production Insight
A demand forecasting model fed raw 'datetime' as an integer and saw a 23% MAPE. After extracting month, day_of_week, and hour sin/cos, MAPE dropped to 11%.
The model could finally learn periodic patterns like weekly seasonality.
Rule: never feed raw timestamps. Extract cyclical and relative features.
Key Takeaway
Raw timestamps are useless to models — extract cyclical components, relative deltas, and business-period flags.
Use sin/cos encoding for hours, months, and days to preserve circular continuity.
Always compute rolling windows with strict left-closed intervals to avoid lookahead.

Text Feature Engineering — Turning Free Text Into Numeric Vectors

Unstructured text — product descriptions, user reviews, support tickets — carries dense signal that raw models can't parse. You need to convert text into numeric features. The most common techniques start with simple bag-of-words representations and scale up to pretrained embeddings.

Bag-of-Words (BoW) counts word occurrences in a document. It loses word order but captures topic presence. Use CountVectorizer for raw counts or TfidfVectorizer for term frequency–inverse document frequency, which downweights common words.

TF-IDF (Term Frequency–Inverse Document Frequency) gives higher weight to rare words that distinguish documents. It's the go-to for classification tasks like spam detection or topic tagging.

Pretrained embeddings (Word2Vec, GloVe, BERT) map words to dense vectors that encode semantic meaning. They're more powerful but require more data and compute. For most production pipelines, TF-IDF on cleaned text is the best starting point — it's fast, interpretable, and strong on many problems.

text_features.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample product descriptions
documents = [
    "Stainless steel water bottle with double-wall insulation",
    "Cotton t-shirt in size medium available in blue and black",
    "Wireless bluetooth noise-cancelling headphones with 30-hour battery",
    "Organic green tea bags pack of 100"
]

# --- Bag-of-Words ---
bow = CountVectorizer(stop_words='english')
bow_matrix = bow.fit_transform(documents)
print("=== Bag-of-Words Feature Matrix (first 2 documents) ===")
print(pd.DataFrame(bow_matrix.toarray()[:2], columns=bow.get_feature_names_out()).T)
print()

# --- TF-IDF ---
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(documents)
print("=== TF-IDF Feature Matrix (first 2 documents) ===")
print(pd.DataFrame(tfidf_matrix.toarray()[:2], columns=tfidf.get_feature_names_out()).T)
Output
=== Bag-of-Words Feature Matrix (first 2 documents) ===
0 1
bottle 1.0 0.0
cotton 0.0 1.0
insulation 1.0 0.0
medium 0.0 1.0
size 0.0 1.0
steel 1.0 0.0
water 1.0 0.0
dtype: float64
=== TF-IDF Feature Matrix (first 2 documents) ===
0 1
bottle 0.577350 0.000000
cotton 0.000000 0.447214
insulation 0.577350 0.000000
medium 0.000000 0.447214
size 0.000000 0.447214
steel 0.577350 0.000000
water 0.577350 0.000000
dtype: float64
Real-World Practical: TF-IDF Baseline
Before reaching for BERT or GPT embeddings, try a TF-IDF baseline. It's deterministic, interpretable, and often matches deep models on many classification tasks. If your text corpus is small (<10k documents), pretrained embeddings may hurt performance due to domain mismatch. Always start simple.
Production Insight
A customer support ticket routing system used a custom Word2Vec model trained on 500k tickets. Accuracy was 78%.
Replacing it with TF-IDF + Logistic Regression gave 76% accuracy but reduced inference latency from 200ms to 5ms.
Rule: start with TF-IDF. Only add embeddings if you see a clear accuracy gain that justifies the cost.
Key Takeaway
Text needs feature extraction — start with TF-IDF for its speed and interpretability.
Bag-of-Words is simple but loses word order; embeddings capture semantics but need more data.
Always clean text (lowercase, remove stopwords, normalize) before vectorizing.
● Production incidentPOST-MORTEMseverity: high

The Silent Accuracy Drop: Leaky Feature Engineering in a Fraud Detection Pipeline

Symptom
Model AUC dropped from 94% to 61% within 24 hours of going live. Online metrics were consistent with random guessing. Offline metrics had never matched production behavior, but the gap was too large to ignore.
Assumption
The team assumed their careful scaling and encoding pipeline was leak-free because they used scikit-learn's Pipeline. However, they had computed a 'rolling average transaction amount' using the entire dataset's history before splitting, not within cross-validation folds.
Root cause
The rolling average feature was computed on the full dataset using pandas .expanding().mean(). This used future transactions to compute the average, creating a lookahead bias. During validation, the model saw information from the test set, inflating its perceived accuracy. In production, that future information didn't exist, so the feature was nearly useless.
Fix
Recompute all aggregate features (rolling averages, target encodings, etc.) inside each cross-validation fold using sklearn's ColumnTransformer and a custom TimeSeriesSplit. Moved the feature engineering step into the pipeline so it respected temporal order. Retrained the model, which settled at a realistic 82% AUC.
Key lesson
  • Any feature that aggregates across rows (mean, rolling window, cumulative count) must be computed per fold, not on the full dataset.
  • Always validate your feature engineering pipeline with a holdout set that is completely untouched by any aggregate statistics.
  • If offline metrics look too good for the feature set you have, suspect data leakage before celebrating.
Production debug guideWhen your model's online performance diverges from offline expectations, these are the first three areas to inspect.3 entries
Symptom · 01
Online AUC/accuracy is significantly lower than offline cross-validation scores
Fix
Check for data leakage: review any feature that uses aggregate statistics (mean, median, rolling windows). Ensure those statistics were computed only on the training fold. A quick test: run the same pipeline on a fresh holdout set that was never touched during any aggregate computation.
Symptom · 02
Model performance degrades after deploying to a new data source or region
Fix
Compare feature distributions between training data and production data. Use KS-test or population stability index for numerical features; chi-square for categorical. If distributions differ significantly, consider retraining on more representative data or adding domain-specific encodings.
Symptom · 03
One-hot encoded features cause memory or latency spikes in production
Fix
Monitor the cardinality of the categorical feature. If new categories appear in production that weren't in training, your encoder will fail or create unseen columns. Use a OneHotEncoder handle_unknown='ignore' or switch to target encoding / frequency encoding. Also consider dimensionality reduction with PCA or t-SNE on the encoded matrix.
★ Quick Debug Cheat Sheet for Feature EngineeringResolve common feature engineering pitfalls in under a minute. Each symptom has an immediate action, two diagnostic commands, and a fix.
scaler sees unseen values or NaN in production
Immediate action
Stop the inference pipeline. Check if production data has missing values or new categories.
Commands
pd.isna(production_data).sum()
production_data['category'].value_counts().index.difference(training_cats)
Fix now
Add a SimpleImputer for missing values and set handle_unknown='ignore' on OneHotEncoder.
model returns constant predictions despite varied input+
Immediate action
Check feature range and scale. Likely a feature with values far outside training range is dominating.
Commands
pd.DataFrame(production_data).describe()
plt.hist(production_feature, bins=50); plt.axvline(train_min, color='r')
Fix now
Cap extreme values using .clip(lower=train_min, upper=train_max) or retrain scaler with robust quantiles.
memory error on inference due to huge encoding matrix+
Immediate action
Check number of unique categories in the feature. One-hot encoding on 10k+ categories is the culprit.
Commands
production_data['high_card_col'].nunique()
import sys; sys.getsizeof(encoded_matrix)
Fix now
Switch to frequency encoding or binary encoding. Alternatively, use a hashing trick with HashingVectorizer.
Feature Engineering Techniques Comparison
TechniqueBest ForMain RiskTree Models Need It?Linear Models Need It?
Min-Max ScalingBounded, clean numerical data; neural netsSensitive to outliers — one extreme value compresses everything elseNo — trees split on thresholds, scale-invariantYes — gradient descent converges much faster
Standardization (Z-score)Most numerical data, especially with outliersDoesn't bound values; can confuse models expecting 0–1 inputNoYes — essential for SVMs and logistic regression
One-Hot EncodingNominal categories with low cardinality (<20 unique values)Feature explosion with high-cardinality columnsYes — trees need numeric inputYes — but watch for dummy variable trap
Ordinal EncodingCategories with a meaningful, known orderUsing it on nominal data invents false relationshipsYesYes — safe only when true order exists
Binning / DiscretizationNon-linear step relationships; noisy continuous dataLoses granularity; bin boundaries are somewhat arbitraryRarely needed — trees find splits naturallyVery useful — converts complex curve into step function
Date/Time Cyclical EncodingPeriodic features like hour, month, weekdayAssumes the period is exactly constant (e.g., 24-hour cycle)Can help with periodic splitsHighly beneficial — captures circular patterns
TF-IDF VectorizationText classification, topic modeling, search relevanceLarge vocabulary leads to sparse high-dim matrix; ignores word orderWorks well with tree ensemblesWorks well with linear classifiers

Key takeaways

1
Scale matters to algorithms, not to reality
a $100,000 income isn't 'more' than a 35-year-old age, but unscaled, your model will treat it that way. Always scale before feeding distance-based or gradient-based models.
2
Label Encoding nominal data silently invents a ranking that doesn't exist
this is one of the most common, hardest-to-spot bugs in ML pipelines. The model learns the fake ordering as real signal.
3
The best engineered features come from domain knowledge, not algorithms
a ratio, flag, or time-delta you construct from business understanding will often outperform a dozen raw columns fed to even a deep model.
4
Fit your transformers on training data only
data leakage from fitting scalers or encoders on the full dataset is the single biggest reason evaluation metrics don't match production performance.
5
Text and date features require explicit extraction
raw timestamps and text strings are opaque to models. Use TF-IDF for text, cyclical encoding for periodic dates, and rolling windows for temporal aggregations.

Common mistakes to avoid

5 patterns
×

Fitting the scaler on the full dataset before the train/test split

Symptom
Your validation accuracy looks suspiciously high and doesn't hold up in production. Offline metrics are optimistic because the scaler used test data statistics.
Fix
Always split your data first, then call .fit_transform() only on the training set and .transform() on the test set. The scaler should never 'see' test data during fitting.
×

Using Label Encoding on nominal (unordered) categorical columns

Symptom
Your linear model or KNN gives oddly poor results on a column you thought was encoded correctly. The model learns a false ordinal relationship.
Fix
Check whether a natural order genuinely exists. If 'Red < Green < Blue' is nonsense in your domain, switch to One-Hot Encoding. Save Label/Ordinal Encoding strictly for features like size ratings, survey scores, or education levels where order is real.
×

Engineering features on the full pipeline without accounting for train/test leakage

Symptom
A derived feature like 'days_to_first_purchase' is computed using aggregate statistics (e.g., mean days across all users) before the split, causing lookahead bias.
Fix
Any feature that aggregates across rows (target encoding, mean statistics, rolling averages) must be computed inside a cross-validation fold or pipeline step, not on the full dataset upfront. Use sklearn Pipeline or ColumnTransformer to ensure transformations respect data splits.
×

Using raw timestamps as integer features

Symptom
Model fails to capture seasonality; performance is poor for time-series forecasting or temporal problems.
Fix
Extract components: hour, day_of_week, month, then apply cyclical encoding (sin/cos). Create relative features like 'days_since_last_event'. Never use raw timestamps as continuous numbers.
×

Ignoring text preprocessing before vectorization

Symptom
TF-IDF matrix is huge and models perform poorly because of noise from stopwords, punctuation, and inconsistent casing.
Fix
Always lowercase, remove punctuation, strip stopwords, and consider stemming/lemmatization. Use ngram_range=(1,2) in TfidfVectorizer to capture phrases.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
You have a 'City' column with 3,000 unique city names. One-hot encoding ...
Q02SENIOR
A colleague says their model accuracy jumped from 72% to 91% after featu...
Q03SENIOR
What's the difference between feature engineering and feature selection?...
Q01 of 03SENIOR

You have a 'City' column with 3,000 unique city names. One-hot encoding would create 3,000 new columns. Walk me through at least two alternative encoding strategies and when you'd choose each.

ANSWER
For high-cardinality nominal features like city names, One-Hot is rarely the best first choice. Two alternatives: 1. Target Encoding (Mean Encoding): Replace each city with the mean of the target variable for that city (e.g., average click-through rate for that city). It compresses 3,000 columns into 1 and can capture strong predictive signal. The risk is target leakage — you must compute the encoding inside cross-validation folds, never on the full dataset. 2. Frequency Encoding: Replace each city with its frequency in the training data (how often it appears). It's simple, leakage-safe, and works well when rare categories are noisy. The downside: it loses all relationship between city and target, so it's weaker for direct prediction. Choose target encoding if the target is strongly city-dependent and you can manage cross-validation. Choose frequency encoding if you need a fast, safe baseline or if city specificity is low.
FAQ · 3 QUESTIONS

Frequently Asked Questions

01
Do I need to do feature engineering if I'm using a decision tree or random forest?
02
What's the difference between normalization and standardization?
03
How do I know which features to engineer? Is there a systematic process?
🔥

That's ML Basics. Mark it forged?

5 min read · try the examples if you haven't

Previous
Train Test Split and Cross Validation
6 / 25 · ML Basics
Next
Data Preprocessing in ML