Intermediate 8 min · March 06, 2026

Feature Engineering Basics

Feature Engineering: Leaky Aggregate Causes AUC Drop

Q: Do I need to do feature engineering if I'm using a decision tree or random forest?

Tree-based models are scale-invariant, so you can skip normalization and standardization. However, you still need to encode categorical variables into numbers because sklearn's tree implementations require numeric input. You also still benefit from engineered ratio or interaction features — trees can find splits, but they can't invent new combinations on their own.

Q: What's the difference between normalization and standardization?

Normalization (Min-Max scaling) rescales values to a fixed range, typically 0 to 1, by subtracting the minimum and dividing by the range. Standardization (Z-score scaling) rescales to mean=0 and standard deviation=1. Normalization is sensitive to outliers and works well for neural networks expecting bounded inputs. Standardization handles outliers better and is preferred for linear models and SVMs.

Q: How do I know which features to engineer? Is there a systematic process?

Start with domain knowledge — ask what derived measurements a human expert would actually look at (ratios, rates, time deltas, flags). Then look at your raw features and ask what combinations or transformations might capture non-linear relationships. After engineering candidates, use feature importance scores from a tree model, correlation analysis, or recursive feature elimination to decide what to keep. There's no single algorithm for it — it's equal parts data intuition and experimentation.

Model AUC dropped from 94% to 61% due to lookahead bias in rolling average feature - avoid the silent accuracy drop with per-fold computation..

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Production

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Feature engineering transforms raw data into representations ML models can actually learn from
Scaling (Min-Max, Standardization) prevents large-magnitude features from dominating distance-based models
One-Hot Encoding avoids false ordinal relationships in nominal categories; Label Encoding is for ordinal data only
Domain-driven features (ratios, flags, time deltas) often beat raw columns in predictive power
Production insight: data leakage from fitting scalers on full dataset is the #1 cause of metric optimism — split before transform

✦ Definition~90s read

What is Feature Engineering Basics?

Feature engineering is the process of transforming raw data into numeric representations that machine learning models can actually learn from. It's the difference between feeding a model a timestamp string and feeding it hour-of-day, day-of-week, and days-since-last-event.

★

Imagine you're baking a cake.

Models don't understand semantics—they understand numbers, and the quality of those numbers determines whether your AUC hits 0.95 or tanks to 0.55. The most common pitfall is the 'leaky aggregate': computing statistics like mean target value per group across the entire dataset before splitting train/test, which injects future information into training and gives you a false sense of performance that collapses in production.

This article covers the six critical transformations you must get right: avoiding data leaks in aggregates, proper scaling (StandardScaler vs. RobustScaler—know when each breaks), categorical encoding that doesn't imply false ordinality (target encoding vs. one-hot vs. embeddings), creating interaction features that capture nonlinear relationships, extracting cyclic signals from timestamps (sin/cos transforms for hour and month), and vectorizing free text with TF-IDF or sentence embeddings.

Skip these steps and your model learns noise; get them right and you turn garbage into signal.

Plain-English First

Imagine you're baking a cake. The raw ingredients — flour, eggs, sugar — are your data. Feature engineering is the prep work: sifting the flour so it's smooth, cracking and beating the eggs, measuring sugar precisely. You could dump everything in a bowl unprocessed, but the cake would be a disaster. ML models are the same — they can't learn well from raw, messy ingredients. Feature engineering is how you prep your data so the model can actually taste the signal.

Your model is only as smart as the data you hand it. You can swap in a fancier algorithm all day long, but if your input features are raw, inconsistent, or poorly structured, accuracy will plateau and you'll spend hours debugging what feels like a model problem — but is actually a data problem. Feature engineering is the craft that separates a hobbyist notebook from a production ML system. It's the reason a simple logistic regression with great features often outperforms a deep neural network fed raw junk.

The core problem feature engineering solves is this: real-world data is collected for humans, not machines. A column labeled 'Customer Since' contains a date string. A 'Price' column spans from $1 to $10,000. A 'City' column has 500 unique text values. Raw ML algorithms see numbers, and they interpret scale literally — a city with code 499 looks mathematically 'larger' than code 1. Feature engineering bridges the gap between how humans record information and how models consume it.

By the end of this article you'll know how to normalize numerical features so large-scale columns don't bully small-scale ones, encode categorical variables without introducing false ordinal relationships, engineer entirely new features from existing ones, and bin continuous values into meaningful groups. Every technique comes with a real Python example you can run today and a clear explanation of when to reach for it.

Why Leaky Aggregates Are a Silent Model Killer

Feature engineering is the process of transforming raw data into predictive signals. The core mechanic is simple: compute aggregate statistics (mean, sum, count) over groups defined by entity IDs or time windows. But when that aggregation leaks information from the future — for example, computing a user's average purchase amount using all transactions, including those after the prediction timestamp — you introduce target leakage. This inflates offline AUC by 10–20% while production performance collapses.

In practice, the leak happens silently because the pipeline computes aggregates on the full dataset before splitting train/test. The model sees the average of tomorrow's purchases and learns a near-perfect cheat. The key property: any aggregate that uses data from after the prediction point is a leak. Time-based splits and careful windowing (e.g., expanding window with a lag) prevent this. The correct approach is to compute aggregates only from past data relative to each row's timestamp.

Use leak-free aggregates when your features depend on historical behavior — user activity, sensor readings, transaction histories. This matters in real systems because a model that scores 0.98 AUC offline but 0.65 online is not just useless; it erodes trust and delays deployment. Every senior engineer must enforce temporal consistency in feature pipelines before any training run.

⚠ Leak is invisible in shuffled cross-validation

Standard k-fold cross-validation without time-based splitting will hide leakage — your AUC looks great, but the model memorizes the future.

📊 Production Insight

A fraud detection team used 7-day rolling average of transaction amounts computed over the entire dataset — offline AUC was 0.97, online AUC dropped to 0.68.

Symptom: model's top features were all time-aggregates with suspiciously high importance; predictions were nearly constant for recent users.

Rule: always compute aggregates using only data strictly before the prediction timestamp — use expanding windows with a lag of at least one record.

🎯 Key Takeaway

Leaky aggregates inflate offline metrics by 10–20% but fail in production — always validate with time-based splits.

Compute aggregates only from past data relative to each row's timestamp — never use the full dataset.

The most common leak source is computing group statistics before splitting train/test — enforce temporal ordering in your pipeline.

thecodeforge.io

Feature Engineering Basics

Scaling and Normalization — Why Raw Numbers Lie to Your Model

Picture a dataset with two columns: a person's age (18–80) and their annual income ($20,000–$200,000). To you, those are just two different measurements. To a distance-based model like K-Nearest Neighbors or a gradient-based model like logistic regression, income completely dominates. The income differences are literally thousands of times larger, so the model barely 'hears' the age signal at all.

Normalization fixes this by rescaling features onto a common playing field. There are two main approaches you'll reach for constantly.

Min-Max Scaling squishes every value into a 0–1 range. It preserves the shape of the distribution but is sensitive to outliers — one rogue value at $999,999 will compress everything else near zero.

Standardization (Z-score) rescales to mean=0, std=1. It doesn't bound values between 0 and 1, but it handles outliers far more gracefully and is the go-to for models that assume normally distributed inputs like linear regression or SVMs.

The rule of thumb: use Min-Max when you know your data is bounded and clean. Use Standardization almost everywhere else.

scaling_comparison.pyPYTHON

import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import pandas as pd

# Simulate a small customer dataset with wildly different scales
customer_data = pd.DataFrame({
    'age':    [22, 35, 47, 58, 29, 63],
    'income': [32000, 75000, 120000, 95000, 48000, 180000]
})

print("=== Raw Data ===")
print(customer_data)
print(f"Income std dev: {customer_data['income'].std():.1f}")
print(f"Age std dev:    {customer_data['age'].std():.1f}\n")

# --- Min-Max Scaling ---
# Each value becomes (value - min) / (max - min)
# Result: every column lives between 0.0 and 1.0
min_max_scaler = MinMaxScaler()
min_max_scaled = pd.DataFrame(
    min_max_scaler.fit_transform(customer_data),
    columns=['age_minmax', 'income_minmax']
)
print("=== After Min-Max Scaling ===")
print(min_max_scaled.round(3))

# --- Standardization (Z-score) ---
# Each value becomes (value - mean) / std_dev
# Result: mean=0, std=1 — no fixed upper or lower bound
standard_scaler = StandardScaler()
standardized = pd.DataFrame(
    standard_scaler.fit_transform(customer_data),
    columns=['age_zscore', 'income_zscore']
)
print("\n=== After Standardization ===")
print(standardized.round(3))
print(f"\nAge z-score mean:    {standardized['age_zscore'].mean():.6f}  (should be ~0)")
print(f"Income z-score std:  {standardized['income_zscore'].std():.6f}  (should be ~1)")

Output

=== Raw Data ===

age income

0 22 32000

1 35 75000

2 47 120000

3 58 95000

4 29 48000

5 63 180000

Income std dev: 52547.1

Age std dev: 15.3

=== After Min-Max Scaling ===

age_minmax income_minmax

0 0.000 0.000

1 0.317 0.287

2 0.610 0.587

3 0.878 0.420

4 0.171 0.107

5 1.000 1.000

=== After Standardization ===

age_zscore income_zscore

0 -1.301 -1.139

1 -0.426 -0.324

2 0.393 0.529

3 1.163 0.149

4 -0.842 -0.834

5 1.012 1.619

Age z-score mean: 0.000000 (should be ~0)

Income z-score std: 1.000000 (should be ~1)

⚠ Watch Out: Fit on Train, Transform on Test

Always call .fit_transform() on your training data only, then .transform() on your test data. If you fit the scaler on the full dataset, you're leaking test statistics into training — your model secretly already 'knows' the test set's range, and your evaluation metrics become optimistically wrong.

📊 Production Insight

A credit scoring team once used Min-Max scaling without outlier clipping on a $1M loan amount feature.

The 99th percentile was $2M but one outlier at $50M compressed all normal values into a tiny range.

Rule: always clip extreme values before Min-Max, or use robust scaling (QuantileTransformer).

🎯 Key Takeaway

Scale matters to algorithms, not to reality.

Unscaled features let large-magnitude columns dominate distance and gradient calculations.

Standardize unless you have a specific reason for bounded outputs.

Encoding Categorical Variables — Stop Lying About Order

Your model speaks math. The moment you hand it the string 'New York', it breaks. So you need to convert text categories into numbers — but the method you choose matters enormously.

The naive approach is Label Encoding: replace each category with an integer. 'Red'→0, 'Green'→1, 'Blue'→2. Fast, compact — and quietly catastrophic for nominal categories. Why? Because now your model thinks Blue (2) is twice as much as Green (1) and greater than both. You've invented a false order that wasn't in your data, and the model will learn it as signal.

Label Encoding is only safe for ordinal categories — where a real order exists: 'Low'→0, 'Medium'→1, 'High'→2 is perfectly valid because the order is real.

One-Hot Encoding is the fix for nominal categories. It creates a new binary column for each category — the model sees completely independent yes/no flags with no false ranking.

The catch: high-cardinality columns (city names, product SKUs with 10,000 unique values) will explode your feature space with OHE. In those cases, reach for Target Encoding or Frequency Encoding instead — but those are more advanced territory and carry their own leakage risks.

categorical_encoding.pyPYTHON

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

# A dataset where both ordinal AND nominal categories exist
product_orders = pd.DataFrame({
    'product_color':    ['Red', 'Blue', 'Green', 'Blue', 'Red', 'Green'],  # NOMINAL — no order
    'satisfaction_level': ['Low', 'High', 'Medium', 'High', 'Low', 'Medium'],  # ORDINAL — real order
    'purchase_amount':  [25.0, 89.0, 45.0, 120.0, 30.0, 67.0]
})

print("=== Original Data ===")
print(product_orders, "\n")

# --- WRONG approach: Label Encoding a NOMINAL column ---
# This assigns Red=2, Green=0, Blue=1 — implying Green < Blue < Red
# The model will treat color differences as numerical distances. That's invented information.
bad_label_encoder = LabelEncoder()
product_orders['color_BAD_label'] = bad_label_encoder.fit_transform(product_orders['product_color'])
print("=== WRONG: Label Encoded Color (invents false order) ===")
print(product_orders[['product_color', 'color_BAD_label']])
print(f"Classes learned: {bad_label_encoder.classes_}")
print(f"Encoded as:      {list(range(len(bad_label_encoder.classes_)))}\n")

# --- RIGHT approach: One-Hot Encoding for NOMINAL columns ---
# Creates separate binary columns — no false ranking, model learns each color independently
one_hot_encoded = pd.get_dummies(
    product_orders[['product_color']],
    columns=['product_color'],
    drop_first=True   # drop one column to avoid multicollinearity (dummy variable trap)
)
print("=== CORRECT: One-Hot Encoded Color ===")
print(one_hot_encoded, "\n")

# --- RIGHT approach: Ordinal Encoding for ORDINAL columns ---
# We explicitly define the meaningful order — Low < Medium < High
ordinal_encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
product_orders['satisfaction_encoded'] = ordinal_encoder.fit_transform(
    product_orders[['satisfaction_level']]
).astype(int)
print("=== CORRECT: Ordinal Encoded Satisfaction (preserves real order) ===")
print(product_orders[['satisfaction_level', 'satisfaction_encoded']])

Output

=== Original Data ===

product_color satisfaction_level purchase_amount

0 Red Low 25.0

1 Blue High 89.0

2 Green Medium 45.0

3 Blue High 120.0

4 Red Low 30.0

5 Green Medium 67.0

=== WRONG: Label Encoded Color (invents false order) ===

product_color color_BAD_label

0 Red 2

1 Blue 0

2 Green 1

3 Blue 0

4 Red 2

5 Green 1

Classes learned: ['Blue' 'Green' 'Red']

Encoded as: [0, 1, 2]

=== CORRECT: One-Hot Encoded Color ===

product_color_Green product_color_Red

0 False True

1 False False

2 True False

3 False False

4 False True

5 True False

=== CORRECT: Ordinal Encoded Satisfaction (preserves real order) ===

satisfaction_level satisfaction_encoded

0 Low 0

1 High 2

2 Medium 1

3 High 2

4 Low 0

5 Medium 1

💡Pro Tip: The Dummy Variable Trap

When one-hot encoding, always drop one category column (use drop_first=True in pandas or drop='first' in sklearn's OneHotEncoder). If you have Red, Green, Blue columns and a row has Red=0, Green=0, you already know it's Blue. Keeping all three creates perfect multicollinearity, which destabilizes linear models and inflates feature importance scores.

📊 Production Insight

A recommendation engine used Label Encoding on 'product_category' (200 nominal categories) and saw a 15% drop in recall.

The model learned false ordinal relationships — e.g., 'Electronics (code 50)' was treated as 'greater' than 'Clothing (code 10)'.

Rule: if the categories have no inherent order, always use One-Hot, Target, or Frequency encoding.

🎯 Key Takeaway

Label Encoding nominal data silently invents a ranking that doesn't exist.

This is one of the most common, hardest-to-spot bugs in ML pipelines.

Use One-Hot for nominal; Ordinal only when order is real and known.

thecodeforge.io

Feature Engineering Basics

Engineering New Features — Creating Signal That Wasn't There

Here's where feature engineering becomes genuinely creative. Sometimes the raw columns aren't the right level of abstraction. The real signal is hiding in a combination, a ratio, or a derived value that you have to construct yourself.

Classic examples: a dataset has house_total_sqft and num_bedrooms — but sqft_per_bedroom might be a far stronger predictor of price than either column alone. An e-commerce dataset has signup_date and first_purchase_date — neither tells the model much, but days_to_first_purchase is a powerful engagement signal.

This is also where domain expertise pays off. A data scientist who knows that churn risk spikes after 30 days of inactivity can engineer a days_since_last_login feature that a raw timestamp never captured. The model can't discover this on its own — you have to hand it the right representation.

Binning (also called discretization) is another powerful tool: converting a continuous feature into categorical buckets. Age 0–17 becomes 'Minor', 18–34 becomes 'YoungAdult', etc. This helps when the relationship between a feature and target is non-linear and step-shaped — trees handle raw continuous values fine, but linear models benefit enormously from binning in these cases.

feature_creation.pyPYTHON

import pandas as pd
import numpy as np

# Simulated e-commerce customer dataset
customers = pd.DataFrame({
    'customer_id':       [101, 102, 103, 104, 105, 106],
    'account_age_days':  [365, 12, 890, 45, 200, 730],
    'total_orders':      [24, 1, 87, 3, 18, 55],
    'total_spent_usd':   [1200, 35, 8700, 90, 950, 4400],
    'last_login_days_ago': [2, 45, 1, 30, 8, 15],
    'age':               [34, 22, 61, 19, 45, 38]
})

print("=== Raw Customer Features ===")
print(customers.to_string(index=False), "\n")

# --- Ratio feature: average order value ---
# This captures 'how much does this customer spend per purchase' — a better
# signal for high-value customer segmentation than raw totals alone
customers['avg_order_value'] = (
    customers['total_spent_usd'] / customers['total_orders']
).round(2)

# --- Ratio feature: order frequency ---
# Orders per day since account creation — normalizes total_orders by tenure
# A new customer with 3 orders in 12 days is MORE active than a veteran with 24 in 365 days
customers['orders_per_day'] = (
    customers['total_orders'] / customers['account_age_days']
).round(4)

# --- Binary flag feature: is the customer 'at risk'? ---
# Domain knowledge: customers inactive for 30+ days churn at 3x the rate
customers['is_churn_risk'] = (customers['last_login_days_ago'] >= 30).astype(int)

# --- Binning: age groups ---
# Linear models struggle with non-linear age effects — binning makes the
# relationship step-shaped and easier for them to capture
age_bin_edges  = [0, 25, 35, 50, 100]
age_bin_labels = ['GenZ', 'YoungAdult', 'MidCareer', 'Senior']
customers['age_group'] = pd.cut(
    customers['age'],
    bins=age_bin_edges,
    labels=age_bin_labels,
    right=True  # intervals are (left, right] — so 25 falls in GenZ, 26 in YoungAdult
)

print("=== Engineered Features ===")
engineered_cols = [
    'customer_id', 'avg_order_value', 'orders_per_day',
    'is_churn_risk', 'age_group'
]
print(customers[engineered_cols].to_string(index=False))

# Quick sanity check — customer 102 has 1 order in 12 days vs customer 101 with 24 in 365
print("\n=== Sanity Check: Order Frequency ===")
for _, row in customers.iterrows():
    print(f"  Customer {row['customer_id']}: {row['orders_per_day']:.4f} orders/day "
          f"({'high' if row['orders_per_day'] > 0.05 else 'low'} frequency)")

Output

=== Raw Customer Features ===

customer_id account_age_days total_orders total_spent_usd last_login_days_ago age

101 365 24 1200 2 34

102 12 1 35 45 22

103 890 87 8700 1 61

104 45 3 90 30 19

105 200 18 950 8 45

106 730 55 4400 15 38

=== Engineered Features ===

customer_id avg_order_value orders_per_day is_churn_risk age_group

101 50.00 0.0658 0 YoungAdult

102 35.00 0.0833 1 GenZ

103 100.00 0.0977 0 Senior

104 30.00 0.0667 1 GenZ

105 52.78 0.0900 0 MidCareer

106 80.00 0.0753 0 MidCareer

=== Sanity Check: Order Frequency ===

Customer 101: 0.0658 orders/day (high frequency)

Customer 102: 0.0833 orders/day (high frequency)

Customer 103: 0.0977 orders/day (high frequency)

Customer 104: 0.0667 orders/day (high frequency)

Customer 105: 0.0900 orders/day (high frequency)

Customer 106: 0.0753 orders/day (high frequency)

🔥Interview Gold: Feature Engineering vs. Feature Selection

Feature engineering is creating or transforming features — it's additive work. Feature selection is deciding which of your (possibly engineered) features to keep — it's subtractive work. You do engineering first, then selection. Interviewers love this distinction because confusing the two reveals shallow understanding of the ML pipeline.

📊 Production Insight

A ride-sharing demand model used raw 'distance' and 'time_of_day' features. Performance plateaued.

Engineered features: 'distance_to_nearest_hotspot', 'rush_hour_flag', and 'fare_per_km' improved RMSE by 18%.

Rule: raw columns are a starting point, not the end. Ratios, flags, and domain-specific composites almost always beat raw inputs.

🎯 Key Takeaway

The best engineered features come from domain knowledge, not algorithms.

A ratio, flag, or time-delta you construct from business understanding often outperforms a dozen raw columns.

Binning helps linear models capture step-shaped relationships.

Working with Date and Time Features — Turning Timestamps Into Signal

Dates and times are a classic trap. Raw timestamps like '2025-03-01 14:30:00' look human-readable but are opaque to ML models. They carry embedded patterns — hour of day, day of week, month, quarter, time since an event — that you must extract explicitly.

Common date-derived features

Cyclical encoding for hours, months, or days: sin/cos transforms preserve circular relationships (e.g., 23:00 and 01:00 are close in reality but far apart as raw numbers).
Time since a reference event: days since account created, hours since last purchase.
Binary flags for special periods: is_weekend, is_holiday, is_business_hours.
Rolling window statistics: count of events in last 7 days, average spend in last 30 days.

The key insight: raw timestamps are only useful as relative differences, never as absolute numbers. Don't feed the model '2025-03-01' as an integer — it will try to learn patterns based on arbitrary calendar position.

datetime_features.pyPYTHON

import pandas as pd
import numpy as np

# Simulated transaction timestamps
dates = pd.to_datetime([
    '2025-01-05 08:15:00',
    '2025-01-05 23:45:00',
    '2025-03-15 14:30:00',
    '2025-07-20 03:10:00',
    '2025-11-11 12:00:00'
])

tx_df = pd.DataFrame({'timestamp': dates,
                     'amount': [50, 200, 15, 300, 100]})

# Extract basic components
tx_df['hour'] = tx_df['timestamp'].dt.hour
tx_df['day_of_week'] = tx_df['timestamp'].dt.dayofweek  # Monday=0
tx_df['month'] = tx_df['timestamp'].dt.month
tx_df['is_weekend'] = (tx_df['timestamp'].dt.dayofweek >= 5).astype(int)

# Cyclical encoding for hour (preserves 23 and 0 as close)
tx_df['hour_sin'] = np.sin(2 * np.pi * tx_df['hour'] / 24)
tx_df['hour_cos'] = np.cos(2 * np.pi * tx_df['hour'] / 24)

# Time since a reference (e.g., first transaction)
first_timestamp = tx_df['timestamp'].min()
tx_df['hours_since_first'] = (tx_df['timestamp'] - first_timestamp).dt.total_seconds() / 3600

print("=== Date/Time Features ===")
print(tx_df[['timestamp', 'hour', 'day_of_week', 'month', 'is_weekend', 'hour_sin', 'hour_cos', 'hours_since_first']])

Output

=== Date/Time Features ===

timestamp hour day_of_week month is_weekend hour_sin hour_cos hours_since_first

0 2025-01-05 08:15:00 8 6 1 0 0.866025 0.500000 0.0

1 2025-01-05 23:45:00 23 6 1 0 -0.258819 0.965926 15.5

2 2025-03-15 14:30:00 14 5 3 1 0.866025 -0.500000 1662.25

3 2025-07-20 03:10:00 3 6 7 0 0.707107 0.707107 4707.0

4 2025-11-11 12:00:00 12 1 11 0 0.000000 1.000000 7480.0

💡Pro Tip: Rolling Windows Need Care

When creating rolling window features (e.g., 'average order value in last 7 days'), always ensure the window looks only at past data relative to the current row. Using .rolling() with a closed='left' interval prevents lookahead bias. Never compute rolling windows on the full dataset without proper temporal ordering.

📊 Production Insight

A demand forecasting model fed raw 'datetime' as an integer and saw a 23% MAPE. After extracting month, day_of_week, and hour sin/cos, MAPE dropped to 11%.

The model could finally learn periodic patterns like weekly seasonality.

Rule: never feed raw timestamps. Extract cyclical and relative features.

🎯 Key Takeaway

Raw timestamps are useless to models — extract cyclical components, relative deltas, and business-period flags.

Use sin/cos encoding for hours, months, and days to preserve circular continuity.

Always compute rolling windows with strict left-closed intervals to avoid lookahead.

Text Feature Engineering — Turning Free Text Into Numeric Vectors

Unstructured text — product descriptions, user reviews, support tickets — carries dense signal that raw models can't parse. You need to convert text into numeric features. The most common techniques start with simple bag-of-words representations and scale up to pretrained embeddings.

Bag-of-Words (BoW) counts word occurrences in a document. It loses word order but captures topic presence. Use CountVectorizer for raw counts or TfidfVectorizer for term frequency–inverse document frequency, which downweights common words.

TF-IDF (Term Frequency–Inverse Document Frequency) gives higher weight to rare words that distinguish documents. It's the go-to for classification tasks like spam detection or topic tagging.

Pretrained embeddings (Word2Vec, GloVe, BERT) map words to dense vectors that encode semantic meaning. They're more powerful but require more data and compute. For most production pipelines, TF-IDF on cleaned text is the best starting point — it's fast, interpretable, and strong on many problems.

text_features.pyPYTHON

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample product descriptions
documents = [
    "Stainless steel water bottle with double-wall insulation",
    "Cotton t-shirt in size medium available in blue and black",
    "Wireless bluetooth noise-cancelling headphones with 30-hour battery",
    "Organic green tea bags pack of 100"
]

# --- Bag-of-Words ---
bow = CountVectorizer(stop_words='english')
bow_matrix = bow.fit_transform(documents)
print("=== Bag-of-Words Feature Matrix (first 2 documents) ===")
print(pd.DataFrame(bow_matrix.toarray()[:2], columns=bow.get_feature_names_out()).T)
print()

# --- TF-IDF ---
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(documents)
print("=== TF-IDF Feature Matrix (first 2 documents) ===")
print(pd.DataFrame(tfidf_matrix.toarray()[:2], columns=tfidf.get_feature_names_out()).T)

Output

=== Bag-of-Words Feature Matrix (first 2 documents) ===

0 1

bottle 1.0 0.0

cotton 0.0 1.0

insulation 1.0 0.0

medium 0.0 1.0

size 0.0 1.0

steel 1.0 0.0

water 1.0 0.0

dtype: float64

=== TF-IDF Feature Matrix (first 2 documents) ===

0 1

bottle 0.577350 0.000000

cotton 0.000000 0.447214

insulation 0.577350 0.000000

medium 0.000000 0.447214

size 0.000000 0.447214

steel 0.577350 0.000000

water 0.577350 0.000000

dtype: float64

🔥Real-World Practical: TF-IDF Baseline

Before reaching for BERT or GPT embeddings, try a TF-IDF baseline. It's deterministic, interpretable, and often matches deep models on many classification tasks. If your text corpus is small (<10k documents), pretrained embeddings may hurt performance due to domain mismatch. Always start simple.

📊 Production Insight

A customer support ticket routing system used a custom Word2Vec model trained on 500k tickets. Accuracy was 78%.

Replacing it with TF-IDF + Logistic Regression gave 76% accuracy but reduced inference latency from 200ms to 5ms.

Rule: start with TF-IDF. Only add embeddings if you see a clear accuracy gain that justifies the cost.

🎯 Key Takeaway

Text needs feature extraction — start with TF-IDF for its speed and interpretability.

Bag-of-Words is simple but loses word order; embeddings capture semantics but need more data.

Always clean text (lowercase, remove stopwords, normalize) before vectorizing.

The Science of Splits — How You Slice Your Data Changes Everything

Most engineers treat train/test splitting like a formality. They call sklearn.model_selection.train_test_split and move on. That's how you end up deploying a model that aces validation but falls apart in production. The split is the first and most critical piece of feature engineering.

Why? Because how you split determines what your model learns. Random splits are fine for i.i.d. data. But production data is rarely i.i.d. — it's time-ordered, grouped by user, or sampled from a distribution that shifts. If you split randomly on time-series data, you're giving your model access to future information. That's not a bug, that's a feature leak.

Here's the rule: split the way your model will be used in production. If it predicts tomorrow's sales, split by date — no random shuffling. If it recommends items per user, split by user — no user's data appearing in both train and test. Use TimeSeriesSplit or GroupKFold. And validate your splits by checking feature distributions across folds. If they drift, your features are unstable.

SplitStrategyValidation.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import pandas as pd
from sklearn.model_selection import TimeSeriesSplit, GroupKFold

# Load timestamped transaction data — real e-commerce
orders = pd.read_parquet("orders.parquet")  # columns: user_id, order_date, amount
orders.sort_values("order_date", inplace=True)  # critical for time-based

# NEVER do this for time-series: random shuffle
# X_train, X_test = train_test_split(orders, test_size=0.2, shuffle=True)  # leaks future

# Correct: time-based split preserving temporal order
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(orders):
    train, test = orders.iloc[train_idx], orders.iloc[test_idx]
    # Check feature stability — mean amount shouldn't drift significantly
    drift = abs(train["amount"].mean() - test["amount"].mean()) / train["amount"].std()
    print(f"Fold drift (standardized): {drift:.2f}")
    if drift > 0.3:
        print("Warning: feature distribution shift detected — revisit encoding.")

Output

Fold drift (standardized): 0.08

Fold drift (standardized): 0.12

Fold drift (standardized): 0.21

Fold drift (standardized): 0.09

Fold drift (standardized): 0.15

⚠ Production Trap:

Random split on time-series data is the #1 cause of over-optimistic validation scores. Your model learns to predict past patterns it has no right knowing. Always split by time — even if you lose a bit of training data.

🎯 Key Takeaway

Split your data the way your model will be used in production: time-based for forecasting, group-based for per-user models. Validate feature distributions across folds.

Missing Data Is Not a Cleanup — It's a Feature Signal

Most production pipelines treat missing values as a preprocessing inconvenience — something to impute, drop, or mask. That mindset discards information. In real-world systems, the presence or pattern of missingness often carries a stronger signal than the observed values themselves. A null in a sensor feed might indicate a failed reading, a thermal threshold exceeded, or a human operator skipping an entry. Each context creates a distinct missingness mechanism: missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). If your model ignores which mechanism applies, it cannot generalize when the same missingness appears in production. Before you write a single imputation line, ask: why is this value absent? The answer might become your highest-impact feature.

Instead of hiding misses with mean imputation, add a binary indicator column per field that flags the missing record. This preserves the model's ability to learn patterns like “when temperature is null, the failure rate jumps 30%.” Keep imputation only for algorithms that cannot tolerate any nulls (e.g., linear regression or SVM). For tree-based models, let missing values remain — they already handle them natively. The key is: never drop rows that contain missing data until you have confirmed null itself is not predictive. A rule of thumb: if 5% or more of a column's values are missing, add the missing flag and retain the original nulls for any native-handling model.

missing_flag_example.pyPYTHON

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('sensor_data.csv')

# Add missing flag before any imputation
for col in ['temperature', 'pressure', 'vibration']:
    df[f'{col}_missing'] = df[col].isna().astype(int)

# Separate features and target
X = df.drop('failure', axis=1)
y = df['failure']

# Preserve actual NaN for tree-based models
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# LightGBM handles NaN natively
model = LGBMClassifier()
model.fit(X_train, y_train)
print(f"Test accuracy: {model.score(X_test, y_test):.3f}")

Output

Test accuracy: 0.894

⚠ Production Trap:

Imputing with mean or median before adding a missing flag destroys the missingness signal. The flag becomes useless because the imputed value makes it impossible to distinguish between an actual missing value and a legitimate low value. Always create the flag column first, then impute if absolutely necessary.

🎯 Key Takeaway

Before imputing, add a binary missing flag per column — the pattern of nulls often predicts the target better than the observed values.

Scikit-Learn Pipelines — Stop Gluing Code Together in Notebooks

You're writing preprocessing code in one cell, fitting in another, transforming test data manually — and one day you'll forget to apply the same scaler to production data. That's how models silently break.

Scikit-learn's Pipeline and ColumnTransformer exist exactly for this. They chain your transforms and estimator into a single composable object. Fit it once on training data. Transform test data with the same fitted pipeline. No copy-paste errors, no train-test skew.

The real win? When you grid-search hyperparameters, the pipeline ensures every fold sees the exact same preprocessing chain. No leaking scaling parameters across folds.

Stop treating preprocessing as a script. Build pipelines. Your future self — and your production system — will thank you.

PipelineExample.pyPYTHON

// io.thecodeforge — ml-ai tutorial

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

num_cols = ['age', 'income']
cat_cols = ['education', 'city']

num_pipe = Pipeline([
    ('scaler', StandardScaler())
])

cat_pipe = Pipeline([
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols)
])

full_pipe = Pipeline([
    ('prep', preprocessor),
    ('clf', RandomForestClassifier())
])

X_train = [[25, 50000, 'bachelor', 'nyc'], [45, 120000, 'master', 'sf']]
y_train = [0, 1]
full_pipe.fit(X_train, y_train)
print(full_pipe.predict([[30, 60000, 'bachelor', 'la']]))

Output

[0]

⚠ Production Trap:

Never fit_transform test data. Use transform only. Pipelines enforce this — manual code doesn't.

🎯 Key Takeaway

Always wrap preprocessing and model into a single Pipeline — leak-proof, deployable, auditable.

Feature-Engine — Stop Writing 50-Line Wrappers for Missing Value Imputation

You're writing custom imputation logic to avoid leaking statistics. Feature-Engine wraps this into one-liners — mean, median, mode, constant, or even adding a missing indicator column. All within a scikit-learn compatible transformer.

Why does this matter? Because imputation is not a table-stakes cleanup. A missing value carries signal. A customer who won't share their income might default. Feature-Engine's AddMissingIndicator lets you capture that without extra columns of manual flagging.

It also provides categorical encoders (like WoE for logistic regressions), outlier cappers, and variable transformers — all fitting into your existing pipeline. You don't rewrite for every project.

One fit, one transform. No more spaghetti imputation code spanning three notebooks.

FeatureEngineExample.pyPYTHON

// io.thecodeforge — ml-ai tutorial

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from feature_engine.imputation import MeanMedianImputer
from feature_engine.imputation import AddMissingIndicator

imputer = MeanMedianImputer(imputation_method='median', variables=['income'])
indicator = AddMissingIndicator(variables=['income'])

pipe = Pipeline([
    ('impute', imputer),
    ('missing_flag', indicator),
    ('clf', RandomForestClassifier())
])

X = [[50000], [None], [75000], [None]]
y = [0, 1, 0, 1]
pipe.fit(X, y)
print(pipe.predict([[60000], [None], [90000]]))

Output

[0 1 0]

💡Senior Shortcut:

AddMissingIndicator + imputation gives your model both the value and the fact it was missing — two signals for the price of one transform.

🎯 Key Takeaway

Imputation is feature engineering. Use Feature-Engine to impute and flag missingness in one pipeline step.

Conduct Exploratory Data Analysis (EDA) — See Your Data Before You Trust It

Models fail when you skip EDA. You need to understand distributions, outliers, correlations, and missing patterns before feature engineering. Start with df.describe() and df.info() to catch data types, null counts, and basic statistics. Plot histograms for numeric features — skewed distributions mislead scaling choices. Use boxplots to spot outliers that can dominate loss functions. Check correlation matrices with sns.heatmap() — high correlations between features cause multicollinearity and unstable coefficients. Pair plots reveal non-linear relationships that polynomial features can capture. Target encoding requires checking class frequencies to avoid rare category noise. Missing data patterns (MCAR, MAR, MNAR) determine imputation strategy — never impute blindly. EDA directly shapes your pipeline: log transforms for skewed data, clipping for outliers, interaction terms for visible interactions. Without EDA, you optimize garbage.

quick_eda.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')
print(df.info())
print(df.describe())
print(df.isnull().sum())

sns.histplot(df['revenue'], bins=50)
plt.show()

sns.boxplot(x='category', y='price', data=df)
plt.show()

corr = df.select_dtypes('number').corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.show()

Output

RangeIndex: 10000 entries, 0 to 9999

Data columns (total 8 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 revenue 10000 non-null float64

1 price 10000 non-null float64

2 category 9500 non-null object

3 age 10000 non-null int64

4 gender 9900 non-null object

5 city 10000 non-null object

6 signup 10000 non-null object

7 purchases 10000 non-null int64

dtypes: float64(2), int64(2), object(4)

Null counts:

category 500

age 0

gender 100

Skewness in revenue: 8.2 -> apply log transform

⚠ Production Trap:

Don't compute EDA on the full dataset. Split into train/val first, then run EDA only on training data. Leaking test-set statistics into feature engineering inflates validation scores and guarantees production failure.

🎯 Key Takeaway

Exploratory data analysis is not optional. It reveals transformations, encodings, and imputations your model needs before any algorithm touches the data.

Test Your Pipeline With Cross-Validation — Don't Trust a Single Holdout Score

One random train-test split can mislead you. A lucky split gives high accuracy; an unlucky one tanks your metrics. Cross-validation (CV) averages performance across multiple splits, revealing stability and variance. Use cross_val_score from scikit-learn with StratifiedKFold for classification to preserve class proportions. For time series, use TimeSeriesSplit — never shuffle temporal data. Monitor both mean and standard deviation of CV scores: high std means your pipeline is sensitive to data ordering or outliers. CV also validates feature engineering choices — leaky aggregates show inflated scores on early folds, then collapse. Always track train vs. validation scores per fold to detect overfitting. Grid search inside CV (via GridSearchCV) prevents data leakage from parameter tuning. A robust pipeline passes CV with narrow error bars. If your model's accuracy varies more than 2% across folds, your features are unstable or your data is too small.

cross_val_example.pyPYTHON

// io.thecodeforge — ml-ai tutorial

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(n_estimators=100))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring='accuracy')

print(f'CV scores: {scores}')
print(f'Mean: {scores.mean():.3f} ± {scores.std():.3f}')

Output

CV scores: [0.87 0.91 0.85 0.89 0.88]

Mean: 0.880 ± 0.019

⚠ Production Trap:

Never use cross-validation on time series without TimeSeriesSplit. Standard CV shuffles future data into training folds, letting your model cheat by seeing tomorrow's data today. That 95% accuracy drops to 60% in production.

🎯 Key Takeaway

Cross-validation is the only honest evaluation method. A single holdout score is a gamble — CV reveals the true reliability of your feature pipeline.

● Production incidentPOST-MORTEMseverity: high

The Silent Accuracy Drop: Leaky Feature Engineering in a Fraud Detection Pipeline

Symptom

Model AUC dropped from 94% to 61% within 24 hours of going live. Online metrics were consistent with random guessing. Offline metrics had never matched production behavior, but the gap was too large to ignore.

Assumption

The team assumed their careful scaling and encoding pipeline was leak-free because they used scikit-learn's Pipeline. However, they had computed a 'rolling average transaction amount' using the entire dataset's history before splitting, not within cross-validation folds.

Root cause

The rolling average feature was computed on the full dataset using pandas .expanding().mean(). This used future transactions to compute the average, creating a lookahead bias. During validation, the model saw information from the test set, inflating its perceived accuracy. In production, that future information didn't exist, so the feature was nearly useless.

Fix

Recompute all aggregate features (rolling averages, target encodings, etc.) inside each cross-validation fold using sklearn's ColumnTransformer and a custom TimeSeriesSplit. Moved the feature engineering step into the pipeline so it respected temporal order. Retrained the model, which settled at a realistic 82% AUC.

Key lesson

Any feature that aggregates across rows (mean, rolling window, cumulative count) must be computed per fold, not on the full dataset.
Always validate your feature engineering pipeline with a holdout set that is completely untouched by any aggregate statistics.
If offline metrics look too good for the feature set you have, suspect data leakage before celebrating.

Production debug guideWhen your model's online performance diverges from offline expectations, these are the first three areas to inspect.3 entries

Symptom · 01

Online AUC/accuracy is significantly lower than offline cross-validation scores

→

Fix

Check for data leakage: review any feature that uses aggregate statistics (mean, median, rolling windows). Ensure those statistics were computed only on the training fold. A quick test: run the same pipeline on a fresh holdout set that was never touched during any aggregate computation.

Symptom · 02

Model performance degrades after deploying to a new data source or region

→

Fix

Compare feature distributions between training data and production data. Use KS-test or population stability index for numerical features; chi-square for categorical. If distributions differ significantly, consider retraining on more representative data or adding domain-specific encodings.

Symptom · 03

One-hot encoded features cause memory or latency spikes in production

→

Fix

Monitor the cardinality of the categorical feature. If new categories appear in production that weren't in training, your encoder will fail or create unseen columns. Use a OneHotEncoder handle_unknown='ignore' or switch to target encoding / frequency encoding. Also consider dimensionality reduction with PCA or t-SNE on the encoded matrix.

★ Quick Debug Cheat Sheet for Feature EngineeringResolve common feature engineering pitfalls in under a minute. Each symptom has an immediate action, two diagnostic commands, and a fix.

scaler sees unseen values or NaN in production−

Immediate action

Stop the inference pipeline. Check if production data has missing values or new categories.

Commands

pd.isna(production_data).sum()

production_data['category'].value_counts().index.difference(training_cats)

Fix now

Add a SimpleImputer for missing values and set handle_unknown='ignore' on OneHotEncoder.

model returns constant predictions despite varied input+

memory error on inference due to huge encoding matrix+

Feature Engineering Techniques Comparison

Technique	Best For	Main Risk	Tree Models Need It?	Linear Models Need It?
Min-Max Scaling	Bounded, clean numerical data; neural nets	Sensitive to outliers — one extreme value compresses everything else	No — trees split on thresholds, scale-invariant	Yes — gradient descent converges much faster
Standardization (Z-score)	Most numerical data, especially with outliers	Doesn't bound values; can confuse models expecting 0–1 input	No	Yes — essential for SVMs and logistic regression
One-Hot Encoding	Nominal categories with low cardinality (<20 unique values)	Feature explosion with high-cardinality columns	Yes — trees need numeric input	Yes — but watch for dummy variable trap
Ordinal Encoding	Categories with a meaningful, known order	Using it on nominal data invents false relationships	Yes	Yes — safe only when true order exists
Binning / Discretization	Non-linear step relationships; noisy continuous data	Loses granularity; bin boundaries are somewhat arbitrary	Rarely needed — trees find splits naturally	Very useful — converts complex curve into step function
Date/Time Cyclical Encoding	Periodic features like hour, month, weekday	Assumes the period is exactly constant (e.g., 24-hour cycle)	Can help with periodic splits	Highly beneficial — captures circular patterns
TF-IDF Vectorization	Text classification, topic modeling, search relevance	Large vocabulary leads to sparse high-dim matrix; ignores word order	Works well with tree ensembles	Works well with linear classifiers

⚙ Quick Reference

11 commands from this guide

File	Command / Code	Purpose
scaling_comparison.py	from sklearn.preprocessing import MinMaxScaler, StandardScaler	Scaling and Normalization
categorical_encoding.py	from sklearn.preprocessing import LabelEncoder, OrdinalEncoder	Encoding Categorical Variables
feature_creation.py	customers = pd.DataFrame({	Engineering New Features
datetime_features.py	dates = pd.to_datetime([	Working with Date and Time Features
text_features.py	from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer	Text Feature Engineering
SplitStrategyValidation.py	from sklearn.model_selection import TimeSeriesSplit, GroupKFold	The Science of Splits
missing_flag_example.py	from sklearn.model_selection import train_test_split	Missing Data Is Not a Cleanup
PipelineExample.py	from sklearn.pipeline import Pipeline	Scikit-Learn Pipelines
FeatureEngineExample.py	from sklearn.pipeline import Pipeline	Feature-Engine
quick_eda.py	df = pd.read_csv('data.csv')	Conduct Exploratory Data Analysis (EDA)
cross_val_example.py	from sklearn.model_selection import cross_val_score, StratifiedKFold	Test Your Pipeline With Cross-Validation

Key takeaways

Scale matters to algorithms, not to reality

a $100,000 income isn't 'more' than a 35-year-old age, but unscaled, your model will treat it that way. Always scale before feeding distance-based or gradient-based models.

Label Encoding nominal data silently invents a ranking that doesn't exist

this is one of the most common, hardest-to-spot bugs in ML pipelines. The model learns the fake ordering as real signal.

The best engineered features come from domain knowledge, not algorithms

a ratio, flag, or time-delta you construct from business understanding will often outperform a dozen raw columns fed to even a deep model.

Fit your transformers on training data only

data leakage from fitting scalers or encoders on the full dataset is the single biggest reason evaluation metrics don't match production performance.

Text and date features require explicit extraction

raw timestamps and text strings are opaque to models. Use TF-IDF for text, cyclical encoding for periodic dates, and rolling windows for temporal aggregations.

Common mistakes to avoid

5 patterns

Fitting the scaler on the full dataset before the train/test split

Symptom

Your validation accuracy looks suspiciously high and doesn't hold up in production. Offline metrics are optimistic because the scaler used test data statistics.

Fix

Always split your data first, then call .fit_transform() only on the training set and .transform() on the test set. The scaler should never 'see' test data during fitting.

Using Label Encoding on nominal (unordered) categorical columns

Symptom

Your linear model or KNN gives oddly poor results on a column you thought was encoded correctly. The model learns a false ordinal relationship.

Fix

Check whether a natural order genuinely exists. If 'Red < Green < Blue' is nonsense in your domain, switch to One-Hot Encoding. Save Label/Ordinal Encoding strictly for features like size ratings, survey scores, or education levels where order is real.

Engineering features on the full pipeline without accounting for train/test leakage

Symptom

A derived feature like 'days_to_first_purchase' is computed using aggregate statistics (e.g., mean days across all users) before the split, causing lookahead bias.

Fix

Any feature that aggregates across rows (target encoding, mean statistics, rolling averages) must be computed inside a cross-validation fold or pipeline step, not on the full dataset upfront. Use sklearn Pipeline or ColumnTransformer to ensure transformations respect data splits.

Using raw timestamps as integer features

Symptom

Model fails to capture seasonality; performance is poor for time-series forecasting or temporal problems.

Fix

Extract components: hour, day_of_week, month, then apply cyclical encoding (sin/cos). Create relative features like 'days_since_last_event'. Never use raw timestamps as continuous numbers.

Ignoring text preprocessing before vectorization

Symptom

TF-IDF matrix is huge and models perform poorly because of noise from stopwords, punctuation, and inconsistent casing.

Fix

Always lowercase, remove punctuation, strip stopwords, and consider stemming/lemmatization. Use ngram_range=(1,2) in TfidfVectorizer to capture phrases.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

You have a 'City' column with 3,000 unique city names. One-hot encoding ...

Q02SENIOR

A colleague says their model accuracy jumped from 72% to 91% after featu...

Q03SENIOR

What's the difference between feature engineering and feature selection?...

Q01 of 03SENIOR

You have a 'City' column with 3,000 unique city names. One-hot encoding would create 3,000 new columns. Walk me through at least two alternative encoding strategies and when you'd choose each.

ANSWER

For high-cardinality nominal features like city names, One-Hot is rarely the best first choice. Two alternatives: 1. Target Encoding (Mean Encoding): Replace each city with the mean of the target variable for that city (e.g., average click-through rate for that city). It compresses 3,000 columns into 1 and can capture strong predictive signal. The risk is target leakage — you must compute the encoding inside cross-validation folds, never on the full dataset. 2. Frequency Encoding: Replace each city with its frequency in the training data (how often it appears). It's simple, leakage-safe, and works well when rare categories are noisy. The downside: it loses all relationship between city and target, so it's weaker for direct prediction. Choose target encoding if the target is strongly city-dependent and you can manage cross-validation. Choose frequency encoding if you need a fast, safe baseline or if city specificity is low.

FAQ · 3 QUESTIONS

Frequently Asked Questions

Do I need to do feature engineering if I'm using a decision tree or random forest?

What's the difference between normalization and standardization?

How do I know which features to engineer? Is there a systematic process?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

🔥

That's ML Basics. Mark it forged?

8 min read · try the examples if you haven't