Skip to content
Home Python Pandas DataFrame — Merge Explosion from Duplicate Keys

Pandas DataFrame — Merge Explosion from Duplicate Keys

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Python Libraries → Topic 4 of 51
A daily ETL job ballooned to 50GB and timed out due to duplicate keys in a merge.
⚙️ Intermediate — basic Python knowledge assumed
In this tutorial, you'll learn
A daily ETL job ballooned to 50GB and timed out due to duplicate keys in a merge.
  • A DataFrame is not just a 2D list — its labelled columns, typed dtypes, and named index are what make vectorised operations fast and readable.
  • Use loc for label-based and condition-based access; use iloc for position-based access. Never chain selections when assigning values — always use loc with both row and column specifiers.
  • groupby().agg() reduces your data to one row per group; groupby().transform() keeps the original shape and is the correct tool for adding group-level statistics as new columns.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Pandas DataFrame is a labelled, column-typed, index-aware 2D structure
  • Use .loc for label-based access, .iloc for position-based access — mixing them causes silent bugs
  • groupby().agg() reduces rows to groups; groupby().transform() keeps original shape for adding group stats
  • Merging without checking for duplicate keys silently creates Cartesian products
  • Chaining bracket assignments like df[mask]['col'] = value modifies a copy, not the original DataFrame
🚨 START HERE

Pandas DataFrame Quick Debug Cheat Sheet

Fire these commands first when data looks wrong.
🟡

No changes after assignment

Immediate ActionCheck if you used chained brackets
Commands
df.loc[df['region'] == 'North', 'revenue'] = 0
df.is_copy # Returns a reference or None
Fix NowAlways use .loc for assignment, never chained indexing.
🟡

Merge produced too many rows

Immediate ActionCheck for duplicate keys on right DataFrame
Commands
right_df['key'].value_counts().max()
left_df.merge(right_df, on='key', how='left', indicator=True)['_merge'].value_counts()
Fix NowDrop duplicates from the right side before merging.
🟡

Numeric column shows object dtype

Immediate ActionFind the non-numeric values
Commands
df[col].apply(type).value_counts()
pd.to_numeric(df[col], errors='coerce').isna().sum()
Fix NowUse pd.to_numeric with errors='coerce' then fill or drop NaNs.
Production Incident

Production Data Feeder Crashed After Merge Explosion

A sales analytics pipeline started producing billions of rows instead of thousands after an unnoticed duplicate key in a lookup table.
SymptomDaily ETL job that usually finishes in 10 minutes started timing out after 2 hours. The output CSV was 50GB instead of 500MB.
AssumptionThe merge was correct because the left DataFrame had 5000 rows and the right had 5000 rows, so the result should be around 5000 rows.
Root causeThe right-hand DataFrame had 3,000 duplicate customer IDs (multiple records per customer due to a historical import bug). The left join on customer_id exploded each row into 3,000 matching rows, producing 5000 × 3000 = 15 million rows per batch, multiplied across 10 batches.
FixAdded a deduplication step before the merge: customers_df = customers_df.drop_duplicates(subset='customer_id', keep='first'). Also added a row count assertion after every merge to alert if row count exceeds 110% of expected.
Key Lesson
Always check for duplicate keys on both sides of a merge before running it in production — use df['key'].value_counts() to spot duplicates.Add a row count sanity check after every merge with a threshold: if len(result) > len(left_df) * 1.1, alert and stop.Never assume a one-to-many relationship is real until you've verified uniqueness on the 'one' side.
Production Debug Guide

Three symptoms that indicate your DataFrame isn't changing the way you think it is.

Column values unchanged after assignment using chained bracketsStop using df[df['col'] > 0]['new_col'] = value. Use df.loc[df['col'] > 0, 'new_col'] = value instead.
Unexpected row count after a merge or groupbyRun df['key'].value_counts() on both DataFrames. Check for duplicates. Also verify that the join key dtype matches across both.
GroupBy or aggregation returns NaN when you expect numbersCheck column dtypes: df.dtypes. If a numeric column shows 'object', there's a non-numeric value like 'N/A' or '1,200' hiding inside. Use pd.to_numeric(column, errors='coerce') to fix.

Every data-driven Python project — from analysing sales figures to training a machine-learning model — eventually hits the same wall: raw data is messy, scattered, and weirdly shaped. You've got CSV files, database query results, and API responses all telling different stories in different formats. Pandas DataFrames exist precisely to be the one place where all that chaos comes together and starts making sense. They're the reason data scientists can do in five lines what would take fifty lines of raw Python loops.

What a DataFrame Actually Is (and Why It's Not Just a List of Lists)

A lot of beginners assume a DataFrame is just a fancy 2D list. It isn't — and that distinction matters the moment your data gets real. A DataFrame is a labelled, column-typed, index-aware structure. Every column has a name and a dtype. Every row has an index. That metadata is what lets you write df['revenue'].sum() instead of sum(row[3] for row in my_list) — it's not just cleaner syntax, it's a fundamentally different contract with your data.

When you load a CSV or query a database, Pandas infers the type of each column automatically. Integers stay integers, strings become object dtype, dates can become datetime64. This typing means Pandas can delegate heavy number-crunching to NumPy under the hood — which is why operations on millions of rows stay fast.

The index is another superpower people underuse. By default it's just 0, 1, 2… but set it to something meaningful — a customer ID, a timestamp — and suddenly lookups, joins, and time-series resampling become trivially simple.

dataframe_basics.py · PYTHON
123456789101112131415161718192021
import pandas as pd

# Build a small sales dataset manually — in real life this comes from a CSV or DB
sales_data = {
    'order_id':    [1001, 1002, 1003, 1004],
    'customer':    ['Alice', 'Bob', 'Alice', 'Carol'],
    'product':     ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'revenue':     [1200.00, 25.50, 89.99, 349.00],
    'order_date':  pd.to_datetime(['2024-01-15', '2024-01-17', '2024-01-17', '2024-01-20'])
}

df = pd.DataFrame(sales_data)

# Set order_id as the index — now we can look up orders by ID, not by position
df = df.set_index('order_id')

print(df)
print("\n--- Column dtypes ---")
print(df.dtypes)
print("\n--- Quick summary stats for revenue ---")
print(df['revenue'].describe())
▶ Output
customer product revenue order_date
order_id
1001 Alice Laptop 1200.00 2024-01-15
1002 Bob Mouse 25.50 2024-01-17
1003 Alice Keyboard 89.99 2024-01-17
1004 Carol Monitor 349.00 2024-01-20

--- Column dtypes ---
customer object
product object
revenue float64
order_date datetime64[ns]
dtype: object

--- Quick summary stats for revenue ---
count 4.000000
mean 416.122500
std 513.744859
min 25.500000
25% 74.742500
50% 219.495000
75% 560.875000
max 1200.000000
Name: revenue, dtype: float64
💡Pro Tip:
Always call df.dtypes right after loading data. If a numeric column shows object dtype, it means there's a hidden string (like '1,200' or 'N/A') lurking in the data. Catch it early — every aggregation on that column will silently fail or error later.
📊 Production Insight
object dtype on a numeric column breaks all vectorised operations silently.
Use pd.to_numeric() with errors='coerce' to surface the problem rows.
Pattern: always check dtypes after reading data — it's a 2-second sanity check that saves hours.
🎯 Key Takeaway
A DataFrame's labelled columns and typed dtypes are what make operations fast and readable.
Check dtypes on every load — object on a number column is a red flag.
Make the index meaningful early — it unlocks fast lookups and resampling.

Selecting Data the Right Way — loc, iloc, and When Each One Saves You

Data selection is where most intermediate developers have a messy relationship with Pandas. They mix bracket notation, loc, and iloc interchangeably until something breaks and they don't know why. Let's fix that.

loc is label-based. You give it index labels and column names — the human-readable identifiers. iloc is position-based. You give it integer row and column positions — like a zero-indexed 2D array. They look similar but they're entirely different tools.

The practical rule: use loc when you're thinking about your data ('give me Alice's orders'), and use iloc when you're thinking about structure ('give me the first 10 rows for a preview'). The bracket shorthand df['column'] is fine for single column access, but the moment you're selecting rows AND columns together, always use loc or iloc explicitly. Ambiguity here is a direct path to the dreaded SettingWithCopyWarning.

Boolean indexing — filtering with a condition — plugs into loc perfectly and is how you'll do the majority of real-world filtering.

dataframe_selection.py · PYTHON
123456789101112131415161718192021222324252627282930
import pandas as pd

sales_data = {
    'order_id':   [1001, 1002, 1003, 1004, 1005],
    'customer':   ['Alice', 'Bob', 'Alice', 'Carol', 'Bob'],
    'product':    ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
    'revenue':    [1200.00, 25.50, 89.99, 349.00, 79.95],
    'region':     ['North', 'South', 'North', 'West', 'South']
}

df = pd.DataFrame(sales_data).set_index('order_id')

# loc: label-based — give me Alice's rows, and only show product and revenue columns
alice_orders = df.loc[df['customer'] == 'Alice', ['product', 'revenue']]
print("--- Alice's orders (loc with boolean mask) ---")
print(alice_orders)

# iloc: position-based — give me rows 0 and 1, columns 0 to 2 (slice is exclusive at end)
first_two_rows_preview = df.iloc[0:2, 0:3]
print("\n--- First 2 rows preview (iloc) ---")
print(first_two_rows_preview)

# Chained boolean conditions — high-value orders in North region
high_value_north = df.loc[(df['revenue'] > 100) & (df['region'] == 'North'), :]
print("\n--- High-value North orders ---")
print(high_value_north)

# Single value lookup by known index label — fast O(1) access
order_1004_revenue = df.loc[1004, 'revenue']
print(f"\nRevenue for order 1004: £{order_1004_revenue}")
▶ Output
--- Alice's orders (loc with boolean mask) ---
product revenue
order_id
1001 Laptop 1200.00
1003 Keyboard 89.99

--- First 2 rows preview (iloc) ---
customer product revenue
order_id
1001 Alice Laptop 1200.00
1002 Bob Mouse 25.50

--- High-value North orders ---
customer product revenue region
order_id
1001 Alice Laptop 1200.00 North

Revenue for order 1004: £349.0
⚠ Watch Out:
Never chain selections like df[df['region'] == 'North']['revenue'] = 0. Pandas may operate on a copy and silently discard your change. Always use df.loc[df['region'] == 'North', 'revenue'] = 0 to guarantee you're modifying the original DataFrame.
📊 Production Insight
Chained indexing is the #1 cause of silent data loss in production Pandas code.
If you see SettingWithCopyWarning in your logs, that change probably didn't stick.
The fix is trivial: use .loc for all assignments involving filters.
🎯 Key Takeaway
loc = label-based, iloc = position-based — they are not interchangeable.
Never use chained brackets for assignment — always use .loc with row and column specifiers.
Boolean filter + loc is your default pattern for selecting subsets.

GroupBy and Aggregation — Turning Raw Rows Into Business Insights

This is where DataFrames stop being a data container and start being an analytical tool. groupby() splits your DataFrame into logical buckets — by customer, by region, by month — and then you apply a calculation to each bucket. The result is a summarised DataFrame that answers real questions.

Under the hood, groupby uses a split-apply-combine strategy: it splits the data by the group key, applies your aggregation function independently to each group, then combines the results back into a single structure. Understanding this model helps you predict what the output will look like before you run it.

The real power shows up with agg(), which lets you apply multiple different functions to different columns in a single pass — no need to run five separate groupby calls. For reporting and dashboards, this pattern is indispensable.

You can also chain groupby with transform() when you need to add a group-level statistic back as a new column on every row — for example, adding each customer's total spend alongside their individual orders.

dataframe_groupby.py · PYTHON
12345678910111213141516171819202122232425262728293031
import pandas as pd

orders = pd.DataFrame({
    'order_id':  [1001, 1002, 1003, 1004, 1005, 1006],
    'customer':  ['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'],
    'region':    ['North', 'South', 'North', 'West', 'South', 'North'],
    'revenue':   [1200.00, 25.50, 89.99, 349.00, 79.95, 430.00],
    'units':     [1, 2, 1, 1, 3, 2]
})

# --- Simple groupby: total revenue per customer ---
revenue_by_customer = orders.groupby('customer')['revenue'].sum().reset_index()
revenue_by_customer.columns = ['customer', 'total_revenue']
print("--- Revenue by customer ---")
print(revenue_by_customer)

# --- Multi-metric aggregation: different stats per column in one pass ---
summary = orders.groupby('region').agg(
    total_revenue=('revenue', 'sum'),    # sum of revenue per region
    order_count=('order_id', 'count'),   # how many orders per region
    avg_units=('units', 'mean')          # average units per order per region
).reset_index()
print("\n--- Regional summary ---")
print(summary)

# --- transform: add each customer's total spend as a new column on every row ---
# Useful for calculating what percentage of a customer's total each order represents
orders['customer_total'] = orders.groupby('customer')['revenue'].transform('sum')
orders['pct_of_customer_total'] = (orders['revenue'] / orders['customer_total'] * 100).round(1)
print("\n--- Orders with customer share % ---")
print(orders[['order_id', 'customer', 'revenue', 'customer_total', 'pct_of_customer_total']])
▶ Output
--- Revenue by customer ---
customer total_revenue
0 Alice 1719.99
1 Bob 105.45
2 Carol 349.00

--- Regional summary ---
region total_revenue order_count avg_units
0 North 1719.99 3 1.333333
1 South 105.45 2 2.500000
2 West 349.00 1 1.000000

--- Orders with customer share % ---
order_id customer revenue customer_total pct_of_customer_total
0 1001 Alice 1200.00 1719.99 69.8
1 1002 Bob 25.50 105.45 24.2
2 1003 Alice 89.99 1719.99 5.2
3 1004 Carol 349.00 349.00 100.0
4 1005 Bob 79.95 105.45 75.8
5 1006 Alice 430.00 1719.99 25.0
🔥Interview Gold:
Interviewers love asking the difference between groupby().agg() and groupby().transform(). The key distinction: agg() reduces the DataFrame to one row per group. transform() keeps the original shape and broadcasts the group result back to every row — perfect for computing ratios or z-scores within groups.
📊 Production Insight
Forgetting reset_index() after groupby leaves the group key as the index — brittle for downstream code.
Always call reset_index() immediately after agg() to keep columns clean.
Transform is memory heavier than agg — use it only when you need per-row group context.
🎯 Key Takeaway
groupby().agg() reduces to one row per group — use for summaries.
groupby().transform() keeps shape — use for adding group context like percentages.
Reset index right after agg() to avoid index-dependent bugs.

Merging DataFrames — The SQL Join You Already Know, In Python

Real-world data is rarely in one place. You've got an orders table, a customers table, and a products table — all separate. Merging DataFrames is how you bring them together, and it maps directly onto SQL JOINs you may already know.

pd.merge() is the workhorse. You specify the two DataFrames, the key column(s) to join on, and the type of join: inner (only matching rows), left (all rows from the left, matched where possible), right, or outer (all rows from both). Understanding which join type to use is more important than memorising the syntax.

A common real-world pattern is a left join against a reference table — you have a list of orders and want to enrich each one with customer details, keeping all orders even if a customer record is somehow missing.

After merging, always check the row count. If you expected 500 rows and got 2000, you've hit a one-to-many relationship you didn't account for — a classic data explosion caused by duplicate keys on the right-hand DataFrame.

dataframe_merge.py · PYTHON
123456789101112131415161718192021222324252627282930
import pandas as pd

# Table 1: raw orders — just IDs, no names
orders_df = pd.DataFrame({
    'order_id':   [1001, 1002, 1003, 1004, 1005],
    'customer_id':[201,  202,  201,  203,  204],    # 204 has no customer record yet
    'revenue':    [1200.00, 25.50, 89.99, 349.00, 79.95]
})

# Table 2: customer reference data — note customer 204 is missing
customers_df = pd.DataFrame({
    'customer_id': [201, 202, 203],
    'customer_name': ['Alice', 'Bob', 'Carol'],
    'tier': ['Gold', 'Silver', 'Gold']
})

# Inner join: only orders where we have a matching customer record
inner_merged = pd.merge(orders_df, customers_df, on='customer_id', how='inner')
print("--- Inner join (drops order 1005 — no customer record) ---")
print(inner_merged)

# Left join: keep ALL orders, fill missing customer info with NaN
left_merged = pd.merge(orders_df, customers_df, on='customer_id', how='left')
print("\n--- Left join (keeps order 1005 with NaN for missing customer) ---")
print(left_merged)

# Practical check: did we unexpectedly gain or lose rows?
print(f"\nOriginal orders: {len(orders_df)} rows")
print(f"After left join:  {len(left_merged)} rows")
print(f"Rows with no customer match: {left_merged['customer_name'].isna().sum()}")
▶ Output
--- Inner join (drops order 1005 — no customer record) ---
order_id customer_id revenue customer_name tier
0 1001 201 1200.00 Alice Gold
1 1002 202 25.50 Bob Silver
2 1003 201 89.99 Alice Gold
3 1004 203 349.00 Carol Gold

--- Left join (keeps order 1005 with NaN for missing customer) ---
order_id customer_id revenue customer_name tier
0 1001 201 1200.00 Alice Gold
1 1002 202 25.50 Bob Silver
2 1003 201 89.99 Alice Gold
3 1004 203 349.00 Carol Gold
4 1005 204 79.95 NaN NaN

Original orders: 5 rows
After left join: 5 rows
Rows with no customer match: 1
⚠ Watch Out:
If your join key has duplicates on BOTH sides, Pandas creates a cartesian product for those keys — 3 matching rows on the left × 3 on the right = 9 output rows, silently. Before merging, run df['key_col'].value_counts() on both DataFrames and confirm you understand where duplicates exist.
📊 Production Insight
Row count explosion from merge is one of the most expensive silent bugs — it can bloat memory to OOM.
Always run value_counts on the key column of the right DataFrame before merging.
Add an assertion: assert len(result) <= len(left_df) * 1.1, or use indicator=True to track row origins.
🎯 Key Takeaway
Merges map SQL joins — know inner, left, right, outer.
Check key uniqueness on the 'many' side before every merge.
Row count explosion from duplicates is silent and costly — assert row counts after every merge.

Handling Missing Data — The Patterns That Save Your Pipeline

Real-world data is never clean. Missing values creep in from JSON nulls, empty cells in CSVs, and incomplete database records. Pandas represents missing data as NaN (for numeric) and None (for object columns), but they behave oddly if you're not careful.

The most common mistake is trying to check df['col'] == None or df['col'] == float('nan') — both always return False because NaN is not equal to itself by IEEE 754 definition. You must use df['col'].isna() or pd.isnull() instead.

Three strategies for handling missing data: 1. Drop rows with df.dropna() — cruel but fast. 2. Fill with a constant or computed value using df.fillna(). Common choices: column mean, median, or a placeholder like 0 or 'Unknown'. 3. Forward or backward fill (method='ffill' or 'bfill') for time series — carry the last known value forward.

The right choice depends on the business context. Dropping rows with missing key identifiers is fine. Dropping 40% of your data because a single optional column has gaps is not.

dataframe_missing.py · PYTHON
1234567891011121314151617181920212223242526272829303132
import pandas as pd

# Messy sales data with missing values
messy = pd.DataFrame({
    'order_id': [1001, 1002, 1003, 1004, 1005],
    'customer': ['Alice', 'Bob', None, 'Carol', 'Alice'],
    'revenue':  [1200.00, None, 89.99, 349.00, None],
    'region':   ['North', 'South', None, 'West', 'North']
})

print("--- Raw messy data ---")
print(messy)

# Detect missing
print("\n--- Missing value counts ---")
print(messy.isna().sum())

# Drop rows where customer is missing (critical field)
cleaned = messy.dropna(subset=['customer'])
print("\n--- After dropping rows with missing customer ---")
print(cleaned)

# Fill missing revenue with column median (reasonable imputation)
median_revenue = messy['revenue'].median()
messy['revenue'] = messy['revenue'].fillna(median_revenue)
print(f"\n--- Filled missing revenue with median ({median_revenue}) ---")
print(messy[['order_id', 'revenue']])

# Forward fill region (carry last known region forward, if ordering makes sense)
messy['region'] = messy['region'].fillna(method='ffill')
print("\n--- After forward filling region ---")
print(messy[['order_id', 'region']])
▶ Output
--- Raw messy data ---
order_id customer revenue region
0 1001 Alice 1200.00 North
1 1002 Bob NaN South
2 1003 None 89.99 None
3 1004 Carol 349.00 West
4 1005 Alice NaN North

--- Missing value counts ---
order_id 0
customer 1
revenue 2
region 1
dtype: int64

--- After dropping rows with missing customer ---
order_id customer revenue region
0 1001 Alice 1200.00 North
1 1002 Bob NaN South
3 1004 Carol 349.00 West
4 1005 Alice NaN North

--- Filled missing revenue with median (219.495) ---
order_id revenue
0 1001 1200.00
1 1002 219.49
2 1003 89.99
3 1004 349.00
4 1005 219.49

--- After forward filling region ---
order_id region
0 1001 North
1 1002 South
2 1003 South
3 1004 West
4 1005 North
🔥Remember:
df.fillna(method='ffill') implicitly uses the previous row's value. This only makes sense if rows are ordered meaningfully — never apply forward fill on unsorted data or grouped data without resetting the index.
📊 Production Insight
Dropping rows with NaN in a critical column is okay — dropping rows with NaN in an optional column wastes data.
Always quantify the impact: df.isna().sum() / len(df) * 100 to see percentage missing.
Imputation with median is safer than mean when data has outliers.
🎯 Key Takeaway
NaN != None != 0 — use isna() to detect missing values.
Drop only when the column is business-critical; impute with context-aware methods.
Forward fill only on ordered data — never apply it blindly.
🗂 loc vs iloc: Quick Reference
Feature / Aspectloc (label-based)iloc (position-based)
How you reference rowsBy index label (e.g. 'Alice', 1004)By integer position (e.g. 0, 1, 2)
How you reference columnsBy column name (e.g. 'revenue')By integer position (e.g. 0, 1, 2)
Slice endpointInclusive — loc[0:3] gives rows 0,1,2,3Exclusive — iloc[0:3] gives rows 0,1,2
Best use caseFiltering with conditions and named lookupsPreviewing structure, position-based splits
Works after set_index?Yes — uses the new index labelsYes — always uses raw integer position
Risk of ambiguityLow — explicit labelsMedium — positions shift if rows are sorted

🎯 Key Takeaways

  • A DataFrame is not just a 2D list — its labelled columns, typed dtypes, and named index are what make vectorised operations fast and readable.
  • Use loc for label-based and condition-based access; use iloc for position-based access. Never chain selections when assigning values — always use loc with both row and column specifiers.
  • groupby().agg() reduces your data to one row per group; groupby().transform() keeps the original shape and is the correct tool for adding group-level statistics as new columns.
  • After every merge, verify the row count. A row explosion caused by duplicate join keys is one of the most common and hardest-to-spot bugs in data pipelines.
  • Missing data detection requires isna() — never use == or == None. Choose drop or fill based on business context, not convenience.

⚠ Common Mistakes to Avoid

    Chaining column assignment (df[mask]['col'] = value)
    Symptom

    Pandas silently modifies a copy instead of the original. Your change disappears with no error. The DataFrame appears unchanged in subsequent operations.

    Fix

    Always use df.loc[mask, 'col'] = value for in-place assignment. Never chain brackets when setting values.

    Forgetting reset_index() after groupby
    Symptom

    The grouped column becomes the index, which breaks downstream merges and column references by name. You might get KeyError on the grouped column.

    Fix

    Append .reset_index() immediately after .agg() or .sum() to restore a clean integer index and promote the grouped column to a regular column.

    Using == to check for NaN values
    Symptom

    df['col'] == None or df['col'] == float('nan') both return False for all rows, even those with missing data. Your filter silently does nothing.

    Fix

    Use df['col'].isna() or pd.isnull(df['col']) which are built specifically for this purpose.

Interview Questions on This Topic

  • QWhat is the difference between loc and iloc in Pandas, and can you give a scenario where using the wrong one would produce incorrect results rather than an error?Mid-levelReveal
    loc accesses rows by index label (inclusive), iloc by integer position (exclusive). If you set a custom index (e.g., customer IDs) and then use iloc to find a row by label, you'll get a different row or an error. Example: df.set_index('order_id') then df.loc[1001] returns order 1001, but df.iloc[0] returns the first row regardless of label. Using iloc when you intended loc silently returns the wrong data if the index isn't 0,1,2... No warning.
  • QExplain the split-apply-combine pattern. How does groupby().transform() differ from groupby().agg(), and when would you choose one over the other?SeniorReveal
    Split-apply-combine splits data by group key, applies a function independently to each group, then combines results. agg() reduces each group to a single value (e.g., sum, mean) — output has one row per group. transform() returns a result with the same shape as the original DataFrame, broadcasting the group statistic back to each row. Use agg() when you want a summary table; use transform() when you need to add a group-level column (e.g., percentage of total) while keeping all rows.
  • QIf you perform a merge and the resulting DataFrame has significantly more rows than either input, what has likely gone wrong and how would you diagnose it?SeniorReveal
    The most common cause is duplicate keys on the 'many' side of the join, creating a Cartesian product for those keys. Diagnose by running df['key'].value_counts().max() on both DataFrames before merging. Check if the max count > 1 on the side you assumed was unique. Also use indicator=True in merge to see which rows matched, left-only, right-only. Fix by deduplicating: df.drop_duplicates(subset='key') where appropriate.

Frequently Asked Questions

What is the difference between a Pandas Series and a DataFrame?

A Series is a single labelled column — one-dimensional, like a list with an index. A DataFrame is a collection of Series sharing the same index — it's two-dimensional, like a table. When you select one column from a DataFrame with df['revenue'], you get a Series back.

How do I handle missing values in a Pandas DataFrame?

Use df.isna() to detect them and df.isna().sum() to count them per column. To fill them use df.fillna(value) with a constant, a column mean, or a forward-fill strategy with method='ffill'. To drop rows with any missing value use df.dropna(). Choose based on whether the missingness is random or meaningful to your analysis.

Why is my Pandas operation slow on a large DataFrame?

The most common culprits are row-by-row Python loops (use vectorised column operations instead), applying Python functions with df.apply() on large datasets (use built-in Pandas or NumPy methods where possible), and operating on object dtype columns that should be numeric or categorical. Converting high-cardinality string columns to category dtype can cut memory and speed up groupby operations dramatically.

When should I use a DataFrame vs a NumPy array?

Use a DataFrame when you need labelled columns, mixed data types, missing data handling, or SQL-like operations (groupby, merge). Use a NumPy array when you're doing pure numerical computations that don't need labels — NumPy operations are typically faster and use less memory for homogeneous numeric data.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousPandas BasicsNext →Matplotlib Basics
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged