Pandas DataFrame — Merge Explosion from Duplicate Keys
- A DataFrame is not just a 2D list — its labelled columns, typed dtypes, and named index are what make vectorised operations fast and readable.
- Use
locfor label-based and condition-based access; useilocfor position-based access. Never chain selections when assigning values — always uselocwith both row and column specifiers. reduces your data to one row per group;groupby().agg()keeps the original shape and is the correct tool for adding group-level statistics as new columns.groupby().transform()
- Pandas DataFrame is a labelled, column-typed, index-aware 2D structure
- Use .loc for label-based access, .iloc for position-based access — mixing them causes silent bugs
- groupby().agg() reduces rows to groups; groupby().transform() keeps original shape for adding group stats
- Merging without checking for duplicate keys silently creates Cartesian products
- Chaining bracket assignments like df[mask]['col'] = value modifies a copy, not the original DataFrame
Pandas DataFrame Quick Debug Cheat Sheet
No changes after assignment
df.loc[df['region'] == 'North', 'revenue'] = 0df.is_copy # Returns a reference or NoneMerge produced too many rows
right_df['key'].value_counts().max()left_df.merge(right_df, on='key', how='left', indicator=True)['_merge'].value_counts()Numeric column shows object dtype
df[col].apply(type).value_counts()pd.to_numeric(df[col], errors='coerce').isna().sum()Production Incident
customers_df = customers_df.drop_duplicates(subset='customer_id', keep='first'). Also added a row count assertion after every merge to alert if row count exceeds 110% of expected.Production Debug GuideThree symptoms that indicate your DataFrame isn't changing the way you think it is.
Every data-driven Python project — from analysing sales figures to training a machine-learning model — eventually hits the same wall: raw data is messy, scattered, and weirdly shaped. You've got CSV files, database query results, and API responses all telling different stories in different formats. Pandas DataFrames exist precisely to be the one place where all that chaos comes together and starts making sense. They're the reason data scientists can do in five lines what would take fifty lines of raw Python loops.
What a DataFrame Actually Is (and Why It's Not Just a List of Lists)
A lot of beginners assume a DataFrame is just a fancy 2D list. It isn't — and that distinction matters the moment your data gets real. A DataFrame is a labelled, column-typed, index-aware structure. Every column has a name and a dtype. Every row has an index. That metadata is what lets you write df['revenue'].sum() instead of sum(row[3] for row in my_list) — it's not just cleaner syntax, it's a fundamentally different contract with your data.
When you load a CSV or query a database, Pandas infers the type of each column automatically. Integers stay integers, strings become object dtype, dates can become datetime64. This typing means Pandas can delegate heavy number-crunching to NumPy under the hood — which is why operations on millions of rows stay fast.
The index is another superpower people underuse. By default it's just 0, 1, 2… but set it to something meaningful — a customer ID, a timestamp — and suddenly lookups, joins, and time-series resampling become trivially simple.
import pandas as pd # Build a small sales dataset manually — in real life this comes from a CSV or DB sales_data = { 'order_id': [1001, 1002, 1003, 1004], 'customer': ['Alice', 'Bob', 'Alice', 'Carol'], 'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'], 'revenue': [1200.00, 25.50, 89.99, 349.00], 'order_date': pd.to_datetime(['2024-01-15', '2024-01-17', '2024-01-17', '2024-01-20']) } df = pd.DataFrame(sales_data) # Set order_id as the index — now we can look up orders by ID, not by position df = df.set_index('order_id') print(df) print("\n--- Column dtypes ---") print(df.dtypes) print("\n--- Quick summary stats for revenue ---") print(df['revenue'].describe())
order_id
1001 Alice Laptop 1200.00 2024-01-15
1002 Bob Mouse 25.50 2024-01-17
1003 Alice Keyboard 89.99 2024-01-17
1004 Carol Monitor 349.00 2024-01-20
--- Column dtypes ---
customer object
product object
revenue float64
order_date datetime64[ns]
dtype: object
--- Quick summary stats for revenue ---
count 4.000000
mean 416.122500
std 513.744859
min 25.500000
25% 74.742500
50% 219.495000
75% 560.875000
max 1200.000000
Name: revenue, dtype: float64
df.dtypes right after loading data. If a numeric column shows object dtype, it means there's a hidden string (like '1,200' or 'N/A') lurking in the data. Catch it early — every aggregation on that column will silently fail or error later.pd.to_numeric() with errors='coerce' to surface the problem rows.Selecting Data the Right Way — loc, iloc, and When Each One Saves You
Data selection is where most intermediate developers have a messy relationship with Pandas. They mix bracket notation, loc, and iloc interchangeably until something breaks and they don't know why. Let's fix that.
loc is label-based. You give it index labels and column names — the human-readable identifiers. iloc is position-based. You give it integer row and column positions — like a zero-indexed 2D array. They look similar but they're entirely different tools.
The practical rule: use loc when you're thinking about your data ('give me Alice's orders'), and use iloc when you're thinking about structure ('give me the first 10 rows for a preview'). The bracket shorthand df['column'] is fine for single column access, but the moment you're selecting rows AND columns together, always use loc or iloc explicitly. Ambiguity here is a direct path to the dreaded SettingWithCopyWarning.
Boolean indexing — filtering with a condition — plugs into loc perfectly and is how you'll do the majority of real-world filtering.
import pandas as pd sales_data = { 'order_id': [1001, 1002, 1003, 1004, 1005], 'customer': ['Alice', 'Bob', 'Alice', 'Carol', 'Bob'], 'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'], 'revenue': [1200.00, 25.50, 89.99, 349.00, 79.95], 'region': ['North', 'South', 'North', 'West', 'South'] } df = pd.DataFrame(sales_data).set_index('order_id') # loc: label-based — give me Alice's rows, and only show product and revenue columns alice_orders = df.loc[df['customer'] == 'Alice', ['product', 'revenue']] print("--- Alice's orders (loc with boolean mask) ---") print(alice_orders) # iloc: position-based — give me rows 0 and 1, columns 0 to 2 (slice is exclusive at end) first_two_rows_preview = df.iloc[0:2, 0:3] print("\n--- First 2 rows preview (iloc) ---") print(first_two_rows_preview) # Chained boolean conditions — high-value orders in North region high_value_north = df.loc[(df['revenue'] > 100) & (df['region'] == 'North'), :] print("\n--- High-value North orders ---") print(high_value_north) # Single value lookup by known index label — fast O(1) access order_1004_revenue = df.loc[1004, 'revenue'] print(f"\nRevenue for order 1004: £{order_1004_revenue}")
product revenue
order_id
1001 Laptop 1200.00
1003 Keyboard 89.99
--- First 2 rows preview (iloc) ---
customer product revenue
order_id
1001 Alice Laptop 1200.00
1002 Bob Mouse 25.50
--- High-value North orders ---
customer product revenue region
order_id
1001 Alice Laptop 1200.00 North
Revenue for order 1004: £349.0
df[df['region'] == 'North']['revenue'] = 0. Pandas may operate on a copy and silently discard your change. Always use df.loc[df['region'] == 'North', 'revenue'] = 0 to guarantee you're modifying the original DataFrame.GroupBy and Aggregation — Turning Raw Rows Into Business Insights
This is where DataFrames stop being a data container and start being an analytical tool. splits your DataFrame into logical buckets — by customer, by region, by month — and then you apply a calculation to each bucket. The result is a summarised DataFrame that answers real questions.groupby()
Under the hood, groupby uses a split-apply-combine strategy: it splits the data by the group key, applies your aggregation function independently to each group, then combines the results back into a single structure. Understanding this model helps you predict what the output will look like before you run it.
The real power shows up with agg(), which lets you apply multiple different functions to different columns in a single pass — no need to run five separate groupby calls. For reporting and dashboards, this pattern is indispensable.
You can also chain groupby with when you need to add a group-level statistic back as a new column on every row — for example, adding each customer's total spend alongside their individual orders.transform()
import pandas as pd orders = pd.DataFrame({ 'order_id': [1001, 1002, 1003, 1004, 1005, 1006], 'customer': ['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'], 'region': ['North', 'South', 'North', 'West', 'South', 'North'], 'revenue': [1200.00, 25.50, 89.99, 349.00, 79.95, 430.00], 'units': [1, 2, 1, 1, 3, 2] }) # --- Simple groupby: total revenue per customer --- revenue_by_customer = orders.groupby('customer')['revenue'].sum().reset_index() revenue_by_customer.columns = ['customer', 'total_revenue'] print("--- Revenue by customer ---") print(revenue_by_customer) # --- Multi-metric aggregation: different stats per column in one pass --- summary = orders.groupby('region').agg( total_revenue=('revenue', 'sum'), # sum of revenue per region order_count=('order_id', 'count'), # how many orders per region avg_units=('units', 'mean') # average units per order per region ).reset_index() print("\n--- Regional summary ---") print(summary) # --- transform: add each customer's total spend as a new column on every row --- # Useful for calculating what percentage of a customer's total each order represents orders['customer_total'] = orders.groupby('customer')['revenue'].transform('sum') orders['pct_of_customer_total'] = (orders['revenue'] / orders['customer_total'] * 100).round(1) print("\n--- Orders with customer share % ---") print(orders[['order_id', 'customer', 'revenue', 'customer_total', 'pct_of_customer_total']])
customer total_revenue
0 Alice 1719.99
1 Bob 105.45
2 Carol 349.00
--- Regional summary ---
region total_revenue order_count avg_units
0 North 1719.99 3 1.333333
1 South 105.45 2 2.500000
2 West 349.00 1 1.000000
--- Orders with customer share % ---
order_id customer revenue customer_total pct_of_customer_total
0 1001 Alice 1200.00 1719.99 69.8
1 1002 Bob 25.50 105.45 24.2
2 1003 Alice 89.99 1719.99 5.2
3 1004 Carol 349.00 349.00 100.0
4 1005 Bob 79.95 105.45 75.8
5 1006 Alice 430.00 1719.99 25.0
groupby().agg() and groupby().transform(). The key distinction: agg() reduces the DataFrame to one row per group. transform() keeps the original shape and broadcasts the group result back to every row — perfect for computing ratios or z-scores within groups.reset_index() after groupby leaves the group key as the index — brittle for downstream code.reset_index() immediately after agg() to keep columns clean.agg() to avoid index-dependent bugs.Merging DataFrames — The SQL Join You Already Know, In Python
Real-world data is rarely in one place. You've got an orders table, a customers table, and a products table — all separate. Merging DataFrames is how you bring them together, and it maps directly onto SQL JOINs you may already know.
is the workhorse. You specify the two DataFrames, the key column(s) to join on, and the type of join: pd.merge()inner (only matching rows), left (all rows from the left, matched where possible), right, or outer (all rows from both). Understanding which join type to use is more important than memorising the syntax.
A common real-world pattern is a left join against a reference table — you have a list of orders and want to enrich each one with customer details, keeping all orders even if a customer record is somehow missing.
After merging, always check the row count. If you expected 500 rows and got 2000, you've hit a one-to-many relationship you didn't account for — a classic data explosion caused by duplicate keys on the right-hand DataFrame.
import pandas as pd # Table 1: raw orders — just IDs, no names orders_df = pd.DataFrame({ 'order_id': [1001, 1002, 1003, 1004, 1005], 'customer_id':[201, 202, 201, 203, 204], # 204 has no customer record yet 'revenue': [1200.00, 25.50, 89.99, 349.00, 79.95] }) # Table 2: customer reference data — note customer 204 is missing customers_df = pd.DataFrame({ 'customer_id': [201, 202, 203], 'customer_name': ['Alice', 'Bob', 'Carol'], 'tier': ['Gold', 'Silver', 'Gold'] }) # Inner join: only orders where we have a matching customer record inner_merged = pd.merge(orders_df, customers_df, on='customer_id', how='inner') print("--- Inner join (drops order 1005 — no customer record) ---") print(inner_merged) # Left join: keep ALL orders, fill missing customer info with NaN left_merged = pd.merge(orders_df, customers_df, on='customer_id', how='left') print("\n--- Left join (keeps order 1005 with NaN for missing customer) ---") print(left_merged) # Practical check: did we unexpectedly gain or lose rows? print(f"\nOriginal orders: {len(orders_df)} rows") print(f"After left join: {len(left_merged)} rows") print(f"Rows with no customer match: {left_merged['customer_name'].isna().sum()}")
order_id customer_id revenue customer_name tier
0 1001 201 1200.00 Alice Gold
1 1002 202 25.50 Bob Silver
2 1003 201 89.99 Alice Gold
3 1004 203 349.00 Carol Gold
--- Left join (keeps order 1005 with NaN for missing customer) ---
order_id customer_id revenue customer_name tier
0 1001 201 1200.00 Alice Gold
1 1002 202 25.50 Bob Silver
2 1003 201 89.99 Alice Gold
3 1004 203 349.00 Carol Gold
4 1005 204 79.95 NaN NaN
Original orders: 5 rows
After left join: 5 rows
Rows with no customer match: 1
df['key_col'].value_counts() on both DataFrames and confirm you understand where duplicates exist.Handling Missing Data — The Patterns That Save Your Pipeline
Real-world data is never clean. Missing values creep in from JSON nulls, empty cells in CSVs, and incomplete database records. Pandas represents missing data as NaN (for numeric) and None (for object columns), but they behave oddly if you're not careful.
The most common mistake is trying to check df['col'] == None or df['col'] == float('nan') — both always return False because NaN is not equal to itself by IEEE 754 definition. You must use df['col'].isna() or instead.pd.isnull()
Three strategies for handling missing data: 1. Drop rows with — cruel but fast. 2. Fill with a constant or computed value using df.dropna(). Common choices: column mean, median, or a placeholder like 0 or 'Unknown'. 3. Forward or backward fill (df.fillna()method='ffill' or 'bfill') for time series — carry the last known value forward.
The right choice depends on the business context. Dropping rows with missing key identifiers is fine. Dropping 40% of your data because a single optional column has gaps is not.
import pandas as pd # Messy sales data with missing values messy = pd.DataFrame({ 'order_id': [1001, 1002, 1003, 1004, 1005], 'customer': ['Alice', 'Bob', None, 'Carol', 'Alice'], 'revenue': [1200.00, None, 89.99, 349.00, None], 'region': ['North', 'South', None, 'West', 'North'] }) print("--- Raw messy data ---") print(messy) # Detect missing print("\n--- Missing value counts ---") print(messy.isna().sum()) # Drop rows where customer is missing (critical field) cleaned = messy.dropna(subset=['customer']) print("\n--- After dropping rows with missing customer ---") print(cleaned) # Fill missing revenue with column median (reasonable imputation) median_revenue = messy['revenue'].median() messy['revenue'] = messy['revenue'].fillna(median_revenue) print(f"\n--- Filled missing revenue with median ({median_revenue}) ---") print(messy[['order_id', 'revenue']]) # Forward fill region (carry last known region forward, if ordering makes sense) messy['region'] = messy['region'].fillna(method='ffill') print("\n--- After forward filling region ---") print(messy[['order_id', 'region']])
order_id customer revenue region
0 1001 Alice 1200.00 North
1 1002 Bob NaN South
2 1003 None 89.99 None
3 1004 Carol 349.00 West
4 1005 Alice NaN North
--- Missing value counts ---
order_id 0
customer 1
revenue 2
region 1
dtype: int64
--- After dropping rows with missing customer ---
order_id customer revenue region
0 1001 Alice 1200.00 North
1 1002 Bob NaN South
3 1004 Carol 349.00 West
4 1005 Alice NaN North
--- Filled missing revenue with median (219.495) ---
order_id revenue
0 1001 1200.00
1 1002 219.49
2 1003 89.99
3 1004 349.00
4 1005 219.49
--- After forward filling region ---
order_id region
0 1001 North
1 1002 South
2 1003 South
3 1004 West
4 1005 North
df.fillna(method='ffill') implicitly uses the previous row's value. This only makes sense if rows are ordered meaningfully — never apply forward fill on unsorted data or grouped data without resetting the index.df.isna().sum() / len(df) * 100 to see percentage missing.isna() to detect missing values.| Feature / Aspect | loc (label-based) | iloc (position-based) |
|---|---|---|
| How you reference rows | By index label (e.g. 'Alice', 1004) | By integer position (e.g. 0, 1, 2) |
| How you reference columns | By column name (e.g. 'revenue') | By integer position (e.g. 0, 1, 2) |
| Slice endpoint | Inclusive — loc[0:3] gives rows 0,1,2,3 | Exclusive — iloc[0:3] gives rows 0,1,2 |
| Best use case | Filtering with conditions and named lookups | Previewing structure, position-based splits |
| Works after set_index? | Yes — uses the new index labels | Yes — always uses raw integer position |
| Risk of ambiguity | Low — explicit labels | Medium — positions shift if rows are sorted |
🎯 Key Takeaways
- A DataFrame is not just a 2D list — its labelled columns, typed dtypes, and named index are what make vectorised operations fast and readable.
- Use
locfor label-based and condition-based access; useilocfor position-based access. Never chain selections when assigning values — always uselocwith both row and column specifiers. reduces your data to one row per group;groupby().agg()keeps the original shape and is the correct tool for adding group-level statistics as new columns.groupby().transform()- After every merge, verify the row count. A row explosion caused by duplicate join keys is one of the most common and hardest-to-spot bugs in data pipelines.
- Missing data detection requires
— never useisna()==or== None. Choose drop or fill based on business context, not convenience.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWhat is the difference between
locandilocin Pandas, and can you give a scenario where using the wrong one would produce incorrect results rather than an error?Mid-levelReveal - QExplain the split-apply-combine pattern. How does
differ fromgroupby().transform(), and when would you choose one over the other?SeniorRevealgroupby().agg() - QIf you perform a merge and the resulting DataFrame has significantly more rows than either input, what has likely gone wrong and how would you diagnose it?SeniorReveal
Frequently Asked Questions
What is the difference between a Pandas Series and a DataFrame?
A Series is a single labelled column — one-dimensional, like a list with an index. A DataFrame is a collection of Series sharing the same index — it's two-dimensional, like a table. When you select one column from a DataFrame with df['revenue'], you get a Series back.
How do I handle missing values in a Pandas DataFrame?
Use to detect them and df.isna() to count them per column. To fill them use df.isna().sum()df.fillna(value) with a constant, a column mean, or a forward-fill strategy with method='ffill'. To drop rows with any missing value use . Choose based on whether the missingness is random or meaningful to your analysis.df.dropna()
Why is my Pandas operation slow on a large DataFrame?
The most common culprits are row-by-row Python loops (use vectorised column operations instead), applying Python functions with on large datasets (use built-in Pandas or NumPy methods where possible), and operating on df.apply()object dtype columns that should be numeric or categorical. Converting high-cardinality string columns to category dtype can cut memory and speed up groupby operations dramatically.
When should I use a DataFrame vs a NumPy array?
Use a DataFrame when you need labelled columns, mixed data types, missing data handling, or SQL-like operations (groupby, merge). Use a NumPy array when you're doing pure numerical computations that don't need labels — NumPy operations are typically faster and use less memory for homogeneous numeric data.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.