Home Python Pandas Basics Explained — DataFrames, Series and Real-World Data Wrangling

Pandas Basics Explained — DataFrames, Series and Real-World Data Wrangling

In Plain English 🔥
Imagine you have a massive Excel spreadsheet full of sales data — thousands of rows, dozens of columns — and you need to find every sale over $500 from the last quarter. Doing that by hand would take hours. Pandas is your supercharged assistant that can scan, sort, filter, and summarise that entire spreadsheet in one line of Python. It turns raw, messy data into answers, fast.
⚡ Quick Answer
Imagine you have a massive Excel spreadsheet full of sales data — thousands of rows, dozens of columns — and you need to find every sale over $500 from the last quarter. Doing that by hand would take hours. Pandas is your supercharged assistant that can scan, sort, filter, and summarise that entire spreadsheet in one line of Python. It turns raw, messy data into answers, fast.

Every data analyst, data scientist, and backend engineer who works with data in Python eventually hits the same wall: raw data is messy, inconsistent, and massive. CSVs from databases, JSON from APIs, Excel files from clients — none of it arrives ready to use. You need a tool that lets you load, inspect, clean, and transform that data without writing hundreds of lines of boilerplate. That tool is Pandas, and it powers roughly 80% of the Python data ecosystem.

Series vs DataFrame — The Two Building Blocks You Must Know Cold

Pandas is built on two core data structures: the Series and the DataFrame. Understanding what each one is — and why both exist — saves you hours of confusion later.

A Series is a one-dimensional labelled array. Think of it as a single column from a spreadsheet, where every value has an index label attached to it. That label isn't just a row number — it's a meaningful key you can look up directly, like a Python dictionary with order preserved.

A DataFrame is a two-dimensional labelled table — a collection of Series that all share the same index. This is your spreadsheet. Rows are observations (a customer, a transaction, a sensor reading). Columns are attributes (name, amount, timestamp). Every column in a DataFrame is literally a Series under the hood.

Why does this distinction matter? Because the operations you can perform — slicing, filtering, aggregating — behave differently depending on whether you're working with a Series or a DataFrame. Knowing which one you have at any point in your code stops you from writing bugs that Python won't warn you about.

series_vs_dataframe.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536
import pandas as pd

# --- SERIES: A single labelled column ---
# Think of this as one column of a product inventory sheet
product_prices = pd.Series(
    [29.99, 49.99, 9.99, 99.99],
    index=['notebook', 'headphones', 'pen', 'keyboard'],  # meaningful labels, not just 0,1,2,3
    name='price_usd'
)

print('=== SERIES ===')
print(product_prices)
print(f'\nType: {type(product_prices)}')
print(f'Look up headphones price: ${product_prices["headphones"]}')

# --- DATAFRAME: Multiple columns sharing the same index ---
# Build one from a dictionary — each key becomes a column
product_data = {
    'price_usd':   [29.99, 49.99, 9.99, 99.99],
    'stock_count': [150,   43,    500,  28],
    'category':    ['stationery', 'electronics', 'stationery', 'electronics']
}

product_df = pd.DataFrame(
    product_data,
    index=['notebook', 'headphones', 'pen', 'keyboard']  # same meaningful index
)

print('\n=== DATAFRAME ===')
print(product_df)
print(f'\nType: {type(product_df)}')

# A DataFrame column IS a Series
price_column = product_df['price_usd']
print(f'\nColumn type: {type(price_column)}')  # confirms it's a Series
print(f'Most expensive item: {price_column.idxmax()} at ${price_column.max()}')
▶ Output
=== SERIES ===
notebook 29.99
headphones 49.99
pen 9.99
keyboard 99.99
Name: price_usd, dtype: float64

Type: <class 'pandas.core.series.Series'>
Look up headphones price: $49.99

=== DATAFRAME ===
price_usd stock_count category
notebook 29.99 150 stationery
headphones 49.99 43 electronics
pen 9.99 500 stationery
keyboard 99.99 28 electronics

Type: <class 'pandas.core.frame.DataFrame'>

Column type: <class 'pandas.core.series.Series'>
Most expensive item: keyboard at $99.99
⚠️
Pro Tip:When you select a single column with df['column_name'], you get a Series back. When you select multiple columns with df[['col1', 'col2']], you get a DataFrame back. The double brackets are not a typo — they're you passing a list to the indexer. This trips up almost every beginner at least once.

Loading Real Data and Actually Understanding What You Have

The first thing you do with any dataset in the real world is load it and immediately interrogate it. Not analyse it — interrogate it. How many rows? What are the columns? Are there missing values? What are the data types? Skipping this step is how bugs hide for days.

Pandas loads data from CSV, Excel, JSON, SQL databases, and more with a single function call. That's the HOW. The WHY is that these functions don't just read the file — they infer column types, parse dates, handle encoding, and give you a structured object you can immediately start querying.

After loading, your first five lines of code should always be the same: shape, dtypes, head, info, and isnull().sum(). Together, these five tell you the size of your data, what types Pandas assigned each column, a preview of the values, a summary of memory and null counts, and exactly which columns have missing data. Think of this as a health check before you operate.

One critical thing beginners miss: Pandas infers types on load. A column full of numbers that has one stray empty cell might be loaded as float64 instead of int64. A date column loaded as a plain string won't support date arithmetic until you explicitly convert it. Knowing this saves you from mysterious errors downstream.

load_and_inspect_data.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import pandas as pd
import io

# Simulating a CSV file in memory so this example is fully self-contained
# In real life you'd use: pd.read_csv('sales_data.csv')
raw_csv = """
order_id,customer_name,order_date,amount_usd,region,is_returned
1001,Alice Martin,2024-01-15,250.00,North,False
1002,Bob Chen,2024-01-17,89.50,South,False
1003,Alice Martin,2024-02-03,,North,True
1004,Dana Patel,2024-02-11,430.00,East,False
1005,Bob Chen,2024-03-05,175.25,South,True
1006,Eve Torres,2024-03-18,610.00,West,False
"""

# parse_dates tells Pandas to convert that column to datetime, not leave it as a string
sales_df = pd.read_csv(
    io.StringIO(raw_csv),
    parse_dates=['order_date']
)

# --- STEP 1: Size ---
print('=== Shape (rows, columns) ===')
print(sales_df.shape)  # (6, 6)

# --- STEP 2: Column types ---
print('\n=== Data Types ===')
print(sales_df.dtypes)

# --- STEP 3: First look at values ---
print('\n=== First 3 Rows ===')
print(sales_df.head(3))

# --- STEP 4: Full health summary (non-null counts, memory) ---
print('\n=== Info ===')
sales_df.info()

# --- STEP 5: Where are the nulls? ---
print('\n=== Missing Values Per Column ===')
print(sales_df.isnull().sum())

# Fix: fill the missing amount with the column mean (a common real-world approach)
sales_df['amount_usd'] = sales_df['amount_usd'].fillna(sales_df['amount_usd'].mean())
print('\n=== After Filling Missing Amount ===')
print(sales_df[['order_id', 'customer_name', 'amount_usd']])
▶ Output
=== Shape (rows, columns) ===
(6, 6)

=== Data Types ===
order_id int64
customer_name object
order_date datetime64[ns]
amount_usd float64
region object
is_returned bool
dtype: object

=== First 3 Rows ===
order_id customer_name order_date amount_usd region is_returned
0 1001 Alice Martin 2024-01-15 250.00 North False
1 1002 Bob Chen 2024-01-17 89.50 South False
2 1003 Alice Martin 2024-02-03 NaN North True

=== Info ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 6 non-null int64
1 customer_name 6 non-null object
2 order_date 6 non-null datetime64[ns]
3 amount_usd 5 non-null float64
4 region 6 non-null object
5 is_returned 6 non-null bool
dtypes: bool(1), datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 422.0+ bytes

=== Missing Values Per Column ===
order_id 0
customer_name 0
order_date 0
amount_usd 1
region 0
is_returned 0
dtype: int64

=== After Filling Missing Amount ===
order_id customer_name amount_usd
0 1001 Alice Martin 250.00
1 1002 Bob Chen 89.50
2 1003 Alice Martin 310.95
3 1004 Dana Patel 430.00
4 1005 Bob Chen 175.25
5 1006 Eve Torres 610.00
⚠️
Watch Out:If you don't pass parse_dates=['order_date'] when reading a CSV, Pandas loads date columns as plain strings (dtype: object). You'll only discover this later when date arithmetic silently fails or throws a TypeError. Always explicitly tell Pandas which columns are dates at load time — don't rely on its auto-detection.

Selecting, Filtering and Slicing Data — loc vs iloc Demystified

This is where most beginners either click with Pandas or get utterly lost. There are two indexers you'll use constantly: .loc and .iloc. They look similar, they do similar things, but they operate on completely different ideas — and mixing them up causes bugs that are painful to track down.

.iloc is positional. It speaks the language of integers: row 0, row 1, row 2. It doesn't care what your index labels are — it just counts from zero, exactly like a Python list. Use .iloc when you need 'the first 3 rows' or 'the second column'.

.loc is label-based. It speaks the language of your actual index values and column names. Use .loc when you need 'the row labelled 1003' or 'the amount_usd column'. Crucially, .loc is also how you apply boolean filters — passing a True/False mask to select only rows that meet a condition.

Boolean filtering is where Pandas becomes genuinely powerful. Instead of looping through rows with a for loop (slow, verbose), you build a condition that produces a True/False Series, then pass it to .loc. Pandas vectorises this — it runs the comparison across all rows simultaneously, which is dramatically faster on large datasets.

indexing_and_filtering.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
import pandas as pd
import io

raw_csv = """
order_id,customer_name,order_date,amount_usd,region,is_returned
1001,Alice Martin,2024-01-15,250.00,North,False
1002,Bob Chen,2024-01-17,89.50,South,False
1003,Alice Martin,2024-02-03,310.95,North,True
1004,Dana Patel,2024-02-11,430.00,East,False
1005,Bob Chen,2024-03-05,175.25,South,True
1006,Eve Torres,2024-03-18,610.00,West,False
"""

sales_df = pd.read_csv(io.StringIO(raw_csv), parse_dates=['order_date'])

# Set order_id as the index so .loc makes semantic sense
sales_df = sales_df.set_index('order_id')

print('=== Full DataFrame ===')
print(sales_df)

# --- iloc: positional (0-based integers, like a list) ---
print('\n=== .iloc: First 2 rows, first 3 columns (positional) ===')
print(sales_df.iloc[0:2, 0:3])  # rows 0-1, columns 0-2

# --- loc: label-based (use actual index values and column names) ---
print('\n=== .loc: Row with order_id 1004, two specific columns ===')
print(sales_df.loc[1004, ['customer_name', 'amount_usd']])

# --- Boolean filtering: the real power move ---
# Step 1: build a boolean mask — this is a Series of True/False values
high_value_mask = sales_df['amount_usd'] > 200
print('\n=== Boolean mask (True where amount > $200) ===')
print(high_value_mask)

# Step 2: pass the mask to .loc to select only matching rows
high_value_orders = sales_df.loc[high_value_mask]
print('\n=== Orders over $200 ===')
print(high_value_orders)

# Combine multiple conditions with & (AND) or | (OR)
# Each condition MUST be wrapped in parentheses — Python operator precedence gotcha!
north_high_value = sales_df.loc[
    (sales_df['amount_usd'] > 200) & (sales_df['region'] == 'North')
]
print('\n=== North region orders over $200 ===')
print(north_high_value)

# Real-world pattern: not returned AND high value — the good orders
good_orders = sales_df.loc[
    (sales_df['is_returned'] == False) & (sales_df['amount_usd'] > 150)
]
print('\n=== High-value, non-returned orders ===')
print(good_orders[['customer_name', 'amount_usd', 'region']])
▶ Output
=== Full DataFrame ===
customer_name order_date amount_usd region is_returned
order_id
1001 Alice Martin 2024-01-15 250.00 North False
1002 Bob Chen 2024-01-17 89.50 South False
1003 Alice Martin 2024-02-03 310.95 North True
1004 Dana Patel 2024-02-11 430.00 East False
1005 Bob Chen 2024-03-05 175.25 South True
1006 Eve Torres 2024-03-18 610.00 West False

=== .iloc: First 2 rows, first 3 columns (positional) ===
customer_name order_date amount_usd
order_id
1001 Alice Martin 2024-01-15 250.00
1002 Bob Chen 2024-01-17 89.50

=== .loc: Row with order_id 1004, two specific columns ===
customer_name Dana Patel
amount_usd 430.0
Name: 1004, dtype: object

=== Boolean mask (True where amount > $200) ===
order_id
1001 True
1002 False
1003 True
1004 True
1005 False
1006 True
Name: amount_usd, dtype: bool

=== Orders over $200 ===
customer_name order_date amount_usd region is_returned
order_id
1001 Alice Martin 2024-01-15 250.00 North False
1003 Alice Martin 2024-02-03 310.95 North True
1004 Dana Patel 2024-02-11 430.00 East False
1006 Eve Torres 2024-03-18 610.00 West False

=== North region orders over $200 ===
customer_name order_date amount_usd region is_returned
order_id
1001 Alice Martin 2024-01-15 250.00 North False
1003 Alice Martin 2024-02-03 310.95 North True

=== High-value, non-returned orders ===
customer_name amount_usd region
order_id
1001 Alice Martin 250.00 North
1004 Dana Patel 430.00 East
1006 Eve Torres 610.00 West
🔥
Interview Gold:Interviewers love asking 'what's the difference between loc and iloc?' The crisp answer: .loc is label-based and inclusive on both ends of a slice, while .iloc is integer position-based and exclusive on the end (like standard Python slicing). Also note: .loc can accept boolean arrays for filtering; .iloc cannot.

GroupBy and Aggregation — Turning Raw Rows Into Business Answers

Loading and filtering data is table stakes. What actually makes Pandas indispensable is groupby — the ability to split your data into groups, apply a calculation to each group, and combine the results back into one neat summary. This is the SQL GROUP BY you already know, but available directly in Python with far more flexibility.

The mental model for groupby is split-apply-combine. Pandas splits the DataFrame into subgroups based on a column's unique values, applies a function (sum, mean, count, max, or even a custom function) to each group independently, then combines all the results back into a new DataFrame. The entire pipeline runs in one chained expression.

Where this becomes genuinely powerful is when you chain multiple aggregations together with .agg(). Instead of running five separate groupby calls, you pass a dictionary mapping each column to the aggregation(s) you want. You get a multi-statistic summary in a single pass — the kind of thing that would take a non-trivial SQL query or a dozen lines of loop-based Python.

In production data pipelines, groupby is how you go from 'here is a million-row event log' to 'here is a per-customer summary table' that a reporting dashboard or machine learning feature pipeline can actually use.

groupby_aggregation.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
import pandas as pd
import io

raw_csv = """
order_id,customer_name,order_date,amount_usd,region,is_returned
1001,Alice Martin,2024-01-15,250.00,North,False
1002,Bob Chen,2024-01-17,89.50,South,False
1003,Alice Martin,2024-02-03,310.95,North,True
1004,Dana Patel,2024-02-11,430.00,East,False
1005,Bob Chen,2024-03-05,175.25,South,True
1006,Eve Torres,2024-03-18,610.00,West,False
1007,Alice Martin,2024-03-22,90.00,North,False
1008,Dana Patel,2024-03-30,220.00,East,False
"""

sales_df = pd.read_csv(io.StringIO(raw_csv), parse_dates=['order_date'])

# --- Basic groupby: total revenue per region ---
# split by 'region', apply sum to 'amount_usd', combine into a Series
revenue_by_region = sales_df.groupby('region')['amount_usd'].sum()
print('=== Total Revenue by Region ===')
print(revenue_by_region.sort_values(ascending=False))

# --- Multiple aggregations in one pass with .agg() ---
# For each customer: count their orders, total spend, average order, and biggest order
customer_summary = sales_df.groupby('customer_name')['amount_usd'].agg(
    order_count='count',   # how many orders
    total_spent='sum',     # lifetime value
    avg_order='mean',      # average basket size
    largest_order='max'    # biggest single purchase
).round(2)  # tidy up the decimals

print('\n=== Customer Purchase Summary ===')
print(customer_summary.sort_values('total_spent', ascending=False))

# --- Groupby with multiple columns ---
# Revenue and return rate by region
region_stats = sales_df.groupby('region').agg(
    total_revenue=('amount_usd', 'sum'),
    order_count=('order_id', 'count'),
    returns=('is_returned', 'sum')  # True counts as 1, False as 0
)
region_stats['return_rate_pct'] = (
    (region_stats['returns'] / region_stats['order_count']) * 100
).round(1)

print('\n=== Region Performance Dashboard ===')
print(region_stats.sort_values('total_revenue', ascending=False))

# --- transform: add a group-level stat back to the original DataFrame ---
# This is powerful: add each customer's total spend as a new column on every row
sales_df['customer_lifetime_value'] = sales_df.groupby('customer_name')['amount_usd'].transform('sum')
print('\n=== Original DF with Customer LTV added ===')
print(sales_df[['customer_name', 'amount_usd', 'customer_lifetime_value']].sort_values('customer_name'))
▶ Output
=== Total Revenue by Region ===
region
West 610.00
East 650.00
North 650.95
South 264.75
Name: amount_usd, dtype: float64

=== Customer Purchase Summary ===
order_count total_spent avg_order largest_order
customer_name
Alice Martin 3 650.95 216.98 310.95
Dana Patel 2 650.00 325.00 430.00
Eve Torres 1 610.00 610.00 610.00
Bob Chen 2 264.75 132.38 175.25

=== Region Performance Dashboard ===
total_revenue order_count returns return_rate_pct
region
North 650.95 3 1 33.3
East 650.00 2 0 0.0
West 610.00 1 0 0.0
South 264.75 2 1 50.0

=== Original DF with Customer LTV added ===
customer_name amount_usd customer_lifetime_value
0 Alice Martin 250.00 650.95
2 Alice Martin 310.95 650.95
6 Alice Martin 90.00 650.95
1 Bob Chen 89.50 264.75
4 Bob Chen 175.25 264.75
3 Dana Patel 430.00 650.00
7 Dana Patel 220.00 650.00
5 Eve Torres 610.00 610.00
⚠️
Pro Tip:Know the difference between .agg() and .transform(). Use .agg() when you want a summary — a smaller DataFrame with one row per group. Use .transform() when you want to broadcast a group-level result back to every row in the original DataFrame. A classic use case for .transform() is adding a 'percentage of group total' column, where each row shows its value divided by the group's sum — all in one line.
Feature / Aspect.loc (Label-based).iloc (Position-based)
How it selects rowsBy index label value (e.g. 1004)By integer position (e.g. 3)
How it selects columnsBy column name stringBy integer column position
Slice end behaviourInclusive — df.loc[0:3] gives 4 rowsExclusive — df.iloc[0:3] gives 3 rows
Accepts boolean maskYes — this is the primary filtering toolNo — raises IndexError if you try
Works after set_index()Yes — uses the new index labelsYes — position doesn't change
Best used whenYou know the label or need to filterYou need nth row regardless of label
Risk of confusionIndex label 3 ≠ 4th row if index is customPosition 3 is always the 4th row

🎯 Key Takeaways

  • A DataFrame is a collection of Series — every column you touch is a Series, so Series operations are DataFrame column operations.
  • Always run shape, dtypes, head, info, and isnull().sum() immediately after loading data — skipping this step is where 90% of analysis bugs are born.
  • .loc uses labels and is inclusive on slice ends; .iloc uses positions and is exclusive on slice ends — mixing them up is one of the most common sources of silent data bugs in Pandas code.
  • groupby().agg() collapses rows into a summary; groupby().transform() broadcasts group-level results back onto every original row — both are essential, neither is a replacement for the other.

⚠ Common Mistakes to Avoid

  • Mistake 1: Chained indexing (df['column'][0]) instead of df.loc[0, 'column'] — Symptom: a SettingWithCopyWarning that says your assignment may not work, and it silently doesn't — Fix: always use .loc[row, column] for both reading and writing values; chained indexing may operate on a copy, not the original DataFrame.
  • Mistake 2: Forgetting parentheses around boolean conditions in multi-condition filters — Symptom: a cryptic ValueError about truth value of a Series being ambiguous — Fix: wrap each condition in its own parentheses before combining with & or |, like (df['col'] > 5) & (df['other'] == 'value'); Python's operator precedence evaluates & before >, breaking bare comparisons.
  • Mistake 3: Using df['date_col'].year instead of df['date_col'].dt.year — Symptom: AttributeError: 'Series' object has no attribute 'year' — Fix: access datetime properties through the .dt accessor on a Series, like df['order_date'].dt.year, df['order_date'].dt.month_name(); the .dt accessor is your gateway to the entire datetime API on a column.

Interview Questions on This Topic

  • QWhat is the difference between .loc and .iloc in Pandas, and can you give a situation where using the wrong one would silently return incorrect data?
  • QExplain the split-apply-combine pattern. How does groupby().agg() differ from groupby().transform(), and when would you use each?
  • QWhat is a SettingWithCopyWarning and how do you fix it? Walk me through exactly why chained indexing is dangerous.

Frequently Asked Questions

What is the difference between a Pandas Series and a DataFrame?

A Series is a one-dimensional labelled array — essentially a single column with an index. A DataFrame is two-dimensional — a table of multiple Series that all share the same index. Every column in a DataFrame is a Series, so anything you can do to a Series you can do to a DataFrame column.

Why does Pandas use .loc and .iloc instead of just regular indexing?

Regular indexing is ambiguous when your index contains integers — Python can't tell if you mean 'label 3' or 'position 3'. Pandas separates these concepts explicitly: .loc always means label, .iloc always means position. This removes ambiguity and makes your intent clear to anyone reading your code.

Is Pandas fast enough for large datasets?

Pandas handles datasets up to a few hundred thousand to low millions of rows comfortably on a standard laptop. Its vectorised operations run on compiled NumPy code under the hood, making them far faster than Python loops. For datasets in the tens of millions or billions of rows, you'd graduate to Polars, Dask, or Spark — but for the vast majority of real-world analytical tasks, Pandas is more than fast enough.

🔥
TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

← PreviousNumPy Arrays and OperationsNext →Pandas DataFrames
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged