Cheatsheet · Python

Pandas Cheat Sheet

Quick reference for Pandas — DataFrame creation, selection, filtering, groupby, merging, handling missing data and common operations.

Creating DataFrames

Method	Example	Description
pd.DataFrame(dict)	pd.DataFrame({'a':[1,2],'b':[3,4]})	From dictionary
pd.DataFrame(list)	pd.DataFrame([[1,2],[3,4]], columns=['a','b'])	From list of lists
pd.read_csv()	pd.read_csv('file.csv')	From CSV file
pd.read_json()	pd.read_json('file.json')	From JSON file
pd.read_excel()	pd.read_excel('file.xlsx')	From Excel file
pd.read_sql()	pd.read_sql(query, conn)	From SQL query

Viewing & Inspecting

Method	Description
df.head(n)	First n rows (default 5)
df.tail(n)	Last n rows (default 5)
df.shape	Tuple (rows, columns)
df.info()	Column types, non-null counts, memory
df.describe()	Summary statistics for numeric columns
df.dtypes	Data type of each column
df.columns	List of column names
df.index	Row index
df.value_counts()	Count of unique values
df.nunique()	Number of unique values per column

Selecting Data

Method	Example	Description
df['col']	df['name']	Select single column → Series
df[['col1','col2']]	df[['name','age']]	Select multiple columns → DataFrame
df.loc[row, col]	df.loc[0, 'name']	Label-based selection
df.iloc[row, col]	df.iloc[0, 1]	Integer position-based selection
df.loc[rows, cols]	df.loc[0:5, 'a':'c']	Slice by label (inclusive)
df.iloc[rows, cols]	df.iloc[0:5, 0:3]	Slice by position (exclusive end)
df.at[row, col]	df.at[0, 'name']	Fast scalar label access
df.iat[row, col]	df.iat[0, 1]	Fast scalar position access

Filtering

Method	Example	Description
Boolean mask	df[df['age'] > 30]	Filter rows by condition
Multiple conditions	df[(df['age']>30) & (df['city']=='NYC')]	AND condition
OR condition	df[(df['age']<20) \| (df['age']>60)]	OR condition
df.query()	df.query('age > 30 and city == "NYC"')	SQL-style string filter
df.isin()	df[df['city'].isin(['NYC','LA'])]	Filter by list of values
df.between()	df[df['age'].between(20, 40)]	Filter by range (inclusive)
str.contains()	df[df['name'].str.contains('Ali')]	Filter by string match

Handling Missing Data

Method	Example	Description
df.isnull()	df.isnull().sum()	Count missing values per column
df.dropna()	df.dropna()	Drop rows with any NaN
df.dropna(subset)	df.dropna(subset=['age'])	Drop rows where specific col is NaN
df.fillna(value)	df.fillna(0)	Fill NaN with value
df.fillna(method)	df.fillna(method='ffill')	Forward fill
df.interpolate()	df['col'].interpolate()	Interpolate missing values

GroupBy & Aggregation

Method	Example	Description
df.groupby(col).agg()	df.groupby('city')['age'].mean()	Group and aggregate
Multiple aggs	df.groupby('city').agg({'age':'mean','salary':'sum'})	Multiple aggregations
df.groupby().size()	df.groupby('city').size()	Count rows per group
df.pivot_table()	df.pivot_table(values='sales', index='city', aggfunc='sum')	Pivot table
df.crosstab()	pd.crosstab(df['city'], df['gender'])	Frequency cross-table

Merging & Joining

Method	Example	Description
pd.merge()	pd.merge(df1, df2, on='id')	Merge on column (inner join by default)
left join	pd.merge(df1, df2, on='id', how='left')	Keep all left rows
outer join	pd.merge(df1, df2, on='id', how='outer')	Keep all rows from both
pd.concat()	pd.concat([df1, df2])	Stack DataFrames vertically
pd.concat(axis=1)	pd.concat([df1, df2], axis=1)	Stack DataFrames horizontally
df.join()	df1.join(df2, on='id')	Join on index by default

Common Operations

Method	Example	Description
df.sort_values()	df.sort_values('age', ascending=False)	Sort by column
df.rename()	df.rename(columns={'old':'new'})	Rename columns
df.drop()	df.drop(columns=['col1'])	Drop column(s)
df.reset_index()	df.reset_index(drop=True)	Reset row index
df.set_index()	df.set_index('id')	Set column as index
df.apply()	df['col'].apply(lambda x: x*2)	Apply function to column
df.map()	df['col'].map({'a':1,'b':2})	Map values using dict
df.astype()	df['age'].astype(float)	Cast column type
df.duplicated()	df.drop_duplicates()	Remove duplicate rows
df.copy()	df2 = df.copy()	Deep copy (avoid SettingWithCopyWarning)

Performance Optimization

Tip	Why It Helps	Example
Use vectorised ops over apply()	apply() runs Python loop; vectorised uses C	df['x'] * 2 not df.apply(lambda r: r['x']*2, axis=1)
Specify dtypes on read_csv()	Reduces memory 2–4× by avoiding object dtype	read_csv('f.csv', dtype={'id':np.int32, 'flag':bool})
Use category dtype for low-cardinality strings	Stores as integer codes — 10–100× less memory	df['status'] = df['status'].astype('category')
query() for filtering	Faster than boolean indexing on large frames	df.query('age > 30 and city == "NYC"')
chunksize for large files	Processes file in chunks — avoids OOM	for chunk in pd.read_csv('big.csv', chunksize=100_000):
eval() for complex expressions	Avoids creating intermediate arrays	df.eval('C = A + B - A*B', inplace=True)
Avoid iterrows() / itertuples()	Row iteration is slow — vectorise instead	Use np.where(), .str methods, or apply on Series

Window Functions

Method	Purpose	Example
rolling(n)	Fixed-size sliding window	df['price'].rolling(7).mean() — 7-day moving avg
rolling(n).std()	Rolling standard deviation	Bollinger Bands, volatility
expanding()	Expanding window (all rows up to current)	df['sales'].expanding().sum() — cumulative total
ewm(span=n)	Exponentially weighted moving average	df['price'].ewm(span=12).mean() — EWMA smoothing
rolling().apply(fn)	Custom function over rolling window	df['x'].rolling(3).apply(lambda x: x[0]/x[-1])
shift(n)	Lag / lead values	df['prev'] = df['close'].shift(1) — yesterday's close