Quick reference for Pandas — DataFrame creation, selection, filtering, groupby, merging, handling missing data and common operations.
| Method | Example | Description |
|---|---|---|
| pd.DataFrame(dict) | pd.DataFrame({'a':[1,2],'b':[3,4]}) | From dictionary |
| pd.DataFrame(list) | pd.DataFrame([[1,2],[3,4]], columns=['a','b']) | From list of lists |
| pd.read_csv() | pd.read_csv('file.csv') | From CSV file |
| pd.read_json() | pd.read_json('file.json') | From JSON file |
| pd.read_excel() | pd.read_excel('file.xlsx') | From Excel file |
| pd.read_sql() | pd.read_sql(query, conn) | From SQL query |
| Method | Description |
|---|---|
| df.head(n) | First n rows (default 5) |
| df.tail(n) | Last n rows (default 5) |
| df.shape | Tuple (rows, columns) |
| df.info() | Column types, non-null counts, memory |
| df.describe() | Summary statistics for numeric columns |
| df.dtypes | Data type of each column |
| df.columns | List of column names |
| df.index | Row index |
| df.value_counts() | Count of unique values |
| df.nunique() | Number of unique values per column |
| Method | Example | Description |
|---|---|---|
| df['col'] | df['name'] | Select single column → Series |
| df[['col1','col2']] | df[['name','age']] | Select multiple columns → DataFrame |
| df.loc[row, col] | df.loc[0, 'name'] | Label-based selection |
| df.iloc[row, col] | df.iloc[0, 1] | Integer position-based selection |
| df.loc[rows, cols] | df.loc[0:5, 'a':'c'] | Slice by label (inclusive) |
| df.iloc[rows, cols] | df.iloc[0:5, 0:3] | Slice by position (exclusive end) |
| df.at[row, col] | df.at[0, 'name'] | Fast scalar label access |
| df.iat[row, col] | df.iat[0, 1] | Fast scalar position access |
| Method | Example | Description |
|---|---|---|
| Boolean mask | df[df['age'] > 30] | Filter rows by condition |
| Multiple conditions | df[(df['age']>30) & (df['city']=='NYC')] | AND condition |
| OR condition | df[(df['age']<20) | (df['age']>60)] | OR condition |
| df.query() | df.query('age > 30 and city == "NYC"') | SQL-style string filter |
| df.isin() | df[df['city'].isin(['NYC','LA'])] | Filter by list of values |
| df.between() | df[df['age'].between(20, 40)] | Filter by range (inclusive) |
| str.contains() | df[df['name'].str.contains('Ali')] | Filter by string match |
| Method | Example | Description |
|---|---|---|
| df.isnull() | df.isnull().sum() | Count missing values per column |
| df.dropna() | df.dropna() | Drop rows with any NaN |
| df.dropna(subset) | df.dropna(subset=['age']) | Drop rows where specific col is NaN |
| df.fillna(value) | df.fillna(0) | Fill NaN with value |
| df.fillna(method) | df.fillna(method='ffill') | Forward fill |
| df.interpolate() | df['col'].interpolate() | Interpolate missing values |
| Method | Example | Description |
|---|---|---|
| df.groupby(col).agg() | df.groupby('city')['age'].mean() | Group and aggregate |
| Multiple aggs | df.groupby('city').agg({'age':'mean','salary':'sum'}) | Multiple aggregations |
| df.groupby().size() | df.groupby('city').size() | Count rows per group |
| df.pivot_table() | df.pivot_table(values='sales', index='city', aggfunc='sum') | Pivot table |
| df.crosstab() | pd.crosstab(df['city'], df['gender']) | Frequency cross-table |
| Method | Example | Description |
|---|---|---|
| pd.merge() | pd.merge(df1, df2, on='id') | Merge on column (inner join by default) |
| left join | pd.merge(df1, df2, on='id', how='left') | Keep all left rows |
| outer join | pd.merge(df1, df2, on='id', how='outer') | Keep all rows from both |
| pd.concat() | pd.concat([df1, df2]) | Stack DataFrames vertically |
| pd.concat(axis=1) | pd.concat([df1, df2], axis=1) | Stack DataFrames horizontally |
| df.join() | df1.join(df2, on='id') | Join on index by default |
| Method | Example | Description |
|---|---|---|
| df.sort_values() | df.sort_values('age', ascending=False) | Sort by column |
| df.rename() | df.rename(columns={'old':'new'}) | Rename columns |
| df.drop() | df.drop(columns=['col1']) | Drop column(s) |
| df.reset_index() | df.reset_index(drop=True) | Reset row index |
| df.set_index() | df.set_index('id') | Set column as index |
| df.apply() | df['col'].apply(lambda x: x*2) | Apply function to column |
| df.map() | df['col'].map({'a':1,'b':2}) | Map values using dict |
| df.astype() | df['age'].astype(float) | Cast column type |
| df.duplicated() | df.drop_duplicates() | Remove duplicate rows |
| df.copy() | df2 = df.copy() | Deep copy (avoid SettingWithCopyWarning) |
| Tip | Why It Helps | Example |
|---|---|---|
| Use vectorised ops over apply() | apply() runs Python loop; vectorised uses C | df['x'] * 2 not df.apply(lambda r: r['x']*2, axis=1) |
| Specify dtypes on read_csv() | Reduces memory 2–4× by avoiding object dtype | read_csv('f.csv', dtype={'id':np.int32, 'flag':bool}) |
| Use category dtype for low-cardinality strings | Stores as integer codes — 10–100× less memory | df['status'] = df['status'].astype('category') |
| query() for filtering | Faster than boolean indexing on large frames | df.query('age > 30 and city == "NYC"') |
| chunksize for large files | Processes file in chunks — avoids OOM | for chunk in pd.read_csv('big.csv', chunksize=100_000): |
| eval() for complex expressions | Avoids creating intermediate arrays | df.eval('C = A + B - A*B', inplace=True) |
| Avoid iterrows() / itertuples() | Row iteration is slow — vectorise instead | Use np.where(), .str methods, or apply on Series |
| Method | Purpose | Example |
|---|---|---|
| rolling(n) | Fixed-size sliding window | df['price'].rolling(7).mean() — 7-day moving avg |
| rolling(n).std() | Rolling standard deviation | Bollinger Bands, volatility |
| expanding() | Expanding window (all rows up to current) | df['sales'].expanding().sum() — cumulative total |
| ewm(span=n) | Exponentially weighted moving average | df['price'].ewm(span=12).mean() — EWMA smoothing |
| rolling().apply(fn) | Custom function over rolling window | df['x'].rolling(3).apply(lambda x: x[0]/x[-1]) |
| shift(n) | Lag / lead values | df['prev'] = df['close'].shift(1) — yesterday's close |