Pandas cheat sheet for data analysts

Practice Python for data interviews
200+ pandas, numpy, and data-wrangling problems with explanations.
Join the waitlist

Why pandas for analysts

Pandas is the default library for tabular data in Python, and on analyst interviews the Python block almost always collapses into pandas: load a CSV, filter, group, join two tables, compute a metric. The operations are the same ones you write in SQL — WHERE, GROUP BY, JOIN, OVER — just expressed as method chains on a DataFrame. The load-bearing skill is translating a SQL query into pandas in your head fast enough that a live-coding screen feels boring.

This is a reference, not a tutorial. Pandas shows up most often at Stripe, Airbnb, DoorDash and Notion, where the work happens in Jupyter and the interviewer hands you a notebook with two CSVs.

Load-bearing trick: if you can recite the SQL equivalent of every pandas idiom on this page, you will pass any live-coding screen gated on "can you use pandas". The interviewer is checking translation fluency, not exotic syntax.

Loading and inspecting data

import pandas as pd

# CSV
df = pd.read_csv('orders.csv')

# Excel
df = pd.read_excel('report.xlsx', sheet_name='Sheet1')

# SQL query
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:pass@host/db')
df = pd.read_sql(
    'SELECT * FROM orders WHERE created_at >= %s',
    engine,
    params=['2025-01-01'],
)

# From a dict literal
df = pd.DataFrame({
    'user_id': [1, 2, 3],
    'revenue': [100, 250, 80],
})

Always do a sanity pass right after loading. The five lines below catch 90% of bad inputs — wrong dtype on a date column, an unexpected null rate, a row count that does not match what the data engineer promised.

df.shape           # (rows, cols)
df.dtypes          # column types
df.head(10)        # first 10 rows
df.describe()      # numeric summary
df.info()          # dtypes + non-null counts
df.isnull().sum()  # null count per column

Method cheat sheet (pandas vs SQL)

This is the table to memorize. Read it left-to-right and right-to-left without thinking, and you are ready for the live-coding round.

Operation pandas idiom SQL equivalent
Filter rows df[df['amount'] > 1000] WHERE amount > 1000
Select columns df[['user_id', 'amount']] SELECT user_id, amount
Distinct values df['city'].unique() SELECT DISTINCT city
Sort df.sort_values('amount', ascending=False) ORDER BY amount DESC
Aggregate df.groupby('city')['amount'].sum() GROUP BY city, SUM(amount)
Having s[s > 1000] after groupby HAVING SUM(amount) > 1000
Inner join a.merge(b, on='id', how='inner') INNER JOIN
Left join a.merge(b, on='id', how='left') LEFT JOIN
Window sum df.groupby('uid')['amt'].transform('sum') SUM() OVER (PARTITION BY uid)
Running total df['x'].cumsum() SUM() OVER (ORDER BY t)
Lag / Lead df['x'].shift(1) / .shift(-1) LAG(x, 1) / LEAD(x, 1)
Rank df.groupby('g')['x'].rank(method='dense') DENSE_RANK() OVER (PARTITION BY g)
Rolling avg df['x'].rolling(7).mean() AVG() OVER (ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
Pivot df.pivot_table(...) CASE WHEN inside aggregate

Print this table, tape it to the side of your monitor, and you will stop second-guessing yourself mid-interview.

Filtering: loc, iloc, boolean masks

The most common form is a boolean condition in square brackets. Parentheses around each clause are mandatory — without them, Python's operator precedence will produce silently wrong results.

big_orders = df[df['amount'] > 1000]

# Multiple conditions: & (and), | (or), ~ (not)
active_big = df[(df['amount'] > 1000) & (df['status'] == 'completed')]

# Filter by value list
target_users = df[df['city'].isin(['San Francisco', 'New York'])]

# String filter
gmail = df[df['email'].str.contains('gmail', na=False)]

loc selects by label (column name, index value). iloc selects by integer position. The classic interview question is "what's the difference?" and the answer is two lines: loc includes the right edge of a slice, iloc does not.

# loc: rows matching a condition, named columns
df.loc[df['amount'] > 1000, ['user_id', 'amount']]

# iloc: first 5 rows, columns at positions 0 and 2
df.iloc[:5, [0, 2]]

groupby and aggregation

groupby is the direct analog of GROUP BY in SQL. The cleanest form is named aggregation, which gives you readable column names without a second renaming step.

# Single aggregate
df.groupby('city')['amount'].mean()

# Multiple aggregates on one column
df.groupby('city')['amount'].agg(['mean', 'median', 'count'])

# Different aggregates on different columns — preferred form
df.groupby('city').agg(
    avg_amount=('amount', 'mean'),
    total_orders=('order_id', 'count'),
    unique_users=('user_id', 'nunique'),
)

To filter groups after aggregation (the equivalent of HAVING), aggregate first, then filter the resulting Series.

city_stats = df.groupby('city')['amount'].sum()
city_stats[city_stats > 100000]

merge: joining tables

Same four join types as SQL, same gotchas. If you have ever debugged a LEFT JOIN that exploded row count, you already know the bug to look for.

# INNER JOIN
result = orders.merge(users, on='user_id', how='inner')

# LEFT JOIN
result = orders.merge(users, on='user_id', how='left')

# OUTER JOIN
result = orders.merge(users, on='user_id', how='outer')

# Keys with different names
result = orders.merge(
    users,
    left_on='customer_id',
    right_on='id',
    how='left',
)

Sanity check: after every merge, print len(left) vs len(result). If the row count went up after a left join, the right table has duplicates on the join key. This single check prevents half of the "why are my numbers wrong" Slack threads.

print(f'before: {len(orders)}, after: {len(result)}')

Use indicator=True when debugging merges — it adds a _merge column with values left_only, right_only, both, the fastest way to see what got dropped.

Practice Python for data interviews
200+ pandas, numpy, and data-wrangling problems with explanations.
Join the waitlist

pivot_table and melt

pivot_table reshapes long data into wide — rows become columns. melt does the inverse and is what interviewers mean when they ask you to "tidy" the data.

# Average order value by city and month
pivot = df.pivot_table(
    values='amount',
    index='city',
    columns='month',
    aggfunc='mean',
    fill_value=0,
)

# Multiple aggregates
pivot = df.pivot_table(
    values='amount',
    index='city',
    aggfunc=['mean', 'sum', 'count'],
)

# Wide → long
df_long = df.melt(
    id_vars=['city'],
    value_vars=['jan', 'feb', 'mar'],
    var_name='month',
    value_name='revenue',
)

After a melt, the row count multiplies by the number of value columns — three month columns and 100 cities become 300 long-format rows.

Working with dates

The first thing to do after loading any table with a date column is convert it. If the column stays as a string, every downstream date operation either errors or silently returns garbage.

df['date'] = pd.to_datetime(df['date'])

# Explicit format — faster and safer
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

The dt accessor exposes calendar components without any extra import.

df['year']    = df['date'].dt.year
df['month']   = df['date'].dt.month
df['weekday'] = df['date'].dt.day_name()
df['week']    = df['date'].dt.isocalendar().week

resample aggregates a time series across periods, but only works on a DatetimeIndex. If you do not want to reset the index, use pd.Grouper instead — same result, less ceremony.

df = df.set_index('date')

weekly = df['revenue'].resample('W').sum()       # daily → weekly
monthly_dau = df['dau'].resample('ME').mean()    # monthly mean

# Without setting an index
df.groupby(pd.Grouper(key='date', freq='ME'))['revenue'].sum()

Frequency aliases worth memorizing: D daily, W weekly (Sunday end), ME month-end, QE quarter-end, YE year-end.

apply, map, transform

apply runs a Python function row-by-row or column-by-column. It is powerful but slow — for anything you can express as a vectorized operation, use the vectorized form and you will get a 10x to 100x speedup on real data.

def size_category(amount):
    if amount > 5000:
        return 'large'
    elif amount > 1000:
        return 'medium'
    return 'small'

df['size'] = df['amount'].apply(size_category)

# Vectorized alternative — dramatically faster
import numpy as np
df['is_big'] = np.where(df['amount'] > 5000, True, False)

map substitutes values from a dictionary, which is the cleanest way to relabel a categorical column.

status_map = {'new': 'New', 'completed': 'Completed', 'cancelled': 'Cancelled'}
df['status_label'] = df['status'].map(status_map)

transform is the one to remember for window-style work: it returns a result the same length as the input, so you can broadcast a group aggregate back onto every row.

# Each order's share of that user's total spend
df['user_total'] = df.groupby('user_id')['amount'].transform('sum')
df['share']      = df['amount'] / df['user_total']

That last block is exactly SUM(amount) OVER (PARTITION BY user_id) in SQL.

Window-style patterns

Rolling averages, running totals, lag and rank — the same window functions you write in SQL, expressed as method calls.

# 7-day rolling mean — first 6 rows are NaN, the window isn't full yet
df['revenue_7d'] = df['revenue'].rolling(window=7).mean()

# Cumulative sum — order the DataFrame by date first, or it's nonsense
df['cumulative_users'] = df['new_users'].cumsum()

# Lag: yesterday's revenue, day-over-day delta, % change
df['prev_day']     = df['revenue'].shift(1)
df['daily_change'] = df['revenue'] - df['revenue'].shift(1)
df['pct_change']   = df['revenue'].pct_change()

# Rank revenue within each city
df['rank'] = df.groupby('city')['revenue'].rank(
    ascending=False,
    method='dense',
)

The method argument on rank chooses the dialect: 'min' is RANK(), 'dense' is DENSE_RANK(), 'first' is ROW_NUMBER(). For a refresher on the SQL side, see the SQL window functions interview questions.

Common pitfalls

The mistake that breaks the most live-coding screens is forgetting parentheses around boolean conditions. Writing df[df['amount'] > 1000 & df['status'] == 'completed'] evaluates 1000 & df['status'] first because & binds tighter than > in Python, and you get a cryptic error. Wrap each comparison in its own pair of parentheses every single time.

The second classic trap is merging on a key that is not unique in the right table. A left merge with duplicates on the right multiplies left rows by the number of matches, and the resulting numbers look plausible but are silently doubled. The fix is to always print len(left) and len(result) after a merge, and if they differ, drop duplicates on the right or aggregate before merging.

A third trap is calling apply when a vectorized method exists. apply falls back to a row-by-row Python loop and can be a hundred times slower than np.where, pd.cut, or arithmetic on the underlying NumPy arrays. Reach for apply only when there is no vectorized form available.

Fourth, chained assignment on a filtered slice: writing df[df['x'] > 0]['y'] = 1 does not modify df reliably and triggers the SettingWithCopyWarning. Use df.loc[df['x'] > 0, 'y'] = 1 instead.

Finally, operating on date columns without converting them first. A column with values like '2025-01-15' is just a string until you call pd.to_datetime on it. Sorting works lexicographically — fine for ISO dates, broken for anything else — and .dt.year raises. Convert dates as the first step after loading.

If you want to drill pandas and SQL questions like these every day, NAILDD is launching with 500+ problems on exactly this pattern — filter, group, merge, window — with worked solutions in both languages.

FAQ

Pandas or SQL — which should I learn first?

SQL, by a wide margin. It is asked on every analyst interview, most production analytics work happens in a warehouse, and the optimizer does work for you that you would have to do by hand in pandas. Pandas is the second tool, for cases where the data has already been pulled and you need a flexible transformation, a quick statistical test, or a plot. In practice an analyst uses both: SQL for the query, pandas for whatever the BI tool cannot do with the result.

Is pandas faster than SQL?

For data that fits in memory — up to a few gigabytes on a laptop — pandas and SQL are roughly comparable. For larger data the warehouse wins decisively: indexes, a query optimizer, parallel execution. Pandas loads everything into a single Python process, and once data does not fit in RAM the process dies. Snowflake, BigQuery and DuckDB keep going.

How do I remember the merge parameters without looking them up?

Memorize one canonical form: left.merge(right, on='key', how='left'). That covers the majority of real cases. When the join keys have different names on the two sides, switch to left_on and right_on. When you need every row from both sides, use how='outer'. When you want to see which rows came from where, add indicator=True. Everything else is a variation on that template.

What does transform do that apply does not?

transform returns an output the same length as the input, so the result can be assigned back as a new column without any reindexing. apply on a grouped object collapses to one row per group, which is what you want for an aggregate but not for a window-style operation. Whenever you want SUM() OVER (PARTITION BY ...) from SQL, the pandas form is df.groupby(...)['col'].transform('sum').

How do I handle missing values cleanly?

Decide policy first, code second. df.isnull().sum() gives the per-column null rate. From there, you either drop with df.dropna(subset=['col']), fill with a constant via df['col'].fillna(0), fill with a group statistic via df.groupby('seg')['col'].transform('median'), or leave them in and let aggregates skip them by default. The choice depends on whether the null means "unknown", "zero", or "not applicable" — the interviewer expects you to ask, not guess.

What pandas tasks come up most often in interviews?

Five patterns dominate. Compute D1 retention: find each user's first visit date with groupby and min, then merge with the activity table on user_id and next_day. Identify users with no orders: left-merge, filter where the right-side id is null. Compute a 7-day rolling revenue: sort by date, call rolling(7).mean(). Find the top three products per category: groupby + rank + filter. Reshape wide to long with melt. If you can write all five from a blank notebook in under ten minutes each, you are ready.