Pandas vs SQL: one task, two solutions

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why compare them at all

If you sit a data analyst interview at Stripe, Airbnb, or DoorDash, expect at least one moment where the interviewer says: "you just solved this in SQL — now write the pandas version." It is not a trick. The two languages solve the same shape of problem — pick rows, group rows, join rows — and the interviewer wants to know you can flip between them without panic.

The catch is that pandas and SQL look similar but diverge on the gotchas. Group-by mutates the index in pandas but not in SQL. NaN does not equal NaN, just like NULL != NULL, but the filtering syntax differs. Joins on duplicated keys silently explode row counts in both languages. This cheat sheet is the one you want printed next to your monitor — operation-by-operation mappings plus the failure modes that get juniors stuck. We assume you know each language at the syntax level; the goal here is translation fluency.

Quick reference table

Operation SQL Pandas
Pick columns SELECT col1, col2 df[['col1', 'col2']]
Filter rows WHERE col > 5 df[df['col'] > 5]
Sort ORDER BY col DESC df.sort_values('col', ascending=False)
Limit rows LIMIT 10 df.head(10)
Distinct values SELECT DISTINCT col df['col'].drop_duplicates()
Count distinct COUNT(DISTINCT col) df['col'].nunique()
Group + aggregate GROUP BY col df.groupby('col').agg(...)
Post-aggregate filter HAVING COUNT(*) > 1 filter the groupby result
Join JOIN ... ON df.merge(other, on=...)
Stack tables UNION ALL pd.concat([a, b])
Conditional column CASE WHEN np.where() or np.select()
Window function SUM() OVER (PARTITION BY) groupby().transform()
Lag/lead LAG(col, 1) OVER (...) df.groupby(...).shift(1)
Top-N per group ROW_NUMBER() OVER (...) groupby().rank() or head()

This table is the load-bearing reference. The rest of the post unpacks the rows where people most often write the wrong thing under pressure.

SELECT and WHERE side-by-side

The basic projection:

SELECT user_id, revenue FROM orders;
df[['user_id', 'revenue']]

Gotcha: in pandas, df['col'] returns a Series, but df[['col']] returns a DataFrame. The difference matters when you chain .merge() or pass the result to a function expecting a frame. One bracket gives you a vector, two brackets give you a table of one column.

The filter:

SELECT * FROM orders WHERE revenue > 1000 AND status = 'paid';
df[(df['revenue'] > 1000) & (df['status'] == 'paid')]

Three things to remember. First, parentheses around each condition are mandatory — operator precedence in pandas is not what your brain expects. Second, & is AND, | is OR, ~ is NOT. Using Python's and/or here raises a ValueError. Third, the SQL IN operator becomes .isin():

df[df['city'].isin(['New York', 'Berlin', 'Tokyo'])]

Sorting is mostly the same idea, but pandas takes a parallel list for direction:

SELECT * FROM orders ORDER BY revenue DESC, created_at ASC;
df.sort_values(['revenue', 'created_at'], ascending=[False, True])

Limit and offset:

SELECT * FROM orders LIMIT 10 OFFSET 5;
df.iloc[5:15]

df.head(10) works as a LIMIT 10 shortcut, but for offset you reach for .iloc[] slicing. This trips people up because Python slicing is half-open: iloc[5:15] is rows 5 through 14, not 5 through 15.

GROUP BY, HAVING, and aggregates

The bread-and-butter analyst pattern:

SELECT city, COUNT(*) AS cnt, AVG(revenue) AS avg_rev
FROM orders
GROUP BY city;
df.groupby('city').agg(
    cnt=('revenue', 'count'),
    avg_rev=('revenue', 'mean'),
).reset_index()

The .reset_index() matters. By default, groupby puts the grouping key into the index, not a column — fine for inspection in a notebook, annoying when you go to merge the result downstream. Always reset the index unless you have a reason not to.

For HAVING, the trick is that pandas has no direct equivalent. You group first, then filter the result:

SELECT city, COUNT(*) AS cnt
FROM orders
GROUP BY city
HAVING COUNT(*) > 100;
result = df.groupby('city').size().reset_index(name='cnt')
result[result['cnt'] > 100]

The two-step pattern feels verbose, but it is also what the SQL planner does under the hood. Knowing this makes "translate this SQL query to pandas" feel mechanical rather than creative.

The aggregation functions themselves map cleanly:

SQL Pandas
COUNT(*) .size()
COUNT(col) .count() (skips nulls)
SUM(col) .sum()
AVG(col) .mean()
MIN/MAX(col) .min() / .max()
STDDEV(col) .std()
COUNT(DISTINCT col) .nunique()

Sanity check: .size() counts all rows including nulls. .count() counts non-null values per column. If your row count looks too low after a group-by, you are probably using .count() on a column that has nulls.

JOIN and UNION

The basic left join:

SELECT o.*, u.name
FROM orders o
LEFT JOIN users u ON o.user_id = u.id;
df.merge(users, left_on='user_id', right_on='id', how='left')

The how= parameter maps one-to-one with SQL join types:

SQL Pandas (how=)
INNER JOIN 'inner' (default)
LEFT JOIN 'left'
RIGHT JOIN 'right'
FULL OUTER JOIN 'outer'
CROSS JOIN 'cross'

When the join keys have the same name in both frames, use on= instead of left_on=/right_on=. Pandas also offers indicator=True, which adds a _merge column showing whether each row matched left-only, right-only, or both. This is a debugging superpower SQL does not have without subqueries — turn it on when you want to know why your row count doubled.

UNION is straightforward:

SELECT * FROM orders_2024
UNION ALL
SELECT * FROM orders_2025;
pd.concat([orders_2024, orders_2025], ignore_index=True)

UNION (without ALL) deduplicates. The pandas equivalent is pd.concat([...]).drop_duplicates(). The ignore_index=True matters: without it, you get the union of the original indexes, which is almost never what you want.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

CASE WHEN and window functions

Conditional logic in SQL:

SELECT *,
  CASE WHEN revenue > 1000 THEN 'high'
       WHEN revenue > 300  THEN 'mid'
       ELSE 'low'
  END AS segment
FROM orders;

For two branches, np.where is enough:

import numpy as np
df['segment'] = np.where(df['revenue'] > 1000, 'high', 'low')

For three or more, np.select is cleaner than nesting:

conditions = [df['revenue'] > 1000, df['revenue'] > 300]
choices = ['high', 'mid']
df['segment'] = np.select(conditions, choices, default='low')

Window functions are where the languages diverge most. The SQL form:

SELECT *,
  ROW_NUMBER() OVER (PARTITION BY city ORDER BY revenue DESC) AS rn
FROM orders;

In pandas, you sort first, then group, then rank:

df['rn'] = (
    df.sort_values('revenue', ascending=False)
      .groupby('city')['revenue']
      .rank(method='first', ascending=False)
      .astype(int)
)

Cumulative sum:

SELECT *, SUM(revenue) OVER (PARTITION BY city ORDER BY created_at) AS cum_rev
FROM orders;
df['cum_rev'] = (
    df.sort_values('created_at')
      .groupby('city')['revenue']
      .cumsum()
)

LAG and LEAD:

SELECT *,
  LAG(revenue, 1)  OVER (PARTITION BY user_id ORDER BY created_at) AS prev_rev,
  LEAD(revenue, 1) OVER (PARTITION BY user_id ORDER BY created_at) AS next_rev
FROM orders;
df = df.sort_values('created_at')
df['prev_rev'] = df.groupby('user_id')['revenue'].shift(1)
df['next_rev'] = df.groupby('user_id')['revenue'].shift(-1)

Load-bearing trick: in pandas, window functions are split into three steps — sort, group, transform. SQL hides this in one OVER (...) clause. If your pandas window result looks wrong, the sort is almost always the buggroupby preserves the input order, but only the input order.

When SQL wins, when pandas wins

Situation Pick SQL Pick pandas
Billions of rows yes — server-side no — OOM in memory
Quick exploration in a notebook slower yes — Jupyter native
Custom Python function per row painful yes — .apply()
Plot the result needs export step yes — direct to matplotlib
Data already in CSV / Excel / API round-trip to a DB yes — pd.read_*
Production ETL pipeline yes — dbt / Airflow rarely — fragile
Interview screen yes — asked 95% of the time yes — asked 60%

The rule of thumb I use: if the data lives in a warehouse and the result feeds a dashboard, write SQL. If the data lives in a CSV and the result feeds a model or a plot, write pandas. For interviews, you need both — the same problem, two solutions, ready in under two minutes each.

If you want a structured way to drill exactly this pattern — SQL question, then the pandas translation — NAILDD ships interview problems with both solutions side by side.

Common pitfalls

The first trap is index leakage after groupby. In pandas, df.groupby('city').sum() puts city into the index, not a column. If you try to merge that result back onto another frame on city, the merge silently fails because there is no city column. The fix is to call .reset_index() immediately, or pass as_index=False to groupby. SQL has no equivalent of this confusion because every column is just a column.

The second trap is NaN-versus-NULL semantics. Both languages treat missing values as not-equal-to-themselves, but the filtering syntax differs. In SQL you write WHERE col IS NULL; in pandas you write df[df['col'].isna()]. Using df[df['col'] == None] returns an empty frame because None != None in NumPy-land. This eats juniors alive in interviews because the code "looks right" and the empty result looks like real data.

The third trap is silent join explosions on duplicated keys. If you merge two frames on a key that has duplicates on both sides, the output is a cross product. A frame of 10k orders joined to 10k users with 5% duplicated keys can balloon into hundreds of thousands of rows. The fix is df.drop_duplicates(subset=['key']) on the right-hand frame, or verify with df['key'].is_unique. Warehouses usually surface a query plan hinting at the blow-up; pandas does not.

The fourth trap is forgetting to sort before window operations. df.groupby('user_id')['revenue'].cumsum() returns the cumulative sum in the order rows happen to sit in the frame, rarely the order you want. Always sort first by the column you would put in ORDER BY inside an OVER (...) clause. SQL has the sort baked into the syntax, so this bug is pandas-specific.

The fifth trap is chained assignment. Writing df[df['col'] > 5]['other'] = 1 looks correct but modifies a view that pandas discards. The fix is df.loc[df['col'] > 5, 'other'] = 1. This is the most common silent bug in pandas code that "ran without errors but did not change the data."

FAQ

Should I learn SQL or pandas first?

SQL, almost always. It is asked in roughly 95% of analyst screens, the syntax is stable across employers, and the mental model — declarative set operations — transfers cleanly to pandas later. Pandas is easier to use once you know SQL than the reverse, because pandas adds Python complexity (Series vs DataFrame, indexes, views vs copies) with no SQL equivalent. Learn SQL until you can write window functions from memory, then pick up pandas.

Is pandas slower than SQL?

For large data, yes — by orders of magnitude. SQL runs in the warehouse engine with indexes, parallel execution, and columnar storage. Pandas pulls the dataset into RAM and runs single-threaded by default. The crossover on a laptop is around 1-5 million rows. For smaller frames pandas is fast enough that the difference does not matter, and notebook iteration speed compensates.

When should I use Polars instead of pandas?

Polars is a good fit when your data is too large for pandas but too small to justify a warehouse round-trip — roughly 1-50 million rows on a laptop, or when you need parallel execution out of the box. The API is similar enough that translating pandas code takes hours. The downside is a smaller ecosystem: matplotlib, scikit-learn, and most internal tooling still assume pandas.

How do I translate a WITH (CTE) to pandas?

Treat each CTE as an intermediate variable. WITH cohort_revenue AS (...) SELECT ... FROM cohort_revenue becomes cohort_revenue = df.groupby(...).agg(...) followed by another operation on it. The pattern reads more naturally because you see each step, but you can leak memory by holding onto intermediate frames. Wrap the steps in a function so Python can garbage-collect the temporaries.

Why does my groupby result have weird column names?

When you aggregate multiple columns with multiple functions — for example, df.groupby('city').agg({'revenue': ['sum', 'mean'], 'orders': 'count'}) — pandas builds a MultiIndex on the columns. The column names become tuples like ('revenue', 'sum'). Flatten them with result.columns = ['_'.join(c) for c in result.columns], or — much cleaner — use the named-aggregation syntax: .agg(revenue_sum=('revenue', 'sum'), revenue_mean=('revenue', 'mean')). The named-aggregation form gives you a flat column index from the start.

What is the pandas equivalent of QUALIFY?

QUALIFY (Snowflake, BigQuery) filters on the result of a window function in one statement: SELECT * FROM t QUALIFY ROW_NUMBER() OVER (...) = 1. Pandas has no single-step equivalent — you compute the window column, then filter on it. Example: df['rn'] = df.groupby('city')['revenue'].rank(); df[df['rn'] == 1]. Less elegant, but more debuggable.