Pandas merge — joining DataFrames in Python

Practice Python for data interviews
200+ pandas, numpy, and data-wrangling problems with explanations.
Join the waitlist

Why merge is the question you will get

If you have ever had a data analyst loop at Stripe, Airbnb, DoorDash, or Meta, you know the pandas live-coding round is rarely about clever syntax. The interviewer drops two DataFrames in a notebook, asks you to combine them, compute a metric, and explain what you would do if half the keys are missing. Nine times out of ten the answer hinges on pd.merge.

Hiring managers lean on merge because it surfaces three skills at once. It checks whether you understand SQL semantics, because pandas merge mirrors INNER, LEFT, RIGHT, and FULL OUTER JOIN. It checks whether you spot data quality issues, because duplicate keys and type mismatches cause silent row explosions or empty results. And it checks whether you debug like a senior analyst, by inspecting shapes, dtypes, and indicator columns rather than rerunning the cell hoping for a different answer.

This guide walks through every flag that matters on real datasets, with code you can paste into a notebook and run. If you can recreate these by hand, you are ready for the live coding round.

Basic syntax and the how parameter

The minimum viable merge takes two DataFrames and one shared key column. Pandas does the rest, but the defaults bite you if you are not paying attention.

import pandas as pd

users = pd.DataFrame({
    "user_id": [1, 2, 3, 4],
    "name": ["Ada", "Ben", "Cara", "Dan"],
})

orders = pd.DataFrame({
    "user_id": [1, 1, 3, 5],
    "amount": [500, 300, 1200, 800],
})

result = pd.merge(users, orders, on="user_id")

By default merge performs an inner join. Only rows whose key appears in both frames survive. Above, Ben and Dan have no orders, and user_id=5 has no matching user, so all three rows quietly disappear. If you do not check the shape of result you will not notice the loss, which is exactly how candidates lose points.

The how parameter controls which rows pass through. The mapping to SQL is one-to-one.

how SQL equivalent Behavior
'inner' INNER JOIN Only matching keys, default
'left' LEFT JOIN All rows from the left frame
'right' RIGHT JOIN All rows from the right frame
'outer' FULL OUTER JOIN Union of keys from both frames
'cross' CROSS JOIN Cartesian product, no key needed
# Keep every user, even those with no order
left_join = pd.merge(users, orders, on="user_id", how="left")

# Union of both populations, nothing is dropped
outer_join = pd.merge(users, orders, on="user_id", how="outer")

The most common mistake in live coding rounds is forgetting how="left" when computing a funnel or conversion metric. The candidate writes the merge, computes a denominator off the merged frame, and reports a conversion rate that silently excludes everyone who never converted. When the metric is "percent of users who did X", reach for left join first.

Keys: on, left_on, right_on

If the join column is named identically in both frames, on is the cleanest option. When the names differ, swap in left_on and right_on.

events = pd.DataFrame({
    "uid": [1, 2, 3],
    "event": ["login", "purchase", "login"],
})

joined = pd.merge(users, events, left_on="user_id", right_on="uid")

Notice that joined now carries both user_id and uid. Drop the duplicate after merge or rename one column upstream so the join column lives in one place.

Compound keys are passed as a list. This pattern shows up whenever you join sessions to events by user and date.

pd.merge(sessions, events, on=["user_id", "session_date"])

For multi-key joins, double-check the dtype of each component. A datetime joined against a string is the silent killer of merge correctness — no error, just an empty result.

suffixes and column conflicts

When both frames carry a non-key column with the same name, pandas appends _x and _y automatically. The defaults are unreadable, so override them.

users_with_source = pd.DataFrame({
    "user_id": [1, 2],
    "name": ["Ada", "Ben"],
    "source": ["organic", "paid"],
})

orders_with_source = pd.DataFrame({
    "user_id": [1, 2],
    "amount": [500, 300],
    "source": ["web", "app"],
})

result = pd.merge(
    users_with_source,
    orders_with_source,
    on="user_id",
    suffixes=("_acq", "_channel"),
)
# Columns: user_id, name, source_acq, amount, source_channel

The suffix trick is most useful when joining a table to itself, for example pairing each manager with their direct reports. Use suffixes that name the role explicitly, not generic _left and _right you will forget in 20 minutes.

indicator for debugging

indicator=True adds a categorical _merge column that records the origin of every row. It costs nothing and turns "why is my merge returning weird numbers" from a 30-minute investigation into a single value_counts call.

audit = pd.merge(users, orders, on="user_id", how="outer", indicator=True)
print(audit["_merge"].value_counts())
# both          rows present in both inputs
# left_only     rows that exist only in the left frame
# right_only    rows that exist only in the right frame

In an interview, narrating an indicator check out loud signals seniority. You are not blindly trusting merge output, you are auditing the join. Follow up by filtering on left_only and inspecting a few rows.

Duplicates and many-to-many joins

If the key is not unique on at least one side, merge produces a Cartesian product across the matching keys. Three left rows and two right rows become six in the output, and the explosion compounds quickly.

df1 = pd.DataFrame({"key": ["a", "a"], "val1": [1, 2]})
df2 = pd.DataFrame({"key": ["a", "a"], "val2": [10, 20]})

result = pd.merge(df1, df2, on="key")
# 4 rows out of 2-row inputs. Every combination is created.

The defensive pattern is to validate uniqueness before merging. Two equally good options exist.

# Option A: explicit check
assert df1["key"].is_unique, "left key is not unique"

# Option B: validate flag, raises MergeError on violation
pd.merge(df1, df2, on="key", validate="one_to_many")

Valid options are 'one_to_one', 'one_to_many', 'many_to_one', 'many_to_many'. Use the strictest constraint that matches your data. A merge that succeeds with one_to_one today and fails six months later tells you something important about the upstream pipeline.

Practice Python for data interviews
200+ pandas, numpy, and data-wrangling problems with explanations.
Join the waitlist

merge vs join vs concat

Pandas offers three combine operations and candidates routinely confuse them. The decision is mechanical once you map them to the data shape.

merge join concat
Combines by Column values Index values Position
Direction Horizontal Horizontal Vertical or horizontal
Typical use Adding fields by key Index-aligned add Stacking same-shape frames
Syntax pd.merge(a, b, on='k') a.join(b, on='k') pd.concat([a, b])

In analyst work, merge handles roughly 90 percent of cases. Use concat to stack monthly partitions. Reach for join only when both frames are already indexed by the join key.

Worked example: user funnel

The canonical interview question: given a registrations table and a purchases table, compute the registration-to-purchase conversion rate, broken out by acquisition source.

import pandas as pd

users = pd.DataFrame({
    "user_id": [1, 2, 3, 4, 5],
    "reg_date": pd.to_datetime([
        "2026-01-01", "2026-01-02", "2026-01-03",
        "2026-01-04", "2026-01-05",
    ]),
    "source": ["organic", "paid", "organic", "paid", "organic"],
})

purchases = pd.DataFrame({
    "user_id": [1, 3],
    "purchase_date": pd.to_datetime(["2026-01-05", "2026-01-10"]),
    "revenue": [990, 1490],
})

# Left merge preserves every registered user
funnel = pd.merge(users, purchases, on="user_id", how="left")

# Overall conversion
overall = funnel["revenue"].notna().mean()
print(f"Overall conversion: {overall:.0%}")  # 40%

# Conversion by source
by_source = funnel.groupby("source")["revenue"].apply(lambda s: s.notna().mean())
print(by_source)

Two details earn bonus points. The left join is non-negotiable when the metric is "share of users who did X", because an inner join would silently drop the denominator. And .notna() on revenue is the cleanest conversion flag, because it works whether the purchase amount is positive, zero, or refunded.

Common pitfalls

The default how="inner" is the trap that costs most candidates their first attempt. Pandas does not warn when rows are dropped, so a forgotten parameter silently turns a funnel from "40 percent conversion" into "100 percent conversion of the converters". Make how an explicit choice every time, and inspect result.shape against both input shapes before computing any metric.

Duplicate keys are the second classic failure mode. A merge where the key is non-unique on both sides produces row counts that grow as the product of multiplicities, and downstream sums, means, and counts all inherit the inflation. Run df["key"].duplicated().sum() on each side before merging, and add validate="one_to_one" (or the right relation) so pandas raises a MergeError instead of returning a corrupted frame.

Type mismatches on the join column cause the worst kind of failure: merge succeeds with zero matching rows and no error. The usual cause is one side storing user IDs as integers and the other as strings, often from a CSV import that inferred the wrong dtype. Run df.dtypes and cast explicitly when joining mixed sources. The datetime-versus-string variant produces the same silent empty result.

Chained merges without intermediate checks compound every preceding mistake. Join four tables in a row and the count after merge three may be 50 times the input size. Treat each merge as a checkpoint, log shapes, and materialize intermediates during exploration so you can roll back.

Optimization tips

When DataFrames grow past a million rows, naive merge starts to feel slow. The single highest-leverage change is to cast the join key to a smaller, fixed-width dtype. Joining on int32 is faster than int64, and category beats object at moderate cardinality. For high-cardinality strings, keep object but ensure both sides match — mismatched dtypes trigger a slower comparison path.

Sorting both frames by the join key before merging pays off when you reuse the join repeatedly. Combine it with set_index and join rather than merge: index-based joins on a sorted index use a merge-sort algorithm and avoid the hash table allocation that merge triggers.

For repeated joins against a small lookup table, drop the merge entirely and use map. If the right frame has a few hundred rows and you only need one column, df["new_col"] = df["key"].map(lookup.set_index("key")["value"]) is an order of magnitude faster than pd.merge.

When data does not fit in memory, switch to chunked loads via pd.read_csv(..., chunksize=), merge each chunk against the smaller frame, and concatenate. For workloads that exceed local RAM by a wide margin, push the join into the database with SQL, or use Polars or DuckDB — increasingly common at Snowflake, Databricks, and Notion.

If you want to drill pandas and SQL interview questions like this every day, NAILDD is launching with hundreds of analyst problems across exactly this pattern.

FAQ

Can I merge more than two DataFrames at once?

Not in a single call. Each pd.merge takes exactly two inputs, so three or more tables require chaining. The cleanest pattern is functools.reduce(lambda left, right: pd.merge(left, right, on="key"), [df1, df2, df3, df4]), which reads better than nested merges. Validate shape after each step during development so you catch row explosions early.

Is pandas merge faster than SQL JOIN?

Almost never. A database engine has indexes, query planners, and choice of hash, merge, or nested-loop strategies, and moving data into pandas before joining usually dwarfs the merge itself. For joins on tables over a million rows, run the join in SQL and pull the result into pandas. Reserve pandas merge for ad-hoc work, prototypes, and data that already lives in memory.

How do I do a cross join in pandas?

Use pd.merge(df1, df2, how="cross"), available in pandas 1.2 and later. It returns the Cartesian product of all rows without requiring a key. Cross joins build calendar tables, expand a sparse panel to all date-user combinations, or score every candidate against every job posting in a recommender prototype. Mind the output size — even moderate inputs produce huge results.

What happens if the merge key is null on some rows?

By default pandas does not match nulls to each other, so any row with a null key falls out of an inner merge and into the unmatched side of a left or outer merge. Decide intentionally whether nulls should be treated as a category. If they should, fill them with a sentinel value before the merge, then convert back afterward.

How do I check whether a merge dropped rows I expected to keep?

Run with indicator=True on an outer merge, then inspect the _merge column. Rows tagged left_only are in the left frame only, right_only in the right only, both matched. A senior-sounding answer is "I outer-merge with indicator, value_counts the indicator, filter to the unexpected side, and look at a sample".