MERGE and UPSERT for data engineering interviews
Contents:
Why DE interviewers ask about UPSERT
UPSERT is the load-bearing primitive for idempotent ETL. If a job at Snowflake or Databricks crashes halfway through, the retry must not duplicate rows in the target. Every senior data engineering loop at Stripe, Airbnb, or DoorDash will probe this with prompts like "how would you make this insert idempotent", "when would you reach for MERGE over INSERT ON CONFLICT", or "design a concurrent UPSERT into a 5B-row dimension table". Strong candidates walk through what the warehouse actually does on conflict, how locks are taken, and where the race conditions live.
A worked story frames the stakes. A DE shipped a nightly load that inserted CDC rows from a Postgres source into a Snowflake fact table. The job crashed during a network blip, on-call rerun it, and by morning the marketing dashboard reported double the bookings. The fix took two hours of DELETE-then-rerun work plus a backfill, and the postmortem ended with "should have been a MERGE from day one". Interviewers love this scenario because it forces you to demonstrate that you understand exactly-once semantics and what the warehouse-side guarantees actually are.
The other reason this topic shows up is that UPSERT syntax differs in subtle ways across engines. Postgres has its own non-standard ON CONFLICT clause, Oracle and SQL Server have had ANSI MERGE for two decades, ClickHouse does not do row-level UPDATE at all and asks you to think in merge trees, and BigQuery has its own quirks around partition pruning.
The problem statement
Given a target table users(user_id PK, email, updated_at) and an incoming CDC batch from a streaming source like Kafka or Debezium, the requirements are easy to state and easy to get wrong. If a user_id already exists in the target, update it. If it does not exist, insert it. If the batch is retried after a partial failure, the second run must produce the same final state as a successful single run.
The naive shape — "SELECT to check existence, then INSERT or UPDATE" — is a textbook race condition. Two concurrent batches both see the row as absent, both try to INSERT, and one fails with a unique-key violation. The correct shape is an atomic UPSERT performed by the engine itself, which serializes the decision on the unique index or primary key.
Postgres: INSERT ON CONFLICT
Postgres has the most ergonomic UPSERT syntax in the industry, and most DE candidates start their answer here.
INSERT INTO users (user_id, email, updated_at)
VALUES (1, 'a@b.com', NOW())
ON CONFLICT (user_id) DO UPDATE
SET email = EXCLUDED.email,
updated_at = EXCLUDED.updated_at;EXCLUDED is a pseudo-table that holds the values you tried to insert. ON CONFLICT (user_id) references a UNIQUE index or PRIMARY KEY — Postgres will not let you target a non-unique column. There are three common variants worth knowing by heart.
-- Ignore duplicates, keep the existing row
INSERT INTO users (user_id, email, updated_at)
VALUES (1, 'a@b.com', NOW())
ON CONFLICT (user_id) DO NOTHING;-- Conditional update: only overwrite if the incoming row is newer
INSERT INTO users (user_id, email, updated_at)
VALUES (1, 'a@b.com', NOW())
ON CONFLICT (user_id) DO UPDATE
SET email = EXCLUDED.email,
updated_at = EXCLUDED.updated_at
WHERE users.updated_at < EXCLUDED.updated_at;-- ON CONFLICT on a partial unique index
INSERT INTO users (user_id, email, updated_at, deleted_at)
VALUES (1, 'a@b.com', NOW(), NULL)
ON CONFLICT (user_id) WHERE deleted_at IS NULL DO UPDATE
SET email = EXCLUDED.email;For bulk loads you batch the VALUES tuple, which is the only sane way to do this at scale.
INSERT INTO users (user_id, email, updated_at) VALUES
(1, 'a@b.com', NOW()),
(2, 'c@d.com', NOW()),
(3, 'e@f.com', NOW())
ON CONFLICT (user_id) DO UPDATE SET email = EXCLUDED.email;From Python the idiomatic path is psycopg2.extras.execute_values for batches up to ten or twenty thousand rows, and COPY into a staging table followed by INSERT ... SELECT ... ON CONFLICT for anything larger.
Standard ANSI MERGE
The SQL:2003 MERGE statement is more verbose but strictly more expressive — it can INSERT, UPDATE, and DELETE in one pass and supports compound WHEN clauses.
MERGE INTO users AS target
USING (SELECT * FROM staging_users) AS src
ON target.user_id = src.user_id
WHEN MATCHED AND src.updated_at > target.updated_at THEN
UPDATE SET email = src.email,
updated_at = src.updated_at
WHEN NOT MATCHED THEN
INSERT (user_id, email, updated_at)
VALUES (src.user_id, src.email, src.updated_at)
WHEN MATCHED AND src.is_deleted = TRUE THEN
DELETE;Support is good but not universal. Postgres added ANSI MERGE in version 15. Oracle, SQL Server, DB2 have had it forever. Snowflake, BigQuery, Redshift all support it. ClickHouse has no row-level MERGE and offers ReplacingMergeTree as the closest equivalent.
The practical interview guidance is "use ON CONFLICT in Postgres when you only need INSERT-or-UPDATE, reach for MERGE when you need INSERT + UPDATE + DELETE in one statement, and remember MERGE is the only choice in cloud warehouses".
Snowflake, BigQuery, Redshift
The Snowflake MERGE looks almost identical to the ANSI form, which is good news for portability.
MERGE INTO users t
USING staging_users s
ON t.user_id = s.user_id
WHEN MATCHED THEN UPDATE SET email = s.email
WHEN NOT MATCHED THEN INSERT (user_id, email) VALUES (s.user_id, s.email);BigQuery MERGE has a performance gotcha worth memorizing: on a partitioned target, the MERGE can scan the entire table if the ON clause does not include the partition column. The recommended pattern is to include the partition column in the ON predicate, or to fall back to INSERT ... SELECT ... WHERE NOT EXISTS plus a partition overwrite. At Stripe and Vercel-scale tables this is the gap between a query that costs ten dollars and one that costs a thousand. Redshift supports MERGE since 2023; older code still uses two-step DELETE-then-INSERT in a transaction.
ClickHouse: ReplacingMergeTree
ClickHouse does not have row-level UPDATE or DELETE in the traditional sense. The idiomatic pattern is a ReplacingMergeTree table where duplicates by primary key are collapsed automatically during background merges.
CREATE TABLE users (
user_id UInt64,
email String,
updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)
ORDER BY user_id;The column inside the parentheses is the version column. When ClickHouse merges parts in the background, it keeps the row with the largest updated_at for each primary key.
Deduplication is eventual, not immediate. A SELECT right after an INSERT can see both versions because the background merge has not run yet — the most common ClickHouse surprise in interviews. OPTIMIZE TABLE ... FINAL forces the merge but is expensive and not safe in a hot read path. SELECT ... FINAL forces collapsing at read time, which is honest but slow on large scans. For high-throughput dashboards the recommended approach is argMax(email, updated_at) GROUP BY user_id, which is correct regardless of merge state. Related engines are CollapsingMergeTree for soft-deletes via a sign column and VersionedCollapsingMergeTree for versions plus deletes.
MERGE inside dbt incremental models
In a modern data stack the MERGE statement is almost always wrapped by dbt's incremental materialization, and interviewers at Notion or Linear will ask you to read the macro.
{{ config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge'
) }}
SELECT *
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}The four incremental_strategy values to know are merge for adapters that support ANSI MERGE (Snowflake, BigQuery, Databricks, Postgres 15+), delete+insert for adapters that do not, append for pure append-only fact tables, and insert_overwrite for partition-overwrite semantics on date-partitioned timeseries. unique_key is mandatory for merge — without it you get duplicates with no warning.
Common pitfalls
Plain INSERT inside an idempotent pipeline is the single most common bug DE candidates ship to production. The pipeline works perfectly on the first run, the retry path silently duplicates rows, and the data team discovers it only when an executive dashboard double-counts revenue. The fix is to wire ON CONFLICT or MERGE against the primary key from the very first version of the load.
Using ON CONFLICT (col1, col2) without a matching UNIQUE index is the second most common failure mode in Postgres. The engine refuses the statement with "there is no unique or exclusion constraint matching the ON CONFLICT specification". The fix is to create a compound unique index that exactly matches the conflict target, or to use a partial unique index and reference it explicitly.
Forgetting the WHERE clause on a conditional update can be catastrophic when CDC delivers events out of order. Without WHERE users.updated_at < EXCLUDED.updated_at an older event will overwrite a newer one and silently corrupt the table. The defense is to always include a monotonic guard column — updated_at, event_time, or a Kafka offset — in the conditional update.
Concurrent UPSERTs into the same row cause Postgres to serialize on the unique index, which is correct but throws deadlocks at scale when batches share keys in different orders. The trick most production loaders use is to sort each batch by unique_key before sending, which makes lock acquisition order globally consistent.
Ignoring ClickHouse eventual deduplication is the cliff every DE walks off the first time they touch the engine. The query that worked perfectly in test starts returning duplicates in production. The defense is to always read through FINAL or with an argMax group-by in dashboards, and to treat OPTIMIZE as a rare maintenance operation.
MERGE without partition pruning on BigQuery or Snowflake can be the most expensive single query a team runs all month. A naive MERGE that joins only on the surrogate key forces a full scan of both sides; adding the partition column to the ON predicate cuts cost by 99%.
Row-by-row UPSERT from application code is the slow path that still ships to production. Ten thousand round-trips of INSERT ... ON CONFLICT from a Python loop is roughly a hundred times slower than one bulk execute_values call, and a thousand times slower than COPY into a staging table followed by a single MERGE.
Optimization tips
For Postgres, swap INSERT ... ON CONFLICT for COPY into staging plus MERGE once batches exceed ten thousand rows; the staging table can be UNLOGGED to avoid WAL overhead, and the final MERGE benefits from a single transaction and consistent locking order. In Snowflake and BigQuery the dominant cost is bytes scanned, so partition pruning is the only optimization that matters at scale: date-partition the target and include the partition column in the MERGE ON clause. For ClickHouse, design the schema around the merge engine — pick a PARTITION BY matching your rebuild cadence and write dashboards as argMax aggregates so they stay correct without FINAL.
Related reading
- Explain and query plan for DE interviews
- Materialized views for DE interviews
- SQL window functions interview questions
- CTE vs temp table in SQL
- GROUP BY vs PARTITION BY
If you want to drill data engineering questions like this every day, NAILDD is launching with 500+ SQL and DE problems across exactly this pattern.
FAQ
Is MERGE atomic in a single transaction?
Yes. Inside one transaction, MERGE either fully applies all matched and not-matched branches or rolls back entirely. Concurrent MERGE statements that touch overlapping keys can still deadlock — the production defense is to sort each batch by key before sending so that lock acquisition order is consistent across workers, and to wrap the call in a bounded retry on serialization failure.
Which is faster for bulk loads in Postgres, ON CONFLICT or MERGE?
In Postgres 15+ they are essentially the same for simple insert-or-update, with ON CONFLICT slightly more compact to write and very slightly faster at small batch sizes because it has one less plan node. The decision rule is "if you also need DELETE, use MERGE; otherwise stick with ON CONFLICT". In Snowflake and BigQuery there is no ON CONFLICT at all and MERGE is the only choice.
Can I UPSERT against a partial unique index?
Yes. The Postgres syntax is ON CONFLICT (user_id) WHERE deleted_at IS NULL DO UPDATE, and the predicate after WHERE must exactly match the predicate on the partial index. This is the right pattern when soft-deleted rows coexist with new live rows under the same logical key.
Is MERGE viable for billion-row fact tables?
Sometimes. On a small dimension or SCD table it is the cleanest option. On a billion-row partitioned fact table the more efficient pattern is often INSERT OVERWRITE PARTITION — drop and rewrite the partitions touched by the batch rather than scanning the entire target. The right interview answer is "MERGE is great for moderate sizes; for very large partitioned facts I would benchmark partition overwrite as an alternative".
How do I propagate deletes from CDC through an UPSERT?
The CDC source emits an is_deleted flag on each row. In MERGE you add a WHEN MATCHED AND src.is_deleted THEN DELETE branch, collapsing insert, update, and tombstone cases into one statement. In Postgres ON CONFLICT you cannot DELETE inside the UPSERT, so the common pattern is a two-step transaction — UPSERT the live rows, then run a separate DELETE for tombstoned keys — or a soft-delete that flips an is_deleted flag on the target.
Is this content official documentation?
No. This article is based on the Postgres 15+ docs, the SQL:2003 standard, Snowflake and BigQuery vendor docs, and field experience across analytics and DE teams. Always cross-check the specific dialect against the engine's current docs before relying on edge-case behavior.