Databricks in a data engineering interview
Contents:
Why Databricks shows up in DE interviews
If you are interviewing for a data engineer role at Databricks, Snowflake, Stripe, Airbnb, or any platform team on the open lakehouse stack, expect at least one round to land on the trio of Lakehouse architecture, Delta Lake, and Unity Catalog. These are not vendor trivia — they encode the modern answer to "where do we put the data, who can read it, and how do we run ACID on object storage".
The job ladder reflects this. A mid-level data engineer in the US lakehouse space tracks around $160k base + $30k bonus + $40k equity at series-C startups and $190k base + $60k bonus + $90k equity at Databricks/Snowflake themselves (levels.fyi medians, 2026). Recruiters on LinkedIn and hiring managers on Glassdoor consistently flag Delta operations and Catalog design as the two areas where candidates either ace the round or stall.
This is focused interview prep, not a feature tour. It walks the lakehouse concepts, the Delta operations that show up in live coding, and the governance vocabulary an interviewer will probe.
Lakehouse architecture in one screen
The lakehouse pattern keeps raw data in cheap object storage (S3, ADLS, GCS) and layers ACID tables on top, so the same physical files serve both SQL analytics and ML training. Databricks popularised the medallion layout: Bronze for raw ingest, Silver for cleansed and conformed, Gold for analytics-ready marts.
Bronze (raw) → Silver (cleansed, deduped) → Gold (aggregated, BI-ready)
| | |
v v v
Delta tables on S3, queried by Spark, DBSQL, and ML notebooks.The interview prompt usually sounds like: given a stream of clickstream events at 50k rps, design the storage layout for both ad-hoc SQL and a daily training job. The correct answer threads three claims: land raw events in Bronze as append-only Delta, dedupe and apply schema in Silver with MERGE, materialise Gold aggregates with OPTIMIZE + Z-ORDER on the high-cardinality filter column.
Load-bearing trick: the lakehouse is not "warehouse on a lake". It is ACID on object storage via a transaction log. Everything else — time travel, schema evolution, Photon — descends from that one design choice. If you can defend that sentence under follow-ups, you have passed the conceptual half of the round.
Unity Catalog and governance concepts
Unity Catalog is the cross-workspace governance layer. It replaces the older Hive metastore and per-workspace ACLs with a single three-level namespace and a unified permission model. The vocabulary matters because interviewers love to ask "how would a viewer on team A query a Gold table owned by team B without copying data?".
The three-level identifier is catalog.schema.table. A catalog maps to a business domain (e.g. marketing, finance), a schema groups related tables, and the table is a Delta object backed by files in cloud storage. Permissions cascade: granting USE CATALOG on marketing plus SELECT on marketing.events.clicks is enough — there is no need to also share the underlying S3 prefix.
| Concept | What it is | Interview hook |
|---|---|---|
| Catalog | Top-level namespace, usually one per business domain | "How do you isolate prod from dev tables?" |
| Schema | Group of related tables inside a catalog | "Where does PII boundary live?" |
| Managed table | Files + metadata both owned by Unity | "What happens on DROP TABLE?" |
| External table | Metadata in Unity, files in caller-owned S3 | "How do you migrate from Glue?" |
| Volume | Governed path for non-tabular files (PDFs, models) | "Where do ML artefacts go?" |
| Lineage | Auto-captured column-level dependency graph | "How do you audit a breaking change?" |
The single biggest reason Unity Catalog wins enterprise deals is column-level lineage — interviewers from Databricks specifically will probe whether you understand it is automatic, not opt-in.
Delta Lake operations you must know
Delta Lake is the open-source ACID storage layer underneath Databricks. In an interview, you will be asked to write or read DDL/DML against it and to reason about the transaction log (the _delta_log/ directory of JSON commits). Four operations carry most of the weight.
-- Create an external Delta table on S3
CREATE TABLE marketing.events.clicks
USING DELTA
LOCATION 's3://lakehouse-prod/marketing/events/clicks'
PARTITIONED BY (event_date);
-- Upsert from a streaming Bronze source into Silver
MERGE INTO marketing.events.clicks AS target
USING bronze_clicks_staging AS src
ON target.event_id = src.event_id
WHEN MATCHED AND src.updated_at > target.updated_at
THEN UPDATE SET *
WHEN NOT MATCHED
THEN INSERT *;
-- Compact small files + co-locate by high-selectivity column
OPTIMIZE marketing.events.clicks
ZORDER BY (user_id);
-- Time travel: read the table as of yesterday
SELECT count(*)
FROM marketing.events.clicks
TIMESTAMP AS OF '2026-05-17 00:00:00';
-- Reclaim storage from tombstoned files (default retention 7 days)
VACUUM marketing.events.clicks RETAIN 168 HOURS;MERGE is the operation you must write from memory. The follow-up is always concurrency: what if two MERGE jobs race on the same partition? Delta uses optimistic concurrency — the second commit checks the log, and if the read set was modified, the writer retries. Fine for append-heavy workloads, but degrades fast under contention, which is why interviewers push on partitioning strategy.
OPTIMIZE compacts small files into ~1 GB Parquet chunks; Z-ORDER is a multi-dimensional clustering that minimises file skips on filtered reads. VACUUM physically deletes files older than the retention threshold — the default of 168 hours (7 days) exists so concurrent readers and time-travel queries do not see a vanished file mid-scan. Lowering RETAIN below 168 hours requires explicitly setting spark.databricks.delta.retentionDurationCheck.enabled = false, which interviewers expect you to flag as a footgun.
Delta vs Iceberg vs Hudi
Every Databricks DE round eventually gets here. The honest answer is "they solve the same problem with different log shapes", and a strong candidate can name two concrete differences without flinching.
| Dimension | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|
| Origin | Databricks (2019), open since 2022 | Netflix (2018), now Apache | Uber (2016), now Apache |
| Transaction log | JSON files in _delta_log/ |
Manifest files + snapshot metadata | Timeline of action files |
| Engine first-class support | Spark, Photon, Trino, Flink | Spark, Trino, Snowflake, BigQuery, Flink | Spark, Flink, Presto |
| Catalog story | Unity Catalog (proprietary) + REST | Multiple: Glue, Nessie, Polaris, Hive | Hive, Glue |
| Streaming upserts | MERGE + change data feed |
Row-level deletes (V2 spec) | Built-in: Copy-on-Write & Merge-on-Read |
| Read pattern strength | OLAP scans, time travel | Schema evolution, partition evolution | Low-latency upserts on changing rows |
| Typical fit | Databricks-centric lakehouses | Multi-engine, vendor-neutral lakes | High-write CDC and ingest pipelines |
If the interviewer presses for a pick, the defensible framing is: Iceberg if you need engine neutrality across Snowflake, BigQuery, and Spark; Delta if you are already on Databricks and want Photon and Unity Catalog without extra glue; Hudi if upsert throughput on a slowly-changing dimension is the bottleneck. Avoid declaring a universal winner — that signals you have only used one.
Sanity check: all three formats now support row-level deletes, schema evolution, and time travel. The differentiation is ecosystem, upsert latency, and partition evolution, not the existence of ACID. Candidates who claim "Iceberg is the only one with X" usually have outdated knowledge.
DBSQL and Photon for interactive workloads
Databricks SQL (DBSQL) is the warehouse-style endpoint over the lakehouse. It runs queries through Photon, a vectorised C++ execution engine that replaces parts of the Spark SQL runtime. Interviewers ask about DBSQL when the role involves BI or self-serve analytics — the question is usually "why not just point Tableau at Snowflake?".
The honest answer: DBSQL wins when the same physical Delta tables already feed ML training and streaming, so you avoid the ETL hop into a separate warehouse. It loses when the workload is 100% short SQL with no ML side, where Snowflake or BigQuery still tend to be faster off-the-shelf. A strong candidate can articulate this tradeoff in two sentences instead of picking sides.
Photon accelerates scan, filter, aggregation, and join kernels but does not cover every Spark operator, so plans can fall back to non-Photon stages. Interviewers occasionally hand you a slow query and ask you to spot the non-Photon hop in the profile.
Common pitfalls
The most common interview-killing mistake is treating VACUUM as a routine cleanup job. Candidates lower the retention to 1 hour to save storage, then discover that long-running readers and time-travel queries break with FileNotFoundException. The fix is to leave the default 7-day retention unless you have an explicit, measured storage problem — and even then, schedule VACUUM only during low-concurrency windows.
A second trap is over-partitioning Delta tables. New engineers see PARTITIONED BY (event_date, country, user_segment) as obvious, but at typical event volumes this produces tens of thousands of tiny partitions, and OPTIMIZE plus Z-ORDER end up doing the real clustering work anyway. The interview-friendly heuristic is to partition only on columns with <10k distinct values and a strong filter predicate, and to rely on Z-ORDER for the rest.
A third pitfall is confusing Unity Catalog grants with cloud IAM. Granting SELECT on a managed table is sufficient because Unity brokers the underlying S3 access via a storage credential. For an external table on a caller-owned bucket, the user still needs the cloud-level read permission — and forgetting this is the most-reported "works on my workspace, fails in prod" symptom on Databricks community forums.
A fourth mistake is mis-using MERGE for high-throughput append workloads. MERGE is expensive — it rewrites entire files containing matched rows. If your stream is 99% inserts and 1% updates, the cheap pattern is INSERT INTO to Bronze and run a periodic MERGE from Bronze to Silver, instead of MERGE-ing on every micro-batch. Interviewers test this with a "your stream is at 100k rps and MERGE is the bottleneck" prompt.
A fifth, subtle pitfall is assuming time travel is free. Every commit retains old file versions until VACUUM clears them, so a high-churn table with daily OVERWRITE can balloon storage 10-20x. The fix is to tier retention by table — Gold marts can keep 30 days, Bronze append-only logs need only the retention floor.
Related reading
- Apache Iceberg deep dive — data engineering interview
- Iceberg time travel — data engineering interview
- BigQuery — data engineering interview
- ClickHouse MergeTree — data engineering interview
- Merge and upsert — data engineering interview
- SQL for data engineer interview
If you want to drill questions like these every day on real interview material, NAILDD is launching with a data engineering track that covers the lakehouse stack end to end.
FAQ
Do I need hands-on Databricks to pass a Databricks interview?
Helpful but not strictly required for a mid-level role. You need to write MERGE, reason about the transaction log, and explain partitioning and OPTIMIZE — all of which you can practise on the free Databricks Community Edition or open-source Delta on local Spark. For staff-plus roles, real production experience with Unity Catalog rollouts and multi-workspace governance becomes harder to fake.
Is the lakehouse just a warehouse with extra steps?
No, and saying so in an interview will hurt you. The lakehouse keeps one copy of the data in open formats (Parquet + Delta log) and serves both SQL and ML directly. A traditional warehouse forces a separate ETL into proprietary storage before BI can query it, which doubles cost and adds latency. The lakehouse argument is operational, not academic — it removes the ETL hop.
How does Delta compare to Iceberg for a brand-new project in 2026?
For a project that lives entirely inside Databricks, Delta plus Unity Catalog is the path of least resistance — Photon and the Catalog are tightly integrated. For a project that needs to be readable by Snowflake, BigQuery, and Spark with no vendor lock-in, Iceberg is the safer pick because every major engine now has first-class read and write support. Hudi remains the specialist choice when CDC upsert throughput is the dominant requirement.
What is the role of Photon exactly?
Photon is a vectorised, native (C++) execution engine that replaces parts of the Spark SQL runtime for SQL workloads on Databricks. It accelerates scans, filters, joins, and aggregations on Delta tables, and it is the engine behind DBSQL warehouses. It is not a separate query language — your SQL is unchanged. You opt in by running on a Photon-enabled cluster or DBSQL endpoint.
How is Unity Catalog different from AWS Glue or Hive Metastore?
Glue and Hive Metastore are catalog-only: they store table metadata and leave access control to the underlying IAM. Unity Catalog adds column-level lineage, row and column masking, cross-workspace governance, and a unified permission model that applies whether the consumer is Spark, DBSQL, or an ML notebook. The trade-off is that Unity is Databricks-managed — for vendor-neutral catalogs, the comparable option is the newer Polaris (Iceberg REST catalog) or Nessie.
What is a realistic interview signal for "knows Databricks well"?
The strongest signal is that you can sketch a Bronze/Silver/Gold flow on a whiteboard, write a MERGE from memory, explain the optimistic concurrency behaviour of the Delta log, and name two concrete differences between Delta and Iceberg without consulting notes.