Apache Iceberg deep dive for DE interviews

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why Iceberg shows up on the DE loop

Apache Iceberg is now the default table format for serious lakehouse work at Netflix, Apple, Stripe, and Airbnb, and it has become the most common open-table-format question on Data Engineering loops in 2025-2026. If you say "we use S3 + Parquet" without naming a table format, expect the interviewer to push: who owns the metadata, how do you do an atomic overwrite, what happens if two writers commit at the same second?

The good news: the answer is almost always the same three primitives — a metadata pointer, an immutable snapshot tree, and a manifest layer that lists data files with statistics. Once you can draw that on a whiteboard, the follow-ups (partition evolution, time travel, hidden partitioning, compaction) fall out of the same model. This post walks the structure end-to-end with the phrasing senior interviewers reward.

Load-bearing trick: Iceberg is not a storage engine. It is a specification for how to lay out metadata files alongside Parquet/ORC/Avro data so that any compute engine — Spark, Trino, Flink, Snowflake, DuckDB — can read and write the same table with ACID guarantees.

Iceberg table anatomy

A real Iceberg table on object storage looks like this on disk:

events/
  metadata/
    v3.metadata.json          ← current table metadata (pointer)
    v2.metadata.json
    v1.metadata.json
    snap-481.avro             ← manifest list for snapshot 481
    snap-482.avro
    e7a2-m0.avro              ← manifest (lists data files)
    e7a2-m1.avro
  data/
    event_date_day=2026-05-17/
      00000-1-abc.parquet
      00000-2-def.parquet
    event_date_day=2026-05-18/
      00000-3-ghi.parquet

Four layers, top to bottom:

Layer File What it stores Read cost
Catalog pointer external (Glue, REST, Nessie) Current metadata.json location 1 lookup
Table metadata vN.metadata.json Schema, partition spec, snapshot list, properties 1 JSON read
Manifest list snap-N.avro Per-snapshot list of manifests + partition summaries 1 Avro read
Manifest *-m*.avro List of data files, partition values, column stats N Avro reads
Data *.parquet Actual rows Only files that pass pruning

The whole point of this layering is that a query planner can prune 99% of files using manifest statistics before opening a single Parquet footer. That is the answer when an interviewer asks "why is Iceberg faster than Hive on the same data?"

Snapshots and the commit model

Every write — INSERT, MERGE, DELETE, UPDATE, schema change, partition spec change — produces a new snapshot. Snapshots are immutable. The table's "current state" is just a pointer in the metadata file.

-- snapshot S1: empty
INSERT INTO events VALUES (...);   -- now at S2
INSERT INTO events VALUES (...);   -- now at S3
DELETE FROM events WHERE ...;      -- now at S4

A snapshot record carries:

  • The manifest list path (which in turn lists manifests, which list data files).
  • Summary statistics — added-files, deleted-files, added-records, total-records.
  • An operation type: append, overwrite, delete, replace.
  • The parent snapshot ID, forming a lineage chain.

This is why Iceberg can do optimistic-concurrency commits cheaply: a writer reads the current metadata, builds a new snapshot off it, and the catalog does a single compare-and-swap on the metadata pointer. If two writers race, the loser retries against the new parent. No table-level lock, no zookeeper, no quorum dance.

Partition evolution

In Hive, partitioning is baked into the directory layout. Change it and you rewrite the table. Iceberg decouples the logical partition spec from the physical layout, so you can evolve it in place:

-- v1: partitioned by year
ALTER TABLE events SET PARTITION SPEC (years(event_ts));

-- 6 months later, traffic grew 10x, year-grain is too coarse
ALTER TABLE events ADD PARTITION FIELD months(event_ts);

-- 6 months after that, we need daily for hot queries
ALTER TABLE events ADD PARTITION FIELD days(event_ts);

Old data keeps its original spec. New data lands in the new spec. The query planner uses the right pruning predicate per file based on which spec was active when the file was written. From the SQL side, you write WHERE event_ts >= '2026-05-01' and Iceberg figures out the rest.

Sanity check: if your interviewer asks "what's the migration cost of changing partition granularity from monthly to daily on a 50 TB table?" the answer in Iceberg is zero rewrites — only a metadata commit. In Hive it is a full table rewrite plus a downstream invalidation storm.

Hidden partitioning

The single feature that converts skeptics. In Hive, to get partition pruning you must include the partition column in the predicate, which means analysts learn to write double predicates like WHERE event_date_year = 2026 AND event_date = '2026-05-18'. Forget the year and you full-scan the table.

Iceberg stores the transform (year, month, day, hour, bucket(N, col), truncate(N, col)) and applies it at plan time:

-- Hive (manual)
SELECT count(*)
FROM events_hive
WHERE event_date_year = 2026
  AND event_date_month = 5
  AND event_date = '2026-05-18';

-- Iceberg (one predicate, planner does the rest)
SELECT count(*)
FROM events_iceberg
WHERE event_ts = '2026-05-18';

Hidden partitioning is also why Iceberg tables tend to survive analyst self-service better — there's no foot-gun column to forget.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Time travel

Because every snapshot is reachable until it expires, you can read the table as of any point in its history:

-- by snapshot ID
SELECT * FROM events VERSION AS OF 4815162342;

-- by wall-clock timestamp
SELECT * FROM events FOR TIMESTAMP AS OF '2026-05-01 12:00:00';

-- diff between two snapshots
SELECT *
FROM events.changes
WHERE start_snapshot_id = 4815162342
  AND end_snapshot_id   = 7300819284;

Four jobs it actually does in production:

  1. Audit and compliance — reproduce a regulator-requested report exactly as it ran on the original date.
  2. ML training reproducibility — pin a feature table to the snapshot used at model-fit time so retraining is bit-identical.
  3. Debugging — diff yesterday's broken numbers against the day-before to find the bad write.
  4. RollbackCALL system.rollback_to_snapshot('events', 4815162342) is one command and atomic.

Compaction and maintenance

Streaming and micro-batch writes leave behind lots of small files, which kill query latency because every Parquet open costs an S3 round-trip and a footer read. Compaction is how you fight back:

-- merge small files into target-size groups (default 512 MB)
CALL system.rewrite_data_files(
  TABLE => 'events',
  options => map('target-file-size-bytes', '536870912')
);

-- expire old snapshots so metadata doesn't bloat
CALL system.expire_snapshots(
  TABLE => 'events',
  older_than => TIMESTAMP '2026-05-11 00:00:00',
  retain_last => 5
);

-- delete files no snapshot references
CALL system.remove_orphan_files(
  TABLE => 'events',
  older_than => TIMESTAMP '2026-05-04 00:00:00'
);

A healthy production table runs all three on a schedule. Typical defaults that interviewers like to hear:

Maintenance task Cadence Typical retention Why
rewrite_data_files Daily on hot partitions n/a Keep read latency flat
rewrite_manifests Weekly n/a Keep planning time bounded
expire_snapshots Daily 7 days active, 30 days audit Cap metadata size
remove_orphan_files Weekly 3 days Reclaim storage from failed writes

Gotcha: remove_orphan_files walks the entire data prefix in object storage. On a multi-petabyte table it is expensive. Run it weekly, not hourly, and scope it to a sub-prefix when you can.

Iceberg vs Delta vs Hudi

The question every DE loop asks. Memorize the table, then pick the angle that matches the role.

Dimension Apache Iceberg Delta Lake Apache Hudi
Origin Netflix (2017), Apache TLP Databricks (2019), Linux Foundation Uber (2017), Apache TLP
Spec governance Open, vendor-neutral Open spec, Databricks-led Open, Uber-led
Default file format Parquet, ORC, Avro Parquet only Parquet, ORC
Hidden partitioning Yes (transforms in spec) No (generated columns workaround) No
Partition evolution In-place, no rewrite Requires rewrite Requires rewrite
Schema evolution Add, drop, rename, reorder, promote Add, drop, rename, reorder Add, drop, rename
Time travel Snapshot ID or timestamp Version or timestamp Commit time
Update mode Copy-on-write, merge-on-read Copy-on-write, deletion vectors CoW or MoR (first-class)
Catalog options Glue, Hive, Nessie, REST, Polaris Unity Catalog, Hive Hive, Glue
Best fit Multi-engine lakehouse Databricks-centric stack High-throughput CDC

If the company runs Snowflake, Trino, and Spark side-by-side: lead with Iceberg. If it is a Databricks shop: Delta is the path of least resistance. If the workload is CDC at >100k events/sec with tight upsert SLAs: Hudi's merge-on-read with record-level indexing is the sharper tool.

Common pitfalls

The first trap is expecting compute engines to auto-compact. Spark and Trino write Iceberg correctly but they do not run rewrite_data_files on their own. If you don't schedule maintenance, your hourly Kafka ingest will leave thousands of 1-MB files per partition per day and reads will get slower week over week. The fix is a nightly maintenance job — Airflow, Argo, or a Spark Structured Streaming job — that compacts hot partitions and expires snapshots in one pass.

A close second is letting snapshot history grow unbounded. Every snapshot keeps its manifest list reachable, and planning time scales with reachable manifests. On a table with 10,000 retained snapshots, plan time can climb from 200 ms to 8 seconds for the same query. Standard hygiene is expire_snapshots daily with a 7-day retention for active analytical tables, longer only when audit policy forces it.

The third pitfall is reading Iceberg through the wrong engine. Hive on Tez technically supports Iceberg via a connector but the experience is rough — predicate pushdown is partial and writes are painful. Iceberg's first-class engines are Spark, Trino, Flink, Snowflake, and DuckDB. If your team is moving off Hive, treat the connector as a migration bridge, not a destination.

Another quiet killer is running Iceberg without a real catalog. File-system-based "Hadoop catalog" is fine for local notebooks and CI tests, but it has no atomic compare-and-swap, which means two writers can corrupt the metadata pointer. Production tables belong in AWS Glue, a Nessie server, or a REST catalog (Polaris, Lakekeeper, Tabular-style). The catalog is what gives you ACID across engines.

Finally, watch for partition spec sprawl. Partition evolution is powerful, but if you change spec every quarter you end up with files written under five generations of specs and planning slows because the planner has to reason about all of them. A reasonable rule: evolve at most twice per year, and run a one-time rewrite to consolidate when the oldest spec drops below 5% of data volume.

If you want to drill these patterns daily with real interview prompts, NAILDD is launching with hundreds of Data Engineering questions across Iceberg, partitioning, and lakehouse design.

FAQ

How does Iceberg achieve ACID without a distributed lock?

Iceberg uses optimistic concurrency at the catalog layer. A writer reads the current metadata pointer, builds a new snapshot whose parent is that snapshot's ID, then asks the catalog to swap the pointer atomically — only if the current pointer still matches. If another writer beat them to it, the swap fails, the writer rebuilds against the new parent and retries (appends are commutative, so retries almost always succeed). The catalog only needs a single atomic compare-and-swap, which is why Glue, REST catalogs, and Nessie all work.

What's the difference between copy-on-write and merge-on-read?

Copy-on-write rewrites affected data files on every update or delete, so reads stay cheap and you pay the cost at write time. Merge-on-read emits small delete files or update files that readers reconcile on the fly, so writes are cheap and reads pay the merge cost until compaction. Pick CoW for read-heavy analytics, MoR for high-throughput CDC where write latency matters more.

Do I need Spark to use Iceberg?

No. Iceberg is a spec, not an engine. Trino, Flink, Snowflake, DuckDB, ClickHouse, StarRocks, and Dremio all read Iceberg natively. Write support is broader on Spark and Flink, but the table itself is engine-agnostic — that is the whole point. You can write with Flink, query with Trino, and BI with Snowflake against the same files.

How big should the target file size be?

The community default is 512 MB, a reasonable starting point on S3, GCS, and ABFS. Go larger (1 GB) for cold archival tables where large scans dominate. Go smaller (128-256 MB) for tables with selective point lookups, because larger files mean longer row-group reads even when you only want a few rows. Always measure.

When should I pick Iceberg over Delta Lake?

Choose Iceberg when you need multiple compute engines on the same table (Snowflake + Spark + Trino is the canonical example), when you want a vendor-neutral catalog story, or when partition evolution matters because your data shape is still moving. Choose Delta when the stack is Databricks-centric and Unity Catalog is already the source of truth.