Apache Kafka on the DE interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why Kafka shows up in every DE loop

If a data engineering job description mentions event, realtime, or pipeline, the loop will have a Kafka section. Levels range from "what is a topic" at L3 to "rebalance 200 consumers without dropping the SLA" at L5+. At Stripe, DoorDash, Uber, and Netflix the bar is similar — know the log model, the partition contract, and the exactly-once story cold.

The most expensive mistake on the loop is confusing Kafka with a queue. A candidate who proposes "delete the message after the consumer reads it" has told the interviewer they have never operated Kafka. Kafka is a distributed append-only log: messages stay in the topic until retention expires, and any consumer can replay from any offset.

Topics, partitions, replication

A topic is a named stream of events — an append-only log, not a queue. Messages are ordered inside the log and stay available to every consumer until retention deletes them.

A partition is the unit of parallelism. A topic with N partitions can be read by up to N consumers in parallel inside one group. Order is strict inside a partition and undefined across partitions. Producers route by hashing the message key: same key, same partition, strict per-key order. This is why user_id or order_id is almost always the right key for entity-keyed streams.

topic: orders (3 partitions)
partition 0: msg1, msg2, msg5, msg8
partition 1: msg3, msg6
partition 2: msg4, msg7

Replication copies each partition onto other brokers. With replication.factor=3 there is one leader and two followers; followers pull from the leader to stay in sync.

The In-Sync Replicas (ISR) set is the followers that have not fallen behind the leader by more than replica.lag.time.max.ms. When the leader dies, a new leader is elected from the ISR. If the ISR is empty and unclean.leader.election.enable=false, the partition goes offline — consistency over availability. Flip it to true and a stale follower can become leader, at the cost of data loss. Payment teams keep it false; clickstream teams sometimes do not.

The classic loop question is "how many partitions should this topic have?" The right answer is a function of throughput and consumer parallelism: consumers in a group can never exceed partitions, and you cannot shrink partitions later without recreating the topic. A reasonable start for a mid-volume topic is 6 to 12 partitions. Over-provisioning hurts too: every partition costs file handles, replication bandwidth, and metadata.

Producer and acks

The producer writes a message and waits for an acknowledgement. The acks setting decides what counts as "written":

acks Who acknowledges Durability guarantee Latency
0 Nobody (fire-and-forget) Data can be lost on any failure Lowest
1 Leader only Lost if leader dies before replicating Medium
all (-1) All ISR members Safe while ISR >= min.insync.replicas Highest

For payments, orders, and anything an auditor will see, acks=all with min.insync.replicas=2 at replication.factor=3 is the standard combination. The broker only acknowledges the write once two replicas have it on disk. If only one ISR remains, the producer gets an error instead of silently writing to a single replica — exactly the failure mode you want for money.

Load-bearing trick: acks=all alone is not durable. Pair it with min.insync.replicas=2 and replication.factor=3, otherwise a single surviving replica is enough to ack and a subsequent crash loses the write.

The idempotent producer (enable.idempotence=true, default since 3.0) prevents the same message from being written twice to the same partition during retries. The producer attaches a sequence number; the broker deduplicates. The transactional producer goes further: atomic writes across multiple partitions and topics, gated by transactional.id. It is the foundation for exactly-once inside Kafka Streams and inside connectors that need to commit offsets and output records together.

Consumer groups and offsets

A consumer group is a set of consumers that divide a topic's partitions among themselves. Each partition is owned by exactly one consumer in the group at a time.

topic orders (6 partitions), group A (3 consumers):
consumer-1 -> p0, p1
consumer-2 -> p2, p3
consumer-3 -> p4, p5

More consumers than partitions and the surplus sits idle. Fewer and some consumers pull multiple partitions. The cap on horizontal scale is the partition count, which is why early under-partitioning is painful.

An offset is the consumer's position in a partition, stored in the internal __consumer_offsets topic. Two strategies: auto-commit (enable.auto.commit=true) commits every auto.commit.interval.ms (5 seconds), which is dangerous for stateful work — the offset can advance before processing completes, and a crash in that window loses messages. Manual commit via commitSync() after the side effect is slower but the offset reflects work actually done.

A rebalance redistributes partitions when a consumer joins, leaves, or dies. During a stop-the-world rebalance, every consumer in the group pauses. On groups of 50+ this is painful — five-second pauses are common. Two mitigations: cooperative rebalancing via partition.assignment.strategy=CooperativeStickyAssignor reassigns incrementally so most consumers keep working, and static membership with group.instance.id lets a briefly-dropped consumer rejoin without triggering a rebalance, up to session.timeout.ms.

Delivery semantics

There are three delivery guarantees and the loop will ask about all three.

At-most-once means the message arrives zero or one times — no duplicates, but losses are possible. This happens when the offset is committed before the work succeeds.

At-least-once means the message arrives one or more times — no losses, but duplicates are possible. This is the default in most pipelines: commit the offset after the side effect, and make the downstream idempotent. Idempotent sinks are the real answer to most exactly-once questions.

Exactly-once means every message lands once, no duplicates and no losses. Inside Kafka this requires three pieces: the idempotent producer (no in-partition duplicates), the transactional producer with a read_committed consumer (atomic multi-partition writes), and Kafka Streams set to processing.guarantee=exactly_once_v2. External systems like Postgres or ClickHouse do not get exactly-once for free — you either run a distributed transaction (rare and slow) or you write idempotent UPSERTs keyed by business id, which is functionally equivalent.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Kafka vs Kinesis vs Pulsar

System-design rounds at AWS-heavy shops (Airbnb, Stripe, Snowflake) and multi-cloud shops (Databricks, Vercel) ask you to compare Kafka with the two common alternatives. Knowing the partition model, retention, and operational story for each saves the round.

Dimension Apache Kafka AWS Kinesis Data Streams Apache Pulsar
Storage model Append-only log per partition, retention by time or size Sharded log, default retention 24h, max 365d Segment-based log, tiered storage to S3 native
Parallelism unit Partition (fixed at creation, hard to grow) Shard (resharding online but throttled) Topic with non-partitioned or partitioned, scales via subscriptions
Ordering guarantee Strict per-partition Strict per-shard (per partition key) Strict per-partition, plus key-shared subscription mode
Delivery semantics At-least-once default, exactly-once via transactions + Streams At-least-once; exactly-once requires app-level idempotency At-least-once or effectively-once via dedup + transactions
Consumer model Consumer groups, offsets in __consumer_offsets Application-level checkpoints in DynamoDB Subscriptions (exclusive, shared, failover, key-shared)
Geo-replication MirrorMaker 2 or Confluent Replicator (add-on) Cross-region via Kinesis app or EventBridge Built-in geo-replication at broker level
Operations Self-managed cluster or Confluent Cloud / MSK Fully managed, IAM-integrated Self-managed cluster or StreamNative Cloud
Best fit High-throughput streaming with replay, broad ecosystem AWS-native pipelines with light ops budget Multi-tenant platforms, geo-replicated streams

Gotcha: Kinesis is not a Kafka drop-in. Retention defaults to one day, partition keys map to shards, and resharding is asynchronous. Pipelines that assume "I can always replay last week" silently lose history when ported to Kinesis without bumping retention.

Schema Registry and Avro

The Schema Registry (Confluent, Karapace, or AWS Glue Schema Registry) stores message schemas centrally. The producer registers the schema and writes only its numeric ID into the message header. The consumer fetches the schema by ID and deserializes. Without a registry, producer and consumer have to agree on the wire format out of band — the moment a producer adds a field, every consumer that did not get the memo crashes.

Avro won the Kafka ecosystem for being compact and strongly typed. Protobuf and JSON Schema are also supported. Compatibility modes: backward (registry default) lets a new consumer read old data — delete a field or add one with a default; forward lets an old consumer read new data — add fields but not delete them; full requires both.

A reliable loop trap: "Can you add a required field to an Avro schema?" The answer is no, not without a default — without one, backward compatibility breaks and consumers on the old schema cannot deserialize new messages.

Common pitfalls

The first pitfall is treating Kafka as a queue. Kafka is an append-only log, not RabbitMQ. Messages persist until retention deletes them and any group can replay independently. If you say "delete the message after processing," the interviewer is already writing "no hire" in the rubric.

Auto-commit on critical data is the second classic trap. With enable.auto.commit=true the offset advances on a timer, not after work succeeds. A consumer that crashes between auto-commit and the side effect loses messages silently. For payments, billing, or anything with a financial auditor, the answer is manual commit after the sink confirms the write — paired with an idempotent sink, that is at-least-once that behaves like exactly-once.

Under-partitioning on day one is the third trap. A single-partition topic has zero parallelism. You can add partitions later but cannot shrink them, and historical data does not re-key without a reprocess. Start at 6 to 12 partitions for mid-volume topics; over-partitioning is recoverable, under-partitioning is not.

Relying on cross-partition order is the fourth trap. Order is strict inside a partition and undefined between partitions. If a user-level invariant needs strict order, the partition key must include the entity id so that all events for one entity land on one partition. If you genuinely need global order, the answer is one partition — single-consumer throughput is the price.

Stuffing large payloads into Kafka is the fifth trap. The default message.max.bytes is 1 MB and brokers do not enjoy multi-megabyte records — they amplify replication cost, GC pressure, and rebalance pain. Use the claim check pattern: write the blob to S3 or GCS, publish only the pointer plus metadata to Kafka.

The last pitfall is assuming exactly-once survives the broker boundary. Exactly-once inside Kafka is real; the moment you write to Postgres, ClickHouse, or Snowflake you are back to at-least-once unless the sink is idempotent. UPSERT by business key is the standard answer.

If you want to drill Kafka the way loops actually ask it, NAILDD ships scenario-style questions on partitions, offsets, exactly-once, and schema evolution with worked solutions.

FAQ

How is Kafka different from RabbitMQ?

Kafka is a distributed event log with durable retention and parallel reads via partitions; consumers track their own position and can replay any window of history. RabbitMQ is a queue broker with rich routing — messages are removed once acked and re-reading requires re-enqueuing. Kafka wins on throughput, replay, and stream processing; RabbitMQ wins on complex routing and low-latency RPC-style messaging.

What is ISR and why does min.insync.replicas matter?

The ISR is the set of replicas that have caught up to the leader within replica.lag.time.max.ms. Setting min.insync.replicas=2 with acks=all and replication.factor=3 guarantees the broker only acknowledges a write once at least two replicas have it on disk. Drop to min.insync.replicas=1 and a single-replica ack plus a leader crash can lose data. For anything financial, two is the floor.

How do you get exactly-once when writing from Kafka into Postgres?

Pure distributed exactly-once needs two-phase commit, which is operationally painful and almost never used. The standard answer is at-least-once delivery plus an idempotent sink: every write is an UPSERT keyed by a business id (order_id, event_id) so reprocessing the same Kafka message yields the same row, not a duplicate. From the system-level view this is exactly-once for every observable purpose.

How many partitions should a new topic have?

A working heuristic is target throughput in messages per second divided by per-consumer throughput — often around 10k/sec per consumer for light work. Minimum three for fault tolerance; ceiling is the maximum parallel consumers you ever expect. Lean slightly high: you can add partitions later but cannot shrink them, and key-based ordering breaks across the boundary if you add mid-flight.

What is a rebalance and how do you make it less painful?

A rebalance is the redistribution of partitions when group membership changes. Shrink the blast radius with cooperative rebalancing (CooperativeStickyAssignor) so reassignment is incremental, static membership (group.instance.id) so rolling restarts skip the rebalance, and a session.timeout.ms large enough to ride out routine pauses without ejecting healthy consumers.

Is this official hiring guidance?

No. This article is based on public sources, Apache Kafka docs, and patterns reported on LinkedIn, Glassdoor, and levels.fyi. Bar and rubric vary by company and level — use it as preparation, not a contract.