Kafka for systems analysts: interview concepts

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why a systems analyst needs Kafka

Kafka is the default async backbone for event-driven integrations at companies like Uber, Stripe, Netflix, and Airbnb. If you write specs at a SaaS, fintech, or marketplace team, you are writing Kafka specs whether the doc says so or not. A systems analyst who treats Kafka as "send the event and figure it out later" ships bugs into production and forces a painful migration eighteen months on.

The classic failure mode is a one-liner in a requirements doc — "publish order events to Kafka" — with no topic name, no key, no schema, no retention. Engineers fill in the blanks, consumers make different assumptions, and the first time an order is processed twice during a deploy, nobody can say whether the producer, the consumer, or the sink owns the bug. The fix is a spec that pins down topic, key, schema, acks, commit strategy, and idempotency before code is written.

You will not be asked about ZooKeeper or KRaft on a systems analyst loop. You will be asked what a topic is, what a partition guarantees, what acks=all means, and how you would describe an exactly-once contract to a backend engineer.

Core concepts

A topic is a named, append-only log of events — not a queue. Messages are not deleted when a consumer reads them; they stay until retention expires or compaction overwrites them, and any number of consumer groups can read the same topic independently. That property is what makes Kafka a streaming platform rather than a message broker.

A partition is the unit of parallelism. A topic with 12 partitions can be processed by up to 12 consumers in parallel within a group. Order is guaranteed inside a partition and never across partitions. If you need strict per-user ordering, the producer must set a partition key (e.g. user_id) so all events for that user hash to the same partition.

A producer publishes messages. A consumer reads them, almost always as part of a consumer group — Kafka assigns partitions so each is owned by exactly one consumer at a time. The consumer's progress is tracked by offset, stored in Kafka itself.

Retention controls how long messages survive. The default is 7 days; production topics usually run 3 to 30 days. The alternative is log compaction: Kafka keeps only the latest value per key, so the topic behaves like a snapshot of "current state per entity" rather than a stream of every event.

Load-bearing rule: order is guaranteed only inside a partition. If your spec depends on cross-partition ordering, the spec is wrong — either pick a key that groups events that must be ordered, or accept they will arrive interleaved.

Producers and delivery

The producer side is mostly about how hard you wait for acknowledgement. The acks setting has three values mapping to a durability/latency tradeoff:

acks Wait for Risk When to use
0 Nothing Message can be lost on broker failure Telemetry, click logs where loss is acceptable
1 Partition leader only Loss if leader dies before replicas catch up Default for non-critical events
all All in-sync replicas Essentially no loss with min.insync.replicas=2 Payments, orders, audit, anything financial

For critical topics the canonical setup is acks=all + min.insync.replicas=2 with replication factor 3 — tolerates one broker failure without losing writes and refuses writes if fewer than two replicas are caught up.

The idempotent producer prevents duplicates on retry. It has been on by default since Kafka 3.0, but call it out in the spec because some legacy clients disable it. The transactional producer goes further: it writes to multiple topics atomically and commits consumer offsets in the same transaction — the building block for exactly-once.

A useful mental model: the producer controls "did the broker accept this message", and the consumer controls "did the application finish processing it". Both ends matter.

Consumers and offsets

Most production incidents start here. The headline knob is when you commit the offset.

With auto-commit the client library commits the latest read offset on a timer (default every 5 seconds). Convenient and dangerous: if the consumer crashes after committing offset 100 but before processing message 100, the message is lost on restart. Fine for analytics that tolerate drops, never fine for anything you bill on.

With manual commit the application commits after the message is fully processed — written to the database, the downstream API has acknowledged, the search index updated. Small throughput cost; the only safe pattern for revenue-bearing data.

A rebalance is what happens when a consumer joins, leaves, or dies: Kafka reassigns partitions across the surviving members. During a classic rebalance the whole group stops processing until the new assignment is finalised. On large groups this can take 5-30 seconds, which is why modern clients prefer cooperative rebalance protocols that only pause the partitions actually moving.

The three delivery semantics every interviewer will ask about line up cleanly:

Semantics Where the work happens Failure outcome When to pick
At-most-once Commit offset before processing Message may be lost on crash Lossy telemetry, sampling pipelines
At-least-once Commit offset after processing Message may be processed twice Default; combine with idempotent sink
Exactly-once Transactional producer + offset commit inside the sink transaction Neither lost nor duplicated Payments, ledger, billing

The honest answer in 90% of integrations is at-least-once with an idempotent sink: the consumer commits after a successful DB write, and the DB uses UPSERT keyed on the business event ID so a redelivery is a no-op. Exactly-once exists but constrains your sink — it only works cleanly when the sink is Kafka itself or a database that participates in the same transaction.

Gotcha: "exactly-once" in marketing copy almost always means "at-least-once plus a dedup step you have to build". If the sink is a third-party API with no idempotency key support, exactly-once is off the menu — make the API call idempotent at the application layer.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Schema Registry and contracts

Without a registry, the contract lives in a Confluence page and the goodwill of whoever last edited the publisher. Add a third consumer six months later and the next breaking field rename takes the integration down for a day.

A Schema Registry (Confluent's, Apicurio, AWS Glue) stores schemas centrally. The producer registers a schema, embeds the schema ID in each message, and the consumer fetches the schema by ID. Format choices are usually Avro, Protobuf, or JSON Schema — Avro is the historical default and most compact on the wire.

Compatibility modes are the part interviewers care about:

Mode What it allows What it forbids Reader/writer
Backward New schema can read old data Removing required fields, type changes Upgrade consumers first
Forward Old schema can read new data Adding required fields Upgrade producers first
Full Both directions Almost any structural change Either side can move first
None Anything Nothing Coordinated big-bang only

Default-friendly choices are backward (consumers upgrade on their own schedule) and full (strictest, safest for cross-team contracts). The spec should state the mode and the deprecation policy: how a field is marked optional before removal, and how consumers are notified.

What to put in the integration spec

A complete spec has roughly a dozen lines per topic. The fields you cannot skip: topic (name, partitions, replication, retention, compaction), producer (acks, idempotent, transactional), schema (format, registry, compatibility, evolution policy), partition key (which field and the ordering it implies), consumer (group name, commit strategy, error handling), delivery contract (at-least-once or exactly-once, idempotency in the sink), dead-letter (DLQ topic, triage, replay tooling), and monitoring (lag SLO, error rate, throughput, alert thresholds).

A working block:

topic: orders.events.v1
  partitions: 12          # hashed by user_id
  replication: 3
  retention: 14d
  compaction: false
  schema: Avro, backward-compatible
  producer:
    acks: all
    idempotent: true
    transactional: false

consumer: dwh-loader
  group: dwh-loader-prod
  commit: after successful DWH write
  idempotency: UPSERT by event_id in DWH
  dlq: orders.events.v1.dlq, manual review by data-platform on-call
  lag SLO: p95 < 60s, page on p95 > 5m for 10m

The point is not the YAML — it is that every field has been thought about. A spec that names twelve concrete decisions is one a backend engineer can implement without follow-up questions; one that names two produces three different implementations across three consumer teams.

Common pitfalls

The most frequent partition-key mistake is leaving the key unset and assuming order is preserved anyway. Without a key Kafka uses round-robin routing, so two events for the same user can land in different partitions and arrive out of order. If the spec says "process status changes in order", the producer key must be the entity ID — set on every message.

A close second is auto-committing on a revenue-bearing consumer. A deploy between the commit timer firing and the actual DB write silently drops the message. The fix is mechanical: switch to manual commit, commit only after the transaction succeeds, and make the sink idempotent on the business key. This is why "did you write before you committed" is a stock interview question.

One large topic with one partition caps throughput at one consumer and makes scaling impossible later. Pick a partition count with headroom — 12 or 24 is a reasonable default for medium-volume business events — because you can grow consumers up to that number but cannot easily reduce partitions later.

Relying on order across partitions comes up in specs that say "process events globally in order". Kafka does not provide that. If global order matters, you need a single partition (with its throughput cap) or a downstream reordering buffer keyed by timestamp — which adds latency the spec must own.

Skipping the Schema Registry on a multi-team integration sets up breakage every time someone renames a field; with three or more consumer teams the registry pays for itself in the first quarter. Skipping retention is the other half: topics grow forever and replay tooling silently rots. The default 7 days is a starting point; payments and audit topics often want 30 days or more. And large payloads over the 1 MB default message limit belong in object storage (S3, GCS) with a Kafka event carrying the URI, not in Kafka itself.

If you want to drill systems-analyst design questions like this every day, NAILDD is launching with hundreds of practice scenarios across exactly this pattern.

FAQ

Is Kafka a queue?

Not really. A traditional queue (RabbitMQ, SQS) deletes a message once a consumer acknowledges it. Kafka is a distributed log: messages stay for the retention window, and any number of independent consumer groups replay the same data with their own offsets. That property is what makes Kafka useful for streaming analytics, change data capture, and event sourcing.

Kafka vs RabbitMQ — which belongs in the spec?

Kafka wins for streaming events, replay, throughput, and fan-out. RabbitMQ wins for routing-heavy command-style messaging where each message has one intended recipient. On most modern stacks Kafka handles events between services and RabbitMQ or SQS handles task queues. If the interviewer asks "why Kafka here", the answer is durability, replay, and horizontal scaling.

What is consumer lag and why is it the headline SLO?

Lag is the difference between the latest offset written to a partition and the offset the consumer group has committed. Rising lag means the consumer cannot keep up. A typical SLO is p95 lag under 60 seconds for near-real-time pipelines and under 5 minutes for batch-ish ETL. Alert on sustained breach, not single spikes — a transient bump during a deploy is normal.

How do I describe exactly-once semantics in a spec?

State it as a chain: "The producer is idempotent and transactional. The consumer reads the message, writes to the sink, and commits the Kafka offset inside the same database transaction. The sink uses UPSERT keyed by event_id so any retry is a no-op." That is enough for a backend engineer to implement, and it forces the conversation about whether the sink can participate in such a transaction. If it cannot, downgrade to at-least-once with idempotent writes and say so explicitly.

What is a compacted topic and when do I use one?

A compacted topic keeps only the latest message per key rather than expiring by time. The right choice when the topic represents current state per entity — "latest user profile", "latest feature flag value". A new consumer reading from the beginning gets a full snapshot without replaying every historical change. Use sparingly: compaction has subtle interactions with tombstones and retention, and a misconfigured compacted topic can silently lose data you expected to keep.

Should a systems analyst know broker internals like ISR and the controller?

No. The loop tests whether you can write a clean integration spec and reason about producer/consumer guarantees, not KRaft or tiered storage. If you find yourself deep in broker internals, redirect to writing two or three sample integration specs and reviewing them against the pitfalls above.