May 20, 2026·13 min read

Event-driven architecture for SA interviews

Q: How do you handle a poison event that breaks every consumer?

Three-layer defense. **Validate at publish time** so the schema registry rejects malformed events at the producer. On the consumer side, wrap projection logic in a try/catch and route failures to a **dead letter topic** instead of blocking the group. Alert on dead-letter rate and have a manual replay path once the bug is fixed. Letting one bad event halt the consumer group for hours turns a small bug into an incident.

Train for your next tech interview

1,500+ real interview questions across engineering, product, design, and data — with worked solutions.

Join the waitlist

Contents:

Why panels keep asking about EDA
Event vs command
Event-driven architecture
Event sourcing
CQRS
When to apply which flavor
Common pitfalls
Related reading
FAQ

Why panels keep asking about EDA

Almost every systems analyst loop at Stripe, Uber, DoorDash, or Snowflake includes a question like "how would you decouple the order service from inventory and notifications?" The interviewer is not testing whether you can spell Kafka. They are checking whether you can reason about loose coupling, eventual consistency, and the failure modes that come with replacing a synchronous RPC call with an event bus.

The trickier follow-ups land in event sourcing and CQRS territory. These are concrete patterns with specific tradeoffs in banking, billing, and regulated domains. A senior SA is expected to know when each flavor is justified and when it is overkill. What follows is the full chain — event vs command, EDA, event sourcing, CQRS — with the tradeoff tables and gotchas you need on the tip of your tongue during a loop.

Load-bearing rule: if you remember only one thing, it is that events are facts about the past, commands are requests about the future. Every confusion downstream traces back to mixing these two.

Event vs command

The first calibration question almost any panel asks is "what is the difference between an event and a command?" The answers split candidates into two buckets immediately.

A command is imperative — a request to do something (CreateOrder, ChargeCard, CancelSubscription). The verb is infinitive, there is a single intended receiver, and the sender expects either success or a typed failure. A command can be rejected if validation fails or the state machine forbids it. Conceptually, a command is a method call that happens to travel over a wire.

An event is declarative — it records that something already happened (OrderCreated, CardCharged, SubscriptionCanceled). The verb is past tense, there can be zero or many subscribers, and they are anonymous to the publisher. Subscribers cannot reject the event — they can only react. This single semantic difference is what makes loose coupling possible.

Command: CreateOrder{customer_id, items}     → Order Service (one receiver)
Event:   OrderCreated{order_id, total, ts}   → Kafka topic → N subscribers

Aspect	Command	Event
Tense	Imperative (do X)	Past (X happened)
Receivers	Exactly one	Zero to many
Can be rejected?	Yes	No
Naming convention	`VerbNoun`	`NounVerbed`
Coupling	Publisher knows receiver	Publisher does not know subscribers
Typical transport	Sync RPC, REST, gRPC	Kafka, Pulsar, EventBridge, SNS

If you blur this line at the contract layer, you end up with "events" that are really commands in disguise — a subscriber that subscribes to OrderCreated only because it needs to send a notification, and the publisher silently relying on that subscriber existing. That is a synchronous dependency wearing an event hat.

Event-driven architecture

Event-driven architecture is the system-level pattern where services communicate primarily through events on a broker, not through synchronous RPC. The diagram is the same one panels expect you to sketch on the whiteboard:

Order Service ── publishes "OrderCreated" ── Kafka topic
                                              ├─→ Inventory Service   (reserves stock)
                                              ├─→ Notification Service (email + push)
                                              ├─→ Analytics Service    (logs cohort)
                                              └─→ Fraud Service        (scores risk)

Benefits are real but each has a matching cost. Loose coupling means Order does not know Inventory exists — you can deploy Inventory independently or add a new subscriber without touching the publisher. Resilience means that if Notification is down, orders still flow; the email goes out when the consumer recovers. Scalability means each consumer scales independently. Audit trail is essentially free, because every event is already persisted in the broker.

The costs trip junior candidates. Eventual consistency is the big one — the customer sees the order confirmation page, but stock is not yet reserved for another few hundred milliseconds. Distributed debugging is the second — one user action fans out into a dozen consumer logs across services, and reconstructing the flow requires a correlation id and a tracing stack like OpenTelemetry. Schema evolution is the third — once a payload is published, every existing subscriber depends on its shape.

Sanity check during the interview: if the interviewer says "the customer must see the final state immediately", that is your signal to push back on full EDA and propose a hybrid — sync call to the service of record, async events to everyone else.

Event sourcing

Event sourcing is a storage pattern, not a transport pattern, and conflating the two is the most common interview mistake. The idea: instead of persisting the current state of an entity, you persist the full ordered history of events that produced it. Current state is then a function — replay the events, fold them into an aggregate.

Events for account #42:
  1. AccountOpened{id=42, owner="ann"}
  2. Deposited{id=42, amount=100}
  3. Deposited{id=42, amount=50}
  4. Withdrawn{id=42, amount=30}

Current state = fold(events) → balance = 120

The wins are concrete. Complete audit trail for free — non-negotiable in banking, billing, healthcare. Time travel — reconstruct the state of any entity at any past moment by replaying up to a timestamp. Rebuildable read views when requirements change, because the source of truth is immutable.

The costs are also concrete. The system is harder to design — every state change must be modeled as a discrete event with clear domain meaning. Storage grows linearly with activity, so a chatty entity needs snapshotting every N events to keep replay fast. Schema evolution is a permanent chore — you cannot rewrite history, so old events must stay readable by every future projection.

Approach	Source of truth	Audit	Storage cost	Read latency
CRUD on current state	Latest row	None (add audit table separately)	Low	Low
Event sourcing (raw)	Event log	Built-in, complete	High	High (replay)
Event sourcing + snapshots	Event log + snapshot	Built-in, complete	Medium-high	Medium
Event sourcing + CQRS	Event log + read DB	Built-in, complete	High	Low (read DB)

Event sourcing without snapshots in a high-volume domain is a footgun. Greg Young, who coined the term, has explicitly warned that ES is not a default — it is a specialist tool for domains where the history itself is the product.

Train for your next tech interview

1,500+ real interview questions across engineering, product, design, and data — with worked solutions.

Join the waitlist

CQRS

Command Query Responsibility Segregation splits the model in two. The write side accepts commands, validates them against an aggregate, and emits events. The read side consumes those events and projects them into one or more denormalized read models tuned for specific queries.

Write side                       Read side
─────────────                    ───────────────
Commands → Aggregate                  Queries → Read DB(s)
              │                                   ▲
              ▼                                   │
           Events ─────→ Event store ─────→  Projection workers

The pattern earns its complexity in two cases. First, when read load dwarfs write load by an order of magnitude — denormalized read models tuned per use case (customer view, analyst dashboard, export job). Second, when multiple read shapes are needed for the same data, and you do not want one normalized schema fighting all of them with conflicting indexes.

CQRS pairs naturally with event sourcing, and ES + CQRS is the canonical pattern in financial services and regulated billing. CQRS without ES is also fine — run CQRS on a relational write DB that publishes change events via CDC. Stripe's ledger, Uber's payments, and several Snowflake internal services use variants of this. The cost is operational — two databases, a projection pipeline, and an eventual consistency contract between them.

When to apply which flavor

Picking the right level of investment is the senior-level signal. Below is the cheat sheet to keep in your head.

Pattern	Use when	Avoid when
Plain EDA (events, no ES, no CQRS)	Microservices needing loose coupling; high-volume async workflows; fan-out notifications	Simple CRUD app with one read shape and low traffic
Event sourcing	Audit-critical domains (banking, billing); compliance requires full history; temporal queries are core	Simple entities where current state is all anyone reads
Full CQRS + ES	Complex domain with multiple read views; read load >> write load; regulated industry	Small team without prior EDA experience; tight deadline; CRUD shaped problem

A defensible interview answer almost always starts with "I would default to plain EDA, and only reach for ES or CQRS if the domain requires it." That framing alone separates you from candidates who pattern-match buzzwords without weighing cost.

The honest anti-patterns are: a five-person team adopting full CQRS+ES for an MVP, a chat app using event sourcing for messages (high volume, low audit value), and a startup adding Kafka because "microservices need it" when the entire backend is a single service. Architecture matches problem shape — copying patterns from a FAANG postmortem without the matching load profile is how you ship a six-month rewrite.

Common pitfalls

The first pitfall is anemic events. A payload like UserChanged{user_id} tells subscribers nothing — they have to call back to the source of truth to ask what changed, which collapses the loose coupling you paid for. Publish what changed and why: UserEmailUpdated{user_id, old, new, reason}. Events should carry enough domain context that a subscriber can act without a round trip.

The second pitfall is events used as commands in disguise. If OrderCreated is only published because Notification needs to send an email, and the publisher silently relies on Notification being subscribed, you have built a synchronous dependency with extra steps. The signal you crossed the line: the publisher's tests start mocking the broker to verify who received the event. Events are facts; if you need a specific receiver, send a command.

The third pitfall is event sourcing without the supporting infrastructure. ES needs an append-only event store, snapshotting, projection workers, and an idempotent projection contract. Bolting "we'll store events in an events table in Postgres and replay them" onto an existing CRUD codebase usually ends in a broken rebuild three quarters later. Commit to a real event store (EventStoreDB, Axon, hardened Kafka) or stick with audit tables.

The fourth pitfall is breaking schema changes without versioning. Once an event is published, every subscriber depends on its shape forever. The discipline is additive evolution only — new fields optional, old fields never removed — plus an event version in the payload and a registry like Confluent Schema Registry with backward-compatibility rules enforced at publish time.

The fifth pitfall is ignoring ordering guarantees. Kafka orders events per partition, not globally. If you partition by customer_id, OrderShipped is guaranteed to land after OrderCreated only because they share a partition key. Cross-partition ordering needs explicit reasoning, often a saga. Candidates who say "Kafka guarantees ordering" with no qualifier lose senior points.

If you want to drill SA architecture questions at this depth, Naildd is launching with 500+ problems covering EDA, sagas, and event sourcing.

FAQ

Is event sourcing the only path to a real audit log?

No. The cheaper, more common approach is to keep current state in a normal database and write change rows into a separate audit table — either via application code or via CDC. That gives you a queryable history without the rebuild and projection complexity of full event sourcing. Event sourcing is the right call when the history is the product — financial ledgers, healthcare records, regulated billing — not when audit is a side requirement.

Can you use CQRS without event sourcing?

Yes, and most production CQRS systems do exactly this. The write side uses a normal relational database, change events are emitted via CDC (Debezium, Snowflake streams, Postgres logical replication), and projection workers build read models from the change stream. You get the read-write separation and the multi-view flexibility without committing to a full event store. The tradeoff is that you do not get free time travel — your audit window is whatever your CDC retention is.

What's the smallest team that should attempt full CQRS + ES?

Rough heuristic: you need at least two engineers who have shipped an ES system before, plus capacity for a dedicated infra investment in event store, snapshotting, and projection monitoring. Below that, the operational overhead eats the velocity you hoped to gain. A team of three with no prior EDA experience adopting full CQRS+ES on a greenfield project is the canonical don't. Start with plain EDA, evolve when the domain demands it.

How do you handle a poison event that breaks every consumer?

Three-layer defense. Validate at publish time so the schema registry rejects malformed events at the producer. On the consumer side, wrap projection logic in a try/catch and route failures to a dead letter topic instead of blocking the group. Alert on dead-letter rate and have a manual replay path once the bug is fixed. Letting one bad event halt the consumer group for hours turns a small bug into an incident.

Is Kafka required for event-driven architecture?

No. Kafka dominates for high-throughput durable streams, but plenty of EDA systems run on Pulsar, AWS EventBridge + SNS, GCP Pub/Sub, or NATS. The pattern is broker-independent. The interview signal is being able to compare them — Kafka for high throughput and replay, EventBridge for AWS-native fan-out, NATS for low-latency in-process clusters — and picking one with reasoning.