Background jobs on a systems analyst interview
Contents:
Why background jobs matter in the design round
When a systems analyst interview moves from "draw the API" to "what happens after the user clicks submit," the answer is almost always a background job. The interviewer wants to hear that you separate the synchronous response (return an ID, return 202 Accepted) from the asynchronous work (send the email, generate the PDF, kick off a billing run). If you keep everything inline, p99 latency tracks the slowest dependency you talk to, and a flaky email provider becomes a checkout outage.
The load-bearing concept is decoupling. The API writes a task to a durable queue, returns immediately, and one or more workers pull from that queue on their own clock. This is the moment in the interview where you stop sounding like a CRUD developer. The interviewer is listening for three things: a queue you can name, a retry policy you can defend, and an idempotency story that survives the worker crashing mid-job.
Load-bearing trick: if the work can be retried, can be reordered, or takes longer than 200 ms, push it to a queue. Anything else stays on the request path.
Common candidates: transactional emails, report generation, image processing, bulk imports, periodic cleanups, webhook fan-out, ML feature computation.
Queue technologies compared
You will be asked "which queue would you use." There is no single correct answer, but there is a wrong way to answer: naming one tool with no justification. The right approach is to map the requirement (ordering, throughput, durability, fan-out) to the tool. Below is the shortlist most interviewers expect you to know cold.
| Broker | Model | Strengths | Weak spots |
|---|---|---|---|
| RabbitMQ | AMQP, push-based | Mature, rich routing, per-message ack | Throughput ceiling around 50k msg/s per node |
| Redis Streams / Lists | In-memory pull | Sub-millisecond latency, simple | Durability tied to RDB/AOF config |
| Kafka | Append-only log | High throughput, replay, partition ordering | Heavy ops, no per-message ack semantics |
| AWS SQS | Managed pull | Zero ops, visibility timeout, DLQ built-in | At-least-once only, no FIFO except SQS FIFO |
| Google Cloud Tasks | Managed push | HTTP target, scheduled dispatch | Lower throughput than Pub/Sub |
| Azure Service Bus | AMQP managed | Sessions, dead-lettering, transactions | Latency higher than Redis-based options |
The rule of thumb interviewers like to hear: Kafka for streaming and replay, RabbitMQ or SQS for task queues, Redis-backed brokers (Sidekiq, RQ, BullMQ) when latency matters more than durability guarantees. If you say "Kafka for password reset emails" you have just lost the round — Kafka's strengths (replay, partitions, ordering) are wasted on one-off transactional work.
A question that trips junior candidates: "Is Kafka a message queue?" The honest answer is no — it is a distributed log. Consumers track their own offsets, messages are not removed on consume, and there is no per-message acknowledgement.
Worker patterns
The worker is the process that drains the queue. Two patterns matter.
Pull-based workers poll the broker. This is what Sidekiq, RQ, Celery, and BullMQ do. The worker controls its own concurrency and backpressure. You can scale by running more worker processes, and the broker is largely passive.
# Simplified pull loop
while not shutdown:
job = queue.reserve(timeout=5)
if not job:
continue
try:
process(job.payload)
job.ack()
except Exception as exc:
job.nack(requeue=should_retry(exc))Push-based workers are dispatched to by the broker. RabbitMQ with basic.consume, Google Cloud Tasks hitting an HTTP endpoint, or AWS SQS with event source mapping into Lambda. The broker decides when and where to deliver, which makes autoscaling easier but means you lose some local control over concurrency.
Either way, the interviewer will probe concurrency. Workers run N concurrent jobs per process (Sidekiq default 10, Celery default CPU count), scaled by process count. Tune N based on whether jobs are I/O-bound or CPU-bound: for I/O-bound work push N to 25-50; for CPU-bound work keep N close to core count or you will thrash.
Sanity check: if your queue depth keeps growing during the working day and drains overnight, you are under-provisioned, not "fine." Provision for the peak hour, not the daily average.
Retry strategies and idempotency
Jobs fail. Networks blip, third-party APIs throttle, databases deadlock. Your retry policy is the difference between a recoverable hiccup and a 3 AM page. The interviewer wants to hear exponential backoff with jitter, capped retries, and a dead-letter queue (DLQ) for poison messages.
Attempt 1: immediate
Attempt 2: 30s + jitter
Attempt 3: 5 min + jitter
Attempt 4: 30 min + jitter
Attempt 5: 2 h + jitter
After attempt 5: move to DLQ, page on-callTwo ideas you have to articulate clearly:
Idempotency. Because the queue is at-least-once (SQS, Kafka, Redis Streams — all at-least-once by default), the same job may be delivered twice. Your handler must produce the same observable result whether it runs once or three times. The standard trick is an idempotency key stored in your database with a unique constraint: before doing the work, insert the key; if the insert violates uniqueness, the job has already run and you ack-and-skip. For a deeper recipe, see idempotency in distributed systems — the same key discipline applies whether the work is a single HTTP call or a saga step.
Poison messages. A bad payload (malformed JSON, deleted referenced row, schema-incompatible event) will fail forever on retry. Without a DLQ, that one job clogs your worker pool and stalls everything behind it. With a DLQ, after N attempts you move the message off the hot path, alert, and let a human or a separate consumer decide what to do.
| Failure type | Right response |
|---|---|
| Transient (timeout, 503, deadlock) | Retry with exponential backoff |
| Rate limited (429) | Retry with Retry-After header value, then exponential |
| Validation error (400, bad payload) | Do not retry, move to DLQ immediately |
| Auth error (401, 403) | Do not retry blindly — rotate credentials, then replay |
| Resource not found (404) | Skip or DLQ depending on whether eventual consistency is in play |
Distinguishing these failure classes in your handler is what separates a senior answer from a junior one. Retrying a 400 forever just burns money and fills the DLQ with garbage.
Scheduling and recurring jobs
The other half of background work is scheduled rather than triggered. Three flavors come up in interviews:
Cron-like recurring jobs. Run at fixed intervals — every 5 minutes, every Monday at 09:00 UTC, the first of every month. Tools: Celery Beat, BullMQ repeatable jobs, Kubernetes CronJobs, GitHub Actions schedules, Airflow for orchestration with dependencies.
Delayed jobs. Enqueue now, execute after a delay. The 30-day "have you tried our premium tier" reminder. SQS supports up to 15 minutes delay natively; for longer delays you need a scheduler (Cloud Tasks supports up to 30 days, BullMQ stores delayed jobs in a sorted set indexed by execute-at time).
Workflows. Multi-step orchestration with dependencies, retries per step, and observable state. Airflow's home turf, plus Temporal, Prefect, and Dagster. If the interviewer asks "Airflow vs Celery," answer: Celery for task queues, Airflow for workflows of tasks.
A practical scheduling question: "How would you send 10 million emails at 09:00 UTC tomorrow?" The wrong answer is "enqueue 10 million jobs at 09:00." The right answer is smear: spread execute-at times across a window (say, 09:00-09:30), so you do not melt your SMTP provider or your worker pool.
Gotcha: scheduled jobs running in multiple regions will fire once per region unless you add a distributed lock. Either pin scheduling to one region or wrap the handler in a SELECT ... FOR UPDATE on a cron_runs row keyed by (job_name, scheduled_at).
Common pitfalls
The most common failure mode is conflating queues with databases. Candidates store "jobs" in a Postgres table with a status column and poll with SELECT * FROM jobs WHERE status = 'pending' LIMIT 10. Without FOR UPDATE SKIP LOCKED you get the same job picked up multiple times, and even with it throughput is bound by row-level locks. A real queue gives you reservation, visibility timeout, and acks for free. If you must use Postgres as a queue (legitimate up to a few hundred jobs per second), use SKIP LOCKED and say so out loud.
A second pitfall is swallowing exceptions inside the worker. A try / except / pass block will ack the message even though the work failed, and the queue will look healthy while your business logic silently drops on the floor. The right pattern: only ack on confirmed success, nack-with-requeue on transient errors, nack-without-requeue for poison. Every exception is either retryable or not — there is no third bucket.
A third pitfall is unbounded fan-out. A single user action triggers a notification job, which enqueues email, SMS, push, and webhook fan-out, each enqueuing five more downstream jobs. Within a couple of hops your queue depth explodes and workers saturate on one user's work. Cap fan-out at the source, batch where possible.
A fourth pitfall is forgetting timezone handling on scheduled jobs. Cron on a worker pod with TZ=UTC and cron on a laptop with TZ=America/Los_Angeles produce different fire times. Always run schedulers in UTC and write the expected fire time as a comment next to every cron expression. Daylight saving has eaten more reports than any other single cause.
A fifth pitfall is assuming ordering you do not have. Inside a Kafka partition, ordering is preserved. Across partitions, it is not. Across SQS visibility timeouts, it is not. If your business logic requires "process A before B," put A and B on the same partition key, or design B to be a no-op if A has not happened yet.
Related reading
- Webhook design on systems analyst interview
- Backpressure on systems analyst interview
- Kafka on systems analyst interview
- 2PC vs Saga on systems analyst interview
- Cache strategies on systems analyst interview
If you want to drill async-systems questions like this every day until the next loop, NAILDD has 500+ interview problems across exactly this pattern.
FAQ
What is the difference between a job queue and a message bus?
A job queue (Sidekiq, Celery, RQ, BullMQ, SQS) is built around the idea of work units that get executed once. Each job has retries, a DLQ, and a clear success/failure state. A message bus (Kafka, Kinesis, Pub/Sub) is built around events that get observed by many consumers. The same event can be replayed, consumed by ten different services, and is not "done" when one consumer processes it. Use a queue when the question is "did this work get done." Use a bus when the question is "what happened, and who needs to know."
When should I use exactly-once delivery?
Almost never, and that is the honest answer. True exactly-once across a queue and your database requires either a transactional outbox (you write the job and the business state in the same DB transaction) or a queue with transactional semantics (Kafka transactions, sometimes). In practice, most production systems run at-least-once delivery plus idempotent handlers, which gives the same observable result with far less complexity. If an interviewer pushes on this, name the outbox pattern and idempotency keys and move on.
How do I size my worker pool?
Start with Little's Law: workers needed = arrival rate × mean processing time. If you receive 100 jobs per second averaging 200 ms each, you need 20 concurrent slots at the absolute minimum. Multiply by 2-3x for headroom (traffic spikes, retries, slow third-party calls) and split across enough processes that one crashing does not take you under capacity. Then watch queue depth in production and adjust — autoscaling on queue depth is the cleanest signal you will get.
Should I use a separate queue per job type?
Usually yes, for two reasons. First, different jobs have different SLAs — sending a password reset email needs to leave the queue in seconds, while generating a monthly invoice can wait minutes. Mixing them on one queue means a slow invoice job blocks a fast email job behind it. Second, separate queues let you scale workers independently and apply different retry policies. The cost is more queues to monitor, but that is cheap compared to debugging head-of-line blocking under load.
What is a dead letter queue and when should I drain it?
A DLQ is where messages go after they have exhausted their retry budget. The point is to stop the bleeding — get the poison message off the active queue so the rest of the work can flow — without losing the payload. Drain it when a human (or automated rule) has fixed the root cause: redeployed the service, patched the handler, restored the missing referenced row. Run a daily report on DLQ depth as a leading indicator of degradation.
Can I use Postgres as a queue?
Yes, up to a point. With SELECT ... FOR UPDATE SKIP LOCKED and an index on (status, scheduled_at), Postgres handles a few hundred to low thousands of jobs per second comfortably. The win is operational simplicity — no extra service to run, transactional consistency with your business data, easy to inspect. The loss is throughput ceiling and writing your own retry, visibility timeout, and DLQ logic. Past low-thousands per second, move to a purpose-built queue.