Latency budget for systems analyst interviews

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

What a latency budget actually is

Picture the moment: a Stripe interviewer draws a checkout API on the whiteboard and says, "p99 needs to come in under 500ms. Walk me through how you'd allocate that." A latency budget is exactly that allocation — the total user-facing p99 SLA carved into per-component slices that each owner has to defend. It is not an aspiration. It is a contract between teams that says: "if your service eats more than 80ms at p99, you broke the page, not me."

This shows up in every senior SA loop at Google, Meta, Amazon, Uber, and DoorDash because it forces end-to-end percentile thinking under pressure. Candidates who fail quote vague numbers. Candidates who pass walk in with a default split and adjust as constraints land.

Load-bearing trick: Always start with the total p99 number the user can tolerate, then subtract — never add up and hope it fits. Subtraction forces you to make trade-offs explicit.

A reasonable starter budget for a synchronous read API at p99 = 500ms looks like the table below. Memorize it; you will reuse it in every design round.

Component Budget (ms) Owner Notes
TLS + ingress + LB 30 Platform/infra Edge POP to origin, kept-alive conn
Auth / token validation 40 Identity team JWT verify + permission cache lookup
App logic + serializer 60 Service owner Validation, mapping, JSON encode
Primary DB read 180 DB team Index hit, p99 not p50
Downstream RPC fan-out 120 Each callee Parallel where possible
Cache + network slack 70 Shared Safety margin for GC pauses, retries
Total p99 500

Breaking the SLA into components

The first move in the interview is to trace one request end-to-end. Pick the exact path: edge load balancer, API gateway, auth middleware, your service, every downstream call, the database round trip, the serializer, the egress. Each hop earns a line in the budget. If you cannot draw the hops, you cannot allocate them.

Allocate proportional to expected work, not equally. A JWT validation against a warm cache is ~5-15ms; a Postgres index seek against a billion-row table is 20-80ms at p99; a cross-region call to a payments provider is 150-300ms before you do anything. Give each component a number that survives a sanity check from someone who runs that system in production.

Three caveats the interviewer is waiting for you to mention:

Network round-trips are systematically underestimated. Intra-AZ is sub-millisecond, but cross-region (us-east-1 → eu-west-1) is 70-90ms one way. Two RPCs across that link burn 300ms before any logic runs.

Cache misses must be in the budget. A 95% hit-rate cache is a 5% tail-latency landmine. If a miss costs 200ms and you priced the call at 20ms, your p99 explodes the moment traffic shifts. Budget the miss path.

Retries are extra latency, not free reliability. One retry on a 100ms call with 50ms backoff turns 100ms into 250ms under failure. Retry budgets belong at the caller, not the callee.

Sequential vs parallel math

This is the part of the question where most candidates fumble the arithmetic. The rule is short.

For sequential calls, latencies sum:

A → B (50ms) → C (100ms) → D (50ms)
Total = 50 + 100 + 50 = 200ms

For parallel calls (fan-out, then join), latency is the max of the slowest branch — plus the small fan-out and gather overhead:

A fires [B, C, D] in parallel
B = 50ms, C = 100ms, D = 50ms
Total ≈ max(50, 100, 50) = 100ms

The follow-up: "What's the p99 of a fan-out to 10 backends that each have p99 = 100ms?" The trap is to say "100ms". The correct answer is closer to 200-300ms — if each call has 1% chance of exceeding 100ms, ten parallel calls give roughly 1 - (0.99)^10 ≈ 9.6% chance that some call is in the tail, so fan-out p99 drifts toward p99.9 of the individual call. This is why services that fan out aggressively (search, feed ranking) lean on hedged requests and tight deadlines.

Pattern Latency formula When it wins
Sequential RPC chain sum of components Strict ordering needed
Parallel fan-out max(branches) + overhead Independent reads
Hedged request min(primary, hedge) Tail latency dominated
Scatter-gather + cap max(branches) up to TTL Partial results acceptable

Slow chains and the compounding problem

A common production sin is the synchronous RPC chain: service A calls B, which calls C, which calls D, which calls E. Five services at 50ms each gives a floor of 250ms before any retries, before any GC pause, before any one node has a bad second.

Service A → B → C → D → E
50ms each × 5 hops = 250ms floor

Add a single retry at each hop and you are at 400-500ms when one node misbehaves. The fix is structural, not tactical. You do not optimize a chain by shaving 5ms off each hop; you flatten it.

Gotcha: Long sync chains are a budget tax that compounds with every failure mode. If you see four-plus hops in a candidate's design, push back on the architecture, not the per-hop latency.

Three flattening strategies that come up in every design round:

The first is async event handoff. If C → D → E is "do work and notify", convert that tail into a Kafka topic or an SQS queue and let it run off the hot path. The user gets a response after A → B → C; D and E consume the event independently. You traded synchronous correctness for end-to-end p99, which is the right trade for write-heavy workflows.

The second is parallel aggregation. If B, C, D are independent lookups, replace the chain with a fan-out: A calls B, C, D in parallel and merges the results. This drops sum-of-three to max-of-three, often a 2-3x p99 improvement.

The third is pre-aggregation. Compute the answer on write and stash it in a materialized view, denormalized column, or Redis hash. The hot path becomes a single key lookup — Notion's page tree and Linear's "my issues" view are both shaped this way.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Optimization levers

Here is the menu of moves you reach for, in rough order of payback per engineering hour.

Caching is the highest-leverage lever for read-heavy paths. A 99% hit-rate read-through cache turns a 50ms DB query into a 1ms memory hit. Cover the miss path in your budget.

Async processing moves any work the user does not need to see right now off the hot path. Confirmation emails, indexing updates, fan-out notifications, analytics events — none should live inside the p99 budget.

Batching combines multiple operations into one round-trip. A loop calling getUser(id) a hundred times is a hundred network round-trips; a single getUsers([ids]) is one. Savings are linear up to 100-500 items per batch.

Connection pooling removes TCP and TLS handshake cost from every call. A cold connection to Postgres costs 5-20ms; a pooled connection costs zero. Always assume pooling exists — and call it out if a candidate does not.

Compression trades CPU for network bytes. For payloads over ~1KB on cross-region links, gzip or zstd is a clear win. For tiny intra-AZ calls, skip it.

Edge / CDN placement moves static and cacheable content closer to the user. A 70ms cross-continent fetch becomes a 5ms POP hit — the premise behind Vercel's edge network.

Read replicas distribute read load. The catch is replication lag: stale reads can violate read-your-writes expectations if you do not route consistency-sensitive reads to the primary.

Pre-computation is the nuclear option: materialized views, denormalized join tables, pre-aggregated rollups. The cost moves from read to write and storage. Worth it for top-of-funnel views hit by hundreds of millions of users.

Common pitfalls

The first pitfall is budgeting the mean and reporting the p99. Engineers calculate the latency budget using average DB response times, then ship and find p99 is 4x worse than the budget said it would be. The fix is to use the p99 of each component, not the mean, and to verify with histograms — not summary statistics — that the tail is actually where you think it is. Tools like Prometheus histograms and Datadog distributions exist for this reason.

The second pitfall is ignoring GC pauses and noisy neighbors. A JVM service with a 200ms full-GC pause once an hour will blow your p99 budget every single hour. If the interviewer mentions JVM or Go GC, expect a follow-up on how you measure and amortize pause time into the budget — usually as a "cache + network slack" line item, not as a fight with the garbage collector.

The third pitfall is treating downstream SLAs as guarantees. Your auth team says "we serve at 30ms p99". You budget 30ms. Six months later they ship a regression and serve at 80ms. Your p99 is now broken and you have no early warning. The fix is to measure your own observed latency to each downstream and alert when it drifts from the budgeted number, not to trust someone else's dashboard.

The fourth pitfall is forgetting the long tail of retries and timeouts. A 100ms call with a 500ms timeout and one retry can take 600ms in the worst case, not 100ms. If your retry policy is "retry once on 5xx", that worst case is part of your p99 budget whether you wrote it down or not.

The fifth pitfall is optimizing the wrong hop. Juniors shave 5ms off the serializer while the real bottleneck is a 250ms cross-region payment call. Sort the budget table by latency consumed and attack the top line. Amdahl's law applies to interviews too.

If you want to drill systems-design questions like this every day, NAILDD ships 500+ SA interview problems across exactly this shape — design rounds, percentile math, and SLA breakdowns from real loops.

FAQ

Is latency budget the same as SLO?

They are related but not the same. An SLO (service level objective) is the user-facing target — "99% of requests return under 500ms over a 28-day window". A latency budget is the internal allocation of that SLO across components — auth gets 40ms, DB gets 180ms, downstream gets 120ms. The SLO is the contract with the user; the budget is the contract between teams that lets you meet the SLO. You build the budget backwards from the SLO.

Do I budget p50, p95, or p99?

Always start with the percentile your SLO is written against, usually p99 or p99.9 for user-facing APIs. Internal services that are not directly on the hot path may use p95. Budgeting on mean or p50 is the most common rookie mistake — your tail will betray you in production every single time.

How do I budget for a cross-region call?

Treat cross-region as a fixed tax of 70-90ms one-way between US east and EU, 120-180ms between US and APAC, and add 1-2ms intra-AZ for the local hops. A single synchronous cross-region call eats 150-200ms before you do any work. If your SLA is under 300ms, you almost certainly cannot afford a synchronous cross-region call on the hot path, and you should answer the interview question with "we'd replicate or cache regionally" rather than "we'd just make the call".

What's a good default budget for a write API?

Writes typically get a looser budget than reads because they often involve persistence, replication ack, and downstream event publishing. A common split for a write at p99 = 800ms looks like: 30ms ingress + 40ms auth + 80ms validation/business logic + 250ms primary DB write with sync replication + 200ms downstream RPC + 100ms event publish + 100ms slack. Adjust based on whether replication is sync or async and how many synchronous side effects the write triggers.

How do hedged requests fit into the budget?

A hedged request fires the primary call and, after a short delay (often p95 of the call), fires a second copy to a different replica. The user sees min(primary, hedge). This trims tail latency at the cost of extra backend load — typically 5-10% more requests on the dependency. Budget the hedged call as the expected min, not the raw p99, but make sure your downstream capacity plan accounts for the hedge multiplier. Google's "The Tail at Scale" paper is the canonical reference.

When should I push back on the SLA itself?

If the budget does not add up — total of realistic per-component p99s exceeds the SLA — pushing back on the SLA is the senior move, not optimizing further. In the interview, say: "Given a cross-region payment call alone is 180ms p99, a 200ms total SLA is not physically feasible without caching the response. We should either relax the SLA to 400ms, cache aggressively, or move the payment off the hot path." That answer signals you understand the trade space, which is exactly what the loop is testing for.