Circuit breaker on systems analyst interviews
Contents:
Why interviewers ask about it
When a hiring manager at a microservices-heavy company — think Stripe, DoorDash, Uber, Netflix — asks a systems analyst about resilience, they want to know whether you can reason about a graph of services where any node can fail without flipping the whole product into a crater. The circuit breaker is the load-bearing pattern that separates "I read a Fowler blog post once" from "I've watched a retry storm eat a downstream service alive."
The question shows up in two flavors. The first is definitional: explain circuit breaker, how it differs from a retry, and when you'd use each. The second is the senior version: here is a sketch of three services, one is flaky, design the resilience strategy in the non-functional requirements. If you answer the second one with only "add a circuit breaker" you've failed — the right answer is a layered combination of timeout + retry with backoff + circuit breaker + bulkhead, with thresholds you can defend.
One-line takeaway: A circuit breaker stops your service from making calls to a downstream that is already on fire, so your threads, sockets, and users don't burn with it.
The core idea
Service A depends on service B. B starts degrading — database failover, noisy neighbor, bad deploy. A keeps sending traffic. Each request to B now takes 30 seconds instead of 30 milliseconds. A's thread pool fills with calls waiting on B. New requests to A — even ones that have nothing to do with B — start queueing. This is how a single slow dependency takes down the whole upstream. Engineers call it cascade failure, the most common multi-service outage pattern in production.
A circuit breaker sits in front of the call to B and watches the failure rate. Once failures cross a threshold — say 5 failures in 60 seconds, or 50% error rate over 100 requests — the breaker trips. It enters an open state and short-circuits all subsequent calls, returning an error immediately without touching the network. After a cooldown, it allows one or two probe requests through. If those succeed, normal traffic resumes; if they fail, the breaker stays open and the cooldown restarts.
The benefit is two-sided. Upstream, A stops wasting threads and sockets on doomed calls, so the rest of its traffic keeps moving. Downstream, B gets relief — instead of a retry storm hammering an already-sick service, the breaker gives it room to recover.
States: closed, open, half-open
The three-state machine is the whole pattern. Memorize the transitions; this is what gets drawn on the whiteboard.
| State | Behavior | Transition trigger |
|---|---|---|
| Closed | All requests pass through to downstream. Failure counter increments on errors. | Failure count exceeds threshold → Open |
| Open | All requests fail fast with no network call. Counter frozen. | Cooldown timer expires → Half-open |
| Half-open | A small number of probe requests are allowed through. | Probes succeed → Closed. Probes fail → Open (cooldown restarts) |
request fails (count >= threshold)
closed ───────────────────────────────────────→ open
↑ │
│ probes pass │ cooldown timer expires
│ ↓
└────────────── half-open ←────────────────────┘
(one or a few probe requests)The half-open state is where most candidates trip up. It is not a slow ramp-up. It is a single binary test: send one probe, if it works the breaker fully closes and full traffic resumes; if it fails, slam back open. Some implementations allow N concurrent probes, but the principle is the same — you are not gradually warming up, you are checking whether the downstream is alive.
Retries and timeouts
A circuit breaker on its own does almost nothing useful. It needs siblings.
A timeout is the floor of any resilience strategy. Without a per-call timeout, "the downstream is slow" looks indistinguishable from "the downstream is fine" until your thread pool dies. For user-facing services the typical budget is 500ms to 3 seconds. A 60-second timeout is essentially "no timeout" — by the time it fires, the user has already left and your queues have already overflowed.
Retry with exponential backoff is the next layer. The first retry fires after one second, the next after two, the next after four, then eight. Linear retries are a recipe for thundering herd. And every retry should include jitter — a random offset of plus-or-minus a few hundred milliseconds — otherwise every client that failed simultaneously will retry simultaneously, and you'll DDoS the downstream right at the moment it's trying to recover.
Retries also need idempotency awareness. GET, PUT, and DELETE are safe to retry because repeating them produces the same result. POST is not — a retried POST can create a duplicate order, a duplicate charge, a duplicate notification. The standard solution is an idempotency key in the request header that the downstream uses to deduplicate. You'll see this pattern in every payments API worth using.
| Layer | What it solves | Typical config |
|---|---|---|
| Timeout | Hung calls eating thread pool | 500ms-3s for user-facing |
| Retry + backoff + jitter | Transient failures, network blips | 3 retries, 1s/2s/4s with ±200ms jitter |
| Circuit breaker | Sustained downstream failure | Trip at 50% errors over 100 requests, cooldown 30s |
| Bulkhead | One slow dep killing unrelated traffic | Separate thread pool per downstream |
Bulkhead pattern
The bulkhead name comes from ship design — watertight compartments that stop a single hull breach from flooding the whole ship. In software, the bulkhead pattern isolates the resources used to talk to each downstream service.
Without a bulkhead, service A uses a single shared thread pool — say 100 threads — for all outbound calls. If downstream B starts taking 10 seconds per call and you get 100 concurrent requests, the entire pool is now blocked on B. Calls to C and D, which are perfectly healthy, can't get a thread. A is effectively down, even though only one of its three dependencies is sick.
With a bulkhead, A allocates a separate pool per downstream — 30 threads for B, 30 for C, 30 for D, plus some shared headroom. When B goes sideways, those 30 threads fill up and new calls to B fail fast, but C and D still have their own pools. The blast radius of one bad dependency is contained.
Service A:
├─ Pool for service B (30 threads, queue 50)
├─ Pool for service C (30 threads, queue 50)
└─ Pool for service D (30 threads, queue 50)The combination is powerful. Bulkhead caps the worst-case resource consumption per dependency. Circuit breaker stops sending traffic into a known-bad dependency. Timeouts prevent any single call from hanging. Retries handle transient blips. Each layer covers a different failure mode, and the system stays standing even when one of them misbehaves.
Implementation choices
You don't usually write this from scratch in production code. The mature libraries are well-known and worth naming on an interview.
Resilience4j is the current standard for the JVM. It replaced Netflix's Hystrix, which was deprecated in 2018 — mentioning Hystrix is fine for context but don't recommend it for new code. Polly is the .NET equivalent, with a fluent policy API. Go developers tend to use either sony/gobreaker or roll a custom 50-line implementation, since the pattern is genuinely small.
For polyglot environments and service meshes, the breaker often lives at the sidecar layer — Envoy (and therefore Istio, Linkerd, Consul Connect) implements circuit breaking at the proxy, so application code stays oblivious. This is increasingly the dominant pattern at large infrastructure-heavy companies because it lets you set policy uniformly across Java, Python, Go, and Node services without per-language libraries.
For a quick whiteboard sketch, a minimal Python implementation looks like this — useful to show you understand the state machine, not as production code:
import time
class CircuitOpenError(Exception):
pass
class CircuitBreaker:
def __init__(self, threshold=5, cooldown=30):
self.failures = 0
self.threshold = threshold
self.last_failure = None
self.cooldown = cooldown
self.state = "closed"
def _is_open(self):
if self.state == "open":
if time.time() - self.last_failure > self.cooldown:
self.state = "half-open"
return False
return True
return False
def call(self, fn, *args, **kwargs):
if self._is_open():
raise CircuitOpenError("downstream is open")
try:
result = fn(*args, **kwargs)
self.failures = 0
self.state = "closed"
return result
except Exception:
self.failures += 1
self.last_failure = time.time()
if self.failures >= self.threshold:
self.state = "open"
raiseIf you want to drill systems-analyst architecture questions like this every day, NAILDD is launching with hundreds of resilience and design problems modeled on real big-tech interview loops.
Common pitfalls
The most common interview mistake is setting the failure threshold too high. Saying "open the breaker after 50 failures over five minutes" sounds generous, but it means your service will spend five minutes sending doomed requests into a dying downstream — exactly the cascade failure the pattern is supposed to prevent. A defensible default is 3-5 failures in a 60-second window, or a percentage-based trigger like 50% error rate over the last 100 requests. Tune from there based on actual traffic volume.
Another classic is mismatching timeout to user experience. A 60-second per-call timeout on a user-facing API means a slow downstream effectively hangs the whole product for a minute, which is worse than just returning a fast error. The right number for synchronous user-facing calls is somewhere between 500ms and 3 seconds, with the upper bound being the latency you'd actually accept from a healthy system plus some headroom. For batch and async workloads the budget can be much higher, but state that as a separate decision.
A subtle and dangerous trap is retrying non-idempotent operations. The textbook example is payments: a POST to /charges times out, you retry, the original eventually succeeded, and the customer is double-billed. The fix is to require an idempotency key that downstreams use to deduplicate. Senior interviewers probe specifically for this, because it's the kind of bug that escapes review and shows up in a postmortem.
A fourth pitfall is omitting jitter from retry backoff. Without jitter, every client that failed at the same moment retries at the same moment, hitting the recovering downstream with a synchronized wave. Classic thundering herd. Add ±20-30% randomness to each backoff interval — costs nothing, prevents one of the most embarrassing self-inflicted incidents in distributed systems.
Finally, candidates often rely on the circuit breaker alone. The breaker does nothing if you have no timeout — it never sees a failure, just a hung thread. It does nothing about cascade resource exhaustion across dependencies — that's what the bulkhead is for. Treat it as one layer of four; the layers compose, they don't substitute.
Related reading
- Bulkhead pattern for systems analyst interview
- API Gateway vs BFF for systems analyst interview
- 2PC vs Saga for systems analyst interview
- Cache strategies for systems analyst interview
- Chaos engineering for systems analyst interview
- Backpressure for systems analyst interview
FAQ
Where does a systems analyst actually touch the circuit breaker?
In the non-functional requirements and the integration spec. A typical NFR row reads: calls to PaymentService must have a 2-second timeout, retry up to 3 times with exponential backoff and jitter, and trip a circuit breaker at 50% error rate over a 60-second window with a 30-second cooldown. The SA writes that spec, hands it to engineering, and reviews the implementation. You may also be asked to justify the numbers to a product manager who wonders why "we don't just always retry forever."
How is a circuit breaker different from a retry?
A retry repeats a single failed call hoping the next attempt succeeds. A circuit breaker is a longer-horizon judgment — it says "this dependency has been failing enough that further calls are pointless right now, stop trying for a while." Retries handle transient blips on the millisecond scale; circuit breakers handle sustained outages on the seconds-to-minutes scale. They work together: retries first, breaker if retries keep failing.
Should every dependency have a circuit breaker?
In a mature microservices system, yes — though many of them are set with permissive defaults and never actually trip. The cost of having one is near-zero with a good library. The cost of not having one on a critical dependency is "we found out about it during the outage." If you're early-stage with three services, you can defer it. Past about ten services, the answer is reflexively yes.
What's a sensible default cooldown?
30 seconds is the common default in Resilience4j and Polly. Shorter than that and you risk flapping — the breaker re-opens, re-closes, re-opens as the downstream wobbles. Longer than that and you take too long to detect recovery. For dependencies known to take longer to recover (a database failover, for instance), bump it to 60-90 seconds.
Is the half-open state always a single probe?
By default in most libraries, yes — one probe, then either close or reopen. Some implementations let you configure N probes that must succeed before closing, which is safer for high-traffic endpoints where one lucky success could mislead.
Can a circuit breaker cause its own outage?
Yes. If the threshold is too low, the breaker trips on normal noise and your service starts failing requests the downstream could have handled. Tune thresholds against real traffic patterns, not a guess, and alert on breaker-open events — sustained open state is itself a signal that something needs human attention.