Health checks for systems analyst interviews

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why health checks matter

You are sitting in a systems analyst loop at Stripe and the interviewer sketches a checkout service behind a load balancer. One pod is hung — accepting connections, never replying. Traffic still routes there because the LB cannot tell the difference between slow and dead. Customers see timeouts; the on-call sees a green dashboard. Design the contract that lets the platform notice this in under thirty seconds. That contract is a health check.

Health checks are the platform's only honest signal about whether an instance should receive traffic or be restarted. Without them, a broken pod silently absorbs requests until error budgets burn. With the wrong checks, the platform flaps — restart loops, cascading 503s, traffic stampedes onto the survivors. The difference between a candidate who has run production and one who has read a blog post is usually whether they can name three probe types (liveness, readiness, startup) and choose between deep and shallow checks without hand-waving.

This is the answer you want in a systems analyst interview at any company running Kubernetes at scale.

Load-bearing rule: liveness restarts the container; readiness only removes it from the load balancer. Confuse these in an interview and you will design a system that restart-loops on a slow database.

Liveness vs readiness

A liveness probe answers a single question: is the process still alive, or is it stuck? A successful response means "do not restart me". A failure — after a configurable threshold — means the orchestrator kills the container and starts a new one. Because the consequence is so violent, a liveness probe must be cheap and self-contained. It should not depend on the database, the cache, or any downstream service. The classic mistake is checking the database from /healthz and watching the entire fleet restart-loop when the database has a thirty-second blip.

# Good liveness — just proves the process loop is running
@app.get("/healthz")
def healthz():
    return {"status": "ok"}, 200

A readiness probe answers a different question: should the load balancer send me traffic right now? A failure here removes the pod from the service endpoint set, but does not restart it. That distinction is everything. A pod might be perfectly alive while still warming caches, replaying a journal, or waiting on a downstream dependency. Readiness lets the pod opt out of traffic without paying the cost of a cold start. Readiness can — and usually should — touch the dependencies the request path needs.

# Readiness — fail closed if a hard dependency is unavailable
@app.get("/ready")
def ready():
    if not db.is_connected():
        return Response(status=503)
    if not cache.ping():
        return Response(status=503)
    return Response(status=200)

The intuition: liveness is "am I a zombie", readiness is "am I useful right now".

Probe Failure action Touches dependencies? Typical interval Failure threshold
Liveness Restart container No 10-30 s 3
Readiness Remove from LB Yes (hard deps) 5-10 s 2-3
Startup Block other probes Optional 5-10 s 30

Startup probes

Kubernetes 1.16 introduced the startup probe for a specific failure mode: slow-starting applications being killed by an over-eager liveness probe. Imagine a JVM service that needs ninety seconds to warm a class loader, prime a JIT, and load a model. If the liveness probe starts checking at thirty seconds with a three-failure threshold, the pod is dead before it ever served a request. The cluster sees a restart loop and concludes the image is broken.

The startup probe disables liveness and readiness checks until it succeeds once. That means you can set failureThreshold: 30 with periodSeconds: 10 for a five-minute startup window, then switch to aggressive five-second liveness checks for the rest of the pod's life.

startupProbe:
  httpGet:
    path: /startup
    port: 8080
  failureThreshold: 30
  periodSeconds: 10  # 5 minutes total before giving up

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 5
  failureThreshold: 3  # ~15 s to declare dead

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 2

Candidates who only know liveness and readiness usually compensate by setting initialDelaySeconds: 120 on the liveness probe — which works but blocks every other check for two minutes and breaks honest restart detection later in the pod's life.

Deep vs shallow checks

The next question every interviewer asks is what should the probe actually do? The answer lives on a spectrum between shallow and deep.

A shallow check verifies the process is responsive. It binds to the port, runs the HTTP stack, and returns 200. Nothing else. It is cheap, predictable, and immune to cascading failures.

GET /health/shallow → 200 OK

A deep check verifies the service can actually do its job. It pings the database, pokes the cache, sometimes calls a downstream over the network with a short timeout. It is honest about whether the pod can serve a request, but it ties the pod's health to every dependency.

GET /health/deep → checks DB + cache + auth service

The trade-off is brutal and worth memorising. Deep checks give accurate readiness but couple every pod's health to every dependency's health. If the database has a five-second blip, every readiness probe across the fleet fails simultaneously and the load balancer drains all backends. You get a thundering-herd reconnect storm on top of a database that was already struggling.

Shallow checks are immune to cascading failure but can mark a pod ready when it cannot actually serve traffic. The healthy-looking pod returns 500s to real users while its probe says everything is fine.

Gotcha: the standard compromise is readiness deep, liveness shallow. Readiness can fail one pod at a time to drain traffic; liveness should never restart the whole fleet because Postgres hiccuped.

Some teams add a third endpoint — a deep check at /health/deep scraped by an external monitor (Datadog, PagerDuty). This decouples the alerting signal from the platform action: a deep failure pages someone without removing pods from rotation.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Implementation patterns

A handful of patterns separate production-grade health endpoints from copy-pasted snippets.

Cache health for a short window. If a check is expensive — opens a database transaction, hits a service across a region — the probe will hammer the dependency every few seconds. Cache the result for 1-5 seconds. The freshness loss is irrelevant; the load reduction is large.

Expose health on a separate port. Health endpoints on the main API port get tangled in request metrics, rate-limited by gateways, and pollute access logs. Bind them to an admin port (often 8081 or 9090) reachable by the platform but not the public LB.

Exclude health from RPS dashboards. Five probe types polled every five seconds adds dozens of phantom requests per minute. Tag them and drop them from latency and throughput metrics so the SLO maths stays clean.

Fail closed on hard dependencies, open on soft. No database connection means not ready — the service cannot answer. A cold cache means still ready, degraded — slower but correct. Encoding this distinction is how senior candidates separate themselves.

Probe tuning in Kubernetes

The probe config in YAML is where most production incidents start. The knobs interact in ways that are easy to get wrong.

Knob What it does Typical value Failure mode if wrong
initialDelaySeconds Wait before first check 0 (use startup probe instead) Long delay masks real crashes
periodSeconds Interval between checks 5-10 Too tight = noisy, too loose = slow detection
timeoutSeconds How long to wait for response 1-3 Too tight = false negatives under GC pause
failureThreshold Consecutive failures before action 2-3 (readiness), 3 (liveness) Too low = flapping, too high = slow draining
successThreshold Consecutive successes to recover 1 Anything else only makes sense for readiness

A defensible default for a typical web service: readiness periodSeconds: 5, failureThreshold: 2, timeoutSeconds: 2; liveness periodSeconds: 10, failureThreshold: 3, timeoutSeconds: 3. That detects a hung pod in roughly thirty seconds and drains a stuck pod in ten seconds. Faster and you fight GC pauses; slower and customers feel it.

The single most common production incident from health checks is the probe times out because the pod is busy serving real traffic, which restarts the pod, throws traffic onto remaining pods, and pushes them over the same threshold.

Common pitfalls

The pitfall that buries the most teams is using the same endpoint for liveness and readiness. A pod with a five-second database blip should be drained from the LB and then recovered when the database returns. With a unified endpoint, the same blip restarts the container, throws away the warm in-memory cache, and forces a cold start across the entire fleet in lock-step. The fix is obvious once you have lived through it — separate probes, separate semantics — but the laziness of "just one health endpoint" survives many code reviews.

Another recurring trap is checking downstream services in the liveness probe. The reasoning sounds clean: "if Stripe is down, mark ourselves unhealthy". The reality is brutal — every consumer of Stripe restart-loops the moment Stripe coughs. The platform's circuit breaker (see Circuit breaker pattern) is the right tool for downstream failures; liveness is not. Liveness should only fail when the process itself is broken.

A third trap is probes that allocate or open transactions. A health endpoint that runs SELECT 1 inside a transaction looks cheap until you notice it consumes a connection from a small pool. At thirty probes per minute across two hundred pods, the probe itself becomes the source of the outage it was meant to detect. The fix is a pre-allocated probe connection outside the transactional pool.

A fourth trap, common with sidecar-heavy stacks, is the Envoy / Istio readiness race. The application reports ready before the mesh sidecar has loaded its config, so the LB sends traffic that the mesh refuses. Gate readiness on the sidecar's admin endpoint, or use a wait-for-sidecar init container.

The last trap is identical probe values copy-pasted across all services. A batch worker handling ten-minute jobs needs different timing than a low-latency API. One YAML template means either sluggish API recovery or batch workers restarted mid-job.

If you want to drill systems analyst questions like this every day, NAILDD is launching with 1,500+ design problems across exactly this pattern.

FAQ

Should health endpoints require authentication?

In most clusters, no. The probe runs inside the pod network and the endpoint is bound to an admin port that is not exposed publicly. Adding auth means the probe needs credentials, which adds a failure mode (rotated secret kills the fleet) for negligible security benefit. If the endpoint must be reachable from outside the cluster — say, for an external monitor — expose a separate scrubbed endpoint with a long-lived token rather than putting auth on the platform-facing probe.

How is a Kubernetes readiness probe different from an AWS ALB health check?

They overlap in intent but operate at different layers. The Kubernetes readiness probe controls whether the pod is in the Service endpoints list; the ALB check controls whether the target is in the target group. In Kubernetes-on-AWS, traffic flows ALB → NLB → kube-proxy → pod, and each hop has its own health view. Production-grade setups treat them as two layers of defence rather than duplicates.

What status code should a failing probe return?

503 Service Unavailable is conventional because it signals "I am here but cannot serve right now". Some teams use 500 for liveness failures to flag "genuinely broken". The platform treats any non-2xx as a failure, so the exact code matters more for log readability than for the orchestrator.

Can you skip readiness if you have liveness?

You can, and many small services do, but you give up the ability to drain a pod cleanly during deploys, dependency blips, or warm-up. Without readiness, the first request a pod sees is a real customer request, and a slow-starting pod returns errors during its warm-up window. Adding readiness is usually a fifty-line change that prevents an entire class of deploy-time incidents.

How do health checks interact with graceful shutdown?

When a pod is terminated, Kubernetes sends SIGTERM and removes the pod from the Service endpoints — but propagation through kube-proxy and any external LBs is not instant. The standard pattern is: on SIGTERM, immediately flip readiness to not ready, sleep for 5-10 seconds to let LB propagation finish, then stop accepting new connections, drain in-flight requests, and exit. Without that pre-stop sleep, you drop the requests that arrive in the propagation window — usually the source of mysterious "deploys cause 5xx spikes" tickets.

What metrics should I export from the health endpoint?

The probe itself should not emit metrics — it is hot-path code. But the health-check logic should record, separately, the per-dependency status and latency that the deep check used. A common pattern is a dependency_health{name="db", status="ok"} gauge updated every probe cycle. That gives you a per-dependency view of why readiness flipped, which is the first thing the on-call wants when the LB drained a fleet at 3 a.m.