Logging strategy for systems analyst interviews
Contents:
Why logging shows up on the whiteboard
A senior systems analyst at Stripe once told me the interview filter for SA candidates is brutally simple: can you design a system that survives a 3 AM page? Logging is the load-bearing answer. When the panel asks "your checkout service is throwing 500s, walk me through the next ten minutes" — the candidate who reaches for structured logs, correlation IDs, and log levels in the first sentence passes. The candidate who says "I'd check the logs" without specifics gets a polite thank-you email.
The trap is that logging sounds boring. It is not a flashy topic like CAP theorem or saga patterns, so candidates under-prep it. Then on the call they get a question like "design observability for a payment service handling 50,000 RPS" and they fumble between "we'd use Datadog" and "we'd add some log statements". The panel wants to hear a layered strategy: levels, format, IDs, retention, and the rules for what stays out of the log entirely.
This post gives you the answer to keep in your head. Everything below is the framework I would use on a Microsoft, DoorDash, or Snowflake SA loop — and the same one I drill with candidates before their on-sites.
Log levels and when to use each
The five canonical levels are universal across Java, Python, Go, and Node logging libraries. In an interview, do not list them as a flat bullet — explain the production policy behind each one. Most teams ship at INFO and above in production, with a sampled DEBUG channel (typically 1–5% of requests) routed to a separate index for live diagnosis. That sampling rule alone signals you have seen real systems.
| Level | Production use | Example event |
|---|---|---|
| DEBUG | Sampled 1–5%, off by default | SQL query plan for a single user request |
| INFO | Always on | order_created, auth_success, payment_captured |
| WARN | Always on | Retried 3 times before success, fallback path hit |
| ERROR | Always on, paged | Database connection refused, unhandled exception |
| FATAL | Always on, immediate page | Process unable to start, data corruption detected |
A clean rule for interviews: WARN means a human should read this within the week; ERROR means a human should look within the hour; FATAL wakes someone up. If your panel pushes back with "what about a 500 error caused by a known retry — WARN or ERROR?", the right answer is WARN with a counter — and you ERROR only after the retry budget is exhausted. That subtlety is what separates SAs from juniors.
Structured logs vs plain text
Plain text logs are a relic. They look like this:
[2026-05-22 12:00:00] User 42 ordered product 17 with amount 100You cannot query that at scale. Every team I have worked with above Series B emits JSON-structured logs so every field is independently filterable and aggregatable in Elasticsearch, Loki, or Datadog. The same event becomes:
{"ts":"2026-05-22T12:00:00Z","level":"INFO","msg":"order_created","user_id":42,"product_id":17,"amount":100,"currency":"USD","service":"checkout","trace_id":"abc123"}Load-bearing trick: when an interviewer asks "why JSON over plain text", do not say "easier to read" — JSON is harder for humans. Say "machine-parseable, indexable per field, and aggregatable without regex". That phrase, almost verbatim, is what staff engineers say in design reviews.
The format unlocks queries like "p99 latency of order_created for users in EU with amount > $500 last 24h" without writing a single regex. It also makes log-based alerting cheap: a counter on level:ERROR AND service:checkout over five minutes is a one-line alert in any modern observability platform.
There is one tax: JSON is verbose. A typical line is 300–800 bytes versus 80–120 for plain text. At 50,000 RPS that is roughly 2–4 TB per day of raw logs. This is exactly why the retention section below matters — you cannot keep all of it hot.
Correlation IDs and distributed tracing
A request to a modern checkout flow hits 5–15 services. Without a correlation ID (often called trace_id or request_id), debugging is impossible — you have logs in fifteen indices with no way to join them. The pattern is universal:
1. Edge gateway generates trace_id "abc123" on inbound request
2. trace_id propagates in headers (W3C traceparent, or X-Request-ID)
3. Every downstream service reads the header and injects trace_id into every log line it emits
4. On the debug side: filter logs by trace_id="abc123" → full request path across all servicesThe next level up is distributed tracing with OpenTelemetry, Jaeger, or Tempo. Tracing adds span IDs inside the trace so you see not just which services touched the request but how long each took and which called which. In an interview, name-drop the W3C traceparent header and OpenTelemetry — they signal you have seen real microservice debugging.
A useful aside: traces are sampled at ingestion (often 1–10% of requests) because storing every span is prohibitive. Logs are sampled differently — usually all errors plus a fraction of successes. Conflating the two sampling strategies is a common candidate mistake.
What never to log
This is where SA candidates win or lose on the security follow-up. Three categories must never enter the log pipeline in plaintext: credentials, PII, and payment data. Specifically:
- Passwords and password hashes. Never. Even hashed passwords give an attacker who breaches your log store a free offline cracking target.
- Full card PANs. Mask to
****1234at the edge. Logging full PANs makes the log store in scope for PCI-DSS, which is a compliance nightmare you do not want. - API keys, OAuth tokens, session cookies. Same logic as passwords.
- Government IDs, full names tied to medical data. GDPR and HIPAA scope. Either tokenize or omit.
The second class of bad logging is volume noise. If your service processes 10,000 rows in a batch and you log every row, you have just generated 10,000 lines of useless context that drowns the one ERROR you actually need to find. The pattern is: log start of operation, log end with summary stats, log errors per item. Not every step in between.
The third class is giant blobs. A 10 MB JSON payload in a single log line will choke your shipper, blow your storage budget, and is unsearchable anyway. Truncate to a hash plus a few key fields, or store the blob in object storage and log just the pointer.
Retention tiers
Logs grow fast. A well-run system uses tiered retention to balance debuggability against cost:
| Tier | Storage | Retention | Use case |
|---|---|---|---|
| Hot | Elasticsearch, Loki, Datadog | 7–30 days | Live debugging, alerting, recent incidents |
| Warm | S3 with Athena / BigQuery | 30–180 days | Post-mortem analysis, slow forensics |
| Cold | S3 Glacier, GCS Archive | 1–7 years | Compliance, legal hold |
The lifecycle policy is automated with object-storage TTL rules — no human moves data between tiers. Compliance frameworks dictate the floor: SOX requires 7 years for financial records, HIPAA requires 6 years for health records, and most security teams want at least 90 days of warm logs for breach forensics. GDPR pushes the opposite direction — minimize personal data and delete on request — which is one more reason PII should never be in the log to begin with.
A nuance that impresses panels: hot logs are 10–50x more expensive per GB than cold storage. So the right answer to "how long do you keep logs" is never a single number — it is a tiered policy with cost as the explicit trade-off.
Common pitfalls
The first pitfall I see in mock interviews is conflating logs, metrics, and traces. Candidates treat them as one bucket called "observability". They are three pillars with different shapes: logs are discrete events with rich context, metrics are pre-aggregated counters and histograms, traces are causal graphs across services. If your panel asks "would you alert on a log line or a metric", the answer is almost always metric — you derive a counter from logs and alert on the counter, not on the raw log line, because raw log alerts are noisy and expensive.
A second trap is logging at the wrong level. Candidates who set everything to INFO end up paying for noise; candidates who set everything to ERROR miss the early warning signals. The fix is a deliberate per-event policy: every event class gets a level decision documented in the service runbook, and code reviews enforce it. Yes, this means logging policy belongs in your design doc — not as an afterthought.
A third pitfall is forgetting to propagate the correlation ID across asynchronous boundaries. Kafka messages, retry queues, and background jobs all break the natural call chain. You must explicitly carry the trace_id in the message envelope or you lose the ability to reconstruct flows that touch queues. This is a favorite follow-up question on SA loops at Uber and DoorDash — both companies run heavy async pipelines and have been burned by orphaned traces.
A fourth pitfall is leaking PII through error messages. A naive logger.error(f"Failed to process user {user.full_name} {user.email}") looks fine until you notice you just dumped names and emails into a log store that retains for years. The fix is structured fields with explicit scrubbing — never interpolate raw user objects into log strings; pass typed fields and let the logger redact based on a deny-list.
The fifth pitfall is no log budget. Teams that ship without a per-service log volume budget end up paying observability vendor bills that exceed their compute bills. A rough sanity check: log spend above 8–10% of compute spend means you are over-logging. Audit, sample, and prune.
Related reading
- Capacity planning for systems analyst interviews
- API gateway vs BFF for systems analyst interviews
- CAP theorem for systems analyst interviews
- Cache strategies for systems analyst interviews
- Chaos engineering for systems analyst interviews
If you want to drill systems analyst design questions like this every day, NAILDD is launching with 500+ interview problems across exactly this pattern.
FAQ
Is this the official answer interviewers expect?
There is no single "official" answer — every company has its own logging conventions. What this post captures is the vendor-neutral baseline that Datadog, Splunk, Elastic, Grafana Labs, and every major cloud provider converge on. If you can articulate the five layers — levels, format, correlation, redaction, retention — you will sound credible at any FAANG, fintech, or late-stage startup loop.
How is logging different from monitoring?
Logging captures individual events with rich context. Monitoring captures aggregate metrics over time — counters, gauges, histograms. You typically derive monitoring signals from logs (or from a metrics SDK directly), then alert on the metrics, not on the logs. The two are complementary: metrics tell you something is wrong, logs tell you what happened to that specific request.
Should I log every database query?
No, and this is a frequent follow-up. Logging every query at INFO will drown your pipeline. The standard pattern is: log the business-level event (order_created, payment_captured) at INFO, and emit query-level detail only at sampled DEBUG. If you need per-query performance data, use a query-store feature like pg_stat_statements or your APM's slow-query module — that is metrics territory, not logs.
What about logging in serverless or edge functions?
Same principles, with a twist: serverless platforms (Lambda, Vercel Functions, Cloudflare Workers) typically ship stdout/stderr to a managed log store automatically. You still want JSON structured logs so the platform can index fields, and you still want a correlation ID propagated through the event payload. The platform handles transport and retention tiers for you, but you own the schema.
How do I handle log schema evolution?
Treat your log schema like an API. Adding fields is safe; removing or renaming them breaks downstream dashboards and alerts. Mature teams version the schema (log_version: "v3"), maintain a schema registry, and run a deprecation cycle before removing fields. On an SA interview, mentioning log schema as a contract is a quick way to signal seniority.
Is OpenTelemetry worth adopting?
For any system above a handful of services, yes. OpenTelemetry (OTel) is now the de-facto standard for traces, metrics, and logs, with vendor-neutral SDKs and a unified data model. Adopting OTel locks you out of vendor proprietary formats and lets you switch between Datadog, Honeycomb, Grafana Cloud, or self-hosted backends without rewriting instrumentation. The migration tax is real, but the optionality is worth it for any team planning to scale past five services.