Distributed tracing for systems analysts

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why tracing shows up in SA interviews

You are sitting across from a staff engineer at a payments company. They draw a checkout flow on the whiteboard — five services, two queues, a third-party fraud check — and ask the question every systems analyst eventually hears: "a customer says checkout took 12 seconds, where do you look?" If your answer starts with grepping logs across five containers, the interview is effectively over. The correct answer starts with one phrase: distributed tracing.

Tracing is the only observability primitive that preserves the causal shape of a request as it fans out through services. Logs tell you what happened in one process. Metrics tell you aggregate behavior over time. A trace tells you that the 12-second checkout spent 11.4 seconds waiting on a synchronous fraud call that should have been async. That is the kind of answer that gets you to the next round, and it is why interviewers at Stripe, DoorDash, Uber, and any company running more than a dozen services treat tracing literacy as table stakes for senior SAs.

The trap is that most candidates recite the OpenTelemetry homepage but freeze when asked what a traceparent header actually contains. This post is the answer kit for the gap between marketing and interview-grade fluency.

Traces and spans, properly defined

A trace is the end-to-end record of a single logical request as it traverses your system. A span is one unit of work inside that trace — a function call, a DB query, an outbound HTTP request, a message publish. Each span carries a start time, end time, parent span ID (except the root), attributes, and a status. Visualized, a trace looks like a waterfall or a tree.

A worked example for a checkout request:

Trace: POST /checkout    total: 612 ms
├─ Span: api-gateway              612 ms
│  ├─ Span: auth-service.verify    18 ms
│  └─ Span: order-service.create  588 ms
│     ├─ Span: postgres.insert     42 ms
│     ├─ Span: fraud.check        510 ms   ← the smoking gun
│     └─ Span: events.publish      14 ms

The waterfall makes the 510 ms fraud call obvious in a way no log line would. That visual is the entire pitch for tracing — and what interviewers want you to draw on a whiteboard from memory.

Every span carries a small payload of metadata. The fields you should be able to name without thinking:

Field Purpose Example
trace_id 16-byte ID shared by every span in the trace 4bf92f3577b34da6a3ce929d0e0e4736
span_id 8-byte ID unique to this span 00f067aa0ba902b7
parent_span_id Links span to its caller; null for root b7ad6b7169203331
name Logical operation, not URL order-service.create
attributes Key-value tags: http.method, db.statement, user.id http.status_code=200
events Timestamped logs scoped to the span cache.miss at t+12ms
status OK, ERROR, or UNSET ERROR

Load-bearing rule: spans are nested by parent_span_id, not by wall-clock time. A child span can outlive its parent (think fire-and-forget async work) and you still get a correct tree as long as the parent ID is set.

Context propagation across services

A trace only works if every service in the call chain agrees on the same trace_id and knows its parent_span_id. The mechanism is context propagation — the most common follow-up to "what is a span?"

For synchronous HTTP, the W3C Trace Context spec defines a traceparent header with a strict format:

traceparent: 00-{trace_id}-{parent_span_id}-{trace_flags}
             ^^                                ^^
             version                           01 = sampled, 00 = not

Example: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01. The receiver parses this, creates a new span with the propagated trace_id, sets parent_span_id from the header, and generates a fresh span_id.

For async messaging — Kafka, SQS, RabbitMQ — the trace context is embedded in message headers instead. The producer attaches the traceparent to message metadata, the consumer extracts it on processing. This is where most teams quietly fail: the producer instruments correctly, but the consumer starts a brand new root span and orphans the downstream work.

Gotcha worth memorizing: if your trace shows two disconnected trees for what is logically one request, the propagation broke at a queue boundary or at a service that does not forward the header. Manual traceparent propagation through legacy code is the single most common SA design-task answer.

For gRPC, context flows through binary metadata. For background jobs, the worker library must attach the parent context when enqueuing. None of this is automatic unless you use a proper SDK.

Sampling strategies that survive scale

If a service handles 10,000 requests per second and you trace every one, you fill a small data lake every hour. Storage explodes, the collector chokes, and you pay to store ten thousand identical happy-path traces. Sampling solves this, and the strategy choice is a senior-level interview question.

Strategy When decision is made Pros Cons
Head-based At the trace root, before any work Cheap, simple, deterministic Cannot prioritize errors; misses rare slow requests
Tail-based After full trace completes Always catches errors and slow traces Needs in-memory buffering across collectors
Adaptive / dynamic Adjusts rate based on traffic Keeps cost bounded under spikes Complex to tune; can mask regressions
Probabilistic Coin flip at root, fixed rate Predictable volume Loses tail-of-distribution outliers

Production sampling typically lands at 1% to 10% for normal traffic, with always-on tracing for errors and slow paths layered on via tail-based rules. The interview-grade answer: head-based at 1-5% for baseline, tail-based for status=ERROR and duration > p99. That combo keeps the storage bill flat while preserving every signal that matters.

The trap with tail-based is that the collector holds the full trace in memory until the last span arrives. Most tail-based deployments cap trace duration at 60-120 seconds and silently drop anything longer.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

OpenTelemetry, the actual standard

OpenTelemetry (OTel) is the vendor-neutral standard for generating and exporting telemetry. It absorbed OpenTracing and OpenCensus in 2019 and is now a CNCF graduated project. Asked "how would you instrument a new service?", the correct first sentence is "I would use the OpenTelemetry SDK for the language."

OTel ships three things you should be able to name:

  1. APIs — language-specific interfaces that application code calls (tracer.start_span, etc.).
  2. SDKs — concrete implementations that batch, sample, and export spans.
  3. The Collector — a separate process that receives spans from SDKs, optionally processes them (filtering, attribute scrubbing, tail-based sampling), and exports to one or more backends.

A minimal Python instrumentation looks like this:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
)

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", order_id)
    span.set_attribute("user.tier", user.tier)
    # ... business logic ...
    if not result.ok:
        span.set_status(trace.Status(trace.StatusCode.ERROR))

Auto-instrumentation covers most frameworks — Flask, Django, FastAPI, Express, Spring, ASP.NET — so HTTP, database, and cache calls are captured without code changes. Manual spans are reserved for business-domain operations like "process_order" or "settle_payment" that no auto-instrumentation can discover on its own.

Tooling landscape

The backend that stores and visualizes traces is decoupled from how you produce them, which is the entire point of OTel. The market consolidated around a handful of options:

Tool Type Strengths Trade-offs
Jaeger Open source CNCF graduated, mature UI, easy local setup Storage is your problem
Grafana Tempo Open source Cheap object-storage backend, integrates with Grafana Search is by trace ID, not free-text
Zipkin Open source The original, simple data model Smaller community than Jaeger today
Honeycomb SaaS Best-in-class query UX, BubbleUp diff Per-event pricing adds up
Datadog APM SaaS Tight integration with logs and metrics Expensive at scale; vendor lock-in
AWS X-Ray Managed Native in AWS Weaker tree visualization, AWS-only

For an interview answer, the safe stack is OpenTelemetry SDK → OTel Collector → Tempo or Jaeger → Grafana for dashboards. That stack is open source end-to-end, runs in any cloud, and is what most senior SAs would actually propose if asked to greenfield observability for a new platform.

Common pitfalls

The first pitfall is treating tracing as a logging replacement. A span attribute is not a log line. Stuffing full request bodies and 2KB JSON blobs into attributes blows up storage and trips attribute-length limits. Keep attributes to small, high-cardinality identifiersuser.id, order.id, region — and leave verbose context in structured logs that share the trace_id so you can pivot between systems.

Second, forgetting context propagation at every async boundary. Every queue, background job runner, and cron task drops the trace context unless you carry it explicitly. Candidates who say "we use OpenTelemetry, so propagation is automatic" lose points instantly, because OTel only auto-propagates within a process or through instrumented HTTP/gRPC clients. Manual propagation through Kafka, Redis queues, or custom RPC is on you.

Third, picking a sampling rate without thinking about error visibility. A 1% head-based sample sounds reasonable until you are paged on a P0 and 99% of failing requests have no trace. The senior-level answer is always a layered policy: low-rate baseline for happy paths plus tail-based always-on capture for errors, slow requests, and specific high-value tenants or canary cohorts.

Fourth, ignoring clock skew across services. Span timestamps come from each host's wall clock, and a 50ms drift makes child spans appear to start before their parents — causally impossible, and it confuses every viz tool. Fix it with NTP discipline on every node and, for forensics, trust the parent-child link over the timestamps.

Fifth, instrumenting too much and dashboarding too little. Spans nobody looks at are pure overhead. The mature pattern is two or three "golden trace" dashboards — checkout, signup, payment settlement — that any on-call engineer can open during an incident, then expand coverage based on the questions production actually asks.

If you want to drill systems analyst interview questions like this every day, NAILDD is launching with 500+ problems covering tracing, observability, and the rest of the senior SA design surface.

FAQ

How is tracing different from logging and metrics?

Logs are discrete events from a single process. Metrics are pre-aggregated time series — counters, gauges, histograms — that summarize behavior. Tracing is the only one of the three that preserves the causal structure of a request across service boundaries. The mature observability stack uses all three with a shared trace_id so you can pivot: see a metric spike, find an example trace, drill into the logs of one failing span. Serious practitioners now treat the "three pillars" as one correlated data model.

Should I always instrument manually, or rely on auto-instrumentation?

Start with auto-instrumentation for the boring infrastructure layer — HTTP servers and clients, ORMs, caches, message queues. That gives 80% of the value with zero code changes. Layer manual spans on top for business-domain operations auto cannot discover, like "validate_kyc" or "compute_pricing_quote." Pure auto gives generic spans without business context; pure manual is a maintenance nightmare.

What sampling rate should I start with in production?

For a mid-size service under 5,000 RPS, start at 5-10% head-based plus tail-based always-on for errors and traces slower than p99. Past 10k RPS, drop head-based to 1% and lean harder on tail-based rules. Below 1% baseline you start hiding 1-in-10,000 bugs — exactly what tracing was supposed to catch.

How do I trace through a third-party API I cannot instrument?

You cannot continue the trace inside their system, but you can wrap the outbound call in a span of your own and capture status code, latency, and any correlation ID they return. When the vendor supports trace headers (Stripe, Twilio, and a few others propagate W3C trace context), pass them through; otherwise accept the boundary is opaque and document it.

Does tracing replace APM tools like New Relic or Datadog?

Tracing is the data layer; APM is a product category bundling tracing with metrics, logs, profiling, and alerting in one UI. OpenTelemetry data feeds Datadog, New Relic, Honeycomb, or a self-hosted stack — same SDK. The interview-grade framing: OTel for emission, APM vendor for analysis. That separation is why OTel won the standard war.

How do I know if my tracing setup is actually working?

Open one trace for a real production request and confirm three things: more than one service in the waterfall, parent-child links forming a connected tree with no orphan roots, and span attributes including the business identifiers you care about (user, order, tenant). If any check fails, your propagation, instrumentation, or attribute hygiene is broken — and you will only find out during the incident tracing was supposed to help debug.