Audience: platform engineers and SREs instrumenting microservices for production observability.
This article assumes:
Customer complaint: "Checkout failed at 3:42 PM with error 'Payment timeout'."
Your investigation:
You have 4 services with independent logs. No way to correlate them. No end-to-end visibility.
How do you debug a request that touches 4 services when logs are disconnected?
Take 10 seconds.
Answer: (2) - distributed tracing solves correlation automatically.
Manual correlation (option 1) doesn't scale beyond 3-5 services. Option 3 makes the problem worse (more noise). Option 4 is pain.
Shipping a package:
Without tracking: Package goes through sorting facilities, trucks, and delivery. If it's lost, you have no idea where.
With tracking: Every step scans the barcode. You see: "Picked up → Sorting facility → In transit → Out for delivery → Delivered." One tracking number links the entire journey.
Distributed tracing is package tracking for requests.
Distributed tracing automatically correlates logs, spans, and events across services using a shared trace ID, giving you end-to-end visibility.
If distributed tracing is so valuable, why isn't every company using it in production?
User clicks "Buy Now" button. Request flows:
Each service needs to know: "This request is part of the same user action."
How do services know they're handling the same logical request?
A. They share a database that tracks requests B. Each service generates a new request ID C. A trace ID is passed through headers D. Services don't know - logs are correlated later
Answer: C.
The trace ID travels with the request through HTTP headers (or gRPC metadata).
Trace:
abc123)Span:
Context propagation:
traceparent: 00-abc123-def456-01Context propagation is the magic that links spans across services. Without it, you have independent spans, not a distributed trace.
What happens if one service in the chain doesn't propagate context? Does the trace break entirely?
Your team debates: "Should we use DataDog's tracing, New Relic's, or OpenTelemetry?"
OpenTelemetry is the HTTPS of observability - a standard that everyone implements, preventing vendor lock-in.
If OpenTelemetry is vendor-neutral, how do vendors like DataDog make money? What's their moat?
You need to add tracing to 50 microservices. Do you:
Auto-instrumentation example (Python):
Start with auto-instrumentation for coverage, add manual instrumentation for business context. The combination gives you both breadth and depth.
You auto-instrument a legacy service and performance drops by 20%. How do you debug which instrumentation is causing overhead?
You have 100K requests/second. Can't send all traces to backend.
Where do you sample: application (SDK) or collector?
Why might you sample at the SDK level instead of collector?
A. Reduce network bandwidth (don't send unsampled traces) B. Reduce collector load C. Lower latency (no collector buffering) D. All of the above
Answer: D - SDK sampling reduces load on everything downstream.
Strategy 1: AlwaysOn sampler
Strategy 2: Probabilistic sampler
Strategy 3: Parent-based sampler
Strategy 4: Custom sampler (rate limiting)
SDK sampling is economic (reduce costs), collector sampling is intelligent (keep important traces). Use both in layers.
How do you ensure consistent sampling across multiple SDK languages (Go, Python, Java) without configuration drift?
You add OpenTelemetry to production. Suddenly p99 latency increases from 200ms to 250ms.
Is the 25% latency increase worth the observability gain?
Optimization 1: Reduce span cardinality
Optimization 2: Sampling
Optimization 3: Asynchronous export
Optimization 4: Conditional instrumentation
Tracing overhead is typically 1-10% latency increase. Optimize by sampling, async export, and low-cardinality span names.
Your service creates 50 spans per request. Overhead is 20%. How do you decide which spans to keep and which to remove?
Your team has instrumented everything. But traces are useless.
Anti-pattern 1: Too many spans (noise)
Anti-pattern 2: High-cardinality span names
Anti-pattern 3: Forgetting to propagate context
Anti-pattern 4: Sensitive data in spans
Anti-pattern 5: Blocking on span export
Common mistakes: too many spans, high cardinality, dropped context, PII exposure, synchronous export. Avoid these to keep traces useful and performant.
Your traces contain passwords in span attributes (accidental logging). How do you scrub existing traces from the backend?
You're the observability lead for a microservices e-commerce platform.
Requirements:
Constraints:
Write down your architecture.
1. Instrumentation strategy:
2. Sampling approach:
3. Trace backend architecture:
4. PII redaction strategy:
5. Performance testing:
6. Rollout plan:
Rollout distributed tracing incrementally: auto-instrument for coverage, manual instrumentation for depth, measure overhead continuously.
After 6 months, your trace storage costs doubled (traffic grew). How do you reduce costs without losing observability?
Instrumentation checklist:
Sampling configuration:
Performance optimization:
PII and security:
Backend setup:
Operational readiness:
Red flags (fix immediately):