Your system has dozens to hundreds of microservices calling each other.
A single user request might traverse 10-50 services.
Manual logging is insufficient - you need automated correlation across services.
Performance overhead matters - instrumentation can't slow down production traffic.

[CHALLENGE] Challenge: Your microservices are a black box

Scenario

Customer complaint: "Checkout failed at 3:42 PM with error 'Payment timeout'."

Your investigation:

Frontend logs: "Called payment service, got timeout after 5s"
Payment service logs: "Received request, called fraud detection, waiting..."
Fraud detection logs: No logs at 3:42 PM (service restarted at 3:40 PM, logs lost)
Database logs: Slow query at 3:41 PM, might be related?

You have 4 services with independent logs. No way to correlate them. No end-to-end visibility.

Interactive question (pause and think)

How do you debug a request that touches 4 services when logs are disconnected?

Add request IDs manually to every log statement
Use distributed tracing to automatically link spans
Increase log verbosity everywhere
Just grep logs really hard

Take 10 seconds.

Progressive reveal (question -> think -> answer)

Answer: (2) - distributed tracing solves correlation automatically.

Manual correlation (option 1) doesn't scale beyond 3-5 services. Option 3 makes the problem worse (more noise). Option 4 is pain.

Real-world analogy (package tracking)

Shipping a package:

Without tracking: Package goes through sorting facilities, trucks, and delivery. If it's lost, you have no idea where.

With tracking: Every step scans the barcode. You see: "Picked up → Sorting facility → In transit → Out for delivery → Delivered." One tracking number links the entire journey.

Distributed tracing is package tracking for requests.

Key insight box

Distributed tracing automatically correlates logs, spans, and events across services using a shared trace ID, giving you end-to-end visibility.

Challenge question

If distributed tracing is so valuable, why isn't every company using it in production?

[MENTAL MODEL] Mental model - Traces, spans, and context propagation

Scenario

User clicks "Buy Now" button. Request flows:

Each service needs to know: "This request is part of the same user action."

Interactive question (pause and think)

How do services know they're handling the same logical request?

A. They share a database that tracks requests B. Each service generates a new request ID C. A trace ID is passed through headers D. Services don't know - logs are correlated later

Progressive reveal

Answer: C.

The trace ID travels with the request through HTTP headers (or gRPC metadata).

Core concepts

Trace:

Represents one complete user request end-to-end
Has a unique trace ID (e.g., abc123)
Contains multiple spans

Span:

Represents a single operation within a trace
Has a span ID and parent span ID
Contains: operation name, start time, duration, tags, logs

Context propagation:

Passing trace ID and span ID between services
Usually via HTTP headers: traceparent: 00-abc123-def456-01

Visual: trace structure

text

Context propagation example

Key insight box

Context propagation is the magic that links spans across services. Without it, you have independent spans, not a distributed trace.

Challenge question

What happens if one service in the chain doesn't propagate context? Does the trace break entirely?

[WARNING] OpenTelemetry vs proprietary solutions - standards matter

Scenario

Your team debates: "Should we use DataDog's tracing, New Relic's, or OpenTelemetry?"

The vendor lock-in problem

yaml

OpenTelemetry architecture

text

Why OpenTelemetry won

yaml

Key insight box

OpenTelemetry is the HTTPS of observability - a standard that everyone implements, preventing vendor lock-in.

Challenge question

If OpenTelemetry is vendor-neutral, how do vendors like DataDog make money? What's their moat?

[DEEP DIVE] Instrumenting applications - auto vs manual

Scenario

You need to add tracing to 50 microservices. Do you:

Auto-instrument everything (zero code changes)
Manually instrument critical paths (more control)

Auto-instrumentation

yaml

Auto-instrumentation example (Python):

python

Manual instrumentation

Hybrid approach (best practice)

yaml

Key insight box

Start with auto-instrumentation for coverage, add manual instrumentation for business context. The combination gives you both breadth and depth.

Challenge question

You auto-instrument a legacy service and performance drops by 20%. How do you debug which instrumentation is causing overhead?

[PUZZLE] Sampling strategies at the SDK level

Scenario

You have 100K requests/second. Can't send all traces to backend.

Where do you sample: application (SDK) or collector?

Think about it

SDK sampling: Decide at request start (head-based)
Collector sampling: Decide after trace completes (tail-based)

Interactive question (pause and think)

Why might you sample at the SDK level instead of collector?

A. Reduce network bandwidth (don't send unsampled traces) B. Reduce collector load C. Lower latency (no collector buffering) D. All of the above

Progressive reveal

Answer: D - SDK sampling reduces load on everything downstream.

SDK sampling strategies

Strategy 1: AlwaysOn sampler

Strategy 2: Probabilistic sampler

Strategy 3: Parent-based sampler

Strategy 4: Custom sampler (rate limiting)

SDK vs Collector sampling trade-offs

yaml

Key insight box

SDK sampling is economic (reduce costs), collector sampling is intelligent (keep important traces). Use both in layers.

Challenge question

How do you ensure consistent sampling across multiple SDK languages (Go, Python, Java) without configuration drift?

[DEEP DIVE] Performance overhead - measuring the cost of observability

Scenario

You add OpenTelemetry to production. Suddenly p99 latency increases from 200ms to 250ms.

Is the 25% latency increase worth the observability gain?

Sources of overhead

yaml

Measuring overhead

Optimization strategies

Optimization 1: Reduce span cardinality

Optimization 2: Sampling

Optimization 3: Asynchronous export

Optimization 4: Conditional instrumentation

Key insight box

Tracing overhead is typically 1-10% latency increase. Optimize by sampling, async export, and low-cardinality span names.

Challenge question

Your service creates 50 spans per request. Overhead is 20%. How do you decide which spans to keep and which to remove?

[WARNING] Common OpenTelemetry anti-patterns

Scenario

Your team has instrumented everything. But traces are useless.

Anti-patterns catalog

Anti-pattern 1: Too many spans (noise)

Anti-pattern 2: High-cardinality span names

Anti-pattern 3: Forgetting to propagate context

Anti-pattern 4: Sensitive data in spans

Anti-pattern 5: Blocking on span export

Key insight box

Common mistakes: too many spans, high cardinality, dropped context, PII exposure, synchronous export. Avoid these to keep traces useful and performant.

Challenge question

Your traces contain passwords in span attributes (accidental logging). How do you scrub existing traces from the backend?

[SYNTHESIS] Final synthesis - Instrument a production system

Synthesis challenge

You're the observability lead for a microservices e-commerce platform.

Requirements:

50 microservices (Go, Python, Node.js)
200K requests/second peak
Need to debug: checkout failures, slow searches, payment timeouts
Compliance: No PII in traces (GDPR)
Performance: < 5% latency overhead acceptable

Constraints:

Team: 5 engineers, limited OpenTelemetry experience
Budget: $20K/month for tracing backend
Current: No distributed tracing (only logs)
Timeline: 3 months to full rollout

Your tasks (pause and think)

Choose instrumentation strategy (auto vs manual)
Select sampling approach (SDK, collector, or both)
Design trace backend architecture
Plan PII redaction strategy
Define performance testing approach
Create rollout plan (which services first?)

Write down your architecture.

Progressive reveal (one possible solution)

1. Instrumentation strategy:

yaml

2. Sampling approach:

yaml

3. Trace backend architecture:

yaml

4. PII redaction strategy:

yaml

5. Performance testing:

yaml

6. Rollout plan:

yaml

Key insight box

Rollout distributed tracing incrementally: auto-instrument for coverage, manual instrumentation for depth, measure overhead continuously.

Final challenge question

After 6 months, your trace storage costs doubled (traffic grew). How do you reduce costs without losing observability?

Appendix: Quick checklist (printable)

Instrumentation checklist:

Choose SDK language (Go, Python, Java, Node.js)
Configure auto-instrumentation (HTTP, DB, cache)
Add manual spans for critical business logic
Propagate context across service boundaries
Test context propagation (end-to-end traces)
Avoid high-cardinality span names

Sampling configuration:

Configure SDK sampling (10-20% recommended)
Configure collector tail sampling (errors, slow)
Set sampling consistently across services
Monitor actual sample rate (vs expected)
Adjust sampling based on costs

Performance optimization:

Measure baseline latency (before tracing)
Measure overhead after instrumentation
Use async batched export (not synchronous)
Reduce span count (only meaningful operations)
Load test with tracing enabled

PII and security:

Identify sensitive attributes (passwords, credit cards)
Configure attribute redaction in collector
Hash or truncate PII (emails, IPs)
Audit traces for PII leakage
Document PII handling policy

Backend setup:

Deploy OTel Collectors (load balanced)
Choose backend (Jaeger, Tempo, vendor)
Configure retention policy
Setup Grafana dashboards
Test collector failover

Operational readiness:

Create runbooks for common trace queries
Train team on trace debugging
Monitor collector health (CPU, memory)
Monitor trace ingestion rate
Alert on trace export failures

Red flags (fix immediately):

Traces missing spans (context not propagated)
High-cardinality explosion (millions of unique span names)
PII in traces (compliance violation)
> 10% latency overhead (optimize instrumentation)
Collector OOM crashes (reduce buffer or add instances)
Traces incomplete (spans arriving out of order)

Key Takeaways

Distributed tracing follows a request across multiple services — showing the full call chain, timing, and where latency or errors occur
OpenTelemetry is the industry standard for instrumentation — vendor-neutral SDKs for traces, metrics, and logs across all major languages
Trace context propagation passes trace IDs across service boundaries — via HTTP headers (traceparent) so each service can contribute its span
Spans represent individual operations within a trace — with parent-child relationships forming a tree that visualizes the request flow
Start with auto-instrumentation — OpenTelemetry SDKs automatically instrument HTTP clients, databases, and frameworks with zero code changes

Previous Observability Driven Development Up next Tail Based Sampling

Reading progress 0%

On this page

OpenTelemetry and Distributed Tracing at Scale (Observability in Production) [CHALLENGE] Challenge: Your microservices are a black box Scenario Interactive question (pause and think) Progressive reveal (question -> think -> answer) Real-world analogy (package tracking) Key insight box Challenge question [MENTAL MODEL] Mental model - Traces, spans, and context propagation Scenario Interactive question (pause and think) Progressive reveal Core concepts Visual: trace structure Context propagation example Key insight box Challenge question [WARNING] OpenTelemetry vs proprietary solutions - standards matter Scenario The vendor lock-in problem OpenTelemetry architecture Why OpenTelemetry won Key insight box Challenge question [DEEP DIVE] Instrumenting applications - auto vs manual Scenario Auto-instrumentation No code changes needed! Just install and configure Install: pip install opentelemetry-distro opentelemetry-bootstrap -a install Run with auto-instrumentation: opentelemetry-instrument python app.py app.py (unchanged) Manual instrumentation Hybrid approach (best practice) Key insight box Challenge question [PUZZLE] Sampling strategies at the SDK level Scenario Think about it Interactive question (pause and think) Progressive reveal SDK sampling strategies SDK vs Collector sampling trade-offs Key insight box Challenge question [DEEP DIVE] Performance overhead - measuring the cost of observability Scenario Sources of overhead Measuring overhead Optimization strategies Key insight box Challenge question [WARNING] Common OpenTelemetry anti-patterns Scenario Anti-patterns catalog Key insight box Challenge question [SYNTHESIS] Final synthesis - Instrument a production system Synthesis challenge Your tasks (pause and think) Progressive reveal (one possible solution) Key insight box Final challenge question Appendix: Quick checklist (printable) Key Takeaways

Chapter complete!

Course Complete!

You've finished all 51 chapters of

System Design Advanced

Browse courses

Up next Tail Based Sampling

Continue