Courses 0%
43
Observability · Chapter 43 of 51

Open Telemetry and Distributed Tracing

Akhil
Akhil Sharma
20 min

OpenTelemetry and Distributed Tracing at Scale (Observability in Production)

Audience: platform engineers and SREs instrumenting microservices for production observability.

This article assumes:

  • Your system has dozens to hundreds of microservices calling each other.
  • A single user request might traverse 10-50 services.
  • Manual logging is insufficient - you need automated correlation across services.
  • Performance overhead matters - instrumentation can't slow down production traffic.

[CHALLENGE] Challenge: Your microservices are a black box

Scenario

Customer complaint: "Checkout failed at 3:42 PM with error 'Payment timeout'."

Your investigation:

  • Frontend logs: "Called payment service, got timeout after 5s"
  • Payment service logs: "Received request, called fraud detection, waiting..."
  • Fraud detection logs: No logs at 3:42 PM (service restarted at 3:40 PM, logs lost)
  • Database logs: Slow query at 3:41 PM, might be related?

You have 4 services with independent logs. No way to correlate them. No end-to-end visibility.

Interactive question (pause and think)

How do you debug a request that touches 4 services when logs are disconnected?

  1. Add request IDs manually to every log statement
  2. Use distributed tracing to automatically link spans
  3. Increase log verbosity everywhere
  4. Just grep logs really hard

Take 10 seconds.

Progressive reveal (question -> think -> answer)

Answer: (2) - distributed tracing solves correlation automatically.

Manual correlation (option 1) doesn't scale beyond 3-5 services. Option 3 makes the problem worse (more noise). Option 4 is pain.

Real-world analogy (package tracking)

Shipping a package:

Without tracking: Package goes through sorting facilities, trucks, and delivery. If it's lost, you have no idea where.

With tracking: Every step scans the barcode. You see: "Picked up → Sorting facility → In transit → Out for delivery → Delivered." One tracking number links the entire journey.

Distributed tracing is package tracking for requests.

Key insight box

Distributed tracing automatically correlates logs, spans, and events across services using a shared trace ID, giving you end-to-end visibility.

Challenge question

If distributed tracing is so valuable, why isn't every company using it in production?


[MENTAL MODEL] Mental model - Traces, spans, and context propagation

Scenario

User clicks "Buy Now" button. Request flows:

Each service needs to know: "This request is part of the same user action."

Interactive question (pause and think)

How do services know they're handling the same logical request?

A. They share a database that tracks requests B. Each service generates a new request ID C. A trace ID is passed through headers D. Services don't know - logs are correlated later

Progressive reveal

Answer: C.

The trace ID travels with the request through HTTP headers (or gRPC metadata).

Core concepts

Trace:

  • Represents one complete user request end-to-end
  • Has a unique trace ID (e.g., abc123)
  • Contains multiple spans

Span:

  • Represents a single operation within a trace
  • Has a span ID and parent span ID
  • Contains: operation name, start time, duration, tags, logs

Context propagation:

  • Passing trace ID and span ID between services
  • Usually via HTTP headers: traceparent: 00-abc123-def456-01

Visual: trace structure

text

Context propagation example

go

Key insight box

Context propagation is the magic that links spans across services. Without it, you have independent spans, not a distributed trace.

Challenge question

What happens if one service in the chain doesn't propagate context? Does the trace break entirely?


[WARNING] OpenTelemetry vs proprietary solutions - standards matter

Scenario

Your team debates: "Should we use DataDog's tracing, New Relic's, or OpenTelemetry?"

The vendor lock-in problem

yaml

OpenTelemetry architecture

text

Why OpenTelemetry won

yaml

Key insight box

OpenTelemetry is the HTTPS of observability - a standard that everyone implements, preventing vendor lock-in.

Challenge question

If OpenTelemetry is vendor-neutral, how do vendors like DataDog make money? What's their moat?


[DEEP DIVE] Instrumenting applications - auto vs manual

Scenario

You need to add tracing to 50 microservices. Do you:

  • Auto-instrument everything (zero code changes)
  • Manually instrument critical paths (more control)

Auto-instrumentation

yaml

Auto-instrumentation example (Python):

python

Manual instrumentation

go

Hybrid approach (best practice)

yaml

Key insight box

Start with auto-instrumentation for coverage, add manual instrumentation for business context. The combination gives you both breadth and depth.

Challenge question

You auto-instrument a legacy service and performance drops by 20%. How do you debug which instrumentation is causing overhead?


[PUZZLE] Sampling strategies at the SDK level

Scenario

You have 100K requests/second. Can't send all traces to backend.

Where do you sample: application (SDK) or collector?

Think about it

  • SDK sampling: Decide at request start (head-based)
  • Collector sampling: Decide after trace completes (tail-based)

Interactive question (pause and think)

Why might you sample at the SDK level instead of collector?

A. Reduce network bandwidth (don't send unsampled traces) B. Reduce collector load C. Lower latency (no collector buffering) D. All of the above

Progressive reveal

Answer: D - SDK sampling reduces load on everything downstream.

SDK sampling strategies

Strategy 1: AlwaysOn sampler

go

Strategy 2: Probabilistic sampler

go

Strategy 3: Parent-based sampler

go

Strategy 4: Custom sampler (rate limiting)

go

SDK vs Collector sampling trade-offs

yaml

Key insight box

SDK sampling is economic (reduce costs), collector sampling is intelligent (keep important traces). Use both in layers.

Challenge question

How do you ensure consistent sampling across multiple SDK languages (Go, Python, Java) without configuration drift?


[DEEP DIVE] Performance overhead - measuring the cost of observability

Scenario

You add OpenTelemetry to production. Suddenly p99 latency increases from 200ms to 250ms.

Is the 25% latency increase worth the observability gain?

Sources of overhead

yaml

Measuring overhead

go

Optimization strategies

Optimization 1: Reduce span cardinality

go

Optimization 2: Sampling

go

Optimization 3: Asynchronous export

go

Optimization 4: Conditional instrumentation

go

Key insight box

Tracing overhead is typically 1-10% latency increase. Optimize by sampling, async export, and low-cardinality span names.

Challenge question

Your service creates 50 spans per request. Overhead is 20%. How do you decide which spans to keep and which to remove?


[WARNING] Common OpenTelemetry anti-patterns

Scenario

Your team has instrumented everything. But traces are useless.

Anti-patterns catalog

Anti-pattern 1: Too many spans (noise)

go

Anti-pattern 2: High-cardinality span names

go

Anti-pattern 3: Forgetting to propagate context

go

Anti-pattern 4: Sensitive data in spans

go

Anti-pattern 5: Blocking on span export

go

Key insight box

Common mistakes: too many spans, high cardinality, dropped context, PII exposure, synchronous export. Avoid these to keep traces useful and performant.

Challenge question

Your traces contain passwords in span attributes (accidental logging). How do you scrub existing traces from the backend?


[SYNTHESIS] Final synthesis - Instrument a production system

Synthesis challenge

You're the observability lead for a microservices e-commerce platform.

Requirements:

  • 50 microservices (Go, Python, Node.js)
  • 200K requests/second peak
  • Need to debug: checkout failures, slow searches, payment timeouts
  • Compliance: No PII in traces (GDPR)
  • Performance: < 5% latency overhead acceptable

Constraints:

  • Team: 5 engineers, limited OpenTelemetry experience
  • Budget: $20K/month for tracing backend
  • Current: No distributed tracing (only logs)
  • Timeline: 3 months to full rollout

Your tasks (pause and think)

  1. Choose instrumentation strategy (auto vs manual)
  2. Select sampling approach (SDK, collector, or both)
  3. Design trace backend architecture
  4. Plan PII redaction strategy
  5. Define performance testing approach
  6. Create rollout plan (which services first?)

Write down your architecture.

Progressive reveal (one possible solution)

1. Instrumentation strategy:

yaml

2. Sampling approach:

yaml

3. Trace backend architecture:

yaml

4. PII redaction strategy:

yaml

5. Performance testing:

yaml

6. Rollout plan:

yaml

Key insight box

Rollout distributed tracing incrementally: auto-instrument for coverage, manual instrumentation for depth, measure overhead continuously.

Final challenge question

After 6 months, your trace storage costs doubled (traffic grew). How do you reduce costs without losing observability?


Appendix: Quick checklist (printable)

Instrumentation checklist:

  • Choose SDK language (Go, Python, Java, Node.js)
  • Configure auto-instrumentation (HTTP, DB, cache)
  • Add manual spans for critical business logic
  • Propagate context across service boundaries
  • Test context propagation (end-to-end traces)
  • Avoid high-cardinality span names

Sampling configuration:

  • Configure SDK sampling (10-20% recommended)
  • Configure collector tail sampling (errors, slow)
  • Set sampling consistently across services
  • Monitor actual sample rate (vs expected)
  • Adjust sampling based on costs

Performance optimization:

  • Measure baseline latency (before tracing)
  • Measure overhead after instrumentation
  • Use async batched export (not synchronous)
  • Reduce span count (only meaningful operations)
  • Load test with tracing enabled

PII and security:

  • Identify sensitive attributes (passwords, credit cards)
  • Configure attribute redaction in collector
  • Hash or truncate PII (emails, IPs)
  • Audit traces for PII leakage
  • Document PII handling policy

Backend setup:

  • Deploy OTel Collectors (load balanced)
  • Choose backend (Jaeger, Tempo, vendor)
  • Configure retention policy
  • Setup Grafana dashboards
  • Test collector failover

Operational readiness:

  • Create runbooks for common trace queries
  • Train team on trace debugging
  • Monitor collector health (CPU, memory)
  • Monitor trace ingestion rate
  • Alert on trace export failures

Red flags (fix immediately):

  • Traces missing spans (context not propagated)
  • High-cardinality explosion (millions of unique span names)
  • PII in traces (compliance violation)
  • > 10% latency overhead (optimize instrumentation)
  • Collector OOM crashes (reduce buffer or add instances)
  • Traces incomplete (spans arriving out of order)

Key Takeaways

  1. Distributed tracing follows a request across multiple services — showing the full call chain, timing, and where latency or errors occur
  2. OpenTelemetry is the industry standard for instrumentation — vendor-neutral SDKs for traces, metrics, and logs across all major languages
  3. Trace context propagation passes trace IDs across service boundaries — via HTTP headers (traceparent) so each service can contribute its span
  4. Spans represent individual operations within a trace — with parent-child relationships forming a tree that visualizes the request flow
  5. Start with auto-instrumentation — OpenTelemetry SDKs automatically instrument HTTP clients, databases, and frameworks with zero code changes
Chapter complete!

Course Complete!

You've finished all 51 chapters of

System Design Advanced

Browse courses
Up next Tail Based Sampling
Continue