Courses 0%
44
Observability · Chapter 44 of 51

Tail Based Sampling

Akhil
Akhil Sharma
20 min

Tail-Based Sampling (Smart Trace Collection at Scale)

Audience: observability engineers and SREs managing distributed tracing in high-throughput production systems.

This article assumes:

  • Your system generates millions of traces per second - you can't store them all.
  • The most important traces (failures, slow requests) are rare but critical.
  • Head-based sampling makes blind decisions before seeing the full trace.
  • Storage costs and query performance matter at scale.

[CHALLENGE] Challenge: Your sampling strategy throws away the evidence

Scenario

Your distributed tracing captures 1% of requests (head-based sampling).

Then an incident happens:

  • Customer reports: "Checkout failed at 2:47 PM"
  • You check traces: Nothing. The failed request wasn't in your 1% sample
  • Support: "This is the 5th complaint today about checkout failures"
  • You check error logs: Errors are there, but no traces to debug context

You're flying blind because your sampling threw away the evidence.

Interactive question (pause and think)

What's wrong with "sample 1% of all requests randomly"?

  1. 1% is too low (should sample more)
  2. Random sampling treats all requests equally
  3. You're sampling before you know which traces matter
  4. All of the above

Take 10 seconds.

Progressive reveal (question -> think -> answer)

Answer: (3) is the core problem, which leads to (2).

Random head-based sampling decides at the START of a request whether to trace it. But you don't know if the request will fail or be slow until it COMPLETES.

Real-world analogy (security cameras)

Imagine a bank with security cameras that randomly record 1% of the day:

Head-based sampling: Decide at midnight which 1% of today to record. Might miss the robbery at 2 PM.

Tail-based sampling: Record everything temporarily, then at end of day, keep only footage with interesting events (robbery, suspicious activity). Delete boring footage.

Key insight box

Tail-based sampling makes sampling decisions AFTER seeing the complete trace, keeping important traces (errors, slow requests) while discarding boring ones.

Challenge question

If you keep 100% of traces temporarily before sampling, where do you store them? How long can you afford to keep them?


[MENTAL MODEL] Mental model - Sampling is an optimization, not a feature

Scenario

Your VP asks: "Why do we sample traces at all? Just store everything."

You explain: "We generate 10 million traces/day. That's 50 TB/day. That's $500K/month storage."

Interactive question (pause and think)

Why can't we just store all traces?

A. Storage costs too much B. Query performance degrades with more data C. Most traces are uninteresting (successful, fast requests) D. All of the above

Progressive reveal

Answer: D.

But the real question is: Can we keep the important traces and drop the boring ones?

Mental model

Think of trace sampling as:

  • Signal vs noise problem: Errors are signal, successful requests are (mostly) noise
  • Compression problem: Keep data with high information content
  • Economic optimization: Maximize observability value per dollar spent

The goal: Achieve 99% of debugging value with 1-10% of the data.

Sampling strategies comparison

text

Key insight box

Head-based sampling is economically efficient but informationally wasteful. Tail-based sampling inverts the trade-off.

Challenge question

If tail-based sampling is better, why does anyone use head-based sampling?


[WARNING] Understanding the tail-based sampling architecture

Scenario

You decide to implement tail-based sampling. Where does the logic run?

Architecture options

Option 1: Application-side tail sampling (doesn't work)

yaml

Option 2: Collector-side tail sampling (standard approach)

yaml

Visual: tail-based sampling pipeline

text

Memory management challenges

go

Key insight box

Tail-based sampling requires buffering complete traces in memory, which means memory usage scales with request rate and trace duration.

Challenge question

What happens if a trace never completes (missing spans due to network issues)? How long do you wait before making a decision?


[DEEP DIVE] Sampling policies - what to keep, what to discard

Scenario

You have 10 million traces buffered. Which ones do you keep?

Core sampling policies

Policy 1: Always keep errors

yaml

Policy 2: Always keep slow requests

yaml

Policy 3: Probabilistic sampling for healthy traces

yaml

Policy 4: Rate limiting per attribute

yaml

Policy 5: String attribute matching

yaml

Composite policy example

yaml

Implementation

go

Key insight box

Effective tail-based sampling uses composite policies: keep 100% of interesting traces (errors, slow) + small % of baseline traffic.

Challenge question

How do you set the "slow request" threshold dynamically as your system's performance changes over time?


[PUZZLE] The trace completion problem - when is a trace "done"?

Scenario

You're buffering traces to make tail-based decisions. But how do you know when a trace is complete?

Think about it

A trace might have:

  • 3 spans (simple request)
  • 300 spans (complex microservice call graph)
  • Missing spans (network packet loss)

How long do you wait?

Interactive question (pause and think)

When should you make a sampling decision?

A. After receiving root span B. After X seconds of no new spans C. After all parent-child references resolved D. All of the above, with fallbacks

Progressive reveal

Answer: D - you need multiple heuristics.

Trace completion heuristics

Heuristic 1: Root span detection

yaml

Heuristic 2: Reference resolution

yaml

Heuristic 3: Inactivity timeout

yaml

Heuristic 4: Expected span count (if available)

yaml

Practical completion strategy

go

Production configuration

yaml

Key insight box

Trace completion is probabilistic, not deterministic. Use multiple heuristics with timeouts to force decisions before memory exhaustion.

Challenge question

A long-running async job generates spans over 5 minutes. How do you handle this without waiting 5 minutes to make a sampling decision?


[WARNING] Scaling challenges - handling millions of traces

Scenario

Your system does 100K requests/second. That's 8.6 billion traces/day.

Even with tail-based sampling keeping only 10%, that's 860 million traces to store.

Scaling dimensions

Dimension 1: Collector memory

yaml

Dimension 2: Collector throughput

yaml

Dimension 3: Trace distribution

yaml

Horizontal scaling architecture

text

Load balancer configuration

yaml

Memory pressure handling

go

Key insight box

Tail-based sampling at scale requires distributed collectors with trace-aware routing and aggressive memory management.

Challenge question

How do you handle a traffic spike that 10x's your trace volume? Can you shed load gracefully?


[DEEP DIVE] Tail sampling vs head sampling - when to use which

Scenario

Your manager asks: "Should we use tail-based sampling for everything?"

Trade-off analysis

yaml

Hybrid approach (best of both worlds)

yaml

Decision matrix

yaml

Key insight box

Pure tail-based sampling is ideal for observability but expensive at scale. Hybrid approaches balance cost and coverage.

Challenge question

Can you implement tail-based sampling at the application level (SDK) instead of collector, avoiding the buffering problem?


[SYNTHESIS] Final synthesis - Design your tail-based sampling system

Synthesis challenge

You're the observability lead for a global e-commerce platform.

Requirements:

  • 500K requests/second peak (43 billion requests/day)
  • Average 8 spans per trace = 4M spans/second
  • Error rate: 0.5% (215 million errors/day)
  • Slow requests (>2s): 1% (430 million/day)
  • Must catch 99%+ of errors and slow requests for debugging
  • Budget: $50K/month for trace storage

Constraints:

  • Current: Head-based 1% sampling (430M traces/day)
  • Problem: Missing 99% of errors (only 2.1M errors stored)
  • Team: 3 observability engineers
  • Must work with OpenTelemetry

Your tasks (pause and think)

  1. Calculate storage requirements for different sampling strategies
  2. Design tail-based sampling policies
  3. Plan collector architecture (how many instances?)
  4. Handle trace completion (timeout strategy)
  5. Define memory management approach
  6. Estimate costs and justify to leadership

Write down your design.

Progressive reveal (one possible solution)

1. Storage requirements:

yaml

2. Tail-based sampling policies:

yaml

3. Collector architecture:

yaml

4. Trace completion strategy:

yaml

5. Memory management:

yaml

6. Cost justification:

yaml

Key insight box

Tail-based sampling is a business decision: pay more for storage to save more on incident costs through better debugging.

Final challenge question

After implementing tail-based sampling, your storage costs are on target, but query performance degrades (too many traces to search). How do you optimize?


Appendix: Quick checklist (printable)

Tail-based sampling design:

  • Calculate expected trace volume (requests/sec × retention window)
  • Estimate memory requirements (traces buffered × avg trace size)
  • Define sampling policies (errors, slow, baseline)
  • Choose completion detection strategy (timeout + heuristics)
  • Plan collector scaling (horizontal based on traffic)
  • Configure load balancing (trace-aware routing)

Sampling policy checklist:

  • Always keep errors (100% of failures)
  • Always keep slow requests (p99+ latency)
  • Rate limit per endpoint (prevent hot endpoints)
  • Keep baseline healthy traces (0.5-1% for comparison)
  • Support debug mode (specific users/features)
  • Test policies before production (dry-run mode)

Trace completion detection:

  • Set inactivity timeout (2-3x p99 latency)
  • Set max trace age (force decision limit)
  • Implement root span detection
  • Implement parent reference resolution
  • Handle incomplete traces (timeout fallback)
  • Monitor completion rate (alert if <90%)

Memory management:

  • Set per-collector memory limits
  • Implement eviction on memory pressure
  • Monitor buffer size (traces in memory)
  • Alert on high eviction rate
  • Load test memory usage (before production)
  • Plan for traffic spikes (burst capacity)

Scaling considerations:

  • Use consistent hashing by trace_id
  • Deploy collectors across availability zones
  • Monitor collector CPU/memory utilization
  • Auto-scale based on traffic
  • Test failover (collector crashes)
  • Measure end-to-end latency (span → decision → storage)

Cost optimization:

  • Measure actual sample rate (% of traces kept)
  • Analyze policy effectiveness (which policies trigger most?)
  • Optimize trace completion timeout (balance completeness vs memory)
  • Consider hybrid head+tail (reduce collector load)
  • Monitor storage costs (alert on unexpected growth)
  • Review policies quarterly (adjust based on traffic patterns)

Red flags (redesign needed):

  • Memory pressure causes >10% trace eviction
  • Incomplete trace rate >20%
  • Collector CPU consistently >80%
  • Missing critical errors in sampled data
  • Storage costs exceed ROI from better debugging
  • Query performance degraded despite sampling

Key Takeaways

  1. Tail-based sampling decides whether to keep a trace after it completes — unlike head-based sampling which decides upfront and misses rare errors
  2. Captures 100% of error and high-latency traces — while sampling routine successful requests to control storage costs
  3. Requires buffering complete traces before making the sampling decision — adding memory overhead and collection delay
  4. Essential for debugging production issues — the traces you most need to see (errors, timeouts) are exactly the ones head-based sampling drops
Chapter complete!

Course Complete!

You've finished all 51 chapters of

System Design Advanced

Browse courses
Up next High Cardinality Data Management
Continue