Advanced

System Design Advanced

0/51 chapters 0%

Distributed Systems And Algorithms 14

1 Vector Clocks and Lamport Timestamps 2 Byzantine Fault Tolerance 3 X3DH 4 Double Ratchet Algorithm 5 Distributed Consensus 6 Two-Phase and Three-Phase Commit 7 Quorum and Sloppy Quorum 8 Gossip Protocols 9 Merkle Trees 10 Anti Entropy and Read Repair 11 Hinted Handoff 12 Split Brain Mitigation 13 Linearizability and Sequential Consistency 14 Causal Consistency

Architectural Patterns 6

15 Saga Patterns 16 Strangler Fig Pattern 17 Backend For Frontend 18 API Gateway vs Service Mesh 19 Multi Tenancy Architecture 20 Hexagonal Archiecture

Caching and Performance 5

21 Thundering Herd Problem 22 Adaptive Caching With Maching learning 23 Distributed Locks With Redis 24 Cache Coherence in Multi-Region System 25 Edge Computing and Edge Caching

Data Management 10

26 Conflict Free Replicated Data Types 27 Multi Version Concurrency Control 28 Optimistic vs Pessimistic Locking 29 Database Federation 30 Polyglot Persistence 31 Zero Downtime Migration 32 Hot Partition Detection and Mitigation 33 Time Series Database Optimization 34 Bloom Filters 35 Lsm Trees Compaction Strategies

Event Driven Architecture 6

36 Event Sourcing at Scale 37 Complex Event Processing 38 Transactional Outbox Pattern 39 Transactional Inbox Pattern 40 Event Mesh Architecture 41 Kappa Architecture vs Lambda Architecture

Observability 4

42 Observability Driven Development 43 Open Telemetry and Distributed Tracing Tail Based Sampling 45 High Cardinality Data Management

Reliability And Resilience 6

46 Game Days and Dirt Testing 47 Blast Radius and Failure Domain Isolation 48 SLIS SLOS && Error Budgets 49 Multi Region Active Active 50 Global Load Balancing 51 Cascading Failure Prevention

Observability · Chapter 44 of 51

Tail Based Sampling

Akhil Sharma

 20 min 

← → to navigate

Tail-Based Sampling (Smart Trace Collection at Scale)

Audience: observability engineers and SREs managing distributed tracing in high-throughput production systems.

This article assumes:

Your system generates millions of traces per second - you can't store them all.
The most important traces (failures, slow requests) are rare but critical.
Head-based sampling makes blind decisions before seeing the full trace.
Storage costs and query performance matter at scale.

[CHALLENGE] Challenge: Your sampling strategy throws away the evidence

Scenario

Your distributed tracing captures 1% of requests (head-based sampling).

Then an incident happens:

Customer reports: "Checkout failed at 2:47 PM"
You check traces: Nothing. The failed request wasn't in your 1% sample
Support: "This is the 5th complaint today about checkout failures"
You check error logs: Errors are there, but no traces to debug context

You're flying blind because your sampling threw away the evidence.

Interactive question (pause and think)

What's wrong with "sample 1% of all requests randomly"?

1% is too low (should sample more)
Random sampling treats all requests equally
You're sampling before you know which traces matter
All of the above

Take 10 seconds.

Progressive reveal (question -> think -> answer)

Answer: (3) is the core problem, which leads to (2).

Random head-based sampling decides at the START of a request whether to trace it. But you don't know if the request will fail or be slow until it COMPLETES.

Real-world analogy (security cameras)

Imagine a bank with security cameras that randomly record 1% of the day:

Head-based sampling: Decide at midnight which 1% of today to record. Might miss the robbery at 2 PM.

Tail-based sampling: Record everything temporarily, then at end of day, keep only footage with interesting events (robbery, suspicious activity). Delete boring footage.

Key insight box

Tail-based sampling makes sampling decisions AFTER seeing the complete trace, keeping important traces (errors, slow requests) while discarding boring ones.

Challenge question

If you keep 100% of traces temporarily before sampling, where do you store them? How long can you afford to keep them?

[MENTAL MODEL] Mental model - Sampling is an optimization, not a feature

Scenario

Your VP asks: "Why do we sample traces at all? Just store everything."

You explain: "We generate 10 million traces/day. That's 50 TB/day. That's $500K/month storage."

Interactive question (pause and think)

Why can't we just store all traces?

A. Storage costs too much B. Query performance degrades with more data C. Most traces are uninteresting (successful, fast requests) D. All of the above

Progressive reveal

Answer: D.

But the real question is: Can we keep the important traces and drop the boring ones?

Mental model

Think of trace sampling as:

Signal vs noise problem: Errors are signal, successful requests are (mostly) noise
Compression problem: Keep data with high information content
Economic optimization: Maximize observability value per dollar spent

The goal: Achieve 99% of debugging value with 1-10% of the data.

Sampling strategies comparison

text

Key insight box

Head-based sampling is economically efficient but informationally wasteful. Tail-based sampling inverts the trade-off.

Challenge question

If tail-based sampling is better, why does anyone use head-based sampling?

[WARNING] Understanding the tail-based sampling architecture

Scenario

You decide to implement tail-based sampling. Where does the logic run?

Architecture options

Option 1: Application-side tail sampling (doesn't work)

yaml

Option 2: Collector-side tail sampling (standard approach)

yaml

Visual: tail-based sampling pipeline

text

Memory management challenges

Key insight box

Tail-based sampling requires buffering complete traces in memory, which means memory usage scales with request rate and trace duration.

Challenge question

What happens if a trace never completes (missing spans due to network issues)? How long do you wait before making a decision?

[DEEP DIVE] Sampling policies - what to keep, what to discard

Scenario

You have 10 million traces buffered. Which ones do you keep?

Core sampling policies

Policy 1: Always keep errors

yaml

Policy 2: Always keep slow requests

yaml

Policy 3: Probabilistic sampling for healthy traces

yaml

Policy 4: Rate limiting per attribute

yaml

Policy 5: String attribute matching

yaml

Composite policy example

yaml

Implementation

Key insight box

Effective tail-based sampling uses composite policies: keep 100% of interesting traces (errors, slow) + small % of baseline traffic.

Challenge question

How do you set the "slow request" threshold dynamically as your system's performance changes over time?

[PUZZLE] The trace completion problem - when is a trace "done"?

Scenario

You're buffering traces to make tail-based decisions. But how do you know when a trace is complete?

Think about it

A trace might have:

3 spans (simple request)
300 spans (complex microservice call graph)
Missing spans (network packet loss)

How long do you wait?

Interactive question (pause and think)

When should you make a sampling decision?

A. After receiving root span B. After X seconds of no new spans C. After all parent-child references resolved D. All of the above, with fallbacks

Progressive reveal

Answer: D - you need multiple heuristics.

Trace completion heuristics

Heuristic 1: Root span detection

yaml

Heuristic 2: Reference resolution

yaml

Heuristic 3: Inactivity timeout

yaml

Heuristic 4: Expected span count (if available)

yaml

Practical completion strategy

Production configuration

yaml

Key insight box

Trace completion is probabilistic, not deterministic. Use multiple heuristics with timeouts to force decisions before memory exhaustion.

Challenge question

A long-running async job generates spans over 5 minutes. How do you handle this without waiting 5 minutes to make a sampling decision?

[WARNING] Scaling challenges - handling millions of traces

Scenario

Your system does 100K requests/second. That's 8.6 billion traces/day.

Even with tail-based sampling keeping only 10%, that's 860 million traces to store.

Scaling dimensions

Dimension 1: Collector memory

yaml

Dimension 2: Collector throughput

yaml

Dimension 3: Trace distribution

yaml

Horizontal scaling architecture

text

Load balancer configuration

yaml

Memory pressure handling

Key insight box

Tail-based sampling at scale requires distributed collectors with trace-aware routing and aggressive memory management.

Challenge question

How do you handle a traffic spike that 10x's your trace volume? Can you shed load gracefully?

[DEEP DIVE] Tail sampling vs head sampling - when to use which

Scenario

Your manager asks: "Should we use tail-based sampling for everything?"

Trade-off analysis

yaml

Hybrid approach (best of both worlds)

yaml

Decision matrix

yaml

Key insight box

Pure tail-based sampling is ideal for observability but expensive at scale. Hybrid approaches balance cost and coverage.

Challenge question

Can you implement tail-based sampling at the application level (SDK) instead of collector, avoiding the buffering problem?

[SYNTHESIS] Final synthesis - Design your tail-based sampling system

Synthesis challenge

You're the observability lead for a global e-commerce platform.

Requirements:

500K requests/second peak (43 billion requests/day)
Average 8 spans per trace = 4M spans/second
Error rate: 0.5% (215 million errors/day)
Slow requests (>2s): 1% (430 million/day)
Must catch 99%+ of errors and slow requests for debugging
Budget: $50K/month for trace storage

Constraints:

Current: Head-based 1% sampling (430M traces/day)
Problem: Missing 99% of errors (only 2.1M errors stored)
Team: 3 observability engineers
Must work with OpenTelemetry

Your tasks (pause and think)

Calculate storage requirements for different sampling strategies
Design tail-based sampling policies
Plan collector architecture (how many instances?)
Handle trace completion (timeout strategy)
Define memory management approach
Estimate costs and justify to leadership

Write down your design.

Progressive reveal (one possible solution)

1. Storage requirements:

yaml

2. Tail-based sampling policies:

yaml

3. Collector architecture:

yaml

4. Trace completion strategy:

yaml

5. Memory management:

yaml

6. Cost justification:

yaml

Key insight box

Tail-based sampling is a business decision: pay more for storage to save more on incident costs through better debugging.

Final challenge question

After implementing tail-based sampling, your storage costs are on target, but query performance degrades (too many traces to search). How do you optimize?

Appendix: Quick checklist (printable)

Tail-based sampling design:

Calculate expected trace volume (requests/sec × retention window)
Estimate memory requirements (traces buffered × avg trace size)
Define sampling policies (errors, slow, baseline)
Choose completion detection strategy (timeout + heuristics)
Plan collector scaling (horizontal based on traffic)
Configure load balancing (trace-aware routing)

Sampling policy checklist:

Always keep errors (100% of failures)
Always keep slow requests (p99+ latency)
Rate limit per endpoint (prevent hot endpoints)
Keep baseline healthy traces (0.5-1% for comparison)
Support debug mode (specific users/features)
Test policies before production (dry-run mode)

Trace completion detection:

Set inactivity timeout (2-3x p99 latency)
Set max trace age (force decision limit)
Implement root span detection
Implement parent reference resolution
Handle incomplete traces (timeout fallback)
Monitor completion rate (alert if <90%)

Memory management:

Set per-collector memory limits
Implement eviction on memory pressure
Monitor buffer size (traces in memory)
Alert on high eviction rate
Load test memory usage (before production)
Plan for traffic spikes (burst capacity)

Scaling considerations:

Use consistent hashing by trace_id
Deploy collectors across availability zones
Monitor collector CPU/memory utilization
Auto-scale based on traffic
Test failover (collector crashes)
Measure end-to-end latency (span → decision → storage)

Cost optimization:

Measure actual sample rate (% of traces kept)
Analyze policy effectiveness (which policies trigger most?)
Optimize trace completion timeout (balance completeness vs memory)
Consider hybrid head+tail (reduce collector load)
Monitor storage costs (alert on unexpected growth)
Review policies quarterly (adjust based on traffic patterns)

Red flags (redesign needed):

Memory pressure causes >10% trace eviction
Incomplete trace rate >20%
Collector CPU consistently >80%
Missing critical errors in sampled data
Storage costs exceed ROI from better debugging
Query performance degraded despite sampling

Key Takeaways

Tail-based sampling decides whether to keep a trace after it completes — unlike head-based sampling which decides upfront and misses rare errors
Captures 100% of error and high-latency traces — while sampling routine successful requests to control storage costs
Requires buffering complete traces before making the sampling decision — adding memory overhead and collection delay
Essential for debugging production issues — the traces you most need to see (errors, timeouts) are exactly the ones head-based sampling drops

Previous Open Telemetry and Distributed Tracing Up next High Cardinality Data Management

Reading progress 0%

On this page

Tail-Based Sampling (Smart Trace Collection at Scale) [CHALLENGE] Challenge: Your sampling strategy throws away the evidence Scenario Interactive question (pause and think) Progressive reveal (question -> think -> answer) Real-world analogy (security cameras) Key insight box Challenge question [MENTAL MODEL] Mental model - Sampling is an optimization, not a feature Scenario Interactive question (pause and think) Progressive reveal Mental model Sampling strategies comparison Key insight box Challenge question [WARNING] Understanding the tail-based sampling architecture Scenario Architecture options Visual: tail-based sampling pipeline Memory management challenges Key insight box Challenge question [DEEP DIVE] Sampling policies - what to keep, what to discard Scenario Core sampling policies Composite policy example Estimated retention: - 1M errors (100%) - 500K slow (100%) - 50K enterprise (10% of 500K) - 100K rate-limited (varies) - 40K probabilistic (0.5% of 8M remaining) Total: ~1.7M / 10M = 17% overall sample rate Implementation Key insight box Challenge question [PUZZLE] The trace completion problem - when is a trace "done"? Scenario Think about it Interactive question (pause and think) Progressive reveal Trace completion heuristics Practical completion strategy Production configuration Key insight box Challenge question [WARNING] Scaling challenges - handling millions of traces Scenario Scaling dimensions Horizontal scaling architecture Load balancer configuration OpenTelemetry Collector load balancer config Memory pressure handling Key insight box Challenge question [DEEP DIVE] Tail sampling vs head sampling - when to use which Scenario Trade-off analysis Hybrid approach (best of both worlds) Decision matrix Key insight box Challenge question [SYNTHESIS] Final synthesis - Design your tail-based sampling system Synthesis challenge Your tasks (pause and think) Progressive reveal (one possible solution) Key insight box Final challenge question Appendix: Quick checklist (printable) Key Takeaways

Chapter complete!

Course Complete!

You've finished all 51 chapters of

System Design Advanced

Browse courses

Up next High Cardinality Data Management

Continue