Audience: engineers who already ship distributed systems and now need their time-series stack to stay fast, cheap, and correct under load.
It’s 11:58 PM. Your on-call phone is quiet. Two minutes later, your dashboards freeze.
Scenario: You run a distributed TSDB (or a single-node TSDB behind a distributed ingestion tier). You must keep:
If you could change only one thing right now to survive the storm, which would it be?
Don’t answer yet—keep it in mind. We’ll return to it after building the mental model.
Imagine a city-wide delivery service:
A TSDB must do two opposing jobs:
It achieves this by buffering and organizing data over time.
Most TSDB optimizations are about controlling amplification:
- Write amplification (how many bytes written per byte ingested)
- Read amplification (how many bytes scanned per byte returned)
- Space amplification (how many bytes stored per byte of raw samples)
- Network amplification (how many nodes touched per query/write)
Your ingestion path typically includes:
Where do you expect the first bottleneck under a sudden ingest spike?
Hold your answer.
Think of a coffee shop:
If the ticket printer jams (WAL fsync), the whole line stops.
Many TSDBs (Prometheus TSDB, InfluxDB, ClickHouse MergeTree, Cassandra-based systems, M3DB, VictoriaMetrics) rely on a WAL to survive crashes. The WAL is frequently a latency floor.
If your WAL is synchronous per request, your p99 write latency is often bounded by fsync latency and queueing.
Pause and think.
Answer reveal: 2 is generally true. Batching amortizes per-request overhead (WAL fsync, RPC overhead, index updates). (1) can increase coordination cost. (3) may increase CPU and hurt tail latency. (4) doesn’t address ingest bottlenecks.
If you can only change one client-side behavior to help ingestion, what is it?
A time series is identified by metric + labels/tags. Cardinality is how many distinct series exist.
Symptom: Everything looks fine at 50k series. At 5M series, your TSDB collapses:
Which metric is more dangerous?
A) 10 metrics x 1,000,000 series each B) 1,000 metrics x 10,000 series each
Think: both are 10M series total, but query patterns and index layout matter.
A restaurant can handle:
But if every customer customizes their dish differently (high label cardinality), the kitchen (index) breaks.
pod, container, request_id, user_id can explode cardinality.Cardinality is not just storage. It’s index size, memory residency, compaction fanout, and query fanout.
“Compression will save me from high cardinality.”
Compression helps sample payloads. High cardinality kills you in metadata and indexes, which often compress poorly and must be kept hot.
Match the label to “safe-ish” vs “dangerous”:
| Label example | Safe-ish | Dangerous |
|---|---|---|
region=us-east-1 | [ ] | [ ] |
instance=10.2.3.4:9100 | [ ] | [ ] |
request_id=... | [ ] | [ ] |
status_code=500 | [ ] | [ ] |
user_id=... | [ ] | [ ] |
Pause and think.
Answer reveal:
region, status_coderequest_id, user_idinstance is acceptable in infra metrics but can be expensive at huge scale (still bounded by fleet size)Name one label you should almost never use in metrics.
Distributed TSDBs must decide:
Common approaches:
If you shard purely by time (e.g., daily partitions), what happens to write load?
Answer reveal: B. Everyone writes “now”. Time-only sharding concentrates writes.
If you assign deliveries by day instead of by address, then today’s warehouse gets every package in the city.
ORDER BY (metric, tags, time) + sharding key).Good sharding spreads writes and bounds query fanout. You rarely get both for free.
Workload: high ingest, most queries are “last 15 minutes for a single service”, occasional “30 days across all services”.
Pause and think.
Answer reveal: 3. Series-hash spreads ingest; time blocks enable retention, compaction, and cold storage.
What query pattern becomes expensive when sharding by series hash?
You replicate data for durability and availability. But replication raises questions:
If your TSDB returns HTTP 200 for a write, what do you want that to guarantee?
There’s no universally correct answer—only trade-offs.
You place an order:
Write acknowledgment levels map to stages of completion.
W and R quorums.Decide your write success contract explicitly. Otherwise, you’ll discover it during an incident.
“Replication factor 3 means I can lose 2 nodes without losing data.”
Only if:
RF=3 and W=1, you can lose data on a single-node failure.RF=3 and W=2, you can never lose data.RF=1, you can still be highly available if you add caches.Pause and think.
Answer reveal: 1 is true. W=1 acknowledges before replication completes.
What’s the trade-off between W=quorum and W=1 for metrics?
Dashboard queries are usually:
Which is typically the biggest query accelerator?
A) better compression B) pre-aggregation / downsampling C) adding more replicas D) increasing retention
Answer reveal: B. Downsampling and rollups reduce scan volume.
If customers constantly order “large drip coffee”, you pre-brew a big batch. That’s downsampling/rollups.
Most dashboards don’t need raw 10s resolution for 30-day ranges.
[IMAGE: A diagram showing query path for “recent” (hits ingesters/mem + cache) vs “historical” (hits object storage blocks + index) with fanout across shards and a query-frontend cache.]
What is the risk of downsampling for incident response?
TSDBs maintain an index mapping:
When the index spills to disk, query latency becomes unpredictable.
If you have to choose, would you rather keep:
Answer reveal: B. Without the index, you can’t find the samples efficiently.
Raw samples are books. The index is the card catalog. Without the catalog, finding books is slow even if books are nearby.
Index locality matters more than raw throughput for interactive queries.
“SSD solves index problems.”
SSDs reduce pain, but the cost is still high: random reads + cache misses + CPU overhead. RAM-resident indexes still win for p99.
Name one optimization that reduces index size.
Compaction merges small immutable files into larger ones to:
But compaction consumes:
What happens if compaction falls behind for 24 hours?
Answer reveal: B (usually). Many small files increase seeks and metadata overhead; storage grows due to duplicates and poor compression.
If no one clears tables, the restaurant can still take orders—until every surface is covered and service collapses.
Compaction debt is real operational debt. Track it like error budget.
[IMAGE: Timeline showing ingestion producing many small segments, compaction merging into larger blocks, and how backlog increases read amplification.]
What’s one safe lever to reduce compaction pressure without losing data?
You have:
A partition isolates half of the storage nodes.
What do you prefer during the partition?
Do you accept packages when your second warehouse is unreachable?
Metrics are often treated as AP: better to accept and be “eventually consistent” than to drop all telemetry.
For metrics, availability often dominates consistency, but you must engineer reconciliation (dedupe, backfill, anti-entropy).
“Metrics are non-critical, so we can drop them.”
During incidents, metrics become critical. Dropping them precisely when failure happens is the worst time.
What mechanism can you use to survive partitions without losing data?
When writes fail, clients retry. If they retry immediately, they amplify load.
Which retry policy is safer?
Answer reveal: 2.
If everyone honks and accelerates into a jam, it gets worse. Backoff is spacing cars out.
Backpressure is a feature. If your TSDB never says “slow down,” it will fail catastrophically.
[CODE: YAML, Prometheus remote_write queue settings demonstrating batching, backoff, and capacity tuning]
What is the risk of buffering too much at the edge?
In TSDBs, “schema” often means:
You need per-endpoint latency metrics. Should path be a label?
Answer reveal: C.
If every customer can invent a new dish name, the kitchen collapses. If choices are from a fixed menu, it’s manageable.
Use route templates like /users/:id not /users/12345.
Bounded label values are the difference between observability and self-inflicted DDoS.
Name a label you would normalize before storing.
In distributed TSDBs, a query often executes as:
Where should aggregation happen?
Answer reveal: B. Push down aggregation reduces network and merge cost.
If each restaurant branch totals its day’s sales locally, HQ only merges totals—not every receipt.
Query pushdown converts network amplification into CPU work near data.
[IMAGE: A distributed query plan diagram showing pushdown aggregation on shard nodes, then merge step.]
What’s the danger of too much pushdown?
You have HA scrapers or multiple agents writing the same metric series.
If two writers send the same timestamp with different values, what should happen?
It depends on your system’s semantics.
Two couriers deliver the same package. Do you keep both? Return one? Decide based on business rules.
Define dedupe behavior for same timestamp, out-of-order, and late arriving data.
“Out-of-order is just a minor annoyance.”
Out-of-order can break compression, compaction assumptions, and query correctness.
What ingestion-side technique reduces out-of-order risk?
You can’t keep everything on fast disks.
Common architecture:
What gets worse when you move data to object storage?
Answer reveal: D.
Hot storage is your local warehouse. Cold storage is a remote warehouse with slower trucks and occasional delays.
Cortex/Mimir/Thanos store blocks in object storage; query path includes store gateways and caches.
Object storage is cheap and durable, but it shifts complexity into caching, indexing, and consistency handling.
[IMAGE: Tiered storage diagram with hot ingesters, compactors, object storage, store-gateway, query-frontend cache.]
Why do many systems keep a “recent window” in ingesters even after shipping blocks?
Caches in TSDB architectures:
Which cache is most sensitive to cardinality?
Answer reveal: B.
If the catalog grows, caching the catalog pages matters more than caching books.
Thanos store gateway uses index cache; Mimir uses multiple caches.
Cache the join points: label -> series sets, series -> chunks.
What cache invalidation strategy works for immutable blocks?
Track:
Which metric is the earliest warning sign of cardinality explosion?
Answer reveal: B is often earliest.
Kitchen prep space (RAM) fills before the pantry (disk).
Watch series count, active series, and index memory like you watch CPU.
What SLO would you set for dashboard queries during incidents?
Below is a comparison table of common optimizations.
| Lever | Helps | Hurts / Risk | Best when |
|---|---|---|---|
| Reduce cardinality | index/memory, query fanout, compaction | loss of granularity | labels are unbounded |
| Client batching | WAL overhead, RPC | larger loss on crash if buffers not flushed | high ingest |
| Increase shards | parallelism | overhead, rebalancing churn | CPU-bound or single shard hot |
| Increase replication | availability | write latency, cost | need HA and can tolerate slower writes |
| Downsampling/rollups | long-range queries | hides spikes | dashboards + billing |
| Tiered storage | cost | complexity, cold query latency | high retention |
| Caching | query latency | consistency, memory | repeated queries |
| Pushdown aggregation | network | CPU hotspots | high fanout queries |
Which lever is most likely to fix write timeouts quickly?
Answer reveal: B.
Optimize the path that’s failing: ingest failures rarely yield to query-side tricks.
If you see WAL fsync p99 spike, what are two immediate mitigations?
At 00:03:
Pick the best first action:
Answer reveal: 2. You’re disk-bound on WAL; stop the storm first.
Don’t rearrange furniture while the kitchen is on fire. Reduce inflow, then fix root cause.
In distributed systems, stability beats throughput during incidents.
After stabilizing, what’s your second action to address root cause?
Phase 1: Single-node TSDB
Phase 2: Sharded ingestion + replicated storage
Phase 3: Object storage + compactor + query frontend
Phase 4: Multi-region
In multi-region, where should writes go?
Answer reveal: usually B, unless strict consistency is required (rare for metrics).
Ship from nearest hub, then transfer to central warehouse later.
Multi-region TSDBs are mostly about latency and failure domains, not raw throughput.
What’s the hardest part of multi-region queries?
Reality: compression helps, but cardinality, indexing, and compaction dominate many failure modes.
Reality: shards increase parallelism but also overhead: coordination, metadata, rebalancing, query fanout.
Reality: object storage changes the bottleneck to indexing, caching, and request rate limits.
Reality: correctness matters for alerting, SLOs, billing, and incident response. You need explicit semantics.
Which misconception have you personally seen cause an outage?
You see a metric:
[PAUSE AND THINK] What’s the problem?
Answer reveal: path contains unbounded IDs. Normalize to route templates.
A single unbounded label can create millions of series.
You need:
[PAUSE AND THINK] How many rollup levels do you store?
Answer reveal: at least two tiers (raw 10s for short window, 5m rollup for long window), possibly 1h rollup for 1y.
Rollups are a query-time optimization and a cost-control strategy.
You run RF=3 across AZs. You choose W=1 for ingestion.
What failure can lose acknowledged writes?
Answer reveal: the single node that acked crashes before replicating or before durable fsync.
Acknowledgment level defines durability, not replication factor.
[CODE: YAML, Prometheus remote_write queue settings demonstrating batching, backoff, and capacity tuning]
[CODE: SQL, TimescaleDB continuous aggregate creation and refresh policy for downsampling]
[CODE: SQL, ClickHouse table definition using MergeTree with PARTITION BY toYYYYMM(time) and ORDER BY (metric, tags..., time) plus TTL for retention]
[CODE: Go, example of client-side batching and exponential backoff with jitter for remote write]
[CODE: Python, simulating cardinality explosion by generating label combinations and estimating series count]
[CODE: Bash, using tsdb tooling/PromQL to estimate active series and top label cardinalities]
You’re building telemetry for a delivery marketplace (drivers, restaurants, customers). Requirements:
Part 1: Labeling Which label is most dangerous?
A) city
B) driver_id
C) vehicle_type
D) status_code
Pause and think.
Answer: B (unbounded/high cardinality).
Part 2: Sharding Choose a sharding approach:
Pause and think.
Answer: 3.
Part 3: Multi-region writes Choose write strategy:
Pause and think.
Answer: usually 2 for metrics.
Part 4: Query path You need fast dashboards during incidents. Pick two:
Pause and think.
Answer: A + B.
Write a 10-bullet design:
Optimization is not one trick—it’s aligning data model, partitioning, and failure semantics with your actual query and ingest patterns.
At the start, you had one change to survive the telemetry storm:
Answer: most often 3 (reduce cardinality) is the highest-leverage long-term fix. But during an active incident, the fastest stabilizers are usually batching/backpressure and ingest throttling. Replication and caches help, but they won’t save a cluster drowning in unbounded series.