Your metrics have attributes like user_id, request_id, container_id (millions of unique values).
Traditional time-series databases explode when cardinality increases.
Query performance degrades from seconds to minutes as cardinality grows.
Storage costs scale linearly with cardinality - this gets expensive fast.

[CHALLENGE] Challenge: Your metrics database just exploded

Scenario

You add a new metric: http_request_duration{user_id="..."}.

Timeline:

Day 1: 10K unique users, metrics database happy
Day 7: 100K unique users, query latency increases to 5 seconds
Day 14: 1M unique users, Prometheus OOM crashes
Day 30: Database unusable, costs $50K/month

One innocent user_id label destroyed your observability.

Interactive question (pause and think)

What went wrong?

Prometheus can't handle scale
user_id is high-cardinality (millions of unique values)
Needed more memory
Should have used a different database

Take 10 seconds.

Progressive reveal (question -> think -> answer)

Answer: (2) - high-cardinality kills time-series databases.

More memory (option 3) delays the problem but doesn't solve it. Different database (option 4) might help, but understanding cardinality is key.

Real-world analogy (library catalog)

Imagine cataloging books:

Low cardinality: Genre (fiction, non-fiction, science, history) = 10 categories. Easy to organize and search.

High cardinality: ISBN (every book has unique number) = millions of categories. Impossible to efficiently organize and search without specialized indexes.

Metrics databases are optimized for low cardinality (like genres), not high cardinality (like ISBNs).

Key insight box

Cardinality is the number of unique combinations of label values. High cardinality (millions+) causes exponential growth in storage and query cost.

Challenge question

If high-cardinality metrics are so problematic, why do we need them? Can't we just avoid them?

[MENTAL MODEL] Mental model - Cardinality is a combinatorial explosion

Scenario

You have a metric with three labels:

prometheus

Interactive question (pause and think)

If you have:

100 endpoints
10 status codes
5 regions

How many unique time series can this metric create?

A. 115 (100 + 10 + 5) B. 500 (100 × 5) C. 5,000 (100 × 10 × 5) D. It depends on actual combinations

Progressive reveal

Answer: C (worst case), D (in practice).

Maximum cardinality = product of all label cardinalities.

Cardinality mathematics

text

Why time-series databases struggle

yaml

Mental model

Think of cardinality as:

Dimensionality curse: More dimensions = exponentially more space
Combinatorial explosion: Labels multiply, not add
Query cost multiplier: Each label filters, but high-cardinality labels don't filter much

Key insight box

Cardinality isn't additive (100 + 1M = 1M), it's multiplicative (100 × 1M = 100M). One high-cardinality label destroys everything.

Challenge question

You have 5 labels, 4 are low-cardinality (10 values each), 1 is high-cardinality (1M values). Which label causes the problem?

[WARNING] Identifying high-cardinality culprits

Scenario

Your Prometheus is slow. You suspect high cardinality. How do you find the culprit?

Cardinality analysis techniques

Technique 1: Prometheus built-in metrics

promql

Technique 2: Prometheus TSDB stats

bash

Technique 3: Cardinality profiling script

python

Common high-cardinality culprits

yaml

Key insight box

High-cardinality labels are usually identifiers (IDs, IPs, timestamps) or unbounded strings. Look for labels with >10K unique values.

Challenge question

You find that pod_id has 100K unique values (Kubernetes pods constantly restart). Should you remove the label or handle it differently?

[DEEP DIVE] Strategies for managing high-cardinality data

Scenario

You've identified high-cardinality labels. Now what?

You can't just delete them - they provide valuable debugging context.

Strategy 1: Don't use metrics for high-cardinality data

yaml

When to use what:

yaml

Strategy 2: Aggregate at write time

Strategy 3: Use exemplars (metrics + traces bridge)

yaml

Strategy 4: Cardinality-aware databases

yaml

Strategy 5: Label transformation/aggregation

Key insight box

Managing high cardinality requires multiple strategies: use the right tool (metrics vs traces), aggregate, sample, or transform labels to reduce cardinality.

Challenge question

Your CEO wants a dashboard showing "response time per customer." How do you build this without exploding cardinality?

[PUZZLE] The sampling dilemma - keeping enough data

Scenario

You decide to sample high-cardinality metrics: keep 1% of user_id values.

Problem: Enterprise customer (0.01% of users) generates error. Their data wasn't sampled. You can't debug.

Think about it

Random sampling might miss important users. How do you sample intelligently?

Interactive question (pause and think)

Which sampling strategy catches the most important data?

A. Random sampling (1% of all users) B. Stratified sampling (1% per user tier) C. Always keep errors + sample healthy traffic D. Dynamic sampling based on anomalies

Progressive reveal

Answer: C or D, depending on use case.

Random sampling (A) misses rare but important events. Stratified (B) is better but still probabilistic.

Intelligent sampling strategies

Strategy 1: Error-weighted sampling

Strategy 2: Tier-based sampling

Strategy 3: Reservoir sampling (keep diverse sample)

Strategy 4: Anomaly-based sampling

Combining strategies

Key insight box

Intelligent sampling keeps important data (errors, anomalies, VIPs) at 100% while sampling less important data. Don't rely on pure random sampling.

Challenge question

You sample 1% of users but need to report "accurate total request count." How do you estimate the true count from sampled data?

[WARNING] Query performance degradation patterns

Scenario

Your metrics database has high cardinality. Queries that took seconds now take minutes.

What's happening under the hood?

Performance degradation mechanics

yaml

Query optimization techniques

Technique 1: Pre-aggregation (recording rules)

yaml

Technique 2: Query result caching

yaml

Technique 3: Downsampling

yaml

Technique 4: Cardinality limits

yaml

Key insight box

High-cardinality degrades query performance exponentially. Mitigate with pre-aggregation, caching, downsampling, and cardinality limits.

Challenge question

Your dashboard queries 100 high-cardinality metrics simultaneously. How do you make it load in < 5 seconds?

[SYNTHESIS] Final synthesis - Design a high-cardinality observability system

Synthesis challenge

You're the observability architect for a multi-tenant SaaS platform.

Requirements:

1 million users across 10K organizations
Users want "per-user dashboards" showing their activity
Need to debug individual user issues
Compliance: 1-year data retention
Budget: $30K/month for observability

Constraints:

Current: Prometheus with user_id labels (10M series, constantly OOM)
Query latency: 2+ minutes for dashboards
Team: frustrated, want to remove observability entirely

Your tasks (pause and think)

Analyze current cardinality problem
Design metrics strategy (what goes in metrics?)
Design traces/logs strategy (what doesn't go in metrics?)
Choose appropriate databases
Implement sampling strategy
Optimize query performance

Write down your architecture.

Progressive reveal (one possible solution)

1. Cardinality analysis:

yaml

2. Metrics strategy (low-cardinality only):

yaml

3. Traces/logs strategy (high-cardinality):

yaml

4. Database choices:

yaml

5. Sampling strategy:

yaml

6. Query performance optimization:

yaml

Key insight box

High-cardinality requires multi-database architecture: metrics for aggregates, traces for per-request debugging, analytics DB for pre-aggregated user stats.

Final challenge question

After implementing this architecture, your COO asks: "Why do we need 4 databases? Can't we just use one?" How do you explain the trade-offs?

Appendix: Quick checklist (printable)

Cardinality audit:

List all metrics and their labels
Count unique values per label (identify high-cardinality)
Calculate max cardinality (multiply label cardinalities)
Identify labels with >10K unique values
Flag metrics with >100K total series

Cardinality reduction:

Remove high-cardinality labels (user_id, request_id, etc.)
Transform labels (normalize URLs, aggregate errors)
Use low-cardinality aggregates (user_tier not user_id)
Move high-cardinality data to traces/logs
Implement sampling for necessary high-cardinality metrics

Database selection:

Prometheus/VictoriaMetrics: low-cardinality metrics
Tempo/Jaeger: high-cardinality traces
ClickHouse/BigQuery: analytical queries on high-cardinality
Loki/Elasticsearch: logs
Use exemplars to bridge metrics and traces

Query optimization:

Create recording rules (pre-aggregation)
Implement query caching (reduce repeated queries)
Downsample old data (reduce resolution over time)
Set cardinality limits (prevent explosion)
Monitor query latency (alert on degradation)

Sampling strategy:

Define sampling rate per metric
Use intelligent sampling (errors, anomalies, VIPs)
Implement deterministic sampling (consistent per entity)
Document what's sampled and what's not
Validate sample representativeness

Operational:

Monitor cardinality growth (trend over time)
Alert on cardinality spikes (new labels added)
Review metrics quarterly (remove unused)
Educate team on cardinality impact
Document cardinality guidelines

Red flags:

Metrics with >1M series
Labels with >100K unique values
Query latency >10s
Prometheus OOM crashes
Storage costs growing >50%/month
New labels added without review

Key Takeaways

High-cardinality dimensions (user IDs, request IDs) explode metric storage — creating millions of unique time series that overwhelm traditional monitoring systems
Don't use high-cardinality values as metric labels — use them in logs and traces instead, where storage is per-event not per-series
Columnar storage (ClickHouse, Druid) handles high-cardinality analytics — compressing and querying billions of unique values efficiently
Aggregate before storing where possible — pre-compute counts and summaries rather than storing every raw data point as a metric

Previous Tail Based Sampling Up next Game Days and Dirt Testing

Reading progress 0%

On this page

High-Cardinality Data Management (The Curse of Infinite Dimensions) [CHALLENGE] Challenge: Your metrics database just exploded Scenario Interactive question (pause and think) Progressive reveal (question -> think -> answer) Real-world analogy (library catalog) Key insight box Challenge question [MENTAL MODEL] Mental model - Cardinality is a combinatorial explosion Scenario Interactive question (pause and think) Progressive reveal Cardinality mathematics Why time-series databases struggle Mental model Key insight box Challenge question [WARNING] Identifying high-cardinality culprits Scenario Cardinality analysis techniques Query cardinality per metric Output: http_requests_total: 5,234,123 series cpu_usage: 1,234 series memory_usage: 567 series Find high-cardinality labels Output shows which labels have most unique values Query Prometheus API Response: Culprit: user_id label has 1M unique values Usage Output: metric total_series labels http_requests_total 5234123 {endpoint: 100, status: 10, user_id: 1000000} api_latency 234567 {endpoint: 100, user_id: 50000} db_queries 1234 {table: 20, operation: 5} Common high-cardinality culprits Key insight box Challenge question [DEEP DIVE] Strategies for managing high-cardinality data Scenario Strategy 1: Don't use metrics for high-cardinality data Strategy 2: Aggregate at write time Strategy 3: Use exemplars (metrics + traces bridge) Strategy 4: Cardinality-aware databases Strategy 5: Label transformation/aggregation Key insight box Challenge question [PUZZLE] The sampling dilemma - keeping enough data Scenario Think about it Interactive question (pause and think) Progressive reveal Intelligent sampling strategies Combining strategies Key insight box Challenge question [WARNING] Query performance degradation patterns Scenario Performance degradation mechanics Query optimization techniques Prometheus recording rule Prometheus configuration Key insight box Challenge question [SYNTHESIS] Final synthesis - Design a high-cardinality observability system Synthesis challenge Your tasks (pause and think) Progressive reveal (one possible solution) Key insight box Final challenge question Appendix: Quick checklist (printable) Key Takeaways

Chapter complete!

Course Complete!

You've finished all 51 chapters of

System Design Advanced

Browse courses

Up next Game Days and Dirt Testing

Continue