Courses 0%
45
Observability · Chapter 45 of 51

High Cardinality Data Management

Akhil
Akhil Sharma
20 min

High-Cardinality Data Management (The Curse of Infinite Dimensions)

Audience: observability engineers dealing with metrics, logs, and traces explosion in high-scale systems.

This article assumes:

  • Your metrics have attributes like user_id, request_id, container_id (millions of unique values).
  • Traditional time-series databases explode when cardinality increases.
  • Query performance degrades from seconds to minutes as cardinality grows.
  • Storage costs scale linearly with cardinality - this gets expensive fast.

[CHALLENGE] Challenge: Your metrics database just exploded

Scenario

You add a new metric: http_request_duration{user_id="..."}.

Timeline:

  • Day 1: 10K unique users, metrics database happy
  • Day 7: 100K unique users, query latency increases to 5 seconds
  • Day 14: 1M unique users, Prometheus OOM crashes
  • Day 30: Database unusable, costs $50K/month

One innocent user_id label destroyed your observability.

Interactive question (pause and think)

What went wrong?

  1. Prometheus can't handle scale
  2. user_id is high-cardinality (millions of unique values)
  3. Needed more memory
  4. Should have used a different database

Take 10 seconds.

Progressive reveal (question -> think -> answer)

Answer: (2) - high-cardinality kills time-series databases.

More memory (option 3) delays the problem but doesn't solve it. Different database (option 4) might help, but understanding cardinality is key.

Real-world analogy (library catalog)

Imagine cataloging books:

Low cardinality: Genre (fiction, non-fiction, science, history) = 10 categories. Easy to organize and search.

High cardinality: ISBN (every book has unique number) = millions of categories. Impossible to efficiently organize and search without specialized indexes.

Metrics databases are optimized for low cardinality (like genres), not high cardinality (like ISBNs).

Key insight box

Cardinality is the number of unique combinations of label values. High cardinality (millions+) causes exponential growth in storage and query cost.

Challenge question

If high-cardinality metrics are so problematic, why do we need them? Can't we just avoid them?


[MENTAL MODEL] Mental model - Cardinality is a combinatorial explosion

Scenario

You have a metric with three labels:

prometheus

Interactive question (pause and think)

If you have:

  • 100 endpoints
  • 10 status codes
  • 5 regions

How many unique time series can this metric create?

A. 115 (100 + 10 + 5) B. 500 (100 × 5) C. 5,000 (100 × 10 × 5) D. It depends on actual combinations

Progressive reveal

Answer: C (worst case), D (in practice).

Maximum cardinality = product of all label cardinalities.

Cardinality mathematics

text

Why time-series databases struggle

yaml

Mental model

Think of cardinality as:

  • Dimensionality curse: More dimensions = exponentially more space
  • Combinatorial explosion: Labels multiply, not add
  • Query cost multiplier: Each label filters, but high-cardinality labels don't filter much

Key insight box

Cardinality isn't additive (100 + 1M = 1M), it's multiplicative (100 × 1M = 100M). One high-cardinality label destroys everything.

Challenge question

You have 5 labels, 4 are low-cardinality (10 values each), 1 is high-cardinality (1M values). Which label causes the problem?


[WARNING] Identifying high-cardinality culprits

Scenario

Your Prometheus is slow. You suspect high cardinality. How do you find the culprit?

Cardinality analysis techniques

Technique 1: Prometheus built-in metrics

promql

Technique 2: Prometheus TSDB stats

bash

Technique 3: Cardinality profiling script

python

Common high-cardinality culprits

yaml

Key insight box

High-cardinality labels are usually identifiers (IDs, IPs, timestamps) or unbounded strings. Look for labels with >10K unique values.

Challenge question

You find that pod_id has 100K unique values (Kubernetes pods constantly restart). Should you remove the label or handle it differently?


[DEEP DIVE] Strategies for managing high-cardinality data

Scenario

You've identified high-cardinality labels. Now what?

You can't just delete them - they provide valuable debugging context.

Strategy 1: Don't use metrics for high-cardinality data

yaml

When to use what:

yaml

Strategy 2: Aggregate at write time

go

Strategy 3: Use exemplars (metrics + traces bridge)

yaml

Strategy 4: Cardinality-aware databases

yaml

Strategy 5: Label transformation/aggregation

go

Key insight box

Managing high cardinality requires multiple strategies: use the right tool (metrics vs traces), aggregate, sample, or transform labels to reduce cardinality.

Challenge question

Your CEO wants a dashboard showing "response time per customer." How do you build this without exploding cardinality?


[PUZZLE] The sampling dilemma - keeping enough data

Scenario

You decide to sample high-cardinality metrics: keep 1% of user_id values.

Problem: Enterprise customer (0.01% of users) generates error. Their data wasn't sampled. You can't debug.

Think about it

Random sampling might miss important users. How do you sample intelligently?

Interactive question (pause and think)

Which sampling strategy catches the most important data?

A. Random sampling (1% of all users) B. Stratified sampling (1% per user tier) C. Always keep errors + sample healthy traffic D. Dynamic sampling based on anomalies

Progressive reveal

Answer: C or D, depending on use case.

Random sampling (A) misses rare but important events. Stratified (B) is better but still probabilistic.

Intelligent sampling strategies

Strategy 1: Error-weighted sampling

go

Strategy 2: Tier-based sampling

go

Strategy 3: Reservoir sampling (keep diverse sample)

go

Strategy 4: Anomaly-based sampling

go

Combining strategies

go

Key insight box

Intelligent sampling keeps important data (errors, anomalies, VIPs) at 100% while sampling less important data. Don't rely on pure random sampling.

Challenge question

You sample 1% of users but need to report "accurate total request count." How do you estimate the true count from sampled data?


[WARNING] Query performance degradation patterns

Scenario

Your metrics database has high cardinality. Queries that took seconds now take minutes.

What's happening under the hood?

Performance degradation mechanics

yaml

Query optimization techniques

Technique 1: Pre-aggregation (recording rules)

yaml

Technique 2: Query result caching

yaml

Technique 3: Downsampling

yaml

Technique 4: Cardinality limits

yaml

Key insight box

High-cardinality degrades query performance exponentially. Mitigate with pre-aggregation, caching, downsampling, and cardinality limits.

Challenge question

Your dashboard queries 100 high-cardinality metrics simultaneously. How do you make it load in < 5 seconds?


[SYNTHESIS] Final synthesis - Design a high-cardinality observability system

Synthesis challenge

You're the observability architect for a multi-tenant SaaS platform.

Requirements:

  • 1 million users across 10K organizations
  • Users want "per-user dashboards" showing their activity
  • Need to debug individual user issues
  • Compliance: 1-year data retention
  • Budget: $30K/month for observability

Constraints:

  • Current: Prometheus with user_id labels (10M series, constantly OOM)
  • Query latency: 2+ minutes for dashboards
  • Team: frustrated, want to remove observability entirely

Your tasks (pause and think)

  1. Analyze current cardinality problem
  2. Design metrics strategy (what goes in metrics?)
  3. Design traces/logs strategy (what doesn't go in metrics?)
  4. Choose appropriate databases
  5. Implement sampling strategy
  6. Optimize query performance

Write down your architecture.

Progressive reveal (one possible solution)

1. Cardinality analysis:

yaml

2. Metrics strategy (low-cardinality only):

yaml

3. Traces/logs strategy (high-cardinality):

yaml

4. Database choices:

yaml

5. Sampling strategy:

yaml

6. Query performance optimization:

yaml

Key insight box

High-cardinality requires multi-database architecture: metrics for aggregates, traces for per-request debugging, analytics DB for pre-aggregated user stats.

Final challenge question

After implementing this architecture, your COO asks: "Why do we need 4 databases? Can't we just use one?" How do you explain the trade-offs?


Appendix: Quick checklist (printable)

Cardinality audit:

  • List all metrics and their labels
  • Count unique values per label (identify high-cardinality)
  • Calculate max cardinality (multiply label cardinalities)
  • Identify labels with >10K unique values
  • Flag metrics with >100K total series

Cardinality reduction:

  • Remove high-cardinality labels (user_id, request_id, etc.)
  • Transform labels (normalize URLs, aggregate errors)
  • Use low-cardinality aggregates (user_tier not user_id)
  • Move high-cardinality data to traces/logs
  • Implement sampling for necessary high-cardinality metrics

Database selection:

  • Prometheus/VictoriaMetrics: low-cardinality metrics
  • Tempo/Jaeger: high-cardinality traces
  • ClickHouse/BigQuery: analytical queries on high-cardinality
  • Loki/Elasticsearch: logs
  • Use exemplars to bridge metrics and traces

Query optimization:

  • Create recording rules (pre-aggregation)
  • Implement query caching (reduce repeated queries)
  • Downsample old data (reduce resolution over time)
  • Set cardinality limits (prevent explosion)
  • Monitor query latency (alert on degradation)

Sampling strategy:

  • Define sampling rate per metric
  • Use intelligent sampling (errors, anomalies, VIPs)
  • Implement deterministic sampling (consistent per entity)
  • Document what's sampled and what's not
  • Validate sample representativeness

Operational:

  • Monitor cardinality growth (trend over time)
  • Alert on cardinality spikes (new labels added)
  • Review metrics quarterly (remove unused)
  • Educate team on cardinality impact
  • Document cardinality guidelines

Red flags:

  • Metrics with >1M series
  • Labels with >100K unique values
  • Query latency >10s
  • Prometheus OOM crashes
  • Storage costs growing >50%/month
  • New labels added without review

Key Takeaways

  1. High-cardinality dimensions (user IDs, request IDs) explode metric storage — creating millions of unique time series that overwhelm traditional monitoring systems
  2. Don't use high-cardinality values as metric labels — use them in logs and traces instead, where storage is per-event not per-series
  3. Columnar storage (ClickHouse, Druid) handles high-cardinality analytics — compressing and querying billions of unique values efficiently
  4. Aggregate before storing where possible — pre-compute counts and summaries rather than storing every raw data point as a metric
Chapter complete!

Course Complete!

You've finished all 51 chapters of

System Design Advanced

Browse courses
Up next Game Days and Dirt Testing
Continue