Audience: observability engineers dealing with metrics, logs, and traces explosion in high-scale systems.
This article assumes:
You add a new metric: http_request_duration{user_id="..."}.
Timeline:
One innocent user_id label destroyed your observability.
What went wrong?
Take 10 seconds.
Answer: (2) - high-cardinality kills time-series databases.
More memory (option 3) delays the problem but doesn't solve it. Different database (option 4) might help, but understanding cardinality is key.
Imagine cataloging books:
Low cardinality: Genre (fiction, non-fiction, science, history) = 10 categories. Easy to organize and search.
High cardinality: ISBN (every book has unique number) = millions of categories. Impossible to efficiently organize and search without specialized indexes.
Metrics databases are optimized for low cardinality (like genres), not high cardinality (like ISBNs).
Cardinality is the number of unique combinations of label values. High cardinality (millions+) causes exponential growth in storage and query cost.
If high-cardinality metrics are so problematic, why do we need them? Can't we just avoid them?
You have a metric with three labels:
If you have:
How many unique time series can this metric create?
A. 115 (100 + 10 + 5) B. 500 (100 × 5) C. 5,000 (100 × 10 × 5) D. It depends on actual combinations
Answer: C (worst case), D (in practice).
Maximum cardinality = product of all label cardinalities.
Think of cardinality as:
Cardinality isn't additive (100 + 1M = 1M), it's multiplicative (100 × 1M = 100M). One high-cardinality label destroys everything.
You have 5 labels, 4 are low-cardinality (10 values each), 1 is high-cardinality (1M values). Which label causes the problem?
Your Prometheus is slow. You suspect high cardinality. How do you find the culprit?
Technique 1: Prometheus built-in metrics
Technique 2: Prometheus TSDB stats
Technique 3: Cardinality profiling script
High-cardinality labels are usually identifiers (IDs, IPs, timestamps) or unbounded strings. Look for labels with >10K unique values.
You find that pod_id has 100K unique values (Kubernetes pods constantly restart). Should you remove the label or handle it differently?
You've identified high-cardinality labels. Now what?
You can't just delete them - they provide valuable debugging context.
When to use what:
Managing high cardinality requires multiple strategies: use the right tool (metrics vs traces), aggregate, sample, or transform labels to reduce cardinality.
Your CEO wants a dashboard showing "response time per customer." How do you build this without exploding cardinality?
You decide to sample high-cardinality metrics: keep 1% of user_id values.
Problem: Enterprise customer (0.01% of users) generates error. Their data wasn't sampled. You can't debug.
Random sampling might miss important users. How do you sample intelligently?
Which sampling strategy catches the most important data?
A. Random sampling (1% of all users) B. Stratified sampling (1% per user tier) C. Always keep errors + sample healthy traffic D. Dynamic sampling based on anomalies
Answer: C or D, depending on use case.
Random sampling (A) misses rare but important events. Stratified (B) is better but still probabilistic.
Strategy 1: Error-weighted sampling
Strategy 2: Tier-based sampling
Strategy 3: Reservoir sampling (keep diverse sample)
Strategy 4: Anomaly-based sampling
Intelligent sampling keeps important data (errors, anomalies, VIPs) at 100% while sampling less important data. Don't rely on pure random sampling.
You sample 1% of users but need to report "accurate total request count." How do you estimate the true count from sampled data?
Your metrics database has high cardinality. Queries that took seconds now take minutes.
What's happening under the hood?
Technique 1: Pre-aggregation (recording rules)
Technique 2: Query result caching
Technique 3: Downsampling
Technique 4: Cardinality limits
High-cardinality degrades query performance exponentially. Mitigate with pre-aggregation, caching, downsampling, and cardinality limits.
Your dashboard queries 100 high-cardinality metrics simultaneously. How do you make it load in < 5 seconds?
You're the observability architect for a multi-tenant SaaS platform.
Requirements:
Constraints:
Write down your architecture.
1. Cardinality analysis:
2. Metrics strategy (low-cardinality only):
3. Traces/logs strategy (high-cardinality):
4. Database choices:
5. Sampling strategy:
6. Query performance optimization:
High-cardinality requires multi-database architecture: metrics for aggregates, traces for per-request debugging, analytics DB for pre-aggregated user stats.
After implementing this architecture, your COO asks: "Why do we need 4 databases? Can't we just use one?" How do you explain the trade-offs?
Cardinality audit:
Cardinality reduction:
Database selection:
Query optimization:
Sampling strategy:
Operational:
Red flags: