Time Series DB — Cardinality Explosion Crashed Monitoring
Prometheus 30s timeouts, InfluxDB OOM every 4h due to 1.2M series explosion.
20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.
- Time series databases store timestamped measurements, optimized for high write throughput and time-range queries
- Storage engine decision: LSM-Tree for write-heavy (InfluxDB), B-Tree for mixed workloads (TimescaleDB), or custom append-only (Prometheus)
- Compression: delta-of-delta timestamps, XOR floating point, bit-packing — cuts storage 90%+
- Cardinality (number of unique metric series) is the #1 performance killer — keep under 100k per shard
- Retention policies define data lifecycle; downsampling preserves long-term trends while shedding raw data
- Production mistake: treating TSDB like a relational DB leads to slow queries and disk bloat within weeks
Imagine your fitness tracker records your heart rate every second — not just what it is right now, but a permanent, ordered log of every reading since you put it on. A time series database is like that log, but engineered specifically so you can ask questions like 'what was my average heart rate between 2pm and 3pm last Tuesday?' in milliseconds, even if there are billions of readings. It's a regular database's specialised cousin that obsesses over timestamps and makes time-based queries absurdly fast.
Every production system that matters emits a continuous stream of measurements — CPU usage spikes at 3am, a payment gateway latency ticks up by 12ms, a wind turbine's RPM drops before a bearing fails. These aren't events you look up by ID; they're facts anchored in time, and the questions you need to ask are always temporal: trends, anomalies, rolling averages, rate-of-change. General-purpose relational databases weren't built for this workload, and trying to make them do it at scale is one of the fastest ways to watch a perfectly good PostgreSQL instance beg for mercy.
Time series databases (TSDBs) exist because the write pattern, query pattern, and retention lifecycle of timestamped data are fundamentally different from transactional data. You almost never update a past measurement — the past is immutable. Writes are high-throughput and append-only. Queries almost always involve a time range plus aggregation. And data has a natural decay in value: last second's CPU reading matters more than last year's. A TSDB exploits all of these properties at every layer of its architecture — from how bytes land on disk to how queries are planned.
By the end of this article you'll understand exactly how TSDBs store and compress data internally, why cardinality is the silent killer of TSDB performance, how to make a principled choice between InfluxDB, TimescaleDB, and Prometheus for a given system design, and what production mistakes cost teams weeks of firefighting. This is the article you wish existed the first time a monitoring stack fell over at 2am.
What Makes a Database 'Time Series'?
A time series database is any storage system optimized for append-only writes of timestamped measurements. Unlike a traditional row store, a TSDB treats time as the primary sort key — data is always written in time order and queried by time range. This changes everything.
Relational databases organize data by entity (user, order, product). Time series databases organize by time first. That means writes go to the latest time chunk (hot shard), and compactions merge older chunks into read-optimised blocks. You never update a past value — the only operations are insert (new point) and delete (retention).
The query pattern is also different. You ask for 'average CPU over the last hour grouped by 5-minute windows' — not 'CPU value for server X at timestamp Y'. That's a range scan with aggregation, which TSDBs accelerate with time-based partitioning and pre-aggregated rollups.
- Data is always ordered by time — no random inserts.
- Past data is immutable: updates are inserts with a later timestamp.
- Queries are range scans with aggregation; point lookups are rare.
- Hot shard vs cold shard: recent data in memory, older data compressed on disk.
Storage Engine Internals: How TSDBs Organise Data on Disk
The storage engine determines write throughput, compression ratio, and query speed. Two dominant approaches exist: Log-Structured Merge (LSM) trees and custom B-Tree adaptations.
LSM Trees (InfluxDB, VictoriaMetrics, TimescaleDB hybrid)
Writes first land in a Write-Ahead Log (WAL), then are buffered in memory in a sorted data structure (memtable). When the memtable reaches a threshold (e.g., 64MB), it's flushed to disk as an immutable SSTable file. Background compaction merges multiple SSTables into larger ones, discarding tombstones and dead versions. This design gives excellent write throughput because every write is sequential. Reads require checking multiple layers (memory + several SSTables per level). Bloom filters accelerate point lookups, but range scans still need to merge overlapping SSTables.
B-Tree with Time-Ordered Primary Key (TimescaleDB, hybrid)
TimescaleDB uses PostgreSQL's B-Tree where the primary key is (time, optional hash) in a chunked table structure. Writes are still sequential-ish because time is monotonically increasing. It supports full SQL, joins, and relational flexibility. Write throughput is lower than LSM for pure time series, but query performance on small ranges is better due to efficient index lookups. B-Trees also have no write amplification from compaction — but they do have page splits and WAL overhead.
Custom Append-Only (Prometheus, M3DB)
Prometheus stores each time series in a block per 2-hour interval. Data is stored as delta-of-delta compressed chunks. The blocks are read-only once completed. This gives extremely fast read access to recent data but requires careful planning for long-term retention (usually offloaded to Thanos or Cortex).
Compression Techniques: How TSDBs Fit Petabytes into Gigabytes
Time series data is highly compressible because adjacent points are correlated — CPU utilization doesn't jump from 5% to 95% in a millisecond. TSDBs exploit this with specialized compression schemes.
Timestamp Compression: Delta-of-Delta
Instead of storing full 64-bit timestamps, TSDBs store the difference from the previous timestamp. Most timestamps arrive at a fixed interval (e.g., every 10s). The delta-of-delta is often zero or small. Prometheus uses this: for a batch of 122 points, it stores the first timestamp fully, then the first delta (usually the interval), then delta-of-delta values. When the interval is constant, delta-of-delta is zero and can be encoded in a single bit. This compresses 64-bit timestamps to an average of ~2 bits.
Value Compression: XOR (Gorilla)
Floating point values change slowly. Gorilla compression (originally from Facebook's Gorilla TSDB, now used in Prometheus) XORs consecutive values. If most bits are the same, the result has many leading zeros, which can be encoded with a variable-length scheme. For 64-bit doubles, this achieves ~1.4 bits per point on average for normal metrics.
Columnar Compression
InfluxDB stores each field in separate columns (similar to Parquet). Within a column, values are sorted by time first, so run-length encoding (RLE) and delta compression work well. TimescaleDB's compression uses native PostgreSQL compression types (ZSTD, LZ4) on chunks, achieving 5–20x reduction.
Real-world results: a 7-day retention of 10k series at 10s resolution (~60 million points) occupies ~12GB uncompressed, ~400MB compressed.
Retention Policies and Downsampling — Managing Storage Over Time
Time series data has diminishing value over time. The exact CPU reading from 3 months ago isn't useful — but the hourly average is. TSDBs manage this through retention policies and downsampling.
Retention Policies (RP)
InfluxDB's RP defines how long data is kept and how many copies exist. For example, 'autogen' RP with DURATION 7d, REPLICATION 1. Once data exceeds the duration, it's automatically deleted. TimescaleDB uses chunk-based retention: you can drop chunks older than a threshold using or automate with a background job. Prometheus stores data in blocks per retention period; blocks older than drop_chunks()--storage.tsdb.retention.time are deleted.
Downsampling / Continuous Queries
To keep long-term historical data without blowing your storage budget, you create rollups — e.g., store raw data at 10s resolution for 7 days, then 1-minute averages for 30 days, then 1-hour averages for 1 year. InfluxDB calls this Continuous Queries (CQ). TimescaleDB uses and continuous aggregates. Prometheus recording rules can precompute rollups.time_bucket()
Shard Groups and Time-Based Partitioning
Data is written to shard groups that correspond to time ranges (e.g., 7-day shards). Each shard is stored as a set of files (TSM files in InfluxDB, chunks in Prometheus). When a shard is closed (no longer receiving writes), it can be fully compressed and moved to cold storage or even archived to object storage (e.g., S3) using tools like Thanos or InfluxDB Cloud's tiered storage.
- Hot tier: raw data in memory (last 6h) — fastest queries, expensive storage.
- Warm tier: compressed data on SSD (last 7 days) — good performance, moderate cost.
- Cold tier: downsampled rollups on HDD/object storage (months/years) — cheap, slower queries.
Cardinality: The Silent Performance Killer
Cardinality is the number of unique time series in your TSDB. Each combination of metric name + label key/value pair creates a distinct series. For Prometheus: http_requests_total{method="GET", endpoint="/api", status="200"} is one series. Add a label user_uuid with 10,000 different values, and you get 10,000 series — for just this one metric.
Why cardinality matters
- Index memory: TSDBs keep an in-memory index of all series. High cardinality consumes RAM and slows down writes because every new label combination must be added to the index (usually requires a lock or CAS operation).
- Query performance: Queries that scan many series (e.g., 'sum of all requests') must iterate over thousands of series. Even with aggregation, the planner needs to evaluate each series.
- Compaction overhead: More series means more SSTable files (in LSM) or more blocks (Prometheus). Compaction merges many small files, increasing CPU and I/O.
Safe cardinality thresholds
- Prometheus: keep total series under 500k per server (for 8GB RAM). Use --storage.tsdb.max-series to enforce a hard limit.
- InfluxDB: keep series per measurement under 100k. Use max-series-per-database config.
- TimescaleDB: less sensitive because indexing is disk-based (B-Tree), but high cardinality still hurts query performance. Keep distinct tag values under 1,000 per dimension.
How cardinality explosions happen
Common culprits: Including request IDs, timestamps, or random values as labels. For example, adding a 'container_id' label in a Kubernetes cluster with 500 pods creates 500 series per metric. Add 'pod_name' with unique names and it multiplies. Multiply by every redeployment (new container IDs), and you get unbounded growth.
Choosing the Right TSDB: InfluxDB, TimescaleDB, or Prometheus
You don't start with 'which TSDB is best?' You start with 'what's my workload?' The answer hinges on three things: write throughput, query complexity, and ecosystem.
InfluxDB - Best for: High-volume IoT metrics, application monitoring, sensor data. - Engine: LSM-based (TSM over RocksDB). Strong compression, high write throughput (millions of points/sec on a single node). - Query: InfluxQL or Flux (deprecating Flux in 3.x). No SQL, limited joins. - Retention: Native RP and CQ. Downsampling built-in. - Scalability: Single-node OSS up to ~1M series. Enterprise/Cloud for clustering. No built-in federation.
TimescaleDB - Best for: Hybrid workloads — time series that occasionally need SQL joins with relational data (e.g., financial trades, sensor + metadata). - Engine: PostgreSQL with automatic partitioning (hypertable). Full SQL support, JSON, JOINs, window functions. - Query: PostgreSQL SQL. Rich ecosystem of tools. - Retention: Via chunk dropping or timescaledb_expire chunk job. - Scalability: Single-node with streaming replication for HA. Distributed version (timescaledb scale) since 2.13.
Prometheus - Best for: Cloud-native monitoring, Kubernetes, alerting tied to metrics. - Engine: Custom block-based storage. Pull model (or push via Pushgateway). - Query: PromQL — powerful, but different from SQL. Excellent for alerting rules. - Retention: Block-based with configurable retention. Long term via Thanos/Cortex. - Scalability: Single-server limited to ~1M series. Use Thanos for aggregation across clusters.
When to avoid each - Avoid InfluxDB if you need SQL joins or complex relational queries. - Avoid TimescaleDB if your write throughput exceeds 500k points/sec on a single node (use LSM alternatives). - Avoid Prometheus if you need to push metrics from many sources without a central coordinator (use Pushgateway carefully).
Scaling Writes: Why Your INSERT Becomes a Firehose
Time series databases live or die on write throughput. A single IoT deployment can hammer you with 10 million data points per second. Relational databases choke because they're built for transactional integrity — row-level locks, B-tree splits, MVCC overhead. TSDBs flip the script: they append, they batch, they sacrifice transactional guarantees for write speed.
The trick is the WAL (Write-Ahead Log). Every point hits the WAL first — sequential, no seeks. Then the TSDB batches those writes into memory-mapped segments before flushing to columnar storage on disk. This batching is critical: writing one row at a time kills performance. You batch 5,000–10,000 points per write call. Any less and you're fighting IOPS. Any more and you risk latency spikes.
If you're hitting 100k+ writes per second, understand your TSDB's sharding strategy. InfluxDB uses shards by time range. TimescaleDB uses hypertable chunks. Prometheus shards by time series. Miss this and you'll hit a write bottleneck at 500k points per second. I've seen it kill a production monitoring pipeline at 2 AM on a Friday.
Querying the Past: How TSDBs Make Time Travel Fast
Reading time series data is the mirror problem of writing — you need to scan huge ranges of sequential data fast. Pulling 24 hours of sensor readings at one-second resolution means 86,400 points. Do that across 10,000 devices and you're looking at 864 million rows. No sane developer runs SELECT * on that. TSDBs win here with two tricks: time-based partitioning and chunked storage.
Partitioning by time is the first layer. Every TSDB splits data into time windows — one hour, one day, one week depending on retention. The query planner prunes partitions that don't match. If you query last hour, it doesn't scan last month's data. This isn't magic. It's just a bounded scan range.
The second trick: columnar storage with pre-aggregated blocks. Instead of storing each point naively, TSDBs store timestamps, values, and tags in separate column files. They pre-compute min, max, sum, count per block (typically 1,000–10,000 points per block). When you query an aggregate like AVG(), the engine reads the block metadata first. If the entire block fits your time range, it uses the pre-computed sum and count. No need to decompress and iterate individual points.
This means a query like "average CPU usage over the last hour" can run in under 100ms on 1 billion points. But it only works if your query predicates include a time range. Forget the time filter and you're scanning everything. That's how you get 5-minute queries. I've fixed those production incidents more times than I can count.
ClickHouse: The Column-Oriented Wrecking Ball for Real-Time Analytics
Stop treating time-series databases as generic storage buckets. ClickHouse doesn't store rows; it stores columns. That swap changes everything for aggregation-heavy workloads. Why? Because most time-series queries ask for the average, max, or count of a metric across millions of timestamps — not the raw event data. Columnar storage means you only read the columns you need, not the entire row. Disk I/O plummets. Queries that would cripple InfluxDB or Prometheus finish in milliseconds.
ClickHouse uses a MergeTree engine with primary keys that double as sort orders. Your timestamp sits in that key, and every insert is sorted on write. The engine partitions data by time intervals automatically. Combine that with its vectorized query execution — processing blocks of data rather than single rows — and you get hardware-level parallelism. The trade-off is real-time insert latency: batch your writes every few seconds, not per event. Use the asynchronous insert mode or buffer tables in production. Single-row inserts will kill throughput. This is not a transactional database. This is a chainsaw for analytical workloads.
The production pattern is simple: buffer metrics in memory, flush to ClickHouse every 5-30 seconds, and query with SQL that filters time ranges early. Materialized views precompute rollups for dashboard queries. Forget about joins. Denormalize everything into wide tables.
Apache Cassandra: When Your Write Throughput Must Survive a Nuclear Blast
Cassandra is not a time-series database. But people use it for time-series data because it laughs at write scaling. Why? It's a peer-to-peer ring with no single point of failure. Every node accepts writes. No leader. No election. You add nodes, write capacity doubles linearly. For metrics ingestion at planet scale — IoT fleets, CDN logs, stock ticks — Cassandra is the hammer that doesn't break.
The catch: Cassandra is terrible at range scans by default. Your time-series query 'give me all metrics for the last hour' becomes a full cluster scan if you don't design the partition key correctly. The golden rule: partition by a high-cardinality key (like device ID or metric name) and cluster by timestamp. Each partition stays small — under 100MB — so reads are fast. Use the TimeWindowCompactionStrategy (TWCS) to avoid compaction storms. TWCS compacts SSTables only within a fixed time window, preventing write amplification from killing your disks.
Production reality: Cassandra handles 100K+ writes per second per node. But reads are expensive if you miss the partition key. Your query must always specify the full partition key. No exceptions. Design your data model around the query pattern first, then the ingestion. Compaction is not maintenance — it's a design constraint. Choose TWCS, set window size to match your retention, and never look back.
Amazon Timestream: The Managed Trap You Should Know Before You Sign
Amazon Timestream screams 'serverless time-series' and it delivers on setup time. You click a button, you have a database. No hardware provisioning. No cluster configuration. Why would anyone ever self-host again? Because the cost model will ruin your budget if you don't understand the pricing knobs. Timestream separates storage into 'memory store' (hot tier) and 'magnetic store' (cold tier). Writes go to memory. After a configurable time, data moves to magnetic storage automatically. That's seamless — until you realize the magnetic store charges per query, not just per GB.
The dirty secret: Timestream queries that scan magnetic data cost you on every byte read. A dashboard refresh that runs once per minute on 30 days of data will cost more in query charges than the storage itself. The fix is aggressive downsampling. Define scheduled queries that pre-aggregate data into hourly or daily tables stored only in memory. Accept the trade-off: you lose raw granularity past a week. Also, Timestream uses a proprietary query language that looks like SQL but isn't. No joins. No subqueries deeper than one level. Complex analytics require Lambda functions for post-processing.
Production path: use Timestream if your ops team is small and your query volume is low — under 100 queries per hour on cold data. For high-frequency dashboards, configure sampling at the SDK level to reduce write costs. And always set a retention policy to move old data to magnetic store after 48 hours max. Your AWS bill will thank you.
Why NoSQL Key-Value Stores Fail for Time-Series Workloads
Time-series databases demand range scans over time-ordered keys. Key-value stores like Redis or DynamoDB are optimized for single-key lookups and lack native ordering. To simulate time-series, developers prefix keys with timestamps (e.g., sensor_1700000000), but this creates hotspots on the latest partition and forces application-level sorting. Writes become scatter-gather operations for any query spanning multiple seconds. The absence of compression — every raw value stored as a separate key-value pair — multiplies storage costs 10x versus columnar TSDBs. Retention policies require scanning and deleting millions of keys manually, risking eventual consistency in distributed stores. When you need last-hour aggregation over 10,000 sensors, key-value stores degrade into O(n) scans across all keys. They serve low-latency lookups but collapse under time-windowed range queries. Use key-value only for hot caching of recent time-series aggregates, never as the primary store.
Appending Entries in MongoDB: The Hidden Cost of $push on Embedded Arrays
MongoDB’s document model encourages embedding time-series data points in a single document per sensor (e.g., { _id: "sensor_1", readings: [{ts, val}] }). Appending with $push grows the BSON document. MongoDB has a 16 MB document limit, and each $push triggers a document move on disk when padding runs out — fragmenting storage and increasing write amplification. A sensor writing 1 point/second hits 86,400 embedded documents per day; within two weeks, the document exceeds 1 MB and ingestion slows 10x due to constant relocation. Querying recent points requires fetching the entire bloated document. Instead, use MongoDB’s time-series collections (introduced in 5.0) which bucket data automatically into fixed-size BSON documents and apply columnar compression. For existing setups, cap embedded arrays at 1,024 entries and roll over to new documents daily. Avoid $push on unbounded arrays for high-frequency writes.
$push on unbounded arrays causes document relocation; cap embedded arrays or switch to time-series collections for high-frequency data.Cardinality Explosion — The 3am Outage That Took Down Monitoring
- Always set a hard cardinality limit in TSDB configuration before ingesting production data.
- Monitor cardinality as a first-class metric — track distinct series count per measurement over time.
- Design label schemas to keep distinct values under 10 per dimension. Pod IDs belong in logs, not metrics.
top and free -m on the TSDB node. Run influx_inspect report-shards (InfluxDB) or tsdb analyze (Prometheus) to identify large shards. Merge or downsample old shards.SHOW SERIES CARDINALITY or SHOW TAG KEYS to identify high-cardinality tags. Remove or relabel immediately. Increase shard retention duration to reduce index rebuild frequency.influxd_inspect dumptsm (InfluxDB) or tsdb retention (TimescaleDB). Ensure compression is enabled for each column. For Prometheus, verify --storage.tsdb.retention.time and --storage.tsdb.max-block-duration are set correctly.--storage.tsdb.min-block-duration=2h to reduce memory for in-memory blocks. For InfluxDB, reduce max-series-per-database in config. Add memory request/limit to pod spec.influx -database 'mydb' -execute 'SHOW SERIES CARDINALITY'curl http://localhost:9090/api/v1/status/tsdb | jq '.data.labelValuesCount'Key takeaways
Common mistakes to avoid
5 patternsUsing high-cardinality labels like 'user_id' or 'request_id'
Not setting retention policies before production
CREATE RETENTION POLICY "30d" ON "mydb" DURATION 30d REPLICATION 1. For Prometheus: set --storage.tsdb.retention.time=30d.Writing data points out of order (backfilling with random timestamps)
timestamp_precision set.Ignoring compression impact on query speed
Assuming TSDBs handle write spikes gracefully
write-timeout and max-values-per-tag to reject excessive data.Interview Questions on This Topic
Explain how time series data differs from transactional data and how a TSDB exploits those differences.
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.
That's Databases in Design. Mark it forged?
14 min read · try the examples if you haven't