Senior 14 min · March 06, 2026

Time Series DB — Cardinality Explosion Crashed Monitoring

Prometheus 30s timeouts, InfluxDB OOM every 4h due to 1.2M series explosion.

N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Time series databases store timestamped measurements, optimized for high write throughput and time-range queries
  • Storage engine decision: LSM-Tree for write-heavy (InfluxDB), B-Tree for mixed workloads (TimescaleDB), or custom append-only (Prometheus)
  • Compression: delta-of-delta timestamps, XOR floating point, bit-packing — cuts storage 90%+
  • Cardinality (number of unique metric series) is the #1 performance killer — keep under 100k per shard
  • Retention policies define data lifecycle; downsampling preserves long-term trends while shedding raw data
  • Production mistake: treating TSDB like a relational DB leads to slow queries and disk bloat within weeks
✦ Definition~90s read
What is Time Series Databases?

A time series database (TSDB) is a purpose-built storage engine optimized for ingesting, querying, and managing timestamped data—metrics, sensor readings, application logs, financial ticks—that arrives continuously and monotonically. Unlike general-purpose databases (PostgreSQL, MySQL) or key-value stores, TSDBs solve the specific problem of write-heavy, append-only workloads where new data always has a later timestamp than old data.

Imagine your fitness tracker records your heart rate every second — not just what it is right now, but a permanent, ordered log of every reading since you put it on.

They achieve this through specialized storage layouts (e.g., LSM trees with time-partitioned segments), aggressive compression (delta-of-delta encoding, XOR for floats, run-length encoding), and built-in retention policies that automatically age out or downsample historical data. Without these optimizations, a standard database would choke on the write throughput and storage costs of, say, 10 million metrics per second from a Kubernetes cluster or IoT fleet.

TSDBs trade general-purpose query flexibility for extreme write performance and time-range efficiency. They shine when your primary access pattern is "give me all values for metric X between timestamps A and B"—think Grafana dashboards, anomaly detection pipelines, or Prometheus alert rules.

They are a terrible fit for transactional workloads, complex joins across disparate entities, or data that requires frequent updates/deletes. The ecosystem includes mature options like InfluxDB (columnar, high-cardinality-friendly), TimescaleDB (PostgreSQL extension, hybrid relational/time-series), and Prometheus (pull-based, embedded TSDB for monitoring).

Each makes different tradeoffs: InfluxDB prioritizes write throughput and SQL-like queries, TimescaleDB leverages full SQL and joins, Prometheus optimizes for reliability and simplicity in cloud-native monitoring.

The cardinality explosion problem—where unique combinations of metric name plus label/tag values grow unbounded—is the single most common reason TSDBs fail in production. A Prometheus instance with 10,000 time series can handle millions of samples per second; one with 10 million time series (e.g., from a misconfigured label like request_id) will OOM or corrupt its storage within hours.

This happens because TSDBs pre-allocate index structures per unique series, and each new series adds memory pressure for the inverted index, write-ahead log, and compaction overhead. Understanding cardinality is not optional—it's the difference between a monitoring stack that survives a Black Friday traffic spike and one that takes down your entire observability pipeline.

Plain-English First

Imagine your fitness tracker records your heart rate every second — not just what it is right now, but a permanent, ordered log of every reading since you put it on. A time series database is like that log, but engineered specifically so you can ask questions like 'what was my average heart rate between 2pm and 3pm last Tuesday?' in milliseconds, even if there are billions of readings. It's a regular database's specialised cousin that obsesses over timestamps and makes time-based queries absurdly fast.

Every production system that matters emits a continuous stream of measurements — CPU usage spikes at 3am, a payment gateway latency ticks up by 12ms, a wind turbine's RPM drops before a bearing fails. These aren't events you look up by ID; they're facts anchored in time, and the questions you need to ask are always temporal: trends, anomalies, rolling averages, rate-of-change. General-purpose relational databases weren't built for this workload, and trying to make them do it at scale is one of the fastest ways to watch a perfectly good PostgreSQL instance beg for mercy.

Time series databases (TSDBs) exist because the write pattern, query pattern, and retention lifecycle of timestamped data are fundamentally different from transactional data. You almost never update a past measurement — the past is immutable. Writes are high-throughput and append-only. Queries almost always involve a time range plus aggregation. And data has a natural decay in value: last second's CPU reading matters more than last year's. A TSDB exploits all of these properties at every layer of its architecture — from how bytes land on disk to how queries are planned.

By the end of this article you'll understand exactly how TSDBs store and compress data internally, why cardinality is the silent killer of TSDB performance, how to make a principled choice between InfluxDB, TimescaleDB, and Prometheus for a given system design, and what production mistakes cost teams weeks of firefighting. This is the article you wish existed the first time a monitoring stack fell over at 2am.

What Makes a Database 'Time Series'?

A time series database is any storage system optimized for append-only writes of timestamped measurements. Unlike a traditional row store, a TSDB treats time as the primary sort key — data is always written in time order and queried by time range. This changes everything.

Relational databases organize data by entity (user, order, product). Time series databases organize by time first. That means writes go to the latest time chunk (hot shard), and compactions merge older chunks into read-optimised blocks. You never update a past value — the only operations are insert (new point) and delete (retention).

The query pattern is also different. You ask for 'average CPU over the last hour grouped by 5-minute windows' — not 'CPU value for server X at timestamp Y'. That's a range scan with aggregation, which TSDBs accelerate with time-based partitioning and pre-aggregated rollups.

io/thecodeforge/tsdb/TsdbWriteExample.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
package io.thecodeforge.tsdb;

import java.time.Instant;
import java.util.HashMap;
import java.util.Map;
import java.util.NavigableMap;
import java.util.concurrent.ConcurrentSkipListMap;

/**
 * Minimal TSDB write model — append-only, time-ordered.
 * Real TSDBs persist to WAL then to an LSM tree.
 */
public class TsdbWriteExample {

    // Timeseries data: time -> value
    private final NavigableMap<Long, Double> series;

    public TsdbWriteExample() {
        this.series = new ConcurrentSkipListMap<>();
    }

    public void ingest(String metric, double value, Instant timestamp) {
        // Append-only: no update of existing keys
        series.put(timestamp.toEpochMilli(), value);
    }

    public double queryAverage(long fromEpochMs, long toEpochMs) {
        var sub = series.subMap(fromEpochMs, true, toEpochMs, true).values();
        return sub.stream().mapToDouble(d -> d).average().orElse(0.0);
    }

    public static void main(String[] args) {
        TsdbWriteExample db = new TsdbWriteExample();
        db.ingest("cpu.user", 0.45, Instant.now());
        db.ingest("cpu.user", 0.47, Instant.now().plusSeconds(1));
        double avg = db.queryAverage(Instant.now().minusSeconds(10).toEpochMilli(),
                                      Instant.now().toEpochMilli());
        System.out.println("Average CPU (last 10s): " + avg);
    }
}
Time as Primary Key
  • Data is always ordered by time — no random inserts.
  • Past data is immutable: updates are inserts with a later timestamp.
  • Queries are range scans with aggregation; point lookups are rare.
  • Hot shard vs cold shard: recent data in memory, older data compressed on disk.
Production Insight
Writing out of time order (backfill) is expensive — TSDBs penalize it heavily.
If you must backfill, batch into a single large insert during low load.
Rule: always send data with monotonically increasing timestamps.
Key Takeaway
Time series is an append-only log sorted by time.
Pick your storage engine based on write profile and query needs.
Backfilling with unordered timestamps kills write performance.
Choose Your Storage Paradigm
IfWrite throughput > 1M points/sec, high data ingestion rate
UseUse LSM-based TSDB (InfluxDB, VictoriaMetrics) — optimized for sequential writes and compaction
IfNeed SQL, joins with relational data, moderate write rate
UseUse TimescaleDB (PostgreSQL extension) — B-tree on time, supports full SQL
IfPure monitoring, scrape-based, no SQL needed
UseUse Prometheus — pull model, dimensional labels, integrated alerting
TSDB Cardinality Explosion & Crash Flow THECODEFORGE.IO TSDB Cardinality Explosion & Crash Flow From ingestion to crash: how high cardinality kills TSDBs Time Series Ingestion Metrics with unique tag combinations In-Memory Index Maps tag sets to series IDs Cardinality Explosion Unique series count grows unbounded Memory Exhaustion Index exceeds RAM, OOM killer triggers Write & Query Failure Database crashes or becomes unresponsive ⚠ High cardinality from dynamic tags (e.g., user IDs, timestamps) Limit unique series: use fixed tags, enforce tag schema, or shard THECODEFORGE.IO
thecodeforge.io
TSDB Cardinality Explosion & Crash Flow
Time Series Databases

Storage Engine Internals: How TSDBs Organise Data on Disk

The storage engine determines write throughput, compression ratio, and query speed. Two dominant approaches exist: Log-Structured Merge (LSM) trees and custom B-Tree adaptations.

LSM Trees (InfluxDB, VictoriaMetrics, TimescaleDB hybrid)

Writes first land in a Write-Ahead Log (WAL), then are buffered in memory in a sorted data structure (memtable). When the memtable reaches a threshold (e.g., 64MB), it's flushed to disk as an immutable SSTable file. Background compaction merges multiple SSTables into larger ones, discarding tombstones and dead versions. This design gives excellent write throughput because every write is sequential. Reads require checking multiple layers (memory + several SSTables per level). Bloom filters accelerate point lookups, but range scans still need to merge overlapping SSTables.

B-Tree with Time-Ordered Primary Key (TimescaleDB, hybrid)

TimescaleDB uses PostgreSQL's B-Tree where the primary key is (time, optional hash) in a chunked table structure. Writes are still sequential-ish because time is monotonically increasing. It supports full SQL, joins, and relational flexibility. Write throughput is lower than LSM for pure time series, but query performance on small ranges is better due to efficient index lookups. B-Trees also have no write amplification from compaction — but they do have page splits and WAL overhead.

Custom Append-Only (Prometheus, M3DB)

Prometheus stores each time series in a block per 2-hour interval. Data is stored as delta-of-delta compressed chunks. The blocks are read-only once completed. This gives extremely fast read access to recent data but requires careful planning for long-term retention (usually offloaded to Thanos or Cortex).

io/thecodeforge/tsdb/LsmCompactionExample.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
package io.thecodeforge.tsdb;

import java.util.*;

/**
 * Simplified LSM compaction logic — merges sorted SSTables.
 */
public class LsmCompactionExample {

    private final List<List<TimeSeriesPoint>> levels;

    public LsmCompactionExample() {
        this.levels = new ArrayList<>();
    }

    public void addLevel(List<TimeSeriesPoint> points) {
        levels.add(new ArrayList<>(points));
    }

    public List<TimeSeriesPoint> compact() {
        // Merge all levels into one sorted list (like multi-way merge)
        PriorityQueue<TimeSeriesPoint> pq = new PriorityQueue<>(
            (a, b) -> Long.compare(a.timestampMs, b.timestampMs));

        for (List<TimeSeriesPoint> level : levels) {
            pq.addAll(level);
        }

        List<TimeSeriesPoint> result = new ArrayList<>();
        while (!pq.isEmpty()) {
            result.add(pq.poll());
        }
        return result;
    }

    record TimeSeriesPoint(long timestampMs, double value) {}
}
Compaction Amplification
LSM compaction can produce 10–50x write amplification. If your disk is slow (HDD, busy EBS), compaction will throttle ingestion. Always use NVMe SSD for LSM-based TSDBs. Monitor compaction lag via the 'compaction_queue' metric.
Production Insight
B-Tree TSDBs (TimescaleDB) degrade gracefully under memory pressure — LSM TSDBs OOM if compaction falls behind.
If you see 'flush stalls' or 'compaction queue > 100', reduce write batch size or increase compaction threads.
Rule: for high-throughput writes, prefer LSM; for complex queries with joins, prefer B-Tree.
Key Takeaway
LSM trees give high write throughput at the cost of read overhead and compaction.
B-Trees give SQL flexibility and predictable read performance.
Know your access pattern before choosing the storage engine.
Storage Engine Selection
IfWrite throughput primary concern, queries are simple aggregates
UseChoose LSM-based engine (InfluxDB, VictoriaMetrics)
IfNeed SQL, JOINs, and complex time range queries with relational data
UseChoose B-Tree-based engine (TimescaleDB)
IfPrometheus ecosystem already in place, need long term storage
UseUse Prometheus + Thanos (block-based, managed compaction)

Compression Techniques: How TSDBs Fit Petabytes into Gigabytes

Time series data is highly compressible because adjacent points are correlated — CPU utilization doesn't jump from 5% to 95% in a millisecond. TSDBs exploit this with specialized compression schemes.

Timestamp Compression: Delta-of-Delta

Instead of storing full 64-bit timestamps, TSDBs store the difference from the previous timestamp. Most timestamps arrive at a fixed interval (e.g., every 10s). The delta-of-delta is often zero or small. Prometheus uses this: for a batch of 122 points, it stores the first timestamp fully, then the first delta (usually the interval), then delta-of-delta values. When the interval is constant, delta-of-delta is zero and can be encoded in a single bit. This compresses 64-bit timestamps to an average of ~2 bits.

Value Compression: XOR (Gorilla)

Floating point values change slowly. Gorilla compression (originally from Facebook's Gorilla TSDB, now used in Prometheus) XORs consecutive values. If most bits are the same, the result has many leading zeros, which can be encoded with a variable-length scheme. For 64-bit doubles, this achieves ~1.4 bits per point on average for normal metrics.

Columnar Compression

InfluxDB stores each field in separate columns (similar to Parquet). Within a column, values are sorted by time first, so run-length encoding (RLE) and delta compression work well. TimescaleDB's compression uses native PostgreSQL compression types (ZSTD, LZ4) on chunks, achieving 5–20x reduction.

Real-world results: a 7-day retention of 10k series at 10s resolution (~60 million points) occupies ~12GB uncompressed, ~400MB compressed.

io/thecodeforge/tsdb/gorilla_compression.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
"""
Simplified Gorilla XOR compression for floating point values.
Based on the paper: Gorilla: A Fast, Scalable, In-Memory Time Series Database
"""

import struct

def xor_compress(values):
    """Compress list of float64 values using XOR of consecutive values."""
    if not values:
        return b""
    encoded = []
    prev = struct.pack('>d', values[0])
    encoded.append(prev)
    for v in values[1:]:
        curr = struct.pack('>d', v)
        xor_result = int.from_bytes(prev, 'big') ^ int.from_bytes(curr, 'big')
        # Store leading zeros count and the significant bits
        # (simplified - real Gorilla uses variable-length encoding)
        encoded.append(xor_result.to_bytes(8, 'big'))
        prev = curr
    return b"".join(encoded)

def xor_decompress(encoded):
    """Decompress XOR-compressed float64 values."""
    if not encoded:
        return []
    values = []
    prev_bytes = encoded[0:8]
    values.append(struct.unpack('>d', prev_bytes)[0])
    prev_int = int.from_bytes(prev_bytes, 'big')
    for i in range(8, len(encoded), 8):
        xor_int = int.from_bytes(encoded[i:i+8], 'big')
        curr_int = prev_int ^ xor_int
        values.append(struct.unpack('>d', curr_int.to_bytes(8, 'big'))[0])
        prev_int = curr_int
    return values

# Example usage
cpu_values = [0.5, 0.52, 0.51, 0.53, 0.54]
compressed = xor_compress(cpu_values)
# In real Gorilla, this would be ~40 bits for 5 values vs 320 bits raw
print(f"Compressed size: {len(compressed)} bytes (raw: {5*8} bytes)")
Compression Ratio Realities
Delta-of-delta + XOR gives ~10–15x compression on typical server metrics. For high-entropy data (random values), compression drops to 2–3x. Always test with your actual data patterns before capacity planning.
Production Insight
If your metrics have high entropy (e.g., unique random IDs as values), compression fails. Store those as labels instead.
Gorilla compression works best when values change slowly. Spiky metrics (error rates) still compress well because consecutive values are often 0 or 1.
Rule: benchmark compression on your real dataset — don't rely on vendor claims.
Key Takeaway
Timestamp compression uses delta-of-delta (average ~2 bits/timestamp).
Value compression uses XOR of floats (average ~1.4 bits/value).
Test compression on your data — high entropy kills compression ratio.

Retention Policies and Downsampling — Managing Storage Over Time

Time series data has diminishing value over time. The exact CPU reading from 3 months ago isn't useful — but the hourly average is. TSDBs manage this through retention policies and downsampling.

Retention Policies (RP)

InfluxDB's RP defines how long data is kept and how many copies exist. For example, 'autogen' RP with DURATION 7d, REPLICATION 1. Once data exceeds the duration, it's automatically deleted. TimescaleDB uses chunk-based retention: you can drop chunks older than a threshold using drop_chunks() or automate with a background job. Prometheus stores data in blocks per retention period; blocks older than --storage.tsdb.retention.time are deleted.

Downsampling / Continuous Queries

To keep long-term historical data without blowing your storage budget, you create rollups — e.g., store raw data at 10s resolution for 7 days, then 1-minute averages for 30 days, then 1-hour averages for 1 year. InfluxDB calls this Continuous Queries (CQ). TimescaleDB uses time_bucket() and continuous aggregates. Prometheus recording rules can precompute rollups.

Shard Groups and Time-Based Partitioning

Data is written to shard groups that correspond to time ranges (e.g., 7-day shards). Each shard is stored as a set of files (TSM files in InfluxDB, chunks in Prometheus). When a shard is closed (no longer receiving writes), it can be fully compressed and moved to cold storage or even archived to object storage (e.g., S3) using tools like Thanos or InfluxDB Cloud's tiered storage.

io/thecodeforge/tsdb/continuous_aggregate.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- TimescaleDB: Create a continuous aggregate downsampling 1-minute data to 1-hour averages
CREATE MATERIALIZED VIEW cpu_hourly
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 hour', time) AS bucket,
    host,
    AVG(usage) AS avg_usage,
    MAX(usage) AS max_usage
FROM cpu_usage
GROUP BY bucket, host;

-- InfluxDB Continuous Query equivalent
CREATE CONTINUOUS QUERY "cpu_hourly" ON "mydb"
BEGIN
  SELECT mean(value) AS mean_value
  INTO cpu_1h
  FROM cpu_raw
  GROUP BY time(1h), *
END;

-- Prometheus recording rule (rules.yml)
# groups:
# - name: downsampling.rules
#   interval: 1h
#   rules:
#   - record: node_cpu_hourly:avg
#     expr: avg_over_time(node_cpu_seconds_total[1h])
Data Temperature Pyramid
  • Hot tier: raw data in memory (last 6h) — fastest queries, expensive storage.
  • Warm tier: compressed data on SSD (last 7 days) — good performance, moderate cost.
  • Cold tier: downsampled rollups on HDD/object storage (months/years) — cheap, slower queries.
Production Insight
Downsampling too aggressively loses precision for anomaly detection. Keep raw data at least 24h for incident forensics.
Setting retention too short leads to 'data holes' when debugging historic issues. Always ask: 'What data would I need for this time next year?'
Rule: define retention tiers before production — raw 7d, 1m avg 60d, 1h avg 365d.
Key Takeaway
Use retention policies to automate data deletion.
Downsample raw data into rollups for long-term storage.
Plan tiered storage: hot (SSD, raw), warm (SSD, compressed), cold (object, rollups).
Choose Downsampling Strategy
IfNeed sub-10 second precision for recent data
UseKeep raw at 1s resolution for 48h, then downsample to 1m
IfCompliance requires 1-year retention but low query frequency on old data
UseDownsample to 1h averages after 30 days, store in cold object storage
IfOnly need aggregate trends (no point-in-time accuracy beyond 1 week)
UseDownsample immediately — use 5m averages from ingestion

Cardinality: The Silent Performance Killer

Cardinality is the number of unique time series in your TSDB. Each combination of metric name + label key/value pair creates a distinct series. For Prometheus: http_requests_total{method="GET", endpoint="/api", status="200"} is one series. Add a label user_uuid with 10,000 different values, and you get 10,000 series — for just this one metric.

Why cardinality matters

  • Index memory: TSDBs keep an in-memory index of all series. High cardinality consumes RAM and slows down writes because every new label combination must be added to the index (usually requires a lock or CAS operation).
  • Query performance: Queries that scan many series (e.g., 'sum of all requests') must iterate over thousands of series. Even with aggregation, the planner needs to evaluate each series.
  • Compaction overhead: More series means more SSTable files (in LSM) or more blocks (Prometheus). Compaction merges many small files, increasing CPU and I/O.

Safe cardinality thresholds

  • Prometheus: keep total series under 500k per server (for 8GB RAM). Use --storage.tsdb.max-series to enforce a hard limit.
  • InfluxDB: keep series per measurement under 100k. Use max-series-per-database config.
  • TimescaleDB: less sensitive because indexing is disk-based (B-Tree), but high cardinality still hurts query performance. Keep distinct tag values under 1,000 per dimension.

How cardinality explosions happen

Common culprits: Including request IDs, timestamps, or random values as labels. For example, adding a 'container_id' label in a Kubernetes cluster with 500 pods creates 500 series per metric. Add 'pod_name' with unique names and it multiplies. Multiply by every redeployment (new container IDs), and you get unbounded growth.

io/thecodeforge/tsdb/cardinality_check.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- InfluxDB: Check cardinality per measurement
SHOW SERIES CARDINALITY ON "mydb";

-- InfluxDB: Find the measurement with highest cardinality
SELECT *
FROM (
  SELECT measurement, count(*) as series_count
  FROM (SHOW SERIES)
  GROUP BY measurement
) ORDER BY series_count DESC;

-- TimescaleDB: Count distinct series (using hyperloglog for efficiency)
SELECT hyperloglog_distinct(tags::text) AS estimated_cardinality
FROM conditions
WHERE time > now() - interval '1 hour';

-- Prometheus: Via API
curl http://localhost:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName | sort_by(.value) | reverse[0:5]'
Cardinality Explosion Prevention
Never use high-cardinality values (user IDs, request IDs, timestamps) as labels. Use them as the value of the metric if needed, or log them to a separate indexing system (Elasticsearch). Set a hard limit on cardinality in your TSDB config before going to production.
Production Insight
Cardinality explosions often happen during load tests or when onboarding new services without label review.
The symptom is subtle: write throughput drops slowly over hours, not a sudden crash. By the time alert fires, the index is already bloated.
Rule: set cardinality alerts before production — trigger at 80% of the hard limit and page the team.
Key Takeaway
Cardinality is the product of metric * label values.
Keep per-measurement cardinality under 100k series.
Design labels like database indexes — fewer, more selective.
Managing Cardinality Design
IfMetric value depends on a unique identifier (e.g., user ID, session ID)
UseDo not use as label. Either make it a metric value or send to a separate logging system.
IfLabel has high churn (e.g., container IDs change every deploy)
UseUse a rolling hash or reduce to a few categories (e.g., 'version' not 'commit_sha').
IfYou need to filter by user ID but cardinality is high
UseStore user ID in the metric value or use a separate Prometheus per tenant.

Choosing the Right TSDB: InfluxDB, TimescaleDB, or Prometheus

You don't start with 'which TSDB is best?' You start with 'what's my workload?' The answer hinges on three things: write throughput, query complexity, and ecosystem.

InfluxDB - Best for: High-volume IoT metrics, application monitoring, sensor data. - Engine: LSM-based (TSM over RocksDB). Strong compression, high write throughput (millions of points/sec on a single node). - Query: InfluxQL or Flux (deprecating Flux in 3.x). No SQL, limited joins. - Retention: Native RP and CQ. Downsampling built-in. - Scalability: Single-node OSS up to ~1M series. Enterprise/Cloud for clustering. No built-in federation.

TimescaleDB - Best for: Hybrid workloads — time series that occasionally need SQL joins with relational data (e.g., financial trades, sensor + metadata). - Engine: PostgreSQL with automatic partitioning (hypertable). Full SQL support, JSON, JOINs, window functions. - Query: PostgreSQL SQL. Rich ecosystem of tools. - Retention: Via chunk dropping or timescaledb_expire chunk job. - Scalability: Single-node with streaming replication for HA. Distributed version (timescaledb scale) since 2.13.

Prometheus - Best for: Cloud-native monitoring, Kubernetes, alerting tied to metrics. - Engine: Custom block-based storage. Pull model (or push via Pushgateway). - Query: PromQL — powerful, but different from SQL. Excellent for alerting rules. - Retention: Block-based with configurable retention. Long term via Thanos/Cortex. - Scalability: Single-server limited to ~1M series. Use Thanos for aggregation across clusters.

When to avoid each - Avoid InfluxDB if you need SQL joins or complex relational queries. - Avoid TimescaleDB if your write throughput exceeds 500k points/sec on a single node (use LSM alternatives). - Avoid Prometheus if you need to push metrics from many sources without a central coordinator (use Pushgateway carefully).

Ecosystem Fit
If your org already uses Grafana and Prometheus for alerting, adding another TSDB means maintaining two monitoring stacks. Consider whether the new workload fits Prometheus + Thanos before introducing a second system.
Production Insight
Switching TSDBs mid-project is painful — migration requires re-architecting queries and retention policies.
Do a proof-of-concept with real data volume and query patterns before committing.
Rule: choose one primary TSDB per org and standardize; exceptions only for workloads that fundamentally differ in access pattern.
Key Takeaway
Match TSDB to workload: write throughput, query complexity, and ecosystem.
Prometheus for monitoring, TimescaleDB for SQL, InfluxDB for IoT.
Start with one primary TSDB; avoid multi-TSDB complexity unless necessary.
TSDB Selection Matrix
IfPrimary workload: infrastructure monitoring, Kubernetes, no SQL needed
UseChoose Prometheus (+ Thanos for long-term)
IfNeed SQL, time series + relational joins, moderate throughput
UseChoose TimescaleDB
IfExtremely high write throughput (IoT, finance tick data)
UseChoose InfluxDB or VictoriaMetrics
IfAlready invested in PostgreSQL ecosystem, want minimal new tech
UseChoose TimescaleDB — standard SQL skills transfer

Scaling Writes: Why Your INSERT Becomes a Firehose

Time series databases live or die on write throughput. A single IoT deployment can hammer you with 10 million data points per second. Relational databases choke because they're built for transactional integrity — row-level locks, B-tree splits, MVCC overhead. TSDBs flip the script: they append, they batch, they sacrifice transactional guarantees for write speed.

The trick is the WAL (Write-Ahead Log). Every point hits the WAL first — sequential, no seeks. Then the TSDB batches those writes into memory-mapped segments before flushing to columnar storage on disk. This batching is critical: writing one row at a time kills performance. You batch 5,000–10,000 points per write call. Any less and you're fighting IOPS. Any more and you risk latency spikes.

If you're hitting 100k+ writes per second, understand your TSDB's sharding strategy. InfluxDB uses shards by time range. TimescaleDB uses hypertable chunks. Prometheus shards by time series. Miss this and you'll hit a write bottleneck at 500k points per second. I've seen it kill a production monitoring pipeline at 2 AM on a Friday.

WriteBatchSimulation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// io.thecodeforge — system-design tutorial

import influxdb_client
from influxdb_client.client.write_api import SYNCHRONOUS
from datetime import datetime
import random

client = influxdb_client.InfluxDBClient(
    url="http://localhost:8086",
    token="your-influxdb-token",
    org="production-monitoring"
)

write_api = client.write_api(write_options=SYNCHRONOUS)

def send_batch():
    batch = []
    for _ in range(10000):  # batch of 10k points
        point = influxdb_client.Point("sensor_readings") \
            .tag("device_id", f"edge-{random.randint(1, 500)}") \
            .field("temperature", round(random.uniform(20.0, 100.0), 2)) \
            .field("humidity", round(random.uniform(0.0, 100.0), 2)) \
            .time(datetime.utcnow())
        batch.append(point)
    
    write_api.write(bucket="iot-sensors", record=batch)
    print(f"Wrote {len(batch)} points successfully")

send_batch()
client.close()
Output
Wrote 10000 points successfully
Production Trap:
Don't write points one at a time in a loop. You'll max out at ~500 writes per second on a single node. Batch to 10k per call and you hit 50k+ writes per second. Always benchmark your batch size against your TSDB's ingest endpoint.
Key Takeaway
Always batch writes in groups of 5,000–10,000 points. One row per HTTP call is a scalability suicide.

Querying the Past: How TSDBs Make Time Travel Fast

Reading time series data is the mirror problem of writing — you need to scan huge ranges of sequential data fast. Pulling 24 hours of sensor readings at one-second resolution means 86,400 points. Do that across 10,000 devices and you're looking at 864 million rows. No sane developer runs SELECT * on that. TSDBs win here with two tricks: time-based partitioning and chunked storage.

Partitioning by time is the first layer. Every TSDB splits data into time windows — one hour, one day, one week depending on retention. The query planner prunes partitions that don't match. If you query last hour, it doesn't scan last month's data. This isn't magic. It's just a bounded scan range.

The second trick: columnar storage with pre-aggregated blocks. Instead of storing each point naively, TSDBs store timestamps, values, and tags in separate column files. They pre-compute min, max, sum, count per block (typically 1,000–10,000 points per block). When you query an aggregate like AVG(), the engine reads the block metadata first. If the entire block fits your time range, it uses the pre-computed sum and count. No need to decompress and iterate individual points.

This means a query like "average CPU usage over the last hour" can run in under 100ms on 1 billion points. But it only works if your query predicates include a time range. Forget the time filter and you're scanning everything. That's how you get 5-minute queries. I've fixed those production incidents more times than I can count.

QueryOptimization.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// io.thecodeforge — system-design tutorial

from influxdb_client import InfluxDBClient
from influxdb_client.client.query_api import QueryApi

query_api = client.query_api()

# BAD: no time range — scans everything
bad_query = '''
    SELECT mean("cpu_usage")
    FROM "server_metrics"
    GROUP BY time(1h)
'''

# GOOD: bounded time range — prunes partitions
good_query = '''
    SELECT mean("cpu_usage")
    FROM "server_metrics"
    WHERE time >= now() - 1h
    GROUP BY time(5m)
'''

result = query_api.query(org="production-monitoring", query=good_query)
for table in result:
    for record in table.records:
        print(f"Time: {record.get_time()}, Avg CPU: {record.get_value()}")

client.close()
Output
Time: 2024-12-10 14:00:00, Avg CPU: 72.34
Time: 2024-12-10 14:05:00, Avg CPU: 68.91
Time: 2024-12-10 14:10:00, Avg CPU: 75.22
Senior Shortcut:
Always include a time range in your WHERE clause. TSDBs use time as the primary partition key. Without it, you force a full scan of all partitions. Most TSDBs allow you to inspect query plans — use them to verify partition pruning.
Key Takeaway
Always filter by time. No time range = full scan = painful wait.

ClickHouse: The Column-Oriented Wrecking Ball for Real-Time Analytics

Stop treating time-series databases as generic storage buckets. ClickHouse doesn't store rows; it stores columns. That swap changes everything for aggregation-heavy workloads. Why? Because most time-series queries ask for the average, max, or count of a metric across millions of timestamps — not the raw event data. Columnar storage means you only read the columns you need, not the entire row. Disk I/O plummets. Queries that would cripple InfluxDB or Prometheus finish in milliseconds.

ClickHouse uses a MergeTree engine with primary keys that double as sort orders. Your timestamp sits in that key, and every insert is sorted on write. The engine partitions data by time intervals automatically. Combine that with its vectorized query execution — processing blocks of data rather than single rows — and you get hardware-level parallelism. The trade-off is real-time insert latency: batch your writes every few seconds, not per event. Use the asynchronous insert mode or buffer tables in production. Single-row inserts will kill throughput. This is not a transactional database. This is a chainsaw for analytical workloads.

The production pattern is simple: buffer metrics in memory, flush to ClickHouse every 5-30 seconds, and query with SQL that filters time ranges early. Materialized views precompute rollups for dashboard queries. Forget about joins. Denormalize everything into wide tables.

clickhouse_bulk_insert.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// io.thecodeforge — system-design tutorial

import clickhouse_driver
import time

client = clickhouse_driver.Client(host='localhost')

# Buffer rows in memory, flush every 10 seconds
buffer = []
flush_interval = 10
last_flush = time.time()

while True:
    row = (time.time(), 'cpu_usage', 45.2, 'server-01')
    buffer.append(row)
    
    if time.time() - last_flush >= flush_interval:
        client.execute(
            'INSERT INTO metrics VALUES',
            buffer
        )
        buffer.clear()
        last_flush = time.time()
    
    time.sleep(0.1)
Output
No output — inserts into ClickHouse silently.
Use system.query_log to verify batch sizes.
Rookie Mistake: Single-Row Inserts
Inserting one row per HTTP request will overwhelm ClickHouse's merge scheduler. Batch at least 1000 rows or buffer for 5+ seconds. Always use async inserts in production.
Key Takeaway
ClickHouse kills at analytical aggregation queries because columnar storage reads only what you ask for. Batch writes, never single-row inserts.

Apache Cassandra: When Your Write Throughput Must Survive a Nuclear Blast

Cassandra is not a time-series database. But people use it for time-series data because it laughs at write scaling. Why? It's a peer-to-peer ring with no single point of failure. Every node accepts writes. No leader. No election. You add nodes, write capacity doubles linearly. For metrics ingestion at planet scale — IoT fleets, CDN logs, stock ticks — Cassandra is the hammer that doesn't break.

The catch: Cassandra is terrible at range scans by default. Your time-series query 'give me all metrics for the last hour' becomes a full cluster scan if you don't design the partition key correctly. The golden rule: partition by a high-cardinality key (like device ID or metric name) and cluster by timestamp. Each partition stays small — under 100MB — so reads are fast. Use the TimeWindowCompactionStrategy (TWCS) to avoid compaction storms. TWCS compacts SSTables only within a fixed time window, preventing write amplification from killing your disks.

Production reality: Cassandra handles 100K+ writes per second per node. But reads are expensive if you miss the partition key. Your query must always specify the full partition key. No exceptions. Design your data model around the query pattern first, then the ingestion. Compaction is not maintenance — it's a design constraint. Choose TWCS, set window size to match your retention, and never look back.

cassandra_insert_partition.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — system-design tutorial

from cassandra.cluster import Cluster
import uuid

cluster = Cluster(['node1', 'node2', 'node3'])
session = cluster.connect('tsdb')

# Partition by device_id + metric_name, cluster by timestamp
insert_stmt = session.prepare(
    """
    INSERT INTO metrics_by_device (
        device_id, metric_name, timestamp, value
    ) VALUES (?, ?, ?, ?)
    """
)

for _ in range(10000):
    session.execute(
        insert_stmt,
        ('sensor-42', 'temperature', 1708392000.0, 23.4)
    )
Output
No output on success.
Check nodetool tablestats for write latency (should be < 5ms per row).
Senior Shortcut: Partition Key Design
Never use timestamp alone as the partition key. You'll create 'hot partitions' with millions of rows. Combine metric name + device ID + day. Keep partitions under 100MB.
Key Takeaway
Cassandra scales writes linearly with nodes, but you must design partitions around your exact query pattern. Partition by device ID, cluster by timestamp.

Amazon Timestream: The Managed Trap You Should Know Before You Sign

Amazon Timestream screams 'serverless time-series' and it delivers on setup time. You click a button, you have a database. No hardware provisioning. No cluster configuration. Why would anyone ever self-host again? Because the cost model will ruin your budget if you don't understand the pricing knobs. Timestream separates storage into 'memory store' (hot tier) and 'magnetic store' (cold tier). Writes go to memory. After a configurable time, data moves to magnetic storage automatically. That's seamless — until you realize the magnetic store charges per query, not just per GB.

The dirty secret: Timestream queries that scan magnetic data cost you on every byte read. A dashboard refresh that runs once per minute on 30 days of data will cost more in query charges than the storage itself. The fix is aggressive downsampling. Define scheduled queries that pre-aggregate data into hourly or daily tables stored only in memory. Accept the trade-off: you lose raw granularity past a week. Also, Timestream uses a proprietary query language that looks like SQL but isn't. No joins. No subqueries deeper than one level. Complex analytics require Lambda functions for post-processing.

Production path: use Timestream if your ops team is small and your query volume is low — under 100 queries per hour on cold data. For high-frequency dashboards, configure sampling at the SDK level to reduce write costs. And always set a retention policy to move old data to magnetic store after 48 hours max. Your AWS bill will thank you.

timestream_scheduled_query.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// io.thecodeforge — system-design tutorial

import boto3

client = boto3.client('timestream-query')

# Create a scheduled query for hourly aggregation
response = client.create_scheduled_query(
    Name='hourly_avg_cpu',
    QueryString="""
    SELECT
        bin(time, 1h) as hour,
        measure_name,
        AVG(measure_value::double) as avg_val
    FROM "metrics"."cpu"
    WHERE time > ago(24h)
    GROUP BY bin(time, 1h), measure_name
    """,
    ScheduleConfiguration={'ScheduleExpression': 'rate(1 hour)'},
    TargetConfiguration={
        'TimestreamConfiguration': {
            'DatabaseName': 'aggregates',
            'TableName': 'cpu_hourly'
        }
    }
)
Output
{
'Arn': 'arn:aws:timestream:us-east-1:123456789012:scheduled-query/hourly_avg_cpu',
'ResponseMetadata': {'HTTPStatusCode': 200}
}
Cost Explosion: Magnetic Store Queries
Key Takeaway
Timestream's managed simplicity hides magnetic query costs. Pre-aggregate aggressively and keep cold data queries rare. Never scan raw history in dashboards.

Why NoSQL Key-Value Stores Fail for Time-Series Workloads

Time-series databases demand range scans over time-ordered keys. Key-value stores like Redis or DynamoDB are optimized for single-key lookups and lack native ordering. To simulate time-series, developers prefix keys with timestamps (e.g., sensor_1700000000), but this creates hotspots on the latest partition and forces application-level sorting. Writes become scatter-gather operations for any query spanning multiple seconds. The absence of compression — every raw value stored as a separate key-value pair — multiplies storage costs 10x versus columnar TSDBs. Retention policies require scanning and deleting millions of keys manually, risking eventual consistency in distributed stores. When you need last-hour aggregation over 10,000 sensors, key-value stores degrade into O(n) scans across all keys. They serve low-latency lookups but collapse under time-windowed range queries. Use key-value only for hot caching of recent time-series aggregates, never as the primary store.

KeyValueGetLatest.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
// io.thecodeforge — system-design tutorial

import redis

r = redis.Redis()
# Keys: sensor:{unix_ts}
# Fetch last 60 seconds of sensor_001 — O(n) scan
now = int(time.time())
keys = r.keys(f"sensor_001:{now-60}:*")
values = [r.get(k) for k in keys]
# Sort timestamps manually — no native ordering
sorted_values = sorted(values, key=lambda x: x[0])
print(sorted_values)
Output
[b'23.4', b'23.5', b'23.6']
Production Trap:
Key-value stores with TTL expiration do not guarantee immediate deletion — expired keys consume memory until the next expiry scan, causing spurious cache misses.
Key Takeaway
Never use key-value stores for primary time-series storage; they lack range scans, compression, and efficient retention.

Appending Entries in MongoDB: The Hidden Cost of $push on Embedded Arrays

MongoDB’s document model encourages embedding time-series data points in a single document per sensor (e.g., { _id: "sensor_1", readings: [{ts, val}] }). Appending with $push grows the BSON document. MongoDB has a 16 MB document limit, and each $push triggers a document move on disk when padding runs out — fragmenting storage and increasing write amplification. A sensor writing 1 point/second hits 86,400 embedded documents per day; within two weeks, the document exceeds 1 MB and ingestion slows 10x due to constant relocation. Querying recent points requires fetching the entire bloated document. Instead, use MongoDB’s time-series collections (introduced in 5.0) which bucket data automatically into fixed-size BSON documents and apply columnar compression. For existing setups, cap embedded arrays at 1,024 entries and roll over to new documents daily. Avoid $push on unbounded arrays for high-frequency writes.

MongoPush.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// io.thecodeforge — system-design tutorial

from pymongo import MongoClient

db = MongoClient().sensors
col = db.readings
# Appending every second — document grows indefinitely
col.update_one(
    {"_id": "sensor_001"},
    {"$push": {"readings": {"ts": 1700000000, "val": 23.5}}}
)
# Fetching latest 10 points still reads the whole doc
doc = col.find_one({"_id": "sensor_001"})
print(doc["readings"][-10:])
Output
[{'ts': 1700000000, 'val': 23.5}, ...] – 16x slower after 2 weeks
Production Trap:
MongoDB’s padding factor is no longer dynamic in recent versions — predictable write growth causes exponential write latency once padding is exhausted.
Key Takeaway
$push on unbounded arrays causes document relocation; cap embedded arrays or switch to time-series collections for high-frequency data.
● Production incidentPOST-MORTEMseverity: high

Cardinality Explosion — The 3am Outage That Took Down Monitoring

Symptom
Prometheus queries started timing out after 30s. InfluxDB heap usage spiked to 32GB, then OOM-killed the process every 4 hours. Write throughput dropped to <100 points/sec.
Assumption
The team assumed unlimited cardinality was safe because each pod had unique container IDs as labels. They thought the TSDB could 'handle it' with horizontal scaling.
Root cause
Cardinality is the product of distinct values across all tag sets. 5000 pods × 20 machines × 10 applications × each with 3 different container states = 1.2M unique time series. Each series consumes ~48 bytes in the index, totalling 57MB for index alone, plus memory overhead for caching. InfluxDB stores the index in memory by default — once it exceeds heap, GC thrashing begins.
Fix
Reduced cardinality by removing high-cardinality labels (container IDs) and moving them to a separate log system. Implemented cardinality limits per measurement at the database level (max 100k series per measurement). Added alerting on cardinality growth rates. Restart with 8GB heap and pre-created shard groups for new retention periods.
Key lesson
  • Always set a hard cardinality limit in TSDB configuration before ingesting production data.
  • Monitor cardinality as a first-class metric — track distinct series count per measurement over time.
  • Design label schemas to keep distinct values under 10 per dimension. Pod IDs belong in logs, not metrics.
Production debug guideDiagnose the five most common TSDB failures in production5 entries
Symptom · 01
Query times suddenly spike from 50ms to 5s, but data ingestion appears normal
Fix
Check memory pressure with top and free -m on the TSDB node. Run influx_inspect report-shards (InfluxDB) or tsdb analyze (Prometheus) to identify large shards. Merge or downsample old shards.
Symptom · 02
Write throughput drops to <10% of normal after adding new labels
Fix
Cardinality explosion alert. Run SHOW SERIES CARDINALITY or SHOW TAG KEYS to identify high-cardinality tags. Remove or relabel immediately. Increase shard retention duration to reduce index rebuild frequency.
Symptom · 03
Disk usage grows faster than expected, even with short retention
Fix
Check compression ratio with influxd_inspect dumptsm (InfluxDB) or tsdb retention (TimescaleDB). Ensure compression is enabled for each column. For Prometheus, verify --storage.tsdb.retention.time and --storage.tsdb.max-block-duration are set correctly.
Symptom · 04
Containers in Kubernetes monitoring stack crash with OOMKilled
Fix
Check vector or Prometheus memory limits. Set --storage.tsdb.min-block-duration=2h to reduce memory for in-memory blocks. For InfluxDB, reduce max-series-per-database in config. Add memory request/limit to pod spec.
Symptom · 05
Queries returning stale data or missing records after a failover
Fix
Check replication factor and consistency level. In TimescaleDB, verify all chunks are replicated across nodes. For InfluxDB OSS, there is no HA — use InfluxDB Enterprise or a proxy layer. Ensure write acknowledgements are set to quorum.
★ TSDB Quick Debug Cheat SheetFive production scenarios and the exact commands to run
High cardinality detected
Immediate action
List top cardinality series
Commands
influx -database 'mydb' -execute 'SHOW SERIES CARDINALITY'
curl http://localhost:9090/api/v1/status/tsdb | jq '.data.labelValuesCount'
Fix now
Drop high-cardinality measurements: DROP MEASUREMENT "high_card_metric"
Query slow on large time range+
Immediate action
Check if query uses full scan
Commands
EXPLAIN ANALYZE SELECT * FROM metrics WHERE time > now() - 7d
influx -execute 'EXPLAIN SELECT mean(value) FROM cpu WHERE time > now() - 1d'
Fix now
Add a continuous query to downsample hourly
Disk filling fast+
Immediate action
Identify largest shards
Commands
influx_inspect report-shards -database mydb
du -sh /var/lib/influxdb/data/*
Fix now
Extend retention policy: ALTER RETENTION POLICY "autogen" ON "mydb" DURATION 30d
Write throughput drop+
Immediate action
Check connections and series creation rate
Commands
influx -execute 'SHOW STATS' | grep 'writePointsOk'
curl http://localhost:9090/api/v1/query?query=rate(prometheus_tsdb_wal_writes[5m])
Fix now
Batch writes >=100 points per request, use gzip compression
OOM crash on TSDB pod+
Immediate action
Check memory limit and heap size
Commands
cat /proc/meminfo | grep MemTotal
journalctl -u influxdb | grep -i 'OutOfMemory'
Fix now
Set INFLUXD_DATA_MAX_SERIES_PER_DATABASE=100000 and increase memory limit to 8GB
Storage Engine Comparison
PropertyLSM (InfluxDB, VictoriaMetrics)B-Tree (TimescaleDB)Block-based (Prometheus)
Write Amplification10-50x (compaction)Low (WAL + page splits)None (immutable blocks)
Read AmplificationHigh (multiple SSTables per query)Low (single B-Tree lookup)Moderate (merge multiple chunks)
Compression EfficiencyExcellent (Gorilla + columnar)Good (TOAST + ZSTD)Excellent (delta-of-delta + XOR)
Row UpdateInsert new version (append)In-place update (B-Tree)Append only (no updates)
Concurrent WritesExcellent (sequential WAL)Good (B-Tree locking)Good (append to current block)
Query FlexibilityLimited (no JOINs)Full SQLPromQL (highly optimized)

Key takeaways

1
Time series databases are optimized for append-only writes and time-range queries, not updates or random access.
2
Storage engine choice (LSM vs B-Tree vs block-based) determines write vs read performance trade-offs.
3
Compression (delta-of-delta + XOR) can reduce storage by 90% but requires low-entropy, slowly changing values.
4
Cardinality is the number of unique metric series
keep it under 100k per measurement or risk OOM.
5
Retention policies and downsampling are mandatory; plan tiers (raw, hourly, daily) before production.
6
Choose the right TSDB by workload
InfluxDB for IoT throughput, TimescaleDB for SQL joins, Prometheus for monitoring.

Common mistakes to avoid

5 patterns
×

Using high-cardinality labels like 'user_id' or 'request_id'

Symptom
Index memory grows unbounded, write throughput drops by 90% within hours. Eventually the TSDB process OOMs.
Fix
Remove high-cardinality labels from metric schema. If you must track per-user metrics, either use a separate TSDB tenant per user or store the user ID as a metric value (not a label).
×

Not setting retention policies before production

Symptom
Disk fills completely after a few days. Queries on old data return stale cached results because compaction backlog grows unbounded.
Fix
Configure retention policy (DURATION + SHARD DURATION) before any writes. For InfluxDB: CREATE RETENTION POLICY "30d" ON "mydb" DURATION 30d REPLICATION 1. For Prometheus: set --storage.tsdb.retention.time=30d.
×

Writing data points out of order (backfilling with random timestamps)

Symptom
WAL grows large, compaction stalls, write latency spikes from 1ms to 1s. Disk space fills quickly due to unmerged SSTables.
Fix
Always send data in timestamp order. If backfilling is unavoidable, batch writes in a single large POST and ensure timestamps are monotonically increasing within each shard interval. Use InfluxDB's backfill option with timestamp_precision set.
×

Ignoring compression impact on query speed

Symptom
Queries on compressed data are 50x slower than expected. The database decompresses on the fly, adding CPU overhead per query.
Fix
Use pre-aggregated rollups for common query patterns. For Prometheus, use recording rules to precompute slow queries. For InfluxDB, use continuous queries. Monitor CPU at the TSDB node level during peak query hours.
×

Assuming TSDBs handle write spikes gracefully

Symptom
During a deployment event (all containers restart), write rate spikes 100x. The TSDB's WAL fills disk or compaction queue backs up, causing eventual OOM.
Fix
Implement client-side rate limiting or buffering. Use a message queue (Kafka) between application and TSDB to absorb spikes. In InfluxDB, set write-timeout and max-values-per-tag to reject excessive data.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain how time series data differs from transactional data and how a T...
Q02SENIOR
What is cardinality in a TSDB and why is it critical for performance?
Q03SENIOR
Compare InfluxDB and TimescaleDB for a financial trading system that nee...
Q04SENIOR
How does Prometheus handle long-term storage and what are its limitation...
Q05SENIOR
Explain how Gorilla (delta-of-delta + XOR) compression works and why it'...
Q01 of 05SENIOR

Explain how time series data differs from transactional data and how a TSDB exploits those differences.

ANSWER
Transactional data is frequently updated, deleted, and requires random access (e.g., user profile). Time series data is append-only, rarely updated, and always queried by time range. TSDBs exploit this by: (1) using append-only storage engines (LSM) that avoid random writes, (2) compressing timestamps and values using delta-of-delta and XOR, (3) partitioning data by time (shard groups) so old data can be dropped or downsampled, and (4) using index structures optimized for range scans and aggregates rather than point lookups.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is a time series database in simple terms?
02
Can I use PostgreSQL or MySQL for time series data?
03
What is the difference between InfluxDB and Prometheus?
04
How do I estimate storage requirements for a time series database?
05
Why does my TSDB crash with OOM when I add more metrics?
N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's Databases in Design. Mark it forged?

14 min read · try the examples if you haven't

Previous
Data Lake vs Data Warehouse
5 / 5 · Databases in Design
Next
OAuth 2.0 and OpenID Connect