Senior 10 min · March 06, 2026

HBase Row Key Hotspots — Why One RegionServer Hit 100% CPU

Timestamp-prefixed row keys funneled all writes to one region, spiking latency from 2ms to 8 seconds.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • HBase is a distributed, sorted, column-family store running on top of HDFS — designed for random read/write access at petabyte scale
  • Data is split into regions by row key ranges; each region is served by exactly one RegionServer at a time
  • Writes go to WAL then MemStore; flushed to immutable HFiles on HDFS when MemStore fills — WAL guarantees no acknowledged write is ever lost
  • Row key design is the single most critical decision — it determines read/write distribution across the cluster and cannot be changed without a full migration
  • Compactions merge HFiles to reclaim space and reduce read amplification — but major compactions can saturate disk I/O for hours on large regions
  • Biggest mistake: sequential row keys (timestamps, auto-increment) create a single hot region — the #1 cause of HBase performance failures in production
Plain-English First

Imagine a school with millions of students, each having a locker. Instead of one giant hallway with all the lockers in a straight line, the school splits the hallway into sections managed by different hall monitors. Each monitor knows exactly which locker numbers they're responsible for and can find any locker in their section instantly. HBase works exactly like that — your data is sorted by a unique key, sliced into ranges called regions, and each region is managed by a dedicated server. When you ask for a row, HBase goes straight to the right monitor without scanning every locker in the school. The catch: if you number your lockers with today's date as a prefix, every new locker ends up in the last section, overloading one hall monitor while the others stand around doing nothing. That's the hotspot problem, and it's the first thing you need to understand before writing a single byte to HBase.

Every week, systems like Facebook Messenger, Apache Phoenix, and large-scale e-commerce platforms serve billions of read and write requests against datasets that dwarf anything a relational database was built to handle. The secret weapon in many of these stacks is Apache HBase — a distributed, column-family-oriented store that runs on top of HDFS and borrows its core ideas from Google's Bigtable paper. When your time-series data grows past what PostgreSQL partitioning can stomach, or when you need sub-10ms random reads across petabytes of sparse data, HBase is the conversation you need to have.

HBase solves a specific and nasty problem: how do you get fast, consistent random reads and writes on a dataset so large it must be spread across hundreds of machines? HDFS on its own is sequential — great for MapReduce batch jobs, terrible for point lookups. HBase layers a sorted, indexed structure on top of HDFS so you can fetch a single row in milliseconds even when the table holds a trillion rows. It trades the flexibility of SQL for raw scalability, and understanding when that trade is worth making separates engineers who reach for HBase correctly from those who spend three months regretting the decision.

By the end of this article you'll understand how HBase physically stores data from the MemStore all the way down to HFiles on HDFS, why row key design determines whether your cluster thrives or dies, how compactions work and when they bite you in production, how bloom filters and the BlockCache keep reads fast, and when to choose a different tool entirely. Every section maps to a real production failure pattern — the kind that shows up at 2am when write latency spikes and one RegionServer is pegged at 100% CPU while 49 others sit idle.

HBase Architecture — From Client to HDFS

HBase has three core components: the HMaster, RegionServers, and ZooKeeper. Understanding what each one does — and critically, what it does NOT do — is the foundation for every production decision you'll make.

The HMaster manages cluster metadata: table creation and deletion, schema changes, region assignment, and load balancing. It does NOT participate in the read or write data path at all. If the HMaster dies, the cluster continues serving reads and writes without interruption. You just cannot create tables, alter schemas, or rebalance regions until a new master is elected. This is a deliberate design choice — the HMaster is intentionally not a bottleneck for data operations.

RegionServers are the workhorses. Each RegionServer hosts multiple regions and handles all client read/write requests for those regions. Each region covers a contiguous range of sorted row keys and contains one or more column families. Each column family maintains its own MemStore — an in-memory sorted write buffer — and its own set of HFiles on HDFS.

ZooKeeper is the coordination layer and the true single point of failure in the cluster. It tracks which server is the active HMaster, which RegionServer owns which region, and detects RegionServer failures through ephemeral node heartbeats. If ZooKeeper loses quorum, the entire cluster becomes unavailable — no reads, no writes, no admin operations. This is the component that warrants the most aggressive monitoring in production.

The write path is straightforward but the ordering matters for reasoning about durability. The client sends a Put to the RegionServer responsible for the target row's region. The RegionServer writes to the WAL first — this is the durability guarantee. Then it writes to the in-memory MemStore. The client receives acknowledgment at this point. When the MemStore reaches its size threshold (default 128MB), it flushes to an immutable HFile on HDFS. The WAL is the safety net: if the JVM crashes after the WAL write but before the HFile flush, the data is recovered by replaying the WAL when the region is reassigned.

The read path is a layered cache hierarchy. The RegionServer checks the MemStore first (contains the most recent data for unflushed writes), then checks the BlockCache (recently accessed HFile blocks cached in heap), then checks HFiles on disk. For HFiles, bloom filters are consulted first to skip files that cannot contain the requested row, then the block index locates the specific block, and finally the block is read from HDFS or the OS page cache.

io/thecodeforge/hbase/HBaseArchitectureClient.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
package io.thecodeforge.hbase;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;

import java.io.IOException;

/**
 * Production-grade HBase client demonstrating the read/write path.
 *
 * Key production rules:
 *   1. Connection is heavyweight and thread-safe — create once, share across threads
 *   2. Table is lightweight — obtain per-operation, close after use
 *   3. Always set retries and pause — transient RegionServer failures are normal
 *   4. Always set scan caching — default of 1 row per RPC is catastrophically slow
 */
public class HBaseArchitectureClient implements AutoCloseable {

    private final Connection connection;
    private final Admin admin;

    public HBaseArchitectureClient(String zkQuorum, int zkPort) throws IOException {
        Configuration conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum", zkQuorum);
        conf.setInt("hbase.zookeeper.property.clientPort", zkPort);

        // Retry on transient failures: up to 3 retries with 1s pause between
        conf.setInt("hbase.client.retries.number", 3);
        conf.setInt("hbase.client.pause", 1000);

        // Operation timeout — fail fast rather than queue behind a slow RegionServer
        conf.setInt("hbase.rpc.timeout", 5000);         // 5s per individual RPC
        conf.setInt("hbase.client.operation.timeout", 15000); // 15s total including retries

        // Connection is thread-safe and maintains a connection pool to each RegionServer
        // One Connection per application process is the correct pattern
        this.connection = ConnectionFactory.createConnection(conf);
        this.admin = connection.getAdmin();
    }

    /**
     * Write path: Put -> WAL -> MemStore -> (eventually) HFile on HDFS
     *
     * The WAL write happens before the client receives acknowledgment.
     * If the RegionServer crashes before MemStore flush, WAL replay recovers the data.
     * No acknowledged write is ever lost, even in a hard crash scenario.
     */
    public void writeRow(String tableName, String rowKey,
                         String columnFamily, String qualifier, byte[] value)
            throws IOException {

        // Table is lightweight — safe to create per-operation; always close after use
        try (Table table = connection.getTable(TableName.valueOf(tableName))) {
            Put put = new Put(Bytes.toBytes(rowKey));
            put.addColumn(
                Bytes.toBytes(columnFamily),
                Bytes.toBytes(qualifier),
                value
            );
            // setDurability controls WAL behaviour:
            //   SYNC_WAL (default)  — fsync WAL before ack — safest
            //   ASYNC_WAL           — buffer WAL writes — faster, small loss window
            //   SKIP_WAL            — no WAL — fastest, data loss on crash
            put.setDurability(Durability.SYNC_WAL);
            table.put(put);
        }
    }

    /**
     * Read path: Get -> MemStore -> BlockCache -> Bloom filter -> HFile block index -> HDFS
     *
     * Sub-10ms latency when the row is in MemStore (unflushed write) or BlockCache.
     * Cold reads (data not in any cache) depend on HDFS read latency — typically 20-50ms.
     */
    public byte[] readRow(String tableName, String rowKey,
                          String columnFamily, String qualifier)
            throws IOException {

        try (Table table = connection.getTable(TableName.valueOf(tableName))) {
            Get get = new Get(Bytes.toBytes(rowKey));
            get.addColumn(Bytes.toBytes(columnFamily), Bytes.toBytes(qualifier));

            // setCacheBlocks(true) is the default — hot rows stay in BlockCache
            // Set to false for analytical/scan workloads to avoid evicting hot data
            get.setCacheBlocks(true);

            Result result = table.get(get);

            if (result.isEmpty()) {
                return null;  // Row does not exist — always check before unwrapping
            }

            return result.getValue(
                Bytes.toBytes(columnFamily),
                Bytes.toBytes(qualifier)
            );
        }
    }

    /**
     * Scan with explicit start/stop row — the correct production scan pattern.
     *
     * Never scan without start/stop rows unless you intend a full table scan.
     * Full table scans block RegionServer threads for minutes and evict hot
     * BlockCache data, degrading point lookup latency for all other clients.
     */
    public void scanRange(String tableName, String startKey, String stopKey)
            throws IOException {

        try (Table table = connection.getTable(TableName.valueOf(tableName))) {
            Scan scan = new Scan()
                .withStartRow(Bytes.toBytes(startKey), true)   // inclusive
                .withStopRow(Bytes.toBytes(stopKey), false)    // exclusive
                .setCaching(100)             // 100 rows per RPC — reduces round trips
                .setMaxResultSize(2 * 1024 * 1024L)  // 2MB max per RPC response
                .setCacheBlocks(false);      // scans read sequentially — don't pollute BlockCache

            try (ResultScanner scanner = table.getScanner(scan)) {
                for (Result result : scanner) {
                    byte[] rowKey = result.getRow();
                    // process each row here — scanner handles region boundaries transparently
                    _ = rowKey; // suppress unused warning in example
                }
            }
        }
    }

    @Override
    public void close() throws IOException {
        admin.close();
        connection.close();
    }
}
The HBase Write Path — Durability First, Performance Second
  • Client sends Put to the RegionServer owning the target row's region — ZooKeeper tells the client which server that is
  • RegionServer appends to the WAL before acknowledging — guarantees the write survives a hard JVM crash
  • Write is applied to the MemStore — a sorted skip list in heap memory, one per column family per region
  • Client receives acknowledgment — the data is durable at this point, regardless of whether a flush has occurred
  • When MemStore reaches 128MB (default), it flushes to an HFile on HDFS — immutable, sorted, compressed, replicated
  • Multiple HFiles accumulate over time; compaction merges them periodically to limit read amplification
Production Insight
HMaster death does not stop reads and writes — only region rebalancing halts until a new master is elected.
ZooKeeper quorum loss stops everything — no reads, no writes, no admin operations. It is the true single point of failure.
Rule: run ZooKeeper on dedicated nodes with SSD storage, a stable JVM (no large heap), and monitor session latency aggressively. ZK sessions should complete in under 10ms on a healthy cluster.
Key Takeaway
HMaster manages metadata only — it is not in the data path and its failure does not stop reads or writes.
RegionServers handle all data operations; ZooKeeper coordinates everything and is the true cluster SPOF.
Rule: monitor ZooKeeper session latency and node health more aggressively than any other component — a flapping ZK causes cascading RegionServer crashes.
HBase Component Failure Impact
IfHMaster dies
UseCluster continues serving reads and writes; no new table creation, schema changes, or region rebalancing until a standby master is elected — typically within 30 seconds with HA master configuration
IfSingle RegionServer dies
UseIts regions are reassigned to other servers by the HMaster; WAL replay reconstructs MemStore data; recovery takes 30-90 seconds; those regions are unavailable during recovery
IfZooKeeper quorum is lost (majority of ZK nodes fail)
UseEntire cluster becomes unavailable immediately — no reads, no writes, no admin operations; restore ZK quorum as the first priority
IfHDFS NameNode becomes unavailable
UseHFile reads degrade or fail; new MemStore flushes cannot complete; cluster degrades progressively until HDFS recovers or NameNode failover completes
IfNetwork partition between client and RegionServer
UseClient retries with exponential backoff up to hbase.client.retries.number; if all retries exhausted, the operation fails with an IOException — not a silent failure

Row Key Design — The Single Most Important Decision

Row key design in HBase is the most consequential architectural decision you will make, and it is the one you cannot change without a full data migration. Unlike a relational database where you can add an index after the fact, HBase has no secondary indexes. The row key is the only native index. It determines three things simultaneously: how writes are distributed across RegionServers, which rows can be read with a single efficient range scan, and how much data each region holds.

HBase stores rows sorted lexicographically by row key. Adjacent keys land in the same region and can be fetched with a single range scan — this is the key locality property that makes HBase efficient for time-series queries like 'give me all readings for device X in the last hour.' But lexicographic sorting also means that sequential keys funnel all new writes to a single region — the rightmost one, which covers the highest current key range. No amount of RegionServer hardware fixes this because the problem is structural: the data model directs all writes to one place.

Salt prefix: prepend a one-byte hash-derived salt (md5(naturalKey) % regionCount) to the row key. Every key gets distributed across all regions. The trade-off is that range scans on the natural key become multi-region scatter-gather operations — you have to scan all regions in parallel and merge results in the client. Acceptable for write-heavy workloads where scans are rare.

Reversed timestamp: for time-series data, format the key as {Long.MAX_VALUE - timestamp}_{deviceId} instead of {timestamp}_{deviceId}. Reversed timestamps decrease as real time increases, so each new write is directed toward lower key space — spreading across regions rather than always hitting the rightmost one. The bonus: a range scan for 'the last N records' now returns contiguous data instead of requiring a reverse scan.

Hash prefix: prepend a fixed-length hash of the natural key. Deterministic — the same natural key always maps to the same region. Good for Get-heavy workloads where you always know the full key. No range scan capability on the natural key.

The row key must also be as short as practical — under 100 bytes is the guideline. The row key is stored verbatim in every HFile block that contains that row, and it appears in every column qualifier's KeyValue metadata. A 500-byte row key on a table with 20 column qualifiers wastes 10KB per row just in key storage overhead, before any actual data.

io/thecodeforge/hbase/RowKeyDesign.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
package io.thecodeforge.hbase;

import org.apache.hadoop.hbase.util.Bytes;

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

/**
 * Production row key design patterns for HBase.
 *
 * Each pattern makes a different trade-off between write distribution,
 * range scan efficiency, and deterministic lookup. Choose based on
 * your dominant access pattern — not based on what looks simplest.
 *
 * Decision guide:
 *   Write-heavy + rare scans    → salt prefix
 *   Time-series + recent reads  → reversed timestamp
 *   Read-heavy + point lookups  → hash prefix
 *   Multi-dimensional queries   → composite key
 */
public class RowKeyDesign {

    /**
     * Pattern 1: Salt prefix for write distribution.
     *
     * Distributes writes uniformly across all regions.
     * Trade-off: range scans on the natural key require querying all regions
     * in parallel and merging results in the client — scatter-gather.
     *
     * Use when: write throughput is the constraint and scans are rare or
     * can tolerate multi-region fan-out.
     */
    public static byte[] saltedKey(String naturalKey, int regionCount) {
        // Ensure consistent positive modulus regardless of hashCode sign
        int salt = (naturalKey.hashCode() & 0x7FFFFFFF) % regionCount;
        // Zero-pad salt so lexicographic ordering of region ranges works correctly
        return Bytes.toBytes(String.format("%02d_%s", salt, naturalKey));
    }

    /**
     * Pattern 2: Reversed timestamp for time-series write distribution.
     *
     * Reversed timestamps decrease as real time increases, directing each
     * new write toward lower key space rather than always to the rightmost region.
     *
     * Bonus: 'latest N records for deviceId' scan reads contiguous rows
     * because recent data is now at lower keys, not scattered.
     *
     * Use when: ingesting time-series data (IoT, logs, events) and recent-data
     * reads are more important than historical range scans.
     */
    public static byte[] reversedTimestampKey(long timestampMillis, String deviceId) {
        long reversedTs = Long.MAX_VALUE - timestampMillis;
        // 16 hex chars = 8 bytes = full long range, zero-padded for correct lex sort
        return Bytes.toBytes(String.format("%016x_%s", reversedTs, deviceId));
    }

    /**
     * Pattern 3: Hash prefix for deterministic distribution.
     *
     * Same natural key always maps to the same region — lookups are fast
     * because you can reconstruct the hash from the natural key.
     * Trade-off: no range scan on natural key order; you lose locality.
     *
     * Use when: the workload is predominantly point Gets and range scans
     * on the natural key are not needed.
     */
    public static byte[] hashPrefixedKey(String naturalKey) {
        String prefix = md5HexPrefix(naturalKey, 4);  // 4-char prefix = 16^4 = 65536 buckets
        return Bytes.toBytes(String.format("%s_%s", prefix, naturalKey));
    }

    /**
     * Pattern 4: Composite key for multi-dimensional access patterns.
     *
     * Encodes multiple fields at fixed widths so lexicographic ordering
     * enables efficient range scans on the primary dimension while
     * filtering on secondary dimensions within the scan.
     *
     * Example: scan all events for US_EAST in reverse-chronological order
     * by setting startRow=US_EAST_<Long.MAX_VALUE> stopRow=US_EAST_<0>
     *
     * Use when: queries always filter on a primary dimension (region, tenant,
     * category) and sort or range-scan on a secondary dimension (time, score).
     */
    public static byte[] compositeKey(String regionCode, long timestampMillis, String userId) {
        // Fixed-width region code ensures correct lexicographic boundaries
        String paddedRegion  = String.format("%-6s", regionCode).replace(' ', '0');
        String reversedTs    = String.format("%016x", Long.MAX_VALUE - timestampMillis);
        return Bytes.toBytes(String.format("%s_%s_%s", paddedRegion, reversedTs, userId));
    }

    private static String md5HexPrefix(String input, int hexChars) {
        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            byte[] hash = md.digest(Bytes.toBytes(input));
            StringBuilder sb = new StringBuilder();
            for (byte b : hash) {
                sb.append(String.format("%02x", b));
                if (sb.length() >= hexChars) break;
            }
            return sb.substring(0, hexChars);
        } catch (NoSuchAlgorithmException e) {
            throw new RuntimeException("MD5 unavailable — should never happen on a JVM", e);
        }
    }
}
The #1 HBase Mistake: Sequential Row Keys
Using a timestamp or auto-increment value as the row key prefix creates a write hotspot that no amount of hardware can fix. All writes go to the rightmost region; one RegionServer handles 100% of write traffic while every other server sits idle. This is the most common cause of HBase production incidents and it cannot be fixed without a full data migration. Design the row key correctly before writing a single byte. Retrofitting a row key schema means exporting the entire dataset, transforming every key, and reimporting — which is weeks of work on a petabyte-scale table.
Production Insight
A telemetry pipeline used {timestamp}_{deviceId} as the row key. One RegionServer handled 100% of writes.
49 other RegionServers sat idle. Write latency spiked from 2ms to 8 seconds as MemStore filled on the hot server.
Rule: never use monotonically increasing values as row key prefixes. Reverse them or add a salt prefix before the first production write.
Key Takeaway
Row key design determines write distribution, read efficiency, and scan locality — all three simultaneously, with no way to change it post-deployment without a migration.
Sequential keys create hotspots; reversed timestamps and salt prefixes solve this before the first write.
Rule: treat the row key schema as a permanent architectural decision. Sketch expected access patterns before creating the table.
Row Key Strategy Selection
IfWrite throughput is the constraint and range scans are rare or can tolerate multi-region scatter-gather
UseSalt prefix: (md5(key) % regionCount)_naturalKey — uniform write distribution, scatter-gather for scans
IfTime-series data with recent-data-heavy reads and ingest at high rate
UseReversed timestamp: (Long.MAX_VALUE - ts)_deviceId — spreads writes, recent data clustered for efficient scans
IfGet-heavy workload where you always know the full natural key
UseHash prefix: md5prefix(naturalKey)_naturalKey — deterministic, no range scan on natural key order
IfMulti-dimensional queries: scan by primary dimension, filter by secondary
UseComposite key: fixedWidthPrimary_reversedTimestamp_secondaryKey — enables range scans on primary while sorting by secondary
IfRow key exceeds 100 bytes
UseShorten it — key storage overhead multiplies by number of columns per row and appears in every HFile block

Compactions — The Silent Performance Killer

Compactions are HBase's mechanism for managing HFile accumulation. Every MemStore flush produces a new HFile on HDFS. Without any merging, a region that has been written to heavily could accumulate hundreds of HFiles, and every read would need to check each one for the requested row. Compaction solves this by periodically merging HFiles into fewer, larger ones — trading I/O cost during compaction for lower read amplification afterward.

Minor compaction is lightweight and runs continuously. It fires automatically when the number of HFiles in a column family store exceeds hbase.hstore.compactionThreshold (default 3). It picks a small batch of adjacent HFiles and merges them. Minor compaction does not remove deleted or expired cells — those tombstone markers remain until a major compaction. Minor compactions are safe to let run during business hours; they consume modest I/O and complete quickly.

Major compaction is the expensive one. It merges all HFiles in a store into a single HFile, permanently removes deleted cells and cells past their TTL, and rewrites the entire column family from scratch. For a region with 200GB of data, a major compaction reads 200GB from HDFS and writes 200GB back. On a cluster already under read and write load, this I/O competition can push latency from 5ms to 500ms. By default, HBase schedules automatic major compactions every 7 days (hbase.hregion.majorcompaction) with a random jitter (hbase.hregion.majorcompaction.jitter) to prevent all regions compacting simultaneously.

The production trap is that automatic major compactions will eventually collide with peak traffic hours, especially as the cluster grows and 7 days worth of accumulated HFiles becomes significant data. The best practice for large production clusters is to disable automatic major compactions entirely (set hbase.hregion.majorcompaction to 0) and schedule them through an external job during the lowest-traffic window — typically weekend nights or the quietest hours of the day. For time-series data with a TTL, major compactions are also the only mechanism that reclaims space from expired cells, so you cannot skip them indefinitely.

io/thecodeforge/hbase/CompactionManager.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
package io.thecodeforge.hbase;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.RegionMetrics;
import org.apache.hadoop.hbase.ServerName;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;

import java.io.IOException;
import java.util.Collection;
import java.util.Map;

/**
 * Production compaction management — manual triggering, throughput throttling,
 * and configuration tuning to prevent compactions from overwhelming live traffic.
 *
 * The core rule: never let automatic major compactions run during peak hours.
 * For clusters serving live traffic, manual scheduling during off-peak windows
 * is the only safe approach at scale.
 */
public class CompactionManager implements AutoCloseable {

    private final Admin admin;
    private final Connection connection;

    public CompactionManager(String zkQuorum) throws IOException {
        Configuration conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum", zkQuorum);
        this.connection = ConnectionFactory.createConnection(conf);
        this.admin = connection.getAdmin();
    }

    /**
     * Trigger a major compaction on a specific table.
     * Call this during a scheduled maintenance window, not during peak hours.
     * Major compaction rewrites all HFiles — I/O impact is proportional to region size.
     */
    public void majorCompactTable(String tableName) throws IOException {
        System.out.printf("Triggering major compaction on table: %s%n", tableName);
        admin.majorCompact(TableName.valueOf(tableName));
        System.out.println("Major compaction submitted — runs asynchronously in the background");
    }

    /**
     * Trigger a minor compaction on a specific table.
     * Safe to run during business hours — merges a few HFiles with modest I/O.
     * Does NOT remove deleted cells or reclaim space from expired TTL data.
     */
    public void minorCompactTable(String tableName) throws IOException {
        admin.compact(TableName.valueOf(tableName));
    }

    /**
     * Check compaction queue depth per RegionServer.
     * A sustained queue > 5 indicates the RegionServer's disk I/O cannot keep up
     * with the incoming flush rate — time to tune or add capacity.
     */
    public void printCompactionQueueDepth() throws IOException {
        Map<ServerName, RegionMetrics> clusterMetrics =
            admin.getClusterMetrics().getLiveServerMetrics()
                 .entrySet().stream()
                 .collect(java.util.stream.Collectors.toMap(
                     Map.Entry::getKey,
                     e -> e.getValue().getRegionMetrics()
                               .values().stream().findFirst().orElse(null)
                 ));
        // In practice, iterate ServerMetrics for compactionQueueSize
        admin.getClusterMetrics().getLiveServerMetrics().forEach((server, metrics) -> {
            long compactionQueueSize = metrics.getCompactionQueueSize();
            System.out.printf("  %s  compaction queue: %d%n",
                server.getHostname(), compactionQueueSize);
        });
    }

    /**
     * Returns the recommended configuration for compaction tuning.
     *
     * Key decisions:
     *   - Disable automatic major compactions for clusters > 1TB
     *   - Throttle compaction throughput to leave headroom for live I/O
     *   - Increase minor compaction threshold to reduce frequency
     */
    public static Configuration getCompactionTunedConfig() {
        Configuration conf = HBaseConfiguration.create();

        // ── Minor compaction tuning ──────────────────────────────────────
        // Trigger minor compaction when 4+ HFiles exist per store (default: 3)
        // Higher threshold = less frequent compaction, more read amplification
        conf.setInt("hbase.hstore.compactionThreshold", 4);

        // Max HFiles merged in one minor compaction batch (default: 10)
        // Smaller batches = faster individual compactions, more total I/O over time
        conf.setInt("hbase.hstore.compaction.max", 6);

        // ── Major compaction tuning ──────────────────────────────────────
        // Disable automatic major compactions for large clusters
        // Set to 0 and schedule manually during off-peak windows
        conf.setLong("hbase.hregion.majorcompaction", 0L);  // 0 = disabled

        // If you do leave automatic major compaction enabled, add jitter
        // to prevent all regions from compacting simultaneously
        // conf.setLong("hbase.hregion.majorcompaction", 7 * 24 * 3600 * 1000L);
        // conf.setFloat("hbase.hregion.majorcompaction.jitter", 0.5f);

        // ── I/O throttling ───────────────────────────────────────────────
        // Limit compaction throughput to leave headroom for live reads/writes
        // 20MB/s per store is conservative; tune up if disk throughput allows
        conf.setLong("hbase.hstore.compaction.throughput.lower.bound",
                     20L * 1024 * 1024);  // 20 MB/s floor
        conf.setLong("hbase.hstore.compaction.throughput.higher.bound",
                     40L * 1024 * 1024);  // 40 MB/s ceiling

        return conf;
    }

    @Override
    public void close() throws IOException {
        admin.close();
        connection.close();
    }
}
Compaction Trade-offs — Read Amplification vs Write Amplification
  • Minor compaction: merges a few adjacent HFiles — lightweight, runs continuously, does not remove deleted cells, safe during business hours
  • Major compaction: rewrites ALL HFiles into one — heavy I/O, removes tombstones and TTL-expired cells, reclaims HDFS space
  • Without any compaction: reads degrade linearly with HFile count — each read must check every HFile via bloom filters
  • With unthrottled major compaction during peak hours: disk I/O saturation drives read latency from 5ms to 500ms
  • Rule: disable automatic major compactions on clusters above 1TB and schedule them manually during the quietest traffic window
Production Insight
A 50-node cluster ran automatic major compactions on the default 7-day schedule with no jitter.
All regions compacted simultaneously on the same day, saturating every DataNode's disk for 4 hours.
Read latency spiked from 8ms to 600ms. Setting hbase.hregion.majorcompaction.jitter=0.5 distributes the load over a 3-day window.
Rule: if you use automatic major compaction, always set jitter to at least 0.5 — never leave it at 0.
Key Takeaway
Minor compaction is safe during business hours; major compaction can saturate disk I/O for hours on large regions.
Disable automatic major compaction on clusters above 1TB and schedule it manually during off-peak windows.
Compaction is the only mechanism that physically removes deleted cells and TTL-expired data — it cannot be skipped indefinitely.
Compaction Configuration Strategy
IfCluster is under 100GB total data, development or staging environment
UseLeave automatic major compaction enabled at default 7-day interval — the I/O cost is negligible
IfCluster is between 100GB and 1TB in production
UseEnable automatic major compaction but set jitter=0.5 and throttle throughput to 40MB/s to avoid peak-hour collisions
IfCluster is above 1TB in production with live traffic SLOs
UseDisable automatic major compaction (set to 0) and schedule manual compactions during off-peak windows via admin.majorCompact()
IfTable uses TTL for data expiry (time-series, event logs)
UseMajor compaction is the only mechanism that physically removes expired cells — run it at least once per TTL period, even if manually scheduled
IfMinor compaction queue depth stays above 5 consistently
UseDisk I/O is undersized for the write rate — increase hbase.hstore.flusher.count, reduce MemStore flush size, or add DataNodes

Bloom Filters and BlockCache — Keeping Reads Fast

Two mechanisms keep HBase read latency in the millisecond range: bloom filters that skip irrelevant HFiles before reading any disk data, and the BlockCache that keeps recently accessed HFile blocks in heap memory.

Bloom filters are probabilistic data structures stored within each HFile. They answer the question 'does this HFile possibly contain row key X?' with certainty in the negative direction — if the bloom filter says no, the RegionServer skips that HFile entirely. If it says yes, the HFile might contain the row (with a configurable false positive rate, typically 1%). For a table with 20 HFiles per region, bloom filters can eliminate 15-19 disk reads per Get request. The space cost is roughly 120MB of bloom filter data per 100 million keys at 1% false positive rate.

HBase offers two bloom filter types. ROW bloom filters check only the row key — the right choice for workloads dominated by point Gets on narrow rows with few columns. ROWCOL bloom filters check both the row key and the column qualifier — useful for wide rows where you frequently fetch specific columns and want to skip HFiles that don't contain that column. ROWCOL uses 3-5x more memory per HFile than ROW. For scan-heavy workloads, bloom filters provide no benefit because range scans must read all blocks in sequence regardless — use NONE to save the memory.

The BlockCache keeps recently accessed HFile blocks in RegionServer heap memory. When a Get or Scan reads a block from HDFS, that block is placed in the BlockCache. Subsequent reads for rows in the same block find it in cache and skip the HDFS read entirely — this is how HBase achieves sub-millisecond reads for hot data. The BlockCache has two optional tiers: an on-heap L1 cache and an off-heap L2 bucket cache for larger working sets without JVM GC pressure.

The configuration rule that causes the most production incidents is the memory budget. The RegionServer's JVM heap must support MemStore, BlockCache, RPC handlers, GC overhead, and JVM internal structures simultaneously. The constraint is firm: hbase.regionserver.global.memstore.size + hfile.block.cache.size must not exceed 0.80. If they sum above 0.80, the remaining heap is insufficient for GC metadata and RPC handler threads, leading to Full GC pauses that exceed ZooKeeper's session timeout and cause RegionServer crashes.

io/thecodeforge/hbase/ReadTuningConfig.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
package io.thecodeforge.hbase;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.io.compress.Compression;
import org.apache.hadoop.hbase.regionserver.BloomType;

import java.io.IOException;

/**
 * Production read performance tuning for HBase.
 *
 * Covers:
 *   - BlockCache and MemStore memory budget (the 80% rule)
 *   - Bloom filter type selection per column family
 *   - Off-heap L2 bucket cache for larger working sets
 *   - Scan caching and result size limits
 *   - Column family compression
 *
 * The 80% rule is non-negotiable:
 *   hfile.block.cache.size + hbase.regionserver.global.memstore.size <= 0.80
 */
public class ReadTuningConfig {

    /**
     * Returns the recommended RegionServer configuration for a 32GB heap node.
     *
     * Memory budget:
     *   BlockCache : 40% of 32GB = 12.8GB  (on-heap L1)
     *   MemStore   : 35% of 32GB = 11.2GB
     *   Total      : 75%  (leaves 25% for GC, RPC handlers, JVM internals)
     *
     * With 8GB off-heap L2 bucket cache, effective cache = 12.8 + 8 = 20.8GB
     * without increasing GC pressure.
     */
    public static Configuration getReadOptimizedConfig() {
        Configuration conf = HBaseConfiguration.create();

        // ── On-heap BlockCache (L1) ──────────────────────────────────────
        // 40% of heap — primary cache for hot data
        conf.setFloat("hfile.block.cache.size", 0.40f);

        // ── MemStore ─────────────────────────────────────────────────────
        // 35% of heap — write buffer across all regions on this server
        // Triggers a global flush if total MemStore usage exceeds this
        conf.setFloat("hbase.regionserver.global.memstore.size", 0.35f);

        // Verify: 0.40 + 0.35 = 0.75 <= 0.80 — within the safe budget

        // ── Off-heap L2 Bucket Cache ─────────────────────────────────────
        // Stores larger working sets without contributing to GC pressure
        // Data is serialised to native memory — slightly slower than L1 but
        // allows a much larger effective cache without increasing -Xmx
        conf.set("hbase.bucketcache.ioengine", "offheap");
        conf.setLong("hbase.bucketcache.size", 8192L);  // 8192 MB = 8GB off-heap

        // ── Scan optimisation ────────────────────────────────────────────
        // Rows fetched per RPC during a Scan — default of 1 is catastrophically slow
        // 100-500 is reasonable; tune based on average row size
        conf.setInt("hbase.client.scanner.caching", 100);

        // Max bytes returned per scanner RPC — prevents OOM on wide rows
        conf.setLong("hbase.client.scanner.max.result.size", 2L * 1024 * 1024);  // 2MB

        return conf;
    }

    /**
     * Creates a table with per-column-family bloom filter and compression settings.
     *
     * Bloom filter type is set per column family because different families
     * can have fundamentally different access patterns on the same table.
     */
    public static void createOptimisedTable(Connection connection,
                                             String tableName) throws IOException {
        try (Admin admin = connection.getAdmin()) {
            HTableDescriptor tableDescriptor =
                new HTableDescriptor(TableName.valueOf(tableName));

            // Column family for point-lookup-heavy data — use ROW bloom filter
            // ROW: probabilistic skip on row key only
            // Low memory cost: ~1.2 bytes per unique row key at 1% false positive
            HColumnDescriptor metadataFamily = new HColumnDescriptor("meta");
            metadataFamily.setBloomFilterType(BloomType.ROW);
            metadataFamily.setCompressionType(Compression.Algorithm.SNAPPY);
            metadataFamily.setMaxVersions(1);     // time-series: keep only latest
            metadataFamily.setTimeToLive(86400 * 90);  // 90-day TTL
            tableDescriptor.addFamily(metadataFamily);

            // Column family for wide rows with selective column reads — use ROWCOL
            // ROWCOL: probabilistic skip on row key + column qualifier
            // Higher memory cost: 3-5x more than ROW — only justified for wide rows
            HColumnDescriptor attributesFamily = new HColumnDescriptor("attrs");
            attributesFamily.setBloomFilterType(BloomType.ROWCOL);
            attributesFamily.setCompressionType(Compression.Algorithm.LZ4);
            attributesFamily.setMaxVersions(3);   // retain last 3 versions
            tableDescriptor.addFamily(attributesFamily);

            // Column family for scan-heavy analytical data — no bloom filter
            // Bloom filters provide no benefit for range scans — skip the memory cost
            HColumnDescriptor metricsFamily = new HColumnDescriptor("metrics");
            metricsFamily.setBloomFilterType(BloomType.NONE);
            metricsFamily.setCompressionType(Compression.Algorithm.GZ);  // higher ratio for cold data
            metricsFamily.setMaxVersions(1);
            tableDescriptor.addFamily(metricsFamily);

            admin.createTable(tableDescriptor);
        }
    }

    /**
     * Selects the appropriate bloom filter type based on column family access pattern.
     * Use this as a decision aide when designing a new column family schema.
     */
    public static BloomType selectBloomType(boolean isGetHeavy,
                                             boolean hasWideRows,
                                             boolean isScanHeavy) {
        if (isScanHeavy) {
            return BloomType.NONE;  // scans read all blocks — bloom filters are wasted memory
        }
        if (isGetHeavy && hasWideRows) {
            return BloomType.ROWCOL;  // skip HFiles missing the specific column — expensive but targeted
        }
        if (isGetHeavy) {
            return BloomType.ROW;  // skip HFiles missing the row — cheap and effective
        }
        return BloomType.NONE;  // default for mixed or unknown patterns
    }
}
Pro Tip: Bloom Filter Memory Budget
Bloom filter data is stored inside HFiles and loaded into memory when the HFile is opened. For a 100M-row HFile with 1% false positive rate, the bloom filter consumes roughly 120MB. If you have 10 HFiles per region and 200 regions per server, that's 240GB of bloom filter data — which clearly cannot all live in memory simultaneously. HBase loads bloom filter chunks lazily and caches them in the BlockCache. If bloom filter data is evicting hot row data from the BlockCache, reduce the false positive rate (costs more memory per HFile) or switch from ROWCOL to ROW. Use ROWCOL only when the column-specific filtering genuinely reduces disk reads enough to justify the 3-5x memory overhead.
Production Insight
A 16GB RegionServer was configured with MemStore at 0.40 and BlockCache at 0.45 — summing to 0.85.
Full GC pauses of 12-15 seconds caused ZooKeeper session timeouts and triggered RegionServer crashes.
The crash-recovery cycle repeated every 20-30 minutes, making the cluster unusable.
Fix: reduce MemStore to 0.35, BlockCache to 0.40 — sum 0.75 — leaving 25% for JVM internals. Crashes stopped immediately.
Key Takeaway
Bloom filters skip irrelevant HFiles before reading any disk data — transforming O(HFileCount) disk reads into O(1) for cache-cold Gets.
BlockCache keeps recently read HFile blocks in heap — hot data stays at sub-millisecond read latency.
MemStore + BlockCache must sum to <= 0.80 of heap — this is the most commonly violated configuration rule and the most reliable cause of RegionServer crashes.
Bloom Filter Type Selection
IfWorkload is Get-heavy with narrow rows (few columns per row)
UseUse ROW bloom filter — filters on row key, low memory overhead, effective for point lookups
IfWorkload is Get-heavy with wide rows (many columns) and reads target specific columns
UseUse ROWCOL bloom filter — filters on row key + column qualifier, 3-5x more memory but avoids reading HFiles missing the target column
IfWorkload is scan-heavy (range scans, full family reads)
UseUse NONE — range scans must read all blocks sequentially; bloom filters consume memory without reducing any disk reads
IfRegionServer memory is constrained (under 16GB heap)
UseUse ROW or NONE — ROWCOL memory overhead may not be justified; monitor BlockCache eviction rates
IfHFile has very few unique row keys (under 10,000)
UseUse NONE — bloom filter overhead exceeds the benefit; the small HFile block index is faster than a bloom filter lookup

HBase vs Alternatives — When to Use What

HBase is a powerful and operationally demanding tool. Choosing it for the wrong problem is a multi-month mistake. Understanding where it fits relative to alternatives is as important as understanding HBase itself.

HBase vs Cassandra: both are wide-column stores inspired by Google's Bigtable paper, but they make fundamentally different architectural trade-offs. Cassandra is masterless and peer-to-peer — every node is equal, and there is no single point of failure equivalent to ZooKeeper. Cassandra supports per-query consistency tuning: write at ONE for maximum throughput, read at QUORUM for strong consistency, or mix them per operation. It also has native multi-datacenter replication built into its core data model. HBase uses a master-based architecture with strong consistency by default and no native multi-datacenter support. Choose Cassandra when you need multi-datacenter active-active replication, tunable consistency on a per-operation basis, or a masterless topology that has no ZooKeeper equivalent as a single point of failure. Choose HBase when you need strong consistency guaranteed by default, coprocessor-based server-side computation, or tight integration with the Hadoop ecosystem for batch analytics on the same data (HBase + Spark/MapReduce on HFiles).

HBase vs MongoDB: MongoDB is a document store with a flexible JSON schema, a rich aggregation query language, and built-in secondary indexes. HBase has none of those things — it is a sorted key-value store with a fixed column-family schema, no query language beyond Get/Scan/Put, and no secondary indexes. Use MongoDB when your access patterns require ad-hoc queries, aggregation pipelines, or text search. Use HBase when you need predictable low-latency random reads across petabytes of sparse data with simple key-based access and no query flexibility required.

HBase vs Google Cloud Bigtable: Bigtable is the managed version of the same architecture. The API is largely compatible, the data model is identical, and the operational model is dramatically simpler — no ZooKeeper to manage, no RegionServers to tune, no compactions to schedule. If you are on GCP and your dataset warrants HBase, Bigtable is almost always the right choice unless you have a specific requirement for on-premises deployment, multi-cloud portability, or HDFS data locality for co-located batch jobs.

When NOT to Use HBase
HBase is the wrong choice when: (1) your dataset fits comfortably in a single PostgreSQL or MySQL instance — the operational burden of ZooKeeper, HDFS, and RegionServer tuning is not justified; (2) you need ad-hoc SQL queries — use Apache Phoenix on HBase if you must, but consider whether a purpose-built analytical store (BigQuery, Redshift, Snowflake) would serve better; (3) you need secondary indexes — HBase has none natively; Phoenix can add them but at the cost of write amplification and additional operational complexity; (4) you need multi-row ACID transactions — HBase provides single-row atomicity only; (5) your engineering team lacks HBase operational experience — production incidents involving GC tuning, WAL replay, and region reassignment are genuinely difficult to debug without prior exposure.
Production Insight
A team chose HBase for a 50GB dataset that needed SQL aggregation queries on ad-hoc dimensions.
They spent 3 months building and tuning an Apache Phoenix integration before realising that a PostgreSQL instance with partition pruning would have handled the load with far less engineering investment.
Rule: if your data fits on one machine and you need SQL, do not use HBase. Scale the database before adopting a distributed store.
Key Takeaway
HBase excels at petabyte-scale random read/write with strong consistency and native HDFS integration for co-located batch jobs.
Cassandra is the better choice for multi-datacenter deployments and tunable per-operation consistency; MongoDB for flexible schemas and complex query patterns.
Rule: the right reason to choose HBase is a specific technical requirement — petabyte scale, strong consistency, HDFS co-location — not familiarity with the Hadoop ecosystem.
● Production incidentPOST-MORTEMseverity: high

HBase Cluster Pegged at 100% CPU on a Single RegionServer — All Writes Funneling to One Region

Symptom
Average write latency jumped from 2ms to 8 seconds overnight. One RegionServer showed 100% CPU and 95% MemStore utilisation while all others were below 5%. The HBase Master UI showed one region with 50GB of accumulated data while every other region held under 1GB. The cluster had 50 RegionServers and was running at roughly 2% of its total write capacity.
Assumption
The team assumed HBase would automatically distribute writes evenly across all regions regardless of row key pattern — similar to how a hash-based partitioning scheme in Kafka or Cassandra would distribute load. They had tested write throughput in staging with a few thousand records and seen no issues.
Root cause
Row keys were formatted as {timestamp}_{deviceId}. Since timestamps are monotonically increasing, every new write was directed to the single rightmost region — the one covering the highest current key range. This region never split into a position that would relieve pressure because all new writes continued arriving in the same range immediately after any split. The RegionServer hosting this region became the sole bottleneck: every write hit the same WAL segment, the same MemStore, and the same HDFS DataNodes backing that region. No horizontal scaling was possible because the row key pattern made distribution structurally impossible.
Fix
Reversed the timestamp bytes in the row key: the format changed from {timestamp}_{deviceId} to {Long.MAX_VALUE - timestamp}_{deviceId}. Reversed timestamps decrease as real time increases, so new writes are directed toward the leftmost regions instead of the rightmost. A salt prefix (deviceId hash modulo region count) was added as a secondary distribution mechanism to handle devices with correlated write patterns. After the fix, write throughput increased 40x and all 50 RegionServers showed balanced load within one hour. The table was pre-split into 50 regions at creation to avoid the initial single-region bottleneck during the data load.
Key lesson
  • Never use monotonically increasing values as row key prefixes — timestamps, auto-increment IDs, and sequence numbers all create write hotspots
  • Reverse timestamps or add salt prefixes before writing a single record — retrofitting requires a full data migration
  • Monitor per-region write count and region size — imbalance larger than 2x the average is the earliest warning sign of a hotspot
  • HBase does NOT auto-balance writes based on row key patterns — the sorted partitioning model means distribution is entirely the caller's responsibility
Production debug guideSymptom → Action mapping for common HBase production issues5 entries
Symptom · 01
Single RegionServer at 100% CPU, others idle
Fix
Check row key distribution immediately. Run hdfs dfs -du /hbase/data/<namespace>/<table>/ and compare region sizes. If one region is 10x larger than the median, your keys are sequential — the fix is a row key redesign with salt prefix or reversed timestamp. Also run hbase org.apache.hadoop.hbase.util.RegionSplitter to inspect current split boundaries.
Symptom · 02
Read latency spikes every 30-60 minutes on a regular schedule
Fix
Compaction is the likely cause. Check HBase Master UI for compaction queue length and run hdfs dfs -du /hbase/data/<namespace>/<table>// to see HFile count per region. If major compactions are firing automatically, tune hbase.hregion.majorcompaction.jitter to stagger them, or disable automatic major compactions and schedule them manually during off-peak hours.
Symptom · 03
OutOfMemoryError in RegionServer JVM
Fix
MemStore or BlockCache is over-allocated. Verify that hbase.regionserver.global.memstore.size + hfile.block.cache.size does not exceed 0.80. On a 32GB heap, MemStore at 0.35 and BlockCache at 0.40 leaves 25% for JVM overhead — the minimum required. Reduce one or both allocations if they sum above 0.80.
Symptom · 04
RegionServer keeps crashing and regions are stuck in transition
Fix
Run hbase hbck -details to check for META inconsistencies. If META is corrupt, run hbase hbck -fixMeta -fixAssignments. Also check ZooKeeper session timeouts — GC pauses longer than 1 second cause the RegionServer to lose its ZK ephemeral node, triggering a crash cycle even if the JVM itself recovers.
Symptom · 05
Writes timing out with RegionTooBusyException
Fix
The MemStore flush queue is backed up. Check hbase.hstore.flusher.count — the default of 2 flusher threads is often insufficient for high write rates. Also check HDFS write throughput via DataNode metrics — if HDFS is slow, MemStore flushes queue behind it and eventually block incoming writes. Increase flusher threads or reduce hbase.hregion.memstore.flush.size to trigger more frequent, smaller flushes.
★ HBase Quick Debug Cheat SheetFast diagnostics for production HBase issues. Run these commands to confirm the root cause before making any configuration changes.
RegionServer unresponsive or repeatedly dying
Immediate action
Check GC pause duration and ZooKeeper session status
Commands
jstat -gcutil <rs_pid> 1000 10
hbase hbck -details 2>&1 | grep -i 'inconsist'
Fix now
If GC pause exceeds 1 second, increase -XX:G1ReservePercent and reduce MemStore share below 0.35. GC pauses longer than the ZK session timeout (default 90s divided by tick count) cause ephemeral node loss and trigger crash/recovery cycles.
Write throughput dropping — latency climbing above 50ms+
Immediate action
Check for write hotspot by comparing region sizes
Commands
hdfs dfs -du /hbase/data/<namespace>/<table>/ | sort -rn | head -10
echo "scan 'hbase:meta', {LIMIT => 50}" | hbase shell | grep -E 'regioninfo|server'
Fix now
If one region is 10x larger than the median, row keys are sequential. Pre-split the table with custom split keys and redesign the row key schema with a salt prefix or reversed timestamp before the next data load.
Read latency above 50ms for point lookups+
Immediate action
Check BlockCache hit ratio — below 85% means either the cache is undersized or scans are evicting hot data
Commands
curl -s http://<rs_host>:16030/jmx | python3 -c "import sys,json; d=json.load(sys.stdin); [print(k,v) for k,v in d.items() if 'blockCache' in k.lower()]"
echo "status 'detailed'" | hbase shell 2>&1 | grep -i 'blockcache\|compaction'
Fix now
If BlockCache hit ratio is below 85%, increase hfile.block.cache.size. If scans are the culprit (they bypass BlockCache by default), set setCacheBlocks(false) on Scan objects so they stop polluting the cache used by point lookups.
HBase vs Alternatives
AspectHBaseCassandraMongoDBGCP Bigtable
ArchitectureMaster-based: HMaster + RegionServers + ZooKeeper coordinationMasterless peer-to-peer: every node is equal, no SPOF equivalent to ZKReplica set with primary election; mongos router for sharded clustersFully managed: no servers, no compaction tuning, no ZooKeeper
Consistency modelStrong by default: single-row ACID, no tuning requiredTunable per operation: ONE (eventual) to ALL (strong); QUORUM is the common production choiceStrong with majority reads by default; configurable read/write concernStrong by default: single-row ACID, identical model to HBase
Data modelSorted wide-column families: sparse, versioned cells with TTLWide-column with CQL: more structured than HBase, less flexible than MongoDBDocument store: flexible BSON/JSON schema with nested documents and arraysSorted wide-column families: API-compatible with HBase, identical data model
Secondary indexesNone natively — use Apache Phoenix for SQL and secondary indexes at write amplification costBuilt-in materialized secondary indexes with automatic maintenanceRich index support: single-field, compound, text, geospatial, TTL indexesNone natively — same constraint as HBase
Query interfaceGet/Scan/Put API — no query language; HBase Shell for admin; Phoenix adds SQLCQL (Cassandra Query Language) — SQL-like with limitations on joins and aggregationRich aggregation pipeline, full-text search, geospatial queries, $lookup joinsRead/write API; integrates with Dataflow and BigQuery for analytics
Multi-datacenterNo native support — requires custom WAL replication or third-party CDCNative multi-DC replication with per-DC consistency configuration built into the core modelGlobal clusters with zone-aware reads/writes (Atlas); sharding adds complexityMulti-region replication with Bigtable replication — managed, not DIY
HDFS / batch integrationNative — HFiles live on HDFS; direct Spark/MapReduce access to raw data filesNone — own proprietary storage engine; Spark integration via connectorNone — own storage engine; Atlas Data Federation for analytics queriesGoogle Colossus internally; Dataflow and BigQuery integration for analytics
Operational complexityHigh: ZooKeeper quorum, HDFS NameNode HA, RegionServer GC tuning, compaction schedulingMedium: no ZooKeeper dependency; tuning required for compaction and repair; simpler failure modelLow to medium: Atlas managed option removes most operational burden; self-hosted adds complexityLow: fully managed service; no ZooKeeper, no compaction scheduling, no RegionServer sizing

Key takeaways

1
HBase layers a sorted, indexed key-value store on top of HDFS to provide sub-10ms random reads across petabytes
HDFS alone supports only sequential access, making point lookups impractical without the indexing layer HBase provides.
2
Row key design is the single most critical and irreversible decision
it simultaneously determines write distribution across RegionServers, read locality, and scan efficiency. Sequential keys create hotspots that no hardware addition can fix.
3
Compactions trade write amplification (I/O cost during compaction) for read amplification reduction (fewer HFiles to check per read). Major compactions can saturate disk I/O for hours on large regions
disable automatic scheduling above 1TB and schedule manually.
4
MemStore + BlockCache must sum to 80% or less of JVM heap
exceeding this causes Full GC pauses that trigger ZooKeeper session timeouts and cascading RegionServer crashes.
5
ZooKeeper is the true single point of failure in HBase
if ZooKeeper loses quorum, the entire cluster becomes unavailable for reads, writes, and admin operations. Monitor it more aggressively than any other component.

Common mistakes to avoid

5 patterns
×

Using sequential row keys such as timestamps or auto-increment IDs as the key prefix

Symptom
One RegionServer at 100% CPU handling all writes while the rest of the cluster sits idle. Write latency spikes from single-digit milliseconds to multiple seconds. The hot region never stabilises because every new write continues arriving in the same key range immediately after any split.
Fix
Reverse the timestamp bytes — use (Long.MAX_VALUE - timestamp) as the prefix instead of the raw timestamp. Alternatively, prepend a hash-based salt: (md5(naturalKey) % regionCount) as a zero-padded prefix. Pre-split the table into as many regions as RegionServers before the initial data load to avoid the reactive-split performance cliff.
×

Allocating MemStore and BlockCache memory that sums above 80% of JVM heap

Symptom
Frequent Full GC pauses lasting 5-15 seconds. RegionServer loses its ZooKeeper session during GC and is declared dead by the HMaster. Regions are reassigned and WAL replay begins, but the newly assigned RegionServers also start GC-pausing. The cluster enters a crash-recovery loop.
Fix
Ensure hbase.regionserver.global.memstore.size + hfile.block.cache.size <= 0.80. A safe starting point for a 32GB heap node: MemStore=0.35 (11.2GB), BlockCache=0.40 (12.8GB), sum=0.75 — leaving 25% for RPC handlers, GC internal structures, and JVM overhead. Use G1GC and tune -XX:G1ReservePercent=20 to reduce GC pause duration.
×

Not pre-splitting tables before bulk data loads

Symptom
All initial data lands in a single region. The region splits reactively when it exceeds the size threshold, but by then the RegionServer hosting it is already overwhelmed and load balancing takes hours to converge. Write throughput is a fraction of what the cluster can handle during the entire load window.
Fix
Pre-split the table at creation time: create 'table', SPLITS => ['10','20','30','40','50','60','70','80','90'] for a 10-region split based on expected key distribution. Use the RegionSplitter utility for hash-based splits when key distribution is unknown. Match the number of pre-split regions to the number of RegionServers for even initial distribution.
×

Allowing automatic major compactions to run during peak traffic hours

Symptom
Read latency spikes 10-50x every 7 days on a regular schedule. The spike correlates with major compaction activity visible in the HBase Master UI. A major compaction on a 200GB region reads and writes 200GB of data, competing with live reads and writes for disk I/O.
Fix
For clusters above 1TB, set hbase.hregion.majorcompaction to 0 to disable automatic scheduling. Schedule manual major compactions via admin.majorCompact() during the lowest-traffic window. If you keep automatic compaction enabled for smaller clusters, set hbase.hregion.majorcompaction.jitter to at least 0.5 to spread the load across a multi-day window.
×

Scanning tables without explicit startRow and stopRow boundaries

Symptom
Full table scan blocks RegionServer handler threads for minutes. Other clients queue behind the scan and begin timing out. The scan's sequential reads evict hot BlockCache data, causing point lookup latency to degrade for all other clients on the same server after the scan completes.
Fix
Always set startRow and stopRow on Scan objects — treat an unbounded scan as a code review failure. Set setCaching(100) to fetch 100 rows per RPC instead of the default 1. Set setCacheBlocks(false) on analytical scans so they do not pollute the BlockCache used by point lookups. For cluster-wide analytics, use Spark or MapReduce directly on HFiles instead of client Scan operations.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the HBase write path from client to disk. What happens if the Re...
Q02SENIOR
Why are sequential row keys bad in HBase, and how would you fix a table ...
Q03SENIOR
What is the difference between a minor and major compaction in HBase, an...
Q04SENIOR
How do bloom filters work in HBase, and when would you choose ROW versus...
Q05SENIOR
Your HBase cluster has 50 RegionServers but only 3 are handling writes. ...
Q06SENIOR
Compare HBase and Cassandra for a multi-datacenter e-commerce platform t...
Q01 of 06SENIOR

Explain the HBase write path from client to disk. What happens if the RegionServer crashes after writing to the WAL but before flushing the MemStore?

ANSWER
The write path has four steps. First, the client sends a Put to the RegionServer that owns the target row's region — it knows which server that is by looking up the hbase:meta table, which it caches locally. Second, the RegionServer appends the mutation to the Write-Ahead Log on HDFS. This is a sequential append to a file on a durable distributed filesystem — it survives RegionServer crashes. Third, the mutation is applied to the in-memory MemStore — a sorted skip list per column family per region. Fourth, the RegionServer sends acknowledgment to the client. If the RegionServer crashes after step 2 but before any MemStore flush to an HFile, the data is safe. When the HMaster detects the crash via ZooKeeper, it reassigns the dead server's regions to other RegionServers. Each recovering RegionServer replays the WAL for its newly assigned regions to reconstruct the MemStore state. The WAL replay guarantees that no acknowledged write is ever lost. The key insight is that WAL durability precedes client acknowledgment — those two facts together make HBase's write durability guarantee possible.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What happens when a RegionServer crashes?
02
Can HBase handle secondary indexes?
03
How does HBase handle region splits?
04
What is the difference between HBase and HDFS?
05
How do I monitor HBase health in production?
🔥

That's NoSQL. Mark it forged?

10 min read · try the examples if you haven't

Previous
Apache Kafka Basics
15 / 15 · NoSQL
Next
Database Normalization