HBase is a distributed, sorted, column-family store running on top of HDFS — designed for random read/write access at petabyte scale
Data is split into regions by row key ranges; each region is served by exactly one RegionServer at a time
Writes go to WAL then MemStore; flushed to immutable HFiles on HDFS when MemStore fills — WAL guarantees no acknowledged write is ever lost
Row key design is the single most critical decision — it determines read/write distribution across the cluster and cannot be changed without a full migration
Compactions merge HFiles to reclaim space and reduce read amplification — but major compactions can saturate disk I/O for hours on large regions
Biggest mistake: sequential row keys (timestamps, auto-increment) create a single hot region — the #1 cause of HBase performance failures in production
Plain-English First
Imagine a school with millions of students, each having a locker. Instead of one giant hallway with all the lockers in a straight line, the school splits the hallway into sections managed by different hall monitors. Each monitor knows exactly which locker numbers they're responsible for and can find any locker in their section instantly. HBase works exactly like that — your data is sorted by a unique key, sliced into ranges called regions, and each region is managed by a dedicated server. When you ask for a row, HBase goes straight to the right monitor without scanning every locker in the school. The catch: if you number your lockers with today's date as a prefix, every new locker ends up in the last section, overloading one hall monitor while the others stand around doing nothing. That's the hotspot problem, and it's the first thing you need to understand before writing a single byte to HBase.
Every week, systems like Facebook Messenger, Apache Phoenix, and large-scale e-commerce platforms serve billions of read and write requests against datasets that dwarf anything a relational database was built to handle. The secret weapon in many of these stacks is Apache HBase — a distributed, column-family-oriented store that runs on top of HDFS and borrows its core ideas from Google's Bigtable paper. When your time-series data grows past what PostgreSQL partitioning can stomach, or when you need sub-10ms random reads across petabytes of sparse data, HBase is the conversation you need to have.
HBase solves a specific and nasty problem: how do you get fast, consistent random reads and writes on a dataset so large it must be spread across hundreds of machines? HDFS on its own is sequential — great for MapReduce batch jobs, terrible for point lookups. HBase layers a sorted, indexed structure on top of HDFS so you can fetch a single row in milliseconds even when the table holds a trillion rows. It trades the flexibility of SQL for raw scalability, and understanding when that trade is worth making separates engineers who reach for HBase correctly from those who spend three months regretting the decision.
By the end of this article you'll understand how HBase physically stores data from the MemStore all the way down to HFiles on HDFS, why row key design determines whether your cluster thrives or dies, how compactions work and when they bite you in production, how bloom filters and the BlockCache keep reads fast, and when to choose a different tool entirely. Every section maps to a real production failure pattern — the kind that shows up at 2am when write latency spikes and one RegionServer is pegged at 100% CPU while 49 others sit idle.
HBase Architecture — From Client to HDFS
HBase has three core components: the HMaster, RegionServers, and ZooKeeper. Understanding what each one does — and critically, what it does NOT do — is the foundation for every production decision you'll make.
The HMaster manages cluster metadata: table creation and deletion, schema changes, region assignment, and load balancing. It does NOT participate in the read or write data path at all. If the HMaster dies, the cluster continues serving reads and writes without interruption. You just cannot create tables, alter schemas, or rebalance regions until a new master is elected. This is a deliberate design choice — the HMaster is intentionally not a bottleneck for data operations.
RegionServers are the workhorses. Each RegionServer hosts multiple regions and handles all client read/write requests for those regions. Each region covers a contiguous range of sorted row keys and contains one or more column families. Each column family maintains its own MemStore — an in-memory sorted write buffer — and its own set of HFiles on HDFS.
ZooKeeper is the coordination layer and the true single point of failure in the cluster. It tracks which server is the active HMaster, which RegionServer owns which region, and detects RegionServer failures through ephemeral node heartbeats. If ZooKeeper loses quorum, the entire cluster becomes unavailable — no reads, no writes, no admin operations. This is the component that warrants the most aggressive monitoring in production.
The write path is straightforward but the ordering matters for reasoning about durability. The client sends a Put to the RegionServer responsible for the target row's region. The RegionServer writes to the WAL first — this is the durability guarantee. Then it writes to the in-memory MemStore. The client receives acknowledgment at this point. When the MemStore reaches its size threshold (default 128MB), it flushes to an immutable HFile on HDFS. The WAL is the safety net: if the JVM crashes after the WAL write but before the HFile flush, the data is recovered by replaying the WAL when the region is reassigned.
The read path is a layered cache hierarchy. The RegionServer checks the MemStore first (contains the most recent data for unflushed writes), then checks the BlockCache (recently accessed HFile blocks cached in heap), then checks HFiles on disk. For HFiles, bloom filters are consulted first to skip files that cannot contain the requested row, then the block index locates the specific block, and finally the block is read from HDFS or the OS page cache.
package io.thecodeforge.hbase;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;
import java.io.IOException;
/**
* Production-grade HBase client demonstrating the read/write path.
*
* Key production rules:
* 1. Connection is heavyweight and thread-safe — create once, share across threads
* 2. Table is lightweight — obtain per-operation, close after use
* 3. Always set retries and pause — transientRegionServer failures are normal
* 4. Always set scan caching — default of 1 row per RPC is catastrophically slow
*/
publicclassHBaseArchitectureClientimplementsAutoCloseable {
privatefinalConnection connection;
privatefinalAdmin admin;
publicHBaseArchitectureClient(String zkQuorum, int zkPort) throwsIOException {
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", zkQuorum);
conf.setInt("hbase.zookeeper.property.clientPort", zkPort);
// Retry on transient failures: up to 3 retries with 1s pause between
conf.setInt("hbase.client.retries.number", 3);
conf.setInt("hbase.client.pause", 1000);
// Operation timeout — fail fast rather than queue behind a slow RegionServer
conf.setInt("hbase.rpc.timeout", 5000); // 5s per individual RPC
conf.setInt("hbase.client.operation.timeout", 15000); // 15s total including retries// Connection is thread-safe and maintains a connection pool to each RegionServer// One Connection per application process is the correct patternthis.connection = ConnectionFactory.createConnection(conf);
this.admin = connection.getAdmin();
}
/**
* Write path: Put -> WAL -> MemStore -> (eventually) HFile on HDFS
*
* TheWAL write happens before the client receives acknowledgment.
* If the RegionServer crashes before MemStore flush, WAL replay recovers the data.
* No acknowledged write is ever lost, even in a hard crash scenario.
*/
publicvoidwriteRow(String tableName, String rowKey,
String columnFamily, String qualifier, byte[] value)
throwsIOException {
// Table is lightweight — safe to create per-operation; always close after usetry (Table table = connection.getTable(TableName.valueOf(tableName))) {
Put put = newPut(Bytes.toBytes(rowKey));
put.addColumn(
Bytes.toBytes(columnFamily),
Bytes.toBytes(qualifier),
value
);
// setDurability controls WAL behaviour:// SYNC_WAL (default) — fsync WAL before ack — safest// ASYNC_WAL — buffer WAL writes — faster, small loss window// SKIP_WAL — no WAL — fastest, data loss on crash
put.setDurability(Durability.SYNC_WAL);
table.put(put);
}
}
/**
* Read path: Get -> MemStore -> BlockCache -> Bloom filter -> HFile block index -> HDFS
*
* Sub-10ms latency when the row is in MemStore (unflushed write) or BlockCache.
* Coldreads (data not in any cache) depend on HDFS read latency — typically 20-50ms.
*/
publicbyte[] readRow(String tableName, String rowKey,
String columnFamily, String qualifier)
throwsIOException {
try (Table table = connection.getTable(TableName.valueOf(tableName))) {
Get get = newGet(Bytes.toBytes(rowKey));
get.addColumn(Bytes.toBytes(columnFamily), Bytes.toBytes(qualifier));
// setCacheBlocks(true) is the default — hot rows stay in BlockCache// Set to false for analytical/scan workloads to avoid evicting hot data
get.setCacheBlocks(true);
Result result = table.get(get);
if (result.isEmpty()) {
return null; // Row does not exist — always check before unwrapping
}
return result.getValue(
Bytes.toBytes(columnFamily),
Bytes.toBytes(qualifier)
);
}
}
/**
* Scan with explicit start/stop row — the correct production scan pattern.
*
* Never scan without start/stop rows unless you intend a full table scan.
* Full table scans block RegionServer threads for minutes and evict hot
* BlockCache data, degrading point lookup latency for all other clients.
*/
publicvoidscanRange(String tableName, String startKey, String stopKey)
throwsIOException {
try (Table table = connection.getTable(TableName.valueOf(tableName))) {
Scan scan = newScan()
.withStartRow(Bytes.toBytes(startKey), true) // inclusive
.withStopRow(Bytes.toBytes(stopKey), false) // exclusive
.setCaching(100) // 100 rows per RPC — reduces round trips
.setMaxResultSize(2 * 1024 * 1024L) // 2MB max per RPC response
.setCacheBlocks(false); // scans read sequentially — don't pollute BlockCachetry (ResultScanner scanner = table.getScanner(scan)) {
for (Result result : scanner) {
byte[] rowKey = result.getRow();
// process each row here — scanner handles region boundaries transparently
_ = rowKey; // suppress unused warning in example
}
}
}
}
@Overridepublicvoidclose() throwsIOException {
admin.close();
connection.close();
}
}
The HBase Write Path — Durability First, Performance Second
Client sends Put to the RegionServer owning the target row's region — ZooKeeper tells the client which server that is
RegionServer appends to the WAL before acknowledging — guarantees the write survives a hard JVM crash
Write is applied to the MemStore — a sorted skip list in heap memory, one per column family per region
Client receives acknowledgment — the data is durable at this point, regardless of whether a flush has occurred
When MemStore reaches 128MB (default), it flushes to an HFile on HDFS — immutable, sorted, compressed, replicated
Multiple HFiles accumulate over time; compaction merges them periodically to limit read amplification
Production Insight
HMaster death does not stop reads and writes — only region rebalancing halts until a new master is elected.
ZooKeeper quorum loss stops everything — no reads, no writes, no admin operations. It is the true single point of failure.
Rule: run ZooKeeper on dedicated nodes with SSD storage, a stable JVM (no large heap), and monitor session latency aggressively. ZK sessions should complete in under 10ms on a healthy cluster.
Key Takeaway
HMaster manages metadata only — it is not in the data path and its failure does not stop reads or writes.
RegionServers handle all data operations; ZooKeeper coordinates everything and is the true cluster SPOF.
Rule: monitor ZooKeeper session latency and node health more aggressively than any other component — a flapping ZK causes cascading RegionServer crashes.
HBase Component Failure Impact
IfHMaster dies
→
UseCluster continues serving reads and writes; no new table creation, schema changes, or region rebalancing until a standby master is elected — typically within 30 seconds with HA master configuration
IfSingle RegionServer dies
→
UseIts regions are reassigned to other servers by the HMaster; WAL replay reconstructs MemStore data; recovery takes 30-90 seconds; those regions are unavailable during recovery
IfZooKeeper quorum is lost (majority of ZK nodes fail)
→
UseEntire cluster becomes unavailable immediately — no reads, no writes, no admin operations; restore ZK quorum as the first priority
IfHDFS NameNode becomes unavailable
→
UseHFile reads degrade or fail; new MemStore flushes cannot complete; cluster degrades progressively until HDFS recovers or NameNode failover completes
IfNetwork partition between client and RegionServer
→
UseClient retries with exponential backoff up to hbase.client.retries.number; if all retries exhausted, the operation fails with an IOException — not a silent failure
Row Key Design — The Single Most Important Decision
Row key design in HBase is the most consequential architectural decision you will make, and it is the one you cannot change without a full data migration. Unlike a relational database where you can add an index after the fact, HBase has no secondary indexes. The row key is the only native index. It determines three things simultaneously: how writes are distributed across RegionServers, which rows can be read with a single efficient range scan, and how much data each region holds.
HBase stores rows sorted lexicographically by row key. Adjacent keys land in the same region and can be fetched with a single range scan — this is the key locality property that makes HBase efficient for time-series queries like 'give me all readings for device X in the last hour.' But lexicographic sorting also means that sequential keys funnel all new writes to a single region — the rightmost one, which covers the highest current key range. No amount of RegionServer hardware fixes this because the problem is structural: the data model directs all writes to one place.
The three production-proven strategies for write distribution:
Salt prefix: prepend a one-byte hash-derived salt (md5(naturalKey) % regionCount) to the row key. Every key gets distributed across all regions. The trade-off is that range scans on the natural key become multi-region scatter-gather operations — you have to scan all regions in parallel and merge results in the client. Acceptable for write-heavy workloads where scans are rare.
Reversed timestamp: for time-series data, format the key as {Long.MAX_VALUE - timestamp}_{deviceId} instead of {timestamp}_{deviceId}. Reversed timestamps decrease as real time increases, so each new write is directed toward lower key space — spreading across regions rather than always hitting the rightmost one. The bonus: a range scan for 'the last N records' now returns contiguous data instead of requiring a reverse scan.
Hash prefix: prepend a fixed-length hash of the natural key. Deterministic — the same natural key always maps to the same region. Good for Get-heavy workloads where you always know the full key. No range scan capability on the natural key.
The row key must also be as short as practical — under 100 bytes is the guideline. The row key is stored verbatim in every HFile block that contains that row, and it appears in every column qualifier's KeyValue metadata. A 500-byte row key on a table with 20 column qualifiers wastes 10KB per row just in key storage overhead, before any actual data.
io/thecodeforge/hbase/RowKeyDesign.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
package io.thecodeforge.hbase;
import org.apache.hadoop.hbase.util.Bytes;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
/**
* Production row key design patterns forHBase.
*
* Each pattern makes a different trade-off between write distribution,
* range scan efficiency, and deterministic lookup. Choose based on
* your dominant access pattern — not based on what looks simplest.
*
* Decision guide:
* Write-heavy + rare scans → salt prefix
* Time-series + recent reads → reversed timestamp
* Read-heavy + point lookups → hash prefix
* Multi-dimensional queries → composite key
*/
publicclassRowKeyDesign {
/**
* Pattern1: Salt prefix for write distribution.
*
* Distributes writes uniformly across all regions.
* Trade-off: range scans on the natural key require querying all regions
* in parallel and merging results in the client — scatter-gather.
*
* Use when: write throughput is the constraint and scans are rare or
* can tolerate multi-region fan-out.
*/
publicstaticbyte[] saltedKey(String naturalKey, int regionCount) {
// Ensure consistent positive modulus regardless of hashCode signint salt = (naturalKey.hashCode() & 0x7FFFFFFF) % regionCount;
// Zero-pad salt so lexicographic ordering of region ranges works correctlyreturnBytes.toBytes(String.format("%02d_%s", salt, naturalKey));
}
/**
* Pattern2: Reversed timestamp for time-series write distribution.
*
* Reversed timestamps decrease as real time increases, directing each
* new write toward lower key space rather than always to the rightmost region.
*
* Bonus: 'latest N records for deviceId' scan reads contiguous rows
* because recent data is now at lower keys, not scattered.
*
* Use when: ingesting time-series data (IoT, logs, events) and recent-data
* reads are more important than historical range scans.
*/
publicstaticbyte[] reversedTimestampKey(long timestampMillis, String deviceId) {
long reversedTs = Long.MAX_VALUE - timestampMillis;
// 16 hex chars = 8 bytes = full long range, zero-padded for correct lex sortreturnBytes.toBytes(String.format("%016x_%s", reversedTs, deviceId));
}
/**
* Pattern3: Hash prefix for deterministic distribution.
*
* Same natural key always maps to the same region — lookups are fast
* because you can reconstruct the hash from the natural key.
* Trade-off: no range scan on natural key order; you lose locality.
*
* Use when: the workload is predominantly point Gets and range scans
* on the natural key are not needed.
*/
publicstaticbyte[] hashPrefixedKey(String naturalKey) {
String prefix = md5HexPrefix(naturalKey, 4); // 4-char prefix = 16^4 = 65536 bucketsreturnBytes.toBytes(String.format("%s_%s", prefix, naturalKey));
}
/**
* Pattern4: Composite key for multi-dimensional access patterns.
*
* Encodes multiple fields at fixed widths so lexicographic ordering
* enables efficient range scans on the primary dimension while
* filtering on secondary dimensions within the scan.
*
* Example: scan all events for US_EAST in reverse-chronological order
* by setting startRow=US_EAST_<Long.MAX_VALUE> stopRow=US_EAST_<0>
*
* Use when: queries always filter on a primary dimension (region, tenant,
* category) and sort or range-scan on a secondary dimension (time, score).
*/
publicstaticbyte[] compositeKey(String regionCode, long timestampMillis, String userId) {
// Fixed-width region code ensures correct lexicographic boundariesString paddedRegion = String.format("%-6s", regionCode).replace(' ', '0');
String reversedTs = String.format("%016x", Long.MAX_VALUE - timestampMillis);
returnBytes.toBytes(String.format("%s_%s_%s", paddedRegion, reversedTs, userId));
}
privatestaticStringmd5HexPrefix(String input, int hexChars) {
try {
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] hash = md.digest(Bytes.toBytes(input));
StringBuilder sb = newStringBuilder();
for (byte b : hash) {
sb.append(String.format("%02x", b));
if (sb.length() >= hexChars) break;
}
return sb.substring(0, hexChars);
} catch (NoSuchAlgorithmException e) {
thrownewRuntimeException("MD5 unavailable — should never happen on a JVM", e);
}
}
}
The #1 HBase Mistake: Sequential Row Keys
Using a timestamp or auto-increment value as the row key prefix creates a write hotspot that no amount of hardware can fix. All writes go to the rightmost region; one RegionServer handles 100% of write traffic while every other server sits idle. This is the most common cause of HBase production incidents and it cannot be fixed without a full data migration. Design the row key correctly before writing a single byte. Retrofitting a row key schema means exporting the entire dataset, transforming every key, and reimporting — which is weeks of work on a petabyte-scale table.
Production Insight
A telemetry pipeline used {timestamp}_{deviceId} as the row key. One RegionServer handled 100% of writes.
49 other RegionServers sat idle. Write latency spiked from 2ms to 8 seconds as MemStore filled on the hot server.
Rule: never use monotonically increasing values as row key prefixes. Reverse them or add a salt prefix before the first production write.
Key Takeaway
Row key design determines write distribution, read efficiency, and scan locality — all three simultaneously, with no way to change it post-deployment without a migration.
Sequential keys create hotspots; reversed timestamps and salt prefixes solve this before the first write.
Rule: treat the row key schema as a permanent architectural decision. Sketch expected access patterns before creating the table.
Row Key Strategy Selection
IfWrite throughput is the constraint and range scans are rare or can tolerate multi-region scatter-gather
IfTime-series data with recent-data-heavy reads and ingest at high rate
→
UseReversed timestamp: (Long.MAX_VALUE - ts)_deviceId — spreads writes, recent data clustered for efficient scans
IfGet-heavy workload where you always know the full natural key
→
UseHash prefix: md5prefix(naturalKey)_naturalKey — deterministic, no range scan on natural key order
IfMulti-dimensional queries: scan by primary dimension, filter by secondary
→
UseComposite key: fixedWidthPrimary_reversedTimestamp_secondaryKey — enables range scans on primary while sorting by secondary
IfRow key exceeds 100 bytes
→
UseShorten it — key storage overhead multiplies by number of columns per row and appears in every HFile block
Compactions — The Silent Performance Killer
Compactions are HBase's mechanism for managing HFile accumulation. Every MemStore flush produces a new HFile on HDFS. Without any merging, a region that has been written to heavily could accumulate hundreds of HFiles, and every read would need to check each one for the requested row. Compaction solves this by periodically merging HFiles into fewer, larger ones — trading I/O cost during compaction for lower read amplification afterward.
Minor compaction is lightweight and runs continuously. It fires automatically when the number of HFiles in a column family store exceeds hbase.hstore.compactionThreshold (default 3). It picks a small batch of adjacent HFiles and merges them. Minor compaction does not remove deleted or expired cells — those tombstone markers remain until a major compaction. Minor compactions are safe to let run during business hours; they consume modest I/O and complete quickly.
Major compaction is the expensive one. It merges all HFiles in a store into a single HFile, permanently removes deleted cells and cells past their TTL, and rewrites the entire column family from scratch. For a region with 200GB of data, a major compaction reads 200GB from HDFS and writes 200GB back. On a cluster already under read and write load, this I/O competition can push latency from 5ms to 500ms. By default, HBase schedules automatic major compactions every 7 days (hbase.hregion.majorcompaction) with a random jitter (hbase.hregion.majorcompaction.jitter) to prevent all regions compacting simultaneously.
The production trap is that automatic major compactions will eventually collide with peak traffic hours, especially as the cluster grows and 7 days worth of accumulated HFiles becomes significant data. The best practice for large production clusters is to disable automatic major compactions entirely (set hbase.hregion.majorcompaction to 0) and schedule them through an external job during the lowest-traffic window — typically weekend nights or the quietest hours of the day. For time-series data with a TTL, major compactions are also the only mechanism that reclaims space from expired cells, so you cannot skip them indefinitely.
io/thecodeforge/hbase/CompactionManager.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
package io.thecodeforge.hbase;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.RegionMetrics;
import org.apache.hadoop.hbase.ServerName;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import java.io.IOException;
import java.util.Collection;
import java.util.Map;
/**
* Production compaction management — manual triggering, throughput throttling,
* and configuration tuning to prevent compactions from overwhelming live traffic.
*
* The core rule: never let automatic major compactions run during peak hours.
* For clusters serving live traffic, manual scheduling during off-peak windows
* is the only safe approach at scale.
*/
publicclassCompactionManagerimplementsAutoCloseable {
privatefinalAdmin admin;
privatefinalConnection connection;
publicCompactionManager(String zkQuorum) throwsIOException {
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", zkQuorum);
this.connection = ConnectionFactory.createConnection(conf);
this.admin = connection.getAdmin();
}
/**
* Trigger a major compaction on a specific table.
* Callthis during a scheduled maintenance window, not during peak hours.
* Major compaction rewrites all HFiles — I/O impact is proportional to region size.
*/
publicvoidmajorCompactTable(String tableName) throwsIOException {
System.out.printf("Triggering major compaction on table: %s%n", tableName);
admin.majorCompact(TableName.valueOf(tableName));
System.out.println("Major compaction submitted — runs asynchronously in the background");
}
/**
* Trigger a minor compaction on a specific table.
* Safe to run during business hours — merges a few HFiles with modest I/O.
* DoesNOT remove deleted cells or reclaim space from expired TTL data.
*/
publicvoidminorCompactTable(String tableName) throwsIOException {
admin.compact(TableName.valueOf(tableName));
}
/**
* Check compaction queue depth per RegionServer.
* A sustained queue > 5 indicates the RegionServer's disk I/O cannot keep up
* with the incoming flush rate — time to tune or add capacity.
*/
publicvoidprintCompactionQueueDepth() throwsIOException {
Map<ServerName, RegionMetrics> clusterMetrics =
admin.getClusterMetrics().getLiveServerMetrics()
.entrySet().stream()
.collect(java.util.stream.Collectors.toMap(
Map.Entry::getKey,
e -> e.getValue().getRegionMetrics()
.values().stream().findFirst().orElse(null)
));
// In practice, iterate ServerMetrics for compactionQueueSize
admin.getClusterMetrics().getLiveServerMetrics().forEach((server, metrics) -> {
long compactionQueueSize = metrics.getCompactionQueueSize();
System.out.printf(" %s compaction queue: %d%n",
server.getHostname(), compactionQueueSize);
});
}
/**
* Returns the recommended configuration for compaction tuning.
*
* Key decisions:
* - Disable automatic major compactions for clusters > 1TB
* - Throttle compaction throughput to leave headroom for live I/O
* - Increase minor compaction threshold to reduce frequency
*/
publicstaticConfigurationgetCompactionTunedConfig() {
Configuration conf = HBaseConfiguration.create();
// ── Minor compaction tuning ──────────────────────────────────────// Trigger minor compaction when 4+ HFiles exist per store (default: 3)// Higher threshold = less frequent compaction, more read amplification
conf.setInt("hbase.hstore.compactionThreshold", 4);
// Max HFiles merged in one minor compaction batch (default: 10)// Smaller batches = faster individual compactions, more total I/O over time
conf.setInt("hbase.hstore.compaction.max", 6);
// ── Major compaction tuning ──────────────────────────────────────// Disable automatic major compactions for large clusters// Set to 0 and schedule manually during off-peak windows
conf.setLong("hbase.hregion.majorcompaction", 0L); // 0 = disabled// If you do leave automatic major compaction enabled, add jitter// to prevent all regions from compacting simultaneously// conf.setLong("hbase.hregion.majorcompaction", 7 * 24 * 3600 * 1000L);// conf.setFloat("hbase.hregion.majorcompaction.jitter", 0.5f);// ── I/O throttling ───────────────────────────────────────────────// Limit compaction throughput to leave headroom for live reads/writes// 20MB/s per store is conservative; tune up if disk throughput allows
conf.setLong("hbase.hstore.compaction.throughput.lower.bound",
20L * 1024 * 1024); // 20 MB/s floor
conf.setLong("hbase.hstore.compaction.throughput.higher.bound",
40L * 1024 * 1024); // 40 MB/s ceilingreturn conf;
}
@Overridepublicvoidclose() throwsIOException {
admin.close();
connection.close();
}
}
Compaction Trade-offs — Read Amplification vs Write Amplification
Minor compaction: merges a few adjacent HFiles — lightweight, runs continuously, does not remove deleted cells, safe during business hours
Major compaction: rewrites ALL HFiles into one — heavy I/O, removes tombstones and TTL-expired cells, reclaims HDFS space
Without any compaction: reads degrade linearly with HFile count — each read must check every HFile via bloom filters
With unthrottled major compaction during peak hours: disk I/O saturation drives read latency from 5ms to 500ms
Rule: disable automatic major compactions on clusters above 1TB and schedule them manually during the quietest traffic window
Production Insight
A 50-node cluster ran automatic major compactions on the default 7-day schedule with no jitter.
All regions compacted simultaneously on the same day, saturating every DataNode's disk for 4 hours.
Read latency spiked from 8ms to 600ms. Setting hbase.hregion.majorcompaction.jitter=0.5 distributes the load over a 3-day window.
Rule: if you use automatic major compaction, always set jitter to at least 0.5 — never leave it at 0.
Key Takeaway
Minor compaction is safe during business hours; major compaction can saturate disk I/O for hours on large regions.
Disable automatic major compaction on clusters above 1TB and schedule it manually during off-peak windows.
Compaction is the only mechanism that physically removes deleted cells and TTL-expired data — it cannot be skipped indefinitely.
Compaction Configuration Strategy
IfCluster is under 100GB total data, development or staging environment
→
UseLeave automatic major compaction enabled at default 7-day interval — the I/O cost is negligible
IfCluster is between 100GB and 1TB in production
→
UseEnable automatic major compaction but set jitter=0.5 and throttle throughput to 40MB/s to avoid peak-hour collisions
IfCluster is above 1TB in production with live traffic SLOs
→
UseDisable automatic major compaction (set to 0) and schedule manual compactions during off-peak windows via admin.majorCompact()
IfTable uses TTL for data expiry (time-series, event logs)
→
UseMajor compaction is the only mechanism that physically removes expired cells — run it at least once per TTL period, even if manually scheduled
UseDisk I/O is undersized for the write rate — increase hbase.hstore.flusher.count, reduce MemStore flush size, or add DataNodes
Bloom Filters and BlockCache — Keeping Reads Fast
Two mechanisms keep HBase read latency in the millisecond range: bloom filters that skip irrelevant HFiles before reading any disk data, and the BlockCache that keeps recently accessed HFile blocks in heap memory.
Bloom filters are probabilistic data structures stored within each HFile. They answer the question 'does this HFile possibly contain row key X?' with certainty in the negative direction — if the bloom filter says no, the RegionServer skips that HFile entirely. If it says yes, the HFile might contain the row (with a configurable false positive rate, typically 1%). For a table with 20 HFiles per region, bloom filters can eliminate 15-19 disk reads per Get request. The space cost is roughly 120MB of bloom filter data per 100 million keys at 1% false positive rate.
HBase offers two bloom filter types. ROW bloom filters check only the row key — the right choice for workloads dominated by point Gets on narrow rows with few columns. ROWCOL bloom filters check both the row key and the column qualifier — useful for wide rows where you frequently fetch specific columns and want to skip HFiles that don't contain that column. ROWCOL uses 3-5x more memory per HFile than ROW. For scan-heavy workloads, bloom filters provide no benefit because range scans must read all blocks in sequence regardless — use NONE to save the memory.
The BlockCache keeps recently accessed HFile blocks in RegionServer heap memory. When a Get or Scan reads a block from HDFS, that block is placed in the BlockCache. Subsequent reads for rows in the same block find it in cache and skip the HDFS read entirely — this is how HBase achieves sub-millisecond reads for hot data. The BlockCache has two optional tiers: an on-heap L1 cache and an off-heap L2 bucket cache for larger working sets without JVM GC pressure.
The configuration rule that causes the most production incidents is the memory budget. The RegionServer's JVM heap must support MemStore, BlockCache, RPC handlers, GC overhead, and JVM internal structures simultaneously. The constraint is firm: hbase.regionserver.global.memstore.size + hfile.block.cache.size must not exceed 0.80. If they sum above 0.80, the remaining heap is insufficient for GC metadata and RPC handler threads, leading to Full GC pauses that exceed ZooKeeper's session timeout and cause RegionServer crashes.
io/thecodeforge/hbase/ReadTuningConfig.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
package io.thecodeforge.hbase;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.io.compress.Compression;
import org.apache.hadoop.hbase.regionserver.BloomType;
import java.io.IOException;
/**
* Production read performance tuning forHBase.
*
* Covers:
* - BlockCache and MemStore memory budget (the 80% rule)
* - Bloom filter type selection per column family
* - Off-heap L2 bucket cache for larger working sets
* - Scan caching and result size limits
* - Column family compression
*
* The80% rule is non-negotiable:
* hfile.block.cache.size + hbase.regionserver.global.memstore.size <= 0.80
*/
publicclassReadTuningConfig {
/**
* Returns the recommended RegionServer configuration for a 32GB heap node.
*
* Memory budget:
* BlockCache : 40% of 32GB = 12.8GB (on-heap L1)
* MemStore : 35% of 32GB = 11.2GB
* Total : 75% (leaves 25% forGC, RPC handlers, JVM internals)
*
* With 8GB off-heap L2 bucket cache, effective cache = 12.8 + 8 = 20.8GB
* without increasing GC pressure.
*/
publicstaticConfigurationgetReadOptimizedConfig() {
Configuration conf = HBaseConfiguration.create();
// ── On-heap BlockCache (L1) ──────────────────────────────────────// 40% of heap — primary cache for hot data
conf.setFloat("hfile.block.cache.size", 0.40f);
// ── MemStore ─────────────────────────────────────────────────────// 35% of heap — write buffer across all regions on this server// Triggers a global flush if total MemStore usage exceeds this
conf.setFloat("hbase.regionserver.global.memstore.size", 0.35f);
// Verify: 0.40 + 0.35 = 0.75 <= 0.80 — within the safe budget// ── Off-heap L2 Bucket Cache ─────────────────────────────────────// Stores larger working sets without contributing to GC pressure// Data is serialised to native memory — slightly slower than L1 but// allows a much larger effective cache without increasing -Xmx
conf.set("hbase.bucketcache.ioengine", "offheap");
conf.setLong("hbase.bucketcache.size", 8192L); // 8192 MB = 8GB off-heap// ── Scan optimisation ────────────────────────────────────────────// Rows fetched per RPC during a Scan — default of 1 is catastrophically slow// 100-500 is reasonable; tune based on average row size
conf.setInt("hbase.client.scanner.caching", 100);
// Max bytes returned per scanner RPC — prevents OOM on wide rows
conf.setLong("hbase.client.scanner.max.result.size", 2L * 1024 * 1024); // 2MBreturn conf;
}
/**
* Creates a table with per-column-family bloom filter and compression settings.
*
* Bloom filter type is set per column family because different families
* can have fundamentally different access patterns on the same table.
*/
publicstaticvoidcreateOptimisedTable(Connection connection,
String tableName) throwsIOException {
try (Admin admin = connection.getAdmin()) {
HTableDescriptor tableDescriptor =
newHTableDescriptor(TableName.valueOf(tableName));
// Column family for point-lookup-heavy data — use ROW bloom filter// ROW: probabilistic skip on row key only// Low memory cost: ~1.2 bytes per unique row key at 1% false positiveHColumnDescriptor metadataFamily = newHColumnDescriptor("meta");
metadataFamily.setBloomFilterType(BloomType.ROW);
metadataFamily.setCompressionType(Compression.Algorithm.SNAPPY);
metadataFamily.setMaxVersions(1); // time-series: keep only latest
metadataFamily.setTimeToLive(86400 * 90); // 90-day TTL
tableDescriptor.addFamily(metadataFamily);
// Column family for wide rows with selective column reads — use ROWCOL// ROWCOL: probabilistic skip on row key + column qualifier// Higher memory cost: 3-5x more than ROW — only justified for wide rowsHColumnDescriptor attributesFamily = newHColumnDescriptor("attrs");
attributesFamily.setBloomFilterType(BloomType.ROWCOL);
attributesFamily.setCompressionType(Compression.Algorithm.LZ4);
attributesFamily.setMaxVersions(3); // retain last 3 versions
tableDescriptor.addFamily(attributesFamily);
// Column family for scan-heavy analytical data — no bloom filter// Bloom filters provide no benefit for range scans — skip the memory costHColumnDescriptor metricsFamily = newHColumnDescriptor("metrics");
metricsFamily.setBloomFilterType(BloomType.NONE);
metricsFamily.setCompressionType(Compression.Algorithm.GZ); // higher ratio for cold data
metricsFamily.setMaxVersions(1);
tableDescriptor.addFamily(metricsFamily);
admin.createTable(tableDescriptor);
}
}
/**
* Selects the appropriate bloom filter type based on column family access pattern.
* Usethis as a decision aide when designing a new column family schema.
*/
publicstaticBloomTypeselectBloomType(boolean isGetHeavy,
boolean hasWideRows,
boolean isScanHeavy) {
if (isScanHeavy) {
return BloomType.NONE; // scans read all blocks — bloom filters are wasted memory
}
if (isGetHeavy && hasWideRows) {
return BloomType.ROWCOL; // skip HFiles missing the specific column — expensive but targeted
}
if (isGetHeavy) {
return BloomType.ROW; // skip HFiles missing the row — cheap and effective
}
return BloomType.NONE; // default for mixed or unknown patterns
}
}
Pro Tip: Bloom Filter Memory Budget
Bloom filter data is stored inside HFiles and loaded into memory when the HFile is opened. For a 100M-row HFile with 1% false positive rate, the bloom filter consumes roughly 120MB. If you have 10 HFiles per region and 200 regions per server, that's 240GB of bloom filter data — which clearly cannot all live in memory simultaneously. HBase loads bloom filter chunks lazily and caches them in the BlockCache. If bloom filter data is evicting hot row data from the BlockCache, reduce the false positive rate (costs more memory per HFile) or switch from ROWCOL to ROW. Use ROWCOL only when the column-specific filtering genuinely reduces disk reads enough to justify the 3-5x memory overhead.
Production Insight
A 16GB RegionServer was configured with MemStore at 0.40 and BlockCache at 0.45 — summing to 0.85.
Full GC pauses of 12-15 seconds caused ZooKeeper session timeouts and triggered RegionServer crashes.
The crash-recovery cycle repeated every 20-30 minutes, making the cluster unusable.
Fix: reduce MemStore to 0.35, BlockCache to 0.40 — sum 0.75 — leaving 25% for JVM internals. Crashes stopped immediately.
Key Takeaway
Bloom filters skip irrelevant HFiles before reading any disk data — transforming O(HFileCount) disk reads into O(1) for cache-cold Gets.
BlockCache keeps recently read HFile blocks in heap — hot data stays at sub-millisecond read latency.
MemStore + BlockCache must sum to <= 0.80 of heap — this is the most commonly violated configuration rule and the most reliable cause of RegionServer crashes.
Bloom Filter Type Selection
IfWorkload is Get-heavy with narrow rows (few columns per row)
→
UseUse ROW bloom filter — filters on row key, low memory overhead, effective for point lookups
IfWorkload is Get-heavy with wide rows (many columns) and reads target specific columns
→
UseUse ROWCOL bloom filter — filters on row key + column qualifier, 3-5x more memory but avoids reading HFiles missing the target column
IfWorkload is scan-heavy (range scans, full family reads)
→
UseUse NONE — range scans must read all blocks sequentially; bloom filters consume memory without reducing any disk reads
IfRegionServer memory is constrained (under 16GB heap)
→
UseUse ROW or NONE — ROWCOL memory overhead may not be justified; monitor BlockCache eviction rates
IfHFile has very few unique row keys (under 10,000)
→
UseUse NONE — bloom filter overhead exceeds the benefit; the small HFile block index is faster than a bloom filter lookup
HBase vs Alternatives — When to Use What
HBase is a powerful and operationally demanding tool. Choosing it for the wrong problem is a multi-month mistake. Understanding where it fits relative to alternatives is as important as understanding HBase itself.
HBase vs Cassandra: both are wide-column stores inspired by Google's Bigtable paper, but they make fundamentally different architectural trade-offs. Cassandra is masterless and peer-to-peer — every node is equal, and there is no single point of failure equivalent to ZooKeeper. Cassandra supports per-query consistency tuning: write at ONE for maximum throughput, read at QUORUM for strong consistency, or mix them per operation. It also has native multi-datacenter replication built into its core data model. HBase uses a master-based architecture with strong consistency by default and no native multi-datacenter support. Choose Cassandra when you need multi-datacenter active-active replication, tunable consistency on a per-operation basis, or a masterless topology that has no ZooKeeper equivalent as a single point of failure. Choose HBase when you need strong consistency guaranteed by default, coprocessor-based server-side computation, or tight integration with the Hadoop ecosystem for batch analytics on the same data (HBase + Spark/MapReduce on HFiles).
HBase vs MongoDB: MongoDB is a document store with a flexible JSON schema, a rich aggregation query language, and built-in secondary indexes. HBase has none of those things — it is a sorted key-value store with a fixed column-family schema, no query language beyond Get/Scan/Put, and no secondary indexes. Use MongoDB when your access patterns require ad-hoc queries, aggregation pipelines, or text search. Use HBase when you need predictable low-latency random reads across petabytes of sparse data with simple key-based access and no query flexibility required.
HBase vs Google Cloud Bigtable: Bigtable is the managed version of the same architecture. The API is largely compatible, the data model is identical, and the operational model is dramatically simpler — no ZooKeeper to manage, no RegionServers to tune, no compactions to schedule. If you are on GCP and your dataset warrants HBase, Bigtable is almost always the right choice unless you have a specific requirement for on-premises deployment, multi-cloud portability, or HDFS data locality for co-located batch jobs.
When NOT to Use HBase
HBase is the wrong choice when: (1) your dataset fits comfortably in a single PostgreSQL or MySQL instance — the operational burden of ZooKeeper, HDFS, and RegionServer tuning is not justified; (2) you need ad-hoc SQL queries — use Apache Phoenix on HBase if you must, but consider whether a purpose-built analytical store (BigQuery, Redshift, Snowflake) would serve better; (3) you need secondary indexes — HBase has none natively; Phoenix can add them but at the cost of write amplification and additional operational complexity; (4) you need multi-row ACID transactions — HBase provides single-row atomicity only; (5) your engineering team lacks HBase operational experience — production incidents involving GC tuning, WAL replay, and region reassignment are genuinely difficult to debug without prior exposure.
Production Insight
A team chose HBase for a 50GB dataset that needed SQL aggregation queries on ad-hoc dimensions.
They spent 3 months building and tuning an Apache Phoenix integration before realising that a PostgreSQL instance with partition pruning would have handled the load with far less engineering investment.
Rule: if your data fits on one machine and you need SQL, do not use HBase. Scale the database before adopting a distributed store.
Key Takeaway
HBase excels at petabyte-scale random read/write with strong consistency and native HDFS integration for co-located batch jobs.
Cassandra is the better choice for multi-datacenter deployments and tunable per-operation consistency; MongoDB for flexible schemas and complex query patterns.
Rule: the right reason to choose HBase is a specific technical requirement — petabyte scale, strong consistency, HDFS co-location — not familiarity with the Hadoop ecosystem.
● Production incidentPOST-MORTEMseverity: high
HBase Cluster Pegged at 100% CPU on a Single RegionServer — All Writes Funneling to One Region
Symptom
Average write latency jumped from 2ms to 8 seconds overnight. One RegionServer showed 100% CPU and 95% MemStore utilisation while all others were below 5%. The HBase Master UI showed one region with 50GB of accumulated data while every other region held under 1GB. The cluster had 50 RegionServers and was running at roughly 2% of its total write capacity.
Assumption
The team assumed HBase would automatically distribute writes evenly across all regions regardless of row key pattern — similar to how a hash-based partitioning scheme in Kafka or Cassandra would distribute load. They had tested write throughput in staging with a few thousand records and seen no issues.
Root cause
Row keys were formatted as {timestamp}_{deviceId}. Since timestamps are monotonically increasing, every new write was directed to the single rightmost region — the one covering the highest current key range. This region never split into a position that would relieve pressure because all new writes continued arriving in the same range immediately after any split. The RegionServer hosting this region became the sole bottleneck: every write hit the same WAL segment, the same MemStore, and the same HDFS DataNodes backing that region. No horizontal scaling was possible because the row key pattern made distribution structurally impossible.
Fix
Reversed the timestamp bytes in the row key: the format changed from {timestamp}_{deviceId} to {Long.MAX_VALUE - timestamp}_{deviceId}. Reversed timestamps decrease as real time increases, so new writes are directed toward the leftmost regions instead of the rightmost. A salt prefix (deviceId hash modulo region count) was added as a secondary distribution mechanism to handle devices with correlated write patterns. After the fix, write throughput increased 40x and all 50 RegionServers showed balanced load within one hour. The table was pre-split into 50 regions at creation to avoid the initial single-region bottleneck during the data load.
Key lesson
Never use monotonically increasing values as row key prefixes — timestamps, auto-increment IDs, and sequence numbers all create write hotspots
Reverse timestamps or add salt prefixes before writing a single record — retrofitting requires a full data migration
Monitor per-region write count and region size — imbalance larger than 2x the average is the earliest warning sign of a hotspot
HBase does NOT auto-balance writes based on row key patterns — the sorted partitioning model means distribution is entirely the caller's responsibility
Production debug guideSymptom → Action mapping for common HBase production issues5 entries
Symptom · 01
Single RegionServer at 100% CPU, others idle
→
Fix
Check row key distribution immediately. Run hdfs dfs -du /hbase/data/<namespace>/<table>/ and compare region sizes. If one region is 10x larger than the median, your keys are sequential — the fix is a row key redesign with salt prefix or reversed timestamp. Also run hbase org.apache.hadoop.hbase.util.RegionSplitter to inspect current split boundaries.
Symptom · 02
Read latency spikes every 30-60 minutes on a regular schedule
→
Fix
Compaction is the likely cause. Check HBase Master UI for compaction queue length and run hdfs dfs -du /hbase/data/<namespace>/<table>// to see HFile count per region. If major compactions are firing automatically, tune hbase.hregion.majorcompaction.jitter to stagger them, or disable automatic major compactions and schedule them manually during off-peak hours.
Symptom · 03
OutOfMemoryError in RegionServer JVM
→
Fix
MemStore or BlockCache is over-allocated. Verify that hbase.regionserver.global.memstore.size + hfile.block.cache.size does not exceed 0.80. On a 32GB heap, MemStore at 0.35 and BlockCache at 0.40 leaves 25% for JVM overhead — the minimum required. Reduce one or both allocations if they sum above 0.80.
Symptom · 04
RegionServer keeps crashing and regions are stuck in transition
→
Fix
Run hbase hbck -details to check for META inconsistencies. If META is corrupt, run hbase hbck -fixMeta -fixAssignments. Also check ZooKeeper session timeouts — GC pauses longer than 1 second cause the RegionServer to lose its ZK ephemeral node, triggering a crash cycle even if the JVM itself recovers.
Symptom · 05
Writes timing out with RegionTooBusyException
→
Fix
The MemStore flush queue is backed up. Check hbase.hstore.flusher.count — the default of 2 flusher threads is often insufficient for high write rates. Also check HDFS write throughput via DataNode metrics — if HDFS is slow, MemStore flushes queue behind it and eventually block incoming writes. Increase flusher threads or reduce hbase.hregion.memstore.flush.size to trigger more frequent, smaller flushes.
★ HBase Quick Debug Cheat SheetFast diagnostics for production HBase issues. Run these commands to confirm the root cause before making any configuration changes.
RegionServer unresponsive or repeatedly dying−
Immediate action
Check GC pause duration and ZooKeeper session status
Commands
jstat -gcutil <rs_pid> 1000 10
hbase hbck -details 2>&1 | grep -i 'inconsist'
Fix now
If GC pause exceeds 1 second, increase -XX:G1ReservePercent and reduce MemStore share below 0.35. GC pauses longer than the ZK session timeout (default 90s divided by tick count) cause ephemeral node loss and trigger crash/recovery cycles.
If one region is 10x larger than the median, row keys are sequential. Pre-split the table with custom split keys and redesign the row key schema with a salt prefix or reversed timestamp before the next data load.
Read latency above 50ms for point lookups+
Immediate action
Check BlockCache hit ratio — below 85% means either the cache is undersized or scans are evicting hot data
Commands
curl -s http://<rs_host>:16030/jmx | python3 -c "import sys,json; d=json.load(sys.stdin); [print(k,v) for k,v in d.items() if 'blockCache' in k.lower()]"
If BlockCache hit ratio is below 85%, increase hfile.block.cache.size. If scans are the culprit (they bypass BlockCache by default), set setCacheBlocks(false) on Scan objects so they stop polluting the cache used by point lookups.
Medium: no ZooKeeper dependency; tuning required for compaction and repair; simpler failure model
Low to medium: Atlas managed option removes most operational burden; self-hosted adds complexity
Low: fully managed service; no ZooKeeper, no compaction scheduling, no RegionServer sizing
Key takeaways
1
HBase layers a sorted, indexed key-value store on top of HDFS to provide sub-10ms random reads across petabytes
HDFS alone supports only sequential access, making point lookups impractical without the indexing layer HBase provides.
2
Row key design is the single most critical and irreversible decision
it simultaneously determines write distribution across RegionServers, read locality, and scan efficiency. Sequential keys create hotspots that no hardware addition can fix.
3
Compactions trade write amplification (I/O cost during compaction) for read amplification reduction (fewer HFiles to check per read). Major compactions can saturate disk I/O for hours on large regions
disable automatic scheduling above 1TB and schedule manually.
4
MemStore + BlockCache must sum to 80% or less of JVM heap
exceeding this causes Full GC pauses that trigger ZooKeeper session timeouts and cascading RegionServer crashes.
5
ZooKeeper is the true single point of failure in HBase
if ZooKeeper loses quorum, the entire cluster becomes unavailable for reads, writes, and admin operations. Monitor it more aggressively than any other component.
Common mistakes to avoid
5 patterns
×
Using sequential row keys such as timestamps or auto-increment IDs as the key prefix
Symptom
One RegionServer at 100% CPU handling all writes while the rest of the cluster sits idle. Write latency spikes from single-digit milliseconds to multiple seconds. The hot region never stabilises because every new write continues arriving in the same key range immediately after any split.
Fix
Reverse the timestamp bytes — use (Long.MAX_VALUE - timestamp) as the prefix instead of the raw timestamp. Alternatively, prepend a hash-based salt: (md5(naturalKey) % regionCount) as a zero-padded prefix. Pre-split the table into as many regions as RegionServers before the initial data load to avoid the reactive-split performance cliff.
×
Allocating MemStore and BlockCache memory that sums above 80% of JVM heap
Symptom
Frequent Full GC pauses lasting 5-15 seconds. RegionServer loses its ZooKeeper session during GC and is declared dead by the HMaster. Regions are reassigned and WAL replay begins, but the newly assigned RegionServers also start GC-pausing. The cluster enters a crash-recovery loop.
Fix
Ensure hbase.regionserver.global.memstore.size + hfile.block.cache.size <= 0.80. A safe starting point for a 32GB heap node: MemStore=0.35 (11.2GB), BlockCache=0.40 (12.8GB), sum=0.75 — leaving 25% for RPC handlers, GC internal structures, and JVM overhead. Use G1GC and tune -XX:G1ReservePercent=20 to reduce GC pause duration.
×
Not pre-splitting tables before bulk data loads
Symptom
All initial data lands in a single region. The region splits reactively when it exceeds the size threshold, but by then the RegionServer hosting it is already overwhelmed and load balancing takes hours to converge. Write throughput is a fraction of what the cluster can handle during the entire load window.
Fix
Pre-split the table at creation time: create 'table', SPLITS => ['10','20','30','40','50','60','70','80','90'] for a 10-region split based on expected key distribution. Use the RegionSplitter utility for hash-based splits when key distribution is unknown. Match the number of pre-split regions to the number of RegionServers for even initial distribution.
×
Allowing automatic major compactions to run during peak traffic hours
Symptom
Read latency spikes 10-50x every 7 days on a regular schedule. The spike correlates with major compaction activity visible in the HBase Master UI. A major compaction on a 200GB region reads and writes 200GB of data, competing with live reads and writes for disk I/O.
Fix
For clusters above 1TB, set hbase.hregion.majorcompaction to 0 to disable automatic scheduling. Schedule manual major compactions via admin.majorCompact() during the lowest-traffic window. If you keep automatic compaction enabled for smaller clusters, set hbase.hregion.majorcompaction.jitter to at least 0.5 to spread the load across a multi-day window.
×
Scanning tables without explicit startRow and stopRow boundaries
Symptom
Full table scan blocks RegionServer handler threads for minutes. Other clients queue behind the scan and begin timing out. The scan's sequential reads evict hot BlockCache data, causing point lookup latency to degrade for all other clients on the same server after the scan completes.
Fix
Always set startRow and stopRow on Scan objects — treat an unbounded scan as a code review failure. Set setCaching(100) to fetch 100 rows per RPC instead of the default 1. Set setCacheBlocks(false) on analytical scans so they do not pollute the BlockCache used by point lookups. For cluster-wide analytics, use Spark or MapReduce directly on HFiles instead of client Scan operations.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Explain the HBase write path from client to disk. What happens if the Re...
Q02SENIOR
Why are sequential row keys bad in HBase, and how would you fix a table ...
Q03SENIOR
What is the difference between a minor and major compaction in HBase, an...
Q04SENIOR
How do bloom filters work in HBase, and when would you choose ROW versus...
Q05SENIOR
Your HBase cluster has 50 RegionServers but only 3 are handling writes. ...
Q06SENIOR
Compare HBase and Cassandra for a multi-datacenter e-commerce platform t...
Q01 of 06SENIOR
Explain the HBase write path from client to disk. What happens if the RegionServer crashes after writing to the WAL but before flushing the MemStore?
ANSWER
The write path has four steps. First, the client sends a Put to the RegionServer that owns the target row's region — it knows which server that is by looking up the hbase:meta table, which it caches locally. Second, the RegionServer appends the mutation to the Write-Ahead Log on HDFS. This is a sequential append to a file on a durable distributed filesystem — it survives RegionServer crashes. Third, the mutation is applied to the in-memory MemStore — a sorted skip list per column family per region. Fourth, the RegionServer sends acknowledgment to the client. If the RegionServer crashes after step 2 but before any MemStore flush to an HFile, the data is safe. When the HMaster detects the crash via ZooKeeper, it reassigns the dead server's regions to other RegionServers. Each recovering RegionServer replays the WAL for its newly assigned regions to reconstruct the MemStore state. The WAL replay guarantees that no acknowledged write is ever lost. The key insight is that WAL durability precedes client acknowledgment — those two facts together make HBase's write durability guarantee possible.
Q02 of 06SENIOR
Why are sequential row keys bad in HBase, and how would you fix a table that already has sequential keys in production?
ANSWER
Sequential keys are bad because HBase partitions data by sorted row key ranges. All writes to a sequential keyspace go to the single rightmost region — the one covering the currently highest key range. One RegionServer handles 100% of write traffic while all others sit idle. No amount of horizontal scaling helps because the data model funnels writes to one place. To fix an existing table in production: first, determine the target row key schema — reversed timestamp or hash-prefix based on the access pattern. Second, run a MapReduce or Spark job to read the existing table, transform the row keys, and write to a new table with the correct schema. Pre-split the new table before loading data. Third, after validation, update the application to point to the new table and drain writes from the old table. Finally, delete the old table once the migration is confirmed clean. There is no in-place rename or re-key operation in HBase — the migration is a full rewrite. This is why getting the row key right before the first write is so important.
Q03 of 06SENIOR
What is the difference between a minor and major compaction in HBase, and when would you disable automatic major compactions?
ANSWER
Minor compaction fires automatically when the number of HFiles in a store exceeds the threshold (default 3). It merges a small batch of adjacent HFiles into one. It is lightweight, does not remove deleted or TTL-expired cells, and is safe to run during business hours. Major compaction merges all HFiles in a store into a single HFile and permanently removes deleted cells and cells past their TTL. It is I/O intensive — a 300GB region undergoing major compaction reads and writes 300GB, competing with live traffic for disk bandwidth. Disable automatic major compaction when: the cluster is above 1TB and the I/O cost would violate read latency SLOs; the cluster serves latency-sensitive traffic with no maintenance window; or you want predictable compaction timing. Set hbase.hregion.majorcompaction to 0 and run manual major compaction via the Admin API during off-peak windows. If you leave automatic compaction enabled, always set jitter to 0.5 or higher to prevent all regions from compacting simultaneously on a fixed 7-day cadence.
Q04 of 06SENIOR
How do bloom filters work in HBase, and when would you choose ROW versus ROWCOL?
ANSWER
Bloom filters are per-HFile probabilistic structures that answer 'does this HFile possibly contain row key X?' — or for ROWCOL, 'does this HFile possibly contain this specific row+column combination?' They guarantee no false negatives: if the bloom filter says the row is not in the file, it is definitely not there and the RegionServer skips the file without any disk read. They allow configurable false positives (default 1%) — occasionally the bloom filter says yes when the row is not present, causing a wasted block read. ROW bloom filters work on row keys only — use them for Get-heavy workloads on narrow rows where eliminating irrelevant HFiles from a point lookup is the primary goal. Memory cost is roughly 1.2 bytes per unique row key at 1% false positive rate. ROWCOL bloom filters check the row key and the column qualifier together — use them for wide rows where you frequently fetch specific columns and want to skip HFiles that don't contain that column. ROWCOL uses 3-5x more memory per HFile than ROW. For scan-heavy workloads, use NONE — range scans must read all blocks in the scanned range anyway, so bloom filters consume memory without reducing any disk reads.
Q05 of 06SENIOR
Your HBase cluster has 50 RegionServers but only 3 are handling writes. Walk me through your debugging process.
ANSWER
Step 1: confirm the hypothesis. Open the HBase Master UI and check the region count and request count per RegionServer. If 3 servers have dramatically higher request counts, the imbalance is real. Step 2: check region size distribution. Run hdfs dfs -du /hbase/data/<namespace>/<table>/ and compare sizes across regions. If one or a few regions are 10x larger than the median, row keys are sequential. Step 3: check row key structure. Look at actual row keys with a short Scan — if they start with timestamps or incrementing numbers, that is the root cause. Step 4: verify that region splits have occurred but failed to relieve pressure. Sequential keys mean new writes still go to the rightmost region even after splits. Step 5: implement the fix — reverse timestamps or add salt prefix. Pre-split the table with the new key schema, migrate data if necessary, and update the ingestion pipeline. Step 6: add monitoring. Alert when any region exceeds 2x the median region size across the cluster. Track per-RegionServer write request rate as a metric and alert on imbalance above 3x. Load should balance within 1 hour after the key schema fix.
Q06 of 06SENIOR
Compare HBase and Cassandra for a multi-datacenter e-commerce platform that needs strong consistency for inventory and eventual consistency for product catalog updates.
ANSWER
The mixed consistency requirement is the deciding factor here and it favours Cassandra. Cassandra supports per-query consistency levels natively — you can write inventory updates at LOCAL_QUORUM for strong-ish consistency within a datacenter, and write catalog updates at ONE for maximum write throughput with eventual propagation to other datacenters. Multi-datacenter replication in Cassandra is a core architectural feature, not an add-on — you configure replication factor per datacenter at table creation and the system handles cross-DC replication automatically. HBase provides strong consistency by default but has no native multi-datacenter support. Replicating HBase across datacenters requires WAL shipping or a custom CDC pipeline — complex to build and operationally expensive to maintain. The inventory use case (strong consistency) works well on HBase but the catalog use case (eventual consistency, multi-DC) does not. Cassandra handles both in a single consistent operational model. The verdict for this specific scenario: Cassandra is the better fit. HBase would be the better choice if the workload required strong consistency for all operations, lived in a single datacenter with HDFS infrastructure already deployed, and needed co-located MapReduce or Spark batch jobs to run directly on the same data.
01
Explain the HBase write path from client to disk. What happens if the RegionServer crashes after writing to the WAL but before flushing the MemStore?
SENIOR
02
Why are sequential row keys bad in HBase, and how would you fix a table that already has sequential keys in production?
SENIOR
03
What is the difference between a minor and major compaction in HBase, and when would you disable automatic major compactions?
SENIOR
04
How do bloom filters work in HBase, and when would you choose ROW versus ROWCOL?
SENIOR
05
Your HBase cluster has 50 RegionServers but only 3 are handling writes. Walk me through your debugging process.
SENIOR
06
Compare HBase and Cassandra for a multi-datacenter e-commerce platform that needs strong consistency for inventory and eventual consistency for product catalog updates.
SENIOR
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
What happens when a RegionServer crashes?
When a RegionServer crashes, ZooKeeper detects the lost session within a few seconds (via the expiry of its ephemeral node). The HMaster is notified and immediately begins reassigning the dead server's regions to other RegionServers. Each recovering RegionServer replays the WAL segments for its newly assigned regions, reconstructing the MemStore state and ensuring no acknowledged write is lost. Recovery time depends on WAL size and the number of regions being recovered — typically 30-90 seconds per affected RegionServer. During recovery, the affected regions are unavailable for reads and writes. After recovery completes, the regions resume serving traffic with full data integrity.
Was this helpful?
02
Can HBase handle secondary indexes?
HBase has no native secondary index support. The common approaches are: Apache Phoenix — a SQL layer on HBase that creates and maintains secondary indexes automatically as shadow tables, at the cost of write amplification (each indexed write goes to both the primary table and the index table); manual index tables — you maintain a separate HBase table mapping secondary key values to primary row keys, with application-level consistency; or Apache Solr integration (via Lily HBase Indexer) for full-text search on HBase data. Each approach adds operational complexity. If secondary indexes are a core requirement, evaluate whether the use case genuinely needs HBase scale or whether a database with native index support would serve better.
Was this helpful?
03
How does HBase handle region splits?
HBase splits regions automatically when a region exceeds the size threshold (hbase.hregion.max.filesize, default 10GB). The split process: the region is marked as splitting in hbase:meta; HBase identifies the midpoint row key; the region is split into two daughter regions at that midpoint; each daughter opens on a RegionServer (possibly different ones after load balancing); the parent's HFiles are referenced by links in both daughters until a major compaction rewrites them into separate files. The split itself is fast (seconds) but the parent region is briefly unavailable during the transition. Pre-splitting avoids this reactive overhead — create the table with the expected number of regions at table creation time using the SPLITS parameter or the RegionSplitter utility.
Was this helpful?
04
What is the difference between HBase and HDFS?
HDFS is a distributed filesystem optimised for large sequential reads and writes — think MapReduce jobs reading multi-gigabyte files from end to end. It has no concept of rows, indexes, or random access. HBase is a database that uses HDFS as its underlying storage layer. It adds sorted storage (rows ordered by key), a per-row index (the row key), column-family organisation, versioned cells, and the RegionServer layer that handles random read/write requests. The relationship is similar to a filesystem and a database engine — HDFS is the durable storage medium; HBase is the structured access layer that makes sub-millisecond point lookups possible on data that would otherwise require scanning multi-gigabyte files.
Was this helpful?
05
How do I monitor HBase health in production?
The key metrics that matter in production: RegionServer request latency at p50, p95, and p99 — spikes indicate compaction I/O contention or write hotspots; BlockCache hit ratio — below 85% means the cache is undersized or scans are evicting hot data; MemStore size per RegionServer — approaching the global limit (hbase.regionserver.global.memstore.size) triggers flush storms; compaction queue depth — sustained queue above 5 indicates disk I/O cannot keep pace with incoming flushes; region count and request count imbalance per server — any server handling more than 3x the median write rate indicates a hotspot; ZooKeeper session latency — above 500ms indicates network issues or GC pressure. Expose these via Hadoop Metrics2 to Prometheus (using the HBase Prometheus exporter) and alert on p99 latency above 100ms, BlockCache hit ratio below 80%, and any single region exceeding 2x the median size.