Advanced 15 min · March 06, 2026

Apache HBase Basics

HBase Row Key Hotspots — Why One RegionServer Hit 100% CPU

Q: What happens when a RegionServer crashes?

When a RegionServer crashes, ZooKeeper detects the lost session within a few seconds (via the expiry of its ephemeral node). The HMaster is notified and immediately begins reassigning the dead server's regions to other RegionServers. Each recovering RegionServer replays the WAL segments for its newly assigned regions, reconstructing the MemStore state and ensuring no acknowledged write is lost. Recovery time depends on WAL size and the number of regions being recovered — typically 30-90 seconds per affected RegionServer. During recovery, the affected regions are unavailable for reads and writes. After recovery completes, the regions resume serving traffic with full data integrity.

Q: Can HBase handle secondary indexes?

HBase has no native secondary index support. The common approaches are: Apache Phoenix — a SQL layer on HBase that creates and maintains secondary indexes automatically as shadow tables, at the cost of write amplification (each indexed write goes to both the primary table and the index table); manual index tables — you maintain a separate HBase table mapping secondary key values to primary row keys, with application-level consistency; or Apache Solr integration (via Lily HBase Indexer) for full-text search on HBase data. Each approach adds operational complexity. If secondary indexes are a core requirement, evaluate whether the use case genuinely needs HBase scale or whether a database with native index support would serve better.

Q: How does HBase handle region splits?

HBase splits regions automatically when a region exceeds the size threshold (hbase.hregion.max.filesize, default 10GB). The split process: the region is marked as splitting in hbase:meta; HBase identifies the midpoint row key; the region is split into two daughter regions at that midpoint; each daughter opens on a RegionServer (possibly different ones after load balancing); the parent's HFiles are referenced by links in both daughters until a major compaction rewrites them into separate files. The split itself is fast (seconds) but the parent region is briefly unavailable during the transition. Pre-splitting avoids this reactive overhead — create the table with the expected number of regions at table creation time using the SPLITS parameter or the RegionSplitter utility.

Q: What is the difference between HBase and HDFS?

HDFS is a distributed filesystem optimised for large sequential reads and writes — think MapReduce jobs reading multi-gigabyte files from end to end. It has no concept of rows, indexes, or random access. HBase is a database that uses HDFS as its underlying storage layer. It adds sorted storage (rows ordered by key), a per-row index (the row key), column-family organisation, versioned cells, and the RegionServer layer that handles random read/write requests. The relationship is similar to a filesystem and a database engine — HDFS is the durable storage medium; HBase is the structured access layer that makes sub-millisecond point lookups possible on data that would otherwise require scanning multi-gigabyte files.

Q: How do I monitor HBase health in production?

The key metrics that matter in production: RegionServer request latency at p50, p95, and p99 — spikes indicate compaction I/O contention or write hotspots; BlockCache hit ratio — below 85% means the cache is undersized or scans are evicting hot data; MemStore size per RegionServer — approaching the global limit (hbase.regionserver.global.memstore.size) triggers flush storms; compaction queue depth — sustained queue above 5 indicates disk I/O cannot keep pace with incoming flushes; region count and request count imbalance per server — any server handling more than 3x the median write rate indicates a hotspot; ZooKeeper session latency — above 500ms indicates network issues or GC pressure. Expose these via Hadoop Metrics2 to Prometheus (using the HBase Prometheus exporter) and alert on p99 latency above 100ms, BlockCache hit ratio below 80%, and any single region exceeding 2x the median size.

Timestamp-prefixed row keys funneled all writes to one region, spiking latency from 2ms to 8 seconds.

Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Drawn from code that ran under real load.

✓ Production

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 30 min

✓Deep production experience
✓Understanding of internals and trade-offs
✓Experience debugging complex systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

HBase is a distributed, sorted, column-family store running on top of HDFS — designed for random read/write access at petabyte scale
Data is split into regions by row key ranges; each region is served by exactly one RegionServer at a time
Writes go to WAL then MemStore; flushed to immutable HFiles on HDFS when MemStore fills — WAL guarantees no acknowledged write is ever lost
Row key design is the single most critical decision — it determines read/write distribution across the cluster and cannot be changed without a full migration
Compactions merge HFiles to reclaim space and reduce read amplification — but major compactions can saturate disk I/O for hours on large regions
Biggest mistake: sequential row keys (timestamps, auto-increment) create a single hot region — the #1 cause of HBase performance failures in production

✦ Definition~90s read

What is Apache HBase Basics?

HBase is a distributed, column-oriented NoSQL database modeled after Google's Bigtable, designed for real-time random read/write access to massive datasets on commodity hardware. It sits on top of HDFS (Hadoop Distributed File System) and provides low-latency lookups by key, row-level ACID semantics, and automatic sharding across RegionServers.

★

Imagine a school with millions of students, each having a locker.

Unlike relational databases, HBase has no joins, no SQL, and no fixed schema — you trade query flexibility for horizontal scalability and consistent write throughput at petabyte scale. It's the hammer for time-series data, clickstream logs, and any workload where you need fast point lookups by a single key and can tolerate eventual consistency for non-critical reads.

HBase's architecture is deceptively simple: tables are split into regions, each served by exactly one RegionServer. Regions are sorted by row key, and all reads/writes for a contiguous key range hit the same server. This design is why row key selection is the single most important decision you'll make — a poorly chosen key creates hotspots where one RegionServer handles 90% of traffic while its peers sit idle.

You've seen this in production: a monotonically increasing timestamp key means every new write goes to the last region, pegging that server's CPU at 100% while the cluster underutilizes the other 99% of its capacity. The fix is key salting, hashing, or field swapping to distribute writes uniformly.

HBase competes with Cassandra (better for multi-datacenter writes and no single point of failure), DynamoDB (fully managed, but expensive at scale), and even plain HDFS with Parquet (for batch analytics, not real-time). Use HBase when you need consistent, low-latency access to billions of rows by primary key, and you're already running Hadoop.

Avoid it for ad-hoc queries, complex aggregations, or workloads under 100GB — a relational database or even Redis will serve you better. The operational cost is real: you'll wrestle with compactions (background merges that can throttle your cluster), tune BlockCache sizes, and design Bloom filters to avoid scanning entire regions.

But when you get the row key right, HBase is a beast that handles 100k+ writes per second on a modest cluster.

Plain-English First

Imagine a school with millions of students, each having a locker. Instead of one giant hallway with all the lockers in a straight line, the school splits the hallway into sections managed by different hall monitors. Each monitor knows exactly which locker numbers they're responsible for and can find any locker in their section instantly. HBase works exactly like that — your data is sorted by a unique key, sliced into ranges called regions, and each region is managed by a dedicated server. When you ask for a row, HBase goes straight to the right monitor without scanning every locker in the school. The catch: if you number your lockers with today's date as a prefix, every new locker ends up in the last section, overloading one hall monitor while the others stand around doing nothing. That's the hotspot problem, and it's the first thing you need to understand before writing a single byte to HBase.

⚙ Browser compatibility

Latest versions — ✓ supported

Chrome	Firefox	Safari	Edge
✓	✓	✓	✓

Every week, systems like Facebook Messenger, Apache Phoenix, and large-scale e-commerce platforms serve billions of read and write requests against datasets that dwarf anything a relational database was built to handle. The secret weapon in many of these stacks is Apache HBase — a distributed, column-family-oriented store that runs on top of HDFS and borrows its core ideas from Google's Bigtable paper. When your time-series data grows past what PostgreSQL partitioning can stomach, or when you need sub-10ms random reads across petabytes of sparse data, HBase is the conversation you need to have.

HBase solves a specific and nasty problem: how do you get fast, consistent random reads and writes on a dataset so large it must be spread across hundreds of machines? HDFS on its own is sequential — great for MapReduce batch jobs, terrible for point lookups. HBase layers a sorted, indexed structure on top of HDFS so you can fetch a single row in milliseconds even when the table holds a trillion rows. It trades the flexibility of SQL for raw scalability, and understanding when that trade is worth making separates engineers who reach for HBase correctly from those who spend three months regretting the decision.

By the end of this article you'll understand how HBase physically stores data from the MemStore all the way down to HFiles on HDFS, why row key design determines whether your cluster thrives or dies, how compactions work and when they bite you in production, how bloom filters and the BlockCache keep reads fast, and when to choose a different tool entirely. Every section maps to a real production failure pattern — the kind that shows up at 2am when write latency spikes and one RegionServer is pegged at 100% CPU while 49 others sit idle.

Why HBase Row Key Design Is Not Optional

HBase is a distributed, column-oriented NoSQL database built on top of HDFS. Its core mechanic is the row key — a sorted, unique byte array that determines how data is partitioned across RegionServers. Every read and write operation targets a specific row key range, so the distribution of those keys directly controls system throughput.

HBase splits tables into regions by contiguous row key ranges. Each region is served by exactly one RegionServer. When row keys are monotonically increasing — like timestamps or auto-increment IDs — all new writes hit the same region, overloading that single server while others sit idle. This is the hotspot pattern. HBase offers no built-in load balancing for write distribution; the row key is your only lever.

Use HBase when you need real-time random read/write access to billions of rows with strong consistency, and you can design a row key that distributes writes uniformly. It excels at time-series data, telemetry, and event logs — but only if you salt, hash, or reverse your keys. Without that, a single RegionServer will saturate at 100% CPU while the cluster screams.

⚠ Hotspot ≠ RegionServer Failure

A hotspot doesn't crash the server — it throttles all writes to that region, causing latency spikes and timeouts across the entire table.

📊 Production Insight

A telemetry pipeline writing events with millisecond-precision timestamps as row keys.

One RegionServer hits 100% CPU, write latency jumps from 5ms to 30s, and the client's retry storm causes cascading failures.

Always salt row keys with a hash prefix or reverse the timestamp to spread writes across all regions.

🎯 Key Takeaway

Row key design is the single most important performance decision in HBase.

Monotonically increasing keys guarantee hotspots — never use raw timestamps or auto-increment IDs.

Salt or hash your keys to distribute writes uniformly across all RegionServers.

thecodeforge.io

Apache Hbase Basics

HBase Architecture — From Client to HDFS

HBase has three core components: the HMaster, RegionServers, and ZooKeeper. Understanding what each one does — and critically, what it does NOT do — is the foundation for every production decision you'll make.

The HMaster manages cluster metadata: table creation and deletion, schema changes, region assignment, and load balancing. It does NOT participate in the read or write data path at all. If the HMaster dies, the cluster continues serving reads and writes without interruption. You just cannot create tables, alter schemas, or rebalance regions until a new master is elected. This is a deliberate design choice — the HMaster is intentionally not a bottleneck for data operations.

RegionServers are the workhorses. Each RegionServer hosts multiple regions and handles all client read/write requests for those regions. Each region covers a contiguous range of sorted row keys and contains one or more column families. Each column family maintains its own MemStore — an in-memory sorted write buffer — and its own set of HFiles on HDFS.

ZooKeeper is the coordination layer and the true single point of failure in the cluster. It tracks which server is the active HMaster, which RegionServer owns which region, and detects RegionServer failures through ephemeral node heartbeats. If ZooKeeper loses quorum, the entire cluster becomes unavailable — no reads, no writes, no admin operations. This is the component that warrants the most aggressive monitoring in production.

The write path is straightforward but the ordering matters for reasoning about durability. The client sends a Put to the RegionServer responsible for the target row's region. The RegionServer writes to the WAL first — this is the durability guarantee. Then it writes to the in-memory MemStore. The client receives acknowledgment at this point. When the MemStore reaches its size threshold (default 128MB), it flushes to an immutable HFile on HDFS. The WAL is the safety net: if the JVM crashes after the WAL write but before the HFile flush, the data is recovered by replaying the WAL when the region is reassigned.

The read path is a layered cache hierarchy. The RegionServer checks the MemStore first (contains the most recent data for unflushed writes), then checks the BlockCache (recently accessed HFile blocks cached in heap), then checks HFiles on disk. For HFiles, bloom filters are consulted first to skip files that cannot contain the requested row, then the block index locates the specific block, and finally the block is read from HDFS or the OS page cache.

io/thecodeforge/hbase/HBaseArchitectureClient.javaJAVA

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

package io.thecodeforge.hbase;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;

import java.io.IOException;

/**
 * Production-grade HBase client demonstrating the read/write path.
 *
 * Key production rules:
 *   1. Connection is heavyweight and thread-safe — create once, share across threads
 *   2. Table is lightweight — obtain per-operation, close after use
 *   3. Always set retries and pause — transient RegionServer failures are normal
 *   4. Always set scan caching — default of 1 row per RPC is catastrophically slow
 */
public class HBaseArchitectureClient implements AutoCloseable {

    private final Connection connection;
    private final Admin admin;

    public HBaseArchitectureClient(String zkQuorum, int zkPort) throws IOException {
        Configuration conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum", zkQuorum);
        conf.setInt("hbase.zookeeper.property.clientPort", zkPort);

        // Retry on transient failures: up to 3 retries with 1s pause between
        conf.setInt("hbase.client.retries.number", 3);
        conf.setInt("hbase.client.pause", 1000);

        // Operation timeout — fail fast rather than queue behind a slow RegionServer
        conf.setInt("hbase.rpc.timeout", 5000);         // 5s per individual RPC
        conf.setInt("hbase.client.operation.timeout", 15000); // 15s total including retries

        // Connection is thread-safe and maintains a connection pool to each RegionServer
        // One Connection per application process is the correct pattern
        this.connection = ConnectionFactory.createConnection(conf);
        this.admin = connection.getAdmin();
    }

    /**
     * Write path: Put -> WAL -> MemStore -> (eventually) HFile on HDFS
     *
     * The WAL write happens before the client receives acknowledgment.
     * If the RegionServer crashes before MemStore flush, WAL replay recovers the data.
     * No acknowledged write is ever lost, even in a hard crash scenario.
     */
    public void writeRow(String tableName, String rowKey,
                         String columnFamily, String qualifier, byte[] value)
            throws IOException {

        // Table is lightweight — safe to create per-operation; always close after use
        try (Table table = connection.getTable(TableName.valueOf(tableName))) {
            Put put = new Put(Bytes.toBytes(rowKey));
            put.addColumn(
                Bytes.toBytes(columnFamily),
                Bytes.toBytes(qualifier),
                value
            );
            // setDurability controls WAL behaviour:
            //   SYNC_WAL (default)  — fsync WAL before ack — safest
            //   ASYNC_WAL           — buffer WAL writes — faster, small loss window
            //   SKIP_WAL            — no WAL — fastest, data loss on crash
            put.setDurability(Durability.SYNC_WAL);
            table.put(put);
        }
    }

    /**
     * Read path: Get -> MemStore -> BlockCache -> Bloom filter -> HFile block index -> HDFS
     *
     * Sub-10ms latency when the row is in MemStore (unflushed write) or BlockCache.
     * Cold reads (data not in any cache) depend on HDFS read latency — typically 20-50ms.
     */
    public byte[] readRow(String tableName, String rowKey,
                          String columnFamily, String qualifier)
            throws IOException {

        try (Table table = connection.getTable(TableName.valueOf(tableName))) {
            Get get = new Get(Bytes.toBytes(rowKey));
            get.addColumn(Bytes.toBytes(columnFamily), Bytes.toBytes(qualifier));

            // setCacheBlocks(true) is the default — hot rows stay in BlockCache
            // Set to false for analytical/scan workloads to avoid evicting hot data
            get.setCacheBlocks(true);

            Result result = table.get(get);

            if (result.isEmpty()) {
                return null;  // Row does not exist — always check before unwrapping
            }

            return result.getValue(
                Bytes.toBytes(columnFamily),
                Bytes.toBytes(qualifier)
            );
        }
    }

    /**
     * Scan with explicit start/stop row — the correct production scan pattern.
     *
     * Never scan without start/stop rows unless you intend a full table scan.
     * Full table scans block RegionServer threads for minutes and evict hot
     * BlockCache data, degrading point lookup latency for all other clients.
     */
    public void scanRange(String tableName, String startKey, String stopKey)
            throws IOException {

        try (Table table = connection.getTable(TableName.valueOf(tableName))) {
            Scan scan = new Scan()
                .withStartRow(Bytes.toBytes(startKey), true)   // inclusive
                .withStopRow(Bytes.toBytes(stopKey), false)    // exclusive
                .setCaching(100)             // 100 rows per RPC — reduces round trips
                .setMaxResultSize(2 * 1024 * 1024L)  // 2MB max per RPC response
                .setCacheBlocks(false);      // scans read sequentially — don't pollute BlockCache

            try (ResultScanner scanner = table.getScanner(scan)) {
                for (Result result : scanner) {
                    byte[] rowKey = result.getRow();
                    // process each row here — scanner handles region boundaries transparently
                    _ = rowKey; // suppress unused warning in example
                }
            }
        }
    }

    @Override
    public void close() throws IOException {
        admin.close();
        connection.close();
    }
}

Mental Model

The HBase Write Path — Durability First, Performance Second

Every write follows the same sequence: WAL for crash recovery, MemStore for fast in-memory buffering, then an eventual flush to immutable sorted files on HDFS. The WAL is the durability guarantee — everything else is optimisation.

Client sends Put to the RegionServer owning the target row's region — ZooKeeper tells the client which server that is
RegionServer appends to the WAL before acknowledging — guarantees the write survives a hard JVM crash
Write is applied to the MemStore — a sorted skip list in heap memory, one per column family per region
Client receives acknowledgment — the data is durable at this point, regardless of whether a flush has occurred
When MemStore reaches 128MB (default), it flushes to an HFile on HDFS — immutable, sorted, compressed, replicated
Multiple HFiles accumulate over time; compaction merges them periodically to limit read amplification

📊 Production Insight

HMaster death does not stop reads and writes — only region rebalancing halts until a new master is elected.

ZooKeeper quorum loss stops everything — no reads, no writes, no admin operations. It is the true single point of failure.

Rule: run ZooKeeper on dedicated nodes with SSD storage, a stable JVM (no large heap), and monitor session latency aggressively. ZK sessions should complete in under 10ms on a healthy cluster.

🎯 Key Takeaway

HMaster manages metadata only — it is not in the data path and its failure does not stop reads or writes.

RegionServers handle all data operations; ZooKeeper coordinates everything and is the true cluster SPOF.

Rule: monitor ZooKeeper session latency and node health more aggressively than any other component — a flapping ZK causes cascading RegionServer crashes.

HBase Component Failure Impact

IfHMaster dies

→

UseCluster continues serving reads and writes; no new table creation, schema changes, or region rebalancing until a standby master is elected — typically within 30 seconds with HA master configuration

IfSingle RegionServer dies

→

UseIts regions are reassigned to other servers by the HMaster; WAL replay reconstructs MemStore data; recovery takes 30-90 seconds; those regions are unavailable during recovery

IfZooKeeper quorum is lost (majority of ZK nodes fail)

→

UseEntire cluster becomes unavailable immediately — no reads, no writes, no admin operations; restore ZK quorum as the first priority

IfHDFS NameNode becomes unavailable

→

UseHFile reads degrade or fail; new MemStore flushes cannot complete; cluster degrades progressively until HDFS recovers or NameNode failover completes

IfNetwork partition between client and RegionServer

→

UseClient retries with exponential backoff up to hbase.client.retries.number; if all retries exhausted, the operation fails with an IOException — not a silent failure

Row Key Design — The Single Most Important Decision

Row key design in HBase is the most consequential architectural decision you will make, and it is the one you cannot change without a full data migration. Unlike a relational database where you can add an index after the fact, HBase has no secondary indexes. The row key is the only native index. It determines three things simultaneously: how writes are distributed across RegionServers, which rows can be read with a single efficient range scan, and how much data each region holds.

HBase stores rows sorted lexicographically by row key. Adjacent keys land in the same region and can be fetched with a single range scan — this is the key locality property that makes HBase efficient for time-series queries like 'give me all readings for device X in the last hour.' But lexicographic sorting also means that sequential keys funnel all new writes to a single region — the rightmost one, which covers the highest current key range. No amount of RegionServer hardware fixes this because the problem is structural: the data model directs all writes to one place.

The three production-proven strategies for write distribution:

Salt prefix: prepend a one-byte hash-derived salt (md5(naturalKey) % regionCount) to the row key. Every key gets distributed across all regions. The trade-off is that range scans on the natural key become multi-region scatter-gather operations — you have to scan all regions in parallel and merge results in the client. Acceptable for write-heavy workloads where scans are rare.

Reversed timestamp: for time-series data, format the key as {Long.MAX_VALUE - timestamp}_{deviceId} instead of {timestamp}_{deviceId}. Reversed timestamps decrease as real time increases, so each new write is directed toward lower key space — spreading across regions rather than always hitting the rightmost one. The bonus: a range scan for 'the last N records' now returns contiguous data instead of requiring a reverse scan.

Hash prefix: prepend a fixed-length hash of the natural key. Deterministic — the same natural key always maps to the same region. Good for Get-heavy workloads where you always know the full key. No range scan capability on the natural key.

The row key must also be as short as practical — under 100 bytes is the guideline. The row key is stored verbatim in every HFile block that contains that row, and it appears in every column qualifier's KeyValue metadata. A 500-byte row key on a table with 20 column qualifiers wastes 10KB per row just in key storage overhead, before any actual data.

io/thecodeforge/hbase/RowKeyDesign.javaJAVA

100

101

102

103

104

105

106

107

package io.thecodeforge.hbase;

import org.apache.hadoop.hbase.util.Bytes;

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

/**
 * Production row key design patterns for HBase.
 *
 * Each pattern makes a different trade-off between write distribution,
 * range scan efficiency, and deterministic lookup. Choose based on
 * your dominant access pattern — not based on what looks simplest.
 *
 * Decision guide:
 *   Write-heavy + rare scans    → salt prefix
 *   Time-series + recent reads  → reversed timestamp
 *   Read-heavy + point lookups  → hash prefix
 *   Multi-dimensional queries   → composite key
 */
public class RowKeyDesign {

    /**
     * Pattern 1: Salt prefix for write distribution.
     *
     * Distributes writes uniformly across all regions.
     * Trade-off: range scans on the natural key require querying all regions
     * in parallel and merging results in the client — scatter-gather.
     *
     * Use when: write throughput is the constraint and scans are rare or
     * can tolerate multi-region fan-out.
     */
    public static byte[] saltedKey(String naturalKey, int regionCount) {
        // Ensure consistent positive modulus regardless of hashCode sign
        int salt = (naturalKey.hashCode() & 0x7FFFFFFF) % regionCount;
        // Zero-pad salt so lexicographic ordering of region ranges works correctly
        return Bytes.toBytes(String.format("%02d_%s", salt, naturalKey));
    }

    /**
     * Pattern 2: Reversed timestamp for time-series write distribution.
     *
     * Reversed timestamps decrease as real time increases, directing each
     * new write toward lower key space rather than always to the rightmost region.
     *
     * Bonus: 'latest N records for deviceId' scan reads contiguous rows
     * because recent data is now at lower keys, not scattered.
     *
     * Use when: ingesting time-series data (IoT, logs, events) and recent-data
     * reads are more important than historical range scans.
     */
    public static byte[] reversedTimestampKey(long timestampMillis, String deviceId) {
        long reversedTs = Long.MAX_VALUE - timestampMillis;
        // 16 hex chars = 8 bytes = full long range, zero-padded for correct lex sort
        return Bytes.toBytes(String.format("%016x_%s", reversedTs, deviceId));
    }

    /**
     * Pattern 3: Hash prefix for deterministic distribution.
     *
     * Same natural key always maps to the same region — lookups are fast
     * because you can reconstruct the hash from the natural key.
     * Trade-off: no range scan on natural key order; you lose locality.
     *
     * Use when: the workload is predominantly point Gets and range scans
     * on the natural key are not needed.
     */
    public static byte[] hashPrefixedKey(String naturalKey) {
        String prefix = md5HexPrefix(naturalKey, 4);  // 4-char prefix = 16^4 = 65536 buckets
        return Bytes.toBytes(String.format("%s_%s", prefix, naturalKey));
    }

    /**
     * Pattern 4: Composite key for multi-dimensional access patterns.
     *
     * Encodes multiple fields at fixed widths so lexicographic ordering
     * enables efficient range scans on the primary dimension while
     * filtering on secondary dimensions within the scan.
     *
     * Example: scan all events for US_EAST in reverse-chronological order
     * by setting startRow=US_EAST_<Long.MAX_VALUE> stopRow=US_EAST_<0>
     *
     * Use when: queries always filter on a primary dimension (region, tenant,
     * category) and sort or range-scan on a secondary dimension (time, score).
     */
    public static byte[] compositeKey(String regionCode, long timestampMillis, String userId) {
        // Fixed-width region code ensures correct lexicographic boundaries
        String paddedRegion  = String.format("%-6s", regionCode).replace(' ', '0');
        String reversedTs    = String.format("%016x", Long.MAX_VALUE - timestampMillis);
        return Bytes.toBytes(String.format("%s_%s_%s", paddedRegion, reversedTs, userId));
    }

    private static String md5HexPrefix(String input, int hexChars) {
        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            byte[] hash = md.digest(Bytes.toBytes(input));
            StringBuilder sb = new StringBuilder();
            for (byte b : hash) {
                sb.append(String.format("%02x", b));
                if (sb.length() >= hexChars) break;
            }
            return sb.substring(0, hexChars);
        } catch (NoSuchAlgorithmException e) {
            throw new RuntimeException("MD5 unavailable — should never happen on a JVM", e);
        }
    }
}

⚠ The #1 HBase Mistake: Sequential Row Keys

Using a timestamp or auto-increment value as the row key prefix creates a write hotspot that no amount of hardware can fix. All writes go to the rightmost region; one RegionServer handles 100% of write traffic while every other server sits idle. This is the most common cause of HBase production incidents and it cannot be fixed without a full data migration. Design the row key correctly before writing a single byte. Retrofitting a row key schema means exporting the entire dataset, transforming every key, and reimporting — which is weeks of work on a petabyte-scale table.

📊 Production Insight

A telemetry pipeline used {timestamp}_{deviceId} as the row key. One RegionServer handled 100% of writes.

49 other RegionServers sat idle. Write latency spiked from 2ms to 8 seconds as MemStore filled on the hot server.

Rule: never use monotonically increasing values as row key prefixes. Reverse them or add a salt prefix before the first production write.

🎯 Key Takeaway

Row key design determines write distribution, read efficiency, and scan locality — all three simultaneously, with no way to change it post-deployment without a migration.

Sequential keys create hotspots; reversed timestamps and salt prefixes solve this before the first write.

Rule: treat the row key schema as a permanent architectural decision. Sketch expected access patterns before creating the table.

Row Key Strategy Selection

IfWrite throughput is the constraint and range scans are rare or can tolerate multi-region scatter-gather

→

UseSalt prefix: (md5(key) % regionCount)_naturalKey — uniform write distribution, scatter-gather for scans

IfTime-series data with recent-data-heavy reads and ingest at high rate

→

UseReversed timestamp: (Long.MAX_VALUE - ts)_deviceId — spreads writes, recent data clustered for efficient scans

IfGet-heavy workload where you always know the full natural key

→

UseHash prefix: md5prefix(naturalKey)_naturalKey — deterministic, no range scan on natural key order

IfMulti-dimensional queries: scan by primary dimension, filter by secondary

→

UseComposite key: fixedWidthPrimary_reversedTimestamp_secondaryKey — enables range scans on primary while sorting by secondary

IfRow key exceeds 100 bytes

→

UseShorten it — key storage overhead multiplies by number of columns per row and appears in every HFile block

thecodeforge.io

Apache Hbase Basics

Compactions — The Silent Performance Killer

Compactions are HBase's mechanism for managing HFile accumulation. Every MemStore flush produces a new HFile on HDFS. Without any merging, a region that has been written to heavily could accumulate hundreds of HFiles, and every read would need to check each one for the requested row. Compaction solves this by periodically merging HFiles into fewer, larger ones — trading I/O cost during compaction for lower read amplification afterward.

Minor compaction is lightweight and runs continuously. It fires automatically when the number of HFiles in a column family store exceeds hbase.hstore.compactionThreshold (default 3). It picks a small batch of adjacent HFiles and merges them. Minor compaction does not remove deleted or expired cells — those tombstone markers remain until a major compaction. Minor compactions are safe to let run during business hours; they consume modest I/O and complete quickly.

Major compaction is the expensive one. It merges all HFiles in a store into a single HFile, permanently removes deleted cells and cells past their TTL, and rewrites the entire column family from scratch. For a region with 200GB of data, a major compaction reads 200GB from HDFS and writes 200GB back. On a cluster already under read and write load, this I/O competition can push latency from 5ms to 500ms. By default, HBase schedules automatic major compactions every 7 days (hbase.hregion.majorcompaction) with a random jitter (hbase.hregion.majorcompaction.jitter) to prevent all regions compacting simultaneously.

The production trap is that automatic major compactions will eventually collide with peak traffic hours, especially as the cluster grows and 7 days worth of accumulated HFiles becomes significant data. The best practice for large production clusters is to disable automatic major compactions entirely (set hbase.hregion.majorcompaction to 0) and schedule them through an external job during the lowest-traffic window — typically weekend nights or the quietest hours of the day. For time-series data with a TTL, major compactions are also the only mechanism that reclaims space from expired cells, so you cannot skip them indefinitely.

io/thecodeforge/hbase/CompactionManager.javaJAVA

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

package io.thecodeforge.hbase;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.RegionMetrics;
import org.apache.hadoop.hbase.ServerName;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;

import java.io.IOException;
import java.util.Collection;
import java.util.Map;

/**
 * Production compaction management — manual triggering, throughput throttling,
 * and configuration tuning to prevent compactions from overwhelming live traffic.
 *
 * The core rule: never let automatic major compactions run during peak hours.
 * For clusters serving live traffic, manual scheduling during off-peak windows
 * is the only safe approach at scale.
 */
public class CompactionManager implements AutoCloseable {

    private final Admin admin;
    private final Connection connection;

    public CompactionManager(String zkQuorum) throws IOException {
        Configuration conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum", zkQuorum);
        this.connection = ConnectionFactory.createConnection(conf);
        this.admin = connection.getAdmin();
    }

    /**
     * Trigger a major compaction on a specific table.
     * Call this during a scheduled maintenance window, not during peak hours.
     * Major compaction rewrites all HFiles — I/O impact is proportional to region size.
     */
    public void majorCompactTable(String tableName) throws IOException {
        System.out.printf("Triggering major compaction on table: %s%n", tableName);
        admin.majorCompact(TableName.valueOf(tableName));
        System.out.println("Major compaction submitted — runs asynchronously in the background");
    }

    /**
     * Trigger a minor compaction on a specific table.
     * Safe to run during business hours — merges a few HFiles with modest I/O.
     * Does NOT remove deleted cells or reclaim space from expired TTL data.
     */
    public void minorCompactTable(String tableName) throws IOException {
        admin.compact(TableName.valueOf(tableName));
    }

    /**
     * Check compaction queue depth per RegionServer.
     * A sustained queue > 5 indicates the RegionServer's disk I/O cannot keep up
     * with the incoming flush rate — time to tune or add capacity.
     */
    public void printCompactionQueueDepth() throws IOException {
        Map<ServerName, RegionMetrics> clusterMetrics =
            admin.getClusterMetrics().getLiveServerMetrics()
                 .entrySet().stream()
                 .collect(java.util.stream.Collectors.toMap(
                     Map.Entry::getKey,
                     e -> e.getValue().getRegionMetrics()
                               .values().stream().findFirst().orElse(null)
                 ));
        // In practice, iterate ServerMetrics for compactionQueueSize
        admin.getClusterMetrics().getLiveServerMetrics().forEach((server, metrics) -> {
            long compactionQueueSize = metrics.getCompactionQueueSize();
            System.out.printf("  %s  compaction queue: %d%n",
                server.getHostname(), compactionQueueSize);
        });
    }

    /**
     * Returns the recommended configuration for compaction tuning.
     *
     * Key decisions:
     *   - Disable automatic major compactions for clusters > 1TB
     *   - Throttle compaction throughput to leave headroom for live I/O
     *   - Increase minor compaction threshold to reduce frequency
     */
    public static Configuration getCompactionTunedConfig() {
        Configuration conf = HBaseConfiguration.create();

        // ── Minor compaction tuning ──────────────────────────────────────
        // Trigger minor compaction when 4+ HFiles exist per store (default: 3)
        // Higher threshold = less frequent compaction, more read amplification
        conf.setInt("hbase.hstore.compactionThreshold", 4);

        // Max HFiles merged in one minor compaction batch (default: 10)
        // Smaller batches = faster individual compactions, more total I/O over time
        conf.setInt("hbase.hstore.compaction.max", 6);

        // ── Major compaction tuning ──────────────────────────────────────
        // Disable automatic major compactions for large clusters
        // Set to 0 and schedule manually during off-peak windows
        conf.setLong("hbase.hregion.majorcompaction", 0L);  // 0 = disabled

        // If you do leave automatic major compaction enabled, add jitter
        // to prevent all regions from compacting simultaneously
        // conf.setLong("hbase.hregion.majorcompaction", 7 * 24 * 3600 * 1000L);
        // conf.setFloat("hbase.hregion.majorcompaction.jitter", 0.5f);

        // ── I/O throttling ───────────────────────────────────────────────
        // Limit compaction throughput to leave headroom for live reads/writes
        // 20MB/s per store is conservative; tune up if disk throughput allows
        conf.setLong("hbase.hstore.compaction.throughput.lower.bound",
                     20L * 1024 * 1024);  // 20 MB/s floor
        conf.setLong("hbase.hstore.compaction.throughput.higher.bound",
                     40L * 1024 * 1024);  // 40 MB/s ceiling

        return conf;
    }

    @Override
    public void close() throws IOException {
        admin.close();
        connection.close();
    }
}

Mental Model

Compaction Trade-offs — Read Amplification vs Write Amplification

Compaction is HBase trading I/O cost today for lower read amplification tomorrow. Without it, reads degrade as HFile count grows. With too-aggressive compaction, live traffic suffers from I/O contention.

Minor compaction: merges a few adjacent HFiles — lightweight, runs continuously, does not remove deleted cells, safe during business hours
Major compaction: rewrites ALL HFiles into one — heavy I/O, removes tombstones and TTL-expired cells, reclaims HDFS space
Without any compaction: reads degrade linearly with HFile count — each read must check every HFile via bloom filters
With unthrottled major compaction during peak hours: disk I/O saturation drives read latency from 5ms to 500ms
Rule: disable automatic major compactions on clusters above 1TB and schedule them manually during the quietest traffic window

📊 Production Insight

A 50-node cluster ran automatic major compactions on the default 7-day schedule with no jitter.

All regions compacted simultaneously on the same day, saturating every DataNode's disk for 4 hours.

Read latency spiked from 8ms to 600ms. Setting hbase.hregion.majorcompaction.jitter=0.5 distributes the load over a 3-day window.

Rule: if you use automatic major compaction, always set jitter to at least 0.5 — never leave it at 0.

🎯 Key Takeaway

Minor compaction is safe during business hours; major compaction can saturate disk I/O for hours on large regions.

Disable automatic major compaction on clusters above 1TB and schedule it manually during off-peak windows.

Compaction is the only mechanism that physically removes deleted cells and TTL-expired data — it cannot be skipped indefinitely.

Compaction Configuration Strategy

IfCluster is under 100GB total data, development or staging environment

→

UseLeave automatic major compaction enabled at default 7-day interval — the I/O cost is negligible

IfCluster is between 100GB and 1TB in production

→

UseEnable automatic major compaction but set jitter=0.5 and throttle throughput to 40MB/s to avoid peak-hour collisions

IfCluster is above 1TB in production with live traffic SLOs

→

UseDisable automatic major compaction (set to 0) and schedule manual compactions during off-peak windows via admin.majorCompact()

IfTable uses TTL for data expiry (time-series, event logs)

→

UseMajor compaction is the only mechanism that physically removes expired cells — run it at least once per TTL period, even if manually scheduled

IfMinor compaction queue depth stays above 5 consistently

→

UseDisk I/O is undersized for the write rate — increase hbase.hstore.flusher.count, reduce MemStore flush size, or add DataNodes

Bloom Filters and BlockCache — Keeping Reads Fast

Two mechanisms keep HBase read latency in the millisecond range: bloom filters that skip irrelevant HFiles before reading any disk data, and the BlockCache that keeps recently accessed HFile blocks in heap memory.

Bloom filters are probabilistic data structures stored within each HFile. They answer the question 'does this HFile possibly contain row key X?' with certainty in the negative direction — if the bloom filter says no, the RegionServer skips that HFile entirely. If it says yes, the HFile might contain the row (with a configurable false positive rate, typically 1%). For a table with 20 HFiles per region, bloom filters can eliminate 15-19 disk reads per Get request. The space cost is roughly 120MB of bloom filter data per 100 million keys at 1% false positive rate.

HBase offers two bloom filter types. ROW bloom filters check only the row key — the right choice for workloads dominated by point Gets on narrow rows with few columns. ROWCOL bloom filters check both the row key and the column qualifier — useful for wide rows where you frequently fetch specific columns and want to skip HFiles that don't contain that column. ROWCOL uses 3-5x more memory per HFile than ROW. For scan-heavy workloads, bloom filters provide no benefit because range scans must read all blocks in sequence regardless — use NONE to save the memory.

The BlockCache keeps recently accessed HFile blocks in RegionServer heap memory. When a Get or Scan reads a block from HDFS, that block is placed in the BlockCache. Subsequent reads for rows in the same block find it in cache and skip the HDFS read entirely — this is how HBase achieves sub-millisecond reads for hot data. The BlockCache has two optional tiers: an on-heap L1 cache and an off-heap L2 bucket cache for larger working sets without JVM GC pressure.

The configuration rule that causes the most production incidents is the memory budget. The RegionServer's JVM heap must support MemStore, BlockCache, RPC handlers, GC overhead, and JVM internal structures simultaneously. The constraint is firm: hbase.regionserver.global.memstore.size + hfile.block.cache.size must not exceed 0.80. If they sum above 0.80, the remaining heap is insufficient for GC metadata and RPC handler threads, leading to Full GC pauses that exceed ZooKeeper's session timeout and cause RegionServer crashes.

io/thecodeforge/hbase/ReadTuningConfig.javaJAVA

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

package io.thecodeforge.hbase;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.io.compress.Compression;
import org.apache.hadoop.hbase.regionserver.BloomType;

import java.io.IOException;

/**
 * Production read performance tuning for HBase.
 *
 * Covers:
 *   - BlockCache and MemStore memory budget (the 80% rule)
 *   - Bloom filter type selection per column family
 *   - Off-heap L2 bucket cache for larger working sets
 *   - Scan caching and result size limits
 *   - Column family compression
 *
 * The 80% rule is non-negotiable:
 *   hfile.block.cache.size + hbase.regionserver.global.memstore.size <= 0.80
 */
public class ReadTuningConfig {

    /**
     * Returns the recommended RegionServer configuration for a 32GB heap node.
     *
     * Memory budget:
     *   BlockCache : 40% of 32GB = 12.8GB  (on-heap L1)
     *   MemStore   : 35% of 32GB = 11.2GB
     *   Total      : 75%  (leaves 25% for GC, RPC handlers, JVM internals)
     *
     * With 8GB off-heap L2 bucket cache, effective cache = 12.8 + 8 = 20.8GB
     * without increasing GC pressure.
     */
    public static Configuration getReadOptimizedConfig() {
        Configuration conf = HBaseConfiguration.create();

        // ── On-heap BlockCache (L1) ──────────────────────────────────────
        // 40% of heap — primary cache for hot data
        conf.setFloat("hfile.block.cache.size", 0.40f);

        // ── MemStore ─────────────────────────────────────────────────────
        // 35% of heap — write buffer across all regions on this server
        // Triggers a global flush if total MemStore usage exceeds this
        conf.setFloat("hbase.regionserver.global.memstore.size", 0.35f);

        // Verify: 0.40 + 0.35 = 0.75 <= 0.80 — within the safe budget

        // ── Off-heap L2 Bucket Cache ─────────────────────────────────────
        // Stores larger working sets without contributing to GC pressure
        // Data is serialised to native memory — slightly slower than L1 but
        // allows a much larger effective cache without increasing -Xmx
        conf.set("hbase.bucketcache.ioengine", "offheap");
        conf.setLong("hbase.bucketcache.size", 8192L);  // 8192 MB = 8GB off-heap

        // ── Scan optimisation ────────────────────────────────────────────
        // Rows fetched per RPC during a Scan — default of 1 is catastrophically slow
        // 100-500 is reasonable; tune based on average row size
        conf.setInt("hbase.client.scanner.caching", 100);

        // Max bytes returned per scanner RPC — prevents OOM on wide rows
        conf.setLong("hbase.client.scanner.max.result.size", 2L * 1024 * 1024);  // 2MB

        return conf;
    }

    /**
     * Creates a table with per-column-family bloom filter and compression settings.
     *
     * Bloom filter type is set per column family because different families
     * can have fundamentally different access patterns on the same table.
     */
    public static void createOptimisedTable(Connection connection,
                                             String tableName) throws IOException {
        try (Admin admin = connection.getAdmin()) {
            HTableDescriptor tableDescriptor =
                new HTableDescriptor(TableName.valueOf(tableName));

            // Column family for point-lookup-heavy data — use ROW bloom filter
            // ROW: probabilistic skip on row key only
            // Low memory cost: ~1.2 bytes per unique row key at 1% false positive
            HColumnDescriptor metadataFamily = new HColumnDescriptor("meta");
            metadataFamily.setBloomFilterType(BloomType.ROW);
            metadataFamily.setCompressionType(Compression.Algorithm.SNAPPY);
            metadataFamily.setMaxVersions(1);     // time-series: keep only latest
            metadataFamily.setTimeToLive(86400 * 90);  // 90-day TTL
            tableDescriptor.addFamily(metadataFamily);

            // Column family for wide rows with selective column reads — use ROWCOL
            // ROWCOL: probabilistic skip on row key + column qualifier
            // Higher memory cost: 3-5x more than ROW — only justified for wide rows
            HColumnDescriptor attributesFamily = new HColumnDescriptor("attrs");
            attributesFamily.setBloomFilterType(BloomType.ROWCOL);
            attributesFamily.setCompressionType(Compression.Algorithm.LZ4);
            attributesFamily.setMaxVersions(3);   // retain last 3 versions
            tableDescriptor.addFamily(attributesFamily);

            // Column family for scan-heavy analytical data — no bloom filter
            // Bloom filters provide no benefit for range scans — skip the memory cost
            HColumnDescriptor metricsFamily = new HColumnDescriptor("metrics");
            metricsFamily.setBloomFilterType(BloomType.NONE);
            metricsFamily.setCompressionType(Compression.Algorithm.GZ);  // higher ratio for cold data
            metricsFamily.setMaxVersions(1);
            tableDescriptor.addFamily(metricsFamily);

            admin.createTable(tableDescriptor);
        }
    }

    /**
     * Selects the appropriate bloom filter type based on column family access pattern.
     * Use this as a decision aide when designing a new column family schema.
     */
    public static BloomType selectBloomType(boolean isGetHeavy,
                                             boolean hasWideRows,
                                             boolean isScanHeavy) {
        if (isScanHeavy) {
            return BloomType.NONE;  // scans read all blocks — bloom filters are wasted memory
        }
        if (isGetHeavy && hasWideRows) {
            return BloomType.ROWCOL;  // skip HFiles missing the specific column — expensive but targeted
        }
        if (isGetHeavy) {
            return BloomType.ROW;  // skip HFiles missing the row — cheap and effective
        }
        return BloomType.NONE;  // default for mixed or unknown patterns
    }
}

💡Pro Tip: Bloom Filter Memory Budget

Bloom filter data is stored inside HFiles and loaded into memory when the HFile is opened. For a 100M-row HFile with 1% false positive rate, the bloom filter consumes roughly 120MB. If you have 10 HFiles per region and 200 regions per server, that's 240GB of bloom filter data — which clearly cannot all live in memory simultaneously. HBase loads bloom filter chunks lazily and caches them in the BlockCache. If bloom filter data is evicting hot row data from the BlockCache, reduce the false positive rate (costs more memory per HFile) or switch from ROWCOL to ROW. Use ROWCOL only when the column-specific filtering genuinely reduces disk reads enough to justify the 3-5x memory overhead.

📊 Production Insight

A 16GB RegionServer was configured with MemStore at 0.40 and BlockCache at 0.45 — summing to 0.85.

Full GC pauses of 12-15 seconds caused ZooKeeper session timeouts and triggered RegionServer crashes.

The crash-recovery cycle repeated every 20-30 minutes, making the cluster unusable.

Fix: reduce MemStore to 0.35, BlockCache to 0.40 — sum 0.75 — leaving 25% for JVM internals. Crashes stopped immediately.

🎯 Key Takeaway

Bloom filters skip irrelevant HFiles before reading any disk data — transforming O(HFileCount) disk reads into O(1) for cache-cold Gets.

BlockCache keeps recently read HFile blocks in heap — hot data stays at sub-millisecond read latency.

MemStore + BlockCache must sum to <= 0.80 of heap — this is the most commonly violated configuration rule and the most reliable cause of RegionServer crashes.

Bloom Filter Type Selection

IfWorkload is Get-heavy with narrow rows (few columns per row)

→

UseUse ROW bloom filter — filters on row key, low memory overhead, effective for point lookups

IfWorkload is Get-heavy with wide rows (many columns) and reads target specific columns

→

UseUse ROWCOL bloom filter — filters on row key + column qualifier, 3-5x more memory but avoids reading HFiles missing the target column

IfWorkload is scan-heavy (range scans, full family reads)

→

UseUse NONE — range scans must read all blocks sequentially; bloom filters consume memory without reducing any disk reads

IfRegionServer memory is constrained (under 16GB heap)

→

UseUse ROW or NONE — ROWCOL memory overhead may not be justified; monitor BlockCache eviction rates

IfHFile has very few unique row keys (under 10,000)

→

UseUse NONE — bloom filter overhead exceeds the benefit; the small HFile block index is faster than a bloom filter lookup

HBase vs Alternatives — When to Use What

HBase is a powerful and operationally demanding tool. Choosing it for the wrong problem is a multi-month mistake. Understanding where it fits relative to alternatives is as important as understanding HBase itself.

HBase vs Cassandra: both are wide-column stores inspired by Google's Bigtable paper, but they make fundamentally different architectural trade-offs. Cassandra is masterless and peer-to-peer — every node is equal, and there is no single point of failure equivalent to ZooKeeper. Cassandra supports per-query consistency tuning: write at ONE for maximum throughput, read at QUORUM for strong consistency, or mix them per operation. It also has native multi-datacenter replication built into its core data model. HBase uses a master-based architecture with strong consistency by default and no native multi-datacenter support. Choose Cassandra when you need multi-datacenter active-active replication, tunable consistency on a per-operation basis, or a masterless topology that has no ZooKeeper equivalent as a single point of failure. Choose HBase when you need strong consistency guaranteed by default, coprocessor-based server-side computation, or tight integration with the Hadoop ecosystem for batch analytics on the same data (HBase + Spark/MapReduce on HFiles).

HBase vs MongoDB: MongoDB is a document store with a flexible JSON schema, a rich aggregation query language, and built-in secondary indexes. HBase has none of those things — it is a sorted key-value store with a fixed column-family schema, no query language beyond Get/Scan/Put, and no secondary indexes. Use MongoDB when your access patterns require ad-hoc queries, aggregation pipelines, or text search. Use HBase when you need predictable low-latency random reads across petabytes of sparse data with simple key-based access and no query flexibility required.

HBase vs Google Cloud Bigtable: Bigtable is the managed version of the same architecture. The API is largely compatible, the data model is identical, and the operational model is dramatically simpler — no ZooKeeper to manage, no RegionServers to tune, no compactions to schedule. If you are on GCP and your dataset warrants HBase, Bigtable is almost always the right choice unless you have a specific requirement for on-premises deployment, multi-cloud portability, or HDFS data locality for co-located batch jobs.

🔥When NOT to Use HBase

HBase is the wrong choice when: (1) your dataset fits comfortably in a single PostgreSQL or MySQL instance — the operational burden of ZooKeeper, HDFS, and RegionServer tuning is not justified; (2) you need ad-hoc SQL queries — use Apache Phoenix on HBase if you must, but consider whether a purpose-built analytical store (BigQuery, Redshift, Snowflake) would serve better; (3) you need secondary indexes — HBase has none natively; Phoenix can add them but at the cost of write amplification and additional operational complexity; (4) you need multi-row ACID transactions — HBase provides single-row atomicity only; (5) your engineering team lacks HBase operational experience — production incidents involving GC tuning, WAL replay, and region reassignment are genuinely difficult to debug without prior exposure.

📊 Production Insight

A team chose HBase for a 50GB dataset that needed SQL aggregation queries on ad-hoc dimensions.

They spent 3 months building and tuning an Apache Phoenix integration before realising that a PostgreSQL instance with partition pruning would have handled the load with far less engineering investment.

Rule: if your data fits on one machine and you need SQL, do not use HBase. Scale the database before adopting a distributed store.

🎯 Key Takeaway

HBase excels at petabyte-scale random read/write with strong consistency and native HDFS integration for co-located batch jobs.

Cassandra is the better choice for multi-datacenter deployments and tunable per-operation consistency; MongoDB for flexible schemas and complex query patterns.

Rule: the right reason to choose HBase is a specific technical requirement — petabyte scale, strong consistency, HDFS co-location — not familiarity with the Hadoop ecosystem.

RegionServer Death — What Actually Happens to Your Writes

When a RegionServer goes down, it's not the data you lose — it's the write path. HDFS keeps the files intact. What breaks is your ability to serve or write until the cluster heals.

ZooKeeper detects the failure within seconds. It removes the dead server from the active list, then waits for HMaster to assign the orphaned regions to surviving RegionServers. During that window — typically 30-60 seconds depending on your cluster size and WAL replay — any client write to those regions gets a NotServingRegionException. Your application needs to retry. If it doesn't, you've got silent data loss.

The WAL (Write-Ahead Log) on HDFS is the safety net. Each RegionServer writes all mutations to a shared WAL before applying them to MemStore. When HMaster reassigns a region, the new RegionServer replays that WAL to reconstruct MemStore state. If the WAL is corrupted or the dead server took HDFS blocks with it, you're looking at manual repair with hbase hbck.

Production rule: Never set the ZK session timeout below 30 seconds. You'll trigger false failovers during GC pauses.

RecoverDeadRegionServer.sqlSQL

// io.thecodeforge — database tutorial
// Check region assignment after server failure

-- Step 1: Identify unassigned regions
scan 'hbase:meta', {COLUMNS => ['info:regioninfo', 'info:server']}

-- Step 2: Force region assignment (rare, only when auto-assign fails)
alter 'user_events', {NAME => 'cf1', REPLICATION_SCOPE => 1}

-- Step 3: Verify WAL replay status in logs
-- grep 'replaying edits' /var/log/hbase/*RegionServer*.log

-- Step 4: Run cluster health check
hbase hbck -details

Output

Number of orphaned regions: 12

Regions stuck in OPENING state: 3

WAL replay in progress on rs-04.example.com

⚠ Production Trap:

A single slow WAL replay can block all region reassignments. Monitor 'hbase.regionserver.wal.max.splitters' — default 2. Bump to 4 on clusters with >50 RegionServers.

🎯 Key Takeaway

A dead RegionServer is a write outage, not a data outage — unless you ignore WAL replay.

SplitBrain — Why ZooKeeper Is Not Optional

HBase without ZooKeeper is like a plane without a black box. You'll crash eventually, and nobody will know why.

ZooKeeper maintains the cluster consensus: which server is the active HMaster, which regions belong to which RegionServer, and the state of every table. If ZooKeeper goes down, HMaster election stops. RegionServers continue serving existing data, but no new regions can be assigned, no splits happen, no failovers run. You're flying blind.

Split-brain is the real nightmare. If network partition isolates half the cluster, without ZooKeeper's quorum you could end up with two HMasters both trying to assign the same region. Data gets written to two places. Reads return stale or inconsistent results. Recovery requires manual merge of HFiles — a job nobody wants at 3 AM.

ZooKeeper uses a majority quorum — 3 nodes for small clusters, 5 for production. Never run an odd number. If you lose the quorum, HBase shuts down all client-facing operations until ZooKeeper heals. That's by design. It's safer than serving corrupt data.

Senior shortcut: Monitor ZooKeeper latency. If your HBase write latency spikes but CPU is low, ZK is your bottleneck.

💡Senior Shortcut:

Run 'echo stat | nc zk-node 2181' — if 'Outstanding' requests > 1000 and 'Znode count' > 100k, your ZK is overloaded. Add nodes before HBase degrades.

🎯 Key Takeaway

ZooKeeper is the cluster's single source of truth. Lose the quorum, lose the cluster.

HBase Training Syllabus: What Senior Engineers Actually Need

Most training material teaches you HBase theory, then drops you into a production fire. That's backwards. Real value comes from understanding failure modes before you hit them.

Start with row key design. Not the theory — the practical patterns that prevent region hot spotting. Then move to compaction mechanics. Your cluster will die not from reads, but from write amplification during major compactions.

Next, region server failure handling. You need to know what happens to WALs when a node dies. Don't skip ZooKeeper session loss — that's the root cause of split-brain scenarios. Finally, Bloom filters and BlockCache tuning. Most engineers get these wrong by setting global defaults. You tune per column family.

This syllabus skips fluff. It prioritizes operational survival over academic knowledge.

RegionServerCheck.sqlSQL

// io.thecodeforge — database tutorial

-- Identify regions with write skew before training
SELECT
  table_name,
  region_name,
  storefile_size_mb,
  writes_per_second,
  CASE
    WHEN writes_per_second > 1000 THEN 'HOT_REGION'
    ELSE 'OK'
  END AS region_status
FROM
  hbase_region_metrics
WHERE
  storefile_size_mb > 10000
  AND writes_per_second > 500
ORDER BY
  writes_per_second DESC
LIMIT 20;

Output

table_name | region_name | storefile_size_mb | writes_per_second | region_status

-------------+-------------------+-------------------+-------------------+---------------

user_events | default,1,abc123 | 15420 | 2300 | HOT_REGION

orders | default,3,def456 | 12800 | 1800 | HOT_REGION

⚠ Production Trap:

Do not train on single-node clusters. The failure modes you need to learn — region splitting, WAL replication delays, region server stalls — only happen at scale.

🎯 Key Takeaway

Train on operational survival, not theory. Hot regions and compaction storms kill clusters faster than any data model.

Why NoSQL Education Never Covers Your Actual Workload

Every HBase training course starts with CAP theorem and column families. That's fine for a whiteboard. In production, your pain points are different.

You'll spend months debugging write path latencies caused by MemStore flushes timing badly. You'll chase region server crashes that happen only when ZooKeeper sessions expire during network blips. No course teaches you to read a region server log in real time and spot a pending compaction deadlock.

The gap is worse: most training assumes uniform access patterns. You don't have those. Your hot keys will be 10% of your data causing 90% of your load. Courses skip how to design pre-splitting strategies that match your actual key distribution.

Build your own curriculum. Start with region server metrics, then WAL hygiene, then compaction tuning. That's where production engineering lives.

CompactionDeadlockCheck.sqlSQL

// io.thecodeforge — database tutorial

-- Find compaction queues blocking writes
SELECT
  rs_host,
  compaction_queue_size,
  flush_queue_size,
  memstore_size_mb,
  CASE
    WHEN compaction_queue_size > 10 AND flush_queue_size > 5
    THEN 'DEADLOCK_RISK'
    ELSE 'NORMAL'
  END AS cluster_health
FROM
  hbase_rs_metrics
WHERE
  compaction_queue_size > 0
ORDER BY
  compaction_queue_size DESC
LIMIT 10;

Output

rs_host | compaction_queue_size | flush_queue_size | memstore_size_mb | cluster_health

-------------+----------------------+------------------+------------------+----------------

rs01:16020 | 14 | 8 | 3200 | DEADLOCK_RISK

rs02:16020 | 2 | 1 | 800 | NORMAL

🔥Senior Shortcut:

If compaction queue length ever exceeds region server CPU cores, you're one spike away from write path timeouts. Tune hbase.hstore.compaction.min and max before that happens.

🎯 Key Takeaway

Actual HBase skills come from reading metrics and logs, not lectures. Compaction backpressure is the silent killer no syllabus covers.

Summary: HBase at 50,000 Feet

HBase is a distributed, column-oriented NoSQL database built on HDFS, designed for real-time random read/write access to massive datasets. Its strength lies in strong consistency, automatic sharding, and seamless Hadoop ecosystem integration. However, HBase is not a magic bullet. Row key design is the single most critical architectural decision — a poorly chosen key causes hotspotting, region splits, and degraded performance that no amount of hardware can fix. Write performance depends on the write-ahead log (WAL) and memstore flushing; reads rely on Bloom filters and BlockCache for efficiency. Compactions, while necessary, silently consume I/O if misconfigured. RegionServer failures are handled via ZooKeeper coordination and HDFS replication, but split-brain scenarios require ZooKeeper's consensus guarantees. You trade operational complexity for scaling power — HBase excels with billions of rows and high write throughput, but fails for low-latency queries on small datasets or complex joins. Use it when you need linear scaling, schema flexibility, and Hadoop-native integration. Avoid it when simplicity, operational ease, or ACID transactions across rows are mandatory.

HBaseSummaryChecklist.sqlSQL

// io.thecodeforge — database tutorial
// Key decisions before deploying HBase

-- 1. Row key design is NOT optional
-- Avoid monotonic keys (timestamps) to prevent hotspotting
-- Use salted or hashed prefixes (e.g., userId_hash + timestamp)

-- 2. Always set Bloom Filter type on column families
-- ROW vs ROWCOL: ROWCOL for high-selectivity scans
ALTER 'users', {NAME => 'cf', BLOOMFILTER => 'ROWCOL'}

-- 3. Configure BlockCache (on-heap + off-heap) for read-heavy workloads
-- hfile.block.cache.size = 0.4 (default 0.25)

-- 4. Monitor compaction queue length daily
-- hbase.hstore.compaction.throughput.lower.bound = 20MB/sec

-- 5. Test RegionServer failure recovery with simulated kills
-- WAL replayed, data safe — but only if HDFS replication >= 3

-- 6. ZooKeeper quorum: odd number, minimum 3, dedicated nodes
-- No ZK shared with Kafka or Solr
-- End of summary checklist

Output

No output — configuration reference for senior engineers.

⚠ Production Trap:

Treating HBase like a traditional RDBMS leads to disaster. You cannot add indexes after the fact, change row keys easily, or rely on secondary indexes without manual maintenance. Every design choice is final at scale.

🎯 Key Takeaway

Row key design and compaction tuning define your HBase performance ceiling; architecture handles scale, but only smart design prevents failure.

HBase vs Cassandra vs Bigtable: Column-Family Comparison

While HBase, Cassandra, and Google Cloud Bigtable all belong to the wide-column NoSQL family, their internal architectures differ significantly, impacting row key design, consistency, and operational behavior.

Consistency Model: HBase offers strong consistency (per row) via its single-master architecture and WAL. Cassandra provides tunable consistency (e.g., QUORUM, ONE) but defaults to eventual consistency. Bigtable also provides strong consistency, similar to HBase.

Row Key Design: In HBase, row key design is critical to avoid hotspotting because all writes go to a single region server per row key range. Cassandra uses consistent hashing (partitioners) to distribute rows across nodes, reducing hotspot risk but making range scans expensive. Bigtable splits tablets automatically based on load, but row key design still matters for locality.

Column Families: HBase supports multiple column families per table, but they are stored separately (one store per CF per region), which can cause performance issues if CFs have different access patterns. Cassandra also supports multiple CFs (called tables), but they are stored together per partition. Bigtable allows only one column family per table (though you can have many columns), simplifying storage.

Compaction: HBase uses major and minor compactions, which can block writes if not tuned. Cassandra uses size-tiered or leveled compaction, with less blocking. Bigtable handles compaction transparently.

When to choose: HBase for strong consistency and HDFS integration; Cassandra for multi-datacenter, high write throughput with eventual consistency; Bigtable for fully managed, low-latency, high-throughput workloads on GCP.

SQL Example: ```sql -- HBase: create table with column families CREATE TABLE 'users', 'profile', 'activity'

-- Cassandra: create keyspace and table CREATE KEYSPACE mykeyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3}; CREATE TABLE mykeyspace.users (id UUID PRIMARY KEY, name text, email text);

-- Bigtable: create table via cbt cbt createtable users families=profile:maxversions=1,activity:maxversions=1 ```

column_family_comparison.sqlSQL

-- HBase: create table with column families
CREATE TABLE 'users', 'profile', 'activity'

-- Cassandra: create keyspace and table
CREATE KEYSPACE mykeyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
CREATE TABLE mykeyspace.users (id UUID PRIMARY KEY, name text, email text);

-- Bigtable: create table via cbt
cbt createtable users families=profile:maxversions=1,activity:maxversions=1

🔥Column Family Best Practices

📊 Production Insight

In production, we migrated a time-series workload from Cassandra to HBase to achieve strong consistency for financial data, but had to redesign row keys to avoid region server hotspots during peak writes.

🎯 Key Takeaway

HBase provides strong consistency but requires careful row key design to avoid hotspots; Cassandra distributes writes evenly but sacrifices range scan performance; Bigtable offers managed scalability with automatic load balancing.

HBase Region Splitting and Compaction

Region splitting and compaction are two automatic processes in HBase that directly impact write performance and read latency. Misconfiguration can lead to the very hotspotting and RegionServer death discussed earlier.

Region Splitting: When a region exceeds a configured size (default 10GB), HBase splits it into two daughter regions. This is critical for load distribution. However, if row keys are poorly designed (e.g., monotonically increasing), splits may concentrate load on one server. There are three split policies: ConstantSizeRegionSplitPolicy (deprecated), IncreasingToUpperBoundRegionSplitPolicy (default), and SteppingSplitPolicy. The default policy uses a heuristic based on region count and table size.

Compaction: HBase merges HFiles to reduce storage overhead and improve read performance. Minor compactions merge a few small files; major compactions merge all files in a region into one. Major compactions can be I/O intensive and block writes if not scheduled carefully. In production, disable automatic major compactions and run them during low-traffic windows using cron jobs.

Impact on Hotspots: Frequent splits due to hotspotting can cause a "split storm," overwhelming the RegionServer. Conversely, if compactions fall behind, the number of HFiles grows, increasing read latency and write amplification.

SQL Example: ```sql -- Disable automatic major compaction for a table ALTER 'mytable', {METHOD => 'table_att', MAJOR_COMPACTION => 'false'}

-- Manually trigger major compaction major_compact 'mytable'

-- Check region split statistics hbase> get_splits 'mytable' ```

region_compaction.sqlSQL

-- Disable automatic major compaction for a table
ALTER 'mytable', {METHOD => 'table_att', MAJOR_COMPACTION => 'false'}

-- Manually trigger major compaction
major_compact 'mytable'

-- Check region split statistics
hbase> get_splits 'mytable'

⚠ Major Compaction Pitfall

📊 Production Insight

We once saw a RegionServer hit 100% CPU because automatic major compaction kicked in during a write-heavy batch job. After switching to manual compaction with a scheduled cron job, CPU usage dropped to 40%.

🎯 Key Takeaway

Region splitting distributes data across servers, but poor row key design can cause split storms; compaction is essential for read performance but must be carefully scheduled to avoid write stalls.

thecodeforge.io

Apache Hbase Basics

HBase Use Cases: Time Series, IoT, and Large-Scale Storage

HBase excels in workloads requiring real-time random access to massive datasets, especially time-series data, IoT sensor logs, and large-scale storage (e.g., messaging, metadata).

Time Series: HBase's sorted row key structure is ideal for time-series data if the row key is designed as (metric_id + timestamp). However, monotonically increasing timestamps cause hotspotting. A common pattern is to salt the row key (e.g., prefix with a hash of metric_id) or use a bucket prefix (e.g., metric_id + (timestamp % num_buckets)).

IoT: IoT devices generate high-velocity writes. HBase can handle millions of writes per second if row keys are distributed. For example, use device_id (hashed) + timestamp to spread writes across regions. Bloom filters help skip irrelevant HFiles during reads.

Large-Scale Storage: HBase is used for storing message queues (e.g., Facebook Messenger), metadata for distributed file systems, and even as a backend for Apache Phoenix (SQL layer). The key is to model row keys for efficient scans and point lookups.

SQL Example: ```sql -- Time series row key with salting: hash(metric_id) + timestamp -- Insert example (pseudo-code) put 'tsdb', 'a1_20250101120000', 'data:value', '42.5'

-- Query last 10 minutes for a metric scan 'tsdb', {STARTROW => 'a1_20250101115000', ENDROW => 'a1_20250101120000'}

-- IoT device data with device_id prefix put 'iot', 'dev123_20250101120000', 'sensor:temp', '22.3' ```

use_cases.sqlSQL

-- Time series row key with salting: hash(metric_id) + timestamp
-- Insert example (pseudo-code)
put 'tsdb', 'a1_20250101120000', 'data:value', '42.5'

-- Query last 10 minutes for a metric
scan 'tsdb', {STARTROW => 'a1_20250101115000', ENDROW => 'a1_20250101120000'}

-- IoT device data with device_id prefix
put 'iot', 'dev123_20250101120000', 'sensor:temp', '22.3'

💡Salting for Time Series

📊 Production Insight

In a production IoT deployment, we used a 4-bit bucket prefix (0-15) on device IDs to spread writes across 16 regions, achieving 500K writes/sec without hotspotting.

🎯 Key Takeaway

HBase is well-suited for time series, IoT, and large-scale storage when row keys are designed to distribute writes and enable efficient scans; salting and bucketing are essential to avoid hotspots.

● Production incidentPOST-MORTEMseverity: high

HBase Cluster Pegged at 100% CPU on a Single RegionServer — All Writes Funneling to One Region

Symptom

Average write latency jumped from 2ms to 8 seconds overnight. One RegionServer showed 100% CPU and 95% MemStore utilisation while all others were below 5%. The HBase Master UI showed one region with 50GB of accumulated data while every other region held under 1GB. The cluster had 50 RegionServers and was running at roughly 2% of its total write capacity.

Assumption

The team assumed HBase would automatically distribute writes evenly across all regions regardless of row key pattern — similar to how a hash-based partitioning scheme in Kafka or Cassandra would distribute load. They had tested write throughput in staging with a few thousand records and seen no issues.

Root cause

Row keys were formatted as {timestamp}_{deviceId}. Since timestamps are monotonically increasing, every new write was directed to the single rightmost region — the one covering the highest current key range. This region never split into a position that would relieve pressure because all new writes continued arriving in the same range immediately after any split. The RegionServer hosting this region became the sole bottleneck: every write hit the same WAL segment, the same MemStore, and the same HDFS DataNodes backing that region. No horizontal scaling was possible because the row key pattern made distribution structurally impossible.

Fix

Reversed the timestamp bytes in the row key: the format changed from {timestamp}_{deviceId} to {Long.MAX_VALUE - timestamp}_{deviceId}. Reversed timestamps decrease as real time increases, so new writes are directed toward the leftmost regions instead of the rightmost. A salt prefix (deviceId hash modulo region count) was added as a secondary distribution mechanism to handle devices with correlated write patterns. After the fix, write throughput increased 40x and all 50 RegionServers showed balanced load within one hour. The table was pre-split into 50 regions at creation to avoid the initial single-region bottleneck during the data load.

Key lesson

Never use monotonically increasing values as row key prefixes — timestamps, auto-increment IDs, and sequence numbers all create write hotspots
Reverse timestamps or add salt prefixes before writing a single record — retrofitting requires a full data migration
Monitor per-region write count and region size — imbalance larger than 2x the average is the earliest warning sign of a hotspot
HBase does NOT auto-balance writes based on row key patterns — the sorted partitioning model means distribution is entirely the caller's responsibility

Production debug guideSymptom → Action mapping for common HBase production issues5 entries

Symptom · 01

Single RegionServer at 100% CPU, others idle

→

Fix

Check row key distribution immediately. Run hdfs dfs -du /hbase/data/<namespace>/<table>/ and compare region sizes. If one region is 10x larger than the median, your keys are sequential — the fix is a row key redesign with salt prefix or reversed timestamp. Also run hbase org.apache.hadoop.hbase.util.RegionSplitter to inspect current split boundaries.

Symptom · 02

Read latency spikes every 30-60 minutes on a regular schedule

→

Fix

Compaction is the likely cause. Check HBase Master UI for compaction queue length and run hdfs dfs -du /hbase/data/<namespace>/<table>// to see HFile count per region. If major compactions are firing automatically, tune hbase.hregion.majorcompaction.jitter to stagger them, or disable automatic major compactions and schedule them manually during off-peak hours.

Symptom · 03

OutOfMemoryError in RegionServer JVM

→

Fix

MemStore or BlockCache is over-allocated. Verify that hbase.regionserver.global.memstore.size + hfile.block.cache.size does not exceed 0.80. On a 32GB heap, MemStore at 0.35 and BlockCache at 0.40 leaves 25% for JVM overhead — the minimum required. Reduce one or both allocations if they sum above 0.80.

Symptom · 04

RegionServer keeps crashing and regions are stuck in transition

→

Fix

Run hbase hbck -details to check for META inconsistencies. If META is corrupt, run hbase hbck -fixMeta -fixAssignments. Also check ZooKeeper session timeouts — GC pauses longer than 1 second cause the RegionServer to lose its ZK ephemeral node, triggering a crash cycle even if the JVM itself recovers.

Symptom · 05

Writes timing out with RegionTooBusyException

→

Fix

The MemStore flush queue is backed up. Check hbase.hstore.flusher.count — the default of 2 flusher threads is often insufficient for high write rates. Also check HDFS write throughput via DataNode metrics — if HDFS is slow, MemStore flushes queue behind it and eventually block incoming writes. Increase flusher threads or reduce hbase.hregion.memstore.flush.size to trigger more frequent, smaller flushes.

★ HBase Quick Debug Cheat SheetFast diagnostics for production HBase issues. Run these commands to confirm the root cause before making any configuration changes.

RegionServer unresponsive or repeatedly dying−

Immediate action

Check GC pause duration and ZooKeeper session status

Commands

jstat -gcutil <rs_pid> 1000 10

hbase hbck -details 2>&1 | grep -i 'inconsist'

Fix now

If GC pause exceeds 1 second, increase -XX:G1ReservePercent and reduce MemStore share below 0.35. GC pauses longer than the ZK session timeout (default 90s divided by tick count) cause ephemeral node loss and trigger crash/recovery cycles.

Write throughput dropping — latency climbing above 50ms+

Read latency above 50ms for point lookups+

HBase vs Alternatives

Aspect	HBase	Cassandra	MongoDB	GCP Bigtable
Architecture	Master-based: HMaster + RegionServers + ZooKeeper coordination	Masterless peer-to-peer: every node is equal, no SPOF equivalent to ZK	Replica set with primary election; mongos router for sharded clusters	Fully managed: no servers, no compaction tuning, no ZooKeeper
Consistency model	Strong by default: single-row ACID, no tuning required	Tunable per operation: ONE (eventual) to ALL (strong); QUORUM is the common production choice	Strong with majority reads by default; configurable read/write concern	Strong by default: single-row ACID, identical model to HBase
Data model	Sorted wide-column families: sparse, versioned cells with TTL	Wide-column with CQL: more structured than HBase, less flexible than MongoDB	Document store: flexible BSON/JSON schema with nested documents and arrays	Sorted wide-column families: API-compatible with HBase, identical data model
Secondary indexes	None natively — use Apache Phoenix for SQL and secondary indexes at write amplification cost	Built-in materialized secondary indexes with automatic maintenance	Rich index support: single-field, compound, text, geospatial, TTL indexes	None natively — same constraint as HBase
Query interface	Get/Scan/Put API — no query language; HBase Shell for admin; Phoenix adds SQL	CQL (Cassandra Query Language) — SQL-like with limitations on joins and aggregation	Rich aggregation pipeline, full-text search, geospatial queries, $lookup joins	Read/write API; integrates with Dataflow and BigQuery for analytics
Multi-datacenter	No native support — requires custom WAL replication or third-party CDC	Native multi-DC replication with per-DC consistency configuration built into the core model	Global clusters with zone-aware reads/writes (Atlas); sharding adds complexity	Multi-region replication with Bigtable replication — managed, not DIY
HDFS / batch integration	Native — HFiles live on HDFS; direct Spark/MapReduce access to raw data files	None — own proprietary storage engine; Spark integration via connector	None — own storage engine; Atlas Data Federation for analytics queries	Google Colossus internally; Dataflow and BigQuery integration for analytics
Operational complexity	High: ZooKeeper quorum, HDFS NameNode HA, RegionServer GC tuning, compaction scheduling	Medium: no ZooKeeper dependency; tuning required for compaction and repair; simpler failure model	Low to medium: Atlas managed option removes most operational burden; self-hosted adds complexity	Low: fully managed service; no ZooKeeper, no compaction scheduling, no RegionServer sizing

⚙ Quick Reference

11 commands from this guide

File	Command / Code	Purpose
iothecodeforgehbaseHBaseArchitectureClient.java	/**	HBase Architecture
iothecodeforgehbaseRowKeyDesign.java	/**	Row Key Design
iothecodeforgehbaseCompactionManager.java	/**	Compactions
iothecodeforgehbaseReadTuningConfig.java	/**	Bloom Filters and BlockCache
RecoverDeadRegionServer.sql	scan 'hbase:meta', {COLUMNS => ['info:regioninfo', 'info:server']}	RegionServer Death
RegionServerCheck.sql	SELECT	HBase Training Syllabus
CompactionDeadlockCheck.sql	SELECT	Why NoSQL Education Never Covers Your Actual Workload
HBaseSummaryChecklist.sql	ALTER 'users', {NAME => 'cf', BLOOMFILTER => 'ROWCOL'}	Summary
column_family_comparison.sql	CREATE TABLE 'users', 'profile', 'activity'	HBase vs Cassandra vs Bigtable
region_compaction.sql	ALTER 'mytable', {METHOD => 'table_att', MAJOR_COMPACTION => 'false'}	HBase Region Splitting and Compaction
use_cases.sql	put 'tsdb', 'a1_20250101120000', 'data:value', '42.5'	HBase Use Cases

Key takeaways

HBase layers a sorted, indexed key-value store on top of HDFS to provide sub-10ms random reads across petabytes

HDFS alone supports only sequential access, making point lookups impractical without the indexing layer HBase provides.

Row key design is the single most critical and irreversible decision

it simultaneously determines write distribution across RegionServers, read locality, and scan efficiency. Sequential keys create hotspots that no hardware addition can fix.

Compactions trade write amplification (I/O cost during compaction) for read amplification reduction (fewer HFiles to check per read). Major compactions can saturate disk I/O for hours on large regions

disable automatic scheduling above 1TB and schedule manually.

MemStore + BlockCache must sum to 80% or less of JVM heap

exceeding this causes Full GC pauses that trigger ZooKeeper session timeouts and cascading RegionServer crashes.

ZooKeeper is the true single point of failure in HBase

if ZooKeeper loses quorum, the entire cluster becomes unavailable for reads, writes, and admin operations. Monitor it more aggressively than any other component.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Explain the HBase write path from client to disk. What happens if the Re...

Q02SENIOR

Why are sequential row keys bad in HBase, and how would you fix a table ...

Q03SENIOR

What is the difference between a minor and major compaction in HBase, an...

Q04SENIOR

How do bloom filters work in HBase, and when would you choose ROW versus...

Q05SENIOR

Your HBase cluster has 50 RegionServers but only 3 are handling writes. ...

Q06SENIOR

Compare HBase and Cassandra for a multi-datacenter e-commerce platform t...

Q01 of 06SENIOR

Explain the HBase write path from client to disk. What happens if the RegionServer crashes after writing to the WAL but before flushing the MemStore?

ANSWER

The write path has four steps. First, the client sends a Put to the RegionServer that owns the target row's region — it knows which server that is by looking up the hbase:meta table, which it caches locally. Second, the RegionServer appends the mutation to the Write-Ahead Log on HDFS. This is a sequential append to a file on a durable distributed filesystem — it survives RegionServer crashes. Third, the mutation is applied to the in-memory MemStore — a sorted skip list per column family per region. Fourth, the RegionServer sends acknowledgment to the client. If the RegionServer crashes after step 2 but before any MemStore flush to an HFile, the data is safe. When the HMaster detects the crash via ZooKeeper, it reassigns the dead server's regions to other RegionServers. Each recovering RegionServer replays the WAL for its newly assigned regions to reconstruct the MemStore state. The WAL replay guarantees that no acknowledged write is ever lost. The key insight is that WAL durability precedes client acknowledgment — those two facts together make HBase's write durability guarantee possible.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What happens when a RegionServer crashes?

Can HBase handle secondary indexes?

How does HBase handle region splits?

What is the difference between HBase and HDFS?

How do I monitor HBase health in production?

Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Drawn from code that ran under real load.

✓ Verified

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

🔥

That's NoSQL. Mark it forged?

15 min read · try the examples if you haven't