PostgreSQL JSON vs JSONB: Deep Dive into Operators, Indexing & Performance
- JSONB stores a decomposed binary representation that is pre-parsed on insert — every subsequent read skips the parsing step, making it roughly 4x faster than JSON for field extraction at scale. Use JSONB for any column you will query. Use JSON only when you need byte-for-byte preservation of the original document.
- The
->operator returns a JSON node (preserves type). The->>operator returns plain TEXT. Using->in a WHERE clause comparison against a SQL string will silently return zero results. Always use->>for comparisons, and cast to the correct type for numeric operations. - The
@>containment operator is the only JSONB filter that engages a GIN index. Queries using->>with casts bypass the GIN index entirely and result in full sequential scans. Rewrite your filters to use@>whenever possible.
Imagine your database is a huge filing cabinet. Normally every drawer is perfectly labelled — 'name', 'age', 'city' — and every folder must match exactly. JSON support is like having a special 'miscellaneous' drawer where you can stuff any shape of document you want: a receipt one day, a config file the next. PostgreSQL doesn't just let you store that document — it lets you reach inside and pull out a specific line item without unfolding the whole thing. JSONB is the turbo version: it reorganises the document the moment it lands into a compact binary format, so future searches skip the parsing step entirely. It's the difference between having to re-read a messy handwritten note every time someone asks a question, versus having a neatly typed index card ready to go.
PostgreSQL's JSONB data type changed how production teams handle semi-structured data. Instead of choosing between a rigid relational schema and a separate document database, JSONB lets you store, index, and query flexible JSON documents directly inside PostgreSQL — with real performance guarantees. If you're coming from a world of TEXT columns containing serialised blobs, or you're evaluating whether to spin up MongoDB for one table that needs dynamic attributes, this article will save you weeks of trial and error.
The real problem JSON support solves is the impedance mismatch between flexible, schema-light data and a structured relational engine. You don't want to spin up MongoDB just because one table needs a dynamic attributes column. PostgreSQL gives you a hybrid: enforce structure where you need it, stay flexible where you don't, and keep everything in one transactional, ACID-compliant system. That's a huge operational win — one backup strategy, one connection pool, one query language.
I have been running PostgreSQL in production for over a decade — SaaS platforms processing millions of events per day, e-commerce catalogues with 200+ million SKUs where every product category has a different attribute schema, and real-time analytics pipelines where the incoming data shape changes without warning. I have watched JSONB save entire projects from premature microservice extraction, and I have also watched it silently destroy query performance when developers treated it as a 'no schema required' escape hatch. The truth, as always, is in the details.
By the end of this article you'll understand the internal difference between JSON and JSONB (and why it matters for writes vs reads), every key operator you'll actually use in production, SQL/JSON path expressions for complex nested queries, how GIN indexes work against JSONB and when to apply them, how to design a hybrid relational/JSON schema that won't collapse under load, the specific mistakes that silently destroy query performance in large datasets, and a proven migration path from TEXT blobs to properly indexed JSONB columns.
JSON vs JSONB Internals: Why the Storage Layer Changes Everything
PostgreSQL ships with two JSON data types and the difference is not cosmetic. JSON stores the document as plain text, exactly as you sent it — whitespace, duplicate keys and all. Every time you query a field inside it, Postgres re-parses the entire string from scratch. For infrequent writes and rare reads on small payloads that's fine. For anything production-grade, it's a trap.
JSONB stores a decomposed binary representation. On insert, Postgres parses the JSON once, converts it to an internal tree structure (similar in spirit to a sorted map), strips whitespace, and deduplicates keys — last value wins. The upfront cost is a slightly slower write. The payback is that every subsequent read, filter, or index operation works on pre-parsed binary data, skipping the parsing step entirely.
There's one subtle but critical consequence of JSONB's deduplication: if you insert {"role":"user","role":"admin"}, you'll get back {"role":"admin"}. Silent data loss if you weren't expecting it. Also, key ordering is not preserved in JSONB — it's stored in an internal sorted order. If your application depends on key order (it shouldn't, but some do), you must use JSON not JSONB.
For 95% of production use cases — especially when you're querying or indexing JSON fields — you want JSONB. The only time to reach for JSON is when you need to preserve the exact original document byte-for-byte.
Let me give you a concrete example of why this matters. In 2021, we inherited a payment processing system that stored webhook payloads from Stripe in a JSON column. Every time our reconciliation job ran — which queried individual fields from 3 million webhook rows — it took 22 minutes. The entire bottleneck was re-parsing every JSON document on every access. We changed the column from JSON to JSONB. Same query, same data, same hardware. Execution time dropped to 47 seconds. We did not change a single line of application code. The storage layer change alone gave us a 28x improvement.
That experience taught me a rule I now enforce on every team: if a column will ever be queried, filtered, or indexed — it is JSONB. The only exception is archival tables where we store raw webhook payloads for compliance and literally never query individual fields. Even then, I usually store both: the raw JSON for audit, and a parsed JSONB copy for analytics.
-- Create a table to compare JSON and JSONB storage behaviour side by side CREATE TABLE io.thecodeforge.storage.comparison ( id SERIAL PRIMARY KEY, raw_json JSON, -- stored verbatim, re-parsed on every access binary_json JSONB -- parsed once on insert, stored as binary tree ); -- Insert the same payload into both columns. -- Note the intentional duplicate key 'status' and extra whitespace. INSERT INTO io.thecodeforge.storage.comparison (raw_json, binary_json) VALUES ( '{ "user_id": 42, "status": "pending", "status": "active" }', '{ "user_id": 42, "status": "pending", "status": "active" }' ); -- Retrieve both columns and observe the differences SELECT raw_json, -- JSON: whitespace preserved, BOTH 'status' keys kept binary_json -- JSONB: whitespace stripped, duplicate key deduplicated (last wins) FROM io.thecodeforge.storage.comparison WHERE id = 1; -- Check the actual storage size on disk SELECT pg_column_size(raw_json) AS json_bytes, pg_column_size(binary_json) AS jsonb_bytes FROM io.thecodeforge.storage.comparison WHERE id = 1; -- Real-world benchmark: measure parse overhead at scale -- Run this on a table with 1M rows to see the JSON vs JSONB difference CREATE TABLE io.thecodeforge.benchmark.json_parse_test AS SELECT generate_series(1, 1000000) AS id, ('{"name":"user_' || generate_series(1, 1000000) || '","email":"user' || generate_series(1, 1000000) || '@example.com","age":' || (random() * 80 + 18)::INT || ',"city":"City_' || (random() * 100)::INT || '"}')::JSON AS data_json, ('{"name":"user_' || generate_series(1, 1000000) || '","email":"user' || generate_series(1, 1000000) || '@example.com","age":' || (random() * 80 + 18)::INT || ',"city":"City_' || (random() * 100)::INT || '"}')::JSONB AS data_jsonb; -- Benchmark: extract 'age' field from JSON (re-parses every row) \timing on SELECT COUNT(*) FROM io.thecodeforge.benchmark.json_parse_test WHERE (data_json ->> 'age')::INT > 65; -- Typical result: ~8,200ms on 1M rows -- Benchmark: extract 'age' field from JSONB (pre-parsed, no re-parse) SELECT COUNT(*) FROM io.thecodeforge.benchmark.json_parse_test WHERE (data_jsonb ->> 'age')::INT > 65; -- Typical result: ~2,100ms on 1M rows — roughly 4x faster -- Clean up DROP TABLE io.thecodeforge.benchmark.json_parse_test;
----------------------------------------------+---------------------------
{ "user_id": 42, "status": "pending", | {"status": "active", "user_id": 42}
"status": "active" } |
json_bytes | jsonb_bytes
------------+-------------
57 | 37
-- Benchmark results (1M rows, local PostgreSQL 16, SSD):
-- JSON extraction: ~8,200ms
-- JSONB extraction: ~2,100ms
-- JSONB is ~4x faster for field extraction at this scale
jsonb_typeof() or validate at the application layer before insert if source data is untrusted. I once spent three days debugging a billing discrepancy because a third-party API started returning {"amount":100,"amount":95} — our JSONB column stored 95, the API intended 100. Three days for a missing validation check.Operators and Functions That Actually Matter in Production
PostgreSQL provides a rich operator set for navigating JSON documents. Knowing which operator to reach for — and what it returns — is the difference between clean, indexed queries and full-table scans that bring down production at 2am.
The two extraction operators are -> and ->>. The arrow -> returns a JSON fragment (preserving type), while the double-arrow ->> returns the value as plain TEXT. This distinction matters enormously for type comparisons: data->'age' = '30' compares a JSON integer node to a JSON string — it'll never match. You must either use ->> and cast, or use a JSON-aware equality.
For deep nested access, chain operators: data->'address'->'city'. For path-based access that's more readable, use #> (returns JSON) and #>> (returns text) with an array-of-keys path: data #>> '{address,city}'.
The containment operator @> is the workhorse of JSONB filtering. It asks: 'does the left JSONB value contain the right JSONB value?' This operator is fully supported by GIN indexes, making it orders of magnitude faster than a text search on large tables. The existence operator ? checks whether a key exists at the top level. ?| checks if any of a list of keys exist, ?& checks if all exist.
Beyond extraction and containment, there are manipulation functions you will use daily: jsonb_set() for surgical field updates, jsonb_insert() for adding new keys without overwriting, jsonb_strip_nulls() for cleaning up output, jsonb_each() and jsonb_each_text() for exploding a JSON object into key-value rows, and the aggregation functions jsonb_agg() and jsonb_object_agg() for building JSON from relational data. I will cover manipulation functions in detail in their own section, but mention them here because operator selection and function selection are deeply intertwined — picking the wrong one doesn't just return wrong results, it can silently bypass your indexes.
Understanding the return types of each operator prevents a whole class of type-mismatch bugs that only surface in edge cases with unusual data.
-- Set up a realistic product catalogue table CREATE TABLE io.thecodeforge.catalogue.products ( product_id SERIAL PRIMARY KEY, product_name TEXT NOT NULL, attributes JSONB NOT NULL -- flexible attributes vary by category ); INSERT INTO io.thecodeforge.catalogue.products (product_name, attributes) VALUES ('Wireless Headphones', '{ "brand": "SoundPeak", "specs": {"battery_hours": 30, "driver_mm": 40}, "tags": ["audio", "wireless", "noise-cancelling"], "in_stock": true, "price_usd": 149.99 }'), ('Mechanical Keyboard', '{ "brand": "KeyForge", "specs": {"switch_type": "Cherry MX Blue", "layout": "TKL"}, "tags": ["peripherals", "mechanical"], "in_stock": false, "price_usd": 89.00 }'), ('USB-C Hub', '{ "brand": "ConnectPro", "specs": {"ports": 7, "max_wattage": 100}, "tags": ["accessories", "usb-c"], "in_stock": true, "price_usd": 39.99 }'); -- (1) -> returns a JSON node (preserves type) -- ->> returns TEXT (use for comparisons and display) SELECT product_name, attributes -> 'brand' AS brand_as_json, -- returns: "SoundPeak" (JSON string) attributes ->> 'brand' AS brand_as_text, -- returns: SoundPeak (plain text) attributes -> 'price_usd' AS price_as_json -- returns: 149.99 (JSON number) FROM io.thecodeforge.catalogue.products; -- (2) Nested access with chained -> and path operator #>> SELECT product_name, attributes -> 'specs' -> 'battery_hours' AS battery_json, -- NULL for non-headphones attributes #>> '{specs, battery_hours}' AS battery_text -- same result, cleaner syntax FROM io.thecodeforge.catalogue.products; -- (3) @> containment: find all in-stock products with 'wireless' tag -- This is the operator GIN indexes are built for — use it! SELECT product_name, attributes ->> 'price_usd' AS price FROM io.thecodeforge.catalogue.products WHERE attributes @> '{"in_stock": true}' -- containment check AND attributes @> '{"tags": ["wireless"]}'; -- array containment works too -- (4) ? existence: find products that HAVE an 'in_stock' key -- Note: ? only checks TOP-LEVEL keys SELECT product_name FROM io.thecodeforge.catalogue.products WHERE attributes ? 'in_stock'; -- all rows have this key, returns all 3 -- (5) ?| any-existence: find products with ANY of these keys SELECT product_name FROM io.thecodeforge.catalogue.products WHERE attributes ?| array['warranty', 'return_policy', 'in_stock']; -- Returns all 3 because all have 'in_stock' -- (6) ?& all-existence: find products that have ALL of these keys SELECT product_name FROM io.thecodeforge.catalogue.products WHERE attributes ?& array['brand', 'price_usd', 'in_stock']; -- Returns all 3 -- (7) Casting ->> result for numeric comparison (common pattern) SELECT product_name FROM io.thecodeforge.catalogue.products WHERE (attributes ->> 'price_usd')::NUMERIC > 100.00; -- cast TEXT to NUMERIC -- (8) jsonb_typeof: inspect the type of a JSONB value -- Useful for debugging — 'string', 'number', 'boolean', 'object', 'array', 'null' SELECT product_name, jsonb_typeof(attributes -> 'price_usd') AS price_type, -- 'number' jsonb_typeof(attributes -> 'brand') AS brand_type, -- 'string' jsonb_typeof(attributes -> 'tags') AS tags_type, -- 'array' jsonb_typeof(attributes -> 'in_stock') AS stock_type -- 'boolean' FROM io.thecodeforge.catalogue.products; -- (9) jsonb_each: explode a JSONB object into key-value rows -- Useful for dynamic pivoting or inspecting unknown structures SELECT product_name, kv.key, kv.value FROM io.thecodeforge.catalogue.products, jsonb_each(attributes) AS kv(key, value) WHERE product_name = 'Wireless Headphones'; -- (10) jsonb_each_text: same but returns values as TEXT (no JSONB type wrapper) SELECT product_name, kv.key, kv.value FROM io.thecodeforge.catalogue.products, jsonb_each_text(attributes) AS kv(key, value) WHERE product_name = 'Wireless Headphones'; -- (11) jsonb_array_elements: unnest a JSON array into rows for per-tag analysis SELECT product_name, tag.value ->> 0 AS tag_value -- each array element becomes a row FROM io.thecodeforge.catalogue.products, jsonb_array_elements(attributes -> 'tags') AS tag(value); -- Cleaner version using jsonb_array_elements_text for string arrays SELECT product_name, tag_text FROM io.thecodeforge.catalogue.products, jsonb_array_elements_text(attributes -> 'tags') AS tag_text ORDER BY product_name, tag_text;
product_name | brand_as_json | brand_as_text | price_as_json
---------------------+---------------+---------------+---------------
Wireless Headphones | "SoundPeak" | SoundPeak | 149.99
Mechanical Keyboard | "KeyForge" | KeyForge | 89.00
USB-C Hub | "ConnectPro" | ConnectPro | 39.99
-- Query (3): in-stock + wireless tag
product_name | price
---------------------+--------
Wireless Headphones | 149.99
-- Query (7): price > 100
product_name
---------------------
Wireless Headphones
-- Query (8): type inspection
product_name | price_type | brand_type | tags_type | stock_type
---------------------+------------+------------+-----------+------------
Wireless Headphones | number | string | array | boolean
Mechanical Keyboard | number | string | array | boolean
USB-C Hub | number | string | array | boolean
-- Query (9): jsonb_each output
product_name | key | value
---------------------+--------------+---------------------------
Wireless Headphones | brand | "SoundPeak"
Wireless Headphones | in_stock | true
Wireless Headphones | price_usd | 149.99
Wireless Headphones | specs | {"battery_hours": 30, ...}
Wireless Headphones | tags | ["audio", "wireless", ...]
-- Query (11) unnested tags
product_name | tag_text
---------------------+------------------
Mechanical Keyboard | mechanical
Mechanical Keyboard | peripherals
USB-C Hub | accessories
USB-C Hub | usb-c
Wireless Headphones | audio
Wireless Headphones | noise-cancelling
Wireless Headphones | wireless
attributes ->> 'price_usd' > 100 compares TEXT to an integer using lexicographic ordering — '99.99' > '100.00' is TRUE lexicographically. Always cast: (attributes ->> 'price_usd')::NUMERIC > 100. Better yet, store numeric-only fields in a proper NUMERIC column and put only truly dynamic data in JSONB. I have seen this exact bug cause a pricing page to display products in the wrong order — $99.99 sorted after $100 because someone forgot the cast. The fix was one character. The customer impact was a week of wrong pricing.SQL/JSON Path Expressions: The Power Tool Most Developers Don't Know About
PostgreSQL 12 introduced SQL/JSON path expressions, and PostgreSQL 16 significantly expanded them. If you are still chaining -> and ->> operators to navigate deeply nested JSON, you are working harder than you need to. Path expressions give you a mini query language inside your SQL — with filtering, iteration, type casting, and wildcard matching — all evaluated natively by the JSONB engine.
The two key functions are for returning all matching values (can return multiple rows) and jsonb_path_query() for boolean existence checks. The jsonb_path_exists()@? operator is the shorthand for jsonb_path_exists. The @@ operator evaluates a path expression and returns a boolean.
Path expressions use a syntax borrowed from JSONPath (RFC 9535): $ is the root, .key navigates to a child, [*] iterates over array elements, [0] indexes into an array, and ?() applies a filter predicate. You can chain these: $.specs[?(@.battery_hours > 20)].battery_hours means 'find all specs objects where battery_hours exceeds 20, and return the battery_hours value.'
The real power shows up when you need conditional logic inside your query. Instead of writing three nested CASE statements with ->> casts, you write a single path expression that does the filtering, type checking, and extraction in one pass.
One caveat: path expressions are not yet supported by GIN indexes in PostgreSQL 16. The @? and @@ operators always result in a sequential scan unless you pair them with a partial index or a generated column. This is a known limitation and may change in future PostgreSQL releases. For now, use path expressions for complex read queries where you need the expressive power, and fall back to @> with GIN for high-volume filtered access.
I discovered path expressions when debugging a particularly nasty query on an event log table. We had events with nested payloads like {"event":"purchase","data":{"items":[{"sku":"A1","qty":2},{"sku":"B3","qty":1}],"total":89.99}}. The developer had written a 14-line SQL query with six -> chains, two jsonb_array_elements calls, and a subquery to filter items by quantity. I rewrote it as a single jsonb_path_query call. Same result, one line, and significantly faster because the path engine avoids materializing intermediate arrays.
-- Set up an event log table with deeply nested JSONB payloads CREATE TABLE io.thecodeforge.events.log ( event_id BIGSERIAL PRIMARY KEY, event_type TEXT NOT NULL, payload JSONB NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); INSERT INTO io.thecodeforge.events.log (event_type, payload) VALUES ('purchase', '{ "customer": {"id": 1001, "tier": "gold"}, "items": [ {"sku": "A1", "name": "Widget", "qty": 2, "unit_price": 29.99}, {"sku": "B3", "name": "Gadget", "qty": 1, "unit_price": 29.99} ], "total": 89.99, "payment": {"method": "card", "last4": "4242"} }'), ('purchase', '{ "customer": {"id": 1002, "tier": "silver"}, "items": [ {"sku": "C7", "name": "Cable", "qty": 5, "unit_price": 9.99} ], "total": 49.95, "payment": {"method": "paypal", "email": "user@example.com"} }'), ('refund', '{ "customer": {"id": 1001, "tier": "gold"}, "original_purchase_id": 42, "items": [ {"sku": "A1", "name": "Widget", "qty": 1, "refund_amount": 29.99} ], "reason": "defective" }'); -- ── (1) jsonb_path_exists / @? : boolean check ── -- Find all events where the customer is gold tier SELECT event_id, event_type FROM io.thecodeforge.events.log WHERE payload @? '$.customer ? (@.tier == "gold")'; -- ── (2) jsonb_path_query : extract matching values ── -- Get all SKUs from purchase events where qty > 1 SELECT event_id, jsonb_path_query(payload, '$.items[*] ? (@.qty > 1).sku') AS high_qty_sku FROM io.thecodeforge.events.log WHERE event_type = 'purchase'; -- ── (3) Complex filter: items with unit_price > 20 in gold customer purchases ── -- Without path expressions this requires subqueries and jsonb_array_elements. -- With path expressions it's a single function call. SELECT event_id, jsonb_path_query(payload, '$.items[*] ? (@.unit_price > 20)') AS expensive_item FROM io.thecodeforge.events.log WHERE payload @? '$.customer ? (@.tier == "gold")' AND event_type = 'purchase'; -- ── (4) Path expression with type casting ── -- Extract the total as a numeric value directly SELECT event_id, jsonb_path_query(payload, '$.total')::NUMERIC AS total_amount FROM io.thecodeforge.events.log WHERE event_type = 'purchase'; -- ── (5) @@ operator : evaluate path expression as boolean ── -- Find purchases where total exceeds 50 SELECT event_id, event_type FROM io.thecodeforge.events.log WHERE payload @@ '$.total > 50'; -- ── (6) Wildcard iteration: get all item names across all events ── SELECT event_id, jsonb_path_query(payload, '$.items[*].name') AS item_name FROM io.thecodeforge.events.log; -- ── (7) Array indexing: get the first item in every purchase ── SELECT event_id, jsonb_path_query(payload, '$.items[0]') AS first_item FROM io.thecodeforge.events.log WHERE event_type = 'purchase'; -- ── (8) Combining with relational filters ── -- Recent gold customer purchases with expensive items SELECT event_id, payload -> 'customer' ->> 'id' AS customer_id, jsonb_path_query(payload, '$.items[*] ? (@.unit_price > 25)') AS premium_item FROM io.thecodeforge.events.log WHERE event_type = 'purchase' AND payload @? '$.customer ? (@.tier == "gold")' AND created_at > NOW() - INTERVAL '7 days';
event_id | event_type
----------+------------
1 | purchase
3 | refund
-- Query (2): high-qty SKUs
event_id | high_qty_sku
----------+--------------
1 | "A1"
2 | "C7"
-- Query (3): expensive items from gold customers
event_id | expensive_item
----------+----------------------------------------------
1 | {"sku": "A1", "name": "Widget", "qty": 2, ...}
-- Query (5): purchases over $50
event_id | event_type
----------+------------
1 | purchase
2 | purchase
-- Query (7): first item per purchase
event_id | first_item
----------+----------------------------------------------
1 | {"sku": "A1", "name": "Widget", "qty": 2, ...}
2 | {"sku": "C7", "name": "Cable", "qty": 5, ...}
@? and @@ path expression operators cannot use GIN indexes. If your path expression query runs against millions of rows, it will do a sequential scan. Workarounds: (1) add a WHERE clause on an indexed relational column to narrow the candidate set first, (2) use a generated column for the most common path extraction and index that with B-tree, or (3) use @> containment for the initial filter and path expressions for the final extraction. Check the PostgreSQL release notes — GIN support for path expressions is on the roadmap.GIN Indexing on JSONB: How It Works and When to Use It
A GIN (Generalised Inverted Index) index on a JSONB column works by decomposing every document into its individual key-value paths and storing them in a lookup structure that maps each path to the set of row IDs that contain it. Think of it as an index of the index — instead of scanning every row's document, Postgres consults the GIN index to get a shortlist of candidate rows, then fetches only those.
There are two operator classes for JSONB GIN indexes. jsonb_ops (the default) indexes every key, value, and key-value pair, supporting @>, ?, ?|, and ?&. jsonb_path_ops only indexes values reachable by a path — it's more compact and faster for @> queries, but it does NOT support the existence operators ?, ?|, ?&. Choosing the wrong one is a common production mistake.
GIN indexes shine for containment queries on large tables. On a table with 10 million JSONB rows, a containment query without a GIN index does a full sequential scan parsing every document. With a GIN index, it's an index lookup in milliseconds. The trade-off: GIN indexes are expensive to write and maintain — expect 3-5x slower inserts on write-heavy tables. They're also large on disk.
For partial, path-specific queries on deeply nested fields, consider a generated column with a B-tree index instead — it's smaller, faster for equality, and cheaper to maintain.
Let me tell you about the time a GIN index nearly took down our production database. We had an event ingestion pipeline — io.thecodeforge.events.ingest — that was inserting about 12,000 rows per second into a table with a JSONB payload column. A junior developer added a GIN index on that column because the analytics team needed to query events by payload content. Insert throughput dropped to 2,400 rows/sec. The WAL generation rate tripled. Our replication lag went from under 1 second to 30+ seconds. The on-call DBA got paged at 1 AM.
The fix was a partial GIN index — we only indexed events from the last 30 days using a WHERE clause on the created_at column. Older events were archived to a separate table without the GIN index. Insert performance recovered, replication lag dropped back to normal, and the analytics team could still query recent events efficiently.
The lesson: GIN indexes are not free. On read-heavy tables they are transformative. On write-heavy tables they can be destructive. Always benchmark your insert throughput before and after adding a GIN index, and consider partial indexes or BRIN indexes for time-series data.
-- ── Scenario: 1 million product rows, we query by tags and brand frequently ── -- Create the table (same structure as before, scaled up) CREATE TABLE io.thecodeforge.catalogue.large_products ( product_id BIGSERIAL PRIMARY KEY, product_name TEXT NOT NULL, attributes JSONB NOT NULL ); -- ── GIN INDEX OPTION 1: jsonb_ops (default) ── -- Indexes ALL keys, values, and key-value pairs. -- Supports: @>, ?, ?|, ?& -- Use when: you query by key existence AND containment CREATE INDEX idx_products_attrs_gin ON io.thecodeforge.catalogue.large_products USING GIN (attributes); -- uses jsonb_ops by default -- ── GIN INDEX OPTION 2: jsonb_path_ops ── -- Only indexes values at the end of paths. -- ~30% smaller than jsonb_ops, faster for @> only. -- Does NOT support ?, ?|, ?& -- Use when: you ONLY ever use @> containment queries CREATE INDEX idx_products_attrs_gin_path ON io.thecodeforge.catalogue.large_products USING GIN (attributes jsonb_path_ops); -- ── GIN INDEX OPTION 3: Partial GIN index ── -- Only index rows that match a WHERE clause. -- Use when: you have a mix of active/archived data and only query active rows. -- This is what saved us in the event ingestion incident. CREATE INDEX idx_products_active_gin ON io.thecodeforge.catalogue.large_products USING GIN (attributes jsonb_path_ops) WHERE attributes @> '{"in_stock": true}'; -- only index in-stock products -- ── GIN INDEX OPTION 4: Generated column + B-tree ── -- Best when querying a SINGLE well-known field repeatedly. -- Much cheaper to maintain than GIN, smaller index size. ALTER TABLE io.thecodeforge.catalogue.large_products ADD COLUMN brand TEXT GENERATED ALWAYS AS (attributes ->> 'brand') STORED; -- materialised at write time CREATE INDEX idx_products_brand_btree ON io.thecodeforge.catalogue.large_products (brand); -- standard B-tree -- ── QUERY: Verify the GIN index is actually being used ── -- Run EXPLAIN ANALYZE to confirm index usage EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT product_id, product_name FROM io.thecodeforge.catalogue.large_products WHERE attributes @> '{"tags": ["wireless"], "in_stock": true}'; -- ── QUERY: Demonstrate the generated column approach ── EXPLAIN (ANALYZE, FORMAT TEXT) SELECT product_id, product_name FROM io.thecodeforge.catalogue.large_products WHERE brand = 'SoundPeak'; -- hits the B-tree index, not GIN -- ── BENCHMARK: GIN index impact on insert performance ── -- Run BEFORE adding GIN index: \timing on INSERT INTO io.thecodeforge.catalogue.large_products (product_name, attributes) SELECT 'Product ' || i, jsonb_build_object( 'brand', 'Brand_' || (i % 100), 'price_usd', (random() * 500)::NUMERIC(10,2), 'in_stock', (random() > 0.3), 'tags', jsonb_build_array('tag_' || (i % 20), 'tag_' || (i % 7)) ) FROM generate_series(1, 100000) AS s(i); -- Note the time, then DROP the GIN index and run again. -- Typical results: -- Without GIN: ~3.2 seconds for 100K inserts -- With GIN (jsonb_ops): ~14.8 seconds (4.6x slower) -- With GIN (jsonb_path_ops): ~11.2 seconds (3.5x slower) -- ── MONITORING: Check index size and usage stats ── SELECT indexrelname AS index_name, pg_size_pretty(pg_relation_size(indexrelid)) AS index_size, idx_scan AS times_used, idx_tup_read AS tuples_read FROM pg_stat_user_indexes WHERE relname = 'large_products' ORDER BY pg_relation_size(indexrelid) DESC;
Bitmap Heap Scan on large_products (cost=84.01..312.45 rows=1000 width=64)
(actual time=2.341..4.892 rows=847 loops=1)
Recheck Cond: (attributes @> '{"in_stock": true, "tags": ["wireless"]}'::jsonb)
-> Bitmap Index Scan on idx_products_attrs_gin_path
(cost=0.00..83.76 rows=1000 width=0)
(actual time=2.190..2.191 rows=847 loops=1)
Index Cond: (attributes @> '{"in_stock": true, "tags": ["wireless"]}'::jsonb)
Planning Time: 0.312 ms
Execution Time: 5.104 ms ← milliseconds vs seconds without index
-- Insert benchmark comparison (100K rows, PostgreSQL 16, local SSD):
-- Without GIN: ~3.2 seconds
-- With jsonb_ops: ~14.8 seconds (4.6x slower)
-- With path_ops: ~11.2 seconds (3.5x slower)
-- With partial GIN: ~5.1 seconds (1.6x slower — much more acceptable)
-- Index size comparison (approximate, on 1M row dataset)
index_name | index_size | times_used | tuples_read
----------------------------------+------------+------------+-------------
idx_products_attrs_gin | 284 MB | 1203 | 987432
idx_products_attrs_gin_path | 198 MB | 847 | 716291
idx_products_brand_btree | 42 MB | 5621 | 4901234
idx_products_active_gin (partial) | 87 MB | 612 | 534219
jsonb_path_ops hashes complete key paths into a single value — meaning it can only answer 'does this path lead to this value?' (perfect for @>), but it has no way to answer 'does this key exist anywhere?' which is what ? needs. The default jsonb_ops keeps each key and value separately, enabling existence checks at the cost of a larger index. In practice, if your queries are 90% @> containment and 10% existence checks, use jsonb_path_ops and accept a sequential scan for the rare ? queries. The 30% smaller index size and faster containment lookups will pay for themselves.JSONB Manipulation Functions: Updating Documents Without Tears
Storing JSONB is the easy part. Updating it surgically — changing one nested key without rewriting the entire document — is where most teams hit a wall. PostgreSQL gives you the tools, but the ergonomics are not obvious, and getting it wrong has real performance consequences at scale.
jsonb_set(target, path, value, create_missing) is the primary update function. It takes the existing JSONB column, a path array pointing to the key you want to change, the new value (must be valid JSONB), and a boolean controlling whether to create the key if it doesn't exist. The path is a text array: '{notifications,email}' navigates two levels deep.
jsonb_insert(target, path, value, insert_after) adds a new key-value pair at a specific position. Unlike jsonb_set, it will NOT overwrite an existing key — it raises an error instead. Use this when you want to add without accidentally clobbering.
removes every key whose value is JSON null. Useful for cleaning up output from APIs that emit null for optional fields. jsonb_strip_nulls() formats a JSONB value for human-readable display — invaluable in debugging.jsonb_pretty()
For building JSONB from relational data, jsonb_build_object(key1, val1, key2, val2, ...) creates a JSON object, and jsonb_build_array(val1, val2, ...) creates a JSON array. The aggregation variants — collects rows into a JSON array, and jsonb_agg()jsonb_object_agg(key, value) collects key-value pairs into a JSON object. These are essential for API response construction inside SQL.
The critical thing to understand about jsonb_set at the storage level: even though you are changing one key, PostgreSQL rewrites the entire JSONB value. There is no in-place partial update at the storage layer. For a 50KB JSONB document, changing one 10-byte string means writing 50KB of WAL. If you do this 10,000 times per day, that is 500MB of daily WAL overhead for essentially no net data change. This is the TOAST rewrite trap, and it will destroy your vacuum performance if you are not watching.
-- Reuse the user_profiles table from the schema design section CREATE TABLE io.thecodeforge.users.profiles ( user_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), email TEXT NOT NULL UNIQUE, preferences JSONB NOT NULL DEFAULT '{}', integrations JSONB NOT NULL DEFAULT '{}' ); INSERT INTO io.thecodeforge.users.profiles (email, preferences, integrations) VALUES ('alice@example.com', '{"theme":"dark","language":"en","notifications":{"email":true,"sms":false},"dashboard":{"layout":"grid","widgets":["metrics","alerts"]}}', '{"slack":{"workspace":"acme","channel":"#alerts"}}'), ('bob@example.com', '{"theme":"light","language":"fr","notifications":{"email":true,"sms":true}}', '{}'); -- ── (1) jsonb_set: update a nested value ── -- Flip Alice's SMS notification to true UPDATE io.thecodeforge.users.profiles SET preferences = jsonb_set( preferences, '{notifications,sms}', -- path: two levels deep 'true'::jsonb, -- new value (must be ::jsonb) false -- create_missing: false = error if path doesn't exist ) WHERE email = 'alice@example.com'; -- Verify SELECT email, preferences -> 'notifications' AS notifs FROM io.thecodeforge.users.profiles WHERE email = 'alice@example.com'; -- Result: {"email": true, "sms": true} -- ── (2) jsonb_set with create_missing = true ── -- Add a new key that doesn't exist yet UPDATE io.thecodeforge.users.profiles SET preferences = jsonb_set( preferences, '{notifications,push}', -- this key doesn't exist yet 'true'::jsonb, true -- create it ) WHERE email = 'alice@example.com'; -- ── (3) Chained jsonb_set for multiple updates in one statement ── -- Change theme AND add a new notification pref in one UPDATE UPDATE io.thecodeforge.users.profiles SET preferences = jsonb_set( jsonb_set(preferences, '{theme}', '"system"'::jsonb, false), '{notifications,digest}', '"weekly"'::jsonb, true ) WHERE email = 'alice@example.com'; -- ── (4) jsonb_insert: add without overwriting ── -- Add a 'timezone' key — will ERROR if it already exists UPDATE io.thecodeforge.users.profiles SET preferences = jsonb_insert( preferences, '{timezone}', -- path to insert at '"America/New_York"'::jsonb ) WHERE email = 'alice@example.com'; -- ── (5) jsonb_strip_nulls: clean up null values ── -- Useful when API payloads contain null for optional fields UPDATE io.thecodeforge.users.profiles SET preferences = jsonb_strip_nulls(preferences) WHERE email = 'alice@example.com'; -- ── (6) jsonb_agg: build a JSON array from relational rows ── -- Construct an API response: list all users with their theme as JSON SELECT jsonb_agg( jsonb_build_object( 'email', email, 'theme', preferences ->> 'theme', 'has_slack', integrations ? 'slack' ) ) AS api_response FROM io.thecodeforge.users.profiles; -- ── (7) jsonb_object_agg: build a JSON object from key-value rows ── -- Pivot user preferences into a single JSON object keyed by email SELECT jsonb_object_agg(email, preferences -> 'notifications') AS all_notifications FROM io.thecodeforge.users.profiles; -- ── (8) jsonb_pretty: format for debugging ── SELECT email, jsonb_pretty(preferences) FROM io.thecodeforge.users.profiles WHERE email = 'alice@example.com'; -- ── (9) Removing a key: set it to NULL then strip nulls ── -- PostgreSQL has no jsonb_remove_key function, so we use this pattern UPDATE io.thecodeforge.users.profiles SET preferences = jsonb_strip_nulls( jsonb_set(preferences, '{dashboard}', 'null'::jsonb, false) ) WHERE email = 'alice@example.com'; -- ── (10) Monitoring WAL impact of JSONB updates ── -- Run this before and after your update workload to measure overhead SELECT pg_size_pretty(pg_total_relation_size('io.thecodeforge.users.profiles')) AS total_size, pg_size_pretty(pg_relation_size('io.thecodeforge.users.profiles')) AS table_size, pg_size_pretty(pg_total_relation_size('io.thecodeforge.users.profiles') - pg_relation_size('io.thecodeforge.users.profiles')) AS toast_size; -- Check dead tuple count (indicates vacuum pressure from TOAST rewrites) SELECT n_live_tup, n_dead_tup, ROUND(n_dead_tup::NUMERIC / GREATEST(n_live_tup, 1) * 100, 1) AS dead_pct FROM pg_stat_user_tables WHERE relname = 'profiles';
email | notifs
--------------------+------------------------------------
alice@example.com | {"email": true, "sms": true}
-- Query (6) API response:
api_response
--------------------------------------------------------------
[
{"email": "alice@example.com", "theme": "dark", "has_slack": true},
{"email": "bob@example.com", "theme": "light", "has_slack": false}
]
-- Query (8) pretty-printed:
email | jsonb_pretty
--------------------+--------------------------------------------------
alice@example.com | {
| "dashboard": {
| "layout": "grid",
| "widgets": ["metrics", "alerts"]
| },
| "language": "en",
| "notifications": {
| "email": true,
| "sms": true,
| "push": true
| },
| "theme": "dark"
| }
pg_stat_bgwriter and watch table bloat via pg_stat_user_tables.n_dead_tup. Runaway dead tuples from TOAST rewrites will tank query performance until VACUUM catches up. Consider moving hot-write fields to dedicated columns, or tune autovacuum_vacuum_cost_delay for that table specifically. We had a session tracking table where a 40KB JSONB blob was updated every 30 seconds per active session. At 500 concurrent sessions, that was 3.2GB of WAL per hour for data that barely changed. We moved the three most frequently updated keys to their own columns and cut WAL generation by 94%.Production Schema Design: Hybrid Relational/JSONB Patterns
Using JSONB everywhere is as wrong as using it nowhere. The best production schemas treat JSONB as a controlled extension point, not a dumping ground. Here's the decision framework that holds up under real load.
Put fields in proper columns when: they appear in WHERE clauses on more than a minority of queries, they participate in JOINs, they have strict types (timestamps, foreign keys, enums), or they need CHECK constraints. Relational columns are faster to filter, smaller on disk, and self-documenting via schema.
Put fields in JSONB when: the set of attributes varies significantly per row (e.g., product specifications), the shape changes frequently without wanting migrations, the data is mostly written once and read as a blob, or you're integrating with external APIs that return arbitrary JSON.
A pattern that scales well: keep your 'anchor' fields (user_id, created_at, status, searchable dimensions) as real columns. Put everything else — preferences, metadata, event payloads — in a JSONB details column. This gives you fast indexed queries on anchor fields and full JSON flexibility for the rest.
For update-heavy JSONB workloads, be aware that Postgres uses TOAST (The Oversized-Attribute Storage Technique) for large JSONB values. Updating any key in a JSONB column rewrites the entire TOAST tuple — there's no in-place partial update at the storage layer. If you have a 50KB JSONB blob and update one nested field 10,000 times a day, you're writing 500MB of WAL for essentially no data change. Use to be precise, and consider splitting hot-write fields into their own columns.jsonb_set()
I have a rule I call the 'three-query test' for deciding whether a JSONB field should become a real column: if a field inside your JSONB column appears in WHERE, ORDER BY, or GROUP BY on three or more distinct queries in your application, promote it to a column. I have seen teams resist this because 'adding a column requires a migration.' Yes, it does. A migration that takes 10 minutes and gives you a B-tree index is infinitely better than a query that takes 30 seconds because it's doing a sequential scan on 8 million JSONB documents.
The hybrid pattern also protects you from a failure mode I call 'JSONB creep.' It starts innocently — one table has a JSONB column for flexible attributes. Then another developer adds a second JSONB column for preferences. Then a third adds one for audit metadata. Six months later, the table has four JSONB columns, nobody knows what's in any of them, every query requires jsonb_typeof checks, and the schema documentation is a lie. Enforce a team rule: one JSONB column per table for extension data. Name it consistently (attributes, metadata, details). Document its expected top-level keys in a comment on the column.
-- ── A production-grade hybrid schema for a SaaS user profile system ── CREATE TABLE io.thecodeforge.users.sso_profiles ( -- Anchor fields: relational columns for everything queryable/joinable user_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), email TEXT NOT NULL UNIQUE, account_tier TEXT NOT NULL DEFAULT 'free' CHECK (account_tier IN ('free', 'pro', 'enterprise')), created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), last_active_at TIMESTAMPTZ, -- JSONB for flexible, schema-light data that varies per user preferences JSONB NOT NULL DEFAULT '{}', -- UI prefs, notification settings integrations JSONB NOT NULL DEFAULT '{}' -- third-party OAuth tokens, webhooks ); -- Document the expected structure in a column comment COMMENT ON COLUMN io.thecodeforge.users.sso_profiles.preferences IS 'User UI and notification preferences. Expected top-level keys: theme (string), language (string), notifications (object with email/sms/push booleans), dashboard (object with layout string and widgets array).'; COMMENT ON COLUMN io.thecodeforge.users.sso_profiles.integrations IS 'Third-party integration configs. Keys are service names (slack, github, jira). Each value is an object with service-specific fields.'; -- Index anchor fields the relational way CREATE INDEX idx_sso_tier ON io.thecodeforge.users.sso_profiles (account_tier); CREATE INDEX idx_sso_last_active ON io.thecodeforge.users.sso_profiles (last_active_at DESC NULLS LAST); -- GIN index only on the JSONB we'll actually query CREATE INDEX idx_sso_preferences_gin ON io.thecodeforge.users.sso_profiles USING GIN (preferences jsonb_path_ops); -- Insert realistic user data INSERT INTO io.thecodeforge.users.sso_profiles (email, account_tier, preferences, integrations) VALUES ('alice@example.com', 'pro', '{"theme": "dark", "language": "en", "notifications": {"email": true, "sms": false}}', '{"slack": {"workspace": "acme-corp", "channel": "#alerts"}, "github": {"username": "alicecodes"}}' ), ('bob@example.com', 'free', '{"theme": "light", "language": "fr", "notifications": {"email": true, "sms": true}}', '{}' ), ('carol@example.com', 'enterprise', '{"theme": "dark", "language": "de", "notifications": {"email": false, "sms": false}, "custom_domain": "carol.io"}', '{"slack": {"workspace": "bigcorp", "channel": "#ops"}}' ); -- ── UPDATING JSONB SAFELY: use jsonb_set, not full-document replace ── -- BAD: rewrites the ENTIRE preferences blob even if 99% is unchanged -- UPDATE io.thecodeforge.users.sso_profiles -- SET preferences = '{"theme":"system", ...all other keys...}' WHERE email = 'alice@example.com'; -- GOOD: surgical update using jsonb_set UPDATE io.thecodeforge.users.sso_profiles SET preferences = jsonb_set( preferences, '{notifications, sms}', 'true'::jsonb, false ) WHERE email = 'alice@example.com'; -- ── QUERYING: mix relational and JSONB filters naturally ── -- Find pro/enterprise users who prefer dark mode and have Slack connected SELECT email, account_tier, preferences ->> 'theme' AS theme, preferences ->> 'language' AS language, integrations -> 'slack' ->> 'workspace' AS slack_workspace FROM io.thecodeforge.users.sso_profiles WHERE account_tier IN ('pro', 'enterprise') -- hits B-tree index AND preferences @> '{"theme": "dark"}' -- hits GIN index AND integrations ? 'slack' -- existence check ORDER BY last_active_at DESC NULLS LAST; -- ── BUILDING ANALYTICS: aggregate JSONB data across rows ── -- Count users by preferred language SELECT preferences ->> 'language' AS language, COUNT(*) AS user_count, COUNT(*) FILTER (WHERE account_tier = 'pro') AS pro_count FROM io.thecodeforge.users.sso_profiles GROUP BY preferences ->> 'language' ORDER BY user_count DESC; -- ── PROMOTING A JSONB FIELD TO A REAL COLUMN ── -- If 'theme' appears in WHERE on 3+ queries, promote it. -- Step 1: Add the column ALTER TABLE io.thecodeforge.users.sso_profiles ADD COLUMN theme TEXT GENERATED ALWAYS AS (preferences ->> 'theme') STORED; -- Step 2: Index it CREATE INDEX idx_sso_theme ON io.thecodeforge.users.sso_profiles (theme); -- Step 3: Rewrite queries to use the column instead of JSONB extraction -- BEFORE: WHERE preferences ->> 'theme' = 'dark' (no index) -- AFTER: WHERE theme = 'dark' (hits B-tree) SELECT email, theme FROM io.thecodeforge.users.sso_profiles WHERE theme = 'dark';
email | account_tier | theme | language | slack_workspace
--------------------+--------------+-------+----------+----------------
carol@example.com | enterprise | dark | de | bigcorp
alice@example.com | pro | dark | en | acme-corp
-- Verify sms update:
SELECT preferences -> 'notifications' FROM io.thecodeforge.users.sso_profiles
WHERE email = 'alice@example.com';
-- Result: {"email": true, "sms": true}
-- QUERY: language analytics
language | user_count | pro_count
----------+------------+-----------
en | 1 | 1
fr | 1 | 0
de | 1 | 0
attributes, metadata, details), document its expected keys in a column COMMENT, and enforce the three-query test: if a JSONB field appears in WHERE/ORDER BY/GROUP BY on three or more queries, promote it to a real column.Migrating from TEXT Blobs to JSONB: A Battle-Tested Playbook
If you are reading this article, there is a good chance you have at least one table somewhere in your database with a TEXT column containing serialised JSON. Maybe it was the pragmatic choice five years ago. Maybe a framework put it there. Either way, you want to move to JSONB without downtime and without losing data.
Here is the playbook I have used at three different companies. It works on tables with billions of rows if you do it in phases.
Phase 1 is shadow column. Add a new JSONB column alongside the existing TEXT column. Do not drop the TEXT column yet. Write a backfill job that parses each TEXT value into JSONB and populates the new column. Run it in batches of 10,000 rows with a sleep interval to avoid locking the table. If any row fails to parse — and some will, because production data always has surprises — log the primary key and the parse error, skip that row, and continue.
Phase 2 is dual-write. Update your application to write to both the TEXT column and the JSONB column. This ensures that new data is available in both formats. Keep this running for at least one full business cycle (a week, a month — whatever covers your edge cases) so you can validate that the JSONB column contains correct data for all new writes.
Phase 3 is dual-read with validation. Update your application to read from the JSONB column but fall back to the TEXT column if the JSONB value is NULL. Log every fallback — if you see fallbacks after the backfill is complete, you have data integrity issues to investigate.
Phase 4 is cut-over. Once you have zero fallbacks for a full business cycle, switch reads to JSONB exclusively. Add your GIN indexes. Update queries to use JSONB operators.
Phase 5 is cleanup. Drop the TEXT column. This is the irreversible step — do not do it until you are absolutely confident.
The entire process typically takes 2-4 weeks for a large production table. The backfill job is the longest phase. Do not rush it. I once saw a team try to backfill 50 million rows in a single UPDATE statement. It held an exclusive lock for 45 minutes and caused a cascading timeout across their entire microservice fleet.
-- ── PHASE 1: Add shadow JSONB column ── ALTER TABLE io.thecodeforge.legacy.events ADD COLUMN payload_jsonb JSONB; -- Backfill in batches to avoid long-running locks -- Run this in a loop, incrementing the offset DO $$ DECLARE v_batch_size INT := 10000; v_offset INT := 0; v_updated INT; BEGIN LOOP UPDATE io.thecodeforge.legacy.events SET payload_jsonb = payload::JSONB WHERE event_id IN ( SELECT event_id FROM io.thecodeforge.legacy.events WHERE payload_jsonb IS NULL AND payload IS NOT NULL ORDER BY event_id LIMIT v_batch_size OFFSET v_offset ); GET DIAGNOSTICS v_updated = ROW_COUNT; EXIT WHEN v_updated = 0; COMMIT; PERFORM pg_sleep(0.1); -- throttle to reduce lock contention END LOOP; END $$; -- Catch rows that failed to parse (malformed JSON) INSERT INTO io.thecodeforge.legacy.migration_errors ( table_name, record_id, error_message, raw_value ) SELECT 'events', event_id, 'JSON parse failed', LEFT(payload, 500) -- truncate for logging FROM io.thecodeforge.legacy.events WHERE payload_jsonb IS NULL AND payload IS NOT NULL; -- ── PHASE 2: Dual-write trigger (application-level or DB trigger) ── -- If you cannot modify the application, use a trigger: CREATE OR REPLACE FUNCTION io.thecodeforge.legacy.events_dual_write() RETURNS TRIGGER AS $$ BEGIN IF NEW.payload IS NOT NULL THEN BEGIN NEW.payload_jsonb := NEW.payload::JSONB; EXCEPTION WHEN OTHERS THEN -- Log the error but don't block the insert INSERT INTO io.thecodeforge.legacy.migration_errors (table_name, record_id, error_message, raw_value) VALUES ('events', NEW.event_id, SQLERRM, LEFT(NEW.payload, 500)); END; END IF; RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER trg_events_dual_write BEFORE INSERT OR UPDATE ON io.thecodeforge.legacy.events FOR EACH ROW EXECUTE FUNCTION io.thecodeforge.legacy.events_dual_write(); -- ── PHASE 3: Dual-read with fallback (example query) ── SELECT event_id, COALESCE( payload_jsonb, -- prefer JSONB payload::JSONB -- fallback to TEXT conversion ) AS event_payload FROM io.thecodeforge.legacy.events WHERE event_id = 12345; -- ── PHASE 4: Validate completeness before cut-over ── -- This query should return ZERO rows before you proceed SELECT COUNT(*) AS missing_jsonb FROM io.thecodeforge.legacy.events WHERE payload IS NOT NULL AND payload_jsonb IS NULL; -- ── PHASE 4: Add GIN indexes after cut-over ── CREATE INDEX idx_events_payload_gin ON io.thecodeforge.legacy.events USING GIN (payload_jsonb jsonb_path_ops); -- ── PHASE 5: Drop the old column (IRREVERSIBLE — be certain) ── -- ALTER TABLE io.thecodeforge.legacy.events DROP COLUMN payload; -- ALTER TABLE io.thecodeforge.legacy.events RENAME COLUMN payload_jsonb TO payload;
::JSONB cast to throw an error. Create a dedicated error-logging table before you start the backfill. Run a pre-flight check: SELECT COUNT(*) FROM table WHERE payload IS NOT NULL AND payload::JSONB IS NULL in a transaction and ROLLBACK to estimate how many rows will fail. Fix the data first, then backfill.| Feature / Aspect | JSON | JSONB |
|---|---|---|
| Storage format | Plain text (verbatim) | Decomposed binary tree |
| Parse cost per read | Full re-parse every time | Pre-parsed — zero parse cost |
| Write speed | Faster (no parsing on insert) | Slightly slower (parsed once on insert) |
| Key ordering preserved | Yes — exact order stored | No — internally sorted |
| Duplicate keys | Both stored, behaviour undefined | Last value wins, silently deduplicated |
| Whitespace preserved | Yes | No |
| GIN indexing | Not supported | Fully supported (jsonb_ops, jsonb_path_ops) |
| Containment operator @> | Not available | Fully supported |
| Existence operators ? ?| ?& | Not available | Supported (requires jsonb_ops GIN) |
| jsonb_set() partial update | Not available | Available (but rewrites full TOAST tuple) |
| SQL/JSON path expressions | Not available | Supported (@?, @@ operators) |
| Function ecosystem | Limited | Rich (jsonb_each, jsonb_path_query, jsonb_agg, etc.) |
| When to use | Preserve exact original document | Almost every other production use case |
🎯 Key Takeaways
- JSONB stores a decomposed binary representation that is pre-parsed on insert — every subsequent read skips the parsing step, making it roughly 4x faster than JSON for field extraction at scale. Use JSONB for any column you will query. Use JSON only when you need byte-for-byte preservation of the original document.
- The
->operator returns a JSON node (preserves type). The->>operator returns plain TEXT. Using->in a WHERE clause comparison against a SQL string will silently return zero results. Always use->>for comparisons, and cast to the correct type for numeric operations. - The
@>containment operator is the only JSONB filter that engages a GIN index. Queries using->>with casts bypass the GIN index entirely and result in full sequential scans. Rewrite your filters to use@>whenever possible. jsonb_path_opsGIN indexes are ~30% smaller and faster for@>queries, but they do NOT support existence operators (?,?|,?&). Choosejsonb_opsif you need key existence checks. Choosejsonb_path_opsif your queries are exclusively containment-based.- GIN indexes are expensive to write — expect 3-5x slower inserts. On write-heavy tables, use partial GIN indexes (with a WHERE clause) or generated columns with B-tree indexes for frequently queried fields.
- SQL/JSON path expressions (
jsonb_path_query,jsonb_path_exists,@?,@@) provide a mini query language for complex nested JSONB navigation. They are not yet supported by GIN indexes as of PostgreSQL 16, so pair them with indexed relational columns to narrow the candidate set. updates one key but rewrites the entire JSONB value at the storage level (TOAST rewrite). On high-frequency update workloads, monitor WAL generation and dead tuple counts. Move hot-write fields to dedicated relational columns.jsonb_set()- Use the three-query test to decide when to promote a JSONB field to a real column: if it appears in WHERE, ORDER BY, or GROUP BY on three or more distinct queries, it deserves its own column with a B-tree index.
- When migrating from TEXT to JSONB, use a phased approach (shadow column, dual-write, dual-read, cut-over) with batched backfill and error logging. Never run a single ALTER TYPE on a large production table — it will hold an exclusive lock.
- Document your JSONB column's expected top-level keys in a column COMMENT. Enforce a team rule of one JSONB extension column per table, named consistently. This prevents JSONB creep and keeps your schema self-documenting.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QExplain the internal difference between JSON and JSONB in PostgreSQL. When would you choose JSON over JSONB, and what are the performance implications?
- QYou have a table with 5 million rows and a JSONB column. A query filtering on a nested field inside JSONB is taking 30 seconds. Walk me through how you would diagnose and fix this.
- QWhat is the difference between jsonb_ops and jsonb_path_ops GIN indexes? When would you choose each one, and what operators does each support?
- QA developer on your team wrote WHERE (attributes ->> 'price')::NUMERIC > 100 and wonders why the GIN index on the JSONB column isn't being used. Explain why and show how to rewrite the query.
- QYou're designing a schema for a product catalogue where different product categories have wildly different attribute sets. How would you balance relational columns with JSONB, and what decision framework would you use?
- QYour application updates a single key inside a 50KB JSONB document 5,000 times per day. After a month, table bloat is causing query performance to degrade. What is happening at the storage level, and how would you fix it?
- QWalk me through a zero-downtime migration from a TEXT column containing serialised JSON to a properly indexed JSONB column on a table with 20 million rows. What are the phases, and what can go wrong at each stage?
- QExplain the difference between the
->and->>operators. A junior developer wroteWHERE data -> 'status' = 'active'and it returns zero rows despite matching data existing. What happened and how do you fix it?
Frequently Asked Questions
What is the difference between JSON and JSONB in PostgreSQL?
JSON stores the document as plain text, exactly as received — whitespace, duplicate keys, and key order are preserved. Every read re-parses the entire string. JSONB stores a decomposed binary representation that is parsed once on insert. Reads are faster, whitespace is stripped, duplicate keys are deduplicated (last value wins), and key order is not preserved. JSONB also supports GIN indexing and the full set of query operators. Use JSONB for almost all production use cases. Use JSON only when you need byte-for-byte document preservation.
When should I use jsonb_ops vs jsonb_path_ops for GIN indexes?
Use jsonb_ops (the default) when you need both containment queries (@>) and key existence checks (?, ?|, ?&). It indexes every key, value, and key-value pair separately. Use jsonb_path_ops when you only use @> containment queries — it hashes complete key paths into single values, making it ~30% smaller and faster for containment, but it cannot answer existence questions. In practice, if 90% of your queries are @> and you rarely check for key existence, jsonb_path_ops is the better choice.
Why is my GIN index not being used in my JSONB query?
The most common cause is using ->> with a cast in your WHERE clause instead of @>. GIN indexes only accelerate @>, ?, ?|, and ?&. A query like WHERE (attributes ->> 'in_stock')::BOOLEAN = true bypasses the GIN index and does a sequential scan. Rewrite it as WHERE attributes @> '{"in_stock": true}' to engage the index. Run EXPLAIN ANALYZE to verify the query plan uses a Bitmap Index Scan on your GIN index.
How do I update a single key inside a JSONB column without rewriting the entire document?
Use jsonb_set(column, '{path,to,key}', 'new_value'::jsonb, create_missing). At the SQL level this is a surgical update. At the storage level, PostgreSQL rewrites the entire JSONB value (TOAST rewrite). There is no in-place partial update. For high-frequency updates on large JSONB documents, monitor WAL generation and consider moving hot-write fields to their own relational columns.
How do I migrate a TEXT column containing JSON to JSONB without downtime?
Use a phased approach: (1) Add a shadow JSONB column and backfill in batches of 10,000 rows with throttling. (2) Dual-write to both columns via application code or a trigger. (3) Dual-read from JSONB with TEXT fallback, logging all fallbacks. (4) Cut over to JSONB-only reads after zero fallbacks for a full business cycle. (5) Drop the TEXT column. Never run a single ALTER TYPE on a large table — it holds an exclusive lock.
Should I store all my data in JSONB to avoid schema migrations?
No. JSONB is for genuinely dynamic, variable-shape data — product attributes, user preferences, event payloads, API responses. Fields that appear in WHERE, ORDER BY, or GROUP BY queries, participate in JOINs, have strict types, or need CHECK constraints should be proper relational columns. Using JSONB everywhere trades away B-tree indexes, foreign keys, type enforcement, and self-documenting schemas. Apply the three-query test: if a JSONB field is queried three or more times, promote it to a column.
What are SQL/JSON path expressions and when should I use them?
SQL/JSON path expressions (available from PostgreSQL 12+) provide a mini query language for navigating nested JSONB. Functions like and operators like jsonb_path_query()@? and @@ let you filter, iterate, and extract from complex nested structures in a single expression. Use them when you need conditional logic inside JSONB navigation — they replace multi-line queries with -> chains and subqueries. Note: as of PostgreSQL 16, path expressions are not supported by GIN indexes, so pair them with indexed relational columns to narrow the candidate set first.
How do I monitor whether JSONB updates are causing table bloat?
Check pg_stat_user_tables.n_dead_tup for your table — a growing dead tuple count indicates vacuum pressure from TOAST rewrites. Monitor WAL generation with pg_stat_bgwriter. Check TOAST table size with . If dead tuples accumulate faster than autovacuum can clean them, tune pg_total_relation_size() - pg_relation_size()autovacuum_vacuum_cost_delay for that table, or move frequently updated fields out of JSONB into their own columns.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.