Advanced 22 min · March 05, 2026

PostgreSQL JSON vs JSONB — 28x Query Improvement

A 22-minute query on 3 million webhook rows caused by JSON re-parsing — why JSONB with GIN indexing is mandatory, GFG's oversimplified articles skip it.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • JSON stores document as plain text, re-parses on every read; JSONB stores pre-parsed binary, 4x faster field extraction
  • Use @> for containment queries with GIN indexes; ->> returns TEXT, -> returns JSON node
  • GIN indexes only accelerate @>, ?, ?|, ?& — other operators bypass the index
  • jsonb_path_ops is ~30% smaller and faster for containment but cannot do key-existence checks
  • Each jsonb_set() rewrites the full TOAST tuple — monitor WAL on high-frequency updates
  • Use generated columns with B-tree for hot-path single-field queries to avoid GIN overhead
Plain-English First

Imagine your database is a huge filing cabinet. Normally every drawer is perfectly labelled — 'name', 'age', 'city' — and every folder must match exactly. JSON support is like having a special 'miscellaneous' drawer where you can stuff any shape of document you want: a receipt one day, a config file the next. PostgreSQL doesn't just let you store that document — it lets you reach inside and pull out a specific line item without unfolding the whole thing. JSONB is the turbo version: it reorganises the document the moment it lands into a compact binary format, so future searches skip the parsing step entirely. It's the difference between having to re-read a messy handwritten note every time someone asks a question, versus having a neatly typed index card ready to go.

PostgreSQL's JSONB data type changed how production teams handle semi-structured data. Instead of choosing between a rigid relational schema and a separate document database, JSONB lets you store, index, and query flexible JSON documents directly inside PostgreSQL — with real performance guarantees. If you're coming from a world of TEXT columns containing serialised blobs, or you're evaluating whether to spin up MongoDB for one table that needs dynamic attributes, this article will save you weeks of trial and error.

The real problem JSON support solves is the impedance mismatch between flexible, schema-light data and a structured relational engine. You don't want to spin up MongoDB just because one table needs a dynamic attributes column. PostgreSQL gives you a hybrid: enforce structure where you need it, stay flexible where you don't, and keep everything in one transactional, ACID-compliant system. That's a huge operational win — one backup strategy, one connection pool, one query language.

I have been running PostgreSQL in production for over a decade — SaaS platforms processing millions of events per day, e-commerce catalogues with 200+ million SKUs where every product category has a different attribute schema, and real-time analytics pipelines where the incoming data shape changes without warning. I have watched JSONB save entire projects from premature microservice extraction, and I have also watched it silently destroy query performance when developers treated it as a 'no schema required' escape hatch. The truth, as always, is in the details.

By the end of this article you'll understand the internal difference between JSON and JSONB (and why it matters for writes vs reads), every key operator you'll actually use in production, SQL/JSON path expressions for complex nested queries, how GIN indexes work against JSONB and when to apply them, how to design a hybrid relational/JSON schema that won't collapse under load, the specific mistakes that silently destroy query performance in large datasets, and a proven migration path from TEXT blobs to properly indexed JSONB columns.

JSON vs JSONB Internals: Why the Storage Layer Changes Everything

PostgreSQL ships with two JSON data types and the difference is not cosmetic. JSON stores the document as plain text, exactly as you sent it — whitespace, duplicate keys and all. Every time you query a field inside it, Postgres re-parses the entire string from scratch. For infrequent writes and rare reads on small payloads that's fine. For anything production-grade, it's a trap.

JSONB stores a decomposed binary representation. On insert, Postgres parses the JSON once, converts it to an internal tree structure (similar in spirit to a sorted map), strips whitespace, and deduplicates keys — last value wins. The upfront cost is a slightly slower write. The payback is that every subsequent read, filter, or index operation works on pre-parsed binary data, skipping the parsing step entirely.

There's one subtle but critical consequence of JSONB's deduplication: if you insert {"role":"user","role":"admin"}, you'll get back {"role":"admin"}. Silent data loss if you weren't expecting it. Also, key ordering is not preserved in JSONB — it's stored in an internal sorted order. If your application depends on key order (it shouldn't, but some do), you must use JSON not JSONB.

For 95% of production use cases — especially when you're querying or indexing JSON fields — you want JSONB. The only time to reach for JSON is when you need to preserve the exact original document byte-for-byte.

Let me give you a concrete example of why this matters. In 2021, we inherited a payment processing system that stored webhook payloads from Stripe in a JSON column. Every time our reconciliation job ran — which queried individual fields from 3 million webhook rows — it took 22 minutes. The entire bottleneck was re-parsing every JSON document on every access. We changed the column from JSON to JSONB. Same query, same data, same hardware. Execution time dropped to 47 seconds. We did not change a single line of application code. The storage layer change alone gave us a 28x improvement.

That experience taught me a rule I now enforce on every team: if a column will ever be queried, filtered, or indexed — it is JSONB. The only exception is archival tables where we store raw webhook payloads for compliance and literally never query individual fields. Even then, I usually store both: the raw JSON for audit, and a parsed JSONB copy for analytics.

json_vs_jsonb_storage.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
-- Create a table to compare JSON and JSONB storage behaviour side by side
CREATE TABLE io.thecodeforge.storage.comparison (
    id          SERIAL PRIMARY KEY,
    raw_json    JSON,    -- stored verbatim, re-parsed on every access
    binary_json JSONB    -- parsed once on insert, stored as binary tree
);

-- Insert the same payload into both columns.
-- Note the intentional duplicate key 'status' and extra whitespace.
INSERT INTO io.thecodeforge.storage.comparison (raw_json, binary_json)
VALUES (
    '{  "user_id": 42,  "status": "pending",  "status": "active"  }',
    '{  "user_id": 42,  "status": "pending",  "status": "active"  }'
);

-- Retrieve both columns and observe the differences
SELECT
    raw_json,     -- JSON: whitespace preserved, BOTH 'status' keys kept
    binary_json   -- JSONB: whitespace stripped, duplicate key deduplicated (last wins)
FROM io.thecodeforge.storage.comparison
WHERE id = 1;

-- Check the actual storage size on disk
SELECT
    pg_column_size(raw_json)    AS json_bytes,
    pg_column_size(binary_json) AS jsonb_bytes
FROM io.thecodeforge.storage.comparison
WHERE id = 1;

-- Real-world benchmark: measure parse overhead at scale
-- Run this on a table with 1M rows to see the JSON vs JSONB difference
CREATE TABLE io.thecodeforge.benchmark.json_parse_test AS
SELECT
    generate_series(1, 1000000) AS id,
    ('{"name":"user_' || generate_series(1, 1000000)
     || '","email":"user' || generate_series(1, 1000000)
     || '@example.com","age":' || (random() * 80 + 18)::INT
     || ',"city":"City_' || (random() * 100)::INT || '"}')::JSON AS data_json,
    ('{"name":"user_' || generate_series(1, 1000000)
     || '","email":"user' || generate_series(1, 1000000)
     || '@example.com","age":' || (random() * 80 + 18)::INT
     || ',"city":"City_' || (random() * 100)::INT || '"}')::JSONB AS data_jsonb;

-- Benchmark: extract 'age' field from JSON (re-parses every row)
\timing on
SELECT COUNT(*) FROM io.thecodeforge.benchmark.json_parse_test
WHERE (data_json ->> 'age')::INT > 65;
-- Typical result: ~8,200ms on 1M rows

-- Benchmark: extract 'age' field from JSONB (pre-parsed, no re-parse)
SELECT COUNT(*) FROM io.thecodeforge.benchmark.json_parse_test
WHERE (data_jsonb ->> 'age')::INT > 65;
-- Typical result: ~2,100ms on 1M rows — roughly 4x faster

-- Clean up
DROP TABLE io.thecodeforge.benchmark.json_parse_test;
Watch Out: Silent Duplicate-Key Deduplication in JSONB
If upstream services ever emit JSON with duplicate keys (malformed but technically valid per RFC 8259), inserting into a JSONB column silently discards all but the last value for each key. Add a CHECK constraint using jsonb_typeof() or validate at the application layer before insert if source data is untrusted. I once spent three days debugging a billing discrepancy because a third-party API started returning {"amount":100,"amount":95} — our JSONB column stored 95, the API intended 100. Three days for a missing validation check.
Production Insight
JSON column re-parses the entire document on every field access — at 3M rows that's 3M full parses.
Switching to JSONB removed the parse cost entirely, yielding a 28x speedup with zero code changes.
Rule: if a column is ever queried or indexed, always use JSONB — the only exception is byte-for-byte archival storage.
Key Takeaway
JSONB stores a pre-parsed binary tree — read queries skip the parsing step entirely.
JSON stores plain text, re-parses on every field access.
Use JSONB for any column you query. Use JSON only when you must preserve the exact original document.

Operators and Functions That Actually Matter in Production

PostgreSQL provides a rich operator set for navigating JSON documents. Knowing which operator to reach for — and what it returns — is the difference between clean, indexed queries and full-table scans that bring down production at 2am.

The two extraction operators are -> and ->>. The arrow -> returns a JSON fragment (preserving type), while the double-arrow ->> returns the value as plain TEXT. This distinction matters enormously for type comparisons: data->'age' = '30' compares a JSON integer node to a JSON string — it'll never match. You must either use ->> and cast, or use a JSON-aware equality.

For deep nested access, chain operators: data->'address'->'city'. For path-based access that's more readable, use #> (returns JSON) and #>> (returns text) with an array-of-keys path: data #>> '{address,city}'.

The containment operator @> is the workhorse of JSONB filtering. It asks: 'does the left JSONB value contain the right JSONB value?' This operator is fully supported by GIN indexes, making it orders of magnitude faster than a text search on large tables. The existence operator ? checks whether a key exists at the top level. ?| checks if any of a list of keys exist, ?& checks if all exist.

Beyond extraction and containment, there are manipulation functions you will use daily: jsonb_set() for surgical field updates, jsonb_insert() for adding new keys without overwriting, jsonb_strip_nulls() for cleaning up output, jsonb_each() and jsonb_each_text() for exploding a JSON object into key-value rows, and the aggregation functions jsonb_agg() and jsonb_object_agg() for building JSON from relational data. I will cover manipulation functions in detail in their own section, but mention them here because operator selection and function selection are deeply intertwined — picking the wrong one doesn't just return wrong results, it can silently bypass your indexes.

Understanding the return types of each operator prevents a whole class of type-mismatch bugs that only surface in edge cases with unusual data.

jsonb_operators_reference.sqlSQL
1
2
3
4
5
6
7
8
9
10
-- Set up a realistic product catalogue table
CREATE TABLE io.thecodeforge.catalogue.products (
    product_id   SERIAL PRIMARY KEY,
    product_name TEXT NOT NULL,
    attributes   JSONB NOT NULL  -- flexible attributes vary by category
);

INSERT INTO io.thecodeforge.catalogue.products (product_name, attributes) VALUES
    ('Wireless Headphones', '{"brand":"SoundPeak","specs":{\"battery_hours\":30,\"driver_mm\":40},\"tags\":[\"audio\",\"wireless\",\"noise-cancelling\"],\"in_stock\":true,\"price_usd\":149.99}'),\n    ('Mechanical Keyboard', '{\"brand\":\"KeyForge\",\"specs\":{\"switch_type\":\"Cherry MX Blue\",\"layout\":\"TKL\"},\"tags\":[\"peripherals\",\"mechanical\"],\"in_stock\":false,\"price_usd\":89.00}'),\n    ('USB-C Hub', '{\"brand\":\"ConnectPro\",\"specs\":{\"ports\":7,\"max_wattage\":100},\"tags\":[\"accessories\",\"usb-c\"],\"in_stock\":true,\"price_usd\":39.99}');\n\n-- (1) -> returns a JSON node (preserves type)\n--     ->> returns TEXT (use for comparisons and display)\nSELECT\n    product_name,\n    attributes -> 'brand'       AS brand_as_json,   -- returns: \"SoundPeak\" (JSON string)\n    attributes ->> 'brand'      AS brand_as_text,   -- returns: SoundPeak (plain text)\n    attributes -> 'price_usd'   AS price_as_json    -- returns: 149.99 (JSON number)\nFROM io.thecodeforge.catalogue.products;\n\n-- (2) Nested access with chained -> and path operator #>>\nSELECT\n    product_name,\n    attributes -> 'specs' -> 'battery_hours'  AS battery_json,  -- NULL for non-headphones\n    attributes #>> '{specs, battery_hours}'    AS battery_text   -- same result, cleaner syntax\nFROM io.thecodeforge.catalogue.products;\n\n-- (3) @> containment: find all in-stock products with 'wireless' tag\n-- This is the operator GIN indexes are built for — use it!\nSELECT product_name, attributes ->> 'price_usd' AS price\nFROM io.thecodeforge.catalogue.products\nWHERE attributes @> '{\"in_stock\": true}'          -- containment check\n  AND attributes @> '{\"tags\": [\"wireless\"]}';      -- array containment works too\n\n-- (4) ? existence: find products that HAVE an 'in_stock' key\n-- Note: ? only checks TOP-LEVEL keys\nSELECT product_name\nFROM io.thecodeforge.catalogue.products\nWHERE attributes ? 'in_stock';  -- all rows have this key, returns all 3\n\n-- (5) ?| any-existence: find products with ANY of these keys\nSELECT product_name\nFROM io.thecodeforge.catalogue.products\nWHERE attributes ?| array['warranty', 'return_policy', 'in_stock'];\n-- Returns all 3 because all have 'in_stock'\n\n-- (6) ?& all-existence: find products that have ALL of these keys\nSELECT product_name\nFROM io.thecodeforge.catalogue.products\nWHERE attributes ?& array['brand', 'price_usd', 'in_stock'];\n-- Returns all 3\n\n-- (7) Casting ->> result for numeric comparison (common pattern)\nSELECT product_name\nFROM io.thecodeforge.catalogue.products\nWHERE (attributes ->> 'price_usd')::NUMERIC > 100.00;  -- cast TEXT to NUMERIC\n\n-- (8) jsonb_typeof: inspect the type of a JSONB value\n-- Useful for debugging — 'string', 'number', 'boolean', 'object', 'array', 'null'\nSELECT\n    product_name,\n    jsonb_typeof(attributes -> 'price_usd')   AS price_type,     -- 'number'\n    jsonb_typeof(attributes -> 'brand')        AS brand_type,     -- 'string'\n    jsonb_typeof(attributes -> 'tags')         AS tags_type,      -- 'array'\n    jsonb_typeof(attributes -> 'in_stock')     AS stock_type      -- 'boolean'\nFROM io.thecodeforge.catalogue.products;\n\n-- (9) jsonb_each: explode a JSONB object into key-value rows\n-- Useful for dynamic pivoting or inspecting unknown structures\nSELECT\n    product_name,\n    kv.key,\n    kv.value\nFROM io.thecodeforge.catalogue.products,\n     jsonb_each(attributes) AS kv(key, value)\nWHERE product_name = 'Wireless Headphones';\n\n-- (10) jsonb_each_text: same but returns values as TEXT (no JSONB type wrapper)\nSELECT\n    product_name,\n    kv.key,\n    kv.value\nFROM io.thecodeforge.catalogue.products,\n     jsonb_each_text(attributes) AS kv(key, value)\nWHERE product_name = 'Wireless Headphones';\n\n-- (11) jsonb_array_elements_text: unnest a JSON array into rows\nSELECT\n    product_name,\n    tag_text\nFROM io.thecodeforge.catalogue.products,\n     jsonb_array_elements_text(attributes -> 'tags') AS tag_text\nORDER BY product_name, tag_text;"
      }

SQL/JSON Path Expressions: The Power Tool Most Developers Don't Know About

PostgreSQL 12 introduced SQL/JSON path expressions, and PostgreSQL 16 significantly expanded them. If you are still chaining -> and ->> operators to navigate deeply nested JSON, you are working harder than you need to. Path expressions give you a mini query language inside your SQL — with filtering, iteration, type casting, and wildcard matching — all evaluated natively by the JSONB engine.

The two key functions are jsonb_path_query() for returning all matching values (can return multiple rows) and jsonb_path_exists() for boolean existence checks. The @? operator is the shorthand for jsonb_path_exists. The @@ operator evaluates a path expression and returns a boolean.

Path expressions use a syntax borrowed from JSONPath (RFC 9535): $ is the root, .key navigates to a child, [*] iterates over array elements, [0] indexes into an array, and ?() applies a filter predicate. You can chain these: $.specs[?(@.battery_hours > 20)].battery_hours means 'find all specs objects where battery_hours exceeds 20, and return the battery_hours value.'

The real power shows up when you need conditional logic inside your query. Instead of writing three nested CASE statements with ->> casts, you write a single path expression that does the filtering, type checking, and extraction in one pass.

One caveat: path expressions are not yet supported by GIN indexes in PostgreSQL 16. The @? and @@ operators always result in a sequential scan unless you pair them with a partial index or a generated column. This is a known limitation and may change in future PostgreSQL releases. For now, use path expressions for complex read queries where you need the expressive power, and fall back to @> with GIN for high-volume filtered access.

I discovered path expressions when debugging a particularly nasty query on an event log table. We had events with nested payloads like {\"event\":\"purchase\",\"data\":{\"items\":[{\"sku\":\"A1\",\"qty\":2},{\"sku\":\"B3\",\"qty\":1}],\"total\":89.99}}. The developer had written a 14-line SQL query with six -> chains, two jsonb_array_elements calls, and a subquery to filter items by quantity. I rewrote it as a single jsonb_path_query call. Same result, one line, and significantly faster because the path engine avoids materializing intermediate arrays.", "code": { "language": "sql", "filename": "jsonb_path_expressions.sql", "code": "-- Set up an event log table with deeply nested JSONB payloads CREATE TABLE io.thecodeforge.events.log ( event_id BIGSERIAL PRIMARY KEY, event_type TEXT NOT NULL, payload JSONB NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() );

INSERT INTO io.thecodeforge.events.log (event_type, payload) VALUES ('purchase', '{\"customer\":{\"id\":1001,\"tier\":\"gold\"},\"items\":[{\"sku\":\"A1\",\"name\":\"Widget\",\"qty\":2,\"unit_price\":29.99},{\"sku\":\"B3\",\"name\":\"Gadget\",\"qty\":1,\"unit_price\":29.99}],\"total\":89.99,\"payment\":{\"method\":\"card\",\"last4\":\"4242\"}}'), ('purchase', '{\"customer\":{\"id\":1002,\"tier\":\"silver\"},\"items\":[{\"sku\":\"C7\",\"name\":\"Cable\",\"qty\":5,\"unit_price\":9.99}],\"total\":49.95,\"payment\":{\"method\":\"paypal\",\"email\":\"user@example.com\"}}'), ('refund', '{\"customer\":{\"id\":1001,\"tier\":\"gold\"},\"original_purchase_id\":42,\"items\":[{\"sku\":\"A1\",\"name\":\"Widget\",\"qty\":1,\"refund_amount\":29.99}],\"reason\":\"defective\"}');

-- (1) jsonb_path_exists / @? : boolean check -- Find all events where the customer is gold tier SELECT event_id, event_type FROM io.thecodeforge.events.log WHERE payload @? '$.customer ? (@.tier == \"gold\")';

-- (2) jsonb_path_query : extract matching values -- Get all SKUs from purchase events where qty > 1 SELECT event_id, jsonb_path_query(payload, '$.items[*] ? (@.qty > 1).sku') AS high_qty_sku FROM io.thecodeforge.events.log WHERE event_type = 'purchase';

-- (3) Complex filter: items with unit_price > 20 in gold customer purchases SELECT event_id, jsonb_path_query(payload, '$.items[*] ? (@.unit_price > 20)') AS expensive_item FROM io.thecodeforge.events.log WHERE payload @? '$.customer ? (@.tier == \"gold\")' AND event_type = 'purchase';

-- (4) Path expression with type casting -- Extract the total as a numeric value directly SELECT event_id, jsonb_path_query(payload, '$.total')::NUMERIC AS total_amount FROM io.thecodeforge.events.log WHERE event_type = 'purchase';

-- (5) @@ operator : evaluate path expression as boolean -- Find purchases where total exceeds 50 SELECT event_id, event_type FROM io.thecodeforge.events.log WHERE payload @@ '$.total > 50';

-- (6) Wildcard iteration: get all item names across all events SELECT event_id, jsonb_path_query(payload, '$.items[*].name') AS item_name FROM io.thecodeforge.events.log;

-- (7) Array indexing: get the first item in every purchase SELECT event_id, jsonb_path_query(payload, '$.items[0]') AS first_item FROM io.thecodeforge.events.log WHERE event_type = 'purchase';

-- (8) Combining with relational filters -- Recent gold customer purchases with expensive items SELECT event_id, payload -> 'customer' ->> 'id' AS customer_id, jsonb_path_query(payload, '$.items[*] ? (@.unit_price > 25)') AS premium_item FROM io.thecodeforge.events.log WHERE event_type = 'purchase' AND payload @? '$.customer ? (@.tier == \"gold\")' AND created_at > NOW() - INTERVAL '7 days';" }, "callout": { "type": "info", "title": "Key Limitation: No GIN Support for Path Expressions Yet", "text": "As of PostgreSQL 16, @? and @@ path expression operators cannot use GIN indexes. If your path expression query runs against millions of rows, it will do a sequential scan. Workarounds: (1) add a WHERE clause on an indexed relational column to narrow the candidate set first, (2) use a generated column for the most common path extraction and index that with B-tree, or (3) use @> containment for the initial filter and path expressions for the final extraction. Check the PostgreSQL release notes — GIN support for path expressions is on the roadmap." }, "production_insight": "Path expressions (@?, @@) always do sequential scans — no GIN index support until future releases. Pair them with relational column filters (e.g., created_at > '2025-01-01') to avoid scanning 10M rows. Rule: use path expressions for complex extraction on narrow datasets; use @> with GIN for high-volume filtering.", "key_takeaway": "SQL/JSON path expressions replace multi-line -> chains with a single query. No GIN index support yet — narrow the candidate set first with relational columns. Rule: path expressions for complex reads, @> for indexed filter-and-fetch." }, { "heading": "GIN Indexing on JSONB: How It Works and When to Use It", "content": "A GIN (Generalised Inverted Index) index on a JSONB column works by decomposing every document into its individual key-value paths and storing them in a lookup structure that maps each path to the set of row IDs that contain it. Think of it as an index of the index — instead of scanning every row's document, Postgres consults the GIN index to get a shortlist of candidate rows, then fetches only those.

There are two operator classes for JSONB GIN indexes. jsonb_ops (the default) indexes every key, value, and key-value pair, supporting @>, ?, ?|, and ?&. jsonb_path_ops only indexes values reachable by a path — it's more compact and faster for @> queries, but it does NOT support the existence operators ?, ?|, ?&. Choosing the wrong one is a common production mistake.

GIN indexes shine for containment queries on large tables. On a table with 10 million JSONB rows, a containment query without a GIN index does a full sequential scan parsing every document. With a GIN index, it's an index lookup in milliseconds. The trade-off: GIN indexes are expensive to write and maintain — expect 3-5x slower inserts on write-heavy tables. They're also large on disk.

For partial, path-specific queries on deeply nested fields, consider a generated column with a B-tree index instead — it's smaller, faster for equality, and cheaper to maintain.

Let me tell you about the time a GIN index nearly took down our production database. We had an event ingestion pipeline — io.thecodeforge.events.ingest — that was inserting about 12,000 rows per second into a table with a JSONB payload column. A junior developer added a GIN index on that column because the analytics team needed to query events by payload content. Insert throughput dropped to 2,400 rows/sec. The WAL generation rate tripled. Our replication lag went from under 1 second to 30+ seconds. The on-call DBA got paged at 1 AM.

The fix was a partial GIN index — we only indexed events from the last 30 days using a WHERE clause on the created_at column. Older events were archived to a separate table without the GIN index. Insert performance recovered, replication lag dropped back to normal, and the analytics team could still query recent events efficiently.

The lesson: GIN indexes are not free. On read-heavy tables they are transformative. On write-heavy tables they can be destructive. Always benchmark your insert throughput before and after adding a GIN index, and consider partial indexes or BRIN indexes for time-series data.", "code": { "language": "sql", "filename": "jsonb_gin_indexes.sql", "code": "-- Scenario: 1 million product rows, we query by tags and brand frequently

-- Create the table (same structure as before, scaled up) CREATE TABLE io.thecodeforge.catalogue.large_products ( product_id BIGSERIAL PRIMARY KEY, product_name TEXT NOT NULL, attributes JSONB NOT NULL );

-- GIN INDEX OPTION 1: jsonb_ops (default) -- Indexes ALL keys, values, and key-value pairs. -- Supports: @>, ?, ?|, ?& -- Use when: you query by key existence AND containment CREATE INDEX idx_products_attrs_gin ON io.thecodeforge.catalogue.large_products USING GIN (attributes); -- uses jsonb_ops by default

-- GIN INDEX OPTION 2: jsonb_path_ops -- Only indexes values at the end of paths. -- ~30% smaller than jsonb_ops, faster for @> only. -- Does NOT support ?, ?|, ?& -- Use when: you ONLY ever use @> containment queries CREATE INDEX idx_products_attrs_gin_path ON io.thecodeforge.catalogue.large_products USING GIN (attributes jsonb_path_ops);

-- GIN INDEX OPTION 3: Partial GIN index -- Only index rows that match a WHERE clause. -- Use when: you have a mix of active/archived data and only query active rows. -- This is what saved us in the event ingestion incident. CREATE INDEX idx_products_active_gin ON io.thecodeforge.catalogue.large_products USING GIN (attributes jsonb_path_ops) WHERE attributes @> '{\"in_stock\": true}'; -- only index in-stock products

-- GIN INDEX OPTION 4: Generated column + B-tree -- Best when querying a SINGLE well-known field repeatedly. -- Much cheaper to maintain than GIN, smaller index size. ALTER TABLE io.thecodeforge.catalogue.large_products ADD COLUMN brand TEXT GENERATED ALWAYS AS (attributes ->> 'brand') STORED; -- materialised at write time

CREATE INDEX idx_products_brand_btree ON io.thecodeforge.catalogue.large_products (brand); -- standard B-tree

-- QUERY: Verify the GIN index is actually being used -- Run EXPLAIN ANALYZE to confirm index usage EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT product_id, product_name FROM io.thecodeforge.catalogue.large_products WHERE attributes @> '{\"tags\": [\"wireless\"], \"in_stock\": true}';

-- QUERY: Demonstrate the generated column approach EXPLAIN (ANALYZE, FORMAT TEXT) SELECT product_id, product_name FROM io.thecodeforge.catalogue.large_products WHERE brand = 'SoundPeak'; -- hits the B-tree index, not GIN

-- BENCHMARK: GIN index impact on insert performance -- Run BEFORE adding GIN index: \\timing on INSERT INTO io.thecodeforge.catalogue.large_products (product_name, attributes) SELECT 'Product ' || i, jsonb_build_object( 'brand', 'Brand_' || (i % 100), 'price_usd', (random() * 500)::NUMERIC(10,2), 'in_stock', (random() > 0.3), 'tags', jsonb_build_array('tag_' || (i % 20), 'tag_' || (i % 7)) ) FROM generate_series(1, 100000) AS s(i); -- Note the time, then DROP the GIN index and run again. -- Typical results: -- Without GIN: ~3.2 seconds for 100K inserts -- With GIN (jsonb_ops): ~14.8 seconds (4.6x slower) -- With GIN (jsonb_path_ops): ~11.2 seconds (3.5x slower)

-- MONITORING: Check index size and usage stats SELECT indexrelname AS index_name, pg_size_pretty(pg_relation_size(indexrelid)) AS index_size, idx_scan AS times_used, idx_tup_read AS tuples_read FROM pg_stat_user_indexes WHERE relname = 'large_products' ORDER BY pg_relation_size(indexrelid) DESC;" }, "callout": { "type": "info", "title": "Interview Gold: jsonb_ops vs jsonb_path_ops Trade-off", "text": "The key difference comes down to what each index can answer. jsonb_path_ops hashes complete key paths into a single value — meaning it can only answer 'does this path lead to this value?' (perfect for @>), but it has no way to answer 'does this key exist anywhere?' which is what ? needs. The default jsonb_ops keeps each key and value separately, enabling existence checks at the cost of a larger index. In practice, if your queries are 90% @> containment and 10% existence checks, use jsonb_path_ops and accept a sequential scan for the rare ? queries. The 30% smaller index size and faster containment lookups will pay for themselves." }, "production_insight": "GIN indexes slow inserts by 3-5x — a full index on a write-heavy pipeline can tank replication lag. Partial GIN indexes (with WHERE clause) reduce the write overhead dramatically and are often sufficient. Rule: always benchmark insert throughput before/after adding a GIN index, and monitor WAL generation.", "key_takeaway": "GIN indexes enable millisecond @> lookups on large tables. jsonb_path_ops is 30% smaller but cannot answer existence queries. Use partial GIN indexes to avoid write penalty on time-series data." }, { "heading": "JSONB Manipulation Functions: Updating Documents Without Tears", "content": "Storing JSONB is the easy part. Updating it surgically — changing one nested key without rewriting the entire document — is where most teams hit a wall. PostgreSQL gives you the tools, but the ergonomics are not obvious, and getting it wrong has real performance consequences at scale.

jsonb_set(target, path, value, create_missing) is the primary update function. It takes the existing JSONB column, a path array pointing to the key you want to change, the new value (must be valid JSONB), and a boolean controlling whether to create the key if it doesn't exist. The path is a text array: '{notifications,email}'. This function returns the updated JSONB — it does NOT mutate in place. You must use it in an UPDATE statement to persist changes.

Let's look at a real example. You have user preferences stored as JSONB, and you need to update just the email notification flag. With jsonb_set, you write: UPDATE users SET prefs = jsonb_set(prefs, '{notifications,email}', 'false', true) WHERE id = 1001; This changes a single key deep in the document without affecting any other keys.

jsonb_insert(target, path, value, after_flag) adds a new key or array element at a specified path. If the key already exists, it errors by default — use jsonb_set for overwrite behaviour. The after_flag is a boolean: TRUE inserts after the matched array element, FALSE inserts before.

Then there's jsonb_strip_nulls() which removes all top-level keys whose values are JSON null. Handy for cleaning up partial updates. And jsonb_concat() (the || operator) merges two JSONB objects. If keys collide, the right-hand side wins — exactly like JavaScript spread.

Performance trap: each call to jsonb_set() reconstructs the entire JSONB document. For small documents (< 1KB) this is negligible. For documents that span multiple TOAST pages (over 2KB after compression), every update forces a full TOAST rebuild. If you're updating a nested key on a 10KB JSONB document at 1000 writes/second, your WAL generation explodes. I've seen this cause replication lag on a multi-terabyte database.

The fix is to normalise. Move frequently updated fields out of JSONB into separate relational columns. Keep in JSONB only data that changes infrequently or is always read and written together. If you must update JSONB at high frequency, design the document to be small — ideally under 2KB so it fits in a single TOAST tuple without compression overhead.

Aggregation functions let you build JSONB from relational data. jsonb_agg() aggregates values into a JSON array. jsonb_object_agg() builds a JSON object from key-value pairs. These are essential for reporting and API response construction directly in SQL.", "code": { "language": "sql", "filename": "jsonb_manipulation.sql", "code": "-- Create a user preferences table with nested JSONB CREATE TABLE io.thecodeforge.users.preferences ( user_id BIGSERIAL PRIMARY KEY, prefs JSONB NOT NULL );

INSERT INTO io.thecodeforge.users.preferences (prefs) VALUES ('{ \"theme\": \"dark\", \"notifications\": { \"email\": true, \"sms\": false, \"push\": true }, \"privacy\": { \"share_location\": false, \"analytics\": true } }'), ('{ \"theme\": \"light\", \"notifications\": { \"email\": true, \"sms\": true, \"push\": false }, \"privacy\": { \"share_location\": true, \"analytics\": true } }');

-- (1) jsonb_set: update email notification to false for user 1 UPDATE io.thecodeforge.users.preferences SET prefs = jsonb_set( prefs, '{notifications,email}', 'false', -- must be valid JSONB (boolean in this case) true -- create_missing: if key doesn't exist, create it ) WHERE user_id = 1;

-- (2) jsonb_insert: add a new notification channel 'in_app' UPDATE io.thecodeforge.users.preferences SET prefs = jsonb_insert( prefs, '{notifications}', '{\"in_app\": true}', false -- after: FALSE inserts before the key, TRUE after (but for objects it inserts as a new key) ) WHERE user_id = 1;

-- (3) jsonb_strip_nulls: remove null-valued keys -- Useful after partial updates that set some fields to null UPDATE io.thecodeforge.users.preferences SET prefs = jsonb_strip_nulls(prefs) WHERE user_id = 1;

-- (4) || operator (jsonb_concat): merge two JSONB objects -- Right side wins on key conflicts SELECT '{\"a\": 1, \"b\": 2}'::JSONB || '{\"b\": 3, \"c\": 4}'::JSONB AS merged; -- Result: {\"a\": 1, \"b\": 3, \"c\": 4}

-- (5) jsonb_agg: aggregate rows into a JSON array -- Build a list of all notification preferences for all users SELECT jsonb_agg(prefs -> 'notifications') AS all_notification_settings FROM io.thecodeforge.users.preferences;

-- (6) jsonb_object_agg: build JSON object from key-value pairs -- Get count of users per notification channel (dynamic pivot) WITH expanded AS ( SELECT user_id, kv.key AS channel, kv.value::boolean AS enabled FROM io.thecodeforge.users.preferences, jsonb_each(prefs -> 'notifications') AS kv(key, value) ) SELECT jsonb_object_agg(channel, cnt) AS counts_per_channel FROM ( SELECT channel, COUNT(*) AS cnt FROM expanded WHERE enabled = true GROUP BY channel ) sub;

-- (7) Performance check: how big is the TOAST tuple? SELECT pg_size_pretty(pg_column_size(prefs)) AS pref_size, pg_column_size(prefs) > 2000 AS is_toasted FROM io.thecodeforge.users.preferences WHERE user_id = 1;

-- (8) Using jsonb_set_lax (PG 14+): controls behaviour on invalid values -- Equivalent to jsonb_set but with error handling modes -- 'raise' (default), 'use_json_null', 'delete_key', 'return_target' UPDATE io.thecodeforge.users.preferences SET prefs = jsonb_set_lax( prefs, '{notifications,email}', 'invalid'::JSONB, -- this is an object, not a boolean — will raise error by default true, 'use_json_null' -- set to null instead of erroring ) WHERE user_id = 2; -- Better: use 'return_target' to skip the update on invalid value" }, "callout": { "type": "mental_model", "title": "Mental Model: JSONB Updates Are Functional, Not In-Place", "hook": "Think of jsonb_set like the spread operator in JavaScript — it returns a new object, it does not mutate the old one.", "bullets": [ "jsonb_set(), jsonb_insert(), || all return a new JSONB value — you must use UPDATE to persist changes.", "The old document is not modified; PostgreSQL writes a new TOAST tuple and marks the old one as dead.", "This is why high-frequency updates cause WAL bloat and dead tuple accumulation — each update produces a full copy.", "For hot-path updates, extract the frequently changing field into a separate relational column." ] }, "production_insight": "Each jsonb_set() call reconstructs the entire document — at 1000 updates/sec on a 10KB doc, WAL generation spikes. Move frequently updated fields out of JSONB into relational columns to avoid TOAST rebuild cost. Rule: if a JSONB field changes every request, it shouldn't be in JSONB — normalise it.", "key_takeaway": "jsonb_set() returns a new JSONB — it doesn't mutate the column. High-frequency updates on large JSONB documents cause WAL and autovacuum pressure. Normalise hot fields out of JSONB. Use it only for data that changes infrequently." }, { "heading": "Production Migration Guide: From TEXT Blobs to Properly Indexed JSONB", "content": "Migrating an existing TEXT column containing serialised JSON to a proper JSONB column is one of the highest-ROI database changes you can make. The migration itself is straightforward — an ALTER TABLE with a USING clause — but the real work is in preparing the data for the conversion and validating the results.

ALTER TABLE tablename ALTER COLUMN columnname TYPE JSONB USING columnname::JSONB;

This will fail if any row contains malformed JSON. You must find and fix those rows first. Common sources of malformed JSON: trailing commas, single quotes instead of double quotes, unescaped control characters, and truncated payloads.

A safer approach is to add a new JSONB column, migrate in batches, validate, then drop the old column and rename:

  1. Add a new column of type JSONB.
  2. Batch update: UPDATE tablename SET new_jsonb = old_text::JSONB WHERE id BETWEEN :start AND :end AND new_jsonb IS NULL;
  3. Validate: SELECT id FROM tablename WHERE old_text::JSONB IS DISTINCT FROM new_jsonb::TEXT; to check for duplicate key resolution differences.
  4. Once validated, drop the old column and rename the new one.

Performance: The ALTER TABLE with USING executes a full table rewrite. For a table with 50 million rows, this will take hours and generate significant WAL. Plan maintenance windows accordingly. The batch approach spreads the I/O load and allows rollback.

Index strategy after migration: start with a GIN index if you have containment queries, or generated columns with B-tree for known hot paths. Benchmark before and after. And remember — once migrated, you can never go back to JSON without data loss (the deduplication and whitespace stripping are one-way).", "code": { "language": "sql", "filename": "text_to_jsonb_migration.sql", "code": "-- Step 1: Find malformed JSON rows before migration -- Look for rows that are not valid JSON objects or arrays SELECT id, payload AS malformed_row FROM io.thecodeforge.events.ingest WHERE payload !~ '^\\{' AND payload !~ '^\\['; -- naïvely check first char

-- More robust: use a test conversion in a transaction to check each row -- (Rollback so we don't actually change data) BEGIN; -- Attempt to cast each row; if any fails, the whole transaction rolls back -- So we need to catch the exception per row, which requires PL/pgSQL -- Better approach: create a temporary table with only valid JSON CREATE TABLE io.thecodeforge.events.ingest_clean AS SELECT id, payload FROM io.thecodeforge.events.ingest WHERE payload @? '$' -- path expression that tests if it's valid JSONB? Actually ->> returns text, we need to check by trying cast -- But we can't use WHERE in that way. Use a function: ;

-- Step 2: Write a function to test JSON validity CREATE OR REPLACE FUNCTION io.thecodeforge.util.is_valid_json(p_text TEXT) RETURNS BOOLEAN LANGUAGE plpgsql IMMUTABLE AS $func$ BEGIN PERFORM p_text::JSONB; RETURN TRUE; EXCEPTION WHEN OTHERS THEN RETURN FALSE; END; $func$;

-- Use it to find bad rows SELECT id, payload FROM io.thecodeforge.events.ingest WHERE NOT io.thecodeforge.util.is_valid_json(payload);

-- Step 3: Fix or exclude bad rows, then perform the migration SET maintenance_work_mem = '2GB'; -- speed up the rewrite

ALTER TABLE io.thecodeforge.events.ingest ALTER COLUMN payload TYPE JSONB USING payload::JSONB;

-- Step 4: Validate no data loss from deduplication SELECT id, payload::TEXT AS original_json, payload AS jsonb_value FROM io.thecodeforge.events.ingest WHERE payload::TEXT IS DISTINCT FROM payload::TEXT; -- this always false; better compare before/after -- If you kept a backup, compare counts.

-- Step 5: Create indexes after migration CREATE INDEX idx_ingest_payload_gin ON io.thecodeforge.events.ingest USING GIN (payload);

-- Step 6: Verify query performance improved EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM io.thecodeforge.events.ingest WHERE payload @> '{\"event_type\": \"purchase\"}';" }, "callout": { "type": "warning", "title": "One-Way Migration: You Cannot Go Back to JSON", "text": "Once you convert a column to JSONB, you cannot cast it back to JSON and expect the original document. JSONB strips whitespace, deduplicates keys, and sorts them internally. The original byte-for-byte representation is lost forever. Always take a backup or preserve the original TEXT column alongside during migration until you are certain the conversion is correct. I have seen teams panic after realising their audit trails no longer match legal requirements because duplicate keys were silently removed." }, "production_insight": "ALTER TABLE with USING rewrites the entire table — for 50M rows expect hours of I/O and WAL. Find and fix malformed JSON rows before migration; a single invalid row blocks the entire ALTER. Rule: always do a trial migration on a read replica first and compare row counts and checksums.", "key_takeaway": "Migrate TEXT to JSONB with ALTER TABLE ... USING column::JSONB. Validate malformed JSON first using a PL/pgSQL validity function. Plan a maintenance window for large tables — the table rewrite is not instant." } ]

● Production incidentPOST-MORTEMseverity: high

The 28x Query Improvement Nobody Noticed

Symptom
A daily reconciliation job querying individual fields from 3 million webhook rows took 22 minutes. The query used payload ->> 'amount' and similar extractions on a JSON column.
Assumption
The team assumed the query was slow because of database load or missing indexes. They spent weeks optimising query patterns, adding hardware, and even considering moving to a separate document store.
Root cause
The JSON column stored the payload as plain text. Every time the query accessed any field (via ->>), PostgreSQL re-parsed the entire JSON document from scratch. With 3 million rows, that meant 3 million full JSON parses per run.
Fix
Changed the column type from JSON to JSONB. No application code changes. The same query then ran in 47 seconds — a 28x improvement. The binary storage eliminated the per-read parse overhead.
Key lesson
  • If a JSON column will ever be queried, filtered, or indexed, it must be JSONB. The only exception is archival tables where you literally never read individual fields.
  • Always benchmark JSON vs JSONB write/read trade-offs with your actual data scale before committing.
  • A storage layer change alone can yield order-of-magnitude improvements — don't assume the problem is always code.
Production debug guideSymptom → Root Cause → Fix for slow JSONB queries5 entries
Symptom · 01
Query using ->> or -> in WHERE clause is slow even with GIN index
Fix
Run EXPLAIN (ANALYZE, BUFFERS). If you see Seq Scan, the GIN index is being bypassed. Rewrite the condition using @> containment operator. Example: WHERE data ->> 'status' = 'active'WHERE data @> '{"status": "active"}'.
Symptom · 02
Query using @? or @@ path expression operators is slow
Fix
Path expressions (@?, @@) cannot use GIN indexes as of PostgreSQL 16. Add an indexed relational column to narrow the candidate set first (e.g., event_type = 'purchase'), then apply the path expression.
Symptom · 03
Incorrect results: zero rows when data exists
Fix
Check if you used -> instead of ->>. Comparing a JSON node (e.g., "active" as JSON string) to a plain SQL string 'active' returns no match. Use data ->> 'status' = 'active' or compare JSONB to JSONB: data -> 'status' = '"active"' ::jsonb.
Symptom · 04
Insert performance degraded significantly after adding GIN index
Fix
GIN indexes slow down writes by 3-5x. Consider a partial GIN index (WHERE clause) or use a generated column with B-tree index for the most queried fields. Benchmark before/after.
Symptom · 05
Table bloat after frequent JSONB updates
Fix
Each jsonb_set() rewrites the entire TOAST tuple. Monitor pg_stat_user_tables.n_dead_tup. If dead tuples grow faster than autovacuum, move hot-write fields to separate relational columns or tune autovacuum_vacuum_cost_delay.
★ JSONB Quick Debug CommandsFive common JSONB scenarios and the exact commands to diagnose them.
GIN index not used (Seq Scan)
Immediate action
Run EXPLAIN ANALYZE to see query plan
Commands
EXPLAIN (ANALYZE, BUFFERS) SELECT ... WHERE data @> '{"field": "value"}'
Check index definition: \d+ table_name
Fix now
Rewrite filter using @> operator or add a generated column with B-tree index
Wrong results from JSON comparison+
Immediate action
Check if using -> instead of ->>
Commands
SELECT data -> 'field', data ->> 'field' FROM table LIMIT 5
Compare types: pg_typeof(data -> 'field') vs pg_typeof(data ->> 'field')
Fix now
Use data ->> 'field' for text comparison or compare JSONB to JSONB
High WAL from JSONB updates+
Immediate action
Check dead tuple count and WAL generation rate
Commands
SELECT n_dead_tup, n_live_tup FROM pg_stat_user_tables WHERE relname = 'your_table'
Check TOAST size: pg_size_pretty(pg_total_relation_size('table') - pg_relation_size('table'))
Fix now
Move frequently updated fields to separate relational columns, or reduce update frequency
TEXT to JSONB migration parse failures+
Immediate action
Identify malformed JSON rows
Commands
SELECT id, payload FROM table WHERE payload NOT LIKE '[{%' AND payload NOT LIKE '{%' LIMIT 100
Check specific parse: SELECT payload::JSONB FROM table WHERE id = X; (rollback if fails)
Fix now
Fix malformed JSON before migration: use regex or application-level parsing with error logging
Path expression query slow on large table+
Immediate action
Check if index can be used at all
Commands
EXPLAIN (ANALYZE, BUFFERS) SELECT ... WHERE data @@ '$.field > 100'
pg_stat_user_tables.seq_scan vs idx_scan for the table
Fix now
Add a relational filter (e.g., created_at > NOW() - INTERVAL '7 days') first, then apply path expression
🔥

That's MySQL & PostgreSQL. Mark it forged?

22 min read · try the examples if you haven't

Previous
MySQL Stored Functions
5 / 13 · MySQL & PostgreSQL
Next
Database Backup and Restore