Senior 10 min · May 23, 2026

CQRS in Spring Boot: You Split the Database, Now Your Queries Are Stale

CQRS pattern in Spring Boot microservices — real production failures, eventual consistency traps, and debug commands that save your on-call rotation.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • CQRS separates read and write models to scale independently.
  • The write model uses events; the read model is a denormalized projection.
  • Eventual consistency means stale reads — your users will see old data.
  • Kafka + Debezium is the battle-tested pipeline, not shared DBs.
  • If your read store is down, writes still succeed — and users get 500s on reads.
✦ Definition~90s read
What is CQRS in Spring Boot?

CQRS stands for Command Query Responsibility Segregation. You split your database into a write side (commands) and a read side (queries). Commands produce events. Queries consume those events to build denormalized projections. The write model is normalized, ACID, and transactional.

You have two cash registers.

The read model is flat, indexed for the exact queries your UI needs. They are not the same database. If they are, you don't have CQRS. You have a transaction log.

The split solves two problems. First, your write model doesn't get crushed by read traffic. Second, your read model doesn't get bogged down by write locks. But it introduces a third problem: eventual consistency. The read side is always behind. A few milliseconds in the best case. Seconds or minutes if the event pipeline breaks. You cannot hide this. You must design for it.

You need CQRS when your read queries have different performance and scaling requirements than your write commands. When reporting queries join 12 tables and take 30 seconds per request while your writes need sub-10ms latency. When your graphQL API needs drastically different data shapes than your REST command endpoints. When one-size-fits-all CRUD gives you a database that can't do anything well.

Plain-English First

You have two cash registers. One takes orders and writes them in a book. The other just reads the book and tells customers their total. If the reader lags behind the writer, the customer sees yesterday's price. That's eventual consistency — and you have to tell the customer they might be wrong.

You're on-call. It's 2 AM. The ticket queue is exploding. Users report that their order total shows double the product price. Every third refresh, the total changes. The product team is screaming. Your instinct says 'data race' or 'cache bug'. You check Redis TTLs. You check the database for phantom writes. Nothing.

You go deeper. The orders service writes to PostgreSQL. The pricing service writes to MongoDB. A Kafka topic pipes events between them. You find it. The pricing projection lagged by 45 seconds. A burst of 10,000 orders hit the same time as a pricing update. The read model hadn't processed the new price yet. Users saw old prices on new orders.

Welcome to CQRS. You split the database for scaling and now your queries are stale.

I've seen this exact pattern kill a Black Friday launch. We had three services sharing a single PostgreSQL database. Reads and writes fought each other at 500 QPS. The fix was CQRS. But the first implementation was naive — one database with two connection pools. That's not CQRS. That's a shared database with extra latency.

Real CQRS means separate data stores. Writes go to an event store or a write-optimized database like Aurora. Reads come from a denormalized store like Elasticsearch or DynamoDB. Events flow through Kafka. Projections update asynchronously. This buys you independent scaling. It costs you eventual consistency.

You can't wave that away with 'it's fine for our use case'. It breaks. Predictably. Your web app shows stale data until the projection catches up. Your tests pass because they poll and wait. Production users don't poll. They hit refresh and get a different answer. You have to tell them the answer is wrong and show a loading state. That's the contract.

If you can't handle that, don't do CQRS. If you can, it saves your systems under load. Let me show you how to build it right and what breaks when you don't.

Command Model: The Write Side Is Not an ORM Dump

The command model handles writes. It validates invariants, applies business logic, and produces events. It does not serve queries. End of story. If your command handler returns the saved entity, you're doing CRUD, not CQRS.

The command handler should be a single-purpose class. One command, one handler. No generic save() methods. The handler receives a command object (DTO), validates it, checks invariants, and if all passes, applies the change and publishes an event. The event is the only output. If the client needs the new state, they query the read model.

I've seen teams handwave this and return the entity from the command endpoint 'for convenience'. Then the client starts using that data instead of querying the read model. Fast forward six months. The write model has leaky abstractions. The read model is stale because no one uses it. You have a shared database with a facade.

Use the transactional outbox pattern. Write the event to an outbox table in the same transaction as the domain change. A separate process polls that table and publishes to Kafka. This guarantees at-least-once delivery. If you use Eventuate Tram or Debezium, you get this out of the box.

The command model database should be normalized. 3NF or BCNF. Indexes for the write path only. No read-oriented denormalization. That's the read model's job.

One more thing: commands are not queries. A command changes state. If it doesn't, it's a query. Don't call a query a command. Don't call a command a query. The distinction is behavioral, not architectural.

Production Trap:
If your command handler loads the entity and returns it, you've broken the read-write separation. The client will cache that response and never query the read model. Your read model becomes a dead letter. Always return 202 Accepted with no body, or a correlation ID only.
Production Insight
Saw a team use Jackson JSON views to return different fields for commands vs queries. That's not CQRS. That's a different JSON serializer.
Key Takeaway
Commands produce events, not responses. If your command returns data, you're doing CRUD.

Query Model: Denormalize Aggressively, But Don't Forget Versioning

The query model is a denormalized projection of the events. One table per query shape. No joins. No subqueries. You eat the storage cost to gain read speed. Each projection is built by a consumer that listens to events and writes directly to the query store.

This is where the real work lives. Your events are JSON blobs with nested objects. Your query store is flat columns and arrays. You need three pieces: the event consumer (Kafka Listener), the projection function, and the query repository.

Version every projection. Add a version column or field. When the consumer processes an event, it increments the version. Queries check this version. If the version hasn't changed in N seconds (your acceptable staleness window), return the data. If it's older, return a 503 with a Retry-After header or a loading state. This turns an eventual consistency problem into a user experience problem.

Failures happen. The consumer crashes mid-event. The projection updates half the fields. You need idempotent projections — the same event applied twice produces the same result, not a duplicate row. Use upsert logic: if the entity exists, update; if not, insert. The event ID is your idempotency key.

Database choice matters. For high-read, low-write projections, use Elasticsearch or OpenSearch. For consistent, low-latency reads, use DynamoDB with DAX. For simple projections, use PostgreSQL with a denormalized schema and no joins. Do not use Cassandra for projections unless you love debugging tombstone issues.

One more thing: the query service must not talk to the write database. Not even for health checks. Separate connection pools, separate schemas, separate clusters. If they share a database, you don't have CQRS. You have a monolith with two APIs.

Senior Shortcut:
Use MongoDB or DynamoDB for projections. They handle partial updates and schema changes better than relational DBs. Plus, you get automatic versioning with @Version in JPA, but DynamoDB conditional updates are faster under high concurrency.
Production Insight
Don't put read models in a relational database if you can avoid it. The normalization tax kills query performance at scale.
Key Takeaway
Denormalize until there are no joins. Store the query in the shape your API sends to the client. Version everything.

Event Sourcing vs CQRS: You Don't Need Both

People conflate CQRS with event sourcing. They're separate. CQRS splits read and write. Event sourcing stores state as a sequence of events. You can have one without the other.

CQRS without event sourcing: you have a normal write database and a separate read database. Events flow between them. The write database stores current state. Events are produced on change. The read database is rebuilt from events. This is simpler. It also means you lose historical state — you can't replay events to rebuild the read model from scratch unless you kept them.

Event sourcing without CQRS: you store all events, but you still query the event store directly. This is painful. Querying a sequence of events to answer 'what's the current state?' requires replaying all events. You need snapshots. You need projection indexes. It's a write-optimized nightmare for reads.

You need both when you need complete audit trails. Financial systems. Compliance-heavy domains. Multi-version concurrency with serializable isolation. If you don't need those, don't pay the complexity tax.

Here's the production truth: event sourcing is for when you need to know not just the current state, but how you got there. CQRS is for when your read and write workloads have fundamentally different scaling profiles. They overlap but are not dependent.

My team once built an event-sourced CQRS system for a customer loyalty program. The events were 'points earned', 'points redeemed', 'points expired'. The read model was a current balance. We never needed the event history. We could have stored just the balance in a single row. The event store was dead weight. Don't be that team.

Interview Gold:
When asked 'How do you rebuild a read model?', say 'Replay all events from the event store, or from the last checkpoint. If you don't have an event store, you need a snapshot table or a periodic full rebuild from the write database.'
Production Insight
90% of CQRS systems don't need event sourcing. The event stream between write and read is enough for replay.
Key Takeaway
CQRS splits read from write. Event sourcing stores all events. They are orthogonal. Choose both only when you need audit trails.

Eventual Consistency: You Can't Hide It, So Design For It

Eventual consistency is not a bug. It's a feature of distributed systems. If you try to hide it, you end up with synchronous event processing, which defeats the purpose of CQRS. The write model slows down waiting for the read model to update. You lose the scaling benefit.

Design for staleness. The UI must handle stale data. Show loading spinners, skeleton screens, or a 'this data is N seconds old' indicator. The API must return stale data gracefully. If the read model is behind, return a warning header: Warning: 299 - "stale-data". The client can decide to retry or show cached data.

Set SLAs. Define acceptable staleness per entity. Order status can be 5 seconds stale. Inventory count can be 30 seconds. User profile can be 2 minutes. The projection service must meet these SLAs. Monitor consumer lag. Alert when lag exceeds the SLA.

What happens when the projection service is down for 10 minutes? The read model is 10 minutes stale. Users create orders against a 10-minute-old inventory count. You sell items that no longer exist. The fix is to reject the command at the write model if the read model is too stale. Yes, that couples them again. But it's better than overselling.

Another pattern: read-your-writes consistency. When a user writes data, they should see it in the next read. For that user only. Other users see stale data. Implement this by returning a write timestamp from the command. The client includes that timestamp in the next query. The read model checks if its projection version is >= that timestamp. If not, it blocks until it catches up or returns a 503.

I've seen teams use a distributed cache like Redis as a write-through cache between the command and read model. The command writes to both the write DB and Redis. The read model reads from Redis. This gives strong consistency at the cost of latency. It works for small datasets. It fails for large ones because Redis memory fills up.

Never Do This:
Don't make the projection service synchronous. Don't block the command handler waiting for the read model to update. You'll bottleneck the write path and lose the entire benefit of CQRS.
Production Insight
A stale read that returns 200 is worse than a 503 with Retry-After. Users understand 'try again'. They don't understand 'the total changed'.
Key Takeaway
Design for eventual consistency, don't fight it. Set SLAs, monitor lag, and tell the client when data is stale.

Transactional Outbox Pattern: The Backbone of Reliable CQRS

The outbox pattern is how you guarantee events are published reliably. Write the event to an outbox table in the same database transaction as the domain change. A separate publisher polls the outbox for unpublished events and sends them to Kafka.

Why not publish the event directly from the command handler? Because the DB transaction might fail after the event is published. Now you have an event that says 'order created' but the order doesn't exist in the write database. This is the classic dual-write problem.

Use Debezium if you want CDC (change data capture). It reads the database transaction log and publishes events. No outbox table needed. But it requires the write database to support CDC (PostgreSQL logical replication, MySQL binlog). It also adds latency — the event is available only after the transaction log is flushed to disk.

Use a simple outbox table if you want low latency. Poll interval of 100ms. Batch size of 100. Deduplicate by event ID. This is what I use in production. It's simple, predictable, and easy to debug.

The outbox publisher must be idempotent. If it crashes after sending the event but before marking it as published, the event is sent twice. The consumer must handle duplicates — use idempotent projections (upsert by event ID).

I once saw an outbox publisher with a bug: it published events but never marked them as published. The table grew to 5 million rows. The poll query (SELECT * FROM outbox WHERE published = false) timed out. The publisher stopped. Not a single event was processed for 8 hours. The fix: add a LIMIT and an index on (published, created_at). Also add a TTL cleanup job — delete published events after 24 hours.

Don't use Quartz or cron jobs for polling. Use Spring's @Scheduled with a configurable rate. Or use a dedicated library like Outbox Runner. Keep it simple.

Senior Shortcut:
Add an index on (published, created_at) on the outbox table. Without it, the poll query scans the entire table as it grows. Add a TTL: delete published events older than 24 hours. The outbox table is not an event store — don't keep it forever.
Production Insight
The #1 outbox failure is the poll query timing out because of missing index. Add the index before you ship to production.
Key Takeaway
Outbox pattern guarantees at-least-once delivery. Poll with LIMIT. Index the right columns. Clean up old records.

Testing CQRS: It's Not Just Unit Tests

CQRS breaks your test assumptions. You can't test with a single database and in-memory mocks. You need three things: component tests for the command handler, component tests for the query handler, and integration tests for the event pipeline.

Command handler tests: mock the repository and the event publisher (or outbox writer). Test invariants and event generation. Don't test the database. Use a fake or an in-memory repository.

Query handler tests: mock the read repository. Test that the correct data is returned. Test that stale data warnings are generated. Test the projection function by feeding it events and checking the output in the read store.

Integration tests: spin up a real Kafka (use Testcontainers). Start the command service, the outbox publisher, and the query service. Send a command via REST. Wait for the event to flow through the pipeline. Query the read model and assert the result. This is the only test that proves the system works end-to-end.

Don't stub the Kafka producer. Don't mock the outbox publisher. These are the parts that break in production. Test them with real infrastructure.

Resilience tests: simulate consumer crash. Kill the projection service while events are in flight. Restart it. Assert that events are processed exactly once (idempotent projection). Simulate outbox publisher crash. Kill it. Restart. Assert no events are lost.

Performance tests: generate 10,000 events. Measure consumer lag. Measure read model update latency. Verify it meets your SLA. If it doesn't, tune the consumer batch size, parallelism, and read store write throughput.

I've seen teams skip integration tests because 'they're slow'. Then they ship an outbox bug that drops every 100th event. Logging found it 3 hours after deployment. 5,000 orders lost. Don't be that team.

The Classic Bug:
Your outbox publisher uses @Scheduled but the application has multiple instances. Now every instance polls the same outbox table and publishes the same events. Duplicate events everywhere. Fix: use a distributed lock (e.g., ShedLock) or a leader election.
Production Insight
Duplicate events from multiple outbox publishers cost me a weekend. Use ShedLock. Or better, use Debezium which naturally single-threads the CDC pipeline.
Key Takeaway
Test the full pipeline with real infrastructure. Don't trust mocks for event-driven architecture.
● Production incidentPOST-MORTEMseverity: high

The 45-Second Pricing Lag — A Black Friday Near-Miss

Symptom
PagerDuty alert: 'Price mismatch rate > 5%'. Users report order total changes between step 2 and step 3 of checkout. Revenue reconciliation job fails — says 12% of orders have wrong totals.
Assumption
Thought it was a race condition in the checkout service. Checked write locks and transaction isolation levels. All normal.
Root cause
The pricing projection service processed Kafka messages with batch size 500 and poll interval 5 seconds. A price update event and 10,000 orders hit the same window. The projection service processed orders with old prices before processing the price update. The read model was 45 seconds stale under peak load.
Fix
Reduced Kafka consumer batch size to 50. Set max.poll.interval.ms to 10 seconds. Added a materialized view version column. Queries now check if the view version matches the current event timestamp. If mismatch > 2 seconds, show a loading skeleton instead of stale data.
Key lesson
  • Eventual consistency isn't eventual enough for checkout.
  • Build staleness detection into the read path or don't use CQRS for pricing.
Production debug guideSymptom → root cause → fix for the failures that actually happen4 entries
Symptom · 01
Read model returns stale data — user sees yesterday's order status.
Fix
Check the Kafka consumer lag for your projection topic (kafka-consumer-groups --bootstrap-server ... --group ... --describe). High lag means consumer is stuck or slow. Check CPU on the consumer pod — likely too small. Check for poison pill messages (non-deserializable events). Also check if the write service published the event. If the write succeeded but the event didn't publish, you have no data for the projection to consume.
Symptom · 02
Commands return success but reads show no change.
Fix
Check the write service logs for successful event publication. Then check the event store (Kafka topic) for the message. If it's there, the projection service didn't consume it. Check the consumer group offset — it might have committed without processing due to auto-commit. Enable manual offset commit after processing. If the event is missing, the write service didn't publish it. Check transactional outbox pattern — maybe the DB transaction rolled back but the event was published.
Symptom · 03
Read model inconsistent across replicas — two API calls return different data.
Fix
Check if your read replicas are behind different write leaders. If using DynamoDB global tables, check replication lag via CloudWatch. For Elasticsearch, check cluster health — unassigned shards cause partial results. The root cause is usually a partial deploy of the projection service — some pods have old code that processes events differently. Roll back to a single version.
Symptom · 04
High latency on read queries despite CQRS.
Fix
Check if your read model is still doing joins. If your projection service writes to a normalized schema, you've defeated CQRS. The read model must be denormalized. Check for missing indexes. Check the read model's write path — if the projection service writes too slowly, queries pile up. Also check if the read model is being used as a write cache (write-through, write-behind) — that violates the pattern.
★ Debug Cheat SheetCommands for fast diagnosis in production
Kafka consumer lag > 1000
Immediate action
Check consumer pod CPU and memory. Check for poison pill messages.
Commands
kafka-consumer-groups --bootstrap-server broker:9092 --group pricing-projection-group --describe
kubectl top pod -l app=pricing-projection
Fix now
Increase consumer parallelism: add more partitions and consumers. Set max.poll.records=50. Enable manual offset commit.
Write succeeds but read shows old data after 30 seconds+
Immediate action
Check transactional outbox table in write database for unpublished events.
Commands
SELECT * FROM outbox WHERE published = false AND created_at > NOW() - INTERVAL '1 minute'
kubectl logs -l app=order-write --tail=100 | grep 'outbox'
Fix now
Restart the outbox publisher with --spring.datasource.hikari.maximum-pool-size=10. Set outbox.publisher.poll-interval-ms=1000.
Elasticsearch returns partial results — some fields null+
Immediate action
Check cluster health and shard allocation.
Commands
curl -X GET 'http://es-cluster:9200/_cluster/health?pretty'
curl -X GET 'http://es-cluster:9200/_cat/shards?v'
Fix now
Reallocate unassigned shards: curl -X POST 'http://es-cluster:9200/_cluster/reroute' —retry_failed. Then reindex the missing documents from the last Kafka offset.
Read replica behind primary by > 10 seconds+
Immediate action
Check network latency between primary and replica. Check if replica is using asynchronous replication.
Commands
SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp() FROM pg_stat_replication;
SHOW synchronous_commit;
Fix now
Set synchronous_commit = 'on' for the write database. Or switch to synchronous replication mode for the read replica. If you can't, add a read-after-write consistency guarantee at the application level.
CQRS vs Traditional CRUD vs Event Sourcing
DimensionTraditional CRUDCQRSCQRS + Event Sourcing
Read/Write couplingTight — one DB for bothLoose — separate storesLoose — separate stores
Consistency modelStrong (within one DB)Eventual (across stores)Eventual (across stores + events)
ScalingShared bottleneckIndependent scalingIndependent scaling + audit
Historical stateLast known state onlyLast known state onlyFull event history
ComplexityLowMedium — outbox, projections, lagHigh — event store, snapshots, versioning
Best forSmall teams, simple CRUDHigh read/write skew, reportingCompliance, audit, multi-version concurrency

Key takeaways

1
CQRS separates read and write databases. If they share storage, you don't have CQRS.
2
Eventual consistency is a design constraint, not a bug. Design the UI for staleness.
3
Always use the transactional outbox pattern
never publish events from the command handler directly.
4
Version your projections. Detect staleness. Return warnings or 503 instead of wrong data.
5
Test the full pipeline with real Kafka and real databases. Mocks hide production failures.

Common mistakes to avoid

5 patterns
×

Using a single database table for both command and query models with two connection pools.

Symptom
Read queries still block writes. Performance doesn't improve.
Fix
Create a separate read-only replica or a denormalized read table. The command model should never serve queries.
×

Making the projection service synchronous — waiting for the read model to update before returning from the command.

Symptom
Command latency spikes as reads catch up. Throughput drops.
Fix
Fire-and-forget the event. The command returns 202 immediately. The read model updates asynchronously. Set SLA monitoring.
×

Not versioning the projection — no way to detect staleness.

Symptom
Users see inconsistent data across refreshes. No metric to track lag.
Fix
Add a version column to the read model. The projection service increments it on each update. The query controller checks it against the write timestamp.
×

Publishing events from the command handler without using the outbox pattern.

Symptom
Events published for transactions that rolled back. Or events lost when the Kafka producer fails.
Fix
Use transactional outbox. Write event to outbox table in same DB transaction. Poll and publish separately.
×

Running multiple outbox publisher instances without a distributed lock.

Symptom
Duplicate events in Kafka. Read model processes each event multiple times (unless idempotent).
Fix
Use ShedLock or a leader election pattern. Or use Debezium which handles this natively.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What happens to the read model when the write model publishes an event b...
Q02JUNIOR
Explain the difference between CQRS and event sourcing in simple terms.
Q03SENIOR
Design a CQRS system for an e-commerce catalog that supports 10,000 read...
Q04SENIOR
How do you handle schema changes in CQRS when the read model needs a new...
Q05SENIOR
Your CQRS system has 3 read replicas for the query database. Why does on...
Q06JUNIOR
Can CQRS work without a message broker like Kafka?
Q07SENIOR
How do you test eventual consistency in CQRS without relying on Thread.s...
Q08SENIOR
You need to rebuild the read model from scratch after a bug in the proje...
Q01 of 08SENIOR

What happens to the read model when the write model publishes an event but the projection service crashes before processing? How do you recover?

ANSWER
The event stays in Kafka (uncommitted offset) if the consumer uses manual commit. When the projection service restarts, it resumes from the last committed offset and reprocesses the event. If the consumer used auto-commit (bad), the event is lost. Recovery requires replaying from the outbox table or a backup. Always use manual offset commits and idempotent projections.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
Is CQRS the same as using a read replica database?
02
Do I need CQRS for my microservices?
03
What database should I use for the read model?
04
How do I handle idempotency in CQRS?
05
Can CQRS be used within a single monolithic application?
🔥

That's Microservices Patterns. Mark it forged?

10 min read · try the examples if you haven't

Previous
Saga Pattern for Distributed Transactions
3 / 3 · Microservices Patterns
Next
Spring Boot Production Deployment Guide