CQRS in Spring Boot: You Split the Database, Now Your Queries Are Stale
CQRS pattern in Spring Boot microservices — real production failures, eventual consistency traps, and debug commands that save your on-call rotation.
- CQRS separates read and write models to scale independently.
- The write model uses events; the read model is a denormalized projection.
- Eventual consistency means stale reads — your users will see old data.
- Kafka + Debezium is the battle-tested pipeline, not shared DBs.
- If your read store is down, writes still succeed — and users get 500s on reads.
You have two cash registers. One takes orders and writes them in a book. The other just reads the book and tells customers their total. If the reader lags behind the writer, the customer sees yesterday's price. That's eventual consistency — and you have to tell the customer they might be wrong.
You're on-call. It's 2 AM. The ticket queue is exploding. Users report that their order total shows double the product price. Every third refresh, the total changes. The product team is screaming. Your instinct says 'data race' or 'cache bug'. You check Redis TTLs. You check the database for phantom writes. Nothing.
You go deeper. The orders service writes to PostgreSQL. The pricing service writes to MongoDB. A Kafka topic pipes events between them. You find it. The pricing projection lagged by 45 seconds. A burst of 10,000 orders hit the same time as a pricing update. The read model hadn't processed the new price yet. Users saw old prices on new orders.
Welcome to CQRS. You split the database for scaling and now your queries are stale.
I've seen this exact pattern kill a Black Friday launch. We had three services sharing a single PostgreSQL database. Reads and writes fought each other at 500 QPS. The fix was CQRS. But the first implementation was naive — one database with two connection pools. That's not CQRS. That's a shared database with extra latency.
Real CQRS means separate data stores. Writes go to an event store or a write-optimized database like Aurora. Reads come from a denormalized store like Elasticsearch or DynamoDB. Events flow through Kafka. Projections update asynchronously. This buys you independent scaling. It costs you eventual consistency.
You can't wave that away with 'it's fine for our use case'. It breaks. Predictably. Your web app shows stale data until the projection catches up. Your tests pass because they poll and wait. Production users don't poll. They hit refresh and get a different answer. You have to tell them the answer is wrong and show a loading state. That's the contract.
If you can't handle that, don't do CQRS. If you can, it saves your systems under load. Let me show you how to build it right and what breaks when you don't.
Command Model: The Write Side Is Not an ORM Dump
The command model handles writes. It validates invariants, applies business logic, and produces events. It does not serve queries. End of story. If your command handler returns the saved entity, you're doing CRUD, not CQRS.
The command handler should be a single-purpose class. One command, one handler. No generic save() methods. The handler receives a command object (DTO), validates it, checks invariants, and if all passes, applies the change and publishes an event. The event is the only output. If the client needs the new state, they query the read model.
I've seen teams handwave this and return the entity from the command endpoint 'for convenience'. Then the client starts using that data instead of querying the read model. Fast forward six months. The write model has leaky abstractions. The read model is stale because no one uses it. You have a shared database with a facade.
Use the transactional outbox pattern. Write the event to an outbox table in the same transaction as the domain change. A separate process polls that table and publishes to Kafka. This guarantees at-least-once delivery. If you use Eventuate Tram or Debezium, you get this out of the box.
The command model database should be normalized. 3NF or BCNF. Indexes for the write path only. No read-oriented denormalization. That's the read model's job.
One more thing: commands are not queries. A command changes state. If it doesn't, it's a query. Don't call a query a command. Don't call a command a query. The distinction is behavioral, not architectural.
Query Model: Denormalize Aggressively, But Don't Forget Versioning
The query model is a denormalized projection of the events. One table per query shape. No joins. No subqueries. You eat the storage cost to gain read speed. Each projection is built by a consumer that listens to events and writes directly to the query store.
This is where the real work lives. Your events are JSON blobs with nested objects. Your query store is flat columns and arrays. You need three pieces: the event consumer (Kafka Listener), the projection function, and the query repository.
Version every projection. Add a version column or field. When the consumer processes an event, it increments the version. Queries check this version. If the version hasn't changed in N seconds (your acceptable staleness window), return the data. If it's older, return a 503 with a Retry-After header or a loading state. This turns an eventual consistency problem into a user experience problem.
Failures happen. The consumer crashes mid-event. The projection updates half the fields. You need idempotent projections — the same event applied twice produces the same result, not a duplicate row. Use upsert logic: if the entity exists, update; if not, insert. The event ID is your idempotency key.
Database choice matters. For high-read, low-write projections, use Elasticsearch or OpenSearch. For consistent, low-latency reads, use DynamoDB with DAX. For simple projections, use PostgreSQL with a denormalized schema and no joins. Do not use Cassandra for projections unless you love debugging tombstone issues.
One more thing: the query service must not talk to the write database. Not even for health checks. Separate connection pools, separate schemas, separate clusters. If they share a database, you don't have CQRS. You have a monolith with two APIs.
Event Sourcing vs CQRS: You Don't Need Both
People conflate CQRS with event sourcing. They're separate. CQRS splits read and write. Event sourcing stores state as a sequence of events. You can have one without the other.
CQRS without event sourcing: you have a normal write database and a separate read database. Events flow between them. The write database stores current state. Events are produced on change. The read database is rebuilt from events. This is simpler. It also means you lose historical state — you can't replay events to rebuild the read model from scratch unless you kept them.
Event sourcing without CQRS: you store all events, but you still query the event store directly. This is painful. Querying a sequence of events to answer 'what's the current state?' requires replaying all events. You need snapshots. You need projection indexes. It's a write-optimized nightmare for reads.
You need both when you need complete audit trails. Financial systems. Compliance-heavy domains. Multi-version concurrency with serializable isolation. If you don't need those, don't pay the complexity tax.
Here's the production truth: event sourcing is for when you need to know not just the current state, but how you got there. CQRS is for when your read and write workloads have fundamentally different scaling profiles. They overlap but are not dependent.
My team once built an event-sourced CQRS system for a customer loyalty program. The events were 'points earned', 'points redeemed', 'points expired'. The read model was a current balance. We never needed the event history. We could have stored just the balance in a single row. The event store was dead weight. Don't be that team.
Eventual Consistency: You Can't Hide It, So Design For It
Eventual consistency is not a bug. It's a feature of distributed systems. If you try to hide it, you end up with synchronous event processing, which defeats the purpose of CQRS. The write model slows down waiting for the read model to update. You lose the scaling benefit.
Design for staleness. The UI must handle stale data. Show loading spinners, skeleton screens, or a 'this data is N seconds old' indicator. The API must return stale data gracefully. If the read model is behind, return a warning header: Warning: 299 - "stale-data". The client can decide to retry or show cached data.
Set SLAs. Define acceptable staleness per entity. Order status can be 5 seconds stale. Inventory count can be 30 seconds. User profile can be 2 minutes. The projection service must meet these SLAs. Monitor consumer lag. Alert when lag exceeds the SLA.
What happens when the projection service is down for 10 minutes? The read model is 10 minutes stale. Users create orders against a 10-minute-old inventory count. You sell items that no longer exist. The fix is to reject the command at the write model if the read model is too stale. Yes, that couples them again. But it's better than overselling.
Another pattern: read-your-writes consistency. When a user writes data, they should see it in the next read. For that user only. Other users see stale data. Implement this by returning a write timestamp from the command. The client includes that timestamp in the next query. The read model checks if its projection version is >= that timestamp. If not, it blocks until it catches up or returns a 503.
I've seen teams use a distributed cache like Redis as a write-through cache between the command and read model. The command writes to both the write DB and Redis. The read model reads from Redis. This gives strong consistency at the cost of latency. It works for small datasets. It fails for large ones because Redis memory fills up.
Transactional Outbox Pattern: The Backbone of Reliable CQRS
The outbox pattern is how you guarantee events are published reliably. Write the event to an outbox table in the same database transaction as the domain change. A separate publisher polls the outbox for unpublished events and sends them to Kafka.
Why not publish the event directly from the command handler? Because the DB transaction might fail after the event is published. Now you have an event that says 'order created' but the order doesn't exist in the write database. This is the classic dual-write problem.
Use Debezium if you want CDC (change data capture). It reads the database transaction log and publishes events. No outbox table needed. But it requires the write database to support CDC (PostgreSQL logical replication, MySQL binlog). It also adds latency — the event is available only after the transaction log is flushed to disk.
Use a simple outbox table if you want low latency. Poll interval of 100ms. Batch size of 100. Deduplicate by event ID. This is what I use in production. It's simple, predictable, and easy to debug.
The outbox publisher must be idempotent. If it crashes after sending the event but before marking it as published, the event is sent twice. The consumer must handle duplicates — use idempotent projections (upsert by event ID).
I once saw an outbox publisher with a bug: it published events but never marked them as published. The table grew to 5 million rows. The poll query (SELECT * FROM outbox WHERE published = false) timed out. The publisher stopped. Not a single event was processed for 8 hours. The fix: add a LIMIT and an index on (published, created_at). Also add a TTL cleanup job — delete published events after 24 hours.
Don't use Quartz or cron jobs for polling. Use Spring's @Scheduled with a configurable rate. Or use a dedicated library like Outbox Runner. Keep it simple.
Testing CQRS: It's Not Just Unit Tests
CQRS breaks your test assumptions. You can't test with a single database and in-memory mocks. You need three things: component tests for the command handler, component tests for the query handler, and integration tests for the event pipeline.
Command handler tests: mock the repository and the event publisher (or outbox writer). Test invariants and event generation. Don't test the database. Use a fake or an in-memory repository.
Query handler tests: mock the read repository. Test that the correct data is returned. Test that stale data warnings are generated. Test the projection function by feeding it events and checking the output in the read store.
Integration tests: spin up a real Kafka (use Testcontainers). Start the command service, the outbox publisher, and the query service. Send a command via REST. Wait for the event to flow through the pipeline. Query the read model and assert the result. This is the only test that proves the system works end-to-end.
Don't stub the Kafka producer. Don't mock the outbox publisher. These are the parts that break in production. Test them with real infrastructure.
Resilience tests: simulate consumer crash. Kill the projection service while events are in flight. Restart it. Assert that events are processed exactly once (idempotent projection). Simulate outbox publisher crash. Kill it. Restart. Assert no events are lost.
Performance tests: generate 10,000 events. Measure consumer lag. Measure read model update latency. Verify it meets your SLA. If it doesn't, tune the consumer batch size, parallelism, and read store write throughput.
I've seen teams skip integration tests because 'they're slow'. Then they ship an outbox bug that drops every 100th event. Logging found it 3 hours after deployment. 5,000 orders lost. Don't be that team.
The 45-Second Pricing Lag — A Black Friday Near-Miss
- Eventual consistency isn't eventual enough for checkout.
- Build staleness detection into the read path or don't use CQRS for pricing.
kafka-consumer-groups --bootstrap-server broker:9092 --group pricing-projection-group --describekubectl top pod -l app=pricing-projectionKey takeaways
Common mistakes to avoid
5 patternsUsing a single database table for both command and query models with two connection pools.
Making the projection service synchronous — waiting for the read model to update before returning from the command.
Not versioning the projection — no way to detect staleness.
Publishing events from the command handler without using the outbox pattern.
Running multiple outbox publisher instances without a distributed lock.
Interview Questions on This Topic
What happens to the read model when the write model publishes an event but the projection service crashes before processing? How do you recover?
Frequently Asked Questions
That's Microservices Patterns. Mark it forged?
10 min read · try the examples if you haven't