Advanced 16 min · March 05, 2026

Event Sourcing Explained: Architecture, Patterns and Production Pitfalls

Q: Do I need CQRS to use Event Sourcing?

No. CQRS is optional. If your only read pattern is "load aggregate by ID", you can use snapshot-based reads without a separate read model. Add CQRS when you have complex query requirements or need different storage technologies for reads.

Q: How do you handle event store growth?

Plan for growth: 1KB events at 100 events/s generates ~8.6GB/day. Archive old events to cheaper storage (e.g., S3) after a retention period (e.g., 90 days). Use snapshots to avoid replaying the full history. Some systems implement event retention policies: keep last N months online, archive older events.

Q: What happens if an event handler fails to process an event?

The projection should have a dead letter queue (DLQ). Failed events are moved to the DLQ with error details. An alert triggers on any new DLQ entry. A human operator inspects the event and either fixes the handler, skips the event, or replays it after a fix. The projection continues processing subsequent events.

Q: Can I use Event Sourcing with a relational database?

Yes. Many production systems use PostgreSQL with JSONB columns for event data. It handles thousands of events per second with proper indexing and partitioning. The primary key (aggregate_id, version) provides fast stream reads. For higher throughput, consider dedicated event stores.

Q: What is the transactional outbox pattern?

It's a pattern to atomically write events to the event store and a separate outbox table in the same database transaction. A background process then publishes the events to a message bus. This avoids the dual-write problem where events might be stored but not published, or vice versa. It's essential for reliable event-driven architectures.

Q: Can Event Sourcing work with Kafka as the event store?

Yes, Kafka can serve as an event store if you treat its log as the source of truth. However, Kafka lacks built-in optimistic concurrency control (per-key versioning) that traditional event stores provide. You'd need to implement version checks using Kafka's log compaction or a separate store. It's viable for high-throughput systems but adds complexity.

Event Sourcing demystified — learn how append-only event logs replace mutable state, how CQRS fits in, and the real gotchas that trip up engineers in production..

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

✓ Production

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 30 min

✓Deep production experience
✓Understanding of internals and trade-offs
✓Experience debugging complex systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Event Sourcing persists every state change as an immutable event, not the current snapshot.
The event log becomes the sole source of truth; projections derive read models.
CQRS separates write and read models — you query from projections, not from events.
Snapshots prevent unbounded replay cost: take them every ~500 events.
Schema evolution demands versioned deserialization — events outlive your code.
Idempotency is non-negotiable: duplicate events must produce the same state.

✦ Definition~90s read

What is Event Sourcing?

Event sourcing is a persistence pattern where you store every state change as an immutable, ordered sequence of events, rather than overwriting the current state in a table. Instead of updating a row for 'order.total = 200', you append an event like 'OrderItemAdded' or 'DiscountApplied'.

★

Imagine your bank never just overwrites your balance — instead it keeps every single deposit and withdrawal in a ledger, and your current balance is just the total of all those entries.

The current state is derived by replaying all events from the beginning. This exists because traditional CRUD loses history, audit trails, and the ability to reconstruct past states—critical for domains like banking, logistics, or compliance where every fact matters.

It solves the problem of temporal queries and debugging by making the event log the single source of truth.

In the ecosystem, event sourcing is not a silver bullet. It pairs naturally with CQRS (Command Query Responsibility Segregation) because writes (append-only events) and reads (projected state) have fundamentally different shapes. You'll see it in production at companies like EventStoreDB (the database), Axon Framework (Java), or lightweight implementations with PostgreSQL as an event store.

Do NOT use it for simple CRUD apps, high-frequency trading where latency under 1ms is required, or when your team lacks operational maturity—snapshotting, schema evolution, and idempotency handling add real complexity. Alternatives include outbox patterns (for reliable messaging) or temporal tables in SQL Server/PostgreSQL if you only need point-in-time recovery without full event replay.

Concretely, an event store is an append-only log where each event has a type, timestamp, payload (JSON or Protobuf), and an aggregate ID. You optimize with snapshots—periodic state checkpoints—to avoid replaying 10 million events on every read. Schema evolution is brutal: events live forever, so you must handle versioning (e.g., event upcasting or dual-write migrations).

Idempotency requires deduplication keys on event streams to prevent double-processing from retries. In production, expect to deal with eventual consistency between event store and read models, and invest in monitoring for event replay latency—teams often underestimate the operational cost until they hit 100k events per aggregate.

Plain-English First

Imagine your bank never just overwrites your balance — instead it keeps every single deposit and withdrawal in a ledger, and your current balance is just the total of all those entries. Event Sourcing works the same way: instead of saving 'the current state' of something, you save every change that ever happened to it as a sequence of immutable events. Want to know what your data looked like last Tuesday at 3pm? Just replay the events up to that point. It's like having a full undo history for your entire application.

One thing people miss: the ledger itself is write-only. You never erase or edit past entries. That means you also never lose intent — each event captures what changed and why, not just the final number.

Here's the practical difference: with CRUD, a buggy UPDATE can silently corrupt your data. With Event Sourcing, every state is derived from an auditable chain. You can always trace back and fix a projection without losing the original history. That's a game-changer for regulated industries.

But here's the catch: the ledger grows forever. If you don't plan for growth, replay times blow up. Snapshots are your escape hatch, but they add complexity. Don't let the simplicity of the analogy fool you — the operational cost is real.

Most databases are built around a lie of convenience: they store only the present. Your users table holds today's email address, not the five addresses the user had before it. Your orders table shows 'CANCELLED' but not who cancelled it, when, or why. This feels fine until the day your CEO asks "why did revenue drop on the 14th?" and your answer is silence — the data that would have told you is gone, overwritten by the next UPDATE statement. Event Sourcing is the architectural answer to that silence.

The problem Event Sourcing solves is deceptively simple: traditional CRUD systems treat every write as a destructive operation. State changes obliterate the history that caused them. This creates three compounding pain points — no audit trail, no ability to reconstruct past state, and a tight coupling between the write model and read model that makes complex business domains nearly impossible to express cleanly. Event Sourcing decouples all three by making the event log the source of truth, and deriving all state from it.

By the end of this article you'll understand how to design an event store from first principles, why snapshots exist and when to reach for them, how CQRS and Event Sourcing compose together, and — critically — the production gotchas around schema evolution, idempotency, and eventual consistency that separate engineers who've shipped this from engineers who've only read about it. Code examples are in Java but the patterns are language-agnostic.

Here's the thing: if your team isn't ready to own an immutable log, you'll reintroduce CRUD patterns inside your event store within six months. I've seen it. It gets ugly. Event Sourcing demands operational discipline — backup integrity checks, schema governance, and projection monitoring. It's not a library you drop in; it's an architectural commitment. Expect at least two sprints of learning curve before your team stops treating events as glorified INSERTs.

One more thing: don't start with Event Sourcing on day one of a greenfield project. Start with CRUD and a simple audit log. Migrate to full ES when the pain of lost history outweighs the operational overhead. I've watched teams adopt ES prematurely and spend more time managing the infrastructure than building features.

Event Sourcing: Recording State as a Sequence of Facts

Event sourcing is a pattern where every change to application state is captured as an immutable event in an append-only log. Instead of storing the current state of an entity, you store the sequence of events that led to it. The current state becomes a derived artifact — you replay events to reconstruct it. This shifts the database from a snapshot model to a journal model.

Practically, this means you never UPDATE or DELETE rows representing state. You only INSERT events. To get an entity's current state, you read all its events and fold them into a projection. This gives you a complete audit trail by design, temporal queries for free, and the ability to rebuild any past state. The trade-off is that reads become O(n) in the number of events unless you maintain materialized views (snapshots).

Use event sourcing when you need a perfect audit log, complex temporal reasoning, or when multiple systems must derive their own views from the same facts. It shines in financial ledgers, compliance-heavy domains, and systems where debugging requires replaying production state. It is overkill for simple CRUD apps where current-state persistence is sufficient.

🔥Event vs. Command

An event is a fact that has already happened — it cannot be rejected. A command is an intent that may be validated and either accepted (producing events) or rejected.

📊 Production Insight

A payment ledger service used event sourcing but skipped snapshotting. After 18 months, a single account had 2.4 million events. Reconstructing its balance for a read request took 12 seconds, timing out the API gateway.

Symptom: read latency for high-volume entities degrades linearly with event count, eventually exceeding timeouts.

Rule: snapshot every N events (e.g., 1000) or when the replay time exceeds 100ms — and rebuild snapshots asynchronously.

🎯 Key Takeaway

Event sourcing decouples write (append-only log) from read (projection), enabling independent scaling.

You must snapshot regularly or reads become O(n) and will fail under load.

The event store is the source of truth — never delete or modify events, or you lose audit integrity.

thecodeforge.io

Event Sourcing

Designing the Event Store

The event store is the backbone of any event-sourced system. It must provide: Append-only writes (no updates, no deletes), Optimistic concurrency control via aggregate version numbers, Efficient stream reading (by aggregate ID and optionally by event type), Snapshot support to avoid full replays.

A common implementation uses a relational database table with columns: aggregate_id, aggregate_type, version, event_type, event_data (JSON/JSONB), created_at, and a metadata JSON field. The primary key is (aggregate_id, version).

For high throughput, you might use a dedicated event store like EventStoreDB or Axon Server, but for many systems a standard PostgreSQL table with proper indexing works well up to thousands of events per second. One thing engineers miss: your event store needs to support transactional outbox if you're publishing events to a message bus — otherwise you risk publishing events that never get stored, or storing events that never get published.

Another nuance: event ordering is critical for aggregates. If you use a distributed event store, ensure all events for the same aggregate land on the same partition. I've seen teams use random sharding and then wonder why projections produce inconsistent results.

Indexing strategy matters: a composite index on (aggregate_id, version) is essential for stream reads. For global projections, an index on (created_at) or (event_type, created_at) helps. Don't over-index — writes are append-only, but indexes still add overhead.

Here's something nobody tells you: your event store will grow faster than you expect. A system handling 100 events per second with 1KB payloads generates ~8.6GB per day. At that rate, you'll hit 3TB in a year. Plan your retention and archival strategy before you go to production, not after.

Another practical detail: benchmark your event store's write throughput before going live. A simple test: generate 10,000 events with realistic payloads and measure p99 write latency. If it's above 50ms, your storage choice or indexing is wrong. Tune before you ship.

One more tip: use a separate schema or database for the event store to avoid accidental table drops or migrations affecting your read models. We keep our event store in a dedicated 'events' schema with restricted write permissions.

When choosing between a generic SQL store and a specialised event store, consider the team's familiarity and operational overhead. A specialised store gives you subscriptions, projections, and clustering built-in, but adds a new system to learn and maintain. The SQL approach is simpler but requires more boilerplate for features like stream subscriptions and snapshot management.

Enrichment: Partitioning the event store by aggregate ID is critical for scaling writes. Use hash-based partitioning to spread load while keeping per-aggregate ordering. Don't use range partitioning — it'll create hot spots on active aggregates.

Also: consider using a separate event store instance for high-frequency aggregates to isolate performance. We once had a single event store handling 50K events/s for order events and a few hundred for user events — contention on the primary key index caused latency spikes for both.

Another production pattern: use a write-ahead log (WAL) before the event store to absorb bursts. Write to a fast local log, then flush to the durable event store asynchronously. This smooths out latency spikes at the cost of a small window of data loss risk.

event_store_schema.sqlSQL

CREATE TABLE io_thecodeforge_event_store (
    aggregate_id UUID NOT NULL,
    aggregate_type VARCHAR(100) NOT NULL,
    version BIGINT NOT NULL,
    event_type VARCHAR(200) NOT NULL,
    event_data JSONB NOT NULL,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (aggregate_id, version)
);

CREATE INDEX idx_event_store_type
    ON io_thecodeforge_event_store (aggregate_type, created_at DESC);

CREATE INDEX idx_event_store_time
    ON io_thecodeforge_event_store (created_at DESC);

Mental Model

Mental Model: Event Store as Write-Ahead Log

Think of the event store as a distributed write-ahead log that you never truncate.

Every write is an append to the log; there is no in-place update.
Readers (projections) consume the log from their last known position.
Snapshotting is like taking a checkpoint: you can start from the snapshot instead of the beginning.
The log is immutable, so you never need locks for writes to different aggregates.
Concurrency conflicts are detected at write time via version check, not at read time.

📊 Production Insight

Event store writes must be idempotent at the protocol level — enforce unique (aggregate_id, version) constraints.

If you use a message queue, duplicate delivery breaks state.

Use transactional outbox to avoid dual-write inconsistencies.

Also: consider using append-only tables with no UPDATE privileges for the application user.

Index maintenance can cause write stalls — schedule VACUUM and reindex during low traffic windows.

🎯 Key Takeaway

Event store design is all about append speed and concurrency.

Plan for growth: 1KB events at 100/s = 8.6GB/day.

If you ignore backup planning now, you'll pay for it at 3am.

Choosing between SQL-based event store vs dedicated ES DB

IfTeam is experienced with SQL and has moderate throughput (<1k events/s)

→

UseUse PostgreSQL with JSONB — simpler ops, good enough performance.

IfHigh throughput (>10k events/s) or need built-in subscriptions

→

UseUse EventStoreDB, Axon Server, or Kafka with event sourcing pattern.

IfYou need built-in projections (category streams, etc.)

→

UseEventStoreDB provides built-in projection engine; less custom code.

IfExisting infrastructure is heavy in the cloud (AWS, Azure)

→

UseConsider DynamoDB or Cosmos DB with TTL for metadata-based snapshots.

IfStrong consistency requirements across partitions

→

UseAvoid distributed stores; use a single-node database with synchronous replication.

thecodeforge.io

Event Sourcing

CQRS and Event Sourcing: A Symbiotic Pair

Event Sourcing naturally pairs with CQRS (Command Query Responsibility Segregation) because: Write side: commands produce events that are appended to the event store. Read side: events are consumed by one or more projections that build specialised read models.

This separation solves a core tension: you don't want complex read queries against your event store (which is optimised for write). Instead, you maintain dedicated read tables that are optimised for your query patterns. The trade-off is eventual consistency — the read model lags behind the write model by the time it takes to project the events.

CQRS is optional with Event Sourcing, but the combination unlocks powerful patterns like multiple read views of the same data, read model rebuilds from scratch (reprojection), and easy integration with different storage technologies for reads vs writes. One practical pattern you'll see in production: use a separate Elasticsearch index for full-text search, rebuilt from events, while keeping the main read model in PostgreSQL.

Don't blindly add CQRS. If your only read is "get the aggregate state by ID", you don't need it. The complexity of maintaining multiple projections isn't free.

I once worked on a system where the team built three separate projections for the same entity because each query needed a different shape. That's fine until a schema change requires updating all three. We ended up with a single canonical projection that served most queries, and only kept the extra ones for performance-critical paths.

The real pain point with CQRS isn't the initial setup — it's the ongoing cost. Every schema change to events means updating every projection that consumes that event. If you have 12 projections consuming OrderPlaced, you touch 12 files. That's fine until someone forgets one in a code review and the discrepancy silently corrupts that read model.

Here's another caution: when you use CQRS, never let a read model influence a write decision. If you check a read model to decide whether to allow a command, you've introduced a race condition. The write side must always validate against the event store, not a projection. I've debugged production bugs where a stale projection caused double-spending. The fix was to move the check to the command handler where it queried the event store directly.

Practical tip: start with one projection. Only add more when you measure a concrete performance need. Premature projection proliferation is a common source of technical debt.

Also consider: if your projections are slow because they do complex joins, you can pre-join data at projection time. For example, instead of joining Order and Customer tables at query time, denormalise customer info into the order read model when the order event is processed. This makes reads fast at the cost of storage and more projection logic.

Enrichment: Monitor per-projection lag separately. A global lag metric can hide one projection that's stuck. Use a dedicated table recording the last event ID processed by each projection. Alert if any projection hasn't advanced in more than 5 minutes.

Another pattern: use materialized views for projections that need to aggregate across streams. They can be refreshed periodically or on-demand, but be careful with refresh performance on large datasets.

io/thecodeforge/eventsourcing/cqrs/AccountProjection.javaJAVA

package io.thecodeforge.eventsourcing.cqrs;

import io.thecodeforge.eventsourcing.*;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Component;

@Component
public class AccountProjection {

    private final JdbcTemplate jdbc;
    private final SnapshotStore snapshotStore;

    public AccountView project(UUID aggregateId) {
        // Try snapshot first
        AccountView cached = snapshotStore.get(aggregateId);
        if (cached != null) {
            return cached;
        }
        // Replay from start
        List<Event> events = eventStore.readStream(aggregateId);
        Account account = Account.loadFromHistory(events);
        AccountView view = new AccountView(account);
        snapshotStore.put(aggregateId, view);
        return view;
    }

    public void handle(Event event) {
        // Called by event bus when new event arrives
        AccountView view = project(event.aggregateId());
        // Update read model table (omitted for brevity)
    }
}

⚠ Consistency Warning

When using CQRS, never allow read-modify-write cycles that depend on the read model being up-to-date. The read model is always behind. Use command validation on the write side, not from the projection.

📊 Production Insight

Eventual consistency means stale reads are a feature, not a bug.

Users and downstream systems must tolerate a few hundred ms lag.

Route critical reads through the event store directly — don't rely on projections.

Also: monitor projection lag per projection, not just globally. One slow projection can hide behind others.

Don't assume all projections have the same freshness requirements — some can tolerate 5-minute lag, others need sub-second.

🎯 Key Takeaway

CQRS is optional but powerful with ES. Read models trade consistency for performance.

Always treat the event store as the only source of truth.

If you feel tempted to check a projection before writing, stop. You'll regret it.

When to add CQRS to your Event Sourcing

IfRead queries are complex and differ from write model shape

→

UseCQRS adds value: optimise read model per query need.

IfYou have one simple query that matches the aggregate structure

→

UseCQRS overhead may not be justified; use snapshot-based reads.

IfRead volume is 10x write volume and requires different tech

→

UseCQRS allows using Elasticsearch or Redis for reads while storing events in DB.

IfMultiple teams own different read models for the same events

→

UseCQRS enables independent evolution of each read model.

Snapshots and Performance Optimisation

Without snapshots, every time you need the current state of an aggregate you replay every event since the beginning of time. For aggregates with 100k+ events, that's a hundred thousand database reads and object instantiations per request — catastrophic for latency.

Snapshots store the state of an aggregate at a specific version. When loading, you fetch the most recent snapshot (or create one if none exists) and then replay only the events after that snapshot version. This collapses the replay cost from O(total events) to O(post-snapshot events).

Snapshot frequency is a trade-off: too frequent and you waste write overhead; too rare and replay is still expensive. A good starting point is every 100–1000 events, or every hour for low-volume aggregates. You can also take snapshots on-demand after a state-changing command. In production, you'll often combine both: snapshot every N events plus a periodic time-based snapshot for aggregates that rarely change.

A common failure: snapshot corruption. Always store a version hash alongside the snapshot and verify it on every load. If it doesn't match, fall back to full replay and rebuild the snapshot.

Memory cache for snapshots? Use Redis with a TTL — but invalidate it when a new event for that aggregate is stored. Otherwise, you'll serve stale data from the in-memory cache while the snapshot store already has a newer version.

Here's the trap most teams hit: they don't back up snapshots because 'snapshots can be rebuilt from events'. True, but rebuilding 10M events for a single aggregate takes 45 minutes. If you lose all snapshots after a crash, your recovery time goes from minutes to hours. Back up snapshots to speed recovery — just don't treat them as the truth.

One more pitfall: taking a snapshot too early in an aggregate's lifecycle. If you snapshot after every event on a frequently changing aggregate, you turn your event store into a state store — losing the performance benefit. Snapshot every N events or when the aggregate reaches a certain version threshold.

Performance note: use a dedicated snapshot store (e.g., Redis, DynamoDB) that's fast for point reads. The event store can be slower but more durable. Keep snapshot serialization efficient — use a binary format if latency matters.

A practical tip: implement snapshot versioning. Store the snapshot version alongside the aggregate version. When you load, check that the snapshot version is not older than the last known event version. If it is, replay from snapshot version + 1. This gives you a safety net against snapshot drift.

Enrichment: Consider multi-level snapshots for aggregates with millions of events. Store a daily checkpoint snapshot plus per-1000-event snapshots. When loading, use the closest snapshot before the target version. This reduces worst-case replay to 1000 events.

Also: use a separate thread pool for snapshot materialization to avoid blocking the event writing path. Snapshots can be taken asynchronously after a certain threshold is reached.

Another nuance: if you use Redis for snapshot caching, set an eviction policy that favors high-traffic aggregates. LFU (Least Frequently Used) works better than LRU for workloads where a subset of aggregates sees most of the reads.

io/thecodeforge/eventsourcing/SnapshotService.javaJAVA

package io.thecodeforge.eventsourcing;

import java.util.List;

public class SnapshotService {

    private final EventStore eventStore;
    private final SnapshotStore snapshotStore;

    public <T extends Aggregate> T load(Class<T> type, UUID aggregateId) {\n        Snapshot snapshot = snapshotStore.get(aggregateId);\n        long fromVersion = (snapshot != null) ? snapshot.version() : 0;\n        List<Event> events = eventStore.readStream(aggregateId, fromVersion + 1);\n        T aggregate;\n        if (snapshot != null) {\n            aggregate = snapshot.deserialize(type);\n        } else {
            aggregate = createEmpty(type);
        }
        for (Event e : events) {
            aggregate.apply(e);
        }
        // Optionally take snapshot if replay count is high
        if (snapshot == null && events.size() > 100) {
            snapshotStore.put(aggregateId, new Snapshot(aggregate, aggregate.version()));
        }
        return aggregate;
    }

    private <T> T createEmpty(Class<T> type) {
        try {
            return type.getDeclaredConstructor().newInstance();
        } catch (Exception e) {
            throw new RuntimeException("Aggregate must have default constructor", e);
        }
    }
}

🔥Snapshot Storage Tip

Store snapshots separately from events — use a fast KV store (Redis, DynamoDB) for snapshots. The event store can be a slower but more durable SQL database. But if you use Redis, ensure snapshot durability with AOF persistence; losing the snapshot cache is fine, losing the event store is catastrophic.

📊 Production Insight

If snapshot load still takes >100ms, you need multi-level snapshots.

Store snapshots every N events, but also a daily checkpoint.

Always hash the snapshot and verify on load to catch corruption.

Invalidate in-memory snapshot cache on event append to avoid serving stale data.

Worth calling out: snapshot serialization format matters — JSON is easy but slow; Protobuf or Kryo can cut deserialization time by 5x.

Snapshots are not a panacea: if your aggregates have millions of events, even replaying 1000 events can be slow if each event triggers complex logic.

🎯 Key Takeaway

Snapshots are essential for production-scale ES.

Replay cost = post-snapshot events only.

If your snapshot load time is over 200ms, your users will feel it. Fix it.

Snapshot Strategy Decision

IfAverage aggregate size < 1000 events

→

UseNo snapshot needed; full replay is cheap enough.

IfAverage aggregate size 1k–100k events, frequent reads

→

UseSnapshots every 500 events. Cache snapshots in memory with TTL.

IfAverage aggregate size > 100k events, read-heavy

→

UseMulti-level snapshots: hourly + per 1000 events. Use Redis for snapshot cache.

IfExtreme case: millions of events, high read concurrency

→

UseConsider materialized views that are updated asynchronously from event streams, bypassing aggregate load entirely.

Schema Evolution: When Events Outlive Your Code

Events are immutable — once written, they must remain readable forever. This means your event schemas will outlive the code that created them. A common trap: you add a new field to an event and old events break the deserialization.

The solution: never rename, remove, or reorder fields. Only add optional fields. Use a serialization format that supports schema evolution (JSON, Avro, or Protobuf with forward/backward compatibility modes). Each event should carry a schema version (e.g., in metadata). Your event handler must be able to process multiple schema versions.

A robust pattern is to store events as JSONB and use a version-specific deserialization layer that fills defaults for missing fields. For example, v1 events may not have a 'status' field; v2 expects it. The v2 handler supplies a default when deserializing v1 events. You also need to handle the case where you deprecate a field: you cannot remove it from events, but your handlers can ignore it. If you absolutely must change event structure, plan an offline migration with a maintenance window.

Here's the pain point nobody tells you: when you have multiple services consuming the same events, schema evolution becomes a coordination problem. You can't just update one service — you need to ensure all consumers can handle the new schema before you start emitting it. That's where a schema registry (like Confluent's) shines. It lets you enforce compatibility rules and prevent breaking changes from reaching production.

Another trick: use a generic wrapper that includes a type discriminator and version. Then write version-specific deserializers that can be registered dynamically. This avoids massive if-else chains in your handler code.

I've seen teams try to solve schema evolution with a shared JAR that contains all event classes. That works until two services need different versions of the same event. Then you're stuck in dependency hell. A schema registry prevents that by decoupling the wire format from the class definition.

Don't forget to test schema evolution in your CI pipeline. Write a test that replays a set of old events (exported from production) against the latest handler code. If any event fails to deserialize, the pipeline fails. This catches breaking changes before they hit production.

One more thing: when you deprecate an event type, don't delete the event class immediately. Keep it in a 'legacy' package and let your deserialization layer know to skip it or transform it. Deleting the class too early will break replay from the beginning of time.

Consider using Avro with a schema registry from the start, even if you only have one service. It forces discipline and makes future multi-service integration much easier. The upfront cost is negligible compared to the pain of retrofitting.

Enrichment: Schema evolution also affects downstream BI systems. If you publish events to a data lake via Kafka, the schema must be compatible for all consumers. Coordinate schema changes in a separate deployment with monitoring for consumer lag.

Also: think about backward vs forward compatibility. Backward compatibility (new reader can read old data) is easier and usually sufficient. Forward compatibility (old reader can read new data) is harder but necessary when you can't upgrade all consumers simultaneously.

One real-world example: we had an event with a 'price' field that was a decimal. We needed to add a 'currency' field. We made it optional with a default of 'USD'. Old events continued to work, and new events included the currency. The projection that needed currency just used the default for legacy events.

io/thecodeforge/eventsourcing/schema/EventDeserializer.javaJAVA

package io.thecodeforge.eventsourcing.schema;

import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;

public class EventDeserializer {

    private static final ObjectMapper mapper = new ObjectMapper();

    // Handles v1->v2 migration for OrderPlaced event
    public OrderPlaced deserializeOrderPlaced(JsonNode node, int schemaVersion) {\n        String orderId = node.get(\"orderId\").asText();\n        BigDecimal total = new BigDecimal(node.get(\"total\").asText());\n        String status;\n        if (schemaVersion >= 2) {\n            status = node.get(\"status\").asText();\n        } else {\n            status = \"PENDING\";\n        }\n        return new OrderPlaced(orderId, total, status);\n    }\n}"
      }

Idempotency and Consistency Guarantees

Event Sourcing systems are event-driven, and events can be delivered multiple times (e.g., after a broker restart, network retry, or projection crash). Without idempotency handling, duplicate events will produce duplicate state changes — corrupting your read models.

Write-side idempotency: each command should be associated with a unique idempotency key (UUID). Before appending an event, check if an event with that key already exists. This prevents double-spending, duplicate orders, etc.

Read-side idempotency: projections must be idempotent. For upserts, use INSERT...ON CONFLICT UPDATE (UPSERT). For set-based updates, ensure the operation is deterministic (e.g., set account balance to the computed value, not increment).

Consistency: Event Sourcing gives you strong consistency within an aggregate (writes are atomic per aggregate via version check). Across aggregates, you have eventual consistency unless you use distributed transactions (which you should avoid). For critical cross-aggregate consistency, consider using sagas or process managers.

Here's a real scenario: we had a projection that used an incrementing counter instead of setting the absolute value. On replay, it doubled every event — took us two hours to find the bug because the numbers looked "close enough".

Idempotency key storage: keep them in a time-bounded store (e.g., Redis with 24h TTL). After that, the risk of duplicate delivery is negligible. Make the idempotency check atomic with the event append — ideally using a database unique constraint, not a check-then-act.

There's a subtlety most docs miss: idempotency keys need to survive the event store write failure scenario. If your app generates a key, checks it doesn't exist, and then the write fails — the key is still unused. A retry will use the same key and succeed. But if the write actually succeeded and only the response was lost, the idempotency check prevents the duplicate. That's the whole point, but it only works if your idempotency store is durable and checked atomically with the write.

Another nuance: idempotency keys for event handlers that process batches. If you replay a batch of events, each event should have its own idempotency check, not a batch-level key. Otherwise, a partial failure in the batch causes the whole batch to be skipped on retry, leading to lost events.

Practical advice: make your projection handlers idempotent by design — write them as if every event could be processed twice. That means using SET x = x (no increments), using INSERT ON CONFLICT, and logging any duplicate attempts for monitoring.

Also, consider using event sourcing frameworks that provide idempotent event handling out of the box, but understand the mechanism so you can debug when it fails. The framework can't fix incorrect business logic that assumes events are never replayed.

Additional: for financial systems, implement idempotency with a separate ledger table that records every processed event ID. This is more durable than a TTL-based cache.

io/thecodeforge/eventsourcing/idempotency/CommandHandler.javaJAVA

package io.thecodeforge.eventsourcing.idempotency;

import io.thecodeforge.eventsourcing.EventStore;
import io.thecodeforge.eventsourcing.IdempotencyStore;
import io.thecodeforge.eventsourcing.events.DepositEvent;
import java.util.UUID;

public class CommandHandler {

    private final EventStore eventStore;
    private final IdempotencyStore idempotencyStore;

    public CommandHandler(EventStore eventStore, IdempotencyStore idempotencyStore) {\n        this.eventStore = eventStore;\n        this.idempotencyStore = idempotencyStore;\n    }

    public void handle(DepositCommand cmd) {
        if (idempotencyStore.exists(cmd.idempotencyKey())) {
            return; // Already processed
        }
        DepositEvent event = new DepositEvent(
            cmd.aggregateId(),
            cmd.amount(),
            cmd.idempotencyKey()
        );
        eventStore.append(cmd.aggregateId(), event, cmd.expectedVersion());
        idempotencyStore.mark(cmd.idempotencyKey());
    }
}

⚠ Idempotency Key Storage

Keep idempotency keys in a time-bounded store with a TTL of at least 24 hours. Use the same durable store as your event store if possible (e.g., a separate table with a unique constraint). Otherwise you risk losing idempotency state on cache failure.

📊 Production Insight

Idempotency failures cause duplicate state changes that are hard to detect.

Use UPSERT in projections and set absolute values, never increments.

Batch processing must check idempotency per event, not per batch.

Rule: write every projection as if every event will be processed twice.

Also: ensure your idempotency key has sufficient entropy — a timestamp alone is not enough; use UUIDs.

🎯 Key Takeaway

Idempotency is non-negotiable in Event Sourcing.

Duplicate events are inevitable; your system must handle them.

If you don't test duplicate delivery, you'll find it in production.

Idempotency strategy decision

IfHigh write throughput, need fast idempotency check

→

UseUse Redis with TTL for idempotency keys, but accept risk of key eviction.

IfFinancial system, no tolerance for duplicates

→

UseUse event store table with unique constraint on (aggregate_id, idempotency_key).

IfProjections are stateless and use SET operations

→

UseNo explicit idempotency store needed; replay is safe.

IfBatch event processing with partial failures

→

UseEach event in batch must have its own idempotency key; log failures and retry individually.

The Problem with Traditional CRUD Systems

You've been lied to about databases. Every time you run an UPDATE statement, you're burning evidence. You overwrite the truth with a guess at what the current state should be, and the old truth disappears into a transaction log nobody reads.

Ask yourself this: when a customer calls support at 10:03 AM screaming that their order got doubled, can you pull up the exact sequence of events between 10:01 and 10:03? In a CRUD system, you can't. You see "status = processed" and that's it. You lost the fact that a payment gateway retry collided with a webhook replay, both of them incrementing the order total.

This isn't just an audit problem. It's a consistency problem. Partial failures in multi-step workflows leave your database in states that don't make sense. The payment succeeded, but the status update failed. Now you've got a happy customer, a sad accountant, and a three-hour debugging session that ends with "restore from backup and pray."

Event sourcing doesn't fix these problems by accident. It fixes them by design: you stop throwing history away. You start recording facts.

ShipmentUpdateProblem.pyPYTHON

// io.thecodeforge — system-design tutorial

# Traditional CRUD: the update that loses history
import sqlite3

db = sqlite3.connect(':memory:')
db.execute("CREATE TABLE shipments (id INT, status TEXT)")
db.execute("INSERT INTO shipments VALUES (1, 'pending')")

def update_status(shipment_id, new_status):
    # This is the moment you lose the truth
    db.execute(
        "UPDATE shipments SET status = ? WHERE id = ?",
        (new_status, shipment_id)
    )
    # Old value 'pending' is gone forever

update_status(1, 'in_transit')
update_status(1, 'delivered')

result = db.execute("SELECT status FROM shipments WHERE id = 1").fetchone()
print(f"Current status: {result[0]}")

Output

Current status: delivered

⚠ Production Trap:

A single UPDATE during a traffic spike can mask a race condition that only surfaces in post-mortems. By then, the evidence is gone.

🎯 Key Takeaway

If your system can't replay what happened between two timestamps for any entity, you're operating blind.

thecodeforge.io

Event Sourcing

Event Sourcing Architecture: Kafka vs. The World

Stop thinking about event sourcing as a database pattern. Think about it as a logistics problem: you need a place to dump immutable packets of truth, in order, forever. That's an append-only log. Kafka is the production-grade answer, but don't pretend it's your only option.

The architectural shape is simple:

Producers write events to a log. No schema enforcement at write time — that's a trap we'll cover later.
The log (Kafka topic, Kinesis stream, Postgres WAL, whatever) stores them in sequence. Retention is your choice: keep everything forever, or tier to S3 after 30 days.
Consumers read the log and build whatever state they need. Projections, materialized views, microservice boundaries — each consumer sees the same events and builds its own truth.

Why Kafka? Because it gives you ordered, partitioned, replayable streams that survive node failures without losing a single byte. Kinesis works if you're married to AWS and hate managing clusters. Postgres WAL works if you're insane or your event volume is measured in dozens per second.

The key insight: the log is the source of truth. Not your database. Not your cache. The log. Everything else is a derived view that can be rebuilt by replaying events.

EventPublisher.pyPYTHON

// io.thecodeforge — system-design tutorial

# Publishing an event to Kafka
from kafka import KafkaProducer
import json, time

producer = KafkaProducer(
    bootstrap_servers='event-cluster:9092',
    value_serializer=lambda v: json.dumps(v).encode(),
    acks='all'  # Don't fire and forget. Wait for leader + ISR ack.
)

event = {
    'event_type': 'order_placed',
    'entity_id': 'ORD-7734',
    'version': 3,
    'data': {
        'customer_id': 'USR-8821',
        'items': ['sku-445', 'sku-112'],
        'total_cents': 4999
    },
    'timestamp': int(time.time() * 1000)
}

# Kafka partitions by key, so same entity_id always hits same partition
future = producer.send('order_events', key=b'ORD-7734', value=event)
result = future.get(timeout=10)
print(f"Event committed at offset {result.offset} in partition {result.partition}")

Output

Event committed at offset 140 in partition 3

💡Senior Shortcut:

Use the entity ID as your Kafka partition key. This guarantees causal ordering per entity — every event for a given order lands in the same partition, in the order they were produced.

🎯 Key Takeaway

The event log is your source of truth; databases and caches are just disposable views built from that log.

Why Event Sourcing Fails Without a Hard Schema Audit

Your event store is a lie if you haven’t nailed schema evolution. Events live forever. Your code doesn’t. The moment you deploy a new field to an event, every consumer built on the old format breaks. This isn’t a v2 problem — it’s a day-one design requirement.

Stop treating events as loose JSON blobs. Enforce a schema registry at write time. Apache Avro or Protobuf with schema IDs embedded in each event. Consumers fetch the schema by ID, decode safely. No surprises. Backward compatibility means optional fields only. Never rename or remove fields — mark them deprecated and stop populating after a coordinated cutoff. Forward compatibility requires matching on known fields and ignoring unknowns. Test both with a compatibility checker in CI. Production teams that skip this rebuild entire pipelines on a Friday night. Don’t be that team.

SchemaAudit.pyPYTHON

// io.thecodeforge — system-design tutorial

from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer

registry = SchemaRegistryClient({"url": "http://schema-registry:8081"})

serializer = AvroSerializer(
    registry,
    schema_str='{...}',  # your Avro schema
    to_dict=lambda event, ctx: event.dict()
)

# Production rule: register schema BEFORE writing any event
registry.register_schema("order-events-value", serializer.schema)

Output

Schema registered: order-events-value (version 3)

⚠ Schema Turbulence:

Never delete a field. Consumers backfill state from historical events. A deleted field kills their replay logic silently.

🎯 Key Takeaway

Schema registry + backward/forward compatibility testing is the only way to survive event sourcing in production.

Scrap Snapshots — The Only Performance Escape

Event sourcing’s Achilles' heel is replay. Rebuild an aggregate from 10,000 events? That’s 10,000 loads, 10,000 deserializations, 10,000 handler invocations — latency you can’t buy your way out of. The only cure is a snapshot: a serialized state checkpoint taken every N events. Your aggregate loads the snapshot, then replays only the events after it. Cut replay cost by 90% with one config change.

Choose the snapshot interval with purpose. Too frequent? You’ve replaced event traffic with snapshot storage churn. Too rare? Replay still stings. A good rule: 100 events for high-write aggregates, 500 for read-heavy ones. Store snapshots alongside events in the same stream — retrieval is sequential and fast. At production scale, commit the snapshot in the same transaction as the events to keep consistency without distributed coordination. No magic, just math. Snapshots aren’t optional; they’re the difference between a system that works and one that stalls.

SnapshotManager.pyPYTHON

from typing import Optional, List
from uuid import UUID

class Snapshot:
    __slots__ = ('aggregate_id', 'state', 'version', 'timestamp')

    def __init__(self, aggregate_id: str, state: dict, version: int):
        self.aggregate_id = aggregate_id
        self.state = state
        self.version = version
        self.timestamp = time.time()

class SnapshotPolicy:
    def __init__(self, threshold: int = 100):
        self._threshold = threshold

    def should_snapshot(self, current_version: int, last_snapshot_version: int) -> bool:
        return (current_version - last_snapshot_version) >= self._threshold

Output

>>> policy = SnapshotPolicy(threshold=100)

>>> policy.should_snapshot(105, 0)

True

>>> policy.should_snapshot(90, 0)

False

⚠ Production Trap:

Don’t wait for replay to hurt. Add snapshots on day one. Retrofitting them into a system with millions of events is a costly migration game you do not want to play.

🎯 Key Takeaway

Snapshots are not optimization; they’re existential. Without them, replay scales linearly with event count — and that curve kills.

Challenges in Event Sourcing

Event sourcing introduces a set of hard challenges that kill projects that don't prepare for them. First, event schema evolution is brutal—events live forever, so any field add, rename, or delete requires up/down migration logic on every consumer. Second, replaying events to rebuild state grows linearly with event count; without aggressive snapshotting or incremental projections, read latency spikes to hours. Third, event ordering across partitions breaks consistency; distributed event stores like Kafka guarantee order only within a partition, forcing careful key routing. Fourth, the system becomes harder to debug—you can't just look at a row in a database; you replay thousands of events to understand current state. Fifth, operational complexity balloons: you need dedicated infrastructure for event storage, projection rebuilds, and catch-up subscriptions. Finally, deleting events is nearly impossible under GDPR, as events are immutable facts; you must implement tombstone events or anonymization logic. These challenges are not optional problems—they are baked into the pattern.

event_migration.pyPYTHON

// io.thecodeforge — system-design tutorial

from dataclasses import dataclass, asdict

@dataclass
class UserV1:
    name: str
    email: str

@dataclass
class UserV2:
    name: str
    email: str | None
    phone: str | None = None

def migrate_event(raw: dict) -> dict:
    if 'version' not in raw:
        raw['version'] = 1
    if raw['version'] == 1:
        raw['version'] = 2
        raw['phone'] = None
    return raw

Output

Example: raw = {'name':'Alice','email':'a@b.com'} -> {'name':'Alice','email':'a@b.com','version':2,'phone':None}

⚠ Production Trap:

A missing version field on old events crashes all replay pipelines. Never assume events have a version key—always default to version 1 during deserialization.

🎯 Key Takeaway

Every event must carry an explicit version field from day one.

Disadvantages of Event Sourcing

Event sourcing is not a free upgrade—it carries real disadvantages that often sink teams expecting magic. The primary disadvantage is complexity: you trade a simple CRUD write for a multi-step process of appending an event, updating a projection, and handling eventual consistency. This adds latency to reads and requires a separate read model layer. Second, the storage footprint balloons—every state change is stored forever, not just the latest value. Third, event replay becomes slower as the event log grows; without snapshotting, rebuilding state takes linear time. Fourth, debugging becomes a forensic exercise—you cannot inspect a current row; you must reconstruct history by replaying events. Fifth, operational overhead is high: you need specialized infrastructure for event stores, projection rebuilds, and schema migration tooling. Sixth, event ordering across services is difficult without distributed coordination. Finally, GDPR compliance is painful—deleting an individual's data requires rewriting the event log or introducing tombstone events. Event sourcing works only when the benefit of an audit trail outweighs these costs.

state_rebuild_penalty.pyPYTHON

// io.thecodeforge — system-design tutorial

def rebuild_state(events):
    state = {}
    for event in events:
        if event['type'] == 'USER_CREATED':
            state[event['user_id']] = event
        elif event['type'] == 'USER_DELETED':
            del state[event['user_id']]
    return state

# 10 million events -> 10 million iterations
# Without snapshot, this runs in O(n)

Output

Events stored: 10,000,000 | Time to rebuild: ~45 seconds (single thread)

⚠ Production Trap:

A team skipped snapshots for 'simplicity' and took 47 minutes to restore a service after a crash. Always snapshot at a cadence tied to your recovery SLA.

🎯 Key Takeaway

Never deploy event sourcing without snapshots and a defined recovery time objective.

● Production incidentPOST-MORTEMseverity: high

The Missing Event Incident

Symptom

After a database failover, the trading platform showed stale positions. Trades executed during the outage window were not reflected. The projection lag indicators showed zero, yet the data was wrong.

Assumption

The team assumed the snapshot store was a cache — losing it would just slow down replay. They didn't verify that the event store replication lag was within acceptable bounds.

Root cause

During failover, the event store primary went down before replicating the latest events to the secondary. The snapshot store had been configured with a short retention and was rebuilt from the (incomplete) secondary event store. The event store and snapshot store became inconsistent without any alert.

Fix

Implemented read-repair on every snapshot load: recompute the snapshot from the event log and compare version hashes. Added a consistency check job that runs hourly on all projections. Configured the event store with synchronous replication to the secondary.

Key lesson

Snapshots are a performance optimisation, not a durable source of truth.
Always verify consistency between event store and snapshot store on read.
Synchronous replication is worth the latency trade-off for financial data.
Never trust projection lag as a health indicator — validate data correctness separately.
Alert on any snapshot version mismatch immediately — don't wait for a manual check.
Back up the event store independently of snapshot backups. Snapshots can be rebuilt, events cannot.
Test failover recovery with a full event replay in staging at least once per quarter.
Don't assume the event store secondary is current — verify replication lag before failing over.

Production debug guideSymptom-to-action guide for common ES failures10 entries

Symptom · 01

Events are missing after a failover or restart

→

Fix

Check replication lag metrics on the event store. Query the event log directly for the expected event IDs. Verify snapshot version hash against recomputed snapshot.

Symptom · 02

Projection lag is increasing, never catching up

→

Fix

Profile the projection handler — likely a slow SQL query or external API call. Consider batching events or using a dedicated projection worker pool. Also check if the event bus is backing up due to slow consumer.

Symptom · 03

Snapshot rebuild takes too long

→

Fix

Reduce snapshot interval. Increase event store read throughput by scaling read replicas. Use a separate projection for snapshot materialisation. For extreme cases, implement multi-level snapshots (hourly + per-N-events).

Symptom · 04

Duplicate events cause inconsistent state

→

Fix

Implement idempotency keys on the write side. In the projection, make operations idempotent (e.g., UPSERT, SET operations). Log and deduplicate by event ID. Ensure projection handlers are idempotent by design.

Symptom · 05

Event schema mismatch causes deserialization failures

→

Fix

Check event metadata for schema version. Implement version-aware deserialization with defaults for missing fields. Set up dead letter queue for unprocessable events and alert on any new DLQ entries.

Symptom · 06

Transactional outbox write fails – event stored but not published

→

Fix

Inspect the outbox table for entries with status='PENDING'. Verify message broker connectivity. Implement a retry mechanism with exponential backoff. Ensure the outbox cleanup job marks entries as published only after broker acknowledges receipt.

Symptom · 07

Event store write latency spikes under load

→

Fix

Check for contention on the primary key index. Move to a higher-performance write tier (e.g., local SSDs, separate write replica). Ensure that batch inserts are used instead of single-row inserts.

Symptom · 08

Snapshot corruption detected during consistency check

→

Fix

Immediately fall back to full replay from events. Generate a new snapshot and store it. Investigate the root cause: could be a bug in the snapshot serialization or a concurrent write to the snapshot store.

Symptom · 09

Event store backup restore fails to bring system online

→

Fix

Ensure you restore events first, then rebuild all snapshots and projections from scratch. Never restore a snapshot-only backup without corresponding events. Test restore procedure quarterly.

Symptom · 10

Event stream ordering is inconsistent across consumers

→

Fix

Verify that all events for a given aggregate are routed to the same partition. Check consumer group configuration. Consider using a global event log with offset tracking for strict ordering.

★ Event Sourcing Quick Debug Cheat SheetWhen something is wrong with your event-sourced system, these commands and checks will find the issue fast.

Event store write failure: duplicate key or constraint violation−

Immediate action

Check for duplicate event IDs in the log. Verify idempotency key handling.

Commands

SELECT event_id, COUNT(*) FROM events GROUP BY event_id HAVING COUNT(*) > 1;

If duplicates found: implement UPSERT with ON CONFLICT DO NOTHING.

Fix now

Set a unique constraint on (aggregate_id, event_version) or use an idempotency key column.

Snapshot vs event store mismatch (version hash different)+

Projection lag > 5 minutes+

Deserialization error on event replay+

Outbox table has stuck entries after broker failure+

Event store backup restore inconsistency+

Dead letter queue growing due to unprocessable events+

Event ordering divergence between consumers+

Event Sourcing vs CRUD vs Audit Log

Dimension	CRUD	Audit Log (Append-only)	Event Sourcing
Write model	Destructive UPDATE	Append-only log + current state	Append-only event log
Read model	Direct table read	Current state read	Projections derived from events
Historical queries	Not possible	Track changes, but no intent	Full time travel with intent
Complexity	Low	Medium	High
Storage growth	Fixed (state size)	Moderate (state + audit)	High (all events)
Consistency	Strong (per row)	Eventual (read model)	Eventual (projections) but strong per aggregate

⚙ Quick Reference

11 commands from this guide

File	Command / Code	Purpose
event_store_schema.sql	CREATE TABLE io_thecodeforge_event_store (	Designing the Event Store
iothecodeforgeeventsourcingcqrsAccountProjection.java	@Component	CQRS and Event Sourcing
iothecodeforgeeventsourcingSnapshotService.java	public class SnapshotService {	Snapshots and Performance Optimisation
iothecodeforgeeventsourcingschemaEventDeserializer.java	public class EventDeserializer {	Schema Evolution
iothecodeforgeeventsourcingidempotencyCommandHandler.java	public class CommandHandler {	Idempotency and Consistency Guarantees
ShipmentUpdateProblem.py	db = sqlite3.connect(':memory:')	The Problem with Traditional CRUD Systems
EventPublisher.py	from kafka import KafkaProducer	Event Sourcing Architecture
SchemaAudit.py	from confluent_kafka.schema_registry import SchemaRegistryClient	Why Event Sourcing Fails Without a Hard Schema Audit
SnapshotManager.py	from typing import Optional, List	Scrap Snapshots
event_migration.py	from dataclasses import dataclass, asdict	Challenges in Event Sourcing
state_rebuild_penalty.py	def rebuild_state(events):	Disadvantages of Event Sourcing

Key takeaways

Event Sourcing makes the event log the truth; state is always derived from replay.

Snapshots are essential for performance

plan for them from day one.

Idempotency must be built into both write and read sides.

Schema evolution should be anticipated and tested continuously.

Monitor projection lag per projection, not globally.

Start simple

SQL event store without CQRS, then add complexity as needed.

Always test failover recovery with full event replay in staging.

Event design is a modeling exercise

keep events granular and past-tense.

Never let a read model influence a write decision; the event store is the only truth.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How does Event Sourcing guarantee consistency within an aggregate?

Q02SENIOR

Explain the trade-offs between using a SQL-based event store vs a dedica...

Q03SENIOR

What strategies do you use to handle schema evolution in Event Sourcing?

Q04SENIOR

How do you maintain event ordering in a distributed event store?

Q05SENIOR

What is the transactional outbox pattern and why is it important in Even...

Q01 of 05SENIOR

How does Event Sourcing guarantee consistency within an aggregate?

ANSWER

Consistency within an aggregate is enforced by optimistic concurrency control: the event store uses the aggregate version as a row lock. When appending an event, you must provide the expected version of the aggregate. If another writer has already appended a newer event, the version check fails and the write is rejected. This ensures linearizability for all operations on the same aggregate. Cross-aggregate consistency relies on eventual consistency through projections or saga patterns.

FAQ · 7 QUESTIONS

Frequently Asked Questions

Do I need CQRS to use Event Sourcing?

How do you handle event store growth?

What happens if an event handler fails to process an event?

Can I use Event Sourcing with a relational database?

How often should I take snapshots?

What is the transactional outbox pattern?

Can Event Sourcing work with Kafka as the event store?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

🔥

That's Architecture. Mark it forged?

16 min read · try the examples if you haven't