Event-Driven Architecture: Patterns, Brokers, and Production Strategies
Master Event-Driven Architecture (EDA).
- EDA is a design pattern where services communicate asynchronously via immutable events
- Events are immutable facts (past tense); commands are imperative requests (future tense)
- Brokers (Kafka, RabbitMQ) decouple producers from consumers, enabling horizontal scaling
- Kafka retains messages for replay; RabbitMQ deletes after acknowledgment
- Biggest mistake: treating commands as events, creating tight coupling and breaking idempotency
Think of a group chat where people post updates (events) that anyone can read later. Commands are like private messages demanding immediate action. Queries are like checking the chat history for a specific fact.
Event-driven architecture is the 'social media' of microservices. In a synchronous, legacy system, Service A acts like a demanding boss—it calls Service B and waits, staring at the screen until it gets an answer. If Service B is having a bad day (or is down), Service A crashes too. In an EDA world, Service A simply 'posts' an update to a broker. It doesn't know, nor does it care, who is 'following' that update. Service B can process it now, ten minutes from now, or after it recovers from a reboot.
While this decoupling provides immense resilience, it introduces a new mental model: Eventual Consistency. You must design your system to accept that the 'User Created' email might arrive a few seconds after the database record is saved. For high-scale systems like Netflix, Uber, or LinkedIn, this trade-off is the only way to achieve global availability.
Events vs Commands vs Queries
In production system design, vocabulary is a functional constraint. Using a command when you meant an event creates accidental coupling. A Command is a 'Request for Action' (imperative), while an Event is a 'Statement of Fact' (declarative). Queries remain the synchronous backbone for immediate data retrieval.
Here's the real test: ask yourself 'Has this already happened?' If yes, it's an event. If no, it's a command. The name must reflect that. 'OrderCreated' is an event. 'CreateOrder' is a command disguised as an event. That distinction directly impacts how much coupling you ship.
- Commands are sent to a specific person (tightly coupled).
- Events are broadcast to everyone who cares (decoupled).
- If you reply to a command with 'I can't', that's still a synchronous response.
- If you react to an event, you don't need permission — that's async decoupling.
The Heavyweights: Kafka vs RabbitMQ
Choosing a broker isn't about speed; it's about the 'Retention Model'. RabbitMQ is a smart post office: it routes messages to specific mailboxes and shreds the letter once it's read (acknowledged). Kafka is a distributed ledger: it appends messages to an immutable log and keeps them there for days or weeks, allowing consumers to 'rewind time' and reprocess data.
Don't fall for the raw throughput numbers. Both can handle 100k messages/sec. The real question is: can you afford to lose history? In event sourcing, you can't. Kafka's retention wins. For transient task queues, RabbitMQ's smart routing and per-queue TTL is simpler and cheaper to operate.
Eventual Consistency and Trade-offs
In a synchronous system, after a write, a subsequent read immediately sees the new data. In EDA, the producer publishes an event, but the consumer may not have processed it yet. During that window, the system is inconsistent. This is eventual consistency. You must design your APIs to either return stale data or use compensating actions. For example, an order service creates an order and emits 'OrderPlaced'. The inventory service eventually decrements stock. If a read request arrives before inventory decrements, the API may report old stock levels. The trade-off: higher availability and scalability at the cost of immediate consistency.
This isn't a bug — it's a design choice. You can mitigate it with read-side caching, optimistic concurrency, or versioned responses. But you can never eliminate the inconsistency window without going synchronous.
Idempotent Consumer: The Non-Negotiable Pattern
Because of at-least-once delivery guarantees, the same event may arrive multiple times. An idempotent consumer ensures that processing the same event multiple times has the same effect as processing it once. The standard implementation stores processed event IDs in a database or Redis. Before processing, the consumer checks if the event ID exists. If yes, it skips the event. If no, it processes and then persists the ID. Use a unique constraint on the event ID to prevent race conditions.
In production, the race condition between 'check' and 'insert' is where duplicates creep in. Use a database INSERT with ON CONFLICT DO NOTHING or a Redis SET NX to atomically claim the event. If two consumers attempt to process the same event simultaneously, only one succeeds.
event_id, concurrent inserts can cause deadlocks.Dead Letter Queues and Error Handling
No matter how well you test, some events will fail to process—bad data, transient dependencies, or code bugs. Instead of retrying forever (which blocks the queue), route the failed event to a Dead Letter Queue (DLQ). The DLQ holds events that exceeded retry limits. You can inspect, fix, and replay them later.
But a DLQ without monitoring is a data graveyard. Set up alerts on DLQ message count. Automate the replay pipeline: when the consumer is fixed, a script should republish DLQ events back to the main topic. Without that, events pile up silently and the system drifts.
Schema Evolution and Compatibility
Over time, event payloads change. A producer adds a new field; consumers built six months ago need to ignore it. Schema registries (like Confluent Schema Registry) enforce compatibility rules: backward, forward, full. Backward means new consumers can read old data (default). Forward means old consumers can read new data. Full means both. Using Avro or Protobuf with a schema registry ensures safe evolution.
The #1 cause of silent event loss in production is a breaking schema change. A producer removes a required field — every consumer downstream crashes with a deserialization error, events pile up in the DLQ, and the business impact is delayed by hours. Always set defaults for new fields and enforce backward compatibility.
BACKWARD or FULL compatibility.Transactional Outbox: Preventing Dual Write Failures
A common problem: your service updates a database and then publishes an event. If the publish fails, the event is lost and other services never see the change. This is the 'Dual Write' problem. The Transactional Outbox pattern solves it: write the event into an outbox table within the same database transaction. A separate process (polling publisher or CDC) reads the outbox and publishes the event to the broker. This guarantees at-least-once publication.
For low-latency needs, use Change Data Capture (CDC) with Debezium. It reads the database transaction log and publishes events within milliseconds. No polling overhead, no risk of missing events.
Missed Order Events During Kafka Partition Rebalance
max.poll.interval.ms was too high (5 min), so the broker thought the consumer died and triggered a rebalance. The new consumer started from the earliest offset because auto.offset.reset was set to earliest, reprocessing old events and skipping the uncommitted ones that hadn't been processed yet.max.poll.interval.ms to a value lower than the rebalance timeout (e.g., 30 seconds) and use an idempotent consumer with manual offset commits after processing. Enable enable.auto.commit=false and commit offsets in batches after successful processing.- Always tune consumer timeouts to your processing latency, not the other way around.
- Use manual offset commits combined with idempotent consumers to avoid skipping or double-processing.
- Test rebalance scenarios in staging with actual event payloads.
kafka-consumer-groups --bootstrap-server localhost:9092 --group my-group --describe. Ensure consumer is alive and not stuck in rebalance.processed_events table before applying side effects. Look for missing eventId deduplication or incorrect ACK timing.max.in.flight.requests.per.connection is set to 1 for guaranteed order if needed.max.poll.records or increase session.timeout.ms. The consumer is taking too long to process a batch; either batch size or processing time is too high. Enable async commits and handle rebalance listeners.Key takeaways
Common mistakes to avoid
4 patternsUsing commands instead of events (imperative naming)
CreateOrder event implies the order doesn't exist yet, so consumers wait for the database write to complete before reacting. This breaks the async promise and causes timeouts.OrderCreated, PaymentFailed. Commands should be imperative: CreateOrderCommand, RefundUser. Decouple intent from fact.No idempotency in consumers
processed_events table with a UNIQUE constraint on event_id. Check before processing, skip if already processed.Ignoring schema evolution
Choosing the wrong broker for the workload
Interview Questions on This Topic
Explain the 'At-Least-Once' delivery guarantee. Why does this necessitate idempotency in your microservices?
Frequently Asked Questions
That's Architecture. Mark it forged?
4 min read · try the examples if you haven't