Senior 4 min · March 06, 2026

Event-Driven Architecture: Patterns, Brokers, and Production Strategies

Master Event-Driven Architecture (EDA).

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • EDA is a design pattern where services communicate asynchronously via immutable events
  • Events are immutable facts (past tense); commands are imperative requests (future tense)
  • Brokers (Kafka, RabbitMQ) decouple producers from consumers, enabling horizontal scaling
  • Kafka retains messages for replay; RabbitMQ deletes after acknowledgment
  • Biggest mistake: treating commands as events, creating tight coupling and breaking idempotency
Plain-English First

Think of a group chat where people post updates (events) that anyone can read later. Commands are like private messages demanding immediate action. Queries are like checking the chat history for a specific fact.

Event-driven architecture is the 'social media' of microservices. In a synchronous, legacy system, Service A acts like a demanding boss—it calls Service B and waits, staring at the screen until it gets an answer. If Service B is having a bad day (or is down), Service A crashes too. In an EDA world, Service A simply 'posts' an update to a broker. It doesn't know, nor does it care, who is 'following' that update. Service B can process it now, ten minutes from now, or after it recovers from a reboot.

While this decoupling provides immense resilience, it introduces a new mental model: Eventual Consistency. You must design your system to accept that the 'User Created' email might arrive a few seconds after the database record is saved. For high-scale systems like Netflix, Uber, or LinkedIn, this trade-off is the only way to achieve global availability.

Events vs Commands vs Queries

In production system design, vocabulary is a functional constraint. Using a command when you meant an event creates accidental coupling. A Command is a 'Request for Action' (imperative), while an Event is a 'Statement of Fact' (declarative). Queries remain the synchronous backbone for immediate data retrieval.

Here's the real test: ask yourself 'Has this already happened?' If yes, it's an event. If no, it's a command. The name must reflect that. 'OrderCreated' is an event. 'CreateOrder' is a command disguised as an event. That distinction directly impacts how much coupling you ship.

CommunicationPatterns.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
package io.thecodeforge.eda.models;

import java.time.Instant;
import java.util.UUID;

/**
 * io.thecodeforge: Standardizing Intent vs Fact
 */
public class CommunicationPatterns {

    // COMMAND: Targeted intent. High coupling.
    // If the receiver is gone, the intent fails.
    public record RegisterUserCommand(
        UUID commandId,
        String email,
        String plainTextPassword
    ) {}

    // EVENT: Immutable fact. Zero coupling.
    // Multiple services (Audit, Email, Analytics) can 'sub' to this.
    public record UserRegisteredEvent(
        UUID eventId,
        UUID userId,
        String email,
        Instant occurredAt
    ) {}

    // QUERY: Immediate data request (REST/gRPC).
    public record GetUserProfileQuery(UUID userId) {}
}
The Party Invitation Model
  • Commands are sent to a specific person (tightly coupled).
  • Events are broadcast to everyone who cares (decoupled).
  • If you reply to a command with 'I can't', that's still a synchronous response.
  • If you react to an event, you don't need permission — that's async decoupling.
Production Insight
Using an event name like 'CreateOrder' is a command disguised as an event.
When multiple services react to 'OrderCreated', they expect the order to already exist.
Rule: name events with past tense (OrderCreated, PaymentFailed) and commands with present tense (CreateOrder, ChargePayment).
Key Takeaway
Commands express intent; events express fact.
Past tense event names prevent ambiguity in async pipelines.
If you're naming your event 'CreateOrder', you've created accidental coupling.
Choosing Between Event and Command
IfHas the action already happened?
UseUse an Event (past tense)
IfAre you asking someone to do something?
UseUse a Command (imperative)
IfDo you need immediate data?
UseUse a Query (synchronous)

The Heavyweights: Kafka vs RabbitMQ

Choosing a broker isn't about speed; it's about the 'Retention Model'. RabbitMQ is a smart post office: it routes messages to specific mailboxes and shreds the letter once it's read (acknowledged). Kafka is a distributed ledger: it appends messages to an immutable log and keeps them there for days or weeks, allowing consumers to 'rewind time' and reprocess data.

Don't fall for the raw throughput numbers. Both can handle 100k messages/sec. The real question is: can you afford to lose history? In event sourcing, you can't. Kafka's retention wins. For transient task queues, RabbitMQ's smart routing and per-queue TTL is simpler and cheaper to operate.

TheCodeForgeProducer.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
package io.thecodeforge.eda.kafka;

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.Properties;

public class TheCodeForgeProducer {

    public void publishUserEvent(String userId, String payload) {
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

        try (KafkaProducer<String, String> producer = new KafkaProducer<>(props)) {
            // Partitioning by userId ensures all events for ONE user
            // land in the same partition, preserving sequence order.
            ProducerRecord<String, String> record =
                new ProducerRecord<>("io.thecodeforge.user-events", userId, payload);

            producer.send(record, (metadata, exception) -> {
                if (exception == null) {
                    System.out.printf("Event persisted to topic %s partition %d%n",
                        metadata.topic(), metadata.partition());
                }
            });
        }
    }
}
Throughput vs Retention
Kafka's sequential disk I/O delivers ~2x raw throughput over RabbitMQ for large messages, but RabbitMQ's broker-side routing can reduce network round trips for complex patterns.
Production Insight
Kafka's retention makes it ideal for event sourcing and replays, but it adds operational cost.
RabbitMQ's smart routing (headers, topics) is simpler for work queues but loses history after ack.
Rule: If you need to reprocess past events or audit history, choose Kafka. For simple task distribution, RabbitMQ wins.
Key Takeaway
Kafka is a database of facts; RabbitMQ is a message taxi.
Retention model dictates use cases, not raw throughput.
Know your consumption pattern: pull (Kafka) or push (RabbitMQ).
When to choose Kafka vs RabbitMQ
IfNeed to replay past events or audit history
UseUse Kafka — its immutable log is designed for replay.
IfSimple work queue with one consumer per message
UseUse RabbitMQ — lightweight, easy to operate.
IfComplex routing based on message headers or topics
UseUse RabbitMQ — its exchange types (topic, headers) are purpose-built.
IfEvent sourcing or stream processing (Kafka Streams, ksqlDB)
UseUse Kafka — it's the foundation for stream processing.

Eventual Consistency and Trade-offs

In a synchronous system, after a write, a subsequent read immediately sees the new data. In EDA, the producer publishes an event, but the consumer may not have processed it yet. During that window, the system is inconsistent. This is eventual consistency. You must design your APIs to either return stale data or use compensating actions. For example, an order service creates an order and emits 'OrderPlaced'. The inventory service eventually decrements stock. If a read request arrives before inventory decrements, the API may report old stock levels. The trade-off: higher availability and scalability at the cost of immediate consistency.

This isn't a bug — it's a design choice. You can mitigate it with read-side caching, optimistic concurrency, or versioned responses. But you can never eliminate the inconsistency window without going synchronous.

EventualConsistencyExample.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
package io.thecodeforge.eda.consistency;

public class EventualConsistencyExample {
    // On order placed, emit event: inventory decrement happens later.
    public record OrderPlacedEvent(
        String orderId,
        String userId,
        int totalAmount
    ) {}

    // The inventory service consumes and decrements asynchronously.
    // If someone queries inventory before this, they see old stock.
    // Solution: Use optimistic concurrency or version vectors.
}
Don't confuse eventual consistency with eventual correctness
Eventual consistency means the data will converge if updates stop. It does NOT mean the system will eventually correct business logic errors. Write compensating transactions (Sagas) to handle failures.
Production Insight
Eventual consistency is not eventual correctness.
If your business requires immediate consistency (e.g., seat reservation), EDA alone is dangerous.
Rule: Use CQRS with a read model that is asynchronously updated from events and a synchronous validation layer for critical paths.
Key Takeaway
EDA trades consistency for availability.
Design for eventual consistency: idempotent consumers, conflict resolution, and compensating transactions.
If you need strong consistency, don't use vanilla EDA—use Saga or two-phase commit.

Idempotent Consumer: The Non-Negotiable Pattern

Because of at-least-once delivery guarantees, the same event may arrive multiple times. An idempotent consumer ensures that processing the same event multiple times has the same effect as processing it once. The standard implementation stores processed event IDs in a database or Redis. Before processing, the consumer checks if the event ID exists. If yes, it skips the event. If no, it processes and then persists the ID. Use a unique constraint on the event ID to prevent race conditions.

In production, the race condition between 'check' and 'insert' is where duplicates creep in. Use a database INSERT with ON CONFLICT DO NOTHING or a Redis SET NX to atomically claim the event. If two consumers attempt to process the same event simultaneously, only one succeeds.

IdempotentConsumer.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
package io.thecodeforge.eda.idempotency;

import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.ResultSet;

public class IdempotentConsumer {
    public void processEvent(String eventId, Runnable processingLogic) {
        String sql = "INSERT INTO processed_events (event_id) VALUES (?) " +
                     "ON CONFLICT (event_id) DO NOTHING";
        try (Connection conn = getConnection();
             PreparedStatement stmt = conn.prepareStatement(sql)) {
            stmt.setString(1, eventId);
            int rowsInserted = stmt.executeUpdate();
            if (rowsInserted == 0) {
                System.out.println("Event " + eventId + " already processed, skipping.");
                return;
            }
        }
        processingLogic.run();
    }
}
Atomic Check-and-Insert
Use UPSERT (PostgreSQL ON CONFLICT, MySQL INSERT IGNORE) to avoid race conditions. A separate check + insert in two statements will leak duplicates.
Production Insight
Idempotency keys must be unique across all events, not just per partition.
If your deduplication table uses a primary key on event_id, concurrent inserts can cause deadlocks.
Rule: Use ON CONFLICT DO NOTHING (PostgreSQL) or a LOCK-free insertion with UniqueConstraintException handling.
Key Takeaway
Assume every event arrives at least twice.
Idempotency is not optional—it's a reliability contract.
Store processed event IDs with a unique constraint to prevent race conditions.

Dead Letter Queues and Error Handling

No matter how well you test, some events will fail to process—bad data, transient dependencies, or code bugs. Instead of retrying forever (which blocks the queue), route the failed event to a Dead Letter Queue (DLQ). The DLQ holds events that exceeded retry limits. You can inspect, fix, and replay them later.

But a DLQ without monitoring is a data graveyard. Set up alerts on DLQ message count. Automate the replay pipeline: when the consumer is fixed, a script should republish DLQ events back to the main topic. Without that, events pile up silently and the system drifts.

DLQConfig.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
package io.thecodeforge.eda.dlq;

import org.springframework.kafka.listener.DefaultErrorHandler;
import org.springframework.util.backoff.FixedBackOff;

// Spring Kafka: configure retries and DLQ
DefaultErrorHandler errorHandler = new DefaultErrorHandler(
    (record, exception) -> {
        System.err.println("Event failed after retries: " + record.value());
        // Send to DLQ manually or let auto-DLQ routing handle
    },
    new FixedBackOff(1000L, 3) // 3 retries with 1 sec interval
);
Production Insight
DLQs can become black holes if nobody monitors them.
A silent DLQ backlog means events are lost, often causing data inconsistencies that surface days later.
Rule: Set an alert on DLQ topic message count and replay events immediately after fixing the consumer.
Key Takeaway
DLQs prevent retry storms but require active monitoring.
Every broker should have a DLQ retention policy and alert threshold.
Automate replay: once the consumer is fixed, a script republishes DLQ events.

Schema Evolution and Compatibility

Over time, event payloads change. A producer adds a new field; consumers built six months ago need to ignore it. Schema registries (like Confluent Schema Registry) enforce compatibility rules: backward, forward, full. Backward means new consumers can read old data (default). Forward means old consumers can read new data. Full means both. Using Avro or Protobuf with a schema registry ensures safe evolution.

The #1 cause of silent event loss in production is a breaking schema change. A producer removes a required field — every consumer downstream crashes with a deserialization error, events pile up in the DLQ, and the business impact is delayed by hours. Always set defaults for new fields and enforce backward compatibility.

SchemaEvolutionExample.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
package io.thecodeforge.eda.schema;

import io.confluent.kafka.serializers.KafkaAvroSerializer;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.GenericRecordBuilder;

// Define schema with a required field and add an optional one later
Schema userSchema = new Schema.Parser().parse(
    "{\"type\":\"record\",\"name\":\"UserCreated\",\"namespace\":\"io.thecodeforge\"," +
    "\"fields\":[{\"name\":\"id\",\"type\":\"string\"}," +
    "{\"name\":\"name\",\"type\":\"string\"}," +
    "{\"name\":\"email\",\"type\":[\"null\",\"string\"],\"default\":null}]}"
);
GenericRecord userRecord = new GenericRecordBuilder(userSchema)
    .set("id", "user-123")
    .set("name", "Alice")
    .set("email", null)
    .build();
Production Insight
Breaking schema changes are the #1 cause of silent event loss.
Removing a field or changing its type causes deserialization failures that propagate to the DLQ.
Rule: Always set default values for new optional fields and use BACKWARD or FULL compatibility.
Key Takeaway
Schema evolution is a binary contract between producer and consumer.
Use Avro/Protobuf with a schema registry and enforce backward compatibility.
Never remove a field—mark it deprecated and start a new schema version if needed.

Transactional Outbox: Preventing Dual Write Failures

A common problem: your service updates a database and then publishes an event. If the publish fails, the event is lost and other services never see the change. This is the 'Dual Write' problem. The Transactional Outbox pattern solves it: write the event into an outbox table within the same database transaction. A separate process (polling publisher or CDC) reads the outbox and publishes the event to the broker. This guarantees at-least-once publication.

For low-latency needs, use Change Data Capture (CDC) with Debezium. It reads the database transaction log and publishes events within milliseconds. No polling overhead, no risk of missing events.

OutboxTransaction.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
package io.thecodeforge.eda.outbox;

import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;

public class OutboxTransaction {
    public void createOrderAndPublish(String orderId, String userId) {
        String sqlInsertOrder = "INSERT INTO orders (id, user_id) VALUES (?, ?)";
        String sqlInsertOutbox = "INSERT INTO outbox (event_type, payload, created_at) VALUES (?, ?, NOW())";

        try (Connection conn = dataSource.getConnection()) {
            conn.setAutoCommit(false);
            try (PreparedStatement orderStmt = conn.prepareStatement(sqlInsertOrder);
                 PreparedStatement outboxStmt = conn.prepareStatement(sqlInsertOutbox)) {
                orderStmt.setString(1, orderId);
                orderStmt.setString(2, userId);
                orderStmt.executeUpdate();

                outboxStmt.setString(1, "OrderCreated");
                outboxStmt.setString(2, "{\"orderId\":\"" + orderId + "\",\"userId\":\"" + userId + "\"}");
                outboxStmt.executeUpdate();

                conn.commit();
            } catch (SQLException e) {
                conn.rollback();
                throw e;
            }
        }
    }
}
Production Insight
Without an outbox, a lost event leads to silent data drift that may not be detected for hours.
CDC tools like Debezium can read the outbox via binlogs, reducing latency to milliseconds.
Rule: Always use an outbox or CDC when the same transaction must both persist and notify.
Key Takeaway
Dual write = lost events.
Transactional outbox ensures the database and message broker are eventually consistent.
Use CDC (Debezium) for low-latency, reliable outbox publishing.
● Production incidentPOST-MORTEMseverity: high

Missed Order Events During Kafka Partition Rebalance

Symptom
Order confirmation emails stopped arriving for a few minutes, then resumed. Some orders were never double-charged but the payment system missed the 'PaymentFailed' event.
Assumption
Kafka handles everything automatically; consumers will resume from where they left off.
Root cause
The consumer's max.poll.interval.ms was too high (5 min), so the broker thought the consumer died and triggered a rebalance. The new consumer started from the earliest offset because auto.offset.reset was set to earliest, reprocessing old events and skipping the uncommitted ones that hadn't been processed yet.
Fix
Set max.poll.interval.ms to a value lower than the rebalance timeout (e.g., 30 seconds) and use an idempotent consumer with manual offset commits after processing. Enable enable.auto.commit=false and commit offsets in batches after successful processing.
Key lesson
  • Always tune consumer timeouts to your processing latency, not the other way around.
  • Use manual offset commits combined with idempotent consumers to avoid skipping or double-processing.
  • Test rebalance scenarios in staging with actual event payloads.
Production debug guideSymptom-based actions for common event-driven failures4 entries
Symptom · 01
Events are not reaching the consumer
Fix
Check broker logs for authorization failures or topic missing. Verify consumer group offset: kafka-consumer-groups --bootstrap-server localhost:9092 --group my-group --describe. Ensure consumer is alive and not stuck in rebalance.
Symptom · 02
Duplicate event processing
Fix
Inspect consumer idempotency implementation. Check if the consumer checks a processed_events table before applying side effects. Look for missing eventId deduplication or incorrect ACK timing.
Symptom · 03
Event order is scrambled
Fix
Verify that producers partition by a consistent key (e.g., userId). Ensure consumers in the same group are not exceeding the number of partitions. Check that max.in.flight.requests.per.connection is set to 1 for guaranteed order if needed.
Symptom · 04
Consumer logs show 'CommitFailedException'
Fix
Reduce max.poll.records or increase session.timeout.ms. The consumer is taking too long to process a batch; either batch size or processing time is too high. Enable async commits and handle rebalance listeners.
★ Quick EDA Debug Cheat SheetOne-liner commands and immediate actions for the top 3 EDA emergencies
Kafka consumer not reading any new events
Immediate action
Check consumer group lag
Commands
kafka-consumer-groups --bootstrap-server localhost:9092 --group my-group --describe
kafka-console-consumer --bootstrap-server localhost:9092 --topic my-topic --from-beginning --max-messages 5
Fix now
If lag shows '0' and no new messages, restart the consumer with correct bootstrap server. If lag > 0 and no consumption, check consumer thread is not blocked (thread dump).
Duplicate events being processed+
Immediate action
Temporarily increase idempotency check scope
Commands
grep 'processed_events' consumer.log | tail -100
SELECT count(*), event_id FROM processed_events GROUP BY event_id HAVING count(*) > 1;
Fix now
Add a UNIQUE constraint on event_id in the processed_events table to stop duplicates immediately. Then fix the consumer logic.
Event processing order incorrect+
Immediate action
Check partition count and key distribution
Commands
kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic my-topic --time -1
kafka-topics --describe --topic my-topic --bootstrap-server localhost:9092
Fix now
Ensure the number of partitions is at least as large as the number of consumers in the group. For strict ordering, use a single partition per entity and set max.in.flight.requests.per.connection=1.
Kafka vs RabbitMQ Comparison
FeatureApache KafkaRabbitMQ
Architectural StyleDistributed Append-only LogTraditional Message Broker
Consumer ModelPull-based (Consumers track offset)Push-based (Broker pushes to workers)
Message RetentionRetains after consumption (Replayable)Deleted after Ack (Transient)
ScalabilityHorizontal (add partitions/nodes)Vertical / Clustering (limited scale)
ProtocolCustom binary over TCPAMQP, MQTT, STOMP
Best Use CaseEvent Sourcing, Real-time Stream ProcessingWork Queues, Simple Decoupling, RPC

Key takeaways

1
Events are immutable facts—they record things that have already happened and cannot be changed.
2
The 'Broker' acts as the buffer
it ensures that producers are never blocked by slow or offline consumers.
3
Idempotency is non-negotiable in EDA—assume every event will be delivered at least twice.
4
Kafka is built for high-throughput and history; RabbitMQ is built for complex routing and immediate tasks.
5
Monitoring is harder in EDA—you need 'Distributed Tracing' (like Zipkin or Jaeger) to follow a request across async boundaries.
6
Use a schema registry to safely evolve event contracts without breaking consumers.
7
Never put business logic in event consumers that assume in-order delivery across partitions.

Common mistakes to avoid

4 patterns
×

Using commands instead of events (imperative naming)

Symptom
In production, services become tightly coupled. A CreateOrder event implies the order doesn't exist yet, so consumers wait for the database write to complete before reacting. This breaks the async promise and causes timeouts.
Fix
Rename events to past tense: OrderCreated, PaymentFailed. Commands should be imperative: CreateOrderCommand, RefundUser. Decouple intent from fact.
×

No idempotency in consumers

Symptom
Duplicate event processing leads to double charges, duplicate notifications, or inconsistent state. Often discovered when a network retry triggers the same event twice.
Fix
Implement an idempotent consumer using a processed_events table with a UNIQUE constraint on event_id. Check before processing, skip if already processed.
×

Ignoring schema evolution

Symptom
After a producer adds a required field, old consumers crash with deserialization errors. Events pile up in the DLQ, and the system becomes inconsistent for hours.
Fix
Use a schema registry (Confluent Schema Registry, Apicurio) with backward-compatible schemas. Always set defaults for new fields. Test schema changes in a staging environment first.
×

Choosing the wrong broker for the workload

Symptom
Using RabbitMQ for event sourcing leads to lost data because events are deleted after ack. Using Kafka for simple request/reply introduces unnecessary operational overhead and latency.
Fix
Match the broker to the consumption pattern: Kafka for log-based, replayable streams; RabbitMQ for task queues, RPC, and complex routing.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the 'At-Least-Once' delivery guarantee. Why does this necessitat...
Q02SENIOR
Kafka partitions provide ordering. What happens to the message order if ...
Q03SENIOR
What is the 'Transactional Outbox' pattern, and how does it prevent the ...
Q04SENIOR
Compare and contrast 'Fan-out' vs 'Point-to-Point' messaging. In RabbitM...
Q05SENIOR
How do you handle a 'Schema Evolution' problem in EDA when a producer ad...
Q01 of 05SENIOR

Explain the 'At-Least-Once' delivery guarantee. Why does this necessitate idempotency in your microservices?

ANSWER
At-Least-Once means the broker guarantees every message is delivered to a consumer at least once, but in failure scenarios (network issues, consumer crashes before ack) the same message may be delivered multiple times. Without idempotency, each delivery would trigger side effects (email, charge, database write) multiple times, causing duplicates or data corruption. Idempotency ensures that processing the same event multiple times has the same net effect as processing it once.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
How do you handle idempotency in event consumers?
02
What is the 'Dead Letter Queue' (DLQ) and why do I need it?
03
When should I NOT use event-driven architecture?
04
How does Kafka ordering work when consumers scale up?
🔥

That's Architecture. Mark it forged?

4 min read · try the examples if you haven't

Previous
GraphQL vs REST
4 / 13 · Architecture
Next
CQRS Pattern