Mid 10 min · May 23, 2026

Kafka Producer & Consumer Deep Dive: acks, Retries, Batch Processing in Spring Boot

Master Kafka producer ack modes, AckMode.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Producer acks=all + min.insync.replicas=2 gives the strongest durability guarantee — use for financial events
  • AckMode.MANUAL_IMMEDIATE commits offsets only after your business logic succeeds, preventing data loss on restart
  • @RetryableTopic(attempts=3, backoff=@Backoff(delay=1000, multiplier=2)) auto-configures retry topics and DLT routing
  • RecordFilterStrategy allows declarative message filtering before your listener method is invoked
  • Batch consumers with List> increase throughput 10-50x by amortizing per-record overhead
✦ Definition~90s read
What is Kafka Producer & Consumer?

Kafka's producer acknowledgment system (acks) controls the trade-off between durability, latency, and throughput at the protocol level. When a producer calls send(), the broker must acknowledge receipt before the producer considers the message safely delivered.

Think of Kafka ack modes like a certified letter vs.

The acknowledgment policy determines how many broker replicas must persist the message before that acknowledgment is sent. This is not a Spring-level abstraction — it's a core Kafka protocol guarantee that survives broker restarts, leader elections, and network partitions.

Offset management on the consumer side controls the at-least-once vs. at-most-once vs. exactly-once delivery semantics. An offset is a monotonically increasing integer identifying a message's position within a partition. Committing an offset tells the broker 'my consumer group has successfully processed everything up to this position.' If the consumer crashes before committing, the next consumer to take ownership of the partition will re-read from the last committed offset — giving at-least-once delivery.

Spring Kafka's AckMode enum controls when commits happen: automatically on poll (BATCH), after each record (RECORD), or only when the application explicitly calls Acknowledgment.acknowledge() (MANUAL_IMMEDIATE).

The @RetryableTopic annotation is Spring Kafka's declarative retry infrastructure. It creates sibling retry topics (e.g., orders-retry-0, orders-retry-1) and a dead-letter topic (orders-dlt) automatically. Failed messages are routed to retry topics with embedded delay metadata, consumed after the configured backoff period, and retried.

After exhausting all attempts, the message lands in the DLT where a @DltHandler method can perform specialized handling (alerting, manual review, compensation). This entire infrastructure is created from a single annotation.

Plain-English First

Think of Kafka ack modes like a certified letter vs. dropping a note in a mailbox. With acks=0, you just throw the letter in — no confirmation. With acks=1, the post office stamps 'received' but might lose it in a back-office fire. With acks=all, every post office branch confirms they have a copy before you leave. Retry topics are like a special tray for letters that couldn't be delivered — they wait their turn and get another attempt automatically.

The first Kafka integration you build usually works great in the demo. The producer sends, the consumer receives, everything looks smooth. Then your system hits its first real incident: the broker has a leadership election mid-deployment, your consumer receives a malformed message from a third-party service, a downstream database goes down for 45 seconds, and suddenly you realize your Kafka setup has no durability guarantees, no retry logic, and no way to isolate bad messages from good ones.

Producer acknowledgment modes are the first line of defense. acks=0 is performance theater — you're broadcasting messages into the void and hoping for the best. acks=1 is a false comfort — the leader acknowledges but hasn't replicated, so a leader failover between write and replication loses your data permanently. acks=all with min.insync.replicas=2 is the production standard: the message is durable on multiple brokers before the producer considers the send complete.

On the consumer side, offset management is where most production bugs live. Auto-commit increments the offset on a timer, not after successful processing — so a consumer crash mid-processing auto-commits the offset, and those messages are gone forever. AckMode.MANUAL_IMMEDIATE gives you exact control: commit only when your database write succeeds.

Retry logic with @RetryableTopic addresses the class of transient failures — downstream database timeouts, external API blips, temporary lock contention — where the message is valid but the processing environment wasn't ready. Spring Kafka creates retry topics automatically, routes failed messages through configurable backoff periods, and finally sends to a dead-letter topic if all attempts fail. No manual retry infrastructure required.

Batch consumers unlock an order-of-magnitude throughput improvement for high-volume processing. Instead of one database call per message, you receive List<ConsumerRecord>, process 500 records at once, and bulk-insert in a single transaction. The operational complexity increases, but so does your system's ability to keep up under load.

Producer Acknowledgment Modes: acks=0, acks=1, acks=all

The acks producer configuration is the most impactful durability setting in Kafka. It determines how many broker replicas must confirm message receipt before the KafkaProducer.send() future completes. Each value represents a distinct point on the latency-durability trade-off spectrum.

acks=0 (fire-and-forget): The producer sends the message and immediately considers it delivered, without waiting for any broker response. No acknowledgment packet traverses the network for this message. This maximizes throughput and minimizes latency but provides zero durability — if the broker is unavailable or rejects the message, the producer is unaware. Use cases are narrow: high-volume metrics, IoT sensor readings, or application logs where occasional loss is acceptable and the monitoring overhead of tracking every message exceeds the business value of any individual one.

acks=1 (leader-only): The partition leader writes the message to its local log and acknowledges the producer. This is faster than acks=all and provides protection against temporary broker unavailability. However, there is a critical gap: if the leader fails between writing the message and replicating it to followers, the message is lost permanently. The new elected leader (a follower that never received the message) has no record of it, and the producer already received an ACK. This failure scenario is not hypothetical — it occurs during every broker restart and leadership election in a rolling update.

acks=all (or acks=-1): The leader writes the message and waits for all in-sync replicas (ISRs) to confirm they have written it to their logs before acknowledging the producer. The ISR list is maintained by the broker — replicas that fall behind (haven't fetched from the leader within replica.lag.time.max.ms) are removed from the ISR. Setting min.insync.replicas=2 on the topic ensures that even if replicas lag out, at least 2 (leader + 1 follower) must confirm the write. If fewer than min.insync.replicas are in sync, the broker rejects the write with NotEnoughReplicasException — the producer should retry until replicas catch up.

In Spring Boot, set spring.kafka.producer.acks=all and configure min.insync.replicas on your NewTopic bean. Pair with enable.idempotence=true to get duplicate-free retries — the broker assigns a producer ID and sequence number to each message, deduplicating retransmissions that occur during retry after an ambiguous failure.

acks=all Requires min.insync.replicas=2 to Be Meaningful
If min.insync.replicas=1 (the default), acks=all is equivalent to acks=1 — only the leader needs to be in the ISR. Always set min.insync.replicas=2 on critical topics so that at least one follower must confirm the write, providing actual durability against leader failure.
Production Insight
Enable idempotent producers (enable.idempotence=true) for all production producers, not just those requiring exactly-once semantics. It prevents duplicate messages caused by producer retries after ambiguous network failures at zero additional operational cost.
Key Takeaway
acks=all + min.insync.replicas=2 + enable.idempotence=true is the minimum durability configuration for any event with real-world consequences.

AckMode.MANUAL_IMMEDIATE and Offset Management

Spring Kafka's AckMode controls when consumed message offsets are committed back to the Kafka broker. The default AckMode.BATCH commits offsets after each poll loop completes — after the listener method returns for all records in the batch, regardless of whether processing succeeded. This is convenient but dangerous: if your database write fails inside the listener after the poll completes, the offset is already committed and the message is silently lost.

AckMode.MANUAL_IMMEDIATE transfers offset commit control entirely to your code. The listener container injects an Acknowledgment object into your listener method. Calling ack.acknowledge() commits the current record's offset to the broker immediately. If you do not call acknowledge() — because an exception was thrown before reaching that line — the offset stays uncommitted and the message is redelivered on the next poll.

The `IMMEDIATE` suffix is important: AckMode.MANUAL batches acknowledgments and flushes them at the end of the poll loop, while MANUAL_IMMEDIATE commits the offset synchronously as soon as acknowledge() is called. For high-reliability consumers where you need to know that each record's offset is committed before proceeding to the next, use MANUAL_IMMEDIATE.

Pairing MANUAL_IMMEDIATE with error handling requires careful design. A naive implementation that calls ack.acknowledge() inside a try-catch that also catches the failure will acknowledge even on error — defeating the purpose. The correct pattern: call acknowledge() only in the success path. Configure SeekToCurrentErrorHandler on the container so that when an exception escapes the listener, the container seeks the consumer back to the failed record's offset, ensuring it is redelivered rather than skipped.

SeekToCurrentErrorHandler (renamed DefaultErrorHandler in Spring Kafka 2.8+) accepts a BackOff configuration for retry timing. With ExponentialBackOffWithMaxRetries(5), the handler retries the failed record 5 times with exponential backoff before routing to the dead-letter topic via DeadLetterPublishingRecoverer. This gives transient failures (DB timeout, external API blip) multiple chances to succeed before the message is quarantined.

Never Call ack.acknowledge() in a Finally Block
Placing ack.acknowledge() in a finally block ensures the offset is committed even when an exception is thrown — silently dropping failed messages. The DefaultErrorHandler's seek-back mechanism only works when exceptions propagate out of the listener method. Swallowing exceptions and acknowledging in finally is the most common cause of silent message loss in Spring Kafka applications.
Production Insight
For consumers that must maintain exactly-once processing, combine AckMode.MANUAL_IMMEDIATE with a database unique constraint on the message's correlation ID. On duplicate delivery, the insert fails with a constraint violation — add this to addNotRetryableExceptions so the offset advances and the duplicate is discarded cleanly.
Key Takeaway
AckMode.MANUAL_IMMEDIATE gives you full control over offset commits — use it for any consumer where losing a message has business consequences, and always pair it with DefaultErrorHandler for automatic retry and DLT routing.

@RetryableTopic: Declarative Retry with Exponential Backoff

Before @RetryableTopic, implementing retry logic for Kafka consumers required building custom retry infrastructure: separate retry topics, a scheduler to re-enqueue messages after a delay, a DLQ writer, and manual tracking of retry counts. Spring Kafka 2.7+ collapses all of this into a single annotation.

@RetryableTopic(attempts = 3, backoff = @Backoff(delay = 1000, multiplier = 2)) on a @KafkaListener method automatically creates two retry topics (orders-retry-0, orders-retry-1) and one dead-letter topic (orders-dlt). When the listener method throws a retryable exception, the framework publishes the message to orders-retry-0 with headers recording the original topic, the attempt number, and the timestamp after which the message should be reprocessed. A background listener on the retry topic waits until the delay has elapsed, then re-delivers the message to the main listener. After the configured number of attempts, the message routes to the DLT.

The retry delay strategy is configurable. @Backoff(delay=1000, multiplier=2, maxDelay=30000) implements exponential backoff: attempt 1 waits 1s, attempt 2 waits 2s, attempt 3 waits 4s, capped at 30s. This gives transient failures (DB connection timeouts typically last 1-5s) time to resolve while not holding retry topics open indefinitely.

Exception classification is granular. notRetryOn = {DataIntegrityViolationException.class, DeserializationException.class} sends these exceptions directly to the DLT without retrying — they will fail on every attempt, so retrying wastes resources and delays DLT processing. include = {OptimisticLockingFailureException.class} restricts retrying to only the listed exceptions, sending everything else to DLT.

dltStrategy = DltStrategy.FAIL_ON_ERROR (the default) routes to DLT when retries are exhausted. dltStrategy = DltStrategy.NO_DLT drops the message after exhaustion — dangerous, but useful for truly ephemeral events. The @DltHandler annotation on a method in the same class receives DLT messages for specialized handling: logging, alerting, inserting into a database for manual review, or triggering a compensation event.

By default, @RetryableTopic uses the same consumer group for retry and main topic consumption. This means retry processing uses the same concurrency settings as main topic processing. For high-volume retry scenarios, configure sameIntervalTopicReuseStrategy = SameIntervalTopicReuseStrategy.SINGLE_TOPIC to consolidate retries or override the container factory for retry topics separately.

@RetryableTopic Creates Real Kafka Topics — Account for Them
With autoCreateTopics=true, Spring Kafka creates orders-retry-0, orders-retry-1, orders-retry-2, and orders-dlt in your broker. With 10 main topics and 3 retry attempts, you have 40 additional topics. Include these in your topic naming conventions, monitoring, and retention policies. DLT topics accumulate messages indefinitely — always monitor their depth and set retention.
Production Insight
Inject @Header(KafkaHeaders.RECEIVED_TOPIC) String topic in your listener to detect whether you're processing a retry. Log retry attempts at WARN level with the original exception headers — this creates an automatic audit trail of which messages needed retries, invaluable for post-incident analysis.
Key Takeaway
@RetryableTopic replaces hundreds of lines of custom retry infrastructure with a single annotation — configure exception classification carefully to avoid retrying non-transient errors that will never succeed.

RecordFilterStrategy: Filtering Messages Before Processing

Not every message in a topic is meant for every consumer. A RecordFilterStrategy allows you to declaratively exclude messages that match a filter condition, without the consumer ever invoking the listener method. The filter runs after deserialization but before listener invocation — the consumer polls the record, deserializes it, evaluates the filter, and if the filter returns true (filtered), the record is acknowledged and the listener is skipped.

Common use cases: filtering by message type header (one topic carries multiple event types, but each consumer only handles one), filtering by tenant ID in a multi-tenant system, or excluding test messages (produced by integration tests) from production consumers.

The filter function receives the deserialized ConsumerRecord<K, V>. You have full access to the record's key, value, headers, topic, partition, and offset. The return value is a boolean: true means discard (filter out), false means pass through to the listener.

When a record is filtered, its offset still needs to be committed so the consumer doesn't re-read it endlessly. Set factory.setAckDiscarded(true) to auto-acknowledge filtered records. Without this, filtered records accumulate and the consumer's committed offset falls behind the end offset, making lag monitoring misleading — you'll see apparent lag from filtered records that will never be processed.

For batch listeners, BatchRecordFilterStrategy filters at the batch level. The default behavior is to filter individual records within the batch and deliver only the non-filtered subset to the listener. If all records in a batch are filtered, the listener is not called at all. Configure factory.setBatchToRecordFilterStrategy(strategy) for fine-grained control.

AckDiscarded Must Be True for Filtered Records
Without setAckDiscarded(true), filtered records are never acknowledged. The consumer's committed offset remains behind the filtered records, creating phantom lag that your monitoring dashboards will report. Always set ackDiscarded=true when using any RecordFilterStrategy.
Production Insight
For high-volume topics where most messages are irrelevant to a specific consumer, consider filtering on message headers rather than deserialized value. Header inspection avoids the deserialization cost for records that will be discarded anyway, reducing CPU and GC pressure on consumers handling millions of messages per hour.
Key Takeaway
RecordFilterStrategy enables clean separation of concerns between the topic producer (which may write multiple event types) and each consumer (which handles only specific types), without requiring separate topics per event type.

Batch Consumers with List

Individual record processing — one listener invocation per Kafka message — is clean and simple but inefficient at scale. For each record, you pay the overhead of listener dispatch, Spring proxy invocation, and crucially, one database round-trip. At 10,000 messages per second, that's 10,000 individual INSERT statements per second — far more than most databases sustain comfortably.

Batch consumers receive a List<ConsumerRecord<K, V>> (or List<V> for value-only access) representing all records returned by a single KafkaConsumer.poll() call. Instead of 500 individual database calls, you execute one bulk INSERT with 500 rows — a 50-100x reduction in database round-trips. Spring Boot's spring.kafka.listener.type=batch enables batch mode globally, or you can configure it per-factory.

Batch size is controlled by max.poll.records (how many records per poll, default 500) and the consumer's fetch.min.bytes / fetch.max.wait.ms settings which control how full the fetch buffer must be before returning. Setting max.poll.records=1000 with fetch.min.bytes=1048576 (1MB) creates large, efficient batches at the cost of slightly increased latency.

Error handling in batch mode is more complex than single-record mode. If the listener throws, the entire batch is the unit of retry — all records in the batch will be redelivered. This makes idempotent processing essential. The BatchToRecordAdapter bridges batch delivery with single-record error handling, allowing DefaultErrorHandler (which operates at the record level) to function within a batch context.

For partial batch commits — acknowledging records 0-249 but retrying records 250-499 — use Acknowledgment.nack(int index, Duration duration). This seeks the consumer back to record at index within the batch, ensuring records after the failure point are redelivered while records before it are committed. This is the correct approach when batch processing involves heterogeneous records that may fail independently.

Batch listeners integrate naturally with Spring's @Transactional — wrap the entire batch in one database transaction for atomic commit/rollback. This is significantly more efficient than one transaction per record. Be aware of transaction size limits and timeout settings when batches are very large.

Batch Retries Reprocess the Entire Batch
When a batch listener throws an exception (without calling ack.nack(index)), the entire batch is retried from the beginning. Records already successfully processed will be reprocessed. Always implement idempotent processing (unique constraint, Redis SETNX, or database upsert) before using batch consumers in production.
Production Insight
Monitor max-poll-records vs actual batch sizes in production using @Header(KafkaHeaders.BATCH_CONVERTED_MESSAGES) or a ConsumerAwareMessageListener. If actual batch sizes are consistently much smaller than max.poll.records, the topic has insufficient throughput to fill batches — consider reducing max.poll.records to avoid holding the poll lock longer than necessary.
Key Takeaway
Batch consumers deliver 10-50x database throughput improvement for bulk processing use cases, but require idempotent processing and careful error handling to safely handle partial batch failures.

SeekToCurrentErrorHandler and DefaultErrorHandler Patterns

The error handler sits between the listener container and your listener method. When an unhandled exception escapes your @KafkaListener method, the error handler decides what to do next: retry immediately, seek back to retry after backoff, skip the record, or route to a dead-letter topic.

Pre-2.8 Spring Kafka used SeekToCurrentErrorHandler. From 2.8+, DefaultErrorHandler is the unified replacement that handles both single-record and batch consumers. The behavior is identical: on exception, the container seeks the consumer back to the failed record's offset so the next poll re-reads it. Without this seek, the container would advance past the failed record (as if it were processed), losing the message.

The DefaultErrorHandler constructor accepts a BackOff and a ConsumerRecordRecoverer. The BackOff determines retry timing. FixedBackOff(1000L, 3) retries 3 times with 1-second delays between each. ExponentialBackOffWithMaxRetries(5) retries 5 times with exponential delays. After exhausting retries, the ConsumerRecordRecoverer handles the failed record. DeadLetterPublishingRecoverer publishes the failed record to a .DLT topic, including headers identifying the original topic, partition, offset, and exception details.

Exception classification controls which exceptions trigger retries and which go directly to the DLT. errorHandler.addNotRetryableExceptions(DeserializationException.class) ensures that malformed messages (which will always fail deserialization) bypass retries and go straight to the DLT. errorHandler.addRetryableExceptions(DataAccessException.class) forces retry for transient database exceptions. Without classification, all exceptions retry, wasting time on failures that are guaranteed to repeat.

The ExceptionClassifier interface provides fine-grained control, allowing a lambda or class to inspect the exception (including its cause chain) and return whether it should be retried. This handles wrapped exceptions where the root cause is what matters, not the wrapper type thrown by the listener.

DeserializationException Must Be Non-Retryable
A message that fails JSON deserialization will always fail — the bytes don't change between retry attempts. Without addNotRetryableExceptions(DeserializationException.class), a single malformed message will retry until all attempts are exhausted, blocking the partition for seconds or minutes. Always add DeserializationException to the non-retryable list.
Production Insight
Add retry attempt logging via errorHandler.setRetryListeners() — these log entries are invaluable during incidents to distinguish between 'this message failed once due to a DB timeout' (normal, resolved) and 'this message has failed 50 times' (poison message requiring investigation).
Key Takeaway
DefaultErrorHandler with DeserializationException in the non-retryable list and DeadLetterPublishingRecoverer is the production-standard error handling configuration for every Kafka consumer — no exceptions.
● Production incidentPOST-MORTEMseverity: high

Silent Message Loss During Broker Leader Election

Symptom
After a scheduled Kafka broker restart for security patching, the payments team noticed 3,200 transactions in the 'PENDING' state that never transitioned. No errors appeared in producer logs. Kafka consumer lag was zero. The messages simply didn't exist in the topic.
Assumption
The team assumed the missing messages were caused by a consumer bug — perhaps the consumer committed offsets without processing. They spent two days reviewing consumer code before escalating.
Root cause
The payment producer was configured with acks=1. During the broker restart, 3,200 messages were written to the leader and acknowledged to the producer. Before those messages could be replicated to followers, the leader went offline for patching. Kafka elected a follower (which didn't have the messages) as the new leader. The messages never existed on any surviving broker — they were acknowledged to the producer and then lost permanently when the old leader's unsynced data was discarded.
Fix
Change producer configuration to acks=all and set min.insync.replicas=2 on all payment-related topics. Also enable retries=Integer.MAX_VALUE and enable.idempotence=true on the producer to handle transient broker errors during leadership transitions without duplicates. Implement an outbox pattern for payment events so database and Kafka writes are atomic.
Key lesson
  • acks=1 provides a false sense of durability.
  • The acknowledgment means the leader received the message, not that the message is safe.
  • For any event representing a financial transaction or state change with real-world consequences, acks=all is non-negotiable.
Production debug guideSymptom → root cause → fix5 entries
Symptom · 01
Messages are processed but offset never advances — consumer appears stuck at same position
Fix
When using AckMode.MANUAL_IMMEDIATE, the offset only advances when ack.acknowledge() is called explicitly. Check that your error handling path also acknowledges (or that a SeekToCurrentErrorHandler is configured to seek back on failure). If an exception propagates past ack.acknowledge(), the offset is committed. If the exception is thrown before it, the offset stays — causing the message to be redelivered on the next poll. Review all exception paths in your listener method to ensure exactly one code path acknowledges.
Symptom · 02
@RetryableTopic not routing failed messages to retry topics
Fix
Verify that @EnableKafka is on your configuration class and that @RetryableTopic is on the @KafkaListener method, not the class. Confirm the retry topic names exist in the broker — Spring Boot creates them on startup, but if auto.create.topics.enable=false on the broker, you must create them via NewTopic beans or the admin CLI. Also check that your error is not excluded by notRetryOn — by default, @RetryableTopic retries all exceptions; subclasses of java.lang.Error and explicitly excluded exceptions go straight to DLT.
Symptom · 03
Producer intermittently throws TimeoutException or NotLeaderOrFollowerException
Fix
These exceptions indicate the producer is trying to write to a partition whose leader just changed (leadership election in progress). Ensure retries is set to a high value (Integer.MAX_VALUE) and retry.backoff.ms=300. With enable.idempotence=true, the producer assigns each message a sequence number and the broker deduplicates retries — enabling safe retry without producing duplicates. Add delivery.timeout.ms=120000 to allow up to 2 minutes for a message to be acknowledged, covering most election durations.
Symptom · 04
Batch consumer processes some records but fails mid-batch — uncertain which were processed
Fix
Batch consumer failures require careful offset management. If you throw from a batch listener, the entire batch is retried by default — any records already processed will be reprocessed. Implement idempotent processing (unique constraint in DB or Redis SET NX) so reprocessing is safe. Alternatively, use BatchToRecordAdapter to have Spring Kafka call your method once per record within the batch, making partial batch failures tractable. For partial commits within a batch, use Acknowledgment.nack(int index, Duration duration) to seek to a specific offset within the batch.
Symptom · 05
RecordFilterStrategy is configured but messages still reach the listener
Fix
Verify that the filter strategy is set on the ConcurrentKafkaListenerContainerFactory, not the @KafkaListener annotation. Check factory.setRecordFilterStrategy() is called before the factory is used to create containers. If using FilteringBatchMessageListenerAdapter, ensure the adapter wraps the correct listener. Also confirm that factory.setAckDiscarded(true) is set if you want filtered (discarded) records to have their offsets committed automatically — without this, filtered records remain uncommitted and are redelivered.
★ Producer & Consumer Debug Cheat SheetTargeted CLI commands for diagnosing producer delivery and consumer offset issues in production Kafka clusters.
Producer sending but messages missing from topic — verify actual broker receipt
Immediate action
Tail the topic directly to confirm message arrival independent of consumer
Commands
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic payments --partition 0 --offset earliest --max-messages 50
kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic payments --time -1
Fix now
If console consumer shows messages but your application consumer doesn't see them, the issue is consumer group offset — the committed offset is ahead of the message. Reset with kafka-consumer-groups.sh. If console consumer shows nothing, producer is not reaching the broker — check acks, broker connectivity, and topic existence.
Consumer group offset stuck — need to check exact committed offsets per partition+
Immediate action
List committed offsets for each partition in the consumer group
Commands
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group payment-service --describe
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group payment-service --reset-offsets --to-offset 12345 --topic payments:0 --execute
Fix now
Use --reset-offsets with --to-offset to move a specific partition's offset to a precise position. Always use --dry-run first. Stop all consumer instances in the group before resetting — offsets cannot be reset while consumers are active.
Retry topics accumulating messages — need to verify retry routing is working+
Immediate action
Check all retry topic consumer groups for lag and activity
Commands
kafka-topics.sh --bootstrap-server localhost:9092 --list | grep -E 'retry|dlt|DLT'
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group payment-service
Fix now
If retry topics have lag but the consumer group shows them as assigned, the retry consumer is running but processing slowly — check the retry delay configuration. If partitions are unassigned, the retry listener is not running (application startup issue or @RetryableTopic misconfiguration). Check application logs for 'Creating retry topic' messages on startup.
Producer throughput lower than expected — diagnose batching and compression+
Immediate action
Check producer metrics via JMX or actuator to see batch fill rate and compression ratio
Commands
kafka-producer-perf-test.sh --topic payments --num-records 100000 --record-size 1024 --throughput -1 --producer-props bootstrap.servers=localhost:9092 acks=all compression.type=snappy linger.ms=5
kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic payments
Fix now
If perf test throughput is much higher than application throughput, your application is not batching efficiently. Set linger.ms=5-20 and batch.size=65536 in producer properties. Verify compression.type=snappy is configured — JSON payloads typically compress 60-70%, dramatically increasing effective throughput.
Kafka AckMode Comparison
AckModeWhen Offset is CommittedRisk of Message LossRisk of ReprocessingBest For
BATCH (default)After poll loop completes, regardless of successHIGH — commits even if listener threwLOWNon-critical events, logging
RECORDAfter each listener method returns successfullyMEDIUM — commits after each successful callLOWModerate reliability needs
MANUALWhen app calls ack.acknowledge(), batched until end of pollLOWLOWHigh reliability, bulk ack
MANUAL_IMMEDIATEWhen app calls ack.acknowledge(), committed immediatelyVERY LOWLOW (only on crash)Financial transactions, critical events
COUNTAfter N records are acknowledgedMEDIUM — partial batch loss on crashLOWBalanced throughput/reliability
TIMEAfter a configured time intervalMEDIUM — time-window loss on crashLOWThroughput-optimized consumers

Key takeaways

1
acks=all + min.insync.replicas=2 + enable.idempotence=true is the non-negotiable minimum for any Kafka producer handling events with real business consequences
2
AckMode.MANUAL_IMMEDIATE combined with DefaultErrorHandler is the safest offset management pattern
commits only on success, retries on failure, routes to DLT after exhaustion
3
@RetryableTopic eliminates custom retry infrastructure but requires careful exception classification to avoid retrying non-transient failures that will always fail
4
Batch consumers deliver 10-50x throughput improvements over individual record processing but require idempotent processing to handle redelivery safely
5
Always add DeserializationException to DefaultErrorHandler's non-retryable list
malformed messages will never deserialize successfully and should go directly to the DLT

Common mistakes to avoid

7 patterns
×

Using acks=1 for payment or order events

Symptom
Messages silently disappear during broker restarts, leadership elections, or rolling deployments — no errors in producer logs
Fix
Set acks=all and min.insync.replicas=2 for all topics where message loss has business impact. Enable enable.idempotence=true to prevent duplicate messages from producer retries.
×

Calling ack.acknowledge() inside a finally block or catch block

Symptom
Messages that fail processing still have their offsets committed — they are silently lost and never reprocessed or routed to DLT
Fix
Call ack.acknowledge() only in the happy path after all business logic succeeds. Let exceptions propagate to the error handler, which handles retry and DLT routing. Never acknowledge in exception handling paths.
×

Not adding DeserializationException to non-retryable exceptions

Symptom
A malformed message blocks the partition for the entire retry duration (attempts × backoff) before finally reaching the DLT, causing processing lag for all subsequent valid messages
Fix
Add errorHandler.addNotRetryableExceptions(DeserializationException.class) to send malformed messages directly to the DLT without retrying. A message that can't be deserialized will never succeed regardless of how many times you retry it.
×

Setting max.poll.records very high without adjusting max.poll.interval.ms

Symptom
Consumer is evicted from the group sporadically with 'max.poll.interval.ms exceeded' log messages, causing rebalances and lag spikes
Fix
If max.poll.records=1000 and each record takes 10ms to process, you need at least 10 seconds to process one poll. Set max.poll.interval.ms higher than (max.poll.records × max processing time per record). Alternatively, reduce max.poll.records until the math works with your current max.poll.interval.ms.
×

Using @RetryableTopic without monitoring DLT depth

Symptom
DLT accumulates thousands of failed messages that go unnoticed for days or weeks, representing data that the business assumes was processed
Fix
Add a Prometheus alert on DLT topic depth (kafka.consumer.fetch-manager.records-lag for the DLT consumer group, or a custom KafkaLagMonitor). Alert when DLT depth exceeds a threshold. Implement a @DltHandler method that sends an alert to PagerDuty or Slack immediately when a message hits the DLT.
×

Batch consumer without idempotent processing

Symptom
After a failure mid-batch, records at the beginning of the batch are inserted twice into the database — causing duplicate orders, payments, or inventory deductions
Fix
Add a unique constraint on the Kafka message offset + partition (or a business-level correlation ID) in your database. Use INSERT ... ON CONFLICT DO NOTHING (PostgreSQL) or REPLACE INTO (MySQL) for upsert semantics. The constraint prevents duplicates even when the batch is fully reprocessed.
×

Using the same container factory for main topic and retry topic listeners with different concurrency needs

Symptom
Retry processing saturates consumer threads, delaying processing of new messages on the main topic
Fix
Configure a separate, lower-concurrency container factory for retry consumers. Use @RetryableTopic's retryTopicSuffix and create a @KafkaListener on the retry topic with a dedicated factory that has concurrency=1 — retry processing rarely needs high parallelism.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
What is the difference between acks=1 and acks=all, and when would you c...
Q02SENIOR
How does AckMode.MANUAL_IMMEDIATE differ from the default AckMode.BATCH,...
Q03SENIOR
How does @RetryableTopic work under the hood? What infrastructure does i...
Q04SENIOR
A batch consumer processes 500 records per poll and fails at record 237....
Q05SENIOR
What is enable.idempotence=true and how does it prevent duplicate messag...
Q06SENIOR
How would you implement a consumer that processes messages from multiple...
Q07SENIOR
Explain the max.poll.interval.ms setting and how it interacts with batch...
Q08SENIOR
What is the SeekToCurrentErrorHandler / DefaultErrorHandler seek behavio...
Q09SENIOR
How does RecordFilterStrategy affect offset commits for filtered message...
Q01 of 09JUNIOR

What is the difference between acks=1 and acks=all, and when would you choose each?

ANSWER
acks=1 means only the partition leader must acknowledge the write before the producer considers the message delivered. acks=all means all in-sync replicas (ISRs) must acknowledge. With acks=1, a message can be lost if the leader fails before replicating to followers — a real scenario during rolling deployments. With acks=all + min.insync.replicas=2, the message is durable even if any single broker fails. Choose acks=1 only for high-volume, loss-tolerant streams like metrics or logs. For any event with business impact (orders, payments, user state changes), use acks=all.
FAQ · 6 QUESTIONS

Frequently Asked Questions

01
Can I use @RetryableTopic with @Transactional?
02
What happens if the DLT itself is unavailable when DefaultErrorHandler tries to publish?
03
How do I share a @RetryableTopic configuration across multiple @KafkaListener methods?
04
Is acks=all significantly slower than acks=1?
05
How do I control how many retry topics @RetryableTopic creates?
06
Can I use batch consumers with @RetryableTopic?
🔥

That's Messaging. Mark it forged?

10 min read · try the examples if you haven't

Previous
Apache Kafka with Spring Boot — Getting Started
2 / 5 · Messaging
Next
RabbitMQ with Spring Boot