Mid 10 min · May 23, 2026

Kafka Producer & Consumer Deep Dive: acks, Retries, Batch Processing in Spring Boot

Q: Can I use @RetryableTopic with @Transactional?

Yes, but with important caveats. @RetryableTopic's retry mechanism publishes the failed message to a retry topic — this Kafka publish is a separate operation from your @Transactional database transaction. If the database transaction commits but the retry-topic publish fails (or vice versa), you can have inconsistencies. For transactional retry with exactly-once semantics, use @RetryableTopic with a transactional KafkaTemplate and ensure both the DB transaction and the Kafka retry publish are part of the same Kafka transaction (using KafkaTransactionManager). For most use cases, @RetryableTopic without transactions is sufficient — design your processing to be idempotent instead.

Q: What happens if the DLT itself is unavailable when DefaultErrorHandler tries to publish?

DeadLetterPublishingRecoverer uses a KafkaTemplate to publish to the DLT. If the DLT topic is unavailable or the publish fails, the recoverer throws an exception. This exception propagates back to the DefaultErrorHandler, which has already exhausted its retries for the original message. The behavior at this point depends on your error handler configuration — by default, the failed record is logged and skipped (the offset advances). To prevent DLT publish failures from causing message loss, monitor DLT publish success as a separate metric and ensure DLT topics have appropriate durability settings (replication factor 3, acks=all).

Q: How do I share a @RetryableTopic configuration across multiple @KafkaListener methods?

Create a custom composed annotation that combines @RetryableTopic and @KafkaListener with your standard retry configuration. Alternatively, configure retry behavior at the container factory level using RetryTopicConfiguration @Bean. A RetryTopicConfigurationBuilder lets you define retry policy, backoff, exception classification, and DLT strategy once and apply it to all listeners that use the configured factory. This is the recommended approach for organizations that want consistent retry behavior across all consumers without duplicating annotation parameters.

Q: Is acks=all significantly slower than acks=1?

The latency difference depends on your cluster network topology and replica lag. In a well-configured cluster with replicas on the same datacenter subnet, acks=all adds roughly the time for the leader to receive replica acknowledgments — typically 2-10ms additional latency per message. For throughput (messages per second), the impact is minimal because Kafka pipelines multiple in-flight sends. The latency cost is worth paying for all financial and state-changing events. For high-volume metrics where you're already aggregating and the individual message value is low, acks=1 or acks=0 is an acceptable trade-off.

Q: How do I control how many retry topics @RetryableTopic creates?

The number of retry topics equals `attempts - 1`. With `attempts=4`, Spring creates retry-0, retry-1, retry-2, and one DLT — four additional topics per main topic. Use `fixedDelayTopicStrategy = FixedDelayStrategy.SINGLE_TOPIC` to use a single retry topic for all retry attempts instead of one per attempt — this reduces topic count but requires the retry topic consumer to manage delay internally. For services with many @KafkaListener methods, this can significantly reduce broker topic count. The trade-off is slightly more complex retry topic management and less visibility into which retry attempt a message is on.

Q: Can I use batch consumers with @RetryableTopic?

Yes, starting from Spring Kafka 2.9+, @RetryableTopic supports batch listeners. When a batch listener throws, the entire batch is published to the retry topic as individual records (not as a batch), and each record is retried independently in subsequent retry topic consumer invocations. This means a batch of 100 records where record 50 fails results in record 50 going to retry-0 while records 0-49 and 51-99 are committed. This is actually more granular than naive batch retry and is the preferred approach for large batches with occasional failures.

Master Kafka producer ack modes, AckMode.

Naren · Founder

Plain-English first. Then code. Then the interview question.

About

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Producer acks=all + min.insync.replicas=2 gives the strongest durability guarantee — use for financial events
AckMode.MANUAL_IMMEDIATE commits offsets only after your business logic succeeds, preventing data loss on restart
@RetryableTopic(attempts=3, backoff=@Backoff(delay=1000, multiplier=2)) auto-configures retry topics and DLT routing
RecordFilterStrategy allows declarative message filtering before your listener method is invoked
Batch consumers with List> increase throughput 10-50x by amortizing per-record overhead

✦ Definition~90s read

What is Kafka Producer & Consumer?

Kafka's producer acknowledgment system (acks) controls the trade-off between durability, latency, and throughput at the protocol level. When a producer calls send(), the broker must acknowledge receipt before the producer considers the message safely delivered.

★

Think of Kafka ack modes like a certified letter vs.

The acknowledgment policy determines how many broker replicas must persist the message before that acknowledgment is sent. This is not a Spring-level abstraction — it's a core Kafka protocol guarantee that survives broker restarts, leader elections, and network partitions.

Offset management on the consumer side controls the at-least-once vs. at-most-once vs. exactly-once delivery semantics. An offset is a monotonically increasing integer identifying a message's position within a partition. Committing an offset tells the broker 'my consumer group has successfully processed everything up to this position.' If the consumer crashes before committing, the next consumer to take ownership of the partition will re-read from the last committed offset — giving at-least-once delivery.

Spring Kafka's AckMode enum controls when commits happen: automatically on poll (BATCH), after each record (RECORD), or only when the application explicitly calls Acknowledgment.acknowledge() (MANUAL_IMMEDIATE).

The @RetryableTopic annotation is Spring Kafka's declarative retry infrastructure. It creates sibling retry topics (e.g., orders-retry-0, orders-retry-1) and a dead-letter topic (orders-dlt) automatically. Failed messages are routed to retry topics with embedded delay metadata, consumed after the configured backoff period, and retried.

After exhausting all attempts, the message lands in the DLT where a @DltHandler method can perform specialized handling (alerting, manual review, compensation). This entire infrastructure is created from a single annotation.

Plain-English First

Think of Kafka ack modes like a certified letter vs. dropping a note in a mailbox. With acks=0, you just throw the letter in — no confirmation. With acks=1, the post office stamps 'received' but might lose it in a back-office fire. With acks=all, every post office branch confirms they have a copy before you leave. Retry topics are like a special tray for letters that couldn't be delivered — they wait their turn and get another attempt automatically.

The first Kafka integration you build usually works great in the demo. The producer sends, the consumer receives, everything looks smooth. Then your system hits its first real incident: the broker has a leadership election mid-deployment, your consumer receives a malformed message from a third-party service, a downstream database goes down for 45 seconds, and suddenly you realize your Kafka setup has no durability guarantees, no retry logic, and no way to isolate bad messages from good ones.

Producer acknowledgment modes are the first line of defense. acks=0 is performance theater — you're broadcasting messages into the void and hoping for the best. acks=1 is a false comfort — the leader acknowledges but hasn't replicated, so a leader failover between write and replication loses your data permanently. acks=all with min.insync.replicas=2 is the production standard: the message is durable on multiple brokers before the producer considers the send complete.

On the consumer side, offset management is where most production bugs live. Auto-commit increments the offset on a timer, not after successful processing — so a consumer crash mid-processing auto-commits the offset, and those messages are gone forever. AckMode.MANUAL_IMMEDIATE gives you exact control: commit only when your database write succeeds.

Retry logic with @RetryableTopic addresses the class of transient failures — downstream database timeouts, external API blips, temporary lock contention — where the message is valid but the processing environment wasn't ready. Spring Kafka creates retry topics automatically, routes failed messages through configurable backoff periods, and finally sends to a dead-letter topic if all attempts fail. No manual retry infrastructure required.

Batch consumers unlock an order-of-magnitude throughput improvement for high-volume processing. Instead of one database call per message, you receive List<ConsumerRecord>, process 500 records at once, and bulk-insert in a single transaction. The operational complexity increases, but so does your system's ability to keep up under load.

Producer Acknowledgment Modes: acks=0, acks=1, acks=all

The acks producer configuration is the most impactful durability setting in Kafka. It determines how many broker replicas must confirm message receipt before the KafkaProducer.send() future completes. Each value represents a distinct point on the latency-durability trade-off spectrum.

acks=0 (fire-and-forget): The producer sends the message and immediately considers it delivered, without waiting for any broker response. No acknowledgment packet traverses the network for this message. This maximizes throughput and minimizes latency but provides zero durability — if the broker is unavailable or rejects the message, the producer is unaware. Use cases are narrow: high-volume metrics, IoT sensor readings, or application logs where occasional loss is acceptable and the monitoring overhead of tracking every message exceeds the business value of any individual one.

acks=1 (leader-only): The partition leader writes the message to its local log and acknowledges the producer. This is faster than acks=all and provides protection against temporary broker unavailability. However, there is a critical gap: if the leader fails between writing the message and replicating it to followers, the message is lost permanently. The new elected leader (a follower that never received the message) has no record of it, and the producer already received an ACK. This failure scenario is not hypothetical — it occurs during every broker restart and leadership election in a rolling update.

acks=all (or acks=-1): The leader writes the message and waits for all in-sync replicas (ISRs) to confirm they have written it to their logs before acknowledging the producer. The ISR list is maintained by the broker — replicas that fall behind (haven't fetched from the leader within replica.lag.time.max.ms) are removed from the ISR. Setting min.insync.replicas=2 on the topic ensures that even if replicas lag out, at least 2 (leader + 1 follower) must confirm the write. If fewer than min.insync.replicas are in sync, the broker rejects the write with NotEnoughReplicasException — the producer should retry until replicas catch up.

In Spring Boot, set spring.kafka.producer.acks=all and configure min.insync.replicas on your NewTopic bean. Pair with enable.idempotence=true to get duplicate-free retries — the broker assigns a producer ID and sequence number to each message, deduplicating retransmissions that occur during retry after an ambiguous failure.

acks=all Requires min.insync.replicas=2 to Be Meaningful

If min.insync.replicas=1 (the default), acks=all is equivalent to acks=1 — only the leader needs to be in the ISR. Always set min.insync.replicas=2 on critical topics so that at least one follower must confirm the write, providing actual durability against leader failure.

Production Insight

Enable idempotent producers (enable.idempotence=true) for all production producers, not just those requiring exactly-once semantics. It prevents duplicate messages caused by producer retries after ambiguous network failures at zero additional operational cost.

Key Takeaway

acks=all + min.insync.replicas=2 + enable.idempotence=true is the minimum durability configuration for any event with real-world consequences.

AckMode.MANUAL_IMMEDIATE and Offset Management

Spring Kafka's AckMode controls when consumed message offsets are committed back to the Kafka broker. The default AckMode.BATCH commits offsets after each poll loop completes — after the listener method returns for all records in the batch, regardless of whether processing succeeded. This is convenient but dangerous: if your database write fails inside the listener after the poll completes, the offset is already committed and the message is silently lost.

AckMode.MANUAL_IMMEDIATE transfers offset commit control entirely to your code. The listener container injects an Acknowledgment object into your listener method. Calling ack.acknowledge() commits the current record's offset to the broker immediately. If you do not call acknowledge() — because an exception was thrown before reaching that line — the offset stays uncommitted and the message is redelivered on the next poll.

The `IMMEDIATE` suffix is important: AckMode.MANUAL batches acknowledgments and flushes them at the end of the poll loop, while MANUAL_IMMEDIATE commits the offset synchronously as soon as acknowledge() is called. For high-reliability consumers where you need to know that each record's offset is committed before proceeding to the next, use MANUAL_IMMEDIATE.

Pairing MANUAL_IMMEDIATE with error handling requires careful design. A naive implementation that calls ack.acknowledge() inside a try-catch that also catches the failure will acknowledge even on error — defeating the purpose. The correct pattern: call acknowledge() only in the success path. Configure SeekToCurrentErrorHandler on the container so that when an exception escapes the listener, the container seeks the consumer back to the failed record's offset, ensuring it is redelivered rather than skipped.

SeekToCurrentErrorHandler (renamed DefaultErrorHandler in Spring Kafka 2.8+) accepts a BackOff configuration for retry timing. With ExponentialBackOffWithMaxRetries(5), the handler retries the failed record 5 times with exponential backoff before routing to the dead-letter topic via DeadLetterPublishingRecoverer. This gives transient failures (DB timeout, external API blip) multiple chances to succeed before the message is quarantined.

Never Call ack.acknowledge() in a Finally Block

Placing ack.acknowledge() in a finally block ensures the offset is committed even when an exception is thrown — silently dropping failed messages. The DefaultErrorHandler's seek-back mechanism only works when exceptions propagate out of the listener method. Swallowing exceptions and acknowledging in finally is the most common cause of silent message loss in Spring Kafka applications.

Production Insight

For consumers that must maintain exactly-once processing, combine AckMode.MANUAL_IMMEDIATE with a database unique constraint on the message's correlation ID. On duplicate delivery, the insert fails with a constraint violation — add this to addNotRetryableExceptions so the offset advances and the duplicate is discarded cleanly.

Key Takeaway

AckMode.MANUAL_IMMEDIATE gives you full control over offset commits — use it for any consumer where losing a message has business consequences, and always pair it with DefaultErrorHandler for automatic retry and DLT routing.

@RetryableTopic: Declarative Retry with Exponential Backoff

Before @RetryableTopic, implementing retry logic for Kafka consumers required building custom retry infrastructure: separate retry topics, a scheduler to re-enqueue messages after a delay, a DLQ writer, and manual tracking of retry counts. Spring Kafka 2.7+ collapses all of this into a single annotation.

@RetryableTopic(attempts = 3, backoff = @Backoff(delay = 1000, multiplier = 2)) on a @KafkaListener method automatically creates two retry topics (orders-retry-0, orders-retry-1) and one dead-letter topic (orders-dlt). When the listener method throws a retryable exception, the framework publishes the message to orders-retry-0 with headers recording the original topic, the attempt number, and the timestamp after which the message should be reprocessed. A background listener on the retry topic waits until the delay has elapsed, then re-delivers the message to the main listener. After the configured number of attempts, the message routes to the DLT.

The retry delay strategy is configurable. @Backoff(delay=1000, multiplier=2, maxDelay=30000) implements exponential backoff: attempt 1 waits 1s, attempt 2 waits 2s, attempt 3 waits 4s, capped at 30s. This gives transient failures (DB connection timeouts typically last 1-5s) time to resolve while not holding retry topics open indefinitely.

Exception classification is granular. notRetryOn = {DataIntegrityViolationException.class, DeserializationException.class} sends these exceptions directly to the DLT without retrying — they will fail on every attempt, so retrying wastes resources and delays DLT processing. include = {OptimisticLockingFailureException.class} restricts retrying to only the listed exceptions, sending everything else to DLT.

dltStrategy = DltStrategy.FAIL_ON_ERROR (the default) routes to DLT when retries are exhausted. dltStrategy = DltStrategy.NO_DLT drops the message after exhaustion — dangerous, but useful for truly ephemeral events. The @DltHandler annotation on a method in the same class receives DLT messages for specialized handling: logging, alerting, inserting into a database for manual review, or triggering a compensation event.

By default, @RetryableTopic uses the same consumer group for retry and main topic consumption. This means retry processing uses the same concurrency settings as main topic processing. For high-volume retry scenarios, configure sameIntervalTopicReuseStrategy = SameIntervalTopicReuseStrategy.SINGLE_TOPIC to consolidate retries or override the container factory for retry topics separately.

@RetryableTopic Creates Real Kafka Topics — Account for Them

With autoCreateTopics=true, Spring Kafka creates orders-retry-0, orders-retry-1, orders-retry-2, and orders-dlt in your broker. With 10 main topics and 3 retry attempts, you have 40 additional topics. Include these in your topic naming conventions, monitoring, and retention policies. DLT topics accumulate messages indefinitely — always monitor their depth and set retention.

Production Insight

Inject @Header(KafkaHeaders.RECEIVED_TOPIC) String topic in your listener to detect whether you're processing a retry. Log retry attempts at WARN level with the original exception headers — this creates an automatic audit trail of which messages needed retries, invaluable for post-incident analysis.

Key Takeaway

@RetryableTopic replaces hundreds of lines of custom retry infrastructure with a single annotation — configure exception classification carefully to avoid retrying non-transient errors that will never succeed.

RecordFilterStrategy: Filtering Messages Before Processing

Not every message in a topic is meant for every consumer. A RecordFilterStrategy allows you to declaratively exclude messages that match a filter condition, without the consumer ever invoking the listener method. The filter runs after deserialization but before listener invocation — the consumer polls the record, deserializes it, evaluates the filter, and if the filter returns true (filtered), the record is acknowledged and the listener is skipped.

Common use cases: filtering by message type header (one topic carries multiple event types, but each consumer only handles one), filtering by tenant ID in a multi-tenant system, or excluding test messages (produced by integration tests) from production consumers.

The filter function receives the deserialized ConsumerRecord<K, V>. You have full access to the record's key, value, headers, topic, partition, and offset. The return value is a boolean: true means discard (filter out), false means pass through to the listener.

When a record is filtered, its offset still needs to be committed so the consumer doesn't re-read it endlessly. Set factory.setAckDiscarded(true) to auto-acknowledge filtered records. Without this, filtered records accumulate and the consumer's committed offset falls behind the end offset, making lag monitoring misleading — you'll see apparent lag from filtered records that will never be processed.

For batch listeners, BatchRecordFilterStrategy filters at the batch level. The default behavior is to filter individual records within the batch and deliver only the non-filtered subset to the listener. If all records in a batch are filtered, the listener is not called at all. Configure factory.setBatchToRecordFilterStrategy(strategy) for fine-grained control.

AckDiscarded Must Be True for Filtered Records

Without setAckDiscarded(true), filtered records are never acknowledged. The consumer's committed offset remains behind the filtered records, creating phantom lag that your monitoring dashboards will report. Always set ackDiscarded=true when using any RecordFilterStrategy.

Production Insight

For high-volume topics where most messages are irrelevant to a specific consumer, consider filtering on message headers rather than deserialized value. Header inspection avoids the deserialization cost for records that will be discarded anyway, reducing CPU and GC pressure on consumers handling millions of messages per hour.

Key Takeaway

RecordFilterStrategy enables clean separation of concerns between the topic producer (which may write multiple event types) and each consumer (which handles only specific types), without requiring separate topics per event type.

Batch Consumers with List

Individual record processing — one listener invocation per Kafka message — is clean and simple but inefficient at scale. For each record, you pay the overhead of listener dispatch, Spring proxy invocation, and crucially, one database round-trip. At 10,000 messages per second, that's 10,000 individual INSERT statements per second — far more than most databases sustain comfortably.

Batch consumers receive a List<ConsumerRecord<K, V>> (or List<V> for value-only access) representing all records returned by a single KafkaConsumer.poll() call. Instead of 500 individual database calls, you execute one bulk INSERT with 500 rows — a 50-100x reduction in database round-trips. Spring Boot's spring.kafka.listener.type=batch enables batch mode globally, or you can configure it per-factory.

Batch size is controlled by max.poll.records (how many records per poll, default 500) and the consumer's fetch.min.bytes / fetch.max.wait.ms settings which control how full the fetch buffer must be before returning. Setting max.poll.records=1000 with fetch.min.bytes=1048576 (1MB) creates large, efficient batches at the cost of slightly increased latency.

Error handling in batch mode is more complex than single-record mode. If the listener throws, the entire batch is the unit of retry — all records in the batch will be redelivered. This makes idempotent processing essential. The BatchToRecordAdapter bridges batch delivery with single-record error handling, allowing DefaultErrorHandler (which operates at the record level) to function within a batch context.

For partial batch commits — acknowledging records 0-249 but retrying records 250-499 — use Acknowledgment.nack(int index, Duration duration). This seeks the consumer back to record at index within the batch, ensuring records after the failure point are redelivered while records before it are committed. This is the correct approach when batch processing involves heterogeneous records that may fail independently.

Batch listeners integrate naturally with Spring's @Transactional — wrap the entire batch in one database transaction for atomic commit/rollback. This is significantly more efficient than one transaction per record. Be aware of transaction size limits and timeout settings when batches are very large.

Batch Retries Reprocess the Entire Batch

When a batch listener throws an exception (without calling ack.nack(index)), the entire batch is retried from the beginning. Records already successfully processed will be reprocessed. Always implement idempotent processing (unique constraint, Redis SETNX, or database upsert) before using batch consumers in production.

Production Insight

Monitor max-poll-records vs actual batch sizes in production using @Header(KafkaHeaders.BATCH_CONVERTED_MESSAGES) or a ConsumerAwareMessageListener. If actual batch sizes are consistently much smaller than max.poll.records, the topic has insufficient throughput to fill batches — consider reducing max.poll.records to avoid holding the poll lock longer than necessary.

Key Takeaway

Batch consumers deliver 10-50x database throughput improvement for bulk processing use cases, but require idempotent processing and careful error handling to safely handle partial batch failures.

SeekToCurrentErrorHandler and DefaultErrorHandler Patterns

The error handler sits between the listener container and your listener method. When an unhandled exception escapes your @KafkaListener method, the error handler decides what to do next: retry immediately, seek back to retry after backoff, skip the record, or route to a dead-letter topic.

Pre-2.8 Spring Kafka used SeekToCurrentErrorHandler. From 2.8+, DefaultErrorHandler is the unified replacement that handles both single-record and batch consumers. The behavior is identical: on exception, the container seeks the consumer back to the failed record's offset so the next poll re-reads it. Without this seek, the container would advance past the failed record (as if it were processed), losing the message.

The DefaultErrorHandler constructor accepts a BackOff and a ConsumerRecordRecoverer. The BackOff determines retry timing. FixedBackOff(1000L, 3) retries 3 times with 1-second delays between each. ExponentialBackOffWithMaxRetries(5) retries 5 times with exponential delays. After exhausting retries, the ConsumerRecordRecoverer handles the failed record. DeadLetterPublishingRecoverer publishes the failed record to a .DLT topic, including headers identifying the original topic, partition, offset, and exception details.

Exception classification controls which exceptions trigger retries and which go directly to the DLT. errorHandler.addNotRetryableExceptions(DeserializationException.class) ensures that malformed messages (which will always fail deserialization) bypass retries and go straight to the DLT. errorHandler.addRetryableExceptions(DataAccessException.class) forces retry for transient database exceptions. Without classification, all exceptions retry, wasting time on failures that are guaranteed to repeat.

The ExceptionClassifier interface provides fine-grained control, allowing a lambda or class to inspect the exception (including its cause chain) and return whether it should be retried. This handles wrapped exceptions where the root cause is what matters, not the wrapper type thrown by the listener.

DeserializationException Must Be Non-Retryable

A message that fails JSON deserialization will always fail — the bytes don't change between retry attempts. Without addNotRetryableExceptions(DeserializationException.class), a single malformed message will retry until all attempts are exhausted, blocking the partition for seconds or minutes. Always add DeserializationException to the non-retryable list.

Production Insight

Add retry attempt logging via errorHandler.setRetryListeners() — these log entries are invaluable during incidents to distinguish between 'this message failed once due to a DB timeout' (normal, resolved) and 'this message has failed 50 times' (poison message requiring investigation).

Key Takeaway

DefaultErrorHandler with DeserializationException in the non-retryable list and DeadLetterPublishingRecoverer is the production-standard error handling configuration for every Kafka consumer — no exceptions.

● Production incidentPOST-MORTEMseverity: high

Silent Message Loss During Broker Leader Election

Symptom

After a scheduled Kafka broker restart for security patching, the payments team noticed 3,200 transactions in the 'PENDING' state that never transitioned. No errors appeared in producer logs. Kafka consumer lag was zero. The messages simply didn't exist in the topic.

Assumption

The team assumed the missing messages were caused by a consumer bug — perhaps the consumer committed offsets without processing. They spent two days reviewing consumer code before escalating.

Root cause

The payment producer was configured with acks=1. During the broker restart, 3,200 messages were written to the leader and acknowledged to the producer. Before those messages could be replicated to followers, the leader went offline for patching. Kafka elected a follower (which didn't have the messages) as the new leader. The messages never existed on any surviving broker — they were acknowledged to the producer and then lost permanently when the old leader's unsynced data was discarded.

Fix

Change producer configuration to acks=all and set min.insync.replicas=2 on all payment-related topics. Also enable retries=Integer.MAX_VALUE and enable.idempotence=true on the producer to handle transient broker errors during leadership transitions without duplicates. Implement an outbox pattern for payment events so database and Kafka writes are atomic.

Key lesson

acks=1 provides a false sense of durability.
The acknowledgment means the leader received the message, not that the message is safe.
For any event representing a financial transaction or state change with real-world consequences, acks=all is non-negotiable.

Production debug guideSymptom → root cause → fix5 entries

Symptom · 01

Messages are processed but offset never advances — consumer appears stuck at same position

→

Fix

When using AckMode.MANUAL_IMMEDIATE, the offset only advances when ack.acknowledge() is called explicitly. Check that your error handling path also acknowledges (or that a SeekToCurrentErrorHandler is configured to seek back on failure). If an exception propagates past ack.acknowledge(), the offset is committed. If the exception is thrown before it, the offset stays — causing the message to be redelivered on the next poll. Review all exception paths in your listener method to ensure exactly one code path acknowledges.

Symptom · 02

@RetryableTopic not routing failed messages to retry topics

→

Fix

Verify that @EnableKafka is on your configuration class and that @RetryableTopic is on the @KafkaListener method, not the class. Confirm the retry topic names exist in the broker — Spring Boot creates them on startup, but if auto.create.topics.enable=false on the broker, you must create them via NewTopic beans or the admin CLI. Also check that your error is not excluded by notRetryOn — by default, @RetryableTopic retries all exceptions; subclasses of java.lang.Error and explicitly excluded exceptions go straight to DLT.

Symptom · 03

Producer intermittently throws TimeoutException or NotLeaderOrFollowerException

→

Fix

These exceptions indicate the producer is trying to write to a partition whose leader just changed (leadership election in progress). Ensure retries is set to a high value (Integer.MAX_VALUE) and retry.backoff.ms=300. With enable.idempotence=true, the producer assigns each message a sequence number and the broker deduplicates retries — enabling safe retry without producing duplicates. Add delivery.timeout.ms=120000 to allow up to 2 minutes for a message to be acknowledged, covering most election durations.

Symptom · 04

Batch consumer processes some records but fails mid-batch — uncertain which were processed

→

Fix

Batch consumer failures require careful offset management. If you throw from a batch listener, the entire batch is retried by default — any records already processed will be reprocessed. Implement idempotent processing (unique constraint in DB or Redis SET NX) so reprocessing is safe. Alternatively, use BatchToRecordAdapter to have Spring Kafka call your method once per record within the batch, making partial batch failures tractable. For partial commits within a batch, use Acknowledgment.nack(int index, Duration duration) to seek to a specific offset within the batch.

Symptom · 05

RecordFilterStrategy is configured but messages still reach the listener

→

Fix

Verify that the filter strategy is set on the ConcurrentKafkaListenerContainerFactory, not the @KafkaListener annotation. Check factory.setRecordFilterStrategy() is called before the factory is used to create containers. If using FilteringBatchMessageListenerAdapter, ensure the adapter wraps the correct listener. Also confirm that factory.setAckDiscarded(true) is set if you want filtered (discarded) records to have their offsets committed automatically — without this, filtered records remain uncommitted and are redelivered.

★ Producer & Consumer Debug Cheat SheetTargeted CLI commands for diagnosing producer delivery and consumer offset issues in production Kafka clusters.

Producer sending but messages missing from topic — verify actual broker receipt−

Immediate action

Tail the topic directly to confirm message arrival independent of consumer

Commands

kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic payments --partition 0 --offset earliest --max-messages 50

kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic payments --time -1

Fix now

If console consumer shows messages but your application consumer doesn't see them, the issue is consumer group offset — the committed offset is ahead of the message. Reset with kafka-consumer-groups.sh. If console consumer shows nothing, producer is not reaching the broker — check acks, broker connectivity, and topic existence.

Consumer group offset stuck — need to check exact committed offsets per partition+

Retry topics accumulating messages — need to verify retry routing is working+

Producer throughput lower than expected — diagnose batching and compression+

Kafka AckMode Comparison

AckMode	When Offset is Committed	Risk of Message Loss	Risk of Reprocessing	Best For
BATCH (default)	After poll loop completes, regardless of success	HIGH — commits even if listener threw	LOW	Non-critical events, logging
RECORD	After each listener method returns successfully	MEDIUM — commits after each successful call	LOW	Moderate reliability needs
MANUAL	When app calls `ack.acknowledge()`, batched until end of poll	LOW	LOW	High reliability, bulk ack
MANUAL_IMMEDIATE	When app calls `ack.acknowledge()`, committed immediately	VERY LOW	LOW (only on crash)	Financial transactions, critical events
COUNT	After N records are acknowledged	MEDIUM — partial batch loss on crash	LOW	Balanced throughput/reliability
TIME	After a configured time interval	MEDIUM — time-window loss on crash	LOW	Throughput-optimized consumers

Key takeaways

acks=all + min.insync.replicas=2 + enable.idempotence=true is the non-negotiable minimum for any Kafka producer handling events with real business consequences

AckMode.MANUAL_IMMEDIATE combined with DefaultErrorHandler is the safest offset management pattern

commits only on success, retries on failure, routes to DLT after exhaustion

@RetryableTopic eliminates custom retry infrastructure but requires careful exception classification to avoid retrying non-transient failures that will always fail

Batch consumers deliver 10-50x throughput improvements over individual record processing but require idempotent processing to handle redelivery safely

Always add DeserializationException to DefaultErrorHandler's non-retryable list

malformed messages will never deserialize successfully and should go directly to the DLT

Common mistakes to avoid

7 patterns

Using acks=1 for payment or order events

Symptom

Messages silently disappear during broker restarts, leadership elections, or rolling deployments — no errors in producer logs

Fix

Set acks=all and min.insync.replicas=2 for all topics where message loss has business impact. Enable enable.idempotence=true to prevent duplicate messages from producer retries.

Calling ack.acknowledge() inside a finally block or catch block

Symptom

Messages that fail processing still have their offsets committed — they are silently lost and never reprocessed or routed to DLT

Fix

Call ack.acknowledge() only in the happy path after all business logic succeeds. Let exceptions propagate to the error handler, which handles retry and DLT routing. Never acknowledge in exception handling paths.

Not adding DeserializationException to non-retryable exceptions

Symptom

A malformed message blocks the partition for the entire retry duration (attempts × backoff) before finally reaching the DLT, causing processing lag for all subsequent valid messages

Fix

Add errorHandler.addNotRetryableExceptions(DeserializationException.class) to send malformed messages directly to the DLT without retrying. A message that can't be deserialized will never succeed regardless of how many times you retry it.

Setting max.poll.records very high without adjusting max.poll.interval.ms

Symptom

Consumer is evicted from the group sporadically with 'max.poll.interval.ms exceeded' log messages, causing rebalances and lag spikes

Fix

If max.poll.records=1000 and each record takes 10ms to process, you need at least 10 seconds to process one poll. Set max.poll.interval.ms higher than (max.poll.records × max processing time per record). Alternatively, reduce max.poll.records until the math works with your current max.poll.interval.ms.

Using @RetryableTopic without monitoring DLT depth

Symptom

DLT accumulates thousands of failed messages that go unnoticed for days or weeks, representing data that the business assumes was processed

Fix

Add a Prometheus alert on DLT topic depth (kafka.consumer.fetch-manager.records-lag for the DLT consumer group, or a custom KafkaLagMonitor). Alert when DLT depth exceeds a threshold. Implement a @DltHandler method that sends an alert to PagerDuty or Slack immediately when a message hits the DLT.

Batch consumer without idempotent processing

Symptom

After a failure mid-batch, records at the beginning of the batch are inserted twice into the database — causing duplicate orders, payments, or inventory deductions

Fix

Add a unique constraint on the Kafka message offset + partition (or a business-level correlation ID) in your database. Use INSERT ... ON CONFLICT DO NOTHING (PostgreSQL) or REPLACE INTO (MySQL) for upsert semantics. The constraint prevents duplicates even when the batch is fully reprocessed.

Using the same container factory for main topic and retry topic listeners with different concurrency needs

Symptom

Retry processing saturates consumer threads, delaying processing of new messages on the main topic

Fix

Configure a separate, lower-concurrency container factory for retry consumers. Use @RetryableTopic's retryTopicSuffix and create a @KafkaListener on the retry topic with a dedicated factory that has concurrency=1 — retry processing rarely needs high parallelism.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

What is the difference between acks=1 and acks=all, and when would you c...

Q02SENIOR

How does AckMode.MANUAL_IMMEDIATE differ from the default AckMode.BATCH,...

Q03SENIOR

How does @RetryableTopic work under the hood? What infrastructure does i...

Q04SENIOR

A batch consumer processes 500 records per poll and fails at record 237....

Q05SENIOR

What is enable.idempotence=true and how does it prevent duplicate messag...

Q06SENIOR

How would you implement a consumer that processes messages from multiple...

Q07SENIOR

Explain the max.poll.interval.ms setting and how it interacts with batch...

Q08SENIOR

What is the SeekToCurrentErrorHandler / DefaultErrorHandler seek behavio...

Q09SENIOR

How does RecordFilterStrategy affect offset commits for filtered message...

Q01 of 09JUNIOR

What is the difference between acks=1 and acks=all, and when would you choose each?

ANSWER

acks=1 means only the partition leader must acknowledge the write before the producer considers the message delivered. acks=all means all in-sync replicas (ISRs) must acknowledge. With acks=1, a message can be lost if the leader fails before replicating to followers — a real scenario during rolling deployments. With acks=all + min.insync.replicas=2, the message is durable even if any single broker fails. Choose acks=1 only for high-volume, loss-tolerant streams like metrics or logs. For any event with business impact (orders, payments, user state changes), use acks=all.

FAQ · 6 QUESTIONS

Frequently Asked Questions

Can I use @RetryableTopic with @Transactional?

What happens if the DLT itself is unavailable when DefaultErrorHandler tries to publish?

How do I share a @RetryableTopic configuration across multiple @KafkaListener methods?

Is acks=all significantly slower than acks=1?

How do I control how many retry topics @RetryableTopic creates?

Can I use batch consumers with @RetryableTopic?

🔥

That's Messaging. Mark it forged?

10 min read · try the examples if you haven't