Senior 5 min · June 25, 2026

Stream Processing: Why Your Batch Jobs Are Losing Money and How to Fix It

Stream processing explained with real production patterns.

N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

Follow
Production
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer

Stream processing processes data records as they arrive, with millisecond latency, using frameworks like Apache Kafka Streams, Apache Flink, or Apache Spark Streaming. It's ideal for real-time analytics, monitoring, and event-driven applications where batch processing's latency is unacceptable.

✦ Definition~90s read
What is Stream Processing?

Stream processing is the continuous, real-time handling of data as it arrives, rather than in batches. It enables low-latency computations, stateful operations, and exactly-once semantics for use cases like fraud detection, real-time analytics, and event-driven microservices.

Imagine a conveyor belt in a factory.
Plain-English First

Imagine a conveyor belt in a factory. Batch processing is like stopping the belt, taking all items off, sorting them, then restarting. Stream processing is like inspecting and acting on each item as it passes by, without ever stopping the belt. You get results instantly, but you need to handle items that arrive out of order or in bursts.

You're losing money. Every second your data pipeline sits idle between batch runs, you're missing fraud, losing customers, and making decisions on stale data. I've seen a payments company lose $50k in a single night because their batch job ran every 15 minutes and a fraudster exploited the gap. Stream processing fixes that — but only if you understand the trade-offs.

The core problem is latency. Batch processing is simple: collect data for an hour, then run a job. But the world doesn't wait. Stream processing lets you react within milliseconds — but it introduces complexity: state management, out-of-order events, exactly-once semantics, and backpressure. Most tutorials gloss over these. This article won't.

By the end, you'll know exactly when to use stream processing, how to choose between Kafka Streams and Flink, how to handle backpressure without losing data, and the three production mistakes that will burn you. You'll also get a debug cheat sheet for when things go wrong at 3am.

The Problem: Why Batch Processing Fails at Real-Time

Batch processing is simple: collect data for a window, then run a job. But the latency is the window size. If you run every hour, you're an hour late. For fraud detection, that's catastrophic. I've seen a team try to reduce batch window to 5 seconds — the overhead of starting/stopping Spark jobs killed throughput. Stream processing keeps a persistent pipeline, processing each record as it arrives. The trade-off: you must manage state, ordering, and failures continuously.

Without stream processing, you hack it: cron jobs every minute, polling databases, or using message queues with manual ack. These all break under load. Stream processing gives you exactly-once semantics, stateful operations, and automatic scaling — but only if you design for it.

BatchVsStream.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge — System Design tutorial

// Batch approach: run every 10 minutes
while (true) {
    sleep(10 * 60 * 1000);  // wait 10 minutes
    List<Event> batch = db.query("SELECT * FROM events WHERE processed = false");
    for (Event e : batch) {
        process(e);
        db.execute("UPDATE events SET processed = true WHERE id = ?", e.id);
    }
}
// Problem: 10-minute latency, table lock on writes, duplicate if crash after process but before update

// Stream approach: process each event as it arrives
KStream<String, Event> stream = builder.stream("events");
stream
    .filter((key, event) -> event.isFraudulent())
    .to("fraud-alerts");
// Latency: milliseconds. No polling. Exactly-once with proper config.
Output
Batch: 10-minute latency, potential duplicates. Stream: <100ms latency, exactly-once.
Production Trap: Polling-Based Hacks
Never use SELECT ... FOR UPDATE in a batch loop for stream-like behavior. I've seen this cause a full table lock on a 500GB table, taking down the entire service. Use Kafka or a queue instead.
Stream Processing vs Batch: Real-Time Wins THECODEFORGE.IO Stream Processing vs Batch: Real-Time Wins From batch failure to stream processing with state, semantics, and backpressure Batch Processing Fails High latency, stale data, money loss Choose Stream Engine Kafka Streams, Flink, or Spark Streaming Manage State State stores, fault tolerance, consistency Exactly-Once Semantics Idempotent writes, transactional boundaries Handle Backpressure Rate limiting, buffering, dynamic scaling Windowing & Time Event time vs processing time, watermarks ⚠ Ignoring state management leads to data loss Always use persistent state stores and enable checkpointing THECODEFORGE.IO
thecodeforge.io
Stream Processing vs Batch: Real-Time Wins
Stream Processing

You have three main options. Kafka Streams is lightweight, runs in your app process, and is perfect for simple transformations and joins. Flink is a heavy-duty cluster framework with advanced windowing, exactly-once, and state management. Spark Streaming is micro-batch — it's not true streaming, but it's good if you're already in the Spark ecosystem.

Here's the rule: if you can express your logic as a single Kafka topic transformation with state, use Kafka Streams. If you need complex event time windows, large state, or multi-source joins, use Flink. If you're doing batch analytics and want near-real-time, use Spark Streaming but accept 1-5 second latency.

I've seen teams over-engineer with Flink for a simple filter+map. The operational overhead of managing a Flink cluster killed their velocity. Start with Kafka Streams, graduate to Flink only when you hit its limits.

FrameworkComparison.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// io.thecodeforge — System Design tutorial

// Kafka Streams: simple, embeddable
KStream<String, Order> orders = builder.stream("orders");
KTable<String, User> users = builder.table("users");
orders.join(users, (order, user) -> new EnrichedOrder(order, user))
      .to("enriched-orders");

// Flink: complex windowing
DataStream<Event> stream = env.addSource(new FlinkKafkaConsumer<>("events", ...));
stream
    .keyBy(Event::getUserId)
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .aggregate(new FraudAggregator())
    .addSink(new AlertSink());

// Spark Streaming: micro-batch
val stream = KafkaUtils.createDirectStream[String, String](ssc, ...)
stream.foreachRDD { rdd =>
    rdd.map(...).saveToCassandra()  // 2-second latency
}
Output
Kafka Streams: sub-100ms latency, no cluster. Flink: sub-100ms, cluster required. Spark: 1-5s latency, cluster required.
Senior Shortcut: Start with Kafka Streams
If you're on Kafka, start with Kafka Streams. It's simpler, runs in-process, and handles exactly-once with the right config. Only move to Flink when you need event-time windows or large state (>10GB per operator).

State Management: The Silent Killer

State is where stream processing gets hard. Every time you do a join, aggregation, or window, you're storing state. That state lives in memory or on disk, and it must survive failures. Kafka Streams uses RocksDB (embedded key-value store) and backs it up to Kafka changelog topics. Flink uses a state backend (memory, filesystem, or RocksDB) with periodic checkpoints.

The rookie mistake: assuming state is free. I've seen a Kafka Streams app with a 24-hour window on a high-cardinality key (user_id) — the state store grew to 50GB and crashed the JVM. The fix: reduce window size, use a sliding window, or add state TTL.

Always estimate state size: state_size = data_rate window_duration key_cardinality. For example, 10MB/s 3600s 1M keys = 36TB. That's not going to fit in memory. Use RocksDB on disk.

StateManagement.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// io.thecodeforge — System Design tutorial

// Kafka Streams: configure state store
Properties props = new Properties();
props.put(StreamsConfig.STATE_DIR_CONFIG, "/data/state");  // use fast SSD
props.put(StreamsConfig.ROCKSDB_CONFIG_SETTER_CLASS_CONFIG, CustomRocksDBConfig.class);

// Custom RocksDB config to limit memory
public class CustomRocksDBConfig implements RocksDBConfigSetter {
    @Override
    public void setConfig(final String storeName, final Options options, final Map<String, Object> configs) {
        options.setMaxWriteBufferNumber(2);
        options.setWriteBufferSize(64 * 1024 * 1024L);  // 64MB per memtable
        // Block cache
        final BlockBasedTableConfig tableConfig = new BlockBasedTableConfig();
        tableConfig.setBlockCacheSize(256 * 1024 * 1024L);  // 256MB cache
        options.setTableFormatConfig(tableConfig);
    }
}

// Flink: configure state backend
env.setStateBackend(new RocksDBStateBackend("hdfs://namenode:40010/flink-checkpoints", true));
// Enable incremental checkpoints for faster recovery
Output
State store memory capped at ~500MB. Checkpoints to HDFS every 10s. Recovery time: ~30s for 10GB state.
Never Do This: Default State Store Config
Default RocksDB config in Kafka Streams uses in-memory block cache that can grow unbounded. Always set a custom RocksDBConfigSetter to cap memory. Otherwise, you'll hit OOM under load.
Stream Processing State Failure RecoveryTHECODEFORGE.IOStream Processing State Failure RecoveryHow RocksDB + Kafka changelogs keep state safeIncoming RecordTriggers stateful op (join/agg/window)RocksDB UpdateState written to local embedded KV storeChangelog TopicState mutation sent to Kafka topicCrash DetectedTask fails, state lost from memoryRebuild StateReplay changelog from earliest offset⚠ Without changelogs, state loss means reprocessing from scratchTHECODEFORGE.IO
thecodeforge.io
Stream Processing State Failure Recovery
Stream Processing

Exactly-Once Semantics: The Devil in the Details

Exactly-once means each record is processed exactly one time, even if the producer retries or the consumer crashes. It's the holy grail, but it's not free. Kafka Streams achieves it via idempotent producers, transactional writes, and consumer offset commits in the same transaction. Flink uses checkpointing and two-phase commit.

The gotcha: exactly-once adds latency. Every batch of records requires a transaction commit, which adds ~10-100ms. For ultra-low-latency use cases (like real-time bidding), you might prefer at-least-once with deduplication downstream.

I've seen a team enable exactly-once on a high-throughput topic (100k msg/s) and see throughput drop by 40%. The fix: increase transaction.max.timeout.ms and batch size. Or switch to at-least-once if duplicates are acceptable.

ExactlyOnceConfig.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// io.thecodeforge — System Design tutorial

// Kafka Streams: enable exactly-once
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2);
// Also set these for performance
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.TRANSACTION_TIMEOUT_CONFIG, 60000);  // 60s
props.put(ProducerConfig.BATCH_SIZE_CONFIG, 65536);  // 64KB

// Flink: exactly-once with Kafka sink
FlinkKafkaProducer<String> sink = new FlinkKafkaProducer<>("output", new SimpleStringSchema(), properties);
sink.setWriteSemantic(FlinkKafkaProducer.Semantic.EXACTLY_ONCE);
// Requires checkpointing enabled
executionEnvironment.enableCheckpointing(10000);  // 10s interval
Output
Throughput: 50k msg/s with exactly-once (vs 80k with at-least-once). Latency: +50ms per batch.
Interview Gold: Exactly-Once vs At-Least-Once
Interviewers love this. Say: 'Exactly-once is a coordination protocol, not a magic flag. It adds latency and reduces throughput. Use it only when duplicates cause business errors (e.g., financial transactions). For logs or analytics, at-least-once with dedup is cheaper.'

Backpressure: When the Firehose Overwhelms You

Backpressure is when your producer sends data faster than your consumer can process. Without handling it, you'll OOM, drop messages, or crash. Kafka handles backpressure naturally: consumers pull at their own pace. But if your processing is slow, consumer lag grows. Flink has built-in backpressure: if a downstream operator is slow, it signals upstream to slow down.

The fix: monitor consumer lag. If it grows, you need more partitions or parallelism. Or you need to optimize your processing logic. I've seen a team ignore lag for hours, then try to catch up — the consumer tried to process millions of records at once and OOM'd. Solution: use max.poll.records to limit batch size, and use a separate thread pool for processing.

Another trick: use a dead letter queue for records that fail processing. Don't block the stream on a single bad record.

BackpressureHandling.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — System Design tutorial

// Kafka consumer: limit batch size
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 500);  // max 500 records per poll
props.put(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, 1048576);  // 1MB per partition

// Flink: auto backpressure
// Flink automatically throttles sources when sinks are slow
// Monitor via web UI: backpressure status shows 'OK', 'LOW', 'HIGH'

// Custom backpressure: use a bounded queue
BlockingQueue<Record> queue = new ArrayBlockingQueue<>(10000);
// Producer thread
while (true) {
    Record r = consumer.poll(100);
    queue.offer(r, 1, TimeUnit.SECONDS);  // blocks if full
}
// Consumer thread
while (true) {
    Record r = queue.take();
    process(r);
}
Output
Consumer lag stays below 1000 records. No OOM. Dead letter queue catches 0.1% bad records.
The Classic Bug: Unbounded Internal Queues
Never use unbounded queues (e.g., LinkedBlockingQueue without capacity) between poll and process. Under load, the queue grows until OOM. Always use bounded queues with a sensible capacity.

Windowing and Time Semantics: Event Time vs Processing Time

Windowing groups records by time. Processing time uses the wall clock when the record is processed. Event time uses a timestamp embedded in the record. Event time is harder but necessary for out-of-order data. Flink handles event time natively with watermarks. Kafka Streams supports event time via TimestampExtractor.

The gotcha: event time requires handling late data. If a record arrives 10 minutes late, do you include it in the window or discard it? Flink allows setting a allowedLateness and a side output for late events. Kafka Streams has suppress() operator to emit final results after a grace period.

I've seen a team use processing time for a fraud detection system. A network glitch delayed records by 30 seconds, causing them to fall into the wrong window and trigger false positives. The fix: switch to event time with a 1-minute allowed lateness.

WindowingExample.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// io.thecodeforge — System Design tutorial

// Flink: event time window with allowed lateness
DataStream<Event> stream = env.addSource(...)
    .assignTimestampsAndWatermarks(
        WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(10))
            .withTimestampAssigner((event, timestamp) -> event.getTimestamp())
    );

stream
    .keyBy(Event::getUserId)
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .allowedLateness(Time.minutes(1))  // late events within 1 min are included
    .sideOutputLateData(lateOutputTag)  // late events beyond 1 min go here
    .aggregate(new CountAggregate());

// Kafka Streams: event time with suppress
KStream<String, Event> stream = builder.stream("events", Consumed.with(Serdes.String(), eventSerde)
    .withTimestampExtractor(new MyTimestampExtractor()));

stream
    .groupByKey()
    .windowedBy(TimeWindows.of(Duration.ofMinutes(5)).grace(Duration.ofMinutes(1)))
    .count()
    .suppress(Suppressed.untilWindowCloses(BufferConfig.unbounded()))  // emit only when window closes
    .toStream()
    .to("windowed-counts");
Output
Windowed counts emitted with <1s delay for on-time events. Late events within 1 minute are included. Extremely late events go to side output.
Senior Shortcut: Use Event Time for Any Production System
Always use event time for production stream processing. Processing time is only acceptable for internal monitoring where latency is more important than accuracy. Event time with watermarks is the standard.
Event Time vs Processing TimeTHECODEFORGE.IOEvent Time vs Processing TimeWhich timestamp to use for windowing?Processing TimeUses wall clock when record arrivesSimple, low overhead, no watermarkFails on out-of-order or late dataResults vary with system loadEvent TimeUses timestamp embedded in recordHandles out-of-order via watermarksRequires state to track watermarkAccurate even under backpressureEvent time is harder but necessary for correct windowed resultsTHECODEFORGE.IO
thecodeforge.io
Event Time vs Processing Time
Stream Processing

When Not to Use Stream Processing

Stream processing is not a silver bullet. If your data volume is low (< 1000 events/second) and latency requirements are loose (> 1 minute), batch processing is simpler and cheaper. If you need complex joins across multiple data sources that change slowly, a database with materialized views might be better.

Also, stream processing adds operational complexity: you need to manage state, handle failures, and monitor lag. For a startup with a small team, a simple cron job or a message queue with a worker might be more maintainable.

I've seen a team use Flink to process 100 events per second from a single MySQL table. They spent weeks setting up the cluster, only to realize a simple Python script with polling would have worked. Don't over-engineer.

Interview Gold: When to Choose Batch Over Stream
Interviewers ask: 'When would you use batch instead of stream?' Answer: 'When latency requirements are >1 minute, data volume is low, or you need complex ad-hoc queries. Batch is simpler to debug and cheaper to operate. Stream is for real-time decisions.'
● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom
A Kafka Streams app processing clickstream data would crash every 2 hours with OutOfMemoryError. Heap was 4GB, data rate ~10MB/s.
Assumption
We assumed the heap was too small. Doubled it to 8GB. Still crashed.
Root cause
The state store (RocksDB) was storing all windowed aggregates in memory because we set memory_mode to HEAP instead of DISK. The state grew unbounded with 24-hour windows.
Fix
Changed state store config to DISK mode, set rocksdb.block.cache-size to 256MB, and added a state TTL of 1 hour. Also enabled logging to offload changelogs to Kafka.
Key lesson
  • State stores are not free.
  • Always estimate state size = (data rate window duration key cardinality) and configure accordingly.
  • Defaults will kill you.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Consumer lag growing despite no apparent bottleneck
Fix
1. Check if processing time per record > produce interval. 2. Increase partitions and parallelism. 3. Profile processing logic for slow operations (e.g., external API calls). 4. Consider async processing with a bounded queue.
Symptom · 02
OutOfMemoryError in Kafka Streams app
Fix
1. Check state store size via JMX (kafka.streams.state:type=rocksdb). 2. Reduce window duration or state TTL. 3. Switch RocksDB to disk mode with capped block cache. 4. Increase heap if necessary.
Symptom · 03
Duplicate records in output despite exactly-once config
Fix
1. Verify producer idempotence is enabled (enable.idempotence=true). 2. Check that transactional coordinator is not timing out (increase transaction.timeout.ms). 3. Ensure consumer offsets are committed within the transaction. 4. Test with a simple producer to isolate issue.
★ Stream Processing Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
Consumer lag > 10000 in Kafka
Immediate action
Check if processing is slower than ingestion
Commands
kafka-consumer-groups --bootstrap-server localhost:9092 --group my-group --describe
kafka-run-class kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic my-topic --time -1
Fix now
Increase partitions: kafka-topics --alter --topic my-topic --partitions 20. Increase consumer parallelism.
OutOfMemoryError in Flink job+
Immediate action
Check state backend size
Commands
curl http://flink-jobmanager:8081/jobs/{jobId}/checkpoints
Check RocksDB directory size: `du -sh /flink-state/rocksdb`
Fix now
Reduce state TTL: state.ttl: 1h. Switch to RocksDB: state.backend: rocksdb. Increase taskmanager.memory.process.size.
Exactly-once violations (duplicates)+
Immediate action
Check producer and consumer config
Commands
kafka-configs --bootstrap-server localhost:9092 --entity-type topics --entity-name my-topic --describe
Check consumer group offsets: `kafka-consumer-groups --bootstrap-server localhost:9092 --group my-group --describe`
Fix now
Set enable.idempotence=true, acks=all, transaction.timeout.ms=60000 on producer. Ensure consumer commits offsets within transaction.
High latency in Flink job (seconds instead of milliseconds)+
Immediate action
Check for backpressure
Commands
curl http://flink-jobmanager:8081/jobs/{jobId}/vertices/{vertexId}/backpressure
Check task manager logs for GC pauses: `grep 'Full GC' flink-taskmanager.log`
Fix now
Increase parallelism: flink run -p 10. Optimize operator logic. Increase heap to reduce GC.
Feature / AspectKafka StreamsApache Flink
DeploymentEmbedded in app (no cluster)Requires Flink cluster (JobManager + TaskManagers)
State BackendRocksDB (embedded)RocksDB, Filesystem, or Memory (with checkpointing)
Exactly-OnceTransactional producer + consumerCheckpointing + two-phase commit
Event Time SupportVia TimestampExtractorNative with watermarks and allowed lateness
WindowingTumbling, hopping, sliding, sessionTumbling, hopping, sliding, session, global, custom
Operational ComplexityLow (just add library)High (cluster management, checkpoint config)
Best ForSimple transformations, joins, aggregationsComplex event-time windows, large state, multi-source

Key takeaways

1
Stream processing is for low-latency, real-time decisions. If you can tolerate >1 minute latency, batch is simpler.
2
State management is the hardest part. Always estimate state size and configure RocksDB with memory limits.
3
Exactly-once semantics add latency and reduce throughput. Use only when duplicates cause business errors.
4
Event time with watermarks is the standard for production systems. Processing time is only for internal monitoring.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does Kafka Streams achieve exactly-once semantics, and what are the ...
Q02SENIOR
When would you choose Flink over Kafka Streams for a stream processing j...
Q03SENIOR
What happens when a Flink job fails mid-checkpoint? How does it recover?
Q04JUNIOR
What is the difference between event time and processing time? When woul...
Q05SENIOR
You see consumer lag growing in a Kafka Streams app. What steps do you t...
Q06SENIOR
Design a real-time fraud detection system that processes credit card tra...
Q01 of 06SENIOR

How does Kafka Streams achieve exactly-once semantics, and what are the trade-offs?

ANSWER
Kafka Streams uses idempotent producers, transactional writes, and consumer offset commits within the same transaction. The trade-off is increased latency (due to transaction commits) and reduced throughput (up to 40% in some cases). It also requires the Kafka cluster to have transaction coordination enabled.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is stream processing used for?
02
What's the difference between stream processing and batch processing?
03
How do I handle backpressure in stream processing?
04
What happens when a stream processing job crashes? How do I recover without data loss?
N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

Follow
Verified
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
🔥

That's Async & Data Processing. Mark it forged?

5 min read · try the examples if you haven't

Previous
MapReduce and Batch Processing
4 / 7 · Async & Data Processing
Next
Inverted Index and Text Search