Stream Processing: Why Your Batch Jobs Are Losing Money and How to Fix It
Stream processing explained with real production patterns.
20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.
Stream processing processes data records as they arrive, with millisecond latency, using frameworks like Apache Kafka Streams, Apache Flink, or Apache Spark Streaming. It's ideal for real-time analytics, monitoring, and event-driven applications where batch processing's latency is unacceptable.
Imagine a conveyor belt in a factory. Batch processing is like stopping the belt, taking all items off, sorting them, then restarting. Stream processing is like inspecting and acting on each item as it passes by, without ever stopping the belt. You get results instantly, but you need to handle items that arrive out of order or in bursts.
You're losing money. Every second your data pipeline sits idle between batch runs, you're missing fraud, losing customers, and making decisions on stale data. I've seen a payments company lose $50k in a single night because their batch job ran every 15 minutes and a fraudster exploited the gap. Stream processing fixes that — but only if you understand the trade-offs.
The core problem is latency. Batch processing is simple: collect data for an hour, then run a job. But the world doesn't wait. Stream processing lets you react within milliseconds — but it introduces complexity: state management, out-of-order events, exactly-once semantics, and backpressure. Most tutorials gloss over these. This article won't.
By the end, you'll know exactly when to use stream processing, how to choose between Kafka Streams and Flink, how to handle backpressure without losing data, and the three production mistakes that will burn you. You'll also get a debug cheat sheet for when things go wrong at 3am.
The Problem: Why Batch Processing Fails at Real-Time
Batch processing is simple: collect data for a window, then run a job. But the latency is the window size. If you run every hour, you're an hour late. For fraud detection, that's catastrophic. I've seen a team try to reduce batch window to 5 seconds — the overhead of starting/stopping Spark jobs killed throughput. Stream processing keeps a persistent pipeline, processing each record as it arrives. The trade-off: you must manage state, ordering, and failures continuously.
Without stream processing, you hack it: cron jobs every minute, polling databases, or using message queues with manual ack. These all break under load. Stream processing gives you exactly-once semantics, stateful operations, and automatic scaling — but only if you design for it.
Choosing Your Weapon: Kafka Streams vs Flink vs Spark Streaming
You have three main options. Kafka Streams is lightweight, runs in your app process, and is perfect for simple transformations and joins. Flink is a heavy-duty cluster framework with advanced windowing, exactly-once, and state management. Spark Streaming is micro-batch — it's not true streaming, but it's good if you're already in the Spark ecosystem.
Here's the rule: if you can express your logic as a single Kafka topic transformation with state, use Kafka Streams. If you need complex event time windows, large state, or multi-source joins, use Flink. If you're doing batch analytics and want near-real-time, use Spark Streaming but accept 1-5 second latency.
I've seen teams over-engineer with Flink for a simple filter+map. The operational overhead of managing a Flink cluster killed their velocity. Start with Kafka Streams, graduate to Flink only when you hit its limits.
State Management: The Silent Killer
State is where stream processing gets hard. Every time you do a join, aggregation, or window, you're storing state. That state lives in memory or on disk, and it must survive failures. Kafka Streams uses RocksDB (embedded key-value store) and backs it up to Kafka changelog topics. Flink uses a state backend (memory, filesystem, or RocksDB) with periodic checkpoints.
The rookie mistake: assuming state is free. I've seen a Kafka Streams app with a 24-hour window on a high-cardinality key (user_id) — the state store grew to 50GB and crashed the JVM. The fix: reduce window size, use a sliding window, or add state TTL.
Always estimate state size: state_size = data_rate window_duration key_cardinality. For example, 10MB/s 3600s 1M keys = 36TB. That's not going to fit in memory. Use RocksDB on disk.
Exactly-Once Semantics: The Devil in the Details
Exactly-once means each record is processed exactly one time, even if the producer retries or the consumer crashes. It's the holy grail, but it's not free. Kafka Streams achieves it via idempotent producers, transactional writes, and consumer offset commits in the same transaction. Flink uses checkpointing and two-phase commit.
The gotcha: exactly-once adds latency. Every batch of records requires a transaction commit, which adds ~10-100ms. For ultra-low-latency use cases (like real-time bidding), you might prefer at-least-once with deduplication downstream.
I've seen a team enable exactly-once on a high-throughput topic (100k msg/s) and see throughput drop by 40%. The fix: increase transaction.max.timeout.ms and batch size. Or switch to at-least-once if duplicates are acceptable.
Backpressure: When the Firehose Overwhelms You
Backpressure is when your producer sends data faster than your consumer can process. Without handling it, you'll OOM, drop messages, or crash. Kafka handles backpressure naturally: consumers pull at their own pace. But if your processing is slow, consumer lag grows. Flink has built-in backpressure: if a downstream operator is slow, it signals upstream to slow down.
The fix: monitor consumer lag. If it grows, you need more partitions or parallelism. Or you need to optimize your processing logic. I've seen a team ignore lag for hours, then try to catch up — the consumer tried to process millions of records at once and OOM'd. Solution: use max.poll.records to limit batch size, and use a separate thread pool for processing.
Another trick: use a dead letter queue for records that fail processing. Don't block the stream on a single bad record.
Windowing and Time Semantics: Event Time vs Processing Time
Windowing groups records by time. Processing time uses the wall clock when the record is processed. Event time uses a timestamp embedded in the record. Event time is harder but necessary for out-of-order data. Flink handles event time natively with watermarks. Kafka Streams supports event time via TimestampExtractor.
The gotcha: event time requires handling late data. If a record arrives 10 minutes late, do you include it in the window or discard it? Flink allows setting a allowedLateness and a side output for late events. Kafka Streams has operator to emit final results after a grace period.suppress()
I've seen a team use processing time for a fraud detection system. A network glitch delayed records by 30 seconds, causing them to fall into the wrong window and trigger false positives. The fix: switch to event time with a 1-minute allowed lateness.
When Not to Use Stream Processing
Stream processing is not a silver bullet. If your data volume is low (< 1000 events/second) and latency requirements are loose (> 1 minute), batch processing is simpler and cheaper. If you need complex joins across multiple data sources that change slowly, a database with materialized views might be better.
Also, stream processing adds operational complexity: you need to manage state, handle failures, and monitor lag. For a startup with a small team, a simple cron job or a message queue with a worker might be more maintainable.
I've seen a team use Flink to process 100 events per second from a single MySQL table. They spent weeks setting up the cluster, only to realize a simple Python script with polling would have worked. Don't over-engineer.
The 4GB Container That Kept Dying
memory_mode to HEAP instead of DISK. The state grew unbounded with 24-hour windows.DISK mode, set rocksdb.block.cache-size to 256MB, and added a state TTL of 1 hour. Also enabled logging to offload changelogs to Kafka.- State stores are not free.
- Always estimate state size = (data rate window duration key cardinality) and configure accordingly.
- Defaults will kill you.
kafka-consumer-groups --bootstrap-server localhost:9092 --group my-group --describekafka-run-class kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic my-topic --time -1kafka-topics --alter --topic my-topic --partitions 20. Increase consumer parallelism.Key takeaways
Interview Questions on This Topic
How does Kafka Streams achieve exactly-once semantics, and what are the trade-offs?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.
That's Async & Data Processing. Mark it forged?
5 min read · try the examples if you haven't