CountDownLatch Deadlock — Missing countDown() After Crash
One unchecked exception before countDown() hangs your service forever.
20+ years shipping production Java in banking & fintech. Drawn from code that ran under real load.
- CountDownLatch is a one-shot gate: thread(s) wait until count reaches zero, then stays open forever.
- CyclicBarrier is a reusable meeting point: all threads wait for each other, then reset for next cycle.
- CountDownLatch uses AQS with a single CAS per countDown(); CyclicBarrier uses ReentrantLock + Condition, higher overhead.
- In production, always use await(timeout) with CountDownLatch; a hung worker blocks indefinitely.
- CyclicBarrier's broken barrier state is active failure detection; CountDownLatch just stays stuck silently.
- Biggest mistake: using CountDownLatch when you need reuse, or CyclicBarrier when participants are dynamic.
Imagine a rocket launch. The countdown — 10, 9, 8 … 1, 0 — happens once, and when it hits zero, the rocket fires. That's a CountDownLatch: a one-shot gate that opens when a count reaches zero. Now imagine a relay race where all four runners must reach the exchange zone before anyone passes the baton. Once they're all there, they all go — and the next lap can repeat the same wait. That's a CyclicBarrier: a reusable meeting point that resets after every group finishes.
Modern Java applications rarely run a single task at a time. Whether you're loading config from three different microservices before serving the first request, running parallel test suites, or coordinating phases in a data-processing pipeline, you need threads to wait for each other in a controlled, predictable way. Get this wrong and you end up with race conditions, deadlocks, or — the sneaky worst case — a service that silently produces incomplete results because one thread raced ahead before the others were ready.
Both CountDownLatch and CyclicBarrier live in java.util.concurrent and solve the 'threads waiting for each other' problem, but they solve subtly different flavours of it. CountDownLatch is about one or more threads waiting until a set of operations performed by other threads completes — think dependencies. CyclicBarrier is about a fixed group of threads all waiting until every member of that group is ready to proceed together — think synchronisation points in iterative work.
By the end of this article you'll understand the internal mechanics of both primitives, know exactly which one to reach for in a given situation, be able to explain their trade-offs in an interview without hesitation, and have production-ready patterns you can drop straight into your codebase.
CountDownLatch vs CyclicBarrier — Two Synchronizers, One Critical Difference
CountDownLatch and CyclicBarrier are both Java synchronizers that coordinate multiple threads, but they solve fundamentally different problems. CountDownLatch is a one-shot gate: one or more threads block until a fixed number of countDown() calls have been made. CyclicBarrier is a reusable rendezvous point: a fixed number of threads all wait for each other to arrive, then proceed together.
CountDownLatch is not reusable — once the count reaches zero, the latch is permanently open. CyclicBarrier resets automatically after all parties trip it, and can optionally run a barrier action. Both operate in O(1) time per operation under the hood, using AQS (AbstractQueuedSynchronizer). The practical difference: CountDownLatch signals an event; CyclicBarrier synchronizes a phase.
Use CountDownLatch when you need to wait for N operations to complete before proceeding — e.g., waiting for N services to start, or N parallel tasks to finish. Use CyclicBarrier when you have a fixed-size group of threads that must meet at a common point repeatedly — e.g., in parallel simulations or multi-phase computations. The wrong choice leads to deadlocks or wasted threads.
CountDownLatch.await().await() to detect stuck latches.CountDownLatch — Internals, Lifecycle and When to Reach for It
CountDownLatch wraps an AbstractQueuedSynchronizer (AQS) state integer. When you call new CountDownLatch(n), the AQS state is initialised to n. Every countDown() call performs a compareAndSet that decrements the state by 1 — atomically, without a lock. When the state hits 0, all threads parked in await() are unblocked via AQS's release mechanism. That's it. There is no reset path in the API. The latch is a one-way gate.
This single-use nature is a feature, not a limitation. It makes CountDownLatch perfect for start-up sequencing (wait for N services to register before opening traffic), test coordination (wait for N worker threads to complete before asserting results), and event broadcasting (all waiting threads unblock simultaneously the moment the count hits zero).
The key mental model: the thread calling await() is the dependent — it needs work done. The threads calling countDown() are the producers — they signal completion. These roles can overlap; a thread can countDown() and then await() on a different latch, which is exactly how two-phase startup coordination is built.
await() and handle the false return every single time.await() without a timeout.CyclicBarrier — Reusable Phases, the Barrier Action, and Its AQS Internals
CyclicBarrier is built differently from CountDownLatch. It uses an internal ReentrantLock and a Condition to park threads rather than AQS directly. The critical state is a 'generation' object that gets replaced each time the barrier trips (resets). This generation mechanism is precisely what makes the barrier cyclic — each trip through the barrier starts a fresh generation, so the same CyclicBarrier instance coordinates an unbounded number of phases.
The constructor accepts an optional Runnable barrierAction. This action runs exactly once per cycle, in the last thread to arrive at the barrier, before any of the waiting threads are released. This is incredibly useful for aggregating results from the phase that just completed (e.g., merging partial sums) before the next phase begins — all without an external synchronisation step.
Broken barrier state is a crucial concept you must understand. If any thread waiting at a barrier is interrupted or times out, the barrier enters a broken state. Every thread currently waiting — and every thread that calls await() on that barrier in the future — gets a BrokenBarrierException. The only recovery is to build a new CyclicBarrier. This failure mode is intentional: a partially-completed phase in iterative work produces corrupt results, so it's better to fail loudly.
Use CyclicBarrier for parallel iterative algorithms (matrix multiplication phases, parallel merge sort stages), simulation loops where N agent threads must sync before each tick, and multi-stage data-processing pipelines where every worker must finish stage N before any starts stage N+1.
Head-to-Head Comparison — Choosing the Right Tool Under Pressure
The single most important question to ask yourself is: 'Is the wait one-directional (waiters depend on workers) or mutual (everyone waits for everyone)?' CountDownLatch is one-directional. CyclicBarrier is mutual.
The second question is: 'Does this pattern repeat?' If threads need to sync once and move on independently, use CountDownLatch. If threads must sync at the end of every phase in a loop, CyclicBarrier's automatic reset is exactly what you need — recreating a CountDownLatch every iteration is wasteful and error-prone.
Performance considerations matter at scale. CountDownLatch.countDown() is a single CAS on an AQS integer — extremely cheap. CyclicBarrier.await() acquires a ReentrantLock, which involves more overhead. For ultra-hot paths with thousands of threads syncing per second, consider Phaser (the more flexible successor to both) which uses a tree-structured internal state to reduce contention. For most application-level coordination (tens of threads, not thousands), both primitives are fast enough that the design clarity matters far more than the performance difference.
Error propagation also differs sharply. A failed countDown() call (e.g., from a crashed thread that never calls it) simply leaves the latch stuck — which is why the timeout overload of await() is non-negotiable in production. CyclicBarrier's broken-barrier state at least actively notifies waiting threads that something went wrong, making it somewhat easier to detect a fault mid-cycle.
Production Decision Framework — How to Pick the Right Primitive
You've seen the internals. Now here's a concrete decision tree you can apply in code reviews or on the whiteboard. Start with these three questions:
1. Roles: Are there distinct 'waiters' and 'workers', or does every thread play both roles?** - Distinct roles → CountDownLatch. One thread (or group) waits for others to finish. - All threads equal → CyclicBarrier. Everyone waits for everyone.
2. Repeatability: Will this coordination point be used exactly once, or multiple times?** - Once → CountDownLatch (or create a new one each time, but that's fragile). - Multiple times → CyclicBarrier (auto-reset) or Phaser (if participants change).
3. Failure semantics: What should happen if a worker fails?** - Silent stuck latch? Use CountDownLatch with timeout. - Active failure notification? Use CyclicBarrier — BrokenBarrierException tells all threads.
Use this table as a quick reference:
| Scenario | Best Choice | Why |
|---|---|---|
| Start-up sequencing (wait for N services) | CountDownLatch | One-shot, distinct roles |
| Parallel algorithm with phases | CyclicBarrier | Reusable, all threads equal |
| Test coordination (wait for threads to finish) | CountDownLatch | Simple, one-time |
| Dynamic worker pool for iterative processing | Phaser | Participants can join/leave |
| Event broadcasting (fire when all ready) | CountDownLatch | All waiters unblock simultaneously |
| Simulation ticks where each tick is a phase | CyclicBarrier | Auto-reset, barrier action for aggregation |
In production, apply the rule of least surprise: pick the primitive whose name and contract clearly communicate the intent. Your future self — and your colleagues — will thank you.
- Gate: Opens once. Once open, nothing stops it. Perfect for one-time dependencies.
- Round table: Everyone sits, the barrier action runs (like a toast), then they get up and the table resets for the next course.
- Phaser extends the round table: chairs can be added or removed between courses.
await() hangs.Common Pitfalls and How to Avoid Them
Even experienced developers make these mistakes. Here's what to watch for.
Pitfall 1: Missing countDown() guarantee If your worker code throws an unchecked exception before calling countDown(), the latch never reaches zero. Always wrap the body in try-finally and call countDown() in the finally block. This ensures the latch is decremented even on failure.
Pitfall 2: Forgeting to restore the interrupt flag When you catch InterruptedException, you must call Thread.currentThread().interrupt() to reassert the interrupt. Failure to do so leaves the thread in a state that can't be cancelled, and if that thread is waiting on a CyclicBarrier, it never breaks the barrier — leading to a deadlock.
Pitfall 3: Reusing a CountDownLatch by creating a new one in a loop You create a new CountDownLatch(n) each iteration, but if a reference from a previous iteration is still held by another thread, that latch is exhausted and await() returns immediately. Switch to CyclicBarrier or Phaser if you need reuse.
Pitfall 4: Calling countDown() after the latch has reached zero It's a no-op, but it can mask bugs. For example, if you accidentally call countDown() 5 times on a latch initialised with 3, the extra calls do nothing — but you'll never know a worker was supposed to only run once. Add assertions if you suspect over-counting.
Pitfall 5: Using CyclicBarrier with more threads than the party count If you submit 5 workers but the barrier expects 4, the barrier will never trip because the 5th thread's await() doesn't count? Actually it does – if you submit extra threads that also call await(), they increase the effective party? No, the barrier waits for exactly its party count. If 5 threads call await() on a 4-part barrier, one thread will be left waiting forever. Ensure threads == barrier parties exactly, or use a secondary coordination mechanism.
Pitfall 6: Not handling BrokenBarrierException If you ignore BrokenBarrierException and continue, you risk processing garbage data. Always abort the current phase and restart with a fresh barrier.
CyclicBarrier.await(), it breaks the barrier. But if you catch InterruptedException and don't restore the flag, the barrier stays broken and other threads get BrokenBarrierException. Always call Thread.currentThread().interrupt() — otherwise you mask the interruption and the barrier recovery is incomplete.await() from an abandoned phase.await().Why CountDownLatch Cares About Tasks, Not Threads
That distinction kills more junior engineers than null pointers. CountDownLatch tracks a counter you decrement. It has zero interest in who does the decrementing. One thread can call countDown() five times. Five threads can each call it once. The latch doesn't care. It only watches that counter hit zero. This matters in production because you might have a thread pool of three workers needing to complete eight pre-flight checks. With CountDownLatch, you set the initial count to eight, each check calls countDown(), and your coordinator thread waits on await(). The pool size is irrelevant. CyclicBarrier would force you to match thread count to barrier count, which is the wrong abstraction when you're tracking units of work, not thread rendezvous. That's the root cause I've debugged at 2 AM: someone treated the barrier count like a task counter and wondered why their pipeline deadlocked. Don't be that engineer.
await().Reusability Is Where CyclicBarrier Earns Its Paycheck
CountDownLatch is a one-shot. Once you hit zero, it's a corpse. You cannot reset it. CyclicBarrier resets implicitly when all parties trip the barrier, or explicitly via reset(). That reusability defines when you reach for it. Think phased computations: map stage, then reduce stage, then output stage. Each phase needs all threads to sync before the next. You create one CyclicBarrier with your phase count, call await() at the end of each phase, and optionally run a barrier action (like shuffling data) between phases. The barrier handles reset automatically. In production, I've used this for partitioned cache refresh jobs: each partition loads fresh data, threads rendezvous, then the barrier action publishes the combined update. Without CyclicBarrier, you'd be wiring CountDownLatch phantoms, resetting them manually, and praying you don't leak a reference. That's fragile. CyclicBarrier is built for that rhythm.
Startup Hang: Missing countDown() After Worker Crash
CountDownLatch.await().await() with no timeout. Latch count stayed at 1 forever.await().- Never call
await()without a timeout in production. - Put countDown() in a finally block – every time.
- If a worker crashes before decrementing, your latch becomes a deadlock trap.
CountDownLatch.await()await() again. Broken barriers require a new instance.jstack $(pgrep -f 'your-app') | grep -A 20 'CountDownLatch.await'Check latch count: add debug logging or attach with jcmd to inspect objectKey takeaways
await() variants in productionCommon mistakes to avoid
5 patternsNot guaranteeing countDown() in a finally block
Catching InterruptedException without restoring the interrupt flag
Thread.currentThread().interrupt() in the catch block for InterruptedException. For CountDownLatch, still call countDown() to unblock others. For CyclicBarrier, let await() throw BrokenBarrierException.Reusing a CountDownLatch by allocating a new one in a loop
Calling CyclicBarrier.await() without handling BrokenBarrierException
await() calls throw BrokenBarrierException. If the code catches Exception generically or ignores it, the phase may continue with partial data.Using CyclicBarrier with a thread pool that has dynamic size
await() varies (e.g., cached thread pool), the barrier may never trip because the party count is fixed. Extra threads calling await() exceed the intended count, or too few threads arrive.await().Interview Questions on This Topic
Can you explain the difference between CountDownLatch and CyclicBarrier, and give a concrete production scenario where you'd pick one over the other?
Frequently Asked Questions
20+ years shipping production Java in banking & fintech. Drawn from code that ran under real load.
That's Multithreading. Mark it forged?
9 min read · try the examples if you haven't