Advanced 8 min · March 06, 2026

Node.js Performance Optimisation

Event Loop Block: 4-Second Template Compile in Node.js

Q: How many cluster workers should I run in production?

Start with one worker per CPU core, but subtract one core for the primary process and OS overhead. On a 4-core container: run 3 workers. On an 8-core VM: run 7 workers. This is a starting point, not a hard rule. Monitor per-worker CPU utilisation after deploying. If workers consistently sit below 50% CPU utilisation under peak load, you have more workers than the workload needs and you are wasting memory (each worker consumes 30-100MB of baseline RAM). If workers consistently exceed 80% CPU utilisation, add workers or scale the container. In Kubernetes with autoscaling, the K8s-preferred pattern is one process per pod (no intra-pod clustering) with the HPA managing pod count. This gives cleaner isolation, simpler crash recovery, more accurate resource accounting, and allows K8s to schedule pods across nodes. Intra-pod clustering makes most sense when a single pod is allocated 4+ CPUs and you want to utilise all of them without the orchestration overhead of additional pods.

Q: Why does my Node.js process use more memory than --max-old-space-size allows?

The V8 heap is only one component of total Node.js process memory. Total RSS (Resident Set Size) includes all of: the V8 old generation heap (controlled by --max-old-space-size), the V8 new generation (young space, ~32MB by default), off-heap Buffer allocations (Buffer.alloc and Buffer.from use memory outside the V8 heap by design — this is intentional for performance), native module memory (crypto, zlib, bcrypt, and other C++ addons allocate from the OS directly), the libuv thread pool stacks (default 4 threads), worker thread stacks (if you're using worker_threads), and V8 code space and map space for compiled JavaScript. A process with --max-old-space-size=512 routinely uses 700-900MB of total RSS under moderate load. Set the flag to 70-75% of your container memory allocation, not 90% or 100%. The remaining headroom is needed for non-heap allocations and to prevent the OOM killer from intervening during peak GC activity when V8 briefly holds both the current and compacted heap simultaneously.

Q: Should I use PM2 or the built-in cluster module?

It depends on where you're deploying. On a VM or bare-metal server where you are responsible for process lifecycle management: PM2 is the right choice. It wraps the cluster module and adds operational necessities — zero-downtime reload (pm2 reload), structured log aggregation, a monitoring dashboard (pm2 monit), and automatic restart policies. Managing process lifecycle manually with the raw cluster module on a VM is reinventing what PM2 already does reliably. In Kubernetes: PM2 adds very little value. K8s handles restarts (liveness probes), scaling (HPA), rolling deployments (RollingUpdate strategy), and log collection (stdout to a sidecar or log aggregator). The overhead of PM2 inside a container that K8s is already managing creates a confusing double layer of process supervision. In K8s, prefer either one process per pod (simplest) or the raw cluster module if you need multi-core utilisation within a single pod.

Q: Can I use async/await for everything and never worry about blocking the event loop?

No — and this misconception is responsible for a significant number of production event loop blocking incidents. async/await is syntax that makes asynchronous code readable. It does not make synchronous operations asynchronous. Every one of the following is synchronous and blocks the event loop regardless of whether it appears inside an async function: JSON.parse(), JSON.stringify() on large objects, regex matching on long strings (especially with backtracking-prone patterns), Array.sort() on large arrays, crypto.pbkdf2Sync() (note the Sync suffix — that is the warning), any tight computation loop, and any native addon that does not use libuv's async APIs. await only helps when you're awaiting something that is already asynchronous — a network request, a database query, a file read via fs.promises. If the underlying operation is synchronous, wrapping it in async/await changes nothing about its blocking behaviour. The rule: use async/await for I/O concurrency. Use worker threads for CPU-bound work that takes more than 10ms. There is no shortcut between these two.

Q: What causes the 'JavaScript heap out of memory' error and how do I fix it immediately?

This error fires when V8's old generation heap reaches its configured maximum (default ~1.5GB on 64-bit, slightly higher in Node.js 22). V8 cannot collect enough garbage to make room for new allocations and gives up. Immediate fix: increase --max-old-space-size to buy time. Add NODE_OPTIONS='--max-old-space-size=2048' before diagnosing the root cause. This is not a permanent solution — it delays the crash but does not fix the leak. Proper fix: take a heap snapshot before and after 30 minutes of load, compare them in Chrome DevTools Memory tab's Comparison view, identify the object types with the largest retained size growth. Then fix the actual leak: the most common causes are unbounded in-memory caches (fix with lru-cache max and ttl options), event listeners that accumulate across reconnections (fix by pairing addListener with removeListener on close), and closures in middleware or request handlers retaining req or res objects beyond the request lifecycle. Long-term fix: add heap utilisation monitoring (used_heap_size / heap_size_limit) as a Prometheus metric and alert when it exceeds 0.75. This gives you hours of warning before the OOM crash, time to take a heap snapshot and diagnose while the service is still running.

Q: Is Node.js 22 LTS significantly different from Node.js 20 LTS for the topics covered in this guide?

For the event loop model, clustering architecture, and worker thread API — no, they are functionally identical. The fundamental architecture has not changed. What is different in Node.js 22 that might affect production performance: io_uring support on Linux (kernel 5.1+) for significantly faster async file I/O — the poll phase processes file system events more efficiently; V8 12.x with improved JSON.parse performance (still synchronous, still blocks the event loop, but faster than 20 LTS for the same payload size); a stable built-in test runner (no more mocha/jest dependency for basic testing); built-in WebSocket client via the WHATWG WebSocket API; and improved single-executable application support. Node.js 20 LTS enters maintenance mode in October 2026. If you are running 20 in production today, the migration to 22 LTS is low-risk for most applications — the breaking changes between 20 and 22 are minimal and well-documented in the Node.js changelog.

An inline handlebars.compile() blocked Node.js event loop for 3.8s, spiking P99 from 45ms to 4s and crashing all endpoints with 504 in production..

Naren Founder & Principal Engineer

20+ years shipping production JavaScript and front-end systems at scale. Notes here come from systems that actually shipped.

✓ Production

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

Before you start⏱ 30 min

✓Deep production experience
✓Understanding of internals and trade-offs
✓Experience debugging complex systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Node.js runs on a single-threaded event loop with libuv handling async I/O via a thread pool (default 4 threads, tunable via UV_THREADPOOL_SIZE)
The event loop has 6 distinct phases (timers, pending, idle/prepare, poll, check, close) — blocking any phase freezes the entire process for every connected client
cluster module forks one process per CPU core, sharing no memory but enabling true parallel request handling across cores
Worker threads share memory via SharedArrayBuffer and handle CPU-bound work without blocking the event loop — use a fixed pool, never spawn per request
Memory leaks typically come from unbounded caches, forgotten timers, or closure-retained references — detect with --inspect and heap snapshot comparison
Production profiling (flamegraphs, clinic.js) reveals bottlenecks that synthetic benchmarks never expose
Node.js 22 LTS (active as of 2026) ships with a native test runner, built-in WebSocket client, and improved single-executable application support — no runtime changes affect the event loop model described here

✦ Definition~90s read

What is Node.js Performance Optimisation?

Node.js performance optimization is the practice of identifying and eliminating bottlenecks in your application's event loop, memory usage, and CPU-bound operations to maintain sub-100ms response times under load. The core challenge is that Node.js runs a single-threaded event loop — any synchronous JavaScript that takes more than a few milliseconds blocks all other requests.

★

Imagine a single brilliant chef running your entire restaurant kitchen alone.

A 4-second template compile isn't just slow; it's a denial-of-service for every other user hitting that process. Optimization targets three specific areas: keeping the event loop responsive (each phase must complete in under 50ms), managing garbage collection pauses (V8's GC can stop the world for 10-100ms), and offloading CPU-intensive work to worker threads or child processes.

Real-world tools like clinic.js, 0x, and Node's built-in --prof flag let you measure actual latency distributions, not just average response times. The alternatives to Node.js for CPU-heavy workloads are Go (goroutines for concurrency) or Python with asyncio, but when you're already in the Node ecosystem, clustering across 4-8 cores and using worker threads for template compilation, image processing, or data transformation is the pragmatic path.

You don't optimize everything — you profile under real load (1000+ concurrent connections with tools like autocannon or wrk2), find the single blocking function, and fix that. The 4-second compile becomes a 50ms offloaded task, and your event loop breathes again.

Plain-English First

Imagine a single brilliant chef running your entire restaurant kitchen alone. They're fast — genuinely fast — but if one order takes 20 minutes to prepare (like baking a cake from scratch), every other customer waits. The chef doesn't serve table two while table one's cake is in the oven. Node.js is that chef: one thread, blazing fast for quick tasks, but one slow synchronous operation can freeze everything else behind it. Optimising Node.js means teaching that chef to delegate long jobs to kitchen assistants, batch similar tasks efficiently, and never stand still doing nothing when there are orders waiting. Clustering is like opening multiple identical kitchens. Worker threads are the kitchen assistants for heavy prep work.

⚙ Browser compatibility

Latest versions — ✓ supported

Chrome	Firefox	Safari	Edge
✓	✓	✓	✓

Every Node.js app starts fast. Then reality hits — traffic spikes, database queries pile up, memory climbs steadily overnight, and that 'non-blocking' promise starts feeling like a lie. The truth is Node.js is genuinely efficient by design, but that efficiency has a very specific shape. Violate that shape and you'll hit performance walls that no amount of horizontal scaling will fully fix.

Node.js is single-threaded, but not single-concurrent. The event loop, backed by libuv's thread pool and epoll/kqueue/io_uring at the OS level, handles thousands of concurrent I/O operations without spawning OS threads. The problem starts when you confuse 'non-blocking I/O' with 'can do anything without consequence'. A JSON.parse on a 50MB payload, a synchronous crypto operation, or a tight computation loop will block the event loop — and every connected client pays the price simultaneously.

In 2026, Node.js 22 LTS is the active long-term support release (Node.js 20 LTS reaches maintenance mode in October 2026). Node.js 22 shipped with io_uring support on Linux for significantly faster async file I/O, a stable built-in test runner, native WebSocket client support, and performance improvements to V8 12.x. None of these changes alter the event loop model or the clustering architecture described in this guide — the fundamentals remain exactly what they were, and the mistakes remain exactly as costly.

This guide covers the internals that matter in production: event loop phases and timing, clustering for multi-core utilisation, worker threads for CPU-bound work, memory leak patterns and detection, and the profiling tools that reveal what synthetic benchmarks consistently miss. Every section is grounded in production behaviour, not toy examples.

What Node.js Performance Optimisation Actually Targets

Node.js performance optimisation is the practice of minimising event loop latency by identifying and eliminating synchronous operations that block the thread. The core mechanic is the event loop: a single-threaded loop that processes callbacks, timers, and I/O events. Any synchronous CPU work — like compiling a Handlebars template that takes 4 seconds — stalls the loop, freezing all concurrent requests. The goal is to keep each tick under 1ms for interactive services.

In practice, the key property is that Node.js excels at I/O-bound workloads but fails under CPU-bound synchronous code. A single synchronous 4-second template compile blocks 100% of throughput during that window. The event loop cannot advance, so no new requests are accepted, no timers fire, and no data is read from sockets. This is not a concurrency issue — it's a thread-blocking issue. The fix is either offloading to a worker thread or caching the compiled template.

Use optimisation techniques like caching, streaming, and worker threads when you detect synchronous operations exceeding 10ms. In real systems, a single unoptimised template compile can cause cascading timeouts across upstream services, triggering retry storms and degrading the entire cluster. The rule: never do CPU work in the main thread that can be done once, cached, or offloaded.

⚠ Misconception: Async Means Non-Blocking

Async I/O does not make CPU-bound synchronous code non-blocking. A 4-second template compile in a route handler blocks the event loop regardless of async wrappers.

📊 Production Insight

A Node.js API server with Handlebars template compilation on every request caused 4-second latency spikes under load, tripping health checks and triggering Kubernetes pod restarts.

Symptom: request latency jumps from 5ms to 4000ms, CPU at 100% on one core, event loop delay > 3000ms.

Rule: compile templates once at startup or cache them — never compile on every request in the main thread.

🎯 Key Takeaway

The event loop is single-threaded — any synchronous CPU work blocks all concurrent requests.

Measure event loop delay with tools like process.hrtime or clinic — target < 1ms per tick.

Offload CPU-heavy work to worker threads or cache aggressively; never compile templates or parse large JSON in request handlers.

thecodeforge.io

Nodejs Performance Optimisation

Event Loop Internals: Phases, Timing, and What Blocks Everything

The Node.js event loop is not a simple FIFO queue. It runs in six distinct phases, each with its own callback queue, executed in a fixed order every tick. Understanding this order is the difference between writing code that performs predictably in production and code that surprises you at 3 AM on-call.

The phases in execution order: timers (setTimeout/setInterval callbacks whose delay has expired), pending callbacks (I/O callbacks deferred from the previous iteration, like TCP error notifications), idle/prepare (internal V8 housekeeping — you don't interact with this), poll (new I/O events — this is where most work happens, and where the loop may block waiting for events when there's nothing queued), check (setImmediate callbacks), and close callbacks (socket.on('close', ...) handlers).

Between each phase transition, the microtask queue is fully drained. This includes process.nextTick callbacks first, then resolved Promise.then handlers. This ordering matters more than most engineers realise — process.nextTick fires before any Promise.then, which fires before the next event loop phase.

The critical practical insight: setImmediate always runs after the poll phase completes within the current iteration. setTimeout(fn, 0) runs in the next iteration's timer phase. Under I/O callbacks specifically, setImmediate fires first — always. This isn't academic trivia. It determines the ordering of operations in streaming pipelines, connection teardown sequences, and graceful shutdown logic where getting the order wrong causes data loss or unclosed handles.

On Node.js 22 LTS, the underlying I/O layer on Linux now uses io_uring where available, which significantly reduces syscall overhead for file I/O operations. The event loop phase model is unchanged — what changes is the speed at which the poll phase can process file system events.

event-loop-phases.jsJAVASCRIPT

const fs = require('fs');

// Demonstrates the phase ordering — run this and study the output
fs.readFile(__filename, () => {
  // We are now inside an I/O callback (poll phase)

  setTimeout(() => console.log('3: setTimeout'), 0);
  // Will run in the NEXT iteration's timer phase

  setImmediate(() => console.log('2: setImmediate'));
  // Will run in THIS iteration's check phase — before setTimeout

  Promise.resolve().then(() => console.log('1: Promise.then'));
  // Microtask — runs before check phase

  process.nextTick(() => console.log('0: nextTick'));
  // Microtask — runs before Promise.then, before any phase
});

// Output order:
// 0: nextTick       (microtask, highest priority — before any phase transition)
// 1: Promise.then   (microtask, runs after nextTick queue is empty)
// 2: setImmediate   (check phase — this iteration)
// 3: setTimeout     (timer phase — NEXT iteration)


// ── Production pattern: yield the event loop during batch processing ───────
// Without yielding, processing 100,000 items blocks the loop for the full duration.
// With setImmediate between batches, I/O and other requests can interleave.
function processBatchWithYield(items, batchSize, processFn, onComplete) {
  let index = 0;

  function processNextBatch() {
    const batchEnd = Math.min(index + batchSize, items.length);

    // Process one batch synchronously
    for (; index < batchEnd; index++) {
      processFn(items[index]);
    }

    if (index < items.length) {
      // Yield to the event loop — allows pending I/O and HTTP requests to run
      // setImmediate is better than setTimeout(fn, 0) here:
      // no artificial 1ms delay, runs in the check phase of the current iteration
      setImmediate(processNextBatch);
    } else {
      onComplete();
    }
  }

  processNextBatch();
}

// Usage:
processBatchWithYield(
  largeArray,
  500,           // Process 500 items per batch before yielding
  (item) => transform(item),
  () => console.log('All items processed')
);


// ── Measuring actual event loop lag in production ──────────────────────────
// Schedule a task, measure how long it actually takes to execute vs expected.
// The delta is your event loop lag.
function measureEventLoopLag() {
  const INTERVAL_MS = 1000;
  let lastCheck = Date.now();

  setInterval(() => {
    const now = Date.now();
    const lag = now - lastCheck - INTERVAL_MS;
    lastCheck = now;

    if (lag > 50) {
      console.warn(`Event loop lag: ${lag}ms — investigate blocking operations`);
    }
    // In production: expose this as a Prometheus gauge
  }, INTERVAL_MS);
}

measureEventLoopLag();

Output

// Output from the phase ordering demo:

0: nextTick

1: Promise.then

2: setImmediate

3: setTimeout

// Output from measureEventLoopLag() under normal load:

// (no output — lag under 50ms)

// Output from measureEventLoopLag() when a synchronous operation blocks the loop:

Event loop lag: 382ms — investigate blocking operations

Event loop lag: 3847ms — investigate blocking operations

Event loop lag: 12ms — investigate blocking operations

Try it live

Mental Model

The Event Loop as a Six-Station Assembly Line

Think of the event loop as a factory assembly line with six inspection stations — each station must clear its entire backlog before the belt moves to the next station. If one station stalls (because someone gave it a 4-second task), every item queued at every other station waits. The line is frozen.

process.nextTick() cuts the queue entirely — it runs before the next station even starts, making it useful for deferring work within the current operation but dangerous if overused (it can starve I/O)
Promise.then() callbacks also run as microtasks, after nextTick is exhausted but before any phase transition
The poll phase is where the loop spends most of its time — it processes I/O callbacks and may block waiting for new I/O events when the queue is empty and no timers are pending
setImmediate was designed specifically for yielding inside I/O callbacks — it fires before the next timer phase without the 1ms minimum delay that setTimeout carries
Blocking any phase for more than 10-20ms degrades latency for every connected client simultaneously — there is no isolation between requests in a single Node.js process

📊 Production Insight

In production, the single most common event loop blocker is JSON.parse on large request bodies.

Parsing a 10MB JSON payload takes 50-200ms depending on structure complexity and CPU generation — during that entire window, zero other requests are processed.

On Node.js 22 with V8 12.x, JSON.parse performance has improved, but the blocking nature has not changed — it is still synchronous.

Rule: stream-parse large bodies with a streaming JSON parser, or enforce a strict body size limit at the reverse proxy (nginx, AWS ALB) before the payload reaches Node.js.

🎯 Key Takeaway

The event loop has six phases with fixed execution order — microtasks (nextTick, Promise.then) drain between every phase transition.

Blocking any phase blocks all concurrent connections for the full duration — there is no request isolation in a single process.

Yield long synchronous work with setImmediate between batches, or offload to worker threads for work that cannot be chunked.

Node.js 22 on Linux uses io_uring for faster async file I/O — the phase model is unchanged, but poll phase throughput for file operations improves.

Choosing Between setTimeout, setImmediate, and nextTick

IfNeed to defer work until after I/O callbacks complete in the current event loop iteration

→

UseUse setImmediate — runs in the check phase immediately after poll, with no artificial timer delay

IfNeed work to run before any I/O, timer, or check callbacks — highest priority deferral

→

UseUse process.nextTick — runs as a microtask before the next event loop phase, but use sparingly as it can starve I/O if called recursively

IfNeed a minimum wall-clock delay before execution

→

UseUse setTimeout — minimum granularity is ~1ms due to timer resolution; Node.js 22 improves timer accuracy but does not eliminate this floor

IfProcessing a large array or dataset without blocking the event loop

→

UseBatch with setImmediate between chunks — yield after every N items to allow pending I/O and incoming requests to interleave

IfCPU-bound work that will take more than 10ms

→

UseOffload to a worker thread via piscina — do not try to chunk it with setImmediate if the work cannot be easily divided

Clustering: Scaling Across CPU Cores Without Shared State

Node.js runs on a single thread. A single process on a 32-core machine uses roughly 3% of available CPU — the other 97% sits idle while your users wait. The cluster module solves this by forking the main process into N worker processes, each with its own event loop, its own V8 heap, and its own memory space. The primary process accepts incoming connections and distributes them to workers via IPC.

Clustering is not the same as load balancing across machines. Cluster workers share the same physical host — same CPU, same RAM, same network interface, same file descriptor table at the OS level. The scaling ceiling is not just CPU. As you add workers, shared resources become the actual bottleneck: a database connection pool of 20 connections shared across 16 workers gives each worker an average of 1.25 connections, which serialises any concurrent database work completely.

On Linux, the default scheduling policy (SCHED_NONE) delegates connection distribution to the OS. This sounds neutral but can produce significantly uneven worker load in practice — some workers handle 3x more requests than others depending on connection arrival timing. Setting NODE_CLUSTER_SCHED_POLICY=rr enables round-robin scheduling in the primary process, which distributes connections more evenly at the cost of a small IPC overhead per connection.

In Kubernetes in 2026, the question of whether to cluster at all has a nuanced answer. If your pod is allocated 1 CPU, clustering gives you nothing — you have one core. If your pod is allocated 4 CPUs, clustering with 3-4 workers utilises those cores. The K8s-preferred pattern is one process per pod (no clustering) with the orchestrator managing pod count for horizontal scaling — this gives you cleaner isolation, simpler crash recovery, and better resource accounting. But if your workload has high per-request CPU variance (some requests are light, some are heavy), clustering within a multi-CPU pod still wins on tail latency.

cluster.production.jsJAVASCRIPT

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

const cluster = require('cluster');
const http = require('http');
const os = require('os');

// In K8s: WEB_CONCURRENCY can be injected as an env var matching allocated CPUs.
// In VMs/bare metal: default to os.cpus().length.
// Never hard-code a number — it breaks when the container size changes.
const WORKER_COUNT = parseInt(process.env.WEB_CONCURRENCY, 10) || os.cpus().length;
const PORT = parseInt(process.env.PORT, 10) || 3000;

if (cluster.isPrimary) {
  console.log(`Primary ${process.pid} — forking ${WORKER_COUNT} workers`);

  // Track each worker's lifecycle separately
  const workerMeta = new Map();

  for (let i = 0; i < WORKER_COUNT; i++) {
    spawnWorker();
  }

  function spawnWorker() {
    const worker = cluster.fork();
    workerMeta.set(worker.id, {
      pid: worker.process.pid,
      startedAt: Date.now(),
      crashCount: 0,
    });
    console.log(`Worker ${worker.id} (pid ${worker.process.pid}) started`);
    return worker;
  }

  cluster.on('exit', (worker, code, signal) => {
    const meta = workerMeta.get(worker.id);
    meta.crashCount++;

    const reason = signal || `exit code ${code}`;
    console.error(`Worker ${worker.id} exited (${reason}). Crash count: ${meta.crashCount}`);

    // Crash loop protection: if a worker crashes more than 5 times,
    // something is fundamentally wrong. Don't keep restarting it.
    // In K8s, the pod will be replaced. In a VM, alert and investigate.
    if (meta.crashCount > 5) {
      console.error(`Worker ${worker.id} has crashed ${meta.crashCount} times — stopping restarts`);
      // Remove from tracking so we don't endlessly accumulate stale entries
      workerMeta.delete(worker.id);
      return;
    }

    // Brief delay before respawning to avoid crash storms on startup failures
    setTimeout(() => spawnWorker(), 500);
  });

  // Rolling restart: send shutdown signal to each worker in sequence,
  // wait for it to drain and exit, then restart it.
  // This achieves zero-downtime restarts without PM2.
  async function rollingRestart() {
    const workerIds = Object.keys(cluster.workers);
    for (const id of workerIds) {
      await new Promise((resolve) => {
        const worker = cluster.workers[id];
        if (!worker) { resolve(); return; }

        worker.send('graceful-shutdown');
        worker.once('exit', () => {
          spawnWorker();
          // Give the new worker 2 seconds to initialise before draining the next one
          setTimeout(resolve, 2000);
        });
      });
    }
    console.log('Rolling restart complete');
  }

  process.on('SIGUSR1', rollingRestart);

  process.on('SIGTERM', () => {
    console.log('Primary received SIGTERM — initiating graceful shutdown');
    for (const id in cluster.workers) {
      cluster.workers[id].send('graceful-shutdown');
      cluster.workers[id].disconnect();
    }
  });

} else {
  // Worker process — this is where your actual application runs
  const server = http.createServer((req, res) => {
    res.writeHead(200, { 'Content-Type': 'text/plain' });
    res.end(`Worker ${cluster.worker.id} (pid ${process.pid}) handled this request`);
  });

  server.listen(PORT, () => {
    console.log(`Worker ${cluster.worker.id} listening on port ${PORT}`);
  });

  // Handle graceful shutdown: stop accepting new connections,
  // finish in-flight requests, then exit cleanly.
  process.on('message', (msg) => {
    if (msg === 'graceful-shutdown') {
      console.log(`Worker ${cluster.worker.id} shutting down gracefully`);
      server.close(() => {
        console.log(`Worker ${cluster.worker.id} shutdown complete`);
        process.exit(0);
      });

      // Force exit if graceful shutdown takes too long (e.g., a stuck WebSocket)
      setTimeout(() => {
        console.error(`Worker ${cluster.worker.id} forced exit after timeout`);
        process.exit(1);
      }, 30_000);
    }
  });

  // Catch unhandled rejections at the worker level
  // Log them fully before the process exits — silent rejections lose stack traces
  process.on('unhandledRejection', (reason, promise) => {
    console.error('Unhandled rejection in worker', cluster.worker.id, reason);
    // Depending on the severity, you may want to exit here:
    // process.exit(1);
  });
}

Output

Primary 1234 — forking 4 workers

Worker 1 (pid 1235) listening on port 3000

Worker 2 (pid 1236) listening on port 3000

Worker 3 (pid 1237) listening on port 3000

Worker 4 (pid 1238) listening on port 3000

# After sending SIGTERM to the primary:

Primary received SIGTERM — initiating graceful shutdown

Worker 1 shutting down gracefully

Worker 2 shutting down gracefully

Worker 3 shutting down gracefully

Worker 4 shutting down gracefully

Worker 1 shutdown complete

Worker 2 shutdown complete

Worker 3 shutdown complete

Worker 4 shutdown complete

# If a worker crashes:

Worker 2 exited (exit code 1). Crash count: 1

Worker 5 (pid 1291) listening on port 3000

Try it live

⚠ Cluster Workers Share Absolutely Nothing in Memory

Each cluster worker is a completely separate V8 isolate with its own heap. In-memory caches, session stores, rate limiters, and feature flag states are NOT shared across workers — they exist independently in each worker's heap. A user whose request hits worker 1 has zero visibility into what worker 2 has cached. In practice this means: any state that must be consistent across requests must live in an external store (Redis for sessions and caches, a database for rate limiting counters). In-memory state is only reliable in single-process deployments, which is fine for local development and occasionally fine for small internal tools, but never appropriate for multi-worker production services.

📊 Production Insight

Cluster workers die silently in production more often than engineers expect.

Unhandled promise rejections (which become fatal in Node.js 15+), native addon segfaults, and OOM kills all crash workers cleanly without obvious log output beyond the exit event.

Node.js 22 still propagates unhandled promise rejections as fatal by default — if your workers are dying and you're not sure why, add a process.on('unhandledRejection') handler to log the full rejection reason before the process exits.

Rule: implement worker death tracking with crash count limits. A crash-looping worker that restarts every 500ms consumes resources and delays investigation. Cap restarts at 5, alert, and let the orchestrator handle replacement.

🎯 Key Takeaway

Cluster forks one process per CPU core — each gets its own V8 heap, its own event loop, and its own memory space.

Workers share nothing in memory — sessions, caches, and rate limiters must live in Redis or an equivalent external store.

Shared resource contention (database connection pool size, file descriptor limits) is the actual scaling ceiling, not CPU count.

In Kubernetes, prefer one process per pod plus K8s horizontal pod autoscaling over intra-pod clustering for cleaner isolation and resource accounting.

thecodeforge.io

Nodejs Performance Optimisation

Worker Threads for CPU-Intensive Work — And When Not to Use Them

Cluster processes are heavyweight. Each fork creates a full V8 instance with its own heap, its own garbage collector, and its own event loop. Worker threads are lighter in a specific way: they run within the same process, can share memory via SharedArrayBuffer, and have a startup cost of 10-50ms versus the 100-300ms for a full process fork.

Worker threads are not a general-purpose concurrency model. They exist for one job: CPU-bound parallelism within a single request or operation lifecycle. They excel at image processing, cryptographic key derivation, data transformation, report generation, ML model inference, and any algorithm that is genuinely computation-heavy. For I/O-bound work — database queries, HTTP calls, file reads — the event loop already handles concurrency efficiently. Adding threads for I/O adds complexity, coordination overhead, and thread pool management with zero performance benefit.

The architectural difference that matters most: worker threads can share memory via SharedArrayBuffer with Atomics for synchronisation. This enables zero-copy data sharing between threads — you pass a reference to a shared buffer, not a serialised copy of the data. For high-throughput data processing where the payload size is large, the difference between zero-copy and serialise-deserialise can dominate latency entirely.

The rule about thread pools is non-negotiable in production: never spawn a worker thread per incoming request. Thread creation has real overhead — 10-50ms startup, ~5-10MB memory per thread including stack and V8 overhead. Under concurrent load, naive spawn-per-request causes thread storms. 200 simultaneous requests spawn 200 threads, the process hits OS thread limits, thread creation starts failing, and the whole thing collapses. The correct pattern is a fixed-size thread pool — piscina is the most production-hardened library for this in the Node.js ecosystem as of 2026.

workers.cpu-task.jsJAVASCRIPT

100

101

102

// ── cpu-worker.js ─────────────────────────────────────────────────────────
// This file runs inside each worker thread.
// It receives work via workerData, does computation, and posts the result back.

const { parentPort, workerData, threadId } = require('worker_threads');
const crypto = require('crypto');

function computeHashes(count) {
  // This is genuinely CPU-bound — no I/O, pure computation.
  // Running this on the main thread would block the event loop for the full duration.
  const results = new Array(count);
  for (let i = 0; i < count; i++) {
    results[i] = crypto
      .createHash('sha256')
      .update(`payload-${i}-${Date.now()}`)
      .digest('hex');
  }
  return results;
}

parentPort.postMessage({
  threadId,
  count: workerData.iterations,
  sample: computeHashes(workerData.iterations)[0],
});


// ── thread-pool.js ────────────────────────────────────────────────────────
// Production pattern: fixed-size thread pool using piscina.
// Never spawn a thread per request. Queue excess work.

const Piscina = require('piscina');
const path = require('path');

// Create the pool once at application startup — not per request.
// maxThreads should roughly match CPU core count.
// idleTimeout: threads that have been idle for 30s are terminated to free memory.
const pool = new Piscina({
  filename: path.resolve(__dirname, 'cpu-worker.js'),
  maxThreads: parseInt(process.env.WORKER_THREADS, 10) || (require('os').cpus().length - 1),
  idleTimeout: 30_000,
});

// In your Express/Fastify route handler:
async function handleHashRequest(req, res) {
  const { iterations = 1000 } = req.query;

  // Cap iterations to prevent intentional DoS via this endpoint
  const safeIterations = Math.min(parseInt(iterations, 10), 10_000);

  try {
    // This runs in a worker thread — the event loop is completely free
    // to handle other requests while this computation runs.
    const result = await pool.run({ iterations: safeIterations });
    res.json({ threadId: result.threadId, count: result.count, sample: result.sample });
  } catch (err) {
    // piscina surfaces worker errors as rejected promises
    console.error('Worker thread error:', err);
    res.status(500).json({ error: 'Computation failed' });
  }
}

// Monitor pool queue depth — if it grows unboundedly, you need more threads
// or you need to reject requests earlier with a 503
setInterval(() => {
  if (pool.queueSize > 50) {
    console.warn(`Thread pool queue depth: ${pool.queueSize} — consider increasing maxThreads or load shedding`);
  }
}, 5_000);


// ── SharedArrayBuffer pattern for zero-copy data sharing ──────────────────
// When passing large datasets to worker threads, avoid serialisation overhead
// by sharing the underlying ArrayBuffer directly.

const { Worker } = require('worker_threads');

function processLargeDataset(data) {
  // Convert data to a typed array for sharing
  const sharedBuffer = new SharedArrayBuffer(data.length * 4);
  const sharedView = new Int32Array(sharedBuffer);

  // Copy data into shared buffer — this is the only copy
  data.forEach((val, i) => { sharedView[i] = val; });

  return new Promise((resolve, reject) => {
    const worker = new Worker(`
      const { workerData, parentPort } = require('worker_threads');
      const view = new Int32Array(workerData.buffer);
      // Process in-place — no additional memory allocation for the data itself
      let sum = 0;
      for (let i = 0; i < view.length; i++) sum += view[i];
      parentPort.postMessage({ sum });
    `, {
      eval: true,
      workerData: { buffer: sharedBuffer } // Transferred, not copied
    });

    worker.on('message', resolve);
    worker.on('error', reject);
  });
}

Output

// piscina pool handling concurrent requests:

// Request 1 → threadId: 1, count: 1000, sample: 'a3f8c2...'

// Request 2 → threadId: 2, count: 1000, sample: 'b7d4e1...'

// Request 3 → threadId: 1, count: 1000, sample: 'c9a1f3...'

// (threadId 1 was reused — threads stay alive and handle multiple tasks)

// Pool queue warning under sustained load:

// Thread pool queue depth: 67 — consider increasing maxThreads or load shedding

// Comparison: blocking event loop vs worker thread

// Without worker thread:

// Request latency during 10,000 hash computation: ~380ms blocked

// Other requests during that time: 0 served

// With worker thread pool:

// Request latency during 10,000 hash computation: ~380ms (in thread)

// Other requests during that time: served normally, <5ms latency

Try it live

Mental Model

Cluster vs Worker Threads — Two Different Problems

Clustering and worker threads solve completely different problems. Mixing them up wastes resources and adds complexity without benefit. Get the model clear before choosing.

Cluster workers: each gets its own event loop and V8 heap — best when you need to handle more concurrent I/O-bound requests than one event loop can manage
Worker threads: share the parent process heap (optionally), lighter startup — best when a single request triggers a CPU-bound computation that would block the event loop
If your bottleneck is database query throughput or concurrent HTTP connections, clustering wins — more event loops means more concurrent I/O operations
If your bottleneck is computation within a request (hashing, image resizing, PDF generation, ML inference), worker threads win — CPU parallelism without full process overhead
You can combine both: a clustered app where each worker has a small thread pool for the occasional CPU-bound request — this is the right architecture for mixed I/O and CPU workloads
Never spawn a thread per request — use piscina with a fixed maxThreads sized to CPU count minus 1 (leave one core for the event loop)

📊 Production Insight

Worker thread startup costs 10-50ms and 5-10MB memory per thread on Node.js 22.

Spawning threads per request under load creates thread storms — 500 concurrent requests attempt to spawn 500 threads, hit OS limits, and the process crashes or thrashes.

Piscina's queue mechanism is your safety valve: excess tasks queue rather than spawning new threads. Monitor pool.queueSize as a metric — if it grows consistently, you need more threads or earlier load shedding.

Rule: create the pool once at startup, size maxThreads to CPU count, cap queueSize with a circuit breaker that returns 503 when the queue exceeds your SLA threshold.

🎯 Key Takeaway

Worker threads handle CPU-bound work without blocking the event loop — they are not for I/O.

Use piscina for thread pool management — never spawn per request, never more threads than CPU cores.

SharedArrayBuffer enables zero-copy data sharing between threads — critical for high-throughput data processing.

You can combine clustering and worker threads: clustered workers each with a small thread pool is the right architecture for mixed I/O and CPU workloads.

Memory Management, GC Pauses, and Leak Detection That Actually Works

Node.js uses V8's generational garbage collector. Objects start in the young generation (also called the nursery or new space), which is collected via a fast scavenge algorithm that runs frequently. Objects that survive two scavenge cycles are promoted to the old generation, which is collected via a mark-sweep-compact algorithm — slower, less frequent, and critically, it pauses the event loop while it runs.

The size of those GC pauses scales with old generation heap utilisation. A heap at 30% utilisation with 512MB old space might produce 5-15ms GC pauses. The same heap at 90% utilisation triggers increasingly frequent major GC cycles producing 50-200ms pauses. Those pauses look exactly like event loop blocking in your latency metrics — P99 spikes with no corresponding CPU spike.

The default old space limit is approximately 1.5GB on 64-bit systems (slightly higher on Node.js 22 due to V8 12.x improvements). This is not a target — it is a ceiling. Running a 1.4GB heap is not healthy; it means V8 is under severe GC pressure. Use --max-old-space-size to set an appropriate limit based on your container allocation, then use clustering to run multiple smaller heaps rather than one large one.

The most dangerous memory leaks in production are the gradual ones. A slow-growing Map, an event emitter accumulating listeners, a closure in a middleware capturing a request object — none of these cause immediate failures. They grow over hours or days, GC pauses increase gradually, latency percentiles drift upward, and the eventual OOM crash looks like a random event rather than the conclusion of a long-running leak. Heap snapshot comparison is the only reliable way to find them.

memory.leak-patterns.jsJAVASCRIPT

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

// ══════════════════════════════════════════════════════════════════
// COMMON PRODUCTION LEAK PATTERNS — what they look like and how to fix them
// ══════════════════════════════════════════════════════════════════

// ── PATTERN 1: Unbounded in-memory cache ──────────────────────────
// The most common leak in production Node.js services.
// Works fine in staging (low traffic, frequent restarts).
// Reaches OOM in production after 12-48 hours of sustained traffic.

const BAD_cache = {};
function getUserBad(id) {
  if (!BAD_cache[id]) {
    BAD_cache[id] = fetchFromDb(id); // Added forever, never evicted
  }
  return BAD_cache[id];
}
// After 100,000 unique users: BAD_cache has 100,000 entries eating memory

// FIX: LRU cache with both max size and TTL
const { LRUCache } = require('lru-cache');
const userCache = new LRUCache({
  max: 10_000,           // Evict least-recently-used entries beyond this count
  ttl: 1000 * 60 * 5,   // Each entry expires after 5 minutes regardless of access
  allowStale: false,     // Don't serve expired entries even while revalidating
});

function getUser(id) {
  const cached = userCache.get(id);
  if (cached) return cached;
  const user = fetchFromDb(id);
  userCache.set(id, user);
  return user;
}


// ── PATTERN 2: Event listener accumulation ────────────────────────
// Each reconnection adds a new listener. removeListener is never called.
// process.on('warning') will emit MaxListenersExceededWarning at 11 listeners,
// but by then you might already have hundreds.

function handleConnectionBad(socket) {
  const onData = (chunk) => processChunk(socket, chunk);
  socket.on('data', onData);
  // BUG: socket 'close' event never removes 'data' listener
  // If this socket is reused or the same emitter receives multiple calls,
  // listeners pile up indefinitely
}

// FIX: always pair addListener with removeListener
function handleConnection(socket) {
  const onData = (chunk) => processChunk(socket, chunk);
  socket.on('data', onData);

  socket.once('close', () => {
    socket.removeListener('data', onData);
    // 'once' ensures this cleanup handler itself doesn't accumulate
  });
}

// For high-connection-volume services, set explicit listener limits:
socket.setMaxListeners(3); // data + error + close — that's all this socket needs


// ── PATTERN 3: Closure retaining large objects ────────────────────
// The closure captures the entire scope, including objects it doesn't use.
// Returned functions keep those objects alive in the old generation forever.

function BAD_createMiddleware() {
  const requestLog = []; // Grows with every request — never cleared

  return function middleware(req, res, next) {
    requestLog.push(req); // Holds every req object ever received
    next();
  };
}

// FIX: only retain what you actually need
function createMiddleware() {
  const requestCount = { value: 0 }; // Tiny counter, not the whole req object

  return function middleware(req, res, next) {
    requestCount.value++;
    next();
  };
}


// ── PATTERN 4: setInterval without clearInterval ──────────────────
// Common in module-level setup code — the interval callback fires forever
// and holds references to everything in its closure scope.

let pollingInterval;
function startPolling(config) {
  // GOOD: store the reference so we can clear it
  pollingInterval = setInterval(async () => {
    await pollRemoteService(config);
  }, 5000);
}

function stopPolling() {
  if (pollingInterval) {
    clearInterval(pollingInterval);
    pollingInterval = null;
  }
}

// Always hook into process shutdown to clean up timers:
process.on('SIGTERM', () => {
  stopPolling();
  // then close server, drain DB pool, etc.
});


// ── Heap snapshot helper for production debugging ─────────────────
// Add this to a debug-only route or triggered by a signal.
// Never leave heap snapshot generation on the hot path — it pauses the event loop.
const v8 = require('v8');

process.on('SIGUSR2', () => {
  const filename = `heap-${Date.now()}.heapsnapshot`;
  v8.writeHeapSnapshot(filename);
  console.log(`Heap snapshot written to ${filename}`);
  // Copy the file off the instance and open in Chrome DevTools Memory tab
});

Output

// MaxListenersExceededWarning from Node.js when listener pattern is wrong:

(node:1234) MaxListenersExceededWarning: Possible EventEmitter memory leak detected.

11 data listeners added to [Socket]. Use emitter.setMaxListeners() to increase limit

// Heap snapshot written on SIGUSR2:

Heap snapshot written to heap-1741132800000.heapsnapshot

// lru-cache eviction working correctly under load:

// After 100,000 unique user requests, userCache.size === 10,000

// (not 100,000 — LRU eviction kept it bounded)

// Memory stays stable instead of growing linearly with unique users

Try it live

⚠ GC Pauses Are the Silent Latency Killer

A 1.4GB heap at 90% utilisation triggers aggressive mark-sweep-compact cycles that pause the event loop for 50-200ms each. These pauses do not appear in your application logs. They do not increment your error counters. They show up only as latency spikes at the P95 and P99 percentiles — and they look identical to a slow database query or a blocked event loop. If your latency percentiles drift upward over hours with no corresponding application error, check heap utilisation and GC frequency before debugging anything else. The metric you want: v8.getHeapStatistics().used_heap_size / v8.getHeapStatistics().heap_size_limit — if this ratio exceeds 0.85, you are in GC pressure territory.

📊 Production Insight

The most common production memory leak in Node.js is an unbounded in-memory cache with no size limit and no TTL.

It works perfectly in staging: low traffic, frequent redeploys, never grows large enough to matter.

In production: 48 hours of real traffic later, the cache has 500,000 entries, GC is running continuously, event loop lag is 300ms, and on-call gets paged for an 'unexplained latency increase'.

Rule: every in-memory cache, every Map used as a cache, every object accumulating state must have an explicit maximum size and a TTL. No exceptions. lru-cache with both max and ttl options is one line of configuration.

🎯 Key Takeaway

V8 GC pauses scale with old generation heap utilisation — a heap at 90% capacity causes 50-200ms event loop stops that look like application bugs.

The three most common production leak sources: unbounded caches without TTL/size limits, event listeners never removed on connection close, and closures in long-lived objects retaining large references.

Heap snapshot comparison in Chrome DevTools is the only reliable way to find leaks — guessing wastes hours. Take a baseline, wait under load, take a second snapshot, compare retained size growth by object type.

Keep heaps small: --max-old-space-size at 70-80% of container memory, multiple smaller processes via clustering rather than one large heap.

Production Profiling: Finding Real Bottlenecks Under Real Load

Synthetic benchmarks lie in very specific ways. A service that handles 50,000 requests per second in autocannon or wrk testing may collapse at 5,000 requests per second in production because synthetic benchmarks cannot replicate: real-world payload size variance, database query latency distribution under concurrent connection pressure, connection pool contention between concurrent requests, GC pressure from actual memory usage patterns, and the interaction between all of these simultaneously. Production profiling reveals what benchmarks never expose.

The three essential tools in 2026: clinic.js for automated high-level diagnosis across event loop, CPU, and memory simultaneously; --inspect with Chrome DevTools for interactive CPU flame graphs and heap timeline recording; and perf/DTrace for low-level kernel-level analysis when the issue is in native code or the V8 runtime itself. Each operates at a different depth.

Start with clinic doctor every time — it gives the fastest cross-dimensional view of what is wrong. Event loop delay, CPU profile, and memory trend in a single dashboard that takes two minutes to generate. When doctor identifies an area of concern, drill into it: clinic flame for CPU hotspot identification, clinic heapprofiler for allocation timeline analysis. Only drop to perf/DTrace when the problem is not visible at the JavaScript level — native addon performance, V8 JIT behaviour, or system-call overhead.

The profiling overhead question matters in 2026 because teams are increasingly reluctant to run profiling tools in production after incidents caused by profiler overhead. Clinic.js adds roughly 5-15% overhead to throughput. Chrome DevTools CPU profiling adds 20-40% overhead and should only run on canary instances. The right workflow: run clinic against a canary pod or a staging environment under production-representative load, not against your primary fleet.

profiling.commands.shBASH

100

101

102

103

104

105

106

107

# ── Install clinic globally (requires Node.js 18+, works on 22 LTS) ────────
npm install -g clinic

# Verify installation:
clinic --version


# ═════════════════════════════════════════════════════════════════════
# STEP 1: Full diagnostic overview — start here, always
# ═════════════════════════════════════════════════════════════════════

clinic doctor -- node server.js
# While this runs, apply load in a separate terminal:
npx autocannon -c 100 -d 30 http://localhost:3000
# clinic doctor generates an HTML report showing:
#   - Event loop delay over time (spikes = blocking operations)
#   - CPU usage across the profile period
#   - Memory growth trend
#   - Recommendations based on detected patterns


# ═════════════════════════════════════════════════════════════════════
# STEP 2: CPU hotspot identification (when doctor shows high CPU)
# ═════════════════════════════════════════════════════════════════════

clinic flame -- node server.js
# Apply load while running.
# The flamegraph shows call stacks with width proportional to CPU time.
# Wide boxes at the top of the stack are your hotspots.
# Look for synchronous JavaScript in the middle of the stack — anything
# that should be async but isn't shows up as a wide synchronous column.


# ═════════════════════════════════════════════════════════════════════
# STEP 3: Memory allocation profiling (when memory grows unexpectedly)
# ═════════════════════════════════════════════════════════════════════

clinic heapprofiler -- node server.js
# Shows an allocation timeline — not just what IS in the heap,
# but what is being actively allocated over time.
# Useful distinction: a large retained object is a leak.
# High allocation rate that GC keeps up with is a GC pressure issue, not a leak.


# ═════════════════════════════════════════════════════════════════════
# STEP 4: Async operation flow analysis (when timing looks wrong)
# ═════════════════════════════════════════════════════════════════════

clinic bubbleprof -- node server.js
# Visualises async operation chains — shows where time is spent waiting
# between async operations. Useful for finding:
#   - Database query chains that could be parallelised with Promise.all()
#   - Missing connection pool capacity (lots of time waiting for a connection)
#   - Unnecessary sequential async operations


# ═════════════════════════════════════════════════════════════════════
# STEP 5: Manual event loop lag measurement (production-safe)
# ═════════════════════════════════════════════════════════════════════

# Option A: quick terminal check
node -e "
  const INTERVAL = 1000;
  let last = Date.now();
  setInterval(() => {
    const now = Date.now();
    const lag = now - last - INTERVAL;
    last = now;
    console.log('Event loop lag:', lag.toFixed(0), 'ms');
    if (lag > 100) console.warn('WARNING: event loop saturation detected');
  }, INTERVAL);
"

# Option B: expose as a Prometheus metric (recommended for production)
# In your application startup:
# const promClient = require('prom-client');
# promClient.collectDefaultMetrics(); // includes nodejs_eventloop_lag_seconds


# ═════════════════════════════════════════════════════════════════════
# STEP 6: V8 CPU profile via --inspect (for interactive investigation)
# ═════════════════════════════════════════════════════════════════════

# Start the process with the inspector enabled:
node --inspect=0.0.0.0:9229 server.js

# Then:
# 1. Open Chrome and navigate to chrome://inspect
# 2. Click 'inspect' on your Node.js target
# 3. Go to the Profiler tab
# 4. Click 'Start' — apply load — click 'Stop'
# 5. Analyse the CPU profile flamegraph

# For production canary: bind inspector to localhost only
node --inspect=127.0.0.1:9229 server.js
# Tunnel via SSH to access from your laptop:
# ssh -L 9229:localhost:9229 user@canary-host


# ═════════════════════════════════════════════════════════════════════
# STEP 7: Heap snapshot on demand (production debugging)
# ═════════════════════════════════════════════════════════════════════

# If you added process.on('SIGUSR2') for heap snapshots:
kill -USR2 <node-pid>
# File appears in the working directory as heap-<timestamp>.heapsnapshot
# Load it in Chrome DevTools > Memory tab > Load profile

Output

# clinic doctor output (terminal summary before HTML report opens):

✔ Analysed data

✔ Generated HTML report

open file:///.clinic/12345.clinic-doctor-html

Key findings:

⚠ Event loop delay detected — max 3847ms at 14:23:07

⚠ Synchronous operations detected in the event loop

✓ Memory: stable growth — no apparent leak

✓ CPU: consistent with expected load

Recommendation: Use clinic flame to identify the synchronous hotspot

# Event loop lag measurement output:

Event loop lag: 2 ms

Event loop lag: 3 ms

Event loop lag: 1 ms

WARNING: event loop saturation detected

Event loop lag: 387 ms ← This is when the blocking operation ran

Event loop lag: 4 ms

Event loop lag: 2 ms

# autocannon load test output:

Running 30s test @ http://localhost:3000

100 connections

┌─────────┬────────┬─────────┬─────────┬────────────┬──────────┐

│ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │

├─────────┼────────┼─────────┼─────────┼────────────┼──────────┤

│ Latency │ 8 ms │ 12 ms │ 3847 ms │ 4203 ms │ 47 ms │

└─────────┴────────┴─────────┴─────────┴────────────┴──────────┘

Req/Sec: 8,432 (average) — note the P99 spike despite high average throughput

# The average looks fine. The P99 is catastrophic. This is why you watch percentiles.

💡Always Profile Under Realistic Concurrent Load

An idle Node.js server shows zero event loop lag, zero CPU hotspots, and flat memory — a completely clean profile that tells you nothing. The blocking operations, the GC pressure patterns, and the async chain bottlenecks only appear under concurrent load because they require multiple requests competing for the same event loop. Before you profile, generate realistic traffic with autocannon, k6, or artillery. Use realistic payload sizes and request patterns from your production access logs — not uniform requests to a single lightweight endpoint. A profiling session on an idle server is the most expensive way to learn nothing.

📊 Production Insight

Clinic.js adds 5-15% overhead to request throughput and should never run permanently in production.

The right approach: run clinic against a canary pod or a dedicated staging instance that receives a copy of production traffic via traffic mirroring.

On Node.js 22, you can use --cpu-prof for built-in V8 CPU profiling without any external tooling — the output is compatible with Chrome DevTools: node --cpu-prof server.js.

Rule: profile in production-equivalent environments under production-equivalent traffic patterns. Analyse the report. Remove the profiler. Make one change. Measure. Repeat.

🎯 Key Takeaway

Start with clinic doctor — it gives the fastest cross-dimensional view of event loop, CPU, and memory behaviour simultaneously.

Always apply realistic concurrent load while profiling — idle servers reveal nothing.

Move from overview (doctor) to detail (flame for CPU, heapprofiler for allocations, bubbleprof for async chains) as patterns emerge.

Node.js 22 includes --cpu-prof as a built-in V8 profiler — useful when clinic.js is not available or when you need zero external dependencies.

The Middleware Chain Is Killing Your Latency — Here's How To Measure It

Express middleware looks harmless. Each function is tiny. But stack five of them and your request latency doubles. The problem isn't individual middleware — it's the chain's cumulative effect on the event loop. Every synchronous operation in a middleware locks the thread. Every JSON parse, every regex match, every object spread. You're not writing middleware. You're writing a blocking queue. Profile with async_hooks or cls-hooked to trace actual wall-clock time per middleware. Then remove anything that doesn't return within 1ms. Authentication? Should be a quick token verify, not a database lookup. Body parsing? Use raw middleware and parse only what you need. Compression? Delegate to a reverse proxy. Your goal is a middleware chain that never exceeds 5ms total. If it does, restructure. Move validation to a pre-request layer. Move logging to a background stream. Stop treating middleware as a dumping ground for every cross-cutting concern. Treat it like a hot code path — because it is.

middleware-profiler.jsJAVASCRIPT

const async_hooks = require('async_hooks');
const fs = require('fs');

const contexts = new Map();

const hook = async_hooks.createHook({
  init(asyncId, type, triggerAsyncId) {
    if (type === 'MIDDLEWARE') {
      contexts.set(asyncId, Date.now());
    }
  },
  destroy(asyncId) {
    if (contexts.has(asyncId)) {
      const elapsed = Date.now() - contexts.get(asyncId);
      fs.appendFileSync('middleware-profile.log', `Middleware ${asyncId}: ${elapsed}ms\n`);
      contexts.delete(asyncId);
    }
  }
});
hook.enable();

// Usage in your app
app.use((req, res, next) => {
  // Trigger async hook with type 'MIDDLEWARE'
  const asyncId = async_hooks.executionAsyncId();
  next();
});

app.get('/api/users', (req, res) => {
  res.json({ message: 'Profile your middleware chain' });
});

// Sample output:
// Middleware 12: 3ms
// Middleware 13: 12ms  <-- THIS ONE IS THE PROBLEM
// Middleware 14: 1ms

Output

Middleware 12: 3ms

Middleware 13: 12ms <-- THIS ONE IS THE PROBLEM

Middleware 14: 1ms

Try it live

⚠ Production Trap:

Don't profile middleware with console.time() — it adds its own overhead and skews results. Use async_hooks.createHook for zero-cost profiling. I've seen teams chasing ghosts because their instrumentation doubled latency.

🎯 Key Takeaway

Your middleware chain should never exceed 5ms total — if it does, you're not writing middleware, you're writing a blocking queue.

thecodeforge.io

Nodejs Performance Optimisation

Connection Pooling in Production: The Settings That Actually Matter

Every Node.js tutorial tells you to use connection pooling. None tell you how to configure it for production. The default pool size in most drivers is 10 connections. That's fine for a blog. It's a disaster for an API server handling 5000 requests per minute. The real numbers that matter: max (total connections), idleTimeoutMillis (how long before idle connections close), and queueLimit (max queued requests when all connections are busy). Set max to your database's connection limit minus 20% headroom. Set idleTimeoutMillis to 30000ms — any longer and you hold resources hostage. Set queueLimit to 0 (unlimited) with a 10-second queue timeout. Don't set max higher than what your DB can handle. I've seen Postgres servers crash because a Node app opened 500 connections thinking "more is better". Test under load with pg-pool or mysql2 and monitor pool.waitingCount. If it grows above 0 consistently, you need more connections or faster queries.

pool-config.jsJAVASCRIPT

const { Pool } = require('pg');

const pool = new Pool({
  host: process.env.DB_HOST,
  port: 5432,
  database: 'myapp',
  user: process.env.DB_USER,
  password: process.env.DB_PASS,
  max: 20,                // Never exceed DB's max_connections - 20%
  idleTimeoutMillis: 30000, // Close idle connections after 30s
  connectionTimeoutMillis: 5000, // Fail fast if DB is down
  maxUses: 7500,           // Recycle connection after 7500 queries
  queueLimit: 0,           // Allow unlimited queue
  allowExitOnIdle: true    // Let Node.js exit if pool is idle
});

// Monitor queue depth
setInterval(async () => {
  const stats = {
    totalCount: pool.totalCount,
    idleCount: pool.idleCount,
    waitingCount: pool.waitingCount,
  };
  console.log('Pool stats:', stats);
  // If waitingCount > 0 for 30 seconds, alert!
}, 10000);

// Sample output:
// Pool stats: { totalCount: 5, idleCount: 3, waitingCount: 0 }
// Pool stats: { totalCount: 12, idleCount: 0, waitingCount: 8 }  <-- PROBLEM

Output

Pool stats: { totalCount: 5, idleCount: 3, waitingCount: 0 }

Pool stats: { totalCount: 12, idleCount: 0, waitingCount: 8 } <-- PROBLEM

Try it live

🔥Real-World Insight:

Start small (max=10) and increase by 5 until waitingCount stays below 1 under peak load. Most production apps never need more than 30 connections. Amazon RDS's default max_connections is often 80 — don't exhaust it with one app server.

🎯 Key Takeaway

Connection pooling doesn't mean "unlimited connections" — it means right-sizing for your database's capacity and your app's concurrency profile.

● Production incidentPOST-MORTEMseverity: high

The 4-Second Event Loop Block That Took Down Production

Symptom

All API endpoints — not just the reporting endpoint — started returning 504 Gateway Timeout errors within minutes of the deployment. P99 latency spiked from 45ms to over 4,000ms. Kubernetes pods showed healthy status (the health check was a simple synchronous string response that returned in under 1ms) but load balancer metrics showed 100% connection exhaustion across all pods in the cluster.

Assumption

The team assumed the database was the bottleneck because latency had spiked once before due to a slow query. They scaled read replicas, increased connection pool sizes from 10 to 50, and waited. Latency continued climbing. Nobody looked at the newly deployed reporting endpoint because it was described in the PR as a 'minor template change'.

Root cause

The reporting endpoint used handlebars.compile() called inline on every request against a 2MB template file with deeply nested loops. Each compile-and-render cycle blocked the event loop for approximately 3.8 seconds. Under concurrent load, event loop lag accumulated exponentially — every new request queued behind the blocked loop, which meant even a 5ms health check request had to wait for the 3.8-second render to complete. The Kubernetes liveness probe had a 5-second timeout, so probes barely passed — just long enough to keep the pods alive and accepting traffic, while all actual application traffic timed out. The incident ran for 22 minutes before the new deployment was identified as the cause.

Fix

1. Immediately reverted the deployment — restored P99 to 45ms within 90 seconds of rollback 2. Moved template compilation to application startup: compile once at boot, store the compiled template function, call it on each request — compile time is now paid once, not per request 3. Switched to streaming template rendering for the large report output using a streaming-compatible templating approach 4. Added event loop lag monitoring via the event-loop-lag npm package, exposed as a Prometheus gauge 5. Implemented a circuit breaker middleware that returns HTTP 503 with a Retry-After header when event loop lag exceeds 500ms, preventing further request queuing behind a saturated loop 6. Updated the Kubernetes liveness probe to measure event loop responsiveness (a dedicated /health/live endpoint that uses setImmediate to verify the event loop is actually scheduling work) rather than just TCP connectivity

Key lesson

Never compile templates, parse large payloads, or run regex on untrusted input synchronously inside the request path — these are event loop blockers regardless of how small they look in code review
Health checks must measure event loop responsiveness, not just process existence — a process can be alive and completely unable to serve requests
A single synchronous operation does not just slow one endpoint — it blocks every concurrent request in the entire process for its full duration
Monitor event loop lag as a first-class production metric from day one — latency percentiles alone will not tell you why things are slow
Code review descriptions like 'minor template change' deserve scrutiny when the change touches the hot request path — the word 'minor' has no meaning when the event loop is involved

Production debug guideCommon production symptoms and their immediate debugging actions5 entries

Symptom · 01

P99 latency spikes while CPU usage remains low across all cores

→

Fix

This is almost always event loop blocking or GC pressure — not a resource exhaustion problem. Measure event loop lag directly with the event-loop-lag package or use clinic doctor to visualise event loop blocking over time. Low CPU with high latency means work is queued behind something synchronous, not that the system is idle.

Symptom · 02

Memory usage grows linearly over hours until OOM crash (exit code 137 in containers)

→

Fix

Take heap snapshots at intervals via --inspect and Chrome DevTools Memory tab. Take snapshot A, wait 30 minutes under load, take snapshot B, then use the Comparison view to identify object types with growing retained size. Exit code 137 is the kernel OOM killer — the container hit its memory limit. Check not just V8 heap but total RSS including Buffer allocations.

Symptom · 03

CPU pinned at 100% on a single core despite clustering being configured

→

Fix

Verify cluster workers are actually forked and running: ps aux | grep node. Check if the master process is somehow handling requests instead of delegating to workers. Also verify that the scheduling policy is round-robin — on Linux, the default SCHED_NONE delegates scheduling to the OS and can produce uneven distribution under certain connection patterns.

Symptom · 04

Throughput plateaus after adding more cluster workers — no improvement beyond N workers

→

Fix

Check shared resource contention first. Database connection pool size is the most common ceiling — if 8 workers share a pool of 10 connections, adding workers 9 and 10 gains nothing. Also check file descriptor limits (ulimit -n), port range exhaustion for outbound connections (ss -s), and whether a downstream service is the actual bottleneck.

Symptom · 05

Gradual latency degradation over days with no single identifiable event

→

Fix

This pattern usually indicates a slow-growing event loop block — a cache whose lookup time grows as it fills, a regex applied to progressively longer strings, or GC pauses increasing as heap utilisation climbs. Correlate event loop lag metrics over time with heap size metrics. If lag tracks heap growth, you have a memory leak causing GC pressure.

★ Node.js Performance Quick Debug Cheat SheetImmediate actions for common Node.js performance issues in production — ordered by what to check first

Event loop appears blocked — all requests timing out simultaneously−

Immediate action

Identify the blocking operation before changing anything else — you need to know what's blocking before you can fix it

Commands

npx clinic doctor -- node app.js

node --inspect=9229 app.js

Fix now

Move the identified synchronous work off the request path entirely — either pre-compute at startup, cache the result, or offload to a worker thread using piscina

Heap memory growing unbounded — RSS increasing steadily over hours+

Single worker consuming disproportionate CPU in cluster mode+

High latency under load with low CPU — requests queuing without obvious reason+

Node.js Scaling Strategies Compared

Strategy	Best For	Memory Model	Overhead	Primary Limitation
Cluster Module	I/O-bound HTTP services on multi-core VMs or multi-CPU pods	Fully isolated heaps per worker — no sharing	Low (~10-30MB per worker for runtime overhead)	No shared in-memory state; shared resource contention (DB pools, FDs) is the real ceiling
Worker Threads (piscina)	CPU-bound computation within a request (hashing, image processing, report generation)	Shared memory possible via SharedArrayBuffer	Medium (10-50ms startup, ~5-10MB per thread)	Not for I/O; pool sizing critical; thread errors surface as promise rejections
PM2 Cluster Mode	Long-running services needing zero-downtime reload and log management without K8s	Isolated heaps per worker (wraps cluster module)	Low (thin wrapper around cluster module)	Less operational control than raw cluster API; adds dependency; redundant in K8s
Kubernetes Horizontal Pod Autoscaling	Stateless services with variable or unpredictable traffic patterns	Fully isolated pods — separate processes, separate nodes	Higher (orchestration, scheduling, network overhead between pods)	Network latency between services; slower scale-up than in-process solutions
Event Loop Optimisation alone	Services already at low concurrency where single-core throughput is the bottleneck	Single process, single heap	None — pure code improvement	Single-core CPU ceiling; cannot utilise additional cores on the host
Cluster + Worker Threads (combined)	Mixed workloads: high concurrent I/O with occasional CPU-bound operations per request	Isolated heaps per cluster worker; optional shared memory within each worker's thread pool	Medium (cluster overhead + thread pool overhead)	Complexity: two concurrency models to reason about, debug, and tune simultaneously

⚙ Quick Reference

7 commands from this guide

File	Command / Code	Purpose
event-loop-phases.js	const fs = require('fs');	Event Loop Internals
cluster.production.js	const cluster = require('cluster');	Clustering
workers.cpu-task.js	const { parentPort, workerData, threadId } = require('worker_threads');	Worker Threads for CPU-Intensive Work
memory.leak-patterns.js	const BAD_cache = {};	Memory Management, GC Pauses, and Leak Detection That Actual
profiling.commands.sh	npm install -g clinic	Production Profiling
middleware-profiler.js	const async_hooks = require('async_hooks');	The Middleware Chain Is Killing Your Latency
pool-config.js	const { Pool } = require('pg');	Connection Pooling in Production

Key takeaways

The event loop has six phases with fixed execution order

microtasks (nextTick first, then Promise.then) drain between every phase transition. Blocking any phase blocks every concurrent connection for the full duration.

Cluster forks one process per CPU core, each with its own V8 heap and event loop. Workers share nothing in memory

sessions, caches, and rate limiters must live in Redis or an equivalent external store.

Worker threads handle CPU-bound computation within a request without blocking the event loop. Use a fixed-size pool via piscina sized to CPU count

never spawn a thread per request. Cluster and worker threads solve different problems and can be combined.

Most production memory leaks come from three sources

unbounded in-memory caches without size limits or TTLs, event listeners accumulated without corresponding removeListener calls, and closures in long-lived objects retaining large references. Every cache needs both max and ttl.

V8 GC pauses scale with old generation heap utilisation. A heap at 90% capacity produces 50-200ms event loop stops that manifest as latency spikes, not errors. Set --max-old-space-size to 70-75% of container memory allocation, not 100%.

Profile under realistic concurrent load with clinic.js

start with clinic doctor for the cross-dimensional overview, then drill into flame or heapprofiler as patterns emerge. Idle servers hide every problem.

Monitor event loop lag as a first-class production metric using prom-client's collectDefaultMetrics(). It is the earliest signal of event loop health degradation

earlier than error rates, earlier than latency percentiles.

Node.js 22 LTS is the active release in 2026

it ships with io_uring on Linux for faster async file I/O, V8 12.x with improved JSON.parse performance, and a stable built-in test runner. The event loop model and all optimisation principles in this guide remain unchanged.

Common mistakes to avoid

7 patterns

Using JSON.parse on large request bodies without streaming or size limits

Symptom

Event loop blocks for 50-500ms per large request, causing all concurrent connections to experience the same latency spike simultaneously. Appears as P99 spikes correlated with specific request types, not with overall load.

Fix

Enforce a strict body size limit at the reverse proxy (nginx, AWS ALB, Cloudflare) before the payload reaches Node.js. For payloads that legitimately need to be large, use a streaming JSON parser (JSONStream or the WHATWG Streams API with a streaming JSON decoder). Set body-parser's limit option as a last line of defence. Never buffer a multi-megabyte payload and then synchronously parse it in the request handler.

Using cluster without an external session or cache store

Symptom

Users experience random logouts, inconsistent feature flag states, or stale data because their requests hit different workers with completely separate in-memory stores. The issue appears intermittently and is difficult to reproduce locally because local development typically runs a single process.

Fix

Move sessions to Redis (connect-redis with express-session), move caches to Redis with appropriate TTLs, and move rate limiting state to Redis (rate-limit-redis). In-memory stores are only reliable in single-process deployments. As of 2026, Valkey (the Redis fork maintained by the Linux Foundation) is a production-ready alternative if Redis licensing is a concern.

Setting --max-old-space-size to the full container memory allocation

Symptom

Container gets OOM-killed by the kernel (exit code 137) even though the V8 heap appears to be within the limit. The kill happens during peak GC activity when V8 briefly holds both the current heap and the compacted copy in memory simultaneously.

Fix

Set --max-old-space-size to 70-75% of your container memory allocation. The remaining 25-30% covers non-heap allocations: Node.js Buffer pool (off-heap by design), native module memory (zlib, crypto), libuv thread pool stacks, worker thread stacks, and the OS page cache. A 512MB container should have --max-old-space-size=384 at most.

Using setTimeout(fn, 0) to yield the event loop in batch processing

Symptom

Batch processing of large arrays takes significantly longer than expected — sometimes 10-100x longer than necessary. Each setTimeout(fn, 0) call introduces a minimum ~1ms delay due to timer resolution, which adds up to seconds across thousands of batches.

Fix

Use setImmediate instead — it runs in the check phase of the current event loop iteration with no artificial delay. For a batch of 100,000 items processed 500 at a time, setImmediate yields 200 times versus setTimeout adding 200ms of minimum delay. The difference between 'fast enough' and 'too slow for production SLAs' at scale.

Not monitoring event loop lag as a production metric

Symptom

Latency degrades gradually over days — a slow-growing cache, a regex applied to progressively longer strings, or GC pressure building as a memory leak matures. No single request fails. No error rate increases. Latency percentiles drift upward by 10ms per day for two weeks before someone notices.

Fix

Expose event loop lag as a Prometheus metric using prom-client's collectDefaultMetrics() — which emits nodejs_eventloop_lag_seconds, nodejs_eventloop_lag_p50_seconds, and nodejs_eventloop_lag_p99_seconds. Set alerts: warn at P95 > 50ms, page at P99 > 200ms. This metric is the earliest signal of event loop health degradation — earlier than latency percentiles, earlier than error rates.

Spawning a worker thread per incoming request

Symptom

Under sustained load, the process spawns hundreds of threads simultaneously. Memory usage spikes 2-3x normal. CPU spends more time on thread management than actual work. OS thread limits (typically 4,096-32,768 depending on configuration) cause thread creation to start failing with EAGAIN errors, which surface as unhandled exceptions.

Fix

Use piscina with a fixed maxThreads sized to the CPU core count (or CPU count minus 1 to leave a core for the event loop). Queue excess work rather than spawning more threads. Monitor pool.queueSize as a metric and implement load shedding (return 503) when the queue exceeds a threshold that your SLA cannot tolerate.

Ignoring unhandledRejection events in cluster workers

Symptom

On Node.js 15 and later (including Node.js 22), unhandled promise rejections terminate the process. Cluster workers die silently — no stack trace in the application logs, just a 'Worker N exited with code 1' message. The root cause is invisible.

Fix

Add a process.on('unhandledRejection') handler in every worker that logs the full rejection reason and stack trace before the process exits. Consider whether to exit immediately (safest — prevents undefined state) or attempt graceful shutdown first. Never swallow the rejection without logging — a silent crash is the hardest class of production bug to diagnose.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Explain the phases of the Node.js event loop and what happens in each ph...

Q02SENIOR

What is the difference between cluster.fork() and worker_threads, and wh...

Q03SENIOR

How would you detect and fix a memory leak in a production Node.js servi...

Q04SENIOR

Why might a Node.js service have low CPU usage but high latency?

Q05JUNIOR

What is event loop lag and how do you monitor it in production?

Q01 of 05SENIOR

Explain the phases of the Node.js event loop and what happens in each phase.

ANSWER

The event loop runs six phases in a fixed order every iteration. Timers: executes setTimeout and setInterval callbacks whose minimum delay has elapsed. Pending callbacks: executes I/O callbacks that were deferred to the next iteration, like TCP error notifications from the previous iteration. Idle/prepare: internal V8 housekeeping — application code does not interact with this phase. Poll: retrieves new I/O events and executes their callbacks — this is where the majority of application work happens, and where the loop may block waiting for new I/O events if there are no pending timers and no setImmediate callbacks. Check: executes setImmediate callbacks — specifically designed to run after the poll phase, before the next timer phase. Close callbacks: executes handlers for abrupt closes, like socket.on('close'). Between every phase transition, the microtask queue is fully drained in priority order: process.nextTick callbacks first (all of them), then resolved Promise.then handlers. This means process.nextTick fires before any Promise.then, which fires before the next event loop phase. The practical implication: if you call process.nextTick recursively, you can starve I/O indefinitely — the next phase never starts until the nextTick queue is empty.

FAQ · 6 QUESTIONS

Frequently Asked Questions

How many cluster workers should I run in production?

Why does my Node.js process use more memory than --max-old-space-size allows?

Should I use PM2 or the built-in cluster module?

Can I use async/await for everything and never worry about blocking the event loop?

What causes the 'JavaScript heap out of memory' error and how do I fix it immediately?

Is Node.js 22 LTS significantly different from Node.js 20 LTS for the topics covered in this guide?

Naren Founder & Principal Engineer

20+ years shipping production JavaScript and front-end systems at scale. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

🔥

That's Node.js. Mark it forged?

8 min read · try the examples if you haven't