Senior 6 min · March 29, 2026

HTTP 500 Internal Server Error — Pool Exhaustion No Timeout

Q: Why am I getting a 500 error when my code worked fine in development?

The most common reason is a missing environment variable or configuration value that exists on your local machine but was never set in the production environment. Check your application's startup logs for phrases like 'undefined is not a function', 'Cannot read properties of undefined', or 'ECONNREFUSED' — these almost always point to a config value that's present locally but missing in production. The second most common cause is a database migration that ran locally but never ran against your production database.

Q: What's the difference between a 500 and a 503 error?

A 500 means your application code itself failed — unhandled exception, null reference, bad config. A 503 means your service is alive but can't serve traffic because something it depends on is down or overwhelmed. The practical rule: if the problem is in your code, it's a 500; if the problem is a database being down or a dependency being unavailable, return a 503 with a Retry-After header so clients and load balancers know the issue is transient.

Q: How do I find what's causing a 500 error when the response just says 'Internal Server Error'?

Go directly to your server-side application logs — not the browser, not the network tab. Search for 'ERROR' or 'Exception' filtered to the timestamp when the 500 occurred. Your framework will have logged a full stack trace there. If you can't find logs, check that your logging is actually configured to write somewhere and that log level isn't set to WARN or higher, which would suppress ERROR output. As a last resort, temporarily enable verbose error responses in a staging environment to surface the stack trace.

Q: Why do 500 errors suddenly appear under high load but never happen during normal traffic?

Almost always it's resource exhaustion — usually the database connection pool. Under low traffic, your pool of 10 connections handles requests fine. Under high load, all 10 connections are occupied, new requests queue up waiting, queries time out, sessions expire mid-request, and null references start appearing in code that worked perfectly at low scale. Check your DB pool metrics (active vs. max connections) and your thread pool metrics simultaneously. The fix is a combination of increasing pool size, adding explicit connection timeouts so hung connections release, and implementing a circuit breaker so you fail fast instead of queuing indefinitely.

Missing index + no connection timeout = all 10 DB pool connections blocked, returning 500s.

Naren · Founder

Plain-English first. Then code. Then the interview question.

About

● Production Incident 🔎 Debug Guide

⚡Quick Answer

HTTP 500 means the server code failed — not the client request.
Real cause hides in app logs: stack trace is a symptom, infra metrics are root cause.
Three layers to check: HTTP response (500), app logs (what failed), infrastructure (why it failed).
80% of 500s come from 5 causes: unhandled exceptions, DB failures, misconfig, resource exhaustion, bad deploys.
Production trap: the stack trace often points at a symptom; the real cause is upstream (pool exhaustion, timeout).

Plain-English First

Imagine you walk into a restaurant, hand the waiter your order, and he disappears into the kitchen — then comes back five minutes later and just says 'something went wrong in there.' He can't tell you what. The chef burned something, dropped something, ran out of gas — who knows. An HTTP 500 is exactly that: the server received your request just fine, understood it, tried to do something with it, and then something inside blew up. The server's embarrassed, it's not your fault as the customer, and the only way to find out what actually happened is to go into the kitchen and look at the mess yourself.

At 2:47am on a Black Friday, I watched a payments service return nothing but 500s for eleven straight minutes because a single database connection pool hit its limit and nobody had set a timeout on the fallback. Eleven minutes. Six figures in lost revenue. The worst part? The fix was a one-line config change that had been flagged in a code review two weeks earlier and marked 'low priority.' The HTTP 500 is the most common, most misunderstood, and most preventable error in web development — and most teams are flying blind when it hits.

A 500 is the server's way of raising a white flag. It doesn't mean your network is broken. It doesn't mean the URL is wrong. It means the server got your request, tried to process it, and something inside its own code or infrastructure fell apart. That distinction matters enormously when you're debugging at speed under pressure. Half the time I see developers waste thirty minutes checking their frontend or their DNS when the actual problem is a null pointer in a backend service they forgot to restart after a config change.

By the end of this, you'll know exactly what causes a 500, how to read the signals it leaves behind, and how to fix the five most common production variants. You'll have a repeatable debugging process you can run in under ten minutes. And you'll know which monitoring you need in place before the next one hits — because there will be a next one.

What a 500 Actually Means Under the Hood

HTTP status codes are a conversation between a client (your browser, a mobile app, an API consumer) and a server. The 5xx range specifically means 'the server is the problem here, not you.' A 400 means you sent something bad. A 500 means the server tried to handle your request and something in its own territory exploded.

The HTTP spec defines 500 as a catch-all: 'The server encountered an unexpected condition that prevented it from fulfilling the request.' That word 'unexpected' is doing a lot of heavy lifting. It means the developer didn't anticipate this failure path. A well-designed server that intentionally rejects something sends a 400 or 409. A 500 is unplanned chaos.

Every 500 has three layers you need to understand. First, there's the HTTP response the client sees — just the status code and maybe a vague error page. Second, there's the application log on the server — this is where the actual stack trace or error message lives, and it's the only thing that matters for debugging. Third, there's the infrastructure layer — the database, the message queue, the third-party API — which may be the real root cause even if the application log points somewhere else. Skipping any of these three layers is how debugging turns into a three-hour mystery instead of a ten-minute fix.

HTTP500ResponseFlow.systemdesignPLAINTEXT

// io.thecodeforge — System Design tutorial
// What actually happens during an HTTP 500 — request/response lifecycle

// === CLIENT SIDE (what the browser or API consumer sees) ===

REQUEST:
  POST /api/checkout/complete HTTP/1.1
  Host: shop.example.com
  Content-Type: application/json
  Body: { "cart_id": "abc123", "payment_token": "tok_xyz" }

RESPONSE (what the client receives — almost useless for debugging):
  HTTP/1.1 500 Internal Server Error
  Content-Type: application/json
  Body: { "error": "Something went wrong. Please try again." }

// Notice: the client gets ZERO useful information.
// This is intentional — leaking stack traces to clients is a security risk.
// The real information lives in the SERVER LOGS, not the response.

// === SERVER SIDE (what actually happened — where you debug) ===

[2024-11-29 02:47:13] ERROR CheckoutService - Unhandled exception during payment processing
  java.lang.NullPointerException: Cannot invoke method getBalance() on null object reference
    at io.thecodeforge.checkout.PaymentProcessor.validateFunds(PaymentProcessor.java:112)
    at io.thecodeforge.checkout.CheckoutService.completeOrder(CheckoutService.java:87)
    at io.thecodeforge.checkout.CheckoutController.handleCheckout(CheckoutController.java:45)
  Caused by: UserAccount object was null — user session expired mid-checkout

// === INFRASTRUCTURE LAYER (may be the real root cause) ===

[2024-11-29 02:47:13] WARN DatabasePool - Connection pool exhausted (max=10, active=10, pending=47)
// 47 requests waiting for a DB connection that never comes free.
// The NullPointerException above is a SYMPTOM.
// The DB pool exhaustion is the ROOT CAUSE.
// Fixing only the NPE would not fix the 500s — they'd keep coming.

// === THE THREE LAYERS — always check all three ===
// Layer 1: HTTP response  → tells you a 500 happened
// Layer 2: App logs       → tells you WHAT failed (stack trace)
// Layer 3: Infra metrics  → tells you WHY it failed (root cause)

Output

CLIENT SEES: HTTP 500 — vague error message, no actionable detail

APP LOG SHOWS: NullPointerException at PaymentProcessor.java:112

INFRA SHOWS: DB connection pool exhausted — 47 requests queued

ROOT CAUSE: Pool maxed out → DB queries hung → sessions expired → NPE on null user

FIX REQUIRED: Increase pool size + add connection timeout + add null guard on user session

Production Trap: The Misleading Stack Trace

The exception in your app log is often a symptom, not the root cause. I've seen teams spend two hours 'fixing' a NullPointerException that kept coming back — because the real problem was a saturated thread pool upstream that was killing DB connections before queries could complete. Always check your infrastructure metrics (DB pool, memory, thread count) before you trust the stack trace as the final word.

Production Insight

Stack traces show what broke — not why it broke.

Infra metrics (pool usage, memory, threads) expose the real cause.

Rule: never fix a 500 based on the stack trace alone. Check infra first.

Key Takeaway

The 500 response tells you nothing. The app log tells you what. The infra metrics tell you why.

Always check all three layers before changing a single line of code.

Symptom != root cause — that stack trace is a distraction until you confirm infrastructure health.

The Five Real Causes Behind 95% of 500 Errors

Here's what nobody tells you: 500 errors come from a surprisingly small set of root causes. Once you've seen enough of them in production, you develop a mental checklist you run in sequence. These five cover the vast majority of everything you'll encounter.

The first is unhandled exceptions — code that throws an error and has no try/catch or error handler to intercept it. The runtime unwinds, nothing catches it, and the web framework slaps a 500 on the response. The second is database failures — connection timeouts, pool exhaustion, query errors, or the database simply being down. The third is misconfiguration — a missing environment variable, a wrong file path, a secret that didn't get deployed to production. I've seen entire services go 500 because someone forgot to set a DATABASE_URL environment variable after a cloud migration. Fourth is resource exhaustion — out of memory, out of disk space, out of file descriptors. The fifth is bad deployments — a syntax error in code that only manifests at runtime, a missing dependency, or a breaking schema change deployed out of order.

The reason this matters before you look at any code: each cause has a different debugging path and a different fix. Jumping straight to code before you know which category you're in is how you waste an hour.

HTTP500CausesDiagnosticTree.systemdesignPLAINTEXT

// io.thecodeforge — System Design tutorial
// Decision tree: diagnosing which category of 500 you're dealing with
// Run these checks IN ORDER — each one narrows the field

==========================================================================
STEP 1 — Did this just start? Or has it always happened on this endpoint?
==========================================================================

  Always happened on this endpoint:
    → Likely: Unhandled exception OR misconfiguration
    → Go to STEP 3

  Just started after a deployment:
    → Likely: Bad deployment (syntax error, missing env var, schema mismatch)
    → IMMEDIATE ACTION: Check deploy logs and consider rollback
    → Go to STEP 2

  Started gradually under load:
    → Likely: Resource exhaustion or DB connection pool saturation
    → Go to STEP 4

==========================================================================
STEP 2 — Bad Deployment Checklist
==========================================================================

  [ ] Check application startup logs — did the process even start cleanly?
      Red flag: "Error: Cannot find module './config/database'"
      Red flag: "SyntaxError: Unexpected token }" (runtime parse error)

  [ ] Check environment variables are set in the NEW environment
      Red flag: process.env.DATABASE_URL is undefined
      Fix: Re-run your secrets injection / config sync before redeploying

  [ ] Check for database schema mismatches
      Red flag: "column 'user_tier' does not exist" (code expects column, migration didn't run)
      Fix: Run pending migrations BEFORE deploying code that depends on them

  [ ] If nothing obvious — ROLL BACK first, investigate second
      Rule: Production stability > root cause analysis. Rollback. Then debug.

==========================================================================
STEP 3 — Unhandled Exception / Misconfiguration Checklist
==========================================================================

  [ ] Pull the server application log for the exact timestamp of the 500
      Look for: stack trace, exception class name, file + line number

  [ ] Most common exception types that cause 500s:
      NullPointerException / TypeError   → object was null/undefined when you accessed it
      FileNotFoundException              → config file path is wrong or file not deployed
      ClassNotFoundException             → dependency jar/package missing in production
      OperationalError: no such table   → database migration never ran

  [ ] Search for the error message verbatim in your codebase
      This tells you exactly which line threw — and whether it has error handling

==========================================================================
STEP 4 — Resource Exhaustion Checklist
==========================================================================

  [ ] Database connection pool
      Check: SELECT count(*) FROM pg_stat_activity; (PostgreSQL)
      Red flag: active connections near or at max_connections limit
      Quick fix: Kill idle connections; longer fix: tune pool size + add timeouts

  [ ] Memory
      Check: `free -h` (Linux) or your cloud provider's memory metric
      Red flag: available memory near zero, OOMKiller in system logs
      Fix: Increase instance size OR fix the memory leak (heap dump required)

  [ ] Disk space
      Check: `df -h`
      Red flag: filesystem at 100% — logs often fill disks silently
      Quick fix: Clear old logs; permanent fix: log rotation + disk alerts

  [ ] File descriptors
      Check: `ulimit -n` vs `lsof | wc -l`
      Red flag: open files near system limit
      Fix: Increase ulimit; check for connection/file handle leaks in code

==========================================================================
DECISION OUTPUT — what to do with your finding
==========================================================================

  Bad Deployment    → Rollback → Fix → Redeploy with proper migration order
  Unhandled Exception → Add try/catch → return meaningful error response → fix root cause
  Misconfiguration  → Set the missing config → restart service → add config validation at startup
  Resource Exhaustion → Immediate: scale or kill idle connections → Long term: fix the leak

Output

Diagnostic result depends on your environment — this is a decision tree, not runnable code.

Expected output for each step:

Step 1 → routes you to Step 2, 3, or 4 based on timing

Step 2 → identifies deploy artifact or migration problem

Step 3 → gives you exact file + line number of the exception

Step 4 → surfaces the exhausted resource and its current vs. max value

Senior Shortcut: The 5-Minute 500 Triage

When a 500 alert fires, run these four commands before touching any code: (1) check when it started relative to the last deploy, (2) grep your app logs for 'ERROR' or 'Exception' at that timestamp, (3) check your DB connection pool metrics, (4) run 'df -h' and 'free -h'. In 80% of cases, one of these four gives you the answer before you've opened a single source file.

Production Insight

The five causes each have a distinct fingerprint.

Unhandled exceptions show a stack trace; DB failures show pool metrics; misconfig shows startup errors; resource exhaustion shows system metrics; bad deploy shows timing correlation.

Rule: classify before you debug — the wrong fix wastes time and often causes collateral damage.

Key Takeaway

Jumping to code without classifying the cause is the #1 time-waster.

Use the timing of the 500 to narrow it down: always happening? just deployed? under load?

Each cause has a distinct debugging path — pick the right one and you're 80% done.

Fixing 500s the Right Way: Code Patterns That Actually Hold Up

Knowing the cause is half the battle. The other half is fixing it in a way that doesn't just hide the 500 and create a worse problem downstream. The two most common bad fixes I've seen: swallowing exceptions silently (so the 500 goes away but the actual failure keeps happening undetected), and catching every exception at the top level and returning a 200 with an error body (which is arguably worse — now your monitoring thinks everything is fine).

The right approach has three parts. First: catch specific, expected failures close to where they happen and handle them gracefully — redirect to a login page, return a meaningful 4xx, retry the operation. Second: let unexpected exceptions bubble up to a single top-level error handler that logs the full stack trace, returns a proper 500, and triggers an alert. Third: add circuit breakers around external dependencies so that when a downstream service is sick, you fail fast instead of piling up 500s while threads wait for timeouts.

The following example shows all three patterns working together in a realistic e-commerce checkout service — the kind of code that actually needs to survive traffic spikes and flaky payment providers.

CheckoutService.errorhandling.jsJAVASCRIPT

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

// io.thecodeforge — System Design tutorial
// Production error handling pattern for an e-commerce checkout service
// Framework: Express.js — patterns apply to any Node.js web framework

const express = require('express');
const app = express();

// ─────────────────────────────────────────────────────────────────
// CIRCUIT BREAKER — fail fast when a dependency is known to be down
// Without this: every request hangs for 30s waiting for a timeout,
// threads pile up, memory spikes, the whole service goes 500.
// ─────────────────────────────────────────────────────────────────
class CircuitBreaker {
  constructor(failureThreshold = 5, recoveryTimeoutMs = 30000) {
    this.failureCount = 0;
    this.failureThreshold = failureThreshold; // open circuit after 5 consecutive failures
    this.state = 'CLOSED'; // CLOSED = normal, OPEN = failing fast, HALF_OPEN = testing recovery
    this.nextAttemptAt = null;
    this.recoveryTimeoutMs = recoveryTimeoutMs;
  }

  async call(operationFn) {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttemptAt) {
        // Still in recovery window — reject immediately without calling the dependency
        throw new Error('CircuitBreaker:OPEN — dependency unavailable, failing fast');
      }
      // Recovery window expired — allow one probe request through
      this.state = 'HALF_OPEN';
    }

    try {
      const result = await operationFn();
      this._onSuccess();
      return result;
    } catch (err) {
      this._onFailure();
      throw err; // re-throw so the caller handles it — don't swallow
    }
  }

  _onSuccess() {
    this.failureCount = 0;
    this.state = 'CLOSED'; // dependency is healthy again
  }

  _onFailure() {
    this.failureCount += 1;
    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
      // Schedule the recovery probe — don't hammer a sick dependency
      this.nextAttemptAt = Date.now() + this.recoveryTimeoutMs;
    }
  }
}

// One circuit breaker per external dependency — never share them
const paymentGatewayBreaker = new CircuitBreaker(5, 30000);
const inventoryServiceBreaker = new CircuitBreaker(3, 15000);

// ─────────────────────────────────────────────────────────────────
// CHECKOUT ROUTE — specific error handling close to the source
// Each failure type gets its own response — no generic catch-all
// ─────────────────────────────────────────────────────────────────
app.post('/api/checkout/complete', async (req, res, next) => {
  const { cartId, paymentToken, userId } = req.body;

  // INPUT VALIDATION — catch bad requests before any business logic runs
  // These are 400s, not 500s — the client sent bad data, not our fault
  if (!cartId || !paymentToken || !userId) {
    return res.status(400).json({
      error: 'MISSING_REQUIRED_FIELDS',
      message: 'cartId, paymentToken, and userId are all required'
    });
  }

  try {
    // STEP 1: Check inventory via circuit-breaker-protected call
    const inventoryAvailable = await inventoryServiceBreaker.call(() =>
      checkInventoryAvailability(cartId)
    );

    if (!inventoryAvailable) {
      // This is an expected business failure — not a 500, it's a 409 Conflict
      return res.status(409).json({
        error: 'INVENTORY_CONFLICT',
        message: 'One or more items in your cart are no longer available'
      });
    }

    // STEP 2: Process payment via circuit-breaker-protected call
    const paymentResult = await paymentGatewayBreaker.call(() =>
      chargePaymentToken(paymentToken, calculateCartTotal(cartId))
    );

    // STEP 3: Persist the order — wrap in try/catch for DB-specific errors
    const order = await persistOrder(userId, cartId, paymentResult.transactionId);

    return res.status(201).json({
      orderId: order.id,
      transactionId: paymentResult.transactionId,
      status: 'CONFIRMED'
    });

  } catch (err) {
    // SPECIFIC KNOWN ERRORS — handle gracefully without a 500
    if (err.message.includes('CircuitBreaker:OPEN')) {
      // Dependency is known-down — tell the client, don't pretend it's our fault
      return res.status(503).json({
        error: 'SERVICE_TEMPORARILY_UNAVAILABLE',
        message: 'Payment processing is temporarily unavailable. Please try again in 30 seconds.',
        retryAfterSeconds: 30
      });
    }

    if (err.code === 'PAYMENT_DECLINED') {
      // Payment gateway explicitly declined — this is a 402, client needs to act
      return res.status(402).json({
        error: 'PAYMENT_DECLINED',
        message: 'Your payment was declined. Please check your card details and try again.'
      });
    }

    // UNEXPECTED ERROR — pass to the global error handler via next()
    // DO NOT return a 500 here — let the central handler do it.
    // DO NOT log here — the central handler does that too.
    // This keeps logging consistent and prevents double-logging.
    next(err);
  }
});

// ─────────────────────────────────────────────────────────────────
// GLOBAL ERROR HANDLER — the last line of defence
// Express recognises this as an error handler because it has 4 params
// This runs for any error that reaches next(err) from any route
// ─────────────────────────────────────────────────────────────────
app.use((err, req, res, next) => {
  // Generate a unique ID so you can correlate the user's report with your logs
  const errorId = `ERR-${Date.now()}-${Math.random().toString(36).substr(2, 6).toUpperCase()}`;

  // ALWAYS log the full stack trace server-side — never swallow it
  // Include request context so you can reproduce the failure
  console.error({
    errorId,
    message: err.message,
    stack: err.stack,
    request: {
      method: req.method,
      url: req.url,
      userId: req.body?.userId,    // log who was affected
      cartId: req.body?.cartId,    // log what they were doing
      userAgent: req.headers['user-agent']
    },
    timestamp: new Date().toISOString()
  });

  // Trigger your alerting pipeline here (PagerDuty, Sentry, etc.)
  // notifyOnCallEngineer(err, errorId); ← wire this up in production

  // Return the error ID to the client — they can quote it in a support ticket
  // NEVER return the stack trace or internal error message to the client
  return res.status(500).json({
    error: 'INTERNAL_SERVER_ERROR',
    message: 'An unexpected error occurred. Please try again or contact support.',
    errorId  // lets your support team look this up in logs instantly
  });
});

// Placeholder stubs — these would be real service calls in production
async function checkInventoryAvailability(cartId) { return true; }
async function chargePaymentToken(token, amount) { return { transactionId: 'txn_abc123' }; }
async function calculateCartTotal(cartId) { return 99.99; }
async function persistOrder(userId, cartId, txnId) { return { id: 'order_xyz789' }; }

app.listen(3000, () => console.log('Checkout service running on port 3000'));

Output

=== Successful checkout ===

POST /api/checkout/complete → HTTP 201

{ "orderId": "order_xyz789", "transactionId": "txn_abc123", "status": "CONFIRMED" }

=== Payment gateway down (circuit open after 5 failures) ===

POST /api/checkout/complete → HTTP 503

{ "error": "SERVICE_TEMPORARILY_UNAVAILABLE", "message": "Payment processing is temporarily unavailable. Please try again in 30 seconds.", "retryAfterSeconds": 30 }

=== Unexpected database error (unhandled path) ===

Server log: { errorId: "ERR-1732845600000-K7X2MN", message: "Connection timeout after 5000ms", stack: "...", request: { userId: "usr_456", cartId: "cart_789" } }

POST /api/checkout/complete → HTTP 500

{ "error": "INTERNAL_SERVER_ERROR", "message": "An unexpected error occurred.", "errorId": "ERR-1732845600000-K7X2MN" }

Never Do This: Swallowing Exceptions to Kill the 500

I've reviewed codebases where someone wrapped an entire route in try/catch and returned res.status(200).json({ success: false }) for every error — because 'the client was complaining about 500s.' The 500s disappeared from monitoring. The underlying failures kept happening. Nobody knew for six weeks. Your monitoring is only as honest as your HTTP status codes — a lying 200 is worse than an honest 500.

Production Insight

Swallowing exceptions hides failures — doesn't fix them.

Circuit breakers prevent cascading 500s by failing fast when a dependency is sick.

Rule: let unexpected exceptions propagate to a central handler that logs, alerts, and returns a proper 500. Never catch-all to 200.

Key Takeaway

Honest HTTP status codes are your monitoring's only source of truth.

A 200 with an error flag is a lie that delays detection by weeks.

Careful error handling: catch expected failures early, let unexpected ones bubble to a central handler that acts.

Debugging 500s in Production: A Step-by-Step Process That Always Works

When a 500 alert fires at 3am, you don't have the luxury of browsing through documentation. You need a repeatable process that works every time. Here's the process I've used across five production outages — it's never failed me.

Step 1: Determine the blast radius. Is this affecting one user, one endpoint, or the whole service? Check your error rate dashboard first, not the logs. If it's the whole service, start with infrastructure checks (disk, memory, pool). If it's one endpoint, focus on that endpoint's logs and any recent changes.

Step 2: Check the deployment timeline. Did a deploy happen in the last hour? If yes, roll back before investigating. Production stability comes first. If no deploy, move to the next step.

Step 3: Read the logs — but read them with intent. Don't scroll aimlessly. grep for 'ERROR' or 'Exception' at the timestamp of the first 500. Look for the first occurrence of a new error pattern. The first error is often the root cause; subsequent ones are cascade failures.

Step 4: Check infrastructure metrics simultaneously. Open three terminal windows — one for logs tailing, one for 'free -h' and 'df -h', one for DB pool status. Cross-reference what you see. If logs show a connection timeout and the DB pool shows 100% active, you've found the cause.

Step 5: Reproduce locally if possible. If the error only happens under specific conditions, try to simulate them in a staging environment. If you can't reproduce, add structured logging around the failing code path and wait for the next occurrence. Yes, sometimes you have to let it happen again with more instrumentation — and that's okay if you've reduced the blast radius.

This process takes 10 minutes. Most of your time will be spent on false trails — logs that point to a symptom, not the cause. The key is staying disciplined and not jumping to conclusions.

The 3-Window Debugging Setup

Open three terminals or split panes: (1) 'kubectl logs -f <pod> --tail=100' for live log tail, (2) 'watch -n 5 free -h && df -h' for real-time resource metrics, (3) 'watch -n 5 "kubectl exec -it <pod> -- psql -c 'SELECT count(*) FROM pg_stat_activity WHERE state='active';"' for DB pool. Cross-reference in real time — when you see a log spike, check which metric changed at the same instant.

Production Insight

Most debugging time is wasted on false trails caused by cascade failures.

The first error in the logs is often the real cause — later errors are just downstream effects.

Rule: never chase a stack trace that appears after a resource exhaustion error. Fix the exhaustion first.

Key Takeaway

A disciplined 10-minute process beats an hour of frantic log scrolling.

Check blast radius, deployment timeline, first error timestamp, and infrastructure metrics — in that order.

Cross-reference logs and metrics in real time. The correlation tells you the story, not either one alone.

Fixing the current 500 is reactive. What separates seniors from juniors is what you put in place so the next one doesn't take you by surprise at 3am. There are four things that matter here: structured logging, error rate alerting, health checks, and startup validation.

Structured logging means your logs are JSON, not plain text. When you're grepping logs at 2am for a specific user's failed checkout, you want to filter by userId in one command — not read through thousands of lines of unformatted text. Every log line should have a timestamp, severity level, correlation ID, and the relevant business context.

Error rate alerting means you're monitoring the percentage of 5xx responses, not just whether the service is up. A service that's 'up' but returning 500 on 30% of requests is not 'up.' Set an alert threshold — anything above 1% 5xx rate on a critical endpoint should page someone. And add startup-time config validation: if a required environment variable is missing, crash loudly at boot with a clear error message instead of returning 500s for hours until someone checks the logs.

HTTP500PreventionChecklist.systemdesignPLAINTEXT

// io.thecodeforge — System Design tutorial
// Production readiness checklist: preventing and catching 500s before users do

==========================================================================
TIER 1 — STARTUP VALIDATION (catch misconfigs before the service accepts traffic)
==========================================================================

At service boot, BEFORE binding to a port:

  [ ] Validate all required environment variables exist and are non-empty
      Pattern: fail-fast with a clear message
      Example:
        const required = ['DATABASE_URL', 'PAYMENT_API_KEY', 'JWT_SECRET'];
        required.forEach(key => {
          if (!process.env[key]) {
            throw new Error(`STARTUP_FAILURE: Required environment variable '${key}' is not set.`);
            // Process exits. Load balancer sees the instance never became healthy.
            // No 500s served. Clean failure.
          }
        });

  [ ] Test database connectivity at startup
      Pattern: ping the DB, confirm connection pool initialises successfully
      If DB is unreachable at startup: crash loudly, do not serve traffic

  [ ] Verify critical config file paths exist
      Pattern: fs.accessSync(configPath) — throws if file missing, crashes cleanly

==========================================================================
TIER 2 — STRUCTURED LOGGING (make logs searchable when it matters most)
==========================================================================

  Bad log (plain text — useless under pressure):
    [ERROR] Something failed during checkout for user abc at 2024-11-29 02:47:13

  Good log (structured JSON — filterable in 10 seconds):
    {
      "timestamp": "2024-11-29T02:47:13.000Z",
      "level": "ERROR",
      "service": "checkout-service",
      "errorId": "ERR-1732845600000-K7X2MN",
      "userId": "usr_456",
      "cartId": "cart_789",
      "endpoint": "POST /api/checkout/complete",
      "errorClass": "NullPointerException",
      "message": "Cannot invoke getBalance() on null UserAccount",
      "durationMs": 234
    }

  Why this matters: grep '"userId": "usr_456"' | jq '.errorId'
  Gets you the exact error ID in one command. Without structure: read every line manually.

==========================================================================
TIER 3 — ALERTING THRESHOLDS (know before your users do)
==========================================================================

  Metric                        | Alert threshold          | Severity
  ─────────────────────────────────────────────────────────────────────
  5xx error rate (critical path)| > 1% over 5 min window   | PAGE
  5xx error rate (non-critical) | > 5% over 5 min window   | SLACK ALERT
  DB connection pool usage      | > 80% of max             | SLACK ALERT
  DB connection pool usage      | > 95% of max             | PAGE
  Available memory              | < 20% of total           | SLACK ALERT
  Disk usage                    | > 85% of total           | SLACK ALERT
  Disk usage                    | > 95% of total           | PAGE
  P99 response latency          | > 5x normal baseline     | SLACK ALERT

  Key rule: alert on RATE, not raw count.
  10 errors in 1 minute during 10 req/min traffic = 100% error rate. PAGE.
  10 errors in 1 minute during 100,000 req/min traffic = 0.01% error rate. Ignore.

==========================================================================
TIER 4 — HEALTH CHECK ENDPOINT (let your load balancer save you)
==========================================================================

  GET /health → should check:
    [ ] Database is reachable (run a lightweight SELECT 1 query)
    [ ] Memory usage is below critical threshold
    [ ] All required config is loaded
    [ ] Any circuit breakers are not permanently open

  Return 200 only when ALL checks pass.
  Return 503 (not 500) when any dependency is unhealthy.

  Your load balancer polls /health every 10-30 seconds.
  If it gets a non-200, it stops routing traffic to that instance.
  This means a sick instance stops serving 500s automatically — 
  without anyone waking up at 3am to restart it manually.

  Health check response time must be < 500ms.
  If your health check itself times out, it causes cascading failures.

Output

Startup failure (missing env var):

STARTUP_FAILURE: Required environment variable 'PAYMENT_API_KEY' is not set.

Process exited with code 1.

Load balancer: instance never marked healthy, no traffic routed.

Health check (all systems go):

GET /health → HTTP 200

{ "status": "healthy", "db": "connected", "memoryUsagePct": 42, "circuitBreakers": { "paymentGateway": "CLOSED", "inventoryService": "CLOSED" } }

Health check (DB unreachable):

GET /health → HTTP 503

{ "status": "unhealthy", "db": "unreachable", "error": "Connection timeout after 2000ms" }

Load balancer: stops routing to this instance within 30 seconds.

Interview Gold: Health Check vs Liveness Check

Interviewers love this distinction. A liveness check answers 'Is the process alive?' — even a totally broken service passes this. A readiness/health check answers 'Is this instance ready to serve production traffic?' — it checks DB connectivity, dependency health, and memory. Kubernetes uses both: liveness probes restart dead processes, readiness probes control load balancer routing. Conflating them causes incidents where a degraded instance stays in the load balancer rotation returning 500s because the liveness check is passing.

Production Insight

Startup validation catches misconfigs before they hurt users.

Error rate alerting on 5xx rate > 1% beats paging on process down.

Health checks must verify dependencies — a 200 from a sick instance is a lie.

Rule: crash loudly at boot for missing config, not silently during requests.

Key Takeaway

Prevention beats reaction: validate config at startup, log structurally, alert on error rate, and health-check dependencies.

A health check that returns 200 when the service is degraded is worse than no health check — it hides the failure.

The best 500 is the one that never happens because the instance never entered production.

● Production incidentPOST-MORTEMseverity: high

Black Friday Payment Meltdown: Connection Pool Exhaustion Without Timeouts

Symptom

All checkout requests returned HTTP 500 with various NullPointerExceptions and timeout errors. Health check still returned 200. Application logs showed intermittent DB query failures.

Assumption

The team assumed a database crash or network issue. They spent 20 minutes checking network connectivity and restarting the database before looking at connection pool metrics.

Root cause

The database connection pool was configured with max=10 connections and no connection timeout. Under normal load, 10 connections were enough. During Black Friday, 47 requests queued up waiting for a connection that never came free because each query took 30+ seconds due to a missing index. All 10 connections were occupied, new requests timed out after 120 seconds (default), and the application threw NullPointerException when the session expired mid-request.

Fix

1) Kill idle connections immediately: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle'; 2) Increase pool max to 50. 3) Add connection timeout of 5 seconds. 4) Add query timeout of 10 seconds. 5) Add health check that verifies pool health and returns 503 instead of 200 when pool usage exceeds 80%. The fix was deployed in 5 minutes after identifying the root cause.

Key lesson

Connection timeouts are not optional — they're the difference between a degraded service and a dead one.
A health check that returns 200 while the service can't serve requests is worse than no health check.
Stack traces lie. The NullPointerException was a symptom of the real cause: pool exhaustion. Always check infra metrics before trusting the first exception you see.

Production debug guideRun these checks in order — each one narrows the field by 50%5 entries

Symptom · 01

500s started immediately after a deployment

→

Fix

Rollback first — production stability over RCA. Then check deploy diff: missing env var? Schema migration not run? Syntax error in new code? Use kubectl rollout undo or swap to previous version.

Symptom · 02

500s appear gradually under increasing load

→

Fix

Check DB connection pool usage, thread pool size, memory, disk space. Run df -h and free -h on the server. Look for OOM killer logs. The 500s are a symptom of resource exhaustion.

Symptom · 03

500s on specific endpoints only

→

Fix

Grep app logs for that endpoint's stack trace. Check if the endpoint calls an external API that might be down (circuit breaker pattern). Check recent schema changes that might affect that specific query.

Symptom · 04

500s with no stack trace in logs

→

Fix

Verify log level is set to ERROR or DEBUG. Check if the error handler is swallowing exceptions. Check for thread pool shutdown errors (e.g., RejectedExecutionException). Increase log verbosity temporarily.

Symptom · 05

500s that disappear after restart

→

Fix

Likely memory leak or connection leak. Run for a while after restart, then check memory usage and open connections. Use heap dump analysis for memory leaks (jmap, Eclipse MAT). Check for unclosed database connections.

★ Quick 500 Debug Cheat SheetGo-to commands for the five most common 500 root causes. Run these before opening any code file.

Application feels slow, 500s pile up under load−

Immediate action

Check database connection pool usage immediately

Commands

docker compose logs | grep -i "connection pool"

SELECT count(*) FROM pg_stat_activity;

Fix now

Kill idle connections and increase pool size with timeouts

500s with 'OutOfMemoryError' in logs+

Disk full — logs show 'No space left on device'+

500s after code deploy, no exception in app logs+

500s with 'Connection refused' to an external service+

HTTP 500 vs HTTP 503: Know the Difference

Aspect	HTTP 500 Internal Server Error	HTTP 503 Service Unavailable
Fault owner	The server application code or config	Infrastructure or a downstream dependency
Typical cause	Unhandled exception, null reference, bad config	DB down, dependency timeout, circuit breaker open, overloaded
Is the service up?	Yes — process is running but code failed	Partially — process running but can't serve traffic healthily
Client should retry?	Not automatically — same request usually fails the same way	Yes — with exponential backoff; the issue is usually transient
Correct Retry-After header?	Rarely appropriate	Always set it — tells clients when to try again
Root cause location	Application logs — stack trace	Infrastructure metrics — connection pools, memory, external API status
Fix usually requires	Code change or config correction	Scaling, dependency recovery, or circuit breaker reset
Load balancer behaviour	Instance stays in rotation — keeps serving 500s	Health check returns 503 — instance pulled from rotation automatically
Your monitoring alert fires on	Error rate > threshold on that endpoint	Health check failures or dependency latency spike
Example error message	NullPointerException at PaymentProcessor.java:112	Connection pool exhausted: max=10, active=10, pending=47

Key takeaways

The stack trace in your app log tells you what failed

the infrastructure metrics tell you why. Always check both before you touch code.

Swallowing exceptions to eliminate 500s from your monitoring is the most dangerous thing you can do. A lying 200 hides real failures for weeks. Your HTTP status codes are the only honest signal your monitoring has.

Set connection timeouts on every external call your service makes

database, HTTP client, cache client, everything. A missing timeout is a loaded gun pointed at your thread pool. When that pool exhausts, every request returns a 500.

A 500 that happens at startup is infinitely better than a 500 that happens in production traffic. Validate every required environment variable and config dependency before your service binds to a port

fail loudly at boot, not silently during requests.

Classify the cause before you debug

always happening? just deployed? under load? Each category has a different fix path. Jumping to code without classification wastes 80% of your debugging time.

Common mistakes to avoid

5 patterns

Catching all exceptions and returning HTTP 200 with an error flag

Symptom

Monitoring shows 0% error rate while real failures pile up silently. Users see a 'success' response but the action didn't complete.

Fix

Always use correct HTTP status codes: 500 for unexpected errors, 4xx for client errors, 503 for dependency failures. Your alerting and load balancer depend on honest status codes.

Returning the raw stack trace or internal error message in the HTTP response body

Symptom

Exposes internal file paths, library versions, and logic that attackers use for reconnaissance. Compliance failures (PII leak).

Fix

Log the full stack trace server-side. Return only a sanitised message and a unique errorId to the client. Never include err.stack in the response.

Not setting connection timeouts on database or HTTP clients

Symptom

One slow external call holds a thread forever. Under load, the pool exhausts in seconds and every subsequent request gets a 500.

Fix

Always set explicit connection and socket timeouts (e.g., connectionTimeout: 3000, socketTimeout: 5000). Wrap external calls in a circuit breaker with a timeout.

Deploying code that depends on a new database column before the migration runs

Symptom

100% of requests to that endpoint return 500 with 'column does not exist' until the migration is applied.

Fix

Run database migrations before deploying application code that depends on them. Add a startup check that verifies the expected schema version.

Letting log files fill the disk because log rotation was never configured

Symptom

Server runs out of disk space. Every write operation (including logging the 500 itself) fails, making the incident completely undebuggable.

Fix

Configure logrotate or your logging daemon to rotate and compress logs daily. Set disk usage alerts at 85%. Use centralised logging (Datadog, CloudWatch, ELK) so logs survive instance crashes.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Your checkout endpoint is returning 500s for 40% of requests. Your healt...

Q02SENIOR

When would you return a 503 instead of a 500 from your API, and how does...

Q03SENIOR

You've added a global error handler that catches all unhandled exception...

Q04SENIOR

Your service uses a database connection pool with a max of 20 connection...

Q01 of 04SENIOR

Your checkout endpoint is returning 500s for 40% of requests. Your health check is still returning 200. Your application logs show no exceptions. Where do you look first and why?

ANSWER

The health check returning 200 while 40% of requests are 500s tells me the health check is too shallow. It's probably just checking if the process is alive, not if it can actually serve requests. I'd immediately check infrastructure metrics: memory, disk space, database connection pool. The absence of exceptions in app logs often points to resource exhaustion — the request fails before it even reaches your code. In Node.js, that could be a thread pool saturation; in Java, a connection pool timeout; in any language, an out-of-memory kill that silently fails requests. First command: free -h and df -h. Second: check database pool metrics. Third: look for TCP queue overflow or load balancer timeout. The root cause is almost certainly an exhausted resource that doesn't throw a standard application exception.

FAQ · 4 QUESTIONS

Frequently Asked Questions

Why am I getting a 500 error when my code worked fine in development?

What's the difference between a 500 and a 503 error?

How do I find what's causing a 500 error when the response just says 'Internal Server Error'?

Why do 500 errors suddenly appear under high load but never happen during normal traffic?

🔥

That's Components. Mark it forged?

6 min read · try the examples if you haven't

HTTP 500 Internal Server Error — Pool Exhaustion No Timeout

What a 500 Actually Means Under the Hood

The Five Real Causes Behind 95% of 500 Errors

Fixing 500s the Right Way: Code Patterns That Actually Hold Up

Debugging 500s in Production: A Step-by-Step Process That Always Works

Monitoring and Prevention: Never Be Blind-sided by a 500 Again

Black Friday Payment Meltdown: Connection Pool Exhaustion Without Timeouts

Key takeaways

Common mistakes to avoid

Catching all exceptions and returning HTTP 200 with an error flag

Returning the raw stack trace or internal error message in the HTTP response body

Not setting connection timeouts on database or HTTP clients

Deploying code that depends on a new database column before the migration runs

Letting log files fill the disk because log rotation was never configured

Interview Questions on This Topic

Frequently Asked Questions

That's Components. Mark it forged?