System Design Beginner

HTTP 500 Internal Server Error: Causes, Debugging & Fixes

📅 March 29, 2026 ⏱ 8 min read 🎯 Beginner

Where developers are forged. · Structured learning · Free forever.

📍 Part of: Components → Topic 15 of 16

HTTP 500 Internal Server Error crashing your app? Learn the real causes, a proven debugging process, and production-tested fixes — no hand-waving.

🧑‍💻 Beginner-friendly — no prior System Design experience needed

In this tutorial, you'll learn:

The stack trace in your app log tells you what failed — the infrastructure metrics tell you why. Always check both before you touch code.
Swallowing exceptions to eliminate 500s from your monitoring is the most dangerous thing you can do. A lying 200 hides real failures for weeks. Your HTTP status codes are the only honest signal your monitoring has.
Set connection timeouts on every external call your service makes — database, HTTP client, cache client, everything. A missing timeout is a loaded gun pointed at your thread pool. When that pool exhausts, every request returns a 500.

✦ Plain-English analogy ✦ Real code with output ✦ Interview questions

⚡ Quick Answer

Imagine you walk into a restaurant, hand the waiter your order, and he disappears into the kitchen — then comes back five minutes later and just says 'something went wrong in there.' He can't tell you what. The chef burned something, dropped something, ran out of gas — who knows. An HTTP 500 is exactly that: the server received your request just fine, understood it, tried to do something with it, and then something inside blew up. The server's embarrassed, it's not your fault as the customer, and the only way to find out what actually happened is to go into the kitchen and look at the mess yourself.

At 2:47am on a Black Friday, I watched a payments service return nothing but 500s for eleven straight minutes because a single database connection pool hit its limit and nobody had set a timeout on the fallback. Eleven minutes. Six figures in lost revenue. The worst part? The fix was a one-line config change that had been flagged in a code review two weeks earlier and marked 'low priority.' The HTTP 500 is the most common, most misunderstood, and most preventable error in web development — and most teams are flying blind when it hits.

A 500 is the server's way of raising a white flag. It doesn't mean your network is broken. It doesn't mean the URL is wrong. It means the server got your request, tried to process it, and something inside its own code or infrastructure fell apart. That distinction matters enormously when you're debugging at speed under pressure. Half the time I see developers waste thirty minutes checking their frontend or their DNS when the actual problem is a null pointer in a backend service they forgot to restart after a config change.

By the end of this, you'll know exactly what causes a 500, how to read the signals it leaves behind, and how to fix the five most common production variants. You'll have a repeatable debugging process you can run in under ten minutes. And you'll know which monitoring you need in place before the next one hits — because there will be a next one.

What a 500 Actually Means Under the Hood

HTTP status codes are a conversation between a client (your browser, a mobile app, an API consumer) and a server. The 5xx range specifically means 'the server is the problem here, not you.' A 400 means you sent something bad. A 500 means the server tried to handle your request and something in its own territory exploded.

The HTTP spec defines 500 as a catch-all: 'The server encountered an unexpected condition that prevented it from fulfilling the request.' That word 'unexpected' is doing a lot of heavy lifting. It means the developer didn't anticipate this failure path. A well-designed server that intentionally rejects something sends a 400 or 409. A 500 is unplanned chaos.

Every 500 has three layers you need to understand. First, there's the HTTP response the client sees — just the status code and maybe a vague error page. Second, there's the application log on the server — this is where the actual stack trace or error message lives, and it's the only thing that matters for debugging. Third, there's the infrastructure layer — the database, the message queue, the third-party API — which may be the real root cause even if the application log points somewhere else. Skipping any of these three layers is how debugging turns into a three-hour mystery instead of a ten-minute fix.

HTTP500ResponseFlow.systemdesign · PLAINTEXT

1234567891011121314151617181920212223242526272829303132333435363738394041

// io.thecodeforge — System Design tutorial
// What actually happens during an HTTP 500 — request/response lifecycle

// === CLIENT SIDE (what the browser or API consumer sees) ===

REQUEST:
  POST /api/checkout/complete HTTP/1.1
  Host: shop.example.com
  Content-Type: application/json
  Body: { "cart_id": "abc123", "payment_token": "tok_xyz" }

RESPONSE (what the client receives — almost useless for debugging):
  HTTP/1.1 500 Internal Server Error
  Content-Type: application/json
  Body: { "error": "Something went wrong. Please try again." }

// Notice: the client gets ZERO useful information.
// This is intentional — leaking stack traces to clients is a security risk.
// The real information lives in the SERVER LOGS, not the response.

// === SERVER SIDE (what actually happened — where you debug) ===

[2024-11-29 02:47:13] ERROR CheckoutService - Unhandled exception during payment processing
  java.lang.NullPointerException: Cannot invoke method getBalance() on null object reference
    at io.thecodeforge.checkout.PaymentProcessor.validateFunds(PaymentProcessor.java:112)
    at io.thecodeforge.checkout.CheckoutService.completeOrder(CheckoutService.java:87)
    at io.thecodeforge.checkout.CheckoutController.handleCheckout(CheckoutController.java:45)
  Caused by: UserAccount object was null — user session expired mid-checkout

// === INFRASTRUCTURE LAYER (may be the real root cause) ===

[2024-11-29 02:47:13] WARN DatabasePool - Connection pool exhausted (max=10, active=10, pending=47)
// 47 requests waiting for a DB connection that never comes free.
// The NullPointerException above is a SYMPTOM.
// The DB pool exhaustion is the ROOT CAUSE.
// Fixing only the NPE would not fix the 500s — they'd keep coming.

// === THE THREE LAYERS — always check all three ===
// Layer 1: HTTP response  → tells you a 500 happened
// Layer 2: App logs       → tells you WHAT failed (stack trace)
// Layer 3: Infra metrics  → tells you WHY it failed (root cause)

▶ Output

CLIENT SEES: HTTP 500 — vague error message, no actionable detail
APP LOG SHOWS: NullPointerException at PaymentProcessor.java:112
INFRA SHOWS: DB connection pool exhausted — 47 requests queued
ROOT CAUSE: Pool maxed out → DB queries hung → sessions expired → NPE on null user
FIX REQUIRED: Increase pool size + add connection timeout + add null guard on user session

⚠️

Production Trap: The Misleading Stack TraceThe exception in your app log is often a symptom, not the root cause. I've seen teams spend two hours 'fixing' a NullPointerException that kept coming back — because the real problem was a saturated thread pool upstream that was killing DB connections before queries could complete. Always check your infrastructure metrics (DB pool, memory, thread count) before you trust the stack trace as the final word.

The Five Real Causes Behind 95% of 500 Errors

Here's what nobody tells you: 500 errors come from a surprisingly small set of root causes. Once you've seen enough of them in production, you develop a mental checklist you run in sequence. These five cover the vast majority of everything you'll encounter.

The first is unhandled exceptions — code that throws an error and has no try/catch or error handler to intercept it. The runtime unwinds, nothing catches it, and the web framework slaps a 500 on the response. The second is database failures — connection timeouts, pool exhaustion, query errors, or the database simply being down. The third is misconfiguration — a missing environment variable, a wrong file path, a secret that didn't get deployed to production. I've seen entire services go 500 because someone forgot to set a DATABASE_URL environment variable after a cloud migration. Fourth is resource exhaustion — out of memory, out of disk space, out of file descriptors. The fifth is bad deployments — a syntax error in code that only manifests at runtime, a missing dependency, or a breaking schema change deployed out of order.

The reason this matters before you look at any code: each cause has a different debugging path and a different fix. Jumping straight to code before you know which category you're in is how you waste an hour.

HTTP500CausesDiagnosticTree.systemdesign · PLAINTEXT

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788

// io.thecodeforge — System Design tutorial
// Decision tree: diagnosing which category of 500 you're dealing with
// Run these checks IN ORDER — each one narrows the field

==========================================================================
STEP 1 — Did this just start? Or has it always happened on this endpoint?
==========================================================================

  Always happened on this endpoint:
    → Likely: Unhandled exception OR misconfiguration
    → Go to STEP 3

  Just started after a deployment:
    → Likely: Bad deployment (syntax error, missing env var, schema mismatch)
    → IMMEDIATE ACTION: Check deploy logs and consider rollback
    → Go to STEP 2

  Started gradually under load:
    → Likely: Resource exhaustion or DB connection pool saturation
    → Go to STEP 4

==========================================================================
STEP 2 — Bad Deployment Checklist
==========================================================================

  [ ] Check application startup logs — did the process even start cleanly?
      Red flag: "Error: Cannot find module './config/database'"
      Red flag: "SyntaxError: Unexpected token }" (runtime parse error)

  [ ] Check environment variables are set in the NEW environment
      Red flag: process.env.DATABASE_URL is undefined
      Fix: Re-run your secrets injection / config sync before redeploying

  [ ] Check for database schema mismatches
      Red flag: "column 'user_tier' does not exist" (code expects column, migration didn't run)
      Fix: Run pending migrations BEFORE deploying code that depends on them

  [ ] If nothing obvious — ROLL BACK first, investigate second
      Rule: Production stability > root cause analysis. Rollback. Then debug.

==========================================================================
STEP 3 — Unhandled Exception / Misconfiguration Checklist
==========================================================================

  [ ] Pull the server application log for the exact timestamp of the 500
      Look for: stack trace, exception class name, file + line number

  [ ] Most common exception types that cause 500s:
      NullPointerException / TypeError   → object was null/undefined when you accessed it
      FileNotFoundException              → config file path is wrong or file not deployed
      ClassNotFoundException             → dependency jar/package missing in production
      OperationalError: no such table   → database migration never ran

  [ ] Search for the error message verbatim in your codebase
      This tells you exactly which line threw — and whether it has error handling

==========================================================================
STEP 4 — Resource Exhaustion Checklist
==========================================================================

  [ ] Database connection pool
      Check: SELECT count(*) FROM pg_stat_activity; (PostgreSQL)
      Red flag: active connections near or at max_connections limit
      Quick fix: Kill idle connections; longer fix: tune pool size + add timeouts

  [ ] Memory
      Check: `free -h` (Linux) or your cloud provider's memory metric
      Red flag: available memory near zero, OOMKiller in system logs
      Fix: Increase instance size OR fix the memory leak (heap dump required)

  [ ] Disk space
      Check: `df -h`
      Red flag: filesystem at 100% — logs often fill disks silently
      Quick fix: Clear old logs; permanent fix: log rotation + disk alerts

  [ ] File descriptors
      Check: `ulimit -n` vs `lsof | wc -l`
      Red flag: open files near system limit
      Fix: Increase ulimit; check for connection/file handle leaks in code

==========================================================================
DECISION OUTPUT — what to do with your finding
==========================================================================

  Bad Deployment    → Rollback → Fix → Redeploy with proper migration order
  Unhandled Exception → Add try/catch → return meaningful error response → fix root cause
  Misconfiguration  → Set the missing config → restart service → add config validation at startup
  Resource Exhaustion → Immediate: scale or kill idle connections → Long term: fix the leak

▶ Output

Diagnostic result depends on your environment — this is a decision tree, not runnable code.
Expected output for each step:
Step 1 → routes you to Step 2, 3, or 4 based on timing
Step 2 → identifies deploy artifact or migration problem
Step 3 → gives you exact file + line number of the exception
Step 4 → surfaces the exhausted resource and its current vs. max value

⚠️

Senior Shortcut: The 5-Minute 500 TriageWhen a 500 alert fires, run these four commands before touching any code: (1) check when it started relative to the last deploy, (2) grep your app logs for 'ERROR' or 'Exception' at that timestamp, (3) check your DB connection pool metrics, (4) run 'df -h' and 'free -h'. In 80% of cases, one of these four gives you the answer before you've opened a single source file.

Fixing 500s the Right Way: Code Patterns That Actually Hold Up

Knowing the cause is half the battle. The other half is fixing it in a way that doesn't just hide the 500 and create a worse problem downstream. The two most common bad fixes I've seen: swallowing exceptions silently (so the 500 goes away but the actual failure keeps happening undetected), and catching every exception at the top level and returning a 200 with an error body (which is arguably worse — now your monitoring thinks everything is fine).

The right approach has three parts. First: catch specific, expected failures close to where they happen and handle them gracefully — redirect to a login page, return a meaningful 4xx, retry the operation. Second: let unexpected exceptions bubble up to a single top-level error handler that logs the full stack trace, returns a proper 500, and triggers an alert. Third: add circuit breakers around external dependencies so that when a downstream service is sick, you fail fast instead of piling up 500s while threads wait for timeouts.

The following example shows all three patterns working together in a realistic e-commerce checkout service — the kind of code that actually needs to survive traffic spikes and flaky payment providers.

CheckoutService.errorhandling.js · JAVASCRIPT

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175

// io.thecodeforge — System Design tutorial
// Production error handling pattern for an e-commerce checkout service
// Framework: Express.js — patterns apply to any Node.js web framework

const express = require('express');
const app = express();

// ─────────────────────────────────────────────────────────────────
// CIRCUIT BREAKER — fail fast when a dependency is known to be down
// Without this: every request hangs for 30s waiting for a timeout,
// threads pile up, memory spikes, the whole service goes 500.
// ─────────────────────────────────────────────────────────────────
class CircuitBreaker {
  constructor(failureThreshold = 5, recoveryTimeoutMs = 30000) {
    this.failureCount = 0;
    this.failureThreshold = failureThreshold; // open circuit after 5 consecutive failures
    this.state = 'CLOSED'; // CLOSED = normal, OPEN = failing fast, HALF_OPEN = testing recovery
    this.nextAttemptAt = null;
    this.recoveryTimeoutMs = recoveryTimeoutMs;
  }

  async call(operationFn) {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttemptAt) {
        // Still in recovery window — reject immediately without calling the dependency
        throw new Error('CircuitBreaker:OPEN — dependency unavailable, failing fast');
      }
      // Recovery window expired — allow one probe request through
      this.state = 'HALF_OPEN';
    }

    try {
      const result = await operationFn();
      this._onSuccess();
      return result;
    } catch (err) {
      this._onFailure();
      throw err; // re-throw so the caller handles it — don't swallow
    }
  }

  _onSuccess() {
    this.failureCount = 0;
    this.state = 'CLOSED'; // dependency is healthy again
  }

  _onFailure() {
    this.failureCount += 1;
    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
      // Schedule the recovery probe — don't hammer a sick dependency
      this.nextAttemptAt = Date.now() + this.recoveryTimeoutMs;
    }
  }
}

// One circuit breaker per external dependency — never share them
const paymentGatewayBreaker = new CircuitBreaker(5, 30000);
const inventoryServiceBreaker = new CircuitBreaker(3, 15000);

// ─────────────────────────────────────────────────────────────────
// CHECKOUT ROUTE — specific error handling close to the source
// Each failure type gets its own response — no generic catch-all
// ─────────────────────────────────────────────────────────────────
app.post('/api/checkout/complete', async (req, res, next) => {
  const { cartId, paymentToken, userId } = req.body;

  // INPUT VALIDATION — catch bad requests before any business logic runs
  // These are 400s, not 500s — the client sent bad data, not our fault
  if (!cartId || !paymentToken || !userId) {
    return res.status(400).json({
      error: 'MISSING_REQUIRED_FIELDS',
      message: 'cartId, paymentToken, and userId are all required'
    });
  }

  try {
    // STEP 1: Check inventory via circuit-breaker-protected call
    const inventoryAvailable = await inventoryServiceBreaker.call(() =>
      checkInventoryAvailability(cartId)
    );

    if (!inventoryAvailable) {
      // This is an expected business failure — not a 500, it's a 409 Conflict
      return res.status(409).json({
        error: 'INVENTORY_CONFLICT',
        message: 'One or more items in your cart are no longer available'
      });
    }

    // STEP 2: Process payment via circuit-breaker-protected call
    const paymentResult = await paymentGatewayBreaker.call(() =>
      chargePaymentToken(paymentToken, calculateCartTotal(cartId))
    );

    // STEP 3: Persist the order — wrap in try/catch for DB-specific errors
    const order = await persistOrder(userId, cartId, paymentResult.transactionId);

    return res.status(201).json({
      orderId: order.id,
      transactionId: paymentResult.transactionId,
      status: 'CONFIRMED'
    });

  } catch (err) {
    // SPECIFIC KNOWN ERRORS — handle gracefully without a 500
    if (err.message.includes('CircuitBreaker:OPEN')) {
      // Dependency is known-down — tell the client, don't pretend it's our fault
      return res.status(503).json({
        error: 'SERVICE_TEMPORARILY_UNAVAILABLE',
        message: 'Payment processing is temporarily unavailable. Please try again in 30 seconds.',
        retryAfterSeconds: 30
      });
    }

    if (err.code === 'PAYMENT_DECLINED') {
      // Payment gateway explicitly declined — this is a 402, client needs to act
      return res.status(402).json({
        error: 'PAYMENT_DECLINED',
        message: 'Your payment was declined. Please check your card details and try again.'
      });
    }

    // UNEXPECTED ERROR — pass to the global error handler via next()
    // DO NOT return a 500 here — let the central handler do it.
    // DO NOT log here — the central handler does that too.
    // This keeps logging consistent and prevents double-logging.
    next(err);
  }
});

// ─────────────────────────────────────────────────────────────────
// GLOBAL ERROR HANDLER — the last line of defence
// Express recognises this as an error handler because it has 4 params
// This runs for any error that reaches next(err) from any route
// ─────────────────────────────────────────────────────────────────
app.use((err, req, res, next) => {
  // Generate a unique ID so you can correlate the user's report with your logs
  const errorId = `ERR-${Date.now()}-${Math.random().toString(36).substr(2, 6).toUpperCase()}`;

  // ALWAYS log the full stack trace server-side — never swallow it
  // Include request context so you can reproduce the failure
  console.error({
    errorId,
    message: err.message,
    stack: err.stack,
    request: {
      method: req.method,
      url: req.url,
      userId: req.body?.userId,    // log who was affected
      cartId: req.body?.cartId,    // log what they were doing
      userAgent: req.headers['user-agent']
    },
    timestamp: new Date().toISOString()
  });

  // Trigger your alerting pipeline here (PagerDuty, Sentry, etc.)
  // notifyOnCallEngineer(err, errorId); ← wire this up in production

  // Return the error ID to the client — they can quote it in a support ticket
  // NEVER return the stack trace or internal error message to the client
  return res.status(500).json({
    error: 'INTERNAL_SERVER_ERROR',
    message: 'An unexpected error occurred. Please try again or contact support.',
    errorId  // lets your support team look this up in logs instantly
  });
});

// Placeholder stubs — these would be real service calls in production
async function checkInventoryAvailability(cartId) { return true; }
async function chargePaymentToken(token, amount) { return { transactionId: 'txn_abc123' }; }
async function calculateCartTotal(cartId) { return 99.99; }
async function persistOrder(userId, cartId, txnId) { return { id: 'order_xyz789' }; }

app.listen(3000, () => console.log('Checkout service running on port 3000'));

▶ Output

=== Successful checkout ===
POST /api/checkout/complete → HTTP 201
{ "orderId": "order_xyz789", "transactionId": "txn_abc123", "status": "CONFIRMED" }

=== Payment gateway down (circuit open after 5 failures) ===
POST /api/checkout/complete → HTTP 503
{ "error": "SERVICE_TEMPORARILY_UNAVAILABLE", "message": "Payment processing is temporarily unavailable. Please try again in 30 seconds.", "retryAfterSeconds": 30 }

=== Unexpected database error (unhandled path) ===
Server log: { errorId: "ERR-1732845600000-K7X2MN", message: "Connection timeout after 5000ms", stack: "...", request: { userId: "usr_456", cartId: "cart_789" } }
POST /api/checkout/complete → HTTP 500
{ "error": "INTERNAL_SERVER_ERROR", "message": "An unexpected error occurred.", "errorId": "ERR-1732845600000-K7X2MN" }

⚠️

Never Do This: Swallowing Exceptions to Kill the 500I've reviewed codebases where someone wrapped an entire route in try/catch and returned res.status(200).json({ success: false }) for every error — because 'the client was complaining about 500s.' The 500s disappeared from monitoring. The underlying failures kept happening. Nobody knew for six weeks. Your monitoring is only as honest as your HTTP status codes — a lying 200 is worse than an honest 500.

Fixing the current 500 is reactive. What separates seniors from juniors is what you put in place so the next one doesn't take you by surprise at 3am. There are four things that matter here: structured logging, error rate alerting, health checks, and startup validation.

Structured logging means your logs are JSON, not plain text. When you're grepping logs at 2am for a specific user's failed checkout, you want to filter by userId in one command — not read through thousands of lines of unformatted text. Every log line should have a timestamp, severity level, correlation ID, and the relevant business context.

Error rate alerting means you're monitoring the percentage of 5xx responses, not just whether the service is up. A service that's 'up' but returning 500 on 30% of requests is not 'up.' Set an alert threshold — anything above 1% 5xx rate on a critical endpoint should page someone. And add startup-time config validation: if a required environment variable is missing, crash loudly at boot with a clear error message instead of returning 500s for hours until someone checks the logs.

HTTP500PreventionChecklist.systemdesign · PLAINTEXT

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091

// io.thecodeforge — System Design tutorial
// Production readiness checklist: preventing and catching 500s before users do

==========================================================================
TIER 1 — STARTUP VALIDATION (catch misconfigs before the service accepts traffic)
==========================================================================

At service boot, BEFORE binding to a port:

  [ ] Validate all required environment variables exist and are non-empty
      Pattern: fail-fast with a clear message
      Example:
        const required = ['DATABASE_URL', 'PAYMENT_API_KEY', 'JWT_SECRET'];
        required.forEach(key => {
          if (!process.env[key]) {
            throw new Error(`STARTUP_FAILURE: Required environment variable '${key}' is not set.`);
            // Process exits. Load balancer sees the instance never became healthy.
            // No 500s served. Clean failure.
          }
        });

  [ ] Test database connectivity at startup
      Pattern: ping the DB, confirm connection pool initialises successfully
      If DB is unreachable at startup: crash loudly, do not serve traffic

  [ ] Verify critical config file paths exist
      Pattern: fs.accessSync(configPath) — throws if file missing, crashes cleanly

==========================================================================
TIER 2 — STRUCTURED LOGGING (make logs searchable when it matters most)
==========================================================================

  Bad log (plain text — useless under pressure):
    [ERROR] Something failed during checkout for user abc at 2024-11-29 02:47:13

  Good log (structured JSON — filterable in 10 seconds):
    {
      "timestamp": "2024-11-29T02:47:13.000Z",
      "level": "ERROR",
      "service": "checkout-service",
      "errorId": "ERR-1732845600000-K7X2MN",
      "userId": "usr_456",
      "cartId": "cart_789",
      "endpoint": "POST /api/checkout/complete",
      "errorClass": "NullPointerException",
      "message": "Cannot invoke getBalance() on null UserAccount",
      "durationMs": 234
    }

  Why this matters: grep '"userId": "usr_456"' | jq '.errorId'
  Gets you the exact error ID in one command. Without structure: read every line manually.

==========================================================================
TIER 3 — ALERTING THRESHOLDS (know before your users do)
==========================================================================

  Metric                        | Alert threshold          | Severity
  ─────────────────────────────────────────────────────────────────────
  5xx error rate (critical path)| > 1% over 5 min window   | PAGE
  5xx error rate (non-critical) | > 5% over 5 min window   | SLACK ALERT
  DB connection pool usage      | > 80% of max             | SLACK ALERT
  DB connection pool usage      | > 95% of max             | PAGE
  Available memory              | < 20% of total           | SLACK ALERT
  Disk usage                    | > 85% of total           | SLACK ALERT
  Disk usage                    | > 95% of total           | PAGE
  P99 response latency          | > 5x normal baseline     | SLACK ALERT

  Key rule: alert on RATE, not raw count.
  10 errors in 1 minute during 10 req/min traffic = 100% error rate. PAGE.
  10 errors in 1 minute during 100,000 req/min traffic = 0.01% error rate. Ignore.

==========================================================================
TIER 4 — HEALTH CHECK ENDPOINT (let your load balancer save you)
==========================================================================

  GET /health → should check:
    [ ] Database is reachable (run a lightweight SELECT 1 query)
    [ ] Memory usage is below critical threshold
    [ ] All required config is loaded
    [ ] Any circuit breakers are not permanently open

  Return 200 only when ALL checks pass.
  Return 503 (not 500) when any dependency is unhealthy.

  Your load balancer polls /health every 10-30 seconds.
  If it gets a non-200, it stops routing traffic to that instance.
  This means a sick instance stops serving 500s automatically — 
  without anyone waking up at 3am to restart it manually.

  Health check response time must be < 500ms.
  If your health check itself times out, it causes cascading failures.

▶ Output

Startup failure (missing env var):
STARTUP_FAILURE: Required environment variable 'PAYMENT_API_KEY' is not set.
Process exited with code 1.
Load balancer: instance never marked healthy, no traffic routed.

Health check (all systems go):
GET /health → HTTP 200
{ "status": "healthy", "db": "connected", "memoryUsagePct": 42, "circuitBreakers": { "paymentGateway": "CLOSED", "inventoryService": "CLOSED" } }

Health check (DB unreachable):
GET /health → HTTP 503
{ "status": "unhealthy", "db": "unreachable", "error": "Connection timeout after 2000ms" }
Load balancer: stops routing to this instance within 30 seconds.

🔥

Interview Gold: Health Check vs Liveness CheckInterviewers love this distinction. A liveness check answers 'Is the process alive?' — even a totally broken service passes this. A readiness/health check answers 'Is this instance ready to serve production traffic?' — it checks DB connectivity, dependency health, and memory. Kubernetes uses both: liveness probes restart dead processes, readiness probes control load balancer routing. Conflating them causes incidents where a degraded instance stays in the load balancer rotation returning 500s because the liveness check is passing.

Aspect	HTTP 500 Internal Server Error	HTTP 503 Service Unavailable
Fault owner	The server application code or config	Infrastructure or a downstream dependency
Typical cause	Unhandled exception, null reference, bad config	DB down, dependency timeout, circuit breaker open, overloaded
Is the service up?	Yes — process is running but code failed	Partially — process running but can't serve traffic healthily
Client should retry?	Not automatically — same request usually fails the same way	Yes — with exponential backoff; the issue is usually transient
Correct Retry-After header?	Rarely appropriate	Always set it — tells clients when to try again
Root cause location	Application logs — stack trace	Infrastructure metrics — connection pools, memory, external API status
Fix usually requires	Code change or config correction	Scaling, dependency recovery, or circuit breaker reset
Load balancer behaviour	Instance stays in rotation — keeps serving 500s	Health check returns 503 — instance pulled from rotation automatically
Your monitoring alert fires on	Error rate > threshold on that endpoint	Health check failures or dependency latency spike
Example error message	NullPointerException at PaymentProcessor.java:112	Connection pool exhausted: max=10, active=10, pending=47

🎯 Key Takeaways

The stack trace in your app log tells you what failed — the infrastructure metrics tell you why. Always check both before you touch code.
Swallowing exceptions to eliminate 500s from your monitoring is the most dangerous thing you can do. A lying 200 hides real failures for weeks. Your HTTP status codes are the only honest signal your monitoring has.
Set connection timeouts on every external call your service makes — database, HTTP client, cache client, everything. A missing timeout is a loaded gun pointed at your thread pool. When that pool exhausts, every request returns a 500.
A 500 that happens at startup is infinitely better than a 500 that happens in production traffic. Validate every required environment variable and config dependency before your service binds to a port — fail loudly at boot, not silently during requests.

⚠ Common Mistakes to Avoid

✕Mistake 1: Catching all exceptions in a top-level try/catch and returning HTTP 200 with an error flag in the body — monitoring shows 0% error rate while real failures pile up silently — fix: always use the correct HTTP status code (500 for unexpected errors, 4xx for client errors, 503 for dependency failures) so your alerting and load balancer behave correctly
✕Mistake 2: Returning the raw stack trace or internal error message in the HTTP response body — exposes internal file paths, library versions, and logic that attackers use for reconnaissance — fix: log the full stack trace server-side, return only a sanitised message and a unique errorId to the client, and make sure production error handlers never include err.stack in the response
✕Mistake 3: Not setting connection timeouts on database or HTTP clients — one slow query or one unavailable third-party API holds a thread forever, the thread pool exhausts in seconds under load, and every subsequent request gets a 500 — fix: always set explicit socket and connection timeouts (e.g., connectionTimeout: 3000, socketTimeout: 5000 in your DB client config) and wrap external calls in a circuit breaker
✕Mistake 4: Deploying application code that depends on a new database column before running the migration that creates it — 100% of requests to that endpoint return 500 with 'column does not exist' until the migration runs — fix: always run database migrations before deploying application code that depends on them, and add a startup check that verifies the schema version matches what the application expects
✕Mistake 5: Letting log files fill the disk because log rotation was never configured — the server runs out of disk space and every write operation (including logging the 500 itself) fails, making the incident completely undebuggable — fix: configure logrotate or your logging daemon to rotate and compress logs daily, set disk usage alerts at 85%, and use a centralised logging service (Datadog, CloudWatch, ELK) so logs survive even if the instance dies

Interview Questions on This Topic

QYour checkout endpoint is returning 500s for 40% of requests. Your health check is still returning 200. Your application logs show no exceptions. Where do you look first and why?
QWhen would you return a 503 instead of a 500 from your API, and how does that decision affect your client's retry behaviour and your load balancer's routing?
QYou've added a global error handler that catches all unhandled exceptions and returns a 500. A junior dev asks: 'Why not just catch every exception and return a 200 with an error field — that way our error rate metric stays clean?' How do you respond?
QYour service uses a database connection pool with a max of 20 connections. Under Black Friday load, you start seeing 500s. The DB itself is healthy. What's happening, what metrics confirm it, and what are your short-term and long-term fixes?

Frequently Asked Questions

Why am I getting a 500 error when my code worked fine in development?

The most common reason is a missing environment variable or configuration value that exists on your local machine but was never set in the production environment. Check your application's startup logs for phrases like 'undefined is not a function', 'Cannot read properties of undefined', or 'ECONNREFUSED' — these almost always point to a config value that's present locally but missing in production. The second most common cause is a database migration that ran locally but never ran against your production database.

What's the difference between a 500 and a 503 error?

A 500 means your application code itself failed — unhandled exception, null reference, bad config. A 503 means your service is alive but can't serve traffic because something it depends on is down or overwhelmed. The practical rule: if the problem is in your code, it's a 500; if the problem is a database being down or a dependency being unavailable, return a 503 with a Retry-After header so clients and load balancers know the issue is transient.

How do I find what's causing a 500 error when the response just says 'Internal Server Error'?

Go directly to your server-side application logs — not the browser, not the network tab. Search for 'ERROR' or 'Exception' filtered to the timestamp when the 500 occurred. Your framework will have logged a full stack trace there. If you can't find logs, check that your logging is actually configured to write somewhere and that log level isn't set to WARN or higher, which would suppress ERROR output. As a last resort, temporarily enable verbose error responses in a staging environment to surface the stack trace.

Why do 500 errors suddenly appear under high load but never happen during normal traffic?

Almost always it's resource exhaustion — usually the database connection pool. Under low traffic, your pool of 10 connections handles requests fine. Under high load, all 10 connections are occupied, new requests queue up waiting, queries time out, sessions expire mid-request, and null references start appearing in code that worked perfectly at low scale. Check your DB pool metrics (active vs. max connections) and your thread pool metrics simultaneously. The fix is a combination of increasing pool size, adding explicit connection timeouts so hung connections release, and implementing a circuit breaker so you fail fast instead of queuing indefinitely.

🔥

Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

About Naren Get in touch

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged

HTTP 500 Internal Server Error: Causes, Debugging & Fixes

What a 500 Actually Means Under the Hood

The Five Real Causes Behind 95% of 500 Errors

Fixing 500s the Right Way: Code Patterns That Actually Hold Up

Monitoring and Prevention: Never Be Blind-sided by a 500 Again

🎯 Key Takeaways

⚠ Common Mistakes to Avoid

Interview Questions on This Topic

Frequently Asked Questions