HTTP 500 Internal Server Error: Causes, Debugging & Fixes
- The stack trace in your app log tells you what failed β the infrastructure metrics tell you why. Always check both before you touch code.
- Swallowing exceptions to eliminate 500s from your monitoring is the most dangerous thing you can do. A lying 200 hides real failures for weeks. Your HTTP status codes are the only honest signal your monitoring has.
- Set connection timeouts on every external call your service makes β database, HTTP client, cache client, everything. A missing timeout is a loaded gun pointed at your thread pool. When that pool exhausts, every request returns a 500.
At 2:47am on a Black Friday, I watched a payments service return nothing but 500s for eleven straight minutes because a single database connection pool hit its limit and nobody had set a timeout on the fallback. Eleven minutes. Six figures in lost revenue. The worst part? The fix was a one-line config change that had been flagged in a code review two weeks earlier and marked 'low priority.' The HTTP 500 is the most common, most misunderstood, and most preventable error in web development β and most teams are flying blind when it hits.
A 500 is the server's way of raising a white flag. It doesn't mean your network is broken. It doesn't mean the URL is wrong. It means the server got your request, tried to process it, and something inside its own code or infrastructure fell apart. That distinction matters enormously when you're debugging at speed under pressure. Half the time I see developers waste thirty minutes checking their frontend or their DNS when the actual problem is a null pointer in a backend service they forgot to restart after a config change.
By the end of this, you'll know exactly what causes a 500, how to read the signals it leaves behind, and how to fix the five most common production variants. You'll have a repeatable debugging process you can run in under ten minutes. And you'll know which monitoring you need in place before the next one hits β because there will be a next one.
What a 500 Actually Means Under the Hood
HTTP status codes are a conversation between a client (your browser, a mobile app, an API consumer) and a server. The 5xx range specifically means 'the server is the problem here, not you.' A 400 means you sent something bad. A 500 means the server tried to handle your request and something in its own territory exploded.
The HTTP spec defines 500 as a catch-all: 'The server encountered an unexpected condition that prevented it from fulfilling the request.' That word 'unexpected' is doing a lot of heavy lifting. It means the developer didn't anticipate this failure path. A well-designed server that intentionally rejects something sends a 400 or 409. A 500 is unplanned chaos.
Every 500 has three layers you need to understand. First, there's the HTTP response the client sees β just the status code and maybe a vague error page. Second, there's the application log on the server β this is where the actual stack trace or error message lives, and it's the only thing that matters for debugging. Third, there's the infrastructure layer β the database, the message queue, the third-party API β which may be the real root cause even if the application log points somewhere else. Skipping any of these three layers is how debugging turns into a three-hour mystery instead of a ten-minute fix.
// io.thecodeforge β System Design tutorial // What actually happens during an HTTP 500 β request/response lifecycle // === CLIENT SIDE (what the browser or API consumer sees) === REQUEST: POST /api/checkout/complete HTTP/1.1 Host: shop.example.com Content-Type: application/json Body: { "cart_id": "abc123", "payment_token": "tok_xyz" } RESPONSE (what the client receives β almost useless for debugging): HTTP/1.1 500 Internal Server Error Content-Type: application/json Body: { "error": "Something went wrong. Please try again." } // Notice: the client gets ZERO useful information. // This is intentional β leaking stack traces to clients is a security risk. // The real information lives in the SERVER LOGS, not the response. // === SERVER SIDE (what actually happened β where you debug) === [2024-11-29 02:47:13] ERROR CheckoutService - Unhandled exception during payment processing java.lang.NullPointerException: Cannot invoke method getBalance() on null object reference at io.thecodeforge.checkout.PaymentProcessor.validateFunds(PaymentProcessor.java:112) at io.thecodeforge.checkout.CheckoutService.completeOrder(CheckoutService.java:87) at io.thecodeforge.checkout.CheckoutController.handleCheckout(CheckoutController.java:45) Caused by: UserAccount object was null β user session expired mid-checkout // === INFRASTRUCTURE LAYER (may be the real root cause) === [2024-11-29 02:47:13] WARN DatabasePool - Connection pool exhausted (max=10, active=10, pending=47) // 47 requests waiting for a DB connection that never comes free. // The NullPointerException above is a SYMPTOM. // The DB pool exhaustion is the ROOT CAUSE. // Fixing only the NPE would not fix the 500s β they'd keep coming. // === THE THREE LAYERS β always check all three === // Layer 1: HTTP response β tells you a 500 happened // Layer 2: App logs β tells you WHAT failed (stack trace) // Layer 3: Infra metrics β tells you WHY it failed (root cause)
APP LOG SHOWS: NullPointerException at PaymentProcessor.java:112
INFRA SHOWS: DB connection pool exhausted β 47 requests queued
ROOT CAUSE: Pool maxed out β DB queries hung β sessions expired β NPE on null user
FIX REQUIRED: Increase pool size + add connection timeout + add null guard on user session
The Five Real Causes Behind 95% of 500 Errors
Here's what nobody tells you: 500 errors come from a surprisingly small set of root causes. Once you've seen enough of them in production, you develop a mental checklist you run in sequence. These five cover the vast majority of everything you'll encounter.
The first is unhandled exceptions β code that throws an error and has no try/catch or error handler to intercept it. The runtime unwinds, nothing catches it, and the web framework slaps a 500 on the response. The second is database failures β connection timeouts, pool exhaustion, query errors, or the database simply being down. The third is misconfiguration β a missing environment variable, a wrong file path, a secret that didn't get deployed to production. I've seen entire services go 500 because someone forgot to set a DATABASE_URL environment variable after a cloud migration. Fourth is resource exhaustion β out of memory, out of disk space, out of file descriptors. The fifth is bad deployments β a syntax error in code that only manifests at runtime, a missing dependency, or a breaking schema change deployed out of order.
The reason this matters before you look at any code: each cause has a different debugging path and a different fix. Jumping straight to code before you know which category you're in is how you waste an hour.
// io.thecodeforge β System Design tutorial // Decision tree: diagnosing which category of 500 you're dealing with // Run these checks IN ORDER β each one narrows the field ========================================================================== STEP 1 β Did this just start? Or has it always happened on this endpoint? ========================================================================== Always happened on this endpoint: β Likely: Unhandled exception OR misconfiguration β Go to STEP 3 Just started after a deployment: β Likely: Bad deployment (syntax error, missing env var, schema mismatch) β IMMEDIATE ACTION: Check deploy logs and consider rollback β Go to STEP 2 Started gradually under load: β Likely: Resource exhaustion or DB connection pool saturation β Go to STEP 4 ========================================================================== STEP 2 β Bad Deployment Checklist ========================================================================== [ ] Check application startup logs β did the process even start cleanly? Red flag: "Error: Cannot find module './config/database'" Red flag: "SyntaxError: Unexpected token }" (runtime parse error) [ ] Check environment variables are set in the NEW environment Red flag: process.env.DATABASE_URL is undefined Fix: Re-run your secrets injection / config sync before redeploying [ ] Check for database schema mismatches Red flag: "column 'user_tier' does not exist" (code expects column, migration didn't run) Fix: Run pending migrations BEFORE deploying code that depends on them [ ] If nothing obvious β ROLL BACK first, investigate second Rule: Production stability > root cause analysis. Rollback. Then debug. ========================================================================== STEP 3 β Unhandled Exception / Misconfiguration Checklist ========================================================================== [ ] Pull the server application log for the exact timestamp of the 500 Look for: stack trace, exception class name, file + line number [ ] Most common exception types that cause 500s: NullPointerException / TypeError β object was null/undefined when you accessed it FileNotFoundException β config file path is wrong or file not deployed ClassNotFoundException β dependency jar/package missing in production OperationalError: no such table β database migration never ran [ ] Search for the error message verbatim in your codebase This tells you exactly which line threw β and whether it has error handling ========================================================================== STEP 4 β Resource Exhaustion Checklist ========================================================================== [ ] Database connection pool Check: SELECT count(*) FROM pg_stat_activity; (PostgreSQL) Red flag: active connections near or at max_connections limit Quick fix: Kill idle connections; longer fix: tune pool size + add timeouts [ ] Memory Check: `free -h` (Linux) or your cloud provider's memory metric Red flag: available memory near zero, OOMKiller in system logs Fix: Increase instance size OR fix the memory leak (heap dump required) [ ] Disk space Check: `df -h` Red flag: filesystem at 100% β logs often fill disks silently Quick fix: Clear old logs; permanent fix: log rotation + disk alerts [ ] File descriptors Check: `ulimit -n` vs `lsof | wc -l` Red flag: open files near system limit Fix: Increase ulimit; check for connection/file handle leaks in code ========================================================================== DECISION OUTPUT β what to do with your finding ========================================================================== Bad Deployment β Rollback β Fix β Redeploy with proper migration order Unhandled Exception β Add try/catch β return meaningful error response β fix root cause Misconfiguration β Set the missing config β restart service β add config validation at startup Resource Exhaustion β Immediate: scale or kill idle connections β Long term: fix the leak
Expected output for each step:
Step 1 β routes you to Step 2, 3, or 4 based on timing
Step 2 β identifies deploy artifact or migration problem
Step 3 β gives you exact file + line number of the exception
Step 4 β surfaces the exhausted resource and its current vs. max value
Fixing 500s the Right Way: Code Patterns That Actually Hold Up
Knowing the cause is half the battle. The other half is fixing it in a way that doesn't just hide the 500 and create a worse problem downstream. The two most common bad fixes I've seen: swallowing exceptions silently (so the 500 goes away but the actual failure keeps happening undetected), and catching every exception at the top level and returning a 200 with an error body (which is arguably worse β now your monitoring thinks everything is fine).
The right approach has three parts. First: catch specific, expected failures close to where they happen and handle them gracefully β redirect to a login page, return a meaningful 4xx, retry the operation. Second: let unexpected exceptions bubble up to a single top-level error handler that logs the full stack trace, returns a proper 500, and triggers an alert. Third: add circuit breakers around external dependencies so that when a downstream service is sick, you fail fast instead of piling up 500s while threads wait for timeouts.
The following example shows all three patterns working together in a realistic e-commerce checkout service β the kind of code that actually needs to survive traffic spikes and flaky payment providers.
// io.thecodeforge β System Design tutorial // Production error handling pattern for an e-commerce checkout service // Framework: Express.js β patterns apply to any Node.js web framework const express = require('express'); const app = express(); // βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ // CIRCUIT BREAKER β fail fast when a dependency is known to be down // Without this: every request hangs for 30s waiting for a timeout, // threads pile up, memory spikes, the whole service goes 500. // βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ class CircuitBreaker { constructor(failureThreshold = 5, recoveryTimeoutMs = 30000) { this.failureCount = 0; this.failureThreshold = failureThreshold; // open circuit after 5 consecutive failures this.state = 'CLOSED'; // CLOSED = normal, OPEN = failing fast, HALF_OPEN = testing recovery this.nextAttemptAt = null; this.recoveryTimeoutMs = recoveryTimeoutMs; } async call(operationFn) { if (this.state === 'OPEN') { if (Date.now() < this.nextAttemptAt) { // Still in recovery window β reject immediately without calling the dependency throw new Error('CircuitBreaker:OPEN β dependency unavailable, failing fast'); } // Recovery window expired β allow one probe request through this.state = 'HALF_OPEN'; } try { const result = await operationFn(); this._onSuccess(); return result; } catch (err) { this._onFailure(); throw err; // re-throw so the caller handles it β don't swallow } } _onSuccess() { this.failureCount = 0; this.state = 'CLOSED'; // dependency is healthy again } _onFailure() { this.failureCount += 1; if (this.failureCount >= this.failureThreshold) { this.state = 'OPEN'; // Schedule the recovery probe β don't hammer a sick dependency this.nextAttemptAt = Date.now() + this.recoveryTimeoutMs; } } } // One circuit breaker per external dependency β never share them const paymentGatewayBreaker = new CircuitBreaker(5, 30000); const inventoryServiceBreaker = new CircuitBreaker(3, 15000); // βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ // CHECKOUT ROUTE β specific error handling close to the source // Each failure type gets its own response β no generic catch-all // βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ app.post('/api/checkout/complete', async (req, res, next) => { const { cartId, paymentToken, userId } = req.body; // INPUT VALIDATION β catch bad requests before any business logic runs // These are 400s, not 500s β the client sent bad data, not our fault if (!cartId || !paymentToken || !userId) { return res.status(400).json({ error: 'MISSING_REQUIRED_FIELDS', message: 'cartId, paymentToken, and userId are all required' }); } try { // STEP 1: Check inventory via circuit-breaker-protected call const inventoryAvailable = await inventoryServiceBreaker.call(() => checkInventoryAvailability(cartId) ); if (!inventoryAvailable) { // This is an expected business failure β not a 500, it's a 409 Conflict return res.status(409).json({ error: 'INVENTORY_CONFLICT', message: 'One or more items in your cart are no longer available' }); } // STEP 2: Process payment via circuit-breaker-protected call const paymentResult = await paymentGatewayBreaker.call(() => chargePaymentToken(paymentToken, calculateCartTotal(cartId)) ); // STEP 3: Persist the order β wrap in try/catch for DB-specific errors const order = await persistOrder(userId, cartId, paymentResult.transactionId); return res.status(201).json({ orderId: order.id, transactionId: paymentResult.transactionId, status: 'CONFIRMED' }); } catch (err) { // SPECIFIC KNOWN ERRORS β handle gracefully without a 500 if (err.message.includes('CircuitBreaker:OPEN')) { // Dependency is known-down β tell the client, don't pretend it's our fault return res.status(503).json({ error: 'SERVICE_TEMPORARILY_UNAVAILABLE', message: 'Payment processing is temporarily unavailable. Please try again in 30 seconds.', retryAfterSeconds: 30 }); } if (err.code === 'PAYMENT_DECLINED') { // Payment gateway explicitly declined β this is a 402, client needs to act return res.status(402).json({ error: 'PAYMENT_DECLINED', message: 'Your payment was declined. Please check your card details and try again.' }); } // UNEXPECTED ERROR β pass to the global error handler via next() // DO NOT return a 500 here β let the central handler do it. // DO NOT log here β the central handler does that too. // This keeps logging consistent and prevents double-logging. next(err); } }); // βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ // GLOBAL ERROR HANDLER β the last line of defence // Express recognises this as an error handler because it has 4 params // This runs for any error that reaches next(err) from any route // βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ app.use((err, req, res, next) => { // Generate a unique ID so you can correlate the user's report with your logs const errorId = `ERR-${Date.now()}-${Math.random().toString(36).substr(2, 6).toUpperCase()}`; // ALWAYS log the full stack trace server-side β never swallow it // Include request context so you can reproduce the failure console.error({ errorId, message: err.message, stack: err.stack, request: { method: req.method, url: req.url, userId: req.body?.userId, // log who was affected cartId: req.body?.cartId, // log what they were doing userAgent: req.headers['user-agent'] }, timestamp: new Date().toISOString() }); // Trigger your alerting pipeline here (PagerDuty, Sentry, etc.) // notifyOnCallEngineer(err, errorId); β wire this up in production // Return the error ID to the client β they can quote it in a support ticket // NEVER return the stack trace or internal error message to the client return res.status(500).json({ error: 'INTERNAL_SERVER_ERROR', message: 'An unexpected error occurred. Please try again or contact support.', errorId // lets your support team look this up in logs instantly }); }); // Placeholder stubs β these would be real service calls in production async function checkInventoryAvailability(cartId) { return true; } async function chargePaymentToken(token, amount) { return { transactionId: 'txn_abc123' }; } async function calculateCartTotal(cartId) { return 99.99; } async function persistOrder(userId, cartId, txnId) { return { id: 'order_xyz789' }; } app.listen(3000, () => console.log('Checkout service running on port 3000'));
POST /api/checkout/complete β HTTP 201
{ "orderId": "order_xyz789", "transactionId": "txn_abc123", "status": "CONFIRMED" }
=== Payment gateway down (circuit open after 5 failures) ===
POST /api/checkout/complete β HTTP 503
{ "error": "SERVICE_TEMPORARILY_UNAVAILABLE", "message": "Payment processing is temporarily unavailable. Please try again in 30 seconds.", "retryAfterSeconds": 30 }
=== Unexpected database error (unhandled path) ===
Server log: { errorId: "ERR-1732845600000-K7X2MN", message: "Connection timeout after 5000ms", stack: "...", request: { userId: "usr_456", cartId: "cart_789" } }
POST /api/checkout/complete β HTTP 500
{ "error": "INTERNAL_SERVER_ERROR", "message": "An unexpected error occurred.", "errorId": "ERR-1732845600000-K7X2MN" }
Monitoring and Prevention: Never Be Blind-sided by a 500 Again
Fixing the current 500 is reactive. What separates seniors from juniors is what you put in place so the next one doesn't take you by surprise at 3am. There are four things that matter here: structured logging, error rate alerting, health checks, and startup validation.
Structured logging means your logs are JSON, not plain text. When you're grepping logs at 2am for a specific user's failed checkout, you want to filter by userId in one command β not read through thousands of lines of unformatted text. Every log line should have a timestamp, severity level, correlation ID, and the relevant business context.
Error rate alerting means you're monitoring the percentage of 5xx responses, not just whether the service is up. A service that's 'up' but returning 500 on 30% of requests is not 'up.' Set an alert threshold β anything above 1% 5xx rate on a critical endpoint should page someone. And add startup-time config validation: if a required environment variable is missing, crash loudly at boot with a clear error message instead of returning 500s for hours until someone checks the logs.
// io.thecodeforge β System Design tutorial // Production readiness checklist: preventing and catching 500s before users do ========================================================================== TIER 1 β STARTUP VALIDATION (catch misconfigs before the service accepts traffic) ========================================================================== At service boot, BEFORE binding to a port: [ ] Validate all required environment variables exist and are non-empty Pattern: fail-fast with a clear message Example: const required = ['DATABASE_URL', 'PAYMENT_API_KEY', 'JWT_SECRET']; required.forEach(key => { if (!process.env[key]) { throw new Error(`STARTUP_FAILURE: Required environment variable '${key}' is not set.`); // Process exits. Load balancer sees the instance never became healthy. // No 500s served. Clean failure. } }); [ ] Test database connectivity at startup Pattern: ping the DB, confirm connection pool initialises successfully If DB is unreachable at startup: crash loudly, do not serve traffic [ ] Verify critical config file paths exist Pattern: fs.accessSync(configPath) β throws if file missing, crashes cleanly ========================================================================== TIER 2 β STRUCTURED LOGGING (make logs searchable when it matters most) ========================================================================== Bad log (plain text β useless under pressure): [ERROR] Something failed during checkout for user abc at 2024-11-29 02:47:13 Good log (structured JSON β filterable in 10 seconds): { "timestamp": "2024-11-29T02:47:13.000Z", "level": "ERROR", "service": "checkout-service", "errorId": "ERR-1732845600000-K7X2MN", "userId": "usr_456", "cartId": "cart_789", "endpoint": "POST /api/checkout/complete", "errorClass": "NullPointerException", "message": "Cannot invoke getBalance() on null UserAccount", "durationMs": 234 } Why this matters: grep '"userId": "usr_456"' | jq '.errorId' Gets you the exact error ID in one command. Without structure: read every line manually. ========================================================================== TIER 3 β ALERTING THRESHOLDS (know before your users do) ========================================================================== Metric | Alert threshold | Severity βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 5xx error rate (critical path)| > 1% over 5 min window | PAGE 5xx error rate (non-critical) | > 5% over 5 min window | SLACK ALERT DB connection pool usage | > 80% of max | SLACK ALERT DB connection pool usage | > 95% of max | PAGE Available memory | < 20% of total | SLACK ALERT Disk usage | > 85% of total | SLACK ALERT Disk usage | > 95% of total | PAGE P99 response latency | > 5x normal baseline | SLACK ALERT Key rule: alert on RATE, not raw count. 10 errors in 1 minute during 10 req/min traffic = 100% error rate. PAGE. 10 errors in 1 minute during 100,000 req/min traffic = 0.01% error rate. Ignore. ========================================================================== TIER 4 β HEALTH CHECK ENDPOINT (let your load balancer save you) ========================================================================== GET /health β should check: [ ] Database is reachable (run a lightweight SELECT 1 query) [ ] Memory usage is below critical threshold [ ] All required config is loaded [ ] Any circuit breakers are not permanently open Return 200 only when ALL checks pass. Return 503 (not 500) when any dependency is unhealthy. Your load balancer polls /health every 10-30 seconds. If it gets a non-200, it stops routing traffic to that instance. This means a sick instance stops serving 500s automatically β without anyone waking up at 3am to restart it manually. Health check response time must be < 500ms. If your health check itself times out, it causes cascading failures.
STARTUP_FAILURE: Required environment variable 'PAYMENT_API_KEY' is not set.
Process exited with code 1.
Load balancer: instance never marked healthy, no traffic routed.
Health check (all systems go):
GET /health β HTTP 200
{ "status": "healthy", "db": "connected", "memoryUsagePct": 42, "circuitBreakers": { "paymentGateway": "CLOSED", "inventoryService": "CLOSED" } }
Health check (DB unreachable):
GET /health β HTTP 503
{ "status": "unhealthy", "db": "unreachable", "error": "Connection timeout after 2000ms" }
Load balancer: stops routing to this instance within 30 seconds.
| Aspect | HTTP 500 Internal Server Error | HTTP 503 Service Unavailable |
|---|---|---|
| Fault owner | The server application code or config | Infrastructure or a downstream dependency |
| Typical cause | Unhandled exception, null reference, bad config | DB down, dependency timeout, circuit breaker open, overloaded |
| Is the service up? | Yes β process is running but code failed | Partially β process running but can't serve traffic healthily |
| Client should retry? | Not automatically β same request usually fails the same way | Yes β with exponential backoff; the issue is usually transient |
| Correct Retry-After header? | Rarely appropriate | Always set it β tells clients when to try again |
| Root cause location | Application logs β stack trace | Infrastructure metrics β connection pools, memory, external API status |
| Fix usually requires | Code change or config correction | Scaling, dependency recovery, or circuit breaker reset |
| Load balancer behaviour | Instance stays in rotation β keeps serving 500s | Health check returns 503 β instance pulled from rotation automatically |
| Your monitoring alert fires on | Error rate > threshold on that endpoint | Health check failures or dependency latency spike |
| Example error message | NullPointerException at PaymentProcessor.java:112 | Connection pool exhausted: max=10, active=10, pending=47 |
π― Key Takeaways
- The stack trace in your app log tells you what failed β the infrastructure metrics tell you why. Always check both before you touch code.
- Swallowing exceptions to eliminate 500s from your monitoring is the most dangerous thing you can do. A lying 200 hides real failures for weeks. Your HTTP status codes are the only honest signal your monitoring has.
- Set connection timeouts on every external call your service makes β database, HTTP client, cache client, everything. A missing timeout is a loaded gun pointed at your thread pool. When that pool exhausts, every request returns a 500.
- A 500 that happens at startup is infinitely better than a 500 that happens in production traffic. Validate every required environment variable and config dependency before your service binds to a port β fail loudly at boot, not silently during requests.
β Common Mistakes to Avoid
- βMistake 1: Catching all exceptions in a top-level try/catch and returning HTTP 200 with an error flag in the body β monitoring shows 0% error rate while real failures pile up silently β fix: always use the correct HTTP status code (500 for unexpected errors, 4xx for client errors, 503 for dependency failures) so your alerting and load balancer behave correctly
- βMistake 2: Returning the raw stack trace or internal error message in the HTTP response body β exposes internal file paths, library versions, and logic that attackers use for reconnaissance β fix: log the full stack trace server-side, return only a sanitised message and a unique errorId to the client, and make sure production error handlers never include err.stack in the response
- βMistake 3: Not setting connection timeouts on database or HTTP clients β one slow query or one unavailable third-party API holds a thread forever, the thread pool exhausts in seconds under load, and every subsequent request gets a 500 β fix: always set explicit socket and connection timeouts (e.g., connectionTimeout: 3000, socketTimeout: 5000 in your DB client config) and wrap external calls in a circuit breaker
- βMistake 4: Deploying application code that depends on a new database column before running the migration that creates it β 100% of requests to that endpoint return 500 with 'column does not exist' until the migration runs β fix: always run database migrations before deploying application code that depends on them, and add a startup check that verifies the schema version matches what the application expects
- βMistake 5: Letting log files fill the disk because log rotation was never configured β the server runs out of disk space and every write operation (including logging the 500 itself) fails, making the incident completely undebuggable β fix: configure logrotate or your logging daemon to rotate and compress logs daily, set disk usage alerts at 85%, and use a centralised logging service (Datadog, CloudWatch, ELK) so logs survive even if the instance dies
Interview Questions on This Topic
- QYour checkout endpoint is returning 500s for 40% of requests. Your health check is still returning 200. Your application logs show no exceptions. Where do you look first and why?
- QWhen would you return a 503 instead of a 500 from your API, and how does that decision affect your client's retry behaviour and your load balancer's routing?
- QYou've added a global error handler that catches all unhandled exceptions and returns a 500. A junior dev asks: 'Why not just catch every exception and return a 200 with an error field β that way our error rate metric stays clean?' How do you respond?
- QYour service uses a database connection pool with a max of 20 connections. Under Black Friday load, you start seeing 500s. The DB itself is healthy. What's happening, what metrics confirm it, and what are your short-term and long-term fixes?
Frequently Asked Questions
Why am I getting a 500 error when my code worked fine in development?
The most common reason is a missing environment variable or configuration value that exists on your local machine but was never set in the production environment. Check your application's startup logs for phrases like 'undefined is not a function', 'Cannot read properties of undefined', or 'ECONNREFUSED' β these almost always point to a config value that's present locally but missing in production. The second most common cause is a database migration that ran locally but never ran against your production database.
What's the difference between a 500 and a 503 error?
A 500 means your application code itself failed β unhandled exception, null reference, bad config. A 503 means your service is alive but can't serve traffic because something it depends on is down or overwhelmed. The practical rule: if the problem is in your code, it's a 500; if the problem is a database being down or a dependency being unavailable, return a 503 with a Retry-After header so clients and load balancers know the issue is transient.
How do I find what's causing a 500 error when the response just says 'Internal Server Error'?
Go directly to your server-side application logs β not the browser, not the network tab. Search for 'ERROR' or 'Exception' filtered to the timestamp when the 500 occurred. Your framework will have logged a full stack trace there. If you can't find logs, check that your logging is actually configured to write somewhere and that log level isn't set to WARN or higher, which would suppress ERROR output. As a last resort, temporarily enable verbose error responses in a staging environment to surface the stack trace.
Why do 500 errors suddenly appear under high load but never happen during normal traffic?
Almost always it's resource exhaustion β usually the database connection pool. Under low traffic, your pool of 10 connections handles requests fine. Under high load, all 10 connections are occupied, new requests queue up waiting, queries time out, sessions expire mid-request, and null references start appearing in code that worked perfectly at low scale. Check your DB pool metrics (active vs. max connections) and your thread pool metrics simultaneously. The fix is a combination of increasing pool size, adding explicit connection timeouts so hung connections release, and implementing a circuit breaker so you fail fast instead of queuing indefinitely.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.