HTTP 500 means the server code failed — not the client request.
Real cause hides in app logs: stack trace is a symptom, infra metrics are root cause.
Three layers to check: HTTP response (500), app logs (what failed), infrastructure (why it failed).
80% of 500s come from 5 causes: unhandled exceptions, DB failures, misconfig, resource exhaustion, bad deploys.
Production trap: the stack trace often points at a symptom; the real cause is upstream (pool exhaustion, timeout).
Plain-English First
Imagine you walk into a restaurant, hand the waiter your order, and he disappears into the kitchen — then comes back five minutes later and just says 'something went wrong in there.' He can't tell you what. The chef burned something, dropped something, ran out of gas — who knows. An HTTP 500 is exactly that: the server received your request just fine, understood it, tried to do something with it, and then something inside blew up. The server's embarrassed, it's not your fault as the customer, and the only way to find out what actually happened is to go into the kitchen and look at the mess yourself.
At 2:47am on a Black Friday, I watched a payments service return nothing but 500s for eleven straight minutes because a single database connection pool hit its limit and nobody had set a timeout on the fallback. Eleven minutes. Six figures in lost revenue. The worst part? The fix was a one-line config change that had been flagged in a code review two weeks earlier and marked 'low priority.' The HTTP 500 is the most common, most misunderstood, and most preventable error in web development — and most teams are flying blind when it hits.
A 500 is the server's way of raising a white flag. It doesn't mean your network is broken. It doesn't mean the URL is wrong. It means the server got your request, tried to process it, and something inside its own code or infrastructure fell apart. That distinction matters enormously when you're debugging at speed under pressure. Half the time I see developers waste thirty minutes checking their frontend or their DNS when the actual problem is a null pointer in a backend service they forgot to restart after a config change.
By the end of this, you'll know exactly what causes a 500, how to read the signals it leaves behind, and how to fix the five most common production variants. You'll have a repeatable debugging process you can run in under ten minutes. And you'll know which monitoring you need in place before the next one hits — because there will be a next one.
What a 500 Actually Means Under the Hood
HTTP status codes are a conversation between a client (your browser, a mobile app, an API consumer) and a server. The 5xx range specifically means 'the server is the problem here, not you.' A 400 means you sent something bad. A 500 means the server tried to handle your request and something in its own territory exploded.
The HTTP spec defines 500 as a catch-all: 'The server encountered an unexpected condition that prevented it from fulfilling the request.' That word 'unexpected' is doing a lot of heavy lifting. It means the developer didn't anticipate this failure path. A well-designed server that intentionally rejects something sends a 400 or 409. A 500 is unplanned chaos.
Every 500 has three layers you need to understand. First, there's the HTTP response the client sees — just the status code and maybe a vague error page. Second, there's the application log on the server — this is where the actual stack trace or error message lives, and it's the only thing that matters for debugging. Third, there's the infrastructure layer — the database, the message queue, the third-party API — which may be the real root cause even if the application log points somewhere else. Skipping any of these three layers is how debugging turns into a three-hour mystery instead of a ten-minute fix.
HTTP500ResponseFlow.systemdesignPLAINTEXT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// io.thecodeforge — SystemDesign tutorial
// What actually happens during an HTTP500 — request/response lifecycle
// === CLIENTSIDE (what the browser or API consumer sees) ===
REQUEST:
POST /api/checkout/complete HTTP/1.1Host: shop.example.com
Content-Type: application/json
Body: { "cart_id": "abc123", "payment_token": "tok_xyz" }
RESPONSE (what the client receives — almost useless for debugging):
HTTP/1.1500InternalServerErrorContent-Type: application/json
Body: { "error": "Something went wrong. Please try again." }
// Notice: the client gets ZERO useful information.
// This is intentional — leaking stack traces to clients is a security risk.
// The real information lives in the SERVERLOGS, not the response.
// === SERVERSIDE (what actually happened — where you debug) ===
[2024-11-2902:47:13] ERRORCheckoutService - Unhandled exception during payment processing
java.lang.NullPointerException: Cannot invoke method getBalance() on null object reference
at io.thecodeforge.checkout.PaymentProcessor.validateFunds(PaymentProcessor.java:112)
at io.thecodeforge.checkout.CheckoutService.completeOrder(CheckoutService.java:87)
at io.thecodeforge.checkout.CheckoutController.handleCheckout(CheckoutController.java:45)
Caused by: UserAccount object was null — user session expired mid-checkout
// === INFRASTRUCTURELAYER (may be the real root cause) ===
[2024-11-2902:47:13] WARNDatabasePool - Connection pool exhausted (max=10, active=10, pending=47)
// 47 requests waiting for a DB connection that never comes free.
// TheNullPointerException above is a SYMPTOM.
// TheDB pool exhaustion is the ROOTCAUSE.
// Fixing only the NPE would not fix the 500s — they'd keep coming.
// === THETHREELAYERS — always check all three ===
// Layer1: HTTP response → tells you a 500 happened
// Layer2: App logs → tells you WHATfailed (stack trace)
// Layer3: Infra metrics → tells you WHY it failed (root cause)
APP LOG SHOWS: NullPointerException at PaymentProcessor.java:112
INFRA SHOWS: DB connection pool exhausted — 47 requests queued
ROOT CAUSE: Pool maxed out → DB queries hung → sessions expired → NPE on null user
FIX REQUIRED: Increase pool size + add connection timeout + add null guard on user session
Production Trap: The Misleading Stack Trace
The exception in your app log is often a symptom, not the root cause. I've seen teams spend two hours 'fixing' a NullPointerException that kept coming back — because the real problem was a saturated thread pool upstream that was killing DB connections before queries could complete. Always check your infrastructure metrics (DB pool, memory, thread count) before you trust the stack trace as the final word.
Production Insight
Stack traces show what broke — not why it broke.
Infra metrics (pool usage, memory, threads) expose the real cause.
Rule: never fix a 500 based on the stack trace alone. Check infra first.
Key Takeaway
The 500 response tells you nothing. The app log tells you what. The infra metrics tell you why.
Always check all three layers before changing a single line of code.
Symptom != root cause — that stack trace is a distraction until you confirm infrastructure health.
The Five Real Causes Behind 95% of 500 Errors
Here's what nobody tells you: 500 errors come from a surprisingly small set of root causes. Once you've seen enough of them in production, you develop a mental checklist you run in sequence. These five cover the vast majority of everything you'll encounter.
The first is unhandled exceptions — code that throws an error and has no try/catch or error handler to intercept it. The runtime unwinds, nothing catches it, and the web framework slaps a 500 on the response. The second is database failures — connection timeouts, pool exhaustion, query errors, or the database simply being down. The third is misconfiguration — a missing environment variable, a wrong file path, a secret that didn't get deployed to production. I've seen entire services go 500 because someone forgot to set a DATABASE_URL environment variable after a cloud migration. Fourth is resource exhaustion — out of memory, out of disk space, out of file descriptors. The fifth is bad deployments — a syntax error in code that only manifests at runtime, a missing dependency, or a breaking schema change deployed out of order.
The reason this matters before you look at any code: each cause has a different debugging path and a different fix. Jumping straight to code before you know which category you're in is how you waste an hour.
HTTP500CausesDiagnosticTree.systemdesignPLAINTEXT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
// io.thecodeforge — SystemDesign tutorial
// Decision tree: diagnosing which category of 500 you're dealing with
// Run these checks INORDER — each one narrows the field
==========================================================================
STEP1 — Didthis just start? Or has it always happened on this endpoint?
==========================================================================
Always happened on this endpoint:
→ Likely: Unhandled exception OR misconfiguration
→ Go to STEP3Just started after a deployment:
→ Likely: Baddeployment (syntax error, missing env var, schema mismatch)
→ IMMEDIATEACTION: Check deploy logs and consider rollback
→ Go to STEP2Started gradually under load:
→ Likely: Resource exhaustion or DB connection pool saturation
→ Go to STEP4
==========================================================================
STEP2 — BadDeploymentChecklist
==========================================================================
[ ] Check application startup logs — did the process even start cleanly?
Red flag: "Error: Cannot find module './config/database'"Red flag: "SyntaxError: Unexpected token }" (runtime parse error)
[ ] Check environment variables are set in the NEW environment
Red flag: process.env.DATABASE_URL is undefined
Fix: Re-run your secrets injection / config sync before redeploying
[ ] Checkfor database schema mismatches
Red flag: "column 'user_tier' does not exist" (code expects column, migration didn't run)
Fix: Run pending migrations BEFORE deploying code that depends on them
[ ] If nothing obvious — ROLLBACK first, investigate second
Rule: Production stability > root cause analysis. Rollback. Then debug.
==========================================================================
STEP3 — UnhandledException / MisconfigurationChecklist
==========================================================================
[ ] Pull the server application log for the exact timestamp of the 500Lookfor: stack trace, exception class name, file + line number
[ ] Most common exception types that cause 500s:
NullPointerException / TypeError → object was null/undefined when you accessed it
FileNotFoundException → config file path is wrong or file not deployed
ClassNotFoundException → dependency jar/package missing in production
OperationalError: no such table → database migration never ran
[ ] Searchfor the error message verbatim in your codebase
This tells you exactly which line threw — and whether it has error handling
==========================================================================
STEP4 — ResourceExhaustionChecklist
==========================================================================
[ ] Database connection pool
Check: SELECTcount(*) FROM pg_stat_activity; (PostgreSQL)
Red flag: active connections near or at max_connections limit
Quick fix: Kill idle connections; longer fix: tune pool size + add timeouts
[ ] MemoryCheck: `free -h` (Linux) or your cloud provider's memory metric
Red flag: available memory near zero, OOMKiller in system logs
Fix: Increase instance size OR fix the memory leak (heap dump required)
[ ] Disk space
Check: `df -h`
Red flag: filesystem at 100% — logs often fill disks silently
Quick fix: Clear old logs; permanent fix: log rotation + disk alerts
[ ] File descriptors
Check: `ulimit -n` vs `lsof | wc -l`
Red flag: open files near system limit
Fix: Increase ulimit; check for connection/file handle leaks in code
==========================================================================
DECISIONOUTPUT — what to do with your finding
==========================================================================
BadDeployment → Rollback → Fix → Redeploy with proper migration order
UnhandledException → Addtry/catch → return meaningful error response → fix root cause
Misconfiguration → Set the missing config → restart service → add config validation at startup
ResourceExhaustion → Immediate: scale or kill idle connections → Long term: fix the leak
Output
Diagnostic result depends on your environment — this is a decision tree, not runnable code.
Expected output for each step:
Step 1 → routes you to Step 2, 3, or 4 based on timing
Step 2 → identifies deploy artifact or migration problem
Step 3 → gives you exact file + line number of the exception
Step 4 → surfaces the exhausted resource and its current vs. max value
Senior Shortcut: The 5-Minute 500 Triage
When a 500 alert fires, run these four commands before touching any code: (1) check when it started relative to the last deploy, (2) grep your app logs for 'ERROR' or 'Exception' at that timestamp, (3) check your DB connection pool metrics, (4) run 'df -h' and 'free -h'. In 80% of cases, one of these four gives you the answer before you've opened a single source file.
Production Insight
The five causes each have a distinct fingerprint.
Unhandled exceptions show a stack trace; DB failures show pool metrics; misconfig shows startup errors; resource exhaustion shows system metrics; bad deploy shows timing correlation.
Rule: classify before you debug — the wrong fix wastes time and often causes collateral damage.
Key Takeaway
Jumping to code without classifying the cause is the #1 time-waster.
Use the timing of the 500 to narrow it down: always happening? just deployed? under load?
Each cause has a distinct debugging path — pick the right one and you're 80% done.
Fixing 500s the Right Way: Code Patterns That Actually Hold Up
Knowing the cause is half the battle. The other half is fixing it in a way that doesn't just hide the 500 and create a worse problem downstream. The two most common bad fixes I've seen: swallowing exceptions silently (so the 500 goes away but the actual failure keeps happening undetected), and catching every exception at the top level and returning a 200 with an error body (which is arguably worse — now your monitoring thinks everything is fine).
The right approach has three parts. First: catch specific, expected failures close to where they happen and handle them gracefully — redirect to a login page, return a meaningful 4xx, retry the operation. Second: let unexpected exceptions bubble up to a single top-level error handler that logs the full stack trace, returns a proper 500, and triggers an alert. Third: add circuit breakers around external dependencies so that when a downstream service is sick, you fail fast instead of piling up 500s while threads wait for timeouts.
The following example shows all three patterns working together in a realistic e-commerce checkout service — the kind of code that actually needs to survive traffic spikes and flaky payment providers.
CheckoutService.errorhandling.jsJAVASCRIPT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
// io.thecodeforge — System Design tutorial// Production error handling pattern for an e-commerce checkout service// Framework: Express.js — patterns apply to any Node.js web frameworkconst express = require('express');
const app = express();
// ─────────────────────────────────────────────────────────────────// CIRCUIT BREAKER — fail fast when a dependency is known to be down// Without this: every request hangs for 30s waiting for a timeout,// threads pile up, memory spikes, the whole service goes 500.// ─────────────────────────────────────────────────────────────────classCircuitBreaker {
constructor(failureThreshold = 5, recoveryTimeoutMs = 30000) {
this.failureCount = 0;
this.failureThreshold = failureThreshold; // open circuit after 5 consecutive failures
this.state = 'CLOSED'; // CLOSED = normal, OPEN = failing fast, HALF_OPEN = testing recoverythis.nextAttemptAt = null;
this.recoveryTimeoutMs = recoveryTimeoutMs;
}
asynccall(operationFn) {
if (this.state === 'OPEN') {
if (Date.now() < this.nextAttemptAt) {
// Still in recovery window — reject immediately without calling the dependencythrownewError('CircuitBreaker:OPEN — dependency unavailable, failing fast');
}
// Recovery window expired — allow one probe request throughthis.state = 'HALF_OPEN';
}
try {
const result = awaitoperationFn();
this._onSuccess();
return result;
} catch (err) {
this._onFailure();
throw err; // re-throw so the caller handles it — don't swallow
}
}
_onSuccess() {
this.failureCount = 0;
this.state = 'CLOSED'; // dependency is healthy again
}
_onFailure() {
this.failureCount += 1;
if (this.failureCount >= this.failureThreshold) {
this.state = 'OPEN';
// Schedule the recovery probe — don't hammer a sick dependencythis.nextAttemptAt = Date.now() + this.recoveryTimeoutMs;
}
}
}
// One circuit breaker per external dependency — never share themconst paymentGatewayBreaker = newCircuitBreaker(5, 30000);
const inventoryServiceBreaker = newCircuitBreaker(3, 15000);
// ─────────────────────────────────────────────────────────────────// CHECKOUT ROUTE — specific error handling close to the source// Each failure type gets its own response — no generic catch-all// ─────────────────────────────────────────────────────────────────
app.post('/api/checkout/complete', async (req, res, next) => {
const { cartId, paymentToken, userId } = req.body;
// INPUT VALIDATION — catch bad requests before any business logic runs// These are 400s, not 500s — the client sent bad data, not our faultif (!cartId || !paymentToken || !userId) {
return res.status(400).json({
error: 'MISSING_REQUIRED_FIELDS',
message: 'cartId, paymentToken, and userId are all required'
});
}
try {
// STEP 1: Check inventory via circuit-breaker-protected callconst inventoryAvailable = await inventoryServiceBreaker.call(() =>
checkInventoryAvailability(cartId)
);
if (!inventoryAvailable) {
// This is an expected business failure — not a 500, it's a 409 Conflictreturn res.status(409).json({
error: 'INVENTORY_CONFLICT',
message: 'One or more items in your cart are no longer available'
});
}
// STEP 2: Process payment via circuit-breaker-protected callconst paymentResult = await paymentGatewayBreaker.call(() =>
chargePaymentToken(paymentToken, calculateCartTotal(cartId))
);
// STEP 3: Persist the order — wrap in try/catch for DB-specific errorsconst order = awaitpersistOrder(userId, cartId, paymentResult.transactionId);
return res.status(201).json({
orderId: order.id,
transactionId: paymentResult.transactionId,
status: 'CONFIRMED'
});
} catch (err) {
// SPECIFIC KNOWN ERRORS — handle gracefully without a 500if (err.message.includes('CircuitBreaker:OPEN')) {
// Dependency is known-down — tell the client, don't pretend it's our faultreturn res.status(503).json({
error: 'SERVICE_TEMPORARILY_UNAVAILABLE',
message: 'Payment processing is temporarily unavailable. Please try again in 30 seconds.',
retryAfterSeconds: 30
});
}
if (err.code === 'PAYMENT_DECLINED') {
// Payment gateway explicitly declined — this is a 402, client needs to actreturn res.status(402).json({
error: 'PAYMENT_DECLINED',
message: 'Your payment was declined. Please check your card details and try again.'
});
}
// UNEXPECTED ERROR — pass to the global error handler via next()// DO NOT return a 500 here — let the central handler do it.// DO NOT log here — the central handler does that too.// This keeps logging consistent and prevents double-logging.next(err);
}
});
// ─────────────────────────────────────────────────────────────────// GLOBAL ERROR HANDLER — the last line of defence// Express recognises this as an error handler because it has 4 params// This runs for any error that reaches next(err) from any route// ─────────────────────────────────────────────────────────────────
app.use((err, req, res, next) => {
// Generate a unique ID so you can correlate the user's report with your logsconst errorId = `ERR-${Date.now()}-${Math.random().toString(36).substr(2, 6).toUpperCase()}`;
// ALWAYS log the full stack trace server-side — never swallow it// Include request context so you can reproduce the failure
console.error({
errorId,
message: err.message,
stack: err.stack,
request: {
method: req.method,
url: req.url,
userId: req.body?.userId, // log who was affected
cartId: req.body?.cartId, // log what they were doing
userAgent: req.headers['user-agent']
},
timestamp: newDate().toISOString()
});
// Trigger your alerting pipeline here (PagerDuty, Sentry, etc.)// notifyOnCallEngineer(err, errorId); ← wire this up in production// Return the error ID to the client — they can quote it in a support ticket// NEVER return the stack trace or internal error message to the clientreturn res.status(500).json({
error: 'INTERNAL_SERVER_ERROR',
message: 'An unexpected error occurred. Please try again or contact support.',
errorId // lets your support team look this up in logs instantly
});
});
// Placeholder stubs — these would be real service calls in productionasyncfunctioncheckInventoryAvailability(cartId) { returntrue; }
asyncfunctionchargePaymentToken(token, amount) { return { transactionId: 'txn_abc123' }; }
asyncfunctioncalculateCartTotal(cartId) { return99.99; }
asyncfunctionpersistOrder(userId, cartId, txnId) { return { id: 'order_xyz789' }; }
app.listen(3000, () => console.log('Checkout service running on port 3000'));
Never Do This: Swallowing Exceptions to Kill the 500
I've reviewed codebases where someone wrapped an entire route in try/catch and returned res.status(200).json({ success: false }) for every error — because 'the client was complaining about 500s.' The 500s disappeared from monitoring. The underlying failures kept happening. Nobody knew for six weeks. Your monitoring is only as honest as your HTTP status codes — a lying 200 is worse than an honest 500.
Circuit breakers prevent cascading 500s by failing fast when a dependency is sick.
Rule: let unexpected exceptions propagate to a central handler that logs, alerts, and returns a proper 500. Never catch-all to 200.
Key Takeaway
Honest HTTP status codes are your monitoring's only source of truth.
A 200 with an error flag is a lie that delays detection by weeks.
Careful error handling: catch expected failures early, let unexpected ones bubble to a central handler that acts.
Debugging 500s in Production: A Step-by-Step Process That Always Works
When a 500 alert fires at 3am, you don't have the luxury of browsing through documentation. You need a repeatable process that works every time. Here's the process I've used across five production outages — it's never failed me.
Step 1: Determine the blast radius. Is this affecting one user, one endpoint, or the whole service? Check your error rate dashboard first, not the logs. If it's the whole service, start with infrastructure checks (disk, memory, pool). If it's one endpoint, focus on that endpoint's logs and any recent changes.
Step 2: Check the deployment timeline. Did a deploy happen in the last hour? If yes, roll back before investigating. Production stability comes first. If no deploy, move to the next step.
Step 3: Read the logs — but read them with intent. Don't scroll aimlessly. grep for 'ERROR' or 'Exception' at the timestamp of the first 500. Look for the first occurrence of a new error pattern. The first error is often the root cause; subsequent ones are cascade failures.
Step 4: Check infrastructure metrics simultaneously. Open three terminal windows — one for logs tailing, one for 'free -h' and 'df -h', one for DB pool status. Cross-reference what you see. If logs show a connection timeout and the DB pool shows 100% active, you've found the cause.
Step 5: Reproduce locally if possible. If the error only happens under specific conditions, try to simulate them in a staging environment. If you can't reproduce, add structured logging around the failing code path and wait for the next occurrence. Yes, sometimes you have to let it happen again with more instrumentation — and that's okay if you've reduced the blast radius.
This process takes 10 minutes. Most of your time will be spent on false trails — logs that point to a symptom, not the cause. The key is staying disciplined and not jumping to conclusions.
The 3-Window Debugging Setup
Open three terminals or split panes: (1) 'kubectl logs -f <pod> --tail=100' for live log tail, (2) 'watch -n 5 free -h && df -h' for real-time resource metrics, (3) 'watch -n 5 "kubectl exec -it <pod> -- psql -c 'SELECT count(*) FROM pg_stat_activity WHERE state='active';"' for DB pool. Cross-reference in real time — when you see a log spike, check which metric changed at the same instant.
Production Insight
Most debugging time is wasted on false trails caused by cascade failures.
The first error in the logs is often the real cause — later errors are just downstream effects.
Rule: never chase a stack trace that appears after a resource exhaustion error. Fix the exhaustion first.
Key Takeaway
A disciplined 10-minute process beats an hour of frantic log scrolling.
Check blast radius, deployment timeline, first error timestamp, and infrastructure metrics — in that order.
Cross-reference logs and metrics in real time. The correlation tells you the story, not either one alone.
Monitoring and Prevention: Never Be Blind-sided by a 500 Again
Fixing the current 500 is reactive. What separates seniors from juniors is what you put in place so the next one doesn't take you by surprise at 3am. There are four things that matter here: structured logging, error rate alerting, health checks, and startup validation.
Structured logging means your logs are JSON, not plain text. When you're grepping logs at 2am for a specific user's failed checkout, you want to filter by userId in one command — not read through thousands of lines of unformatted text. Every log line should have a timestamp, severity level, correlation ID, and the relevant business context.
Error rate alerting means you're monitoring the percentage of 5xx responses, not just whether the service is up. A service that's 'up' but returning 500 on 30% of requests is not 'up.' Set an alert threshold — anything above 1% 5xx rate on a critical endpoint should page someone. And add startup-time config validation: if a required environment variable is missing, crash loudly at boot with a clear error message instead of returning 500s for hours until someone checks the logs.
HTTP500PreventionChecklist.systemdesignPLAINTEXT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
// io.thecodeforge — SystemDesign tutorial
// Production readiness checklist: preventing and catching 500s before users do
==========================================================================
TIER1 — STARTUPVALIDATION (catch misconfigs before the service accepts traffic)
==========================================================================
At service boot, BEFORE binding to a port:
[ ] Validate all required environment variables exist and are non-empty
Pattern: fail-fast with a clear message
Example:
const required = ['DATABASE_URL', 'PAYMENT_API_KEY', 'JWT_SECRET'];
required.forEach(key => {
if (!process.env[key]) {
thrownewError(`STARTUP_FAILURE: Required environment variable '${key}' is not set.`);
// Process exits. Load balancer sees the instance never became healthy.
// No 500s served. Clean failure.
}
});
[ ] Test database connectivity at startup
Pattern: ping the DB, confirm connection pool initialises successfully
IfDB is unreachable at startup: crash loudly, do not serve traffic
[ ] Verify critical config file paths exist
Pattern: fs.accessSync(configPath) — throwsif file missing, crashes cleanly
==========================================================================
TIER2 — STRUCTUREDLOGGING (make logs searchable when it matters most)
==========================================================================
Badlog (plain text — useless under pressure):
[ERROR] Something failed during checkout for user abc at 2024-11-2902:47:13Goodlog (structured JSON — filterable in 10 seconds):
{
"timestamp": "2024-11-29T02:47:13.000Z",
"level": "ERROR",
"service": "checkout-service",
"errorId": "ERR-1732845600000-K7X2MN",
"userId": "usr_456",
"cartId": "cart_789",
"endpoint": "POST /api/checkout/complete",
"errorClass": "NullPointerException",
"message": "Cannot invoke getBalance() on null UserAccount",
"durationMs": 234
}
Whythis matters: grep '"userId": "usr_456"' | jq '.errorId'Gets you the exact error ID in one command. Without structure: read every line manually.
==========================================================================
TIER3 — ALERTINGTHRESHOLDS (know before your users do)
==========================================================================
Metric | Alert threshold | Severity
─────────────────────────────────────────────────────────────────────
5xx error rate (critical path)| > 1% over 5 min window | PAGE
5xx error rate (non-critical) | > 5% over 5 min window | SLACKALERTDB connection pool usage | > 80% of max | SLACKALERTDB connection pool usage | > 95% of max | PAGEAvailable memory | < 20% of total | SLACKALERTDisk usage | > 85% of total | SLACKALERTDisk usage | > 95% of total | PAGEP99 response latency | > 5x normal baseline | SLACKALERTKey rule: alert on RATE, not raw count.
10 errors in 1 minute during 10 req/min traffic = 100% error rate. PAGE.
10 errors in 1 minute during 100,000 req/min traffic = 0.01% error rate. Ignore.
==========================================================================
TIER4 — HEALTHCHECKENDPOINT (let your load balancer save you)
==========================================================================
GET /health → should check:
[ ] Database is reachable (run a lightweight SELECT1 query)
[ ] Memory usage is below critical threshold
[ ] All required config is loaded
[ ] Any circuit breakers are not permanently open
Return200 only when ALL checks pass.
Return503 (not 500) when any dependency is unhealthy.
Your load balancer polls /health every 10-30 seconds.
If it gets a non-200, it stops routing traffic to that instance.
This means a sick instance stops serving 500s automatically —
without anyone waking up at 3am to restart it manually.
Health check response time must be < 500ms.
If your health check itself times out, it causes cascading failures.
Output
Startup failure (missing env var):
STARTUP_FAILURE: Required environment variable 'PAYMENT_API_KEY' is not set.
Process exited with code 1.
Load balancer: instance never marked healthy, no traffic routed.
Load balancer: stops routing to this instance within 30 seconds.
Interview Gold: Health Check vs Liveness Check
Interviewers love this distinction. A liveness check answers 'Is the process alive?' — even a totally broken service passes this. A readiness/health check answers 'Is this instance ready to serve production traffic?' — it checks DB connectivity, dependency health, and memory. Kubernetes uses both: liveness probes restart dead processes, readiness probes control load balancer routing. Conflating them causes incidents where a degraded instance stays in the load balancer rotation returning 500s because the liveness check is passing.
Production Insight
Startup validation catches misconfigs before they hurt users.
Error rate alerting on 5xx rate > 1% beats paging on process down.
Health checks must verify dependencies — a 200 from a sick instance is a lie.
Rule: crash loudly at boot for missing config, not silently during requests.
Key Takeaway
Prevention beats reaction: validate config at startup, log structurally, alert on error rate, and health-check dependencies.
A health check that returns 200 when the service is degraded is worse than no health check — it hides the failure.
The best 500 is the one that never happens because the instance never entered production.
● Production incidentPOST-MORTEMseverity: high
Black Friday Payment Meltdown: Connection Pool Exhaustion Without Timeouts
Symptom
All checkout requests returned HTTP 500 with various NullPointerExceptions and timeout errors. Health check still returned 200. Application logs showed intermittent DB query failures.
Assumption
The team assumed a database crash or network issue. They spent 20 minutes checking network connectivity and restarting the database before looking at connection pool metrics.
Root cause
The database connection pool was configured with max=10 connections and no connection timeout. Under normal load, 10 connections were enough. During Black Friday, 47 requests queued up waiting for a connection that never came free because each query took 30+ seconds due to a missing index. All 10 connections were occupied, new requests timed out after 120 seconds (default), and the application threw NullPointerException when the session expired mid-request.
Fix
1) Kill idle connections immediately: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle'; 2) Increase pool max to 50. 3) Add connection timeout of 5 seconds. 4) Add query timeout of 10 seconds. 5) Add health check that verifies pool health and returns 503 instead of 200 when pool usage exceeds 80%. The fix was deployed in 5 minutes after identifying the root cause.
Key lesson
Connection timeouts are not optional — they're the difference between a degraded service and a dead one.
A health check that returns 200 while the service can't serve requests is worse than no health check.
Stack traces lie. The NullPointerException was a symptom of the real cause: pool exhaustion. Always check infra metrics before trusting the first exception you see.
Production debug guideRun these checks in order — each one narrows the field by 50%5 entries
Symptom · 01
500s started immediately after a deployment
→
Fix
Rollback first — production stability over RCA. Then check deploy diff: missing env var? Schema migration not run? Syntax error in new code? Use kubectl rollout undo or swap to previous version.
Symptom · 02
500s appear gradually under increasing load
→
Fix
Check DB connection pool usage, thread pool size, memory, disk space. Run df -h and free -h on the server. Look for OOM killer logs. The 500s are a symptom of resource exhaustion.
Symptom · 03
500s on specific endpoints only
→
Fix
Grep app logs for that endpoint's stack trace. Check if the endpoint calls an external API that might be down (circuit breaker pattern). Check recent schema changes that might affect that specific query.
Symptom · 04
500s with no stack trace in logs
→
Fix
Verify log level is set to ERROR or DEBUG. Check if the error handler is swallowing exceptions. Check for thread pool shutdown errors (e.g., RejectedExecutionException). Increase log verbosity temporarily.
Symptom · 05
500s that disappear after restart
→
Fix
Likely memory leak or connection leak. Run for a while after restart, then check memory usage and open connections. Use heap dump analysis for memory leaks (jmap, Eclipse MAT). Check for unclosed database connections.
★ Quick 500 Debug Cheat SheetGo-to commands for the five most common 500 root causes. Run these before opening any code file.
Application feels slow, 500s pile up under load−
Immediate action
Check database connection pool usage immediately
Commands
docker compose logs | grep -i "connection pool"
SELECT count(*) FROM pg_stat_activity;
Fix now
Kill idle connections and increase pool size with timeouts
500s with 'OutOfMemoryError' in logs+
Immediate action
Check system memory and heap usage
Commands
free -h
jstat -gcutil <pid> 1000 5
Fix now
Restart with increased heap or fix the leak (heap dump + analysis)
Disk full — logs show 'No space left on device'+
Immediate action
Check disk usage
Commands
df -h /app
du -sh /var/log/* | sort -rh | head -5
Fix now
Remove old logs and set up log rotation (logrotate)
500s with 'Connection refused' to an external service+
Immediate action
Check if the downstream service is up
Commands
curl -I http://downstream-service/health
kubectl get pods -l app=downstream-service
Fix now
Restart downstream service or remove it from load balancer rotation
HTTP 500 vs HTTP 503: Know the Difference
Aspect
HTTP 500 Internal Server Error
HTTP 503 Service Unavailable
Fault owner
The server application code or config
Infrastructure or a downstream dependency
Typical cause
Unhandled exception, null reference, bad config
DB down, dependency timeout, circuit breaker open, overloaded
Is the service up?
Yes — process is running but code failed
Partially — process running but can't serve traffic healthily
Client should retry?
Not automatically — same request usually fails the same way
Yes — with exponential backoff; the issue is usually transient
Correct Retry-After header?
Rarely appropriate
Always set it — tells clients when to try again
Root cause location
Application logs — stack trace
Infrastructure metrics — connection pools, memory, external API status
Fix usually requires
Code change or config correction
Scaling, dependency recovery, or circuit breaker reset
Load balancer behaviour
Instance stays in rotation — keeps serving 500s
Health check returns 503 — instance pulled from rotation automatically
Your monitoring alert fires on
Error rate > threshold on that endpoint
Health check failures or dependency latency spike
Example error message
NullPointerException at PaymentProcessor.java:112
Connection pool exhausted: max=10, active=10, pending=47
Key takeaways
1
The stack trace in your app log tells you what failed
the infrastructure metrics tell you why. Always check both before you touch code.
2
Swallowing exceptions to eliminate 500s from your monitoring is the most dangerous thing you can do. A lying 200 hides real failures for weeks. Your HTTP status codes are the only honest signal your monitoring has.
3
Set connection timeouts on every external call your service makes
database, HTTP client, cache client, everything. A missing timeout is a loaded gun pointed at your thread pool. When that pool exhausts, every request returns a 500.
4
A 500 that happens at startup is infinitely better than a 500 that happens in production traffic. Validate every required environment variable and config dependency before your service binds to a port
fail loudly at boot, not silently during requests.
5
Classify the cause before you debug
always happening? just deployed? under load? Each category has a different fix path. Jumping to code without classification wastes 80% of your debugging time.
Common mistakes to avoid
5 patterns
×
Catching all exceptions and returning HTTP 200 with an error flag
Symptom
Monitoring shows 0% error rate while real failures pile up silently. Users see a 'success' response but the action didn't complete.
Fix
Always use correct HTTP status codes: 500 for unexpected errors, 4xx for client errors, 503 for dependency failures. Your alerting and load balancer depend on honest status codes.
×
Returning the raw stack trace or internal error message in the HTTP response body
Symptom
Exposes internal file paths, library versions, and logic that attackers use for reconnaissance. Compliance failures (PII leak).
Fix
Log the full stack trace server-side. Return only a sanitised message and a unique errorId to the client. Never include err.stack in the response.
×
Not setting connection timeouts on database or HTTP clients
Symptom
One slow external call holds a thread forever. Under load, the pool exhausts in seconds and every subsequent request gets a 500.
Fix
Always set explicit connection and socket timeouts (e.g., connectionTimeout: 3000, socketTimeout: 5000). Wrap external calls in a circuit breaker with a timeout.
×
Deploying code that depends on a new database column before the migration runs
Symptom
100% of requests to that endpoint return 500 with 'column does not exist' until the migration is applied.
Fix
Run database migrations before deploying application code that depends on them. Add a startup check that verifies the expected schema version.
×
Letting log files fill the disk because log rotation was never configured
Symptom
Server runs out of disk space. Every write operation (including logging the 500 itself) fails, making the incident completely undebuggable.
Fix
Configure logrotate or your logging daemon to rotate and compress logs daily. Set disk usage alerts at 85%. Use centralised logging (Datadog, CloudWatch, ELK) so logs survive instance crashes.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Your checkout endpoint is returning 500s for 40% of requests. Your healt...
Q02SENIOR
When would you return a 503 instead of a 500 from your API, and how does...
Q03SENIOR
You've added a global error handler that catches all unhandled exception...
Q04SENIOR
Your service uses a database connection pool with a max of 20 connection...
Q01 of 04SENIOR
Your checkout endpoint is returning 500s for 40% of requests. Your health check is still returning 200. Your application logs show no exceptions. Where do you look first and why?
ANSWER
The health check returning 200 while 40% of requests are 500s tells me the health check is too shallow. It's probably just checking if the process is alive, not if it can actually serve requests. I'd immediately check infrastructure metrics: memory, disk space, database connection pool. The absence of exceptions in app logs often points to resource exhaustion — the request fails before it even reaches your code. In Node.js, that could be a thread pool saturation; in Java, a connection pool timeout; in any language, an out-of-memory kill that silently fails requests. First command: free -h and df -h. Second: check database pool metrics. Third: look for TCP queue overflow or load balancer timeout. The root cause is almost certainly an exhausted resource that doesn't throw a standard application exception.
Q02 of 04SENIOR
When would you return a 503 instead of a 500 from your API, and how does that decision affect your client's retry behaviour and your load balancer's routing?
ANSWER
Return 503 when your service is alive but cannot serve traffic because a dependency is down, overwhelmed, or unreachable. This includes database connection pool exhaustion, circuit breaker open, downstream API timeout, or resource saturation (memory/disk/threats). The decision matters enormously: a 503 tells the client and the load balancer that the issue is transient. Clients should retry with exponential backoff (and you should set a Retry-After header). Load balancers configured with health checks that return 503 will automatically take the instance out of rotation, stopping the flood of retries from making things worse. A 500, on the other hand, means the server tried and failed — the same request is likely to fail again, so automatic retry is usually fruitless. Using 503 correctly turns a potential cascading failure into a controlled degradation. Using 500 for transient issues turns a minor hiccup into a full outage as threads pile up waiting for timeouts.
Q03 of 04SENIOR
You've added a global error handler that catches all unhandled exceptions and returns a 500. A junior dev asks: 'Why not just catch every exception and return a 200 with an error field — that way our error rate metric stays clean?' How do you respond?
ANSWER
I'd explain that clean metrics are worthless if they're lying. A 200 tells your monitoring, your load balancer, your logging pipeline, and your alerting system that everything is fine. It's not fine — the operation failed. The 500 is an honest signal. Without it, you won't be paged, your error budgets won't degrade, and the failure will go undetected until users start complaining or churning. The silence is the danger. You can't fix what you don't measure, and if you hide the signal, you've lost the ability to measure. The right pattern is to have a global handler that logs the full context and returns a 500 with an errorId — not to silence the 500. If the client really needs a 'soft error' pattern (like showing an error toast but not crashing), that's fine at the client level — but the API must still report the 500 so operations knows something broke. I've seen this exact mistake cause a six-week undetected payment processing failure at a previous company. The error rate dashboard showed 0% all along.
Q04 of 04SENIOR
Your service uses a database connection pool with a max of 20 connections. Under Black Friday load, you start seeing 500s. The DB itself is healthy. What's happening, what metrics confirm it, and what are your short-term and long-term fixes?
ANSWER
The database is healthy but the connection pool is saturated — all 20 connections are active, and new requests are queuing up waiting for a connection. The queue eventually times out, causing the request to fail. The metrics to confirm: pool active connections = 20, pool pending requests > 0, query latency spikes because queries are competing for connections, and the application logs show timeout errors or 'connection not available' exceptions. The short-term fix: kill idle connections immediately with SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle' and increase pool max to 50 (or whatever number fits your database's max connections). Add connection timeout of 5 seconds so no request waits indefinitely. Long-term fix: add query timeouts (10 seconds) to prevent slow queries from holding connections, add a read replica for read-heavy workloads to reduce pool pressure on the primary, implement connection pool metrics monitoring with alert at 80% usage, and consider using a connection pooler like PgBouncer between your service and the database. Also add a health check that returns 503 when pool usage exceeds 95% so the load balancer stops routing to that instance before it starts returning 500s.
01
Your checkout endpoint is returning 500s for 40% of requests. Your health check is still returning 200. Your application logs show no exceptions. Where do you look first and why?
SENIOR
02
When would you return a 503 instead of a 500 from your API, and how does that decision affect your client's retry behaviour and your load balancer's routing?
SENIOR
03
You've added a global error handler that catches all unhandled exceptions and returns a 500. A junior dev asks: 'Why not just catch every exception and return a 200 with an error field — that way our error rate metric stays clean?' How do you respond?
SENIOR
04
Your service uses a database connection pool with a max of 20 connections. Under Black Friday load, you start seeing 500s. The DB itself is healthy. What's happening, what metrics confirm it, and what are your short-term and long-term fixes?
SENIOR
FAQ · 4 QUESTIONS
Frequently Asked Questions
01
Why am I getting a 500 error when my code worked fine in development?
The most common reason is a missing environment variable or configuration value that exists on your local machine but was never set in the production environment. Check your application's startup logs for phrases like 'undefined is not a function', 'Cannot read properties of undefined', or 'ECONNREFUSED' — these almost always point to a config value that's present locally but missing in production. The second most common cause is a database migration that ran locally but never ran against your production database.
Was this helpful?
02
What's the difference between a 500 and a 503 error?
A 500 means your application code itself failed — unhandled exception, null reference, bad config. A 503 means your service is alive but can't serve traffic because something it depends on is down or overwhelmed. The practical rule: if the problem is in your code, it's a 500; if the problem is a database being down or a dependency being unavailable, return a 503 with a Retry-After header so clients and load balancers know the issue is transient.
Was this helpful?
03
How do I find what's causing a 500 error when the response just says 'Internal Server Error'?
Go directly to your server-side application logs — not the browser, not the network tab. Search for 'ERROR' or 'Exception' filtered to the timestamp when the 500 occurred. Your framework will have logged a full stack trace there. If you can't find logs, check that your logging is actually configured to write somewhere and that log level isn't set to WARN or higher, which would suppress ERROR output. As a last resort, temporarily enable verbose error responses in a staging environment to surface the stack trace.
Was this helpful?
04
Why do 500 errors suddenly appear under high load but never happen during normal traffic?
Almost always it's resource exhaustion — usually the database connection pool. Under low traffic, your pool of 10 connections handles requests fine. Under high load, all 10 connections are occupied, new requests queue up waiting, queries time out, sessions expire mid-request, and null references start appearing in code that worked perfectly at low scale. Check your DB pool metrics (active vs. max connections) and your thread pool metrics simultaneously. The fix is a combination of increasing pool size, adding explicit connection timeouts so hung connections release, and implementing a circuit breaker so you fail fast instead of queuing indefinitely.