ASP.NET Core health checks let external systems (K8s, load balancers) query app and dependency status in a standard way.
Split liveness (pod alive) and readiness (can serve traffic) onto separate endpoints using tag filtering.
Custom IHealthCheck classes with timeouts prevent hung checks from blocking probes – always fail fast.
HTTP status mapping via ResultStatusCodes controls how Degraded vs Unhealthy affects infrastructure decisions.
The Health Checks UI dashboard gives ops teams visual history and drill-down per check.
Never put database checks on the liveness probe – that causes restart storms during partial outages.
Plain-English First
Imagine a hospital with a dashboard showing every patient's vital signs — heart rate, blood pressure, oxygen — all on one screen. A doctor glances at it and instantly knows who needs attention. ASP.NET Core health checks are exactly that dashboard for your application. Instead of patients, you're monitoring your database connection, your message queue, your disk space, and any other system your app depends on. One endpoint, one glance, and you know if everything is healthy or something is about to crash.
When your app is running in production, 'it deployed successfully' is just the beginning. Kubernetes needs to know whether to send traffic to your pod. Your load balancer needs to decide if an instance should be taken out of rotation. Your ops team needs an alert before a full outage hits — not after. Without a structured health check system, you're flying blind. You're relying on a user to tell you something is broken, which is the worst possible monitoring strategy.
Health checks solve a specific, painful problem: how do external systems and internal teams get a reliable, machine-readable signal about whether your application and all of its dependencies are functioning correctly? Before ASP.NET Core 2.2, teams would hand-roll ping endpoints, scatter try-catch blocks across random controllers, and end up with inconsistent, unreliable status pages. The built-in health check middleware standardises all of that — with a clean model for registering checks, aggregating results, and exposing them over HTTP.
By the end of this article you'll know how to register built-in and custom health checks, gate them by tags for different audiences (liveness vs readiness), wire up the visual Health Checks UI dashboard, and avoid the three mistakes that catch almost every developer the first time. You'll also have copy-paste-ready code patterns you can drop into a real project today.
How the Health Check Pipeline Actually Works
Before writing a single line of code, it's worth understanding the architecture — because once you see it, every API decision makes sense.
ASP.NET Core's health check system has three layers. First, you register one or more IHealthCheck implementations with the DI container via AddHealthChecks(). Each check is a small class with a single method — CheckHealthAsync — that returns a HealthCheckResult of Healthy, Degraded, or Unhealthy.
Second, the framework aggregates those results. When the health endpoint is hit, it runs all registered checks (or a filtered subset by tag), collects every result, and computes an overall status. If any check is Unhealthy, the aggregate is Unhealthy. If any is Degraded but none are Unhealthy, the aggregate is Degraded.
Third, the middleware serialises that result and returns an HTTP response. By default it just writes 'Healthy' or 'Unhealthy' as plain text. But you can swap in a custom response writer to return rich JSON — which is exactly what production systems need.
The key insight here is separation of concerns: the check logic, the aggregation logic, and the serialisation logic are all independent. That's what makes the system so composable.
Program.csCSHARP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// Program.cs — Minimal API style (NET 6+)// This is the absolute foundation. Every health check setup starts here.var builder = WebApplication.CreateBuilder(args);
// Step 1: Register the health check services with the DI container.// AddHealthChecks() returns an IHealthChecksBuilder you can chain onto.
builder.Services.AddHealthChecks()
// Register a named check. The name appears in the JSON response// so ops teams know exactly WHICH check failed.
.AddCheck("self", () => HealthCheckResult.Healthy("App is running"))
// Tags let you group checks for different audiences.// 'live' = Kubernetes liveness probe (is the process alive?)// 'ready' = Kubernetes readiness probe (can it serve traffic?)
.AddCheck(
name: "startup-warmup",
check: () => HealthCheckResult.Healthy("Warm-up complete"),
tags: new[] { "live" }
);
var app = builder.Build();
// Step 2: Map the health check endpoints.// /healthz/live — only runs checks tagged 'live'// /healthz/ready — only runs checks tagged 'ready'// /healthz — runs ALL checks (useful for ops dashboards)
app.MapHealthChecks("/healthz", newHealthCheckOptions
{
// ResponseWriter controls what gets written to the HTTP response body.// WriteResponse is a static helper we define in the next section.ResponseWriter = HealthCheckResponseWriter.WriteResponse
});
app.MapHealthChecks("/healthz/live", newHealthCheckOptions
{
// Predicate filters which checks run on this endpoint.// Here we only run checks tagged 'live'.Predicate = check => check.Tags.Contains("live"),
ResponseWriter = HealthCheckResponseWriter.WriteResponse
});
app.MapHealthChecks("/healthz/ready", newHealthCheckOptions
{
Predicate = check => check.Tags.Contains("ready"),
ResponseWriter = HealthCheckResponseWriter.WriteResponse
});
app.Run();
Output
// GET /healthz
// HTTP 200 OK
{
"status": "Healthy",
"totalDuration": "00:00:00.0012345",
"entries": {
"self": {
"status": "Healthy",
"description": "App is running",
"duration": "00:00:00.0001234"
},
"startup-warmup": {
"status": "Healthy",
"description": "Warm-up complete",
"duration": "00:00:00.0000987"
}
}
}
Why Two Endpoints?
Kubernetes uses both a liveness probe (/healthz/live) and a readiness probe (/healthz/ready). Liveness asks 'is the process crashed?' — if it fails, K8s restarts the pod. Readiness asks 'is this pod ready to receive traffic?' — if it fails, K8s removes it from the load balancer but doesn't restart it. Mixing all your checks on one endpoint means a slow database query could trigger a pod restart, which is almost never what you want.
Production Insight
The default response writer returns only 'Healthy' or 'Unhealthy' as plain text.
Engineers in production need to know which check failed and why.
Always replace the default writer with a custom JSON writer that includes entry details and exceptions.
Key Takeaway
Health checks have three independent layers: registration, aggregation, serialisation.
Understand how they connect before writing your first check.
Separate concerns mean you can replace any layer without touching the others.
Writing a Real Custom Health Check — Database + External API
The built-in lambda-style checks are fine for demos, but production systems need proper IHealthCheck implementations. This is where the pattern gets genuinely powerful.
A well-written health check does three things: it detects a real failure condition (not just 'can I reach the host'), it includes diagnostic data in the result so engineers can debug without reading logs, and it fails fast — it has a timeout so a slow dependency doesn't hold up your entire health endpoint.
Let's build two concrete examples: a SQL Server check that validates query execution (not just connection), and an external HTTP API check that confirms the downstream service is actually responding correctly.
Notice the pattern in both checks: the try/catch returns Unhealthy with the exception message as the description. That description surfaces in the JSON response, which means your on-call engineer sees the actual error message — not just a red dot on a dashboard.
SqlServerHealthCheck.csCSHARP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
// SqlServerHealthCheck.cs// A production-grade database health check that validates the connection// AND confirms the database can execute a real query — not just ping.usingMicrosoft.Extensions.Diagnostics.HealthChecks;
usingMicrosoft.Data.SqlClient;
publicclassSqlServerHealthCheck : IHealthCheck
{
privatereadonlystring _connectionString;
// Inject the connection string via DI rather than hard-coding it.// In production this comes from IConfiguration / environment variables.publicSqlServerHealthCheck(IConfiguration configuration)
{
_connectionString = configuration.GetConnectionString("DefaultConnection")
?? thrownewInvalidOperationException("DefaultConnection string is not configured.");
}
publicasyncTask<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken cancellationToken = default)
{
try
{
// Use a short timeout — health checks should fail fast.// 5 seconds is a reasonable maximum for a DB ping.usingvar cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
cts.CancelAfter(TimeSpan.FromSeconds(5));
awaitusingvar connection = newSqlConnection(_connectionString);
await connection.OpenAsync(cts.Token);
// Run a trivial query — SELECT 1 confirms the DB engine is// accepting queries, not just that the TCP port is open.awaitusingvar command = connection.CreateCommand();
command.CommandText = "SELECT 1";
await command.ExecuteScalarAsync(cts.Token);
// Include useful diagnostic data in the result.// This appears in the JSON response and in health check UI.var data = newDictionary<string, object>
{
{ "database", connection.Database },
{ "server", connection.DataSource }
};
returnHealthCheckResult.Healthy(
description: "SQL Server is reachable and accepting queries.",
data: data
);
}
catch (OperationCanceledException)
{
// Distinguish a timeout from a general failure —// timeouts and connection errors need different ops responses.returnHealthCheckResult.Unhealthy(
description: "SQL Server health check timed out after 5 seconds."
);
}
catch (Exception ex)
{
// The exception message goes into the description field// so it shows up directly in your monitoring dashboard.returnHealthCheckResult.Unhealthy(
description: $"SQL Server check failed: {ex.Message}",
exception: ex
);
}
}
}
// ─────────────────────────────────────────────────────────────────────────────// ExternalPaymentApiHealthCheck.cs// Checks that a critical downstream HTTP dependency is healthy.// Uses a named HttpClient registered via IHttpClientFactory — the correct// pattern for health checks, which must not create HttpClient instances// directly (causes socket exhaustion).publicclassExternalPaymentApiHealthCheck : IHealthCheck
{
privatereadonlyIHttpClientFactory _httpClientFactory;
publicExternalPaymentApiHealthCheck(IHttpClientFactory httpClientFactory)
{
_httpClientFactory = httpClientFactory;
}
publicasyncTask<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken cancellationToken = default)
{
try
{
// Use the named client configured in Program.cs.// Named clients have pre-configured BaseAddress, timeout, etc.var httpClient = _httpClientFactory.CreateClient("PaymentApiClient");
// Hit the payment API's own health endpoint rather than a// business endpoint — avoids triggering real business logic.var response = await httpClient.GetAsync("/health", cancellationToken);
if (response.IsSuccessStatusCode)
{
returnHealthCheckResult.Healthy(
description: $"Payment API responded with {(int)response.StatusCode}."
);
}
// Degraded = the service is reachable but not fully healthy.// This is useful when a dependency is slow or partially down.returnHealthCheckResult.Degraded(
description: $"Payment API returned unexpected status: {(int)response.StatusCode}."
);
}
catch (HttpRequestException ex)
{
returnHealthCheckResult.Unhealthy(
description: $"Cannot reach Payment API: {ex.Message}",
exception: ex
);
}
catch (TaskCanceledException)
{
returnHealthCheckResult.Unhealthy(
description: "Payment API health check timed out."
);
}
}
}
// ─────────────────────────────────────────────────────────────────────────────// Program.cs — registering both custom checks
builder.Services.AddHttpClient("PaymentApiClient", client =>
{
client.BaseAddress = new Uri("https://api.paymentprovider.com");// Set a tight timeout — do not rely on the default 100s HttpClient timeout.
client.Timeout = TimeSpan.FromSeconds(8);
});
builder.Services.AddHealthChecks()
.AddCheck<SqlServerHealthCheck>(
name: "sql-server",
failureStatus: HealthStatus.Unhealthy, // a DB failure = fully unhealthy
tags: new[] { "ready", "db" }
)
.AddCheck<ExternalPaymentApiHealthCheck>(
name: "payment-api",
failureStatus: HealthStatus.Degraded, // payment API down = degraded, not dead
tags: new[] { "ready", "external" }
);
Output
// GET /healthz/ready — when SQL Server is fine but Payment API is slow
// HTTP 200 OK (Degraded still returns 200 by default — see Gotchas)
{
"status": "Degraded",
"totalDuration": "00:00:00.2341567",
"entries": {
"sql-server": {
"status": "Healthy",
"description": "SQL Server is reachable and accepting queries.",
"duration": "00:00:00.0234567",
"data": {
"database": "AppDb",
"server": "prod-sql-01.internal"
}
},
"payment-api": {
"status": "Degraded",
"description": "Payment API returned unexpected status: 503.",
"duration": "00:00:00.2107000",
"data": {}
}
}
}
Watch Out: Never new up HttpClient in a health check
Creating new HttpClient() inside CheckHealthAsync is a classic socket exhaustion bug — health checks run frequently (every few seconds in K8s), so you'll blow through available sockets fast. Always inject IHttpClientFactory and call CreateClient(). It's one extra line of setup in Program.cs and it eliminates the entire problem.
Production Insight
Health checks that create HttpClient directly cause socket exhaustion.
Factory-managed clients reuse connections and respect DNS changes.
Rule: always use IHttpClientFactory for any HTTP-dependent health check.
Key Takeaway
Write checks that detect real failure, include diagnostic data, and fail fast.
Timeouts are mandatory — a hung check blocks all other checks.
Never new up HttpClient inside CheckHealthAsync.
Custom JSON Response Writer and the Health Checks UI Dashboard
The default health check response is a single word — 'Healthy' or 'Unhealthy'. That's fine for Kubernetes probes, but it's useless for a human engineer trying to diagnose a problem. You need a JSON response that includes every check name, its status, its description, and how long it took.
ASP.NET Core lets you swap in a custom ResponseWriter — a delegate of type Func<HttpContext, HealthReport, Task>. You write it once, pass it to every HealthCheckOptions instance, and every endpoint automatically returns rich JSON.
For a visual dashboard, the AspNetCore.HealthChecks.UI NuGet package gives you a ready-made React UI that polls your health endpoints and shows a live status board. It's genuinely useful for ops teams — and it takes about ten minutes to set up.
The UI package needs a separate configuration section in appsettings.json that lists the health check URIs to monitor. This means the UI can monitor multiple services, not just the current app — making it a lightweight centralised health dashboard.
HealthCheckResponseWriter.csCSHARP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
// HealthCheckResponseWriter.cs// A reusable JSON response writer that returns rich diagnostic output.// Reference this from every MapHealthChecks call.usingSystem.Text.Json;
usingMicrosoft.Extensions.Diagnostics.HealthChecks;
publicstaticclassHealthCheckResponseWriter
{
publicstaticTaskWriteResponse(HttpContext context, HealthReport report)
{
// Always return JSON — never let this endpoint return HTML.
context.Response.ContentType = "application/json; charset=utf-8";
// Map each health check entry to a serialisable anonymous object.var responseBody = new
{
status = report.Status.ToString(),
totalDuration = report.TotalDuration.ToString(),
entries = report.Entries.ToDictionary(
entry => entry.Key, // check name e.g. "sql-server"
entry => new
{
status = entry.Value.Status.ToString(),
description = entry.Value.Description,
duration = entry.Value.Duration.ToString(),
// Serialise the exception message if one was captured.// This is invaluable for on-call debugging.
exception = entry.Value.Exception?.Message,
data = entry.Value.Data
}
)
};
// Use camelCase to match the convention of JSON APIs everywhere.var jsonOptions = newJsonSerializerOptions
{
WriteIndented = true,
PropertyNamingPolicy = JsonNamingPolicy.CamelCase
};
return context.Response.WriteAsync(
JsonSerializer.Serialize(responseBody, jsonOptions)
);
}
}
// ─────────────────────────────────────────────────────────────────────────────// Program.cs additions for Health Checks UI// Install: dotnet add package AspNetCore.HealthChecks.UI// dotnet add package AspNetCore.HealthChecks.UI.Client// dotnet add package AspNetCore.HealthChecks.UI.InMemory.Storage
using HealthChecks.UI.Client; // provides UIResponseWritervar builder = WebApplication.CreateBuilder(args);
builder.Services.AddHealthChecks()
.AddCheck<SqlServerHealthCheck>("sql-server", tags: new[] { "ready", "db" })
.AddCheck<ExternalPaymentApiHealthCheck>("payment-api", tags: new[] { "ready", "external" });
// Register the UI services and configure in-memory storage for check history.
builder.Services
.AddHealthChecksUI(settings =>
{
// How often the UI polls the health endpoint (in seconds).
settings.SetEvaluationTimeInSeconds(15);
// Maximum number of history entries to retain per endpoint.
settings.MaximumHistoryEntriesPerEndpoint(50);
// Register the endpoint the UI will poll.// The name shows up as a label in the UI dashboard.
settings.AddHealthCheckEndpoint(
name: "Production App",
uri: "/healthz"
);
})
.AddInMemoryStorage(); // stores check history in-process (use SQL for multi-instance)var app = builder.Build();
// The /healthz endpoint uses the UI client's response writer.// UIResponseWriter.WriteHealthCheckUIResponse outputs the exact JSON format// the UI dashboard expects — richer than our custom writer.
app.MapHealthChecks("/healthz", newHealthCheckOptions
{
ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});
// Serve the Health Checks UI dashboard at /healthchecks-ui// Restrict this to internal networks in production!
app.MapHealthChecksUI(options =>
{
options.UIPath = "/healthchecks-ui";
options.ApiPath = "/healthchecks-api";
});
app.Run();
// ─────────────────────────────────────────────────────────────────────────────// appsettings.json — required for multi-service UI monitoring// (When using AddHealthCheckEndpoint() in code, this section is optional// but useful for environment-specific overrides via environment variables.)
/*
{
"HealthChecksUI": {
"HealthChecks": [
{
"Name": "Production App",
"Uri": "https://myapp.internal/healthz"
},
{
"Name": "Background Worker",
"Uri": "https://worker.internal/healthz"
}
],
"EvaluationTimeInSeconds": 15,
"MaximumHistoryEntriesPerEndpoint": 50
}
}
*/
Output
// Navigate to https://localhost:5001/healthchecks-ui
// You'll see a dashboard with:
// - A green/yellow/red status badge per registered service
// - A timeline chart showing health history
// - Drill-down per check showing description, duration, exception
// The /healthz JSON response looks like:
{
"status": "Healthy",
"totalDuration": "00:00:00.0342100",
"entries": {
"sql-server": {
"status": "Healthy",
"description": "SQL Server is reachable and accepting queries.",
The /healthchecks-ui endpoint exposes infrastructure details — server names, connection strings in exception messages, latency data. Gate it behind a network policy or add app.MapHealthChecksUI().RequireAuthorization('InternalOnly') with an IP-restriction policy. Exposing it publicly is a real security risk.
Production Insight
Health Checks UI exposes server names and exception details.
Without network or authorization gating, you leak internal topology.
Rule: treat the UI endpoint as internal infrastructure — never public.
Key Takeaway
Custom ResponseWriter gives you full control over JSON shape.
UI dashboard is ten-minute setup for a live ops board.
Always secure the UI endpoint behind network policy or auth.
HTTP Status Codes, Failure Thresholds and the ResultStatusCodes Gotcha
Here's something that surprises almost everyone the first time: by default, ASP.NET Core returns HTTP 200 for both Healthy and Degraded results, and HTTP 503 only for Unhealthy. That means Kubernetes readiness probes — which interpret anything other than 2xx as a failure — won't remove a degraded pod from the load balancer. If 'degraded' for you means 'stop sending traffic here', you need to override this.
You control the HTTP status code mapping via HealthCheckOptions.ResultStatusCodes. It's a dictionary from HealthStatus to HTTP status code. Changing Degraded to map to 503 tells K8s to remove the pod from rotation when any check is degraded.
There's also the FailureStatus concept — set per check registration, not per endpoint. It controls what status gets reported when a check throws an exception or returns Unhealthy. Setting failureStatus: HealthStatus.Degraded on a non-critical check means that check can fail without taking the whole service offline.
These two levers together give you very fine-grained control over how dependency failures propagate to your infrastructure.
HealthCheckStatusCodeConfig.csCSHARP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
// Program.cs — Demonstrating ResultStatusCodes and FailureStatus configuration// This is the production-ready pattern for a Kubernetes-hosted service.var builder = WebApplication.CreateBuilder(args);
builder.Services.AddHealthChecks()
// Critical dependency: DB down = service is Unhealthy
.AddCheck<SqlServerHealthCheck>(
name: "sql-server",
failureStatus: HealthStatus.Unhealthy,
tags: new[] { "ready" }
)
// Important but non-critical: cache down = service is Degraded// The service can still serve traffic without Redis, just slower.
.AddCheck<RedisCacheHealthCheck>(
name: "redis-cache",
failureStatus: HealthStatus.Degraded, // downgrade the severity
tags: new[] { "ready" }
)
// External dependency: payment API down = Degraded (we can queue transactions)
.AddCheck<ExternalPaymentApiHealthCheck>(
name: "payment-api",
failureStatus: HealthStatus.Degraded,
tags: new[] { "ready" }
);
var app = builder.Build();
// Liveness endpoint — only the self-check.// A liveness failure triggers a pod RESTART. Keep this minimal.// Do NOT include DB or external checks here — a slow DB causes// restart loops, which makes an outage dramatically worse.
app.MapHealthChecks("/healthz/live", newHealthCheckOptions
{
Predicate = _ => false, // run NO registered checks — just return 200ResponseWriter = HealthCheckResponseWriter.WriteResponse
});
// Readiness endpoint — all 'ready' tagged checks.// A readiness failure removes the pod from the load balancer.
app.MapHealthChecks("/healthz/ready", newHealthCheckOptions
{
Predicate = check => check.Tags.Contains("ready"),
ResponseWriter = HealthCheckResponseWriter.WriteResponse,
// THE KEY CHANGE: map Degraded to 503 so K8s stops sending traffic// when any dependency is struggling, even if not fully failed.ResultStatusCodes =
{
[HealthStatus.Healthy] = StatusCodes.Status200OK,
[HealthStatus.Degraded] = StatusCodes.Status503ServiceUnavailable,
[HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable
}
});
// Full check endpoint — for ops dashboards and manual inspection.// Returns 200 even when Degraded so the dashboard doesn't show false alarms.
app.MapHealthChecks("/healthz", newHealthCheckOptions
{
ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse// No ResultStatusCodes override — uses default (Degraded = 200)
});
app.Run();
// ─────────────────────────────────────────────────────────────────────────────// RedisCacheHealthCheck.cs — a lightweight example showing Degraded usageusingStackExchange.Redis;
publicclassRedisCacheHealthCheck : IHealthCheck
{
privatereadonlyIConnectionMultiplexer _redis;
publicRedisCacheHealthCheck(IConnectionMultiplexer redis)
{
_redis = redis;
}
publicTask<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken cancellationToken = default)
{
// IsConnected is synchronous — no await needed for a connection check.if (_redis.IsConnected)
{
returnTask.FromResult(
HealthCheckResult.Healthy("Redis is connected.")
);
}
// Return Degraded here — the service can operate without cache,// but performance will degrade. Ops should know.// The FailureStatus on the registration (Degraded) means even// if this throws an exception, it reports as Degraded not Unhealthy.returnTask.FromResult(
HealthCheckResult.Degraded("Redis is not connected. Operating without cache.")
);
}
}
Output
// GET /healthz/ready — when Redis is disconnected
// HTTP 503 Service Unavailable <-- Kubernetes now removes this pod from rotation
{
"status": "Degraded",
"totalDuration": "00:00:00.0089234",
"entries": {
"sql-server": {
"status": "Healthy",
"description": "SQL Server is reachable and accepting queries."
},
"redis-cache": {
"status": "Degraded",
"description": "Redis is not connected. Operating without cache."
},
"payment-api": {
"status": "Healthy",
"description": "Payment API responded with 200."
}
}
}
// GET /healthz/live — always 200 while the process is running
// HTTP 200 OK
{
"status": "Healthy",
"totalDuration": "00:00:00.0001234",
"entries": {}
}
Watch Out: Putting DB checks on the liveness probe
If your SQL Server is slow and you've put the DB health check on /healthz/live, Kubernetes will interpret the timeout as a dead process and restart your pod. Now every instance is restarting simultaneously while the DB recovers — turning a degraded situation into a complete outage. Liveness should only check 'is the process itself alive?'. Readiness handles dependency checks.
Production Insight
Degraded returns HTTP 200 by default — Kubernetes ignores it and keeps sending traffic.
You must explicitly map Degraded to 503 on the readiness endpoint to make K8s react.
Rule: never assume the default status codes match your infrastructure's expectation.
Key Takeaway
ResultStatusCodes maps HealthStatus to HTTP codes per endpoint.
FailureStatus maps check failures to HealthStatus per registration.
Combine both levers for fine-grained control over dependency failure propagation.
Health Checks for Background Services and Worker Processes
Not all work happens in request-response cycles. Your app probably runs background services — hosted services that process messages, poll queues, or perform periodic maintenance. If one of those workers stalls, the health endpoint should know about it, even if the main web process is still accepting requests.
The solution is to share state between your BackgroundService and an IHealthCheck implementation, usually via a thread-safe flag or a shared object registered as a singleton. The background service writes its status (last processed timestamp, queue depth, error count), and the health check reads it.
This pattern keeps the health check lightweight and decouples worker logic from health reporting. You get accurate visibility into background activity without making the health check itself execute business logic.
WorkerHealthCheck.csCSHARP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
// BackgroundQueueProcessor.cs// A BackgroundService that updates a shared health status object.publicclassBackgroundQueueProcessor : BackgroundService
{
privatereadonlyILogger<BackgroundQueueProcessor> _logger;
privatereadonlyWorkerHealthStatus _status;
publicBackgroundQueueProcessor(
ILogger<BackgroundQueueProcessor> logger,
WorkerHealthStatus status) // registered as singleton
{
_logger = logger;
_status = status;
}
protectedoverrideasyncTaskExecuteAsync(CancellationToken stoppingToken)
{
_logger.LogInformation("Queue processor started.");
_status.SetHealthy("Worker running, processing queue");
while (!stoppingToken.IsCancellationRequested)
{
try
{
// Simulate processing a batch of messagesawaitTask.Delay(1000, stoppingToken);
// Update health status with last run time
_status.SetHealthy(
$"Queue processed at {DateTime.UtcNow:O}",
newDictionary<string, object>
{
["lastRun"] = DateTime.UtcNow,
["processedCount"] = Interlocked.Increment(ref _processedCount)
}
);
}
catch (OperationCanceledException)
{
// Graceful shutdown
_status.SetDegraded("Worker stopping");
break;
}
catch (Exception ex)
{
_logger.LogError(ex, "Queue processing failed");
_status.SetUnhealthy($"Queue processing failed: {ex.Message}");
// Optionally wait before retrying to avoid tight failure loopsawaitTask.Delay(5000, stoppingToken);
}
}
_status.SetDegraded("Worker stopped");
}
privatelong _processedCount;
}
// WorkerHealthStatus.cs// Thread-safe health status holder for background workers.publicclassWorkerHealthStatus
{
privateHealthStatus _status = HealthStatus.Unhealthy;
privatestring _description = "Not started";
privateDictionary<string, object> _data = new();
privatereadonlyobject _lock = new();
publicvoidSetHealthy(string description, Dictionary<string, object> data = null)
{
lock (_lock)
{
_status = HealthStatus.Healthy;
_description = description;
_data = data ?? newDictionary<string, object>();
}
}
publicvoidSetDegraded(string description)
{
lock (_lock)
{
_status = HealthStatus.Degraded;
_description = description;
}
}
publicvoidSetUnhealthy(string description)
{
lock (_lock)
{
_status = HealthStatus.Unhealthy;
_description = description;
}
}
publicHealthCheckResultGetResult()
{
lock (_lock)
{
returnnewHealthCheckResult(_status, _description, data: _data);
}
}
}
// BackgroundWorkerHealthCheck.cs// IHealthCheck that reads from the shared status object.publicclassBackgroundWorkerHealthCheck : IHealthCheck
{
privatereadonlyWorkerHealthStatus _status;
publicBackgroundWorkerHealthCheck(WorkerHealthStatus status)
{
_status = status;
}
publicTask<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken cancellationToken = default)
{
// Just delegate to the shared status object — no async work needed.returnTask.FromResult(_status.GetResult());
}
}
// Program.cs — registration for worker health check
builder.Services.AddSingleton<WorkerHealthStatus>();
builder.Services.AddHostedService<BackgroundQueueProcessor>();
builder.Services.AddHealthChecks()
.AddCheck<BackgroundWorkerHealthCheck>(
name: "queue-worker",
failureStatus: HealthStatus.Degraded,
tags: new[] { "ready", "background" }
);
Output
// GET /healthz/ready — when worker is processing
// HTTP 200 OK
{
"status": "Healthy",
"entries": {
"sql-server": { "status": "Healthy" },
"payment-api": { "status": "Healthy" },
"queue-worker": {
"status": "Healthy",
"description": "Queue processed at 2026-04-22T14:35:10.123Z",
"data": {
"lastRun": "2026-04-22T14:35:10.123Z",
"processedCount": 42
}
}
}
}
// GET /healthz/ready — when worker has failed
// HTTP 503 Service Unavailable (if configured via ResultStatusCodes)
{
"status": "Unhealthy",
"entries": {
"sql-server": { "status": "Healthy" },
"payment-api": { "status": "Healthy" },
"queue-worker": {
"status": "Unhealthy",
"description": "Queue processing failed: Connection to message bus refused",
"data": {}
}
}
}
Why Use a Shared Status Object?
The worker health check doesn't call the message queue every time it's invoked — that would be slow and could overwhelm the queue during an outage. Instead, the background service writes its status periodically, and the health check reads the latest value. This decouples check execution from actual monitoring cost.
Production Insight
Background workers can stall silently while the web layer stays healthy.
A shared status object lets the health check see worker failures immediately.
Rule: always add a health check for each critical BackgroundService — don't assume it's running just because the process is up.
Key Takeaway
Use a singleton shared status object between BackgroundService and IHealthCheck.
Workers update status periodically; health checks read it — no heavy lifting in the check.
This pattern avoids worker health checks that depend on the very system they monitor.
● Production incidentPOST-MORTEMseverity: high
The Restart Storm That Took Down Three Services
Symptom
All pods across three services started restarting in rapid succession. Requests returned 503 errors. The database team reported slow queries due to an unplanned index rebuild, but the apps were crashing, not just slowing down.
Assumption
The team assumed that if the health check returned Unhealthy, Kubernetes would handle it gracefully. They thought setting a high failureThreshold on the liveness probe would buy them time.
Root cause
The liveness probe (/healthz/live) ran the same SQL Server health check as the readiness probe. When the database slowed down, the health check timed out after 5 seconds. Kubernetes saw the timeout, interpreted it as a dead process, and restarted the pod. With multiple replicas, all restarts happened within the same minute, causing complete downtime.
Fix
Changed the liveness endpoint to run zero checks (Predicate = _ => false) so it always returns 200 as long as the process is alive. Moved the database check exclusively to the readiness endpoint. Set the readiness probe's failureThreshold to 3 to tolerate transient slowness before removing pods from rotation.
Key lesson
Liveness probes must only check if the process itself is alive, not its dependencies.
Production debug guideCommon symptoms and the exact actions to diagnose them5 entries
Symptom · 01
Health endpoint returns 200 but pod keeps restarting
→
Fix
Check which endpoint Kubernetes is using as liveness probe. If it includes dependency checks, reconfigure to only use a no-check liveness endpoint.
Symptom · 02
Health endpoint times out after 30 seconds
→
Fix
Add an explicit CancellationTokenSource with short timeout inside CheckHealthAsync. The default CancellationToken may not enforce a timeout.
Symptom · 03
Health check says Degraded but K8s doesn't stop traffic
→
Fix
Verify ResultStatusCodes mapping on the readiness endpoint. Degraded returns 200 by default – override to map Degraded to 503.
Symptom · 04
JSON response missing exception details
→
Fix
Ensure your ResponseWriter serializes entry.Value.Exception?.Message. The default writer omits exception data.
Symptom · 05
Health Checks UI shows 'Unhealthy' but app works fine
→
Fix
Check if the UI endpoint URI is correct. If using different ports or authentication, the UI might receive a 401 or 404, which it interprets as Unhealthy.
★ Quick Health Check Debug Cheat SheetCommands and fixes for the most common health check issues in production
Liveness probe causing restarts−
Immediate action
Run `kubectl describe pod <pod-name>` and check the Liveness probe section to see which endpoint and threshold is configured.
Commands
kubectl get pods --field-selector=status.phase=Running -o custom-columns=NAME:.metadata.name,LIVENESS:.spec.containers[0].livenessProbe.httpGet.path
Check application logs for 'OperationCanceledException' – that means the timeout is being triggered.
Fix now
Wrap your healthy check logic in a using (var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken)) { cts.CancelAfter(TimeSpan.FromSeconds(5)); ... }
Degraded not stopping K8s traffic+
Immediate action
Check the HealthCheckOptions on the readiness endpoint – ResultStatusCodes probably missing Degraded -> 503.
kubectl get pods -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}'
Fix now
Add ResultStatusCodes = new Dictionary<HealthStatus, int> { [HealthStatus.Degraded] = 503, [HealthStatus.Unhealthy] = 503 } to the readiness endpoint options.
Health Checks UI shows 'Unhealthy' but app is fine+
Immediate action
Verify the URI configured in HealthChecksUI settings matches the actual health endpoint URL including port and path.
Commands
curl -v http://localhost:5000/healthz (compare status code and body)
Check the UI's network tab – is it receiving a 200 response with proper JSON?
Fix now
Update AddHealthCheckEndpoint URI in Program.cs or the HealthChecksUI configuration section in appsettings.json.
Liveness vs Readiness Probes in Kubernetes
Aspect
Liveness Probe (/healthz/live)
Readiness Probe (/healthz/ready)
Purpose
Is the process itself alive and not deadlocked?
Is the pod ready to receive user traffic?
K8s action on failure
Restarts the pod
Removes pod from load balancer rotation
Recommended checks
Self-check only (return 200 if process runs)
DB, cache, external APIs, message queues
Risk of including DB checks
High — slow DB causes restart storm
Safe — slow DB just pauses traffic to that pod
Typical HTTP success code
200 OK
200 OK (Healthy) or 503 (if Degraded = 503)
Run frequency in K8s
Every 10-30 seconds
Every 10-30 seconds
FailureStatus recommendation
N/A — no checks to configure
Unhealthy for critical, Degraded for non-critical
Key takeaways
1
Split liveness and readiness onto separate endpoints with tag filtering
putting database checks on the liveness probe is the #1 cause of Kubernetes restart storms during partial outages.
2
Always set a hard timeout inside CheckHealthAsync
a health check that hangs for 100 seconds is worse than one that fails fast and returns Unhealthy after 5 seconds.
3
Use ResultStatusCodes to map Degraded to HTTP 503 on the readiness endpoint if you want Kubernetes to stop routing traffic to a pod when any dependency is struggling.
4
The FailureStatus per-check registration and the ResultStatusCodes per-endpoint configuration are independent levers
FailureStatus controls what HealthStatus gets reported, ResultStatusCodes controls what HTTP code that status maps to.
5
Background services need their own health checks via a shared status object
don't assume a healthy web layer means background workers are still running.
Common mistakes to avoid
5 patterns
×
Exposing /healthz publicly without securing it
Symptom
The endpoint includes server names, database host names, and exception stack traces in its JSON response. An attacker can map your entire infrastructure topology from it.
Fix
Add .RequireAuthorization() to MapHealthChecks and back it with an IP-restriction policy, or serve it only on a non-public internal port by binding to a separate address using app.MapHealthChecks(...).RequireHost('*.internal').
×
Forgetting that Degraded returns HTTP 200 by default
Symptom
Developers test their health check, see 'Degraded' in the JSON, assume Kubernetes will react, and are confused when degraded pods keep receiving traffic.
Fix
Explicitly configure ResultStatusCodes in HealthCheckOptions and map HealthStatus.Degraded to StatusCodes.Status503ServiceUnavailable on the readiness endpoint — as shown in the code above.
×
Running all health checks on the liveness probe
Symptom
When the database is slow, the liveness check times out, Kubernetes restarts every pod simultaneously, and a partial outage becomes a total one.
Fix
Either use Predicate = _ => false on the liveness endpoint (returning 200 unconditionally while the process is up), or only tag lightweight self-checks with 'live' and never tag external dependency checks with it.
×
Not including a timeout inside CheckHealthAsync
Symptom
A third-party API health check hangs for 100 seconds because the HttpClient default timeout is absurdly long. The aggregate health endpoint also hangs, causing Kubernetes probes to fail and restart the pod.
Fix
Always use CancellationTokenSource.CreateLinkedTokenSource with a short timeout inside CheckHealthAsync. Set per-check timeouts — 5 seconds for databases, 8 seconds for HTTP calls.
×
Directly creating HttpClient in health checks
Symptom
Health checks run every 15 seconds, each creating a new HttpClient. Within minutes, socket exhaustion crashes the process with SocketException: Only one usage of each socket address is normally permitted.
Fix
Inject IHttpClientFactory (registered in DI) and call CreateClient(). For health checks that call HTTP endpoints, always use a named or typed client with a pre-configured timeout.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
What's the difference between a liveness probe and a readiness probe, an...
Q02SENIOR
A colleague says 'our health check endpoint returned Degraded so Kuberne...
Q03SENIOR
You have five microservices. The payment service depends on a shared Red...
Q01 of 03SENIOR
What's the difference between a liveness probe and a readiness probe, and how do ASP.NET Core health check tags help you implement both correctly?
ANSWER
Liveness probes check if the process is alive – if they fail, Kubernetes restarts the pod. Readiness probes check if the pod can serve traffic – if they fail, Kubernetes removes it from the load balancer. You implement both by mapping separate endpoints (e.g., /healthz/live and /healthz/ready) and using the Predicate option with tag filtering. Liveness endpoints should run zero or only self-checks, readiness endpoints should run all dependency checks. Tags like "live" and "ready" let you group checks cleanly.
Q02 of 03SENIOR
A colleague says 'our health check endpoint returned Degraded so Kubernetes should have removed the pod from the load balancer, but it didn't'. What would you check first, and why?
ANSWER
First, check the HealthCheckOptions.ResultStatusCodes mapping on the readiness endpoint. By default, Degraded maps to HTTP 200, not 503. Kubernetes readiness probes treat any non-2xx as failure; 200 means 'ready'. So if Degraded is not explicitly mapped to 503, K8s ignores it. Second, verify that the check is registered with the "ready" tag and that the endpoint's Predicate includes that tag. Third, confirm the probe configuration in the Kubernetes deployment YAML points to the correct readiness endpoint path.
Q03 of 03SENIOR
You have five microservices. The payment service depends on a shared Redis cluster. If Redis goes down, should the payment service report Unhealthy or Degraded? How does your answer change the Kubernetes behaviour, and how do you configure that distinction in ASP.NET Core?
ANSWER
If Redis is a cache that can be skipped (e.g., fallback to database), report Degraded. If Redis is critical for every request (e.g., session state, rate limiting), report Unhealthy. For Degraded, set failureStatus: HealthStatus.Degraded on the check registration. Then on the readiness endpoint, map Degraded to 503 via ResultStatusCodes – this tells K8s to stop routing traffic but does not restart the pod (since it's a readiness probe). For Unhealthy, the failureStatus already maps to Unhealthy, and the readiness endpoint's default mapping sends 503 anyway. The key distinction: Degraded allows the pod to recover without restart; Unhealthy may trigger further escalation depending on your probe setup.
01
What's the difference between a liveness probe and a readiness probe, and how do ASP.NET Core health check tags help you implement both correctly?
SENIOR
02
A colleague says 'our health check endpoint returned Degraded so Kubernetes should have removed the pod from the load balancer, but it didn't'. What would you check first, and why?
SENIOR
03
You have five microservices. The payment service depends on a shared Redis cluster. If Redis goes down, should the payment service report Unhealthy or Degraded? How does your answer change the Kubernetes behaviour, and how do you configure that distinction in ASP.NET Core?
SENIOR
FAQ · 4 QUESTIONS
Frequently Asked Questions
01
How do I add health checks to an existing ASP.NET Core app without breaking anything?
Add builder.Services.AddHealthChecks() in Program.cs and call app.MapHealthChecks('/healthz') before app.Run(). This adds a new endpoint and touches nothing else in your app. You can start with zero checks registered — it just returns 'Healthy' — and add real checks incrementally. There's no risk of breaking existing routes.
Was this helpful?
02
What NuGet packages do I need for health checks in ASP.NET Core?
The core health check middleware is built into Microsoft.AspNetCore.Diagnostics.HealthChecks, which ships with the ASP.NET Core SDK — no extra package needed. For the visual UI dashboard you need AspNetCore.HealthChecks.UI, AspNetCore.HealthChecks.UI.Client, and a storage package like AspNetCore.HealthChecks.UI.InMemory.Storage. Community packages like AspNetCore.HealthChecks.SqlServer exist for common dependencies but the custom IHealthCheck approach shown above gives you more control.
Was this helpful?
03
Can I use health checks with .NET Framework or only .NET Core?
The built-in Microsoft.Extensions.Diagnostics.HealthChecks middleware is an ASP.NET Core feature introduced in version 2.2 and is not available in .NET Framework. If you're on .NET Framework, you'd need to hand-roll a similar pattern using an HTTP handler or a NuGet package like Polly combined with a custom endpoint. Upgrading to .NET 6+ is the practical path to getting the full health check ecosystem.
Was this helpful?
04
How do I add a health check for a background service?
Create a singleton class (e.g., WorkerHealthStatus) that holds the current health state with thread-safe methods. Have your BackgroundService update this object periodically or on failure. Then implement an IHealthCheck that reads from the same singleton. Register both the singleton and the health check in DI. This decouples the actual worker work from the health check invocation, keeping the health check lightweight.