Mid-level 5 min · March 06, 2026

ASP.NET Core Health Checks: Liveness Probe Timeout Restarts

When database slowed, liveness probe timed out in 5s, causing all pods to restart in a minute.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • ASP.NET Core health checks let external systems (K8s, load balancers) query app and dependency status in a standard way.
  • Split liveness (pod alive) and readiness (can serve traffic) onto separate endpoints using tag filtering.
  • Custom IHealthCheck classes with timeouts prevent hung checks from blocking probes – always fail fast.
  • HTTP status mapping via ResultStatusCodes controls how Degraded vs Unhealthy affects infrastructure decisions.
  • The Health Checks UI dashboard gives ops teams visual history and drill-down per check.
  • Never put database checks on the liveness probe – that causes restart storms during partial outages.
Plain-English First

Imagine a hospital with a dashboard showing every patient's vital signs — heart rate, blood pressure, oxygen — all on one screen. A doctor glances at it and instantly knows who needs attention. ASP.NET Core health checks are exactly that dashboard for your application. Instead of patients, you're monitoring your database connection, your message queue, your disk space, and any other system your app depends on. One endpoint, one glance, and you know if everything is healthy or something is about to crash.

When your app is running in production, 'it deployed successfully' is just the beginning. Kubernetes needs to know whether to send traffic to your pod. Your load balancer needs to decide if an instance should be taken out of rotation. Your ops team needs an alert before a full outage hits — not after. Without a structured health check system, you're flying blind. You're relying on a user to tell you something is broken, which is the worst possible monitoring strategy.

Health checks solve a specific, painful problem: how do external systems and internal teams get a reliable, machine-readable signal about whether your application and all of its dependencies are functioning correctly? Before ASP.NET Core 2.2, teams would hand-roll ping endpoints, scatter try-catch blocks across random controllers, and end up with inconsistent, unreliable status pages. The built-in health check middleware standardises all of that — with a clean model for registering checks, aggregating results, and exposing them over HTTP.

By the end of this article you'll know how to register built-in and custom health checks, gate them by tags for different audiences (liveness vs readiness), wire up the visual Health Checks UI dashboard, and avoid the three mistakes that catch almost every developer the first time. You'll also have copy-paste-ready code patterns you can drop into a real project today.

How the Health Check Pipeline Actually Works

Before writing a single line of code, it's worth understanding the architecture — because once you see it, every API decision makes sense.

ASP.NET Core's health check system has three layers. First, you register one or more IHealthCheck implementations with the DI container via AddHealthChecks(). Each check is a small class with a single method — CheckHealthAsync — that returns a HealthCheckResult of Healthy, Degraded, or Unhealthy.

Second, the framework aggregates those results. When the health endpoint is hit, it runs all registered checks (or a filtered subset by tag), collects every result, and computes an overall status. If any check is Unhealthy, the aggregate is Unhealthy. If any is Degraded but none are Unhealthy, the aggregate is Degraded.

Third, the middleware serialises that result and returns an HTTP response. By default it just writes 'Healthy' or 'Unhealthy' as plain text. But you can swap in a custom response writer to return rich JSON — which is exactly what production systems need.

The key insight here is separation of concerns: the check logic, the aggregation logic, and the serialisation logic are all independent. That's what makes the system so composable.

Program.csCSHARP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// Program.cs — Minimal API style (NET 6+)
// This is the absolute foundation. Every health check setup starts here.

var builder = WebApplication.CreateBuilder(args);

// Step 1: Register the health check services with the DI container.
// AddHealthChecks() returns an IHealthChecksBuilder you can chain onto.
builder.Services.AddHealthChecks()
    // Register a named check. The name appears in the JSON response
    // so ops teams know exactly WHICH check failed.
    .AddCheck("self", () => HealthCheckResult.Healthy("App is running"))
    
    // Tags let you group checks for different audiences.
    // 'live' = Kubernetes liveness probe (is the process alive?)
    // 'ready' = Kubernetes readiness probe (can it serve traffic?)
    .AddCheck(
        name: "startup-warmup",
        check: () => HealthCheckResult.Healthy("Warm-up complete"),
        tags: new[] { "live" }
    );

var app = builder.Build();

// Step 2: Map the health check endpoints.
// /healthz/live — only runs checks tagged 'live'
// /healthz/ready — only runs checks tagged 'ready'
// /healthz       — runs ALL checks (useful for ops dashboards)
app.MapHealthChecks("/healthz", new HealthCheckOptions
{
    // ResponseWriter controls what gets written to the HTTP response body.
    // WriteResponse is a static helper we define in the next section.
    ResponseWriter = HealthCheckResponseWriter.WriteResponse
});

app.MapHealthChecks("/healthz/live", new HealthCheckOptions
{
    // Predicate filters which checks run on this endpoint.
    // Here we only run checks tagged 'live'.
    Predicate = check => check.Tags.Contains("live"),
    ResponseWriter = HealthCheckResponseWriter.WriteResponse
});

app.MapHealthChecks("/healthz/ready", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("ready"),
    ResponseWriter = HealthCheckResponseWriter.WriteResponse
});

app.Run();
Output
// GET /healthz
// HTTP 200 OK
{
"status": "Healthy",
"totalDuration": "00:00:00.0012345",
"entries": {
"self": {
"status": "Healthy",
"description": "App is running",
"duration": "00:00:00.0001234"
},
"startup-warmup": {
"status": "Healthy",
"description": "Warm-up complete",
"duration": "00:00:00.0000987"
}
}
}
Why Two Endpoints?
Kubernetes uses both a liveness probe (/healthz/live) and a readiness probe (/healthz/ready). Liveness asks 'is the process crashed?' — if it fails, K8s restarts the pod. Readiness asks 'is this pod ready to receive traffic?' — if it fails, K8s removes it from the load balancer but doesn't restart it. Mixing all your checks on one endpoint means a slow database query could trigger a pod restart, which is almost never what you want.
Production Insight
The default response writer returns only 'Healthy' or 'Unhealthy' as plain text.
Engineers in production need to know which check failed and why.
Always replace the default writer with a custom JSON writer that includes entry details and exceptions.
Key Takeaway
Health checks have three independent layers: registration, aggregation, serialisation.
Understand how they connect before writing your first check.
Separate concerns mean you can replace any layer without touching the others.

Writing a Real Custom Health Check — Database + External API

The built-in lambda-style checks are fine for demos, but production systems need proper IHealthCheck implementations. This is where the pattern gets genuinely powerful.

A well-written health check does three things: it detects a real failure condition (not just 'can I reach the host'), it includes diagnostic data in the result so engineers can debug without reading logs, and it fails fast — it has a timeout so a slow dependency doesn't hold up your entire health endpoint.

Let's build two concrete examples: a SQL Server check that validates query execution (not just connection), and an external HTTP API check that confirms the downstream service is actually responding correctly.

Notice the pattern in both checks: the try/catch returns Unhealthy with the exception message as the description. That description surfaces in the JSON response, which means your on-call engineer sees the actual error message — not just a red dot on a dashboard.

SqlServerHealthCheck.csCSHARP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
// SqlServerHealthCheck.cs
// A production-grade database health check that validates the connection
// AND confirms the database can execute a real query — not just ping.

using Microsoft.Extensions.Diagnostics.HealthChecks;
using Microsoft.Data.SqlClient;

public class SqlServerHealthCheck : IHealthCheck
{
    private readonly string _connectionString;
    
    // Inject the connection string via DI rather than hard-coding it.
    // In production this comes from IConfiguration / environment variables.
    public SqlServerHealthCheck(IConfiguration configuration)
    {
        _connectionString = configuration.GetConnectionString("DefaultConnection")
            ?? throw new InvalidOperationException("DefaultConnection string is not configured.");
    }

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        try
        {
            // Use a short timeout — health checks should fail fast.
            // 5 seconds is a reasonable maximum for a DB ping.
            using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
            cts.CancelAfter(TimeSpan.FromSeconds(5));

            await using var connection = new SqlConnection(_connectionString);
            await connection.OpenAsync(cts.Token);

            // Run a trivial query — SELECT 1 confirms the DB engine is
            // accepting queries, not just that the TCP port is open.
            await using var command = connection.CreateCommand();
            command.CommandText = "SELECT 1";
            await command.ExecuteScalarAsync(cts.Token);

            // Include useful diagnostic data in the result.
            // This appears in the JSON response and in health check UI.
            var data = new Dictionary<string, object>
            {
                { "database", connection.Database },
                { "server", connection.DataSource }
            };

            return HealthCheckResult.Healthy(
                description: "SQL Server is reachable and accepting queries.",
                data: data
            );
        }
        catch (OperationCanceledException)
        {
            // Distinguish a timeout from a general failure —
            // timeouts and connection errors need different ops responses.
            return HealthCheckResult.Unhealthy(
                description: "SQL Server health check timed out after 5 seconds."
            );
        }
        catch (Exception ex)
        {
            // The exception message goes into the description field
            // so it shows up directly in your monitoring dashboard.
            return HealthCheckResult.Unhealthy(
                description: $"SQL Server check failed: {ex.Message}",
                exception: ex
            );
        }
    }
}

// ─────────────────────────────────────────────────────────────────────────────
// ExternalPaymentApiHealthCheck.cs
// Checks that a critical downstream HTTP dependency is healthy.
// Uses a named HttpClient registered via IHttpClientFactory — the correct
// pattern for health checks, which must not create HttpClient instances
// directly (causes socket exhaustion).

public class ExternalPaymentApiHealthCheck : IHealthCheck
{
    private readonly IHttpClientFactory _httpClientFactory;

    public ExternalPaymentApiHealthCheck(IHttpClientFactory httpClientFactory)
    {
        _httpClientFactory = httpClientFactory;
    }

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        try
        {
            // Use the named client configured in Program.cs.
            // Named clients have pre-configured BaseAddress, timeout, etc.
            var httpClient = _httpClientFactory.CreateClient("PaymentApiClient");

            // Hit the payment API's own health endpoint rather than a
            // business endpoint — avoids triggering real business logic.
            var response = await httpClient.GetAsync("/health", cancellationToken);

            if (response.IsSuccessStatusCode)
            {
                return HealthCheckResult.Healthy(
                    description: $"Payment API responded with {(int)response.StatusCode}."
                );
            }

            // Degraded = the service is reachable but not fully healthy.
            // This is useful when a dependency is slow or partially down.
            return HealthCheckResult.Degraded(
                description: $"Payment API returned unexpected status: {(int)response.StatusCode}."
            );
        }
        catch (HttpRequestException ex)
        {
            return HealthCheckResult.Unhealthy(
                description: $"Cannot reach Payment API: {ex.Message}",
                exception: ex
            );
        }
        catch (TaskCanceledException)
        {
            return HealthCheckResult.Unhealthy(
                description: "Payment API health check timed out."
            );
        }
    }
}

// ─────────────────────────────────────────────────────────────────────────────
// Program.cs — registering both custom checks

builder.Services.AddHttpClient("PaymentApiClient", client =>
{
    client.BaseAddress = new Uri("https://api.paymentprovider.com");
    // Set a tight timeout — do not rely on the default 100s HttpClient timeout.
    client.Timeout = TimeSpan.FromSeconds(8);
});

builder.Services.AddHealthChecks()
    .AddCheck<SqlServerHealthCheck>(
        name: "sql-server",
        failureStatus: HealthStatus.Unhealthy,  // a DB failure = fully unhealthy
        tags: new[] { "ready", "db" }
    )
    .AddCheck<ExternalPaymentApiHealthCheck>(
        name: "payment-api",
        failureStatus: HealthStatus.Degraded,   // payment API down = degraded, not dead
        tags: new[] { "ready", "external" }
    );
Output
// GET /healthz/ready — when SQL Server is fine but Payment API is slow
// HTTP 200 OK (Degraded still returns 200 by default — see Gotchas)
{
"status": "Degraded",
"totalDuration": "00:00:00.2341567",
"entries": {
"sql-server": {
"status": "Healthy",
"description": "SQL Server is reachable and accepting queries.",
"duration": "00:00:00.0234567",
"data": {
"database": "AppDb",
"server": "prod-sql-01.internal"
}
},
"payment-api": {
"status": "Degraded",
"description": "Payment API returned unexpected status: 503.",
"duration": "00:00:00.2107000",
"data": {}
}
}
}
Watch Out: Never new up HttpClient in a health check
Creating new HttpClient() inside CheckHealthAsync is a classic socket exhaustion bug — health checks run frequently (every few seconds in K8s), so you'll blow through available sockets fast. Always inject IHttpClientFactory and call CreateClient(). It's one extra line of setup in Program.cs and it eliminates the entire problem.
Production Insight
Health checks that create HttpClient directly cause socket exhaustion.
Factory-managed clients reuse connections and respect DNS changes.
Rule: always use IHttpClientFactory for any HTTP-dependent health check.
Key Takeaway
Write checks that detect real failure, include diagnostic data, and fail fast.
Timeouts are mandatory — a hung check blocks all other checks.
Never new up HttpClient inside CheckHealthAsync.

Custom JSON Response Writer and the Health Checks UI Dashboard

The default health check response is a single word — 'Healthy' or 'Unhealthy'. That's fine for Kubernetes probes, but it's useless for a human engineer trying to diagnose a problem. You need a JSON response that includes every check name, its status, its description, and how long it took.

ASP.NET Core lets you swap in a custom ResponseWriter — a delegate of type Func<HttpContext, HealthReport, Task>. You write it once, pass it to every HealthCheckOptions instance, and every endpoint automatically returns rich JSON.

For a visual dashboard, the AspNetCore.HealthChecks.UI NuGet package gives you a ready-made React UI that polls your health endpoints and shows a live status board. It's genuinely useful for ops teams — and it takes about ten minutes to set up.

The UI package needs a separate configuration section in appsettings.json that lists the health check URIs to monitor. This means the UI can monitor multiple services, not just the current app — making it a lightweight centralised health dashboard.

HealthCheckResponseWriter.csCSHARP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
// HealthCheckResponseWriter.cs
// A reusable JSON response writer that returns rich diagnostic output.
// Reference this from every MapHealthChecks call.

using System.Text.Json;
using Microsoft.Extensions.Diagnostics.HealthChecks;

public static class HealthCheckResponseWriter
{
    public static Task WriteResponse(HttpContext context, HealthReport report)
    {
        // Always return JSON — never let this endpoint return HTML.
        context.Response.ContentType = "application/json; charset=utf-8";

        // Map each health check entry to a serialisable anonymous object.
        var responseBody = new
        {
            status = report.Status.ToString(),
            totalDuration = report.TotalDuration.ToString(),
            entries = report.Entries.ToDictionary(
                entry => entry.Key,   // check name e.g. "sql-server"
                entry => new
                {
                    status = entry.Value.Status.ToString(),
                    description = entry.Value.Description,
                    duration = entry.Value.Duration.ToString(),
                    // Serialise the exception message if one was captured.
                    // This is invaluable for on-call debugging.
                    exception = entry.Value.Exception?.Message,
                    data = entry.Value.Data
                }
            )
        };

        // Use camelCase to match the convention of JSON APIs everywhere.
        var jsonOptions = new JsonSerializerOptions
        {
            WriteIndented = true,
            PropertyNamingPolicy = JsonNamingPolicy.CamelCase
        };

        return context.Response.WriteAsync(
            JsonSerializer.Serialize(responseBody, jsonOptions)
        );
    }
}

// ─────────────────────────────────────────────────────────────────────────────
// Program.cs additions for Health Checks UI
// Install: dotnet add package AspNetCore.HealthChecks.UI
//          dotnet add package AspNetCore.HealthChecks.UI.Client
//          dotnet add package AspNetCore.HealthChecks.UI.InMemory.Storage

using HealthChecks.UI.Client;   // provides UIResponseWriter

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddHealthChecks()
    .AddCheck<SqlServerHealthCheck>("sql-server", tags: new[] { "ready", "db" })
    .AddCheck<ExternalPaymentApiHealthCheck>("payment-api", tags: new[] { "ready", "external" });

// Register the UI services and configure in-memory storage for check history.
builder.Services
    .AddHealthChecksUI(settings =>
    {
        // How often the UI polls the health endpoint (in seconds).
        settings.SetEvaluationTimeInSeconds(15);
        
        // Maximum number of history entries to retain per endpoint.
        settings.MaximumHistoryEntriesPerEndpoint(50);
        
        // Register the endpoint the UI will poll.
        // The name shows up as a label in the UI dashboard.
        settings.AddHealthCheckEndpoint(
            name: "Production App",
            uri: "/healthz"
        );
    })
    .AddInMemoryStorage();   // stores check history in-process (use SQL for multi-instance)

var app = builder.Build();

// The /healthz endpoint uses the UI client's response writer.
// UIResponseWriter.WriteHealthCheckUIResponse outputs the exact JSON format
// the UI dashboard expects — richer than our custom writer.
app.MapHealthChecks("/healthz", new HealthCheckOptions
{
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});

// Serve the Health Checks UI dashboard at /healthchecks-ui
// Restrict this to internal networks in production!
app.MapHealthChecksUI(options =>
{
    options.UIPath = "/healthchecks-ui";
    options.ApiPath = "/healthchecks-api";
});

app.Run();

// ─────────────────────────────────────────────────────────────────────────────
// appsettings.json — required for multi-service UI monitoring
// (When using AddHealthCheckEndpoint() in code, this section is optional
// but useful for environment-specific overrides via environment variables.)
/*
{
  "HealthChecksUI": {
    "HealthChecks": [
      {
        "Name": "Production App",
        "Uri": "https://myapp.internal/healthz"
      },
      {
        "Name": "Background Worker",
        "Uri": "https://worker.internal/healthz"
      }
    ],
    "EvaluationTimeInSeconds": 15,
    "MaximumHistoryEntriesPerEndpoint": 50
  }
}
*/
Output
// Navigate to https://localhost:5001/healthchecks-ui
// You'll see a dashboard with:
// - A green/yellow/red status badge per registered service
// - A timeline chart showing health history
// - Drill-down per check showing description, duration, exception
// The /healthz JSON response looks like:
{
"status": "Healthy",
"totalDuration": "00:00:00.0342100",
"entries": {
"sql-server": {
"status": "Healthy",
"description": "SQL Server is reachable and accepting queries.",
"duration": "00:00:00.0234100",
"exception": null,
"data": { "database": "AppDb", "server": "prod-sql-01.internal" }
},
"payment-api": {
"status": "Healthy",
"description": "Payment API responded with 200.",
"duration": "00:00:00.0108000",
"exception": null,
"data": {}
}
}
}
Pro Tip: Secure your health UI in production
The /healthchecks-ui endpoint exposes infrastructure details — server names, connection strings in exception messages, latency data. Gate it behind a network policy or add app.MapHealthChecksUI().RequireAuthorization('InternalOnly') with an IP-restriction policy. Exposing it publicly is a real security risk.
Production Insight
Health Checks UI exposes server names and exception details.
Without network or authorization gating, you leak internal topology.
Rule: treat the UI endpoint as internal infrastructure — never public.
Key Takeaway
Custom ResponseWriter gives you full control over JSON shape.
UI dashboard is ten-minute setup for a live ops board.
Always secure the UI endpoint behind network policy or auth.

HTTP Status Codes, Failure Thresholds and the ResultStatusCodes Gotcha

Here's something that surprises almost everyone the first time: by default, ASP.NET Core returns HTTP 200 for both Healthy and Degraded results, and HTTP 503 only for Unhealthy. That means Kubernetes readiness probes — which interpret anything other than 2xx as a failure — won't remove a degraded pod from the load balancer. If 'degraded' for you means 'stop sending traffic here', you need to override this.

You control the HTTP status code mapping via HealthCheckOptions.ResultStatusCodes. It's a dictionary from HealthStatus to HTTP status code. Changing Degraded to map to 503 tells K8s to remove the pod from rotation when any check is degraded.

There's also the FailureStatus concept — set per check registration, not per endpoint. It controls what status gets reported when a check throws an exception or returns Unhealthy. Setting failureStatus: HealthStatus.Degraded on a non-critical check means that check can fail without taking the whole service offline.

These two levers together give you very fine-grained control over how dependency failures propagate to your infrastructure.

HealthCheckStatusCodeConfig.csCSHARP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
// Program.cs — Demonstrating ResultStatusCodes and FailureStatus configuration
// This is the production-ready pattern for a Kubernetes-hosted service.

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddHealthChecks()
    // Critical dependency: DB down = service is Unhealthy
    .AddCheck<SqlServerHealthCheck>(
        name: "sql-server",
        failureStatus: HealthStatus.Unhealthy,
        tags: new[] { "ready" }
    )
    // Important but non-critical: cache down = service is Degraded
    // The service can still serve traffic without Redis, just slower.
    .AddCheck<RedisCacheHealthCheck>(
        name: "redis-cache",
        failureStatus: HealthStatus.Degraded,  // downgrade the severity
        tags: new[] { "ready" }
    )
    // External dependency: payment API down = Degraded (we can queue transactions)
    .AddCheck<ExternalPaymentApiHealthCheck>(
        name: "payment-api",
        failureStatus: HealthStatus.Degraded,
        tags: new[] { "ready" }
    );

var app = builder.Build();

// Liveness endpoint — only the self-check.
// A liveness failure triggers a pod RESTART. Keep this minimal.
// Do NOT include DB or external checks here — a slow DB causes
// restart loops, which makes an outage dramatically worse.
app.MapHealthChecks("/healthz/live", new HealthCheckOptions
{
    Predicate = _ => false,   // run NO registered checks — just return 200
    ResponseWriter = HealthCheckResponseWriter.WriteResponse
});

// Readiness endpoint — all 'ready' tagged checks.
// A readiness failure removes the pod from the load balancer.
app.MapHealthChecks("/healthz/ready", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("ready"),
    ResponseWriter = HealthCheckResponseWriter.WriteResponse,

    // THE KEY CHANGE: map Degraded to 503 so K8s stops sending traffic
    // when any dependency is struggling, even if not fully failed.
    ResultStatusCodes =
    {
        [HealthStatus.Healthy]   = StatusCodes.Status200OK,
        [HealthStatus.Degraded]  = StatusCodes.Status503ServiceUnavailable,
        [HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable
    }
});

// Full check endpoint — for ops dashboards and manual inspection.
// Returns 200 even when Degraded so the dashboard doesn't show false alarms.
app.MapHealthChecks("/healthz", new HealthCheckOptions
{
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
    // No ResultStatusCodes override — uses default (Degraded = 200)
});

app.Run();

// ─────────────────────────────────────────────────────────────────────────────
// RedisCacheHealthCheck.cs — a lightweight example showing Degraded usage

using StackExchange.Redis;

public class RedisCacheHealthCheck : IHealthCheck
{
    private readonly IConnectionMultiplexer _redis;

    public RedisCacheHealthCheck(IConnectionMultiplexer redis)
    {
        _redis = redis;
    }

    public Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        // IsConnected is synchronous — no await needed for a connection check.
        if (_redis.IsConnected)
        {
            return Task.FromResult(
                HealthCheckResult.Healthy("Redis is connected.")
            );
        }

        // Return Degraded here — the service can operate without cache,
        // but performance will degrade. Ops should know.
        // The FailureStatus on the registration (Degraded) means even
        // if this throws an exception, it reports as Degraded not Unhealthy.
        return Task.FromResult(
            HealthCheckResult.Degraded("Redis is not connected. Operating without cache.")
        );
    }
}
Output
// GET /healthz/ready — when Redis is disconnected
// HTTP 503 Service Unavailable <-- Kubernetes now removes this pod from rotation
{
"status": "Degraded",
"totalDuration": "00:00:00.0089234",
"entries": {
"sql-server": {
"status": "Healthy",
"description": "SQL Server is reachable and accepting queries."
},
"redis-cache": {
"status": "Degraded",
"description": "Redis is not connected. Operating without cache."
},
"payment-api": {
"status": "Healthy",
"description": "Payment API responded with 200."
}
}
}
// GET /healthz/live — always 200 while the process is running
// HTTP 200 OK
{
"status": "Healthy",
"totalDuration": "00:00:00.0001234",
"entries": {}
}
Watch Out: Putting DB checks on the liveness probe
If your SQL Server is slow and you've put the DB health check on /healthz/live, Kubernetes will interpret the timeout as a dead process and restart your pod. Now every instance is restarting simultaneously while the DB recovers — turning a degraded situation into a complete outage. Liveness should only check 'is the process itself alive?'. Readiness handles dependency checks.
Production Insight
Degraded returns HTTP 200 by default — Kubernetes ignores it and keeps sending traffic.
You must explicitly map Degraded to 503 on the readiness endpoint to make K8s react.
Rule: never assume the default status codes match your infrastructure's expectation.
Key Takeaway
ResultStatusCodes maps HealthStatus to HTTP codes per endpoint.
FailureStatus maps check failures to HealthStatus per registration.
Combine both levers for fine-grained control over dependency failure propagation.

Health Checks for Background Services and Worker Processes

Not all work happens in request-response cycles. Your app probably runs background services — hosted services that process messages, poll queues, or perform periodic maintenance. If one of those workers stalls, the health endpoint should know about it, even if the main web process is still accepting requests.

The solution is to share state between your BackgroundService and an IHealthCheck implementation, usually via a thread-safe flag or a shared object registered as a singleton. The background service writes its status (last processed timestamp, queue depth, error count), and the health check reads it.

This pattern keeps the health check lightweight and decouples worker logic from health reporting. You get accurate visibility into background activity without making the health check itself execute business logic.

WorkerHealthCheck.csCSHARP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
// BackgroundQueueProcessor.cs
// A BackgroundService that updates a shared health status object.

public class BackgroundQueueProcessor : BackgroundService
{
    private readonly ILogger<BackgroundQueueProcessor> _logger;
    private readonly WorkerHealthStatus _status;

    public BackgroundQueueProcessor(
        ILogger<BackgroundQueueProcessor> logger,
        WorkerHealthStatus status)  // registered as singleton
    {
        _logger = logger;
        _status = status;
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        _logger.LogInformation("Queue processor started.");
        _status.SetHealthy("Worker running, processing queue");

        while (!stoppingToken.IsCancellationRequested)
        {
            try
            {
                // Simulate processing a batch of messages
                await Task.Delay(1000, stoppingToken);

                // Update health status with last run time
                _status.SetHealthy(
                    $"Queue processed at {DateTime.UtcNow:O}",
                    new Dictionary<string, object>
                    {
                        ["lastRun"] = DateTime.UtcNow,
                        ["processedCount"] = Interlocked.Increment(ref _processedCount)
                    }
                );
            }
            catch (OperationCanceledException)
            {
                // Graceful shutdown
                _status.SetDegraded("Worker stopping");
                break;
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "Queue processing failed");
                _status.SetUnhealthy($"Queue processing failed: {ex.Message}");
                // Optionally wait before retrying to avoid tight failure loops
                await Task.Delay(5000, stoppingToken);
            }
        }

        _status.SetDegraded("Worker stopped");
    }

    private long _processedCount;
}

// WorkerHealthStatus.cs
// Thread-safe health status holder for background workers.

public class WorkerHealthStatus
{
    private HealthStatus _status = HealthStatus.Unhealthy;
    private string _description = "Not started";
    private Dictionary<string, object> _data = new();
    private readonly object _lock = new();

    public void SetHealthy(string description, Dictionary<string, object> data = null)
    {
        lock (_lock)
        {
            _status = HealthStatus.Healthy;
            _description = description;
            _data = data ?? new Dictionary<string, object>();
        }
    }

    public void SetDegraded(string description)
    {
        lock (_lock)
        {
            _status = HealthStatus.Degraded;
            _description = description;
        }
    }

    public void SetUnhealthy(string description)
    {
        lock (_lock)
        {
            _status = HealthStatus.Unhealthy;
            _description = description;
        }
    }

    public HealthCheckResult GetResult()
    {
        lock (_lock)
        {
            return new HealthCheckResult(_status, _description, data: _data);
        }
    }
}

// BackgroundWorkerHealthCheck.cs
// IHealthCheck that reads from the shared status object.

public class BackgroundWorkerHealthCheck : IHealthCheck
{
    private readonly WorkerHealthStatus _status;

    public BackgroundWorkerHealthCheck(WorkerHealthStatus status)
    {
        _status = status;
    }

    public Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        // Just delegate to the shared status object — no async work needed.
        return Task.FromResult(_status.GetResult());
    }
}

// Program.cs — registration for worker health check

builder.Services.AddSingleton<WorkerHealthStatus>();
builder.Services.AddHostedService<BackgroundQueueProcessor>();

builder.Services.AddHealthChecks()
    .AddCheck<BackgroundWorkerHealthCheck>(
        name: "queue-worker",
        failureStatus: HealthStatus.Degraded,
        tags: new[] { "ready", "background" }
    );
Output
// GET /healthz/ready — when worker is processing
// HTTP 200 OK
{
"status": "Healthy",
"entries": {
"sql-server": { "status": "Healthy" },
"payment-api": { "status": "Healthy" },
"queue-worker": {
"status": "Healthy",
"description": "Queue processed at 2026-04-22T14:35:10.123Z",
"data": {
"lastRun": "2026-04-22T14:35:10.123Z",
"processedCount": 42
}
}
}
}
// GET /healthz/ready — when worker has failed
// HTTP 503 Service Unavailable (if configured via ResultStatusCodes)
{
"status": "Unhealthy",
"entries": {
"sql-server": { "status": "Healthy" },
"payment-api": { "status": "Healthy" },
"queue-worker": {
"status": "Unhealthy",
"description": "Queue processing failed: Connection to message bus refused",
"data": {}
}
}
}
Why Use a Shared Status Object?
The worker health check doesn't call the message queue every time it's invoked — that would be slow and could overwhelm the queue during an outage. Instead, the background service writes its status periodically, and the health check reads the latest value. This decouples check execution from actual monitoring cost.
Production Insight
Background workers can stall silently while the web layer stays healthy.
A shared status object lets the health check see worker failures immediately.
Rule: always add a health check for each critical BackgroundService — don't assume it's running just because the process is up.
Key Takeaway
Use a singleton shared status object between BackgroundService and IHealthCheck.
Workers update status periodically; health checks read it — no heavy lifting in the check.
This pattern avoids worker health checks that depend on the very system they monitor.
● Production incidentPOST-MORTEMseverity: high

The Restart Storm That Took Down Three Services

Symptom
All pods across three services started restarting in rapid succession. Requests returned 503 errors. The database team reported slow queries due to an unplanned index rebuild, but the apps were crashing, not just slowing down.
Assumption
The team assumed that if the health check returned Unhealthy, Kubernetes would handle it gracefully. They thought setting a high failureThreshold on the liveness probe would buy them time.
Root cause
The liveness probe (/healthz/live) ran the same SQL Server health check as the readiness probe. When the database slowed down, the health check timed out after 5 seconds. Kubernetes saw the timeout, interpreted it as a dead process, and restarted the pod. With multiple replicas, all restarts happened within the same minute, causing complete downtime.
Fix
Changed the liveness endpoint to run zero checks (Predicate = _ => false) so it always returns 200 as long as the process is alive. Moved the database check exclusively to the readiness endpoint. Set the readiness probe's failureThreshold to 3 to tolerate transient slowness before removing pods from rotation.
Key lesson
  • Liveness probes must only check if the process itself is alive, not its dependencies.
Production debug guideCommon symptoms and the exact actions to diagnose them5 entries
Symptom · 01
Health endpoint returns 200 but pod keeps restarting
Fix
Check which endpoint Kubernetes is using as liveness probe. If it includes dependency checks, reconfigure to only use a no-check liveness endpoint.
Symptom · 02
Health endpoint times out after 30 seconds
Fix
Add an explicit CancellationTokenSource with short timeout inside CheckHealthAsync. The default CancellationToken may not enforce a timeout.
Symptom · 03
Health check says Degraded but K8s doesn't stop traffic
Fix
Verify ResultStatusCodes mapping on the readiness endpoint. Degraded returns 200 by default – override to map Degraded to 503.
Symptom · 04
JSON response missing exception details
Fix
Ensure your ResponseWriter serializes entry.Value.Exception?.Message. The default writer omits exception data.
Symptom · 05
Health Checks UI shows 'Unhealthy' but app works fine
Fix
Check if the UI endpoint URI is correct. If using different ports or authentication, the UI might receive a 401 or 404, which it interprets as Unhealthy.
★ Quick Health Check Debug Cheat SheetCommands and fixes for the most common health check issues in production
Liveness probe causing restarts
Immediate action
Run `kubectl describe pod <pod-name>` and check the Liveness probe section to see which endpoint and threshold is configured.
Commands
kubectl get pods --field-selector=status.phase=Running -o custom-columns=NAME:.metadata.name,LIVENESS:.spec.containers[0].livenessProbe.httpGet.path
curl -w '%{http_code}' http://localhost:5000/healthz/live
Fix now
Change liveness probe path to /healthz/live and configure that endpoint to run no checks (Predicate = _ => false).
Health check times out after 30 seconds+
Immediate action
Add a 5-second timeout inside CheckHealthAsync using CancellationTokenSource.CreateLinkedTokenSource.
Commands
kubectl exec <pod> -- curl -m 5 http://localhost:5000/healthz/ready
Check application logs for 'OperationCanceledException' – that means the timeout is being triggered.
Fix now
Wrap your healthy check logic in a using (var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken)) { cts.CancelAfter(TimeSpan.FromSeconds(5)); ... }
Degraded not stopping K8s traffic+
Immediate action
Check the HealthCheckOptions on the readiness endpoint – ResultStatusCodes probably missing Degraded -> 503.
Commands
curl -w '%{http_code}' http://localhost:5000/healthz/ready
kubectl get pods -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}'
Fix now
Add ResultStatusCodes = new Dictionary<HealthStatus, int> { [HealthStatus.Degraded] = 503, [HealthStatus.Unhealthy] = 503 } to the readiness endpoint options.
Health Checks UI shows 'Unhealthy' but app is fine+
Immediate action
Verify the URI configured in HealthChecksUI settings matches the actual health endpoint URL including port and path.
Commands
curl -v http://localhost:5000/healthz (compare status code and body)
Check the UI's network tab – is it receiving a 200 response with proper JSON?
Fix now
Update AddHealthCheckEndpoint URI in Program.cs or the HealthChecksUI configuration section in appsettings.json.
Liveness vs Readiness Probes in Kubernetes
AspectLiveness Probe (/healthz/live)Readiness Probe (/healthz/ready)
PurposeIs the process itself alive and not deadlocked?Is the pod ready to receive user traffic?
K8s action on failureRestarts the podRemoves pod from load balancer rotation
Recommended checksSelf-check only (return 200 if process runs)DB, cache, external APIs, message queues
Risk of including DB checksHigh — slow DB causes restart stormSafe — slow DB just pauses traffic to that pod
Typical HTTP success code200 OK200 OK (Healthy) or 503 (if Degraded = 503)
Run frequency in K8sEvery 10-30 secondsEvery 10-30 seconds
FailureStatus recommendationN/A — no checks to configureUnhealthy for critical, Degraded for non-critical

Key takeaways

1
Split liveness and readiness onto separate endpoints with tag filtering
putting database checks on the liveness probe is the #1 cause of Kubernetes restart storms during partial outages.
2
Always set a hard timeout inside CheckHealthAsync
a health check that hangs for 100 seconds is worse than one that fails fast and returns Unhealthy after 5 seconds.
3
Use ResultStatusCodes to map Degraded to HTTP 503 on the readiness endpoint if you want Kubernetes to stop routing traffic to a pod when any dependency is struggling.
4
The FailureStatus per-check registration and the ResultStatusCodes per-endpoint configuration are independent levers
FailureStatus controls what HealthStatus gets reported, ResultStatusCodes controls what HTTP code that status maps to.
5
Background services need their own health checks via a shared status object
don't assume a healthy web layer means background workers are still running.

Common mistakes to avoid

5 patterns
×

Exposing /healthz publicly without securing it

Symptom
The endpoint includes server names, database host names, and exception stack traces in its JSON response. An attacker can map your entire infrastructure topology from it.
Fix
Add .RequireAuthorization() to MapHealthChecks and back it with an IP-restriction policy, or serve it only on a non-public internal port by binding to a separate address using app.MapHealthChecks(...).RequireHost('*.internal').
×

Forgetting that Degraded returns HTTP 200 by default

Symptom
Developers test their health check, see 'Degraded' in the JSON, assume Kubernetes will react, and are confused when degraded pods keep receiving traffic.
Fix
Explicitly configure ResultStatusCodes in HealthCheckOptions and map HealthStatus.Degraded to StatusCodes.Status503ServiceUnavailable on the readiness endpoint — as shown in the code above.
×

Running all health checks on the liveness probe

Symptom
When the database is slow, the liveness check times out, Kubernetes restarts every pod simultaneously, and a partial outage becomes a total one.
Fix
Either use Predicate = _ => false on the liveness endpoint (returning 200 unconditionally while the process is up), or only tag lightweight self-checks with 'live' and never tag external dependency checks with it.
×

Not including a timeout inside CheckHealthAsync

Symptom
A third-party API health check hangs for 100 seconds because the HttpClient default timeout is absurdly long. The aggregate health endpoint also hangs, causing Kubernetes probes to fail and restart the pod.
Fix
Always use CancellationTokenSource.CreateLinkedTokenSource with a short timeout inside CheckHealthAsync. Set per-check timeouts — 5 seconds for databases, 8 seconds for HTTP calls.
×

Directly creating HttpClient in health checks

Symptom
Health checks run every 15 seconds, each creating a new HttpClient. Within minutes, socket exhaustion crashes the process with SocketException: Only one usage of each socket address is normally permitted.
Fix
Inject IHttpClientFactory (registered in DI) and call CreateClient(). For health checks that call HTTP endpoints, always use a named or typed client with a pre-configured timeout.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What's the difference between a liveness probe and a readiness probe, an...
Q02SENIOR
A colleague says 'our health check endpoint returned Degraded so Kuberne...
Q03SENIOR
You have five microservices. The payment service depends on a shared Red...
Q01 of 03SENIOR

What's the difference between a liveness probe and a readiness probe, and how do ASP.NET Core health check tags help you implement both correctly?

ANSWER
Liveness probes check if the process is alive – if they fail, Kubernetes restarts the pod. Readiness probes check if the pod can serve traffic – if they fail, Kubernetes removes it from the load balancer. You implement both by mapping separate endpoints (e.g., /healthz/live and /healthz/ready) and using the Predicate option with tag filtering. Liveness endpoints should run zero or only self-checks, readiness endpoints should run all dependency checks. Tags like "live" and "ready" let you group checks cleanly.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
How do I add health checks to an existing ASP.NET Core app without breaking anything?
02
What NuGet packages do I need for health checks in ASP.NET Core?
03
Can I use health checks with .NET Framework or only .NET Core?
04
How do I add a health check for a background service?
🔥

That's ASP.NET. Mark it forged?

5 min read · try the examples if you haven't

Previous
gRPC with ASP.NET Core
11 / 14 · ASP.NET
Next
Background Services in ASP.NET Core