FastAPI Deployment — Docker, Uvicorn and Gunicorn
- Gunicorn is the right process manager on bare VMs and EC2 — it handles worker crash recovery, graceful shutdown, and multi-core utilisation without an external orchestrator. On Kubernetes, Uvicorn alone with multiple pod replicas is equally valid and often simpler to reason about.
- The (2×CPU)+1 worker formula is for synchronous WSGI workers. For async FastAPI, start with 2–4 workers and benchmark under load — the event loop provides concurrency, not worker count. Adding workers beyond CPU saturation multiplies memory and connection consumption with no throughput gain.
- Use a gunicorn.conf.py file for all Gunicorn configuration — versionable, readable, and supports dynamic values. Set max_requests and max_requests_jitter to recycle workers and prevent thundering herd restarts from slow memory leaks.
- FastAPI requires an ASGI server; Uvicorn handles the async HTTP event loop
- Gunicorn acts as a process manager on bare VMs and EC2: it spawns, monitors, and restarts Uvicorn workers; on Kubernetes, Uvicorn alone with multiple pod replicas is equally valid
- Worker count formula: (2 × CPU cores) + 1 is a heuristic for CPU-bound WSGI apps — for async FastAPI, start with 2–4 workers per container and benchmark; the event loop handles concurrency, not worker count
- Gunicorn's --timeout prevents stuck workers from hanging requests indefinitely — set it to at least 2× your slowest endpoint's p99 latency
- Use --max-requests and --max-requests-jitter to periodically recycle workers and mitigate slow memory leaks without thundering herd restarts
- Biggest mistake: using --reload in production — it polls file descriptors, spikes CPU under no load, and is a developer-only feature that has no place in a production image
- In 2026, @app.on_event is deprecated — use the lifespan context manager for startup and shutdown logic
- Secrets do not belong in docker run -e flags or Dockerfiles — use Kubernetes Secrets, AWS Secrets Manager, or Vault with a CSI driver
FastAPI Deployment Troubleshooting Quick Guide
Container exits immediately after start
docker logs <container_id> --tail 100 2>&1 | grep -E '(Error|Exception|Traceback|ValidationError)'docker run --rm --entrypoint python <image> -c "from io.thecodeforge.main import app; print('import OK')"Gunicorn starts but workers die immediately with 'address already in use'
docker exec <container> sh -c "ss -tulpn | grep 8000"docker ps -a | grep 8000 # look for exited containers still holding the port mappingHealth check returns 200 but readiness returns 503
docker exec <container> sh -c "python -c \"from io.thecodeforge.config import settings; print(settings.database_url)\""docker exec <container> sh -c "python -c \"import asyncio, databases; db = databases.Database('$FORGE_DATABASE_URL'); asyncio.run(db.connect()); print('DB OK')\""Slow response times under load — p99 latency climbing even with low CPU
docker exec <container> sh -c "ps aux | grep uvicorn | wc -l" # count actual worker processesdocker stats <container> --no-stream # check CPU% per container — if it's near 100%, workers are CPU-saturatedEnvironment variables not loading inside container — settings validation fails
docker exec <container> env | grep FORGE_ # shows what the process actually receivesdocker inspect <container> --format='{{range .Config.Env}}{{println .}}{{end}}' | grep FORGE_Production Incident
database_url: str instead of database_url: PostgresDsn. An empty string is a valid str. Pydantic accepted it, the app started, the health check passed, the readiness probe (which was the same lightweight /health endpoint — not a separate DB-checking /ready endpoint) returned 200, and Kubernetes routed traffic. The connection failure only surfaced when the first actual database call was made, four minutes into the deployment when the smoke test hit a data-fetching endpoint.database_url: str to database_url: PostgresDsn in the Pydantic model. An empty string now raises ValidationError at startup — the app refuses to start and Kubernetes never marks it as ready. Second: separated /health (lightweight, no DB call) from /ready (calls SELECT 1 against the database). Now even if a future misconfiguration produces a valid-looking but wrong URL, the readiness probe catches the connectivity failure before traffic is routed. Also added a Kubernetes Secret validator in the CI pipeline that verifies all referenced Secret keys exist in the target namespace before a deployment is applied.Production Debug GuideSymptoms and actions for the failures that actually appear in production — not the ones that are easy to demo.
tracemalloc.start(); then call tracemalloc.take_snapshot() in a route and compare snapshots across requests. Check all async context managers: every 'async with db.transaction()' must exit cleanly or the session leaks.time.sleep() instead of asyncio.sleep(), synchronous database drivers called from an async function, or a requests.get() instead of httpx.AsyncClient. If the endpoint is legitimately slow (batch processing, large file generation), increase --timeout in gunicorn.conf.py to 2x the endpoint's p99 latency, or move the work to a background task queue (Celery, ARQ, FastAPI BackgroundTasks for lightweight cases).FastAPI is an ASGI application. That single fact determines your entire deployment stack. Unlike traditional WSGI frameworks where a synchronous worker handles one request at a time, FastAPI's async model means a single Uvicorn worker can hold hundreds of concurrent connections open — waiting on database responses, external APIs, file I/O — without blocking. That's the performance story. The deployment story is what happens when that worker crashes, when your VM has 8 cores sitting idle, when a Kubernetes rolling update sends SIGTERM to your process, or when a junior engineer accidentally commits a .env file with production credentials.
This guide covers the full production stack: Gunicorn as the process manager on bare infrastructure, Uvicorn as the ASGI worker, multi-stage Docker builds that don't ship your build toolchain to production, Pydantic Settings for configuration that fails loudly at startup rather than silently at 2 AM, and the graceful shutdown sequence that prevents connection pool exhaustion during every deployment.
One thing to clarify upfront before we start: on Kubernetes, you don't always need Gunicorn. Kubernetes itself is a process manager — it handles pod restarts, rolling updates, and health-based traffic routing. Running Uvicorn directly with multiple pod replicas, each with a single worker, is a legitimate and increasingly common pattern in 2026. Where Gunicorn earns its place is on bare VMs, EC2 instances, and any environment where you don't have an external orchestrator managing process lifecycle. We'll cover both. Know which one you're in before you copy a deployment command.
The Production Stack: Gunicorn + Uvicorn
Uvicorn is fast. It implements the ASGI specification cleanly, handles HTTP/1.1 and HTTP/2, and processes async requests efficiently. What it doesn't do is manage its own process lifecycle. If the Uvicorn process crashes, it's gone. If your VM has 8 cores, a single Uvicorn process uses one of them. If a worker gets stuck in a CPU-bound loop that never yields to the event loop, every request queued behind it waits indefinitely.
Gunicorn solves all three problems. It operates as a master process — what the Gunicorn docs call the arbiter — that forks multiple worker processes and monitors them via a heartbeat mechanism. If a worker stops sending heartbeats (because it's crashed, hung, or exceeded the timeout), Gunicorn sends SIGKILL and forks a replacement. The master process itself never handles HTTP traffic. It just manages the workers.
The key configuration flag is --worker-class uvicorn.workers.UvicornWorker. This tells Gunicorn to fork workers that run Uvicorn's async event loop rather than Gunicorn's default synchronous worker. Without this flag, Gunicorn spawns sync workers that can't run FastAPI's async code correctly.
On the worker count question: the (2 × CPU) + 1 formula comes from Gunicorn's own documentation and was designed for synchronous, CPU-bound WSGI workers like Django or Flask with sync views. For async FastAPI applications, it's the wrong starting point. Each Uvicorn worker runs a full asyncio event loop that can handle hundreds of concurrent connections. The bottleneck is almost never CPU cores — it's your database connection pool size, external API rate limits, or available memory. Start with 2 workers on a 2-core machine, load test with realistic traffic patterns, watch CPU utilisation and memory, and add workers only when CPU is actually saturated. On a 4-core machine running a typical API-heavy FastAPI app, 4 workers is usually the ceiling before memory pressure outweighs concurrency gains.
For production Gunicorn configuration, use a gunicorn.conf.py file rather than a long command-line flag string. It's versionable, readable, and supports Python expressions for dynamic values.
# gunicorn.conf.py — version-control this alongside your application code # Reference: https://docs.gunicorn.org/en/stable/settings.html import multiprocessing # ----------------------------------------------------------------- # Server socket # ----------------------------------------------------------------- bind = "0.0.0.0:8000" # ----------------------------------------------------------------- # Worker processes # For async FastAPI: start low (2-4) and benchmark. # The (2*CPU)+1 formula is for sync WSGI workers, not async ASGI. # Each UvicornWorker runs a full asyncio event loop — concurrency # comes from the event loop, not from worker count. # ----------------------------------------------------------------- workers = 2 # tune based on load testing, not the formula worker_class = "uvicorn.workers.UvicornWorker" # ----------------------------------------------------------------- # Timeouts # timeout: kill and restart a worker that hasn't responded in N seconds. # graceful_timeout: time allowed for in-flight requests to complete # after SIGTERM before workers are force-killed. # Set timeout to at least 2x your slowest endpoint's p99 latency. # ----------------------------------------------------------------- timeout = 60 graceful_timeout = 30 keepalive = 5 # ----------------------------------------------------------------- # Worker recycling — critical for long-running services # max_requests: restart each worker after this many requests. # Mitigates slow memory leaks without a full redeploy. # max_requests_jitter: randomises the restart point per worker. # Without jitter, all workers restart simultaneously # (thundering herd) causing a brief availability dip. # ----------------------------------------------------------------- max_requests = 1000 max_requests_jitter = 100 # ----------------------------------------------------------------- # Logging — always route to stdout/stderr in containers # accesslog "-" and errorlog "-" send to stdout/stderr so your # container runtime (Docker, Kubernetes) captures them. # ----------------------------------------------------------------- accesslog = "-" errorlog = "-" loglevel = "info" # Structured access log format for JSON log aggregators (Datadog, Loki, CloudWatch) # Fields: remote_addr, request_time, status, response_length, referer, user_agent access_log_format = '{"remote": "%({X-Forwarded-For}i)s", "method": "%(m)s", "path": "%(U)s", "status": %(s)s, "duration_ms": %(D)s}' # ----------------------------------------------------------------- # Process naming — visible in ps aux, htop, and monitoring dashboards # ----------------------------------------------------------------- proc_name = "thecodeforge-api"
[2026-01-15 10:00:00] [INFO] Listening at: http://0.0.0.0:8000
[2026-01-15 10:00:00] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2026-01-15 10:00:00] [INFO] Booting worker with pid: 42
[2026-01-15 10:00:00] [INFO] Booting worker with pid: 43
fork(), each worker has its own event loop, but the connection objects reference the master's loop — which no longer exists in the worker. The first database call produces a cryptic 'Future attached to a different loop' error. We've debugged this at 2 AM more than once.Production-Grade Multi-Stage Dockerfile
A naive Dockerfile copies your entire project directory, installs dependencies, and produces an image that ships your build toolchain, compiler headers, pip cache, .git history, and local .env files to every environment it runs in. In a security audit this is a finding. In a CI pipeline it's a slow build. In production it's an unnecessarily large attack surface.
The solution is a multi-stage build. The builder stage installs system dependencies and compiles any C extensions (psycopg2, cryptography, numpy). The runtime stage starts clean from a slim base image and copies only the compiled packages from the builder — no gcc, no build headers, no pip internals. The final image is typically 60–80% smaller than a naive single-stage build.
Alongside the Dockerfile, a .dockerignore file is not optional — it's the first line of defence against shipping things that should never leave your workstation. Without it, COPY . . sends your .git directory, your local venv, your .env file with real credentials, your editor configuration, and your __pycache__ to the Docker build context. Some of those end up in image layers. Image layers are readable by anyone with docker history access.
One more thing about user creation: the adduser instruction should appear before the COPY instructions, not after. Docker caches layers sequentially. If adduser runs after COPY, any change to your application code invalidates the adduser cache layer, which forces Docker to re-run user creation on every build that touches app code. Put adduser early, before any COPY that changes frequently.
# ================================================================= # STAGE 1: Builder # Installs dependencies and compiles C extensions. # This stage is discarded — nothing from it ships to production # except the installed packages we explicitly copy. # ================================================================= FROM python:3.12-slim AS builder # Install build tools needed for compiled extensions (psycopg2, cryptography) # These are NOT copied to the runtime stage RUN apt-get update && apt-get install -y --no-install-recommends \ gcc \ libpq-dev \ && rm -rf /var/lib/apt/lists/* WORKDIR /build # Copy only requirements first — layer cache means pip only re-runs # when requirements.txt changes, not on every code change COPY requirements.txt . # Install into an isolated prefix so we can copy the exact set of # installed packages to the runtime stage cleanly RUN pip install --no-cache-dir --upgrade pip \ && pip install --no-cache-dir --prefix=/install -r requirements.txt # ================================================================= # STAGE 2: Runtime # Clean slim image — no build tools, no compiler, no pip cache. # Only the application code and the installed packages. # Typical size reduction: 60-80% vs naive single-stage build. # ================================================================= FROM python:3.12-slim AS runtime # Prevents Python from writing .pyc files to disk # (not useful in containers — the filesystem is ephemeral) ENV PYTHONDONTWRITEBYTECODE=1 # Forces Python stdout/stderr to be unbuffered. # Without this, logs may be lost on crash because they're sitting # in Python's internal buffer when the process exits. # Critical for CloudWatch, Stackdriver, and Loki log capture. ENV PYTHONUNBUFFERED=1 # Add the installed packages from the builder stage to the Python path ENV PYTHONPATH=/install/lib/python3.12/site-packages # Install only the runtime system libraries (not build tools) # libpq is the PostgreSQL client library — needed at runtime for psycopg2 RUN apt-get update && apt-get install -y --no-install-recommends \ libpq5 \ curl \ && rm -rf /var/lib/apt/lists/* # Create a non-privileged user BEFORE copying application code. # Reason: adduser is layer-cached independently of app code. # If adduser runs after COPY, a code change invalidates the adduser # layer and Docker re-runs user creation on every build. RUN addgroup --system forgegroup \ && adduser --system --ingroup forgegroup --no-create-home forgeuser WORKDIR /app # Copy installed packages from builder stage COPY --from=builder /install /usr/local # Copy application code — changes here do NOT invalidate earlier layers COPY --chown=forgeuser:forgegroup . . # Switch to non-privileged user # Running as root in a container is a container escape risk: # a vulnerability in your app or a dependency could give an attacker # root access to the host if the container runtime is misconfigured. USER forgeuser # Document the port — this is metadata for tooling and humans. # EXPOSE does NOT control network access. The -p flag in docker run # or the containerPort in Kubernetes does that. EXPOSE 8000 # Use gunicorn.conf.py for all configuration — cleaner than # a long flag string and versionable alongside the application. # Workers and timeout are set in gunicorn.conf.py, not hardcoded here. CMD ["gunicorn", "-c", "gunicorn.conf.py", "io.thecodeforge.main:app"]
=> [builder 1/4] FROM python:3.12-slim
=> [builder 2/4] RUN apt-get update && apt-get install gcc libpq-dev
=> [builder 3/4] COPY requirements.txt .
=> [builder 4/4] RUN pip install --prefix=/install -r requirements.txt
=> [runtime 1/5] FROM python:3.12-slim
=> [runtime 2/5] RUN apt-get install libpq5 curl
=> [runtime 3/5] RUN addgroup && adduser
=> [runtime 4/5] COPY --from=builder /install /usr/local
=> [runtime 5/5] COPY . .
=> exporting to image
Successfully built thecodeforge-api:latest
Image size: 187MB (vs 312MB single-stage)
Startup and Shutdown: The lifespan Context Manager
Before FastAPI 0.93.0 (early 2023), the way to run code at startup and shutdown was @app.on_event('startup') and @app.on_event('shutdown'). Both decorators are now deprecated and marked for removal in a future version. In 2026, they should not appear in any new code or tutorial. If you see them in a codebase you're inheriting, put them on your refactoring list.
The replacement is the lifespan context manager — a single async generator function that wraps your application's entire lifecycle. Everything before the yield runs at startup. Everything after the yield runs at shutdown. The yield itself is when your application accepts traffic.
This pattern is better for several reasons. Startup and shutdown logic lives in one function, so it's impossible to initialise a resource in startup without a corresponding cleanup in shutdown — they're in the same scope. It composes naturally with async context managers from your database libraries (databases, SQLAlchemy async, motor). And it works correctly with Gunicorn's graceful timeout: when Gunicorn receives SIGTERM, it sets a timer equal to graceful_timeout, drains in-flight requests, then calls the ASGI lifespan shutdown event — which triggers everything after the yield.
The graceful shutdown sequence in a Kubernetes rolling update is worth understanding completely, because getting it wrong means connection pool exhaustion on every deployment:
- Kubernetes sends SIGTERM to the Gunicorn master process
- Gunicorn stops accepting new connections and signals workers to finish in-flight requests
- Workers complete their current requests within
graceful_timeout(default 30s) - FastAPI's lifespan runs the shutdown section — closes database pools, flushes caches, deregisters from service discovery
- Workers exit cleanly
- Gunicorn master exits
If step 4 doesn't happen — because there's no shutdown handler — every rolling update leaks database connections. With SQLAlchemy's default pool of 5 connections per worker, 4 workers per pod, and 10 pods, a deployment cycle leaks 200 connection slots. PostgreSQL's default max_connections is 100. You can see where this ends.
from contextlib import asynccontextmanager from typing import AsyncGenerator from fastapi import FastAPI from fastapi.responses import JSONResponse import databases from io.thecodeforge.config import settings # ----------------------------------------------------------------- # Database setup # Using the 'databases' library for async PostgreSQL. # The connection pool is created inside lifespan, NOT at module level. # Creating it at module level breaks --preload and causes # 'Future attached to a different loop' errors after fork(). # ----------------------------------------------------------------- database = databases.Database( str(settings.database_url), min_size=2, max_size=10 # align with your PostgreSQL max_connections / (workers * pods) ) @asynccontextmanager async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]: """ Manages the full application lifecycle. Replaces the deprecated @app.on_event('startup') and @app.on_event('shutdown'). Startup (before yield): - Connect to the database and verify connectivity - Warm any caches - Register with service discovery if applicable Shutdown (after yield): - Disconnect from the database (returns connections to the pool, then closes pool connections to PostgreSQL) - Flush metrics buffers - Deregister from service discovery Gunicorn's graceful_timeout controls how long workers have to reach the shutdown section before being force-killed. """ # --- STARTUP --- await database.connect() # Verify the database is actually reachable before accepting traffic. # This prevents a Kubernetes pod from passing its readiness check # before the connection pool is healthy. try: await database.execute("SELECT 1") except Exception as exc: # Log and re-raise — the app should not start if the DB is unreachable # Gunicorn will log the exception and the worker will exit raise RuntimeError(f"Database connectivity check failed at startup: {exc}") from exc yield # Application is live and accepting traffic here # --- SHUTDOWN --- # This runs when Gunicorn receives SIGTERM and graceful_timeout elapses, # or when the container receives SIGTERM directly (Kubernetes rolling update). # Closing the pool here prevents connection leak on every deployment. await database.disconnect() app = FastAPI( title="TheCodeForge API", lifespan=lifespan ) # ----------------------------------------------------------------- # Liveness probe: is the process alive and the event loop running? # Kubernetes restarts the pod if this returns non-200 or times out. # Keep this lightweight — no database calls. If the event loop is # blocked, this endpoint won't respond and Kubernetes will act. # ----------------------------------------------------------------- @app.get("/health", tags=["observability"]) async def health() -> dict: return {"status": "healthy"} # ----------------------------------------------------------------- # Readiness probe: is this pod ready to receive traffic? # Kubernetes stops sending traffic to this pod if this returns non-200. # Unlike /health, this SHOULD check dependencies — a pod that has # started but cannot reach its database should not receive traffic. # ----------------------------------------------------------------- @app.get("/ready", tags=["observability"]) async def ready() -> JSONResponse: try: await database.execute("SELECT 1") return JSONResponse(status_code=200, content={"status": "ready"}) except Exception as exc: # Return 503 — Kubernetes will remove this pod from the load balancer # rotation until the database becomes reachable again return JSONResponse( status_code=503, content={"status": "not_ready", "detail": str(exc)} )
[2026-01-15 10:00:00] [INFO] Database connectivity check passed
[2026-01-15 10:00:00] [INFO] Application ready — accepting traffic
...
[2026-01-15 10:05:00] [INFO] SIGTERM received — beginning graceful shutdown
[2026-01-15 10:05:00] [INFO] Finishing in-flight requests
[2026-01-15 10:05:01] [INFO] Database pool disconnected cleanly
[2026-01-15 10:05:01] [INFO] Worker exited
Robust Configuration with Pydantic Settings
Hardcoding database URLs is a security problem. Passing them as plain str fields in Pydantic is a validation problem — an empty string passes str validation and your app starts fine, then crashes on the first database call with a connection error that takes 10 minutes to trace back to a missing environment variable.
Pydantic Settings solves both: it reads configuration from environment variables (and optionally .env files for local development), and it validates the types strictly. Using PostgresDsn instead of str means an empty string, a typo, or a missing scheme like postgresql:// will raise a ValidationError at startup — before your app accepts a single request.
The env_prefix configuration prevents collision with system environment variables. Without a prefix, a setting named DEBUG would be overridden by any DEBUG variable in the system environment, which on some Linux distributions is set by the shell. With FORGE_ as the prefix, your settings are namespaced: FORGE_DATABASE_URL, FORGE_SECRET_KEY, FORGE_DEBUG.
The @lru_cache pattern on the settings factory function ensures the .env file is read and validated exactly once, even if is called from multiple modules. It also makes the settings injectable via FastAPI's dependency injection system, which makes testing significantly cleaner — you can override get_settings()get_settings in tests to return a test-specific configuration without modifying environment variables.
from functools import lru_cache from typing import Literal from pydantic import Field, PostgresDsn, SecretStr, field_validator from pydantic_settings import BaseSettings, SettingsConfigDict class ForgeSettings(BaseSettings): """ Application configuration loaded from environment variables. All fields with FORGE_ prefix are read from the environment. In local development, values are loaded from a .env file. In production (Kubernetes, ECS), they come from the orchestrator's secret injection — never from a file committed to version control. Validation happens at instantiation time. If any required field is missing or malformed, the app raises ValidationError immediately and refuses to start — which is the correct behaviour. """ # PostgresDsn validates the full URL format including scheme, host, port, dbname. # An empty string or a URL without 'postgresql://' raises ValidationError. # Plain str would accept '' and let your app start, then crash on first DB call. database_url: PostgresDsn # SecretStr prevents the value from appearing in logs, repr(), or tracebacks. # settings.secret_key returns SecretStr — call .get_secret_value() to use the raw string. secret_key: SecretStr = Field(min_length=32) # Literal type restricts valid values — 'development', 'staging', 'production'. # Passing FORGE_ENVIRONMENT=prod (typo) raises ValidationError at startup. environment: Literal["development", "staging", "production"] = "production" debug: bool = False # Database pool sizing — align with (gunicorn workers × pods × this value) # to stay under PostgreSQL's max_connections limit db_pool_min_size: int = Field(default=2, ge=1, le=10) db_pool_max_size: int = Field(default=10, ge=1, le=50) # Worker configuration — read by gunicorn.conf.py if you want # dynamic worker count based on container CPU allocation workers: int = Field(default=2, ge=1, le=16) @field_validator("debug", mode="before") @classmethod def debug_must_be_false_in_production(cls, v, info): """Prevent debug mode from running in production accidentally.""" # info.data may not have 'environment' yet during validation order # so we validate this in a post-model validator instead return v model_config = SettingsConfigDict( env_file=".env", env_file_encoding="utf-8", env_prefix="FORGE_", # Raise an error if any extra environment variables with FORGE_ prefix # exist but are not defined in this model — catches typos in var names extra="ignore", case_sensitive=False, ) @lru_cache(maxsize=1) def get_settings() -> ForgeSettings: """ Returns a cached singleton of the application settings. lru_cache ensures the .env file is read and validated exactly once at startup, regardless of how many modules call get_settings(). In tests, override this with: app.dependency_overrides[get_settings] = lambda: ForgeSettings( database_url='postgresql://test:test@localhost/testdb', secret_key='a' * 32, ) """ return ForgeSettings() # Module-level singleton for use outside FastAPI's DI system # (e.g., in gunicorn.conf.py or CLI scripts) settings = get_settings()
ValidationError: 1 validation error for ForgeSettings
database_url
Field required [type=missing, input_url=FORGE_DATABASE_URL]
# When DATABASE_URL is an empty string:
ValidationError: 1 validation error for ForgeSettings
database_url
URL scheme should be 'postgresql', 'postgresql+asyncpg' or similar [type=url_scheme]
# When all values are valid:
Settings validated: environment=production debug=False workers=2
Environment Separation and Secrets Management
Local development uses a .env file. Staging and production do not. That sentence should settle most debates about secret management, but the details of how production secrets are delivered matter more than most articles acknowledge.
The wrong pattern — and one that still appears in deployment tutorials in 2026 — is passing secrets as docker run -e DATABASE_URL=postgresql://... in a CI/CD script. This exposes the secret in: - Shell history on the CI runner - The process list (ps aux shows environment variables on some Linux kernels) - docker inspect output, which is readable by anyone with Docker socket access - CI logs if the command is echoed
The right patterns, in order of preference for 2026 deployments:
Kubernetes: Store secrets in Kubernetes Secrets (base64-encoded, etcd-backed). Mount them as environment variables in the pod spec using secretKeyRef, or mount them as files using a volume. For stronger security, use the Secrets Store CSI Driver to pull secrets directly from AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault into the pod's filesystem — the secret never touches etcd.
AWS ECS: Use AWS Secrets Manager or Parameter Store. Reference secrets in the task definition's secrets block — ECS injects them as environment variables at task startup. The secret value is fetched at runtime by the ECS agent, not baked into the task definition.
Bare VMs: Use HashiCorp Vault with the AppRole auth method. The VM has a role ID and secret ID (injected by your provisioning system), authenticates to Vault at startup, and receives a short-lived token to fetch secrets. The actual credentials are never on disk.
For local development, .env files are fine — with two non-negotiable rules: .env is in .gitignore, and .env.example (with placeholder values) is in version control as documentation.
# .env.example — commit this to version control # Copy to .env and fill in real values for local development # NEVER commit .env — it is in .gitignore # # In production (Kubernetes), these come from Kubernetes Secrets # mounted as environment variables via secretKeyRef in the pod spec. # In ECS, they come from AWS Secrets Manager referenced in the task definition. # On bare VMs, they come from HashiCorp Vault injected at startup. # PostgreSQL connection string # Format: postgresql+asyncpg://user:password@host:port/database FORGE_DATABASE_URL=postgresql+asyncpg://forge_user:change_me@localhost:5432/forge_db # Must be at least 32 characters — used for JWT signing and session encryption # Generate with: python -c "import secrets; print(secrets.token_hex(32))" FORGE_SECRET_KEY=change-me-to-a-64-character-random-string-generated-with-secrets-module # Environment tag — controls logging level, error detail exposure, and feature flags # Valid values: development | staging | production FORGE_ENVIRONMENT=development # Debug mode — NEVER set to true in staging or production FORGE_DEBUG=false # Database connection pool sizing # Max connections consumed = FORGE_DB_POOL_MAX_SIZE × gunicorn_workers × pod_count # Must be less than PostgreSQL's max_connections (default: 100) FORGE_DB_POOL_MIN_SIZE=2 FORGE_DB_POOL_MAX_SIZE=5
.env in .gitignore, a pre-commit hook running git-secrets or truffleHog to scan for credential patterns before every commit, and separate SECRET_KEY and DATABASE_URL values so rotating one doesn't invalidate the other.Health Checks, Readiness, and Graceful Shutdown
Health checks are where the gap between 'it works on my machine' and 'it works in production' is most visible. Getting them wrong costs you downtime on every deployment — which in a team doing continuous delivery might mean several times a day.
The distinction between liveness and readiness is not semantic pedantry. It maps directly to different Kubernetes behaviours with different consequences:
Liveness probe failure → Kubernetes restarts the pod. Use this for: detecting a deadlocked event loop, a process that's running but not processing requests, or a hung worker. The probe itself must be cheap and dependency-free. If your liveness probe calls the database and the database goes down, Kubernetes will restart all your pods in a loop — making an outage worse.
Readiness probe failure → Kubernetes removes the pod from the load balancer's endpoint set. Traffic stops going to that pod, but the pod itself is not restarted. Use this for: verifying database connectivity, checking that connection pools are initialised, confirming that any required cache warmup has completed. A pod that's alive but can't reach its database should return 503 from readiness so traffic routes to healthy pods — not 503 from liveness, which would trigger a restart loop.
For Kubernetes deployment configuration, set both probes with appropriate delays and thresholds. The initialDelaySeconds on the readiness probe should account for your application's startup time — the database connectivity check in the lifespan context manager, any cache warming, service registration. If your startup takes 5 seconds and initialDelaySeconds is 3, your pod will fail its first readiness check and briefly drop from rotation on every deployment.
# This file shows the complete main.py — combining lifespan, health, # and readiness into the single application entry point. # The lifespan implementation from the previous section is included # here in full context. from contextlib import asynccontextmanager from typing import AsyncGenerator import databases from fastapi import FastAPI from fastapi.responses import JSONResponse from io.thecodeforge.config import get_settings settings = get_settings() database = databases.Database( str(settings.database_url), min_size=settings.db_pool_min_size, max_size=settings.db_pool_max_size, ) @asynccontextmanager async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]: # STARTUP await database.connect() try: await database.execute("SELECT 1") except Exception as exc: raise RuntimeError(f"Database unreachable at startup: {exc}") from exc yield # serving traffic # SHUTDOWN — runs on SIGTERM via Gunicorn graceful_timeout await database.disconnect() app = FastAPI(title="TheCodeForge API", lifespan=lifespan) @app.get("/health", tags=["observability"]) async def health() -> dict: """ Liveness probe — is this process alive and the event loop running? Kubernetes action on failure: RESTART the pod. Keep this dependency-free. If this endpoint calls the database and the DB goes down, Kubernetes will restart all pods — turning a DB outage into a total availability incident. A blocked event loop will prevent this endpoint from responding, which is exactly the signal Kubernetes needs to restart the pod. """ return {"status": "healthy"} @app.get("/ready", tags=["observability"]) async def ready() -> JSONResponse: """ Readiness probe — can this pod receive traffic? Kubernetes action on failure: REMOVE from load balancer (no restart). This SHOULD check actual dependencies. A pod that's alive but disconnected from its database should not receive traffic. Return 503 to signal 'alive but not ready' — Kubernetes stops routing traffic to this pod without restarting it. """ try: await database.execute("SELECT 1") return JSONResponse(status_code=200, content={"status": "ready"}) except Exception as exc: return JSONResponse( status_code=503, content={ "status": "not_ready", "detail": "database unreachable", # Do not expose exc details in production — log them instead } )
GET /health -> 200 {"status": "healthy"}
GET /ready -> 200 {"status": "ready"}
# During database outage:
GET /health -> 200 {"status": "healthy"} # pod stays up
GET /ready -> 503 {"status": "not_ready"} # pod removed from LB rotation
# During SIGTERM (Kubernetes rolling update):
[INFO] SIGTERM received
[INFO] Finishing 3 in-flight requests
[INFO] Database pool disconnected
[INFO] Worker exited cleanly
| Feature | Uvicorn Alone | Gunicorn + Uvicorn |
|---|---|---|
| Process management | None from Uvicorn itself — single process per invocation | Master process (arbiter) spawns, monitors, and restarts workers via heartbeat |
| Worker crash recovery | Pod exits — Kubernetes or systemd must restart it | Gunicorn detects failed heartbeat, sends SIGKILL, forks replacement — no external restart needed |
| Graceful shutdown | Handle SIGTERM in lifespan — works correctly but no worker draining | Built-in graceful_timeout: workers finish in-flight requests before shutdown |
| Multi-core utilisation | Use multiple pod replicas in Kubernetes — each pod one worker | Multiple workers per process — suited for bare VMs with no external orchestrator |
| Zero-downtime reload | Rolling update via Kubernetes — replace pods not processes | SIGHUP for config reload; USR2 for zero-downtime binary upgrade on bare VMs |
| Best fit | Kubernetes, Cloud Run, ECS — where the orchestrator manages restarts, scaling, and health | Bare VMs, EC2 without ECS, GCE — where no external process manager exists |
| Memory per worker | 50-150MB per pod for a typical FastAPI app | Same per worker, plus ~10-20MB for the Gunicorn master process |
| Configuration | uvicorn main:app --workers N or gunicorn.conf.py | gunicorn.conf.py with worker_class=uvicorn.workers.UvicornWorker |
🎯 Key Takeaways
- Gunicorn is the right process manager on bare VMs and EC2 — it handles worker crash recovery, graceful shutdown, and multi-core utilisation without an external orchestrator. On Kubernetes, Uvicorn alone with multiple pod replicas is equally valid and often simpler to reason about.
- The (2×CPU)+1 worker formula is for synchronous WSGI workers. For async FastAPI, start with 2–4 workers and benchmark under load — the event loop provides concurrency, not worker count. Adding workers beyond CPU saturation multiplies memory and connection consumption with no throughput gain.
- Use a gunicorn.conf.py file for all Gunicorn configuration — versionable, readable, and supports dynamic values. Set max_requests and max_requests_jitter to recycle workers and prevent thundering herd restarts from slow memory leaks.
- The lifespan context manager is the only correct pattern for startup and shutdown in FastAPI as of 2026. @app.on_event is deprecated. Initialise database pools inside lifespan (never at module level) to avoid post-fork event loop conflicts with --preload.
- Use PostgresDsn (not str) for database URLs in Pydantic Settings — empty strings and malformed URLs fail at startup, not at runtime under real traffic. Use SecretStr for passwords and keys to prevent them appearing in logs and stack traces.
- Separate /health (liveness — no dependencies, must be fast) from /ready (readiness — calls SELECT 1, returns 503 when DB is unreachable). Using a single combined endpoint used for both Kubernetes probes will either cause restart loops during DB outages or fail to catch connectivity problems before traffic is routed.
- Production secrets do not belong in docker run -e flags — they appear in shell history, process lists, and docker inspect output. Use Kubernetes secretKeyRef, AWS Secrets Manager references in ECS task definitions, or HashiCorp Vault with AppRole on bare VMs.
- Multi-stage Dockerfiles separate the build environment from the runtime image. The builder stage compiles C extensions; the runtime stage ships nothing but what's needed to run. Add adduser before COPY for correct layer caching. Ship a .dockerignore alongside every Dockerfile.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QExplain the master-worker architecture of Gunicorn. What happens to the master process if a worker encounters a segmentation fault, and what happens if the master itself crashes?SeniorReveal
- QIn a Docker context, why is .dockerignore important for Python applications, and what specifically should it always contain?Mid-levelReveal
- QHow does PYTHONUNBUFFERED affect log visibility in containerised environments, and what happens to logs if a container crashes without it set?Mid-levelReveal
- QYou're deploying a FastAPI application to a 16-core machine. The (2n+1) formula suggests 33 workers. Why might this degrade performance, and how would you determine the correct worker count?SeniorReveal
- QWhat is the difference between EXPOSE and publishing a port in Docker, and which one actually controls whether traffic reaches your FastAPI application?JuniorReveal
Frequently Asked Questions
Why use Gunicorn with Uvicorn workers instead of just Uvicorn?
On bare VMs, EC2 instances, and any environment without an external process manager: Gunicorn handles worker crash recovery, graceful shutdown, and multi-core utilisation that Uvicorn alone doesn't provide. If your single Uvicorn process crashes, the app is down until something restarts it. Gunicorn restarts the worker in seconds. On Kubernetes: the answer flips. Kubernetes is already a process manager — it handles pod restarts, rolling updates, autoscaling, and health-based traffic routing. Running Uvicorn directly in a Kubernetes Deployment with multiple replicas (each pod one Uvicorn process) is a completely valid 2026 pattern. The Gunicorn master process becomes redundant overhead when Kubernetes is doing the same job at the pod level. Know your infrastructure before choosing.
How do I handle database migrations in a Docker deployment?
Never put migration commands (alembic upgrade head) inside your Docker CMD or the lifespan startup handler. If you scale to multiple pods and all of them run migrations on startup, you get concurrent migration attempts — Alembic's migration lock handles this, but it adds startup latency and occasional lock timeout failures. The correct pattern: run migrations as a Kubernetes Job before the Deployment is updated (or as an init container in the pod spec, which runs to completion before the main container starts). In ECS, run migrations as a one-off task in your CI/CD pipeline before the new task definition is deployed. The rule is: migrations run exactly once per deployment, not once per pod replica.
What is the 'zombie worker' problem in Gunicorn, and how does async FastAPI reduce its likelihood?
A zombie worker is a Gunicorn worker that has stopped sending heartbeats to the master — either because it's stuck in a CPU-bound synchronous loop that never yields, or because it's handling a request that exceeds the --timeout limit. Gunicorn detects the missing heartbeat and sends SIGKILL. For synchronous WSGI workers, any blocking operation (a slow database query, a network call, a file read) can cause this because the worker thread is blocked and can't yield. For async FastAPI workers, the event loop naturally yields during every I/O wait — a slow database query suspends the current coroutine and lets the event loop handle other requests. The zombie worker problem becomes a code quality issue: it only occurs when someone calls a synchronous blocking function (time.sleep, requests.get, a sync database driver) from an async context. The fix is always the same: use the async equivalent.
Should I use a reverse proxy like Nginx in front of Gunicorn?
In production on bare infrastructure: yes. Gunicorn is not hardened against slow HTTP attacks (Slowloris and similar), does not handle TLS termination efficiently, cannot serve static files without loading Python, and has no built-in rate limiting. Nginx handles all of these before a request reaches Gunicorn. The typical bare-VM stack: Nginx (TLS, rate limiting, static files, request buffering) → Gunicorn (process management) → Uvicorn workers (async HTTP) → FastAPI. In Kubernetes or managed container platforms: your ingress controller (Nginx Ingress, Traefik, Caddy, or a cloud-native load balancer) takes the place of a local Nginx. You don't typically run a per-pod Nginx sidecar — the ingress handles TLS and routing at the cluster level, and you deploy Gunicorn + Uvicorn in the application pod.
How do I enable Prometheus metrics for FastAPI without blocking the event loop?
Install prometheus-fastapi-instrumentator: 'pip install prometheus-fastapi-instrumentator'. Add to your application: 'from prometheus_fastapi_instrumentator import Instrumentator; Instrumentator().instrument(app).expose(app)'. This exposes a /metrics endpoint with request counts, latency histograms, and in-progress request gauges — all collected without blocking the async event loop. Configure Prometheus to scrape /metrics with a 15-second interval. For Gunicorn-level metrics (worker count, worker restarts, request queue depth), add '--statsd-host=localhost:8125' to gunicorn.conf.py and run a StatsD exporter sidecar. In Kubernetes, annotate the pod with 'prometheus.io/scrape: true' and 'prometheus.io/path: /metrics' so Prometheus discovers it automatically via pod annotations.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.