Intermediate 7 min · March 06, 2026

Docker Crash on Boot — API Connects Before Postgres Ready

API container exits ECONNREFUSED 5432 due to missing healthcheck on Postgres.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Container: a running instance of an image — isolated filesystem, network, and process tree sharing the host kernel
  • Image: a read-only blueprint built from layers — each Dockerfile instruction creates one cached layer
  • Dockerfile: the build script that defines the image — instructions are executed top-to-bottom
  • Volume: persistent storage that survives container deletion — named volumes for production, bind mounts for development
  • Containers share the host OS kernel (no guest OS overhead)
  • VMs run a full guest OS per instance (stronger isolation, much heavier)
  • Layer caching: changing one layer invalidates all layers after it — order from least-to-most frequently changing
  • Multi-stage builds: use heavy toolchains during compilation, ship only the output to production
Plain-English First

Imagine you're moving house. Instead of dismantling every piece of furniture and hoping it fits in the new place, you pack everything — sofa, TV, cables, instruction manuals — into a perfectly sized shipping container. That container can be loaded onto any truck, ship, or train and delivered anywhere. Docker does exactly this for software: it bundles your app, its dependencies, its config, and its runtime into one portable 'container' that runs identically on your laptop, your colleague's machine, or a cloud server in Singapore. No more 'but it works on my machine'.

Environment drift is the root cause of most 'works on my machine' failures. A different Node version, a missing library, an environment variable pointing nowhere — these are not skill problems, they are infrastructure problems. Docker eliminates this class of issue by packaging the entire runtime environment into a portable, immutable container.

Containers are not VMs. They share the host OS kernel and use Linux namespaces and cgroups for isolation. This means containers start in milliseconds and use megabytes of memory, making microservices architectures economically viable. On the same machine that runs three VMs, you can run thirty containers.

Common misconceptions: containers are not inherently insecure (misconfiguration is the problem, not the technology), data inside containers is not persistent by default (you need volumes), and Docker Compose is not just for development (it works in production for single-host deployments).

Containers vs Virtual Machines: Why Docker Is a Fundamentally Different Idea

Most people learn Docker by running commands without understanding the architectural shift underneath. That's fine for getting started, but it bites you the moment something breaks.

A virtual machine (VM) runs a full guest operating system — its own kernel, drivers, system processes — on top of a hypervisor. Your app sits at the top of this tower. Booting a VM can take minutes. It consumes gigabytes of RAM even before your app starts. Scaling ten microservices with VMs means ten full operating systems running simultaneously.

Docker containers take a different path. They share the host machine's kernel directly. Each container gets its own isolated view of the filesystem (via union file system layers), its own network namespace, and its own process tree — but there's no duplicated OS. A container starts in milliseconds. It uses megabytes of overhead instead of gigabytes.

The practical implication: on the same machine where you could run three VMs, you can run thirty containers. That's not a minor efficiency gain — it's the reason microservices architectures became economically viable. When AWS charges you per second of compute, that difference compounds fast.

Containers are not inherently less secure than VMs — they're just differently isolated. A misconfigured container is dangerous, just as a misconfigured VM is. The security story depends on your configuration, not the technology itself.

Kernel sharing trade-off: Because containers share the host kernel, a kernel vulnerability (like CVE-2022-0185 or Dirty Pipe) affects all containers on that host. VMs have a separate kernel per instance, so a kernel vulnerability in one VM does not affect others. For high-security multi-tenant environments (running untrusted code), VMs provide stronger isolation. For single-tenant application workloads, container isolation is sufficient.

io/thecodeforge/container_vs_vm_demo.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Compare startup time and resource footprint — run these and watch the difference

# Pull a minimal Linux image (only ~5MB compressed)
docker pull alpine:3.19

# Start a container, run a command, and exit — time the whole thing
time docker run --rm alpine:3.19 echo "Container is alive"
# --rm tells Docker to delete the container after it exits (no cleanup needed)
# alpine:3.19 is the image — think of it as the blueprint
# 'echo ...' is the command to run inside the container

# Now check how much memory the container used at peak
# Run it in the background with resource stats
docker run -d --name resource-demo alpine:3.19 sleep 30
# -d runs in detached (background) mode
# --name gives it a human-readable name instead of a random hash

docker stats resource-demo --no-stream
# --no-stream prints one snapshot instead of a live feed
# Look at the MEM USAGE column — typically under 1MB for alpine doing nothing

# Clean up
docker stop resource-demo && docker rm resource-demo
Output
Container is alive
real 0m0.387s
user 0m0.021s
sys 0m0.018s
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
resource-demo 0.00% 632KiB / 15.55GiB 0.00% 796B / 0B 0B / 0B
Containers as Apartments, VMs as Houses
  • Multi-tenant environments running untrusted code — the shared kernel is a risk.
  • Workloads requiring a different kernel version than the host.
  • Compliance requirements that mandate full OS isolation.
  • For everything else — single-tenant application workloads — containers are the right choice.
Production Insight
The kernel sharing trade-off has real security implications. In 2022, the Dirty Pipe vulnerability (CVE-2022-0847) allowed any process to overwrite read-only files on the host kernel. Every container on an affected host was vulnerable simultaneously. VMs were not affected because each VM has its own kernel. For high-security environments, consider gVisor (user-space kernel) or Kata Containers (lightweight VMs) as alternatives that provide VM-level isolation with container-like startup speed.
Key Takeaway
Containers share the host kernel — they start in milliseconds and use megabytes of memory. VMs run a full guest OS — stronger isolation but much heavier. For single-tenant workloads, containers are the right choice. For multi-tenant or high-security environments, consider gVisor or Kata Containers.
Container vs VM Selection
IfSingle-tenant application workload (API, web server, worker)
UseContainer. Faster startup, lower cost, sufficient isolation.
IfMulti-tenant environment running untrusted code
UseVM or gVisor/Kata Containers. Stronger isolation required.
IfWorkload requires a specific kernel version
UseVM. Containers share the host kernel.
IfCI/CD pipeline, developer environment
UseContainer. Fast spin-up, disposable, reproducible.

Containerization vs Virtualization Comparison Table

The table below summarizes the key differences between traditional virtual machines and Docker containers. These trade-offs directly impact how you architect, deploy, and secure your applications.

AspectVirtual MachinesDocker Containers
Startup time30 seconds – 5 minutesMilliseconds to 2 seconds
Memory overhead512MB – 2GB per instance1MB – 50MB per instance
OS isolationFull guest OS per VMShared host kernel, isolated namespaces
Disk footprint5GB – 50GB per image5MB – 500MB per image
PortabilityHypervisor-dependent (.vmdk, .vhd)Runs on any Docker host (Linux, Mac, Windows, Cloud)
Security isolationStrong (separate kernel)Good (namespaces + cgroups, but shared kernel)
Best forFull OS control, strong isolation needsMicroservices, CI/CD pipelines, developer environments
Scaling speedMinutes (VM provisioning)Seconds (container spin-up)

When choosing between them, consider your workload's isolation requirements, startup latency tolerance, and operational overhead budget. For most web applications and APIs running on a single tenancy infrastructure, containers offer a 10-100x resource efficiency improvement over VMs.

Quick Decision Rule
If you can run it on the same kernel as the host, use a container. If you need a different kernel, full OS isolation, or are running untrusted code in a multi-tenant environment, use a VM or a lightweight VM alternative like Kata Containers.
Production Insight
Many production teams start with containers but eventually need a mix: containers for stateless services and VMs for legacy workloads or compliance requirements. Tools like AWS Fargate and Azure Container Instances abstract away the underlying VM management, giving you container speed with VM-level isolation for each task.
Key Takeaway
Containers are not miniature VMs — they share the host kernel. This fundamental difference makes containers faster and more efficient, but with weaker isolation. Choose based on your security and workload requirements.

Docker Architecture: From CLI to Running Container

When you type docker run, a chain of components works together to create and start a container. Understanding this flow helps you troubleshoot startup failures (like the ECONNREFUSED scenario) and optimize build performance.

  1. Docker CLI (your terminal) sends a REST API request to the Docker daemon. The CLI uses the DOCKER_HOST environment variable to know where the daemon is listening (default: unix:///var/run/docker.sock).
  2. Docker daemon (dockerd) receives the request, checks local image cache, and pulls the image if necessary. It then calls containerd via gRPC to create a container.
  3. containerd is the industry-standard container runtime (used by Docker, Kubernetes, and others). It manages the entire container lifecycle — image transfer, storage, network interfaces, and process orchestration. It calls runc to actually start the container.
  4. runc is the low-level OCI runtime that spawns the container process on the host. It uses Linux namespaces (PID, mount, network, user) and cgroups (CPU, memory, I/O) to isolate the process. Once the container is running, runc exits and the process runs directly under the host kernel.

The diagram below visualizes this flow. The important detail: the Docker daemon is not involved in the running container's process — it only coordinates setup. This makes containers lightweight and fast.

``mermaid graph TD A[User runs docker run] --> B[Docker CLI] B -->|REST API on /var/run/docker.sock| C[Docker Daemon (dockerd)] C -->|gRPC| D[containerd] D -->|OCI runc call| E[runc] E -->|clone() with namespaces & cgroups| F[Container Process] F --> G[Host Kernel] C -->|image pull| H[Registry (Docker Hub)] H --> C ``

Production Insight
In Kubernetes, you often replace dockerd with containerd directly (no Docker daemon). Understanding this decoupling helps when migrating from Docker Swarm or Docker Compose to Kubernetes — the container runtime stays the same, but the orchestration layer changes. Also, the Docker daemon is a single point of failure if you tie all container management to it; running containers directly via containerd (using crictl or ctr) avoids that bottleneck.
Key Takeaway
The Docker architecture is a layered pipeline: CLI -> daemon -> containerd -> runc. The daemon coordinates but does not run containers directly. This decoupling allows Kubernetes to use containerd independently.

First 5 Docker Commands Every Developer Should Know

If you're new to Docker, these five commands will cover 90% of your daily workflow. Master them before diving into advanced topics like multi-stage builds or healthchecks.

  1. docker pull — Download an image from a registry (default Docker Hub). Always specify a tag; never use latest in scripts.
  2. ```bash
  3. docker pull postgres:16-alpine
  4. ```
  5. docker run — Create and start a container from an image. Common flags: -d (detach), -p (port mapping), --name, -v (volume), -e (environment variable).
  6. ```bash
  7. docker run -d --name my-postgres -p 5432:5432 -e POSTGRES_PASSWORD=secret postgres:16-alpine
  8. ```
  9. docker ps — List running containers. Add -a to include stopped containers. Use --format for custom output.
  10. ```bash
  11. docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'
  12. ```
  13. docker logs — View logs from a container. Use -f to follow (tail -f style). Combine with --tail for performance.
  14. ```bash
  15. docker logs -f my-postgres --tail 20
  16. ```
  17. docker exec — Run a command inside a running container. The -it flags allocate an interactive terminal (useful for debugging).
  18. ```bash
  19. docker exec -it my-postgres psql -U postgres
  20. ```

These commands are the foundation. Once comfortable, add docker compose, docker build, docker images, and docker system prune to your toolkit.

io/thecodeforge/first_5_commands.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 1. Pull an image
$ docker pull alpine:3.19

# 2. Run a container interactively
$ docker run -it alpine:3.19 sh
/ # echo "Hello from inside container"
Hello from inside container
/ # exit

# 3. List running containers
$ docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS   PORTS   NAMES

# 4. View logs of a container (replace <container-id>)
$ docker logs <container-id>

# 5. Execute a command in a running container
$ docker run -d --name test alpine:3.19 sleep 30
$ docker exec test echo "Alive!"
Alive!
$ docker stop test && docker rm test
Output
Hello from inside container
Alive!
Always Clean Up After Yourself
Containers and images can accumulate quickly. Run docker system prune -a --volumes periodically (but carefully — it removes unused containers, images, networks, and volumes). For safety, omit --volumes if you want to keep cached data.
Production Insight
In CI/CD pipelines, use docker run --rm to auto-clean containers after they finish. This prevents disk fill-ups on build machines. Also, always pull images with a specific digest (@sha256:...) instead of tags for immutable builds.
Key Takeaway
Five commands — pull, run, ps, logs, exec — form the daily driver set. Practice them until they're muscle memory, then explore compose and volumes.

Images, Layers and Dockerfiles: How Docker Actually Builds Your App

A Docker image is a read-only blueprint for creating containers. A container is a running instance of an image — the same relationship as a class and an object in OOP, or a recipe and a meal.

Images are built in layers. Every instruction in a Dockerfile creates a new layer on top of the previous one. Docker caches these layers aggressively. This is the single most important thing to understand about Dockerfile efficiency: if layer 3 changes, Docker rebuilds from layer 3 downward. Layers 1 and 2 are served from cache instantly.

This is why experienced engineers always copy dependency manifests (package.json, requirements.txt, go.mod) and install dependencies BEFORE copying application source code. Source code changes every commit; dependencies change rarely. Put the slow, stable work near the top of your Dockerfile so it stays cached.

Multi-stage builds are the other major pattern worth knowing early. You use one image (with compilers, build tools, dev dependencies) to build your app, then copy only the compiled output into a minimal runtime image. Your final image contains zero build tooling — smaller, faster, and with a dramatically reduced attack surface.

Let's build a realistic Node.js API with both patterns applied — this is what a production-ready Dockerfile actually looks like, not the toy examples you usually see.

Layer cleanup in the same RUN: Each RUN creates a new layer. If you download a 200MB package in one RUN and delete it in the next RUN, the 200MB still exists in the first layer — layers are additive. Always chain download and cleanup in the same RUN with &&.

io/thecodeforge/DockerfileDOCKERFILE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# ── STAGE 1: Build Stage ──────────────────────────────────────────────────────
# Use the full Node image with build tools available
FROM node:20-alpine AS builder
# 'AS builder' names this stage so we can reference it later
# node:20-alpine uses Alpine Linux — much smaller than node:20-bullseye

# Set the working directory inside the container
WORKDIR /app

# COPY dependency files FIRST — before application code
# Docker caches this layer. If package.json hasn't changed, npm install
# won't re-run even if your source code changed. This saves minutes per build.
COPY package.json package-lock.json ./

# Install only production dependencies (saves ~200MB vs installing devDependencies)
RUN npm ci --omit=dev
# npm ci is faster and stricter than npm install — it respects package-lock.json exactly

# NOW copy application source code
# Changing any source file only invalidates from this line forward
COPY src/ ./src/

# ── STAGE 2: Production Runtime Stage ─────────────────────────────────────────
# Start fresh from a minimal image — no build tools, no npm, no package manager cruft
FROM node:20-alpine AS production

# Run as a non-root user — critical for production security
# node:alpine ships with a 'node' user built in
USER node

WORKDIR /app

# Copy only what we need from the builder stage — not the entire filesystem
COPY --from=builder --chown=node:node /app/node_modules ./node_modules
COPY --from=builder --chown=node:node /app/src ./src
COPY --chown=node:node package.json ./
# --chown ensures the node user owns these files, not root

# Document which port the app listens on (informational — doesn't actually publish it)
EXPOSE 3000

# Define the command to run when a container starts from this image
# Use array form (exec form) — NOT string form — to ensure signals are handled correctly
CMD ["node", "src/server.js"]
Output
# Build the image — run from the directory containing your Dockerfile
$ docker build -t my-node-api:1.0.0 .
Sending build context to Docker daemon 48.13kB
Step 1/11 : FROM node:20-alpine AS builder
---> 3f4d90098f5b
Step 2/11 : WORKDIR /app
---> Using cache
Step 3/11 : COPY package.json package-lock.json ./
---> Using cache <- dependencies layer served from cache!
Step 4/11 : RUN npm ci --omit=dev
---> Using cache <- install step also cached — build is fast
Step 5/11 : COPY src/ ./src/
---> 8c3a1b2d4e5f <- only this layer rebuilt (source changed)
...
Successfully built a7b3c9d1e2f4
Successfully tagged my-node-api:1.0.0
# Check the final image size
$ docker image ls my-node-api
REPOSITORY TAG IMAGE ID CREATED SIZE
my-node-api 1.0.0 a7b3c9d1e2f4 12 seconds ago 142MB
# Compare: the builder stage alone would be ~380MB with all dev tooling
Layers as Transparent Slides
  • Each layer is a diff on top of the previous layer. If the base changes, the diff no longer applies.
  • Docker cannot know if a later instruction depends on the changed content in an earlier layer.
  • The cache is sequential, not selective — Docker rebuilds from the first invalidated layer onward.
  • This is why layer ordering (least-change to most-change) is the single most impactful Dockerfile optimization.
Production Insight
The cleanup-in-same-layer rule is the most common cause of bloated images. A team's image was 1.2GB because they ran apt-get install in one RUN and apt-get clean in the next. The 800MB apt cache persisted in the first layer. Fix: chain with && and clean up in the same RUN. This alone reduced their image from 1.2GB to 340MB.
Key Takeaway
Docker builds images as a stack of cached layers. Order instructions from least-to-most frequently changing. Copy dependency manifests before source code. Chain cleanup in the same RUN as the operation. This single optimization can turn 5-minute rebuilds into 10-second rebuilds.
Layer Ordering Strategy
IfBase image (FROM)
UseFirst layer. Changes rarely. Cached indefinitely until the tag is updated.
IfSystem dependencies (apt-get install, apk add)
UseSecond layer. Changes occasionally. Chain with && and clean up in the same RUN.
IfDependency manifests (package.json, requirements.txt)
UseThird layer. Changes when dependencies change. Copy BEFORE source code.
IfDependency installation (npm ci, pip install)
UseFourth layer. Changes when dependencies change. Cached until manifests change.
IfSource code (COPY . . or COPY src/)
UseLast layer. Changes on every code edit. Must be the final COPY to maximize cache.

Volumes and Docker Compose: Persistence and Multi-Container Orchestration

Containers are ephemeral by design. When a container stops, any data written inside it vanishes. That's perfect for stateless services, but databases, file uploads, and logs need to survive container restarts. Docker volumes solve this by mounting a storage location from the host (or a managed volume) into the container's filesystem.

There are three storage mechanisms: bind mounts (link a specific host directory into the container — great for local development where you want live code reloading), named volumes (Docker manages the storage location — best for databases in production), and tmpfs mounts (in-memory only — useful for sensitive data you never want written to disk).

Real applications are never a single container. You have an API, a database, a cache, maybe a background worker. Running and networking these manually with individual docker run commands is error-prone and impossible to reproduce reliably. Docker Compose lets you define your entire multi-container application in one YAML file and bring it all up with a single command.

Here's a complete, realistic Compose setup for a Node.js API backed by PostgreSQL and Redis — the stack you'll encounter in most backend roles.

The depends_on trap: depends_on without condition: service_healthy only waits for the container to START — not for the process inside to be READY. Postgres takes 5-15 seconds to initialize. Without service_healthy, your API will crash on boot trying to connect to a database that is not accepting connections yet. This is the single most common cause of flaky Docker Compose environments.

io/thecodeforge/docker-compose.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# Docker Compose V2 format (no 'version' key needed with modern Docker Desktop)
services:

  # ── The API service ───────────────────────────────────────────────────────
  api:
    build:
      context: .           # Build from the Dockerfile in the current directory
      target: production   # Use the 'production' stage from our multi-stage Dockerfile
    container_name: my-api
    ports:
      - "3000:3000"        # Map host port 3000 -> container port 3000
    environment:
      # Reference values from a .env file — never hardcode secrets in Compose files
      NODE_ENV: production
      DATABASE_URL: postgresql://api_user:${DB_PASSWORD}@postgres:5432/app_db
      REDIS_URL: redis://redis:6379
      # 'postgres' and 'redis' are the service names below — Docker's internal
      # DNS resolves them automatically within the shared network
    depends_on:
      postgres:
        condition: service_healthy   # Wait until postgres passes its health check
      redis:
        condition: service_started
    restart: unless-stopped          # Restart on crash, but not if manually stopped

  # ── PostgreSQL database ───────────────────────────────────────────────────
  postgres:
    image: postgres:16-alpine        # Always pin a specific version — never use 'latest'
    container_name: my-postgres
    environment:
      POSTGRES_DB: app_db
      POSTGRES_USER: api_user
      POSTGRES_PASSWORD: ${DB_PASSWORD}   # Pulled from .env file
    volumes:
      - postgres_data:/var/lib/postgresql/data
      # Named volume — Docker manages where this lives on the host.
      # Database files survive 'docker compose down' and container rebuilds.
      - ./db/init.sql:/docker-entrypoint-initdb.d/init.sql:ro
      # Bind mount an init script — runs once when the DB is first created.
      # :ro makes it read-only inside the container (good security habit)
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U api_user -d app_db"]
      interval: 5s     # Check every 5 seconds
      timeout: 5s      # Fail if no response in 5 seconds
      retries: 5       # Mark unhealthy after 5 consecutive failures
      start_period: 30s  # Grace period before health checks start

  # ── Redis cache ───────────────────────────────────────────────────────────
  redis:
    image: redis:7-alpine
    container_name: my-redis
    command: redis-server --appendonly yes
    # --appendonly yes enables AOF persistence — data survives Redis restarts
    volumes:
      - redis_data:/data

# Named volumes must be declared at the top level
# Docker creates and manages these — they persist across 'docker compose down'
volumes:
  postgres_data:
  redis_data:
Output
# Start everything (add -d for detached/background mode)
$ docker compose up -d
[+] Running 5/5
✔ Network my-app_default Created
✔ Volume "postgres_data" Created
✔ Volume "redis_data" Created
✔ Container my-postgres Healthy
✔ Container my-redis Started
✔ Container my-api Started
# Check all services are running
$ docker compose ps
NAME IMAGE COMMAND STATUS PORTS
my-api my-app-api "docker-entrypoint.s…" Up 12 seconds 0.0.0.0:3000->3000/tcp
my-postgres postgres:16-alpine "docker-entrypoint.s…" Up 18 seconds 5432/tcp
my-redis redis:7-alpine "docker-entrypoint.s…" Up 18 seconds 6379/tcp
# Tail logs from a specific service
$ docker compose logs -f api
my-api | Server listening on port 3000
my-api | Database connection established
my-api | Redis connection established
# Tear down (volumes are preserved by default)
$ docker compose down
# Add --volumes to also delete the named volumes (WARNING: deletes all DB data)
docker compose down vs docker compose down -v
  • It deletes all named volumes for the project — including databases with days of data.
  • There is no undo. Once volumes are deleted, data is gone unless backed up.
  • Developers often use it thinking it is a 'clean restart' — it is a destructive operation.
  • Always back up volumes before running down -v. Use: docker run --rm -v vol:/data -v $(pwd):/backup alpine tar czf /backup/backup.tar.gz -C /data .
Production Insight
The depends_on with service_healthy pattern is not just for databases. Any service with initialization time — Redis, Elasticsearch, Kafka, message queues — needs a healthcheck and a depends_on condition. Without it, dependent services will crash on boot and enter a restart loop, delaying deployments and creating flaky CI pipelines.
Key Takeaway
Named volumes persist data across container restarts — they are the production default. depends_on with condition: service_healthy prevents race conditions between services. docker compose down preserves volumes; down -v deletes them. Always pin image versions — never use 'latest' in production.
Volume Type Selection
IfProduction database or stateful service
UseNamed volume (postgres_data:/var/lib/postgresql/data). Docker manages the path. Portable.
IfDevelopment — live code reload, config files
UseBind mount (-v ./src:/app/src). Direct access to host files. Fast iteration.
IfSensitive data that should never touch disk
Usetmpfs mount (--tmpfs /secrets:size=1m). In-memory only. Deleted on container stop.
IfShared config files across multiple containers
UseNamed volume with :ro (read-only) flag. Prevents accidental modification by any container.
● Production incidentPOST-MORTEMseverity: high

Production Deployment Crashes on Boot — API Connects to Database Before Postgres Is Ready

Symptom
After running docker compose up -d, the API container exited immediately with code 1. docker compose logs api showed: 'Error: connect ECONNREFUSED 172.18.0.2:5432'. The container restarted automatically (restart: unless-stopped) but crashed again with the same error. After 4-5 restarts over 60 seconds, Postgres finished initializing, and the API finally started successfully on the 6th attempt.
Assumption
The team assumed a network configuration issue — maybe the containers were on different networks. They checked docker network inspect and confirmed both containers were on the same network. They assumed a DNS resolution issue — they exec'd into the API container and ran nslookup postgres, which resolved correctly. They assumed the database was misconfigured — they exec'd into the Postgres container and ran psql, which connected successfully.
Root cause
The docker-compose.yml had depends_on: [postgres] without a condition. depends_on without condition: service_healthy only waits for the container to START (docker start returns), not for the process inside to be READY. Postgres takes 5-15 seconds to initialize after the container starts — creating the default database, running init scripts, and opening the port. The API container started immediately after the Postgres container started, before Postgres was accepting connections. The API crashed, Docker restarted it, and this repeated until Postgres was finally ready.
Fix
1. Added a healthcheck to the Postgres service using pg_isready. 2. Changed depends_on to use condition: service_healthy so the API waits until Postgres passes its health check. 3. Added a start_period: 30s to the healthcheck to prevent false failures during Postgres initialization. 4. Added a retry loop in the API's startup code as a defense-in-depth measure (connect with exponential backoff for 30 seconds before giving up). 5. Documented the pattern in the team's Docker Compose style guide.
Key lesson
  • depends_on without condition: service_healthy only waits for container start — not process readiness. Always use service_healthy for databases.
  • Postgres, MySQL, Redis, and any service with initialization time needs a healthcheck. Without it, dependent services will crash on boot.
  • A restart loop (container crashing and restarting repeatedly) is a symptom of a race condition, not a network issue. Check startup ordering first.
  • Defense-in-depth: add a connection retry loop in your application code as a second layer of protection, even with healthchecks in place.
  • The start_period flag on healthcheck prevents false failures during slow startup. Set it to the expected maximum initialization time.
Production debug guideFrom startup crashes to slow builds — systematic debugging paths.6 entries
Symptom · 01
Container exits immediately with code 1 or 137 on startup.
Fix
Check logs: docker compose logs <service> or docker logs <container>. Exit code 1 is an application error — check the stack trace. Exit code 137 is OOM-killed — check memory limits with docker stats. Exit code 143 is SIGTERM — check if the container is being stopped by another process or healthcheck.
Symptom · 02
Container cannot connect to another container — 'ECONNREFUSED' or 'could not translate host name'.
Fix
Verify both containers are on the same network: docker network inspect <network>. Check if the target container is running: docker compose ps. Check if the target service is healthy (if it has a healthcheck). Verify DNS resolution: docker exec <container> nslookup <service-name>.
Symptom · 03
Docker build is slow — every rebuild takes 3-5 minutes.
Fix
Check layer ordering. Run docker history <image> to see which layers were rebuilt. If the dependency install layer rebuilds on every code change, the Dockerfile copies source code before dependency manifests. Fix: copy package.json/requirements.txt before COPY . . and run install in a separate layer.
Symptom · 04
Container data disappears after restart.
Fix
Check if a volume is mounted: docker inspect <container> --format '{{.Mounts}}'. If no volume is mounted, data lives in the container's writable layer and is lost on docker rm. Fix: add a named volume to docker-compose.yml and restart.
Symptom · 05
docker compose up fails with 'port is already allocated'.
Fix
Check what is using the port: docker ps --format '{{.Ports}}' or ss -tlnp | grep <port>. Either stop the conflicting container or change the host-side port mapping in docker-compose.yml.
Symptom · 06
Image is unexpectedly large — 1GB+ for a simple application.
Fix
Run docker history --no-trunc <image> to see layer sizes. Check if .dockerignore exists — without it, COPY . . includes node_modules and .git. Check if multi-stage builds are used — the final image may contain build tools. Check if package manager cache is cleaned in the same RUN layer.
★ Docker Container Triage Cheat SheetFirst-response commands when containers crash, builds are slow, or services cannot communicate.
Container exits immediately on startup.
Immediate action
Check container logs and exit code.
Commands
docker compose logs <service> --tail 50
docker inspect <container> --format '{{.State.ExitCode}} {{.State.Error}}'
Fix now
Exit code 1 = app error (check logs). Exit code 137 = OOM (increase memory limit). Exit code 143 = SIGTERM (check healthcheck or depends_on).
Container A cannot reach Container B by hostname.+
Immediate action
Verify network connectivity and DNS resolution.
Commands
docker network inspect <network> --format '{{range .Containers}}{{.Name}} {{end}}'
docker exec <container-a> nslookup <container-b-service-name>
Fix now
If container-b is missing from the network, add it to the same network in docker-compose.yml. If using default bridge, create a user-defined network.
Docker build is slow — every rebuild takes minutes.+
Immediate action
Check which layers are being rebuilt vs cached.
Commands
docker build --progress=plain -t test . 2>&1 | grep -E 'CACHED|RUN|COPY'
docker history <image> --format '{{.Size}} {{.CreatedBy}}' | sort -hr
Fix now
If RUN npm install rebuilds on every change, move COPY package.json before COPY . . . Separate dependency installation from source code copying.
Port already allocated — container cannot start.+
Immediate action
Find what is using the port.
Commands
docker ps --format '{{.Names}} {{.Ports}}' | grep <port>
ss -tlnp | grep <port>
Fix now
Stop the conflicting container or change the host-side port in docker-compose.yml (e.g., 3001:3000 instead of 3000:3000).
Container data lost after docker compose down.+
Immediate action
Check if volumes were deleted or never created.
Commands
docker volume ls | grep <project>
docker compose config | grep -A2 volumes
Fix now
If docker compose down -v was run, data is gone (check backups). If no volumes defined in docker-compose.yml, add named volumes for stateful services.
Image is unexpectedly large (>500MB).+
Immediate action
Inspect layer sizes and check for .dockerignore.
Commands
docker history <image> --format '{{.Size}} {{.CreatedBy}}' | sort -hr
cat .dockerignore 2>/dev/null || echo 'NO .dockerignore FILE'
Fix now
Create .dockerignore. Use multi-stage builds. Chain cleanup in same RUN: RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*.
Virtual Machines vs Docker Containers
AspectVirtual MachinesDocker Containers
Startup time30 seconds – 5 minutesMilliseconds to 2 seconds
Memory overhead512MB – 2GB per instance1MB – 50MB per instance
OS isolationFull guest OS per VMShared host kernel, isolated namespaces
Disk footprint5GB – 50GB per image5MB – 500MB per image
PortabilityHypervisor-dependent (.vmdk, .vhd)Runs on any Docker host (Linux, Mac, Windows, Cloud)
Security isolationStrong (separate kernel)Good (namespaces + cgroups, but shared kernel)
Best forFull OS control, strong isolation needsMicroservices, CI/CD pipelines, developer environments
Scaling speedMinutes (VM provisioning)Seconds (container spin-up)

Key takeaways

1
Containers share the host OS kernel
they're not mini VMs. This is why they start in milliseconds and use megabytes of memory, making them economically practical for microservices at scale.
2
Docker image layers are cached from top to bottom. Copy dependency manifests and run installs BEFORE copying source code, or every git commit will trigger a full package reinstall.
3
Multi-stage builds are not optional in production
they separate build-time tooling from the runtime image, cutting image sizes by 50-70% and removing attack surface from your deployed artifact.
4
Named volumes persist data across container restarts and rebuilds; depends_on with service_healthy prevents race conditions
both are non-negotiable for any database-backed service.
5
docker compose down preserves volumes. docker compose down -v deletes them. Always back up volumes before any destructive operation.
6
Always use exec-form CMD (CMD ["node", "server.js"])
shell form silently breaks graceful shutdown in Kubernetes.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is Docker used for in real-world software development?
02
Is Docker the same as a virtual machine?
03
Does data inside a Docker container get deleted when the container stops?
04
What is the difference between depends_on and depends_on with condition: service_healthy?
05
How do I reduce the size of my Docker image?
🔥

That's Docker. Mark it forged?

7 min read · try the examples if you haven't

Previous
Git Amend: Edit the Last Commit
1 / 18 · Docker
Next
Containerization vs Virtualization