Advanced 6 min · March 06, 2026

Docker ML Models — Fixing numpy Version Drift in Serving

A numpy 1.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • The image is the artifact. It runs identically on your laptop, CI, and production GPU instances.
  • Multi-stage builds separate training dependencies from serving runtime, keeping images small.
  • NVIDIA Container Toolkit exposes GPU devices to containers via --gpus flag.
  • Base image with pinned CUDA and Python versions
  • Model weights copied or mounted as volumes
  • Serving framework (FastAPI, TorchServe, Triton) as the entrypoint
  • Health check endpoint for orchestrator readiness probes
Plain-English First

Imagine you bake the perfect cake in your kitchen — but when you try to bake it at a friend's house, it collapses because their oven runs hotter and they don't have the same brand of flour. Docker is like shipping your entire kitchen — oven, flour, recipe, temperature settings — in one sealed box, so the cake comes out identical every single time, on any stove, anywhere in the world. For ML models, that 'kitchen' is Python 3.10, CUDA 11.8, PyTorch, your trained weights file, and your serving script. Docker boxes all of that up so the model that worked on your laptop works exactly the same way in production on a cloud GPU instance.

ML models are environment-sensitive in ways that web applications are not. A model trained with numpy 1.23 can silently produce different floating-point results on numpy 2.0. A CUDA version mismatch between training and serving causes either crashes or silent CPU fallback that tanks inference latency by 50x. These are not hypothetical — they are the leading causes of 'it works on my machine' failures in ML deployments.

Docker eliminates environment drift by packaging the entire runtime — OS libraries, Python version, pip packages, CUDA toolkit, model weights, and serving logic — into a single versioned image. That image runs identically on your laptop, your CI pipeline, a Kubernetes cluster, and an edge device.

The gap between a Jupyter notebook that produces great metrics and a model that reliably serves predictions in production is wider than most teams expect. Docker closes that gap by making the environment a constant, not a variable. This guide covers the patterns that separate production-grade ML containers from fragile ones.

Multi-Stage Builds for ML — Separating Training from Serving

The most common mistake in ML Dockerfiles is shipping the training environment as the serving image. A training image includes Jupyter, gcc, test frameworks, debugging tools, and development dependencies — none of which are needed in production. This bloats the image to 5-10GB, increases attack surface, and slows deployments.

Multi-stage builds solve this by using a heavy 'builder' stage with all training and build dependencies, then copying only the trained model and runtime dependencies into a minimal 'serving' stage. The final image contains Python, the serving framework, and the model — nothing else.

Why this matters for ML specifically: ML images are uniquely large because they include CUDA toolkit (2-3GB), PyTorch/TensorFlow (1-2GB), and model weights (1-10GB). A single-stage image that includes training tools, CUDA development headers, and model weights can easily exceed 10GB. A multi-stage serving image with quantized weights can be under 2GB.

Layer caching for ML: Model weights change rarely (only after retraining). Dependencies change occasionally. Application code changes frequently. Order your Dockerfile: base image + CUDA first, dependencies second, model weights third, application code last. This ensures code changes do not trigger a re-download of PyTorch or a re-copy of multi-gigabyte weights.

io/thecodeforge/ml-serving.DockerfileDOCKERFILE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# ─── STAGE 1: Build environment (training deps, compilers) ───
FROM python:3.10.12-slim-bookworm AS builder

WORKDIR /build

# Install build dependencies (not in final image)
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc g++ libpq-dev && \
    rm -rf /var/lib/apt/lists/*

COPY requirements-training.txt .
RUN pip install --user --no-cache-dir -r requirements-training.txt

# Simulate model training artifact (in practice, this comes from
# a training pipeline or model registry)
COPY models/ ./models/

# ─── STAGE 2: Serving runtime (minimal) ───
FROM python:3.10.12-slim-bookworm AS serving

WORKDIR /app

# Install only runtime dependencies
COPY requirements-serving.txt .
RUN pip install --no-cache-dir -r requirements-serving.txt

# Copy trained model weights from builder
COPY --from=builder /build/models/ ./models/

# Copy serving application code
COPY src/serving/ ./src/serving/

# Non-root user for security
RUN useradd --create-home appuser
USER appuser

HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8080/health')"

EXPOSE 8080

CMD ["python", "-m", "uvicorn", "src.serving.api:app", "--host", "0.0.0.0", "--port", "8080"]
Output
# Build:
docker build -f io/thecodeforge/ml-serving.Dockerfile -t io.thecodeforge/ml-model:v1.0 .
# Image size comparison:
# Single-stage (with training deps): 8.2 GB
# Multi-stage (serving only): 1.8 GB
ML Image Layers as a Supply Chain
  • Docker layers are additive. A file added in one layer and deleted in a later layer still occupies space in the earlier layer.
  • RUN pip install torch && pip uninstall torch still has torch in the install layer — the image does not shrink.
  • Multi-stage builds start fresh — the serving stage never contains training dependencies in any layer.
  • This is the only way to genuinely reduce image size for ML workloads where base dependencies are gigabytes.
Production Insight
The layer caching insight is critical for ML because dependency downloads are large. PyTorch with CUDA is ~2GB. If every code change invalidates the pip install layer, every CI build downloads 2GB of dependencies. By copying requirements-serving.txt before application code, the dependency layer is cached on code-only changes — turning 10-minute builds into 30-second builds.
Key Takeaway
Multi-stage builds are not optional for ML serving images. The training environment and the serving environment are fundamentally different — training needs compilers and debugging tools, serving needs only the framework and the model. A 10GB training image deployed as a serving image wastes storage, slows deployments, and increases attack surface.
ML Image Size Optimization
IfImage is >5GB and includes training tools
UseUse multi-stage builds — separate training from serving. Copy only model weights and runtime deps to serving stage.
IfModel weights are >1GB and baked into the image
UseMove weights to a volume or download from S3/GCS at container startup. Keep image under 2GB.
IfCUDA toolkit adds 2-3GB to the image
UseUse runtime-only CUDA base images (nvidia/cuda:11.8.0-runtime-ubuntu22.04) instead of devel images.
IfMultiple models share the same serving framework
UseCreate a shared base image with the framework, extend it per model with just the weights and config.

GPU Access with NVIDIA Container Toolkit

ML inference on CPU is 10-100x slower than on GPU. Docker does not expose GPU devices to containers by default — you need the NVIDIA Container Toolkit and the --gpus flag.

The NVIDIA Container Toolkit (formerly nvidia-docker2) installs a Docker runtime that automatically mounts the GPU device drivers and libraries into containers. Without it, containers see no GPU devices even when the host has GPUs available.

Installation and verification: 1. Install nvidia-container-toolkit on the host 2. Configure Docker to use the nvidia runtime: sudo nvidia-ctk runtime configure --runtime=docker 3. Restart Docker: sudo systemctl restart docker 4. Verify: docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

GPU allocation strategies: - --gpus all: expose all GPUs to the container - --gpus 1: expose one GPU (Docker picks which) - --gpus '\"device=0,2\"': expose specific GPUs by index - NVIDIA_VISIBLE_DEVICES=0,2: set via environment variable (useful in Compose)

Failure scenario — silent CPU fallback: If the NVIDIA Container Toolkit is not installed or --gpus is not passed, PyTorch and TensorFlow silently fall back to CPU. The model loads successfully, inference works, but latency is 50x slower than expected. There is no error — torch.cuda.is_available() returns False, but many serving frameworks do not check this. The fix: always add a startup assertion that verifies GPU availability.

io/thecodeforge/ml_serving/startup_check.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import sys
import logging

logger = logging.getLogger(__name__)

def verify_gpu_availability(required_gpus: int = 1) -> None:
    """Startup assertion: fail fast if GPU is not available.
    
    Call this at application startup before loading the model.
    If GPU is required but not available, exit immediately
    rather than silently falling back to CPU.
    """
    try:
        import torch
        
        available = torch.cuda.is_available()
        device_count = torch.cuda.device_count()
        
        if not available:
            logger.error(
                "GPU not available. torch.cuda.is_available() returned False. "
                "Ensure NVIDIA Container Toolkit is installed and "
                "container is started with --gpus flag."
            )
            sys.exit(1)
        
        if device_count < required_gpus:
            logger.error(
                f"Insufficient GPUs: required={required_gpus}, "
                f"available={device_count}. "
                f"Adjust --gpus flag or reduce requirement."
            )
            sys.exit(1)
        
        gpu_name = torch.cuda.get_device_name(0)
        gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
        logger.info(
            f"GPU verified: {gpu_name}, "
            f"{gpu_memory:.1f}GB VRAM, "
            f"{device_count} device(s) available"
        )
        
    except ImportError:
        logger.error("PyTorch not installed. Cannot verify GPU availability.")
        sys.exit(1)
Output
# Successful startup:
# GPU verified: NVIDIA A10G, 22.0GB VRAM, 1 device(s) available
# Failed startup (no --gpus flag):
# ERROR: GPU not available. torch.cuda.is_available() returned False.
# Ensure NVIDIA Container Toolkit is installed and container is started with --gpus flag.
GPU Access as a Device Permission
  • PyTorch was designed to work on both CPU and GPU — GPU is an optimization, not a requirement.
  • Many development environments (laptops without GPU) run PyTorch on CPU legitimately.
  • The framework cannot know if you intended to use GPU or CPU — it defers to the developer.
  • This is why a startup assertion (verify_gpu_availability) is essential in production serving containers.
Production Insight
The silent CPU fallback is the most insidious GPU-related production bug. The model loads successfully, inference returns correct results, but latency is 50x slower than expected. Monitoring shows high CPU usage instead of GPU utilization. The team spends hours profiling the model code before discovering the container never had GPU access. The fix is a one-time startup assertion that fails fast if GPU is not available.
Key Takeaway
GPU access in Docker requires the NVIDIA Container Toolkit and the --gpus flag. Without them, ML frameworks silently fall back to CPU — no error, just 50x slower inference. Always add a startup assertion that verifies GPU availability. This single check prevents hours of debugging silent CPU fallback.
GPU Troubleshooting Decision Tree
Iftorch.cuda.is_available() returns False inside container
UseCheck if --gpus flag was passed. Check if nvidia-container-toolkit is installed on the host. Run: docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
IfGPU is available but inference is still slow
UseCheck if data is being transferred CPU->GPU on every inference call. Pin the model to GPU once at startup. Check batch size — too small batches underutilize GPU parallelism.
IfCUDA out of memory during inference
UseCheck if multiple containers share the same GPU. Use CUDA_VISIBLE_DEVICES to assign specific GPUs. Use float16 instead of float32. Reduce batch size.
Ifnvidia-smi works on host but not in container
UseHost driver version must be >= the CUDA version in the container. Check: nvidia-smi on host vs CUDA version in FROM image.

Model Serving Patterns — FastAPI, TorchServe, and Triton

There are three common patterns for serving ML models in Docker containers. The right choice depends on your latency requirements, model complexity, and operational maturity.

Pattern 1: FastAPI + direct model loading. Load the model at startup, expose a /predict endpoint. Simple, full control over the inference pipeline, easy to customize. Best for single-model serving with custom pre/post-processing. The model is loaded into the application process — startup time equals model load time.

Pattern 2: TorchServe / TF Serving. Purpose-built serving frameworks with built-in batching, model versioning, and A/B testing. More operational overhead but better for multi-model serving and high-throughput scenarios. TorchServe runs a separate model server process — the Docker container wraps the TorchServe binary.

Pattern 3: NVIDIA Triton Inference Server. GPU-optimized serving with dynamic batching, model ensemble pipelines, and multi-framework support (PyTorch, TensorFlow, ONNX, TensorRT). Highest throughput but most complex configuration. Best for latency-critical production workloads with multiple models.

Health check patterns: A health check that only verifies the server is running is insufficient for ML serving. The health check must verify that the model is loaded and can produce a valid output. A /health endpoint should run a dummy inference with a known input and verify the output shape matches expectations.

io/thecodeforge/ml_serving/api.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List
import numpy as np
import torch
import logging

from io.thecodeforge.ml_serving.startup_check import verify_gpu_availability

logger = logging.getLogger(__name__)

app = FastAPI(title="ML Model Serving API")

# Global model reference — loaded once at startup
model = None
model_device = None


class PredictionRequest(BaseModel):
    features: List[float]


class PredictionResponse(BaseModel):
    prediction: float
    model_version: str
    device: str


@app.on_event("startup")
async def load_model() -> None:
    """Load model once at startup — not on every request."""
    global model, model_device

    # Fail fast if GPU is not available
    verify_gpu_availability(required_gpus=1)

    model_device = torch.device("cuda:0")
    model_path = "./models/production_model.pt"

    logger.info(f"Loading model from {model_path}...")
    model = torch.jit.load(model_path, map_location=model_device)
    model.eval()

    # Warmup inference — ensures CUDA kernels are compiled
    dummy_input = torch.randn(1, 128, device=model_device)
    with torch.no_grad():
        _ = model(dummy_input)

    logger.info("Model loaded and warmed up successfully.")


@app.get("/health")
async def health_check() -> dict:
    """Health check that verifies model is loaded and functional."""
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded")

    # Dummy inference to verify the model actually works
    try:
        dummy_input = torch.randn(1, 128, device=model_device)
        with torch.no_grad():
            output = model(dummy_input)
        return {
            "status": "healthy",
            "model_loaded": True,
            "output_shape": list(output.shape),
            "device": str(model_device),
        }
    except Exception as e:
        raise HTTPException(
            status_code=503,
            detail=f"Model health check failed: {str(e)}"
        )


@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest) -> PredictionResponse:
    """Run inference on the loaded model."""
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded")

    try:
        input_tensor = torch.tensor(
            [request.features], dtype=torch.float32, device=model_device
        )

        with torch.no_grad():
            output = model(input_tensor)

        return PredictionResponse(
            prediction=output.item(),
            model_version="v1.0.0",
            device=str(model_device),
        )
    except Exception as e:
        logger.error(f"Inference failed: {e}")
        raise HTTPException(status_code=500, detail=f"Inference error: {str(e)}")
Output
# Run:
# uvicorn io.thecodeforge.ml_serving.api:app --host 0.0.0.0 --port 8080
#
# Health check:
# curl http://localhost:8080/health
# {"status":"healthy","model_loaded":true,"output_shape":[1,1],"device":"cuda:0"}
#
# Predict:
# curl -X POST http://localhost:8080/predict -H 'Content-Type: application/json' \
# -d '{"features": [1.0, 2.0, 3.0, ...]}'
# {"prediction": 0.847,"model_version": "v1.0.0","device": "cuda:0"}
Model Serving as a Restaurant Kitchen
  • Model loading takes 5-60 seconds depending on model size. First request would timeout.
  • Loading at startup means the health check can verify the model is functional before accepting traffic.
  • The orchestrator (Kubernetes, ECS) uses the health check to know when the container is ready.
  • Warmup inference ensures CUDA kernels are compiled before the first real request — avoids cold-start latency.
Production Insight
The health check pattern is critical for ML serving. A health check that only verifies the server process is running (returns 200) is insufficient. The model might have failed to load, the GPU might be unavailable, or the weights might be corrupted. A proper health check runs a dummy inference and verifies the output shape. Kubernetes readiness probes use this health check to route traffic only to containers that are actually ready to serve predictions.
Key Takeaway
Choose the serving pattern that matches your operational maturity. FastAPI for simplicity and control. TorchServe for multi-model production. Triton for maximum GPU throughput. Regardless of framework, the health check must verify model loading and inference capability — not just server process liveness.
Model Serving Framework Selection
IfSingle model, custom pre/post-processing, fast iteration
UseUse FastAPI — simple, full control, easy to customize, minimal operational overhead
IfMultiple models, versioning, A/B testing, high throughput
UseUse TorchServe — built-in batching, model management, versioning API
IfMulti-framework (PyTorch + TensorFlow + ONNX), latency-critical, GPU-optimized
UseUse Triton Inference Server — dynamic batching, model ensembles, TensorRT integration
IfEdge deployment, resource-constrained, no GPU
UseUse ONNX Runtime or TensorFlow Lite — optimized for CPU inference on edge devices

Volume Strategies for Model Weights — Baking vs Mounting vs Pulling

Model weights are the largest component of an ML serving image. A production NLP model can be 2-10GB. A large language model can be 50-200GB. How you deliver these weights to the container has a major impact on deployment speed, storage costs, and operational flexibility.

Strategy 1: Bake weights into the image (COPY). Simplest approach. The weights are part of the image layer. Every deployment pulls the full image including weights. Pros: self-contained, no external dependencies. Cons: every model update requires a full image rebuild and multi-gigabyte pull. Not practical for models >1GB.

Strategy 2: Mount weights as a named volume. Weights are stored in a Docker volume, mounted into the container at runtime. The image stays small (just the framework and serving code). Pros: image is small and fast to pull. Cons: weights must be pre-populated in the volume. Requires a separate weight management process.

Strategy 3: Pull weights from object storage at startup. The container downloads weights from S3/GCS/Azure Blob at startup. Pros: always gets the latest version, no pre-population needed, works across environments. Cons: adds startup latency (5-60 seconds depending on model size and network), requires credentials management, adds a failure mode (network timeout during download).

Strategy 4: Hybrid — framework in image, weights in registry. Use a model registry (MLflow, Weights & Biases, SageMaker Model Registry) to version and store weights. The serving image contains the framework and a startup script that pulls the correct model version from the registry. This is the most operationally mature approach — it decouples model updates from image updates.

io/thecodeforge/ml_serving/model_loader.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import os
import logging
from pathlib import Path

logger = logging.getLogger(__name__)


def load_model_weights(model_name: str, version: str) -> Path:
    """Load model weights using the appropriate strategy.
    
    Strategy is determined by environment variable MODEL_SOURCE:
    - 'baked': weights are in the image (COPY in Dockerfile)
    - 'volume': weights are in a mounted volume
    - 's3': weights are downloaded from S3 at startup
    """
    source = os.environ.get("MODEL_SOURCE", "baked")
    
    if source == "baked":
        # Weights were COPY'd into the image during build
        model_path = Path(f"./models/{model_name}/{version}/model.pt")
        if not model_path.exists():
            raise FileNotFoundError(
                f"Baked model not found at {model_path}. "
                f"Ensure the model was copied during docker build."
            )
        logger.info(f"Loaded baked model from {model_path}")
        return model_path
    
    elif source == "volume":
        # Weights are in a mounted Docker volume
        volume_path = os.environ.get("MODEL_VOLUME_PATH", "/data/models")
        model_path = Path(f"{volume_path}/{model_name}/{version}/model.pt")
        if not model_path.exists():
            raise FileNotFoundError(
                f"Model not found in volume at {model_path}. "
                f"Ensure the volume is mounted and contains the model."
            )
        logger.info(f"Loaded model from volume: {model_path}")
        return model_path
    
    elif source == "s3":
        # Download from S3 at startup
        import boto3
        
        bucket = os.environ["MODEL_S3_BUCKET"]
        key = f"models/{model_name}/{version}/model.pt"
        local_path = Path(f"/tmp/models/{model_name}/{version}/model.pt")
        local_path.parent.mkdir(parents=True, exist_ok=True)
        
        logger.info(f"Downloading model from s3://{bucket}/{key}...")
        s3 = boto3.client("s3")
        s3.download_file(bucket, key, str(local_path))
        logger.info(f"Downloaded model to {local_path}")
        return local_path
    
    else:
        raise ValueError(f"Unknown MODEL_SOURCE: {source}")
Output
# With baked weights:
# MODEL_SOURCE=baked docker run --gpus all io.thecodeforge/ml-model:v1.0
#
# With volume:
# docker volume create model_weights
# MODEL_SOURCE=volume MODEL_VOLUME_PATH=/data/models \
# docker run --gpus all -v model_weights:/data/models io.thecodeforge/ml-serving:v1.0
#
# With S3:
# MODEL_SOURCE=s3 MODEL_S3_BUCKET=my-models-bucket \
# docker run --gpus all -e AWS_DEFAULT_REGION=us-east-1 io.thecodeforge/ml-serving:v1.0
Model Weights as a Supply Chain Decision
  • Bake when: model is <500MB, deployment frequency is low, self-contained images are required (air-gapped environments).
  • Mount when: model is >1GB, multiple containers share the same weights, you need to update weights without rebuilding the image.
  • Pull from S3 when: model updates are frequent, you need version management, you deploy across multiple environments.
  • Use a model registry (MLflow) when: you need version tracking, A/B testing, and rollback capabilities.
Production Insight
The failure scenario for baked weights is deployment speed. A 5GB model baked into the image means every deployment pulls 5GB+ — even if only the serving code changed. With 20 nodes and a 1Gbps network, that is 100GB of data transfer and 15+ minutes of deployment time. Mounting weights as a volume or pulling from S3 keeps the image under 500MB and deploys in under 30 seconds.
Key Takeaway
Model weight delivery strategy determines deployment speed. Baked weights make images large and slow to pull. Volumes and S3 downloads keep images small but add operational complexity. For models >1GB, never bake weights into the image — mount them or pull from object storage.
Model Weight Delivery Strategy
IfModel < 500MB, infrequent updates, self-contained deployment required
UseBake into image — simplest, no external dependencies
IfModel > 1GB, shared across multiple containers
UseMount as named volume — small image, shared storage
IfFrequent model updates, multi-environment deployment
UsePull from S3/GCS at startup — decouple model from image
IfNeed version tracking, A/B testing, rollback
UseUse model registry (MLflow, W&B) — full lifecycle management

Production Deployment Patterns — Health Checks, Graceful Shutdown, and Resource Limits

Deploying ML models in production requires patterns that go beyond basic containerization. Three patterns separate production-grade deployments from fragile ones.

1. Health checks that verify inference capability. A /health endpoint must verify that the model is loaded and can produce valid output. Run a dummy inference at startup and on every health check. Kubernetes readiness probes use this to route traffic only to ready containers.

2. Graceful shutdown for in-flight requests. When a container is stopped (docker stop, Kubernetes pod termination), it receives SIGTERM. The serving framework must stop accepting new requests, complete in-flight requests, and exit cleanly. Default stop timeout is 10 seconds — increase it with --stop-timeout or terminationGracePeriodSeconds if inference takes longer.

3. Resource limits to prevent GPU and memory contention. Without resource limits, one container can consume all GPU memory or host memory, crashing other services. Set --memory limits for RAM. Use NVIDIA_MPS or CUDA_VISIBLE_DEVICES for GPU isolation. In Kubernetes, use resource requests and limits for both CPU/memory and nvidia.com/gpu.

4. Model warmup to avoid cold-start latency. The first inference on a GPU model is slow because CUDA kernels must be compiled. Run a dummy inference at startup to warm up the GPU. This ensures the first real request has the same latency as subsequent requests.

io/thecodeforge/ml-serving-deployment.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# Kubernetes deployment for ML model serving
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-serving
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-model-serving
  template:
    metadata:
      labels:
        app: ml-model-serving
    spec:
      containers:
        - name: model-server
          image: io.thecodeforge/ml-model:v1.0.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              memory: "2Gi"
              cpu: "1"
              nvidia.com/gpu: "1"
            limits:
              memory: "4Gi"
              cpu: "2"
              nvidia.com/gpu: "1"
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 60  # Model loading takes time
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 120
            periodSeconds: 30
            timeoutSeconds: 5
            failureThreshold: 5
          env:
            - name: MODEL_SOURCE
              value: "s3"
            - name: MODEL_S3_BUCKET
              valueFrom:
                secretKeyRef:
                  name: ml-model-secrets
                  key: s3-bucket
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]
      terminationGracePeriodSeconds: 60  # Allow in-flight requests to complete
Output
# Deploy:
# kubectl apply -f io/thecodeforge/ml-serving-deployment.yml
#
# Verify:
# kubectl get pods -n production -l app=ml-model-serving
# NAME READY STATUS RESTARTS AGE
# ml-model-serving-7d4f8b6c9-abc12 1/1 Running 0 2m
# ml-model-serving-7d4f8b6c9-def34 1/1 Running 0 2m
# ml-model-serving-7d4f8b6c9-ghi56 1/1 Running 0 2m
Production ML Serving as an Airport
  • ML model loading involves reading multi-gigabyte weight files and initializing CUDA contexts.
  • PyTorch model loading can take 30-60 seconds for large models.
  • If the readiness probe fires before the model is loaded, Kubernetes marks the pod as not ready and does not route traffic.
  • The initialDelaySeconds must exceed the expected model load time to prevent premature traffic routing.
Production Insight
The preStop hook with sleep 10 is critical for zero-downtime deployments. When Kubernetes terminates a pod, it simultaneously sends SIGTERM and removes the pod from the service endpoints. But the endpoint removal is not instant — there is a propagation delay. The sleep 10 ensures the pod continues accepting requests for 10 seconds after SIGTERM, giving the endpoint removal time to propagate. Without this, in-flight requests during deployment get connection resets.
Key Takeaway
Production ML serving requires health checks that verify inference capability, graceful shutdown for in-flight requests, resource limits for GPU isolation, and model warmup to avoid cold-start latency. The preStop hook with sleep is essential for zero-downtime deployments. These patterns are not optional — they are the difference between a reliable serving system and one that fails during every deployment.
ML Serving Production Readiness Checklist
IfHealth check only verifies server process
UseAdd dummy inference to health check — verify model is loaded and produces valid output
IfPods restart during deployment with connection errors
UseAdd preStop hook with sleep, increase terminationGracePeriodSeconds, ensure SIGTERM handler in serving code
IfFirst request after deployment is 10x slower than subsequent requests
UseAdd model warmup at startup — run dummy inference to compile CUDA kernels before accepting traffic
IfOne model container consumes all GPU memory, crashing other containers
UseSet nvidia.com/gpu resource limits. Use CUDA_VISIBLE_DEVICES to assign specific GPUs per container.
● Production incidentPOST-MORTEMseverity: high

Silent Prediction Drift — numpy Version Mismatch Between Training and Serving Containers

Symptom
A/B test showed the production model had 3.2% lower click-through rate than the offline evaluation. The model code was identical. The weights were identical. The input data pipeline was identical. Engineers could not reproduce the discrepancy locally because their dev environment matched the training environment.
Assumption
Team assumed a data pipeline issue — perhaps the production feature store had stale data. They spent 2 days comparing feature vectors between training and serving. All features matched. Second assumption: a random seed issue causing non-deterministic behavior. They set all seeds explicitly — the discrepancy persisted.
Root cause
The training Dockerfile used FROM python:3.10 which resolved to numpy 1.24 at build time. The serving Dockerfile used FROM python:3.11 which resolved to numpy 2.0. numpy 2.0 changed the default rounding behavior in np.dot and np.matmul for certain float32 operations. The model's softmax layer used np.exp on logits that were near the overflow boundary — the rounding difference changed which items appeared in the top-10 recommendation list. The 3.2% CTR drop was caused by slightly different recommendations being served.
Fix
1. Pinned numpy==1.24.3 in both training and serving requirements.txt. 2. Pinned the base image to FROM python:3.10.12-slim-bookworm in both Dockerfiles. 3. Added a CI step that runs a prediction consistency test — the same input must produce identical output in both containers. 4. Added pip freeze output to the image metadata as a LABEL for auditability. 5. Implemented a model validation pipeline that compares offline and online predictions within a tolerance threshold before deploying.
Key lesson
  • ML models are sensitive to floating-point library versions in ways that web applications are not.
  • Pin every dependency — including numpy, scipy, and CUDA toolkit — in both training and serving Dockerfiles.
  • A prediction consistency test between training and serving environments catches version drift before it reaches production.
  • The serving Dockerfile must be derived from the same base image as the training Dockerfile, or at minimum pin identical dependency versions.
  • pip freeze output should be captured as image metadata for post-deployment auditability.
Production debug guideFrom silent prediction drift to GPU failures — systematic debugging paths.6 entries
Symptom · 01
Model container starts but inference is 10-50x slower than expected.
Fix
Check if the model is running on CPU instead of GPU. Exec into the container and run: python -c "import torch; print(torch.cuda.is_available())". If False, the NVIDIA Container Toolkit is not configured or --gpus was not passed. Check nvidia-smi on the host to verify GPU availability.
Symptom · 02
Container crashes with CUDA out of memory on a GPU that should have enough VRAM.
Fix
Check if multiple containers are sharing the same GPU without memory isolation. Use NVIDIA_MPS or set CUDA_VISIBLE_DEVICES to assign specific GPUs. Check if the model is loading weights in float32 instead of float16 — float32 uses 2x the VRAM.
Symptom · 03
Model produces different predictions in Docker than in the training notebook.
Fix
Compare dependency versions: docker exec <container> pip freeze vs your training environment. Check numpy, scipy, and CUDA toolkit versions specifically. Run a prediction consistency test with fixed inputs and seeds. Check if the model uses any platform-specific operations (MKL vs OpenBLAS).
Symptom · 04
Docker image is 8GB+ and deploys take 10+ minutes.
Fix
Audit image layers: docker history <image>. Check if training dependencies (Jupyter, gcc, test frameworks) are in the serving image. Use multi-stage builds to separate build-time from runtime. Move model weights to a volume or object storage instead of baking them into the image.
Symptom · 05
Health check passes but the model returns errors on actual inference requests.
Fix
The health check endpoint may only verify the server is running, not that the model is loaded. Add a health check that runs a dummy inference with a known input and verifies the output shape. Check if the model file was corrupted during the COPY step (large files can fail silently).
Symptom · 06
Container runs out of disk space during inference (large batch processing).
Fix
Check if the model writes temporary files (attention caches, intermediate tensors) to the container filesystem. Mount a tmpfs or volume for temporary storage. Set --shm-size for PyTorch DataLoader workers that use shared memory.
★ Docker ML Model Triage Cheat SheetFirst-response commands when an ML serving container fails in production.
Inference is extremely slow (10x+ slower than expected).
Immediate action
Check if GPU is being used inside the container.
Commands
docker exec <container> python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
docker exec <container> nvidia-smi
Fix now
If cuda.is_available() is False, restart with --gpus all. If nvidia-smi not found, install nvidia-container-toolkit on the host.
Model predictions differ from training notebook results.+
Immediate action
Compare dependency versions between training and serving.
Commands
docker exec <container> pip freeze | grep -E 'numpy|scipy|torch|tensorflow'
docker inspect <container> --format='{{.Config.Labels}}'
Fix now
Pin identical versions in both Dockerfiles. Add a prediction consistency test to CI.
Container crashes with OOM (out of memory) during model loading.+
Immediate action
Check GPU memory and container memory limits.
Commands
docker stats <container> --no-stream
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
Fix now
Increase --memory limit. Use float16 instead of float32. Use model sharding for models > single GPU VRAM.
Container image pull takes 10+ minutes in CI/CD.+
Immediate action
Check image size and layer composition.
Commands
docker images <image> --format '{{.Size}}'
docker history <image> | head -20
Fix now
Use multi-stage builds. Move model weights to volumes or object storage. Use a local registry mirror.
Health check passes but inference requests fail with 500.+
Immediate action
Verify model is actually loaded, not just the server process.
Commands
docker exec <container> curl -s http://localhost:8080/health
docker exec <container> curl -s -X POST http://localhost:8080/predict -d '{"input": [1.0, 2.0, 3.0]}'
Fix now
Add model-loaded check to health endpoint. Check container logs for model loading errors: docker logs --tail 100 <container>
Model Weight Delivery Strategies Compared
StrategyImage SizeDeployment SpeedOperational ComplexityBest For
Bake into image (COPY)Large (model size + framework)Slow (full image pull)Low (self-contained)Models < 500MB, infrequent updates
Named volume (mount)Small (framework only)Fast (small image)Medium (volume management)Large models, shared across containers
S3/GCS download at startupSmall (framework only)Fast pull + download timeMedium (credentials, retry logic)Frequent model updates, multi-environment
Model registry (MLflow)Small (framework only)Fast pull + download timeHigh (registry infrastructure)Version tracking, A/B testing, rollback

Key takeaways

1
Multi-stage builds are essential for ML serving
the training image includes compilers and debugging tools that should never ship to production. The serving image should contain only the framework, the model, and the serving code.
2
GPU access requires the NVIDIA Container Toolkit and the --gpus flag. Without them, ML frameworks silently fall back to CPU with no error. Always add a startup assertion that verifies GPU availability.
3
Model weight delivery strategy determines deployment speed. Baked weights make images large. Volumes and S3 downloads keep images small but add operational complexity. For models >1GB, never bake weights into the image.
4
Production ML serving requires health checks that verify inference capability (not just server liveness), graceful shutdown for in-flight requests, and model warmup to avoid cold-start latency.
5
Pin identical dependency versions (numpy, scipy, CUDA toolkit) in both training and serving Dockerfiles. Version drift causes silent prediction changes that are extremely difficult to debug.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

FAQ · 5 QUESTIONS

Frequently Asked Questions

01
How do I access GPUs from a Docker container?
02
Why is my ML model inference slow inside Docker but fast on the host?
03
Should I bake model weights into the Docker image?
04
How do I handle model versioning with Docker?
05
What is the difference between Docker and NVIDIA Triton for model serving?
🔥

That's MLOps. Mark it forged?

6 min read · try the examples if you haven't

Previous
A/B Testing in ML
5 / 9 · MLOps
Next
Feature Stores Explained