Skip to content
Home ML / AI Docker for ML Models: Containerize, Deploy and Scale with Confidence

Docker for ML Models: Containerize, Deploy and Scale with Confidence

Where developers are forged. · Structured learning · Free forever.
📍 Part of: MLOps → Topic 5 of 9
Docker for ML models explained deeply — multi-stage builds, GPU access, model serving patterns, and production gotchas every MLOps engineer must know.
🔥 Advanced — solid ML / AI foundation required
In this tutorial, you'll learn
Docker for ML models explained deeply — multi-stage builds, GPU access, model serving patterns, and production gotchas every MLOps engineer must know.
  • Multi-stage builds are essential for ML serving — the training image includes compilers and debugging tools that should never ship to production. The serving image should contain only the framework, the model, and the serving code.
  • GPU access requires the NVIDIA Container Toolkit and the --gpus flag. Without them, ML frameworks silently fall back to CPU with no error. Always add a startup assertion that verifies GPU availability.
  • Model weight delivery strategy determines deployment speed. Baked weights make images large. Volumes and S3 downloads keep images small but add operational complexity. For models >1GB, never bake weights into the image.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • The image is the artifact. It runs identically on your laptop, CI, and production GPU instances.
  • Multi-stage builds separate training dependencies from serving runtime, keeping images small.
  • NVIDIA Container Toolkit exposes GPU devices to containers via --gpus flag.
  • Base image with pinned CUDA and Python versions
  • Model weights copied or mounted as volumes
  • Serving framework (FastAPI, TorchServe, Triton) as the entrypoint
  • Health check endpoint for orchestrator readiness probes
🚨 START HERE
Docker ML Model Triage Cheat Sheet
First-response commands when an ML serving container fails in production.
🟠Inference is extremely slow (10x+ slower than expected).
Immediate ActionCheck if GPU is being used inside the container.
Commands
docker exec <container> python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
docker exec <container> nvidia-smi
Fix NowIf cuda.is_available() is False, restart with --gpus all. If nvidia-smi not found, install nvidia-container-toolkit on the host.
🟡Model predictions differ from training notebook results.
Immediate ActionCompare dependency versions between training and serving.
Commands
docker exec <container> pip freeze | grep -E 'numpy|scipy|torch|tensorflow'
docker inspect <container> --format='{{.Config.Labels}}'
Fix NowPin identical versions in both Dockerfiles. Add a prediction consistency test to CI.
🔴Container crashes with OOM (out of memory) during model loading.
Immediate ActionCheck GPU memory and container memory limits.
Commands
docker stats <container> --no-stream
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
Fix NowIncrease --memory limit. Use float16 instead of float32. Use model sharding for models > single GPU VRAM.
🟡Container image pull takes 10+ minutes in CI/CD.
Immediate ActionCheck image size and layer composition.
Commands
docker images <image> --format '{{.Size}}'
docker history <image> | head -20
Fix NowUse multi-stage builds. Move model weights to volumes or object storage. Use a local registry mirror.
🟡Health check passes but inference requests fail with 500.
Immediate ActionVerify model is actually loaded, not just the server process.
Commands
docker exec <container> curl -s http://localhost:8080/health
docker exec <container> curl -s -X POST http://localhost:8080/predict -d '{"input": [1.0, 2.0, 3.0]}'
Fix NowAdd model-loaded check to health endpoint. Check container logs for model loading errors: docker logs --tail 100 <container>
Production IncidentSilent Prediction Drift — numpy Version Mismatch Between Training and Serving ContainersA recommendation model deployed via Docker produced different top-10 results in production than in the training notebook. The root cause was a numpy version mismatch — 1.24 in training, 2.0 in serving — that changed floating-point rounding behavior in matrix operations.
SymptomA/B test showed the production model had 3.2% lower click-through rate than the offline evaluation. The model code was identical. The weights were identical. The input data pipeline was identical. Engineers could not reproduce the discrepancy locally because their dev environment matched the training environment.
AssumptionTeam assumed a data pipeline issue — perhaps the production feature store had stale data. They spent 2 days comparing feature vectors between training and serving. All features matched. Second assumption: a random seed issue causing non-deterministic behavior. They set all seeds explicitly — the discrepancy persisted.
Root causeThe training Dockerfile used FROM python:3.10 which resolved to numpy 1.24 at build time. The serving Dockerfile used FROM python:3.11 which resolved to numpy 2.0. numpy 2.0 changed the default rounding behavior in np.dot and np.matmul for certain float32 operations. The model's softmax layer used np.exp on logits that were near the overflow boundary — the rounding difference changed which items appeared in the top-10 recommendation list. The 3.2% CTR drop was caused by slightly different recommendations being served.
Fix1. Pinned numpy==1.24.3 in both training and serving requirements.txt. 2. Pinned the base image to FROM python:3.10.12-slim-bookworm in both Dockerfiles. 3. Added a CI step that runs a prediction consistency test — the same input must produce identical output in both containers. 4. Added pip freeze output to the image metadata as a LABEL for auditability. 5. Implemented a model validation pipeline that compares offline and online predictions within a tolerance threshold before deploying.
Key Lesson
ML models are sensitive to floating-point library versions in ways that web applications are not.Pin every dependency — including numpy, scipy, and CUDA toolkit — in both training and serving Dockerfiles.A prediction consistency test between training and serving environments catches version drift before it reaches production.The serving Dockerfile must be derived from the same base image as the training Dockerfile, or at minimum pin identical dependency versions.pip freeze output should be captured as image metadata for post-deployment auditability.
Production Debug GuideFrom silent prediction drift to GPU failures — systematic debugging paths.
Model container starts but inference is 10-50x slower than expected.Check if the model is running on CPU instead of GPU. Exec into the container and run: python -c "import torch; print(torch.cuda.is_available())". If False, the NVIDIA Container Toolkit is not configured or --gpus was not passed. Check nvidia-smi on the host to verify GPU availability.
Container crashes with CUDA out of memory on a GPU that should have enough VRAM.Check if multiple containers are sharing the same GPU without memory isolation. Use NVIDIA_MPS or set CUDA_VISIBLE_DEVICES to assign specific GPUs. Check if the model is loading weights in float32 instead of float16 — float32 uses 2x the VRAM.
Model produces different predictions in Docker than in the training notebook.Compare dependency versions: docker exec <container> pip freeze vs your training environment. Check numpy, scipy, and CUDA toolkit versions specifically. Run a prediction consistency test with fixed inputs and seeds. Check if the model uses any platform-specific operations (MKL vs OpenBLAS).
Docker image is 8GB+ and deploys take 10+ minutes.Audit image layers: docker history <image>. Check if training dependencies (Jupyter, gcc, test frameworks) are in the serving image. Use multi-stage builds to separate build-time from runtime. Move model weights to a volume or object storage instead of baking them into the image.
Health check passes but the model returns errors on actual inference requests.The health check endpoint may only verify the server is running, not that the model is loaded. Add a health check that runs a dummy inference with a known input and verifies the output shape. Check if the model file was corrupted during the COPY step (large files can fail silently).
Container runs out of disk space during inference (large batch processing).Check if the model writes temporary files (attention caches, intermediate tensors) to the container filesystem. Mount a tmpfs or volume for temporary storage. Set --shm-size for PyTorch DataLoader workers that use shared memory.

ML models are environment-sensitive in ways that web applications are not. A model trained with numpy 1.23 can silently produce different floating-point results on numpy 2.0. A CUDA version mismatch between training and serving causes either crashes or silent CPU fallback that tanks inference latency by 50x. These are not hypothetical — they are the leading causes of 'it works on my machine' failures in ML deployments.

Docker eliminates environment drift by packaging the entire runtime — OS libraries, Python version, pip packages, CUDA toolkit, model weights, and serving logic — into a single versioned image. That image runs identically on your laptop, your CI pipeline, a Kubernetes cluster, and an edge device.

The gap between a Jupyter notebook that produces great metrics and a model that reliably serves predictions in production is wider than most teams expect. Docker closes that gap by making the environment a constant, not a variable. This guide covers the patterns that separate production-grade ML containers from fragile ones.

Multi-Stage Builds for ML — Separating Training from Serving

The most common mistake in ML Dockerfiles is shipping the training environment as the serving image. A training image includes Jupyter, gcc, test frameworks, debugging tools, and development dependencies — none of which are needed in production. This bloats the image to 5-10GB, increases attack surface, and slows deployments.

Multi-stage builds solve this by using a heavy 'builder' stage with all training and build dependencies, then copying only the trained model and runtime dependencies into a minimal 'serving' stage. The final image contains Python, the serving framework, and the model — nothing else.

Why this matters for ML specifically: ML images are uniquely large because they include CUDA toolkit (2-3GB), PyTorch/TensorFlow (1-2GB), and model weights (1-10GB). A single-stage image that includes training tools, CUDA development headers, and model weights can easily exceed 10GB. A multi-stage serving image with quantized weights can be under 2GB.

Layer caching for ML: Model weights change rarely (only after retraining). Dependencies change occasionally. Application code changes frequently. Order your Dockerfile: base image + CUDA first, dependencies second, model weights third, application code last. This ensures code changes do not trigger a re-download of PyTorch or a re-copy of multi-gigabyte weights.

io/thecodeforge/ml-serving.Dockerfile · DOCKERFILE
123456789101112131415161718192021222324252627282930313233343536373839404142
# ─── STAGE 1: Build environment (training deps, compilers) ───
FROM python:3.10.12-slim-bookworm AS builder

WORKDIR /build

# Install build dependencies (not in final image)
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc g++ libpq-dev && \
    rm -rf /var/lib/apt/lists/*

COPY requirements-training.txt .
RUN pip install --user --no-cache-dir -r requirements-training.txt

# Simulate model training artifact (in practice, this comes from
# a training pipeline or model registry)
COPY models/ ./models/

# ─── STAGE 2: Serving runtime (minimal) ───
FROM python:3.10.12-slim-bookworm AS serving

WORKDIR /app

# Install only runtime dependencies
COPY requirements-serving.txt .
RUN pip install --no-cache-dir -r requirements-serving.txt

# Copy trained model weights from builder
COPY --from=builder /build/models/ ./models/

# Copy serving application code
COPY src/serving/ ./src/serving/

# Non-root user for security
RUN useradd --create-home appuser
USER appuser

HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8080/health')"

EXPOSE 8080

CMD ["python", "-m", "uvicorn", "src.serving.api:app", "--host", "0.0.0.0", "--port", "8080"]
▶ Output
# Build:
docker build -f io/thecodeforge/ml-serving.Dockerfile -t io.thecodeforge/ml-model:v1.0 .

# Image size comparison:
# Single-stage (with training deps): 8.2 GB
# Multi-stage (serving only): 1.8 GB
Mental Model
ML Image Layers as a Supply Chain
Why not just delete training dependencies in a RUN command at the end of a single-stage Dockerfile?
  • Docker layers are additive. A file added in one layer and deleted in a later layer still occupies space in the earlier layer.
  • RUN pip install torch && pip uninstall torch still has torch in the install layer — the image does not shrink.
  • Multi-stage builds start fresh — the serving stage never contains training dependencies in any layer.
  • This is the only way to genuinely reduce image size for ML workloads where base dependencies are gigabytes.
📊 Production Insight
The layer caching insight is critical for ML because dependency downloads are large. PyTorch with CUDA is ~2GB. If every code change invalidates the pip install layer, every CI build downloads 2GB of dependencies. By copying requirements-serving.txt before application code, the dependency layer is cached on code-only changes — turning 10-minute builds into 30-second builds.
🎯 Key Takeaway
Multi-stage builds are not optional for ML serving images. The training environment and the serving environment are fundamentally different — training needs compilers and debugging tools, serving needs only the framework and the model. A 10GB training image deployed as a serving image wastes storage, slows deployments, and increases attack surface.
ML Image Size Optimization
IfImage is >5GB and includes training tools
UseUse multi-stage builds — separate training from serving. Copy only model weights and runtime deps to serving stage.
IfModel weights are >1GB and baked into the image
UseMove weights to a volume or download from S3/GCS at container startup. Keep image under 2GB.
IfCUDA toolkit adds 2-3GB to the image
UseUse runtime-only CUDA base images (nvidia/cuda:11.8.0-runtime-ubuntu22.04) instead of devel images.
IfMultiple models share the same serving framework
UseCreate a shared base image with the framework, extend it per model with just the weights and config.

GPU Access with NVIDIA Container Toolkit

ML inference on CPU is 10-100x slower than on GPU. Docker does not expose GPU devices to containers by default — you need the NVIDIA Container Toolkit and the --gpus flag.

The NVIDIA Container Toolkit (formerly nvidia-docker2) installs a Docker runtime that automatically mounts the GPU device drivers and libraries into containers. Without it, containers see no GPU devices even when the host has GPUs available.

Installation and verification: 1. Install nvidia-container-toolkit on the host 2. Configure Docker to use the nvidia runtime: sudo nvidia-ctk runtime configure --runtime=docker 3. Restart Docker: sudo systemctl restart docker 4. Verify: docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

GPU allocation strategies: - --gpus all: expose all GPUs to the container - --gpus 1: expose one GPU (Docker picks which) - --gpus '\"device=0,2\"': expose specific GPUs by index - NVIDIA_VISIBLE_DEVICES=0,2: set via environment variable (useful in Compose)

Failure scenario — silent CPU fallback: If the NVIDIA Container Toolkit is not installed or --gpus is not passed, PyTorch and TensorFlow silently fall back to CPU. The model loads successfully, inference works, but latency is 50x slower than expected. There is no error — torch.cuda.is_available() returns False, but many serving frameworks do not check this. The fix: always add a startup assertion that verifies GPU availability.

io/thecodeforge/ml_serving/startup_check.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import sys
import logging

logger = logging.getLogger(__name__)

def verify_gpu_availability(required_gpus: int = 1) -> None:
    """Startup assertion: fail fast if GPU is not available.
    
    Call this at application startup before loading the model.
    If GPU is required but not available, exit immediately
    rather than silently falling back to CPU.
    """
    try:
        import torch
        
        available = torch.cuda.is_available()
        device_count = torch.cuda.device_count()
        
        if not available:
            logger.error(
                "GPU not available. torch.cuda.is_available() returned False. "
                "Ensure NVIDIA Container Toolkit is installed and "
                "container is started with --gpus flag."
            )
            sys.exit(1)
        
        if device_count < required_gpus:
            logger.error(
                f"Insufficient GPUs: required={required_gpus}, "
                f"available={device_count}. "
                f"Adjust --gpus flag or reduce requirement."
            )
            sys.exit(1)
        
        gpu_name = torch.cuda.get_device_name(0)
        gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
        logger.info(
            f"GPU verified: {gpu_name}, "
            f"{gpu_memory:.1f}GB VRAM, "
            f"{device_count} device(s) available"
        )
        
    except ImportError:
        logger.error("PyTorch not installed. Cannot verify GPU availability.")
        sys.exit(1)
▶ Output
# Successful startup:
# GPU verified: NVIDIA A10G, 22.0GB VRAM, 1 device(s) available

# Failed startup (no --gpus flag):
# ERROR: GPU not available. torch.cuda.is_available() returned False.
# Ensure NVIDIA Container Toolkit is installed and container is started with --gpus flag.
Mental Model
GPU Access as a Device Permission
Why does PyTorch silently fall back to CPU instead of raising an error when no GPU is available?
  • PyTorch was designed to work on both CPU and GPU — GPU is an optimization, not a requirement.
  • Many development environments (laptops without GPU) run PyTorch on CPU legitimately.
  • The framework cannot know if you intended to use GPU or CPU — it defers to the developer.
  • This is why a startup assertion (verify_gpu_availability) is essential in production serving containers.
📊 Production Insight
The silent CPU fallback is the most insidious GPU-related production bug. The model loads successfully, inference returns correct results, but latency is 50x slower than expected. Monitoring shows high CPU usage instead of GPU utilization. The team spends hours profiling the model code before discovering the container never had GPU access. The fix is a one-time startup assertion that fails fast if GPU is not available.
🎯 Key Takeaway
GPU access in Docker requires the NVIDIA Container Toolkit and the --gpus flag. Without them, ML frameworks silently fall back to CPU — no error, just 50x slower inference. Always add a startup assertion that verifies GPU availability. This single check prevents hours of debugging silent CPU fallback.
GPU Troubleshooting Decision Tree
Iftorch.cuda.is_available() returns False inside container
UseCheck if --gpus flag was passed. Check if nvidia-container-toolkit is installed on the host. Run: docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
IfGPU is available but inference is still slow
UseCheck if data is being transferred CPU->GPU on every inference call. Pin the model to GPU once at startup. Check batch size — too small batches underutilize GPU parallelism.
IfCUDA out of memory during inference
UseCheck if multiple containers share the same GPU. Use CUDA_VISIBLE_DEVICES to assign specific GPUs. Use float16 instead of float32. Reduce batch size.
Ifnvidia-smi works on host but not in container
UseHost driver version must be >= the CUDA version in the container. Check: nvidia-smi on host vs CUDA version in FROM image.

Model Serving Patterns — FastAPI, TorchServe, and Triton

There are three common patterns for serving ML models in Docker containers. The right choice depends on your latency requirements, model complexity, and operational maturity.

Pattern 1: FastAPI + direct model loading. Load the model at startup, expose a /predict endpoint. Simple, full control over the inference pipeline, easy to customize. Best for single-model serving with custom pre/post-processing. The model is loaded into the application process — startup time equals model load time.

Pattern 2: TorchServe / TF Serving. Purpose-built serving frameworks with built-in batching, model versioning, and A/B testing. More operational overhead but better for multi-model serving and high-throughput scenarios. TorchServe runs a separate model server process — the Docker container wraps the TorchServe binary.

Pattern 3: NVIDIA Triton Inference Server. GPU-optimized serving with dynamic batching, model ensemble pipelines, and multi-framework support (PyTorch, TensorFlow, ONNX, TensorRT). Highest throughput but most complex configuration. Best for latency-critical production workloads with multiple models.

Health check patterns: A health check that only verifies the server is running is insufficient for ML serving. The health check must verify that the model is loaded and can produce a valid output. A /health endpoint should run a dummy inference with a known input and verify the output shape matches expectations.

io/thecodeforge/ml_serving/api.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List
import numpy as np
import torch
import logging

from io.thecodeforge.ml_serving.startup_check import verify_gpu_availability

logger = logging.getLogger(__name__)

app = FastAPI(title="ML Model Serving API")

# Global model reference — loaded once at startup
model = None
model_device = None


class PredictionRequest(BaseModel):
    features: List[float]


class PredictionResponse(BaseModel):
    prediction: float
    model_version: str
    device: str


@app.on_event("startup")
async def load_model() -> None:
    """Load model once at startup — not on every request."""
    global model, model_device

    # Fail fast if GPU is not available
    verify_gpu_availability(required_gpus=1)

    model_device = torch.device("cuda:0")
    model_path = "./models/production_model.pt"

    logger.info(f"Loading model from {model_path}...")
    model = torch.jit.load(model_path, map_location=model_device)
    model.eval()

    # Warmup inference — ensures CUDA kernels are compiled
    dummy_input = torch.randn(1, 128, device=model_device)
    with torch.no_grad():
        _ = model(dummy_input)

    logger.info("Model loaded and warmed up successfully.")


@app.get("/health")
async def health_check() -> dict:
    """Health check that verifies model is loaded and functional."""
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded")

    # Dummy inference to verify the model actually works
    try:
        dummy_input = torch.randn(1, 128, device=model_device)
        with torch.no_grad():
            output = model(dummy_input)
        return {
            "status": "healthy",
            "model_loaded": True,
            "output_shape": list(output.shape),
            "device": str(model_device),
        }
    except Exception as e:
        raise HTTPException(
            status_code=503,
            detail=f"Model health check failed: {str(e)}"
        )


@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest) -> PredictionResponse:
    """Run inference on the loaded model."""
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded")

    try:
        input_tensor = torch.tensor(
            [request.features], dtype=torch.float32, device=model_device
        )

        with torch.no_grad():
            output = model(input_tensor)

        return PredictionResponse(
            prediction=output.item(),
            model_version="v1.0.0",
            device=str(model_device),
        )
    except Exception as e:
        logger.error(f"Inference failed: {e}")
        raise HTTPException(status_code=500, detail=f"Inference error: {str(e)}")
▶ Output
# Run:
# uvicorn io.thecodeforge.ml_serving.api:app --host 0.0.0.0 --port 8080
#
# Health check:
# curl http://localhost:8080/health
# {"status":"healthy","model_loaded":true,"output_shape":[1,1],"device":"cuda:0"}
#
# Predict:
# curl -X POST http://localhost:8080/predict -H 'Content-Type: application/json' \
# -d '{"features": [1.0, 2.0, 3.0, ...]}'
# {"prediction": 0.847,"model_version": "v1.0.0","device": "cuda:0"}
Mental Model
Model Serving as a Restaurant Kitchen
Why load the model at startup instead of on the first request?
  • Model loading takes 5-60 seconds depending on model size. First request would timeout.
  • Loading at startup means the health check can verify the model is functional before accepting traffic.
  • The orchestrator (Kubernetes, ECS) uses the health check to know when the container is ready.
  • Warmup inference ensures CUDA kernels are compiled before the first real request — avoids cold-start latency.
📊 Production Insight
The health check pattern is critical for ML serving. A health check that only verifies the server process is running (returns 200) is insufficient. The model might have failed to load, the GPU might be unavailable, or the weights might be corrupted. A proper health check runs a dummy inference and verifies the output shape. Kubernetes readiness probes use this health check to route traffic only to containers that are actually ready to serve predictions.
🎯 Key Takeaway
Choose the serving pattern that matches your operational maturity. FastAPI for simplicity and control. TorchServe for multi-model production. Triton for maximum GPU throughput. Regardless of framework, the health check must verify model loading and inference capability — not just server process liveness.
Model Serving Framework Selection
IfSingle model, custom pre/post-processing, fast iteration
UseUse FastAPI — simple, full control, easy to customize, minimal operational overhead
IfMultiple models, versioning, A/B testing, high throughput
UseUse TorchServe — built-in batching, model management, versioning API
IfMulti-framework (PyTorch + TensorFlow + ONNX), latency-critical, GPU-optimized
UseUse Triton Inference Server — dynamic batching, model ensembles, TensorRT integration
IfEdge deployment, resource-constrained, no GPU
UseUse ONNX Runtime or TensorFlow Lite — optimized for CPU inference on edge devices

Volume Strategies for Model Weights — Baking vs Mounting vs Pulling

Model weights are the largest component of an ML serving image. A production NLP model can be 2-10GB. A large language model can be 50-200GB. How you deliver these weights to the container has a major impact on deployment speed, storage costs, and operational flexibility.

Strategy 1: Bake weights into the image (COPY). Simplest approach. The weights are part of the image layer. Every deployment pulls the full image including weights. Pros: self-contained, no external dependencies. Cons: every model update requires a full image rebuild and multi-gigabyte pull. Not practical for models >1GB.

Strategy 2: Mount weights as a named volume. Weights are stored in a Docker volume, mounted into the container at runtime. The image stays small (just the framework and serving code). Pros: image is small and fast to pull. Cons: weights must be pre-populated in the volume. Requires a separate weight management process.

Strategy 3: Pull weights from object storage at startup. The container downloads weights from S3/GCS/Azure Blob at startup. Pros: always gets the latest version, no pre-population needed, works across environments. Cons: adds startup latency (5-60 seconds depending on model size and network), requires credentials management, adds a failure mode (network timeout during download).

Strategy 4: Hybrid — framework in image, weights in registry. Use a model registry (MLflow, Weights & Biases, SageMaker Model Registry) to version and store weights. The serving image contains the framework and a startup script that pulls the correct model version from the registry. This is the most operationally mature approach — it decouples model updates from image updates.

io/thecodeforge/ml_serving/model_loader.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import os
import logging
from pathlib import Path

logger = logging.getLogger(__name__)


def load_model_weights(model_name: str, version: str) -> Path:
    """Load model weights using the appropriate strategy.
    
    Strategy is determined by environment variable MODEL_SOURCE:
    - 'baked': weights are in the image (COPY in Dockerfile)
    - 'volume': weights are in a mounted volume
    - 's3': weights are downloaded from S3 at startup
    """
    source = os.environ.get("MODEL_SOURCE", "baked")
    
    if source == "baked":
        # Weights were COPY'd into the image during build
        model_path = Path(f"./models/{model_name}/{version}/model.pt")
        if not model_path.exists():
            raise FileNotFoundError(
                f"Baked model not found at {model_path}. "
                f"Ensure the model was copied during docker build."
            )
        logger.info(f"Loaded baked model from {model_path}")
        return model_path
    
    elif source == "volume":
        # Weights are in a mounted Docker volume
        volume_path = os.environ.get("MODEL_VOLUME_PATH", "/data/models")
        model_path = Path(f"{volume_path}/{model_name}/{version}/model.pt")
        if not model_path.exists():
            raise FileNotFoundError(
                f"Model not found in volume at {model_path}. "
                f"Ensure the volume is mounted and contains the model."
            )
        logger.info(f"Loaded model from volume: {model_path}")
        return model_path
    
    elif source == "s3":
        # Download from S3 at startup
        import boto3
        
        bucket = os.environ["MODEL_S3_BUCKET"]
        key = f"models/{model_name}/{version}/model.pt"
        local_path = Path(f"/tmp/models/{model_name}/{version}/model.pt")
        local_path.parent.mkdir(parents=True, exist_ok=True)
        
        logger.info(f"Downloading model from s3://{bucket}/{key}...")
        s3 = boto3.client("s3")
        s3.download_file(bucket, key, str(local_path))
        logger.info(f"Downloaded model to {local_path}")
        return local_path
    
    else:
        raise ValueError(f"Unknown MODEL_SOURCE: {source}")
▶ Output
# With baked weights:
# MODEL_SOURCE=baked docker run --gpus all io.thecodeforge/ml-model:v1.0
#
# With volume:
# docker volume create model_weights
# MODEL_SOURCE=volume MODEL_VOLUME_PATH=/data/models \
# docker run --gpus all -v model_weights:/data/models io.thecodeforge/ml-serving:v1.0
#
# With S3:
# MODEL_SOURCE=s3 MODEL_S3_BUCKET=my-models-bucket \
# docker run --gpus all -e AWS_DEFAULT_REGION=us-east-1 io.thecodeforge/ml-serving:v1.0
Mental Model
Model Weights as a Supply Chain Decision
When should you bake weights into the image vs mount them as volumes?
  • Bake when: model is <500MB, deployment frequency is low, self-contained images are required (air-gapped environments).
  • Mount when: model is >1GB, multiple containers share the same weights, you need to update weights without rebuilding the image.
  • Pull from S3 when: model updates are frequent, you need version management, you deploy across multiple environments.
  • Use a model registry (MLflow) when: you need version tracking, A/B testing, and rollback capabilities.
📊 Production Insight
The failure scenario for baked weights is deployment speed. A 5GB model baked into the image means every deployment pulls 5GB+ — even if only the serving code changed. With 20 nodes and a 1Gbps network, that is 100GB of data transfer and 15+ minutes of deployment time. Mounting weights as a volume or pulling from S3 keeps the image under 500MB and deploys in under 30 seconds.
🎯 Key Takeaway
Model weight delivery strategy determines deployment speed. Baked weights make images large and slow to pull. Volumes and S3 downloads keep images small but add operational complexity. For models >1GB, never bake weights into the image — mount them or pull from object storage.
Model Weight Delivery Strategy
IfModel < 500MB, infrequent updates, self-contained deployment required
UseBake into image — simplest, no external dependencies
IfModel > 1GB, shared across multiple containers
UseMount as named volume — small image, shared storage
IfFrequent model updates, multi-environment deployment
UsePull from S3/GCS at startup — decouple model from image
IfNeed version tracking, A/B testing, rollback
UseUse model registry (MLflow, W&B) — full lifecycle management

Production Deployment Patterns — Health Checks, Graceful Shutdown, and Resource Limits

Deploying ML models in production requires patterns that go beyond basic containerization. Three patterns separate production-grade deployments from fragile ones.

1. Health checks that verify inference capability. A /health endpoint must verify that the model is loaded and can produce valid output. Run a dummy inference at startup and on every health check. Kubernetes readiness probes use this to route traffic only to ready containers.

2. Graceful shutdown for in-flight requests. When a container is stopped (docker stop, Kubernetes pod termination), it receives SIGTERM. The serving framework must stop accepting new requests, complete in-flight requests, and exit cleanly. Default stop timeout is 10 seconds — increase it with --stop-timeout or terminationGracePeriodSeconds if inference takes longer.

3. Resource limits to prevent GPU and memory contention. Without resource limits, one container can consume all GPU memory or host memory, crashing other services. Set --memory limits for RAM. Use NVIDIA_MPS or CUDA_VISIBLE_DEVICES for GPU isolation. In Kubernetes, use resource requests and limits for both CPU/memory and nvidia.com/gpu.

4. Model warmup to avoid cold-start latency. The first inference on a GPU model is slow because CUDA kernels must be compiled. Run a dummy inference at startup to warm up the GPU. This ensures the first real request has the same latency as subsequent requests.

io/thecodeforge/ml-serving-deployment.yml · YAML
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
# Kubernetes deployment for ML model serving
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-serving
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-model-serving
  template:
    metadata:
      labels:
        app: ml-model-serving
    spec:
      containers:
        - name: model-server
          image: io.thecodeforge/ml-model:v1.0.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              memory: "2Gi"
              cpu: "1"
              nvidia.com/gpu: "1"
            limits:
              memory: "4Gi"
              cpu: "2"
              nvidia.com/gpu: "1"
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 60  # Model loading takes time
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 120
            periodSeconds: 30
            timeoutSeconds: 5
            failureThreshold: 5
          env:
            - name: MODEL_SOURCE
              value: "s3"
            - name: MODEL_S3_BUCKET
              valueFrom:
                secretKeyRef:
                  name: ml-model-secrets
                  key: s3-bucket
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]
      terminationGracePeriodSeconds: 60  # Allow in-flight requests to complete
▶ Output
# Deploy:
# kubectl apply -f io/thecodeforge/ml-serving-deployment.yml
#
# Verify:
# kubectl get pods -n production -l app=ml-model-serving
# NAME READY STATUS RESTARTS AGE
# ml-model-serving-7d4f8b6c9-abc12 1/1 Running 0 2m
# ml-model-serving-7d4f8b6c9-def34 1/1 Running 0 2m
# ml-model-serving-7d4f8b6c9-ghi56 1/1 Running 0 2m
Mental Model
Production ML Serving as an Airport
Why is initialDelaySeconds set to 60 for ML serving but typically 5-10 for web apps?
  • ML model loading involves reading multi-gigabyte weight files and initializing CUDA contexts.
  • PyTorch model loading can take 30-60 seconds for large models.
  • If the readiness probe fires before the model is loaded, Kubernetes marks the pod as not ready and does not route traffic.
  • The initialDelaySeconds must exceed the expected model load time to prevent premature traffic routing.
📊 Production Insight
The preStop hook with sleep 10 is critical for zero-downtime deployments. When Kubernetes terminates a pod, it simultaneously sends SIGTERM and removes the pod from the service endpoints. But the endpoint removal is not instant — there is a propagation delay. The sleep 10 ensures the pod continues accepting requests for 10 seconds after SIGTERM, giving the endpoint removal time to propagate. Without this, in-flight requests during deployment get connection resets.
🎯 Key Takeaway
Production ML serving requires health checks that verify inference capability, graceful shutdown for in-flight requests, resource limits for GPU isolation, and model warmup to avoid cold-start latency. The preStop hook with sleep is essential for zero-downtime deployments. These patterns are not optional — they are the difference between a reliable serving system and one that fails during every deployment.
ML Serving Production Readiness Checklist
IfHealth check only verifies server process
UseAdd dummy inference to health check — verify model is loaded and produces valid output
IfPods restart during deployment with connection errors
UseAdd preStop hook with sleep, increase terminationGracePeriodSeconds, ensure SIGTERM handler in serving code
IfFirst request after deployment is 10x slower than subsequent requests
UseAdd model warmup at startup — run dummy inference to compile CUDA kernels before accepting traffic
IfOne model container consumes all GPU memory, crashing other containers
UseSet nvidia.com/gpu resource limits. Use CUDA_VISIBLE_DEVICES to assign specific GPUs per container.
🗂 Model Weight Delivery Strategies Compared
Baking vs mounting vs pulling — deployment speed, operational complexity, and use cases.
StrategyImage SizeDeployment SpeedOperational ComplexityBest For
Bake into image (COPY)Large (model size + framework)Slow (full image pull)Low (self-contained)Models < 500MB, infrequent updates
Named volume (mount)Small (framework only)Fast (small image)Medium (volume management)Large models, shared across containers
S3/GCS download at startupSmall (framework only)Fast pull + download timeMedium (credentials, retry logic)Frequent model updates, multi-environment
Model registry (MLflow)Small (framework only)Fast pull + download timeHigh (registry infrastructure)Version tracking, A/B testing, rollback

🎯 Key Takeaways

  • Multi-stage builds are essential for ML serving — the training image includes compilers and debugging tools that should never ship to production. The serving image should contain only the framework, the model, and the serving code.
  • GPU access requires the NVIDIA Container Toolkit and the --gpus flag. Without them, ML frameworks silently fall back to CPU with no error. Always add a startup assertion that verifies GPU availability.
  • Model weight delivery strategy determines deployment speed. Baked weights make images large. Volumes and S3 downloads keep images small but add operational complexity. For models >1GB, never bake weights into the image.
  • Production ML serving requires health checks that verify inference capability (not just server liveness), graceful shutdown for in-flight requests, and model warmup to avoid cold-start latency.
  • Pin identical dependency versions (numpy, scipy, CUDA toolkit) in both training and serving Dockerfiles. Version drift causes silent prediction changes that are extremely difficult to debug.

⚠ Common Mistakes to Avoid

    Shipping the training image as the serving image
    Symptom

    8GB+ image, 10-minute deploys, large attack surface with gcc, Jupyter, and test frameworks in production —

    Fix

    use multi-stage builds. Training stage compiles everything. Serving stage copies only the model weights and runtime dependencies. Final image should be under 2GB.

    Not verifying GPU availability at startup
    Symptom

    model loads successfully, inference works, but latency is 50x slower than expected because PyTorch silently fell back to CPU —

    Fix

    add a startup assertion that checks torch.cuda.is_available() and exits immediately if False. This fail-fast approach prevents hours of debugging silent CPU fallback.

    Health check only verifies server process, not model functionality
    Symptom

    Kubernetes routes traffic to a container where the model failed to load, causing 500 errors on all requests —

    Fix

    health check must run a dummy inference and verify the output shape. Use readinessProbe with initialDelaySeconds that exceeds model load time.

    Baking multi-gigabyte model weights into the image
    Symptom

    every deployment pulls 5GB+ even when only the serving code changed. Deploys take 15+ minutes across 20 nodes —

    Fix

    mount weights as a volume or download from S3 at startup. Keep the serving image under 500MB.

    Not pinning numpy/scipy/CUDA versions between training and serving
    Symptom

    model produces different predictions in production than in the training notebook. Floating-point library version differences change computation results —

    Fix

    pin identical dependency versions in both training and serving Dockerfiles. Add a prediction consistency test to CI.

    No graceful shutdown for in-flight requests
    Symptom

    during rolling deployments, in-flight inference requests get connection resets. Users see intermittent 500 errors —

    Fix

    add preStop hook with sleep 10, increase terminationGracePeriodSeconds to 60, implement SIGTERM handler in serving code.

Interview Questions on This Topic

  • QHow would you structure a Dockerfile for an ML model serving container? Walk me through the multi-stage build approach and explain why you would not ship the training environment.
  • QYour ML model container is running but inference latency is 50x slower than your benchmarks. Walk me through your debugging process.
  • QExplain the difference between baking model weights into the Docker image, mounting them as a volume, and downloading them from S3 at startup. When would you use each?
  • QHow do you handle GPU access in Docker containers? What happens if the NVIDIA Container Toolkit is not installed?
  • QYour Kubernetes deployment shows pods restarting during rolling updates with connection errors. How do you fix this for an ML serving container?
  • QWhat should a health check endpoint verify for an ML model serving container? Why is a simple 'server is running' check insufficient?

Frequently Asked Questions

How do I access GPUs from a Docker container?

Install the NVIDIA Container Toolkit on the host, configure it with nvidia-ctk runtime configure, restart Docker, then run containers with --gpus all. Verify with docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi. In Docker Compose, use deploy.resources.reservations.devices with driver: nvidia.

Why is my ML model inference slow inside Docker but fast on the host?

The most common cause is the container not having GPU access. Without --gpus, PyTorch and TensorFlow silently fall back to CPU. Check with docker exec <container> python -c 'import torch; print(torch.cuda.is_available())'. If False, the NVIDIA Container Toolkit is not configured or the --gpus flag is missing.

Should I bake model weights into the Docker image?

Only for models under 500MB with infrequent updates. For larger models, mount weights as a volume or download from S3/GCS at startup. Baked weights make every deployment a multi-gigabyte pull — even when only the serving code changed. This adds 10-15 minutes to deployment time across a cluster.

How do I handle model versioning with Docker?

Tag images with the model version (my-model:v1.2.3). Use a model registry (MLflow, Weights & Biases) to version weights independently of the serving image. The serving image contains the framework; the startup script pulls the correct model version from the registry. This decouples model updates from image updates.

What is the difference between Docker and NVIDIA Triton for model serving?

Docker is a containerization platform — it packages and runs any application. Triton Inference Server is a model serving framework that runs inside a Docker container. Triton provides GPU-optimized inference, dynamic batching, model ensembles, and multi-framework support. You use Docker to containerize Triton, not as an alternative to it.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousA/B Testing in MLNext →Feature Stores Explained
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged