The image is the artifact. It runs identically on your laptop, CI, and production GPU instances.
Multi-stage builds separate training dependencies from serving runtime, keeping images small.
NVIDIA Container Toolkit exposes GPU devices to containers via --gpus flag.
Base image with pinned CUDA and Python versions
Model weights copied or mounted as volumes
Serving framework (FastAPI, TorchServe, Triton) as the entrypoint
Health check endpoint for orchestrator readiness probes
Plain-English First
Imagine you bake the perfect cake in your kitchen — but when you try to bake it at a friend's house, it collapses because their oven runs hotter and they don't have the same brand of flour. Docker is like shipping your entire kitchen — oven, flour, recipe, temperature settings — in one sealed box, so the cake comes out identical every single time, on any stove, anywhere in the world. For ML models, that 'kitchen' is Python 3.10, CUDA 11.8, PyTorch, your trained weights file, and your serving script. Docker boxes all of that up so the model that worked on your laptop works exactly the same way in production on a cloud GPU instance.
ML models are environment-sensitive in ways that web applications are not. A model trained with numpy 1.23 can silently produce different floating-point results on numpy 2.0. A CUDA version mismatch between training and serving causes either crashes or silent CPU fallback that tanks inference latency by 50x. These are not hypothetical — they are the leading causes of 'it works on my machine' failures in ML deployments.
Docker eliminates environment drift by packaging the entire runtime — OS libraries, Python version, pip packages, CUDA toolkit, model weights, and serving logic — into a single versioned image. That image runs identically on your laptop, your CI pipeline, a Kubernetes cluster, and an edge device.
The gap between a Jupyter notebook that produces great metrics and a model that reliably serves predictions in production is wider than most teams expect. Docker closes that gap by making the environment a constant, not a variable. This guide covers the patterns that separate production-grade ML containers from fragile ones.
Multi-Stage Builds for ML — Separating Training from Serving
The most common mistake in ML Dockerfiles is shipping the training environment as the serving image. A training image includes Jupyter, gcc, test frameworks, debugging tools, and development dependencies — none of which are needed in production. This bloats the image to 5-10GB, increases attack surface, and slows deployments.
Multi-stage builds solve this by using a heavy 'builder' stage with all training and build dependencies, then copying only the trained model and runtime dependencies into a minimal 'serving' stage. The final image contains Python, the serving framework, and the model — nothing else.
Why this matters for ML specifically: ML images are uniquely large because they include CUDA toolkit (2-3GB), PyTorch/TensorFlow (1-2GB), and model weights (1-10GB). A single-stage image that includes training tools, CUDA development headers, and model weights can easily exceed 10GB. A multi-stage serving image with quantized weights can be under 2GB.
Layer caching for ML: Model weights change rarely (only after retraining). Dependencies change occasionally. Application code changes frequently. Order your Dockerfile: base image + CUDA first, dependencies second, model weights third, application code last. This ensures code changes do not trigger a re-download of PyTorch or a re-copy of multi-gigabyte weights.
io/thecodeforge/ml-serving.DockerfileDOCKERFILE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# ─── STAGE1: Buildenvironment (training deps, compilers) ───
FROM python:3.10.12-slim-bookworm AS builder
WORKDIR /build
# Install build dependencies (not in final image)
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc g++ libpq-dev && \
rm -rf /var/lib/apt/lists/*
COPY requirements-training.txt .
RUN pip install --user --no-cache-dir -r requirements-training.txt
# Simulate model training artifact (in practice, this comes from
# a training pipeline or model registry)
COPY models/ ./models/
# ─── STAGE2: Servingruntime (minimal) ───
FROM python:3.10.12-slim-bookworm AS serving
WORKDIR /app
# Install only runtime dependencies
COPY requirements-serving.txt .
RUN pip install --no-cache-dir -r requirements-serving.txt
# Copy trained model weights from builder
COPY --from=builder /build/models/ ./models/
# Copy serving application code
COPY src/serving/ ./src/serving/
# Non-root user for security
RUN useradd --create-home appuser
USER appuser
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8080/health')"EXPOSE8080CMD ["python", "-m", "uvicorn", "src.serving.api:app", "--host", "0.0.0.0", "--port", "8080"]
Docker layers are additive. A file added in one layer and deleted in a later layer still occupies space in the earlier layer.
RUN pip install torch && pip uninstall torch still has torch in the install layer — the image does not shrink.
Multi-stage builds start fresh — the serving stage never contains training dependencies in any layer.
This is the only way to genuinely reduce image size for ML workloads where base dependencies are gigabytes.
Production Insight
The layer caching insight is critical for ML because dependency downloads are large. PyTorch with CUDA is ~2GB. If every code change invalidates the pip install layer, every CI build downloads 2GB of dependencies. By copying requirements-serving.txt before application code, the dependency layer is cached on code-only changes — turning 10-minute builds into 30-second builds.
Key Takeaway
Multi-stage builds are not optional for ML serving images. The training environment and the serving environment are fundamentally different — training needs compilers and debugging tools, serving needs only the framework and the model. A 10GB training image deployed as a serving image wastes storage, slows deployments, and increases attack surface.
ML Image Size Optimization
IfImage is >5GB and includes training tools
→
UseUse multi-stage builds — separate training from serving. Copy only model weights and runtime deps to serving stage.
IfModel weights are >1GB and baked into the image
→
UseMove weights to a volume or download from S3/GCS at container startup. Keep image under 2GB.
IfCUDA toolkit adds 2-3GB to the image
→
UseUse runtime-only CUDA base images (nvidia/cuda:11.8.0-runtime-ubuntu22.04) instead of devel images.
IfMultiple models share the same serving framework
→
UseCreate a shared base image with the framework, extend it per model with just the weights and config.
GPU Access with NVIDIA Container Toolkit
ML inference on CPU is 10-100x slower than on GPU. Docker does not expose GPU devices to containers by default — you need the NVIDIA Container Toolkit and the --gpus flag.
The NVIDIA Container Toolkit (formerly nvidia-docker2) installs a Docker runtime that automatically mounts the GPU device drivers and libraries into containers. Without it, containers see no GPU devices even when the host has GPUs available.
Installation and verification: 1. Install nvidia-container-toolkit on the host 2. Configure Docker to use the nvidia runtime: sudo nvidia-ctk runtime configure --runtime=docker 3. Restart Docker: sudo systemctl restart docker 4. Verify: docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
GPU allocation strategies: - --gpus all: expose all GPUs to the container - --gpus 1: expose one GPU (Docker picks which) - --gpus '\"device=0,2\"': expose specific GPUs by index - NVIDIA_VISIBLE_DEVICES=0,2: set via environment variable (useful in Compose)
Failure scenario — silent CPU fallback: If the NVIDIA Container Toolkit is not installed or --gpus is not passed, PyTorch and TensorFlow silently fall back to CPU. The model loads successfully, inference works, but latency is 50x slower than expected. There is no error — torch.cuda.is_available() returns False, but many serving frameworks do not check this. The fix: always add a startup assertion that verifies GPU availability.
io/thecodeforge/ml_serving/startup_check.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import sys
import logging
logger = logging.getLogger(__name__)
defverify_gpu_availability(required_gpus: int = 1) -> None:
"""Startup assertion: fail fast ifGPUisnot available.
Call this at application startup before loading the model.
IfGPUis required but not available, exit immediately
rather than silently falling back to CPU.
"""
try:
import torch
available = torch.cuda.is_available()
device_count = torch.cuda.device_count()
ifnot available:
logger.error(
"GPU not available. torch.cuda.is_available() returned False. ""Ensure NVIDIA Container Toolkit is installed and ""container is started with --gpus flag."
)
sys.exit(1)
if device_count < required_gpus:
logger.error(
f"Insufficient GPUs: required={required_gpus}, "
f"available={device_count}. "
f"Adjust --gpus flag or reduce requirement."
)
sys.exit(1)
gpu_name = torch.cuda.get_device_name(0)
gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
logger.info(
f"GPU verified: {gpu_name}, "
f"{gpu_memory:.1f}GB VRAM, "
f"{device_count} device(s) available"
)
exceptImportError:
logger.error("PyTorch not installed. Cannot verify GPU availability.")
sys.exit(1)
Output
# Successful startup:
# GPU verified: NVIDIA A10G, 22.0GB VRAM, 1 device(s) available
# Failed startup (no --gpus flag):
# ERROR: GPU not available. torch.cuda.is_available() returned False.
# Ensure NVIDIA Container Toolkit is installed and container is started with --gpus flag.
GPU Access as a Device Permission
PyTorch was designed to work on both CPU and GPU — GPU is an optimization, not a requirement.
Many development environments (laptops without GPU) run PyTorch on CPU legitimately.
The framework cannot know if you intended to use GPU or CPU — it defers to the developer.
This is why a startup assertion (verify_gpu_availability) is essential in production serving containers.
Production Insight
The silent CPU fallback is the most insidious GPU-related production bug. The model loads successfully, inference returns correct results, but latency is 50x slower than expected. Monitoring shows high CPU usage instead of GPU utilization. The team spends hours profiling the model code before discovering the container never had GPU access. The fix is a one-time startup assertion that fails fast if GPU is not available.
Key Takeaway
GPU access in Docker requires the NVIDIA Container Toolkit and the --gpus flag. Without them, ML frameworks silently fall back to CPU — no error, just 50x slower inference. Always add a startup assertion that verifies GPU availability. This single check prevents hours of debugging silent CPU fallback.
UseCheck if --gpus flag was passed. Check if nvidia-container-toolkit is installed on the host. Run: docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
IfGPU is available but inference is still slow
→
UseCheck if data is being transferred CPU->GPU on every inference call. Pin the model to GPU once at startup. Check batch size — too small batches underutilize GPU parallelism.
IfCUDA out of memory during inference
→
UseCheck if multiple containers share the same GPU. Use CUDA_VISIBLE_DEVICES to assign specific GPUs. Use float16 instead of float32. Reduce batch size.
Ifnvidia-smi works on host but not in container
→
UseHost driver version must be >= the CUDA version in the container. Check: nvidia-smi on host vs CUDA version in FROM image.
Model Serving Patterns — FastAPI, TorchServe, and Triton
There are three common patterns for serving ML models in Docker containers. The right choice depends on your latency requirements, model complexity, and operational maturity.
Pattern 1: FastAPI + direct model loading. Load the model at startup, expose a /predict endpoint. Simple, full control over the inference pipeline, easy to customize. Best for single-model serving with custom pre/post-processing. The model is loaded into the application process — startup time equals model load time.
Pattern 2: TorchServe / TF Serving. Purpose-built serving frameworks with built-in batching, model versioning, and A/B testing. More operational overhead but better for multi-model serving and high-throughput scenarios. TorchServe runs a separate model server process — the Docker container wraps the TorchServe binary.
Pattern 3: NVIDIA Triton Inference Server. GPU-optimized serving with dynamic batching, model ensemble pipelines, and multi-framework support (PyTorch, TensorFlow, ONNX, TensorRT). Highest throughput but most complex configuration. Best for latency-critical production workloads with multiple models.
Health check patterns: A health check that only verifies the server is running is insufficient for ML serving. The health check must verify that the model is loaded and can produce a valid output. A /health endpoint should run a dummy inference with a known input and verify the output shape matches expectations.
io/thecodeforge/ml_serving/api.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
from fastapi importFastAPI, HTTPExceptionfrom pydantic importBaseModelfrom typing importListimport numpy as np
import torch
import logging
from io.thecodeforge.ml_serving.startup_check import verify_gpu_availability
logger = logging.getLogger(__name__)
app = FastAPI(title="ML Model Serving API")
# Global model reference — loaded once at startup
model = None
model_device = NoneclassPredictionRequest(BaseModel):
features: List[float]
classPredictionResponse(BaseModel):
prediction: float
model_version: str
device: str
@app.on_event("startup")
asyncdefload_model() -> None:
"""Load model once at startup — not on every request."""global model, model_device
# Fail fast if GPU is not availableverify_gpu_availability(required_gpus=1)
model_device = torch.device("cuda:0")
model_path = "./models/production_model.pt"
logger.info(f"Loading model from {model_path}...")
model = torch.jit.load(model_path, map_location=model_device)
model.eval()
# Warmup inference — ensures CUDA kernels are compiled
dummy_input = torch.randn(1, 128, device=model_device)
with torch.no_grad():
_ = model(dummy_input)
logger.info("Model loaded and warmed up successfully.")
@app.get("/health")
asyncdefhealth_check() -> dict:
"""Health check that verifies model is loaded and functional."""if model isNone:
raiseHTTPException(status_code=503, detail="Model not loaded")
# Dummy inference to verify the model actually workstry:
dummy_input = torch.randn(1, 128, device=model_device)
with torch.no_grad():
output = model(dummy_input)
return {
"status": "healthy",
"model_loaded": True,
"output_shape": list(output.shape),
"device": str(model_device),
}
exceptExceptionas e:
raiseHTTPException(
status_code=503,
detail=f"Model health check failed: {str(e)}"
)
@app.post("/predict", response_model=PredictionResponse)
asyncdefpredict(request: PredictionRequest) -> PredictionResponse:
"""Run inference on the loaded model."""if model isNone:
raiseHTTPException(status_code=503, detail="Model not loaded")
try:
input_tensor = torch.tensor(
[request.features], dtype=torch.float32, device=model_device
)
with torch.no_grad():
output = model(input_tensor)
returnPredictionResponse(
prediction=output.item(),
model_version="v1.0.0",
device=str(model_device),
)
exceptExceptionas e:
logger.error(f"Inference failed: {e}")
raiseHTTPException(status_code=500, detail=f"Inference error: {str(e)}")
Model loading takes 5-60 seconds depending on model size. First request would timeout.
Loading at startup means the health check can verify the model is functional before accepting traffic.
The orchestrator (Kubernetes, ECS) uses the health check to know when the container is ready.
Warmup inference ensures CUDA kernels are compiled before the first real request — avoids cold-start latency.
Production Insight
The health check pattern is critical for ML serving. A health check that only verifies the server process is running (returns 200) is insufficient. The model might have failed to load, the GPU might be unavailable, or the weights might be corrupted. A proper health check runs a dummy inference and verifies the output shape. Kubernetes readiness probes use this health check to route traffic only to containers that are actually ready to serve predictions.
Key Takeaway
Choose the serving pattern that matches your operational maturity. FastAPI for simplicity and control. TorchServe for multi-model production. Triton for maximum GPU throughput. Regardless of framework, the health check must verify model loading and inference capability — not just server process liveness.
Model Serving Framework Selection
IfSingle model, custom pre/post-processing, fast iteration
→
UseUse FastAPI — simple, full control, easy to customize, minimal operational overhead
IfMultiple models, versioning, A/B testing, high throughput
→
UseUse TorchServe — built-in batching, model management, versioning API
UseUse Triton Inference Server — dynamic batching, model ensembles, TensorRT integration
IfEdge deployment, resource-constrained, no GPU
→
UseUse ONNX Runtime or TensorFlow Lite — optimized for CPU inference on edge devices
Volume Strategies for Model Weights — Baking vs Mounting vs Pulling
Model weights are the largest component of an ML serving image. A production NLP model can be 2-10GB. A large language model can be 50-200GB. How you deliver these weights to the container has a major impact on deployment speed, storage costs, and operational flexibility.
Strategy 1: Bake weights into the image (COPY). Simplest approach. The weights are part of the image layer. Every deployment pulls the full image including weights. Pros: self-contained, no external dependencies. Cons: every model update requires a full image rebuild and multi-gigabyte pull. Not practical for models >1GB.
Strategy 2: Mount weights as a named volume. Weights are stored in a Docker volume, mounted into the container at runtime. The image stays small (just the framework and serving code). Pros: image is small and fast to pull. Cons: weights must be pre-populated in the volume. Requires a separate weight management process.
Strategy 3: Pull weights from object storage at startup. The container downloads weights from S3/GCS/Azure Blob at startup. Pros: always gets the latest version, no pre-population needed, works across environments. Cons: adds startup latency (5-60 seconds depending on model size and network), requires credentials management, adds a failure mode (network timeout during download).
Strategy 4: Hybrid — framework in image, weights in registry. Use a model registry (MLflow, Weights & Biases, SageMaker Model Registry) to version and store weights. The serving image contains the framework and a startup script that pulls the correct model version from the registry. This is the most operationally mature approach — it decouples model updates from image updates.
io/thecodeforge/ml_serving/model_loader.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import os
import logging
from pathlib importPath
logger = logging.getLogger(__name__)
defload_model_weights(model_name: str, version: str) -> Path:
"""Load model weights using the appropriate strategy.
Strategyis determined by environment variable MODEL_SOURCE:
- 'baked': weights are in the image (COPYinDockerfile)
- 'volume': weights are in a mounted volume
- 's3': weights are downloaded fromS3 at startup
"""
source = os.environ.get("MODEL_SOURCE", "baked")
if source == "baked":
# Weights were COPY'd into the image during build
model_path = Path(f"./models/{model_name}/{version}/model.pt")
ifnot model_path.exists():
raiseFileNotFoundError(
f"Baked model not found at {model_path}. "
f"Ensure the model was copied during docker build."
)
logger.info(f"Loaded baked model from {model_path}")
return model_path
elif source == "volume":
# Weights are in a mounted Docker volume
volume_path = os.environ.get("MODEL_VOLUME_PATH", "/data/models")
model_path = Path(f"{volume_path}/{model_name}/{version}/model.pt")
ifnot model_path.exists():
raiseFileNotFoundError(
f"Model not found in volume at {model_path}. "
f"Ensure the volume is mounted and contains the model."
)
logger.info(f"Loaded model from volume: {model_path}")
return model_path
elif source == "s3":
# Download from S3 at startupimport boto3
bucket = os.environ["MODEL_S3_BUCKET"]
key = f"models/{model_name}/{version}/model.pt"
local_path = Path(f"/tmp/models/{model_name}/{version}/model.pt")
local_path.parent.mkdir(parents=True, exist_ok=True)
logger.info(f"Downloading model from s3://{bucket}/{key}...")
s3 = boto3.client("s3")
s3.download_file(bucket, key, str(local_path))
logger.info(f"Downloaded model to {local_path}")
return local_path
else:
raiseValueError(f"Unknown MODEL_SOURCE: {source}")
Output
# With baked weights:
# MODEL_SOURCE=baked docker run --gpus all io.thecodeforge/ml-model:v1.0
# docker run --gpus all -e AWS_DEFAULT_REGION=us-east-1 io.thecodeforge/ml-serving:v1.0
Model Weights as a Supply Chain Decision
Bake when: model is <500MB, deployment frequency is low, self-contained images are required (air-gapped environments).
Mount when: model is >1GB, multiple containers share the same weights, you need to update weights without rebuilding the image.
Pull from S3 when: model updates are frequent, you need version management, you deploy across multiple environments.
Use a model registry (MLflow) when: you need version tracking, A/B testing, and rollback capabilities.
Production Insight
The failure scenario for baked weights is deployment speed. A 5GB model baked into the image means every deployment pulls 5GB+ — even if only the serving code changed. With 20 nodes and a 1Gbps network, that is 100GB of data transfer and 15+ minutes of deployment time. Mounting weights as a volume or pulling from S3 keeps the image under 500MB and deploys in under 30 seconds.
Key Takeaway
Model weight delivery strategy determines deployment speed. Baked weights make images large and slow to pull. Volumes and S3 downloads keep images small but add operational complexity. For models >1GB, never bake weights into the image — mount them or pull from object storage.
UseBake into image — simplest, no external dependencies
IfModel > 1GB, shared across multiple containers
→
UseMount as named volume — small image, shared storage
IfFrequent model updates, multi-environment deployment
→
UsePull from S3/GCS at startup — decouple model from image
IfNeed version tracking, A/B testing, rollback
→
UseUse model registry (MLflow, W&B) — full lifecycle management
Production Deployment Patterns — Health Checks, Graceful Shutdown, and Resource Limits
Deploying ML models in production requires patterns that go beyond basic containerization. Three patterns separate production-grade deployments from fragile ones.
1. Health checks that verify inference capability. A /health endpoint must verify that the model is loaded and can produce valid output. Run a dummy inference at startup and on every health check. Kubernetes readiness probes use this to route traffic only to ready containers.
2. Graceful shutdown for in-flight requests. When a container is stopped (docker stop, Kubernetes pod termination), it receives SIGTERM. The serving framework must stop accepting new requests, complete in-flight requests, and exit cleanly. Default stop timeout is 10 seconds — increase it with --stop-timeout or terminationGracePeriodSeconds if inference takes longer.
3. Resource limits to prevent GPU and memory contention. Without resource limits, one container can consume all GPU memory or host memory, crashing other services. Set --memory limits for RAM. Use NVIDIA_MPS or CUDA_VISIBLE_DEVICES for GPU isolation. In Kubernetes, use resource requests and limits for both CPU/memory and nvidia.com/gpu.
4. Model warmup to avoid cold-start latency. The first inference on a GPU model is slow because CUDA kernels must be compiled. Run a dummy inference at startup to warm up the GPU. This ensures the first real request has the same latency as subsequent requests.
ML model loading involves reading multi-gigabyte weight files and initializing CUDA contexts.
PyTorch model loading can take 30-60 seconds for large models.
If the readiness probe fires before the model is loaded, Kubernetes marks the pod as not ready and does not route traffic.
The initialDelaySeconds must exceed the expected model load time to prevent premature traffic routing.
Production Insight
The preStop hook with sleep 10 is critical for zero-downtime deployments. When Kubernetes terminates a pod, it simultaneously sends SIGTERM and removes the pod from the service endpoints. But the endpoint removal is not instant — there is a propagation delay. The sleep 10 ensures the pod continues accepting requests for 10 seconds after SIGTERM, giving the endpoint removal time to propagate. Without this, in-flight requests during deployment get connection resets.
Key Takeaway
Production ML serving requires health checks that verify inference capability, graceful shutdown for in-flight requests, resource limits for GPU isolation, and model warmup to avoid cold-start latency. The preStop hook with sleep is essential for zero-downtime deployments. These patterns are not optional — they are the difference between a reliable serving system and one that fails during every deployment.
ML Serving Production Readiness Checklist
IfHealth check only verifies server process
→
UseAdd dummy inference to health check — verify model is loaded and produces valid output
IfPods restart during deployment with connection errors
→
UseAdd preStop hook with sleep, increase terminationGracePeriodSeconds, ensure SIGTERM handler in serving code
IfFirst request after deployment is 10x slower than subsequent requests
→
UseAdd model warmup at startup — run dummy inference to compile CUDA kernels before accepting traffic
IfOne model container consumes all GPU memory, crashing other containers
→
UseSet nvidia.com/gpu resource limits. Use CUDA_VISIBLE_DEVICES to assign specific GPUs per container.
● Production incidentPOST-MORTEMseverity: high
Silent Prediction Drift — numpy Version Mismatch Between Training and Serving Containers
Symptom
A/B test showed the production model had 3.2% lower click-through rate than the offline evaluation. The model code was identical. The weights were identical. The input data pipeline was identical. Engineers could not reproduce the discrepancy locally because their dev environment matched the training environment.
Assumption
Team assumed a data pipeline issue — perhaps the production feature store had stale data. They spent 2 days comparing feature vectors between training and serving. All features matched. Second assumption: a random seed issue causing non-deterministic behavior. They set all seeds explicitly — the discrepancy persisted.
Root cause
The training Dockerfile used FROM python:3.10 which resolved to numpy 1.24 at build time. The serving Dockerfile used FROM python:3.11 which resolved to numpy 2.0. numpy 2.0 changed the default rounding behavior in np.dot and np.matmul for certain float32 operations. The model's softmax layer used np.exp on logits that were near the overflow boundary — the rounding difference changed which items appeared in the top-10 recommendation list. The 3.2% CTR drop was caused by slightly different recommendations being served.
Fix
1. Pinned numpy==1.24.3 in both training and serving requirements.txt. 2. Pinned the base image to FROM python:3.10.12-slim-bookworm in both Dockerfiles. 3. Added a CI step that runs a prediction consistency test — the same input must produce identical output in both containers. 4. Added pip freeze output to the image metadata as a LABEL for auditability. 5. Implemented a model validation pipeline that compares offline and online predictions within a tolerance threshold before deploying.
Key lesson
ML models are sensitive to floating-point library versions in ways that web applications are not.
Pin every dependency — including numpy, scipy, and CUDA toolkit — in both training and serving Dockerfiles.
A prediction consistency test between training and serving environments catches version drift before it reaches production.
The serving Dockerfile must be derived from the same base image as the training Dockerfile, or at minimum pin identical dependency versions.
pip freeze output should be captured as image metadata for post-deployment auditability.
Production debug guideFrom silent prediction drift to GPU failures — systematic debugging paths.6 entries
Symptom · 01
Model container starts but inference is 10-50x slower than expected.
→
Fix
Check if the model is running on CPU instead of GPU. Exec into the container and run: python -c "import torch; print(torch.cuda.is_available())". If False, the NVIDIA Container Toolkit is not configured or --gpus was not passed. Check nvidia-smi on the host to verify GPU availability.
Symptom · 02
Container crashes with CUDA out of memory on a GPU that should have enough VRAM.
→
Fix
Check if multiple containers are sharing the same GPU without memory isolation. Use NVIDIA_MPS or set CUDA_VISIBLE_DEVICES to assign specific GPUs. Check if the model is loading weights in float32 instead of float16 — float32 uses 2x the VRAM.
Symptom · 03
Model produces different predictions in Docker than in the training notebook.
→
Fix
Compare dependency versions: docker exec <container> pip freeze vs your training environment. Check numpy, scipy, and CUDA toolkit versions specifically. Run a prediction consistency test with fixed inputs and seeds. Check if the model uses any platform-specific operations (MKL vs OpenBLAS).
Symptom · 04
Docker image is 8GB+ and deploys take 10+ minutes.
→
Fix
Audit image layers: docker history <image>. Check if training dependencies (Jupyter, gcc, test frameworks) are in the serving image. Use multi-stage builds to separate build-time from runtime. Move model weights to a volume or object storage instead of baking them into the image.
Symptom · 05
Health check passes but the model returns errors on actual inference requests.
→
Fix
The health check endpoint may only verify the server is running, not that the model is loaded. Add a health check that runs a dummy inference with a known input and verifies the output shape. Check if the model file was corrupted during the COPY step (large files can fail silently).
Symptom · 06
Container runs out of disk space during inference (large batch processing).
→
Fix
Check if the model writes temporary files (attention caches, intermediate tensors) to the container filesystem. Mount a tmpfs or volume for temporary storage. Set --shm-size for PyTorch DataLoader workers that use shared memory.
★ Docker ML Model Triage Cheat SheetFirst-response commands when an ML serving container fails in production.
Inference is extremely slow (10x+ slower than expected).−
Add model-loaded check to health endpoint. Check container logs for model loading errors: docker logs --tail 100 <container>
Model Weight Delivery Strategies Compared
Strategy
Image Size
Deployment Speed
Operational Complexity
Best For
Bake into image (COPY)
Large (model size + framework)
Slow (full image pull)
Low (self-contained)
Models < 500MB, infrequent updates
Named volume (mount)
Small (framework only)
Fast (small image)
Medium (volume management)
Large models, shared across containers
S3/GCS download at startup
Small (framework only)
Fast pull + download time
Medium (credentials, retry logic)
Frequent model updates, multi-environment
Model registry (MLflow)
Small (framework only)
Fast pull + download time
High (registry infrastructure)
Version tracking, A/B testing, rollback
Key takeaways
1
Multi-stage builds are essential for ML serving
the training image includes compilers and debugging tools that should never ship to production. The serving image should contain only the framework, the model, and the serving code.
2
GPU access requires the NVIDIA Container Toolkit and the --gpus flag. Without them, ML frameworks silently fall back to CPU with no error. Always add a startup assertion that verifies GPU availability.
3
Model weight delivery strategy determines deployment speed. Baked weights make images large. Volumes and S3 downloads keep images small but add operational complexity. For models >1GB, never bake weights into the image.
4
Production ML serving requires health checks that verify inference capability (not just server liveness), graceful shutdown for in-flight requests, and model warmup to avoid cold-start latency.
5
Pin identical dependency versions (numpy, scipy, CUDA toolkit) in both training and serving Dockerfiles. Version drift causes silent prediction changes that are extremely difficult to debug.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
How do I access GPUs from a Docker container?
Install the NVIDIA Container Toolkit on the host, configure it with nvidia-ctk runtime configure, restart Docker, then run containers with --gpus all. Verify with docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi. In Docker Compose, use deploy.resources.reservations.devices with driver: nvidia.
Was this helpful?
02
Why is my ML model inference slow inside Docker but fast on the host?
The most common cause is the container not having GPU access. Without --gpus, PyTorch and TensorFlow silently fall back to CPU. Check with docker exec <container> python -c 'import torch; print(torch.cuda.is_available())'. If False, the NVIDIA Container Toolkit is not configured or the --gpus flag is missing.
Was this helpful?
03
Should I bake model weights into the Docker image?
Only for models under 500MB with infrequent updates. For larger models, mount weights as a volume or download from S3/GCS at startup. Baked weights make every deployment a multi-gigabyte pull — even when only the serving code changed. This adds 10-15 minutes to deployment time across a cluster.
Was this helpful?
04
How do I handle model versioning with Docker?
Tag images with the model version (my-model:v1.2.3). Use a model registry (MLflow, Weights & Biases) to version weights independently of the serving image. The serving image contains the framework; the startup script pulls the correct model version from the registry. This decouples model updates from image updates.
Was this helpful?
05
What is the difference between Docker and NVIDIA Triton for model serving?
Docker is a containerization platform — it packages and runs any application. Triton Inference Server is a model serving framework that runs inside a Docker container. Triton provides GPU-optimized inference, dynamic batching, model ensembles, and multi-framework support. You use Docker to containerize Triton, not as an alternative to it.