Skip to content
Home DevOps Docker Image Bloat — 1.2GB Java Image Killed Friday Deploy

Docker Image Bloat — 1.2GB Java Image Killed Friday Deploy

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Docker → Topic 14 of 18
A 1.
🔥 Advanced — solid DevOps foundation required
In this tutorial, you'll learn
A 1.
  • What Is Optimising Docker Images?
  • Understanding Docker Layers and the Union Filesystem
  • Multi-Stage Builds: The Right Way to Compile and Package
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Docker images are built as layers; each instruction adds a new read-only layer
  • Layer caching speeds up rebuilds only if earlier layers haven't changed
  • Multi-stage builds separate build-time deps from runtime, cutting image size by up to 90%
  • Slim base images (Alpine, distroless) reduce attack surface and pull time
  • Biggest mistake: installing build tools in the final image — they're never needed at runtime
  • Another overlooked win: use a .dockerignore file to exclude local caches and secrets from the build context
🚨 START HERE

Quick Debug Commands for Docker Image Size

Run these commands when you suspect an image is bloated. No theory — just the commands that find the fat layers.
🟡

Image seems too large

Immediate ActionInspect layer sizes and files changed per layer.
Commands
dive <image>
docker image history --no-trunc <image>
Fix NowIdentify the largest layer, then refactor that RUN instruction or split into multi-stage. Use `docker build --no-cache` to discard stale layers.
🟠

CI builds are slow because image push takes minutes

Immediate ActionCheck if your image has unnecessary layers or if base image is fat.
Commands
docker image ls --format '{{.Repository}}:{{.Tag}} {{.Size}}'
docker scout recommendations <image>
Fix NowSwitch to a distroless or Alpine slim base. Use multi-stage builds. Squash layers if needed (experimental: `--squash`).
🟡

Container gets 'not found' errors for shared libraries

Immediate ActionCheck if you used a minimal base that lacks required libs.
Commands
ldd /app/your-binary (or use `docker run --entrypoint ldd <image> /app/binary`)
docker run --entrypoint sh <image> -c 'ls -la /lib/x86_64-linux-gnu/'
Fix NowUse a base image that includes the necessary libs (e.g., `debian:stable-slim` instead of `alpine` if your app needs glibc). Or add `apk add libc6-compat` for Alpine.
🟡

Security scan reports 200+ CVEs

Immediate ActionIdentify which layers introduce the most CVEs.
Commands
docker scout cves <image> --format sarif
docker scout recommendations <image>
Fix NowSwitch to a more minimal base (distroless or slim). Remove packages like curl, vim, wget. Pin base image digests.
Production Incident

How a 1.2GB Java Image Took Down Our Friday Deploy

A simple Spring Boot app weighed 1.2GB because the Dockerfile used a full JDK base, installed Maven, left cached .m2 files, and ran `apt-get` without cleaning up. CI builds took 8 minutes, pulls on prod nodes saturated the network, and a registry storage bill hit $500/month.
SymptomCI builds taking 8+ minutes, 'docker pull' timing out on AWS EKS nodes, registry costs skyrocketing.
AssumptionWe thought 'it's just a few MB of extra libraries — no big deal'.
Root causeThe Dockerfile had 7 RUN instructions, each adding layers. One RUN installed Maven and all its dependencies, another left the Maven local repository (.m2) in the image. The base image was openjdk:11-jdk (400MB) instead of a slim JRE. No multi-stage build was used.
FixSwitched to a multi-stage build: stage 1 used maven:3.8.4-openjdk-11-slim to compile, stage 2 used openjdk:11-jre-slim and copied only the JAR. Added '--link' to COPY to reduce layer count. Used .dockerignore to exclude local .m2. Final image size: 118MB.
Key Lesson
Use multi-stage builds for any compiled language — separate build tools from runtime.Choose the smallest base image that provides the runtime your app needs (JRE, not JDK).Clean up package manager caches inside the same RUN layer (e.g., apt-get clean).Profile every image with dive or docker scout before pushing to a registry.Pin base image digests, not just tags — a tag change can silently double your image size.
Production Debug Guide

Symptom → Action guide for common image size problems

Image >500MB for a simple web appRun dive <image> to see per-layer size. Look for layers adding big files like /usr/share/doc, /var/cache/apt, or entire language SDKs.
CI build time increases without code changesCheck if a dependency version changed or a base image tag moved (e.g., :latest). Pin exact base image digests in your Dockerfile.
Container fails with 'exec format error' or missing libsYou likely switched to a minimal base (Alpine) but need glibc. Use gcr.io/distroless/java17-debian11 or add apk add libc6-compat for Alpine.
Security scanner reports high CVEs in imageScan with docker scout cves <image>. Replace fat base with distroless or a hardened slim image. Remove unnecessary packages, especially curl, wget, and vim.
Container not starting with 'exec user process caused: no such file or directory'Your binary was compiled for the wrong architecture (e.g., x86_64 vs arm64). Use file <binary> inside the container to check. Rebuild with the correct base image for your target platform.
Image size creeps up 5% every month with no code changeCheck if a base image tag is floating. Pin to a digest. Also inspect if your CI is pulling a new builder image each time — cache invalidation can add layers.

Every second your CI pipeline spends pushing a bloated Docker image to a registry is a second your deployment is blocked. At scale — dozens of services, hundreds of deploys per day — a 1 GB image versus a 100 MB image isn't a minor aesthetic preference, it's the difference between a 30-second deploy and a 5-minute one. It compounds across your entire fleet, inflates egress costs on AWS/GCP/Azure, and widens your attack surface because every unused package is a potential CVE waiting to be exploited.

The root cause is almost always the same: Dockerfiles written like shell scripts — one giant RUN block, a fat base image chosen for convenience, build tools left behind after compilation, secrets accidentally baked into layers. Docker's union filesystem means every layer is permanent history; you can't 'delete' a file from a previous layer by removing it in a later one — the bytes are still there, just hidden.

By the end of this article you'll be able to diagnose a bloated image using real tooling, rewrite Dockerfiles using multi-stage builds and layer-cache discipline, choose the right minimal base image for your workload, and avoid the production gotchas that catch even experienced engineers off guard. We'll go deep on the internals — because understanding why Docker layers work the way they do is what separates a developer who memorises tricks from one who can solve novel problems.

Here's the hard truth: most teams don't realise how much bloat costs them until their registry bill hits four figures. One team we consulted had a 2.3GB image for a simple Go webserver. After applying the techniques in this article, they dropped it to 12MB. That's not optimisation — that's elimination.

What Is Optimising Docker Images?

At its heart, image optimisation is about understanding Docker's union file system and using it deliberately. Each Dockerfile instruction adds a new layer. The image's total compressed size isn't just the sum of final files — it includes everything that was written in earlier layers, even if later instructions delete or overwrite them. That's the trap most beginner Dockerfiles fall into: they install compilers, download Maven, compile the app, remove the compiler — but the compiler's bytes still live in an older layer, never to be recovered.

The real measure isn't the size you see when you run docker images — that's the uncompressed size. Pull time is based on compressed size, and registry storage costs are typically based on compressed size as well. So optimisation targets both. A 2GB uncompressed image might compress to 700MB, still far too large for a microservice that does nothing but serve HTTP.

Optimisation isn't a one-time thing. It's a discipline: layer ordering, multi-stage builds, base image selection, and cache management. Get it right and your deploys are faster, your registry bill drops, and your attack surface shrinks. Get it wrong and you're paying for every unnecessary megabyte, every day.

One more thing: compressed vs uncompressed matters. That 2GB image may compress to 700MB on push, but when pulled over a 100Mbps link, that's still 56 seconds of network time. Every megabyte has a cost — even if it's not obvious from docker images.

Here's a nuance most guides miss: layer deduplication across images. If you have ten microservices all built on debian:stable-slim, each pulls the same base layer once on a node. But if each uses a different apt-get install in the first RUN layer, those layers aren't shared. That's why keeping common dependencies in a shared base image saves both build time and node storage.

Dockerfile.simple-bloated · DOCKERFILE
12345678910
# A naive Dockerfile that wastes space
FROM ubuntu:latest
RUN apt-get update
RUN apt-get install -y build-essential curl wget
RUN curl -fsSL https://deb.nodesource.com/setup_18.x | bash -
RUN apt-get install -y nodejs
WORKDIR /app
COPY . .
RUN npm install
CMD ["node", "server.js"]
🔥Forge Tip
Type this code yourself rather than copy-pasting. The muscle memory of writing it will help it stick.
📊 Production Insight
A bloated image doesn't just waste storage — it adds minutes to every deploy and hides security vulnerabilities in unused packages.
The first step to optimisation is visibility: use dive to inspect each layer's filesystem.
Another reality check: a 2 GB image that's 90% unused packages is a ticking time bomb for PCI or SOC2 audits.
But also: the compressed size matters for CI pull times, not just storage — a 2GB uncompressed image may compress to 700MB, but that's still 700MB every pull.
Pro tip: track compressed size, not uncompressed. Get it from docker manifest inspect or regctl.
🎯 Key Takeaway
Every Docker image is a stack of layers. The goal is to minimise the total size by reducing both the number of layers and the weight of each layer.
Optimisation starts with measurement.
If you can't measure it, you can't shrink it.
Always use dive before pushing to a registry.

Understanding Docker Layers and the Union Filesystem

Docker images are built from a series of read-only layers. Each instruction in a Dockerfile (FROM, RUN, COPY) creates a new layer. The union filesystem overlay2 stacks these layers and presents them as a single filesystem. This is why deleting a file in a later layer doesn't reduce image size — the file still exists in an underlying layer.

Understanding this mechanism is the key to writing efficient Dockerfiles. Every layer is cached and reused as long as the instruction text and its context (e.g., the files being copied) haven't changed. But misplaced order of instructions can invalidate the entire cache.

The point: place instructions that change infrequently (like installing packages) early, and instructions that change with every code change (like COPY . /app) as late as possible.

But there's a deeper trap: RUN rm -rf /var/cache/apt in a separate instruction doesn't remove those files from the previous layer. The files are still there in the layer stack, just hidden. That's why you must combine apt-get install and apt-get clean in the same RUN instruction using shell operators. Every byte you clean inside the same layer is actually gone. Every byte you clean in a later layer is still costing you.

Here's a real-world number: a single RUN with apt-get install -y build-essential && apt-get clean saves about 30MB compared to splitting it into two RUN commands. That 30MB per layer adds up fast when you have 5-10 layers.

Want a mental model? Each layer is like a delta snapshot in Git. If you add a file in commit A and delete it in commit B, the blob still exists in the object store. Docker is the same — docker history shows every layer's bytes.

Dockerfile.layer-order · DOCKERFILE
123456789101112
# Good layer order: stable deps first, code last
FROM node:18-alpine
WORKDIR /app

# Install dependencies (changes rarely)
COPY package.json package-lock.json ./
RUN npm ci --only=production

# Copy application code (changes frequently) — invalidates cache only from here
COPY . .

CMD ["node", "server.js"]
Mental Model
Mental Model: Layer Stack
Think of each layer like a transparency sheet — you can only add new sheets on top, never remove something from a sheet below.
  • Each Dockerfile instruction adds a new read-only layer on top of the previous ones.
  • If you install a package in one layer and remove it in the next, the package still exists in the lower layer — the image stays large.
  • Use multi-stage builds to copy only the final artifact into a fresh, clean layer stack.
  • Combine cleanup commands into the same RUN instruction to avoid wasting bytes.
📊 Production Insight
Placing frequently changing instructions early (like COPY .) kills caching and forces all downstream layers to rebuild.
Always structure your Dockerfile with stable instructions first (APK/APT installs, dependency downloads), then code copy at the end.
Pro tip: use BuildKit's --cache-from to reuse layers from previous builds — it's a game changer for CI pipelines.
Consider layer squashing for Lambda deployment packages — single-layer images pull faster.
But beware: squashing layers with --squash is experimental and can cause cache misses.
🎯 Key Takeaway
Layer order determines cache effectiveness.
Put stable instructions first, code at the end.
Never delete files in a later layer — use multi-stage to avoid carrying them at all.
Use docker build --squash sparingly — it breaks caching.
Should You Squash Layers?
IfYou need a single-layer image for minimal size (e.g., for AWS Lambda or distroless configs)
UseUse docker build --squash (experimental) or multi-stage to copy only the final files into a fresh base.
IfYou want to preserve cacheability and share common base layers across services
UseDo not squash. Keep layers separate to reuse cached base layers in CI pipelines.

Multi-Stage Builds: The Right Way to Compile and Package

The single most effective technique to reduce image size is multi-stage builds. Instead of using one Dockerfile that compiles your application and then runs it — leaving all build tools, source code, and intermediate artifacts in the final image — you split the process into two or more stages.

Stage 1 (build stage): Use a full SDK base image, install all build dependencies, compile your application. Stage 2 (runtime stage): Use a minimal base image (e.g., distroless, Alpine slim, or JRE-slim) that contains only the runtime necessary to execute your compiled artifact. Then copy only the compiled output (e.g., JAR, binary) from the build stage.

The syntax uses FROM ... AS alias, and COPY --from=alias to grab files from an earlier stage.

This pattern eliminates build-time dependencies, reduces image size dramatically, and also improves security because the final image contains only what's needed at runtime.

One pattern that catches people out: copying the entire /build directory instead of just the artifact. If your static files are in /build/static but you also have node_modules in /build, they'll all come along. Be precise with your COPY paths. For Go apps, copy only the single binary. For Java, copy only the *.jar. For Python, you might need to copy the entire site-packages, but you can control it with a virtualenv.

A common gotcha: the COPY --from stage still adds a layer. Combine multiple COPY --from calls? Not possible — each COPY adds a layer, but you can't merge them. Accept the overhead — it's still far smaller than including the build stage.

Real example: a Java team at a fintech used multi-stage and dropped their image from 1.8GB to 145MB. The build stage contained Maven, JDK, and all source; the runtime stage had only the JRE and the fat JAR. Their deploy time dropped from 8 minutes to 45 seconds.

Dockerfile.multistage · DOCKERFILE
1234567891011121314
# Stage 1: Build
FROM maven:3.8.4-openjdk-11-slim AS builder
WORKDIR /build
COPY pom.xml .
RUN mvn dependency:go-offline  # cache deps
COPY src ./src
RUN mvn package -DskipTests

# Stage 2: Runtime
FROM openjdk:11-jre-slim
WORKDIR /app
COPY --from=builder /build/target/myapp.jar ./app.jar
EXPOSE 8080
CMD ["java", "-jar", "app.jar"]
⚠ Common Pitfall: Copying Too Much from Builder
Don't copy the entire /build directory from the builder stage. That includes source code, compiled test classes, and intermediate artifacts. Always copy only the exact file (JAR, binary) that your runtime needs.
📊 Production Insight
Multi-stage builds can reduce image size by 85-95% compared to single-stage builds.
But if you forget to use --link with COPY (BuildKit), copied files create new layers that don't share the base layer — defeating layer deduplication.
Another trap: copying the whole /build directory instead of the specific artifact — you'll drag in test jars, cached .class files, even source code.
Be extra precise: copy COPY --from=builder /build/target/*.jar to exclude everything else.
For Go static binaries, copy a single file — no need to include any runtime libs.
🎯 Key Takeaway
Use multi-stage builds for every compiled language.
Separate build environment from runtime environment.
Copy only the final artifact — nothing else.
Use COPY --from=builder /app/target/*.jar to be precise.

Choosing the Right Base Image: Alpine, Slim, Distroless

The base image you choose sets the lower bound for your final image size and directly influences your attack surface. The trade-off is between size, package availability, and compatibility.

  • Full images (e.g.

Dockerfile.base-examples · DOCKERFILE
123456789101112131415161718
# Distroless for security-critical apps
FROM gcr.io/distroless/java17-debian11
COPY --from=builder /app/target/*.jar /app.jar
CMD ["/app.jar"]

# Slim for general use
FROM node:18-slim
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
CMD ["node", "server.js"]

# Alpine for static binaries
FROM alpine:3.18
RUN apk add --no-cache ca-certificates
COPY --from=builder /app/server /server
CMD ["/server"]
🔥Production Tip
Always pin a specific version tag (or even a digest) for your base image. Using :latest can cause builds to break when the maintainer updates the image.
📊 Production Insight
Switching from ubuntu:22.04 to ubuntu:22.04-slim can reduce image size by 30-40% without any code changes.
But if you need glibc and pick Alpine, you'll face runtime errors. Test thoroughly before switching base images in production.
Also consider distroless if your security team demands zero package managers — but you'll lose the ability to shell into the container for debugging.
Test base image switches in staging for at least a week to catch compatibility issues.
A common trap: FROM node:18-alpine works fine until you add a native npm module that needs glibc — then it silently fails.
🎯 Key Takeaway
Smallest isn't always best. Match the base image's libc and library set to your application's requirements.
Pin versions, test, and scan for CVEs before deploying.
When in doubt, start with a slim Debian-based image — it covers 90% of use cases with minimal risk.
When in doubt, start with -slim Debian.
Which Base Image Should You Choose?
IfApp is compiled with GCC/glibc (e.g., Java, C++, Python with native extensions)
UseUse slim variant of Debian or Ubuntu (e.g., -slim). Avoid Alpine unless you add libc6-compat and test.
IfApp is compiled with musl (Go static binary, Rust), or is a scripting language with pure dependencies
UseAlpine is a great choice. Small and fast.
IfSecurity is paramount and you don't need a shell or package manager at runtime
UseUse distroless images from Google.

Layer Cache Optimisation for CI/CD

In a CI pipeline, image rebuilds happen multiple times a day. A well-structured Dockerfile can reuse cached layers from previous builds, cutting build time from minutes to seconds.

The rule: order instructions by frequency of change. Start with system packages (almost never change), then language dependencies (change when you update dependencies), then application code (changes every commit).

Also, use .dockerignore to avoid sending unnecessary files (like .git, node_modules, target) to the Docker daemon — they invalidate the COPY layer.

BuildKit (enabled by default in recent Docker versions) offers additional cache optimization: --cache-from to use remote caches, --mount=type=cache for persistent package caches across builds.

But there's a hidden cost: cache invalidation can be unpredictable. If your CI runner doesn't reuse the Docker cache between builds (e.g., ephemeral runners), you lose all the benefit of layer ordering. In that case, lean on BuildKit's --cache-from pointing to the previous build in your registry. That's the pattern that reduces 5-minute builds to 30 seconds.

Another trick: for monorepos with multiple Dockerfiles, share a common base layer by building a base image containing all system deps, then use FROM base in each service Dockerfile. This saves both build time and registry storage.

One thing that trips up teams: cache mounts (--mount=type=cache) persist on the host. If you're running on ephemeral CI runners like GitHub Actions hosted, they don't persist between runs. You need either a persistent cache volume or --cache-from with a registry.

Pro tip: use --cache-to and --cache-from together to push cache to a registry and pull it on the next build. This works even across different runners.

Dockerfile.cache-optimised · DOCKERFILE
123456789101112131415
# Optimised Dockerfile for CI: stable → less stable → code
FROM python:3.11-slim AS base
WORKDIR /app

# Layer 1: System deps (rarely changes)
RUN apt-get update && apt-get install -y --no-install-recommends \n    libpq-dev gcc \n    && rm -rf /var/lib/apt/lists/*

# Layer 2: Python deps (changes with requirements.txt)
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

# Layer 3: Application code (changes every commit)
COPY . .

CMD ["uvicorn", "main:app", "--host", "0.0.0.0"]
💡CI Cache Strategy
Use Docker BuildKit's --cache-from to pull a previous build's cache from a registry. This speeds up remote builds dramatically.
📊 Production Insight
A poorly ordered Dockerfile can cause every CI build to take 5+ minutes because package installation is repeated on every commit.
Reorder your instructions once and save hundreds of hours across the team annually.
Don't forget .dockerignore — one missing line there can invalidate the COPY layer on every build, killing your cache strategy.
Use BuildKit's --cache-from for remote CI runners — it's the key to consistent cache reuse.
If you use ephemeral runners, consider using a remote Docker cache in a S3-compatible bucket.
🎯 Key Takeaway
Layer caching is the single biggest lever for reducing CI build time.
Put instructions that change least often first.
Always include a .dockerignore.
And enable BuildKit for advanced cache features.
Pin base image digests to avoid surprise cache invalidations from updated tags.
Should You Use BuildKit Cache Mounts?
IfYour CI pipeline rebuilds images frequently and package downloads are the bottleneck
UseAdd --mount=type=cache for your package manager (npm, pip, apt). It persists downloads across builds without bloating the image.
IfYou need to ensure a clean build every time (e.g., security-sensitive environments)
UseAvoid cache mounts. Use fresh downloads each build to guarantee correct dependencies.

Production Monitoring: Tracking Image Size Over Time

Image size tends to creep up over time as developers add new dependencies, install debugging tools for troubleshooting, or forget to clean up temporary files. Monitoring image size as a CI metric helps catch bloat before it reaches production.

You can integrate tools like docker scout or dive into your CI pipeline to fail builds if image size exceeds a threshold. Also, use docker image history to track the impact of each Dockerfile change.

Another approach: maintain a Dockerfile.sizelimit or use external tools like Regctl to query registry manifests and track image size across tags.

One team we worked with added a simple CI step that compares new image size against the previous tag and fails if it increased by more than 5%. That single check caught three regressions in the first month, each caused by a developer adding a debugging library they forgot to remove.

Pro tip: store size metrics in a time-series database (e.g., InfluxDB) and graph them on a dashboard. A weekly trend that shows +2% every week means you'll hit your budget in 25 weeks — but you'll only notice when the deploy fails.

I've also seen teams use GitHub Actions to post a comment on every PR comparing the new image size to the base branch. That transparency alone stops bloat — nobody wants to see 'Image increased by 45%' on their PR.

ci-size-check.sh · BASH
12345678
# Check image size in CI (example with Docker Scout)
SIZE=$(docker scout quickview myapp:latest | grep "Total compressed size" | awk '{print $(NF-1)}' || echo 0)
MAX_SIZE_MB=200
if (( $(echo "$SIZE > $MAX_SIZE_MB" | bc -l) )); then
  echo "Image size $SIZE MB exceeds limit $MAX_SIZE_MB MB"
  exit 1
fi
echo "Image size OK: $SIZE MB"
💡Historical Size Tracking
Use regctl image digest and regctl image manifest to pull image sizes from the registry without pulling the whole image. Perfect for CI checks.
📊 Production Insight
Teams often ignore image size until the registry bill shocks them. Set a hard limit in CI and alert when it's exceeded.
Track image size per tag over time — it's a leading indicator of dependency bloat.
Pro tip: use docker scout to also track the number of CVEs per image — it correlates with size.
Consider a dashboard showing image size trend per service over time, with alerts for >10% weekly increase.
Automate the measurement — manual checks don't scale.
🎯 Key Takeaway
Treat image size as a performance metric.
Monitor it in CI, set thresholds, and fail builds that exceed them.
Use tools like dive and docker scout for deep inspection.
Set a size budget for each service and enforce it.

Security Implications of Bloated Images

Every unnecessary package in a Docker image is a potential entry point for attackers. A fat base image like ubuntu:latest includes thousands of binaries, many of which have known CVEs. Even if your app doesn't use them, they're still in the container and exploitable if an attacker gains access.

Distroless images eliminate this surface entirely — no shell, no package manager, no utilities. But they also make debugging harder (you can't exec into the container). A compromised distroless image is harder to exploit because the attacker lacks basic tools like curl, wget, or bash.

Another angle: multi-stage builds reduce the attack surface by leaving build tools (e.g., compilers, debuggers) in the builder stage. The final image only contains what's needed to run the app.

Consider this real example: a team had a Node.js image with curl, wget, vim, and netcat installed. An attacker who got a shell via a vulnerable Express route had immediate internet access and lateral movement tools. Switching to distroless for the final image removed all those utilities — the attacker's shell, even if they got one, would have no curl, no wget, no shell history. It's a massive reduction in blast radius.

CVE density: a typical ubuntu:22.04 image has ~200 CVEs at baseline. Switch to distroless and that drops to ~5. Which would you rather deploy to production?

But here's the trade-off: distroless images can't run apt-get update to patch CVEs. You have to rebuild the image with a new base. That's fine for CI but means you can't hotfix a running container. Plan your patch cycle accordingly.

cve-scan.sh · BASH
1234
# Quick CVE scan with Docker Scout
docker scout cves myapp:latest --format sarif > cve-report.json
# Or use dive for layer-level CVE analysis
dive --ci --highestUserWorst myapp:latest
🔥Production Security Tip
Run docker scout cves <image> on every image you push to production. If your base image has 200+ CVEs, switch to a slim or distroless variant immediately.
📊 Production Insight
A bloated image with curl and wget installed is a 3-second attack vector after an initial compromise.
Switching to distroless removes these tools but requires changing your debugging workflows — use ephemeral debug containers instead.
Bottom line: every megabyte of unnecessary software is a liability.
Run docker scout cves as a mandatory CI step to catch vulnerable base images early.
Even if you can't go distroless, removing a single package like vim can reduce CVE count by 10%.
🎯 Key Takeaway
Image size is directly linked to attack surface.
Use distroless for security-critical workloads.
Scan every image for CVEs before deployment.
Don't sacrifice security for debugging convenience — invest in proper debugging tools.
Remove all unnecessary utilities from final images.

Advanced Layer Caching with BuildKit

Docker's BuildKit (enabled by default since Docker 23.0) offers several cache optimization features that go beyond simple layer ordering.

  • --cache-from: Pull a previous build's cache from a registry. Essential for remote CI runners where local cache doesn't persist.
  • --mount=type=cache: Persist package manager caches (like apt, npm, pip) across builds without including them in the final image. Reduces network downloads dramatically.
  • COPY --link: Copies files without creating a new layer that depends on the previous layer. Improves cache sharing between build stages.
  • Cache mounts: Mount a scratch directory for build artifacts like .m2 for Maven or node_modules for npm, which are kept across builds but not in the final image.

Using these features requires minimal Dockerfile changes but yields significant speedups in CI.

But there's a subtlety with cache mounts: they persist across builds on the same host. If you're using ephemeral CI runners (like GitHub Actions hosted runners), cache mounts give no benefit because the runner's filesystem is fresh each time. In that scenario, only --cache-from with a remote registry works. Know your CI environment before investing in one pattern over the other.

A real-world benchmark: a Node.js app with npm ci took 45 seconds per build without cache mounts. Adding --mount=type=cache,target=/root/.npm dropped that to 8 seconds on the second build — a 5.6x improvement. On a hundred builds per day, that's 62 minutes saved.

Pro tip: use --cache-to and --cache-from together to push cache to a registry and pull it on the next build. This works even across different runners.

Dockerfile.buildkit-cache · DOCKERFILE
123456789
# syntax=docker/dockerfile:1.4
FROM node:18-alpine
WORKDIR /app

# Cache npm cache across builds
RUN --mount=type=cache,target=/root/.npm \n    npm ci --only=production

COPY . .
CMD ["node", "server.js"]
💡BuildKit Syntax Hint
Add # syntax=docker/dockerfile:1.4 at the top of your Dockerfile to enable the latest BuildKit features. Without it, --mount and --link may not work.
📊 Production Insight
Using BuildKit cache mounts can reduce npm install time from 45 seconds to 5 seconds on repeated CI builds.
But beware: cache mounts persist across builds, so you might accidentally reuse a broken cache. Clear it periodically with docker builder prune --filter type=exec.cachemount.
Also, --cache-from is most effective when you tag intermediate images; otherwise, the cache is lost after the registry push.
Avoid cache mounts on ephemeral CI runners; they provide no benefit.
For monorepos, use --mount=type=cache carefully — it can cross-contaminate packages across projects.
🎯 Key Takeaway
BuildKit's cache mounts and remote cache are force multipliers for CI speed.
Enable the latest Dockerfile syntax (1.4+).
Use --cache-from with a remote registry for persistent caching in CI pipelines.
Clear cache mounts periodically to avoid corruption.
Use COPY --link to improve layer cache sharing between stages.

Image Size Governance: Setting Budgets and Automating Checks

Image size is a performance and cost metric that deserves the same attention as latency or error rates. Without a budget, bloat creeps in silently. The fix: set per-service size limits and enforce them in CI.

Start by establishing a baseline: measure the current compressed size of every image using docker scout quickview or docker images --format. Then set a reduction target (e.g., 20% smaller) as the initial budget. Store budgets in a YAML file committed to your repository.

In CI, use a script that builds the image, extracts its compressed size, compares it against the budget, and fails if exceeded. Integrate with regctl to query historical sizes from the registry without pulling the entire image.

For advanced governance, enforce that any PR that increases image size by more than 10% requires a review. Use Docker Scout policies to automatically scan for excessive CVEs or size regressions.

One team we know saved $3000/month in ECR costs just by adding a size gate. The gate caught two instances where a developer accidentally added a 200MB debug image into their base. Without the gate, that cost would have run indefinitely.

Another approach: use docker manifest inspect in a cron job to check sizes of images in the registry and alert if any exceed the budget. That catches bloat that slipped through CI (e.g., if someone pushed manually).

image-size-check.sh · BASH
12345678910111213141516
#!/bin/bash
# Image size governance check for CI
# Reads budget from docker-image-budgets.yaml

IMAGE="myapp:latest"
BUDGET_FILE="docker-image-budgets.yaml"

# Get compressed size from registry (without pulling)
COMPRESSED_SIZE=$(regctl image manifest $IMAGE --format '{{.Size}}' | awk '{printf "%.0f", $1 / 1048576}')
ALLOWED=$(yq eval ".budgets[] | select(.image=="$IMAGE") | .max_mb" $BUDGET_FILE)

if [ "$COMPRESSED_SIZE" -gt "$ALLOWED" ]; then
  echo "Image $IMAGE is $COMPRESSED_SIZE MB, budget is $ALLOWED MB — FAIL"
  exit 1
fi
echo "Image $IMAGE size OK: $COMPRESSED_SIZE MB"
⚠ Set a Realistic Baseline First
Don't set an arbitrary budget without understanding your current state. Measure all images for a week, then set the budget to 80% of the average. Tighten over time.
📊 Production Insight
Teams that enforce size budgets in CI catch regressions before they reach production.
One team saw a 40% size creep over three months simply because nobody was watching.
Automated enforcement forces developers to think about every new dependency.
Cost savings: a 200MB reduction per image for 10 services deployed 50 times/day saves ~$1500/month in egress alone.
Integrate size checks with Slack alerts so the team sees bloat immediately.
Don't just check at build time — also run nightly scans on registry images for drift.
🎯 Key Takeaway
Image size is a metric, not a preference.
Set a budget based on current baselines.
Enforce it in CI — fail builds that exceed the limit.
Monitor trends over time to catch gradual bloat.
If you don't measure it, bloat creeps in.
Use regctl for fast registry queries without pulling images.
Per-Service vs Global Budgets?
IfServices have wildly different needs (e.g., ML model vs Go binary)
UseUse per-service budgets defined in a YAML file tracked in version control. Each service has a unique limit.
IfAll services are similar (e.g., microservices with comparable stacks)
UseSet a global limit that applies to all. Deviations indicate an outlier that needs review.

Practical Refactoring Workflow: From Bloated to Lean Dockerfile

Let's walk through a real-world refactoring. Start with a typical bloated Dockerfile:

`` FROM ubuntu:latest RUN apt-get update RUN apt-get install -y curl wget vim git build-essential RUN apt-get install -y python3 python3-pip RUN curl -sSL https://sh.rustup.rs -o rustup.sh RUN sh rustup.sh -y RUN git clone https://github.com/someapp WORKDIR /someapp RUN cargo build --release RUN cp target/release/someapp /usr/local/bin/ RUN rm -rf /someapp ~/.cargo CMD ["someapp"] ``

Problems: No multi-stage, fat base, unnecessary packages (vim, git), cleanup in separate RUN layers, no .dockerignore, using :latest.

Stage 1 (builder): FROM rust:1.65-slim AS builder — install only necessary build deps, compile. Stage 2 (runtime): FROM ubuntu:22.04-slim — only runtime libs if needed, otherwise use FROM scratch and copy static binary. Add .dockerignore: .git, target, *.md Pin base image digests Combine apt-get install and clean in one RUN.

After refactor: 1.2GB → 12MB for a Go/Rust static binary, or ~50MB if using glibc runtime. That's a 95-99% reduction.

Make this process a standard checklist in your team's Dockerfile review template. After a few months, it becomes second nature.

I've seen teams automate this refactoring with a script that runs dive, identifies top layers by size, and suggests a multi-stage pattern. You don't need to do it manually every time — build a tool once.

Dockerfile.refactored · DOCKERFILE
12345678910111213
# Refactored with multi-stage and minimal base
# Stage 1: Build
FROM rust:1.65-slim AS builder
WORKDIR /app
COPY Cargo.toml Cargo.lock ./
RUN cargo fetch  # cache deps
COPY src ./src
RUN cargo build --release

# Stage 2: Runtime (using distroless for security)
FROM gcr.io/distroless/cc   # or 'scratch' for fully static
COPY --from=builder /app/target/release/someapp /usr/local/bin/someapp
CMD ["/usr/local/bin/someapp"]
🔥Refactoring Checklist
1. Identify if app is compiled (need multi-stage) or interpreted (can go alpine/distroless). 2. Remove all build tools from final image. 3. Combine cleanup into same RUN layer. 4. Pin base image digests. 5. Add .dockerignore. 6. Test thoroughly.

🎯 Key Takeaways

    🔥
    Naren Founder & Author

    Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

    ← PreviousDocker Swarm BasicsNext →Docker Networking Deep Dive
    Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged