Jenkins Kubernetes Deployment: Build Agents That Survive Production Chaos
Deploy Jenkins on Kubernetes with battle-tested patterns: avoid pod eviction, config drift, and plugin hell.
20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.
Deploy Jenkins on Kubernetes using the official Helm chart with persistent volumes for /var/jenkins_home, configure the Kubernetes plugin to spin up agent pods from custom images, and set resource limits to prevent noisy neighbor builds from starving the cluster.
Imagine you run a food truck (Jenkins) that needs a kitchen to cook orders (build jobs). Instead of owning a fixed kitchen, you rent space in a shared commercial kitchen (Kubernetes). When an order comes in, you grab a prep station (agent pod), cook, clean up, and leave. If the kitchen gets busy, you can rent more stations automatically. But if you leave dirty dishes (build artifacts) or take too long, the kitchen manager (Kubernetes scheduler) kicks you out.
You've containerized everything except your CI/CD pipeline. That's like building a race car and parking it in a horse stable. Running Jenkins on bare metal or VMs in 2024 is a self-inflicted wound—you're managing pets when you should be managing cattle. The Kubernetes plugin promises dynamic agents, but the default setup will burn you with pod evictions, config drift, and plugin incompatibilities at 3 AM. I've seen a payment service go down because a Jenkins agent pod consumed all node memory during a parallel test run. By the end of this, you'll deploy Jenkins on Kubernetes with production-hardened configurations that survive node failures, resource contention, and plugin upgrades without waking you up.
Why Not Just Run Jenkins on a VM? The Hidden Costs
Before Kubernetes, Jenkins ran on a single VM. That VM was a pet—you patched it, backed it up, and prayed it didn't die. Scaling meant cloning the VM and manually configuring agents. The real cost wasn't the VM; it was the operational overhead of maintaining a stateful CI server. With Kubernetes, you get self-healing, auto-scaling, and declarative config. But the trade-off is complexity: you now manage persistent volumes, pod networking, and plugin compatibility with container images. If your team has fewer than 10 developers, a simple Docker Compose setup might be faster. But for any serious CI/CD, Kubernetes is the only sane choice.
The Kubernetes Plugin: Dynamic Agents Done Right (and Wrong)
The Kubernetes plugin lets Jenkins spin up agent pods on demand. The default config is a trap: it uses the 'jnlp' image that's huge (800MB+) and slow to start. Worse, it doesn't clean up idle pods, so you pay for resources you don't use. The fix: use a custom agent image with only the tools you need (e.g., maven, docker, kubectl). Set 'Pod Retention' to 'Never' to delete pods after build. And always set resource requests—without them, your builds can starve the cluster. I've seen a team's entire dev namespace get evicted because a Jenkins agent pod consumed all CPU with no limits.
Persistent Volumes: Don't Lose Your Jenkins Home
Jenkins stores everything in /var/jenkins_home: jobs, configs, build logs, plugin binaries. If that disappears, you lose your CI history. On Kubernetes, you must attach a PersistentVolumeClaim (PVC) to the controller pod. Use a ReadWriteOnce volume with a backup strategy. I've seen teams use emptyDir for 'ephemeral' Jenkins—then wonder why all jobs vanished after a pod restart. The gotcha: if your PVC is deleted, the data is gone. Always set reclaimPolicy: Retain on the PV. And for god's sake, back up /var/jenkins_home to S3 or GCS daily.
Resource Limits: Stop Noisy Neighbor Builds
Without resource limits, a single build can consume all node memory and trigger OOM kills of other pods. The Kubernetes plugin lets you set requests and limits per agent pod template. But here's the catch: if you set limits too low, builds fail with 'Container killed by OOM killer'. Too high, and you waste cluster capacity. The rule of thumb: set requests to the average usage, limits to the peak. Monitor with kubectl top pods. I once debugged a build that failed intermittently—turns out the agent pod had no memory limits and was sharing a node with a memory-hungry database. Set limits, and always test with your heaviest build.
Plugin Management: The Dependency Hell You Inherit
Jenkins plugins are a mess. They depend on each other, and version mismatches cause startup failures. On Kubernetes, you can't just SSH in and fix plugins—you rebuild the image. The solution: use a custom controller image with pinned plugin versions. The official Jenkins image lets you install plugins via install-plugins.sh at build time. Never install plugins at runtime via the UI—that creates a snowflake container. I've seen a team's Jenkins fail to start because a plugin update pulled in a newer version of a dependency that broke another plugin. Pin everything.
jenkins-plugin-cli --list to check dependency tree. Always test plugin upgrades in a staging environment.Configuration as Code: Stop Clicking Around the UI
Jenkins UI configuration is not reproducible. One wrong click and your CI is broken. The Configuration as Code (JCasC) plugin lets you define everything in YAML. On Kubernetes, mount the YAML as a ConfigMap. This way, you can version control your Jenkins config and rebuild from scratch in minutes. The gotcha: some plugins don't support JCasC—you'll need to use the Groovy init script for those. I've seen a team lose a day because they forgot to configure the Kubernetes plugin credentials in JCasC, and agents couldn't connect.
curl -X POST http://jenkins:8080/configuration-as-code/check to validate your YAML. This catches syntax errors before they break the controller.Security: Don't Leave the Back Door Open
Jenkins on Kubernetes exposes two ports: 8080 (UI/API) and 50000 (agent listener). The agent listener should never be exposed externally—agents connect from inside the cluster. Use a ClusterIP service for 50000. For 8080, use an Ingress with TLS and authentication. The default Jenkins setup has no security—anyone can access the UI. I've seen a company's Jenkins exposed to the internet with no auth, allowing anyone to run arbitrary builds. Always enable security via JCasC and use Kubernetes secrets for credentials.
Monitoring: Know When Your CI Is Dying
Jenkins on Kubernetes can fail silently: pods restart, builds hang, agents fail to connect. You need monitoring. Use Prometheus to scrape Jenkins metrics (via the Prometheus plugin) and set up alerts for: queue length > 10, build failure rate > 5%, agent connection errors. Also monitor Kubernetes pod status—if agent pods are in CrashLoopBackOff, you have a problem. I once debugged a case where Jenkins was 'running' but no builds executed—the Kubernetes plugin had lost connectivity to the API server due to a expired service account token.
The 4GB Container That Kept Dying
JENKINS_JAVA_OPTIONS=-Xmx2g -Xms2g in the controller deployment. Also set --handlerCountMax=100 in Jenkins CLI args to limit concurrent connections. Pod memory limit reduced to 3GB.- Never trust default JVM settings in containerized Jenkins.
- Always set -Xmx to at least 50% of the pod memory limit.
kubectl describe pod <agent-pod-name> to check events. 2. Look for 'Insufficient memory' or 'Insufficient cpu'. 3. Increase cluster node count or reduce agent resource requests. 4. If using nodeSelector, verify node labels match.kubectl get pods -l app=jenkins-controller. 2. If pod is running, check logs: kubectl logs <pod-name>. 3. Look for 'OutOfMemoryError' or 'Port already in use'. 4. If OOM, increase memory limits and JVM heap. 5. If port conflict, ensure no other service uses 8080.kubectl get pods -n jenkins-agents. 3. Check network policies: agent pods must reach controller on port 50000. 4. Increase 'Wait for pod to be running' timeout in Kubernetes plugin config to 300s. 5. If using custom agent image, ensure JNLP agent is correctly configured.Key takeaways
Interview Questions on This Topic
Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.
That's Jenkins. Mark it forged?
4 min read · try the examples if you haven't