Jenkins Controller and Agent: Stop Running Everything on One Machine
Jenkins controller and agent architecture explained.
20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.
The Jenkins controller is the central server that schedules jobs and stores configuration. Agents (formerly slaves) are remote machines that execute those jobs. You connect agents to the controller via SSH, JNLP, or web start. This lets you run builds on different OSes, scale horizontally, and isolate workloads.
Think of the controller as the head chef in a busy kitchen. They plan the menu, take orders, and decide which dishes go to which station. The agents are the line cooks — each specializes in a different task (grill, pastry, etc.) and works independently. If you only had the head chef cooking, they'd be overwhelmed and everything would slow down. By delegating to multiple line cooks, the kitchen serves more orders faster.
Most Jenkins setups I've seen in production start the same way: someone installs Jenkins on a single VM, runs everything there, and it works fine — until it doesn't. The controller gets overloaded, builds queue up, and a single rogue job can take down the entire CI/CD pipeline. I've watched a payments team lose an entire release day because a memory leak in a test suite killed the Jenkins process. Don't be that team.
The controller-agent pattern solves this by separating the brain from the brawn. The controller handles lightweight orchestration — scheduling, authentication, UI — while agents do the heavy lifting: compiling, testing, packaging. This isn't just about scaling; it's about survival. Without it, you can't run builds on different platforms, you can't isolate untrusted jobs, and you can't recover from a single point of failure.
By the end of this article, you'll be able to set up a Jenkins controller with multiple agents, configure agent security, and diagnose the most common production failures — including the one that cost a fintech startup 6 hours of downtime last year.
Why You Need a Separate Controller and Agent
Running everything on the controller is like using your laptop as a production server. It works until you need to deploy at 3 AM and your laptop runs out of battery. The controller is the nervous system — it should be lean, responsive, and always available. Agents are the muscles — they do the heavy lifting and can be swapped out when they fail.
Without agents, every build competes for the same CPU, memory, and disk I/O. A single memory-intensive test suite can starve the controller's web UI, making it impossible to cancel the build or check logs. I've seen this bring down a CI pipeline for a 200-person engineering org because no one could access the Jenkins UI to kill the runaway job.
Agents also let you run builds on different platforms. Want to test on Windows, Linux, and macOS? Spin up an agent for each. Want to isolate untrusted pipeline code? Run it on a disposable agent in a Docker container. The controller never executes arbitrary code — it only orchestrates.
Agent Connection Methods: SSH vs JNLP vs Web Start
You have three ways to connect an agent to the controller. Each has trade-offs. SSH is the most common for permanent agents — the controller pushes the agent.jar and manages the connection. JNLP (Java Web Start) is for agents that can't accept inbound SSH connections, like Windows machines behind a firewall. Web Start is deprecated but still seen in legacy setups.
SSH is preferred because it's encrypted by default, supports key-based auth, and the controller handles reconnection automatically. JNLP requires the agent to initiate the connection, which is useful when the agent is in a different network segment. But JNLP agents need manual restart if the connection drops — they don't auto-reconnect without extra configuration.
I've seen teams use JNLP because 'it's easier' — then they wonder why agents go offline after a network blip. Use SSH unless you have a specific reason not to. If you must use JNLP, wrap the agent launch in a systemd service with Restart=always.
Configuring Agent Labels for Targeted Builds
Labels are how you tell Jenkins which agent should run a specific job. Without labels, Jenkins picks any available agent — which is fine for simple setups, but dangerous when you need specific tools or environments. For example, a Docker build must run on an agent with Docker installed. A Windows build needs a Windows agent.
Labels are free-form strings. You can assign multiple labels to an agent (e.g., 'linux docker high-mem'). In your pipeline, use the label directive to constrain where the job runs. If no agent matches the label, the job waits indefinitely — which is why you should always have a fallback or timeout.
I've seen a team label their agents 'production' and 'testing', then accidentally run a destructive database migration on the production agent because the pipeline didn't specify a label. Always label explicitly, and never rely on the default 'any' agent for sensitive jobs.
Scaling Agents Horizontally with Cloud Plugins
When your build demand spikes — say, during a release day — you don't want to manually spin up agents. Cloud plugins (EC2, Kubernetes, Azure VM) let Jenkins provision agents on demand. The controller detects a queued job, launches a new agent, runs the job, then terminates the agent after a timeout.
This is the gold standard for CI/CD at scale. You pay only for what you use, and you never have idle agents wasting resources. The Kubernetes plugin is especially popular: each build runs in a pod with ephemeral storage, so no workspace cleanup needed.
But cloud agents introduce latency. Spinning up a VM takes 30-60 seconds. For short jobs, that overhead might exceed the build time. Use a hybrid approach: keep a pool of warm agents (e.g., 2-3 always-on) for quick jobs, and use cloud agents for spikes.
Securing the Controller-Agent Connection
The controller-agent channel carries sensitive data: source code, credentials, deployment keys. If an attacker compromises an agent, they can exfiltrate secrets or inject malicious builds. You must secure the connection.
SSH agents use the controller's SSH key to authenticate. Protect that key with a passphrase and store it in Jenkins credentials with restricted scope. For JNLP agents, use a secret token that's unique per agent. Never reuse secrets across agents.
Beyond authentication, encrypt the traffic. SSH is encrypted by default. For JNLP, use HTTPS for the controller URL and enable TCP encryption if using the TCP agent port. Also, run agents in isolated environments — don't give them access to production networks unless necessary.
I've seen a company where an agent had access to the production database because it was on the same VLAN. A compromised build script dumped the entire user table. Isolate agents in a separate subnet with strict firewall rules.
ps aux output. Use @secret.txt to read from a file instead. Also, never echo the secret in build logs — mask it with echo '***'.Monitoring Agent Health and Performance
Agents die. Networks blip. Disks fill up. You need to know when an agent goes offline before a developer complains. Jenkins provides monitoring plugins (Monitoring, Metrics) that expose agent status via API and UI.
Set up alerts for agent disconnection. Use the Jenkins CLI or API to check agent status periodically. For example, a cron job that runs java -jar jenkins-cli.jar list-nodes and alerts if any agent is offline for more than 5 minutes.
Also monitor agent resource usage. A build that consumes 100% CPU for an hour might indicate an infinite loop. Use the 'Monitoring' plugin to track CPU, memory, and disk on each agent. Set thresholds and trigger notifications.
I've seen a build that wrote gigabytes of logs to the workspace, filling the agent's disk and causing all subsequent builds to fail with 'No space left on device'. Set workspace cleanup policies and disk usage alerts.
Troubleshooting Common Agent Failures
Agents fail in predictable ways. Here are the top three I've seen in production:
- Connection refused: The agent machine is down, or the SSH port is blocked. Check network connectivity and firewall rules. Use
telnet agent-ip 22to test SSH. - Authentication failure: The SSH key changed or the agent secret expired. Regenerate the key/secret and update the agent configuration. For SSH, verify the public key is in
~jenkins/.ssh/authorized_keys. - Out of disk space: Builds accumulate workspace files. Set up a cron job to clean workspaces older than 7 days. Use the 'Workspace Cleanup Plugin' to delete workspace after each build.
- Java version mismatch: The agent requires Java 8 or 11, but the controller expects a different version. Check the agent's Java version with
java -version. Use the same major version as the controller.
I've debugged an agent that disconnected every 30 minutes. Turned out the agent's JVM was running out of memory because the -Xmx was set too low for the agent process itself. Increased it from 64m to 256m and the disconnects stopped.
The 4GB Container That Kept Dying
-Xmx512m -Xms256m for the agent process. Then set build tool memory limits explicitly (e.g., MAVEN_OPTS=-Xmx2g). Also added -XX:+UseContainerSupport for JDK 10+ to respect container limits.- Always cap the agent JVM memory.
- The agent doesn't need gigabytes — it's just a relay.
- Starve the agent, feed the build.
ps aux | grep agent.jar. 2. If not running, restart agent service: sudo systemctl restart jenkins-agent. 3. Check agent logs: tail -100 /home/jenkins/agent/remoting.log. 4. Verify network connectivity from controller to agent: ssh -i /var/lib/jenkins/.ssh/agent-key jenkins@agent-ip 'echo OK'. 5. If SSH fails, check firewall rules and SSH key permissions.java -jar jenkins-cli.jar -s http://controller:8080/ list-nodes. 2. Ensure at least one agent has the required label. 3. If using cloud agents, check cloud plugin logs for provisioning errors. 4. Increase agent count or add more executors. 5. As a temporary workaround, add a label to an existing agent that matches the job requirement.ps aux | grep agent.jar and look for -Xmx. 2. Increase agent heap: add -Xmx512m to launch command. 3. Check network stability: ping controller-ip for packet loss. 4. If using JNLP, wrap agent in systemd with Restart=always. 5. Enable remoting logging: add -Djava.util.logging.config.file=/path/to/logging.properties to agent JVM args.Key takeaways
Interview Questions on This Topic
Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.
That's Jenkins. Mark it forged?
6 min read · try the examples if you haven't