Jenkins Monitoring with Prometheus: Stop Reacting, Start Predicting Failures
Monitor Jenkins with Prometheus to detect queue spikes, executor exhaustion, and JVM issues before they cause outages.
20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.
Install the Prometheus plugin, configure it to expose /prometheus on a separate port, then add a Prometheus scrape target pointing to that endpoint. Use the default metrics or filter with ?include= and ?exclude= query parameters.
Think of Jenkins as a busy kitchen with multiple chefs (executors) and a whiteboard of orders (build queue). Prometheus is a health inspector who constantly checks how many orders are waiting, how many chefs are idle, and whether the fridge (JVM) is overheating. When the queue grows past a threshold, the inspector pages the head chef before orders start burning.
Your Jenkins master is a single point of failure for your entire delivery pipeline. I've seen a 200-node cluster grind to a halt because nobody noticed the build queue hit 5000 pending jobs at 2 AM. Monitoring isn't optional — it's survival. This article gives you the exact Prometheus configs, alert rules, and dashboards to catch executor exhaustion, queue backlogs, and JVM memory leaks before they take down your deployments. You'll walk away with production-ready scrape configs, Grafana dashboard JSON, and alert rules that have kept my Jenkins masters alive through Black Friday traffic.
Why Default Jenkins Monitoring Fails at Scale
The built-in Jenkins monitoring page shows you the last 10 builds and a memory graph that updates every 30 seconds. That's fine for a hobby project. In production, you need historical trends, correlation with deployments, and alerts that wake you up at 3 AM. Without Prometheus, you're flying blind. The classic mistake is relying on Jenkins' own health check — it reports green even when the queue is 10,000 deep because the master process is still alive. Prometheus gives you the real picture: executor utilization, queue depth, build duration percentiles, and JVM internals.
Installing and Configuring the Prometheus Plugin
The Prometheus plugin exposes metrics at /prometheus. Install it via Jenkins Plugin Manager. Then configure which metrics to expose. By default, it exposes everything — including job-level metrics that can balloon the response. In production, you want to filter aggressively. Use the system property jenkins.metrics.prometheus.exclude to drop high-cardinality metrics like job build durations per branch. I've seen a 1000-job folder generate a 200MB metrics page. Filter it down to aggregate metrics only.
Essential Metrics: What to Watch and Why
Not all metrics are equal. Focus on the ones that predict failure. Queue depth (jenkins_queue_size_value) tells you if builds are piling up. Executor utilization (jenkins_executor_in_use) tells you if you need more capacity. Build duration (jenkins_runs_duration_seconds) helps detect performance regressions. JVM metrics (jvm_memory_bytes_used, jvm_gc_collection_seconds) catch memory leaks. Ignore job-level metrics like per-branch counters — they create high cardinality and slow down Prometheus.
Building a Grafana Dashboard That Tells a Story
A good dashboard shows the system's health at a glance. Start with a row for 'Build Pipeline Health': queue depth, executor utilization, build duration (p50/p95/p99). Second row for 'JVM Health': heap usage, GC pause time, thread count. Third row for 'Throughput': builds completed per minute, success rate. Use Grafana's time series panels with thresholds. Color-code: green (<70%), yellow (70-90%), red (>90%). Add a log panel for recent build failures.
Alerting Rules That Don't Wake You Up for Nothing
Bad alerts are worse than no alerts. You need to tune thresholds and durations. A queue depth of 100 for 30 seconds is noise. A queue depth of 100 for 10 minutes is a problem. Use 'for' clauses to avoid flapping. Also set up 'no data' alerts — if Prometheus stops scraping Jenkins, you won't know. Alert on absent(jenkins_queue_size_value) for 5m. Finally, route alerts to the right channel: critical to PagerDuty, warnings to Slack.
When Not to Use Prometheus for Jenkins Monitoring
Prometheus is overkill if you have a single Jenkins master with <10 executors and <50 builds per day. In that case, the built-in monitoring page and email alerts suffice. Also, if your Jenkins is ephemeral (spun up per branch), Prometheus scraping becomes complex — consider using a push gateway or centralized logging instead. Finally, if your team doesn't have the bandwidth to maintain a Prometheus stack, use a SaaS solution like Datadog or New Relic with their Jenkins integrations.
The 4GB Heap That Kept Dying
- Always benchmark your metrics endpoint under load before going to production — a bloated /prometheus can kill your Jenkins master.
Key takeaways
Interview Questions on This Topic
Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.
That's Jenkins. Mark it forged?
3 min read · try the examples if you haven't