Advanced 3 min · June 21, 2026

Jenkins Monitoring with Prometheus: Stop Reacting, Start Predicting Failures

Monitor Jenkins with Prometheus to detect queue spikes, executor exhaustion, and JVM issues before they cause outages.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.

Follow
Production
production tested
June 21, 2026
last updated
1,577
articles · all by Naren
 ● Production Incident 🔎 Debug Guide
Quick Answer

Install the Prometheus plugin, configure it to expose /prometheus on a separate port, then add a Prometheus scrape target pointing to that endpoint. Use the default metrics or filter with ?include= and ?exclude= query parameters.

✦ Definition~90s read
What is Jenkins Monitoring with Prometheus?

Jenkins monitoring with Prometheus involves exposing Jenkins metrics via the Prometheus plugin and scraping them into Prometheus for alerting and dashboards. It covers build queue depth, executor utilization, JVM health, and plugin-specific metrics.

Think of Jenkins as a busy kitchen with multiple chefs (executors) and a whiteboard of orders (build queue).
Plain-English First

Think of Jenkins as a busy kitchen with multiple chefs (executors) and a whiteboard of orders (build queue). Prometheus is a health inspector who constantly checks how many orders are waiting, how many chefs are idle, and whether the fridge (JVM) is overheating. When the queue grows past a threshold, the inspector pages the head chef before orders start burning.

Your Jenkins master is a single point of failure for your entire delivery pipeline. I've seen a 200-node cluster grind to a halt because nobody noticed the build queue hit 5000 pending jobs at 2 AM. Monitoring isn't optional — it's survival. This article gives you the exact Prometheus configs, alert rules, and dashboards to catch executor exhaustion, queue backlogs, and JVM memory leaks before they take down your deployments. You'll walk away with production-ready scrape configs, Grafana dashboard JSON, and alert rules that have kept my Jenkins masters alive through Black Friday traffic.

Why Default Jenkins Monitoring Fails at Scale

The built-in Jenkins monitoring page shows you the last 10 builds and a memory graph that updates every 30 seconds. That's fine for a hobby project. In production, you need historical trends, correlation with deployments, and alerts that wake you up at 3 AM. Without Prometheus, you're flying blind. The classic mistake is relying on Jenkins' own health check — it reports green even when the queue is 10,000 deep because the master process is still alive. Prometheus gives you the real picture: executor utilization, queue depth, build duration percentiles, and JVM internals.

prometheus-scrape-config.ymlDEVOPS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — DevOps tutorial

scrape_configs:
  - job_name: 'jenkins'
    metrics_path: '/prometheus'
    # Use a dedicated port to avoid CSRF issues
    # Add ?include=[] to filter metrics if needed
    params:
      include: ['jenkins_executor_*', 'jenkins_queue_*', 'jvm_*']
    static_configs:
      - targets: ['jenkins-master:8080']
        labels:
          env: 'production'
    scrape_interval: 60s
    scrape_timeout: 30s
    # Relabel to add instance name
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: 'jenkins-prod'
Output
No direct output — this is a config file. After reloading Prometheus, check target status: 'curl http://localhost:9090/api/v1/targets | jq .'
Production Trap: CSRF Protection
If you use the default Jenkins port (8080), the /prometheus endpoint may return 403 due to CSRF crumb requirement. Either configure a separate port (e.g., 8081) with --httpPort=8081 and disable CSRF on that port, or add the crumb to Prometheus' headers. The separate port approach is cleaner and avoids security holes.

Installing and Configuring the Prometheus Plugin

The Prometheus plugin exposes metrics at /prometheus. Install it via Jenkins Plugin Manager. Then configure which metrics to expose. By default, it exposes everything — including job-level metrics that can balloon the response. In production, you want to filter aggressively. Use the system property jenkins.metrics.prometheus.exclude to drop high-cardinality metrics like job build durations per branch. I've seen a 1000-job folder generate a 200MB metrics page. Filter it down to aggregate metrics only.

jenkins-metrics-config.groovyDEVOPS
1
2
3
4
5
6
7
8
9
10
11
12
// io.thecodeforge — DevOps tutorial

// Set this in Jenkins script console or as system property
// Exclude job-level metrics to reduce cardinality
System.setProperty('jenkins.metrics.prometheus.exclude', '.*job.*')

// Or include only essential metrics via query parameter
// /prometheus?include[]=jenkins_executor_*&include[]=jenkins_queue_*&include[]=jvm_*

// Verify the endpoint
import jenkins.metrics.impl.PrometheusMetrics
println PrometheusMetrics.instance.getSampleCount()
Output
Sample count printed (e.g., 150). After filtering, should be <50.
Senior Shortcut: Filter at Query Time
Instead of restarting Jenkins to change system properties, use the ?include[] and ?exclude[] query parameters in the Prometheus scrape config. This lets you adjust filtering without downtime. Example: metrics_path: '/prometheus?include[]=jenkins_executor_&include[]=jenkins_queue_'

Essential Metrics: What to Watch and Why

Not all metrics are equal. Focus on the ones that predict failure. Queue depth (jenkins_queue_size_value) tells you if builds are piling up. Executor utilization (jenkins_executor_in_use) tells you if you need more capacity. Build duration (jenkins_runs_duration_seconds) helps detect performance regressions. JVM metrics (jvm_memory_bytes_used, jvm_gc_collection_seconds) catch memory leaks. Ignore job-level metrics like per-branch counters — they create high cardinality and slow down Prometheus.

essential-metrics.rulesDEVOPS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// io.thecodeforge — DevOps tutorial

# Alert when queue exceeds 50 for 5 minutes
alert: JenkinsQueueHigh
expr: jenkins_queue_size_value > 50
for: 5m
labels:
  severity: critical
annotations:
  summary: "Jenkins queue depth is {{ $value }}"

# Alert when executor utilization > 90%
alert: JenkinsExecutorsExhausted
expr: (jenkins_executor_in_use / jenkins_executor_total) > 0.9
for: 2m
labels:
  severity: warning
annotations:
  summary: "Executor utilization at {{ $value | humanizePercentage }}"

# Alert on long GC pauses
alert: JenkinsGCPauseHigh
expr: rate(jvm_gc_collection_seconds_sum[5m]) > 1
for: 1m
labels:
  severity: critical
annotations:
  summary: "GC pause > 1s in last 5 minutes"
Output
Alerts fire in Prometheus Alertmanager. Example: 'JenkinsQueueHigh' triggers when queue > 50 for 5 minutes.
Interview Gold: Cardinality
Interviewers love asking about Prometheus cardinality. The Jenkins Prometheus plugin can explode cardinality if you expose job labels like branch name. Each unique label combination creates a new time series. With 1000 jobs and 10 branches each, that's 10,000 series just for build duration. Always filter or aggregate.

Building a Grafana Dashboard That Tells a Story

A good dashboard shows the system's health at a glance. Start with a row for 'Build Pipeline Health': queue depth, executor utilization, build duration (p50/p95/p99). Second row for 'JVM Health': heap usage, GC pause time, thread count. Third row for 'Throughput': builds completed per minute, success rate. Use Grafana's time series panels with thresholds. Color-code: green (<70%), yellow (70-90%), red (>90%). Add a log panel for recent build failures.

grafana-dashboard.jsonDEVOPS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// io.thecodeforge — DevOps tutorial

{
  "title": "Jenkins Production Overview",
  "panels": [
    {
      "title": "Queue Depth",
      "type": "graph",
      "targets": [{"expr": "jenkins_queue_size_value", "legendFormat": "queue"}],
      "thresholds": [{"value": 50, "color": "red"}]
    },
    {
      "title": "Executor Utilization",
      "type": "graph",
      "targets": [{"expr": "jenkins_executor_in_use / jenkins_executor_total", "legendFormat": "utilization"}],
      "thresholds": [{"value": 0.9, "color": "red"}]
    },
    {
      "title": "Build Duration (p99)",
      "type": "graph",
      "targets": [{"expr": "histogram_quantile(0.99, rate(jenkins_runs_duration_seconds_bucket[5m]))", "legendFormat": "p99"}]
    }
  ]
}
Output
Grafana renders three panels. Import this JSON into Grafana via + Import.
Senior Shortcut: Use Variables
Add a Grafana variable for 'instance' so you can switch between Jenkins masters. Query: label_values(jenkins_queue_size_value, instance). Then use $instance in all panel queries.

Alerting Rules That Don't Wake You Up for Nothing

Bad alerts are worse than no alerts. You need to tune thresholds and durations. A queue depth of 100 for 30 seconds is noise. A queue depth of 100 for 10 minutes is a problem. Use 'for' clauses to avoid flapping. Also set up 'no data' alerts — if Prometheus stops scraping Jenkins, you won't know. Alert on absent(jenkins_queue_size_value) for 5m. Finally, route alerts to the right channel: critical to PagerDuty, warnings to Slack.

alertmanager-config.ymlDEVOPS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — DevOps tutorial

route:
  receiver: 'pagerduty-critical'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
    - match:
        severity: warning
      receiver: 'slack-warnings'

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - routing_key: 'your-pagerduty-key'
  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/...'
        channel: '#jenkins-alerts'
Output
Alertmanager routes alerts based on severity. Test with: amtool alert add alertname=Test severity=critical
Never Do This: Alert on Every Build Failure
Build failures are normal. Alerting on every failed build creates noise and trains your team to ignore alerts. Instead, alert on build failure rate > 10% over 1 hour, or on infrastructure issues like queue backup.

When Not to Use Prometheus for Jenkins Monitoring

Prometheus is overkill if you have a single Jenkins master with <10 executors and <50 builds per day. In that case, the built-in monitoring page and email alerts suffice. Also, if your Jenkins is ephemeral (spun up per branch), Prometheus scraping becomes complex — consider using a push gateway or centralized logging instead. Finally, if your team doesn't have the bandwidth to maintain a Prometheus stack, use a SaaS solution like Datadog or New Relic with their Jenkins integrations.

Alternative: Push Gateway
For ephemeral Jenkins instances, use the Prometheus Pushgateway. Jenkins pushes metrics at the end of each build, and Prometheus scrapes the pushgateway. This avoids the problem of scraping a target that disappears.
● Production incidentPOST-MORTEMseverity: high

The 4GB Heap That Kept Dying

Symptom
Jenkins master became unresponsive every 3 hours. Restart fixed it temporarily. No alerts fired.
Assumption
Team assumed a memory leak in a custom plugin.
Root cause
The Prometheus plugin was exposing all metrics including detailed job history, causing the /prometheus endpoint to generate a 50MB response every scrape. Prometheus scraped every 15 seconds, triggering full GCs that paused the JVM for 5+ seconds.
Fix
Set the system property jenkins.metrics.prometheus.exclude=.job. to exclude job-level metrics. Reduced response size to 2MB. Also increased scrape interval to 60s.
Key lesson
  • Always benchmark your metrics endpoint under load before going to production — a bloated /prometheus can kill your Jenkins master.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Prometheus scrape returns 403 Forbidden
Fix
1. Check if CSRF protection is enabled. 2. Configure Jenkins to expose metrics on a separate port (e.g., 8081) with --httpPort=8081. 3. Add system property hudson.security.csrf.CrumbFilter.DISABLE_CSRF_PROTECTION=true for that port. 4. Update Prometheus scrape target to new port.
Symptom · 02
Prometheus scrape timeout or high memory on Jenkins
Fix
1. Check /prometheus response size with curl. 2. If >10MB, filter metrics using ?include[]= or system property. 3. Increase scrape_timeout to 30s. 4. Reduce scrape_interval to 60s. 5. Consider upgrading Jenkins heap.
Symptom · 03
No metrics appear in Prometheus (target down)
Fix
1. Verify Jenkins is running. 2. Check Prometheus target status at /targets. 3. Ensure network connectivity between Prometheus and Jenkins. 4. Check Jenkins logs for errors. 5. Verify plugin is installed and enabled.
FeaturePrometheus PluginJenkins Built-in Monitoring
Historical dataYes, via Prometheus TSDBLimited (last few hours)
AlertingYes, via AlertmanagerEmail only
Custom metricsYes, via plugin APINo
ScalabilityHandles 1000s of jobsStruggles above 100 jobs
Setup complexityMedium (Prometheus + Grafana)Low (built-in)

Key takeaways

1
Filter metrics aggressively to avoid high cardinality and large response sizes
include only what you need.
2
Use a separate port for metrics to bypass CSRF protection and avoid security issues.
3
Set up alerts on queue depth, executor utilization, and JVM health
not on individual build failures.
4
The /prometheus endpoint can kill your Jenkins master if it's too large
always benchmark under load.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

FAQ · 4 QUESTIONS

Frequently Asked Questions

01
How do I monitor Jenkins with Prometheus?
02
What's the difference between the Prometheus plugin and the Metrics plugin?
03
How do I reduce the size of the /prometheus response?
04
Can Prometheus monitoring cause Jenkins to crash?
N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.

Follow
Verified
production tested
June 21, 2026
last updated
1,577
articles · all by Naren
🔥

That's Jenkins. Mark it forged?

3 min read · try the examples if you haven't

Previous
Jenkins Configuration as Code (JCasC)
21 / 23 · Jenkins
Next
Jenkins Backup and Disaster Recovery