Advanced 11 min · 2026-06-21

Ansible Performance Tuning: From 10x Slower to 10x Faster with Forks, Pipelining, and Async

Q: What is the optimal forks value for 1000 hosts?

Start with 100 and increase while monitoring controller CPU and ulimit. Most 16-core controllers handle 200-300 forks. Test with 50 first.

Q: Does pipelining work with become?

Yes, but requiretty must be disabled. Some modules may have issues; test with your playbooks.

Q: Can I use ControlMaster with jump hosts?

Yes, but specify control_path to avoid socket name collisions. Use `control_path = /tmp/ansible-%%h-%%p-%%r`.

Q: What happens if I set strategy=free and a task fails?

Only that host fails; others continue. Use `any_errors_fatal: true` if you want to stop all on failure.

Q: How do I clean stale ControlMaster sockets?

Run `ssh -O stop hostname` for each host or delete socket files in ~/.ansible/cp/. Set ControlPersist short to avoid accumulation.

Q: Is Redis fact caching persistent across Ansible controller restarts?

Yes, Redis stores data in memory (or disk if configured). Facts persist until timeout or flush.

Q: Can I use profile_tasks with Tower/AWX?

Yes, but enable it in the project's ansible.cfg or via extra vars. Tower may strip some callbacks; test first.

Q: What is the impact of gathering=smart?

It only gathers facts if the cache is missing or expired. Reduces startup time significantly when caching is enabled.

Production-tested Ansible tuning: forks, SSH pipelining, ControlMaster, strategy plugins, async, fact caching with Redis, and profiling with profile_tasks.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

✓ Production

production tested

June 21, 2026

last updated

1,596

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Set forks: 50 in ansible.cfg to parallelize across 50 hosts; monitor system limits with ulimit -n. Enable pipelining = True to reduce SSH connections by ~50%; requires requiretty disabled in sudoers. Use ssh_args = -o ControlMaster=auto -o ControlPersist=60s to reuse SSH connections across tasks. Switch strategy to free for independent host execution; host_pinned for locality; linear for strict ordering. For fire-and-forget tasks, use async: 300 and poll: 0; check status with async_status module. Cache facts in Redis: fact_caching = redis with fact_caching_timeout = 86400; avoid JSON file contention. Enable profiling: callback_whitelist = profile_tasks in ansible.cfg; parse output with grep to find slow tasks. Always test tuning changes on a small batch before rolling out; monitor Ansible controller CPU and memory.

✦ Definition~90s read

What is Ansible Performance Tuning?

Ansible performance tuning is the practice of adjusting Ansible's configuration and playbook patterns to minimize execution time, reduce resource consumption on the controller, and increase throughput when managing many hosts. The core areas are parallelism (forks), SSH optimization (pipelining, ControlMaster), task execution strategy (linear, free, host_pinned), asynchronous task management, and fact caching.

★

Imagine you're a chef in a busy restaurant kitchen.

In the Ansible ecosystem, these settings live primarily in ansible.cfg or as environment variables. They affect how Ansible's internal engine communicates with managed hosts. Without tuning, Ansible is a serial executor with a small concurrency pool. With tuning, it becomes a high-throughput automation engine capable of managing thousands of hosts in minutes.

The problem these settings solve is the inherent latency of SSH-based communication. Each task on each host requires an SSH connection, authentication, task transfer, execution, and result retrieval. Without pipelining, that's multiple round trips per task.

Without ControlMaster, a new TCP connection is established for every task. Without sufficient forks, hosts wait in a queue. Without caching, facts are gathered on every run. Each of these adds up to massive overhead at scale.

Plain-English First

Imagine you're a chef in a busy restaurant kitchen. Each order (host) needs a series of steps: chop vegetables, boil water, cook pasta. If you do all orders one by one, it's slow. Ansible's forks setting is like hiring more chefs to work in parallel—each chef handles one order. SSH pipelining is like having a direct conveyor belt from prep to stove instead of walking each ingredient. ControlMaster is like keeping a dedicated line to each station so you don't have to dial the phone every time. Strategy plugins are different kitchen workflows: linear is a strict assembly line (one step for all orders, then next), free lets each chef work independently on their order, and host_pinned keeps a chef at one station. Async tasks are like putting a pot on low heat and walking away—you check later if it's done. Fact caching is like having a cheat sheet of each ingredient's properties so you don't have to look them up every time. Profiling is a stopwatch that shows which step takes the longest so you can optimize it.

I once managed a 2000-node deployment that took 45 minutes to run a simple playbook. The team was blaming network latency, but the real culprit was Ansible's default configuration. By the time we finished tuning, the same playbook ran in under 4 minutes. That incident taught me that Ansible's defaults are designed for small environments, not production at scale.

Ansible's performance bottlenecks are often misdiagnosed. Engineers blame slow SSH, slow hosts, or the tool itself, but the root cause is almost always configuration. The defaults prioritize compatibility over speed: forks=5, pipelining=False, linear strategy, no fact caching. For a handful of hosts, that's fine. For hundreds or thousands, it's a disaster.

This article covers the five most impactful tuning levers: forks parallelism, SSH pipelining and ControlMaster, strategy plugins, async tasks, and fact caching. I'll also show how to profile your playbooks to find the real slow spots. Every recommendation comes from production experience—including the gotchas that can break your deployment if you're not careful.

By the end, you'll know how to make Ansible run 5-10x faster on large infrastructures, and more importantly, how to avoid the common mistakes that lead to timeouts, connection failures, and inconsistent state.

Forks: The First Lever for Parallelism

The forks setting in Ansible controls the maximum number of hosts that can be processed in parallel for any given task. The default is 5, which is absurdly low for any environment with more than a handful of servers. Increasing forks is the single most impactful change you can make.

In ansible.cfg: ``ini [defaults] forks = 100 ``

You can also set it via environment variable: ANSIBLE_FORKS=100.

Production Gotcha: Setting forks too high can overwhelm the Ansible controller's CPU, memory, and file descriptor limits. Each fork uses a separate SSH process. On Linux, check your ulimit -n (open file limit). For 100 forks, you need at least 100 file descriptors per task (plus overhead). Also, network bandwidth and target host capacity are factors. I've seen controllers become unresponsive with forks=500 on a 2GB VM.

How to find the right value: Start with 50, monitor controller CPU and memory with htop. Increase by 25 until CPU reaches ~70% or you see connection errors. In our production environment with 2000 hosts and 16-core controllers, we settled on forks=200.

Another gotcha: forks applies per task, not per playbook. If you have 20 tasks, with forks=100, at peak you'll have 2000 concurrent SSH connections (100 forks * 20 tasks). That's why ControlMaster and pipelining are critical—they reduce the number of connections per task.

Don't Forget ulimit

Ansible forks consume file descriptors. If your controller's ulimit -n is 1024, you can't have forks > 1024. Increase it in /etc/security/limits.conf: soft nofile 65536 and hard nofile 65536. Reboot or restart SSH.

Production Insight

We once set forks=500 on a controller with ulimit -n=1024. Ansible started throwing 'Too many open files' errors. We had to kill the process and lower forks. Now we always check ulimit first.

Key Takeaway

Increase forks to match your controller's capacity; start at 50 and scale up while monitoring resources.

SSH Pipelining: Cut Connections in Half

SSH pipelining reduces the number of SSH operations required to execute a module. Without pipelining, Ansible sends the module file to the host via SFTP, then executes it via SSH. With pipelining, it sends the module as part of the SSH session, eliminating the separate file transfer.

Enable in ansible.cfg: ``ini [ssh_connection] pipelining = True ``

Requirement: Pipelining requires that requiretty be disabled in the sudoers file on managed hosts. Otherwise, you'll get sudo: sorry, you must have a tty to run sudo errors. Fix with: `` # /etc/sudoers or /etc/sudoers.d/ansible Defaults !requiretty ``

Why it matters: Each task without pipelining uses 3 SSH connections (SFTP + exec + cleanup). With pipelining, it's 1. For 1000 hosts and 20 tasks, that's 60,000 vs 20,000 connections. That's a 3x reduction in SSH overhead.

Production Gotcha: Pipelining can cause issues with modules that need to run as a different user or with become. The module's stdin is consumed by the SSH session. Some modules like shell and command work fine, but copy with content may fail. Test your playbooks with pipelining enabled on a small set first.

Debug: To verify pipelining is working, run with -vvv and look for Using module file vs Pipelining is enabled.

Test with -vvv

Run ansible-playbook playbook.yml -vvv | grep -i pipelining to confirm it's enabled. You should see 'Pipelining is enabled. Sending module via pipelining.'

Production Insight

When we first enabled pipelining, we got 'sudo: sorry, you must have a tty to run sudo' on half our hosts. The team had Defaults requiretty in sudoers. We created a playbook to remove it: lineinfile path=/etc/sudoers regexp='^Defaults requiretty' state=absent. Then pipelining worked.

Key Takeaway

Enable pipelining and disable requiretty on all managed hosts; test with -vvv to confirm.

SSH ControlMaster: Reuse Connections Across Tasks

ControlMaster is an OpenSSH feature that allows multiplexing multiple SSH sessions over a single TCP connection. Ansible can leverage this to reuse the same SSH connection across multiple tasks on the same host, reducing the overhead of TCP handshakes.

Configure in ansible.cfg: ``ini [ssh_connection] ssh_args = -o ControlMaster=auto -o ControlPersist=60s ``

ControlMaster=auto: Automatically use a control socket if available.
ControlPersist=60s: Keep the master connection open for 60 seconds after the last session closes. Adjust based on task duration.

Why it matters: Without ControlMaster, each task on a host opens a new TCP connection (SYN, SYN-ACK, ACK). With ControlMaster, the first task opens the connection, and subsequent tasks reuse it. For a 20-task playbook on 1000 hosts, that's 20,000 TCP handshakes saved.

Production Gotcha: ControlMaster sockets are stored in ~/.ansible/cp/ by default. If you have many hosts and long ControlPersist, you can accumulate thousands of socket files. Clean them up with ssh -O stop or set ControlPersist to a reasonable value (60-300s). Also, if the control socket becomes stale (e.g., host rebooted), Ansible may fail. Setting ControlMaster=auto and ControlPersist=60s usually works.

Debug: Check active control sockets with ls -la ~/.ansible/cp/ or ssh -O check hostname.

ControlPersist Tuning

Set ControlPersist to slightly longer than your longest task. For most playbooks, 60s is safe. Too long (>600s) can leave stale sockets. Too short (0) disables reuse.

Production Insight

We once had ControlPersist=600s and after a large deployment, ~2000 stale sockets remained. The next run tried to use them and got 'Connection refused' because the hosts had rebooted. We added a cleanup task: command: ssh -O stop hostname for each host. Now we use 60s.

Key Takeaway

Add ControlMaster and ControlPersist to ssh_args; keep ControlPersist short (60s) to avoid stale sockets.

Strategy Plugins: Linear vs Free vs Host Pinned

Ansible's strategy plugin determines how tasks are executed across hosts. The default is linear, which waits for all hosts to complete a task before moving to the next. free allows each host to progress independently. host_pinned is like free but ensures tasks on the same host run consecutively without interleaving.

Set in ansible.cfg: ``ini [defaults] strategy = free ``

Or per-playbook: ``yaml - hosts: all strategy: free tasks: - ... ``

When to use each

linear: Required when tasks have inter-host dependencies (e.g., you need all hosts to stop service before any host starts new version).
free: Best for independent hosts. Reduces overall runtime because fast hosts don't wait for slow ones. Risk: race conditions if tasks are not idempotent.
host_pinned: Good for hosts with sequential task dependencies (e.g., install package, then configure). Reduces context switching overhead.

Production Gotcha: With free strategy, task ordering per host is preserved, but across hosts it's unpredictable. If your playbook relies on a global order (e.g., update load balancer before app servers), free will break it. Also, free can cause higher peak load on the controller because many hosts may execute the same task simultaneously.

Profiling difference: In our tests, free reduced total playbook time by 30-50% compared to linear for independent tasks. host_pinned was similar to free but with better locality.

Free Strategy and Idempotency

Free strategy can expose non-idempotent tasks. For example, if two hosts try to write to a shared file concurrently, you'll get corruption. Ensure tasks are idempotent before switching to free.

Production Insight

We switched a 500-host playbook from linear to free and saw runtime drop from 12 minutes to 5. But one task that appended to a shared NFS file caused corruption. We had to refactor that task to be idempotent using lineinfile with regexp.

Key Takeaway

Use free strategy for independent hosts to reduce runtime; use linear when inter-host ordering matters.

Async Tasks: Fire-and-Forget with poll: 0

Async tasks allow Ansible to start a long-running operation on a host and move on without waiting for completion. Set async to the maximum time you expect the task to take (in seconds), and poll: 0 to fire-and-forget. Later, you can check the status with async_status.

Example: ```yaml - name: Run long script shell: /opt/long_script.sh async: 3600 poll: 0 register: long_task

name: Check status later
async_status:
jid: "{{ long_task.ansible_job_id }}"
register: job_result
until: job_result.finished
retries: 30
delay: 10
```

Why it matters: Without async, a long task blocks that host's fork for the entire duration. With async, the fork is freed to work on other hosts. This is critical for tasks like database migrations, package installs, or reboots.

Production Gotcha: Async tasks with poll: 0 return immediately, but the job runs in the background. If the playbook ends before the job completes, the job is lost. Always use async_status to wait for completion if the result matters. Also, async tasks cannot be used with free strategy reliably because the job ID may be lost.

Best practices

Use poll: 0 for truly fire-and-forget tasks (e.g., sending a notification).
Use poll: 5 (check every 5s) for tasks you want to monitor but still free the fork.
Set async value generously to avoid timeout.

Async and Reboots

For reboots, use async: 0 poll: 0 and then wait_for_connection to wait for the host to come back. Example: - name: Reboot; async: 0; poll: 0; then - wait_for_connection:; connect_timeout: 60; sleep: 5.

Production Insight

We had a playbook that ran a 10-minute DB migration on each host. With async and poll: 0, we started all migrations in parallel and then polled for completion. Total time dropped from 10 minutes per host to 10 minutes total.

Key Takeaway

Use async: <timeout> poll: 0 for long-running tasks to avoid blocking forks; check status with async_status.

Gather Facts Caching: Skip Repetitive Work

By default, Ansible gathers facts at the start of every playbook run. For large environments, this can take minutes. Fact caching stores facts between runs so they are only gathered once per cache timeout.

Enable in ansible.cfg: ``ini [defaults] gathering = smart fact_caching = redis fact_caching_timeout = 86400 ``

Or use `jsonfile`: ``ini fact_caching = jsonfile fact_caching_connection = /tmp/ansible_facts fact_caching_timeout = 86400 ``

Why Redis over jsonfile: Redis is faster and handles concurrent access better. jsonfile can have file locking issues when multiple Ansible processes write to the same file. Redis is also easier to flush and inspect.

Install Redis: ``bash apt install redis-server pip install redis ``

Production Gotcha: If facts change between runs (e.g., IP address changes), cached facts become stale. Set fact_caching_timeout appropriately (e.g., 24h). For dynamic environments, use gathering = smart which only gathers facts if the cache is missing or expired. To force a refresh, use --flush-cache.

Impact: We saw playbook startup time drop from 30 seconds to 2 seconds after enabling Redis caching on a 500-host environment.

Redis Connection Details

By default, Ansible connects to Redis on localhost:6379. Configure with fact_caching_connection = localhost:6379:0 (host:port:db). Use a dedicated Redis instance for Ansible to avoid eviction.

Production Insight

We used jsonfile caching initially, but after a few concurrent runs, we got 'OSError: [Errno 24] Too many open files' because each playbook opened many cache files. Switched to Redis and the problem vanished.

Key Takeaway

Use Redis for fact caching; set timeout to 86400s (24h); use gathering = smart to only gather when needed.

Redis vs JSON File Fact Caching: A Comparison

Ansible supports multiple backends for fact caching. The two most common are redis and jsonfile. Here's a detailed comparison.

JSON File Caching

Stores facts in individual JSON files per host in a directory.
Simple, no external dependency.
Problems: File locking under concurrent access; slow with many hosts (1000+ files); filesystem overhead.
Config: fact_caching = jsonfile, fact_caching_connection = /path/to/dir.

Redis Caching

Stores facts in Redis key-value store.
Fast, concurrent-safe, easy to flush with redis-cli flushall.
Requires Redis server and Python redis library.
Config: fact_caching = redis, fact_caching_connection = localhost:6379:0.

Performance: In our tests with 500 hosts, jsonfile took ~5s to read all caches (sequential file reads), while Redis took ~0.5s. Write times were similar.

Recommendation: Use Redis for any environment with >100 hosts or multiple concurrent Ansible runs. For small labs, jsonfile is fine.

Flush Cache on Demand

Use redis-cli keys 'ansible_facts*' | xargs redis-cli del to flush Ansible facts from Redis. Or use ansible-playbook --flush-cache to force re-gathering.

Production Insight

We had a CI pipeline that ran 10 Ansible playbooks concurrently with jsonfile caching. The cache files got corrupted due to race conditions. Switched to Redis and the corruption stopped.

Key Takeaway

Prefer Redis over jsonfile for production; it's faster and handles concurrency better.

Profiling with callback_whitelist=profile_tasks

Ansible's profile_tasks callback plugin prints the execution time of each task at the end of the playbook. This is invaluable for identifying bottlenecks.

Enable in ansible.cfg: ``ini [defaults] callback_whitelist = profile_tasks ``

Or use environment variable: ANSIBLE_CALLBACK_WHITELIST=profile_tasks.

Output example: `` Friday 06 October 2023 14:23:45 +0000 (0:00:00.123) 0:00:00.123 ******* =============================================================================== Install packages ------------------------------------------------------- 2.34s Configure service ------------------------------------------------------ 1.20s Start service ---------------------------------------------------------- 0.50s ``

How to use: Run your playbook, then look at the summary. Tasks with the highest cumulative time are your targets for optimization. Common culprits: package installs, template rendering, or modules that query APIs.

Production Gotcha: The profile_tasks callback adds overhead (it stores timing data). In our tests, overhead was ~1-2% for playbooks with 50+ tasks. Acceptable for debugging, but remove in production if every second counts.

Alternative: Use ansible-playbook --timeout to set global timeout, but that doesn't profile.

Advanced: Combine with profile_roles callback to profile entire roles.

Remove in Production

The profile_tasks callback adds memory overhead. Remove it from ansible.cfg or set ANSIBLE_CALLBACK_WHITELIST='' in production to avoid performance impact.

Production Insight

We profiled a playbook and found that yum module took 30 seconds per host due to slow repo updates. We switched to dnf with --cacheonly and reduced it to 5 seconds.

Key Takeaway

Enable profile_tasks to find slow tasks; remove in production to avoid overhead.

Putting It All Together: A Production ansible.cfg

Here's a production-tested ansible.cfg that incorporates all the tuning discussed:

```ini [defaults] forks = 200 host_key_checking = False timeout = 30 strategy = free gathering = smart fact_caching = redis fact_caching_timeout = 86400 fact_caching_connection = localhost:6379:0 callback_whitelist = profile_tasks

[ssh_connection] pipelining = True ssh_args = -o ControlMaster=auto -o ControlPersist=60s control_path = /tmp/ansible-%%h-%%p-%%r

[inventory] enable_plugins = yaml,ini,script ```

Important adjustments

forks: Tune based on your controller's CPU and ulimit.
strategy: Use free only if tasks are independent; otherwise linear or host_pinned.
callback_whitelist: Remove for production runs.
control_path: Custom path to avoid socket conflicts.

Testing: Before rolling out, test with a small group of hosts (e.g., --limit 10). Monitor controller with htop, nload, and ulimit -n.

Rollback: Keep a backup of your original ansible.cfg.

Version Control Your Config

Store ansible.cfg in your playbook repository. Use environment-specific overrides via ANSIBLE_CONFIG environment variable.

Production Insight

We once deployed a new ansible.cfg without testing and broke all playbooks because strategy=free caused a dependency issue. Now we always test with --limit 5 first.

Key Takeaway

Use the provided ansible.cfg as a starting point; test thoroughly before production rollout.

Common Pitfalls in Ansible Performance Tuning

Even with the best intentions, tuning can backfire. Here are the most common mistakes I've seen:

Setting forks too high: Leads to OOM or file descriptor exhaustion. Always check ulimit -n and monitor memory.
Enabling pipelining without disabling requiretty: Causes sudo errors on many systems. Must be done on all managed hosts.
Using ControlMaster with long ControlPersist: Stale sockets cause connection failures. Keep ControlPersist short (60s).
Using free strategy with non-idempotent tasks: Race conditions and data corruption. Ensure idempotency.
Async tasks with poll:0 and no status check: Jobs may fail silently. Always check with async_status if result matters.
Fact caching with jsonfile under concurrent runs: File corruption. Use Redis.
Leaving profile_tasks enabled in production: Adds overhead. Disable for production runs.
Not testing on a small batch: A misconfiguration can take down all hosts. Always use --limit first.

The Most Dangerous Mistake

Setting forks=500 on a controller with ulimit -n=1024 will cause Ansible to hang or crash. Always check ulimit first.

Production Insight

A colleague once set forks=500 and ran a playbook on 2000 hosts. The controller ran out of file descriptors, and Ansible couldn't even be killed with SIGTERM. We had to reboot the machine.

Key Takeaway

Avoid these common pitfalls by testing changes incrementally and monitoring system limits.

Advanced: Custom Strategy Plugin and Async Patterns

For extreme performance needs, you can write custom strategy plugins or use advanced async patterns.

Custom Strategy Plugin: Ansible allows you to write your own strategy plugin in Python. For example, a strategy that batches hosts based on network topology. This is advanced and requires deep understanding of Ansible internals. See the Ansible documentation for strategy plugin development.

Async with Batch Processing: Instead of fire-and-forget all at once, you can limit concurrency with a custom batch: ``yaml - name: Run in batches of 50 shell: /opt/long_script.sh async: 3600 poll: 0 register: async_results with_items: "{{ groups['all'] }}" loop_control: batch: 50 ``

Using wait_for with async: For tasks that need to complete before proceeding, use async_status in a loop with until and delay.

Production Gotcha: Custom strategy plugins are not supported by Ansible Tower/AWX. Stick to built-in strategies unless you control the execution environment.

Community Strategies

The Ansible community has plugins like 'mitogen' that replace the default SSH mechanism with a faster one. However, mitogen is no longer actively maintained. Use with caution.

Production Insight

We experimented with a custom strategy that prioritized hosts by role. It reduced total time by 10% but was hard to maintain. We reverted to free strategy.

Key Takeaway

Custom strategies are possible but rarely needed; focus on built-in tuning first.

Monitoring and Alerting for Ansible Performance

Once you've tuned Ansible, you need to monitor its performance to catch regressions.

Metrics to track

Playbook execution time per host (use profile_tasks and parse with grep).
Ansible controller CPU, memory, and file descriptor usage.
Redis cache hit rate: redis-cli info stats | grep keyspace_hits.
SSH connection failure rate.

Tooling

Use ansible-playbook --syntax-check to catch errors before running.
Use ansible-inventory --graph to verify inventory.
Integrate with Prometheus: Export Ansible run duration as a metric using a custom callback.

Alerting: Set alerts for: - Playbook duration > 2x baseline. - Controller CPU > 80% during runs. - Redis memory > 80%.

Production Gotcha: Without monitoring, a config change that slows things down can go unnoticed for days. We once had a junior engineer set forks=5 accidentally, and the deployment time doubled. We caught it via a Grafana dashboard showing playbook duration.

Baseline Your Performance

After tuning, run your playbook 5 times and record the average duration. Use that as a baseline for alerts. Any deviation >20% warrants investigation.

Production Insight

We set up a Prometheus exporter that captures playbook duration from Ansible logs. Now we get alerted if any playbook takes longer than 10 minutes (our baseline is 4).

Key Takeaway

Monitor Ansible performance with metrics and alerts to catch regressions quickly.

● Production incidentPOST-MORTEMseverity: high

The 45-Minute Playbook That Became 4 Minutes

Symptom

Playbook execution time ~45 minutes for 2000 hosts. CPU on Ansible controller was idle; network utilization low. Many hosts timed out with 'Timeout (12s) waiting for privilege escalation'.

Assumption

The team assumed it was a network bandwidth issue or that Ansible was inherently slow at scale.

Root cause

Default forks=5 meant only 5 hosts were processed at a time. Pipelining was disabled, causing 3 SSH connections per task per host. With 2000 hosts and ~20 tasks, that's 120,000 SSH connections. ControlMaster was also disabled, so each connection was a full TCP handshake.

Fix

Set forks=100, pipelining=True, and ssh_args = -o ControlMaster=auto -o ControlPersist=60s in ansible.cfg. Also disabled requiretty in sudoers on managed hosts.

Key lesson

Default Ansible settings are for small labs.
Always override forks, pipelining, and ControlMaster for production.
Test with a small batch first to avoid overwhelming the controller or network.

Production debug guideSymptom → Root cause → Fix4 entries

Symptom · 01

Playbook runs slowly; CPU on controller is low; many hosts waiting.

→

Fix

Root cause: forks too low. Fix: Increase forks in ansible.cfg (e.g., forks=100). Monitor system limits: ulimit -n must be > forks.

Symptom · 02

Frequent 'Timeout (12s) waiting for privilege escalation' errors.

→

Fix

Root cause: pipelining=False causing extra SSH connections and sudo prompt issues. Fix: Set pipelining=True in ansible.cfg and ensure requiretty is disabled in sudoers on managed hosts.

Symptom · 03

High number of SSH connections per second; controller running out of file descriptors.

→

Fix

Root cause: ControlMaster not used. Fix: Add ssh_args = -o ControlMaster=auto -o ControlPersist=60s to ansible.cfg. Also increase ulimit -n on controller.

Symptom · 04

Some hosts finish tasks much earlier than others; playbook still waits for all.

→

Fix

Root cause: strategy=linear waits for all hosts to complete each task before proceeding. Fix: Use strategy=free if task order across hosts is not critical, or strategy=host_pinned for host-local ordering.

★ Ansible Performance Tuning Quick Referenceprint this for your desk

Playbook too slow, low CPU usage−

Immediate action

Check current forks value

Commands

grep -i forks /etc/ansible/ansible.cfg

ulimit -n

Fix now

Set forks=50 in ansible.cfg

Frequent SSH timeout errors+

Too many SSH connections, file descriptor exhaustion+

Playbook waits for slow hosts+

Facts gathered every run, slow startup+

Ansible Performance Tuning Techniques Comparison

Technique	Impact	Complexity	Risk
Forks	High	Low	Medium (resource exhaustion)
SSH Pipelining	Medium	Low	Low (requiretty issue)
ControlMaster	Medium	Low	Low (stale sockets)
Strategy Plugin	High	Medium	Medium (race conditions)
Async Tasks	Medium	Medium	Low (lost jobs)
Fact Caching (Redis)	High	Medium	Low (stale facts)
Profile Tasks	Low (debug)	Low	Low (overhead)

Key takeaways

Increase forks to 50-200 based on controller capacity; always check ulimit -n first.

Enable SSH pipelining and disable requiretty on managed hosts for 2-3x fewer SSH connections.

Use ControlMaster with ControlPersist=60s to reuse SSH connections across tasks.

Switch strategy to 'free' for independent hosts; use 'linear' if ordering matters.

Use async with poll:0 for long-running tasks; check status with async_status.

Cache facts in Redis with gathering=smart to avoid re-gathering on every run.

Enable profile_tasks callback to identify slow tasks; remove in production.

Always test tuning changes with --limit on a small batch before full rollout.

Common mistakes to avoid

6 patterns

Setting forks too high without checking ulimit

Symptom

Ansible crashes with 'Too many open files' or OOM

Fix

Check ulimit -n; set forks to <= ulimit - 50 for safety

Enabling pipelining without disabling requiretty

Symptom

sudo: sorry, you must have a tty to run sudo

Fix

Add 'Defaults !requiretty' to sudoers on all managed hosts

Using ControlMaster with ControlPersist=600s

Symptom

Stale sockets cause 'Connection refused' on subsequent runs

Fix

Set ControlPersist=60s; clean stale sockets with ssh -O stop

Using free strategy with non-idempotent tasks

Symptom

Race conditions, corrupted shared files, inconsistent state

Fix

Refactor tasks to be idempotent; use linear if ordering matters

Async with poll:0 and never checking status

Symptom

Jobs fail silently; playbook thinks they succeeded

Fix

Use async_status to verify completion if result matters

Using jsonfile fact caching for concurrent runs

Symptom

Cache file corruption, 'Too many open files' errors

Fix

Switch to Redis fact caching

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

What is the default forks value in Ansible and why is it problematic for...

Q02SENIOR

Explain SSH pipelining in Ansible and its requirement.

Q03SENIOR

How does ControlMaster improve Ansible performance and what is a common ...

Q04SENIOR

Compare linear, free, and host_pinned strategies. When would you use eac...

Q05SENIOR

How do you implement fire-and-forget tasks in Ansible and how do you che...

Q06SENIOR

What are the benefits of Redis over jsonfile for fact caching?

Q07SENIOR

How do you profile Ansible playbooks to find performance bottlenecks?

Q08SENIOR

What system limits should you check before increasing forks?

Q01 of 08JUNIOR

What is the default forks value in Ansible and why is it problematic for large environments?

ANSWER

The default forks value is 5. This means only 5 hosts are processed in parallel per task. For large environments (e.g., 1000 hosts), this serializes execution and causes long runtimes. Increasing forks to 50-200 can drastically reduce total time, but requires monitoring controller resources.

FAQ · 8 QUESTIONS

Frequently Asked Questions

What is the optimal forks value for 1000 hosts?

Does pipelining work with become?

Can I use ControlMaster with jump hosts?

What happens if I set strategy=free and a task fails?

How do I clean stale ControlMaster sockets?

Is Redis fact caching persistent across Ansible controller restarts?

Can I use profile_tasks with Tower/AWX?

What is the impact of gathering=smart?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

✓ Verified

production tested

June 21, 2026

last updated

1,596

articles · all by Naren

🔥

That's Ansible. Mark it forged?

11 min read · try the examples if you haven't