Ansible Performance Tuning: From 10x Slower to 10x Faster with Forks, Pipelining, and Async
Production-tested Ansible tuning: forks, SSH pipelining, ControlMaster, strategy plugins, async, fact caching with Redis, and profiling with profile_tasks.
20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.
Set forks: 50 in ansible.cfg to parallelize across 50 hosts; monitor system limits with ulimit -n.
Enable pipelining = True to reduce SSH connections by ~50%; requires requiretty disabled in sudoers.
Use ssh_args = -o ControlMaster=auto -o ControlPersist=60s to reuse SSH connections across tasks.
Switch strategy to free for independent host execution; host_pinned for locality; linear for strict ordering.
For fire-and-forget tasks, use async: 300 and poll: 0; check status with async_status module.
Cache facts in Redis: fact_caching = redis with fact_caching_timeout = 86400; avoid JSON file contention.
Enable profiling: callback_whitelist = profile_tasks in ansible.cfg; parse output with grep to find slow tasks.
Always test tuning changes on a small batch before rolling out; monitor Ansible controller CPU and memory.
Imagine you're a chef in a busy restaurant kitchen. Each order (host) needs a series of steps: chop vegetables, boil water, cook pasta. If you do all orders one by one, it's slow. Ansible's forks setting is like hiring more chefs to work in parallel—each chef handles one order. SSH pipelining is like having a direct conveyor belt from prep to stove instead of walking each ingredient. ControlMaster is like keeping a dedicated line to each station so you don't have to dial the phone every time. Strategy plugins are different kitchen workflows: linear is a strict assembly line (one step for all orders, then next), free lets each chef work independently on their order, and host_pinned keeps a chef at one station. Async tasks are like putting a pot on low heat and walking away—you check later if it's done. Fact caching is like having a cheat sheet of each ingredient's properties so you don't have to look them up every time. Profiling is a stopwatch that shows which step takes the longest so you can optimize it.
I once managed a 2000-node deployment that took 45 minutes to run a simple playbook. The team was blaming network latency, but the real culprit was Ansible's default configuration. By the time we finished tuning, the same playbook ran in under 4 minutes. That incident taught me that Ansible's defaults are designed for small environments, not production at scale.
Ansible's performance bottlenecks are often misdiagnosed. Engineers blame slow SSH, slow hosts, or the tool itself, but the root cause is almost always configuration. The defaults prioritize compatibility over speed: forks=5, pipelining=False, linear strategy, no fact caching. For a handful of hosts, that's fine. For hundreds or thousands, it's a disaster.
This article covers the five most impactful tuning levers: forks parallelism, SSH pipelining and ControlMaster, strategy plugins, async tasks, and fact caching. I'll also show how to profile your playbooks to find the real slow spots. Every recommendation comes from production experience—including the gotchas that can break your deployment if you're not careful.
By the end, you'll know how to make Ansible run 5-10x faster on large infrastructures, and more importantly, how to avoid the common mistakes that lead to timeouts, connection failures, and inconsistent state.
Forks: The First Lever for Parallelism
The forks setting in Ansible controls the maximum number of hosts that can be processed in parallel for any given task. The default is 5, which is absurdly low for any environment with more than a handful of servers. Increasing forks is the single most impactful change you can make.
In ansible.cfg: ``ini [defaults] forks = 100 ``
You can also set it via environment variable: ANSIBLE_FORKS=100.
Production Gotcha: Setting forks too high can overwhelm the Ansible controller's CPU, memory, and file descriptor limits. Each fork uses a separate SSH process. On Linux, check your ulimit -n (open file limit). For 100 forks, you need at least 100 file descriptors per task (plus overhead). Also, network bandwidth and target host capacity are factors. I've seen controllers become unresponsive with forks=500 on a 2GB VM.
How to find the right value: Start with 50, monitor controller CPU and memory with htop. Increase by 25 until CPU reaches ~70% or you see connection errors. In our production environment with 2000 hosts and 16-core controllers, we settled on forks=200.
Another gotcha: forks applies per task, not per playbook. If you have 20 tasks, with forks=100, at peak you'll have 2000 concurrent SSH connections (100 forks * 20 tasks). That's why ControlMaster and pipelining are critical—they reduce the number of connections per task.
soft nofile 65536 and hard nofile 65536. Reboot or restart SSH.SSH Pipelining: Cut Connections in Half
SSH pipelining reduces the number of SSH operations required to execute a module. Without pipelining, Ansible sends the module file to the host via SFTP, then executes it via SSH. With pipelining, it sends the module as part of the SSH session, eliminating the separate file transfer.
Enable in ansible.cfg: ``ini [ssh_connection] pipelining = True ``
Requirement: Pipelining requires that requiretty be disabled in the sudoers file on managed hosts. Otherwise, you'll get sudo: sorry, you must have a tty to run sudo errors. Fix with: `` # /etc/sudoers or /etc/sudoers.d/ansible Defaults !requiretty ``
Why it matters: Each task without pipelining uses 3 SSH connections (SFTP + exec + cleanup). With pipelining, it's 1. For 1000 hosts and 20 tasks, that's 60,000 vs 20,000 connections. That's a 3x reduction in SSH overhead.
Production Gotcha: Pipelining can cause issues with modules that need to run as a different user or with become. The module's stdin is consumed by the SSH session. Some modules like shell and command work fine, but copy with content may fail. Test your playbooks with pipelining enabled on a small set first.
Debug: To verify pipelining is working, run with -vvv and look for Using module file vs Pipelining is enabled.
ansible-playbook playbook.yml -vvv | grep -i pipelining to confirm it's enabled. You should see 'Pipelining is enabled. Sending module via pipelining.'Defaults requiretty in sudoers. We created a playbook to remove it: lineinfile path=/etc/sudoers regexp='^Defaults requiretty' state=absent. Then pipelining worked.SSH ControlMaster: Reuse Connections Across Tasks
ControlMaster is an OpenSSH feature that allows multiplexing multiple SSH sessions over a single TCP connection. Ansible can leverage this to reuse the same SSH connection across multiple tasks on the same host, reducing the overhead of TCP handshakes.
Configure in ansible.cfg: ``ini [ssh_connection] ssh_args = -o ControlMaster=auto -o ControlPersist=60s ``
ControlMaster=auto: Automatically use a control socket if available.ControlPersist=60s: Keep the master connection open for 60 seconds after the last session closes. Adjust based on task duration.
Why it matters: Without ControlMaster, each task on a host opens a new TCP connection (SYN, SYN-ACK, ACK). With ControlMaster, the first task opens the connection, and subsequent tasks reuse it. For a 20-task playbook on 1000 hosts, that's 20,000 TCP handshakes saved.
Production Gotcha: ControlMaster sockets are stored in ~/.ansible/cp/ by default. If you have many hosts and long ControlPersist, you can accumulate thousands of socket files. Clean them up with ssh -O stop or set ControlPersist to a reasonable value (60-300s). Also, if the control socket becomes stale (e.g., host rebooted), Ansible may fail. Setting ControlMaster=auto and ControlPersist=60s usually works.
Debug: Check active control sockets with ls -la ~/.ansible/cp/ or ssh -O check hostname.
command: ssh -O stop hostname for each host. Now we use 60s.Strategy Plugins: Linear vs Free vs Host Pinned
Ansible's strategy plugin determines how tasks are executed across hosts. The default is linear, which waits for all hosts to complete a task before moving to the next. free allows each host to progress independently. host_pinned is like free but ensures tasks on the same host run consecutively without interleaving.
Set in ansible.cfg: ``ini [defaults] strategy = free ``
Or per-playbook: ``yaml - hosts: all strategy: free tasks: - ... ``
linear: Required when tasks have inter-host dependencies (e.g., you need all hosts to stop service before any host starts new version).free: Best for independent hosts. Reduces overall runtime because fast hosts don't wait for slow ones. Risk: race conditions if tasks are not idempotent.host_pinned: Good for hosts with sequential task dependencies (e.g., install package, then configure). Reduces context switching overhead.
Production Gotcha: With free strategy, task ordering per host is preserved, but across hosts it's unpredictable. If your playbook relies on a global order (e.g., update load balancer before app servers), free will break it. Also, free can cause higher peak load on the controller because many hosts may execute the same task simultaneously.
Profiling difference: In our tests, free reduced total playbook time by 30-50% compared to linear for independent tasks. host_pinned was similar to free but with better locality.
lineinfile with regexp.free strategy for independent hosts to reduce runtime; use linear when inter-host ordering matters.Async Tasks: Fire-and-Forget with poll: 0
Async tasks allow Ansible to start a long-running operation on a host and move on without waiting for completion. Set async to the maximum time you expect the task to take (in seconds), and poll: 0 to fire-and-forget. Later, you can check the status with async_status.
Example: ```yaml - name: Run long script shell: /opt/long_script.sh async: 3600 poll: 0 register: long_task
- name: Check status later
- async_status:
- jid: "{{ long_task.ansible_job_id }}"
- register: job_result
- until: job_result.finished
- retries: 30
- delay: 10
- ```
Why it matters: Without async, a long task blocks that host's fork for the entire duration. With async, the fork is freed to work on other hosts. This is critical for tasks like database migrations, package installs, or reboots.
Production Gotcha: Async tasks with poll: 0 return immediately, but the job runs in the background. If the playbook ends before the job completes, the job is lost. Always use async_status to wait for completion if the result matters. Also, async tasks cannot be used with free strategy reliably because the job ID may be lost.
- Use
poll: 0for truly fire-and-forget tasks (e.g., sending a notification). - Use
poll: 5(check every 5s) for tasks you want to monitor but still free the fork. - Set
asyncvalue generously to avoid timeout.
async: 0 poll: 0 and then wait_for_connection to wait for the host to come back. Example: - name: Reboot; async: 0; poll: 0; then - wait_for_connection:; connect_timeout: 60; sleep: 5.async: <timeout> poll: 0 for long-running tasks to avoid blocking forks; check status with async_status.Gather Facts Caching: Skip Repetitive Work
By default, Ansible gathers facts at the start of every playbook run. For large environments, this can take minutes. Fact caching stores facts between runs so they are only gathered once per cache timeout.
Enable in ansible.cfg: ``ini [defaults] gathering = smart fact_caching = redis fact_caching_timeout = 86400 ``
Or use `jsonfile`: ``ini fact_caching = jsonfile fact_caching_connection = /tmp/ansible_facts fact_caching_timeout = 86400 ``
Why Redis over jsonfile: Redis is faster and handles concurrent access better. jsonfile can have file locking issues when multiple Ansible processes write to the same file. Redis is also easier to flush and inspect.
Install Redis: ``bash apt install redis-server pip install redis ``
Production Gotcha: If facts change between runs (e.g., IP address changes), cached facts become stale. Set fact_caching_timeout appropriately (e.g., 24h). For dynamic environments, use gathering = smart which only gathers facts if the cache is missing or expired. To force a refresh, use --flush-cache.
Impact: We saw playbook startup time drop from 30 seconds to 2 seconds after enabling Redis caching on a 500-host environment.
fact_caching_connection = localhost:6379:0 (host:port:db). Use a dedicated Redis instance for Ansible to avoid eviction.gathering = smart to only gather when needed.Redis vs JSON File Fact Caching: A Comparison
Ansible supports multiple backends for fact caching. The two most common are redis and jsonfile. Here's a detailed comparison.
- Stores facts in individual JSON files per host in a directory.
- Simple, no external dependency.
- Problems: File locking under concurrent access; slow with many hosts (1000+ files); filesystem overhead.
- Config:
fact_caching = jsonfile,fact_caching_connection = /path/to/dir.
- Stores facts in Redis key-value store.
- Fast, concurrent-safe, easy to flush with
redis-cli flushall. - Requires Redis server and Python redis library.
- Config:
fact_caching = redis,fact_caching_connection = localhost:6379:0.
Performance: In our tests with 500 hosts, jsonfile took ~5s to read all caches (sequential file reads), while Redis took ~0.5s. Write times were similar.
Recommendation: Use Redis for any environment with >100 hosts or multiple concurrent Ansible runs. For small labs, jsonfile is fine.
redis-cli keys 'ansible_facts*' | xargs redis-cli del to flush Ansible facts from Redis. Or use ansible-playbook --flush-cache to force re-gathering.Profiling with callback_whitelist=profile_tasks
Ansible's profile_tasks callback plugin prints the execution time of each task at the end of the playbook. This is invaluable for identifying bottlenecks.
Enable in ansible.cfg: ``ini [defaults] callback_whitelist = profile_tasks ``
Or use environment variable: ANSIBLE_CALLBACK_WHITELIST=profile_tasks.
Output example: `` Friday 06 October 2023 14:23:45 +0000 (0:00:00.123) 0:00:00.123 ******* =============================================================================== Install packages ------------------------------------------------------- 2.34s Configure service ------------------------------------------------------ 1.20s Start service ---------------------------------------------------------- 0.50s ``
How to use: Run your playbook, then look at the summary. Tasks with the highest cumulative time are your targets for optimization. Common culprits: package installs, template rendering, or modules that query APIs.
Production Gotcha: The profile_tasks callback adds overhead (it stores timing data). In our tests, overhead was ~1-2% for playbooks with 50+ tasks. Acceptable for debugging, but remove in production if every second counts.
Alternative: Use ansible-playbook --timeout to set global timeout, but that doesn't profile.
Advanced: Combine with profile_roles callback to profile entire roles.
ANSIBLE_CALLBACK_WHITELIST='' in production to avoid performance impact.yum module took 30 seconds per host due to slow repo updates. We switched to dnf with --cacheonly and reduced it to 5 seconds.Putting It All Together: A Production ansible.cfg
Here's a production-tested ansible.cfg that incorporates all the tuning discussed:
```ini [defaults] forks = 200 host_key_checking = False timeout = 30 strategy = free gathering = smart fact_caching = redis fact_caching_timeout = 86400 fact_caching_connection = localhost:6379:0 callback_whitelist = profile_tasks
[ssh_connection] pipelining = True ssh_args = -o ControlMaster=auto -o ControlPersist=60s control_path = /tmp/ansible-%%h-%%p-%%r
[inventory] enable_plugins = yaml,ini,script ```
forks: Tune based on your controller's CPU and ulimit.strategy: Usefreeonly if tasks are independent; otherwiselinearorhost_pinned.callback_whitelist: Remove for production runs.control_path: Custom path to avoid socket conflicts.
Testing: Before rolling out, test with a small group of hosts (e.g., --limit 10). Monitor controller with htop, nload, and ulimit -n.
Rollback: Keep a backup of your original ansible.cfg.
ANSIBLE_CONFIG environment variable.strategy=free caused a dependency issue. Now we always test with --limit 5 first.Common Pitfalls in Ansible Performance Tuning
Even with the best intentions, tuning can backfire. Here are the most common mistakes I've seen:
- Setting forks too high: Leads to OOM or file descriptor exhaustion. Always check
ulimit -nand monitor memory. - Enabling pipelining without disabling requiretty: Causes sudo errors on many systems. Must be done on all managed hosts.
- Using ControlMaster with long ControlPersist: Stale sockets cause connection failures. Keep ControlPersist short (60s).
- Using
freestrategy with non-idempotent tasks: Race conditions and data corruption. Ensure idempotency. - Async tasks with poll:0 and no status check: Jobs may fail silently. Always check with
async_statusif result matters. - Fact caching with jsonfile under concurrent runs: File corruption. Use Redis.
- Leaving profile_tasks enabled in production: Adds overhead. Disable for production runs.
- Not testing on a small batch: A misconfiguration can take down all hosts. Always use
--limitfirst.
Advanced: Custom Strategy Plugin and Async Patterns
For extreme performance needs, you can write custom strategy plugins or use advanced async patterns.
Custom Strategy Plugin: Ansible allows you to write your own strategy plugin in Python. For example, a strategy that batches hosts based on network topology. This is advanced and requires deep understanding of Ansible internals. See the Ansible documentation for strategy plugin development.
Async with Batch Processing: Instead of fire-and-forget all at once, you can limit concurrency with a custom batch: ``yaml - name: Run in batches of 50 shell: /opt/long_script.sh async: 3600 poll: 0 register: async_results with_items: "{{ groups['all'] }}" loop_control: batch: 50 ``
Using wait_for with async: For tasks that need to complete before proceeding, use async_status in a loop with until and delay.
Production Gotcha: Custom strategy plugins are not supported by Ansible Tower/AWX. Stick to built-in strategies unless you control the execution environment.
Monitoring and Alerting for Ansible Performance
Once you've tuned Ansible, you need to monitor its performance to catch regressions.
- Playbook execution time per host (use profile_tasks and parse with
grep). - Ansible controller CPU, memory, and file descriptor usage.
- Redis cache hit rate:
redis-cli info stats | grep keyspace_hits. - SSH connection failure rate.
- Use
ansible-playbook --syntax-checkto catch errors before running. - Use
ansible-inventory --graphto verify inventory. - Integrate with Prometheus: Export Ansible run duration as a metric using a custom callback.
Alerting: Set alerts for: - Playbook duration > 2x baseline. - Controller CPU > 80% during runs. - Redis memory > 80%.
Production Gotcha: Without monitoring, a config change that slows things down can go unnoticed for days. We once had a junior engineer set forks=5 accidentally, and the deployment time doubled. We caught it via a Grafana dashboard showing playbook duration.
The 45-Minute Playbook That Became 4 Minutes
forks=100, pipelining=True, and ssh_args = -o ControlMaster=auto -o ControlPersist=60s in ansible.cfg. Also disabled requiretty in sudoers on managed hosts.- Default Ansible settings are for small labs.
- Always override forks, pipelining, and ControlMaster for production.
- Test with a small batch first to avoid overwhelming the controller or network.
forks too low. Fix: Increase forks in ansible.cfg (e.g., forks=100). Monitor system limits: ulimit -n must be > forks.pipelining=False causing extra SSH connections and sudo prompt issues. Fix: Set pipelining=True in ansible.cfg and ensure requiretty is disabled in sudoers on managed hosts.ControlMaster not used. Fix: Add ssh_args = -o ControlMaster=auto -o ControlPersist=60s to ansible.cfg. Also increase ulimit -n on controller.strategy=linear waits for all hosts to complete each task before proceeding. Fix: Use strategy=free if task order across hosts is not critical, or strategy=host_pinned for host-local ordering.grep -i forks /etc/ansible/ansible.cfgulimit -nKey takeaways
Common mistakes to avoid
6 patternsSetting forks too high without checking ulimit
Enabling pipelining without disabling requiretty
Using ControlMaster with ControlPersist=600s
Using free strategy with non-idempotent tasks
Async with poll:0 and never checking status
Using jsonfile fact caching for concurrent runs
Interview Questions on This Topic
What is the default forks value in Ansible and why is it problematic for large environments?
Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.
That's Ansible. Mark it forged?
11 min read · try the examples if you haven't