Linux Performance Tuning — Silent Swap from vm.swappiness
Default vm.swappiness=60 silently swaps Redis working set, spiking P99 from 2ms to 4s.
20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.
- Tuning modifies kernel parameters via sysctl to match workload, not defaults
- Key subsystems: CPU scheduler, virtual memory, I/O, and network
- vm.swappiness controls swap tendency — set to 1 for latency-sensitive apps
- Wrong I/O scheduler on NVMe adds ~40% latency — use none (or mq-deadline)
- Production rule: change one parameter at a time, measure before and after
Imagine a busy restaurant kitchen. The head chef (Linux kernel) manages cooks (CPU cores), pantry space (RAM), and delivery trucks (I/O). Out of the box, the kitchen is set up for a casual diner — it works fine for most nights. But on a Saturday rush with 300 covers, you need to rearrange the stations, pre-stock the fridges, and assign cooks to specific roles. That's exactly what Linux performance tuning is: deliberately reorganising how the OS allocates its resources so it can handle YOUR workload, not just an average one.
A default Linux installation is deliberately conservative. The kernel ships with settings tuned for broad compatibility — a database server, a gaming rig, a Raspberry Pi, and a 64-core cloud VM will all boot with roughly the same baseline config. That's great for getting started, but catastrophic for production at scale. A misconfigured TCP buffer kills throughput on a 10 Gbps link. The wrong I/O scheduler on NVMe storage adds 40% latency. A forgotten vm.swappiness setting causes a Redis node to start swapping under load, tanking p99 response times from 2ms to 4 seconds. These aren't theoretical problems — they're war stories from real oncall rotations.
Performance tuning solves the gap between 'it works' and 'it works under pressure'. The Linux kernel exposes hundreds of tuneable knobs through /proc, /sys, and sysctl. Understanding which knobs affect which subsystem — and crucially, WHY they exist — lets you make surgical changes instead of cargo-culting settings from a Stack Overflow post that was written for a 2012 spinning-disk server.
By the end of this article you'll understand how the kernel scheduler, virtual memory subsystem, I/O stack, and network stack interact with each other. You'll be able to profile a live system, identify the bottleneck, apply the right tuning, and verify the improvement with hard numbers — all without rebooting. You'll also know which changes to make permanent and which to test ephemerally first.
What is Linux System Performance Tuning?
Linux system performance tuning is the practice of modifying kernel parameters via /proc, /sys, and sysctl to adjust the OS's behaviour for a specific workload. Default settings target broad compatibility, not peak performance. Tuning is not a one-time event — it's an iterative cycle of measurement, change, verification.
The kernel exposes these knobs because there's no single 'best' config. A web server that handles short-lived connections needs different TCP buffers than a file server streaming large files. A real-time analytics database needs different memory pressure settings than a batch processing job.
The goal is to close the gap between 'works' and 'works under production load'. That means measuring latency, throughput, and resource utilisation before and after each change.
Kernel Scheduler Tuning — CPU Affinity, CFS & NUMA
The Completely Fair Scheduler (CFS) allocates CPU time proportionally among processes. Its main tuning knobs control preemption aggressiveness, group scheduling, and NUMA balancing.
- kernel.sched_min_granularity_ns: Minimum slice per process. Lower values reduce latency but increase context switches. Default 3ms, reduce to 1ms for latency-sensitive apps.
- kernel.sched_wakeup_granularity_ns: How long a waking process must wait before preempting a running one. Reduce for interactive workloads.
- kernel.numa_balancing: Default 1 (enabled). On NUMA machines, this can migrate pages and threads across nodes. Often causes latency spikes. Disable with 0 in virtualised environments.
- kernel.sched_migration_cost_ns: Time after a process runs before it can be migrated to another CPU. Increasing prevents unnecessary migrations.
Also use taskset to pin processes to specific cores and numactl to bind memory to local NUMA node.
Memory Management Tuning — vm.swappiness, dirty pages & huge pages
The virtual memory subsystem decides how aggressively to swap anonymous pages versus reclaim page cache. The key parameters:
- vm.swappiness: 0–100. Default 60 encourages swapping even with free memory. Set to 1 for latency-sensitive apps. 0 means no swapping until absolutely necessary (but kernel still swaps).
- vm.dirty_ratio / vm.dirty_background_ratio: When writeback starts. Default 20% dirty background, 50% dirty synchronous. On write-heavy systems, this can cause latency spikes when the kernel blocks writes. Lower to 5%/10% for transaction logs.
- vm.vfs_cache_pressure: Controls tendency to reclaim inode/dentry cache. Default 100. Lower to 50 on file servers to keep metadata in memory.
- vm.min_free_kbytes: Reserve memory to avoid direct reclaim under load. Set to 1% of RAM.
Huge pages (2MB vs 4KB) reduce TLB misses. For apps with large memory footprints (databases, JVMs), enable transparent huge pages (THP) or use explicit hugetlbfs. THP can cause allocation stalls – often better to disable it and pre-allocate huge pages.
I/O Subsystem Tuning — Scheduler, Queue Depth & Block Layer
The Linux block layer sits between file systems and hardware. Its main tunable is the I/O scheduler, which queues and reorders requests. On modern NVMe SSDs, the default (usually BFQ or CFQ) adds overhead. Switch to none (no reordering) or mq-deadline (minimise latency).
- /sys/block/<dev>/queue/scheduler: set to 'none' for NVMe, 'mq-deadline' for SATA SSD, 'bfq' for spinning disks.
- /sys/block/<dev>/queue/nr_requests: I/O queue depth. Increase for high-throughput workloads (e.g., 1024 for databases).
- /sys/block/<dev>/queue/read_ahead_kb: Pre-fetch size. Larger values benefit sequential reads but waste cache on random workloads.
Also tune filesystem mount options: noatime, nobarrier for ext4, or use XFS with larger allocation groups.
Network Stack Tuning — TCP Buffers, Congestion Control & Ring Buffers
The network stack's biggest bottleneck is often TCP buffer sizing and interrupt processing. For high-speed links (>1 Gbps), default socket buffers are too small, causing underutilisation.
- net.core.rmem_max / net.core.wmem_max: Max receive/send socket buffer (bytes). Set to 16MB for 10G links.
- net.ipv4.tcp_rmem / tcp_wmem: min-default-max for TCP buffers. Set min=4096, default=87380, max=16777216 (16MB).
- net.ipv4.tcp_congestion_control: Default cubic. For lossy or long-haul links, use bbr (needs kernel 4.9+). BBR handles packet loss better and can increase throughput.
- net.core.netdev_max_backlog: Max packets queued from NIC before kernel drops. Increase to 5000 on 10G links.
- /sys/class/net/eth0/queues/rx-*/rps_cpus: Enable RPS (Receive Packet Steering) to spread interrupt load across CPUs.
Putting It All Together — A Repeatable Tuning Workflow
Follow this sequence to avoid chaos:
- Baseline measurement: Collect latency, throughput, CPU, memory, I/O, and network metrics for at least 24 hours under typical load. Use tools like sar, sysstat, perf, and netdata.
- Identify bottleneck: Use the USE method (Utilization, Saturation, Errors) – e.g., CPU util > 90%? I/O queue length growing? Network drops?
- Hypothesis and change: Pick ONE parameter. Change it. Document why and expected effect.
- Measure again: Same period and load type. Compare before/after.
- Accept or rollback: If improvement >5% in the target metric, keep. If not, rollback and try different hypothesis.
- Make persistent: Only after validation, add to /etc/sysctl.d/ or tuned profiles.
Treat every parameter change as an experiment. Use tools like 'tuned' to apply preset profiles for common workloads (latency-performance, throughput-performance).
- Default config is for breadth, not production
- One change at a time, measure before and after
- The kernel has hundreds of knobs, but only 5-10 matter for your workload
- If you can't explain why a parameter helps, don't apply it
CPU Cache Thrashing Is Your Silent Killer
Most devs stare at CPU utilization and think they're fine. 5% usage means nothing when your L2 cache miss rate is 40%. The CPU spends more cycles waiting on memory than computing. That's the real bottleneck.
Cache hierarchy matters because data locality is physics, not magic. L1 cache runs at CPU speed. L3 is an order of magnitude slower. Every cache miss stalls the pipeline. If the scheduler keeps bouncing your thread between cores, you're flushing and reloading that cache every time. That's thrashing.
Check your cache topology with lscpu or lstopo. Pin critical processes with taskset or numactl to keep them on the same core. For NUMA boxes, memory allocation follows CPU node. Bind both to the same node. Remote memory access is slow, and your database will feel it.
Run perf stat -e cache-misses on your workload. If miss rate exceeds 5%, you're leaving performance on the table. Fix affinity first. Tune later.
Context Switches Are The Hidden Tax On Your Throughput
Every context switch costs you CPU cycles. The scheduler saves state, flushes TLB, and reloads the next thread. On a busy web server handling 10k connections, you might be switching 100k times per second. That's 100k lost microseconds. Add them up.
High context switch rates usually mean one of two things: too many threads fighting for CPU time, or I/O-bound tasks that yield constantly. Both waste cycles. The fix: reduce thread count or switch to an event-driven model like epoll. For databases, tune the connection pool to match core count, not max connections.
Watch /proc/stat or use vmstat 1 and look at the cs column. If it's over 50k per core per second, you're thrashing the scheduler. perf sched gives you a per-process breakdown. Identify the worst offender. Either pin it, scale its threads, or rewrite it.
Don't touch scheduler policies (SCHED_FIFO, SCHED_RR) unless you know exactly what you're doing. Preempting the kernel scheduler can lock your box. Start with affinity and thread count. That's 80% of the win.
Interrupt Affinity: Don't Let IRQs Steal Your Cache
Network interrupts land on whatever core the kernel picks. By default, that's CPU 0. Your cache-hot nginx worker on that core now gets interrupted 20k times a second to handle packet processing. Cache evicted. Pipeline stalled. Performance tanks.
Bind interrupt request lines (IRQs) to a dedicated core — separate from your application cores. This keeps your worker's cache warm and lets the interrupt handler run unopposed. Check /proc/interrupts to see which IRQ is hammering which core. Then write the CPU mask to /proc/irq/<N>/smp_affinity.
For high-throughput NICs (10GbE+), spread IRQs across a set of cores, not all. Use irqbalance as a baseline but don't trust it blindly for heavy loads. Manual tuning beats a daemon every time. Match NIC queue count to core count, then assign one queue per core. No sharing.
Test with perf top or mpstat -I CPU before and after. If you see 10%+ of CPU time in softirq or hardirq, you have work to do. Dedicated interrupt cores are free performance. Take it.
I/O Priority: Stop Letting Batch Jobs Steal Your Latency
Most engineers tune I/O scheduling but ignore I/O priority. That's like upgrading your car's tires while leaving the parking brake on. The kernel's I/O priority system (ionice) lets you tell the block layer which processes are latency-sensitive and which are background noise.
WHY this matters: Without I/O priority, a cron job running log rotation can starve your database. The CFQ and BFQ schedulers respect three priority classes: Real-time (0), Best-effort (1), and Idle (3). Real-time processes always get first dibs on I/O requests. Best-effort divides bandwidth proportionally. Idle processes only run when nobody else needs the disk.
HOW to use it: Attach ionice to critical processes. Your PostgreSQL should be ionice -c 2 -n 0 (best-effort, high priority). Your nightly backup script should be ionice -c 3 (idle). On systemd, set IOSchedulingClass and IOSchedulingPriority in your service unit. This single change can cut database tail latency by 40% in mixed-workload environments. Test it: run fio with different ionice classes and watch the latency divergence.
Useful I/O Monitoring Commands: Stop Guessing, Start Measuring
You can't tune what you can't see. Most engineers look at iostat, get lost in %util, and call it a day. That number is a lie — %util shows device busy time, not saturation. You need the real tools.
WHY this matters: False signals cause bad tuning. %util at 100% means nothing on an NVMe drive that can queue 64K commands. You need indicators of actual congestion: average queue size (avgqu-sz), service time (svctm), and await vs. r_await divergence.
- iostat -x 1: Look at avgqu-sz > 1 per device. That's congestion. svctm under 10ms on spinning rust is fine; under 1ms on SSD is fine.
- iotop -oP: See which process is eating I/O right now. The -P flag shows threads. Discover your rogue log writer.
- blktrace / blkparse: Capture every I/O event. Trace a 10-second window: blktrace -d /dev/sda -o
- | blkparse -i
- This shows exact I/O sizes, latencies, and which process issued them.
- bcc-tools (biosnoop, biolatency): Instant histograms of I/O latency. No configuration. Just run biosnoop and watch outliers appear.
Ditch guesswork. Measure with blktrace when you profile, iostat when you monitor, iotop when you firefight.
strftime(), $0}'. Run it in a tmux pane. Now you see congestion before pager goes off.Keepalive Timeout: Stop Letting Dead Connections Rot Your Resources
Every open TCP socket costs you a file descriptor and a chunk of kernel memory. When a client crashes without sending FIN, that socket sits in CLOSE_WAIT or ESTABLISHED state until you die or the kernel notices. If you have 10,000 connections, 10% of them dead, you're leaking 1,000 descriptors. That's absurd.
WHY this matters: Default keepalive settings are glacial. tcp_keepalive_time is 7200 seconds (2 hours) on most distros. Production apps talking to flaky mobile clients don't have 2 hours. Your connection pool fills with zombies. Your new connections get rejected at 65535. Keepalive is your garbage collector for dead peers.
HOW to fix it: Set tcp_keepalive_time to 300 seconds (5 minutes). tcp_keepalive_intvl to 15 seconds. tcp_keepalive_probes to 5. That means after 5 minutes idle, the kernel sends 5 probes at 15-second intervals. Total detection time: 5 min + 75 sec = 6.25 minutes. Tune per-service via iptables or socket options — don't apply a single value to everything. Your database servers need shorter timeouts than your web servers. Set SO_KEEPALIVE in application code, or use sysctl modifiers per namespace.
Test it: netstat -tn | grep CLOSE_WAIT. If count > 0, your keepalive is too slow or missing.
Kernel Function Tracing for Low-Level Analysis
WHY kernel tracing matters: Latency spikes hide behind aggregated metrics. Function tracing reveals exactly which kernel path causes a bottleneck. Use ftrace or BPF-based tools to measure entry-to-exit times of syscalls, interrupt handlers, and scheduler functions. Start with trace-cmd record -p function_graph -g do_sys_open to trace file open latency. For targeted analysis, use funclatency from BCC to histogram a single kernel function's execution time. Focus on high-frequency or high-jitter paths like tcp_ack, do_try_to_free_pages, or enqueue_task_fair. Compare before and after tuning. Always trace under load—idle traces tell nothing. Stop guessing with averages; start tracing the 99th percentile path.
Programmable Tracing for Custom Metrics
WHY programmable tracing: Traditional tools report fixed metrics—you get what they give. With BPF (BCC or bpftrace), you write custom probes on any kernel or user function. Measure the exact distribution of mutex hold times, NUMA miss ratios, or per-socket TCP retransmit counts. Start with bpftrace -e 'kprobe:tcp_retransmit_skb { @[sockuid] = to count retransmits per user. For NUMA analysis, run count(); }'numa-miss from BCC: it shows where memory allocation fails to match the requesting CPU's node. Build metrics that matter to your workload—not generic dashboard noise. Write scripts that save histograms and alert on tail latency shifts. This turns raw tracing into actionable tuning data.
tcp_transmit_skb) can overflow the per-CPU buffer and drop events. Monitor /sys/kernel/debug/tracing/trace_stat/bpf_stats for lost probes.The Silent OOM that Cost $2,000 in AWS Credits
- Default sysctl values are not safe for all workloads – especially vm.swappiness.
- Always check swap usage (free -h, /proc/meminfo) even if you have 'plenty' of RAM.
- For latency-sensitive apps, set vm.swappiness=1 or disable swap entirely.
- Monitor /proc/[pid]/status VmSwap to see per-process swap usage.
vmstat 1 10 | awk '{print $12,$13}' # cs and sys columnsperf top -e cs -s count # find what kernel code is spinningKey takeaways
Common mistakes to avoid
4 patternsCopying sysctl settings from outdated blog posts without verification
Never measuring before applying changes
Changing multiple parameters simultaneously
Disabling swap entirely on memory-constrained servers
Interview Questions on This Topic
You notice a database server has high %iowait and I/O latency is 50ms. The I/O scheduler is CFQ and storage is NVMe. What is the first thing you would change?
echo none > /sys/block/nvme0n1/queue/scheduler and verify with iostat -x 1 that await drops below 2ms.Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.
That's Linux. Mark it forged?
11 min read · try the examples if you haven't