Linux Performance Tuning — Silent Swap from vm.swappiness
Default vm.
- Tuning modifies kernel parameters via sysctl to match workload, not defaults
- Key subsystems: CPU scheduler, virtual memory, I/O, and network
- vm.swappiness controls swap tendency — set to 1 for latency-sensitive apps
- Wrong I/O scheduler on NVMe adds ~40% latency — use none (or mq-deadline)
- Production rule: change one parameter at a time, measure before and after
Imagine a busy restaurant kitchen. The head chef (Linux kernel) manages cooks (CPU cores), pantry space (RAM), and delivery trucks (I/O). Out of the box, the kitchen is set up for a casual diner — it works fine for most nights. But on a Saturday rush with 300 covers, you need to rearrange the stations, pre-stock the fridges, and assign cooks to specific roles. That's exactly what Linux performance tuning is: deliberately reorganising how the OS allocates its resources so it can handle YOUR workload, not just an average one.
A default Linux installation is deliberately conservative. The kernel ships with settings tuned for broad compatibility — a database server, a gaming rig, a Raspberry Pi, and a 64-core cloud VM will all boot with roughly the same baseline config. That's great for getting started, but catastrophic for production at scale. A misconfigured TCP buffer kills throughput on a 10 Gbps link. The wrong I/O scheduler on NVMe storage adds 40% latency. A forgotten vm.swappiness setting causes a Redis node to start swapping under load, tanking p99 response times from 2ms to 4 seconds. These aren't theoretical problems — they're war stories from real oncall rotations.
Performance tuning solves the gap between 'it works' and 'it works under pressure'. The Linux kernel exposes hundreds of tuneable knobs through /proc, /sys, and sysctl. Understanding which knobs affect which subsystem — and crucially, WHY they exist — lets you make surgical changes instead of cargo-culting settings from a Stack Overflow post that was written for a 2012 spinning-disk server.
By the end of this article you'll understand how the kernel scheduler, virtual memory subsystem, I/O stack, and network stack interact with each other. You'll be able to profile a live system, identify the bottleneck, apply the right tuning, and verify the improvement with hard numbers — all without rebooting. You'll also know which changes to make permanent and which to test ephemerally first.
What is Linux System Performance Tuning?
Linux system performance tuning is the practice of modifying kernel parameters via /proc, /sys, and sysctl to adjust the OS's behaviour for a specific workload. Default settings target broad compatibility, not peak performance. Tuning is not a one-time event — it's an iterative cycle of measurement, change, verification.
The kernel exposes these knobs because there's no single 'best' config. A web server that handles short-lived connections needs different TCP buffers than a file server streaming large files. A real-time analytics database needs different memory pressure settings than a batch processing job.
The goal is to close the gap between 'works' and 'works under production load'. That means measuring latency, throughput, and resource utilisation before and after each change.
Kernel Scheduler Tuning — CPU Affinity, CFS & NUMA
The Completely Fair Scheduler (CFS) allocates CPU time proportionally among processes. Its main tuning knobs control preemption aggressiveness, group scheduling, and NUMA balancing.
- kernel.sched_min_granularity_ns: Minimum slice per process. Lower values reduce latency but increase context switches. Default 3ms, reduce to 1ms for latency-sensitive apps.
- kernel.sched_wakeup_granularity_ns: How long a waking process must wait before preempting a running one. Reduce for interactive workloads.
- kernel.numa_balancing: Default 1 (enabled). On NUMA machines, this can migrate pages and threads across nodes. Often causes latency spikes. Disable with 0 in virtualised environments.
- kernel.sched_migration_cost_ns: Time after a process runs before it can be migrated to another CPU. Increasing prevents unnecessary migrations.
Also use taskset to pin processes to specific cores and numactl to bind memory to local NUMA node.
Memory Management Tuning — vm.swappiness, dirty pages & huge pages
The virtual memory subsystem decides how aggressively to swap anonymous pages versus reclaim page cache. The key parameters:
- vm.swappiness: 0–100. Default 60 encourages swapping even with free memory. Set to 1 for latency-sensitive apps. 0 means no swapping until absolutely necessary (but kernel still swaps).
- vm.dirty_ratio / vm.dirty_background_ratio: When writeback starts. Default 20% dirty background, 50% dirty synchronous. On write-heavy systems, this can cause latency spikes when the kernel blocks writes. Lower to 5%/10% for transaction logs.
- vm.vfs_cache_pressure: Controls tendency to reclaim inode/dentry cache. Default 100. Lower to 50 on file servers to keep metadata in memory.
- vm.min_free_kbytes: Reserve memory to avoid direct reclaim under load. Set to 1% of RAM.
Huge pages (2MB vs 4KB) reduce TLB misses. For apps with large memory footprints (databases, JVMs), enable transparent huge pages (THP) or use explicit hugetlbfs. THP can cause allocation stalls – often better to disable it and pre-allocate huge pages.
I/O Subsystem Tuning — Scheduler, Queue Depth & Block Layer
The Linux block layer sits between file systems and hardware. Its main tunable is the I/O scheduler, which queues and reorders requests. On modern NVMe SSDs, the default (usually BFQ or CFQ) adds overhead. Switch to none (no reordering) or mq-deadline (minimise latency).
- /sys/block/<dev>/queue/scheduler: set to 'none' for NVMe, 'mq-deadline' for SATA SSD, 'bfq' for spinning disks.
- /sys/block/<dev>/queue/nr_requests: I/O queue depth. Increase for high-throughput workloads (e.g., 1024 for databases).
- /sys/block/<dev>/queue/read_ahead_kb: Pre-fetch size. Larger values benefit sequential reads but waste cache on random workloads.
Also tune filesystem mount options: noatime, nobarrier for ext4, or use XFS with larger allocation groups.
Network Stack Tuning — TCP Buffers, Congestion Control & Ring Buffers
The network stack's biggest bottleneck is often TCP buffer sizing and interrupt processing. For high-speed links (>1 Gbps), default socket buffers are too small, causing underutilisation.
- net.core.rmem_max / net.core.wmem_max: Max receive/send socket buffer (bytes). Set to 16MB for 10G links.
- net.ipv4.tcp_rmem / tcp_wmem: min-default-max for TCP buffers. Set min=4096, default=87380, max=16777216 (16MB).
- net.ipv4.tcp_congestion_control: Default cubic. For lossy or long-haul links, use bbr (needs kernel 4.9+). BBR handles packet loss better and can increase throughput.
- net.core.netdev_max_backlog: Max packets queued from NIC before kernel drops. Increase to 5000 on 10G links.
- /sys/class/net/eth0/queues/rx-*/rps_cpus: Enable RPS (Receive Packet Steering) to spread interrupt load across CPUs.
Putting It All Together — A Repeatable Tuning Workflow
Follow this sequence to avoid chaos:
- Baseline measurement: Collect latency, throughput, CPU, memory, I/O, and network metrics for at least 24 hours under typical load. Use tools like sar, sysstat, perf, and netdata.
- Identify bottleneck: Use the USE method (Utilization, Saturation, Errors) – e.g., CPU util > 90%? I/O queue length growing? Network drops?
- Hypothesis and change: Pick ONE parameter. Change it. Document why and expected effect.
- Measure again: Same period and load type. Compare before/after.
- Accept or rollback: If improvement >5% in the target metric, keep. If not, rollback and try different hypothesis.
- Make persistent: Only after validation, add to /etc/sysctl.d/ or tuned profiles.
Treat every parameter change as an experiment. Use tools like 'tuned' to apply preset profiles for common workloads (latency-performance, throughput-performance).
- Default config is for breadth, not production
- One change at a time, measure before and after
- The kernel has hundreds of knobs, but only 5-10 matter for your workload
- If you can't explain why a parameter helps, don't apply it
The Silent OOM that Cost $2,000 in AWS Credits
- Default sysctl values are not safe for all workloads – especially vm.swappiness.
- Always check swap usage (free -h, /proc/meminfo) even if you have 'plenty' of RAM.
- For latency-sensitive apps, set vm.swappiness=1 or disable swap entirely.
- Monitor /proc/[pid]/status VmSwap to see per-process swap usage.
Key takeaways
Common mistakes to avoid
4 patternsCopying sysctl settings from outdated blog posts without verification
Never measuring before applying changes
Changing multiple parameters simultaneously
Disabling swap entirely on memory-constrained servers
Interview Questions on This Topic
You notice a database server has high %iowait and I/O latency is 50ms. The I/O scheduler is CFQ and storage is NVMe. What is the first thing you would change?
echo none > /sys/block/nvme0n1/queue/scheduler and verify with iostat -x 1 that await drops below 2ms.Frequently Asked Questions
That's Linux. Mark it forged?
4 min read · try the examples if you haven't