Senior 4 min · March 06, 2026

Linux Performance Tuning — Silent Swap from vm.swappiness

Default vm.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Tuning modifies kernel parameters via sysctl to match workload, not defaults
  • Key subsystems: CPU scheduler, virtual memory, I/O, and network
  • vm.swappiness controls swap tendency — set to 1 for latency-sensitive apps
  • Wrong I/O scheduler on NVMe adds ~40% latency — use none (or mq-deadline)
  • Production rule: change one parameter at a time, measure before and after
Plain-English First

Imagine a busy restaurant kitchen. The head chef (Linux kernel) manages cooks (CPU cores), pantry space (RAM), and delivery trucks (I/O). Out of the box, the kitchen is set up for a casual diner — it works fine for most nights. But on a Saturday rush with 300 covers, you need to rearrange the stations, pre-stock the fridges, and assign cooks to specific roles. That's exactly what Linux performance tuning is: deliberately reorganising how the OS allocates its resources so it can handle YOUR workload, not just an average one.

A default Linux installation is deliberately conservative. The kernel ships with settings tuned for broad compatibility — a database server, a gaming rig, a Raspberry Pi, and a 64-core cloud VM will all boot with roughly the same baseline config. That's great for getting started, but catastrophic for production at scale. A misconfigured TCP buffer kills throughput on a 10 Gbps link. The wrong I/O scheduler on NVMe storage adds 40% latency. A forgotten vm.swappiness setting causes a Redis node to start swapping under load, tanking p99 response times from 2ms to 4 seconds. These aren't theoretical problems — they're war stories from real oncall rotations.

Performance tuning solves the gap between 'it works' and 'it works under pressure'. The Linux kernel exposes hundreds of tuneable knobs through /proc, /sys, and sysctl. Understanding which knobs affect which subsystem — and crucially, WHY they exist — lets you make surgical changes instead of cargo-culting settings from a Stack Overflow post that was written for a 2012 spinning-disk server.

By the end of this article you'll understand how the kernel scheduler, virtual memory subsystem, I/O stack, and network stack interact with each other. You'll be able to profile a live system, identify the bottleneck, apply the right tuning, and verify the improvement with hard numbers — all without rebooting. You'll also know which changes to make permanent and which to test ephemerally first.

What is Linux System Performance Tuning?

Linux system performance tuning is the practice of modifying kernel parameters via /proc, /sys, and sysctl to adjust the OS's behaviour for a specific workload. Default settings target broad compatibility, not peak performance. Tuning is not a one-time event — it's an iterative cycle of measurement, change, verification.

The kernel exposes these knobs because there's no single 'best' config. A web server that handles short-lived connections needs different TCP buffers than a file server streaming large files. A real-time analytics database needs different memory pressure settings than a batch processing job.

The goal is to close the gap between 'works' and 'works under production load'. That means measuring latency, throughput, and resource utilisation before and after each change.

Production Insight
Blindly applying sysctl settings from a blog post can make things worse.
Example: setting vm.swappiness=0 on a database server seemed right, but it caused the page cache to be evicted aggressively, doubling I/O read latency.
Rule: understand the trade-off — no parameter is universally 'optimal'.
Key Takeaway
Tuning is workload-specific.
Default kernel config is for booting, not for performance.
Always measure before change and after change.

Kernel Scheduler Tuning — CPU Affinity, CFS & NUMA

The Completely Fair Scheduler (CFS) allocates CPU time proportionally among processes. Its main tuning knobs control preemption aggressiveness, group scheduling, and NUMA balancing.

Key parameters
  • kernel.sched_min_granularity_ns: Minimum slice per process. Lower values reduce latency but increase context switches. Default 3ms, reduce to 1ms for latency-sensitive apps.
  • kernel.sched_wakeup_granularity_ns: How long a waking process must wait before preempting a running one. Reduce for interactive workloads.
  • kernel.numa_balancing: Default 1 (enabled). On NUMA machines, this can migrate pages and threads across nodes. Often causes latency spikes. Disable with 0 in virtualised environments.
  • kernel.sched_migration_cost_ns: Time after a process runs before it can be migrated to another CPU. Increasing prevents unnecessary migrations.

Also use taskset to pin processes to specific cores and numactl to bind memory to local NUMA node.

scheduler-tuning.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash
# TheCodeForgeLinux scheduler tuning for an API server
# Pins Nginx workers to cores 0-15 (physical CPUs) on a dual-socket NUMA system

export WORKER_PIDS=$(pgrep -f 'nginx: worker')

for pid in $WORKER_PIDS; do
    # Pin to first socket cores only (avoid cross-socket memory access)
    taskset -pc 0-15 $pid
done

# Sysctl tweaks for lower latency
sysctl -w kernel.sched_min_granularity_ns=1000000   # 1ms vs default 3ms
sysctl -w kernel.sched_wakeup_granularity_ns=2000000  # 2ms vs default 4ms
sysctl -w kernel.sched_migration_cost_ns=5000000    # 5ms → fewer migrations
sysctl -w kernel.numa_balancing=0                    # Disable NUMA balancing
Output
pid 12345's current affinity list: 0-31
pid 12345's new affinity list: 0-15
kernel.sched_min_granularity_ns = 1000000
kernel.sched_wakeup_granularity_ns = 2000000
kernel.sched_migration_cost_ns = 5000000
kernel.numa_balancing = 0
Production Insight
An e-commerce team enabled NUMA balancing on a 2-socket server expecting better memory locality. It caused ~10% CPU overhead from page migrations and random latency spikes 3x the baseline.
Fix: disable numa_balancing on any host running dedicated workloads.
Rule: NUMA balancing helps mixed workloads; hurts single-app servers.
Key Takeaway
Pin critical processes with taskset.
Disable numa_balancing in VMs or dedicated app hosts.
Measure latency before enabling scheduler migrations.

Memory Management Tuning — vm.swappiness, dirty pages & huge pages

The virtual memory subsystem decides how aggressively to swap anonymous pages versus reclaim page cache. The key parameters:

  • vm.swappiness: 0–100. Default 60 encourages swapping even with free memory. Set to 1 for latency-sensitive apps. 0 means no swapping until absolutely necessary (but kernel still swaps).
  • vm.dirty_ratio / vm.dirty_background_ratio: When writeback starts. Default 20% dirty background, 50% dirty synchronous. On write-heavy systems, this can cause latency spikes when the kernel blocks writes. Lower to 5%/10% for transaction logs.
  • vm.vfs_cache_pressure: Controls tendency to reclaim inode/dentry cache. Default 100. Lower to 50 on file servers to keep metadata in memory.
  • vm.min_free_kbytes: Reserve memory to avoid direct reclaim under load. Set to 1% of RAM.

Huge pages (2MB vs 4KB) reduce TLB misses. For apps with large memory footprints (databases, JVMs), enable transparent huge pages (THP) or use explicit hugetlbfs. THP can cause allocation stalls – often better to disable it and pre-allocate huge pages.

memory-tuning.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/bin/bash
# TheCodeForgeMemory tuning for a MySQL database server
# Aim: reduce swap, control dirty writeback, and use huge pages

# Swap tuning
sysctl -w vm.swappiness=1
sysctl -w vm.min_free_kbytes=$(( $(grep MemTotal /proc/meminfo | awk '{print $2}') / 100 ))

# Dirty page tuning for transaction logs
sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.dirty_ratio=10

# Reduce inode cache pressure
sysctl -w vm.vfs_cache_pressure=50

# Disable transparent huge pages (THP) to avoid stalls, use explicit huge pages instead
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# Pre-allocate 1024 huge pages for MySQL buffer pool
sysctl -w vm.nr_hugepages=1024

# Verify
grep -i huge /proc/meminfo | head -4
Output
AnonHugePages: 0 kB
HugePages_Total: 1024
HugePages_Free: 1024
HugePages_Rsvd: 0
Production Insight
A Redis instance with vm.swappiness=60 started swapping when a backup process flushed page cache. The backup read 100GB of data, causing page cache growth and swapping out Redis pages. P99 latency went from 2ms to 4s.
Fix: set swappiness=1 and configure cgroups to isolate backup's memory pressure.
Rule: backup and burst processes can trigger swapping on co-located apps – always use cgroup memory limits.
Key Takeaway
Set vm.swappiness=1 for apps that hate swapping.
Control dirty page ratios to avoid long write stalls.
Consider explicit huge pages over THP to avoid allocation jitter.

I/O Subsystem Tuning — Scheduler, Queue Depth & Block Layer

The Linux block layer sits between file systems and hardware. Its main tunable is the I/O scheduler, which queues and reorders requests. On modern NVMe SSDs, the default (usually BFQ or CFQ) adds overhead. Switch to none (no reordering) or mq-deadline (minimise latency).

Key parameters
  • /sys/block/<dev>/queue/scheduler: set to 'none' for NVMe, 'mq-deadline' for SATA SSD, 'bfq' for spinning disks.
  • /sys/block/<dev>/queue/nr_requests: I/O queue depth. Increase for high-throughput workloads (e.g., 1024 for databases).
  • /sys/block/<dev>/queue/read_ahead_kb: Pre-fetch size. Larger values benefit sequential reads but waste cache on random workloads.

Also tune filesystem mount options: noatime, nobarrier for ext4, or use XFS with larger allocation groups.

io-tuning.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
# TheCodeForge — I/O tuning for an NVMe-based database
DEV=/dev/nvme0n1

# Set scheduler to none (noop) for NVMe
echo none > /sys/block/nvme0n1/queue/scheduler

# Increase queue depth for multiple concurrent I/O
echo 1024 > /sys/block/nvme0n1/queue/nr_requests

# Reduce read-ahead since database does random I/O
echo 128 > /sys/block/nvme0n1/queue/read_ahead_kb

# Mount with noatime to reduce metadata writes
mount -o remount,noatime /data

# For ext4, disable barriers (safe on NVMe with power loss protection)
mount -o remount,noatime,nobarrier /data

# Check new settings
cat /sys/block/nvme0n1/queue/scheduler
cat /sys/block/nvme0n1/queue/nr_requests
Output
[none] mq-deadline bfq
1024
Production Insight
A PostgreSQL server on virtualised NVMe (cloud instance) used the default CFQ scheduler. CFQ split requests into 100ms time slices designed for spinning disks. The database's synchronous commit latency jumped to 150ms.
Fix: switch to none – latency dropped to 2ms.
Rule: always verify the I/O scheduler when deploying on flash storage.
Key Takeaway
NVMe → scheduler to 'none'.
SATA SSD → 'mq-deadline'.
Spinning disk → 'bfq'.
Increase nr_requests for high concurrency.

Network Stack Tuning — TCP Buffers, Congestion Control & Ring Buffers

The network stack's biggest bottleneck is often TCP buffer sizing and interrupt processing. For high-speed links (>1 Gbps), default socket buffers are too small, causing underutilisation.

Key parameters
  • net.core.rmem_max / net.core.wmem_max: Max receive/send socket buffer (bytes). Set to 16MB for 10G links.
  • net.ipv4.tcp_rmem / tcp_wmem: min-default-max for TCP buffers. Set min=4096, default=87380, max=16777216 (16MB).
  • net.ipv4.tcp_congestion_control: Default cubic. For lossy or long-haul links, use bbr (needs kernel 4.9+). BBR handles packet loss better and can increase throughput.
  • net.core.netdev_max_backlog: Max packets queued from NIC before kernel drops. Increase to 5000 on 10G links.
  • /sys/class/net/eth0/queues/rx-*/rps_cpus: Enable RPS (Receive Packet Steering) to spread interrupt load across CPUs.
network-tuning.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/bin/bash
# TheCodeForgeNetwork tuning for a 10Gbps web server

# Socket buffer maxima for 10G
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216

# TCP auto-tuning ranges
sysctl -w net.ipv4.tcp_rmem='4096 87380 16777216'
sysctl -w net.ipv4.tcp_wmem='4096 65536 16777216'

# Use BBR congestion control (requires kernel 4.9+)
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Increase backlog for bursty traffic
sysctl -w net.core.netdev_max_backlog=5000

# Enable RPS (Receive Packet Steering) on all cores for eth0
# For a 4-core machine: bitmap 0x0f (cores 0-3)
echo 'f' > /sys/class/net/eth0/queues/rx-0/rps_cpus

# Verify
sysctl net.ipv4.tcp_congestion_control
cat /sys/class/net/eth0/queues/rx-0/rps_cpus
Output
net.ipv4.tcp_congestion_control = bbr
ff
Production Insight
A streaming service used Cubic congestion control on a 10G inter-datacenter link with 0.2% packet loss. Throughput was capped at 1.5Gbps. After switching to BBR, throughput jumped to 8.5Gbps – BBR's bandwidth estimation doesn't treat packet loss as congestion.
Fix: sysctl net.ipv4.tcp_congestion_control=bbr.
Rule: for WAN links with >0.1% loss, BBR is almost always better than Cubic.
Key Takeaway
Socket buffers must be 16MB for 10G links.
BBR beats Cubic on lossy long-haul links.
RPS/IRQ balance prevents single-CPU saturation.

Putting It All Together — A Repeatable Tuning Workflow

  1. Baseline measurement: Collect latency, throughput, CPU, memory, I/O, and network metrics for at least 24 hours under typical load. Use tools like sar, sysstat, perf, and netdata.
  2. Identify bottleneck: Use the USE method (Utilization, Saturation, Errors) – e.g., CPU util > 90%? I/O queue length growing? Network drops?
  3. Hypothesis and change: Pick ONE parameter. Change it. Document why and expected effect.
  4. Measure again: Same period and load type. Compare before/after.
  5. Accept or rollback: If improvement >5% in the target metric, keep. If not, rollback and try different hypothesis.
  6. Make persistent: Only after validation, add to /etc/sysctl.d/ or tuned profiles.

Treat every parameter change as an experiment. Use tools like 'tuned' to apply preset profiles for common workloads (latency-performance, throughput-performance).

The Tuning Loop
  • Default config is for breadth, not production
  • One change at a time, measure before and after
  • The kernel has hundreds of knobs, but only 5-10 matter for your workload
  • If you can't explain why a parameter helps, don't apply it
Production Insight
A team applied 20 sysctl changes from a 'production tuning' blog post at once. When latency improved, they didn't know which change caused it. When the database later started throwing connection resets, they couldn't roll back.
Rule: change one parameter per iteration. Use version control for your sysctl configs.
Key Takeaway
Tune one parameter at a time.
Baseline for 24h before any change.
Measure the same metric – more data, less guesswork.
● Production incidentPOST-MORTEMseverity: high

The Silent OOM that Cost $2,000 in AWS Credits

Symptom
Redis cluster latency spikes, cloud monitoring shows memory pressure but no OOM kill. P99 response times go from 2ms to 4s. The node stays up but becomes unusable.
Assumption
The team assumed that with 64GB RAM and only 30GB of Redis data, swapping was impossible. They never checked vm.swappiness.
Root cause
vm.swappiness defaults to 60, which tells the kernel to start swapping pages even when plenty of free memory exists. The kernel's heuristic tries to keep file cache big, so it swaps out anonymous pages. Redis's working set pages get swapped to disk – each access now requires a slow disk read.
Fix
Set vm.swappiness=1 in /etc/sysctl.d/99-swap.conf and apply with sysctl -p. Then disable swap entirely for latency-critical workloads (swapoff -a). After tuning, swap usage dropped to zero and p99 returned to 2ms.
Key lesson
  • Default sysctl values are not safe for all workloads – especially vm.swappiness.
  • Always check swap usage (free -h, /proc/meminfo) even if you have 'plenty' of RAM.
  • For latency-sensitive apps, set vm.swappiness=1 or disable swap entirely.
  • Monitor /proc/[pid]/status VmSwap to see per-process swap usage.
Production debug guideSymptom -> Action flow for the most common production issues4 entries
Symptom · 01
Latency spikes, high %iowait in top or vmstat
Fix
Check iostat -x 1. Look at avgqu-sz and await. If await is high but %util is low, disk is oversaturated. Then check I/O scheduler: cat /sys/block/sda/queue/scheduler. Switch to none or mq-deadline if on NVMe.
Symptom · 02
Unexpected swapping (kswapd0 high CPU, p99 latency spikes)
Fix
Run free -h and check SwapUsed. Then cat /proc/meminfo | grep -E 'Swap|Dirty'. If SwapCached > 0, swap is active. Set vm.swappiness=1 and consider disabling swap for latency-critical services.
Symptom · 03
Network throughput far below link speed, dropped packets
Fix
Check netstat -s for TCP retransmits and packet drops. Run ethtool -S eth0 | grep drop. Tune net.core.rmem_max and net.core.wmem_max to 16MB for 10G links. Also check tcp_congestion_control vs BBR.
Symptom · 04
CPU imbalance: one core pegged, others idle
Fix
Check /proc/interrupts for IRQ imbalance. Use taskset to pin irqbalance or manually assign IRQ affinity via /proc/irq/*/smp_affinity. For NUMA, ensure memory is local (numactl --membind).
★ Linux Performance Debug Cheat SheetQuick-fire commands for the three most common failure scenarios
High CPU system time but low user time
Immediate action
Check context switches: vmstat 1 5. Watch cs (context switches per second).
Commands
vmstat 1 10 | awk '{print $12,$13}' # cs and sys columns
perf top -e cs -s count # find what kernel code is spinning
Fix now
Tune kernel.numa_balancing=0 and kernel.sched_migration_cost_ns=5000000. If still high, disable NUMA balancing.
Memory allocation latency spikes under load+
Immediate action
Check direct reclaim: grep 'direct_reclaim' /proc/vmstat
Commands
grep 'direct_reclaim' /proc/vmstat | awk '{if ($2 > 0) print "ALERT: direct reclaim happening"}'
echo 100 > /proc/sys/vm/vfs_cache_pressure # reduce cache eviction
Fix now
Increase vm.min_free_kbytes to 1% of RAM. Set vm.zone_reclaim_mode=0 (disable aggressive zone reclaim).
Apache/Nginx worker threads time out during TCP reconnection+
Immediate action
Check net.ipv4.tcp_tw_reuse and tcp_fin_timeout
Commands
ss -s | grep timewait # count TIME_WAIT sockets
sysctl net.ipv4.tcp_fin_timeout # default is 60
Fix now
Set net.ipv4.tcp_tw_reuse=1 and net.ipv4.tcp_fin_timeout=15. Also increase net.ipv4.tcp_max_tw_buckets=2000000.
I/O Scheduler Selection Guide
Storage TypeRecommended SchedulerWhy
NVMe SSDnone (or noop)No reordering, minimal latency. NVMe has >100K IOPS; scheduler overhead slows it down.
SATA SSDmq-deadlineDeadline ensures reads/writes don't starve. Multi-queue variant scales with multiple cores.
Spinning HDD (database)bfqBudget Fair Queuing provides fairness among processes. Good for mixed workloads.
Spinning HDD (sequential streaming)mq-deadlineStarves writes moderately, but sequential throughput remains high.
Virtualised (paravirtual scsi)noneHypervisor handles scheduling; host-side scheduler adds double queuing.

Key takeaways

1
Default kernel settings are for boot, not for production. Always tune for your workload.
2
One change per iteration. Baseline for 24h before and after. Document everything.
3
The four subsystems
CPU, memory, I/O, network – interact. Fix one bottleneck first, then the next.
4
NVMe storage must use the 'none' I/O scheduler to avoid latency overhead.
5
vm.swappiness=1 stops silent swapping that destroys latency.
6
BBR congestion control beats Cubic on any link with >0.1% packet loss.
7
Use tools like sar, perf, iostat, and sysctl consistently in a repeatable workflow.

Common mistakes to avoid

4 patterns
×

Copying sysctl settings from outdated blog posts without verification

Symptom
After applying, performance worsens or mysterious errors appear (e.g., tcp_tw_reuse=2 on old kernels causes netfilter issues).
Fix
Verify each parameter's kernel version support. Use sysctl -a | grep <param> to check current and man pages. Test in non-production first.
×

Never measuring before applying changes

Symptom
No baseline exists. When latency improves, you can't tell if it was your change or traffic variation.
Fix
Run 'sar -A' for at least 24h before any tuning. Save output to a file. Re-run same sar commands after change and diff the results.
×

Changing multiple parameters simultaneously

Symptom
Performance gain is real but you don't know which change caused it. If side effects appear, you can't isolate them.
Fix
Change ONE parameter per iteration. Document the change, reason, and outcome. Use version-controlled sysctl configs.
×

Disabling swap entirely on memory-constrained servers

Symptom
OOM killer triggers more often because swap acts as emergency buffer. Kernel cannot reclaim memory fast enough.
Fix
Set vm.swappiness=1 instead of 0 or turning off swap. Keep swap available but prioritise keeping pages in memory. For latency-critical apps, still disable swap but only if you have sufficient over-provisioning.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
You notice a database server has high %iowait and I/O latency is 50ms. T...
Q02SENIOR
Explain the relationship between vm.swappiness, dirty ratio, and swap pr...
Q03SENIOR
How would you tune a 10Gbps web server to handle 100K concurrent connect...
Q01 of 03SENIOR

You notice a database server has high %iowait and I/O latency is 50ms. The I/O scheduler is CFQ and storage is NVMe. What is the first thing you would change?

ANSWER
Change the I/O scheduler to none (noop). CFQ was designed for spinning disks – it groups requests into time slices, which adds latency. NVMe flash has no seek penalty, so request reordering is unnecessary overhead. Run echo none > /sys/block/nvme0n1/queue/scheduler and verify with iostat -x 1 that await drops below 2ms.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
Is it safe to modify sysctl values without rebooting?
02
What is the difference between transparent huge pages (THP) and explicit huge pages?
03
Why does the default I/O scheduler on modern distros still use CFQ or BFQ?
04
How do I know which sysctl parameters to change?
🔥

That's Linux. Mark it forged?

4 min read · try the examples if you haven't

Previous
vim Editor Basics
11 / 12 · Linux
Next
Linux Disk and Storage Management