Senior 11 min · March 06, 2026

Linux Performance Tuning — Silent Swap from vm.swappiness

Default vm.swappiness=60 silently swaps Redis working set, spiking P99 from 2ms to 4s.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Tuning modifies kernel parameters via sysctl to match workload, not defaults
  • Key subsystems: CPU scheduler, virtual memory, I/O, and network
  • vm.swappiness controls swap tendency — set to 1 for latency-sensitive apps
  • Wrong I/O scheduler on NVMe adds ~40% latency — use none (or mq-deadline)
  • Production rule: change one parameter at a time, measure before and after
✦ Definition~90s read
What is Linux System Performance Tuning?

Linux system performance tuning is the practice of modifying kernel parameters via /proc, /sys, and sysctl to adjust the OS's behaviour for a specific workload. Default settings target broad compatibility, not peak performance. Tuning is not a one-time event — it's an iterative cycle of measurement, change, verification.

Imagine a busy restaurant kitchen.

The kernel exposes these knobs because there's no single 'best' config. A web server that handles short-lived connections needs different TCP buffers than a file server streaming large files. A real-time analytics database needs different memory pressure settings than a batch processing job.

The goal is to close the gap between 'works' and 'works under production load'. That means measuring latency, throughput, and resource utilisation before and after each change.

Plain-English First

Imagine a busy restaurant kitchen. The head chef (Linux kernel) manages cooks (CPU cores), pantry space (RAM), and delivery trucks (I/O). Out of the box, the kitchen is set up for a casual diner — it works fine for most nights. But on a Saturday rush with 300 covers, you need to rearrange the stations, pre-stock the fridges, and assign cooks to specific roles. That's exactly what Linux performance tuning is: deliberately reorganising how the OS allocates its resources so it can handle YOUR workload, not just an average one.

A default Linux installation is deliberately conservative. The kernel ships with settings tuned for broad compatibility — a database server, a gaming rig, a Raspberry Pi, and a 64-core cloud VM will all boot with roughly the same baseline config. That's great for getting started, but catastrophic for production at scale. A misconfigured TCP buffer kills throughput on a 10 Gbps link. The wrong I/O scheduler on NVMe storage adds 40% latency. A forgotten vm.swappiness setting causes a Redis node to start swapping under load, tanking p99 response times from 2ms to 4 seconds. These aren't theoretical problems — they're war stories from real oncall rotations.

Performance tuning solves the gap between 'it works' and 'it works under pressure'. The Linux kernel exposes hundreds of tuneable knobs through /proc, /sys, and sysctl. Understanding which knobs affect which subsystem — and crucially, WHY they exist — lets you make surgical changes instead of cargo-culting settings from a Stack Overflow post that was written for a 2012 spinning-disk server.

By the end of this article you'll understand how the kernel scheduler, virtual memory subsystem, I/O stack, and network stack interact with each other. You'll be able to profile a live system, identify the bottleneck, apply the right tuning, and verify the improvement with hard numbers — all without rebooting. You'll also know which changes to make permanent and which to test ephemerally first.

What is Linux System Performance Tuning?

Linux system performance tuning is the practice of modifying kernel parameters via /proc, /sys, and sysctl to adjust the OS's behaviour for a specific workload. Default settings target broad compatibility, not peak performance. Tuning is not a one-time event — it's an iterative cycle of measurement, change, verification.

The kernel exposes these knobs because there's no single 'best' config. A web server that handles short-lived connections needs different TCP buffers than a file server streaming large files. A real-time analytics database needs different memory pressure settings than a batch processing job.

The goal is to close the gap between 'works' and 'works under production load'. That means measuring latency, throughput, and resource utilisation before and after each change.

Production Insight
Blindly applying sysctl settings from a blog post can make things worse.
Example: setting vm.swappiness=0 on a database server seemed right, but it caused the page cache to be evicted aggressively, doubling I/O read latency.
Rule: understand the trade-off — no parameter is universally 'optimal'.
Key Takeaway
Tuning is workload-specific.
Default kernel config is for booting, not for performance.
Always measure before change and after change.
Linux Performance Tuning Flow THECODEFORGE.IO Linux Performance Tuning Flow From kernel scheduler to I/O and network tuning Kernel Scheduler Tuning CPU affinity, CFS, NUMA balancing Memory Management Tuning vm.swappiness, dirty page ratios I/O Subsystem Tuning Scheduler, queue depth, block layer Network Stack Tuning TCP buffers, congestion control Repeatable Tuning Workflow Measure, adjust, validate cycle ⚠ CPU cache thrashing is your silent killer Monitor cache misses; avoid over-tuning context switches THECODEFORGE.IO
thecodeforge.io
Linux Performance Tuning Flow
Linux Performance Tuning

Kernel Scheduler Tuning — CPU Affinity, CFS & NUMA

The Completely Fair Scheduler (CFS) allocates CPU time proportionally among processes. Its main tuning knobs control preemption aggressiveness, group scheduling, and NUMA balancing.

Key parameters
  • kernel.sched_min_granularity_ns: Minimum slice per process. Lower values reduce latency but increase context switches. Default 3ms, reduce to 1ms for latency-sensitive apps.
  • kernel.sched_wakeup_granularity_ns: How long a waking process must wait before preempting a running one. Reduce for interactive workloads.
  • kernel.numa_balancing: Default 1 (enabled). On NUMA machines, this can migrate pages and threads across nodes. Often causes latency spikes. Disable with 0 in virtualised environments.
  • kernel.sched_migration_cost_ns: Time after a process runs before it can be migrated to another CPU. Increasing prevents unnecessary migrations.

Also use taskset to pin processes to specific cores and numactl to bind memory to local NUMA node.

scheduler-tuning.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash
# TheCodeForgeLinux scheduler tuning for an API server
# Pins Nginx workers to cores 0-15 (physical CPUs) on a dual-socket NUMA system

export WORKER_PIDS=$(pgrep -f 'nginx: worker')

for pid in $WORKER_PIDS; do
    # Pin to first socket cores only (avoid cross-socket memory access)
    taskset -pc 0-15 $pid
done

# Sysctl tweaks for lower latency
sysctl -w kernel.sched_min_granularity_ns=1000000   # 1ms vs default 3ms
sysctl -w kernel.sched_wakeup_granularity_ns=2000000  # 2ms vs default 4ms
sysctl -w kernel.sched_migration_cost_ns=5000000    # 5ms → fewer migrations
sysctl -w kernel.numa_balancing=0                    # Disable NUMA balancing
Output
pid 12345's current affinity list: 0-31
pid 12345's new affinity list: 0-15
kernel.sched_min_granularity_ns = 1000000
kernel.sched_wakeup_granularity_ns = 2000000
kernel.sched_migration_cost_ns = 5000000
kernel.numa_balancing = 0
Production Insight
An e-commerce team enabled NUMA balancing on a 2-socket server expecting better memory locality. It caused ~10% CPU overhead from page migrations and random latency spikes 3x the baseline.
Fix: disable numa_balancing on any host running dedicated workloads.
Rule: NUMA balancing helps mixed workloads; hurts single-app servers.
Key Takeaway
Pin critical processes with taskset.
Disable numa_balancing in VMs or dedicated app hosts.
Measure latency before enabling scheduler migrations.

Memory Management Tuning — vm.swappiness, dirty pages & huge pages

The virtual memory subsystem decides how aggressively to swap anonymous pages versus reclaim page cache. The key parameters:

  • vm.swappiness: 0–100. Default 60 encourages swapping even with free memory. Set to 1 for latency-sensitive apps. 0 means no swapping until absolutely necessary (but kernel still swaps).
  • vm.dirty_ratio / vm.dirty_background_ratio: When writeback starts. Default 20% dirty background, 50% dirty synchronous. On write-heavy systems, this can cause latency spikes when the kernel blocks writes. Lower to 5%/10% for transaction logs.
  • vm.vfs_cache_pressure: Controls tendency to reclaim inode/dentry cache. Default 100. Lower to 50 on file servers to keep metadata in memory.
  • vm.min_free_kbytes: Reserve memory to avoid direct reclaim under load. Set to 1% of RAM.

Huge pages (2MB vs 4KB) reduce TLB misses. For apps with large memory footprints (databases, JVMs), enable transparent huge pages (THP) or use explicit hugetlbfs. THP can cause allocation stalls – often better to disable it and pre-allocate huge pages.

memory-tuning.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/bin/bash
# TheCodeForgeMemory tuning for a MySQL database server
# Aim: reduce swap, control dirty writeback, and use huge pages

# Swap tuning
sysctl -w vm.swappiness=1
sysctl -w vm.min_free_kbytes=$(( $(grep MemTotal /proc/meminfo | awk '{print $2}') / 100 ))

# Dirty page tuning for transaction logs
sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.dirty_ratio=10

# Reduce inode cache pressure
sysctl -w vm.vfs_cache_pressure=50

# Disable transparent huge pages (THP) to avoid stalls, use explicit huge pages instead
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# Pre-allocate 1024 huge pages for MySQL buffer pool
sysctl -w vm.nr_hugepages=1024

# Verify
grep -i huge /proc/meminfo | head -4
Output
AnonHugePages: 0 kB
HugePages_Total: 1024
HugePages_Free: 1024
HugePages_Rsvd: 0
Production Insight
A Redis instance with vm.swappiness=60 started swapping when a backup process flushed page cache. The backup read 100GB of data, causing page cache growth and swapping out Redis pages. P99 latency went from 2ms to 4s.
Fix: set swappiness=1 and configure cgroups to isolate backup's memory pressure.
Rule: backup and burst processes can trigger swapping on co-located apps – always use cgroup memory limits.
Key Takeaway
Set vm.swappiness=1 for apps that hate swapping.
Control dirty page ratios to avoid long write stalls.
Consider explicit huge pages over THP to avoid allocation jitter.

I/O Subsystem Tuning — Scheduler, Queue Depth & Block Layer

The Linux block layer sits between file systems and hardware. Its main tunable is the I/O scheduler, which queues and reorders requests. On modern NVMe SSDs, the default (usually BFQ or CFQ) adds overhead. Switch to none (no reordering) or mq-deadline (minimise latency).

Key parameters
  • /sys/block/<dev>/queue/scheduler: set to 'none' for NVMe, 'mq-deadline' for SATA SSD, 'bfq' for spinning disks.
  • /sys/block/<dev>/queue/nr_requests: I/O queue depth. Increase for high-throughput workloads (e.g., 1024 for databases).
  • /sys/block/<dev>/queue/read_ahead_kb: Pre-fetch size. Larger values benefit sequential reads but waste cache on random workloads.

Also tune filesystem mount options: noatime, nobarrier for ext4, or use XFS with larger allocation groups.

io-tuning.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
# TheCodeForge — I/O tuning for an NVMe-based database
DEV=/dev/nvme0n1

# Set scheduler to none (noop) for NVMe
echo none > /sys/block/nvme0n1/queue/scheduler

# Increase queue depth for multiple concurrent I/O
echo 1024 > /sys/block/nvme0n1/queue/nr_requests

# Reduce read-ahead since database does random I/O
echo 128 > /sys/block/nvme0n1/queue/read_ahead_kb

# Mount with noatime to reduce metadata writes
mount -o remount,noatime /data

# For ext4, disable barriers (safe on NVMe with power loss protection)
mount -o remount,noatime,nobarrier /data

# Check new settings
cat /sys/block/nvme0n1/queue/scheduler
cat /sys/block/nvme0n1/queue/nr_requests
Output
[none] mq-deadline bfq
1024
Production Insight
A PostgreSQL server on virtualised NVMe (cloud instance) used the default CFQ scheduler. CFQ split requests into 100ms time slices designed for spinning disks. The database's synchronous commit latency jumped to 150ms.
Fix: switch to none – latency dropped to 2ms.
Rule: always verify the I/O scheduler when deploying on flash storage.
Key Takeaway
NVMe → scheduler to 'none'.
SATA SSD → 'mq-deadline'.
Spinning disk → 'bfq'.
Increase nr_requests for high concurrency.

Network Stack Tuning — TCP Buffers, Congestion Control & Ring Buffers

The network stack's biggest bottleneck is often TCP buffer sizing and interrupt processing. For high-speed links (>1 Gbps), default socket buffers are too small, causing underutilisation.

Key parameters
  • net.core.rmem_max / net.core.wmem_max: Max receive/send socket buffer (bytes). Set to 16MB for 10G links.
  • net.ipv4.tcp_rmem / tcp_wmem: min-default-max for TCP buffers. Set min=4096, default=87380, max=16777216 (16MB).
  • net.ipv4.tcp_congestion_control: Default cubic. For lossy or long-haul links, use bbr (needs kernel 4.9+). BBR handles packet loss better and can increase throughput.
  • net.core.netdev_max_backlog: Max packets queued from NIC before kernel drops. Increase to 5000 on 10G links.
  • /sys/class/net/eth0/queues/rx-*/rps_cpus: Enable RPS (Receive Packet Steering) to spread interrupt load across CPUs.
network-tuning.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/bin/bash
# TheCodeForgeNetwork tuning for a 10Gbps web server

# Socket buffer maxima for 10G
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216

# TCP auto-tuning ranges
sysctl -w net.ipv4.tcp_rmem='4096 87380 16777216'
sysctl -w net.ipv4.tcp_wmem='4096 65536 16777216'

# Use BBR congestion control (requires kernel 4.9+)
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Increase backlog for bursty traffic
sysctl -w net.core.netdev_max_backlog=5000

# Enable RPS (Receive Packet Steering) on all cores for eth0
# For a 4-core machine: bitmap 0x0f (cores 0-3)
echo 'f' > /sys/class/net/eth0/queues/rx-0/rps_cpus

# Verify
sysctl net.ipv4.tcp_congestion_control
cat /sys/class/net/eth0/queues/rx-0/rps_cpus
Output
net.ipv4.tcp_congestion_control = bbr
ff
Production Insight
A streaming service used Cubic congestion control on a 10G inter-datacenter link with 0.2% packet loss. Throughput was capped at 1.5Gbps. After switching to BBR, throughput jumped to 8.5Gbps – BBR's bandwidth estimation doesn't treat packet loss as congestion.
Fix: sysctl net.ipv4.tcp_congestion_control=bbr.
Rule: for WAN links with >0.1% loss, BBR is almost always better than Cubic.
Key Takeaway
Socket buffers must be 16MB for 10G links.
BBR beats Cubic on lossy long-haul links.
RPS/IRQ balance prevents single-CPU saturation.

Putting It All Together — A Repeatable Tuning Workflow

  1. Baseline measurement: Collect latency, throughput, CPU, memory, I/O, and network metrics for at least 24 hours under typical load. Use tools like sar, sysstat, perf, and netdata.
  2. Identify bottleneck: Use the USE method (Utilization, Saturation, Errors) – e.g., CPU util > 90%? I/O queue length growing? Network drops?
  3. Hypothesis and change: Pick ONE parameter. Change it. Document why and expected effect.
  4. Measure again: Same period and load type. Compare before/after.
  5. Accept or rollback: If improvement >5% in the target metric, keep. If not, rollback and try different hypothesis.
  6. Make persistent: Only after validation, add to /etc/sysctl.d/ or tuned profiles.

Treat every parameter change as an experiment. Use tools like 'tuned' to apply preset profiles for common workloads (latency-performance, throughput-performance).

The Tuning Loop
  • Default config is for breadth, not production
  • One change at a time, measure before and after
  • The kernel has hundreds of knobs, but only 5-10 matter for your workload
  • If you can't explain why a parameter helps, don't apply it
Production Insight
A team applied 20 sysctl changes from a 'production tuning' blog post at once. When latency improved, they didn't know which change caused it. When the database later started throwing connection resets, they couldn't roll back.
Rule: change one parameter per iteration. Use version control for your sysctl configs.
Key Takeaway
Tune one parameter at a time.
Baseline for 24h before any change.
Measure the same metric – more data, less guesswork.

CPU Cache Thrashing Is Your Silent Killer

Most devs stare at CPU utilization and think they're fine. 5% usage means nothing when your L2 cache miss rate is 40%. The CPU spends more cycles waiting on memory than computing. That's the real bottleneck.

Cache hierarchy matters because data locality is physics, not magic. L1 cache runs at CPU speed. L3 is an order of magnitude slower. Every cache miss stalls the pipeline. If the scheduler keeps bouncing your thread between cores, you're flushing and reloading that cache every time. That's thrashing.

Check your cache topology with lscpu or lstopo. Pin critical processes with taskset or numactl to keep them on the same core. For NUMA boxes, memory allocation follows CPU node. Bind both to the same node. Remote memory access is slow, and your database will feel it.

Run perf stat -e cache-misses on your workload. If miss rate exceeds 5%, you're leaving performance on the table. Fix affinity first. Tune later.

CacheMissCheck.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
// io.thecodeforge — devops tutorial

// Check cache topology for a production web server
- name: Inspect CPU cache hierarchy
  command: lscpu | grep -E "(cache|Core|Socket)"
  register: cache_info

- name: Pin nginx worker to specific core
  command: taskset -pc 0 {{ nginx_pid }}
  when: cache_info.stdout.find("L1d cache: 32K") != -1

- name: Measure cache misses under load
  command: perf stat -e cache-misses,cache-references -p {{ nginx_pid }} sleep 10
Output
Performance counter stats for process 12345:
1,234,567,890 cache-references
89,012,345 cache-misses # 7.21% of all cache refs
10.001234 seconds time elapsed
Production Trap: The 5% Utilization Myth
CPU at 5% doesn't mean idle — it means stalled. High cache miss rates masquerade as low utilization. Always check cache-misses before scaling out.
Key Takeaway
Cache miss rate above 5% means you're memory-bound, not CPU-bound. Pin processes to cores before tuning anything else.

Context Switches Are The Hidden Tax On Your Throughput

Every context switch costs you CPU cycles. The scheduler saves state, flushes TLB, and reloads the next thread. On a busy web server handling 10k connections, you might be switching 100k times per second. That's 100k lost microseconds. Add them up.

High context switch rates usually mean one of two things: too many threads fighting for CPU time, or I/O-bound tasks that yield constantly. Both waste cycles. The fix: reduce thread count or switch to an event-driven model like epoll. For databases, tune the connection pool to match core count, not max connections.

Watch /proc/stat or use vmstat 1 and look at the cs column. If it's over 50k per core per second, you're thrashing the scheduler. perf sched gives you a per-process breakdown. Identify the worst offender. Either pin it, scale its threads, or rewrite it.

Don't touch scheduler policies (SCHED_FIFO, SCHED_RR) unless you know exactly what you're doing. Preempting the kernel scheduler can lock your box. Start with affinity and thread count. That's 80% of the win.

ContextSwitchAudit.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// io.thecodeforge — devops tutorial

// Audit context switches on a production API server
- name: Check context switch rate
  shell: vmstat 1 5 | tail -4 | awk '{print $12}'
  register: cs_rate
  changed_when: false

- name: Alert if excessive context switching
  fail:
    msg: "Context switch rate {{ item }} per second exceeds threshold"
  loop: "{{ cs_rate.stdout_lines }}"
  when: item | int > 50000

- name: Identify top offender by voluntary switches
  command: perf sched record -g -a sleep 5 && perf sched latency
  register: sched_latency
Output
---------------------------------------------------------------------
Task | Runtime ms | Switches | Avg Delay ms | Max Delay ms |
---------------------------------------------------------------------
nginx:12345 | 1200.345 | 5432 | 0.12 | 2.34 |
postgres:56789 | 800.123 | 3210 | 0.08 | 1.45 |
python-worker:98765 | 400.456 | 8765 | 0.45 | 8.91 |
---------------------------------------------------------------------
Senior Shortcut: Thread Count Math
Key Takeaway
Context switch rate over 50k per core per second means you're paying more for switching than for actual work. Reduce threads or switch to async.

Interrupt Affinity: Don't Let IRQs Steal Your Cache

Network interrupts land on whatever core the kernel picks. By default, that's CPU 0. Your cache-hot nginx worker on that core now gets interrupted 20k times a second to handle packet processing. Cache evicted. Pipeline stalled. Performance tanks.

Bind interrupt request lines (IRQs) to a dedicated core — separate from your application cores. This keeps your worker's cache warm and lets the interrupt handler run unopposed. Check /proc/interrupts to see which IRQ is hammering which core. Then write the CPU mask to /proc/irq/<N>/smp_affinity.

For high-throughput NICs (10GbE+), spread IRQs across a set of cores, not all. Use irqbalance as a baseline but don't trust it blindly for heavy loads. Manual tuning beats a daemon every time. Match NIC queue count to core count, then assign one queue per core. No sharing.

Test with perf top or mpstat -I CPU before and after. If you see 10%+ of CPU time in softirq or hardirq, you have work to do. Dedicated interrupt cores are free performance. Take it.

InterruptAffinityBinding.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — devops tutorial

// Bind NIC interrupts to dedicated cores (not application cores)
- name: Check current interrupt distribution
  command: cat /proc/interrupts | grep eth0
  register: irq_eth0

- name: Identify IRQ numbers for eth0
  shell: "cat /proc/interrupts | grep eth0 | awk '{print $1}' | tr -d ':'"
  register: irq_numbers

- name: Bind each IRQ to core 2 (mask: 0x04) — skip cores 0 and 1
  shell: "echo 04 > /proc/irq/{{ item }}/smp_affinity"
  loop: "{{ irq_numbers.stdout_lines }}"
  become: yes

- name: Verify binding
  command: cat /proc/interrupts | head -n 1 && cat /proc/interrupts | grep eth0
Output
CPU0 CPU1 CPU2 CPU3
98: 0 0 12345 0 IR-PCI-MSI eth0
99: 0 0 13456 0 IR-PCI-MSI eth0-tx-0
100: 0 0 14567 0 IR-PCI-MSI eth0-rx-0
Production Trap: Don't Bind IRQs To Your App Cores
Key Takeaway
Bind network IRQs to dedicated cores outside your application affinity mask. This isolates cache contention and boosts throughput by 10-20% under load.

I/O Priority: Stop Letting Batch Jobs Steal Your Latency

Most engineers tune I/O scheduling but ignore I/O priority. That's like upgrading your car's tires while leaving the parking brake on. The kernel's I/O priority system (ionice) lets you tell the block layer which processes are latency-sensitive and which are background noise.

WHY this matters: Without I/O priority, a cron job running log rotation can starve your database. The CFQ and BFQ schedulers respect three priority classes: Real-time (0), Best-effort (1), and Idle (3). Real-time processes always get first dibs on I/O requests. Best-effort divides bandwidth proportionally. Idle processes only run when nobody else needs the disk.

HOW to use it: Attach ionice to critical processes. Your PostgreSQL should be ionice -c 2 -n 0 (best-effort, high priority). Your nightly backup script should be ionice -c 3 (idle). On systemd, set IOSchedulingClass and IOSchedulingPriority in your service unit. This single change can cut database tail latency by 40% in mixed-workload environments. Test it: run fio with different ionice classes and watch the latency divergence.

docker-compose.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — devops tutorial

services:
  postgres:
    image: postgres:15
    # Do NOT forget ionice; default is best-effort 4
    # This is for containers that control global ionice
    # On bare metal, wrap your entrypoint with ionice
      
  backup:
    image: backup-agent:latest
    # Trash priority — won't impact prod queries
    cpu_shares: 128
    mem_limit: 256m
    # In container runtime, set via cgroup I/O weight:
    blkio_weight: 10
Output
No direct output; apply via systemd drop-in:
[Service]
IOSchedulingClass=best-effort
IOSchedulingPriority=0
Production Trap:
Don't set real-time I/O priority (ionice -c 1) on anything except watchdog daemons. A bug in your process can lock the entire I/O subsystem. Best-effort class 0 is the highest safe priority for production.
Key Takeaway
I/O priority is free latency protection — classify every I/O-heavy process or accept random slowdowns.

Useful I/O Monitoring Commands: Stop Guessing, Start Measuring

You can't tune what you can't see. Most engineers look at iostat, get lost in %util, and call it a day. That number is a lie — %util shows device busy time, not saturation. You need the real tools.

WHY this matters: False signals cause bad tuning. %util at 100% means nothing on an NVMe drive that can queue 64K commands. You need indicators of actual congestion: average queue size (avgqu-sz), service time (svctm), and await vs. r_await divergence.

HOW to do it right
  • iostat -x 1: Look at avgqu-sz > 1 per device. That's congestion. svctm under 10ms on spinning rust is fine; under 1ms on SSD is fine.
  • iotop -oP: See which process is eating I/O right now. The -P flag shows threads. Discover your rogue log writer.
  • blktrace / blkparse: Capture every I/O event. Trace a 10-second window: blktrace -d /dev/sda -o
  • | blkparse -i
  • This shows exact I/O sizes, latencies, and which process issued them.
  • bcc-tools (biosnoop, biolatency): Instant histograms of I/O latency. No configuration. Just run biosnoop and watch outliers appear.

Ditch guesswork. Measure with blktrace when you profile, iostat when you monitor, iotop when you firefight.

monitor.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — devops tutorial

steps:
  - name: "Check I/O congestion"
    command: "iostat -x 1 5"
    # Look for avgqu-sz > 1.0 on any device

  - name: "Find I/O hog"
    command: "iotop -oP -b -n 1"
    # -o: only show active processes
    # -P: show threads
    # -b: batch mode (non-interactive)

  - name: "Capture I/O trace"
    command: "timeout 10 blktrace -d /dev/nvme0n1 -o - | blkparse -i -"

# Install tools:
apt install sysstat iotop blktrace bpfcc-tools
Output
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme0n1 0.00 0.00 892.00 64.00 3568.00 256.00 8.00 2.45 2.68 2.01 12.34 0.22 21.00
Senior Shortcut:
Write a one-liner: iostat -x 1 | awk '/nvme/ && $10 > 2 {print strftime(), $0}'. Run it in a tmux pane. Now you see congestion before pager goes off.
Key Takeaway
Ignore %util. Watch avgqu-sz, await, and svctm — that's real I/O health.

Keepalive Timeout: Stop Letting Dead Connections Rot Your Resources

Every open TCP socket costs you a file descriptor and a chunk of kernel memory. When a client crashes without sending FIN, that socket sits in CLOSE_WAIT or ESTABLISHED state until you die or the kernel notices. If you have 10,000 connections, 10% of them dead, you're leaking 1,000 descriptors. That's absurd.

WHY this matters: Default keepalive settings are glacial. tcp_keepalive_time is 7200 seconds (2 hours) on most distros. Production apps talking to flaky mobile clients don't have 2 hours. Your connection pool fills with zombies. Your new connections get rejected at 65535. Keepalive is your garbage collector for dead peers.

HOW to fix it: Set tcp_keepalive_time to 300 seconds (5 minutes). tcp_keepalive_intvl to 15 seconds. tcp_keepalive_probes to 5. That means after 5 minutes idle, the kernel sends 5 probes at 15-second intervals. Total detection time: 5 min + 75 sec = 6.25 minutes. Tune per-service via iptables or socket options — don't apply a single value to everything. Your database servers need shorter timeouts than your web servers. Set SO_KEEPALIVE in application code, or use sysctl modifiers per namespace.

Test it: netstat -tn | grep CLOSE_WAIT. If count > 0, your keepalive is too slow or missing.

keepalive-tune.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — devops tutorial

# /etc/sysctl.d/99-tcp-keepalive.conf

net.ipv4.tcp_keepalive_time = 300
# Default: 7200 (2 hours). Production: 300.

net.ipv4.tcp_keepalive_intvl = 15
# Interval between keepalive probes.

net.ipv4.tcp_keepalive_probes = 5
# Number of probes before declaring connection dead.

# Apply:
sysctl -p /etc/sysctl.d/99-tcp-keepalive.conf

# Verify:
sysctl net.ipv4.tcp_keepalive_time
Output
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 5
# After tuning, watch dead connections drop:
# netstat -tn | grep CLOSE_WAIT | wc -l
Production Trap:
Don't set tcp_keepalive_time below 60 seconds. You'll burn CPU on probe traffic and kill mobile clients on brief network glitches. 300 seconds is aggressive but safe for server-to-server connections.
Key Takeaway
A keepalive timeout under 10 minutes is free — dead connections are not. Tune it or accept resource leaks.

Kernel Function Tracing for Low-Level Analysis

WHY kernel tracing matters: Latency spikes hide behind aggregated metrics. Function tracing reveals exactly which kernel path causes a bottleneck. Use ftrace or BPF-based tools to measure entry-to-exit times of syscalls, interrupt handlers, and scheduler functions. Start with trace-cmd record -p function_graph -g do_sys_open to trace file open latency. For targeted analysis, use funclatency from BCC to histogram a single kernel function's execution time. Focus on high-frequency or high-jitter paths like tcp_ack, do_try_to_free_pages, or enqueue_task_fair. Compare before and after tuning. Always trace under load—idle traces tell nothing. Stop guessing with averages; start tracing the 99th percentile path.

kernel_trace_example.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// io.thecodeforge — devops tutorial

name: trace-latency-p99
on:
  workflow_dispatch:
    inputs:
      function:
        description: 'Kernel function to trace'
        required: true
        default: 'do_sys_open'
      duration:
        description: 'Trace duration in seconds'
        default: '30'

jobs:
  trace:
    runs-on: ubuntu-22.04
    steps:
      - run: sudo apt-get install -y trace-cmd
      - run: |
          sudo trace-cmd record -p function_graph \
            -g ${{ inputs.function }} \
              sleep ${{ inputs.duration }}
      - run: sudo trace-cmd report --func-stack | awk '/${{ inputs.function }}/{print $NF}' | sort -n | tail -1
Production Trap:
Running ftrace on every function in production crushes CPU. Always filter to one function or PID, and limit trace duration to under 60 seconds.
Key Takeaway
Trace one function at a time under real load, never idle.

Programmable Tracing for Custom Metrics

WHY programmable tracing: Traditional tools report fixed metrics—you get what they give. With BPF (BCC or bpftrace), you write custom probes on any kernel or user function. Measure the exact distribution of mutex hold times, NUMA miss ratios, or per-socket TCP retransmit counts. Start with bpftrace -e 'kprobe:tcp_retransmit_skb { @[sockuid] = count(); }' to count retransmits per user. For NUMA analysis, run numa-miss from BCC: it shows where memory allocation fails to match the requesting CPU's node. Build metrics that matter to your workload—not generic dashboard noise. Write scripts that save histograms and alert on tail latency shifts. This turns raw tracing into actionable tuning data.

bpf_metric_example.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// io.thecodeforge — devops tutorial

name: bpf-custom-metric
on:
  schedule:
    - cron: '*/5 * * * *'

jobs:
  collect:
    runs-on: ubuntu-22.04
    steps:
      - run: |
          sudo bpftrace -e '
            kprobe:tcp_retransmit_skb
            {
              @transmits_by_pid[pid] = count();
            }
            interval:s:60
            {
              print(@transmits_by_pid);
              clear(@transmits_by_pid);
            }' > /tmp/retransmit_raw.log &
          sleep 65
          kill %1
      - run: |
          awk '{if ($NF > 5) print}' /tmp/retransmit_raw.log \
            > /tmp/high_retransmit_pids.txt
Production Trap:
BPF probes with high-frequency events (e.g., tcp_transmit_skb) can overflow the per-CPU buffer and drop events. Monitor /sys/kernel/debug/tracing/trace_stat/bpf_stats for lost probes.
Key Takeaway
Write custom BPF probes for workload-specific metrics; never settle for stock tool outputs.
● Production incidentPOST-MORTEMseverity: high

The Silent OOM that Cost $2,000 in AWS Credits

Symptom
Redis cluster latency spikes, cloud monitoring shows memory pressure but no OOM kill. P99 response times go from 2ms to 4s. The node stays up but becomes unusable.
Assumption
The team assumed that with 64GB RAM and only 30GB of Redis data, swapping was impossible. They never checked vm.swappiness.
Root cause
vm.swappiness defaults to 60, which tells the kernel to start swapping pages even when plenty of free memory exists. The kernel's heuristic tries to keep file cache big, so it swaps out anonymous pages. Redis's working set pages get swapped to disk – each access now requires a slow disk read.
Fix
Set vm.swappiness=1 in /etc/sysctl.d/99-swap.conf and apply with sysctl -p. Then disable swap entirely for latency-critical workloads (swapoff -a). After tuning, swap usage dropped to zero and p99 returned to 2ms.
Key lesson
  • Default sysctl values are not safe for all workloads – especially vm.swappiness.
  • Always check swap usage (free -h, /proc/meminfo) even if you have 'plenty' of RAM.
  • For latency-sensitive apps, set vm.swappiness=1 or disable swap entirely.
  • Monitor /proc/[pid]/status VmSwap to see per-process swap usage.
Production debug guideSymptom -> Action flow for the most common production issues4 entries
Symptom · 01
Latency spikes, high %iowait in top or vmstat
Fix
Check iostat -x 1. Look at avgqu-sz and await. If await is high but %util is low, disk is oversaturated. Then check I/O scheduler: cat /sys/block/sda/queue/scheduler. Switch to none or mq-deadline if on NVMe.
Symptom · 02
Unexpected swapping (kswapd0 high CPU, p99 latency spikes)
Fix
Run free -h and check SwapUsed. Then cat /proc/meminfo | grep -E 'Swap|Dirty'. If SwapCached > 0, swap is active. Set vm.swappiness=1 and consider disabling swap for latency-critical services.
Symptom · 03
Network throughput far below link speed, dropped packets
Fix
Check netstat -s for TCP retransmits and packet drops. Run ethtool -S eth0 | grep drop. Tune net.core.rmem_max and net.core.wmem_max to 16MB for 10G links. Also check tcp_congestion_control vs BBR.
Symptom · 04
CPU imbalance: one core pegged, others idle
Fix
Check /proc/interrupts for IRQ imbalance. Use taskset to pin irqbalance or manually assign IRQ affinity via /proc/irq/*/smp_affinity. For NUMA, ensure memory is local (numactl --membind).
★ Linux Performance Debug Cheat SheetQuick-fire commands for the three most common failure scenarios
High CPU system time but low user time
Immediate action
Check context switches: vmstat 1 5. Watch cs (context switches per second).
Commands
vmstat 1 10 | awk '{print $12,$13}' # cs and sys columns
perf top -e cs -s count # find what kernel code is spinning
Fix now
Tune kernel.numa_balancing=0 and kernel.sched_migration_cost_ns=5000000. If still high, disable NUMA balancing.
Memory allocation latency spikes under load+
Immediate action
Check direct reclaim: grep 'direct_reclaim' /proc/vmstat
Commands
grep 'direct_reclaim' /proc/vmstat | awk '{if ($2 > 0) print "ALERT: direct reclaim happening"}'
echo 100 > /proc/sys/vm/vfs_cache_pressure # reduce cache eviction
Fix now
Increase vm.min_free_kbytes to 1% of RAM. Set vm.zone_reclaim_mode=0 (disable aggressive zone reclaim).
Apache/Nginx worker threads time out during TCP reconnection+
Immediate action
Check net.ipv4.tcp_tw_reuse and tcp_fin_timeout
Commands
ss -s | grep timewait # count TIME_WAIT sockets
sysctl net.ipv4.tcp_fin_timeout # default is 60
Fix now
Set net.ipv4.tcp_tw_reuse=1 and net.ipv4.tcp_fin_timeout=15. Also increase net.ipv4.tcp_max_tw_buckets=2000000.
I/O Scheduler Selection Guide
Storage TypeRecommended SchedulerWhy
NVMe SSDnone (or noop)No reordering, minimal latency. NVMe has >100K IOPS; scheduler overhead slows it down.
SATA SSDmq-deadlineDeadline ensures reads/writes don't starve. Multi-queue variant scales with multiple cores.
Spinning HDD (database)bfqBudget Fair Queuing provides fairness among processes. Good for mixed workloads.
Spinning HDD (sequential streaming)mq-deadlineStarves writes moderately, but sequential throughput remains high.
Virtualised (paravirtual scsi)noneHypervisor handles scheduling; host-side scheduler adds double queuing.

Key takeaways

1
Default kernel settings are for boot, not for production. Always tune for your workload.
2
One change per iteration. Baseline for 24h before and after. Document everything.
3
The four subsystems
CPU, memory, I/O, network – interact. Fix one bottleneck first, then the next.
4
NVMe storage must use the 'none' I/O scheduler to avoid latency overhead.
5
vm.swappiness=1 stops silent swapping that destroys latency.
6
BBR congestion control beats Cubic on any link with >0.1% packet loss.
7
Use tools like sar, perf, iostat, and sysctl consistently in a repeatable workflow.

Common mistakes to avoid

4 patterns
×

Copying sysctl settings from outdated blog posts without verification

Symptom
After applying, performance worsens or mysterious errors appear (e.g., tcp_tw_reuse=2 on old kernels causes netfilter issues).
Fix
Verify each parameter's kernel version support. Use sysctl -a | grep <param> to check current and man pages. Test in non-production first.
×

Never measuring before applying changes

Symptom
No baseline exists. When latency improves, you can't tell if it was your change or traffic variation.
Fix
Run 'sar -A' for at least 24h before any tuning. Save output to a file. Re-run same sar commands after change and diff the results.
×

Changing multiple parameters simultaneously

Symptom
Performance gain is real but you don't know which change caused it. If side effects appear, you can't isolate them.
Fix
Change ONE parameter per iteration. Document the change, reason, and outcome. Use version-controlled sysctl configs.
×

Disabling swap entirely on memory-constrained servers

Symptom
OOM killer triggers more often because swap acts as emergency buffer. Kernel cannot reclaim memory fast enough.
Fix
Set vm.swappiness=1 instead of 0 or turning off swap. Keep swap available but prioritise keeping pages in memory. For latency-critical apps, still disable swap but only if you have sufficient over-provisioning.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
You notice a database server has high %iowait and I/O latency is 50ms. T...
Q02SENIOR
Explain the relationship between vm.swappiness, dirty ratio, and swap pr...
Q03SENIOR
How would you tune a 10Gbps web server to handle 100K concurrent connect...
Q01 of 03SENIOR

You notice a database server has high %iowait and I/O latency is 50ms. The I/O scheduler is CFQ and storage is NVMe. What is the first thing you would change?

ANSWER
Change the I/O scheduler to none (noop). CFQ was designed for spinning disks – it groups requests into time slices, which adds latency. NVMe flash has no seek penalty, so request reordering is unnecessary overhead. Run echo none > /sys/block/nvme0n1/queue/scheduler and verify with iostat -x 1 that await drops below 2ms.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
Is it safe to modify sysctl values without rebooting?
02
What is the difference between transparent huge pages (THP) and explicit huge pages?
03
Why does the default I/O scheduler on modern distros still use CFQ or BFQ?
04
How do I know which sysctl parameters to change?
N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's Linux. Mark it forged?

11 min read · try the examples if you haven't

Previous
vim Editor Basics
11 / 12 · Linux
Next
Linux Disk and Storage Management