Senior 11 min · March 06, 2026

Linux Performance Tuning — Silent Swap from vm.swappiness

Q: Is it safe to modify sysctl values without rebooting?

Yes, most sysctl parameters can be changed at runtime using 'sysctl -w' and affect the running kernel immediately. Changes won't survive a reboot unless you persist them in /etc/sysctl.d/. However, some parameters (like kernel.pid_max) require a reboot or only take effect on new processes. Always check the kernel documentation or 'sysctl -a' to see if a parameter is runtime-only.

Q: What is the difference between transparent huge pages (THP) and explicit huge pages?

THP automatically promotes regular 4KB pages to 2MB huge pages in the background. This can cause performance stalls during promotion (due to memory compaction). Explicit huge pages (hugetlbfs) are pre-allocated at boot time and never moved or defragmented – they give deterministic performance but require the application to be aware of them. For databases, explicit huge pages usually outperform THP.

Q: Why does the default I/O scheduler on modern distros still use CFQ or BFQ?

Distros prioritise compatibility over peak performance. CFQ/BFQ work reasonably on spinning disks and some SSDs, and switching to 'none' on unsupported hardware (e.g., some cloud hypervisors) can break performance. Many cloud instances virtualise the disk from a shared backend where the hypervisor already schedules. In such cases, 'none' is usually safe and recommended.

Q: How do I know which sysctl parameters to change?

Start by identifying the bottleneck: CPU (use top, perf, sar -u), memory (free, vmstat, sar -r), I/O (iostat, sar -b), network (netstat, ethtool, sar -n). Each bottleneck leads to specific parameters. For example, high I/O wait → check scheduler and queue depth. Low network throughput → check buffer sizes and congestion control. Never change parameters blindly – read man pages or kernel docs first.

Default vm.swappiness=60 silently swaps Redis working set, spiking P99 from 2ms to 4s.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

✓ Production

production tested

May 23, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Tuning modifies kernel parameters via sysctl to match workload, not defaults
Key subsystems: CPU scheduler, virtual memory, I/O, and network
vm.swappiness controls swap tendency — set to 1 for latency-sensitive apps
Wrong I/O scheduler on NVMe adds ~40% latency — use none (or mq-deadline)
Production rule: change one parameter at a time, measure before and after

✦ Definition~90s read

What is Linux System Performance Tuning?

Linux system performance tuning is the practice of modifying kernel parameters via /proc, /sys, and sysctl to adjust the OS's behaviour for a specific workload. Default settings target broad compatibility, not peak performance. Tuning is not a one-time event — it's an iterative cycle of measurement, change, verification.

★

Imagine a busy restaurant kitchen.

The kernel exposes these knobs because there's no single 'best' config. A web server that handles short-lived connections needs different TCP buffers than a file server streaming large files. A real-time analytics database needs different memory pressure settings than a batch processing job.

The goal is to close the gap between 'works' and 'works under production load'. That means measuring latency, throughput, and resource utilisation before and after each change.

Plain-English First

Imagine a busy restaurant kitchen. The head chef (Linux kernel) manages cooks (CPU cores), pantry space (RAM), and delivery trucks (I/O). Out of the box, the kitchen is set up for a casual diner — it works fine for most nights. But on a Saturday rush with 300 covers, you need to rearrange the stations, pre-stock the fridges, and assign cooks to specific roles. That's exactly what Linux performance tuning is: deliberately reorganising how the OS allocates its resources so it can handle YOUR workload, not just an average one.

A default Linux installation is deliberately conservative. The kernel ships with settings tuned for broad compatibility — a database server, a gaming rig, a Raspberry Pi, and a 64-core cloud VM will all boot with roughly the same baseline config. That's great for getting started, but catastrophic for production at scale. A misconfigured TCP buffer kills throughput on a 10 Gbps link. The wrong I/O scheduler on NVMe storage adds 40% latency. A forgotten vm.swappiness setting causes a Redis node to start swapping under load, tanking p99 response times from 2ms to 4 seconds. These aren't theoretical problems — they're war stories from real oncall rotations.

Performance tuning solves the gap between 'it works' and 'it works under pressure'. The Linux kernel exposes hundreds of tuneable knobs through /proc, /sys, and sysctl. Understanding which knobs affect which subsystem — and crucially, WHY they exist — lets you make surgical changes instead of cargo-culting settings from a Stack Overflow post that was written for a 2012 spinning-disk server.

By the end of this article you'll understand how the kernel scheduler, virtual memory subsystem, I/O stack, and network stack interact with each other. You'll be able to profile a live system, identify the bottleneck, apply the right tuning, and verify the improvement with hard numbers — all without rebooting. You'll also know which changes to make permanent and which to test ephemerally first.

What is Linux System Performance Tuning?

The goal is to close the gap between 'works' and 'works under production load'. That means measuring latency, throughput, and resource utilisation before and after each change.

Production Insight

Blindly applying sysctl settings from a blog post can make things worse.

Example: setting vm.swappiness=0 on a database server seemed right, but it caused the page cache to be evicted aggressively, doubling I/O read latency.

Rule: understand the trade-off — no parameter is universally 'optimal'.

Key Takeaway

Tuning is workload-specific.

Default kernel config is for booting, not for performance.

Always measure before change and after change.

thecodeforge.io

Linux Performance Tuning Flow

Linux Performance Tuning

Kernel Scheduler Tuning — CPU Affinity, CFS & NUMA

The Completely Fair Scheduler (CFS) allocates CPU time proportionally among processes. Its main tuning knobs control preemption aggressiveness, group scheduling, and NUMA balancing.

Key parameters

kernel.sched_min_granularity_ns: Minimum slice per process. Lower values reduce latency but increase context switches. Default 3ms, reduce to 1ms for latency-sensitive apps.
kernel.sched_wakeup_granularity_ns: How long a waking process must wait before preempting a running one. Reduce for interactive workloads.
kernel.numa_balancing: Default 1 (enabled). On NUMA machines, this can migrate pages and threads across nodes. Often causes latency spikes. Disable with 0 in virtualised environments.
kernel.sched_migration_cost_ns: Time after a process runs before it can be migrated to another CPU. Increasing prevents unnecessary migrations.

Also use taskset to pin processes to specific cores and numactl to bind memory to local NUMA node.

scheduler-tuning.shBASH

#!/bin/bash
# TheCodeForge — Linux scheduler tuning for an API server
# Pins Nginx workers to cores 0-15 (physical CPUs) on a dual-socket NUMA system

export WORKER_PIDS=$(pgrep -f 'nginx: worker')

for pid in $WORKER_PIDS; do
    # Pin to first socket cores only (avoid cross-socket memory access)
    taskset -pc 0-15 $pid
done

# Sysctl tweaks for lower latency
sysctl -w kernel.sched_min_granularity_ns=1000000   # 1ms vs default 3ms
sysctl -w kernel.sched_wakeup_granularity_ns=2000000  # 2ms vs default 4ms
sysctl -w kernel.sched_migration_cost_ns=5000000    # 5ms → fewer migrations
sysctl -w kernel.numa_balancing=0                    # Disable NUMA balancing

Output

pid 12345's current affinity list: 0-31

pid 12345's new affinity list: 0-15

kernel.sched_min_granularity_ns = 1000000

kernel.sched_wakeup_granularity_ns = 2000000

kernel.sched_migration_cost_ns = 5000000

kernel.numa_balancing = 0

Production Insight

An e-commerce team enabled NUMA balancing on a 2-socket server expecting better memory locality. It caused ~10% CPU overhead from page migrations and random latency spikes 3x the baseline.

Fix: disable numa_balancing on any host running dedicated workloads.

Rule: NUMA balancing helps mixed workloads; hurts single-app servers.

Key Takeaway

Pin critical processes with taskset.

Disable numa_balancing in VMs or dedicated app hosts.

Measure latency before enabling scheduler migrations.

Memory Management Tuning — vm.swappiness, dirty pages & huge pages

The virtual memory subsystem decides how aggressively to swap anonymous pages versus reclaim page cache. The key parameters:

vm.swappiness: 0–100. Default 60 encourages swapping even with free memory. Set to 1 for latency-sensitive apps. 0 means no swapping until absolutely necessary (but kernel still swaps).
vm.dirty_ratio / vm.dirty_background_ratio: When writeback starts. Default 20% dirty background, 50% dirty synchronous. On write-heavy systems, this can cause latency spikes when the kernel blocks writes. Lower to 5%/10% for transaction logs.
vm.vfs_cache_pressure: Controls tendency to reclaim inode/dentry cache. Default 100. Lower to 50 on file servers to keep metadata in memory.
vm.min_free_kbytes: Reserve memory to avoid direct reclaim under load. Set to 1% of RAM.

Huge pages (2MB vs 4KB) reduce TLB misses. For apps with large memory footprints (databases, JVMs), enable transparent huge pages (THP) or use explicit hugetlbfs. THP can cause allocation stalls – often better to disable it and pre-allocate huge pages.

memory-tuning.shBASH

#!/bin/bash
# TheCodeForge — Memory tuning for a MySQL database server
# Aim: reduce swap, control dirty writeback, and use huge pages

# Swap tuning
sysctl -w vm.swappiness=1
sysctl -w vm.min_free_kbytes=$(( $(grep MemTotal /proc/meminfo | awk '{print $2}') / 100 ))

# Dirty page tuning for transaction logs
sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.dirty_ratio=10

# Reduce inode cache pressure
sysctl -w vm.vfs_cache_pressure=50

# Disable transparent huge pages (THP) to avoid stalls, use explicit huge pages instead
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# Pre-allocate 1024 huge pages for MySQL buffer pool
sysctl -w vm.nr_hugepages=1024

# Verify
grep -i huge /proc/meminfo | head -4

Output

AnonHugePages: 0 kB

HugePages_Total: 1024

HugePages_Free: 1024

HugePages_Rsvd: 0

Production Insight

A Redis instance with vm.swappiness=60 started swapping when a backup process flushed page cache. The backup read 100GB of data, causing page cache growth and swapping out Redis pages. P99 latency went from 2ms to 4s.

Fix: set swappiness=1 and configure cgroups to isolate backup's memory pressure.

Rule: backup and burst processes can trigger swapping on co-located apps – always use cgroup memory limits.

Key Takeaway

Set vm.swappiness=1 for apps that hate swapping.

Control dirty page ratios to avoid long write stalls.

Consider explicit huge pages over THP to avoid allocation jitter.

I/O Subsystem Tuning — Scheduler, Queue Depth & Block Layer

The Linux block layer sits between file systems and hardware. Its main tunable is the I/O scheduler, which queues and reorders requests. On modern NVMe SSDs, the default (usually BFQ or CFQ) adds overhead. Switch to none (no reordering) or mq-deadline (minimise latency).

Key parameters

/sys/block/<dev>/queue/scheduler: set to 'none' for NVMe, 'mq-deadline' for SATA SSD, 'bfq' for spinning disks.
/sys/block/<dev>/queue/nr_requests: I/O queue depth. Increase for high-throughput workloads (e.g., 1024 for databases).
/sys/block/<dev>/queue/read_ahead_kb: Pre-fetch size. Larger values benefit sequential reads but waste cache on random workloads.

Also tune filesystem mount options: noatime, nobarrier for ext4, or use XFS with larger allocation groups.

io-tuning.shBASH

#!/bin/bash
# TheCodeForge — I/O tuning for an NVMe-based database
DEV=/dev/nvme0n1

# Set scheduler to none (noop) for NVMe
echo none > /sys/block/nvme0n1/queue/scheduler

# Increase queue depth for multiple concurrent I/O
echo 1024 > /sys/block/nvme0n1/queue/nr_requests

# Reduce read-ahead since database does random I/O
echo 128 > /sys/block/nvme0n1/queue/read_ahead_kb

# Mount with noatime to reduce metadata writes
mount -o remount,noatime /data

# For ext4, disable barriers (safe on NVMe with power loss protection)
mount -o remount,noatime,nobarrier /data

# Check new settings
cat /sys/block/nvme0n1/queue/scheduler
cat /sys/block/nvme0n1/queue/nr_requests

Output

[none] mq-deadline bfq

1024

Production Insight

A PostgreSQL server on virtualised NVMe (cloud instance) used the default CFQ scheduler. CFQ split requests into 100ms time slices designed for spinning disks. The database's synchronous commit latency jumped to 150ms.

Fix: switch to none – latency dropped to 2ms.

Rule: always verify the I/O scheduler when deploying on flash storage.

Key Takeaway

NVMe → scheduler to 'none'.

SATA SSD → 'mq-deadline'.

Spinning disk → 'bfq'.

Increase nr_requests for high concurrency.

Network Stack Tuning — TCP Buffers, Congestion Control & Ring Buffers

The network stack's biggest bottleneck is often TCP buffer sizing and interrupt processing. For high-speed links (>1 Gbps), default socket buffers are too small, causing underutilisation.

Key parameters

net.core.rmem_max / net.core.wmem_max: Max receive/send socket buffer (bytes). Set to 16MB for 10G links.
net.ipv4.tcp_rmem / tcp_wmem: min-default-max for TCP buffers. Set min=4096, default=87380, max=16777216 (16MB).
net.ipv4.tcp_congestion_control: Default cubic. For lossy or long-haul links, use bbr (needs kernel 4.9+). BBR handles packet loss better and can increase throughput.
net.core.netdev_max_backlog: Max packets queued from NIC before kernel drops. Increase to 5000 on 10G links.
/sys/class/net/eth0/queues/rx-*/rps_cpus: Enable RPS (Receive Packet Steering) to spread interrupt load across CPUs.

network-tuning.shBASH

#!/bin/bash
# TheCodeForge — Network tuning for a 10Gbps web server

# Socket buffer maxima for 10G
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216

# TCP auto-tuning ranges
sysctl -w net.ipv4.tcp_rmem='4096 87380 16777216'
sysctl -w net.ipv4.tcp_wmem='4096 65536 16777216'

# Use BBR congestion control (requires kernel 4.9+)
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Increase backlog for bursty traffic
sysctl -w net.core.netdev_max_backlog=5000

# Enable RPS (Receive Packet Steering) on all cores for eth0
# For a 4-core machine: bitmap 0x0f (cores 0-3)
echo 'f' > /sys/class/net/eth0/queues/rx-0/rps_cpus

# Verify
sysctl net.ipv4.tcp_congestion_control
cat /sys/class/net/eth0/queues/rx-0/rps_cpus

Output

net.ipv4.tcp_congestion_control = bbr

Production Insight

A streaming service used Cubic congestion control on a 10G inter-datacenter link with 0.2% packet loss. Throughput was capped at 1.5Gbps. After switching to BBR, throughput jumped to 8.5Gbps – BBR's bandwidth estimation doesn't treat packet loss as congestion.

Fix: sysctl net.ipv4.tcp_congestion_control=bbr.

Rule: for WAN links with >0.1% loss, BBR is almost always better than Cubic.

Key Takeaway

Socket buffers must be 16MB for 10G links.

BBR beats Cubic on lossy long-haul links.

RPS/IRQ balance prevents single-CPU saturation.

Putting It All Together — A Repeatable Tuning Workflow

Follow this sequence to avoid chaos:

Baseline measurement: Collect latency, throughput, CPU, memory, I/O, and network metrics for at least 24 hours under typical load. Use tools like sar, sysstat, perf, and netdata.
Identify bottleneck: Use the USE method (Utilization, Saturation, Errors) – e.g., CPU util > 90%? I/O queue length growing? Network drops?
Hypothesis and change: Pick ONE parameter. Change it. Document why and expected effect.
Measure again: Same period and load type. Compare before/after.
Accept or rollback: If improvement >5% in the target metric, keep. If not, rollback and try different hypothesis.
Make persistent: Only after validation, add to /etc/sysctl.d/ or tuned profiles.

Treat every parameter change as an experiment. Use tools like 'tuned' to apply preset profiles for common workloads (latency-performance, throughput-performance).

The Tuning Loop

Default config is for breadth, not production
One change at a time, measure before and after
The kernel has hundreds of knobs, but only 5-10 matter for your workload
If you can't explain why a parameter helps, don't apply it

Production Insight

A team applied 20 sysctl changes from a 'production tuning' blog post at once. When latency improved, they didn't know which change caused it. When the database later started throwing connection resets, they couldn't roll back.

Rule: change one parameter per iteration. Use version control for your sysctl configs.

Key Takeaway

Tune one parameter at a time.

Baseline for 24h before any change.

Measure the same metric – more data, less guesswork.

CPU Cache Thrashing Is Your Silent Killer

Most devs stare at CPU utilization and think they're fine. 5% usage means nothing when your L2 cache miss rate is 40%. The CPU spends more cycles waiting on memory than computing. That's the real bottleneck.

Cache hierarchy matters because data locality is physics, not magic. L1 cache runs at CPU speed. L3 is an order of magnitude slower. Every cache miss stalls the pipeline. If the scheduler keeps bouncing your thread between cores, you're flushing and reloading that cache every time. That's thrashing.

Check your cache topology with lscpu or lstopo. Pin critical processes with taskset or numactl to keep them on the same core. For NUMA boxes, memory allocation follows CPU node. Bind both to the same node. Remote memory access is slow, and your database will feel it.

Run perf stat -e cache-misses on your workload. If miss rate exceeds 5%, you're leaving performance on the table. Fix affinity first. Tune later.

CacheMissCheck.ymlYAML

// io.thecodeforge — devops tutorial

// Check cache topology for a production web server
- name: Inspect CPU cache hierarchy
  command: lscpu | grep -E "(cache|Core|Socket)"
  register: cache_info

- name: Pin nginx worker to specific core
  command: taskset -pc 0 {{ nginx_pid }}
  when: cache_info.stdout.find("L1d cache: 32K") != -1

- name: Measure cache misses under load
  command: perf stat -e cache-misses,cache-references -p {{ nginx_pid }} sleep 10

Output

Performance counter stats for process 12345:

1,234,567,890 cache-references

89,012,345 cache-misses # 7.21% of all cache refs

10.001234 seconds time elapsed

Production Trap: The 5% Utilization Myth

CPU at 5% doesn't mean idle — it means stalled. High cache miss rates masquerade as low utilization. Always check cache-misses before scaling out.

Key Takeaway

Cache miss rate above 5% means you're memory-bound, not CPU-bound. Pin processes to cores before tuning anything else.

Context Switches Are The Hidden Tax On Your Throughput

Every context switch costs you CPU cycles. The scheduler saves state, flushes TLB, and reloads the next thread. On a busy web server handling 10k connections, you might be switching 100k times per second. That's 100k lost microseconds. Add them up.

High context switch rates usually mean one of two things: too many threads fighting for CPU time, or I/O-bound tasks that yield constantly. Both waste cycles. The fix: reduce thread count or switch to an event-driven model like epoll. For databases, tune the connection pool to match core count, not max connections.

Watch /proc/stat or use vmstat 1 and look at the cs column. If it's over 50k per core per second, you're thrashing the scheduler. perf sched gives you a per-process breakdown. Identify the worst offender. Either pin it, scale its threads, or rewrite it.

Don't touch scheduler policies (SCHED_FIFO, SCHED_RR) unless you know exactly what you're doing. Preempting the kernel scheduler can lock your box. Start with affinity and thread count. That's 80% of the win.

ContextSwitchAudit.ymlYAML

// io.thecodeforge — devops tutorial

// Audit context switches on a production API server
- name: Check context switch rate
  shell: vmstat 1 5 | tail -4 | awk '{print $12}'
  register: cs_rate
  changed_when: false

- name: Alert if excessive context switching
  fail:
    msg: "Context switch rate {{ item }} per second exceeds threshold"
  loop: "{{ cs_rate.stdout_lines }}"
  when: item | int > 50000

- name: Identify top offender by voluntary switches
  command: perf sched record -g -a sleep 5 && perf sched latency
  register: sched_latency

Output

---------------------------------------------------------------------

---------------------------------------------------------------------

nginx:12345 | 1200.345 | 5432 | 0.12 | 2.34 |

postgres:56789 | 800.123 | 3210 | 0.08 | 1.45 |

python-worker:98765 | 400.456 | 8765 | 0.45 | 8.91 |

---------------------------------------------------------------------

Senior Shortcut: Thread Count Math

Key Takeaway

Context switch rate over 50k per core per second means you're paying more for switching than for actual work. Reduce threads or switch to async.

Interrupt Affinity: Don't Let IRQs Steal Your Cache

Network interrupts land on whatever core the kernel picks. By default, that's CPU 0. Your cache-hot nginx worker on that core now gets interrupted 20k times a second to handle packet processing. Cache evicted. Pipeline stalled. Performance tanks.

Bind interrupt request lines (IRQs) to a dedicated core — separate from your application cores. This keeps your worker's cache warm and lets the interrupt handler run unopposed. Check /proc/interrupts to see which IRQ is hammering which core. Then write the CPU mask to /proc/irq/<N>/smp_affinity.

For high-throughput NICs (10GbE+), spread IRQs across a set of cores, not all. Use irqbalance as a baseline but don't trust it blindly for heavy loads. Manual tuning beats a daemon every time. Match NIC queue count to core count, then assign one queue per core. No sharing.

Test with perf top or mpstat -I CPU before and after. If you see 10%+ of CPU time in softirq or hardirq, you have work to do. Dedicated interrupt cores are free performance. Take it.

InterruptAffinityBinding.ymlYAML

// io.thecodeforge — devops tutorial

// Bind NIC interrupts to dedicated cores (not application cores)
- name: Check current interrupt distribution
  command: cat /proc/interrupts | grep eth0
  register: irq_eth0

- name: Identify IRQ numbers for eth0
  shell: "cat /proc/interrupts | grep eth0 | awk '{print $1}' | tr -d ':'"
  register: irq_numbers

- name: Bind each IRQ to core 2 (mask: 0x04) — skip cores 0 and 1
  shell: "echo 04 > /proc/irq/{{ item }}/smp_affinity"
  loop: "{{ irq_numbers.stdout_lines }}"
  become: yes

- name: Verify binding
  command: cat /proc/interrupts | head -n 1 && cat /proc/interrupts | grep eth0

Output

CPU0 CPU1 CPU2 CPU3

98: 0 0 12345 0 IR-PCI-MSI eth0

99: 0 0 13456 0 IR-PCI-MSI eth0-tx-0

100: 0 0 14567 0 IR-PCI-MSI eth0-rx-0

Production Trap: Don't Bind IRQs To Your App Cores

Key Takeaway

Bind network IRQs to dedicated cores outside your application affinity mask. This isolates cache contention and boosts throughput by 10-20% under load.

I/O Priority: Stop Letting Batch Jobs Steal Your Latency

Most engineers tune I/O scheduling but ignore I/O priority. That's like upgrading your car's tires while leaving the parking brake on. The kernel's I/O priority system (ionice) lets you tell the block layer which processes are latency-sensitive and which are background noise.

WHY this matters: Without I/O priority, a cron job running log rotation can starve your database. The CFQ and BFQ schedulers respect three priority classes: Real-time (0), Best-effort (1), and Idle (3). Real-time processes always get first dibs on I/O requests. Best-effort divides bandwidth proportionally. Idle processes only run when nobody else needs the disk.

HOW to use it: Attach ionice to critical processes. Your PostgreSQL should be ionice -c 2 -n 0 (best-effort, high priority). Your nightly backup script should be ionice -c 3 (idle). On systemd, set IOSchedulingClass and IOSchedulingPriority in your service unit. This single change can cut database tail latency by 40% in mixed-workload environments. Test it: run fio with different ionice classes and watch the latency divergence.

docker-compose.ymlYAML

// io.thecodeforge — devops tutorial

services:
  postgres:
    image: postgres:15
    # Do NOT forget ionice; default is best-effort 4
    # This is for containers that control global ionice
    # On bare metal, wrap your entrypoint with ionice
      
  backup:
    image: backup-agent:latest
    # Trash priority — won't impact prod queries
    cpu_shares: 128
    mem_limit: 256m
    # In container runtime, set via cgroup I/O weight:
    blkio_weight: 10

Output

No direct output; apply via systemd drop-in:

[Service]

IOSchedulingClass=best-effort

IOSchedulingPriority=0

Production Trap:

Don't set real-time I/O priority (ionice -c 1) on anything except watchdog daemons. A bug in your process can lock the entire I/O subsystem. Best-effort class 0 is the highest safe priority for production.

Key Takeaway

I/O priority is free latency protection — classify every I/O-heavy process or accept random slowdowns.

Useful I/O Monitoring Commands: Stop Guessing, Start Measuring

You can't tune what you can't see. Most engineers look at iostat, get lost in %util, and call it a day. That number is a lie — %util shows device busy time, not saturation. You need the real tools.

WHY this matters: False signals cause bad tuning. %util at 100% means nothing on an NVMe drive that can queue 64K commands. You need indicators of actual congestion: average queue size (avgqu-sz), service time (svctm), and await vs. r_await divergence.

HOW to do it right

iostat -x 1: Look at avgqu-sz > 1 per device. That's congestion. svctm under 10ms on spinning rust is fine; under 1ms on SSD is fine.
iotop -oP: See which process is eating I/O right now. The -P flag shows threads. Discover your rogue log writer.
blktrace / blkparse: Capture every I/O event. Trace a 10-second window: blktrace -d /dev/sda -o
| blkparse -i
This shows exact I/O sizes, latencies, and which process issued them.
bcc-tools (biosnoop, biolatency): Instant histograms of I/O latency. No configuration. Just run biosnoop and watch outliers appear.

Ditch guesswork. Measure with blktrace when you profile, iostat when you monitor, iotop when you firefight.

monitor.ymlYAML

// io.thecodeforge — devops tutorial

steps:
  - name: "Check I/O congestion"
    command: "iostat -x 1 5"
    # Look for avgqu-sz > 1.0 on any device

  - name: "Find I/O hog"
    command: "iotop -oP -b -n 1"
    # -o: only show active processes
    # -P: show threads
    # -b: batch mode (non-interactive)

  - name: "Capture I/O trace"
    command: "timeout 10 blktrace -d /dev/nvme0n1 -o - | blkparse -i -"

# Install tools:
apt install sysstat iotop blktrace bpfcc-tools

Output

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

nvme0n1 0.00 0.00 892.00 64.00 3568.00 256.00 8.00 2.45 2.68 2.01 12.34 0.22 21.00

Senior Shortcut:

Write a one-liner: iostat -x 1 | awk '/nvme/ && $10 > 2 {print strftime(), $0}'. Run it in a tmux pane. Now you see congestion before pager goes off.

Key Takeaway

Ignore %util. Watch avgqu-sz, await, and svctm — that's real I/O health.

Keepalive Timeout: Stop Letting Dead Connections Rot Your Resources

Every open TCP socket costs you a file descriptor and a chunk of kernel memory. When a client crashes without sending FIN, that socket sits in CLOSE_WAIT or ESTABLISHED state until you die or the kernel notices. If you have 10,000 connections, 10% of them dead, you're leaking 1,000 descriptors. That's absurd.

WHY this matters: Default keepalive settings are glacial. tcp_keepalive_time is 7200 seconds (2 hours) on most distros. Production apps talking to flaky mobile clients don't have 2 hours. Your connection pool fills with zombies. Your new connections get rejected at 65535. Keepalive is your garbage collector for dead peers.

HOW to fix it: Set tcp_keepalive_time to 300 seconds (5 minutes). tcp_keepalive_intvl to 15 seconds. tcp_keepalive_probes to 5. That means after 5 minutes idle, the kernel sends 5 probes at 15-second intervals. Total detection time: 5 min + 75 sec = 6.25 minutes. Tune per-service via iptables or socket options — don't apply a single value to everything. Your database servers need shorter timeouts than your web servers. Set SO_KEEPALIVE in application code, or use sysctl modifiers per namespace.

Test it: netstat -tn | grep CLOSE_WAIT. If count > 0, your keepalive is too slow or missing.

keepalive-tune.ymlYAML

// io.thecodeforge — devops tutorial

# /etc/sysctl.d/99-tcp-keepalive.conf

net.ipv4.tcp_keepalive_time = 300
# Default: 7200 (2 hours). Production: 300.

net.ipv4.tcp_keepalive_intvl = 15
# Interval between keepalive probes.

net.ipv4.tcp_keepalive_probes = 5
# Number of probes before declaring connection dead.

# Apply:
sysctl -p /etc/sysctl.d/99-tcp-keepalive.conf

# Verify:
sysctl net.ipv4.tcp_keepalive_time

Output

net.ipv4.tcp_keepalive_time = 300

net.ipv4.tcp_keepalive_intvl = 15

net.ipv4.tcp_keepalive_probes = 5

# After tuning, watch dead connections drop:

# netstat -tn | grep CLOSE_WAIT | wc -l

Production Trap:

Don't set tcp_keepalive_time below 60 seconds. You'll burn CPU on probe traffic and kill mobile clients on brief network glitches. 300 seconds is aggressive but safe for server-to-server connections.

Key Takeaway

A keepalive timeout under 10 minutes is free — dead connections are not. Tune it or accept resource leaks.

Kernel Function Tracing for Low-Level Analysis

WHY kernel tracing matters: Latency spikes hide behind aggregated metrics. Function tracing reveals exactly which kernel path causes a bottleneck. Use ftrace or BPF-based tools to measure entry-to-exit times of syscalls, interrupt handlers, and scheduler functions. Start with trace-cmd record -p function_graph -g do_sys_open to trace file open latency. For targeted analysis, use funclatency from BCC to histogram a single kernel function's execution time. Focus on high-frequency or high-jitter paths like tcp_ack, do_try_to_free_pages, or enqueue_task_fair. Compare before and after tuning. Always trace under load—idle traces tell nothing. Stop guessing with averages; start tracing the 99th percentile path.

kernel_trace_example.ymlYAML

// io.thecodeforge — devops tutorial

name: trace-latency-p99
on:
  workflow_dispatch:
    inputs:
      function:
        description: 'Kernel function to trace'
        required: true
        default: 'do_sys_open'
      duration:
        description: 'Trace duration in seconds'
        default: '30'

jobs:
  trace:
    runs-on: ubuntu-22.04
    steps:
      - run: sudo apt-get install -y trace-cmd
      - run: |
          sudo trace-cmd record -p function_graph \
            -g ${{ inputs.function }} \
              sleep ${{ inputs.duration }}
      - run: sudo trace-cmd report --func-stack | awk '/${{ inputs.function }}/{print $NF}' | sort -n | tail -1

Production Trap:

Running ftrace on every function in production crushes CPU. Always filter to one function or PID, and limit trace duration to under 60 seconds.

Key Takeaway

Trace one function at a time under real load, never idle.

Programmable Tracing for Custom Metrics

WHY programmable tracing: Traditional tools report fixed metrics—you get what they give. With BPF (BCC or bpftrace), you write custom probes on any kernel or user function. Measure the exact distribution of mutex hold times, NUMA miss ratios, or per-socket TCP retransmit counts. Start with bpftrace -e 'kprobe:tcp_retransmit_skb { @[sockuid] = count(); }' to count retransmits per user. For NUMA analysis, run numa-miss from BCC: it shows where memory allocation fails to match the requesting CPU's node. Build metrics that matter to your workload—not generic dashboard noise. Write scripts that save histograms and alert on tail latency shifts. This turns raw tracing into actionable tuning data.

bpf_metric_example.ymlYAML

// io.thecodeforge — devops tutorial

name: bpf-custom-metric
on:
  schedule:
    - cron: '*/5 * * * *'

jobs:
  collect:
    runs-on: ubuntu-22.04
    steps:
      - run: |
          sudo bpftrace -e '
            kprobe:tcp_retransmit_skb
            {
              @transmits_by_pid[pid] = count();
            }
            interval:s:60
            {
              print(@transmits_by_pid);
              clear(@transmits_by_pid);
            }' > /tmp/retransmit_raw.log &
          sleep 65
          kill %1
      - run: |
          awk '{if ($NF > 5) print}' /tmp/retransmit_raw.log \
            > /tmp/high_retransmit_pids.txt

Production Trap:

BPF probes with high-frequency events (e.g., tcp_transmit_skb) can overflow the per-CPU buffer and drop events. Monitor /sys/kernel/debug/tracing/trace_stat/bpf_stats for lost probes.

Key Takeaway

Write custom BPF probes for workload-specific metrics; never settle for stock tool outputs.

● Production incidentPOST-MORTEMseverity: high

The Silent OOM that Cost $2,000 in AWS Credits

Symptom

Redis cluster latency spikes, cloud monitoring shows memory pressure but no OOM kill. P99 response times go from 2ms to 4s. The node stays up but becomes unusable.

Assumption

The team assumed that with 64GB RAM and only 30GB of Redis data, swapping was impossible. They never checked vm.swappiness.

Root cause

vm.swappiness defaults to 60, which tells the kernel to start swapping pages even when plenty of free memory exists. The kernel's heuristic tries to keep file cache big, so it swaps out anonymous pages. Redis's working set pages get swapped to disk – each access now requires a slow disk read.

Fix

Set vm.swappiness=1 in /etc/sysctl.d/99-swap.conf and apply with sysctl -p. Then disable swap entirely for latency-critical workloads (swapoff -a). After tuning, swap usage dropped to zero and p99 returned to 2ms.

Key lesson

Default sysctl values are not safe for all workloads – especially vm.swappiness.
Always check swap usage (free -h, /proc/meminfo) even if you have 'plenty' of RAM.
For latency-sensitive apps, set vm.swappiness=1 or disable swap entirely.
Monitor /proc/[pid]/status VmSwap to see per-process swap usage.

Production debug guideSymptom -> Action flow for the most common production issues4 entries

Symptom · 01

Latency spikes, high %iowait in top or vmstat

→

Fix

Check iostat -x 1. Look at avgqu-sz and await. If await is high but %util is low, disk is oversaturated. Then check I/O scheduler: cat /sys/block/sda/queue/scheduler. Switch to none or mq-deadline if on NVMe.

Symptom · 02

Unexpected swapping (kswapd0 high CPU, p99 latency spikes)

→

Fix

Run free -h and check SwapUsed. Then cat /proc/meminfo | grep -E 'Swap|Dirty'. If SwapCached > 0, swap is active. Set vm.swappiness=1 and consider disabling swap for latency-critical services.

Symptom · 03

Network throughput far below link speed, dropped packets

→

Fix

Check netstat -s for TCP retransmits and packet drops. Run ethtool -S eth0 | grep drop. Tune net.core.rmem_max and net.core.wmem_max to 16MB for 10G links. Also check tcp_congestion_control vs BBR.

Symptom · 04

CPU imbalance: one core pegged, others idle

→

Fix

Check /proc/interrupts for IRQ imbalance. Use taskset to pin irqbalance or manually assign IRQ affinity via /proc/irq/*/smp_affinity. For NUMA, ensure memory is local (numactl --membind).

★ Linux Performance Debug Cheat SheetQuick-fire commands for the three most common failure scenarios

High CPU system time but low user time−

Immediate action

Check context switches: vmstat 1 5. Watch cs (context switches per second).

Commands

vmstat 1 10 | awk '{print $12,$13}' # cs and sys columns

perf top -e cs -s count # find what kernel code is spinning

Fix now

Tune kernel.numa_balancing=0 and kernel.sched_migration_cost_ns=5000000. If still high, disable NUMA balancing.

Memory allocation latency spikes under load+

Apache/Nginx worker threads time out during TCP reconnection+

I/O Scheduler Selection Guide

Storage Type	Recommended Scheduler	Why
NVMe SSD	none (or noop)	No reordering, minimal latency. NVMe has >100K IOPS; scheduler overhead slows it down.
SATA SSD	mq-deadline	Deadline ensures reads/writes don't starve. Multi-queue variant scales with multiple cores.
Spinning HDD (database)	bfq	Budget Fair Queuing provides fairness among processes. Good for mixed workloads.
Spinning HDD (sequential streaming)	mq-deadline	Starves writes moderately, but sequential throughput remains high.
Virtualised (paravirtual scsi)	none	Hypervisor handles scheduling; host-side scheduler adds double queuing.

Key takeaways

Default kernel settings are for boot, not for production. Always tune for your workload.

One change per iteration. Baseline for 24h before and after. Document everything.

The four subsystems

CPU, memory, I/O, network – interact. Fix one bottleneck first, then the next.

NVMe storage must use the 'none' I/O scheduler to avoid latency overhead.

vm.swappiness=1 stops silent swapping that destroys latency.

BBR congestion control beats Cubic on any link with >0.1% packet loss.

Use tools like sar, perf, iostat, and sysctl consistently in a repeatable workflow.

Common mistakes to avoid

4 patterns

Copying sysctl settings from outdated blog posts without verification

Symptom

After applying, performance worsens or mysterious errors appear (e.g., tcp_tw_reuse=2 on old kernels causes netfilter issues).

Fix

Verify each parameter's kernel version support. Use sysctl -a | grep <param> to check current and man pages. Test in non-production first.

Never measuring before applying changes

Symptom

No baseline exists. When latency improves, you can't tell if it was your change or traffic variation.

Fix

Run 'sar -A' for at least 24h before any tuning. Save output to a file. Re-run same sar commands after change and diff the results.

Changing multiple parameters simultaneously

Symptom

Performance gain is real but you don't know which change caused it. If side effects appear, you can't isolate them.

Fix

Change ONE parameter per iteration. Document the change, reason, and outcome. Use version-controlled sysctl configs.

Disabling swap entirely on memory-constrained servers

Symptom

OOM killer triggers more often because swap acts as emergency buffer. Kernel cannot reclaim memory fast enough.

Fix

Set vm.swappiness=1 instead of 0 or turning off swap. Keep swap available but prioritise keeping pages in memory. For latency-critical apps, still disable swap but only if you have sufficient over-provisioning.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

You notice a database server has high %iowait and I/O latency is 50ms. T...

Q02SENIOR

Explain the relationship between vm.swappiness, dirty ratio, and swap pr...

Q03SENIOR

How would you tune a 10Gbps web server to handle 100K concurrent connect...

Q01 of 03SENIOR

You notice a database server has high %iowait and I/O latency is 50ms. The I/O scheduler is CFQ and storage is NVMe. What is the first thing you would change?

ANSWER

Change the I/O scheduler to none (noop). CFQ was designed for spinning disks – it groups requests into time slices, which adds latency. NVMe flash has no seek penalty, so request reordering is unnecessary overhead. Run echo none > /sys/block/nvme0n1/queue/scheduler and verify with iostat -x 1 that await drops below 2ms.

FAQ · 4 QUESTIONS

Frequently Asked Questions

Is it safe to modify sysctl values without rebooting?

What is the difference between transparent huge pages (THP) and explicit huge pages?

Why does the default I/O scheduler on modern distros still use CFQ or BFQ?

How do I know which sysctl parameters to change?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

✓ Verified

production tested

May 23, 2026

last updated

1,554

articles · all by Naren

🔥

That's Linux. Mark it forged?

11 min read · try the examples if you haven't