Mid-level 6 min · March 06, 2026

Linux Process Management — Unkillable D State from NFS Hang

All pods froze in D state from NFS hang - kill -9 had zero effect.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Every running program is a Linux process with a unique PID (Process ID)
  • ps shows process states — look at the STAT column for D (unkillable I/O wait), Z (zombie), R (running)
  • SIGTERM (15) asks politely; SIGKILL (9) forces death — always try SIGTERM first
  • Background jobs die when terminal closes unless you use nohup or disown
  • Process trees (pstree) reveal parent-child relationships faster than flat ps output
  • Use jobs, fg, bg for shell job control; strace for syscall-level debugging
Plain-English First

Imagine your computer is a busy restaurant kitchen. Every dish being cooked right now is a 'process' — it has a chef assigned to it, a station it runs on, and a ticket number so the head chef can track it. Linux process management is how the head chef (the OS) keeps track of every dish, reassigns chefs when things get backed up, and shuts down a dish that's gone wrong before it burns the whole kitchen down.

Every command you run, every web server you start, every cron job that fires at midnight — all of it becomes a Linux process. Understanding how those processes live, communicate and die isn't optional knowledge for a DevOps engineer or backend developer; it's the difference between confidently diagnosing a runaway process at 2 AM and blindly rebooting a production server and hoping for the best.

The problem most developers hit is that they learn 'ps aux | grep something' and 'kill -9' and think that's process management. It isn't. That's like learning to use a fire extinguisher but not knowing what causes fires. Real process management means understanding process states, parent-child relationships, signals, job control, and how the kernel schedules work — so you can make deliberate decisions instead of panicked ones.

By the end of this article you'll be able to inspect any running process and understand what it's doing, send the right signal for the right situation (spoiler: kill -9 is almost never the right answer), manage foreground and background jobs like a pro, and build the mental model that makes every 'why is my server slow?' investigation start from a place of clarity.

How Linux Processes Are Born — PIDs, PPIDs and the Process Tree

Every process in Linux gets a Process ID (PID) — a unique integer the kernel assigns at birth. But processes don't appear from nowhere. Almost every process is spawned by another process, its parent, which holds a Parent Process ID (PPID). This parent-child relationship forms a tree, and the root of that entire tree is PID 1 — the init system (systemd on modern distros).

Why does this matter? Because when a parent process dies before its child, the child becomes an 'orphan' and gets re-parented to PID 1. When a child dies but the parent hasn't called wait() to collect its exit status, the child becomes a 'zombie' — it occupies a PID slot and a row in the process table while holding no real resources. A handful of zombies is harmless. Thousands mean something in your code is seriously wrong.

The fork-exec model is how new processes are created. A parent calls fork(), which clones itself into a child process. The child then calls exec() to replace its memory with a new program. This is why your shell is the parent of almost every command you run — and why killing your terminal kills the processes running inside it.

Run pstree to see the entire family tree live. It's one of the most clarifying commands a Linux learner can run.

process_tree_inspection.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#!/usr/bin/env bash
# process_tree_inspection.sh
# Goal: Understand where a process comes from and who owns it

# --- Step 1: Find the PID of the current shell ---
current_shell_pid=$$   # $$ is a special variable holding the current process's PID
echo "Current shell PID: $current_shell_pid"

# --- Step 2: Start a background sleep to give us something to inspect ---
sleep 300 &            # The & sends the process to the background immediately
sleep_pid=$!           # $! captures the PID of the last background command
echo "Background sleep PID: $sleep_pid"

# --- Step 3: Inspect the process in detail using ps ---
# -o lets us choose exactly which columns to display
# pid=process id, ppid=parent process id, stat=state, cmd=full command
echo ""
echo "--- Detailed view of our sleep process ---"
ps -o pid,ppid,stat,user,cmd -p "$sleep_pid"

# --- Step 4: Show how this process fits in the full tree ---
echo ""
echo "--- Process tree rooted at our shell ---"
pstree -p "$current_shell_pid"   # -p shows PIDs next to each process name

# --- Step 5: Clean up — kill our background sleep gracefully ---
kill "$sleep_pid"      # Sends SIGTERM (signal 15) by default — polite shutdown request
echo ""
echo "Sent SIGTERM to sleep process $sleep_pid. It should be gone now."

# --- Step 6: Confirm it's gone ---
sleep 0.2              # Give the kernel a moment to clean up
if ! kill -0 "$sleep_pid" 2>/dev/null; then
  # kill -0 doesn't actually kill — it just checks if the process exists
  echo "Confirmed: PID $sleep_pid no longer exists."
fi
Output
Current shell PID: 47821
Background sleep PID: 47832
--- Detailed view of our sleep process ---
PID PPID STAT USER CMD
47832 47821 S deploy sleep 300
--- Process tree rooted at our shell ---
bash(47821)---sleep(47832)
Sent SIGTERM to sleep process 47832. It should be gone now.
Confirmed: PID 47832 no longer exists.
Why PPID Matters in Production:
When debugging a runaway process, always check its PPID first. If a web worker is leaking memory, knowing it was spawned by nginx worker master (not your app) tells you the fault is in nginx config, not your application code. ps -o pid,ppid,cmd -p <PID> is the first command to run.
Production Insight
Zombie processes from a misbehaving parent can consume all available PIDs, causing fork() failures across the system.
Run pstree -p to find the parent and restart it to reap zombies.
Rule: always ensure signal handlers call waitpid() to collect child exit status.
Key Takeaway
Processes form a tree rooted at PID 1.
Orphans get reparented; zombies indicate a broken parent.
The fork-exec model is how Linux creates every new process.

Reading Process State — What ps and top Are Actually Telling You

Developers glance at ps output and look for a name. Senior engineers look at the STAT column first. That single letter (or two) tells you exactly what the kernel is doing with that process right now, and it's the fastest way to diagnose a sick system.

The core states are: R (Running or runnable — actively using CPU or waiting for a CPU slot), S (Interruptible Sleep — waiting for I/O or an event, will wake up when signalled), D (Uninterruptible Sleep — waiting on I/O it cannot be interrupted from, typically disk or NFS), Z (Zombie — dead but parent hasn't collected exit status), and T (Stopped — paused by a signal like SIGSTOP or by a debugger).

The D state is the one that causes real pain. A process in D state cannot be killed — not even with kill -9. It's waiting on the kernel for something and is completely outside the kill path until that kernel operation finishes or times out. If you see dozens of processes in D state, your storage layer is almost certainly the problem: a hung NFS mount, a failing disk, or an overloaded I/O scheduler.

top and htop give you the same state information but in real time, so you can watch a process oscillate between R and S as it processes requests — that's healthy. A process pinned in R consuming 100% CPU for minutes is not.

process_state_diagnosis.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/usr/bin/env bash
# process_state_diagnosis.sh
# Goal: Show how to read process states and identify problematic ones

# --- Snapshot of all processes with state info ---
# a = all users, u = user-oriented format, x = include processes without a terminal
echo "=== Full process snapshot (top 20 by CPU) ==="
ps aux --sort=-%cpu | head -20
# Output columns: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

echo ""
echo "=== Processes currently in Uninterruptible Sleep (D state) ==="
# These are the dangerous ones — they can't be killed and indicate I/O problems
ps aux | awk '$8 ~ /^D/ { print $0 }'
# awk checks column 8 (STAT) for a value starting with D

echo ""
echo "=== Zombie processes on this system ==="
# Zombies start with Z in the STAT column
zombie_count=$(ps aux | awk '$8 ~ /^Z/ { count++ } END { print count+0 }')
echo "Zombie count: $zombie_count"
if [ "$zombie_count" -gt 0 ]; then
  echo "Zombies found — listing with parent PIDs:"
  ps aux | awk '$8 ~ /^Z/ { print $0 }'
  echo ""
  echo "To fix zombies: identify the parent (PPID) and restart it."
  echo "The parent is responsible for calling wait() to reap its children."
fi

echo ""
echo "=== Top 5 memory consumers ==="
ps aux --sort=-%mem | awk 'NR==1 || NR<=6 { print $0 }'
# NR==1 preserves the header row, NR<=6 gives us 5 data rows
Output
=== Full process snapshot (top 20 by CPU) ===
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
deploy 3821 24.3 2.1 987432 86412 ? Sl 09:14 1:43 node /app/server.js
postgres 1204 8.1 4.7 432100 193020 ? Ss 08:01 0:52 postgres: autovacuum
nginx 2301 1.2 0.3 48220 12300 ? S 08:01 0:08 nginx: worker process
root 1 0.0 0.1 169936 9812 ? Ss 08:00 0:01 /sbin/init
=== Processes currently in Uninterruptible Sleep (D state) ===
(none on this system — storage is healthy)
=== Zombie processes on this system ===
Zombie count: 0
=== Top 5 memory consumers ===
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 1204 8.1 4.7 432100 193020 ? Ss 08:01 0:52 postgres: autovacuum
deploy 3821 24.3 2.1 987432 86412 ? Sl 09:14 1:43 node /app/server.js
deploy 3840 0.1 1.8 880100 74210 ? Sl 09:14 0:02 node /app/worker.js
Watch Out: kill -9 Won't Touch a D-State Process
If ps shows your process in D state and kill -9 isn't working, stop sending signals — they're being ignored at the kernel level. Check dmesg for I/O errors, run iostat -x 1 5 to watch disk utilisation, and check mount | grep nfs for hung NFS mounts. The fix is fixing the underlying I/O, not the process.
Production Insight
A sudden spike in D state processes is almost always a storage problem, not a code problem.
Use iostat -x 1 to identify the device; use strace -e trace=openat,read,write -p to see what file the process is stuck on.
Rule: when processes become unkillable, stop diagnosing the process and start diagnosing the I/O subsystem.
Key Takeaway
The STAT column is the first thing a senior engineer reads.
D state = unkillable I/O wait; Z state = broken parent.
Monitor these states in your alerting, not just CPU and memory.
Process State Diagnosis Decision Tree
IfSTAT = R and CPU > 90% for > 2 minutes
UseLikely a runaway process. strace -p to see syscalls; kill -SIGTERM if unexpected.
IfSTAT = D (any duration)
UseI/O bottleneck. Check dmesg, iostat, NFS mounts. Do not attempt kill.
IfSTAT = Z (zombie)
UseParent not reaping. Find parent with PPID; restart parent service.
IfSTAT = T (stopped)
UseProcess paused by SIGSTOP or Ctrl+Z. Use kill -SIGCONT to resume or kill -SIGTERM to terminate.
IfSTAT = S (sleeping) but process seems unresponsive
UseNormal for many server processes. Check if it's blocking on a resource (lsof, strace -e trace=network).

Signals — The Right Way to Talk to a Running Process

A signal is a small integer the kernel delivers to a process as a notification or instruction. Most developers only know two: kill -9 and 'the other one'. That ignorance causes real production problems — from data corruption when processes don't get to flush their write buffers, to configuration changes never taking effect because an engineer restarted instead of reloaded.

The key signals every DevOps engineer must know: SIGTERM (15) is a polite shutdown request — the process can catch this, finish what it's doing, close files, and exit cleanly. This is the default signal for kill and the one you should try first. SIGKILL (9) is unconditional termination by the kernel — the process gets no say, no cleanup. Use it only when SIGTERM has failed after a reasonable wait. SIGHUP (1) means 'hang up' and historically disconnected modems, but modern daemons like nginx and sshd re-read their config files when they receive SIGHUP — no restart, no downtime. SIGSTOP (19) and SIGCONT (18) pause and resume a process, identical to what Ctrl+Z and fg do from your terminal. SIGUSR1 and SIGUSR2 are user-defined signals that applications can use for custom behaviour — some log rotation tools use these.

The kill command is misnamed — it sends signals, it doesn't exclusively kill. kill -l shows every signal your system supports.

signal_management_demo.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#!/usr/bin/env bash
# signal_management_demo.sh
# Goal: Demonstrate the right signal for each situation

# --- Part 1: Graceful shutdown vs forced kill ---
# Start a simulated long-running service
sleep 600 &
long_running_pid=$!
echo "Started fake service with PID: $long_running_pid"

# The RIGHT first step — ask politely
echo "Sending SIGTERM (graceful shutdown request)..."
kill -SIGTERM "$long_running_pid"   # Same as: kill -15 $long_running_pid

# Wait up to 5 seconds for graceful shutdown
for wait_seconds in 1 2 3 4 5; do
  sleep 1
  if ! kill -0 "$long_running_pid" 2>/dev/null; then
    echo "Process exited cleanly after ${wait_seconds}s. Good."
    break
  fi
  if [ "$wait_seconds" -eq 5 ]; then
    echo "Process didn't respond to SIGTERM after 5s. Now using SIGKILL."
    kill -SIGKILL "$long_running_pid"   # Only escalate when SIGTERM fails
  fi
done

echo ""

# --- Part 2: SIGHUP for zero-downtime config reload ---
# In production you'd do: kill -SIGHUP $(cat /var/run/nginx.pid)
# Let's simulate it with a script that catches SIGHUP
cat > /tmp/signal_catcher.sh << 'SCRIPT'
#!/usr/bin/env bash
trap 'echo "[PID $$] Caught SIGHUP — reloading config (no restart needed)"' SIGHUP
trap 'echo "[PID $$] Caught SIGTERM — shutting down cleanly"; exit 0' SIGTERM
echo "[PID $$] Service started. Waiting for signals..."
while true; do sleep 1; done
SCRIPT
chmod +x /tmp/signal_catcher.sh

/tmp/signal_catcher.sh &
catcher_pid=$!
sleep 0.5   # Give it a moment to start

echo "--- Simulating config reload with SIGHUP ---"
kill -SIGHUP "$catcher_pid"         # nginx does this — reload config, keep serving traffic
sleep 0.3

echo ""
echo "--- Simulating graceful shutdown with SIGTERM ---"
kill -SIGTERM "$catcher_pid"        # Clean exit
wait "$catcher_pid" 2>/dev/null     # Wait for it to finish before exiting this script

echo ""
echo "--- All signals on this system (for reference) ---"
kill -l   # Print all signal names and numbers
Output
Started fake service with PID: 52341
Sending SIGTERM (graceful shutdown request)...
Process exited cleanly after 1s. Good.
[PID 52355] Service started. Waiting for signals...
--- Simulating config reload with SIGHUP ---
[PID 52355] Caught SIGHUP — reloading config (no restart needed)
--- Simulating graceful shutdown with SIGTERM ---
[PID 52355] Caught SIGTERM — shutting down cleanly
--- All signals on this system (for reference) ---
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL
5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE
9) SIGKILL 10) SIGUSR1 11) SIGSEGV 12) SIGUSR2
13) SIGPIPE 14) SIGALRM 15) SIGTERM 16) SIGSTKFLT
17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP
Pro Tip: Use trap in Every Long-Running Script
Add 'trap "cleanup_function" SIGTERM SIGINT' at the top of any bash script that writes temp files, holds locks, or manages child processes. Without it, Ctrl+C or a deployment pipeline kill leaves orphaned files and locks behind. The cleanup_function should remove temp files and kill child processes before exiting.
Production Insight
Using SIGKILL as a first resort causes data corruption — databases lose in-flight transactions, config files get truncated.
Always implement a graceful shutdown handler in your applications and honour SIGTERM.
Rule: the only signal you should send without a grace period is one you understand completely.
Key Takeaway
SIGTERM is a polite request; SIGKILL is a forced execution.
SIGHUP reloads config without restart — use it for nginx, sshd.
Trap signals in scripts to avoid orphaned resources.

Job Control — Managing Foreground, Background and Suspended Processes

Job control is the shell's built-in mechanism for managing multiple processes from a single terminal session. It's the feature that lets you start a long compile, push it to the background, check your email, bring the compile back, and do all of this without opening a second terminal.

When you press Ctrl+Z, the shell sends SIGTSTP to the foreground process, which pauses it immediately (state changes to T). The process is now a 'stopped job'. bg resumes it in the background (sends SIGCONT). fg brings any background or stopped job back to the foreground. The jobs command lists everything the current shell is managing.

The critical thing most developers miss is that background jobs in a terminal session are tied to that terminal. Close the terminal (or SSH connection drops), and the shell sends SIGHUP to all its jobs, which kills them. This is why nohup and disown exist — nohup makes a process immune to SIGHUP, and disown removes a job from the shell's job table so closing the terminal doesn't affect it.

For anything that needs to truly survive a disconnection, use tmux or screen — they create a persistent session that lives on the server, not inside your SSH connection.

job_control_workflow.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#!/usr/bin/env bash
# job_control_workflow.sh
# Goal: Show the complete job control lifecycle including background survival

# --- Part 1: Basic job management ---
echo "=== Starting three background jobs ==="

# Simulate three different long-running tasks
sleep 120 &   # Pretend this is a database backup
backup_pid=$!
echo "Backup job started — PID: $backup_pid, Job: $!"

sleep 240 &   # Pretend this is a data export
export_pid=$!
echo "Export job started — PID: $export_pid"

sleep 360 &   # Pretend this is a log archive
archive_pid=$!
echo "Archive job started — PID: $archive_pid"

echo ""
echo "=== All current jobs in this shell ==="
jobs -l   # -l includes PIDs alongside job numbers

echo ""
echo "=== Suspending the export job (simulating Ctrl+Z) ==="
kill -SIGTSTP "$export_pid"   # Same signal as pressing Ctrl+Z interactively
sleep 0.2

echo "Job state after SIGTSTP:"
jobs -l   # Export should now show as 'Stopped'

echo ""
echo "=== Resuming export in the background ==="
bg %2    # %2 refers to job number 2 (the export). bg sends SIGCONT
sleep 0.2
jobs -l  # Should be back to Running

echo ""
# --- Part 2: Making a job survive terminal closure ---
echo "=== Running a job that survives SSH disconnection ==="

# nohup redirects stdout/stderr to nohup.out and makes process immune to SIGHUP
nohup sleep 9999 > /tmp/persistent_job.log 2>&1 &
persistent_pid=$!
echo "Persistent job PID: $persistent_pid"

# disown removes it from shell job table — terminal closing won't affect it
disown "$persistent_pid"
echo "Job $persistent_pid disowned — it will survive terminal closure"

# Verify it's no longer in the job table
echo ""
echo "Current jobs (persistent_pid should NOT appear):"
jobs -l

echo ""
echo "But it IS still in the process table:"
ps -p "$persistent_pid" -o pid,stat,cmd

# --- Cleanup ---
kill "$backup_pid" "$export_pid" "$archive_pid" "$persistent_pid" 2>/dev/null
wait 2>/dev/null
echo ""
echo "All jobs cleaned up."
Output
=== Starting three background jobs ===
Backup job started — PID: 61201, Job: 61201
Export job started — PID: 61202
Archive job started — PID: 61203
=== All current jobs in this shell ===
[1] 61201 Running sleep 120
[2] 61202 Running sleep 240
[3] 61203 Running sleep 360
=== Suspending the export job (simulating Ctrl+Z) ===
Job state after SIGTSTP:
[1] 61201 Running sleep 120
[2]+ 61202 Stopped sleep 240
[3] 61203 Running sleep 360
=== Resuming export in the background ===
[2] 61202 Running sleep 240
=== Running a job that survives SSH disconnection ===
Persistent job PID: 61210
Job 61210 disowned — it will survive terminal closure
Current jobs (persistent_pid should NOT appear):
[1] 61201 Running sleep 120
[2] 61202 Running sleep 240
[3] 61203 Running sleep 360
But it IS still in the process table:
PID STAT CMD
61210 S sleep 9999
All jobs cleaned up.
Watch Out: nohup Alone Isn't Enough for Production
nohup keeps the process alive after logout, but it still writes to nohup.out which will grow forever and eventually fill your disk. For production daemons, always redirect output explicitly: nohup ./server >> /var/log/myapp/server.log 2>&1 &. Better yet, use systemd to manage services — it handles logging, restarts, and resource limits properly.
Production Insight
A long database migration killed by SSH timeout corrupted a production table because the migration was part-way through.
Always wrap critical jobs in tmux or screen, or use nohup + disown.
Rule: if a job takes longer than your SSH timeout, it must be detached from the terminal before you start.
Key Takeaway
Ctrl+Z pauses, fg/bg resume, jobs list them.
Background jobs die with the terminal — use nohup or disown.
For production, use systemd or tmux, not shell job control.

Debugging Running Processes with strace and ltrace

When a process is misbehaving — high CPU, hanging, slow responses — the first question is 'what is it actually doing right now?' strace gives you the answer by intercepting system calls: every open, read, write, connect, poll that the process makes. ltrace does the same for library calls (e.g., malloc, free, gettimeofday).

Common debugging scenarios: A web server that's slow — strace -p <PID> -e trace=network reveals it's stuck on a connect() to a backend that's not responding. A process consuming 100% CPU — strace -c -p <PID> shows the distribution of syscall counts; if you see millions of gettimeofday() calls, your code is polling in a tight loop. A process that's leaking memory — strace -e trace=brk,mmap,munmap -p <PID> shows every heap allocation and deallocation.

strace can also attach to already-running processes, follow child processes (-f), and filter by specific syscalls (-e). Use it sparingly in production because it slows the traced process significantly (often 10-100x slower syscalls). For quick checks, strace -p <PID> -c for a summary, then dive deeper if needed.

ltrace is less common but useful when you suspect a library call is the bottleneck — for example, a process that calls gettimeofday millions of times or does excessive memory allocation.

strace_debug_demo.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/env bash
# strace_debug_demo.sh
# Goal: Use strace to diagnose process behavior without modifying the process

# Simulate a problematic process: a tight loop calling gettimeofday()
cat > /tmp/busy_loop.py << 'PYTHON'
import time
import sys
while True:
    time.time()  # calls gettimeofday syscall
    if time.time() % 1000 < 0.001:
        print("tick")
PYTHON

python3 /tmp/busy_loop.py &
busy_pid=$!
sleep 0.5  # Let it start

echo "=== strace summary for PID $busy_pid (10 seconds) ==="
strace -p "$busy_pid" -c -S time 2>&1 &
strace_pid=$!
sleep 3
kill "$strace_pid" 2>/dev/null
wait "$strace_pid" 2>/dev/null

echo ""
echo "=== Showing last 10 syscalls for PID $busy_pid ==="
strace -p "$busy_pid" -e trace=write -c -S calls 2>&1 &
strace_pid2=$!
sleep 2
kill "$strace_pid2" 2>/dev/null
wait "$strace_pid2" 2>/dev/null

echo ""
echo "=== Killing the test process ==="
kill "$busy_pid" 2>/dev/null
wait "$busy_pid" 2>/dev/null
echo "Done."
Output
=== strace summary for PID 72341 (10 seconds) ===
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.002345 2 1172 gettimeofday
------ ----------- ----------- --------- --------- ----------------
100.00 0.002345 2 1172 total
=== Showing last 10 syscalls for PID 72341 ===
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000012 1 12 write
------ ----------- ----------- --------- --------- ----------------
100.00 0.000012 1 12 total
=== Killing the test process ===
Done.
Pro Tip: Use strace -c First, Then Drill Down
Running strace on a production process without -c can slow it down massively. Start with strace -p <PID> -c -S time for a few seconds — it gives you a count and timing summary without logging every call. If the summary shows something suspicious (e.g., millions of poll() calls), then run a filtered strace for that specific syscall.
Production Insight
A Node.js server was consuming 130% CPU. strace -c revealed 95% of syscalls were gettimeofday(). The developer had used new Date() inside a hot loop. Replacing it with a cached timestamp fixed it.
strace is your best friend for CPU and hang investigations, but use it carefully in prod.
Rule: always start with the summary flag -c to minimise performance impact.
Key Takeaway
strace shows what syscalls a process is making in real time.
Use -c for a low-impact summary before drilling into details.
ltrace shows library calls — less common but useful for memory and timing bugs.
● Production incidentPOST-MORTEMseverity: high

The NFS Hang That Froze an Entire Microservice Fleet

Symptom
All pods on one node stopped responding. kubectl exec timed out. ps aux showed every node process in STAT=D. kill -9 had zero effect.
Assumption
The previous on-call engineer assumed it was a memory leak and tried to increase ulimit. That just delayed the inevitable.
Root cause
An NFS volume used for log storage was mounted with hard,bg options. The NFS server became unreachable. All processes writing to that mount got stuck in uninterruptible sleep waiting for NFS to respond.
Fix
We had to find the hung NFS mount (mount | grep nfs), unmount it forcibly with umount -f /mnt/nfs after identifying and killing the NFS client daemon, then reboot the affected node. After reboot, we reconfigured all mounts to use soft,intr and added proper timeout settings.
Key lesson
  • D state processes are unkillable — you must fix the underlying I/O (disk, NFS, kernel issue).
  • Always use soft,intr options for NFS mounts in production (with careful timeout tuning).
  • Monitor D state process count in your alerting — a sudden spike means I/O trouble, not application trouble.
  • Keep iostat -x 1 and dmesg output in your debug playbook.
Production debug guideMatch your symptom to the right reaction5 entries
Symptom · 01
Process consuming 100% CPU for > 5 minutes
Fix
Run top -p <PID> to see live CPU. Check process name and command line. If expected behavior (e.g., video transcoding), let it run. If unexpected, strace -p <PID> to see what syscalls it's making, then kill -SIGTERM
Symptom · 02
Process won't die after kill -SIGTERM
Fix
Wait 10 seconds. If still alive, use kill -SIGKILL. If even SIGKILL fails, check STAT column for D state — then investigate I/O (dmesg, iostat, NFS mounts).
Symptom · 03
Many zombie processes accumulating
Fix
Identify the parent PID of zombies using ps -o pid,ppid,stat,cmd. The parent is the one with children in state Z. Restart the parent service — that reaps the zombies.
Symptom · 04
Background job killed after SSH disconnect
Fix
You forgot nohup or tmux. Reconnect and check /var/log/messages for SIGHUP. Next time use nohup command & or tmux new-session before starting the job.
Symptom · 05
Out of memory killer (OOM) killed my process
Fix
Check /var/log/syslog or journalctl -xe for OOM killer messages. The process with highest oom_score gets killed. Analyze memory usage with ps aux --sort=-%mem. Set oom_adj to protect critical processes.
★ Quick Process Debug Cheat SheetCommands to diagnose and resolve process issues without context switching
High CPU, unknown process
Immediate action
Identify top CPU consumer
Commands
ps aux --sort=-%cpu | head -5
top -b -n 1 -o +%CPU | head -10
Fix now
If rogue, kill -SIGTERM <PID>; if legitimate, investigate further
Process stuck in D state+
Immediate action
Check if I/O subsystem is healthy
Commands
ps aux | awk '$8 ~ /^D/ { print }'
iostat -x 1 5
Fix now
Identify hung mount with mount | grep nfs; fix network/storage; cannot kill process until I/O resolves
Zombie processes+
Immediate action
Find parent
Commands
ps aux | awk '$8 ~ /^Z/ { print $2 }'
ps -o pid,ppid,cmd --no-headers -p $(ps aux | awk '$8 ~ /^Z/ { print $3 }')
Fix now
Restart or fix the parent process to reap children
Job disappeared after terminal closed+
Immediate action
Check if it survived via nohup
Commands
ps aux | grep <job_name>
Check nohup.out or specified log file
Fix now
If gone, restart with nohup command & disown or better: tmux new-session -s jobname
Process running slowly, may be blocked on I/O+
Immediate action
Check what syscalls it's making
Commands
strace -p <PID> -c -S time 2>&1 | head -10
lsof -p <PID> | grep -E 'REG|CHR'
Fix now
Identify slow filesystem or network resource; if disk, check iostat; if network, check netstat
ScenarioBest ToolWhy Not the Alternative
Graceful app shutdownkill -SIGTERM <pid>kill -9 skips cleanup — open files, DB connections, temp files all left dirty
Reload nginx config livekill -SIGHUP $(cat /run/nginx.pid)Restarting drops active connections; SIGHUP reloads with zero downtime
Find what's eating CPUtop or htop (live)ps aux is a snapshot — you miss transient spikes that top catches in real time
Debug process relationshipspstree -p <pid>ps aux shows all processes but not the parent-child tree structure
Run job after SSH logouttmux or screen + disownnohup alone still ties output to a growing file; tmux lets you reconnect interactively
Process won't respond to SIGTERMWait, then kill -9Jumping straight to -9 is the mistake — always give SIGTERM 5-10 seconds first
Monitor I/O-blocked processesiostat -x 1 + ps aux check D-statetop alone won't tell you WHY a process is stuck in D state; iostat shows the disk bottleneck
Trace syscalls of a misbehaving processstrace -p <pid> -c -S timeltrace shows library calls, not kernel interactions; strace gives you the low-level truth
Find which files a process has openlsof -p <pid>ps only shows command and state, not open file descriptors

Key takeaways

1
Every process has a PID and a PPID
the parent-child tree (visible with pstree -p) is the fastest way to understand what spawned a problem process and what will be affected if you kill it
2
The STAT column in ps output is more diagnostic than the process name
D state means I/O blocked and unkillable, Z means zombie from a bad parent, and R pinned for minutes means runaway CPU
3
SIGTERM (15) is a polite request; SIGKILL (9) is a forced execution by the kernel
always try SIGTERM first and escalate only after a timeout, or you risk corrupt state and broken locks
4
Background jobs in a terminal die when the terminal closes (SIGHUP)
use nohup + disown for quick survival, and tmux or systemd for anything that matters in production
5
strace -c gives a syscall summary without severe slowdown
use it first to identify what the process is actually spending time on
6
D state processes are unkillable
diagnose the I/O subsystem, not the process itself

Common mistakes to avoid

5 patterns
×

Using kill -9 as the first response to a hung process

Symptom
The process dies but leaves behind corrupt temp files, unreleased locks (e.g., a stale .pid file), or open database transactions that need manual rollback
Fix
Always send SIGTERM first and wait 5-10 seconds. Use a loop: kill -SIGTERM $pid && sleep 5 && kill -0 $pid 2>/dev/null && kill -SIGKILL $pid. Reserve SIGKILL for processes that genuinely ignore SIGTERM.
×

Running a long job directly in an SSH session without nohup or tmux

Symptom
You kick off a 2-hour database migration, your laptop closes the lid, the SSH connection drops, SIGHUP kills the job, and you return to a half-migrated database
Fix
Always wrap critical long-running commands in tmux new-session or prefix with nohup ... & disown. Make it a habit before you type any command that'll take more than a minute.
×

Assuming a process in D state can be killed

Symptom
kill -9 <pid> appears to do nothing, the process is still visible in ps, and your monitoring alerts keep firing
Fix
D state means the process is waiting inside a kernel I/O operation and is completely unkillable until that operation resolves. Run dmesg | tail -20 to check for I/O errors, run iostat -x 1 5 to identify a saturated disk, and check for hung NFS mounts with mount | grep nfs. Fixing the underlying I/O issue will unblock the process naturally.
×

Ignoring zombie processes until PID exhaustion

Symptom
Services fail to start with 'Cannot fork' errors because all PIDs are consumed by zombie entries
Fix
Set up monitoring for zombie count (e.g., alert if > 10). The root cause is always the parent process failing to call wait(). Restart the parent service to reap zombies. Fix the bug in the parent that skips wait()
×

Using nohup without redirecting output

Symptom
nohup.out grows to gigabytes, fills root filesystem, and crashes the server
Fix
Always redirect explicitly: nohup ./command > /var/log/myapp/output.log 2>&1 &. Or better, use systemd to manage the process and handle logs properly.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What's the difference between a zombie process and an orphan process? Ho...
Q02SENIOR
If kill -9 isn't working on a process, what's most likely happening and ...
Q03SENIOR
An nginx worker is consuming 100% CPU. Walk me through exactly how you'd...
Q04SENIOR
How does the shell handle Ctrl+C vs Ctrl+Z? What signals are sent and ho...
Q01 of 04SENIOR

What's the difference between a zombie process and an orphan process? How do you handle each in production?

ANSWER
A zombie process has finished execution but still has an entry in the process table because its parent hasn't called wait(). Orphan process is still running but its parent has died — it gets adopted by PID 1 (init/systemd). Zombies are harmless in small numbers but indicate a bug in the parent (failing to reap). Orphans are normal and continue running. To fix zombies, find the parent PID and restart the parent service. Orphans don't need fixing — the system handles them.
FAQ · 6 QUESTIONS

Frequently Asked Questions

01
What is the difference between kill -9 and kill -15 in Linux?
02
Why can't I kill a process even with kill -9?
03
What is a zombie process and should I be worried about it?
04
What does `strace` do and when should I use it in production?
05
How do I run a command that survives SSH disconnection without nohup?
06
What is the difference between `ps aux` and `ps -ef`?
🔥

That's Linux. Mark it forged?

6 min read · try the examples if you haven't

Previous
Linux File Permissions
4 / 12 · Linux
Next
Shell Scripting Basics