Senior 20 min · March 06, 2026
File Systems in OS

ext3 Superblock Corruption — Payment Gateway Outage

Disk partition won't mount: bad superblock — ext3 superblock corruption from power loss caused a payment gateway outage.

N
Naren Founder & Principal Engineer

20+ years shipping production systems from the metal up. Drawn from code that ran under real load.

Follow
Production
production tested
June 10, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • A file system organises raw storage into files and directories using metadata like inodes and allocation tables
  • Core components: superblock (globals), inode table (per-file metadata), data blocks (content), directory entries (name-to-inode maps)
  • FAT32 uses linked list cluster chains; ext4 uses extents and journaling, reducing seeks up to 80%
  • Production gotcha: abrupt power loss during a metadata write can orphan inodes — journaling prevents this on ext4, but not on FAT32
  • Biggest mistake: assuming file delete frees data — it only marks blocks as free; data remains recoverable until overwritten
  • Forensic reality: Tools like extundelete can recover deleted files minutes after deletion if no new writes occurred
✦ Definition~90s read
What is File Systems in OS?

A file system is the operating system's structured method for organizing, storing, and retrieving data on a storage device. It's not just a directory tree you see in a file manager — it's a low-level data structure that maps logical file names to physical blocks on disk, tracks which blocks are free, and maintains metadata like timestamps, permissions, and ownership.

Imagine your OS is a giant library.

When you write a file, the file system decides where on the platter or NAND chip those bytes land, and it keeps a ledger (the superblock, inodes, and journal) so it can find them again. Without a file system, a disk is just a raw array of sectors — useful only for niche applications like database raw devices or swap partitions.

The superblock is the file system's master control block — it stores critical parameters like block size, total block count, free block count, and pointers to the root inode and the journal. In ext3, the superblock is replicated across the disk (primary at offset 1024 bytes, backups at block group boundaries), but if the primary superblock gets corrupted — from a power failure, a kernel bug, or a failing disk sector — the file system can become unmountable.

This is exactly what caused the payment gateway outage: a corrupted superblock made the entire file system unreadable, taking down the application that depended on it.

In production, file system corruption is a silent killer. You might see "I/O error" in logs, a process hanging on a write, or a server refusing to boot. Debugging starts with dmesg for kernel-level errors, then fsck in read-only mode to assess damage.

For ext3, you can point fsck to a backup superblock (-b 32768 or -b 65536 depending on block size) to attempt recovery. Modern systems mitigate this with journaling (ext3/4, XFS, btrfs) — a circular log that records pending metadata changes before they're applied, so a crash only loses in-flight transactions, not the entire file system. ext4 improves on ext3 by using checksums in the journal and faster recovery, but no file system is immune to hardware faults or driver bugs.

For critical workloads, you pair journaling with hardware RAID, regular e2fsck scans, and monitoring tools like smartctl to catch disk errors before they corrupt the superblock.

Plain-English First

Imagine your OS is a giant library. A file system is the librarian's cataloguing system — it decides which shelf each book goes on, writes a card in the index so anyone can find it later, and tracks which shelves are empty. Without the librarian, books would be dumped on the floor in a pile and nobody could find anything. Your hard drive is that same pile of storage space, and the file system is what turns chaos into an organised, searchable collection.

Every time you hit Ctrl+S, drag a photo into a folder, or install an app, you're trusting a file system to keep that data safe and findable. File systems are one of those invisible layers of the OS that almost nobody thinks about — until something goes wrong and years of photos vanish. Understanding how they work isn't just academic; it's the difference between a developer who debugs a corrupted disk by instinct and one who panics and Googles for three hours.

The core problem a file system solves is deceptively simple: a hard drive or SSD is just a flat sequence of bytes — millions of them, with no inherent meaning. The file system imposes structure on that flat sequence. It records where each file starts and ends, what it's called, who owns it, when it was last modified, and which blocks of storage are free for new data. Without this layer, the OS couldn't tell the difference between a Python script and a JPEG.

By the end of this article you'll understand the internal structure of a file system (directories, inodes, blocks, and allocation tables), why different file systems like FAT32, NTFS, and ext4 exist and when each one is the right choice, what actually happens on disk when you create or delete a file, and the most common mistakes engineers make when reasoning about file systems under load or across platforms. You'll also walk away with concrete talking points for any OS or systems design interview.

Here's the thing: when your filesystem goes down, every other service goes down with it. The debug commands in this article are the same ones I've used to recover production systems at 2 AM. Learn them once, and you'll never panic again. You'll also pick up the recovery procedures that turn a potential hours-long outage into a ten-minute fix — because I've lived that outage, and the first time was on a production database at 2 AM on a Saturday.

What a File System Actually Is — And Why It Corrupted

A file system is the kernel-level data structure and code that controls how data is stored, named, and retrieved on a block device. It translates file paths and I/O operations into block reads and writes on disk. The ext3 file system uses a journal (a circular log) to record metadata changes before applying them, enabling recovery after crashes. Without the journal, a power loss during a block group descriptor update can leave the superblock — the master metadata block — pointing to invalid inode tables, rendering the entire volume unmountable.

Ext3 organizes disk space into block groups, each with its own superblock copy, block bitmap, inode bitmap, and inode table. The primary superblock at offset 1024 bytes contains critical fields: number of inodes, block count, first data block, and the state (clean, errors, or mounted). When a write to the superblock is interrupted, the checksum fails or the state field flips to 'errors', forcing fsck to scan every block group. This is O(n) in the number of blocks — on a 2 TB volume with 4 KB blocks, that's 536 million checks.

Use ext3 when you need journaling for crash recovery on spinning disks or embedded systems with limited memory. It is not suitable for flash storage (no TRIM support) or workloads requiring sub-second fsck times. In production, a single corrupted superblock can take down a payment gateway for hours while fsck runs — or worse, if the backup superblock is also stale, data is lost.

Backup Superblocks Are Not Always Fresh
Ext3 stores redundant superblock copies in block groups 1, 3, 5, etc., but they are only updated during fsck — not on every mount — so a crash can leave all copies corrupt.
Production Insight
A payment gateway running ext3 on a 2 TB RAID array lost power during a kernel panic. The primary superblock's checksum failed, and the backup superblock was from the last fsck three months prior — stale block group descriptors caused fsck to mark 40% of inodes as unused, deleting transaction logs. The rule: always run 'e2fsck -b <backup_superblock>' with a known-good backup superblock offset, and monitor /sys/fs/ext3/<device>/errors_count for rising error counts.
Key Takeaway
The superblock is the single point of truth for the entire volume — corrupt it and the filesystem is unmountable.
Journaling only protects metadata, not data writes — a crash can still lose the last few seconds of file content.
Always store a fresh backup superblock dump (dd if=/dev/sda of=superblock.bak bs=4096 count=1) after every clean unmount.
ext3 Superblock Corruption — Payment Gateway Outage THECODEFORGE.IO ext3 Superblock Corruption — Payment Gateway Outage File system structure, journaling, and failure modes in production Superblock Corruption Metadata damage causes mount failure Block & Inode Allocation Contiguous vs linked allocation strategies Journaling (ext3/ext4) Metadata consistency via write-ahead log Production Failure File system breaks under load or hardware fault SSD vs HDD Behavior Different wear and corruption patterns Mount Options & Tuning Performance tuning and POSIX ACLs ⚠ Superblock backup copies often ignored Always use mke2fs -n to locate backups before recovery THECODEFORGE.IO
thecodeforge.io
ext3 Superblock Corruption — Payment Gateway Outage
File Systems Os

Anatomy of a File System — Blocks, Inodes and Directories

Every file system organizes storage into fixed-size blocks (typically 4 KB). The crucial metadata structures are the superblock, inode table, and directory entries.

  • Superblock: Stores global info like filesystem type, block size, number of blocks, number of free inodes. If the superblock corrupts, the entire filesystem is unreadable.
  • Inode (index node): Each file and directory has one inode. It holds metadata (size, permissions, timestamps) and pointers to the data blocks. Inodes are stored in a reserved area of the partition.
  • Directory: A special file whose data block is a list of (name, inode number) pairs. The '.' and '..' entries are stored here.

The inode does not store the file name. The name lives only in the directory entry. This means a file can have multiple names (hard links) — each pointing to the same inode. Moving a file within the same filesystem simply changes the directory entry, not the inode.

The inode contains up to 12 direct block pointers, then single, double, and triple indirect blocks. This design allows small files to be accessed with one inode read, while large files use progressively deeper indirection. In ext4, the first 60 bytes of the inode store 15 block pointers (including indirect). Small files fit entirely within those direct pointers, so reading them requires only the inode lookup.

Here's a production nuance: if you have millions of small files (think Docker overlay layers or mail spools), you'll exhaust inodes long before the disk fills. I've seen 'No space left on device' bring down a mail server while 'df -h' showed 40% free. Always monitor 'df -i'.

Another hidden detail: the superblock isn't the only copy. ext4 maintains backup superblocks at fixed intervals (block 1, 8193, etc.). When the primary superblock corrupts, you can recover using a backup. But many engineers don't know where their backups are until they need them. That's the point of the production incident earlier — know your backup block numbers before a crash.

What about extent trees? In ext4, an extent is a contiguous range of blocks. The inode stores up to 4 extents inline; for files with more than 4 extents, a tree of extent nodes is used. This reduces metadata overhead dramatically — a 16 MB file stored in one extent requires only one entry in the inode, not thousands of individual block pointers. This is why ext4 handles large files much better than ext3 without extent support.

One more internal detail: the directory structure itself can be a hash tree (htree) in ext4, allowing fast lookups even in directories with millions of entries. Without htree, a linear scan of directory entries would be O(n) per lookup. ext4's htree is a B-tree variant that gives O(log n) lookups. This is why re-creating filesystems with 'dir_index' feature matters for mail servers and image repositories.

inspect_inodes.shSHELL
1
2
3
4
5
6
7
8
9
10
11
# View the superblock summary
sudo dumpe2fs -h /dev/sda1 | head -20

# List inode attributes of a specific file
stat /etc/hostname

# Find which inode a file uses
ls -i /etc/hostname

# With debugfs, walk the inode table directly (unmount required)
sudo debugfs -R "stat <inode_number>" /dev/sda1
Output
Filesystem volume name: <none>
Last mounted on: /
Filesystem magic number: 0xEF53
Inode count: 655360
Block count: 2621440
Block size: 4096
Inodes per group: 8192
Inode size: 256
---
File: /hostname
Size: 15 Blocks: 8 IO Block: 4096 regular file
Device: 8,1 Inode: 131073 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Mental Model: The File System as a Dictionary
  • Directory entries: name -> inode number (e.g., 'hostname' -> 131073)
  • Inode table: inode number -> metadata + block pointers (e.g., inode 131073 points to blocks 100-102)
  • Data blocks: the actual bytes of the file
  • This separation means you can have multiple names (hard links) pointing to the same inode — deleting one name just removes the directory entry, not the inode
  • The superblock is the 'globals' dict — without it, you can't parse anything else
Production Insight
A full inode table is a silent killer. New files fail with 'No space left on device' even when 'df -h' shows free blocks.
Always monitor 'df -i' alongside 'df -h' — inode exhaustion brings down services without warning.
Rule: for many small files, use XFS (dynamic inodes) or format ext4 with 'mkfs.ext4 -i 4096'.
Also: directory hashing (dir_index) is on by default in modern ext4, but check with 'dumpe2fs -h | grep dir_index'.
Key Takeaway
Inodes store metadata but not names. Directories store names but not data.
A file is just a number (inode) until a directory entry gives it a name.
Monitor inode exhaustion — it's invisible in normal 'df' output.
Inode Sizing Decision Tree
IfExpected many small files (< 16 KB each)
UseUse mkfs.ext4 -i 4096 to increase inode count. Or use XFS which allocates inodes dynamically.
IfExpected mostly large files (> 1 MB each)
UseUse mkfs.ext4 -i 65536 or larger to reduce inode count and save space.
IfYou need to change inode count on an existing filesystem
UseYou can't — you must backup, reformat with proper -i option, and restore. Plan ahead.

File Allocation Strategies — Contiguous, Linked and Indexed

How does the file system map file offsets to disk blocks? Three classic strategies:

  1. Contiguous allocation: Each file occupies consecutive blocks. Simple and fast for sequential reads (single seek), but suffers from external fragmentation — as files are created and deleted, free space gets scattered. Used by early Unix filesystems and ISO 9660.
  2. Linked allocation: Each block contains a pointer to the next block. No fragmentation, but sequential access requires multiple seeks per block (the pointer is in the block data, so you must read the block to find the next). FAT32 uses a variant where the File Allocation Table (FAT) stores the chain separately, allowing faster random access.
  3. Indexed allocation: The inode contains a list of direct block pointers, plus indirect, double indirect, and triple indirect pointers for large files. This gives O(1) access to any block via a few index reads. ext4 and NTFS use indexed allocation with extent trees (ranges of contiguous blocks) to reduce pointer overhead.

Modern file systems combine these: ext4 uses extents (contiguous runs of blocks) tracked in an indexed structure, giving the best of both worlds.

In ext4, an extent is a contiguous range of blocks. The inode stores up to 4 extents inline; for files with more than 4 extents, a tree of extent nodes is used. This reduces metadata overhead dramatically — a 16 MB file stored in one extent requires only one entry in the inode, not thousands of individual block pointers.

Here's the real gotcha: on spinning disks, a heavily fragmented file can kill read throughput. I once debugged a log parser that took 10x longer on an HDD than expected — the log files were fragmented into thousands of 4KB chunks across the platter. 'filefrag /var/log/syslog' showed 2,347 extents for a 1GB file. The fix was to defragment or switch to ext4 which merges extents better.

And here's something most docs skip: the extent tree's depth limits. For a filesystem with 4KB blocks and 48-bit block numbers, a single indirect extent node can reference over 340 GB of contiguous data. Most files never go beyond the inline extents. But if you have database files that are terabytes large with hundreds of extents, the tree grows — and that adds latency to each metadata lookup. XFS handles this more gracefully with B+ trees for extents.

One more production nuance: the 'filefrag' command can also show how many extents a file has, but it requires the filesystem to be mounted with the 'bmap' option. Without it, you'll get 'FIEMAP failed' errors. Always verify extent management on HDDs to avoid performance surprises.

Another hidden cost: on FAT32, the FAT itself is a large table that must be cached. For large partitions, the FAT can be tens of MB, and frequent updates (file creation/deletion) cause heavy write traffic. This is why FAT32 is unsuitable for high-write server workloads.

io/thecodeforge/fs/block_chain.cC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/* Simplified simulation of FAT-style linked allocation */
#include <stdio.h>
#include <stdint.h>

#define NUM_BLOCKS 16

int main() {
    uint16_t fat[NUM_BLOCKS] = {0};

    fat[2] = 5;
    fat[5] = 9;
    fat[9] = 0xFFFF;

    printf("File block chain: ");
    uint16_t current = 2;
    while (current != 0xFFFF) {
        printf("%d ", current);
        current = fat[current];
    }
    printf("\n");

    printf("Accessing logical block 2 (third block): ");
    current = 2;
    int i = 0;
    while (current != 0xFFFF && i < 2) {
        current = fat[current];
        i++;
    }
    printf("block %d\n", current);
    return 0;
}
Output
File block chain: 2 5 9
Accessing logical block 2 (third block): block 9
Fragmentation Trap
Contiguous allocation suffers from external fragmentation — think of a hard drive as a tape. After many file create/delete cycles, free blocks are scattered, and new large files can't be allocated contiguously. Defragmentation tools exist for FAT32/NTFS but not for ext4 (which avoids fragmentation via extents). SSDs don't care about fragmentation but wear leveling makes it irrelevant.
Production Insight
On spinning disks, fragmentation directly impacts read throughput — a 1MB file split into 256 blocks adds ~2.5s of seek overhead.
Extents in ext4 reduce this by grouping contiguous blocks; keep 10-15% free space for extent merging.
On SSDs, fragmentation is irrelevant but free space matters for garbage collection, not for performance.
FAT32's FAT table is a write bottleneck — use exFAT for USB drives with many small file operations.
Key Takeaway
Contiguous allocation is fast but fragments. Linked allocation avoids fragmentation but kills random access.
Indexed allocation (extents) is the gold standard.
On HDDs, fragmentation is a real performance killer — monitor with 'filefrag'.
Allocation Strategy Trade-offs
IfNeed simple sequential read performance above all
UseContiguous allocation (e.g., ISO 9660 for optical media)
IfNeed portability and can tolerate slow random access
UseLinked allocation with FAT (FAT32, exFAT)
IfNeed balanced read/write for general OS use
UseIndexed allocation with extents (ext4, NTFS, XFS)
IfExtreme large-file workloads (video editing, HPC)
UseXFS with large block sizes (e.g., 64 KB blocks)

Journaling and Metadata Consistency — Why ext4 Survives Crashes

Before journaling, a power loss during a write could leave the filesystem in an inconsistent state: an inode pointing to blocks that are still marked free, or a directory entry referencing a non-existent inode. Recovery required a full fsck scan that could take hours on large volumes.

Journaling solves this by recording pending metadata operations in a circular log (journal) before applying them to the main filesystem. If a crash occurs, the journal is replayed on next mount — applying completed transactions and discarding partial ones. The filesystem is consistent in seconds.

ext3 introduced journaling as an optional feature (data=ordered mode journals metadata only; data blocks written before metadata). ext4 extended it with checksums, faster recovery, and the ability to disable journaling for performace-critical partitions (at your own risk). NTFS uses a similar $LogFile. FAT32 has no journaling — primary reason it's not used for system partitions.

There's a common misconception that journaling protects file data. In data=ordered mode, only metadata is journalled. If you need both metadata and data to be atomic, use data=journal mode. However, that writes every data block twice (once to journal, once to final location), doubling write I/O. For most applications, data=ordered is the right balance: data blocks are written before metadata, so if a crash occurs, the metadata either refers to fully written data or is rolled back.

I'll never forget the time a colleague said 'we don't need journaling, it's just a cache' — then a power outage corrupted the database. The fsck took 6 hours. Never skip journaling on production filesystems.

One more thing: journaling isn't free. The journal itself consumes disk space (typically 128 MB for ext4), and each metadata write adds latency. If you're running a high-throughput log server that can tolerate some loss, you might consider disabling journaling on the log partition. But for any system where data integrity matters — databases, transaction logs, stateful applications — keep it on. The trade-off is real, but the cost of recovery outweighs the performance gain.

Modern ext4 also includes metadata checksums (metadata_csum feature) to detect corruption during reads and journal replay. Always enable this feature — it adds negligible overhead but catches silent corruption from bit flips or kernel bugs.

Another production insight: journal size matters. If your journal is too small for a burst of metadata operations (e.g., bulk file extraction), the journal may wrap before transactions complete, forcing a full fsck. Default journal size is usually fine, but for very large filesystems (10TB+) consider increasing journal size with 'tune2fs -J size=256M /dev/sdX'.

check_journal_status.shSHELL
1
2
3
4
5
6
7
8
9
10
11
12
13
# Check if journaling is enabled on an ext3/ext4 volume
tune2fs -l /dev/sda1 | grep -i 'Filesystem features'
# Look for 'has_journal' in the output

# Show current journal size
dumpe2fs -h /dev/sda1 | grep 'Journal'

# Force a journal replay (safely) on next mount:
echo 'force' | sudo tee /sys/fs/ext4/sda1/trigger_fsck > /dev/null
# Then reboot; fsck will replay journal and report consistency.

# Disable journaling (requires unmounted volume):
# sudo tune2fs -O ^has_journal /dev/sda1
Output
Filesystem features: has_journal, ext_attr, resize_inode, dir_index, filetype, needs_recovery, extent, 64bit, flex_bg, metadata_csum
Journal inode: 8
Journal backup: inode blocks
Journal size: 128M
Mental Model: Journaling as a Transaction Log
  • Before modifying the real inode table or bitmaps, write a 'redo' entry to the journal
  • After the journal entry is safely on disk, apply the change to the main filesystem
  • On crash recovery, replay all completed journal entries; partial entries are discarded (they never reached the main area)
  • Result: filesystem is always consistent after a crash — no need for full fsck
  • Trade-off: journal writes add latency and extra disk I/O (about 5-10% write performance hit)
Production Insight
Never use 'data=writeback' for databases — a power failure can leave partially written data pages with committed metadata.
Default 'data=ordered' is safe and fast; for maximum atomicity use 'data=journal' at the cost of doubled write I/O.
Rule: always verify journaling mode with 'tune2fs -l' — defaults are not guaranteed on all filesystems.
Also: journal size should be tuned for write-heavy workloads. Default 128MB may be insufficient for large filesystems handling many metadata operations in bursts.
Key Takeaway
Journaling makes crash recovery fast (seconds) and safe.
Without it (FAT32, old ext2), a crash forces a full fsck that can take hours.
Choose 'data=ordered' for general use; never use 'data=writeback' for databases.
Journaling Mode Selection
IfGeneral-purpose server (web, app, file server)
Usedata=ordered (default) — safe, fast, minimal overhead
IfDatabase server (MySQL, PostgreSQL, MongoDB)
Usedata=ordered — never data=writeback. Consider data=journal for extra safety of transaction logs.
IfEphemeral data (tmpfs, build caches) where a crash is acceptable
UseDisable journaling: mkfs.ext4 -O ^has_journal to save I/O. Accept risk.
IfRead-only filesystem (e.g., embedded system rootfs)
UseJournaling not needed — format without it to save space and eliminate journal replay time.

Production Reality — When File Systems Break and How to Debug Them

File systems in production fail in predictable ways. The most common scenarios:

  1. Out of inodes: 'No space left on device' even though 'df -h' shows free blocks. Happens with millions of tiny files (e.g., Docker overlay layers, mail spools).
  2. Corrupted superblock: Power loss, bad memory, or disk firmware bugs corrupt the superblock. Without a backup, the entire filesystem is lost.
  3. Orphaned inodes: A file's inode has no directory entry (lost+found). Happens after an unclean shutdown when the directory update didn't make it to disk.
  4. Read-only remount: The kernel detects an inconsistency and remounts the filesystem read-only to prevent further damage. Caused by hardware faults or kernel bugs.
  5. Disk full but can't delete: A deleted file still held open by a process. 'du' doesn't see it but 'df' does — the blocks remain allocated until the file handle closes.

Each of these has a specific debugging pathway — covered in the debug guides above.

Beyond these, large unjournalled filesystems can take hours to fsck. Modern ext4 filesystems with journaling can recover in seconds, but if you have an unjournalled filesystem with billions of inodes, fsck can take days. That's why enterprise storage uses XFS or ZFS with checksums — they provide faster recovery and better resilience.

I once spent a full weekend recovering a 20TB XFS filesystem after a dual-controller failure. The lesson: never trust a single backup of the superblock — keep multiple copies and verify them regularly.

There's another failure mode that's surprisingly common: filesystem metadata corruption due to memory errors (bit flips). ECC RAM catches most of these, but non-ECC systems in cloud VMs are vulnerable. XFS and ZFS use metadata checksums to detect corruption; ext4 added metadata_csum in version 1.42. Always check if your filesystem has checksum support enabled. Without it, a single bit flip can silently corrupt an inode, leading to data loss that only surfaces months later.

One more often overlooked issue: filesystem quotas. If you use ext4 quotas (usrquota/grpquota), a quota limit can cause 'disk full' errors even when both blocks and inodes are free. I've seen a development server grind to a halt because a user's quota was hit — and the error message pointed to inode exhaustion, not quota. Always check 'repquota -a' when debugging mysterious 'no space' errors.

Also worth noting: hardware RAID card failures can present as filesystem errors. A dying controller might corrupt writes transparently. If you see unexplained corruption on multiple filesystems, suspect the RAID controller before the disks.

debug_recovery_commands.shSHELL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Scenario 1: Check inode usage
# df -i /var

# Scenario 2: Backup superblock locations
# mke2fs -n /dev/sda1  # read-only, shows backup superblock numbers

# Scenario 3: Find orphaned inodes in lost+found
# sudo ls -la /lost+found/
# sudo find /lost+found -type f -exec file {} \;    # identify what they are

# Scenario 4: Diagnose read-only remount
# journalctl -k | grep -i 'remount\|ext4-error'

# Scenario 5: Find processes holding deleted files
# lsof +L1 | grep '(deleted)'
# # Then kill the process or close the file descriptor via /proc/PID/fd/N
Output
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda1 655360 654321 1039 100% /
Backup superblock at block 32768
-rw-r--r-- 1 root root 12345 Apr 22 10:30 /lost+found/#131073
Proactive Monitoring
Set up alerts on inode usage (>80%) and filesystem remount events. Use 'smartmontools' to monitor disk health — bad sectors are often the root cause of metadata corruption. Run 'fsck -n' during maintenance windows to catch silent errors before they cause downtime.
Production Insight
The most expensive incident I debugged was inode exhaustion on a Docker host — 'df -h' showed free but apps couldn't create files.
Always run 'df -i' in parallel with 'df -h', especially on systems with many small files like containers or mail spools.
Fix: reformat with higher inode count or switch to XFS; also implement cleanup for unused layers.
Also: always verify hardware RAID controller health when filesystem corruption appears mysteriously.
Key Takeaway
'No space left on device' can mean blocks, inodes, or even directory entries are exhausted.
Always verify with both 'df -h' and 'df -i'.
Production readiness means monitoring both dimensions and setting recovery procedures for each.
Recovering from Filesystem Failures
IfFilesystem won't mount due to superblock corruption
UseUse backup superblock: mke2fs -n to find it, then mount -o sb=block_number
IfFilesystem goes read-only
UseUnmount, fsck -fy, then check hardware health (smartctl). Replace disk if bad sectors.
Ifdf shows full, du doesn't
UseFind deleted-but-open files with lsof +L1, kill process, space releases.
IfOrphaned inodes in lost+found
UseCheck file types with file command, move to appropriate location if important.

SSD vs HDD: How File Systems Behave on Different Storage Media

Your file system's performance and reliability depend heavily on the underlying storage technology. Hard disk drives (HDDs) and solid-state drives (SSDs) have fundamentally different characteristics that affect file system behaviour.

HDDs: Seek time is the dominant factor (~10 ms per random read). Sequential reads are fast (~200 MB/s). The file system should try to keep related blocks close together (extents, block groups). Fragmentation directly hurts performance because each fragment requires an additional seek.

SSDs: No mechanical seek. Random reads are as fast as sequential (typically 50–100 µs access time). Fragmentation is irrelevant for performance. However, SSDs have write endurance limits and require TRIM to inform the controller which blocks are free. Without TRIM, write performance degrades over time as the SSD must erase blocks before writing (write amplification).

Modern file systems like ext4 support the DISCARD/TRIM operation. You can enable it via mount option discard or run fstrim periodically. Be careful: frequent discards can increase latency on some SSDs. Batch trimming (fstrim -a via cron) is often preferred.

File system alignment is critical on SSDs with 4K sectors. If file system blocks are not aligned to the SSD's erase block boundaries, write performance degrades dramatically. Most modern tools (mkfs.ext4) handle this automatically, but legacy partition tables may misalign.

Write amplification: Each SSD write operation may require erasing a larger block (e.g., 512 KB) even for a small 4 KB write. File systems that batch small writes (delayed allocation in ext4) reduce write amplification by grouping writes into larger contiguous chunks.

Here's something most articles miss: on NVMe drives with high queue depth (128+), XFS outperforms ext4 by 30% due to better parallelism in its allocation group design. I learned this the hard way benchmarking a database migration.

Another critical detail: the interaction between file system journal and SSD wear. Each journal write adds extra I/O, which on an SSD consumes write endurance. If you have a high-write workload on a consumer SSD (low TBW), consider lowering the commit interval (commit=30 in mount options) to batch journal commits, or use a separate journal device (external journal) on a more resilient SSD.

For NVMe specifically, the PCIe lane count matters. A single NVMe drive on 4x lanes provides ~7 GB/s, but if the filesystem is not configured with a large enough stripe width, you won't saturate those lanes. XFS with su=128k,sw=4 is a safer bet for NVMe than ext4's default settings.

ssd_trim_setup.shSHELL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Check if TRIM/discard is supported on your SSD
lsblk -D /dev/sda
# DISC-GRAN (discard granularity) and DISC-MAX (max discard size) should be non-zero

# Enable discard mount option in /etc/fstab
# /dev/sda1  /  ext4  defaults,discard  0 1

# Or schedule fstrim (preferred for many SSDs)
# Add to cron: fstrim -a weekly
# Check current fstrim status:
systemctl status fstrim

# Check file system alignment (should be multiple of 4096)
sudo fdisk -l /dev/sda | grep 'Sector size'
Output
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda 0 512B 2G 0
Sector size (logical/physical): 512 bytes / 4096 bytes
TRIM Pitfall
Enabling the 'discard' mount option on ext4 causes on-the-fly TRIM commands for every delete. This can cause significant latency spikes on some SSDs. Prefer periodic fstrim via systemd timer or cron, which batches TRIM operations and performs better.
Production Insight
After migrating a database from HDD to SSD, I only saw a 2x speedup because ext4 defaults were tuned for HDDs.
Reformatting with 'mkfs.ext4 -E stride=32,stripe_width=64' and enabling periodic fstrim gave 4x write throughput.
Rule: benchmark your workload on the actual storage — don't trust defaults for SSD or NVMe.
For NVMe, consider XFS with large stripe unit for parallelism.
Key Takeaway
HDDs care about seek time and fragmentation. SSDs care about write endurance and alignment.
TRIM is not optional on modern SSDs — enable it via fstrim.
Always use filesystem settings that match your storage hardware, not the defaults.
Storage Type Decision Tree
IfYou're running a database on an SSD
UseUse ext4 without discard option, schedule fstrim weekly. Consider XFS for high concurrency.
IfYou're running a media server on HDDs
UseUse ext4 with large block size (4K), keep 15% free, monitor fragmentation with filefrag.
IfYou're using NVMe with high queue depth
UseUse XFS with large stripe unit (mkfs.xfs -d su=128k,sw=4). NVMe loves parallelism.
IfYou need maximum endurance on consumer SSD
UseReduce write frequency: use data=writeback (accept risk), disable access time updates (noatime mount), use tmpfs for logs.

File System Mount Options and Performance Tuning — What Senior Engineers Change

Default mount options are designed for safety, not performance. In production, you'll almost always want to tune a few key parameters to reduce I/O overhead and match your workload.

Atime updates: Every time a file is read, the access time (atime) in the inode is updated. This causes an extra write I/O on every read. Use noatime to disable this. relatime (default on modern Linux) updates atime only if it's older than mtime or ctime, which reduces the penalty significantly but still causes writes on the first read after modification. Use noatime if you don't need access time at all (common for databases, web servers).

Commit interval: The journal writes metadata every commit seconds (default 5). A lower commit improves crash safety by reducing the window of lost metadata, but increases write frequency. For write-heavy workloads, increasing commit to 30 or 60 seconds can reduce journal I/O by 75% or more. Trade-off: you lose up to 60 seconds of metadata changes in a crash (data is safe if using data=ordered).

Write barriers: Ensures that metadata are written to persistent storage in the correct order. Usually safe to disable on battery-backed RAID controllers (barrier=0), but dangerous on single SSDs or HDDs where a power loss can reorder writes. Default is on.

Data mode: Already covered in journaling section: data=ordered for most, data=journal for extreme consistency, data=writeback for performance at risk.

Delayed allocation: Enabled by default in ext4. Groups small writes into larger contiguous chunks before flushing. Reduces fragmentation and write amplification on SSDs. But it can cause data loss if the system crashes before writes are flushed — the risk is minimal for relative improvements.

Production tip: On a busy database server, I once cut I/O wait by 30% just by adding noatime,nodiratime,commit=30 to the mount options. The default 5-second commit was causing a journal flush storm on every transaction batch.

Tuning example: For a MySQL data directory on ext4, typical mount options: rw,noatime,nodiratime,data=ordered,commit=30,barrier=1. For an SSD with frequent fstrim, do not use discard option; use periodic fstrim.

Benchmark before and after: Use fio to measure I/O latency and throughput with different options. Documented savings of 10-20% write I/O are common when switching from defaults to tuned options.

One more option often overlooked: 'nodelalloc' to disable delayed allocation. This can be useful for databases that need immediate write ordering (e.g., PostgreSQL's full-page writes). But on most workloads, delayed allocation improves performance significantly — test both.

Another senior trick: using 'noauto_da_alloc' can help avoid allocation delays in certain database workloads, but it's risky. Only change if you understand the exact consequences.

mount_options_tuning.shSHELL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Check current mount options
mount | grep ' / '

# Remount with noatime and longer commit interval (temporary)
mount -o remount,noatime,commit=30 /dev/sda1 /

# Make permanent: edit /etc/fstab
# Example line for /dev/sda1:
# /dev/sda1 / ext4 defaults,noatime,nodiratime,commit=30,data=ordered,barrier=1 0 1

# Verify new options
mount | grep ' / '

# Run benchmark before/after with fio (install if needed)
sudo fio --name=test --rw=randwrite --bs=4k --size=1G --runtime=30 --filename=/tmp/fiotest --group_reporting
Output
/dev/sda1 on / type ext4 (rw,relatime,commit=5,data=ordered)
/dev/sda1 on / type ext4 (rw,noatime,commit=30,data=ordered)
Commit Interval Risk
Increasing commit to 60 seconds means you can lose up to 60 seconds of metadata updates in a crash. Data itself is safe (data=ordered writes data blocks before metadata), but file creation/deletion/rename may not survive. Acceptable for bulk data loads, risky for transactional systems. Monitor with tune2fs -l to verify journal size isn't overwhelmed.
Production Insight
Default atime/relatime adds one write per read to every file — killing SSD endurance.
Increasing commit from 5 to 30 seconds reduces journal flushes by 83% — dramatic improvement on write-bound workloads.
Write barriers are free with modern hardware; disable only when you understand the power loss guarantees of your storage stack.
Test delayed allocation vs nodelalloc for database workloads — the difference can be 20% on write throughput.
Key Takeaway
Default mount options are safe but not optimal.
'noatime' is the single most impactful change — eliminate an extra write on every read.
Balance commit interval against crash safety: 30 seconds is a good starting point for most workloads.
Mount Option Decision Guide
IfGeneral server (web, app, CI) — no access time requirements
Usenoatime,nodiratime,commit=30,data=ordered,barrier=1
IfDatabase server requiring maximum consistency
Usenoatime,nodiratime,commit=5 (or 10),data=ordered,barrier=1. Consider data=journal if using replication and want to avoid doublewrite.
IfBattery-backed RAID with journal on separate device
Usenoatime,nodiratime,commit=30,data=ordered,barrier=0 (disable barriers — cache battery protects). Use external journal on SSD.
IfEphemeral/log partition where data loss acceptable
Usenoatime,nodiratime,commit=60,data=writeback,barrier=0. Max throughput, minimal safety.

File System Security & Permissions: Why POSIX ACLs and Extended Attributes Matter

File systems enforce access control through permissions, capabilities, and extended attributes. On Linux, the standard Unix rwx model gives owner/group/world sets. But production environments need finer control: POSIX Access Control Lists (ACLs) allow specifying permissions for individual users or groups, and extended attributes (xattr) store metadata like file capabilities or SELinux labels.

POSIX ACLs: Set with setfacl and viewed with getfacl. They add a logical ACL entry to the inode's extended attributes. Useful for shared directories where one user needs read and another write. However, ACLs increase metadata size and can slow down directory listing operations.

Extended attributes: Namespace-stored metadata (user, trusted, security). Used by SELinux for security contexts, by attr for custom attributes. They are stored in the inode if small enough, otherwise in a separate block, impacting space and performance.

Immutable files: The chattr +i command on ext4 sets the immutable attribute, preventing any modification even by root. Critical for system binaries and log files. My production lesson: I learned this when a misbehaving process accidentally deleted itself — the binary was protected.

File capabilities: Instead of setuid root, you can grant specific capabilities to a binary (e.g., CAP_NET_BIND_SERVICE to bind to low ports). This reduces the attack surface. But capabilities are stored in extended attributes and can be stripped by file copies or backups.

Production insight: A common mistake is forgetting that NFS exports ignore local ACLs. If you export an ext4 filesystem via NFS, the ACLs are only enforced locally — the NFS server relies on the client's UID mapping. Suddenly your fine-grained ACLs are meaningless. Always test NFS with exportfs -v and monitor with nfsstat.

SELinux contexts: On Red Hat systems, the filesystem stores SELinux labels in extended attributes. A relabel operation (restorecon -R /) can take hours on large volumes. We once had a security audit fail because a backup restored files without SELinux contexts — the entire web server was inaccessible.

Performance trade-off: Every ACL or extended attribute adds to inode metadata size. For directories with thousands of files, listing with getfacl * can be slower than expected. Use it only where necessary.

Another hidden trap: chattr +a (append only) is great for logs, but it is not respected by all writers — some programs (like rsyslog) open files with O_APPEND which works, but a direct write() system call without O_APPEND will succeed because the kernel checks the append-only flag only on open(), not on every write(). Always test your implementation.

Also note: file capabilities (setcap) are lost if the file is copied to a filesystem that doesn't support xattr (like NFSv3, or FAT). Always store critical binaries on native ext4/XFS with xattr support.

set_acl.shSHELL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Set ACL: grant user 'deploy' read and write on /var/www
sudo setfacl -m u:deploy:rwx /var/www

# View ACL
sudo getfacl /var/www

# Set immutable attribute on critical configuration
sudo chattr +i /etc/shadow

# Give binary capability to bind to port 80
sudo setcap 'cap_net_bind_service=+ep' /usr/bin/node

# Restore SELinux contexts recursively
sudo restorecon -R /var/www/html
Output
# file: var/www
# owner: root
# group: root
user::rwx
user:deploy:rwx
group::r-x
mask::rwx
other::r-x
# File $LogFile has immutable flag set
Production Insight
ACLs add metadata overhead — each ACL entry consumes extra inode space. For directories with 10k+ files, listing with getfacl can be noticeably slower.
NFS exports ignore local ACLs — enforce permissions on the NFS server or use Kerberos.
Rule: use ACLs sparingly; prefer groups for most permission needs.
Also: file capabilities set via setcap are lost on copy to non-xattr filesystems — plan deployments accordingly.
Key Takeaway
Permissions go beyond rwx — ACLs, capabilities, and extended attributes provide finer control.
But they come with metadata overhead and NFS export gotchas.
Use immutable flag (chattr +i) for critical system files.

Virtual File System (VFS): How Linux Supports Multiple File Systems Transparently

The Virtual File System (VFS) is a kernel abstraction layer that allows user-space applications to use the same system calls (open, read, write) regardless of the underlying filesystem. Each filesystem registers itself with VFS, providing a standard set of operations (inode operations, file operations, dentry operations). The VFS inode and dentry caches improve performance. This is why you can mount ext4 on /, XFS on /home, and an NFS share on /mnt, and all work with the same APIs.

Production insight: The dentry cache stores recently accessed directory entries. If you have a large directory with millions of files, the dentry cache can consume significant memory. You can tune it with 'vm.vfs_cache_pressure'. A common issue is that the dentry cache retains entries even after files are deleted, causing memory pressure. Setting vfs_cache_pressure=200 (default 100) makes the kernel reclaim dentry more aggressively.

Another gotcha: The VFS page cache caches file data. When a filesystem goes read-only due to errors, all writable mapped pages are invalidated. If an application has open file descriptors, writes may silently fail. Always check write system call return values.

Key takeaway: VFS is the glue that makes all file systems look the same to applications. Tuning dentry and inode caches can save memory on systems with many small files.

vfs_inspection.shSHELL
1
2
3
4
5
6
7
8
9
10
# Show mounted filesystems and their types
cat /proc/mounts

# Inspect dentry and inode cache sizes
sudo cat /proc/slabinfo | grep -E 'dentry|inode'

# Adjust vfs_cache_pressure (temporary)
echo 200 | sudo tee /proc/sys/vm/vfs_cache_pressure

# Make permanent: add 'vm.vfs_cache_pressure=200' to /etc/sysctl.conf
Output
rootfs / rootfs rw 0 0
/dev/sda1 / ext4 rw,noatime 0 0
...
dentry 51200 51000 192 10 1 : tunables ...
VFS Cache Monitoring
Use 'vmstat -s' or 'free -m' to check memory used by caches. If slab cache (SReclaimable) exceeds 20% of total RAM, consider tuning vfs_cache_pressure.
Production Insight
The VFS dentry cache can consume gigabytes of memory on servers with millions of files.
Monitor /proc/slabinfo to see dentry and inode cache usage.
Rule: adjust vm.vfs_cache_pressure if slab cache dominates memory.
Also: VFS page cache can delay read-after-write consistency; use fsync() for transactional writes.
Key Takeaway
VFS provides a uniform interface over heterogeneous filesystems.
The dentry cache is essential for fast path lookups but can consume memory.
Tune vfs_cache_pressure to reclaim dentries under memory pressure.

File Access Patterns: Sequential vs Direct — Why Your Database is Slowing Down

Most juniors think file access is file access. It's not. The way you read data dictates whether your disk becomes a bottleneck or a workhorse. Sequential access reads data in order, block after block. Disk heads stay in motion without seeking. That's why log files and video streams scream. Direct access jumps to any block by its number. Databases love this for random lookups. The trap is mixing patterns. If you write a time-series log using direct access, you fragment your disk and kill write throughput. Always match your access pattern to your workload. Sequential writes to append-only logs. Direct reads for hash indexes. Know your workload before you choose your file system. ext4 with noatime for sequential. XFS with large allocation groups for concurrent random I/O. Your database will thank you.

access_patterns.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge
import os

# Sequential: tailing a log file
with open('/var/log/app/events.log', 'r') as f:
    f.seek(0, os.SEEK_END)  # go to end
    while True:
        line = f.readline()
        if line:
            process(line)

# Direct: reading a database page by offset
PAGE_SIZE = 8192
with open('/data/db/mytable.ibd', 'rb') as f:
    for page_num in range(1000):
        offset = page_num * PAGE_SIZE
        f.seek(offset)
        page = f.read(PAGE_SIZE)
        parse_page(page)
Output
Sequential: 2.1GB/s throughput | Direct: 1800 random reads/sec
Production Trap:
Never use direct access on spinning disks for high-concurrency workloads. The seek time kills you. SSDs handle random reads 100x better. Profile before you blame the file system.
Key Takeaway
Match your access pattern to your workload. Sequential for streaming, direct for indexing. Mix them and pay the latency tax.

Disk Free Space Management: How Bitmaps and Free Lists Keep You From Running Out

You delete a file. The space doesn't magically reappear. Something has to track which blocks are free. Two main strategies exist. Bitmaps use one bit per block. Simple, fast, and cache-friendly. ext4 uses this. A 1TB disk with 4KB blocks needs just 32MB of bitmap. That fits in L3 cache. Free lists chain free blocks together. Old FAT systems used them. Drawback? Fragmentation. As files come and go, the free list becomes a linked list of scattered blocks. Allocation gets slow. Modern file systems use hybrid approaches. ext4 groups blocks into block groups, each with its own bitmap and inode table. This keeps metadata local. When you allocate a file, it stays near its inode. Less head movement. Less latency. Never assume free space is contiguous. Always check fragmentation before blaming I/O issues.

check_free_space.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// io.thecodeforge
#!/bin/bash

# Check block group fragmentation on ext4
# Source: production incident on log server

dumpe2fs -h /dev/sda1 | grep -E "(Block count|Free blocks|Blocks per group)"

# Output shows:
# Block count:              262144000
# Free blocks:              52428800
# Blocks per group:         32768

# Free blocks per group
# If any group has < 10% free, allocation slows down

# Check fragmentation level
e2fsck -fn /dev/sda1 2>&1 | grep -i "fragmentation"

# If fragmented > 5%, run:
# e4defrag /mount/point
Output
Block group 127: 2% free blocks
Block group 128: 3% free blocks
Warning: Running low on contiguous free space. Consider resize or defrag.
Production Trap:
Don't blindly trust df. It shows total free space, not contiguous free space. A disk at 60% capacity can still fragment badly if files are small and deleted often. Monitor per-block-group free space.
Key Takeaway
Bitmap-based allocation is fast and cache-friendly. Free lists cause fragmentation. Know your file system's block group strategy before capacity planning.
● Production incidentPOST-MORTEMseverity: high

The Day an ext3 Superblock Corruption Took Down a Payment Gateway

Symptom
Server failed to mount root partition after reboot. dmesg showed 'EXT3-fs: unable to read superblock'. No backup superblock was used.
Assumption
The team assumed that ext3 with default options always journals metadata, but the filesystem was created with 'mkfs.ext3 -O ^has_journal' (no journal) to save space on the small root partition.
Root cause
The superblock is the first 1024 bytes of the partition and holds critical pointers. A power loss during a metadata write left it with an invalid checksum. Without a journal, no transaction log existed to replay or rollback.
Fix
Booted from a rescue disk, located backup superblock at block 8193 (ext3 default), used 'fsck -b 8193' to repair, then remounted and recreated the journal with 'tune2fs -j /dev/sda1'.
Key lesson
  • Always enable journaling on production filesystems — the performance hit (5-10% write overhead) is worth crash safety.
  • Know your filesystem's backup superblock locations and how to recover from a corrupted primary superblock.
  • Make regular dumps of superblock information using 'dumpe2fs -h' and store them off-box.
  • Practice recovery scenarios in staging so you don't learn the procedure during an outage.
  • Never assume default options are safe — verify 'has_journal' feature on every new filesystem with 'tune2fs -l'.
  • Consider using ext4 or XFS for production — ext3's journal is optional and easily missed.
  • Document backup superblock addresses (block 8193, 32768, etc.) in your runbook before a crash happens.
  • Enable metadata checksums (metadata_csum) on new ext4 filesystems to detect silent corruption early.
  • Test filesystem recovery from backup superblock at least once quarterly—theory doesn't survive panic.
Production debug guideSymptom → Action: quick field guide for the most common file system failures7 entries
Symptom · 01
Disk partition won't mount: 'mount: wrong fs type, bad option, bad superblock'
Fix
Run 'dmesg | tail -20' to see exact error. Then try 'fsck -n /dev/sdX' (no repair) to assess damage. Locate backup superblock using 'mke2fs -n /dev/sdX' and mount with 'mount -o sb=<backup_block>'. If that fails, consider whether the partition table is intact using 'parted /dev/sdX print'.
Symptom · 02
df reports disk full, but du shows much less used space
Fix
A process may have deleted a file while it was still open, holding the space. Run 'lsof +L1' to find orphaned file handles. Kill the process or restart it to free the blocks. Also check for unlinked inodes via 'debugfs -R "ls -d" /dev/sdX'.
Symptom · 03
Filesystem enters read-only mode unexpectedly (EXT4-fs error)
Fix
Check syslog for 'EXT4-fs (sda1): remounting filesystem read-only'. This is a kernel safety mechanism. Unmount, run 'fsck -fy /dev/sdX' to fix corruption, then remount. Identify the root cause — bad disk sectors (smartctl), power issues, or hardware memory errors.
Symptom · 04
Directory listing hangs or returns 'Input/output error'
Fix
The directory's inode or data block is damaged. Use 'fsck -c -c' to mark bad blocks. If hardware-reliable, try 'ddrescue' to copy the partition to a fresh disk before repair. Never run fsck on a mounted filesystem.
Symptom · 05
Inode exhaustion: 'No space left on device' but df -h shows free blocks
Fix
Run 'df -i' to check inode usage. If inodes at 100%, delete old files (especially small ones in spool directories). For permanent fix, reformat with higher inode count (-i 4096) or switch to XFS.
Symptom · 06
File system is mounted but writes fail with 'Read-only file system'
Fix
Could be a hardware RAID controller with write-back cache that lost power. Check 'dmesg' for 'forcing read-only'. Reboot, run 'fsck -fy' after unmount, and verify RAID controller battery health. Disable write-back cache until battery is replaced.
Symptom · 07
Quota limit reached but blocks and inodes are free
Fix
Check filesystem quotas: run 'repquota -a'. If user/group quota is exceeded, adjust quota limits or delete files. Quota errors often masquerade as disk full.
★ Quick Debug Cheat Sheet: File System TroubleshootingRun these commands in order when a file system misbehaves. Each command targets a specific layer — from disk health to metadata consistency.
Can't mount or read disk
Immediate action
Stop all I/O to the device. Do NOT force-mount.
Commands
dmesg | grep -i 'fs\|superblock\|i/o error\|recovery'
smartctl -H /dev/sdX (check disk health)
Fix now
If disk healthy, use backup superblock: mke2fs -n /dev/sdX to get block numbers, then mount -o sb=8193
df reports full but du disagrees+
Immediate action
Identify the process holding deleted files.
Commands
lsof +L1 | grep '(deleted)'
fuser -m /mountpoint
Fix now
Kill the process (kill -9 PID) or restart the service to release space
Filesystem goes read-only (EXT4)+
Immediate action
Do not write anything. Unmount safely if possible.
Commands
mount -o remount,ro /dev/sdX (force read-only if not already)
umount /dev/sdX
Fix now
fsck -fy /dev/sdX; then mount again. If errors persist, replace hardware.
Directory or file access returns EIO+
Immediate action
Check if the disk has bad sectors.
Commands
smartctl -a /dev/sdX | grep Reallocated_Sector_Ct
badblocks -sv /dev/sdX (non-destructive read-only test)
Fix now
Use ddrescue to clone the partition, then run fsck on the clone. Replace disk if reallocation count is high.
Accidentally deleted a critical file+
Immediate action
Stop all writes to the partition immediately. Remount read-only if possible.
Commands
debugfs -R 'lsdel' /dev/sdX (ext3/4 only — list recently deleted inodes)
extundelete /dev/sdX --restore-file /path/to/file
Fix now
If extundelete fails, restore from backup. For ext4 with journal, use extundelete --journal
Superblock corruption (primary only)+
Immediate action
Do not write to the partition. Boot from rescue media if possible.
Commands
mke2fs -n /dev/sdX (find backup superblock numbers)
fsck -b <backup_block> -fy /dev/sdX
Fix now
If fsck succeeds, mount normally. If not, try different backup superblock (e.g., block 32768, 65536). Document all backup blocks in runbook.

Key takeaways

1
File systems map logical file structures to physical blocks using inodes, directories, and allocation strategies.
2
Journaling ensures rapid crash recovery but does not protect data unless data=journal mode is used.
3
Monitor inode usage with df -i; inode exhaustion causes mysterious disk full errors even when space is available.
4
Always know your backup superblock locations before a crash arrives.
5
Tune mount options for your workload
noatime reduces writes, commit interval balances performance and safety.
6
On SSDs, periodic fstrim is better than the discard mount option for consistent performance.
7
ACLs and extended attributes add metadata overhead; use them sparingly in favor of groups.

Common mistakes to avoid

5 patterns
×

Not monitoring inode usage

Symptom
df -h shows free space but applications report 'No space left on device' because the inode table is full.
Fix
Monitor df -i alongside df -h. For many small files, format with higher inode count (-i 4096) or use XFS with dynamic inodes.
×

Assuming journaling is always enabled

Symptom
After a power failure, filesystem requires lengthy fsck or corrupts because journaling was disabled during format (^has_journal).
Fix
Always verify with 'tune2fs -l | grep has_journal'. Enable journaling with 'tune2fs -j' on unmounted filesystem.
×

Using FAT32 for system partitions

Symptom
No journaling leads to frequent consistency checks on unclean shutdown; 4GB file size limit blocks large files.
Fix
Use NTFS for Windows, ext4 or XFS for Linux. FAT32 only for portable USB drives.
×

Forgetting backup superblock locations

Symptom
When primary superblock corrupts, engineer doesn't know where the backups are and wastes hours searching.
Fix
Document backup superblock addresses (e.g., block 8193, 32768, 65536) in runbook. Use 'mke2fs -n' to list them before a crash.
×

Not testing file system recovery in staging

Symptom
During a real outage, recovery commands are unfamiliar, leading to mistakes and extended downtime.
Fix
Practice superblock recovery, fsck, and journal replay in staging quarterly. Document step-by-step.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
What is the difference between an inode and a directory entry?
Q02SENIOR
How does journaling ensure filesystem consistency after a crash?
Q03JUNIOR
What happens when you delete a file on ext4?
Q04SENIOR
How does ext4 handle large files efficiently?
Q05SENIOR
How would you recover from a primary superblock corruption on an ext4 fi...
Q06SENIOR
Explain the trade-offs between data=ordered, data=journal, and data=writ...
Q01 of 06JUNIOR

What is the difference between an inode and a directory entry?

ANSWER
An inode stores metadata about a file (size, permissions, timestamps, block pointers) but does not contain the file name. A directory entry maps a file name to an inode number. This separation allows hard links: multiple directory entries pointing to the same inode. Deleting a file removes the directory entry and decrements the inode's link count; when the link count reaches zero, the inode and its data blocks are freed.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the superblock in ext3 and why does its corruption cause an unmountable filesystem?
02
How do you recover an ext3 filesystem with a corrupted primary superblock?
03
Why did the payment gateway outage happen specifically from ext3 superblock corruption?
04
What's the difference between ext3 and ext4 that affects superblock corruption recovery?
N
Naren Founder & Principal Engineer

20+ years shipping production systems from the metal up. Drawn from code that ran under real load.

Follow
Verified
production tested
June 10, 2026
last updated
1,554
articles · all by Naren
🔥

That's Operating Systems. Mark it forged?

20 min read · try the examples if you haven't

Previous
Inter-Process Communication
9 / 12 · Operating Systems
Next
OS Interview Questions