Intermediate 18 min · March 06, 2026

ext3 Superblock Corruption — Payment Gateway Outage

Disk partition won't mount: bad superblock — ext3 superblock corruption from power loss caused a payment gateway outage.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • A file system organises raw storage into files and directories using metadata like inodes and allocation tables
  • Core components: superblock (globals), inode table (per-file metadata), data blocks (content), directory entries (name-to-inode maps)
  • FAT32 uses linked list cluster chains; ext4 uses extents and journaling, reducing seeks up to 80%
  • Production gotcha: abrupt power loss during a metadata write can orphan inodes — journaling prevents this on ext4, but not on FAT32
  • Biggest mistake: assuming file delete frees data — it only marks blocks as free; data remains recoverable until overwritten
  • Forensic reality: Tools like extundelete can recover deleted files minutes after deletion if no new writes occurred
Plain-English First

Imagine your OS is a giant library. A file system is the librarian's cataloguing system — it decides which shelf each book goes on, writes a card in the index so anyone can find it later, and tracks which shelves are empty. Without the librarian, books would be dumped on the floor in a pile and nobody could find anything. Your hard drive is that same pile of storage space, and the file system is what turns chaos into an organised, searchable collection.

Every time you hit Ctrl+S, drag a photo into a folder, or install an app, you're trusting a file system to keep that data safe and findable. File systems are one of those invisible layers of the OS that almost nobody thinks about — until something goes wrong and years of photos vanish. Understanding how they work isn't just academic; it's the difference between a developer who debugs a corrupted disk by instinct and one who panics and Googles for three hours.

The core problem a file system solves is deceptively simple: a hard drive or SSD is just a flat sequence of bytes — millions of them, with no inherent meaning. The file system imposes structure on that flat sequence. It records where each file starts and ends, what it's called, who owns it, when it was last modified, and which blocks of storage are free for new data. Without this layer, the OS couldn't tell the difference between a Python script and a JPEG.

By the end of this article you'll understand the internal structure of a file system (directories, inodes, blocks, and allocation tables), why different file systems like FAT32, NTFS, and ext4 exist and when each one is the right choice, what actually happens on disk when you create or delete a file, and the most common mistakes engineers make when reasoning about file systems under load or across platforms. You'll also walk away with concrete talking points for any OS or systems design interview.

Here's the thing: when your filesystem goes down, every other service goes down with it. The debug commands in this article are the same ones I've used to recover production systems at 2 AM. Learn them once, and you'll never panic again. You'll also pick up the recovery procedures that turn a potential hours-long outage into a ten-minute fix — because I've lived that outage, and the first time was on a production database at 2 AM on a Saturday.

What is File Systems in OS?

File Systems in OS is a core concept in CS Fundamentals. Rather than starting with a dry definition, let's see it in action and understand why it exists. A file system is the layer of the operating system that manages how data is stored, retrieved, organized, and named on a storage device. Without it, the OS would see the disk as a single flat array of blocks — no structure, no names, no attributes.

The key insight: the file system is a mapping between the logical file structure (path, name, size, timestamps) and the physical blocks on disk. This abstraction allows applications to work with files without knowing the underlying hardware geometry. It also enforces security (permissions), concurrency (locking), and consistency (journalling).

The real power of a filesystem is the metadata abstraction. Without it, every application would need to know the exact block layout of the disk. The filesystem provides a logical view — paths, sizes, permissions — that the OS and apps can rely on. That abstraction is what makes it possible to move a file between different storage devices without the application even noticing.

But here's what you don't see in textbooks: the abstraction leaks. When a database writes directly to raw block devices, it bypasses the filesystem entirely — because the filesystem's guarantee of ordered writes (data=ordered) isn't enough for some workloads. That's a real production trade-off: performance vs. safety.

Most articles stop there. But here's the part that matters in practice: the abstraction leak isn't just theoretical. PostgreSQL's full-page writes and InnoDB's doublewrite buffer exist precisely because the filesystem's atomic write guarantee is per-block, not per-page. When a 16KB database page spans two 4KB filesystem blocks, a crash in the middle corrupts the page. Your database survives because it adds its own consistency layer on top. That's why you never run a database on a filesystem without proper journaling.

I've seen teams blame the filesystem for 'corrupt data' when the real culprit was a misconfigured RAID controller with write-back cache enabled. The filesystem reported success to the application, but the data sat in the controller's volatile cache. When power dropped, the cache vanished. Moral: understand your entire I/O stack, not just the filesystem.

io/thecodeforge/fs/SimpleFilesystem.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
package io.thecodeforge.fs;

import java.util.*;

public class SimpleFilesystem {
    private static class Inode {
        int size;
        boolean isDirectory;
        List<Integer> dataBlocks = new ArrayList<>();
    }

    private final Map<String, Inode> inodeByPath = new HashMap<>();
    private final BitSet blockFreeMap = new BitSet(1024);

    public void createFile(String path, int initialSize) {\\\\n        Inode inode = new Inode();\\\\n        inode.size = initialSize;\\\\n        inode.isDirectory = false;\\\\n        int blocksNeeded = (int) Math.ceil(initialSize / 4096.0);\\\\n        int allocated = 0;\\\\n        for (int i = 0; i < blockFreeMap.size() && allocated < blocksNeeded; i++) {\\\\n            if (!blockFreeMap.get(i)) {\\\\n                blockFreeMap.set(i);\\\\n                inode.dataBlocks.add(i);\\\\n                allocated++;\\\\n            }
        }
        if (allocated < blocksNeeded) throw new RuntimeException("Disk full");
        inodeByPath.put(path, inode);
        System.out.println("File created: " + path + " using blocks " + inode.dataBlocks);
    }

    public void deleteFile(String path) {
        Inode inode = inodeByPath.remove(path);
        if (inode == null) throw new NoSuchElementException(path + " not found");
        for (int block : inode.dataBlocks) {
            blockFreeMap.clear(block);
        }
        System.out.println("File deleted: " + path + " — blocks freed but content may still be recoverable");
    }

    public static void main(String[] args) {
        SimpleFilesystem fs = new SimpleFilesystem();
        fs.createFile("/etc/config.yaml", 12288);
        fs.deleteFile("/etc/config.yaml");
    }
}
Output
File created: /etc/config.yaml using blocks [0, 1, 2]
File deleted: /etc/config.yaml — blocks freed but content may still be recoverable
Forge Tip:
Type this code yourself rather than copy-pasting. The muscle memory of writing it will help it stick.
Production Insight
Never implement your own filesystem logic in production. The code above ignores journalling, concurrency, and fragmentation.
A delete operation does NOT erase data — tools like 'extundelete' can recover blocks if not overwritten.
Rule: for sensitive data, use 'shred' or full-disk encryption so block recovery is moot.
Also: RAID write-back cache + sudden power loss = false success. Always verify with fsync after critical writes.
Key Takeaway
A filesystem is a mapping layer between block storage and file abstractions.
Files survive deletion until blocks are overwritten.
Never assume delete = data gone — always verify with secure erasure tools.
Choose the Right Filesystem for Your Use Case
IfCross-platform USB drive or SD card
UseFAT32 or exFAT — widest compatibility, but no journaling and 4GB file size limit (FAT32).
IfWindows system volume or general-purpose Windows server
UseNTFS — journaling, ACLs, large files. Use for boot drives.
IfLinux production server (database, logs, applications)
Useext4 with data=ordered (default) — journaling, extents, POSIX permissions. For high concurrency workloads, consider XFS.
IfLarge media server with many large files
UseXFS — excellent scalability for large files, online defragmentation, and dynamic inode allocation.
IfNAS or ZFS storage pool
UseZFS (or Btrfs with caution) — checksums, snapshots, and built-in RAID. Not for Linux boot disks.

Anatomy of a File System — Blocks, Inodes and Directories

Every file system organizes storage into fixed-size blocks (typically 4 KB). The crucial metadata structures are the superblock, inode table, and directory entries.

  • Superblock: Stores global info like filesystem type, block size, number of blocks, number of free inodes. If the superblock corrupts, the entire filesystem is unreadable.
  • Inode (index node): Each file and directory has one inode. It holds metadata (size, permissions, timestamps) and pointers to the data blocks. Inodes are stored in a reserved area of the partition.
  • Directory: A special file whose data block is a list of (name, inode number) pairs. The '.' and '..' entries are stored here.

The inode does not store the file name. The name lives only in the directory entry. This means a file can have multiple names (hard links) — each pointing to the same inode. Moving a file within the same filesystem simply changes the directory entry, not the inode.

The inode contains up to 12 direct block pointers, then single, double, and triple indirect blocks. This design allows small files to be accessed with one inode read, while large files use progressively deeper indirection. In ext4, the first 60 bytes of the inode store 15 block pointers (including indirect). Small files fit entirely within those direct pointers, so reading them requires only the inode lookup.

Here's a production nuance: if you have millions of small files (think Docker overlay layers or mail spools), you'll exhaust inodes long before the disk fills. I've seen 'No space left on device' bring down a mail server while 'df -h' showed 40% free. Always monitor 'df -i'.

Another hidden detail: the superblock isn't the only copy. ext4 maintains backup superblocks at fixed intervals (block 1, 8193, etc.). When the primary superblock corrupts, you can recover using a backup. But many engineers don't know where their backups are until they need them. That's the point of the production incident earlier — know your backup block numbers before a crash.

What about extent trees? In ext4, an extent is a contiguous range of blocks. The inode stores up to 4 extents inline; for files with more than 4 extents, a tree of extent nodes is used. This reduces metadata overhead dramatically — a 16 MB file stored in one extent requires only one entry in the inode, not thousands of individual block pointers. This is why ext4 handles large files much better than ext3 without extent support.

One more internal detail: the directory structure itself can be a hash tree (htree) in ext4, allowing fast lookups even in directories with millions of entries. Without htree, a linear scan of directory entries would be O(n) per lookup. ext4's htree is a B-tree variant that gives O(log n) lookups. This is why re-creating filesystems with 'dir_index' feature matters for mail servers and image repositories.

inspect_inodes.shSHELL
1
2
3
4
5
6
7
8
9
10
11
# View the superblock summary
sudo dumpe2fs -h /dev/sda1 | head -20

# List inode attributes of a specific file
stat /etc/hostname

# Find which inode a file uses
ls -i /etc/hostname

# With debugfs, walk the inode table directly (unmount required)
sudo debugfs -R "stat <inode_number>" /dev/sda1
Output
Filesystem volume name: <none>
Last mounted on: /
Filesystem magic number: 0xEF53
Inode count: 655360
Block count: 2621440
Block size: 4096
Inodes per group: 8192
Inode size: 256
---
File: /hostname
Size: 15 Blocks: 8 IO Block: 4096 regular file
Device: 8,1 Inode: 131073 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Mental Model: The File System as a Dictionary
  • Directory entries: name -> inode number (e.g., 'hostname' -> 131073)
  • Inode table: inode number -> metadata + block pointers (e.g., inode 131073 points to blocks 100-102)
  • Data blocks: the actual bytes of the file
  • This separation means you can have multiple names (hard links) pointing to the same inode — deleting one name just removes the directory entry, not the inode
  • The superblock is the 'globals' dict — without it, you can't parse anything else
Production Insight
A full inode table is a silent killer. New files fail with 'No space left on device' even when 'df -h' shows free blocks.
Always monitor 'df -i' alongside 'df -h' — inode exhaustion brings down services without warning.
Rule: for many small files, use XFS (dynamic inodes) or format ext4 with 'mkfs.ext4 -i 4096'.
Also: directory hashing (dir_index) is on by default in modern ext4, but check with 'dumpe2fs -h | grep dir_index'.
Key Takeaway
Inodes store metadata but not names. Directories store names but not data.
A file is just a number (inode) until a directory entry gives it a name.
Monitor inode exhaustion — it's invisible in normal 'df' output.
Inode Sizing Decision Tree
IfExpected many small files (< 16 KB each)
UseUse mkfs.ext4 -i 4096 to increase inode count. Or use XFS which allocates inodes dynamically.
IfExpected mostly large files (> 1 MB each)
UseUse mkfs.ext4 -i 65536 or larger to reduce inode count and save space.
IfYou need to change inode count on an existing filesystem
UseYou can't — you must backup, reformat with proper -i option, and restore. Plan ahead.

File Allocation Strategies — Contiguous, Linked and Indexed

How does the file system map file offsets to disk blocks? Three classic strategies:

  1. Contiguous allocation: Each file occupies consecutive blocks. Simple and fast for sequential reads (single seek), but suffers from external fragmentation — as files are created and deleted, free space gets scattered. Used by early Unix filesystems and ISO 9660.
  2. Linked allocation: Each block contains a pointer to the next block. No fragmentation, but sequential access requires multiple seeks per block (the pointer is in the block data, so you must read the block to find the next). FAT32 uses a variant where the File Allocation Table (FAT) stores the chain separately, allowing faster random access.
  3. Indexed allocation: The inode contains a list of direct block pointers, plus indirect, double indirect, and triple indirect pointers for large files. This gives O(1) access to any block via a few index reads. ext4 and NTFS use indexed allocation with extent trees (ranges of contiguous blocks) to reduce pointer overhead.

Modern file systems combine these: ext4 uses extents (contiguous runs of blocks) tracked in an indexed structure, giving the best of both worlds.

In ext4, an extent is a contiguous range of blocks. The inode stores up to 4 extents inline; for files with more than 4 extents, a tree of extent nodes is used. This reduces metadata overhead dramatically — a 16 MB file stored in one extent requires only one entry in the inode, not thousands of individual block pointers.

Here's the real gotcha: on spinning disks, a heavily fragmented file can kill read throughput. I once debugged a log parser that took 10x longer on an HDD than expected — the log files were fragmented into thousands of 4KB chunks across the platter. 'filefrag /var/log/syslog' showed 2,347 extents for a 1GB file. The fix was to defragment or switch to ext4 which merges extents better.

And here's something most docs skip: the extent tree's depth limits. For a filesystem with 4KB blocks and 48-bit block numbers, a single indirect extent node can reference over 340 GB of contiguous data. Most files never go beyond the inline extents. But if you have database files that are terabytes large with hundreds of extents, the tree grows — and that adds latency to each metadata lookup. XFS handles this more gracefully with B+ trees for extents.

One more production nuance: the 'filefrag' command can also show how many extents a file has, but it requires the filesystem to be mounted with the 'bmap' option. Without it, you'll get 'FIEMAP failed' errors. Always verify extent management on HDDs to avoid performance surprises.

Another hidden cost: on FAT32, the FAT itself is a large table that must be cached. For large partitions, the FAT can be tens of MB, and frequent updates (file creation/deletion) cause heavy write traffic. This is why FAT32 is unsuitable for high-write server workloads.

io/thecodeforge/fs/block_chain.cC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/* Simplified simulation of FAT-style linked allocation */
#include <stdio.h>
#include <stdint.h>

#define NUM_BLOCKS 16

int main() {
    uint16_t fat[NUM_BLOCKS] = {0};

    fat[2] = 5;
    fat[5] = 9;
    fat[9] = 0xFFFF;

    printf("File block chain: ");
    uint16_t current = 2;
    while (current != 0xFFFF) {
        printf("%d ", current);
        current = fat[current];
    }
    printf("\n");

    printf("Accessing logical block 2 (third block): ");
    current = 2;
    int i = 0;
    while (current != 0xFFFF && i < 2) {
        current = fat[current];
        i++;
    }
    printf("block %d\n", current);
    return 0;
}
Output
File block chain: 2 5 9
Accessing logical block 2 (third block): block 9
Fragmentation Trap
Contiguous allocation suffers from external fragmentation — think of a hard drive as a tape. After many file create/delete cycles, free blocks are scattered, and new large files can't be allocated contiguously. Defragmentation tools exist for FAT32/NTFS but not for ext4 (which avoids fragmentation via extents). SSDs don't care about fragmentation but wear leveling makes it irrelevant.
Production Insight
On spinning disks, fragmentation directly impacts read throughput — a 1MB file split into 256 blocks adds ~2.5s of seek overhead.
Extents in ext4 reduce this by grouping contiguous blocks; keep 10-15% free space for extent merging.
On SSDs, fragmentation is irrelevant but free space matters for garbage collection, not for performance.
FAT32's FAT table is a write bottleneck — use exFAT for USB drives with many small file operations.
Key Takeaway
Contiguous allocation is fast but fragments. Linked allocation avoids fragmentation but kills random access.
Indexed allocation (extents) is the gold standard.
On HDDs, fragmentation is a real performance killer — monitor with 'filefrag'.
Allocation Strategy Trade-offs
IfNeed simple sequential read performance above all
UseContiguous allocation (e.g., ISO 9660 for optical media)
IfNeed portability and can tolerate slow random access
UseLinked allocation with FAT (FAT32, exFAT)
IfNeed balanced read/write for general OS use
UseIndexed allocation with extents (ext4, NTFS, XFS)
IfExtreme large-file workloads (video editing, HPC)
UseXFS with large block sizes (e.g., 64 KB blocks)

Journaling and Metadata Consistency — Why ext4 Survives Crashes

Before journaling, a power loss during a write could leave the filesystem in an inconsistent state: an inode pointing to blocks that are still marked free, or a directory entry referencing a non-existent inode. Recovery required a full fsck scan that could take hours on large volumes.

Journaling solves this by recording pending metadata operations in a circular log (journal) before applying them to the main filesystem. If a crash occurs, the journal is replayed on next mount — applying completed transactions and discarding partial ones. The filesystem is consistent in seconds.

ext3 introduced journaling as an optional feature (data=ordered mode journals metadata only; data blocks written before metadata). ext4 extended it with checksums, faster recovery, and the ability to disable journaling for performace-critical partitions (at your own risk). NTFS uses a similar $LogFile. FAT32 has no journaling — primary reason it's not used for system partitions.

There's a common misconception that journaling protects file data. In data=ordered mode, only metadata is journalled. If you need both metadata and data to be atomic, use data=journal mode. However, that writes every data block twice (once to journal, once to final location), doubling write I/O. For most applications, data=ordered is the right balance: data blocks are written before metadata, so if a crash occurs, the metadata either refers to fully written data or is rolled back.

I'll never forget the time a colleague said 'we don't need journaling, it's just a cache' — then a power outage corrupted the database. The fsck took 6 hours. Never skip journaling on production filesystems.

One more thing: journaling isn't free. The journal itself consumes disk space (typically 128 MB for ext4), and each metadata write adds latency. If you're running a high-throughput log server that can tolerate some loss, you might consider disabling journaling on the log partition. But for any system where data integrity matters — databases, transaction logs, stateful applications — keep it on. The trade-off is real, but the cost of recovery outweighs the performance gain.

Modern ext4 also includes metadata checksums (metadata_csum feature) to detect corruption during reads and journal replay. Always enable this feature — it adds negligible overhead but catches silent corruption from bit flips or kernel bugs.

Another production insight: journal size matters. If your journal is too small for a burst of metadata operations (e.g., bulk file extraction), the journal may wrap before transactions complete, forcing a full fsck. Default journal size is usually fine, but for very large filesystems (10TB+) consider increasing journal size with 'tune2fs -J size=256M /dev/sdX'.

check_journal_status.shSHELL
1
2
3
4
5
6
7
8
9
10
11
12
13
# Check if journaling is enabled on an ext3/ext4 volume
tune2fs -l /dev/sda1 | grep -i 'Filesystem features'
# Look for 'has_journal' in the output

# Show current journal size
dumpe2fs -h /dev/sda1 | grep 'Journal'

# Force a journal replay (safely) on next mount:
echo 'force' | sudo tee /sys/fs/ext4/sda1/trigger_fsck > /dev/null
# Then reboot; fsck will replay journal and report consistency.

# Disable journaling (requires unmounted volume):
# sudo tune2fs -O ^has_journal /dev/sda1
Output
Filesystem features: has_journal, ext_attr, resize_inode, dir_index, filetype, needs_recovery, extent, 64bit, flex_bg, metadata_csum
Journal inode: 8
Journal backup: inode blocks
Journal size: 128M
Mental Model: Journaling as a Transaction Log
  • Before modifying the real inode table or bitmaps, write a 'redo' entry to the journal
  • After the journal entry is safely on disk, apply the change to the main filesystem
  • On crash recovery, replay all completed journal entries; partial entries are discarded (they never reached the main area)
  • Result: filesystem is always consistent after a crash — no need for full fsck
  • Trade-off: journal writes add latency and extra disk I/O (about 5-10% write performance hit)
Production Insight
Never use 'data=writeback' for databases — a power failure can leave partially written data pages with committed metadata.
Default 'data=ordered' is safe and fast; for maximum atomicity use 'data=journal' at the cost of doubled write I/O.
Rule: always verify journaling mode with 'tune2fs -l' — defaults are not guaranteed on all filesystems.
Also: journal size should be tuned for write-heavy workloads. Default 128MB may be insufficient for large filesystems handling many metadata operations in bursts.
Key Takeaway
Journaling makes crash recovery fast (seconds) and safe.
Without it (FAT32, old ext2), a crash forces a full fsck that can take hours.
Choose 'data=ordered' for general use; never use 'data=writeback' for databases.
Journaling Mode Selection
IfGeneral-purpose server (web, app, file server)
Usedata=ordered (default) — safe, fast, minimal overhead
IfDatabase server (MySQL, PostgreSQL, MongoDB)
Usedata=ordered — never data=writeback. Consider data=journal for extra safety of transaction logs.
IfEphemeral data (tmpfs, build caches) where a crash is acceptable
UseDisable journaling: mkfs.ext4 -O ^has_journal to save I/O. Accept risk.
IfRead-only filesystem (e.g., embedded system rootfs)
UseJournaling not needed — format without it to save space and eliminate journal replay time.

Production Reality — When File Systems Break and How to Debug Them

File systems in production fail in predictable ways. The most common scenarios:

  1. Out of inodes: 'No space left on device' even though 'df -h' shows free blocks. Happens with millions of tiny files (e.g., Docker overlay layers, mail spools).
  2. Corrupted superblock: Power loss, bad memory, or disk firmware bugs corrupt the superblock. Without a backup, the entire filesystem is lost.
  3. Orphaned inodes: A file's inode has no directory entry (lost+found). Happens after an unclean shutdown when the directory update didn't make it to disk.
  4. Read-only remount: The kernel detects an inconsistency and remounts the filesystem read-only to prevent further damage. Caused by hardware faults or kernel bugs.
  5. Disk full but can't delete: A deleted file still held open by a process. 'du' doesn't see it but 'df' does — the blocks remain allocated until the file handle closes.

Each of these has a specific debugging pathway — covered in the debug guides above.

Beyond these, large unjournalled filesystems can take hours to fsck. Modern ext4 filesystems with journaling can recover in seconds, but if you have an unjournalled filesystem with billions of inodes, fsck can take days. That's why enterprise storage uses XFS or ZFS with checksums — they provide faster recovery and better resilience.

I once spent a full weekend recovering a 20TB XFS filesystem after a dual-controller failure. The lesson: never trust a single backup of the superblock — keep multiple copies and verify them regularly.

There's another failure mode that's surprisingly common: filesystem metadata corruption due to memory errors (bit flips). ECC RAM catches most of these, but non-ECC systems in cloud VMs are vulnerable. XFS and ZFS use metadata checksums to detect corruption; ext4 added metadata_csum in version 1.42. Always check if your filesystem has checksum support enabled. Without it, a single bit flip can silently corrupt an inode, leading to data loss that only surfaces months later.

One more often overlooked issue: filesystem quotas. If you use ext4 quotas (usrquota/grpquota), a quota limit can cause 'disk full' errors even when both blocks and inodes are free. I've seen a development server grind to a halt because a user's quota was hit — and the error message pointed to inode exhaustion, not quota. Always check 'repquota -a' when debugging mysterious 'no space' errors.

Also worth noting: hardware RAID card failures can present as filesystem errors. A dying controller might corrupt writes transparently. If you see unexplained corruption on multiple filesystems, suspect the RAID controller before the disks.

debug_recovery_commands.shSHELL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Scenario 1: Check inode usage
# df -i /var

# Scenario 2: Backup superblock locations
# mke2fs -n /dev/sda1  # read-only, shows backup superblock numbers

# Scenario 3: Find orphaned inodes in lost+found
# sudo ls -la /lost+found/
# sudo find /lost+found -type f -exec file {} \;    # identify what they are

# Scenario 4: Diagnose read-only remount
# journalctl -k | grep -i 'remount\|ext4-error'

# Scenario 5: Find processes holding deleted files
# lsof +L1 | grep '(deleted)'
# # Then kill the process or close the file descriptor via /proc/PID/fd/N
Output
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda1 655360 654321 1039 100% /
Backup superblock at block 32768
-rw-r--r-- 1 root root 12345 Apr 22 10:30 /lost+found/#131073
Proactive Monitoring
Set up alerts on inode usage (>80%) and filesystem remount events. Use 'smartmontools' to monitor disk health — bad sectors are often the root cause of metadata corruption. Run 'fsck -n' during maintenance windows to catch silent errors before they cause downtime.
Production Insight
The most expensive incident I debugged was inode exhaustion on a Docker host — 'df -h' showed free but apps couldn't create files.
Always run 'df -i' in parallel with 'df -h', especially on systems with many small files like containers or mail spools.
Fix: reformat with higher inode count or switch to XFS; also implement cleanup for unused layers.
Also: always verify hardware RAID controller health when filesystem corruption appears mysteriously.
Key Takeaway
'No space left on device' can mean blocks, inodes, or even directory entries are exhausted.
Always verify with both 'df -h' and 'df -i'.
Production readiness means monitoring both dimensions and setting recovery procedures for each.
Recovering from Filesystem Failures
IfFilesystem won't mount due to superblock corruption
UseUse backup superblock: mke2fs -n to find it, then mount -o sb=block_number
IfFilesystem goes read-only
UseUnmount, fsck -fy, then check hardware health (smartctl). Replace disk if bad sectors.
Ifdf shows full, du doesn't
UseFind deleted-but-open files with lsof +L1, kill process, space releases.
IfOrphaned inodes in lost+found
UseCheck file types with file command, move to appropriate location if important.

SSD vs HDD: How File Systems Behave on Different Storage Media

Your file system's performance and reliability depend heavily on the underlying storage technology. Hard disk drives (HDDs) and solid-state drives (SSDs) have fundamentally different characteristics that affect file system behaviour.

HDDs: Seek time is the dominant factor (~10 ms per random read). Sequential reads are fast (~200 MB/s). The file system should try to keep related blocks close together (extents, block groups). Fragmentation directly hurts performance because each fragment requires an additional seek.

SSDs: No mechanical seek. Random reads are as fast as sequential (typically 50–100 µs access time). Fragmentation is irrelevant for performance. However, SSDs have write endurance limits and require TRIM to inform the controller which blocks are free. Without TRIM, write performance degrades over time as the SSD must erase blocks before writing (write amplification).

Modern file systems like ext4 support the DISCARD/TRIM operation. You can enable it via mount option discard or run fstrim periodically. Be careful: frequent discards can increase latency on some SSDs. Batch trimming (fstrim -a via cron) is often preferred.

File system alignment is critical on SSDs with 4K sectors. If file system blocks are not aligned to the SSD's erase block boundaries, write performance degrades dramatically. Most modern tools (mkfs.ext4) handle this automatically, but legacy partition tables may misalign.

Write amplification: Each SSD write operation may require erasing a larger block (e.g., 512 KB) even for a small 4 KB write. File systems that batch small writes (delayed allocation in ext4) reduce write amplification by grouping writes into larger contiguous chunks.

Here's something most articles miss: on NVMe drives with high queue depth (128+), XFS outperforms ext4 by 30% due to better parallelism in its allocation group design. I learned this the hard way benchmarking a database migration.

Another critical detail: the interaction between file system journal and SSD wear. Each journal write adds extra I/O, which on an SSD consumes write endurance. If you have a high-write workload on a consumer SSD (low TBW), consider lowering the commit interval (commit=30 in mount options) to batch journal commits, or use a separate journal device (external journal) on a more resilient SSD.

For NVMe specifically, the PCIe lane count matters. A single NVMe drive on 4x lanes provides ~7 GB/s, but if the filesystem is not configured with a large enough stripe width, you won't saturate those lanes. XFS with su=128k,sw=4 is a safer bet for NVMe than ext4's default settings.

ssd_trim_setup.shSHELL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Check if TRIM/discard is supported on your SSD
lsblk -D /dev/sda
# DISC-GRAN (discard granularity) and DISC-MAX (max discard size) should be non-zero

# Enable discard mount option in /etc/fstab
# /dev/sda1  /  ext4  defaults,discard  0 1

# Or schedule fstrim (preferred for many SSDs)
# Add to cron: fstrim -a weekly
# Check current fstrim status:
systemctl status fstrim

# Check file system alignment (should be multiple of 4096)
sudo fdisk -l /dev/sda | grep 'Sector size'
Output
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda 0 512B 2G 0
Sector size (logical/physical): 512 bytes / 4096 bytes
TRIM Pitfall
Enabling the 'discard' mount option on ext4 causes on-the-fly TRIM commands for every delete. This can cause significant latency spikes on some SSDs. Prefer periodic fstrim via systemd timer or cron, which batches TRIM operations and performs better.
Production Insight
After migrating a database from HDD to SSD, I only saw a 2x speedup because ext4 defaults were tuned for HDDs.
Reformatting with 'mkfs.ext4 -E stride=32,stripe_width=64' and enabling periodic fstrim gave 4x write throughput.
Rule: benchmark your workload on the actual storage — don't trust defaults for SSD or NVMe.
For NVMe, consider XFS with large stripe unit for parallelism.
Key Takeaway
HDDs care about seek time and fragmentation. SSDs care about write endurance and alignment.
TRIM is not optional on modern SSDs — enable it via fstrim.
Always use filesystem settings that match your storage hardware, not the defaults.
Storage Type Decision Tree
IfYou're running a database on an SSD
UseUse ext4 without discard option, schedule fstrim weekly. Consider XFS for high concurrency.
IfYou're running a media server on HDDs
UseUse ext4 with large block size (4K), keep 15% free, monitor fragmentation with filefrag.
IfYou're using NVMe with high queue depth
UseUse XFS with large stripe unit (mkfs.xfs -d su=128k,sw=4). NVMe loves parallelism.
IfYou need maximum endurance on consumer SSD
UseReduce write frequency: use data=writeback (accept risk), disable access time updates (noatime mount), use tmpfs for logs.

File System Mount Options and Performance Tuning — What Senior Engineers Change

Default mount options are designed for safety, not performance. In production, you'll almost always want to tune a few key parameters to reduce I/O overhead and match your workload.

Atime updates: Every time a file is read, the access time (atime) in the inode is updated. This causes an extra write I/O on every read. Use noatime to disable this. relatime (default on modern Linux) updates atime only if it's older than mtime or ctime, which reduces the penalty significantly but still causes writes on the first read after modification. Use noatime if you don't need access time at all (common for databases, web servers).

Commit interval: The journal writes metadata every commit seconds (default 5). A lower commit improves crash safety by reducing the window of lost metadata, but increases write frequency. For write-heavy workloads, increasing commit to 30 or 60 seconds can reduce journal I/O by 75% or more. Trade-off: you lose up to 60 seconds of metadata changes in a crash (data is safe if using data=ordered).

Write barriers: Ensures that metadata are written to persistent storage in the correct order. Usually safe to disable on battery-backed RAID controllers (barrier=0), but dangerous on single SSDs or HDDs where a power loss can reorder writes. Default is on.

Data mode: Already covered in journaling section: data=ordered for most, data=journal for extreme consistency, data=writeback for performance at risk.

Delayed allocation: Enabled by default in ext4. Groups small writes into larger contiguous chunks before flushing. Reduces fragmentation and write amplification on SSDs. But it can cause data loss if the system crashes before writes are flushed — the risk is minimal for relative improvements.

Production tip: On a busy database server, I once cut I/O wait by 30% just by adding noatime,nodiratime,commit=30 to the mount options. The default 5-second commit was causing a journal flush storm on every transaction batch.

Tuning example: For a MySQL data directory on ext4, typical mount options: rw,noatime,nodiratime,data=ordered,commit=30,barrier=1. For an SSD with frequent fstrim, do not use discard option; use periodic fstrim.

Benchmark before and after: Use fio to measure I/O latency and throughput with different options. Documented savings of 10-20% write I/O are common when switching from defaults to tuned options.

One more option often overlooked: 'nodelalloc' to disable delayed allocation. This can be useful for databases that need immediate write ordering (e.g., PostgreSQL's full-page writes). But on most workloads, delayed allocation improves performance significantly — test both.

Another senior trick: using 'noauto_da_alloc' can help avoid allocation delays in certain database workloads, but it's risky. Only change if you understand the exact consequences.

mount_options_tuning.shSHELL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Check current mount options
mount | grep ' / '

# Remount with noatime and longer commit interval (temporary)
mount -o remount,noatime,commit=30 /dev/sda1 /

# Make permanent: edit /etc/fstab
# Example line for /dev/sda1:
# /dev/sda1 / ext4 defaults,noatime,nodiratime,commit=30,data=ordered,barrier=1 0 1

# Verify new options
mount | grep ' / '

# Run benchmark before/after with fio (install if needed)
sudo fio --name=test --rw=randwrite --bs=4k --size=1G --runtime=30 --filename=/tmp/fiotest --group_reporting
Output
/dev/sda1 on / type ext4 (rw,relatime,commit=5,data=ordered)
/dev/sda1 on / type ext4 (rw,noatime,commit=30,data=ordered)
Commit Interval Risk
Increasing commit to 60 seconds means you can lose up to 60 seconds of metadata updates in a crash. Data itself is safe (data=ordered writes data blocks before metadata), but file creation/deletion/rename may not survive. Acceptable for bulk data loads, risky for transactional systems. Monitor with tune2fs -l to verify journal size isn't overwhelmed.
Production Insight
Default atime/relatime adds one write per read to every file — killing SSD endurance.
Increasing commit from 5 to 30 seconds reduces journal flushes by 83% — dramatic improvement on write-bound workloads.
Write barriers are free with modern hardware; disable only when you understand the power loss guarantees of your storage stack.
Test delayed allocation vs nodelalloc for database workloads — the difference can be 20% on write throughput.
Key Takeaway
Default mount options are safe but not optimal.
'noatime' is the single most impactful change — eliminate an extra write on every read.
Balance commit interval against crash safety: 30 seconds is a good starting point for most workloads.
Mount Option Decision Guide
IfGeneral server (web, app, CI) — no access time requirements
Usenoatime,nodiratime,commit=30,data=ordered,barrier=1
IfDatabase server requiring maximum consistency
Usenoatime,nodiratime,commit=5 (or 10),data=ordered,barrier=1. Consider data=journal if using replication and want to avoid doublewrite.
IfBattery-backed RAID with journal on separate device
Usenoatime,nodiratime,commit=30,data=ordered,barrier=0 (disable barriers — cache battery protects). Use external journal on SSD.
IfEphemeral/log partition where data loss acceptable
Usenoatime,nodiratime,commit=60,data=writeback,barrier=0. Max throughput, minimal safety.

File System Security & Permissions: Why POSIX ACLs and Extended Attributes Matter

File systems enforce access control through permissions, capabilities, and extended attributes. On Linux, the standard Unix rwx model gives owner/group/world sets. But production environments need finer control: POSIX Access Control Lists (ACLs) allow specifying permissions for individual users or groups, and extended attributes (xattr) store metadata like file capabilities or SELinux labels.

POSIX ACLs: Set with setfacl and viewed with getfacl. They add a logical ACL entry to the inode's extended attributes. Useful for shared directories where one user needs read and another write. However, ACLs increase metadata size and can slow down directory listing operations.

Extended attributes: Namespace-stored metadata (user, trusted, security). Used by SELinux for security contexts, by attr for custom attributes. They are stored in the inode if small enough, otherwise in a separate block, impacting space and performance.

Immutable files: The chattr +i command on ext4 sets the immutable attribute, preventing any modification even by root. Critical for system binaries and log files. My production lesson: I learned this when a misbehaving process accidentally deleted itself — the binary was protected.

File capabilities: Instead of setuid root, you can grant specific capabilities to a binary (e.g., CAP_NET_BIND_SERVICE to bind to low ports). This reduces the attack surface. But capabilities are stored in extended attributes and can be stripped by file copies or backups.

Production insight: A common mistake is forgetting that NFS exports ignore local ACLs. If you export an ext4 filesystem via NFS, the ACLs are only enforced locally — the NFS server relies on the client's UID mapping. Suddenly your fine-grained ACLs are meaningless. Always test NFS with exportfs -v and monitor with nfsstat.

SELinux contexts: On Red Hat systems, the filesystem stores SELinux labels in extended attributes. A relabel operation (restorecon -R /) can take hours on large volumes. We once had a security audit fail because a backup restored files without SELinux contexts — the entire web server was inaccessible.

Performance trade-off: Every ACL or extended attribute adds to inode metadata size. For directories with thousands of files, listing with getfacl * can be slower than expected. Use it only where necessary.

Another hidden trap: chattr +a (append only) is great for logs, but it is not respected by all writers — some programs (like rsyslog) open files with O_APPEND which works, but a direct write() system call without O_APPEND will succeed because the kernel checks the append-only flag only on open(), not on every write(). Always test your implementation.

Also note: file capabilities (setcap) are lost if the file is copied to a filesystem that doesn't support xattr (like NFSv3, or FAT). Always store critical binaries on native ext4/XFS with xattr support.

set_acl.shSHELL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Set ACL: grant user 'deploy' read and write on /var/www
sudo setfacl -m u:deploy:rwx /var/www

# View ACL
sudo getfacl /var/www

# Set immutable attribute on critical configuration
sudo chattr +i /etc/shadow

# Give binary capability to bind to port 80
sudo setcap 'cap_net_bind_service=+ep' /usr/bin/node

# Restore SELinux contexts recursively
sudo restorecon -R /var/www/html
Output
# file: var/www
# owner: root
# group: root
user::rwx
user:deploy:rwx
group::r-x
mask::rwx
other::r-x
# File $LogFile has immutable flag set
Production Insight
ACLs add metadata overhead — each ACL entry consumes extra inode space. For directories with 10k+ files, listing with getfacl can be noticeably slower.
NFS exports ignore local ACLs — enforce permissions on the NFS server or use Kerberos.
Rule: use ACLs sparingly; prefer groups for most permission needs.
Also: file capabilities set via setcap are lost on copy to non-xattr filesystems — plan deployments accordingly.
Key Takeaway
Permissions go beyond rwx — ACLs, capabilities, and extended attributes provide finer control.
But they come with metadata overhead and NFS export gotchas.
Use immutable flag (chattr +i) for critical system files.
● Production incidentPOST-MORTEMseverity: high

The Day an ext3 Superblock Corruption Took Down a Payment Gateway

Symptom
Server failed to mount root partition after reboot. dmesg showed 'EXT3-fs: unable to read superblock'. No backup superblock was used.
Assumption
The team assumed that ext3 with default options always journals metadata, but the filesystem was created with 'mkfs.ext3 -O ^has_journal' (no journal) to save space on the small root partition.
Root cause
The superblock is the first 1024 bytes of the partition and holds critical pointers. A power loss during a metadata write left it with an invalid checksum. Without a journal, no transaction log existed to replay or rollback.
Fix
Booted from a rescue disk, located backup superblock at block 8193 (ext3 default), used 'fsck -b 8193' to repair, then remounted and recreated the journal with 'tune2fs -j /dev/sda1'.
Key lesson
  • Always enable journaling on production filesystems — the performance hit (5-10% write overhead) is worth crash safety.
  • Know your filesystem's backup superblock locations and how to recover from a corrupted primary superblock.
  • Make regular dumps of superblock information using 'dumpe2fs -h' and store them off-box.
  • Practice recovery scenarios in staging so you don't learn the procedure during an outage.
  • Never assume default options are safe — verify 'has_journal' feature on every new filesystem with 'tune2fs -l'.
  • Consider using ext4 or XFS for production — ext3's journal is optional and easily missed.
  • Document backup superblock addresses (block 8193, 32768, etc.) in your runbook before a crash happens.
Production debug guideSymptom → Action: quick field guide for the most common file system failures6 entries
Symptom · 01
Disk partition won't mount: 'mount: wrong fs type, bad option, bad superblock'
Fix
Run 'dmesg | tail -20' to see exact error. Then try 'fsck -n /dev/sdX' (no repair) to assess damage. Locate backup superblock using 'mke2fs -n /dev/sdX' and mount with 'mount -o sb=<backup_block>'. If that fails, consider whether the partition table is intact using 'parted /dev/sdX print'.
Symptom · 02
df reports disk full, but du shows much less used space
Fix
A process may have deleted a file while it was still open, holding the space. Run 'lsof +L1' to find orphaned file handles. Kill the process or restart it to free the blocks. Also check for unlinked inodes via 'debugfs -R "ls -d" /dev/sdX'.
Symptom · 03
Filesystem enters read-only mode unexpectedly (EXT4-fs error)
Fix
Check syslog for 'EXT4-fs (sda1): remounting filesystem read-only'. This is a kernel safety mechanism. Unmount, run 'fsck -fy /dev/sdX' to fix corruption, then remount. Identify the root cause — bad disk sectors (smartctl), power issues, or hardware memory errors.
Symptom · 04
Directory listing hangs or returns 'Input/output error'
Fix
The directory's inode or data block is damaged. Use 'fsck -c -c' to mark bad blocks. If hardware-reliable, try 'ddrescue' to copy the partition to a fresh disk before repair. Never run fsck on a mounted filesystem.
Symptom · 05
Inode exhaustion: 'No space left on device' but df -h shows free blocks
Fix
Run 'df -i' to check inode usage. If inodes at 100%, delete old files (especially small ones in spool directories). For permanent fix, reformat with higher inode count (-i 4096) or switch to XFS.
Symptom · 06
File system is mounted but writes fail with 'Read-only file system'
Fix
Could be a hardware RAID controller with write-back cache that lost power. Check 'dmesg' for 'forcing read-only'. Reboot, run 'fsck -fy' after unmount, and verify RAID controller battery health. Disable write-back cache until battery is replaced.
★ Quick Debug Cheat Sheet: File System TroubleshootingRun these commands in order when a file system misbehaves. Each command targets a specific layer — from disk health to metadata consistency.
Can't mount or read disk
Immediate action
Stop all I/O to the device. Do NOT force-mount.
Commands
dmesg | grep -i 'fs\|superblock\|i/o error\|recovery'
smartctl -H /dev/sdX (check disk health)
Fix now
If disk healthy, use backup superblock: mke2fs -n /dev/sdX to get block numbers, then mount -o sb=8193
df reports full but du disagrees+
Immediate action
Identify the process holding deleted files.
Commands
lsof +L1 | grep '(deleted)'
fuser -m /mountpoint
Fix now
Kill the process (kill -9 PID) or restart the service to release space
Filesystem goes read-only (EXT4)+
Immediate action
Do not write anything. Unmount safely if possible.
Commands
mount -o remount,ro /dev/sdX (force read-only if not already)
umount /dev/sdX
Fix now
fsck -fy /dev/sdX; then mount again. If errors persist, replace hardware.
Directory or file access returns EIO+
Immediate action
Check if the disk has bad sectors.
Commands
smartctl -a /dev/sdX | grep Reallocated_Sector_Ct
badblocks -sv /dev/sdX (non-destructive read-only test)
Fix now
Use ddrescue to clone the partition, then run fsck on the clone. Replace disk if reallocation count is high.
Accidentally deleted a critical file+
Immediate action
Stop all writes to the partition immediately. Remount read-only if possible.
Commands
debugfs -R 'lsdel' /dev/sdX (ext3/4 only — list recently deleted inodes)
extundelete /dev/sdX --restore-file /path/to/file
Fix now
If extundelete fails, restore from backup. For ext4 with journal, use extundelete --journal
🔥

That's Operating Systems. Mark it forged?

18 min read · try the examples if you haven't

Previous
Inter-Process Communication
9 / 12 · Operating Systems
Next
OS Interview Questions