ext3 Superblock Corruption — Payment Gateway Outage
Disk partition won't mount: bad superblock — ext3 superblock corruption from power loss caused a payment gateway outage.
- A file system organises raw storage into files and directories using metadata like inodes and allocation tables
- Core components: superblock (globals), inode table (per-file metadata), data blocks (content), directory entries (name-to-inode maps)
- FAT32 uses linked list cluster chains; ext4 uses extents and journaling, reducing seeks up to 80%
- Production gotcha: abrupt power loss during a metadata write can orphan inodes — journaling prevents this on ext4, but not on FAT32
- Biggest mistake: assuming file delete frees data — it only marks blocks as free; data remains recoverable until overwritten
- Forensic reality: Tools like extundelete can recover deleted files minutes after deletion if no new writes occurred
Imagine your OS is a giant library. A file system is the librarian's cataloguing system — it decides which shelf each book goes on, writes a card in the index so anyone can find it later, and tracks which shelves are empty. Without the librarian, books would be dumped on the floor in a pile and nobody could find anything. Your hard drive is that same pile of storage space, and the file system is what turns chaos into an organised, searchable collection.
Every time you hit Ctrl+S, drag a photo into a folder, or install an app, you're trusting a file system to keep that data safe and findable. File systems are one of those invisible layers of the OS that almost nobody thinks about — until something goes wrong and years of photos vanish. Understanding how they work isn't just academic; it's the difference between a developer who debugs a corrupted disk by instinct and one who panics and Googles for three hours.
The core problem a file system solves is deceptively simple: a hard drive or SSD is just a flat sequence of bytes — millions of them, with no inherent meaning. The file system imposes structure on that flat sequence. It records where each file starts and ends, what it's called, who owns it, when it was last modified, and which blocks of storage are free for new data. Without this layer, the OS couldn't tell the difference between a Python script and a JPEG.
By the end of this article you'll understand the internal structure of a file system (directories, inodes, blocks, and allocation tables), why different file systems like FAT32, NTFS, and ext4 exist and when each one is the right choice, what actually happens on disk when you create or delete a file, and the most common mistakes engineers make when reasoning about file systems under load or across platforms. You'll also walk away with concrete talking points for any OS or systems design interview.
Here's the thing: when your filesystem goes down, every other service goes down with it. The debug commands in this article are the same ones I've used to recover production systems at 2 AM. Learn them once, and you'll never panic again. You'll also pick up the recovery procedures that turn a potential hours-long outage into a ten-minute fix — because I've lived that outage, and the first time was on a production database at 2 AM on a Saturday.
What is File Systems in OS?
File Systems in OS is a core concept in CS Fundamentals. Rather than starting with a dry definition, let's see it in action and understand why it exists. A file system is the layer of the operating system that manages how data is stored, retrieved, organized, and named on a storage device. Without it, the OS would see the disk as a single flat array of blocks — no structure, no names, no attributes.
The key insight: the file system is a mapping between the logical file structure (path, name, size, timestamps) and the physical blocks on disk. This abstraction allows applications to work with files without knowing the underlying hardware geometry. It also enforces security (permissions), concurrency (locking), and consistency (journalling).
The real power of a filesystem is the metadata abstraction. Without it, every application would need to know the exact block layout of the disk. The filesystem provides a logical view — paths, sizes, permissions — that the OS and apps can rely on. That abstraction is what makes it possible to move a file between different storage devices without the application even noticing.
But here's what you don't see in textbooks: the abstraction leaks. When a database writes directly to raw block devices, it bypasses the filesystem entirely — because the filesystem's guarantee of ordered writes (data=ordered) isn't enough for some workloads. That's a real production trade-off: performance vs. safety.
Most articles stop there. But here's the part that matters in practice: the abstraction leak isn't just theoretical. PostgreSQL's full-page writes and InnoDB's doublewrite buffer exist precisely because the filesystem's atomic write guarantee is per-block, not per-page. When a 16KB database page spans two 4KB filesystem blocks, a crash in the middle corrupts the page. Your database survives because it adds its own consistency layer on top. That's why you never run a database on a filesystem without proper journaling.
I've seen teams blame the filesystem for 'corrupt data' when the real culprit was a misconfigured RAID controller with write-back cache enabled. The filesystem reported success to the application, but the data sat in the controller's volatile cache. When power dropped, the cache vanished. Moral: understand your entire I/O stack, not just the filesystem.
Anatomy of a File System — Blocks, Inodes and Directories
Every file system organizes storage into fixed-size blocks (typically 4 KB). The crucial metadata structures are the superblock, inode table, and directory entries.
- Superblock: Stores global info like filesystem type, block size, number of blocks, number of free inodes. If the superblock corrupts, the entire filesystem is unreadable.
- Inode (index node): Each file and directory has one inode. It holds metadata (size, permissions, timestamps) and pointers to the data blocks. Inodes are stored in a reserved area of the partition.
- Directory: A special file whose data block is a list of (name, inode number) pairs. The '.' and '..' entries are stored here.
The inode does not store the file name. The name lives only in the directory entry. This means a file can have multiple names (hard links) — each pointing to the same inode. Moving a file within the same filesystem simply changes the directory entry, not the inode.
The inode contains up to 12 direct block pointers, then single, double, and triple indirect blocks. This design allows small files to be accessed with one inode read, while large files use progressively deeper indirection. In ext4, the first 60 bytes of the inode store 15 block pointers (including indirect). Small files fit entirely within those direct pointers, so reading them requires only the inode lookup.
Here's a production nuance: if you have millions of small files (think Docker overlay layers or mail spools), you'll exhaust inodes long before the disk fills. I've seen 'No space left on device' bring down a mail server while 'df -h' showed 40% free. Always monitor 'df -i'.
Another hidden detail: the superblock isn't the only copy. ext4 maintains backup superblocks at fixed intervals (block 1, 8193, etc.). When the primary superblock corrupts, you can recover using a backup. But many engineers don't know where their backups are until they need them. That's the point of the production incident earlier — know your backup block numbers before a crash.
What about extent trees? In ext4, an extent is a contiguous range of blocks. The inode stores up to 4 extents inline; for files with more than 4 extents, a tree of extent nodes is used. This reduces metadata overhead dramatically — a 16 MB file stored in one extent requires only one entry in the inode, not thousands of individual block pointers. This is why ext4 handles large files much better than ext3 without extent support.
One more internal detail: the directory structure itself can be a hash tree (htree) in ext4, allowing fast lookups even in directories with millions of entries. Without htree, a linear scan of directory entries would be O(n) per lookup. ext4's htree is a B-tree variant that gives O(log n) lookups. This is why re-creating filesystems with 'dir_index' feature matters for mail servers and image repositories.
- Directory entries: name -> inode number (e.g., 'hostname' -> 131073)
- Inode table: inode number -> metadata + block pointers (e.g., inode 131073 points to blocks 100-102)
- Data blocks: the actual bytes of the file
- This separation means you can have multiple names (hard links) pointing to the same inode — deleting one name just removes the directory entry, not the inode
- The superblock is the 'globals' dict — without it, you can't parse anything else
File Allocation Strategies — Contiguous, Linked and Indexed
How does the file system map file offsets to disk blocks? Three classic strategies:
- Contiguous allocation: Each file occupies consecutive blocks. Simple and fast for sequential reads (single seek), but suffers from external fragmentation — as files are created and deleted, free space gets scattered. Used by early Unix filesystems and ISO 9660.
- Linked allocation: Each block contains a pointer to the next block. No fragmentation, but sequential access requires multiple seeks per block (the pointer is in the block data, so you must read the block to find the next). FAT32 uses a variant where the File Allocation Table (FAT) stores the chain separately, allowing faster random access.
- Indexed allocation: The inode contains a list of direct block pointers, plus indirect, double indirect, and triple indirect pointers for large files. This gives O(1) access to any block via a few index reads. ext4 and NTFS use indexed allocation with extent trees (ranges of contiguous blocks) to reduce pointer overhead.
Modern file systems combine these: ext4 uses extents (contiguous runs of blocks) tracked in an indexed structure, giving the best of both worlds.
In ext4, an extent is a contiguous range of blocks. The inode stores up to 4 extents inline; for files with more than 4 extents, a tree of extent nodes is used. This reduces metadata overhead dramatically — a 16 MB file stored in one extent requires only one entry in the inode, not thousands of individual block pointers.
Here's the real gotcha: on spinning disks, a heavily fragmented file can kill read throughput. I once debugged a log parser that took 10x longer on an HDD than expected — the log files were fragmented into thousands of 4KB chunks across the platter. 'filefrag /var/log/syslog' showed 2,347 extents for a 1GB file. The fix was to defragment or switch to ext4 which merges extents better.
And here's something most docs skip: the extent tree's depth limits. For a filesystem with 4KB blocks and 48-bit block numbers, a single indirect extent node can reference over 340 GB of contiguous data. Most files never go beyond the inline extents. But if you have database files that are terabytes large with hundreds of extents, the tree grows — and that adds latency to each metadata lookup. XFS handles this more gracefully with B+ trees for extents.
One more production nuance: the 'filefrag' command can also show how many extents a file has, but it requires the filesystem to be mounted with the 'bmap' option. Without it, you'll get 'FIEMAP failed' errors. Always verify extent management on HDDs to avoid performance surprises.
Another hidden cost: on FAT32, the FAT itself is a large table that must be cached. For large partitions, the FAT can be tens of MB, and frequent updates (file creation/deletion) cause heavy write traffic. This is why FAT32 is unsuitable for high-write server workloads.
Journaling and Metadata Consistency — Why ext4 Survives Crashes
Before journaling, a power loss during a write could leave the filesystem in an inconsistent state: an inode pointing to blocks that are still marked free, or a directory entry referencing a non-existent inode. Recovery required a full fsck scan that could take hours on large volumes.
Journaling solves this by recording pending metadata operations in a circular log (journal) before applying them to the main filesystem. If a crash occurs, the journal is replayed on next mount — applying completed transactions and discarding partial ones. The filesystem is consistent in seconds.
ext3 introduced journaling as an optional feature (data=ordered mode journals metadata only; data blocks written before metadata). ext4 extended it with checksums, faster recovery, and the ability to disable journaling for performace-critical partitions (at your own risk). NTFS uses a similar $LogFile. FAT32 has no journaling — primary reason it's not used for system partitions.
There's a common misconception that journaling protects file data. In data=ordered mode, only metadata is journalled. If you need both metadata and data to be atomic, use data=journal mode. However, that writes every data block twice (once to journal, once to final location), doubling write I/O. For most applications, data=ordered is the right balance: data blocks are written before metadata, so if a crash occurs, the metadata either refers to fully written data or is rolled back.
I'll never forget the time a colleague said 'we don't need journaling, it's just a cache' — then a power outage corrupted the database. The fsck took 6 hours. Never skip journaling on production filesystems.
One more thing: journaling isn't free. The journal itself consumes disk space (typically 128 MB for ext4), and each metadata write adds latency. If you're running a high-throughput log server that can tolerate some loss, you might consider disabling journaling on the log partition. But for any system where data integrity matters — databases, transaction logs, stateful applications — keep it on. The trade-off is real, but the cost of recovery outweighs the performance gain.
Modern ext4 also includes metadata checksums (metadata_csum feature) to detect corruption during reads and journal replay. Always enable this feature — it adds negligible overhead but catches silent corruption from bit flips or kernel bugs.
Another production insight: journal size matters. If your journal is too small for a burst of metadata operations (e.g., bulk file extraction), the journal may wrap before transactions complete, forcing a full fsck. Default journal size is usually fine, but for very large filesystems (10TB+) consider increasing journal size with 'tune2fs -J size=256M /dev/sdX'.
- Before modifying the real inode table or bitmaps, write a 'redo' entry to the journal
- After the journal entry is safely on disk, apply the change to the main filesystem
- On crash recovery, replay all completed journal entries; partial entries are discarded (they never reached the main area)
- Result: filesystem is always consistent after a crash — no need for full fsck
- Trade-off: journal writes add latency and extra disk I/O (about 5-10% write performance hit)
Production Reality — When File Systems Break and How to Debug Them
File systems in production fail in predictable ways. The most common scenarios:
- Out of inodes: 'No space left on device' even though 'df -h' shows free blocks. Happens with millions of tiny files (e.g., Docker overlay layers, mail spools).
- Corrupted superblock: Power loss, bad memory, or disk firmware bugs corrupt the superblock. Without a backup, the entire filesystem is lost.
- Orphaned inodes: A file's inode has no directory entry (lost+found). Happens after an unclean shutdown when the directory update didn't make it to disk.
- Read-only remount: The kernel detects an inconsistency and remounts the filesystem read-only to prevent further damage. Caused by hardware faults or kernel bugs.
- Disk full but can't delete: A deleted file still held open by a process. 'du' doesn't see it but 'df' does — the blocks remain allocated until the file handle closes.
Each of these has a specific debugging pathway — covered in the debug guides above.
Beyond these, large unjournalled filesystems can take hours to fsck. Modern ext4 filesystems with journaling can recover in seconds, but if you have an unjournalled filesystem with billions of inodes, fsck can take days. That's why enterprise storage uses XFS or ZFS with checksums — they provide faster recovery and better resilience.
I once spent a full weekend recovering a 20TB XFS filesystem after a dual-controller failure. The lesson: never trust a single backup of the superblock — keep multiple copies and verify them regularly.
There's another failure mode that's surprisingly common: filesystem metadata corruption due to memory errors (bit flips). ECC RAM catches most of these, but non-ECC systems in cloud VMs are vulnerable. XFS and ZFS use metadata checksums to detect corruption; ext4 added metadata_csum in version 1.42. Always check if your filesystem has checksum support enabled. Without it, a single bit flip can silently corrupt an inode, leading to data loss that only surfaces months later.
One more often overlooked issue: filesystem quotas. If you use ext4 quotas (usrquota/grpquota), a quota limit can cause 'disk full' errors even when both blocks and inodes are free. I've seen a development server grind to a halt because a user's quota was hit — and the error message pointed to inode exhaustion, not quota. Always check 'repquota -a' when debugging mysterious 'no space' errors.
Also worth noting: hardware RAID card failures can present as filesystem errors. A dying controller might corrupt writes transparently. If you see unexplained corruption on multiple filesystems, suspect the RAID controller before the disks.
SSD vs HDD: How File Systems Behave on Different Storage Media
Your file system's performance and reliability depend heavily on the underlying storage technology. Hard disk drives (HDDs) and solid-state drives (SSDs) have fundamentally different characteristics that affect file system behaviour.
HDDs: Seek time is the dominant factor (~10 ms per random read). Sequential reads are fast (~200 MB/s). The file system should try to keep related blocks close together (extents, block groups). Fragmentation directly hurts performance because each fragment requires an additional seek.
SSDs: No mechanical seek. Random reads are as fast as sequential (typically 50–100 µs access time). Fragmentation is irrelevant for performance. However, SSDs have write endurance limits and require TRIM to inform the controller which blocks are free. Without TRIM, write performance degrades over time as the SSD must erase blocks before writing (write amplification).
Modern file systems like ext4 support the DISCARD/TRIM operation. You can enable it via mount option discard or run fstrim periodically. Be careful: frequent discards can increase latency on some SSDs. Batch trimming (fstrim -a via cron) is often preferred.
File system alignment is critical on SSDs with 4K sectors. If file system blocks are not aligned to the SSD's erase block boundaries, write performance degrades dramatically. Most modern tools (mkfs.ext4) handle this automatically, but legacy partition tables may misalign.
Write amplification: Each SSD write operation may require erasing a larger block (e.g., 512 KB) even for a small 4 KB write. File systems that batch small writes (delayed allocation in ext4) reduce write amplification by grouping writes into larger contiguous chunks.
Here's something most articles miss: on NVMe drives with high queue depth (128+), XFS outperforms ext4 by 30% due to better parallelism in its allocation group design. I learned this the hard way benchmarking a database migration.
Another critical detail: the interaction between file system journal and SSD wear. Each journal write adds extra I/O, which on an SSD consumes write endurance. If you have a high-write workload on a consumer SSD (low TBW), consider lowering the commit interval (commit=30 in mount options) to batch journal commits, or use a separate journal device (external journal) on a more resilient SSD.
For NVMe specifically, the PCIe lane count matters. A single NVMe drive on 4x lanes provides ~7 GB/s, but if the filesystem is not configured with a large enough stripe width, you won't saturate those lanes. XFS with su=128k,sw=4 is a safer bet for NVMe than ext4's default settings.
File System Mount Options and Performance Tuning — What Senior Engineers Change
Default mount options are designed for safety, not performance. In production, you'll almost always want to tune a few key parameters to reduce I/O overhead and match your workload.
Atime updates: Every time a file is read, the access time (atime) in the inode is updated. This causes an extra write I/O on every read. Use noatime to disable this. relatime (default on modern Linux) updates atime only if it's older than mtime or ctime, which reduces the penalty significantly but still causes writes on the first read after modification. Use noatime if you don't need access time at all (common for databases, web servers).
Commit interval: The journal writes metadata every commit seconds (default 5). A lower commit improves crash safety by reducing the window of lost metadata, but increases write frequency. For write-heavy workloads, increasing commit to 30 or 60 seconds can reduce journal I/O by 75% or more. Trade-off: you lose up to 60 seconds of metadata changes in a crash (data is safe if using data=ordered).
Write barriers: Ensures that metadata are written to persistent storage in the correct order. Usually safe to disable on battery-backed RAID controllers (barrier=0), but dangerous on single SSDs or HDDs where a power loss can reorder writes. Default is on.
Data mode: Already covered in journaling section: data=ordered for most, data=journal for extreme consistency, data=writeback for performance at risk.
Delayed allocation: Enabled by default in ext4. Groups small writes into larger contiguous chunks before flushing. Reduces fragmentation and write amplification on SSDs. But it can cause data loss if the system crashes before writes are flushed — the risk is minimal for relative improvements.
Production tip: On a busy database server, I once cut I/O wait by 30% just by adding noatime,nodiratime,commit=30 to the mount options. The default 5-second commit was causing a journal flush storm on every transaction batch.
Tuning example: For a MySQL data directory on ext4, typical mount options: rw,noatime,nodiratime,data=ordered,commit=30,barrier=1. For an SSD with frequent fstrim, do not use discard option; use periodic fstrim.
Benchmark before and after: Use fio to measure I/O latency and throughput with different options. Documented savings of 10-20% write I/O are common when switching from defaults to tuned options.
One more option often overlooked: 'nodelalloc' to disable delayed allocation. This can be useful for databases that need immediate write ordering (e.g., PostgreSQL's full-page writes). But on most workloads, delayed allocation improves performance significantly — test both.
Another senior trick: using 'noauto_da_alloc' can help avoid allocation delays in certain database workloads, but it's risky. Only change if you understand the exact consequences.
tune2fs -l to verify journal size isn't overwhelmed.File System Security & Permissions: Why POSIX ACLs and Extended Attributes Matter
File systems enforce access control through permissions, capabilities, and extended attributes. On Linux, the standard Unix rwx model gives owner/group/world sets. But production environments need finer control: POSIX Access Control Lists (ACLs) allow specifying permissions for individual users or groups, and extended attributes (xattr) store metadata like file capabilities or SELinux labels.
POSIX ACLs: Set with setfacl and viewed with getfacl. They add a logical ACL entry to the inode's extended attributes. Useful for shared directories where one user needs read and another write. However, ACLs increase metadata size and can slow down directory listing operations.
Extended attributes: Namespace-stored metadata (user, trusted, security). Used by SELinux for security contexts, by attr for custom attributes. They are stored in the inode if small enough, otherwise in a separate block, impacting space and performance.
Immutable files: The chattr +i command on ext4 sets the immutable attribute, preventing any modification even by root. Critical for system binaries and log files. My production lesson: I learned this when a misbehaving process accidentally deleted itself — the binary was protected.
File capabilities: Instead of setuid root, you can grant specific capabilities to a binary (e.g., CAP_NET_BIND_SERVICE to bind to low ports). This reduces the attack surface. But capabilities are stored in extended attributes and can be stripped by file copies or backups.
Production insight: A common mistake is forgetting that NFS exports ignore local ACLs. If you export an ext4 filesystem via NFS, the ACLs are only enforced locally — the NFS server relies on the client's UID mapping. Suddenly your fine-grained ACLs are meaningless. Always test NFS with exportfs -v and monitor with nfsstat.
SELinux contexts: On Red Hat systems, the filesystem stores SELinux labels in extended attributes. A relabel operation (restorecon -R /) can take hours on large volumes. We once had a security audit fail because a backup restored files without SELinux contexts — the entire web server was inaccessible.
Performance trade-off: Every ACL or extended attribute adds to inode metadata size. For directories with thousands of files, listing with getfacl * can be slower than expected. Use it only where necessary.
Another hidden trap: chattr +a (append only) is great for logs, but it is not respected by all writers — some programs (like rsyslog) open files with O_APPEND which works, but a direct system call without O_APPEND will succeed because the kernel checks the append-only flag only on write(), not on every open(). Always test your implementation.write()
Also note: file capabilities (setcap) are lost if the file is copied to a filesystem that doesn't support xattr (like NFSv3, or FAT). Always store critical binaries on native ext4/XFS with xattr support.
The Day an ext3 Superblock Corruption Took Down a Payment Gateway
- Always enable journaling on production filesystems — the performance hit (5-10% write overhead) is worth crash safety.
- Know your filesystem's backup superblock locations and how to recover from a corrupted primary superblock.
- Make regular dumps of superblock information using 'dumpe2fs -h' and store them off-box.
- Practice recovery scenarios in staging so you don't learn the procedure during an outage.
- Never assume default options are safe — verify 'has_journal' feature on every new filesystem with 'tune2fs -l'.
- Consider using ext4 or XFS for production — ext3's journal is optional and easily missed.
- Document backup superblock addresses (block 8193, 32768, etc.) in your runbook before a crash happens.
That's Operating Systems. Mark it forged?
18 min read · try the examples if you haven't