ext3 Superblock Corruption — Payment Gateway Outage
Disk partition won't mount: bad superblock — ext3 superblock corruption from power loss caused a payment gateway outage.
20+ years shipping production systems from the metal up. Drawn from code that ran under real load.
- A file system organises raw storage into files and directories using metadata like inodes and allocation tables
- Core components: superblock (globals), inode table (per-file metadata), data blocks (content), directory entries (name-to-inode maps)
- FAT32 uses linked list cluster chains; ext4 uses extents and journaling, reducing seeks up to 80%
- Production gotcha: abrupt power loss during a metadata write can orphan inodes — journaling prevents this on ext4, but not on FAT32
- Biggest mistake: assuming file delete frees data — it only marks blocks as free; data remains recoverable until overwritten
- Forensic reality: Tools like extundelete can recover deleted files minutes after deletion if no new writes occurred
Imagine your OS is a giant library. A file system is the librarian's cataloguing system — it decides which shelf each book goes on, writes a card in the index so anyone can find it later, and tracks which shelves are empty. Without the librarian, books would be dumped on the floor in a pile and nobody could find anything. Your hard drive is that same pile of storage space, and the file system is what turns chaos into an organised, searchable collection.
Every time you hit Ctrl+S, drag a photo into a folder, or install an app, you're trusting a file system to keep that data safe and findable. File systems are one of those invisible layers of the OS that almost nobody thinks about — until something goes wrong and years of photos vanish. Understanding how they work isn't just academic; it's the difference between a developer who debugs a corrupted disk by instinct and one who panics and Googles for three hours.
The core problem a file system solves is deceptively simple: a hard drive or SSD is just a flat sequence of bytes — millions of them, with no inherent meaning. The file system imposes structure on that flat sequence. It records where each file starts and ends, what it's called, who owns it, when it was last modified, and which blocks of storage are free for new data. Without this layer, the OS couldn't tell the difference between a Python script and a JPEG.
By the end of this article you'll understand the internal structure of a file system (directories, inodes, blocks, and allocation tables), why different file systems like FAT32, NTFS, and ext4 exist and when each one is the right choice, what actually happens on disk when you create or delete a file, and the most common mistakes engineers make when reasoning about file systems under load or across platforms. You'll also walk away with concrete talking points for any OS or systems design interview.
Here's the thing: when your filesystem goes down, every other service goes down with it. The debug commands in this article are the same ones I've used to recover production systems at 2 AM. Learn them once, and you'll never panic again. You'll also pick up the recovery procedures that turn a potential hours-long outage into a ten-minute fix — because I've lived that outage, and the first time was on a production database at 2 AM on a Saturday.
What a File System Actually Is — And Why It Corrupted
A file system is the kernel-level data structure and code that controls how data is stored, named, and retrieved on a block device. It translates file paths and I/O operations into block reads and writes on disk. The ext3 file system uses a journal (a circular log) to record metadata changes before applying them, enabling recovery after crashes. Without the journal, a power loss during a block group descriptor update can leave the superblock — the master metadata block — pointing to invalid inode tables, rendering the entire volume unmountable.
Ext3 organizes disk space into block groups, each with its own superblock copy, block bitmap, inode bitmap, and inode table. The primary superblock at offset 1024 bytes contains critical fields: number of inodes, block count, first data block, and the state (clean, errors, or mounted). When a write to the superblock is interrupted, the checksum fails or the state field flips to 'errors', forcing fsck to scan every block group. This is O(n) in the number of blocks — on a 2 TB volume with 4 KB blocks, that's 536 million checks.
Use ext3 when you need journaling for crash recovery on spinning disks or embedded systems with limited memory. It is not suitable for flash storage (no TRIM support) or workloads requiring sub-second fsck times. In production, a single corrupted superblock can take down a payment gateway for hours while fsck runs — or worse, if the backup superblock is also stale, data is lost.
Anatomy of a File System — Blocks, Inodes and Directories
Every file system organizes storage into fixed-size blocks (typically 4 KB). The crucial metadata structures are the superblock, inode table, and directory entries.
- Superblock: Stores global info like filesystem type, block size, number of blocks, number of free inodes. If the superblock corrupts, the entire filesystem is unreadable.
- Inode (index node): Each file and directory has one inode. It holds metadata (size, permissions, timestamps) and pointers to the data blocks. Inodes are stored in a reserved area of the partition.
- Directory: A special file whose data block is a list of (name, inode number) pairs. The '.' and '..' entries are stored here.
The inode does not store the file name. The name lives only in the directory entry. This means a file can have multiple names (hard links) — each pointing to the same inode. Moving a file within the same filesystem simply changes the directory entry, not the inode.
The inode contains up to 12 direct block pointers, then single, double, and triple indirect blocks. This design allows small files to be accessed with one inode read, while large files use progressively deeper indirection. In ext4, the first 60 bytes of the inode store 15 block pointers (including indirect). Small files fit entirely within those direct pointers, so reading them requires only the inode lookup.
Here's a production nuance: if you have millions of small files (think Docker overlay layers or mail spools), you'll exhaust inodes long before the disk fills. I've seen 'No space left on device' bring down a mail server while 'df -h' showed 40% free. Always monitor 'df -i'.
Another hidden detail: the superblock isn't the only copy. ext4 maintains backup superblocks at fixed intervals (block 1, 8193, etc.). When the primary superblock corrupts, you can recover using a backup. But many engineers don't know where their backups are until they need them. That's the point of the production incident earlier — know your backup block numbers before a crash.
What about extent trees? In ext4, an extent is a contiguous range of blocks. The inode stores up to 4 extents inline; for files with more than 4 extents, a tree of extent nodes is used. This reduces metadata overhead dramatically — a 16 MB file stored in one extent requires only one entry in the inode, not thousands of individual block pointers. This is why ext4 handles large files much better than ext3 without extent support.
One more internal detail: the directory structure itself can be a hash tree (htree) in ext4, allowing fast lookups even in directories with millions of entries. Without htree, a linear scan of directory entries would be O(n) per lookup. ext4's htree is a B-tree variant that gives O(log n) lookups. This is why re-creating filesystems with 'dir_index' feature matters for mail servers and image repositories.
- Directory entries: name -> inode number (e.g., 'hostname' -> 131073)
- Inode table: inode number -> metadata + block pointers (e.g., inode 131073 points to blocks 100-102)
- Data blocks: the actual bytes of the file
- This separation means you can have multiple names (hard links) pointing to the same inode — deleting one name just removes the directory entry, not the inode
- The superblock is the 'globals' dict — without it, you can't parse anything else
File Allocation Strategies — Contiguous, Linked and Indexed
How does the file system map file offsets to disk blocks? Three classic strategies:
- Contiguous allocation: Each file occupies consecutive blocks. Simple and fast for sequential reads (single seek), but suffers from external fragmentation — as files are created and deleted, free space gets scattered. Used by early Unix filesystems and ISO 9660.
- Linked allocation: Each block contains a pointer to the next block. No fragmentation, but sequential access requires multiple seeks per block (the pointer is in the block data, so you must read the block to find the next). FAT32 uses a variant where the File Allocation Table (FAT) stores the chain separately, allowing faster random access.
- Indexed allocation: The inode contains a list of direct block pointers, plus indirect, double indirect, and triple indirect pointers for large files. This gives O(1) access to any block via a few index reads. ext4 and NTFS use indexed allocation with extent trees (ranges of contiguous blocks) to reduce pointer overhead.
Modern file systems combine these: ext4 uses extents (contiguous runs of blocks) tracked in an indexed structure, giving the best of both worlds.
In ext4, an extent is a contiguous range of blocks. The inode stores up to 4 extents inline; for files with more than 4 extents, a tree of extent nodes is used. This reduces metadata overhead dramatically — a 16 MB file stored in one extent requires only one entry in the inode, not thousands of individual block pointers.
Here's the real gotcha: on spinning disks, a heavily fragmented file can kill read throughput. I once debugged a log parser that took 10x longer on an HDD than expected — the log files were fragmented into thousands of 4KB chunks across the platter. 'filefrag /var/log/syslog' showed 2,347 extents for a 1GB file. The fix was to defragment or switch to ext4 which merges extents better.
And here's something most docs skip: the extent tree's depth limits. For a filesystem with 4KB blocks and 48-bit block numbers, a single indirect extent node can reference over 340 GB of contiguous data. Most files never go beyond the inline extents. But if you have database files that are terabytes large with hundreds of extents, the tree grows — and that adds latency to each metadata lookup. XFS handles this more gracefully with B+ trees for extents.
One more production nuance: the 'filefrag' command can also show how many extents a file has, but it requires the filesystem to be mounted with the 'bmap' option. Without it, you'll get 'FIEMAP failed' errors. Always verify extent management on HDDs to avoid performance surprises.
Another hidden cost: on FAT32, the FAT itself is a large table that must be cached. For large partitions, the FAT can be tens of MB, and frequent updates (file creation/deletion) cause heavy write traffic. This is why FAT32 is unsuitable for high-write server workloads.
Journaling and Metadata Consistency — Why ext4 Survives Crashes
Before journaling, a power loss during a write could leave the filesystem in an inconsistent state: an inode pointing to blocks that are still marked free, or a directory entry referencing a non-existent inode. Recovery required a full fsck scan that could take hours on large volumes.
Journaling solves this by recording pending metadata operations in a circular log (journal) before applying them to the main filesystem. If a crash occurs, the journal is replayed on next mount — applying completed transactions and discarding partial ones. The filesystem is consistent in seconds.
ext3 introduced journaling as an optional feature (data=ordered mode journals metadata only; data blocks written before metadata). ext4 extended it with checksums, faster recovery, and the ability to disable journaling for performace-critical partitions (at your own risk). NTFS uses a similar $LogFile. FAT32 has no journaling — primary reason it's not used for system partitions.
There's a common misconception that journaling protects file data. In data=ordered mode, only metadata is journalled. If you need both metadata and data to be atomic, use data=journal mode. However, that writes every data block twice (once to journal, once to final location), doubling write I/O. For most applications, data=ordered is the right balance: data blocks are written before metadata, so if a crash occurs, the metadata either refers to fully written data or is rolled back.
I'll never forget the time a colleague said 'we don't need journaling, it's just a cache' — then a power outage corrupted the database. The fsck took 6 hours. Never skip journaling on production filesystems.
One more thing: journaling isn't free. The journal itself consumes disk space (typically 128 MB for ext4), and each metadata write adds latency. If you're running a high-throughput log server that can tolerate some loss, you might consider disabling journaling on the log partition. But for any system where data integrity matters — databases, transaction logs, stateful applications — keep it on. The trade-off is real, but the cost of recovery outweighs the performance gain.
Modern ext4 also includes metadata checksums (metadata_csum feature) to detect corruption during reads and journal replay. Always enable this feature — it adds negligible overhead but catches silent corruption from bit flips or kernel bugs.
Another production insight: journal size matters. If your journal is too small for a burst of metadata operations (e.g., bulk file extraction), the journal may wrap before transactions complete, forcing a full fsck. Default journal size is usually fine, but for very large filesystems (10TB+) consider increasing journal size with 'tune2fs -J size=256M /dev/sdX'.
- Before modifying the real inode table or bitmaps, write a 'redo' entry to the journal
- After the journal entry is safely on disk, apply the change to the main filesystem
- On crash recovery, replay all completed journal entries; partial entries are discarded (they never reached the main area)
- Result: filesystem is always consistent after a crash — no need for full fsck
- Trade-off: journal writes add latency and extra disk I/O (about 5-10% write performance hit)
Production Reality — When File Systems Break and How to Debug Them
File systems in production fail in predictable ways. The most common scenarios:
- Out of inodes: 'No space left on device' even though 'df -h' shows free blocks. Happens with millions of tiny files (e.g., Docker overlay layers, mail spools).
- Corrupted superblock: Power loss, bad memory, or disk firmware bugs corrupt the superblock. Without a backup, the entire filesystem is lost.
- Orphaned inodes: A file's inode has no directory entry (lost+found). Happens after an unclean shutdown when the directory update didn't make it to disk.
- Read-only remount: The kernel detects an inconsistency and remounts the filesystem read-only to prevent further damage. Caused by hardware faults or kernel bugs.
- Disk full but can't delete: A deleted file still held open by a process. 'du' doesn't see it but 'df' does — the blocks remain allocated until the file handle closes.
Each of these has a specific debugging pathway — covered in the debug guides above.
Beyond these, large unjournalled filesystems can take hours to fsck. Modern ext4 filesystems with journaling can recover in seconds, but if you have an unjournalled filesystem with billions of inodes, fsck can take days. That's why enterprise storage uses XFS or ZFS with checksums — they provide faster recovery and better resilience.
I once spent a full weekend recovering a 20TB XFS filesystem after a dual-controller failure. The lesson: never trust a single backup of the superblock — keep multiple copies and verify them regularly.
There's another failure mode that's surprisingly common: filesystem metadata corruption due to memory errors (bit flips). ECC RAM catches most of these, but non-ECC systems in cloud VMs are vulnerable. XFS and ZFS use metadata checksums to detect corruption; ext4 added metadata_csum in version 1.42. Always check if your filesystem has checksum support enabled. Without it, a single bit flip can silently corrupt an inode, leading to data loss that only surfaces months later.
One more often overlooked issue: filesystem quotas. If you use ext4 quotas (usrquota/grpquota), a quota limit can cause 'disk full' errors even when both blocks and inodes are free. I've seen a development server grind to a halt because a user's quota was hit — and the error message pointed to inode exhaustion, not quota. Always check 'repquota -a' when debugging mysterious 'no space' errors.
Also worth noting: hardware RAID card failures can present as filesystem errors. A dying controller might corrupt writes transparently. If you see unexplained corruption on multiple filesystems, suspect the RAID controller before the disks.
SSD vs HDD: How File Systems Behave on Different Storage Media
Your file system's performance and reliability depend heavily on the underlying storage technology. Hard disk drives (HDDs) and solid-state drives (SSDs) have fundamentally different characteristics that affect file system behaviour.
HDDs: Seek time is the dominant factor (~10 ms per random read). Sequential reads are fast (~200 MB/s). The file system should try to keep related blocks close together (extents, block groups). Fragmentation directly hurts performance because each fragment requires an additional seek.
SSDs: No mechanical seek. Random reads are as fast as sequential (typically 50–100 µs access time). Fragmentation is irrelevant for performance. However, SSDs have write endurance limits and require TRIM to inform the controller which blocks are free. Without TRIM, write performance degrades over time as the SSD must erase blocks before writing (write amplification).
Modern file systems like ext4 support the DISCARD/TRIM operation. You can enable it via mount option discard or run fstrim periodically. Be careful: frequent discards can increase latency on some SSDs. Batch trimming (fstrim -a via cron) is often preferred.
File system alignment is critical on SSDs with 4K sectors. If file system blocks are not aligned to the SSD's erase block boundaries, write performance degrades dramatically. Most modern tools (mkfs.ext4) handle this automatically, but legacy partition tables may misalign.
Write amplification: Each SSD write operation may require erasing a larger block (e.g., 512 KB) even for a small 4 KB write. File systems that batch small writes (delayed allocation in ext4) reduce write amplification by grouping writes into larger contiguous chunks.
Here's something most articles miss: on NVMe drives with high queue depth (128+), XFS outperforms ext4 by 30% due to better parallelism in its allocation group design. I learned this the hard way benchmarking a database migration.
Another critical detail: the interaction between file system journal and SSD wear. Each journal write adds extra I/O, which on an SSD consumes write endurance. If you have a high-write workload on a consumer SSD (low TBW), consider lowering the commit interval (commit=30 in mount options) to batch journal commits, or use a separate journal device (external journal) on a more resilient SSD.
For NVMe specifically, the PCIe lane count matters. A single NVMe drive on 4x lanes provides ~7 GB/s, but if the filesystem is not configured with a large enough stripe width, you won't saturate those lanes. XFS with su=128k,sw=4 is a safer bet for NVMe than ext4's default settings.
File System Mount Options and Performance Tuning — What Senior Engineers Change
Default mount options are designed for safety, not performance. In production, you'll almost always want to tune a few key parameters to reduce I/O overhead and match your workload.
Atime updates: Every time a file is read, the access time (atime) in the inode is updated. This causes an extra write I/O on every read. Use noatime to disable this. relatime (default on modern Linux) updates atime only if it's older than mtime or ctime, which reduces the penalty significantly but still causes writes on the first read after modification. Use noatime if you don't need access time at all (common for databases, web servers).
Commit interval: The journal writes metadata every commit seconds (default 5). A lower commit improves crash safety by reducing the window of lost metadata, but increases write frequency. For write-heavy workloads, increasing commit to 30 or 60 seconds can reduce journal I/O by 75% or more. Trade-off: you lose up to 60 seconds of metadata changes in a crash (data is safe if using data=ordered).
Write barriers: Ensures that metadata are written to persistent storage in the correct order. Usually safe to disable on battery-backed RAID controllers (barrier=0), but dangerous on single SSDs or HDDs where a power loss can reorder writes. Default is on.
Data mode: Already covered in journaling section: data=ordered for most, data=journal for extreme consistency, data=writeback for performance at risk.
Delayed allocation: Enabled by default in ext4. Groups small writes into larger contiguous chunks before flushing. Reduces fragmentation and write amplification on SSDs. But it can cause data loss if the system crashes before writes are flushed — the risk is minimal for relative improvements.
Production tip: On a busy database server, I once cut I/O wait by 30% just by adding noatime,nodiratime,commit=30 to the mount options. The default 5-second commit was causing a journal flush storm on every transaction batch.
Tuning example: For a MySQL data directory on ext4, typical mount options: rw,noatime,nodiratime,data=ordered,commit=30,barrier=1. For an SSD with frequent fstrim, do not use discard option; use periodic fstrim.
Benchmark before and after: Use fio to measure I/O latency and throughput with different options. Documented savings of 10-20% write I/O are common when switching from defaults to tuned options.
One more option often overlooked: 'nodelalloc' to disable delayed allocation. This can be useful for databases that need immediate write ordering (e.g., PostgreSQL's full-page writes). But on most workloads, delayed allocation improves performance significantly — test both.
Another senior trick: using 'noauto_da_alloc' can help avoid allocation delays in certain database workloads, but it's risky. Only change if you understand the exact consequences.
tune2fs -l to verify journal size isn't overwhelmed.File System Security & Permissions: Why POSIX ACLs and Extended Attributes Matter
File systems enforce access control through permissions, capabilities, and extended attributes. On Linux, the standard Unix rwx model gives owner/group/world sets. But production environments need finer control: POSIX Access Control Lists (ACLs) allow specifying permissions for individual users or groups, and extended attributes (xattr) store metadata like file capabilities or SELinux labels.
POSIX ACLs: Set with setfacl and viewed with getfacl. They add a logical ACL entry to the inode's extended attributes. Useful for shared directories where one user needs read and another write. However, ACLs increase metadata size and can slow down directory listing operations.
Extended attributes: Namespace-stored metadata (user, trusted, security). Used by SELinux for security contexts, by attr for custom attributes. They are stored in the inode if small enough, otherwise in a separate block, impacting space and performance.
Immutable files: The chattr +i command on ext4 sets the immutable attribute, preventing any modification even by root. Critical for system binaries and log files. My production lesson: I learned this when a misbehaving process accidentally deleted itself — the binary was protected.
File capabilities: Instead of setuid root, you can grant specific capabilities to a binary (e.g., CAP_NET_BIND_SERVICE to bind to low ports). This reduces the attack surface. But capabilities are stored in extended attributes and can be stripped by file copies or backups.
Production insight: A common mistake is forgetting that NFS exports ignore local ACLs. If you export an ext4 filesystem via NFS, the ACLs are only enforced locally — the NFS server relies on the client's UID mapping. Suddenly your fine-grained ACLs are meaningless. Always test NFS with exportfs -v and monitor with nfsstat.
SELinux contexts: On Red Hat systems, the filesystem stores SELinux labels in extended attributes. A relabel operation (restorecon -R /) can take hours on large volumes. We once had a security audit fail because a backup restored files without SELinux contexts — the entire web server was inaccessible.
Performance trade-off: Every ACL or extended attribute adds to inode metadata size. For directories with thousands of files, listing with getfacl * can be slower than expected. Use it only where necessary.
Another hidden trap: chattr +a (append only) is great for logs, but it is not respected by all writers — some programs (like rsyslog) open files with O_APPEND which works, but a direct system call without O_APPEND will succeed because the kernel checks the append-only flag only on write(), not on every open(). Always test your implementation.write()
Also note: file capabilities (setcap) are lost if the file is copied to a filesystem that doesn't support xattr (like NFSv3, or FAT). Always store critical binaries on native ext4/XFS with xattr support.
Virtual File System (VFS): How Linux Supports Multiple File Systems Transparently
The Virtual File System (VFS) is a kernel abstraction layer that allows user-space applications to use the same system calls (open, read, write) regardless of the underlying filesystem. Each filesystem registers itself with VFS, providing a standard set of operations (inode operations, file operations, dentry operations). The VFS inode and dentry caches improve performance. This is why you can mount ext4 on /, XFS on /home, and an NFS share on /mnt, and all work with the same APIs.
Production insight: The dentry cache stores recently accessed directory entries. If you have a large directory with millions of files, the dentry cache can consume significant memory. You can tune it with 'vm.vfs_cache_pressure'. A common issue is that the dentry cache retains entries even after files are deleted, causing memory pressure. Setting vfs_cache_pressure=200 (default 100) makes the kernel reclaim dentry more aggressively.
Another gotcha: The VFS page cache caches file data. When a filesystem goes read-only due to errors, all writable mapped pages are invalidated. If an application has open file descriptors, writes may silently fail. Always check write system call return values.
Key takeaway: VFS is the glue that makes all file systems look the same to applications. Tuning dentry and inode caches can save memory on systems with many small files.
fsync() for transactional writes.File Access Patterns: Sequential vs Direct — Why Your Database is Slowing Down
Most juniors think file access is file access. It's not. The way you read data dictates whether your disk becomes a bottleneck or a workhorse. Sequential access reads data in order, block after block. Disk heads stay in motion without seeking. That's why log files and video streams scream. Direct access jumps to any block by its number. Databases love this for random lookups. The trap is mixing patterns. If you write a time-series log using direct access, you fragment your disk and kill write throughput. Always match your access pattern to your workload. Sequential writes to append-only logs. Direct reads for hash indexes. Know your workload before you choose your file system. ext4 with noatime for sequential. XFS with large allocation groups for concurrent random I/O. Your database will thank you.
Disk Free Space Management: How Bitmaps and Free Lists Keep You From Running Out
You delete a file. The space doesn't magically reappear. Something has to track which blocks are free. Two main strategies exist. Bitmaps use one bit per block. Simple, fast, and cache-friendly. ext4 uses this. A 1TB disk with 4KB blocks needs just 32MB of bitmap. That fits in L3 cache. Free lists chain free blocks together. Old FAT systems used them. Drawback? Fragmentation. As files come and go, the free list becomes a linked list of scattered blocks. Allocation gets slow. Modern file systems use hybrid approaches. ext4 groups blocks into block groups, each with its own bitmap and inode table. This keeps metadata local. When you allocate a file, it stays near its inode. Less head movement. Less latency. Never assume free space is contiguous. Always check fragmentation before blaming I/O issues.
df. It shows total free space, not contiguous free space. A disk at 60% capacity can still fragment badly if files are small and deleted often. Monitor per-block-group free space.The Day an ext3 Superblock Corruption Took Down a Payment Gateway
- Always enable journaling on production filesystems — the performance hit (5-10% write overhead) is worth crash safety.
- Know your filesystem's backup superblock locations and how to recover from a corrupted primary superblock.
- Make regular dumps of superblock information using 'dumpe2fs -h' and store them off-box.
- Practice recovery scenarios in staging so you don't learn the procedure during an outage.
- Never assume default options are safe — verify 'has_journal' feature on every new filesystem with 'tune2fs -l'.
- Consider using ext4 or XFS for production — ext3's journal is optional and easily missed.
- Document backup superblock addresses (block 8193, 32768, etc.) in your runbook before a crash happens.
- Enable metadata checksums (metadata_csum) on new ext4 filesystems to detect silent corruption early.
- Test filesystem recovery from backup superblock at least once quarterly—theory doesn't survive panic.
dmesg | grep -i 'fs\|superblock\|i/o error\|recovery'smartctl -H /dev/sdX (check disk health)Key takeaways
Common mistakes to avoid
5 patternsNot monitoring inode usage
Assuming journaling is always enabled
Using FAT32 for system partitions
Forgetting backup superblock locations
Not testing file system recovery in staging
Interview Questions on This Topic
What is the difference between an inode and a directory entry?
Frequently Asked Questions
20+ years shipping production systems from the metal up. Drawn from code that ran under real load.
That's Operating Systems. Mark it forged?
20 min read · try the examples if you haven't