Senior 11 min · March 06, 2026

IPC — Misaligned uint64 Torn Read in Shared Memory

A misaligned uint64 write at offset 1022 caused a torn read crash.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • IPC lets isolated processes exchange data through kernel-managed channels
  • Pipes: unidirectional byte stream, parent-child or named (FIFO) for unrelated procs
  • Sockets: bidirectional, works across processes on same host (AF_UNIX) or network (TCP)
  • Shared memory: fastest (70–150 ns access) — zero kernel copy — but requires explicit sync
  • Message queues: async, kernel-buffered, priority delivery — slower but decoupled in time
  • Production risk: shared memory races corrupt data silently — never skip mutexes
  • Biggest mistake: assuming all IPC is interchangeable — latency varies from 70ns to 50µs
Plain-English First

Imagine two chefs working in the same restaurant kitchen. They can't read each other's minds, so they need ways to pass information — one chef shouts across the kitchen (like a pipe), another writes on a shared whiteboard (like shared memory), and a third drops orders into a ticket queue (like a message queue). IPC is exactly that: the set of rules and tools the OS provides so that separate running programs can talk to each other, coordinate work, and share data without crashing into each other.

The OS keeps every process in its own virtual address space — that sandbox is why a crash in one tab doesn't bring down your whole browser. But it creates a hard problem: how do two isolated processes cooperate?

IPC solves exactly this. The kernel provides controlled channels — pipes, sockets, shared memory, message queues — that let data flow between address spaces safely. Each makes a different trade-off between speed, complexity, and what breaks under load.

Here's what most guides skip: the wrong IPC choice can tank your system's throughput by 100x or introduce data corruption that only reproduces on multi-socket machines. This article covers not just how each mechanism works but what fails in production and how to fix it.

Most engineers treat IPC as a one-dimensional decision — faster is better. That's wrong. The real question is what failure mode you can tolerate. Shared memory gives you speed but a single lock mistake corrupts all data. Message queues add latency but isolate failures. Choose based on your outage cost, not a microbenchmark.

What is Inter-Process Communication?

Inter-Process Communication is a core concept in OS design. The kernel keeps processes isolated — each in its own virtual address space. IPC bridges that gap, providing controlled channels for data exchange. There are four classic mechanisms: pipes, sockets, shared memory, and message queues.

Think of IPC as a spectrum. Pipes are simple byte streams for parent-child processes. Sockets add network transparency. Shared memory delivers raw speed (70–150 ns access) but demands manual synchronisation. Message queues give structured, async delivery with kernel buffering.

Here's the rule that matters: you rarely choose IPC directly — your libraries decide. PostgreSQL uses shared memory for its buffer pool. Get the size wrong and you'll either OOM or hit terrible disk reads. Profile IPC overhead under your actual workload, not synthetic benchmarks.

Most engineers treat IPC as one-dimensional: faster is better. That's wrong. The real question is what failure mode you can tolerate. Shared memory gives speed but a single lock mistake corrupts all data. Message queues add latency but isolate failures. Choose based on your outage cost, not a microbenchmark.

A common misconception: IPC mechanisms are drop-in replacements. They're not. Switch from shared memory to a message queue in a latency-critical path and you'll see a 100x slowdown. Always validate with your workload.

Another nuance: IPC performance depends on the kernel version and CPU architecture. On a NUMA system, accessing a shared memory page allocated on a remote NUMA node adds 2–3x latency. Use numactl to pin processes and memory allocations to the same node. Ignoring topology is one of the most common causes of inconsistent IPC performance in production.

io_thecodeforge_ipc_overview.cC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>

int main() {
    int fd[2];
    pipe(fd);
    pid_t pid = fork();
    if (pid == 0) {
        close(fd[0]);
        const char *msg = "io.thecodeforge says: IPC works";
        write(fd[1], msg, 32);
        close(fd[1]);
    } else {
        close(fd[1]);
        char buf[64];
        read(fd[0], &buf, 64);
        printf("Child said: %s\n", buf);
        close(fd[0]);
        wait(NULL);
    }
    return 0;
}
Nuance: NUMA Matters
On multi-socket machines, shared memory allocated on one NUMA node can be slower when accessed from another. Use numactl --membind to keep memory local to the processes that use it.
Production Insight
Your libraries choose IPC for you, not your config files.
PostgreSQL uses shared memory for its buffer pool — misconfigure it and you get OOM or terrible disk reads.
Rule: profile IPC overhead under load; 10µs pipe latency is fine for logs, deadly for a trading engine.
Key Takeaway
IPC isolates processes safely but forces data through kernel-managed channels.
Latency: shared memory < 1µs, pipes ~10µs, sockets ~50µs (local) to ms (remote).
Choose based on latency budget, data size, and whether processes are on the same host.
IPC Mechanism Decision Tree
IfProcesses are parent-child, need simple byte stream, unidirectional
UseUse unnamed pipe (pipe()). Zero config, lowest overhead.
IfProcesses are unrelated, same host, need bidirectional byte stream
UseUse Unix domain socket (AF_UNIX). Socketpair for convenience.
IfProcesses need high throughput, low latency, share large data set
UseUse shared memory (shm_open + mmap). Mandatory: mutex sync.
IfProcesses are unrelated, need async message delivery with kernel buffering
UseUse POSIX message queues (mq_open). Better than SysV; supports notification.
IfProcesses are on different machines
UseUse TCP sockets (AF_INET). Consider message framing (length prefix) to avoid fragmentation.

Pipes are the oldest and simplest IPC mechanism. An unnamed pipe created with pipe() gives two file descriptors: the write end and the read end. Data written to the write end is buffered in the kernel and read sequentially from the read end.

Pipes are unidirectional. For bidirectional communication, create two pipes or switch to a socketpair. The kernel buffer is finite — pipe-max-size defaults to 1 MB on Linux, adjustable via fcntl(fd, F_SETPIPE_SZ). If the buffer fills, write() blocks until the reader drains data.

Named pipes (FIFOs) use a filesystem path and allow unrelated processes to communicate. Created with mkfifo(). The same read/write semantics apply, but now any process with permissions can open the file.

Pipes are great for one-shot data flows like grep | sort or for passing small control messages between a parent and its children. Don't force them into high-throughput scenarios — the kernel copy and scheduling overhead adds up.

One production gotcha: if the reader closes its end before the writer finishes, the writer receives SIGPIPE (termination) or gets EPIPE error on write if SIGPIPE is ignored. Always handle EPIPE in your writer loop.

A real-world story: a log aggregation pipeline used pipes between a producer and a consumer. The consumer was slow due to disk I/O, and the producer would block on write, stalling the entire logging thread. The fix was to set O_NONBLOCK on the write end and buffer in userspace — but only after losing a day of logs. Always test your pipe throughput against your worst-case burst rate.

Another subtle issue: the pipe buffer size is shared across all writers. If you have multiple writers, a single slow reader can cause all of them to block. Consider using a separate pipe per writer or a socket with a larger buffer.

In high-throughput scenarios, consider using splice() to move data between pipes and sockets without copying to userspace. It's a syscall but can reduce CPU overhead significantly when chaining pipes. However, splice requires one end to be a pipe. That's a niche optimisation but worth knowing.

io_thecodeforge_pipe_example.cC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#include <unistd.h>
#include <stdio.h>
#include <sys/wait.h>
#include <string.h>

int main() {
    int fd[2];
    if (pipe(fd) == -1) {
        perror("pipe");
        return 1;
    }

    pid_t pid = fork();
    if (pid == 0) {
        // Child: write
        close(fd[0]);
        const char *msg = "io.thecodeforge says: data through pipe";
        write(fd[1], msg, strlen(msg) + 1);
        close(fd[1]);
        return 0;
    }

    // Parent: read
    close(fd[1]);
    char buf[64];
    ssize_t n = read(fd[0], buf, sizeof(buf));
    printf("Received: %s\n", buf);
    close(fd[0]);
    wait(NULL);
    return 0;
}
Pipe Gotcha: Blocking Write
If the reader closes its end before the writer finishes, the writer receives SIGPIPE (termination) or gets EPIPE error on write if SIGPIPE is ignored. Always handle EPIPE in your writer loop.
Production Insight
Pipes are unbuffered in the sense that they pass data through kernel pages.
In a log-aggregation pipeline, a slow reader caused the writer to block, stalling all logging and cascading into a full system freeze.
Fix: set O_NONBLOCK on the write end and log to a separate thread with a user-space buffer.
For high-throughput, splice() avoids user-space copies but requires a pipe as one end.
Key Takeaway
Pipes are simple, zero-config, but limited to parent-child or unrelated via FIFO.
Blocking write can kill throughput if reader is slow.
Use socketpair for bidirectional on same host — it's essentially a pipe but gives two FILE*s.
Pipes Decision Tree
IfNeed one-way byte stream between parent/child
UseUse unnamed pipe (pipe()). Simple, no filesystem artifacts.
IfNeed bidirectional but processes are related
UseUse socketpair(AF_UNIX, SOCK_STREAM) — avoid managing two pipes.
IfNeed one-way between unrelated processes
UseUse named pipe (FIFO) via mkfifo(). Must handle permissions and cleanup.
IfMultiple writers, single slow reader
UseSwitch to Unix socket or add per-writer pipes. O_NONBLOCK with user-space buffer can help.

Sockets: Network-Transparent Bidirectional Communication

Sockets are the most versatile IPC mechanism. They can communicate within a single host using Unix domain sockets (AF_UNIX) or across a network using TCP (AF_INET). Unix domain sockets are file-based and support both stream (SOCK_STREAM) and datagram (SOCK_DGRAM) semantics.

For local IPC, Unix domain sockets are the gold standard: they support bidirectional byte streams, have lower latency than TCP (no loopback interface), and integrate well with select/poll/epoll. They also support passing file descriptors between processes via sendmsg() — extremely useful for delegating server sockets to workers.

A common pattern: a server creates a Unix domain socket at /tmp/app.sock, accepts connections from multiple clients, and services them using a thread pool or event loop. The biggest pitfall is forgetting to unlink() the socket file before bind(), or getting permissions wrong (the socket file must be writable by the client).

TCP sockets add network transparency but at a cost: latency is 10–100x higher, and you need to handle connection management, reconnection, and message framing yourself (length-prefix encoding is standard).

One real-world lesson: if you forget to set SO_RCVTIMEO or use a blocking accept, a single slow client can stall your entire server event loop. Use non-blocking I/O + epoll for any production socket server.

Another subtle issue: partial reads. TCP is a stream, not a message API. You might read only half of what was sent. Always loop and check return values. A common bug is assuming a single read() gets the full message. Build a simple framing layer — four bytes of length followed by the payload — and you'll save hours of debugging.

Passing file descriptors via SCM_RIGHTS is a powerful feature. Nginx uses it to hand off accepted sockets to worker processes. But be careful: the receiving process must have the same UID or be root. Also, the kernel serializes the FD passing, so you can send only one FD per message on some systems.

Performance-wise, Unix domain sockets are about 2–3x faster than TCP loopback for small messages (under 1KB). For bulk transfers, the gap narrows but still favours Unix sockets. Benchmark your own workload.

io_thecodeforge_unix_domain_socket_server.cC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#include <sys/socket.h>
#include <sys/un.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>

int main() {
    const char *path = "/tmp/io_thecodeforge.sock";
    unlink(path);  // clean up previous run

    int sock = socket(AF_UNIX, SOCK_STREAM, 0);
    if (sock == -1) { perror("socket"); return 1; }

    struct sockaddr_un addr = {0};
    addr.sun_family = AF_UNIX;
    strncpy(addr.sun_path, path, sizeof(addr.sun_path) - 1);

    if (bind(sock, (struct sockaddr *)&addr, sizeof(addr)) == -1) {
        perror("bind"); return 1;
    }
    listen(sock, 5);

    int client = accept(sock, NULL, NULL);
    if (client == -1) { perror("accept"); return 1; }

    char buf[128];
    read(client, buf, sizeof(buf));
    printf("Server got: %s\n", buf);

    const char *reply = "Welcome from io.thecodeforge";
    write(client, reply, strlen(reply) + 1);

    close(client); close(sock); unlink(path);
    return 0;
}
Socket Mental Model: Phone System
  • socket() = get a phone device
  • bind() = assign your phone number
  • listen() = wait for incoming calls
  • accept() = pick up the ring
  • connect() = dial
  • read/write = speak and listen
Production Insight
Unix domain sockets are 2–3x faster than loopback TCP for small messages.
But if you forget to set SO_RCVTIMEO or use a blocking accept, a single slow client can stall your entire server event loop.
Rule: use non-blocking I/O + epoll for any production socket server.
Partial reads are guaranteed — always loop. Build a framing layer (length prefix) to avoid app-level corruption.
Key Takeaway
Unix sockets: bidirectional, local, low-latency, can pass FDs.
TCP sockets: network-enabled, higher overhead, need message framing.
Always set non-blocking and handle partial reads/writes — TCP is a stream, not a message API.
Socket Decision Tree
IfSame host, bidirectional, need FD passing
UseUnix stream socket (AF_UNIX, SOCK_STREAM). Best latency, can pass FDs.
IfSame host, need datagram (unordered, reliable)
UseUnix datagram socket (AF_UNIX, SOCK_DGRAM). No framing needed for fixed-size messages.
IfCross-network, need reliable stream
UseTCP socket (AF_INET, SOCK_STREAM). Must handle partial reads, reconnection, and framing.
IfCross-network, need low-latency, loss-tolerant
UseUDP socket (AF_INET, SOCK_DGRAM). Build your own reliability if needed.

Shared Memory: Raw Speed with Manual Synchronisation

Shared memory is the fastest IPC mechanism because data moves directly between process address spaces without kernel copies. On Linux, you create a shared memory object with shm_open() and then mmap() it into each process's address space. Both processes see the same physical pages.

Once mapped, you can write and read from the shared region using pointer dereferencing — no system calls. That's nanosecond access. But it comes with a huge responsibility: race conditions are guaranteed if you don't use synchronisation. The typical approach is to embed a pthread_mutex_t (or a futex-based lock) directly in the shared memory region, placed at the start, and protect critical sections.

A common production pattern: a producer writes a new version of a large data structure (e.g., a lookup table), then atomically increments a generation counter that consumers check before reading. That's a form of sequence lock (seqlock). But even then, the compiler can reorder memory accesses — you need memory barriers (atomic_thread_fence).

The biggest mistake? Thinking that because you only write to your own partition, you don't need synchronisation. See the production incident above — torn reads happen.

Another subtle issue: cache line bouncing when two processes on different CPU cores repeatedly modify adjacent variables in the same cache line (false sharing). Pad your shared structures to 64-byte boundaries to avoid this.

One more trap: initialising the mutex with PTHREAD_PROCESS_SHARED attribute. If you forget, any process other than the initialiser will crash with EINVAL on the first lock. Always check that the mutex is process-shared and initialised exactly once.

Additionally, be aware of the size of the shared memory object. ftruncate() sets the size, and mapping beyond that causes SIGBUS. Always handle SIGBUS or map with MAP_POPULATE to pre-fault pages. In production, set the shared memory size based on sysctl kernel.shmmax limits.

For large-scale deployments, consider using huge pages (2MB or 1GB) to reduce TLB misses. shm_open with MAP_HUGETLB can significantly improve performance for large shared regions. But it requires kernel configuration and reserving huge pages.

io_thecodeforge_shared_memory_producer.cC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <stdio.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <pthread.h>

#define SHM_NAME "/io_thecodeforge_shm"
#define SHM_SIZE 4096

int main() {
    int fd = shm_open(SHM_NAME, O_CREAT | O_RDWR, 0666);
    ftruncate(fd, SHM_SIZE);
    void *ptr = mmap(NULL, SHM_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    close(fd);

    // mutex must be initialised elsewhere with PTHREAD_PROCESS_SHARED
    pthread_mutex_t *mtx = (pthread_mutex_t *)ptr;
    pthread_mutex_lock(mtx);

    char *data = (char *)ptr + sizeof(pthread_mutex_t);
    const char *msg = "io.thecodeforge shared memory payload";
    memcpy(data, msg, strlen(msg) + 1);

    pthread_mutex_unlock(mtx);
    munmap(ptr, SHM_SIZE);
    return 0;
}
Shared Memory Initialisation Trap
The mutex inside shared memory must be initialised with PTHREAD_PROCESS_SHARED attribute, otherwise pthread_mutex_lock will crash with EINVAL when a process other than the initialiser tries to lock it. Do this exactly once: call pthread_mutexattr_setpshared(&attr, PTHREAD_PROCESS_SHARED) before pthread_mutex_init.
Production Insight
Shared memory is the mechanism behind PostgreSQL's buffer pool, Redis's persistence, and most inter-process data stores.
It's also where the nastiest race conditions live — ones that only show up on multi-socket machines under load.
Rule: use tools like ThreadSanitizer and stress-test with concurrent readers/writers.
False sharing can silently halve throughput — pad critical structs to 64 bytes.
Key Takeaway
Shared memory is the fastest IPC (zero kernel copies) but demands explicit synchronisation.
Always use mutexes or atomics with memory ordering — never trust partitioned writes alone.
For read-heavy workloads, consider a seqlock or RCU pattern.
Shared Memory Decision Tree
IfNeed fastest IPC for large data structures, same host
UseUse shared memory with mutex or RW lock. Embed sync primitives at offset 0.
IfRead-heavy, single writer, infrequent updates
UseConsider seqlock (sequence counter + barrier) to avoid exclusive locks on reads.
IfMultiple writers, need atomic updates on counters
UseUse C11 atomics or GCC __sync builtins. Align to 8-byte boundary.
IfPerformance critical, large dataset (>1GB)
UseEnable huge pages (MAP_HUGETLB). Reserve via kernel boot params.

Message Queues: Structured Asynchronous Communication

Message queues (SysV msgget family or POSIX mq_open family) allow processes to send and receive discrete messages. Each message has a type or priority, and the kernel buffers them until the receiver picks them up. The sender doesn't block unless the queue is full.

POSIX message queues (preferred over SysV) provide
  • named queues (/myqueue) accessible by unrelated processes
  • prioritised delivery (messages with higher priority are received first)
  • notification (via mq_notify with a signal or thread) when a message arrives
  • asynchronous send with O_NONBLOCK

The queue has a fixed maximum number of messages and maximum message size, set at creation. These are kernel-controlled: msgmax, msgmnb, etc. If you need dynamic sizing, you must handle failures.

Message queues are ideal for event-driven systems where producers and consumers don't need to be alive at the same time. They decouple the sender and receiver in time (the kernel holds the message) and in space (receiver can be on a different process).

But they're slower than shared memory — each mq_send/mq_receive involves a system call and a kernel buffer copy. In high-throughput paths, shared memory + a condition variable beats message queues by orders of magnitude.

Be aware of priority inversion: a low-priority message that holds a resource needed by a high-priority sender can cause head-of-line blocking. Monitor queue depths and set timeouts on both ends.

A production scenario: a logging service used a message queue between the app and the file writer. When the queue filled up, the app's mq_send blocked, which blocked the request handler. The entire site went down because the logger became the bottleneck. The fix: mq_timedsend with a timeout and fallback to dropping messages under backpressure.

Another subtlety: POSIX message queue names are limited to NAME_MAX (often 255) characters and must start with a slash. SysV queues use a numeric key. Prefer POSIX for portability and notification support.

Also note: message queues are not suitable for real-time systems with tight deadlines due to unpredictable kernel scheduling. The time between mq_send and mq_receive can vary by microseconds to milliseconds under load.

io_thecodeforge_mq_receiver.cC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <mqueue.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
    struct mq_attr attr = { .mq_maxmsg = 10, .mq_msgsize = 256 };
    mqd_t mq = mq_open("/io_thecodeforge_mq", O_CREAT | O_RDONLY, 0644, &attr);
    if (mq == (mqd_t)-1) { perror("mq_open"); return 1; }

    char buf[256];
    unsigned int prio;
    ssize_t n = mq_receive(mq, buf, sizeof(buf), &prio);
    if (n == -1) { perror("mq_receive"); return 1; }

    buf[n] = '\0';
    printf("Received: %s (priority %u)\n", buf, prio);
    mq_close(mq);
    mq_unlink("/io_thecodeforge_mq");
    return 0;
}
Use mq_notify for Event-Driven Consumption
Instead of polling mq_receive in a loop, use mq_notify to get a signal or spawn a thread when a message arrives. This reduces CPU usage and latency. But beware: the notification is one-shot — you must rearm it after each receipt.
Production Insight
Message queues are ideal for decoupled, async workflows but can become a bottleneck.
A logging service using a full queue blocked the entire request handler, taking the site down.
Fix: set a timeout on send and drop messages under backpressure instead of blocking.
Priority inversion is real — monitor depths and consider separate queues for critical vs bulk messages.
Key Takeaway
Message queues decouple sender and receiver in time and space.
Slower than shared memory (syscall + copy) but safer and easier to use across unrelated processes.
Prefer POSIX queues over SysV — they're more portable and support notification.
Message Queue Decision Tree
IfNeed async, decoupled messaging, same host, moderate throughput
UseUse POSIX message queue (mq_open). Named, priority support, notification available.
IfNeed fixed-priority delivery, sender must never block
UseUse O_NONBLOCK and handle EAGAIN with fallback or mq_timedsend.
IfNeed persistent distributed messaging, cross-network
UseUse Kafka or RabbitMQ — message queues are local-only. Don't try to stretch POSIX MQ across machines.
IfReal-time critical, tight latency bounds
UseAvoid message queues. Use shared memory with condition variables or lock-free ring buffers.

IPC Performance Tuning: Kernel Parameters That Actually Matter

The Linux kernel exposes dozens of sysctl knobs that control IPC behavior. Getting these wrong means your pipe buffer starves, shared memory allocations fail, or message queues block unnecessarily.

For pipes: fs.pipe-max-size (default 1 MB) caps the kernel buffer. Use fcntl(fd, F_SETPIPE_SZ, size) to increase per-pipe up to the max. For high-throughput pipelines, increase fs.pipe-max-size and set appropriate per-pipe sizes.

For shared memory: kernel.shmmax (default 32 MB on some distros) sets the max size of a single shared memory segment. kernel.shmall limits total pages. If your app needs a large shared region (say 16 GB), you must increase both. Also, kernel.shm_rmid_forced can help clean up orphaned segments after a crash.

For message queues: kernel.msgmnb (default 16 KB) caps total bytes in a queue. kernel.msgmax caps per-message size. kernel.msgmni caps number of queues system-wide. Monitor with ipcs -q -l.

For semaphores: kernel.sem sets semaphore limits. If you use System V semaphores for IPC sync, ensure SEMMSL isn't too low.

A common production trap: a team increased shmmax but forgot shmall, so shmget failed with ENOMEM. Always set both. Also, changes via sysctl -w are ephemeral — persist in /etc/sysctl.conf.

Here's a snippet that sets sane defaults for a database workload:

``bash # Increase shared memory limits for Postgres-like workloads sysctl -w kernel.shmmax=68719476736 # 64 GB sysctl -w kernel.shmall=16777216 # 64 GB in pages (4K each) sysctl -w kernel.msgmnb=65536 # 64 KB per queue sysctl -w fs.pipe-max-size=4194304 # 4 MB per pipe ``

Always reboot or reload after editing sysctl.conf: sysctl -p.

io_thecodeforge_ipc_sysctl.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/bin/bash
# Set IPC kernel parameters for production workloads
# Run as root

sysctl -w kernel.shmmax=68719476736
sysctl -w kernel.shmall=16777216
sysctl -w kernel.msgmnb=65536
sysctl -w kernel.msgmax=65536
sysctl -w fs.pipe-max-size=4194304

# Persist across reboots
cat >> /etc/sysctl.conf <<EOF
kernel.shmmax=68719476736
kernel.shmall=16777216
kernel.msgmnb=65536
kernel.msgmax=65536
fs.pipe-max-size=4194304
EOF
Parameter Order Matters
Some kernel parameters are interdependent. For example, kernel.shmall must be at least kernel.shmmax / PAGE_SIZE. If you increase shmmax without adjusting shmall, large allocations fail silently. Always check with ipcs -lm after changes.
Production Insight
A team increased shmmax but forgot shmall, causing shmget to fail with ENOMEM at scale.
Always set both and persist in /etc/sysctl.conf.
Verify with ipcs -lm after each change.
For pipe-intensive workloads, increase pipe-max-size and set per-pipe via fcntl. Don't rely on the default 1 MB.
Key Takeaway
Kernel parameters control IPC resource limits.
shmmax and shmall must be set together.
pipe-max-size and msgmnb are the most commonly misconfigured.
IPC Tuning Decision Tree
Ifshmget fails with ENOMEM for large segment
UseIncrease kernel.shmmax and kernel.shmall (shmall >= shmmax / PAGE_SIZE). Verify with ipcs -lm.
IfPipe write blocks even with space in user buffer
UseIncrease fs.pipe-max-size via sysctl, and set per-pipe with fcntl(fd, F_SETPIPE_SZ, new_size).
IfMessage queue send blocks (full queue)
UseIncrease kernel.msgmnb and kernel.msgmax. Consider using non-blocking send with fallback.
IfNeed to clean up orphaned shared memory after crash
UseEnable kernel.shm_rmid_forced. Also consider using shm_unlink in application cleanup handlers.
IfSemaphore operations fail with ENOSPC
UseIncrease kernel.sem values: SEMMSL, SEMMNS, SEMOPM, SEMMNI. Use ipcs -sl to check current limits.

Real-World IPC: How PostgreSQL and Redis Use Shared Memory

Two of the most popular databases rely on shared memory for performance-critical paths.

PostgreSQL uses shared memory for its shared buffer pool, WAL buffers, and lock tables. The shared_buffers parameter directly maps to a SysV or POSIX shared memory segment. Misconfiguring kernel.shmmax or kernel.shmall to be smaller than shared_buffers causes PostgreSQL to fail to start with an IPC error. A common fix is to increase kernel parameters or reduce shared_buffers.

Redis uses shared memory for its persistence model? No — Redis is single-threaded and uses fork()+COW, not shared memory. However, Redis cluster and some tools use shared memory for inter-process coordination. The bigger point: Redis uses socket IPC for replication and networking.

But the real shared-memory workhorse is PostgreSQL. A production story: a PostgreSQL deployment had shared_buffers set to 8 GB but kernel.shmmax was only 1 GB. The server refused to start. The fix was to increase kernel.shmmax to 12 GB. Always check ipcs -lm after changes.

Another example: Apache HTTP uses shared memory for the scoreboard to track worker processes. Nginx uses shared memory for rate limiting and load balancing across workers. Both depend on correct kernel tuning.

For custom applications, a ring buffer in shared memory is a common pattern for high-throughput logging or metrics collection. The key is to use a lock-free design with atomic sequence numbers and enough padding to avoid false sharing.

In production, always monitor shared memory usage with ipcs -m and ensure your monitoring alerts on segment allocation failures. A sudden spike in shared memory usage can indicate a bug that leaks segments — set up a job to clean up orphaned segments on startup.

io_thecodeforge_check_shm.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
#!/bin/bash
# Check shared memory usage and kernel limits

echo "=== Shared Memory Limits ==="
sysctl kernel.shmmax kernel.shmall 2>/dev/null

echo ""
echo "=== Current Shared Memory Segments ==="
ipcs -m

echo ""
echo "=== IPC Limits (ipcs -lm) ==="
ipcs -lm
PostgreSQL IPC Startup Failure
PostgreSQL will refuse to start if kernel.shmmax is smaller than shared_buffers. The error message usually says 'invalid value for parameter shared_buffers' or 'could not create shared memory segment'. Increase kernel parameters as needed.
Production Insight
PostgreSQL's shared_buffers is a direct consumer of shared memory.
If shmmax is too low, the database won't start — you get a cryptic IPC error.
Rule: always set kernel.shmmax >= shared_buffers + some margin.
Redis doesn't use shared memory for data, but its replication uses socket IPC.
Monitor ipcs -m for orphaned segments that can slowly exhaust system memory.
Key Takeaway
PostgreSQL, Apache, Nginx — all use shared memory behind the scenes.
Get the kernel parameters wrong and they fail silently.
Always verify shared memory limits before deploying a database or web server at scale.
Real-World IPC Decision Tree
IfRunning PostgreSQL with large shared_buffers
UseEnsure kernel.shmmax and kernel.shmall are sufficiently large. Increase if necessary.
IfCustom high-throughput logging using ring buffer
UseUse shared memory with atomic sequence numbers and padded structs to avoid false sharing.
IfNginx load balancing across workers
UseUse shared memory for rate limit zones. Pre-allocate with adequate size.
IfApache HTTP with MPM worker
UseShared memory used for scoreboard. Monitor with ipcs -m; clean up after graceful restart.
● Production incidentPOST-MORTEMseverity: high

The Shared Memory Race That Brought Down a Trading Engine

Symptom
Intermittent data corruption: order book entries showed prices from different stocks, trades failed to match, system logged 'impossible' states.
Assumption
The team assumed that because each process only wrote to its own offset within the shared region, no race existed. They used a fixed partitioning scheme — process A owned bytes 0–1023, process B owned 1024–2047 — but both processes also read from each other's regions to detect updates.
Root cause
A uint64 write is not atomic on x86 when misaligned. Process A started writing an 8-byte price at offset 1022 (crossing the partition boundary), and process B's reader saw a torn value — the first 2 bytes from the new write and the remaining 6 from the previous value. That torn integer became a negative price, which cascaded into a market order being sent at the wrong side.
Fix
1. Align all shared variables to their natural size (uint64 on 8-byte boundary). 2. Add a pthread_mutex_t per partition. 3. Use volatile sig_atomic_t flags for lock-free reads where possible. 4. Run with ThreadSanitizer on the test cluster before every production deploy.
Key lesson
  • Shared memory is not safe without synchronisation, even with 'non-overlapping' writes
  • Assume all data races are possible: the compiler and CPU can reorder operations
  • Always align shared variables to their type size
  • Use proper sync primitives — lock-free algorithms require proof, not hope
Production debug guideCommon IPC failures and the exact commands to diagnose them5 entries
Symptom · 01
Pipe write returns EPIPE (Broken pipe)
Fix
Check if the reading end is closed. Use lsof -c <process> to see open file descriptors. Ensure reader hasn't crashed or exited early.
Symptom · 02
Socket connect hangs indefinitely
Fix
strace -e trace=connect,recvfrom,sendto -p <PID> to see connection state. Netstat ss -tlnp to verify listen backlog and port. Check firewall rules with iptables -L.
Symptom · 03
Shared memory read returns stale or corrupted data
Fix
1. Verify mutex locking with valgrind --tool=helgrind. 2. Use cat /proc/<PID>/maps to confirm shared memory region is mapped with correct permissions. 3. Dump raw bytes with dd if=/dev/shm/<name> bs=1 count=<size> | xxd to inspect corruption.
Symptom · 04
Message queue send blocks forever
Fix
Check queue size with ipcs -q. Compare message count to msg_qbytes. Use strace -e trace=msgsnd to see if it's blocking on full queue. Increase kernel parameter kernel.msgmnb or add non-blocking flag.
Symptom · 05
Unix socket bind fails with EADDRINUSE
Fix
Check if the socket file exists with ls -l /path/to/socket. Remove stale files with rm /path/to/socket. Ensure your application unlinks the socket on shutdown. Use lsof /path/to/socket to see if another process still has it open.
★ IPC Quick Debug CommandsThe 4 most common IPC failures and what to run in under 30 seconds.
Pipe broken EPIPE
Immediate action
`strace -e write -p <writer_pid>`
Commands
`lsof -c <process_name> | grep pipe`
`echo 'pipe size' ; cat /proc/sys/fs/pipe-max-size`
Fix now
Ensure reader process stays alive; add SIGPIPE handler or set it to SIG_IGN
Socket connect timeout+
Immediate action
`ss -tlnp | grep <port>`
Commands
`ping <server_ip>`
`tcpdump -i any port <port> -c 10`
Fix now
Check server listen backlog, increase somaxconn, verify no firewall drop
Shared memory stale data+
Immediate action
`cat /proc/<pid>/maps | grep /dev/shm`
Commands
`valgrind --tool=helgrind --log-file=race.log ./program`
`sudo ipcs -m -p`
Fix now
Add mutex around shared data accesses; use atomic operations for counters
Message queue send blocks+
Immediate action
`ipcs -q` and check `msqid`
Commands
`ipcs -q -i <msqid>` for msg_qnum vs msg_qbytes
`strace -e trace=msgsnd -p <pid>`
Fix now
Increase kernel.msgmnb via sysctl, or switch to non-blocking send (IPC_NOWAIT)
IPC Mechanism Comparison
MechanismTypical LatencyData ShapeScopeSync NeededMax Throughput
Pipe~10 µsByte streamSame host (parent-child or FIFO)No (kernel serializes)~1 GB/s
Unix Socket~50 µsByte stream or datagramSame hostNo (kernel serializes)~2 GB/s
TCP Socket~100 µs (local) – 10 ms (WAN)Byte streamAny networkNo (kernel serializes)~1 GB/s (local)
Shared Memory~70–150 nsRaw memorySame hostYes (mutex/atomics)~10 GB/s
Message Queue~10–100 µsMessages (priority)Same hostNo (kernel buffers)~100 MB/s

Key takeaways

1
IPC mechanisms are not interchangeable
latency ranges from 70ns (shared memory) to 50µs (message queues).
2
Shared memory is the fastest but requires explicit synchronisation; a missing mutex corrupts data silently.
3
Pipes are simple and zero-config but limited to unidirectional, parent-child flows.
4
Unix domain sockets are the gold standard for local bidirectional IPC
use them over TCP on the same host.
5
Message queues decouple sender and receiver but can become a bottleneck if queue size is misconfigured.
6
Tune kernel parameters (shmmax, pipe-max-size, msgmnb) for your workload
defaults are often too small.

Common mistakes to avoid

5 patterns
×

Using pipes for bidirectional communication

Symptom
Deadlock or data corruption because both ends try to write and read on the same pipe.
Fix
Use socketpair(AF_UNIX, SOCK_STREAM, 0, fds) for bidirectional communication. Two pipes also work but are clunkier.
×

Forgetting to unlink Unix socket files before bind

Symptom
bind() fails with EADDRINUSE on restart even though no process is listening.
Fix
Always unlink(socket_path) before bind, or set SO_REUSEADDR (but for Unix sockets, unlink is the reliable approach).
×

Not aligning shared memory variables to their type size

Symptom
Torn reads – a uint64 write crossing a cache line boundary produces a corrupted value on the reader.
Fix
Align every variable to its natural size using alignas or struct padding. For x86 uint64, align to 8 bytes.
×

Assuming message queue send never blocks

Symptom
Application hangs when queue reaches capacity; no error logged.
Fix
Always use mq_timedsend() with a timeout, or set O_NONBLOCK. Monitor queue depth with ipcs -q.
×

Initialising a mutex in shared memory without PTHREAD_PROCESS_SHARED

Symptom
EINVAL error when a second process tries to lock the mutex.
Fix
Set PTHREAD_PROCESS_SHARED attribute on the mutex before initialisation. Only initialise once.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
Compare pipes and Unix domain sockets for local IPC. When would you use ...
Q02SENIOR
How does shared memory achieve the lowest latency? What's the main risk?
Q03SENIOR
What is the difference between POSIX and System V message queues? Which ...
Q04SENIOR
Explain the role of `kernel.shmmax` and `kernel.shmall`. What happens if...
Q05SENIOR
What is false sharing in the context of shared memory IPC, and how do yo...
Q06SENIOR
How does PostgreSQL use shared memory, and what kernel parameters must b...
Q01 of 06JUNIOR

Compare pipes and Unix domain sockets for local IPC. When would you use one over the other?

ANSWER
Pipes are simpler and lighter – they're just file descriptors for parent-child communication. But they're unidirectional. Unix domain sockets are bidirectional, support multiple connections (server model), and can pass file descriptors. Use pipes for simple one-way data flows; use Unix sockets when you need bidirectional communication or a client-server pattern. Both are kernel-buffered, but Unix sockets are slightly more flexible.
FAQ · 7 QUESTIONS

Frequently Asked Questions

01
Can pipes be used for unrelated processes?
02
Why is shared memory faster than pipes?
03
What happens if a message queue fills up?
04
How do I check the current size of my pipe buffer?
05
Is it safe to share memory between processes without any locks if each process only writes to its own region?
06
What is splice() and when should I use it for pipes?
07
How do I monitor shared memory usage in production?
🔥

That's Operating Systems. Mark it forged?

11 min read · try the examples if you haven't

Previous
Semaphores and Mutex
8 / 12 · Operating Systems
Next
File Systems in OS