Flight Booking System Design: Avoid Double-Booking and Payment Race Conditions
Design a flight booking system that handles concurrency, idempotency, and failures.
20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.
Use optimistic locking with version numbers on seat inventory rows, implement a two-phase commit via saga pattern with compensating transactions, and enforce idempotency keys on payment requests to prevent duplicate charges.
Think of it like a concert ticket booth with one clerk. When you pick a seat, the clerk puts a 'reserved' sign on it. If you walk away without paying, the sign comes off. If you pay, the sign becomes 'sold'. Now imagine 1000 clerks — you need a system so two clerks don't sell the same seat, and if a payment fails, the seat goes back to available. That's the booking system.
Every flight booking system has the same nightmare: two users book the last seat, both get charged, and you have to refund one while explaining to an angry customer. I've seen this bring down a payments service when the thread pool was exhausted at 3am because a booking lock wasn't released. The problem isn't just concurrency — it's distributed state across inventory, payment, and booking services. This article walks you through a production-tested design that prevents double-booking, handles payment failures gracefully, and scales to thousands of bookings per second. By the end, you'll be able to design a booking system that survives a flash sale without data corruption.
Why Naive Locking Fails Under Load
Most tutorials show you SELECT FOR UPDATE on the seat row and call it done. That works until you have 500 concurrent requests for the same flight. The lock becomes a bottleneck — every request queues on that row. Worse, if your application holds the transaction open while calling a payment gateway (which can take 2-3 seconds), you'll exhaust your database connection pool. I've seen ERROR: 53300: too many connections in production because of this. The fix: use optimistic locking with a version column. Read the seat, check version, update where version = old_version. If zero rows affected, retry. This avoids long-held locks and scales linearly.
SET statement_timeout = '5s') to prevent a slow payment gateway from holding locks forever. Otherwise, you'll hit ERROR: 57014: canceling statement due to statement timeout — but only after the lock has already caused connection pool exhaustion.Idempotency Keys: Your Shield Against Duplicate Payments
The classic rookie mistake: when a payment request times out, the client retries. Without an idempotency key, the payment gateway charges the card twice. I've seen this result in $40k in duplicate charges during a Black Friday sale. The fix: generate a unique idempotency key (UUID) on the client for each booking attempt. Send it with the payment request. The payment gateway stores the key and returns the same response for duplicate requests. On your side, store the key in the booking record. If you receive a retry, check the key — if the booking already exists, return the existing confirmation without processing payment again.
ON CONFLICT DO NOTHING for fast duplicate detection.Saga Pattern: Surviving Partial Failures
A booking involves multiple steps: lock seat, charge payment, confirm booking, send email. If payment succeeds but email fails, you can't roll back the payment. That's where the saga pattern comes in. Each step has a compensating action: if email fails, you don't roll back the payment — you just retry the email. But if payment fails, you need to release the seat lock. Implement a saga orchestrator that tracks each step's state. If a step fails irrecoverably, execute compensating transactions for all completed steps. Use a transactional outbox to ensure the saga state is persisted atomically with the step execution.
Transactional Outbox: Reliable Event Publishing
When you update seat inventory and create a booking in the same database transaction, you often need to publish an event (e.g., 'booking confirmed') to a message queue. If the queue publish fails after the DB commit, you lose the event. The transactional outbox pattern solves this: instead of publishing directly, write the event to an 'outbox' table in the same DB transaction. A separate process (polling or CDC) reads from the outbox and publishes to the queue. This guarantees at-least-once delivery. I've used this pattern to avoid missing booking confirmations during a Kafka broker outage.
Handling Payment Gateway Timeouts Gracefully
Payment gateways can timeout after 30 seconds. If you hold a database lock during that time, you'll block other bookings. The solution: use a two-phase approach. First, reserve the seat with a short TTL (e.g., 10 minutes) without holding a lock. Then, process payment asynchronously. If payment succeeds, confirm the booking. If it fails or times out, the TTL expires and the seat becomes available again. This is how airlines actually work — they 'hold' a seat for 10 minutes while you enter payment details. Implement this with a reserved_until timestamp column on the seat.
When Not to Use This Design
This design is overkill for a small airline with 10 flights a day. If your scale is low, a simple SELECT FOR UPDATE with a single database and synchronous payment processing works fine. Don't implement sagas, outboxes, and TTLs unless you have concurrency > 10 req/s per seat or you need to survive payment gateway outages. For a prototype, use a monolith with a single transaction. You can always split later. Also, if your payment gateway doesn't support idempotency keys, you can't use the idempotency pattern — fall back to manual reconciliation.
The Double-Booking That Cost $50k in Refunds
- Never split a seat deduction and booking confirmation across separate services without a saga or two-phase commit.
- One transaction, one service.
reserved_until column in seats table. 2. Verify background job is running and releasing expired reservations. 3. Check job logs for errors. 4. Manually release stuck reservations with SQL: UPDATE seats SET status='available', reserved_until=NULL WHERE status='reserved' AND reserved_until < NOW();SELECT seat_id, COUNT(*) FROM seats WHERE status='booked' GROUP BY seat_id HAVING COUNT(*) > 1;SELECT * FROM bookings WHERE seat_id = <duplicate_seat_id>;Key takeaways
Interview Questions on This Topic
How does optimistic locking handle concurrent seat bookings under high load? What happens when two users read the same version and both try to update?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.
That's Real World. Mark it forged?
3 min read · try the examples if you haven't