Design a Digital Wallet: Avoiding Double-Spend and Data Loss at Scale
Design a digital wallet system that prevents double-spend, survives crashes, and scales to millions of users.
20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.
Design a digital wallet by using an append-only ledger with idempotency keys, a balance cache with write-through, and a two-phase commit for cross-service transactions. Never update balances in place — always log transactions first.
Think of a digital wallet like a bank teller's drawer. Every time you deposit or withdraw, the teller writes it in a logbook first, then updates the cash count. If the power goes out mid-transaction, the logbook tells them exactly where they left off. Your digital wallet does the same: it writes a transaction record before touching the balance. That way, if the server crashes, you never lose money or double-count.
I've seen a startup lose $200k in one night because their wallet system double-spent on concurrent withdrawals. The fix was a single index and a database constraint they'd skipped. Most wallet designs are wrong from day one — they treat balances as mutable values instead of derived sums. Here's what actually works in production.
The core problem is simple: you need to move money between accounts without ever losing a penny or creating money out of thin air. Banks solved this centuries ago with double-entry bookkeeping. Yet most developers reinvent it badly, using UPDATE balance = balance - amount without any protection against race conditions or crashes.
By the end of this, you'll be able to design a wallet that handles 10,000 transactions per second per shard, survives database failovers without data loss, and scales horizontally. You'll know exactly which database isolation level to use, how to implement idempotency, and when to reach for distributed transactions vs. eventual consistency.
The Ledger-First Pattern: Why You Never Update Balances Directly
The single biggest mistake in wallet design is treating a balance as a mutable field you UPDATE. That's how you lose money. The correct pattern is append-only ledger: every transaction is a row in a transactions table. The balance is a derived value — SELECT SUM(amount) FROM transactions WHERE account_id = ?. But that query is slow at scale, so you cache the balance with a write-through pattern.
Here's the flow: when a transaction comes in, you INSERT a row into the ledger with status='pending'. Then you atomically update the balance cache. If the INSERT succeeds but the cache update fails, you have a pending transaction that never settles. That's why you need a reconciler job that periodically processes pending transactions.
In production, you'll want a materialized balance column on the account table that you update in the same database transaction as the ledger insert. This gives you both the audit trail and fast reads. The key is the order: ledger first, balance second. If the server crashes after the ledger insert but before the balance update, the reconciler picks it up.
WHERE balance >= amount for debits. Also, never use SELECT ... FOR UPDATE on the account row — it causes contention. Use optimistic locking with a version column instead.Idempotency Keys: The Only Way to Safely Retry
Networks fail. Clients retry. If you process the same withdrawal twice, you've created money. The fix is idempotency keys: a unique string the client sends with every request. Your service deduplicates based on that key. The key must be unique per account — a UNIQUE constraint on (idempotency_key, account_id) in the ledger table.
When a retry arrives, you look up the idempotency key. If it exists, return the previous result. If not, process normally. This is simple but has a gotcha: what if the first request timed out after the server committed the transaction but before the client got the response? The retry will find the key and return success — correct. But what if the first request never reached the server? The retry will process fresh — also correct.
The real danger is when the idempotency key is generated incorrectly. Never let the client generate it — they might reuse keys across different transactions. Generate it server-side from a unique request fingerprint, or require the client to provide a UUID that you validate is unique per account.
Concurrency Control: Optimistic Locking Beats Pessimistic Every Time
When two withdrawals hit the same account simultaneously, you need to prevent both from succeeding if the balance is insufficient. Pessimistic locking (SELECT ... FOR UPDATE) serializes access — it works but kills throughput. Optimistic locking with a version column is better: you read the current version, compute the new balance, and UPDATE WHERE version = old_version. If another transaction updated first, your UPDATE affects 0 rows, and you retry.
In practice, you'll combine this with the ledger-first pattern. The ledger insert doesn't need a lock — it's append-only. The balance update uses optimistic locking. If the update fails, you roll back the ledger insert (or mark it as 'failed'). This gives you high concurrency with minimal contention.
One gotcha: if you have a materialized balance column, the UPDATE must be in the same database transaction as the ledger insert. If you're using a separate cache (Redis), you can't easily roll back the cache. That's why write-through cache with database as source of truth is safer.
Sharding the Wallet: How to Scale Beyond a Single Database
When you have millions of accounts, a single database becomes a bottleneck. The natural shard key is account_id. All transactions for an account go to the same shard. This keeps the ledger and balance local to one shard, avoiding distributed transactions.
But sharding introduces new problems: cross-shard transactions (e.g., transferring from account A on shard 1 to account B on shard 2). You can't do a single database transaction across shards. The solution is a two-phase commit (2PC) or, more practically, a saga pattern with compensating transactions.
For most wallet systems, you avoid cross-shard transfers by routing through a central ledger (like a clearing house). Each shard debits the source account and credits a 'clearing' account on the same shard. A separate async process moves funds between clearing accounts across shards. This is eventual consistency, but it's safe if you track the state of each leg.
Handling Failures: The Reconciler Pattern
No matter how careful you are, transactions will get stuck in 'pending' state. Network partitions, database failovers, or bugs can leave a transaction half-processed. You need a reconciler: a background job that scans for pending transactions older than a threshold and either completes or reverses them.
The reconciler must be idempotent. It checks the current state of the account and decides whether the transaction was actually applied. For example, if a withdrawal has status='pending' but the balance already reflects the debit, the reconciler marks it 'completed'. If the balance doesn't reflect the debit, it either retries or marks it 'failed'.
In production, run the reconciler every minute on a small set of recent pending transactions. Use a database index on (status, created_at) to make the scan efficient. Alert if any transaction stays pending for more than 5 minutes.
When Not to Use This: Simpler Alternatives for Low-Risk Systems
If you're building a loyalty points wallet or a in-game currency that can be reissued, you don't need this level of rigor. You can use a simple UPDATE balance with a database transaction and accept occasional inconsistencies that you fix with a daily reconciliation script. The overhead of idempotency keys, optimistic locking, and reconcilers is only justified when real money is on the line.
Also, if your transaction volume is under 100 TPS and you have a single database, you can skip sharding. Use a single ACID-compliant database with materialized balances and be done. Don't over-engineer.
But if you're handling payments, banking, or any system where a single error costs real money, follow the patterns above. The cost of implementing them upfront is far less than the cost of a production incident.
The 4GB Container That Kept Dying
- Never load the full ledger into memory to compute a balance — store the running total and update it atomically.
SELECT * FROM transactions WHERE idempotency_key = 'idem_789';SELECT * FROM idempotency_cache WHERE key = 'idem_789';Key takeaways
Interview Questions on This Topic
How would you handle concurrent withdrawals on the same account without losing money?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.
That's Real World. Mark it forged?
5 min read · try the examples if you haven't