Senior 12 min · March 06, 2026

Payment System Design — Idempotency & Double Charge Fixes

Client-generated idempotency keys caused 300% more chargebacks on Black Friday.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Idempotency keys prevent duplicate charges from retries — the foundation of safe payments
  • Append-only ledgers provide an immutable audit trail; balance is always derived, never stored
  • Retry with exponential backoff and jitter, but always pair with idempotency
  • Reconciliation catches silent inconsistencies — automate it from day one
  • Strong consistency beats availability for financial data; choose PostgreSQL or CockroachDB
  • Reconciliation timing matters: run it at least once per billing cycle to catch silent losses before they compound
Plain-English First

Imagine you're at a vending machine. You put in a dollar, press B3, and the machine jams. Did your dollar count? Did the snack dispatch? You don't know — and that confusion is exactly what a payment system is designed to eliminate. A payment system is like a very careful referee that keeps score of every dollar that moves, makes sure nobody's charged twice if the internet hiccups, and can always show you a receipt even years later. Every time you tap your card at a coffee shop, at least five different computer systems have a 300-millisecond conversation to make sure the money moves exactly once, to exactly the right place.

Payment systems are the circulatory system of the modern economy. Stripe processes hundreds of billions of dollars annually. PayPal handles over 22 billion transactions per year. Even a small SaaS product charging $29/month is trusting its payment pipeline with its entire revenue stream. When a payment system fails, it doesn't just log an error — it loses real money, destroys customer trust, and can trigger regulatory scrutiny. This is why payment system design is one of the most consequential, unforgiving engineering domains you'll encounter. Yet most introductory guides stop at idempotency — they skip the ledger model and reconciliation pipelines that catch silent errors.

The core problem a payment system solves is deceptively simple: move money from person A to person B reliably. But 'reliably' carries enormous weight. Networks time out. Databases crash mid-transaction. Users double-click submit buttons. Banks return ambiguous responses like 'do not honor' without explaining why. A naive implementation will lose money, double-charge customers, or silently swallow failures — any of which is catastrophic. The engineering challenge is building a system that behaves correctly under all these failure modes simultaneously.

By the end of this article you'll understand how to architect a production-grade payment system end to end — from the data model and API contract to idempotency keys, ledger design, retry strategies with exponential backoff, reconciliation pipelines, and the edge cases that bite engineers in production. You'll walk away knowing exactly how to answer this question in a system design interview and, more importantly, how to actually build it.

Here's the reality most guides skip: your payment system will fail in ways you can't predict. The difference between a good system and a broken one is not the happy path — it's how gracefully it degrades when the bank's API returns HTML instead of JSON, or when a network partition splits your database nodes. The design choices you make today determine whether those failures become a five-minute incident or a week-long disaster. And that's why every decision — from idempotency key format to database fsync configuration — deserves careful thought.

What is Design a Payment System?

A payment system is a set of services and processes that reliably moves money between parties. Its core challenges are not about speed — they're about correctness under uncertainty. Every engineer who builds payment systems soon learns that network timeouts, database failures, and user impatience are the real enemies. The naive approach of 'try, and if it fails, try again' leads to double charges, lost payments, and angry customers. That's why idempotency, ledgers, and reconciliation aren't optional — they're the bedrock.

When you design a payment system, you're designing a system that must say 'yes' exactly once, and 'no' with a clear reason. Getting that wrong costs real money. Senior engineers know that every edge case you don't think of will happen in production — the trick is to design so that when it happens, the system still does the right thing.

Here's the thing: the moment you accept real money, you're operating under regulatory constraints. PCI DSS, SOC 2, and regional regulations like PSD2 in Europe impose specific requirements on how you store, process, and transmit payment data. Your architecture decisions have compliance consequences.

Let's talk about the design decisions that separate production-grade systems from weekend projects. Every choice — whether to use a single database or separate services, how to model the API contract, what to do when a third-party gateway returns a 503 — has a downstream impact on correctness, latency, and auditability. A healthy payment system is boring: it logs everything, it never loses data, and it pushes discrepancies to a queue instead of swallowing them.

A practical way to think about it: your payment system is only as good as its worst failure mode. If you test only the happy path, you've tested nothing. Invest time in simulating gateway timeouts, database failures, and duplicate requests. That's where the real engineering happens.

ForgeExample.javaSYSTEM DESIGN
1
2
3
4
5
6
7
8
// TheCodeForgeDesign a Payment System example
// Always use meaningful names, not x or n
public class ForgeExample {
    public static void main(String[] args) {
        String topic = "Design a Payment System";
        System.out.println("Learning: " + topic + " 🔥");
    }
}
Output
Learning: Design a Payment System 🔥
Forge Tip:
Type this code yourself rather than copy-pasting. The muscle memory of writing it will help it stick.
Production Insight
A naive payment system without idempotency caused a SaaS company to double-charge 300 customers during a network blip.
The fix took 3 hours, cost $15k in refunds, and triggered a PR disaster that lost 10% of their user base.
Rule: assume every network call can fail, and every retry can succeed — design for both.
Key Takeaway
Payment systems are unforgiving: one wrong assumption about retry semantics can cost real money.
Design for the worst-case network behavior from the start.
Rule: if you don't know what happens when the same request arrives twice, you're not ready for production.
When to invest in payment system design depth
IfYou are processing real-money transactions
UseInvest in idempotency, ledger, and reconciliation from day one.
IfYou are building a prototype or MVP with fake money
UseYou can skip some guarantees, but plan to add them before any real payment flows.

Idempotency: The Non-Negotiable Foundation

Idempotency is what stops a payment from being processed twice when a user refreshes the page or a network retry sends the same request again. The core mechanism is an idempotency key — a unique identifier (often a UUID) that the client sends with every creation request. The server checks if it has already seen that key. If yes, it returns the cached result from the first invocation instead of executing the operation again. This transforms an unsafe retry into a safe replay.

You store the key together with the response in a database row with a unique constraint. Any duplicate key insertion fails, and you simply return the stored response. The TTL on this mapping matters: keep it long enough to cover worst-case retry windows (typically 24 hours) but short enough to avoid unbounded storage growth. Use a primary key or a unique index on the idempotency key column.

A common production pattern is to use a single table for idempotency keys with a TTL of 24 hours. But be careful: if the TTL is too short, a retry that arrives after expiry will be processed as a new request and you'll double-charge. Set the TTL to cover the payment's settlement lifecycle — typically 7 days for card payments. Use a background job to purge expired keys.

One subtle failure: if you store idempotency keys in the same database as your ledger, a database outage can render your entire payment API unavailable. Consider a separate, highly available key-value store (Redis with persistence, or DynamoDB with strong consistency) for idempotency data, while keeping the ledger in a transactional database. This decoupling lets you tolerate partial failures without losing the ability to retry safely.

Another nuance: idempotency keys don't just protect against double charging — they also serve as a correlation ID for debugging. When a payment fails after multiple retries, the key ties all attempts together. You can trace the entire lifecycle in your logs. Make sure to log the idempotency key at every step.

io/thecodeforge/payment/IdempotencyService.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public class IdempotencyService {
    private final Map<String, PaymentResult> store = new ConcurrentHashMap<>();

    public PaymentResult process(String idempotencyKey, PaymentRequest request) {
        // Atomic putIfAbsent ensures exactly one execution
        PaymentResult existing = store.putIfAbsent(idempotencyKey, IN_FLIGHT);
        if (existing != null) {
            if (existing == IN_FLIGHT) {
                // Poll until first execution completes
                return waitForCompletion(idempotencyKey);
            }
            return existing;
        }
        try {
            PaymentResult result = executePayment(request);
            store.put(idempotencyKey, result);
            return result;
        } catch (Exception e) {
            store.put(idempotencyKey, new PaymentResult("FAILED", e.getMessage()));
            throw e;
        }
    }
}
Output
Idempotency key ensures a payment request is processed at most once. The ConcurrentHashMap with putIfAbsent guarantees atomic check-and-set.
Idempotency Key Collision
If two different payments accidentally share the same idempotency key, the second one is silently ignored. Generate keys client-side with enough entropy (UUIDv4, or hash of client_id + timestamp + nonce). Never trust the client to provide a key – generate it server-side if the client doesn't include one. Additionally, consider using a distributed lock around idempotency check-and-store to prevent race conditions on concurrent requests with the same key.
Production Insight
A payment processor at a major retailer used a timestamp-based idempotency key during Black Friday.
Concurrent requests from the same user produced identical keys, causing duplicate charges.
They lost $340k before the fix: use a cryptographically random UUID generated on the server.
Key rule: never derive idempotency from request data alone.
Key Takeaway
Idempotency keys are the lock that prevents double charges.
Store them in a table with a unique constraint and a TTL.
The rule: if you can't replay a request safely, your system isn't production-ready.
Idempotency Strategy Decision
IfClient provides idempotency key
UseUse it as-is after validating format and length. Reject non-compliant keys with 400.
IfClient does not provide key
UseGenerate one server-side using UUIDv4. Return it in response header for client retries.
IfExisting key found in store with IN_FLIGHT status
UseBlock and poll until first execution completes (with timeout). Return the eventual response.
IfExisting key found with completed response
UseReturn that response immediately. Do not re-execute.

Ledger Design: The Source of Truth

A ledger is the immutable record of every financial event. Every credit, debit, fee, refund, and adjustment is an entry in the ledger. The balance of an account is derived by summing all entries — never stored directly as a mutable field. This prevents inconsistencies from partial updates and gives you a full audit trail.

Design the ledger as an append-only table. Each entry has: account_id, entry_type (DEBIT/CREDIT), amount in the smallest currency unit (cents), currency, reference_id, timestamp, and a unique sequence number. Use a composite primary key on (account_id, sequence) to guarantee ordering. For performance, you can cache balances in a separate table but always rebuild from the ledger during reconciliation.

One leaky abstraction that trips teams up: they store the balance as a column and rebuild it periodically from the ledger. During a reconciliation run, if the rebuild fails halfway, you'll have an inconsistent cache. The better approach is to never trust the cached balance — treat it as a degraded read and always verify against the ledger for critical operations like withdrawals.

Another trap: teams mistakenly store derived balances in the ledger itself using UPDATE statements. That's not an append-only ledger — it's a mutating counter dressed up. If you ever find yourself writing an UPDATE on the ledger table, stop and redesign. The correct pattern is to always INSERT and compute balance via aggregation. For performance, materialise the balance in a separate cache table with a periodic rebuild job, but never trust it for critical withdrawals without re-verifying against the ledger.

One more production consideration: ledger entries must be idempotent themselves. If you process a webhook callback twice, you need to ensure the second attempt doesn't create a duplicate ledger entry. Use the gateway's transaction ID as a unique constraint on the ledger table. This prevents double-counting even if your webhook handler re-executes.

io/thecodeforge/payment/ledger.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
CREATE TABLE ledger_entries (
    account_id     UUID NOT NULL,
    sequence_id    BIGSERIAL,
    entry_type     VARCHAR(10) NOT NULL CHECK (entry_type IN ('DEBIT','CREDIT')),
    amount_cents   BIGINT NOT NULL,
    currency       CHAR(3) NOT NULL DEFAULT 'USD',
    reference_id   UUID NOT NULL,
    created_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (account_id, sequence_id)
);

CREATE INDEX idx_ledger_ref ON ledger_entries(reference_id);

-- Derive balance: sum of credits minus debits
SELECT account_id, 
       SUM(CASE WHEN entry_type = 'CREDIT' THEN amount_cents 
                ELSE -amount_cents END) AS balance_cents
FROM ledger_entries
WHERE account_id = :accountId
GROUP BY account_id;
Output
The ledger table is append-only. Balance is computed dynamically from the sum of CREDIT and DEBIT entries.
Ledger as Append-Only Log
  • Every financial event appends a row – never UPDATE or DELETE.
  • Balance is always derived by summing entries on the fly (or via materialized view).
  • Reversal entries cancel out previous entries (e.g., a chargeback is a CREDIT reversal of a DEBIT).
  • Auditors and regulators require this immutability.
  • Use a materialised balance table only for read capacity; rebuild it periodically from the ledger to catch drift.
Production Insight
A startup stored account balances in a single row and updated it transactionally.
During a database failover, the UPDATE was applied twice, doubling the balance.
They couldn't detect the error until a withdrawal exceeded actual funds.
Rule: never store a mutable balance – always derive from an immutable ledger.
Key Takeaway
Ledgers are the single source of financial truth.
Append-only, never update – balance is always derived.
Rule: if you are updating a balance column, you're building a future outage.
Ledger vs Mutable Balance
IfYou need audit trail and regulatory compliance
UseUse an append-only ledger with derived balance.
IfYou need ultra-low latency read of balance (<1ms)
UseUse a materialized cache that is rebuilt from ledger every few seconds; accept eventual consistency.
IfYou are processing high throughput (10k+ tx/s)
UsePartition the ledger by account_id and cache recent balances in Redis with periodic ledger reconciliation.

Retry Strategies and Failure Handling

Payment networks are unreliable. A bank might return a timeout, a network partition might drop the response, or the payment gateway may be overloaded. Your system must handle these gracefully. The standard approach is automatic retries with exponential backoff and jitter. But retries must be safe — meaning the operation is idempotent. Every retry uses the same idempotency key, so the payment processor's ledger sees only one debit.

Define a finite number of retries (e.g., 3 attempts with backoff: 1s, 4s, 16s) and a final failure state. Track retry attempts in a dedicated table. For critical payments (like subscriptions), consider a delayed retry queue that retries over hours with escalating backoff.

A trap I see often: teams implement retry in the application layer but forget to set a timeout on the upstream call. If the gateway hangs, the retry loop queues up waiting threads, eventually exhausting the connection pool. Always pair retries with a circuit breaker that trips after consecutive failures and allows the system to recover.

Consider the 'double timeout' problem: your payment gateway has a 10-second timeout, but your HTTP client has a 30-second timeout. A slow gateway that responds at 15 seconds (past its own timeout) may have processed the request, yet your client is still waiting. When you eventually retry with the same idempotency key, the gateway might have already committed. This is safe if idempotent, but fails if the gateway's timeout window differs. Always align timeout configurations between client and server, and log the exact response timing to correlate logs.

Another pattern: for time-sensitive operations like pre-auth expiry, you need aggressive retries with short backoff. If you wait too long, the authorisation expires and you lose the hold on the customer's funds. In that case, use a backoff of 100ms, 200ms, 400ms, and fail fast. Then run a background job to re-authorise if needed.

io/thecodeforge/payment/RetryHandler.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public class RetryHandler {
    private static final int MAX_RETRIES = 3;
    private static final long BASE_DELAY_MS = 1000;

    public PaymentResult executeWithRetry(String idempotencyKey, PaymentRequest request) {
        for (int attempt = 1; attempt <= MAX_RETRIES; attempt++) {
            try {
                // Idempotency key stays the same across retries
                return paymentClient.charge(idempotencyKey, request);
            } catch (TimeoutException e) {
                if (attempt == MAX_RETRIES) {
                    // Log and move to dead-letter queue for manual review
                    deadLetterQueue.enqueue(idempotencyKey, request);
                    return new PaymentResult("FAILED", "Exhausted retries");
                }
                // Exponential backoff with jitter
                long delay = (long) (BASE_DELAY_MS * Math.pow(2, attempt - 1) * (0.5 + Math.random()));
                Thread.sleep(delay);
            }
        }
        return new PaymentResult("FAILED", "Unexpected error");
    }
}
Output
Retry loop with exponential backoff and jitter. On final failure, enqueue to dead-letter queue for manual reconciliation.
Idempotency + Retry = Safety
Without idempotency, a retry can result in multiple charges. Always attach the same idempotency key to every retry attempt. The payment provider's system will then deduplicate; your system doesn't have to worry about double-charging. The dead-letter queue is your last line of defence — monitor its depth and alert if it grows beyond a threshold.
Production Insight
A SaaS company's payment integration had no retry limit – it retried indefinitely with fixed 5-second delay.
When the payment gateway went down for 30 minutes, millions of requests piled up.
Upon recovery, the backlog caused a DDoS-like spike, taking down the gateway again.
Rule: cap retries, use exponential backoff, and implement a circuit breaker to back off during extended failures.
Key Takeaway
Retries are safe only when paired with idempotency.
Always cap retries, add jitter, and use a circuit breaker.
Rule: if you're not tracking retry attempts, you're one outage away from a spike that takes you down.
Retry Strategy Selection
IfOperation is idempotent and safe to replay
UseUse automatic retries with exponential backoff and jitter (3–5 attempts).
IfOperation is not idempotent
UseDo NOT retry automatically – throw an error and rely on manual intervention or a compensating transaction.
IfPayment is time-sensitive (e.g., pre-auth expires)
UseUse aggressive retry with short backoff (100ms, 200ms, 400ms) and fail fast.

Reconciliation and Double-Spend Prevention

Reconciliation is the process of comparing your internal ledger with external statements (bank, gateway, network) to catch discrepancies. It's your safety net against silent data loss, double charges, or missing credits. Run reconciliation periodically (daily or hourly depending on volume). For each transaction, compare status, amount, currency, timestamps. Flag any mismatch for manual review.

Double-spend prevention goes further: ensure that a given payment instruction (e.g., a specific invoice) is fulfilled exactly once. Use a unique constraint on the combination of (merchant_id, payment_reference). Combined with idempotency, this prevents any chance of creating two debits for the same order.

Don't limit reconciliation to daily batches. For high-value payments, implement near-real-time reconciliation using webhooks from the gateway. The webhook acts as an async callback that your system processes immediately. Compare the webhook payload with the ledger entry using the gateway transaction ID. If they don't match, trigger an alert. This cuts detection time from hours to seconds.

Reconciliation is also your best defence against bank-level errors. Banks sometimes batch settlements incorrectly, or apply fees that your system didn't anticipate. Automated reconciliation that compares line items (transaction ID, amount, currency, fee) will flag these mismatches. Build a dashboard that shows the 'health score' of reconciliation for the last 7 days — green if no unresolved mismatches, red if any discrepancy exceeds $X. This gives you a single pane of glass for financial integrity.

One more thing: reconciliation should be a first-class feature, not an afterthought. Start implementing it from your first payment in production. If you wait until you have thousands of transactions, the cleanup effort is enormous. Automate the matching logic and build a manual review queue for edge cases. The operations team's ability to reconcile quickly is a direct measure of your system's reliability.

io/thecodeforge/payment/reconciliation.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Sample reconciliation query to find mismatches
SELECT 
    p.id AS payment_id,
    p.amount AS our_amount,
    g.amount AS gateway_amount,
    p.status AS our_status,
    g.status AS gateway_status
FROM payments p
JOIN gateway_transactions g ON p.gateway_txn_id = g.txn_id
WHERE p.amount != g.amount
   OR p.status != g.status
   OR g.txn_id IS NULL;

-- Double-spend prevention: unique constraint on merchant payment reference
ALTER TABLE payments 
ADD CONSTRAINT uq_merchant_ref UNIQUE (merchant_id, merchant_payment_reference);
Output
Reconciliation query to find amount/status mismatches. Unique constraint prevents duplicate payments for the same invoice.
Reconciliation as a Safety Net
  • Run reconciliation regularly (daily minimum).
  • Automate flagging of discrepancies for human review.
  • For high-value transactions, consider near-real-time reconciliation via webhook callbacks.
  • Double-spend prevention via unique constraints is a hard guarantee; idempotency is a soft guarantee.
  • Automate reconciliation discrepancy handling with a retry queue for transient mismatches and a manual review queue for permanent ones.
Production Insight
A fintech startup lost track of a batch of transactions during a database migration.
They skipped reconciliation for two weeks.
When they finally ran it, they found 47 duplicate charges and 12 missing credits worth $11,000.
Reputation damage was far worse than the financial loss.
Rule: reconciliation is not optional – it's how you catch the things you don't know are wrong.
Key Takeaway
Reconciliation catches what your application logic misses.
Always run it, automate it, and escalate discrepancies.
Rule: the absence of a discrepancy report doesn't mean no discrepancy exists.
Reconciliation Strategy Selection
IfLow volume (< 1k transactions/day) and simple payment flows
UseRun reconciliation daily. Manual review is acceptable.
IfHigh volume or complex payment flows (subscriptions, partial captures)
UseRun reconciliation hourly. Automate alerts and implement a discrepancy dashboard.
IfRegulatory requirement (e.g., PSD2, SOX)
UseRun reconciliation every 15 minutes with automated escalation and audit logs.

Production Architecture and Trade-offs

A production payment system is composed of several layers: an API gateway (handles auth, rate limiting, idempotency), a payment service (orchestrates the flow), a ledger service (append-only log), an external gateway adapter (communicates with banks/networks), and a reconciliation pipeline. The system must be stateless at the edge to scale horizontally, and stateful components (ledger, idempotency store) must use a strongly consistent database with failover.

Key trade-offs: consistency vs availability (choose consistency for financial data – CAP says partition tolerance is a given, so you sacrifice availability in favor of consistency). Use a database that supports transactions and strong consistency (e.g., PostgreSQL, CockroachDB). For high throughput, partition by account_id. Avoid distributed transactions if you can; prefer idempotent operations and eventual consistency with compensation (Saga pattern) for cross-service workflows.

Another pattern that works well in production is the 'outbox pattern'. When your payment service needs to emit an event (e.g., payment.completed) while updating the ledger, use a transactional outbox: write the event to an outbox table in the same database transaction as the ledger update. A separate process reads the outbox and publishes the events. This guarantees exactly-once delivery downstream.

A pattern that's gaining traction in 2026 is the 'payment orchestrator as a state machine.' Instead of wiring complex sagas manually, define the payment lifecycle states (INITIATED, AUTHED, CAPTURED, REFUNDED, FAILED) and transitions as a finite state machine. Each state transition is guarded by preconditions and triggers compensation on failure. Tools like AWS Step Functions or Cadence/Temporal implement this natively, giving you visibility, retries, and timeouts without custom boilerplate.

Don't forget monitoring. Payment systems need specific metrics: idempotency hit rate (how many retries are deduplicated), retry rate, reconciliation discrepancy count, gateway latency percentiles. Set up alerts for any deviation from baseline. A sudden spike in retries often signals a gateway issue. A growing reconciliation queue means a data integrity problem that needs immediate investigation.

io/thecodeforge/payment/PaymentOrchestrator.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
public class PaymentOrchestrator {
    private final IdempotencyService idempotencyService;
    private final LedgerService ledgerService;
    private final GatewayClient gatewayClient;

    public PaymentResult processPayment(PaymentRequest request) {
        // 1. Check idempotency
        PaymentResult cached = idempotencyService.getResult(request.idempotencyKey());
        if (cached != null) return cached;

        // 2. Begin distributed saga
        try {
            // Step 1: Reserve funds in gateway
            String authCode = gatewayClient.authorize(request.amount, request.currency);
            // Step 2: Record ledger entry
            ledgerService.recordDebit(request.accountId, request.amount, request.currency, request.idempotencyKey());
            // Step 3: Capture (only after ledger is safe)
            gatewayClient.capture(authCode);
            // Step 4: Update idempotency store
            PaymentResult result = new PaymentResult("SUCCESS", authCode);
            idempotencyService.storeResult(request.idempotencyKey(), result);
            return result;
        } catch (Exception e) {
            // Compensate: void authorization if ledger already wrote
            gatewayClient.voidAuthorization(authCode);
            idempotencyService.storeResult(request.idempotencyKey(), new PaymentResult("FAILED", e.getMessage()));
            throw new PaymentException("Payment failed", e);
        }
    }
}
Output
Saga pattern: each step records an effect; if one fails, compensating actions undo previous steps. Idempotency key is used throughout.
Architecture Note
Choose a database that supports transactions and strong consistency for the ledger and idempotency store. PostgreSQL is a solid choice. Avoid eventually consistent stores (like DynamoDB in default mode) for financial data unless you understand the trade-offs and implement reconciliation. For global deployments, consider CockroachDB which offers PostgreSQL-compatible strong consistency across regions without application changes.
Production Insight
A payment team used DynamoDB with eventual consistency for their ledger.
During a region failover, they lost several committed payments because the last write was not replicated.
Balances went out of sync, and they had to run a full reconciliation manually.
Rule: for financial data, choose strong consistency over availability – your users will forgive a brief outage but not a lost transaction.
Key Takeaway
Payments demand strong consistency – choose your database accordingly.
Architect for idempotency at every layer and use sagas for cross-service workflows.
Rule: in payment systems, consistency beats availability when you must choose.
Database Selection for Payments
IfYou need ACID transactions and strong consistency
UseUse PostgreSQL or CockroachDB. Partition by account_id if throughput demands.
IfYou need global multi-region active-active
UseUse Google Spanner or CockroachDB – they provide strong consistency across regions.
IfYou are processing non-critical payments (e.g., loyalty points)
UseEventually consistent stores like DynamoDB are acceptable with good reconciliation.

Testing Payment Systems: Integration, Contract & Chaos

Testing a payment system is different from testing a typical CRUD app. You can't just test the happy path — you need to verify behaviour under network failures, duplicate requests, and unexpected gateway responses. Three testing strategies are essential: integration tests that run against a sandbox gateway, contract tests that verify your API agreement with the gateway, and chaos experiments that simulate real-world failures.

Integration tests should cover the full payment flow: authorise, capture, refund, void. Use a test gateway like Stripe's test mode or a mock server. Each test must include a unique idempotency key to avoid state contamination. Run these tests in a dedicated environment with its own ledger database.

Contract tests (e.g., using Pact) verify that your payment service and the gateway agree on the API contract. When the gateway changes their response format, a contract test breaks before you deploy to production. This catches mismatches like field name changes or new required parameters.

Chaos engineering for payments: deliberately inject failures — gateway timeouts, slow responses, duplicate requests, invalid responses. Observe how your system behaves. Does it double-charge? Does it lose transactions? Does it handle the dead-letter queue correctly? Run these experiments in a staging environment that mirrors production load. A weekly chaos day helped one team discover that their circuit breaker reset too aggressively, causing repeated gateway overloads.

io/thecodeforge/payment/PaymentContractTest.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
@ExtendWith(PactConsumerTestExt.class)
@Pact(consumer = "PaymentService")
public RequestResponsePact createPact(PactDslWithProvider builder) {
    return builder
        .given("a valid charge request")
        .uponReceiving("a charge with idempotency key")
            .method("POST")
            .headers("Idempotency-Key", "test-key-123")
            .body(newJsonBody(body -> {
                body.stringType("amount", "2000");
                body.stringType("currency", "USD");
            }).build())
        .willRespondWith()
            .status(200)
            .body(newJsonBody(body -> {
                body.stringType("id", "txn_123");
                body.stringType("status", "captured");
            }).build())
        .toPact();
}

@Test
@PactVerification
public void testCharge() {
    PaymentResponse response = paymentClient.charge("test-key-123", new PaymentRequest(2000, "USD"));
    assertEquals("captured", response.getStatus());
}
Output
Consumer-driven contract test for the payment gateway charge endpoint. Ensures the gateway's response matches expectations.
Payment Testing Layers
  • Unit tests: validate business logic (e.g., compensation calculations) in isolation.
  • Integration tests: run against sandbox gateway with unique idempotency keys.
  • Contract tests: catch API contract changes before they break production.
  • Chaos tests: simulate gateway timeouts, slow responses, and duplicate requests.
  • Production smoke tests: run a small real transaction after every deploy.
Production Insight
A team deployed a new gateway integration without contract tests.
The gateway changed a response field name from 'status' to 'state' without prior notice.
The integration silently failed for 4 hours, losing 2000 transactions.
Rule: always have consumer-driven contract tests for external payment dependencies.
Key Takeaway
Payment systems require multi-layered testing beyond unit tests.
Contract tests protect against API changes; chaos tests verify resilience.
Rule: the test that catches a payment bug after deploy is too late.
Payment Testing Priority
IfYou are integrating with a new payment gateway
UseWrite contract tests first, then integration tests with sandbox.
IfYou have existing production traffic
UseImplement production smoke tests that run one real transaction per deploy.
IfYou want to improve resilience
UseRun weekly chaos experiments simulating gateway failures and duplicate requests.
● Production incidentPOST-MORTEMseverity: high

Double Charge During Black Friday: Client-Generated Idempotency Key Collision

Symptom
Customers saw two separate charges on their credit card statements for a single purchase. Customer service spent weeks resolving disputes. Chargeback fees increased by 300%.
Assumption
The team assumed the payment gateway's timeout meant the transaction had not gone through, so retrying the capture was safe. They didn't realise the gateway had already processed the first request and was just slow to respond.
Root cause
The capture endpoint lacked idempotency. The payment service retried the capture without an idempotency key when it received a timeout from the gateway. The gateway processed both the original and the retry, resulting in two successful captures for the same authorisation. Additionally, the client SDK generated idempotency keys using a timestamp-based scheme that produced collisions under high concurrency.
Fix
Added an idempotency key to the capture call. The key was derived from the authorisation ID plus a unique request ID. The gateway's idempotency store deduplicated the retry. Also implemented a retry limit and a dead-letter queue for manual review of truly ambiguous failures. Replaced the timestamp-based key generation with cryptographically random UUIDs generated server-side.
Key lesson
  • Always attach an idempotency key to every payment state transition — authorize, capture, refund are all separate operations.
  • Timeouts don't mean the request didn't go through; they mean you don't know. Idempotency is the only safe way to retry.
  • Never retry a payment operation without an idempotency key, even on timeouts.
  • Client-generated idempotency keys are dangerous — always validate randomness and entropy server-side.
Production debug guideDiagnose and resolve common payment failures in production.5 entries
Symptom · 01
Payment failed with 'duplicate idempotency key' error
Fix
Check the idempotency store to see if the key was already used. If it's a legitimate retry, return the cached response. If it's a genuine duplicate from a different payment, generate a new key.
Symptom · 02
Payment appears successful but user did not receive confirmation
Fix
Check the ledger for the debit entry. Then check the gateway for the transaction. If the gateway shows success but ledger is missing, it's a write failure. Reconcile using the idempotency key as the correlation ID.
Symptom · 03
Reconciliation report shows mismatched amounts
Fix
Compare internal 'payments' table with gateway's transaction report. Look for partial captures, refunds, or time zone differences. Flag for manual review if the difference exceeds a threshold.
Symptom · 04
Gateway returns 402 (payment_required) but we already captured
Fix
Check the gateway transaction status using the gateway transaction ID. A 402 on a subsequent authorisation does not invalidate a prior capture. Verify the capture status in the gateway dashboard.
Symptom · 05
Multiple payments created for the same order
Fix
Check for missing unique constraint on (merchant_id, order_reference). If the constraint exists, verify idempotency key uniqueness. Run a reconciliation query to find duplicates and issue refunds.
★ Payment System Debug Cheat SheetQuick commands to diagnose and fix common payment issues in production.
Idempotency key collision
Immediate action
Check idempotency store for the key.
Commands
SELECT * FROM idempotency_keys WHERE key = '{key}'
curl -v -H 'Idempotency-Key: {key}' https://api.payments/charge
Fix now
If the key was accidentally reused, generate a new server-side UUID and retry.
Ledger balance out of sync with gateway+
Immediate action
Run a reconciliation query between payments and gateway_transactions.
Commands
SELECT p.id, p.amount, g.amount FROM payments p JOIN gateway_transactions g ON p.gateway_txn_id = g.txn_id WHERE p.amount != g.amount
SELECT * FROM ledger_entries WHERE reference_id = '{payment_id}'
Fix now
Insert a compensating ledger entry (reversal) and notify finance for manual reconciliation.
Payment timeout during capture+
Immediate action
Check gateway status and retry with idempotency key.
Commands
curl -X POST -H 'Idempotency-Key: {key}' https://gateway/capture -d 'auth_code={auth}'
kubectl logs -l app=payment-service --tail=200 | grep 'timeout'
Fix now
Increase capture timeout to 30s and implement circuit breaker if gateway latency spikes.
Gateway report shows extra transactions not in ledger+
Immediate action
Query the gateway's transaction report for the date range and compare with internal payments table.
Commands
SELECT txn_id, amount FROM gateway_transactions WHERE created_at >= NOW() - INTERVAL '1 day' AND created_at < NOW()
SELECT reference_id, amount FROM payments WHERE created_at >= NOW() - INTERVAL '1 day'
Fix now
If internal is missing a legitimate gateway transaction, insert a compensating ledger entry (credit if missing debit) and flag for finance review. If gateway shows ghost transaction, escalate to gateway support.
Reconciliation report shows orphaned authorisations+
Immediate action
Query gateway for all authorisations older than 7 days without a capture.
Commands
SELECT auth_id, amount, created_at FROM authorisations WHERE status='APPROVED' AND created_at < NOW() - INTERVAL '7 days'
curl -X POST -H 'Idempotency-Key: {key}' https://gateway/void -d 'auth_id={auth}'
Fix now
Implement a daily cron job to void orphaned authorisations and log them for audit.
Payment Processing Patterns Compared
PatternUse CaseKey Trade-off
Idempotency KeyPrevents duplicate processing of the same requestRequires client cooperation or server-side key generation; storage overhead
Append-Only LedgerImmutable audit trail, regulatory complianceBalance queries require aggregation; may need caching for low latency
Exponential Backoff RetryHandles transient failures gracefullyMust pair with idempotency; careful with jitter to avoid thundering herd
Saga PatternCross-service transaction without distributed transactionComplexity of compensating actions; eventual consistency window
ReconciliationDetects silent data loss or inconsistencyManual effort if not automated; batch processing latency
Outbox PatternGuarantees event delivery after ledger writeAdds complexity; requires background job for publishing

Key takeaways

1
Idempotency keys are the single most important safety net against duplicate charges. Never retry without one.
2
Append-only ledgers provide an immutable audit trail. Balance is always derived, never stored directly.
3
Pair retries with exponential backoff, jitter, and a circuit breaker. Unlimited retries cause cascading outages.
4
Reconciliation must be automated from day one. It catches silent errors that your application logic misses.
5
Choose strong consistency over availability for financial data. Lost transactions are worse than brief downtime.
6
Test payment systems with chaos experiments
simulate timeouts, duplicates, and unexpected gateway responses.
7
Webhook idempotency is often forgotten, but double-processing a webhook can cause double refunds or duplicate ledger entries.

Common mistakes to avoid

7 patterns
×

Storing balance as a mutable column

Symptom
During failure recovery or race conditions, balance becomes incorrect. Duplicate updates or lost writes go undetected until a user complains or audit fails.
Fix
Use an append-only ledger where balance is derived from summed entries. Use a materialized cache only for reads, and rebuild from ledger on inconsistency.
×

Retrying without idempotency

Symptom
Users get charged multiple times when network retry or timeout triggers the same request again. Customer support is flooded with dispute calls.
Fix
Always attach an idempotency key to payment requests. Store the result once and return the same for any subsequent request with the same key.
×

Using a single database for ledger and transactional processing without proper isolation

Symptom
Reads of balance during heavy write load see inconsistent state. Partial updates lead to incorrect display balances or failed validations.
Fix
Use read replicas for reporting and balance display, with a small acceptable staleness. For the ledger writes, use primary with strong consistency. Or separate write and read models (CQRS).
×

Not implementing reconciliation from day one

Symptom
After a database migration or incident, you realise you have lost or duplicated transactions. No automated check exists to catch the issue; recovery requires lengthy manual audit.
Fix
Implement reconciliation pipelines before going to production. Start with daily batch matching against external payment provider reports. Automate alerts for mismatches.
×

Not handling idempotency key expiry correctly

Symptom
A retry arrives after the idempotency key has expired (e.g., after 24 hours). The system processes it as a new request, resulting in a double charge or duplicate entry.
Fix
Set the idempotency key TTL to cover the entire settlement lifecycle (7 days for cards). Use a background job to archive expired keys to cold storage for audit, but keep them in the active table until TTL expires. For extremely delayed retries, implement a manual review step.
×

Not testing against gateway sandbox before production deploy

Symptom
A new gateway integration passes unit tests but fails in production due to differences in response format, field names, or error handling.
Fix
Write integration tests that run against the gateway's sandbox environment. Include scenarios for timeouts, declined payments, and duplicate idempotency keys.
×

Ignoring webhook idempotency

Symptom
The same webhook event (e.g., payment.succeeded) is delivered multiple times, causing duplicate ledger entries or multiple refunds.
Fix
Store the webhook event ID with a unique constraint. Process each event at most once. Use idempotency keys for refund operations triggered by webhooks.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How would you design a payment system that guarantees exactly-once proce...
Q02SENIOR
What is the difference between idempotency and a unique constraint? When...
Q03SENIOR
How do you handle a case where a payment is authorised but the capture f...
Q04SENIOR
Explain how you would design a ledger that supports high throughput (10,...
Q01 of 04SENIOR

How would you design a payment system that guarantees exactly-once processing even with network retries?

ANSWER
Use an idempotency key on every payment request. The key is a UUID generated by the client or server. Store the key and the result in a database with a unique constraint. Before processing, check if the key exists. If it does, return the cached response. This ensures that retries of the same request do not create duplicate charges. Also, ensure that all downstream operations (ledger, gateway) are idempotent or have their own idempotency mechanism.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What happens if the idempotency key expires before a retry arrives?
02
How do you handle a gateway that returns HTML instead of JSON?
03
What is the outbox pattern and how does it help payment systems?
04
Can I use DynamoDB for my ledger?
05
How often should reconciliation run?
🔥

That's Real World. Mark it forged?

12 min read · try the examples if you haven't

Previous
Design a Search Autocomplete
14 / 17 · Real World
Next
Design a Live Video Streaming System