Advanced 10 min · March 06, 2026

Payment System Design — Idempotency & Double Charge Fixes

Q: What happens if the idempotency key expires before a retry arrives?

If the TTL is shorter than the retry window, the retry is treated as a new request and the operation executes again, causing a duplicate. To avoid this, set the idempotency key TTL to cover the entire settlement lifecycle — typically 7 days for card payments. Use a background job to purge expired keys, but only after the settlement window closes.

Q: How do you handle a gateway that returns HTML instead of JSON?

This is a classic failure case. Your HTTP client should be configured to handle non-JSON responses gracefully. Parse the content-type header and if it's not JSON, treat it as a server error, log the response body for debugging, and trigger an alert. Do not retry automatically — escalate to manual review or use a circuit breaker.

Q: What is the outbox pattern and how does it help payment systems?

The outbox pattern ensures that a database transaction and an event (e.g., payment.completed) are atomic. Instead of publishing the event directly (which can fail after the DB commit), you write the event to an outbox table in the same transaction. A separate process reads the outbox and publishes the events. This guarantees exactly-once delivery of payment events.

Q: Can I use DynamoDB for my ledger?

You can, but only if you use DynamoDB with strong consistency reads and understand the trade-offs. DynamoDB's eventual consistency mode can lose committed writes during failover, which is unacceptable for financial data. With strong consistency, you lose some availability during partitions. For critical payments, PostgreSQL or CockroachDB are safer choices.

Q: How often should reconciliation run?

At least once per billing cycle. For high-volume or critical payments, run reconciliation hourly or every 15 minutes. The goal is to detect discrepancies before they compound. Start with daily reconciliation and increase frequency as your payment volume grows.

Client-generated idempotency keys caused 300% more chargebacks on Black Friday.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

✓ Production

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

Before you start⏱ 30 min

✓Deep production experience
✓Understanding of internals and trade-offs
✓Experience debugging complex systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Idempotency keys prevent duplicate charges from retries — the foundation of safe payments
Append-only ledgers provide an immutable audit trail; balance is always derived, never stored
Retry with exponential backoff and jitter, but always pair with idempotency
Reconciliation catches silent inconsistencies — automate it from day one
Strong consistency beats availability for financial data; choose PostgreSQL or CockroachDB
Reconciliation timing matters: run it at least once per billing cycle to catch silent losses before they compound

✦ Definition~90s read

What is Design a Payment System?

★

Imagine you're at a vending machine.

The naive approach of 'try, and if it fails, try again' leads to double charges, lost payments, and angry customers. That's why idempotency, ledgers, and reconciliation aren't optional — they're the bedrock.

When you design a payment system, you're designing a system that must say 'yes' exactly once, and 'no' with a clear reason. Getting that wrong costs real money. Senior engineers know that every edge case you don't think of will happen in production — the trick is to design so that when it happens, the system still does the right thing.

Here's the thing: the moment you accept real money, you're operating under regulatory constraints. PCI DSS, SOC 2, and regional regulations like PSD2 in Europe impose specific requirements on how you store, process, and transmit payment data. Your architecture decisions have compliance consequences.

A healthy payment system is boring: it logs everything, it never loses data, and it pushes discrepancies to a queue instead of swallowing them.

A practical way to think about it: your payment system is only as good as its worst failure mode. If you test only the happy path, you've tested nothing. Invest time in simulating gateway timeouts, database failures, and duplicate requests. That's where the real engineering happens.

Plain-English First

Imagine you're at a vending machine. You put in a dollar, press B3, and the machine jams. Did your dollar count? Did the snack dispatch? You don't know — and that confusion is exactly what a payment system is designed to eliminate. A payment system is like a very careful referee that keeps score of every dollar that moves, makes sure nobody's charged twice if the internet hiccups, and can always show you a receipt even years later. Every time you tap your card at a coffee shop, at least five different computer systems have a 300-millisecond conversation to make sure the money moves exactly once, to exactly the right place.

Payment systems are the circulatory system of the modern economy. Stripe processes hundreds of billions of dollars annually. PayPal handles over 22 billion transactions per year. Even a small SaaS product charging $29/month is trusting its payment pipeline with its entire revenue stream. When a payment system fails, it doesn't just log an error — it loses real money, destroys customer trust, and can trigger regulatory scrutiny. This is why payment system design is one of the most consequential, unforgiving engineering domains you'll encounter. Yet most introductory guides stop at idempotency — they skip the ledger model and reconciliation pipelines that catch silent errors.

The core problem a payment system solves is deceptively simple: move money from person A to person B reliably. But 'reliably' carries enormous weight. Networks time out. Databases crash mid-transaction. Users double-click submit buttons. Banks return ambiguous responses like 'do not honor' without explaining why. A naive implementation will lose money, double-charge customers, or silently swallow failures — any of which is catastrophic. The engineering challenge is building a system that behaves correctly under all these failure modes simultaneously.

By the end of this article you'll understand how to architect a production-grade payment system end to end — from the data model and API contract to idempotency keys, ledger design, retry strategies with exponential backoff, reconciliation pipelines, and the edge cases that bite engineers in production. You'll walk away knowing exactly how to answer this question in a system design interview and, more importantly, how to actually build it.

Here's the reality most guides skip: your payment system will fail in ways you can't predict. The difference between a good system and a broken one is not the happy path — it's how gracefully it degrades when the bank's API returns HTML instead of JSON, or when a network partition splits your database nodes. The design choices you make today determine whether those failures become a five-minute incident or a week-long disaster. And that's why every decision — from idempotency key format to database fsync configuration — deserves careful thought.

What is Design a Payment System?

A payment system is a set of services and processes that reliably moves money between parties. Its core challenges are not about speed — they're about correctness under uncertainty. Every engineer who builds payment systems soon learns that network timeouts, database failures, and user impatience are the real enemies. The naive approach of 'try, and if it fails, try again' leads to double charges, lost payments, and angry customers. That's why idempotency, ledgers, and reconciliation aren't optional — they're the bedrock.

Let's talk about the design decisions that separate production-grade systems from weekend projects. Every choice — whether to use a single database or separate services, how to model the API contract, what to do when a third-party gateway returns a 503 — has a downstream impact on correctness, latency, and auditability. A healthy payment system is boring: it logs everything, it never loses data, and it pushes discrepancies to a queue instead of swallowing them.

ForgeExample.javaSYSTEM DESIGN

// TheCodeForge — Design a Payment System example
// Always use meaningful names, not x or n
public class ForgeExample {
    public static void main(String[] args) {
        String topic = "Design a Payment System";
        System.out.println("Learning: " + topic + " 🔥");
    }
}

Output

Learning: Design a Payment System 🔥

🔥Forge Tip:

Type this code yourself rather than copy-pasting. The muscle memory of writing it will help it stick.

📊 Production Insight

A naive payment system without idempotency caused a SaaS company to double-charge 300 customers during a network blip.

The fix took 3 hours, cost $15k in refunds, and triggered a PR disaster that lost 10% of their user base.

Rule: assume every network call can fail, and every retry can succeed — design for both.

🎯 Key Takeaway

Payment systems are unforgiving: one wrong assumption about retry semantics can cost real money.

Design for the worst-case network behavior from the start.

Rule: if you don't know what happens when the same request arrives twice, you're not ready for production.

When to invest in payment system design depth

IfYou are processing real-money transactions

→

UseInvest in idempotency, ledger, and reconciliation from day one.

IfYou are building a prototype or MVP with fake money

→

UseYou can skip some guarantees, but plan to add them before any real payment flows.

thecodeforge.io

Design Payment System

Idempotency: The Non-Negotiable Foundation

Idempotency is what stops a payment from being processed twice when a user refreshes the page or a network retry sends the same request again. The core mechanism is an idempotency key — a unique identifier (often a UUID) that the client sends with every creation request. The server checks if it has already seen that key. If yes, it returns the cached result from the first invocation instead of executing the operation again. This transforms an unsafe retry into a safe replay.

You store the key together with the response in a database row with a unique constraint. Any duplicate key insertion fails, and you simply return the stored response. The TTL on this mapping matters: keep it long enough to cover worst-case retry windows (typically 24 hours) but short enough to avoid unbounded storage growth. Use a primary key or a unique index on the idempotency key column.

A common production pattern is to use a single table for idempotency keys with a TTL of 24 hours. But be careful: if the TTL is too short, a retry that arrives after expiry will be processed as a new request and you'll double-charge. Set the TTL to cover the payment's settlement lifecycle — typically 7 days for card payments. Use a background job to purge expired keys.

One subtle failure: if you store idempotency keys in the same database as your ledger, a database outage can render your entire payment API unavailable. Consider a separate, highly available key-value store (Redis with persistence, or DynamoDB with strong consistency) for idempotency data, while keeping the ledger in a transactional database. This decoupling lets you tolerate partial failures without losing the ability to retry safely.

Another nuance: idempotency keys don't just protect against double charging — they also serve as a correlation ID for debugging. When a payment fails after multiple retries, the key ties all attempts together. You can trace the entire lifecycle in your logs. Make sure to log the idempotency key at every step.

io/thecodeforge/payment/IdempotencyService.javaJAVA

public class IdempotencyService {
    private final Map<String, PaymentResult> store = new ConcurrentHashMap<>();

    public PaymentResult process(String idempotencyKey, PaymentRequest request) {
        // Atomic putIfAbsent ensures exactly one execution
        PaymentResult existing = store.putIfAbsent(idempotencyKey, IN_FLIGHT);
        if (existing != null) {
            if (existing == IN_FLIGHT) {
                // Poll until first execution completes
                return waitForCompletion(idempotencyKey);
            }
            return existing;
        }
        try {
            PaymentResult result = executePayment(request);
            store.put(idempotencyKey, result);
            return result;
        } catch (Exception e) {
            store.put(idempotencyKey, new PaymentResult("FAILED", e.getMessage()));
            throw e;
        }
    }
}

Output

Idempotency key ensures a payment request is processed at most once. The ConcurrentHashMap with putIfAbsent guarantees atomic check-and-set.

⚠ Idempotency Key Collision

If two different payments accidentally share the same idempotency key, the second one is silently ignored. Generate keys client-side with enough entropy (UUIDv4, or hash of client_id + timestamp + nonce). Never trust the client to provide a key – generate it server-side if the client doesn't include one. Additionally, consider using a distributed lock around idempotency check-and-store to prevent race conditions on concurrent requests with the same key.

📊 Production Insight

A payment processor at a major retailer used a timestamp-based idempotency key during Black Friday.

Concurrent requests from the same user produced identical keys, causing duplicate charges.

They lost $340k before the fix: use a cryptographically random UUID generated on the server.

Key rule: never derive idempotency from request data alone.

🎯 Key Takeaway

Idempotency keys are the lock that prevents double charges.

Store them in a table with a unique constraint and a TTL.

The rule: if you can't replay a request safely, your system isn't production-ready.

Idempotency Strategy Decision

IfClient provides idempotency key

→

UseUse it as-is after validating format and length. Reject non-compliant keys with 400.

IfClient does not provide key

→

UseGenerate one server-side using UUIDv4. Return it in response header for client retries.

IfExisting key found in store with IN_FLIGHT status

→

UseBlock and poll until first execution completes (with timeout). Return the eventual response.

IfExisting key found with completed response

→

UseReturn that response immediately. Do not re-execute.

thecodeforge.io

Design Payment System

Ledger Design: The Source of Truth

A ledger is the immutable record of every financial event. Every credit, debit, fee, refund, and adjustment is an entry in the ledger. The balance of an account is derived by summing all entries — never stored directly as a mutable field. This prevents inconsistencies from partial updates and gives you a full audit trail.

Design the ledger as an append-only table. Each entry has: account_id, entry_type (DEBIT/CREDIT), amount in the smallest currency unit (cents), currency, reference_id, timestamp, and a unique sequence number. Use a composite primary key on (account_id, sequence) to guarantee ordering. For performance, you can cache balances in a separate table but always rebuild from the ledger during reconciliation.

One leaky abstraction that trips teams up: they store the balance as a column and rebuild it periodically from the ledger. During a reconciliation run, if the rebuild fails halfway, you'll have an inconsistent cache. The better approach is to never trust the cached balance — treat it as a degraded read and always verify against the ledger for critical operations like withdrawals.

Another trap: teams mistakenly store derived balances in the ledger itself using UPDATE statements. That's not an append-only ledger — it's a mutating counter dressed up. If you ever find yourself writing an UPDATE on the ledger table, stop and redesign. The correct pattern is to always INSERT and compute balance via aggregation. For performance, materialise the balance in a separate cache table with a periodic rebuild job, but never trust it for critical withdrawals without re-verifying against the ledger.

One more production consideration: ledger entries must be idempotent themselves. If you process a webhook callback twice, you need to ensure the second attempt doesn't create a duplicate ledger entry. Use the gateway's transaction ID as a unique constraint on the ledger table. This prevents double-counting even if your webhook handler re-executes.

io/thecodeforge/payment/ledger.sqlSQL

CREATE TABLE ledger_entries (
    account_id     UUID NOT NULL,
    sequence_id    BIGSERIAL,
    entry_type     VARCHAR(10) NOT NULL CHECK (entry_type IN ('DEBIT','CREDIT')),
    amount_cents   BIGINT NOT NULL,
    currency       CHAR(3) NOT NULL DEFAULT 'USD',
    reference_id   UUID NOT NULL,
    created_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (account_id, sequence_id)
);

CREATE INDEX idx_ledger_ref ON ledger_entries(reference_id);

-- Derive balance: sum of credits minus debits
SELECT account_id, 
       SUM(CASE WHEN entry_type = 'CREDIT' THEN amount_cents 
                ELSE -amount_cents END) AS balance_cents
FROM ledger_entries
WHERE account_id = :accountId
GROUP BY account_id;

Output

The ledger table is append-only. Balance is computed dynamically from the sum of CREDIT and DEBIT entries.

Mental Model

Ledger as Append-Only Log

Think of a ledger like a bank statement – you never erase a line; you add a reversal entry if something is wrong.

Every financial event appends a row – never UPDATE or DELETE.
Balance is always derived by summing entries on the fly (or via materialized view).
Reversal entries cancel out previous entries (e.g., a chargeback is a CREDIT reversal of a DEBIT).
Auditors and regulators require this immutability.
Use a materialised balance table only for read capacity; rebuild it periodically from the ledger to catch drift.

📊 Production Insight

A startup stored account balances in a single row and updated it transactionally.

During a database failover, the UPDATE was applied twice, doubling the balance.

They couldn't detect the error until a withdrawal exceeded actual funds.

Rule: never store a mutable balance – always derive from an immutable ledger.

🎯 Key Takeaway

Ledgers are the single source of financial truth.

Append-only, never update – balance is always derived.

Rule: if you are updating a balance column, you're building a future outage.

Ledger vs Mutable Balance

IfYou need audit trail and regulatory compliance

→

UseUse an append-only ledger with derived balance.

IfYou need ultra-low latency read of balance (<1ms)

→

UseUse a materialized cache that is rebuilt from ledger every few seconds; accept eventual consistency.

IfYou are processing high throughput (10k+ tx/s)

→

UsePartition the ledger by account_id and cache recent balances in Redis with periodic ledger reconciliation.

thecodeforge.io

Design Payment System

Retry Strategies and Failure Handling

Payment networks are unreliable. A bank might return a timeout, a network partition might drop the response, or the payment gateway may be overloaded. Your system must handle these gracefully. The standard approach is automatic retries with exponential backoff and jitter. But retries must be safe — meaning the operation is idempotent. Every retry uses the same idempotency key, so the payment processor's ledger sees only one debit.

Define a finite number of retries (e.g., 3 attempts with backoff: 1s, 4s, 16s) and a final failure state. Track retry attempts in a dedicated table. For critical payments (like subscriptions), consider a delayed retry queue that retries over hours with escalating backoff.

A trap I see often: teams implement retry in the application layer but forget to set a timeout on the upstream call. If the gateway hangs, the retry loop queues up waiting threads, eventually exhausting the connection pool. Always pair retries with a circuit breaker that trips after consecutive failures and allows the system to recover.

Consider the 'double timeout' problem: your payment gateway has a 10-second timeout, but your HTTP client has a 30-second timeout. A slow gateway that responds at 15 seconds (past its own timeout) may have processed the request, yet your client is still waiting. When you eventually retry with the same idempotency key, the gateway might have already committed. This is safe if idempotent, but fails if the gateway's timeout window differs. Always align timeout configurations between client and server, and log the exact response timing to correlate logs.

Another pattern: for time-sensitive operations like pre-auth expiry, you need aggressive retries with short backoff. If you wait too long, the authorisation expires and you lose the hold on the customer's funds. In that case, use a backoff of 100ms, 200ms, 400ms, and fail fast. Then run a background job to re-authorise if needed.

io/thecodeforge/payment/RetryHandler.javaJAVA

public class RetryHandler {
    private static final int MAX_RETRIES = 3;
    private static final long BASE_DELAY_MS = 1000;

    public PaymentResult executeWithRetry(String idempotencyKey, PaymentRequest request) {
        for (int attempt = 1; attempt <= MAX_RETRIES; attempt++) {
            try {
                // Idempotency key stays the same across retries
                return paymentClient.charge(idempotencyKey, request);
            } catch (TimeoutException e) {
                if (attempt == MAX_RETRIES) {
                    // Log and move to dead-letter queue for manual review
                    deadLetterQueue.enqueue(idempotencyKey, request);
                    return new PaymentResult("FAILED", "Exhausted retries");
                }
                // Exponential backoff with jitter
                long delay = (long) (BASE_DELAY_MS * Math.pow(2, attempt - 1) * (0.5 + Math.random()));
                Thread.sleep(delay);
            }
        }
        return new PaymentResult("FAILED", "Unexpected error");
    }
}

Output

Retry loop with exponential backoff and jitter. On final failure, enqueue to dead-letter queue for manual reconciliation.

⚠ Idempotency + Retry = Safety

Without idempotency, a retry can result in multiple charges. Always attach the same idempotency key to every retry attempt. The payment provider's system will then deduplicate; your system doesn't have to worry about double-charging. The dead-letter queue is your last line of defence — monitor its depth and alert if it grows beyond a threshold.

📊 Production Insight

A SaaS company's payment integration had no retry limit – it retried indefinitely with fixed 5-second delay.

When the payment gateway went down for 30 minutes, millions of requests piled up.

Upon recovery, the backlog caused a DDoS-like spike, taking down the gateway again.

Rule: cap retries, use exponential backoff, and implement a circuit breaker to back off during extended failures.

🎯 Key Takeaway

Retries are safe only when paired with idempotency.

Always cap retries, add jitter, and use a circuit breaker.

Rule: if you're not tracking retry attempts, you're one outage away from a spike that takes you down.

Retry Strategy Selection

IfOperation is idempotent and safe to replay

→

UseUse automatic retries with exponential backoff and jitter (3–5 attempts).

IfOperation is not idempotent

→

UseDo NOT retry automatically – throw an error and rely on manual intervention or a compensating transaction.

IfPayment is time-sensitive (e.g., pre-auth expires)

→

UseUse aggressive retry with short backoff (100ms, 200ms, 400ms) and fail fast.

Reconciliation and Double-Spend Prevention

Reconciliation is the process of comparing your internal ledger with external statements (bank, gateway, network) to catch discrepancies. It's your safety net against silent data loss, double charges, or missing credits. Run reconciliation periodically (daily or hourly depending on volume). For each transaction, compare status, amount, currency, timestamps. Flag any mismatch for manual review.

Double-spend prevention goes further: ensure that a given payment instruction (e.g., a specific invoice) is fulfilled exactly once. Use a unique constraint on the combination of (merchant_id, payment_reference). Combined with idempotency, this prevents any chance of creating two debits for the same order.

Don't limit reconciliation to daily batches. For high-value payments, implement near-real-time reconciliation using webhooks from the gateway. The webhook acts as an async callback that your system processes immediately. Compare the webhook payload with the ledger entry using the gateway transaction ID. If they don't match, trigger an alert. This cuts detection time from hours to seconds.

Reconciliation is also your best defence against bank-level errors. Banks sometimes batch settlements incorrectly, or apply fees that your system didn't anticipate. Automated reconciliation that compares line items (transaction ID, amount, currency, fee) will flag these mismatches. Build a dashboard that shows the 'health score' of reconciliation for the last 7 days — green if no unresolved mismatches, red if any discrepancy exceeds $X. This gives you a single pane of glass for financial integrity.

One more thing: reconciliation should be a first-class feature, not an afterthought. Start implementing it from your first payment in production. If you wait until you have thousands of transactions, the cleanup effort is enormous. Automate the matching logic and build a manual review queue for edge cases. The operations team's ability to reconcile quickly is a direct measure of your system's reliability.

io/thecodeforge/payment/reconciliation.sqlSQL

-- Sample reconciliation query to find mismatches
SELECT 
    p.id AS payment_id,
    p.amount AS our_amount,
    g.amount AS gateway_amount,
    p.status AS our_status,
    g.status AS gateway_status
FROM payments p
JOIN gateway_transactions g ON p.gateway_txn_id = g.txn_id
WHERE p.amount != g.amount
   OR p.status != g.status
   OR g.txn_id IS NULL;

-- Double-spend prevention: unique constraint on merchant payment reference
ALTER TABLE payments 
ADD CONSTRAINT uq_merchant_ref UNIQUE (merchant_id, merchant_payment_reference);

Output

Reconciliation query to find amount/status mismatches. Unique constraint prevents duplicate payments for the same invoice.

Mental Model

Reconciliation as a Safety Net

Reconciliation is like balancing a checkbook – you compare what you think happened with what the bank says happened.

Run reconciliation regularly (daily minimum).
Automate flagging of discrepancies for human review.
For high-value transactions, consider near-real-time reconciliation via webhook callbacks.
Double-spend prevention via unique constraints is a hard guarantee; idempotency is a soft guarantee.
Automate reconciliation discrepancy handling with a retry queue for transient mismatches and a manual review queue for permanent ones.

📊 Production Insight

A fintech startup lost track of a batch of transactions during a database migration.

They skipped reconciliation for two weeks.

When they finally ran it, they found 47 duplicate charges and 12 missing credits worth $11,000.

Reputation damage was far worse than the financial loss.

Rule: reconciliation is not optional – it's how you catch the things you don't know are wrong.

🎯 Key Takeaway

Reconciliation catches what your application logic misses.

Always run it, automate it, and escalate discrepancies.

Rule: the absence of a discrepancy report doesn't mean no discrepancy exists.

Reconciliation Strategy Selection

IfLow volume (< 1k transactions/day) and simple payment flows

→

UseRun reconciliation daily. Manual review is acceptable.

IfHigh volume or complex payment flows (subscriptions, partial captures)

→

UseRun reconciliation hourly. Automate alerts and implement a discrepancy dashboard.

IfRegulatory requirement (e.g., PSD2, SOX)

→

UseRun reconciliation every 15 minutes with automated escalation and audit logs.

Production Architecture and Trade-offs

A production payment system is composed of several layers: an API gateway (handles auth, rate limiting, idempotency), a payment service (orchestrates the flow), a ledger service (append-only log), an external gateway adapter (communicates with banks/networks), and a reconciliation pipeline. The system must be stateless at the edge to scale horizontally, and stateful components (ledger, idempotency store) must use a strongly consistent database with failover.

Key trade-offs: consistency vs availability (choose consistency for financial data – CAP says partition tolerance is a given, so you sacrifice availability in favor of consistency). Use a database that supports transactions and strong consistency (e.g., PostgreSQL, CockroachDB). For high throughput, partition by account_id. Avoid distributed transactions if you can; prefer idempotent operations and eventual consistency with compensation (Saga pattern) for cross-service workflows.

Another pattern that works well in production is the 'outbox pattern'. When your payment service needs to emit an event (e.g., payment.completed) while updating the ledger, use a transactional outbox: write the event to an outbox table in the same database transaction as the ledger update. A separate process reads the outbox and publishes the events. This guarantees exactly-once delivery downstream.

A pattern that's gaining traction in 2026 is the 'payment orchestrator as a state machine.' Instead of wiring complex sagas manually, define the payment lifecycle states (INITIATED, AUTHED, CAPTURED, REFUNDED, FAILED) and transitions as a finite state machine. Each state transition is guarded by preconditions and triggers compensation on failure. Tools like AWS Step Functions or Cadence/Temporal implement this natively, giving you visibility, retries, and timeouts without custom boilerplate.

Don't forget monitoring. Payment systems need specific metrics: idempotency hit rate (how many retries are deduplicated), retry rate, reconciliation discrepancy count, gateway latency percentiles. Set up alerts for any deviation from baseline. A sudden spike in retries often signals a gateway issue. A growing reconciliation queue means a data integrity problem that needs immediate investigation.

io/thecodeforge/payment/PaymentOrchestrator.javaJAVA

public class PaymentOrchestrator {
    private final IdempotencyService idempotencyService;
    private final LedgerService ledgerService;
    private final GatewayClient gatewayClient;

    public PaymentResult processPayment(PaymentRequest request) {
        // 1. Check idempotency
        PaymentResult cached = idempotencyService.getResult(request.idempotencyKey());
        if (cached != null) return cached;

        // 2. Begin distributed saga
        try {
            // Step 1: Reserve funds in gateway
            String authCode = gatewayClient.authorize(request.amount, request.currency);
            // Step 2: Record ledger entry
            ledgerService.recordDebit(request.accountId, request.amount, request.currency, request.idempotencyKey());
            // Step 3: Capture (only after ledger is safe)
            gatewayClient.capture(authCode);
            // Step 4: Update idempotency store
            PaymentResult result = new PaymentResult("SUCCESS", authCode);
            idempotencyService.storeResult(request.idempotencyKey(), result);
            return result;
        } catch (Exception e) {
            // Compensate: void authorization if ledger already wrote
            gatewayClient.voidAuthorization(authCode);
            idempotencyService.storeResult(request.idempotencyKey(), new PaymentResult("FAILED", e.getMessage()));
            throw new PaymentException("Payment failed", e);
        }
    }
}

Output

Saga pattern: each step records an effect; if one fails, compensating actions undo previous steps. Idempotency key is used throughout.

🔥Architecture Note

Choose a database that supports transactions and strong consistency for the ledger and idempotency store. PostgreSQL is a solid choice. Avoid eventually consistent stores (like DynamoDB in default mode) for financial data unless you understand the trade-offs and implement reconciliation. For global deployments, consider CockroachDB which offers PostgreSQL-compatible strong consistency across regions without application changes.

📊 Production Insight

A payment team used DynamoDB with eventual consistency for their ledger.

During a region failover, they lost several committed payments because the last write was not replicated.

Balances went out of sync, and they had to run a full reconciliation manually.

Rule: for financial data, choose strong consistency over availability – your users will forgive a brief outage but not a lost transaction.

🎯 Key Takeaway

Payments demand strong consistency – choose your database accordingly.

Architect for idempotency at every layer and use sagas for cross-service workflows.

Rule: in payment systems, consistency beats availability when you must choose.

Database Selection for Payments

IfYou need ACID transactions and strong consistency

→

UseUse PostgreSQL or CockroachDB. Partition by account_id if throughput demands.

IfYou need global multi-region active-active

→

UseUse Google Spanner or CockroachDB – they provide strong consistency across regions.

IfYou are processing non-critical payments (e.g., loyalty points)

→

UseEventually consistent stores like DynamoDB are acceptable with good reconciliation.

Testing Payment Systems: Integration, Contract & Chaos

Testing a payment system is different from testing a typical CRUD app. You can't just test the happy path — you need to verify behaviour under network failures, duplicate requests, and unexpected gateway responses. Three testing strategies are essential: integration tests that run against a sandbox gateway, contract tests that verify your API agreement with the gateway, and chaos experiments that simulate real-world failures.

Integration tests should cover the full payment flow: authorise, capture, refund, void. Use a test gateway like Stripe's test mode or a mock server. Each test must include a unique idempotency key to avoid state contamination. Run these tests in a dedicated environment with its own ledger database.

Contract tests (e.g., using Pact) verify that your payment service and the gateway agree on the API contract. When the gateway changes their response format, a contract test breaks before you deploy to production. This catches mismatches like field name changes or new required parameters.

Chaos engineering for payments: deliberately inject failures — gateway timeouts, slow responses, duplicate requests, invalid responses. Observe how your system behaves. Does it double-charge? Does it lose transactions? Does it handle the dead-letter queue correctly? Run these experiments in a staging environment that mirrors production load. A weekly chaos day helped one team discover that their circuit breaker reset too aggressively, causing repeated gateway overloads.

io/thecodeforge/payment/PaymentContractTest.javaJAVA

@ExtendWith(PactConsumerTestExt.class)
@Pact(consumer = "PaymentService")
public RequestResponsePact createPact(PactDslWithProvider builder) {
    return builder
        .given("a valid charge request")
        .uponReceiving("a charge with idempotency key")
            .method("POST")
            .headers("Idempotency-Key", "test-key-123")
            .body(newJsonBody(body -> {
                body.stringType("amount", "2000");
                body.stringType("currency", "USD");
            }).build())
        .willRespondWith()
            .status(200)
            .body(newJsonBody(body -> {
                body.stringType("id", "txn_123");
                body.stringType("status", "captured");
            }).build())
        .toPact();
}

@Test
@PactVerification
public void testCharge() {
    PaymentResponse response = paymentClient.charge("test-key-123", new PaymentRequest(2000, "USD"));
    assertEquals("captured", response.getStatus());
}

Output

Consumer-driven contract test for the payment gateway charge endpoint. Ensures the gateway's response matches expectations.

Mental Model

Payment Testing Layers

Test your payment system like you're testing a nuclear reactor: verify every failure path, not just the successful one.

Unit tests: validate business logic (e.g., compensation calculations) in isolation.
Integration tests: run against sandbox gateway with unique idempotency keys.
Contract tests: catch API contract changes before they break production.
Chaos tests: simulate gateway timeouts, slow responses, and duplicate requests.
Production smoke tests: run a small real transaction after every deploy.

📊 Production Insight

A team deployed a new gateway integration without contract tests.

The gateway changed a response field name from 'status' to 'state' without prior notice.

The integration silently failed for 4 hours, losing 2000 transactions.

Rule: always have consumer-driven contract tests for external payment dependencies.

🎯 Key Takeaway

Payment systems require multi-layered testing beyond unit tests.

Contract tests protect against API changes; chaos tests verify resilience.

Rule: the test that catches a payment bug after deploy is too late.

Payment Testing Priority

IfYou are integrating with a new payment gateway

→

UseWrite contract tests first, then integration tests with sandbox.

IfYou have existing production traffic

→

UseImplement production smoke tests that run one real transaction per deploy.

IfYou want to improve resilience

→

UseRun weekly chaos experiments simulating gateway failures and duplicate requests.

Payment Flow – From Buy Button to Bank Ledger

You click Buy. A payment event fires. That event hits your payment service, which stores it in a database before it does anything else. Why? Because you need durability. If the service crashes right after receiving the event, you lose the order. That means lost revenue, angry customers, and a P0 incident. Once the event is safe, the payment service checks if it’s a single order or split across multiple sellers. For each sub-order, it creates a payment order record. The payment executor then calls an external PSP to process the credit card. Only after the PSP confirms success does the payment service update wallets and call the ledger. Every step follows the same rule: persist before you act. The ledger gets appended with a permanent record of the transaction. This is the canonical payment flow. Deviate from it, and reconciliation will haunt your on-call rotation.

PaymentFlowPipeline.javaJAVA

// io.thecodeforge
// Real-world Payment Flow Pipeline
import java.util.UUID;

public class PaymentFlowPipeline {
  public PaymentResult process(PaymentEvent event) {
    PaymentEvent saved = paymentRepo.save(event); // step 1: persist
    if (saved == null) throw new PaymentPersistenceException();
    
    List<PaymentOrder> orders = splitByMerchant(saved);
    for (PaymentOrder order : orders) {
      PaymentOrder stored = paymentOrderRepo.save(order);
      PaymentResponse pspResponse = pspClient.charge(stored.getCardToken(), stored.getAmount());
      if (pspResponse.isSuccess()) {
        walletService.credit(stored.getMerchantId(), stored.getAmount());
        ledgerService.append(new LedgerEntry(UUID.randomUUID(), stored));
      } // else: retry or dead-letter
    }
    return PaymentResult.SUCCESS;
  }
}

Output

Event persisted -> Orders split -> PSP charged -> Wallet credited -> Ledger appended

⚠ Production Trap:

Do not update the wallet before the PSP confirms. I’ve seen engineers call wallet.credit() optimistically. If the PSP then fails, you’ve created ghost money. Reconciliation won’t save you from a support ticket avalanche.

🎯 Key Takeaway

Persist the event before you do anything else. The payment flow is a series of durable, sequential writes—not optimistic updates.

Settlement Files – Why Your Nightly Batch Is a War Room

Every night, the PSP or bank sends a settlement file. It’s a flat file, CSV or fixed-width, containing every transaction that settled in the bank account that day. Your job is to match each row against your ledger. This is where you discover discrepancies: double charges, missing refunds, or—the worst—phantom transactions. The settlement file is the bank’s truth. If your ledger doesn’t match it, you have a money leak. Automate this matching. Load the file into a staging table, join on transaction ID and amount, and flag every row where the amounts don’t match or where a row doesn’t exist in your ledger. Then alert the finance team. Don’t try to autocorrect. Human approval is cheaper than explaining a $100k loss to the CFO.

SettlementReconciler.javaJAVA

// io.thecodeforge
// Nightly Settlement Reconciliation Batch
import java.io.*;
import java.math.BigDecimal;

public class SettlementReconciler {
  public void reconcile(File settlementCsv) {
    try (BufferedReader br = new BufferedReader(new FileReader(settlementCsv))) {
      br.lines().skip(1).forEach(line -> {
        String[] cols = line.split(",");
        String txnId = cols[0];
        BigDecimal bankAmount = new BigDecimal(cols[3]);
        
        LedgerEntry entry = ledgerRepo.findByExternalTxnId(txnId);
        if (entry == null) {
          alertFinance("UNMATCHED_TXN: " + txnId);
        } else if (entry.getAmount().compareTo(bankAmount) != 0) {
          alertFinance("AMOUNT_MISMATCH: txn=" + txnId + " ledger=" + entry.getAmount() + " bank=" + bankAmount);
        }
      });
    } catch (IOException e) {
      // pager duty immediately - missing settlement file is a critical alert
      pagerDuty.alert("SETTLEMENT_FILE_MISSING");
    }
  }
}

Output

MATCHED: 9,841 txns | UNMATCHED: 3 txns | AMOUNT_MISMATCH: 1 txn

⚠ Production Trap:

Never autocorrect discrepancies. A script that silently adjusts ledger entries is a fraud vector. Flag, alert, and let a human with access control approve the fix. Your SOC-2 audit will thank you.

🎯 Key Takeaway

Settlement files are the bank’s source of truth. Automated reconciliation is mandatory; autocorrection is a security risk.

● Production incidentPOST-MORTEMseverity: high

Double Charge During Black Friday: Client-Generated Idempotency Key Collision

Symptom

Customers saw two separate charges on their credit card statements for a single purchase. Customer service spent weeks resolving disputes. Chargeback fees increased by 300%.

Assumption

The team assumed the payment gateway's timeout meant the transaction had not gone through, so retrying the capture was safe. They didn't realise the gateway had already processed the first request and was just slow to respond.

Root cause

The capture endpoint lacked idempotency. The payment service retried the capture without an idempotency key when it received a timeout from the gateway. The gateway processed both the original and the retry, resulting in two successful captures for the same authorisation. Additionally, the client SDK generated idempotency keys using a timestamp-based scheme that produced collisions under high concurrency.

Fix

Added an idempotency key to the capture call. The key was derived from the authorisation ID plus a unique request ID. The gateway's idempotency store deduplicated the retry. Also implemented a retry limit and a dead-letter queue for manual review of truly ambiguous failures. Replaced the timestamp-based key generation with cryptographically random UUIDs generated server-side.

Key lesson

Always attach an idempotency key to every payment state transition — authorize, capture, refund are all separate operations.
Timeouts don't mean the request didn't go through; they mean you don't know. Idempotency is the only safe way to retry.
Never retry a payment operation without an idempotency key, even on timeouts.
Client-generated idempotency keys are dangerous — always validate randomness and entropy server-side.

Production debug guideDiagnose and resolve common payment failures in production.5 entries

Symptom · 01

Payment failed with 'duplicate idempotency key' error

→

Fix

Check the idempotency store to see if the key was already used. If it's a legitimate retry, return the cached response. If it's a genuine duplicate from a different payment, generate a new key.

Symptom · 02

Payment appears successful but user did not receive confirmation

→

Fix

Check the ledger for the debit entry. Then check the gateway for the transaction. If the gateway shows success but ledger is missing, it's a write failure. Reconcile using the idempotency key as the correlation ID.

Symptom · 03

Reconciliation report shows mismatched amounts

→

Fix

Compare internal 'payments' table with gateway's transaction report. Look for partial captures, refunds, or time zone differences. Flag for manual review if the difference exceeds a threshold.

Symptom · 04

Gateway returns 402 (payment_required) but we already captured

→

Fix

Check the gateway transaction status using the gateway transaction ID. A 402 on a subsequent authorisation does not invalidate a prior capture. Verify the capture status in the gateway dashboard.

Symptom · 05

Multiple payments created for the same order

→

Fix

Check for missing unique constraint on (merchant_id, order_reference). If the constraint exists, verify idempotency key uniqueness. Run a reconciliation query to find duplicates and issue refunds.

★ Payment System Debug Cheat SheetQuick commands to diagnose and fix common payment issues in production.

Idempotency key collision−

Immediate action

Check idempotency store for the key.

Commands

SELECT * FROM idempotency_keys WHERE key = '{key}'

curl -v -H 'Idempotency-Key: {key}' https://api.payments/charge

Fix now

If the key was accidentally reused, generate a new server-side UUID and retry.

Ledger balance out of sync with gateway+

Payment timeout during capture+

Gateway report shows extra transactions not in ledger+

Reconciliation report shows orphaned authorisations+

Payment Processing Patterns Compared

Pattern	Use Case	Key Trade-off
Idempotency Key	Prevents duplicate processing of the same request	Requires client cooperation or server-side key generation; storage overhead
Append-Only Ledger	Immutable audit trail, regulatory compliance	Balance queries require aggregation; may need caching for low latency
Exponential Backoff Retry	Handles transient failures gracefully	Must pair with idempotency; careful with jitter to avoid thundering herd
Saga Pattern	Cross-service transaction without distributed transaction	Complexity of compensating actions; eventual consistency window
Reconciliation	Detects silent data loss or inconsistency	Manual effort if not automated; batch processing latency
Outbox Pattern	Guarantees event delivery after ledger write	Adds complexity; requires background job for publishing

⚙ Quick Reference

9 commands from this guide

File	Command / Code	Purpose
ForgeExample.java	public class ForgeExample {	What is Design a Payment System?
iothecodeforgepaymentIdempotencyService.java	public class IdempotencyService {	Idempotency
iothecodeforgepaymentledger.sql	CREATE TABLE ledger_entries (	Ledger Design
iothecodeforgepaymentRetryHandler.java	public class RetryHandler {	Retry Strategies and Failure Handling
iothecodeforgepaymentreconciliation.sql	SELECT	Reconciliation and Double-Spend Prevention
iothecodeforgepaymentPaymentOrchestrator.java	public class PaymentOrchestrator {	Production Architecture and Trade-offs
iothecodeforgepaymentPaymentContractTest.java	@ExtendWith(PactConsumerTestExt.class)	Testing Payment Systems
PaymentFlowPipeline.java	public class PaymentFlowPipeline {	Payment Flow – From Buy Button to Bank Ledger
SettlementReconciler.java	public class SettlementReconciler {	Settlement Files – Why Your Nightly Batch Is a War Room

Key takeaways

Idempotency keys are the single most important safety net against duplicate charges. Never retry without one.

Append-only ledgers provide an immutable audit trail. Balance is always derived, never stored directly.

Pair retries with exponential backoff, jitter, and a circuit breaker. Unlimited retries cause cascading outages.

Reconciliation must be automated from day one. It catches silent errors that your application logic misses.

Choose strong consistency over availability for financial data. Lost transactions are worse than brief downtime.

Test payment systems with chaos experiments

simulate timeouts, duplicates, and unexpected gateway responses.

Webhook idempotency is often forgotten, but double-processing a webhook can cause double refunds or duplicate ledger entries.

Common mistakes to avoid

7 patterns

Storing balance as a mutable column

Symptom

During failure recovery or race conditions, balance becomes incorrect. Duplicate updates or lost writes go undetected until a user complains or audit fails.

Fix

Use an append-only ledger where balance is derived from summed entries. Use a materialized cache only for reads, and rebuild from ledger on inconsistency.

Retrying without idempotency

Symptom

Users get charged multiple times when network retry or timeout triggers the same request again. Customer support is flooded with dispute calls.

Fix

Always attach an idempotency key to payment requests. Store the result once and return the same for any subsequent request with the same key.

Using a single database for ledger and transactional processing without proper isolation

Symptom

Reads of balance during heavy write load see inconsistent state. Partial updates lead to incorrect display balances or failed validations.

Fix

Use read replicas for reporting and balance display, with a small acceptable staleness. For the ledger writes, use primary with strong consistency. Or separate write and read models (CQRS).

Not implementing reconciliation from day one

Symptom

After a database migration or incident, you realise you have lost or duplicated transactions. No automated check exists to catch the issue; recovery requires lengthy manual audit.

Fix

Implement reconciliation pipelines before going to production. Start with daily batch matching against external payment provider reports. Automate alerts for mismatches.

Not handling idempotency key expiry correctly

Symptom

A retry arrives after the idempotency key has expired (e.g., after 24 hours). The system processes it as a new request, resulting in a double charge or duplicate entry.

Fix

Set the idempotency key TTL to cover the entire settlement lifecycle (7 days for cards). Use a background job to archive expired keys to cold storage for audit, but keep them in the active table until TTL expires. For extremely delayed retries, implement a manual review step.

Not testing against gateway sandbox before production deploy

Symptom

A new gateway integration passes unit tests but fails in production due to differences in response format, field names, or error handling.

Fix

Write integration tests that run against the gateway's sandbox environment. Include scenarios for timeouts, declined payments, and duplicate idempotency keys.

Ignoring webhook idempotency

Symptom

The same webhook event (e.g., payment.succeeded) is delivered multiple times, causing duplicate ledger entries or multiple refunds.

Fix

Store the webhook event ID with a unique constraint. Process each event at most once. Use idempotency keys for refund operations triggered by webhooks.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How would you design a payment system that guarantees exactly-once proce...

Q02SENIOR

What is the difference between idempotency and a unique constraint? When...

Q03SENIOR

How do you handle a case where a payment is authorised but the capture f...

Q04SENIOR

Explain how you would design a ledger that supports high throughput (10,...

Q01 of 04SENIOR

How would you design a payment system that guarantees exactly-once processing even with network retries?

ANSWER

Use an idempotency key on every payment request. The key is a UUID generated by the client or server. Store the key and the result in a database with a unique constraint. Before processing, check if the key exists. If it does, return the cached response. This ensures that retries of the same request do not create duplicate charges. Also, ensure that all downstream operations (ledger, gateway) are idempotent or have their own idempotency mechanism.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What happens if the idempotency key expires before a retry arrives?

How do you handle a gateway that returns HTML instead of JSON?

What is the outbox pattern and how does it help payment systems?

Can I use DynamoDB for my ledger?

How often should reconciliation run?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

✓ Verified

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

🔥

That's Real World. Mark it forged?

10 min read · try the examples if you haven't