Homeβ€Ί System Designβ€Ί API Keys Explained: How They Work and Why They Fail

API Keys Explained: How They Work and Why They Fail

Where developers are forged. Β· Structured learning Β· Free forever.
πŸ“ Part of: Security β†’ Topic 10 of 10
API keys exposed in GitHub repos cost companies millions yearly.
πŸ§‘β€πŸ’» Beginner-friendly β€” no prior System Design experience needed
In this tutorial, you'll learn:
  • An API key is a lookup token, not a cryptographic proof β€” whoever holds the string has the permission, which is why storage and transmission are everything
  • The logging trap kills you quietly: Sentry, Datadog, and similar tools will happily capture your Authorization header in error breadcrumbs unless you explicitly scrub them β€” go check your existing error logs before finishing this article
  • One scoped key per service is the single highest-leverage change you can make β€” when (not if) a key leaks, scope isolation determines whether you have a five-minute fix or a three-hour incident
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚑ Quick Answer
Think of an API key like a loyalty card at a coffee shop. The barista doesn't know your name, doesn't check your ID β€” they just scan the card and know you're allowed to order, how many free drinks you have left, and whether you're a VIP. The card itself IS the permission. Lose it, and whoever finds it can order on your tab until you cancel it. That's it. That's an API key. It's a password-shaped permission slip that you hand to every service call instead of logging in each time.

A developer at a Y Combinator startup pushed to GitHub on a Friday afternoon. By Sunday, a bot had scraped their AWS API key from the commit history, spun up 47 GPU instances for crypto mining, and run up a $17,000 bill. The key had been in the code for exactly 11 minutes before the push. Eleven minutes. The bill took three months to dispute.

API keys are everywhere β€” every third-party service you integrate, every payment processor, every mapping library, every SMS gateway. They're the most common authentication mechanism in modern software, and they're also the most commonly mishandled. Not because developers are careless, but because nobody sits down and explains what these things actually are, how they flow through a system, and specifically what blows up when you treat them carelessly.

By the end of this, you'll know exactly what an API key is and isn't, how to generate and store one safely, how to pass it correctly in HTTP requests, what rate limiting and key rotation actually look like in practice, and β€” most critically β€” the exact mistakes that get people paged at 3am or handed a five-figure cloud bill. No handwaving. No 'just be careful with your keys.' Concrete mechanics, real failure modes, specific fixes.

What an API Key Actually Is (And What It Is Not)

Before you can protect an API key, you need to know what it's doing in the first place. Most explanations skip straight to 'keep it secret' without ever explaining the mechanism. That's why people make mistakes β€” they're following rules they don't understand.

An API is just a door into someone else's software. Stripe's API is a door into their payment system. The Google Maps API is a door into their mapping engine. You're not running their code β€” you're sending HTTP requests to their servers, and their servers do the work and send back a response. Simple.

The problem is: that door can't be wide open. Stripe needs to know which requests came from your account so they can bill you, rate-limit you, and lock you out if you do something sketchy. They can't ask you to type a username and password every single time your checkout page needs to verify a card β€” that would happen dozens of times per second at scale. So instead, they give you a key: a long random string that you attach to every request. Their server sees the key, looks it up in their database, finds your account, and knows who's asking.

Here's the critical thing most juniors get wrong: an API key is NOT encryption. It doesn't scramble your data. It's not a token that proves who you are through math. It's purely a lookup mechanism β€” a secret identifier that maps to an account in someone else's database. That distinction matters enormously when you're deciding how to store and transmit it.

APIKeyFlowDiagram.systemdesign Β· PLAINTEXT
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
// io.thecodeforge β€” System Design tutorial
// Tracing a single API call from your app to a third-party service
// Scenario: Your e-commerce checkout calls Stripe to charge a card

// ─────────────────────────────────────────────────────────────
// STEP 1 β€” Your checkout service builds an HTTP request
// ─────────────────────────────────────────────────────────────

POST https://api.stripe.com/v1/charges

Headers:
  Authorization: Bearer sk_live_4eC39HqLyjWDarjtT1zdp7dc   // <-- the API key
  Content-Type: application/x-www-form-urlencoded

Body:
  amount=2000          // $20.00 in cents
  currency=usd
  source=tok_visa      // tokenised card from Stripe.js

// ─────────────────────────────────────────────────────────────
// STEP 2 β€” Stripe's server receives the request
// ─────────────────────────────────────────────────────────────

Stripe API Gateway:
  1. Extract key from Authorization header
     key = "sk_live_4eC39HqLyjWDarjtT1zdp7dc"

  2. Look up key in Stripe's internal key store
     SELECT account_id, permissions, rate_limit, is_active
     FROM api_keys
     WHERE key_hash = SHA256("sk_live_4eC39HqLyjWDarjtT1zdp7dc")
     // NOTE: Stripe stores a HASH of your key, not the key itself
     // This means even Stripe can't recover your key if their DB leaks

  3. Key found β†’ account_id = "acct_1A2B3C4D5E6F"
     is_active = true
     permissions = ["charges:write", "refunds:write"]
     rate_limit = 100 requests/second

  4. Check rate limit β€” current usage: 23/100 req/sec β†’ OK

  5. Process the charge against account acct_1A2B3C4D5E6F

// ─────────────────────────────────────────────────────────────
// STEP 3 β€” Stripe responds
// ─────────────────────────────────────────────────────────────

HTTP 200 OK
{
  "id": "ch_3MqLiJKZ2eZvKYlo2T9UW2GX",
  "object": "charge",
  "amount": 2000,
  "status": "succeeded"
}

// ─────────────────────────────────────────────────────────────
// WHAT HAPPENS WITH A BAD KEY
// ─────────────────────────────────────────────────────────────

Stripe API Gateway (bad key scenario):
  1. Extract key: "sk_live_INVALIDKEYHERE"
  2. Hash and look up β†’ no matching row in api_keys table
  3. Return immediately β€” no account check, no charge processing

HTTP 401 Unauthorized
{
  "error": {
    "code": "api_key_invalid",
    "message": "No such API key: 'sk_live_INVA...HERE'"
  }
}
β–Ά Output
// Successful charge:
HTTP 200 β†’ { "id": "ch_3MqLiJKZ2eZvKYlo2T9UW2GX", "status": "succeeded" }

// Invalid key:
HTTP 401 β†’ { "error": { "code": "api_key_invalid" } }

// Correct key, wrong permissions:
HTTP 403 β†’ { "error": { "code": "permission_denied", "message": "This key does not have permission for charges:write" } }

// Rate limit hit:
HTTP 429 β†’ { "error": { "code": "rate_limit_exceeded", "message": "Too many requests" } }
⚠️
Never Do This: Confusing API Keys with AuthenticationAn API key proves nothing about identity through cryptography β€” it just proves the caller has the string. If someone steals your key, the server cannot tell the difference between them and you. Unlike a JWT (which is cryptographically signed and expires), a stolen API key is valid forever until you manually revoke it. Build your threat model around that fact.

Where API Keys Live, Travel, and Get Stolen

The key gets generated once. After that, it has to live somewhere in your system, travel with every request, and never appear anywhere a human or bot shouldn't see it. Every one of those three moments is a potential leak point, and I've seen all three fail in production.

Storage is where most teams fail first. The lazy path β€” and I've seen it in codebases at companies you've heard of β€” is hardcoding the key directly in source code. It's fast, it works locally, and it will eventually destroy you. GitHub's secret scanning catches some of these and emails the vendor, but by the time the email arrives, automated bots have already scraped the commit. Those bots watch GitHub's public event stream in real time. Real time. Not a crawl β€” a live stream.

The correct storage pattern is environment variables at minimum, a secrets manager (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager) in any production system that matters. The key lives in the secrets manager, your app fetches it at startup or at request time, and it never touches your source control, your logs, or your error reporting service. That last one trips people up constantly β€” Sentry, Datadog, and similar tools often log full request objects on errors. If your API key is in a request header and you log the full request on a 500 error, you just wrote your key into your observability stack.

SecureKeyLoading.py Β· PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
# io.thecodeforge β€” System Design tutorial
# Scenario: Payment service loading a Stripe key safely at startup
# Demonstrating: env vars (dev), secrets manager (prod), and the logging trap

import os
import boto3
import json
import logging
import requests
from functools import lru_cache

logger = logging.getLogger(__name__)

# ─────────────────────────────────────────────────────────────
# PATTERN 1 β€” Environment variable (acceptable for local dev)
# ─────────────────────────────────────────────────────────────

def load_stripe_key_from_env() -> str:
    key = os.environ.get("STRIPE_SECRET_KEY")
    if not key:
        # Fail loud at startup β€” better than a cryptic 401 at checkout time
        raise EnvironmentError(
            "STRIPE_SECRET_KEY is not set. "
            "Check your .env file or deployment environment variables."
        )
    if key.startswith("sk_live") and os.environ.get("APP_ENV") == "development":
        # Catch the classic mistake: live key used in local dev
        raise EnvironmentError(
            "Live Stripe key detected in development environment. "
            "Use sk_test_ keys for local development."
        )
    return key

# ─────────────────────────────────────────────────────────────
# PATTERN 2 β€” AWS Secrets Manager (required for production)
# ─────────────────────────────────────────────────────────────

@lru_cache(maxsize=1)  # Cache the secret β€” don't call Secrets Manager on every request
def load_stripe_key_from_secrets_manager(secret_name: str, region: str) -> str:
    client = boto3.client("secretsmanager", region_name=region)
    try:
        response = client.get_secret_value(SecretId=secret_name)
    except client.exceptions.ResourceNotFoundException:
        raise RuntimeError(f"Secret '{secret_name}' not found in Secrets Manager.")
    except client.exceptions.AccessDeniedException:
        # This usually means your IAM role doesn't have secretsmanager:GetSecretValue
        raise RuntimeError(
            f"IAM permission denied reading '{secret_name}'. "
            "Check your task role policy for secretsmanager:GetSecretValue."
        )
    secret = json.loads(response["SecretString"])
    return secret["stripe_secret_key"]

# ─────────────────────────────────────────────────────────────
# THE LOGGING TRAP β€” this is how keys end up in Datadog
# ─────────────────────────────────────────────────────────────

def charge_card_unsafe(stripe_key: str, amount_cents: int, card_token: str):
    headers = {"Authorization": f"Bearer {stripe_key}"}
    response = requests.post(
        "https://api.stripe.com/v1/charges",
        headers=headers,
        data={"amount": amount_cents, "currency": "usd", "source": card_token}
    )
    if response.status_code != 200:
        # DANGER: logging response.request exposes the Authorization header
        # If Sentry or Datadog captures this log, your key is now in their system
        logger.error(f"Stripe charge failed. Request: {response.request.headers}")
    return response.json()


def charge_card_safe(stripe_key: str, amount_cents: int, card_token: str):
    headers = {"Authorization": f"Bearer {stripe_key}"}
    response = requests.post(
        "https://api.stripe.com/v1/charges",
        headers=headers,
        data={"amount": amount_cents, "currency": "usd", "source": card_token}
    )
    if response.status_code != 200:
        # Log only what you need to debug β€” never log headers containing credentials
        logger.error(
            "Stripe charge failed",
            extra={
                "status_code": response.status_code,
                "stripe_error_code": response.json().get("error", {}).get("code"),
                "amount_cents": amount_cents
                # Deliberately omitting: headers, full request object, card_token
            }
        )
    return response.json()


# ─────────────────────────────────────────────────────────────
# STARTUP β€” how the service wires this together
# ─────────────────────────────────────────────────────────────

if __name__ == "__main__":
    env = os.environ.get("APP_ENV", "development")

    if env == "production":
        stripe_key = load_stripe_key_from_secrets_manager(
            secret_name="prod/payment-service/stripe",
            region="us-east-1"
        )
        print("Loaded Stripe key from Secrets Manager")
    else:
        stripe_key = load_stripe_key_from_env()
        print("Loaded Stripe key from environment variable")

    # Sanity check β€” log key PREFIX only so you can confirm which key is active
    # Never log the full key, even in debug mode
    print(f"Active Stripe key prefix: {stripe_key[:12]}...")
β–Ά Output
# Production startup:
Loaded Stripe key from Secrets Manager
Active Stripe key prefix: sk_live_4eC3...

# Development startup with test key:
Loaded Stripe key from environment variable
Active Stripe key prefix: sk_test_51Lk...

# Development startup with LIVE key (caught at startup, not at runtime):
EnvironmentError: Live Stripe key detected in development environment. Use sk_test_ keys for local development.

# Production with missing IAM permission:
RuntimeError: IAM permission denied reading 'prod/payment-service/stripe'. Check your task role policy for secretsmanager:GetSecretValue.
⚠️
Production Trap: Your Error Reporter Is Logging Your KeysSentry's default Django and Flask integrations capture the full HTTP request object on unhandled exceptions β€” including all headers. Authorization: Bearer sk_live_... goes straight into Sentry's servers. Fix it: configure Sentry's before_send hook to scrub Authorization headers, or use sentry_sdk's send_default_pii=False setting. Check your existing Sentry issues right now β€” search for 'Authorization' in the breadcrumb data.

Rate Limiting, Key Rotation, and Scoping: The Three Things That Save You

Generating an API key is easy. Managing it across the lifecycle of a production system is where teams fall apart. There are three practices that separate systems that recover from a leaked key in five minutes from systems that spend a week cleaning up the blast radius.

Rate limiting is your circuit breaker. Every serious API provider implements it β€” they'll return HTTP 429 Too Many Requests when you exceed your quota. But here's what most juniors don't realize: rate limiting protects the provider, not you. It stops a leaked key from burning through someone else's quota, but it doesn't stop an attacker from doing exactly 99 requests per minute (just under your limit) indefinitely. You need your own rate limiting on inbound requests to your service, separate from whatever the upstream API enforces.

Key rotation means proactively replacing your API keys on a schedule, even if they haven't leaked. The argument against it β€” 'why fix what isn't broken?' β€” ignores the reality that you often don't know a key is compromised until damage is done. Rotate quarterly at minimum. Rotate immediately any time a developer with access leaves the company. Rotate immediately if the key appears anywhere it shouldn't. The operational cost of rotation is low if you've already externalized keys to a secrets manager β€” it's a one-line update, not a deployment.

Scoping means giving each key only the permissions it actually needs. Don't use your master admin key in your read-only reporting service. If that reporting service gets compromised, the attacker should get read access to your data β€” not write access, not billing access, not the ability to create new API keys. Most providers let you scope keys to specific operations. Use it every time.

APIKeyRotationAndScoping.py Β· PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159
# io.thecodeforge β€” System Design tutorial
# Scenario: Internal API gateway managing keys for microservices
# Demonstrates: scoped keys, rotation tracking, and handling 429s correctly

import time
import hashlib
import secrets
import logging
from datetime import datetime, timedelta
from dataclasses import dataclass, field
from typing import Optional
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

logger = logging.getLogger(__name__)

# ─────────────────────────────────────────────────────────────
# DATA MODEL β€” what a managed API key looks like internally
# ─────────────────────────────────────────────────────────────

@dataclass
class ScopedAPIKey:
    service_name: str           # which internal service owns this key
    provider: str               # e.g. "stripe", "sendgrid", "googlemaps"
    permissions: list[str]      # e.g. ["charges:write"] β€” not ["*"]
    created_at: datetime = field(default_factory=datetime.utcnow)
    rotate_by: datetime = field(default_factory=lambda: datetime.utcnow() + timedelta(days=90))
    _raw_key: str = field(default="", repr=False)  # never printed in logs or repr

    @property
    def key_prefix(self) -> str:
        # Safe to log β€” enough to identify which key is active without exposing it
        return self._raw_key[:12] + "..."

    @property
    def days_until_rotation(self) -> int:
        return (self.rotate_by - datetime.utcnow()).days

    @property
    def needs_rotation(self) -> bool:
        return self.days_until_rotation <= 7  # warn a week out


# ─────────────────────────────────────────────────────────────
# RETRY LOGIC β€” handling 429s without hammering the upstream
# ─────────────────────────────────────────────────────────────

def build_resilient_http_session(total_retries: int = 3) -> requests.Session:
    session = requests.Session()

    # Retry on 429 (rate limit) and 503 (upstream temporarily unavailable)
    # backoff_factor=2 means: wait 2s, then 4s, then 8s between retries
    retry_strategy = Retry(
        total=total_retries,
        status_forcelist=[429, 503],
        backoff_factor=2,
        respect_retry_after_header=True  # honour Stripe/SendGrid's Retry-After header
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("https://", adapter)
    return session


# ─────────────────────────────────────────────────────────────
# SCOPED REQUEST BUILDER β€” enforces least-privilege per service
# ─────────────────────────────────────────────────────────────

class ScopedStripeClient:
    """
    Each internal service gets its own ScopedStripeClient with its own key.
    The checkout service gets charges:write.
    The reporting service gets charges:read only.
    A compromised reporting service cannot create charges.
    """

    def __init__(self, api_key: ScopedAPIKey):
        self._key = api_key
        self._session = build_resilient_http_session()
        self._check_rotation_status()

    def _check_rotation_status(self):
        if self._key.needs_rotation:
            # Warn loudly at startup β€” gives ops team time to rotate before expiry
            logger.warning(
                "API key rotation due soon",
                extra={
                    "service": self._key.service_name,
                    "provider": self._key.provider,
                    "key_prefix": self._key.key_prefix,
                    "days_remaining": self._key.days_until_rotation
                }
            )

    def get_charge(self, charge_id: str) -> dict:
        # Reporting service uses this β€” read-only, no ability to create/modify
        if "charges:read" not in self._key.permissions:
            raise PermissionError(
                f"Key for '{self._key.service_name}' lacks charges:read permission. "
                f"Granted permissions: {self._key.permissions}"
            )
        response = self._session.get(
            f"https://api.stripe.com/v1/charges/{charge_id}",
            headers={"Authorization": f"Bearer {self._key._raw_key}"}
        )
        response.raise_for_status()
        return response.json()

    def create_charge(self, amount_cents: int, card_token: str) -> dict:
        # Checkout service uses this β€” requires explicit write permission
        if "charges:write" not in self._key.permissions:
            raise PermissionError(
                f"Key for '{self._key.service_name}' lacks charges:write permission. "
                f"This is likely a scoping error β€” do not expand permissions. "
                f"Create a dedicated key with charges:write for the checkout service."
            )
        response = self._session.post(
            "https://api.stripe.com/v1/charges",
            headers={"Authorization": f"Bearer {self._key._raw_key}"},
            data={"amount": amount_cents, "currency": "usd", "source": card_token}
        )
        response.raise_for_status()
        return response.json()


# ─────────────────────────────────────────────────────────────
# EXAMPLE USAGE β€” wiring up two services with different scopes
# ─────────────────────────────────────────────────────────────

if __name__ == "__main__":
    # Checkout service key β€” write access
    checkout_api_key = ScopedAPIKey(
        service_name="checkout-service",
        provider="stripe",
        permissions=["charges:write", "refunds:write"],
        rotate_by=datetime.utcnow() + timedelta(days=5)  # triggers rotation warning
    )
    checkout_api_key._raw_key = "sk_live_checkout_key_here"

    # Reporting service key β€” read access only
    reporting_api_key = ScopedAPIKey(
        service_name="reporting-service",
        provider="stripe",
        permissions=["charges:read"],
        rotate_by=datetime.utcnow() + timedelta(days=60)
    )
    reporting_api_key._raw_key = "sk_live_reporting_key_here"

    checkout_client = ScopedStripeClient(checkout_api_key)
    reporting_client = ScopedStripeClient(reporting_api_key)

    # This works:
    print("Checkout client permissions:", checkout_api_key.permissions)

    # This raises PermissionError β€” the reporting client cannot create charges
    try:
        reporting_client.create_charge(2000, "tok_visa")
    except PermissionError as e:
        print(f"Caught expected permission error: {e}")
β–Ά Output
# Startup warning (checkout key expires in 5 days):
WARNING: API key rotation due soon | service=checkout-service | provider=stripe | key_prefix=sk_live_chec... | days_remaining=5

# No warning for reporting key (60 days out):
[no rotation warning]

# Permissions check:
Checkout client permissions: ['charges:write', 'refunds:write']

# Reporting client attempting to create a charge:
Caught expected permission error: Key for 'reporting-service' lacks charges:write permission. This is likely a scoping error β€” do not expand permissions. Create a dedicated key with charges:write for the checkout service.
⚠️
Senior Shortcut: One Key Per Service, Never One Key Per CompanyThe single biggest operational upgrade you can make today: stop using one shared API key across all your services. Give each service its own scoped key. When a key leaks, you revoke exactly that key, you know exactly which service was compromised, and every other service keeps running. With a shared key, a leak in your reporting cron job takes down your payment flow while you rotate. Scope isolation is your blast radius limiter.

The API Key Graveyard: Real Failure Modes and How to Detect Them

Every API key failure I've seen fits one of four patterns. Learn to recognise the smell of each, because by the time you're debugging them under pressure they all look like generic 'service unavailable' errors.

The first pattern is the silent leak. The key is out in the wild β€” in a public GitHub repo, in a Slack message, in a Confluence page someone made public β€” and you don't know yet. The attacker isn't being dramatic. They're making exactly 80 requests per minute to stay under your 100 req/min rate limit. Your metrics look normal. Your error rate is zero. Your bill is climbing. Detection: set up spend anomaly alerts on every API provider that has billing. AWS, Stripe, SendGrid β€” they all have it. Set the threshold low. A 20% spike in API usage at 2am is worth a PagerDuty alert.

The second pattern is the rotation death spiral. Someone rotates a key, updates it in the secrets manager, but forgets that four services read that secret at startup and cache it with lru_cache. They're all still using the old key. You start seeing 401s in production. Panicked, someone reverts the rotation. Now you're back to the leaked key and have to do the whole thing again. Fix: implement a cache TTL on secret fetches, and build a /healthz endpoint that validates the API key is still active without caching the result.

The third pattern is the scope creep accident. Someone needs a quick fix in staging, expands a key's permissions 'temporarily,' and that change makes it to production. Now your read-only analytics service has write access. It doesn't matter until the analytics service has a bug that starts writing garbage data. Audit your key permissions quarterly β€” not just whether keys are rotated, but whether their scopes still match what they actually need.

APIKeyHealthMonitor.py Β· PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143
# io.thecodeforge β€” System Design tutorial
# Scenario: A lightweight health-check system that validates API keys are alive
# and alerts on anomalous usage patterns before the bill arrives

import logging
import time
from datetime import datetime, timedelta
from collections import deque
from typing import Callable
import requests

logger = logging.getLogger(__name__)

# ─────────────────────────────────────────────────────────────
# PATTERN: Sliding window usage tracker
# Detects anomalous request spikes that could indicate a leaked key
# being used by someone else against your quota
# ─────────────────────────────────────────────────────────────

class APIKeyUsageMonitor:
    def __init__(
        self,
        service_name: str,
        rate_limit_per_minute: int,
        spike_alert_threshold: float = 0.75  # alert at 75% of rate limit
    ):
        self.service_name = service_name
        self.rate_limit_per_minute = rate_limit_per_minute
        self.spike_alert_threshold = spike_alert_threshold
        # Deque of timestamps β€” we keep only the last 60 seconds of requests
        self._request_timestamps: deque = deque()

    def record_request(self):
        now = time.monotonic()
        self._request_timestamps.append(now)
        self._evict_old_timestamps(now)
        self._check_for_spike()

    def _evict_old_timestamps(self, now: float):
        # Remove timestamps older than 60 seconds
        cutoff = now - 60.0
        while self._request_timestamps and self._request_timestamps[0] < cutoff:
            self._request_timestamps.popleft()

    def _check_for_spike(self):
        current_rate = len(self._request_timestamps)
        alert_threshold = int(self.rate_limit_per_minute * self.spike_alert_threshold)
        if current_rate >= alert_threshold:
            logger.warning(
                "API key usage spike detected β€” possible key leak or runaway client",
                extra={
                    "service": self.service_name,
                    "requests_last_60s": current_rate,
                    "rate_limit": self.rate_limit_per_minute,
                    "threshold": alert_threshold,
                    "pct_of_limit": round(current_rate / self.rate_limit_per_minute * 100, 1)
                }
            )

    def current_usage(self) -> dict:
        now = time.monotonic()
        self._evict_old_timestamps(now)
        return {
            "service": self.service_name,
            "requests_last_60s": len(self._request_timestamps),
            "rate_limit_per_minute": self.rate_limit_per_minute,
            "headroom_remaining": self.rate_limit_per_minute - len(self._request_timestamps)
        }


# ─────────────────────────────────────────────────────────────
# PATTERN: Active key health check
# Run this from your /healthz endpoint β€” does NOT use lru_cache
# so it always validates the current key, even after rotation
# ─────────────────────────────────────────────────────────────

def validate_stripe_key_is_active(stripe_key: str) -> dict:
    """
    Stripe's /v1/account endpoint requires a valid key and returns
    account metadata. It's the canonical 'is this key alive?' check.
    Costs one API call. Cache the RESULT for 60 seconds max, never the key.
    """
    try:
        response = requests.get(
            "https://api.stripe.com/v1/account",
            headers={"Authorization": f"Bearer {stripe_key}"},
            timeout=5  # never let a health check block indefinitely
        )
        if response.status_code == 200:
            account_data = response.json()
            return {
                "status": "healthy",
                "account_id": account_data.get("id"),
                "charges_enabled": account_data.get("charges_enabled"),
                "checked_at": datetime.utcnow().isoformat()
            }
        elif response.status_code == 401:
            # The key is dead β€” either revoked, rotated, or never valid
            return {
                "status": "invalid_key",
                "error": response.json().get("error", {}).get("message"),
                "action_required": "Rotate key immediately and update secrets manager"
            }
        else:
            return {
                "status": "unexpected_response",
                "http_status": response.status_code
            }
    except requests.Timeout:
        return {"status": "timeout", "note": "Stripe API did not respond within 5s"}
    except requests.ConnectionError:
        return {"status": "network_error", "note": "Cannot reach api.stripe.com"}


# ─────────────────────────────────────────────────────────────
# DEMO β€” simulating usage tracking and health check
# ─────────────────────────────────────────────────────────────

if __name__ == "__main__":
    monitor = APIKeyUsageMonitor(
        service_name="checkout-service",
        rate_limit_per_minute=100,
        spike_alert_threshold=0.75
    )

    # Simulate normal traffic (30 requests)
    for _ in range(30):
        monitor.record_request()
    print("After 30 requests:", monitor.current_usage())

    # Simulate spike (76 more requests β€” crosses 75% threshold)
    for _ in range(46):
        monitor.record_request()
    print("After 76 requests:", monitor.current_usage())

    # Health check output (mocked β€” would hit real Stripe in prod)
    print("\nKey health check result:")
    print({
        "status": "healthy",
        "account_id": "acct_1A2B3C4D5E6F",
        "charges_enabled": True,
        "checked_at": datetime.utcnow().isoformat()
    })
β–Ά Output
After 30 requests: {'service': 'checkout-service', 'requests_last_60s': 30, 'rate_limit_per_minute': 100, 'headroom_remaining': 70}

WARNING: API key usage spike detected β€” possible key leak or runaway client | service=checkout-service | requests_last_60s=76 | rate_limit=100 | threshold=75 | pct_of_limit=76.0

After 76 requests: {'service': 'checkout-service', 'requests_last_60s': 76, 'rate_limit_per_minute': 100, 'headroom_remaining': 24}

Key health check result:
{'status': 'healthy', 'account_id': 'acct_1A2B3C4D5E6F', 'charges_enabled': True, 'checked_at': '2024-03-15T03:42:17.221483'}
πŸ”₯
Interview Gold: The Cache-and-Rotate ProblemInterviewers love this one: 'You rotate an API key in Secrets Manager but services start returning 401 β€” why?' The answer: services fetched the old key at startup and cached it in memory with no TTL. Fix with two things: set a max cache TTL of 60 seconds on secret fetches, and have your health check endpoint always re-fetch the key from Secrets Manager (bypassing cache) so you catch rotation failures within one health check cycle.
AspectAPI KeyOAuth 2.0 Bearer Token (JWT)
What it provesCaller has the string β€” nothing moreCaller authenticated via a trusted identity provider
ExpiryNever expires unless manually revokedShort-lived (typically 15min–1hr), auto-expires
Revocation speedInstant β€” delete the key server-sideCannot revoke before expiry without a blocklist
Theft impactAttacker has permanent access until manual revokeAttacker has access for the remaining token lifetime only
Ideal use caseServer-to-server with a secret you fully controlUser-facing auth, or anywhere expiry matters
Cryptographic proofNone β€” pure lookupYes β€” signature verified with public key, no DB call needed
Storage locationSecrets manager / environment variableShort-lived, often stored in memory only
Rotation complexityManual process, operationally risky if cachedAutomatic via token expiry and refresh flow
Provider-side DB hit per requestYes β€” key must be looked up every requestNo β€” signature verification is stateless
Setup complexityTrivial β€” generate, copy, useHigh β€” OAuth flows, identity providers, token endpoints

🎯 Key Takeaways

  • An API key is a lookup token, not a cryptographic proof β€” whoever holds the string has the permission, which is why storage and transmission are everything
  • The logging trap kills you quietly: Sentry, Datadog, and similar tools will happily capture your Authorization header in error breadcrumbs unless you explicitly scrub them β€” go check your existing error logs before finishing this article
  • One scoped key per service is the single highest-leverage change you can make β€” when (not if) a key leaks, scope isolation determines whether you have a five-minute fix or a three-hour incident
  • An attacker with your API key doesn't need to hammer your rate limit β€” they'll stay just under it indefinitely, which means spend anomaly alerts catch leaks that error-rate monitoring completely misses

⚠ Common Mistakes to Avoid

  • βœ•Mistake 1: Hardcoding an API key directly in source code β€” 'No such API key' 401 errors appear in production after key rotation, AND the old key is permanently exposed in git history β€” Fix: run git log -S &#39;sk_live&#39; --all right now to audit history; use git filter-repo or BFG Repo Cleaner to scrub the key, then rotate it immediately; move storage to Secrets Manager going forward
  • βœ•Mistake 2: Using the same API key across all services (checkout, reporting, cron jobs) β€” when any one service is compromised or its key leaks, you revoke it and take down every other service simultaneously β€” Fix: generate one scoped key per service with least-privilege permissions; use your provider's restricted key feature (Stripe calls these 'Restricted Keys', Twilio calls them 'API Key SIDs')
  • βœ•Mistake 3: Caching the API key from Secrets Manager with no TTL using lru_cache β€” after key rotation, services keep sending the revoked key and receive HTTP 401, but they won't recover until redeployed β€” Fix: replace @lru_cache with a timed cache that re-fetches after 60 seconds (use cachetools.TTLCache in Python); add a /healthz route that validates the live key against the upstream API on every call

Interview Questions on This Topic

  • QYour team rotates a Stripe API key in AWS Secrets Manager. Within two minutes, your checkout service starts returning 401 errors, but your reporting service is fine. Both services use the same key name in Secrets Manager. What's causing the discrepancy, and how do you fix it without a redeployment?
  • QWhen would you use an API key instead of OAuth 2.0 for authenticating a microservice, and at what point does that choice become a liability you need to revisit?
  • QA leaked API key is being used by an attacker at exactly 80 req/min β€” just under your rate limit of 100 req/min. Your error rate is zero and your latency is normal. How would you detect this attack in a system you're designing today?

Frequently Asked Questions

Is it safe to put an API key in a frontend JavaScript file?

No β€” never put a secret API key in frontend code. Anything shipped to the browser is public, full stop. Anyone can open DevTools, go to the Network tab, and read every header your frontend sends. If you need to call a third-party API from the frontend, proxy the call through your backend server, which holds the key. The only keys safe for frontend use are explicitly designated 'public keys' (like Stripe's pk_live_ publishable key), which providers scope to read-only, non-sensitive operations by design.

What's the difference between an API key and an API token?

An API key is a static credential that doesn't expire and maps directly to an account β€” think of it as a permanent password. An API token (usually a JWT or OAuth Bearer token) is short-lived, cryptographically signed, and expires automatically. Use API keys for server-to-server integrations where you fully control the secret. Use tokens for anything involving user identity, or anywhere automatic expiry matters more than operational simplicity.

How do I rotate an API key in production without downtime?

Generate the new key first, then deploy it. Don't revoke the old key until you've confirmed the new key is working in production. The sequence: (1) generate new key in the provider dashboard, (2) update the value in Secrets Manager or your secrets manager of choice, (3) trigger a rolling restart of services (or wait for the TTL cache to expire if you've implemented one), (4) verify your health check endpoint returns healthy with the new key, (5) only then revoke the old key. Skipping step 4 before step 5 is how teams create 3am incidents.

If an attacker gets my API key, can I tell what they did with it?

Only if your API provider logs per-key request history β€” and most do, but retention windows are short (Stripe keeps 30 days, AWS CloudTrail keeps 90 days by default). The hard reality: if a key was silently leaking for six months at just-under-rate-limit usage, you may never reconstruct the full damage. This is exactly why you should set up spend anomaly alerts and per-key request dashboards proactively, not forensically. After an incident, the first thing to pull is your provider's API usage logs filtered by key prefix, timestamp, and source IP β€” source IP mismatches between your known datacenter ranges and unknown ranges are your clearest signal.

πŸ”₯
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousNmap Tutorial: Network Scanning and Host Discovery
Forged with πŸ”₯ at TheCodeForge.io β€” Where Developers Are Forged