Expired Root CA Broke mTLS: A $340k PKI Lesson
PKIX path building failed? An expired internal CA root cost $340k.
- PKI is a hierarchy of trust: Root CA → Intermediate CA → Leaf certificate
- TLS handshake verifies every link in that chain before data flows
- Missing intermediate is the #1 production failure — mobile clients fail silently
- Use short-lived certs (24h TTL) over OCSP/CRL for internal services
- Always put hostnames in SAN; CN-only matching died in 2017
- Automatic renewal (ACME/Vault) prevents 90% of expiry incidents
Imagine every business in a city gets a laminated ID badge from the mayor's office. When you walk into a shop, you don't know the shopkeeper personally — but you trust the mayor, and the badge proves the mayor vouched for them. PKI is that exact system for the internet: a chain of vouching, anchored at a root everyone agrees to trust. The twist is that your browser doesn't trust 'the internet' — it trusts a specific, hardcoded list of mayors baked into your OS. If your mayor isn't on that list, the whole thing collapses, no matter how legitimate your badge is.
The chain works like this: the Root CA (the mayor) signs an Intermediate CA's badge (a deputy mayor). The Intermediate CA signs your server's badge. When your browser connects, it checks: 'Is this badge signed by someone I trust? And is THAT person signed by the root I trust?' If both answers are yes, the connection proceeds. If any link in that chain is missing or expired, the connection fails — even if the leaf certificate itself is perfectly valid.
A fintech startup I consulted for lost six hours of payment processing because a certificate issued by their internal CA expired at 2:47 AM on a Tuesday. Their monitoring caught nothing — the service didn't crash, it just silently rejected every TLS handshake with PKIX path building failed: unable to find valid certification path to requested target. Six engineers stared at perfectly healthy application logs while $340k in transactions queued. PKI didn't fail loudly. It failed quietly, at the edges, in a way nobody had written a runbook for.
PKI — Public Key Infrastructure — is the trust plumbing underneath every HTTPS connection, every signed JWT, every mTLS service mesh, and every code-signing pipeline you've ever touched. It answers one deceptively hard question: how do two strangers on a network prove to each other that they are who they claim to be, without having met before? Symmetric keys don't scale — you can't pre-share a secret with every website on the internet. PKI solves this with asymmetric cryptography layered over a hierarchy of trusted authorities. Get it right and it's invisible. Get it wrong and you're the 3 AM war room.
After this you'll be able to read a certificate chain and understand exactly what each field means and why it matters, trace a TLS handshake step by step and know where it can break, build and rotate certificates in a real service without downtime, debug the six most common certificate errors without guessing, and design an internal PKI for a microservices environment that won't bite you six months later.
Asymmetric Cryptography: The Math That Makes Trust Possible
Before PKI existed, encrypting traffic between two servers meant pre-sharing a secret key out-of-band — email it, phone it in, bake it into a config file checked into git (yes, this still happens). That doesn't scale, and it means anyone who intercepts the key exchange owns all past and future traffic. The entire premise of PKI is that you can publish a key openly, and doing so doesn't compromise you.
Asymmetric cryptography gives you a key pair: a public key you broadcast freely and a private key you guard with your life. Anything encrypted with the public key can only be decrypted by the matching private key. More importantly for PKI, anything signed with the private key can be verified by anyone holding the public key — without the verifier ever touching the private key. That second property is what makes certificates work.
A certificate is just a structured document that says: 'This public key belongs to example.com, and I, DigiCert, am signing this claim with my own private key.' Your browser holds DigiCert's public key (via the trust store), verifies DigiCert's signature, and concludes the public key genuinely belongs to example.com. The private key for example.com never travels over the wire. Not once. That's the whole trick.
RSA-2048 was the default for years. Today you should default to ECDSA P-256 — smaller keys, faster handshakes, equivalent or better security. RSA-4096 is not meaningfully more secure than RSA-2048 against current threats, but it's noticeably slower. Don't reach for it unless a compliance checkbox forces your hand.
But there's a subtle failure mode most engineers miss: key management. If your private key is stored on a disk that gets backed up to S3 with world-readable permissions — and that backup is leaked — the entire PKI collapses. The certificate says the key belongs to example.com, but anyone who holds the private key can impersonate example.com. The certificate is still valid; the trust is broken. That's why hardware security modules (HSMs) or at minimum, encrypted keystores with strict access control, are non-negotiable. You don't just protect secrets; you protect the private keys that underpin your entire trust model.
MessageDigest.digest() then pass the result to Signature.update(). SHA256withECDSA already hashes internally — you'll be signing a hash of a hash, and verification will silently fail every time. The symptom is a valid-looking signature that always returns false on verify. Use the combined algorithm string (SHA256withECDSA, SHA256withRSA) and pass the raw payload bytes.The TLS Handshake: What Actually Happens When Your Browser Connects
Every time you visit an HTTPS URL, a precise cryptographic handshake happens before a single byte of application data is exchanged. Understanding this handshake is the single most important thing for debugging certificate errors in production — because every error message you'll ever see maps to a specific step that failed.
Here's the TLS 1.3 handshake (the current standard — TLS 1.2 had more round trips and weaker cipher negotiation):
Step 1 — ClientHello: The client sends a list of supported cipher suites, a random nonce, the SNI hostname (api.example.com), and a list of supported TLS versions. This is plaintext — anyone on the network can read it. The SNI is what lets a single IP serve multiple certificates.
Step 2 — ServerHello + Certificate + CertificateVerify + Finished: The server picks the strongest mutually-supported cipher suite, sends its certificate chain (leaf + intermediates — NOT the root), and proves it holds the private key by signing a transcript hash (CertificateVerify). All of this happens in one flight in TLS 1.3 — two round trips total.
Step 3 — Client verifies the chain: The client walks the chain from leaf to root: verify each certificate's signature against its issuer's public key, check that the leaf's SAN matches the hostname, check that no certificate in the chain has expired, check that the root is in the client's trust store. If any check fails, the handshake aborts.
Step 4 — Finished: Both sides send a Finished message encrypted with the derived session keys. If both Finished messages verify, the handshake is complete and application data flows.
TLS 1.3 completes in 1-RTT (one round trip) after TCP handshake. TLS 1.2 required 2-RTT. This matters for latency-sensitive services — every millisecond counts when you're processing thousands of connections per second.
The critical thing: steps 1 and 2 are where 90% of production certificate errors surface. The server sends an incomplete chain? Step 3 fails. The leaf cert's SAN doesn't match the hostname? Step 3 fails. The root CA isn't in the client's trust store? Step 3 fails. The certificate expired two hours ago? Step 3 fails. Every error message you'll see is a failure at step 3.
A common production gotcha: Server Name Indication (SNI) mismatch. If you connect to an IP address without the SNI extension, the server may return a default certificate that doesn't match your hostname. Java's HttpsURLConnection always sends SNI, but some older libraries and low-level socket tools do not. The symptom is a 403 or handshake failure even though the certificate is valid. Always include SNI when debugging programmatically.
- Round 1: Client announces capabilities (ciphers, versions, SNI). Server responds with chosen cipher, its certificate chain, and proof of private key ownership.
- Round 2: Client verifies the chain (signature, SAN, expiry, trust store) and sends Finished encrypted with shared secret. Server sends Finished.
- If any step fails, the handshake aborts with a specific error — every error message maps to exactly one step.
- Most production failures happen during chain verification (step 3) — missing intermediate, expired root, SAN mismatch.
Certificate Anatomy: Reading Every Field That Matters
An X.509 certificate is a structured document defined by RFC 5280. It's encoded in DER (binary) format under the hood, but you'll usually see it as PEM (base64 with BEGIN/END markers). Every field has a purpose, and understanding them is essential for debugging.
Subject — Who this certificate belongs to. Contains the Common Name (CN), Organization (O), Country (C), etc. For a server cert, the CN is typically the hostname — but modern TLS ignores CN and checks SAN instead.
Issuer — Who signed this certificate. For a leaf cert, this is the Intermediate CA. For an Intermediate, this is the Root CA. For a self-signed Root, Subject == Issuer.
Subject Alternative Names (SAN) — The list of hostnames and IP addresses this certificate is valid for. This is what modern browsers and Java clients actually check — not the CN. A cert with CN=api.example.com but no SAN for api.example.com will fail in Chrome 58+ and Java 8u181+.
Validity Period — Not Before and Not After timestamps. The certificate is only valid between these dates. Expired certs are the #1 cause of PKI incidents in production.
Serial Number — A unique identifier assigned by the issuing CA. Used by CRLs to identify revoked certificates.
Key Usage — Bit flags indicating what the key can be used for. The critical ones: digitalSignature (signing data), keyCertSign (signing other certificates — only for CAs), keyEncipherment (RSA key exchange). If keyCertSign is set on a leaf cert, something is very wrong.
Extended Key Usage (EKU) — More specific usage restrictions. serverAuth means 'this cert can authenticate a TLS server.' clientAuth means 'this cert can authenticate a TLS client' (used in mTLS). codeSigning means 'this cert can sign executable code.' A cert with only serverAuth should not be accepted for mTLS client authentication.
Authority Information Access (AIA) — URLs where the client can fetch the issuing CA's certificate if it's missing from the chain. Desktop browsers use this to silently recover from incomplete chains. Mobile clients and JVM do not.
CRL Distribution Points — URLs where the client can download the Certificate Revocation List to check if this certificate has been revoked. Largely replaced by short-lived certs in modern architectures.
Basic Constraints — CA:TRUE or CA:FALSE. Indicates whether this certificate can sign other certificates. Leaf certs must be CA:FALSE. If a leaf cert has CA:TRUE, any attacker who gets the private key can mint arbitrary certificates that chain to it.
Certificate Lifecycle: Rotation, Renewal, and the 3 AM Failure
The most reliable way to prevent the $340k outage from the incident above is to obsess over certificate lifecycle. Certificates expire. That's a feature, not a bug. But if you don't manage the lifecycle proactively, you'll get woken up at 3 AM when a certificate you forgot about stops working.
Expiry Monitoring is table stakes. Every certificate in your chain — root CA, intermediate CA, leaf — must have monitoring with alerts at 30, 14, 7, and 1 day before expiry. Don't just monitor leaf certs. The root CA may have a 10-year validity, but it will expire eventually, and when it does, every leaf signed by it becomes invalid.
Automatic Renewal is the gold standard. Use ACME (Let's Encrypt) for public-facing certs or Vault PKI for internal certs. Configure renewal to happen automatically when the certificate has less than 30 days of validity. For Vault, set the TTL of issued certs to 24 hours and configure a renewal job that runs every 12 hours. Short-lived certs eliminate the need for OCSP/CRL entirely — a certificate that expires in 24 hours doesn't need revocation checking.
Dual-Cert Grace Period is critical for rotation without downtime. When renewing a certificate, deploy the new certificate alongside the old one (if your infrastructure supports it) or configure a grace period where the old cert is accepted. In mTLS environments, ensure clients can accept both old and new intermediates during the rotation window. Without this, a rotation can cause a cascading failure as all connections renegotiate simultaneously.
CI/CD Validation should reject any deployment that includes an expired or expiring certificate. Use a script that checks: chain length >= 2, SAN matches the intended hostname, expiry date > 30 days from now, and the root is in the expected trust store. This catches 99% of cert-related deployment failures before they hit production.
Key Rotation for private keys: generate a new key pair and CSR at every renewal. Don't reuse the same private key for the life of a certificate — if the key is compromised, rotating the cert alone doesn't help. The new certificate signs a new public key, and the old key is discarded.
The biggest lifecycle mistake is treating certificates as fire-and-forget. You generate one, install it, and never look at it again until something breaks. The fix is to treat certificates as ephemeral, automatically-renewed resources, just like AWS IAM keys or database passwords.
- Short-lived certs (24h TTL) force automatic renewal, eliminating expiry surprises.
- Long-lived certs require complex revocation infrastructure (CRL/OCSP) that often fails.
- Every cert should be tied to a monitoring alert and an automated renewal pipeline.
- If you're not renewing a cert every 30 days, you're not managing it — you're just hoping.
Trust Stores and Chain Validation: How Clients Decide to Trust
The final piece of PKI is the trust store — the list of root certificates that a client considers trustworthy. Your browser ships with ~150 root CAs from Mozilla's root store. Your Java application has its own truststore (cacerts). Your mobile app has the OS trust store. Each client decides independently what to trust.
The root of trust — literally. A root CA is self-signed: its Subject and Issuer are the same. The client trusts it because it's hardcoded into the trust store, not because it was verified by a higher authority. That means trust is ultimately a social and operational decision: you trust DigiCert because your OS vendor vetted them.
Chain validation step-by-step: 1. Start with leaf certificate. 2. Find its Issuer field. Look for a certificate in the chain (or trust store) that matches that Subject. 3. Verify the leaf's signature using the issuer's public key. 4. Check that the leaf's SAN matches the hostname being connected to. 5. Check that the leaf and all intermediate certs are within their validity periods. 6. Check that no certificate in the chain is revoked (via CRL or OCSP). 7. Repeat steps 2-6 for each intermediate until you reach a root in the trust store. 8. If you reach a trusted root, the chain is valid.
The most common trust store failure is a self-signed certificate that isn't in the trust store. You generate a CA, sign a leaf, install the leaf on the server, forget to add the CA to the client trust store. The handshake fails because the root isn't trusted. This is especially painful in mTLS where both sides need to trust each other's CA.
Java's default truststore is located at $JAVA_HOME/lib/security/cacerts. It contains root CAs from commercial and public CAs. If you're using an internal CA, you must import its root into cacerts using keytool. Many teams forget this step and spend hours debugging 'PKIX path building failed' because the root isn't in the JVM trust store.
Mobile clients have stricter trust stores. iOS and Android ship with their own lists. If you're using a private CA for a mobile app, you must distribute the CA certificate to the device — either during app installation or via a profile. Without it, the app will fail to connect to your API, and the error message will be cryptic: 'SSL handshake failed' with no details.
The $340k Payment Outage: An Expired Internal CA Certificate
- Every certificate in your chain must be monitored for expiry — including root CAs and intermediates, not just leaf certs.
- Certificate management must be centralized. Different teams or tools issuing different parts of the chain leads to blind spots.
- Add certificate chain validation (length, expiry, SAN check) to your CI/CD pipeline. Don't rely on runtime monitoring alone.
Key takeaways
Common mistakes to avoid
4 patternsUsing depends_on without a healthcheck in compose context (analogy: deploying certs without chain check)
Only monitoring leaf certificate expiry, ignoring root and intermediate CAs
Assuming CN is used for hostname matching
Self-signed certificates without distributing the CA to clients
Interview Questions on This Topic
Explain the TLS 1.3 handshake and how it differs from TLS 1.2. Where does certificate chain validation fit?
Frequently Asked Questions
That's Cryptography. Mark it forged?
11 min read · try the examples if you haven't