Expired Root CA Broke mTLS: A $340k PKI Lesson
PKIX path building failed? An expired internal CA root cost $340k.
20+ years shipping performance-critical code where algorithms decide the bill. Notes here come from systems that actually shipped.
- PKI is a hierarchy of trust: Root CA → Intermediate CA → Leaf certificate
- TLS handshake verifies every link in that chain before data flows
- Missing intermediate is the #1 production failure — mobile clients fail silently
- Use short-lived certs (24h TTL) over OCSP/CRL for internal services
- Always put hostnames in SAN; CN-only matching died in 2017
- Automatic renewal (ACME/Vault) prevents 90% of expiry incidents
Imagine every business in a city gets a laminated ID badge from the mayor's office. When you walk into a shop, you don't know the shopkeeper personally — but you trust the mayor, and the badge proves the mayor vouched for them. PKI is that exact system for the internet: a chain of vouching, anchored at a root everyone agrees to trust. The twist is that your browser doesn't trust 'the internet' — it trusts a specific, hardcoded list of mayors baked into your OS. If your mayor isn't on that list, the whole thing collapses, no matter how legitimate your badge is.
The chain works like this: the Root CA (the mayor) signs an Intermediate CA's badge (a deputy mayor). The Intermediate CA signs your server's badge. When your browser connects, it checks: 'Is this badge signed by someone I trust? And is THAT person signed by the root I trust?' If both answers are yes, the connection proceeds. If any link in that chain is missing or expired, the connection fails — even if the leaf certificate itself is perfectly valid.
A fintech startup I consulted for lost six hours of payment processing because a certificate issued by their internal CA expired at 2:47 AM on a Tuesday. Their monitoring caught nothing — the service didn't crash, it just silently rejected every TLS handshake with PKIX path building failed: unable to find valid certification path to requested target. Six engineers stared at perfectly healthy application logs while $340k in transactions queued. PKI didn't fail loudly. It failed quietly, at the edges, in a way nobody had written a runbook for.
PKI — Public Key Infrastructure — is the trust plumbing underneath every HTTPS connection, every signed JWT, every mTLS service mesh, and every code-signing pipeline you've ever touched. It answers one deceptively hard question: how do two strangers on a network prove to each other that they are who they claim to be, without having met before? Symmetric keys don't scale — you can't pre-share a secret with every website on the internet. PKI solves this with asymmetric cryptography layered over a hierarchy of trusted authorities. Get it right and it's invisible. Get it wrong and you're the 3 AM war room.
After this you'll be able to read a certificate chain and understand exactly what each field means and why it matters, trace a TLS handshake step by step and know where it can break, build and rotate certificates in a real service without downtime, debug the six most common certificate errors without guessing, and design an internal PKI for a microservices environment that won't bite you six months later.
Why Public Key Infrastructure Is a Chain of Trust, Not a Certificate
Public Key Infrastructure (PKI) is the system of policies, hardware, and software that binds public keys to identities through a hierarchy of Certificate Authorities (CAs). The core mechanic: a root CA signs intermediate CA certificates, which then sign end-entity certificates, forming a trust chain. Every TLS/mTLS connection validates this chain from leaf back to a trusted root — if any link is expired, revoked, or misconfigured, the entire handshake fails.
In practice, PKI relies on two properties: chain validation and revocation checking. Chain validation walks from the server certificate up to a root CA whose public key is in the client's trust store. Revocation checking (via CRL or OCSP) ensures no certificate in the chain has been invalidated before its expiry. A common mistake is assuming root CA expiry doesn't matter — but an expired root breaks validation even if all intermediate and leaf certs are valid.
Use PKI whenever you need authenticated, encrypted communication between services, especially in zero-trust networks, microservices, or IoT. Without PKI, you're vulnerable to man-in-the-middle attacks and identity spoofing. In production, a single expired root CA can cascade into a full mTLS outage, costing teams hours of debugging and thousands in incident response.
Asymmetric Cryptography: The Math That Makes Trust Possible
Before PKI existed, encrypting traffic between two servers meant pre-sharing a secret key out-of-band — email it, phone it in, bake it into a config file checked into git (yes, this still happens). That doesn't scale, and it means anyone who intercepts the key exchange owns all past and future traffic. The entire premise of PKI is that you can publish a key openly, and doing so doesn't compromise you.
Asymmetric cryptography gives you a key pair: a public key you broadcast freely and a private key you guard with your life. Anything encrypted with the public key can only be decrypted by the matching private key. More importantly for PKI, anything signed with the private key can be verified by anyone holding the public key — without the verifier ever touching the private key. That second property is what makes certificates work.
A certificate is just a structured document that says: 'This public key belongs to example.com, and I, DigiCert, am signing this claim with my own private key.' Your browser holds DigiCert's public key (via the trust store), verifies DigiCert's signature, and concludes the public key genuinely belongs to example.com. The private key for example.com never travels over the wire. Not once. That's the whole trick.
RSA-2048 was the default for years. Today you should default to ECDSA P-256 — smaller keys, faster handshakes, equivalent or better security. RSA-4096 is not meaningfully more secure than RSA-2048 against current threats, but it's noticeably slower. Don't reach for it unless a compliance checkbox forces your hand.
But there's a subtle failure mode most engineers miss: key management. If your private key is stored on a disk that gets backed up to S3 with world-readable permissions — and that backup is leaked — the entire PKI collapses. The certificate says the key belongs to example.com, but anyone who holds the private key can impersonate example.com. The certificate is still valid; the trust is broken. That's why hardware security modules (HSMs) or at minimum, encrypted keystores with strict access control, are non-negotiable. You don't just protect secrets; you protect the private keys that underpin your entire trust model.
MessageDigest.digest() then pass the result to Signature.update(). SHA256withECDSA already hashes internally — you'll be signing a hash of a hash, and verification will silently fail every time. The symptom is a valid-looking signature that always returns false on verify. Use the combined algorithm string (SHA256withECDSA, SHA256withRSA) and pass the raw payload bytes.The TLS Handshake: What Actually Happens When Your Browser Connects
Every time you visit an HTTPS URL, a precise cryptographic handshake happens before a single byte of application data is exchanged. Understanding this handshake is the single most important thing for debugging certificate errors in production — because every error message you'll ever see maps to a specific step that failed.
Here's the TLS 1.3 handshake (the current standard — TLS 1.2 had more round trips and weaker cipher negotiation):
Step 1 — ClientHello: The client sends a list of supported cipher suites, a random nonce, the SNI hostname (api.example.com), and a list of supported TLS versions. This is plaintext — anyone on the network can read it. The SNI is what lets a single IP serve multiple certificates.
Step 2 — ServerHello + Certificate + CertificateVerify + Finished: The server picks the strongest mutually-supported cipher suite, sends its certificate chain (leaf + intermediates — NOT the root), and proves it holds the private key by signing a transcript hash (CertificateVerify). All of this happens in one flight in TLS 1.3 — two round trips total.
Step 3 — Client verifies the chain: The client walks the chain from leaf to root: verify each certificate's signature against its issuer's public key, check that the leaf's SAN matches the hostname, check that no certificate in the chain has expired, check that the root is in the client's trust store. If any check fails, the handshake aborts.
Step 4 — Finished: Both sides send a Finished message encrypted with the derived session keys. If both Finished messages verify, the handshake is complete and application data flows.
TLS 1.3 completes in 1-RTT (one round trip) after TCP handshake. TLS 1.2 required 2-RTT. This matters for latency-sensitive services — every millisecond counts when you're processing thousands of connections per second.
The critical thing: steps 1 and 2 are where 90% of production certificate errors surface. The server sends an incomplete chain? Step 3 fails. The leaf cert's SAN doesn't match the hostname? Step 3 fails. The root CA isn't in the client's trust store? Step 3 fails. The certificate expired two hours ago? Step 3 fails. Every error message you'll see is a failure at step 3.
A common production gotcha: Server Name Indication (SNI) mismatch. If you connect to an IP address without the SNI extension, the server may return a default certificate that doesn't match your hostname. Java's HttpsURLConnection always sends SNI, but some older libraries and low-level socket tools do not. The symptom is a 403 or handshake failure even though the certificate is valid. Always include SNI when debugging programmatically.
- Round 1: Client announces capabilities (ciphers, versions, SNI). Server responds with chosen cipher, its certificate chain, and proof of private key ownership.
- Round 2: Client verifies the chain (signature, SAN, expiry, trust store) and sends Finished encrypted with shared secret. Server sends Finished.
- If any step fails, the handshake aborts with a specific error — every error message maps to exactly one step.
- Most production failures happen during chain verification (step 3) — missing intermediate, expired root, SAN mismatch.
Certificate Anatomy: Reading Every Field That Matters
An X.509 certificate is a structured document defined by RFC 5280. It's encoded in DER (binary) format under the hood, but you'll usually see it as PEM (base64 with BEGIN/END markers). Every field has a purpose, and understanding them is essential for debugging.
Subject — Who this certificate belongs to. Contains the Common Name (CN), Organization (O), Country (C), etc. For a server cert, the CN is typically the hostname — but modern TLS ignores CN and checks SAN instead.
Issuer — Who signed this certificate. For a leaf cert, this is the Intermediate CA. For an Intermediate, this is the Root CA. For a self-signed Root, Subject == Issuer.
Subject Alternative Names (SAN) — The list of hostnames and IP addresses this certificate is valid for. This is what modern browsers and Java clients actually check — not the CN. A cert with CN=api.example.com but no SAN for api.example.com will fail in Chrome 58+ and Java 8u181+.
Validity Period — Not Before and Not After timestamps. The certificate is only valid between these dates. Expired certs are the #1 cause of PKI incidents in production.
Serial Number — A unique identifier assigned by the issuing CA. Used by CRLs to identify revoked certificates.
Key Usage — Bit flags indicating what the key can be used for. The critical ones: digitalSignature (signing data), keyCertSign (signing other certificates — only for CAs), keyEncipherment (RSA key exchange). If keyCertSign is set on a leaf cert, something is very wrong.
Extended Key Usage (EKU) — More specific usage restrictions. serverAuth means 'this cert can authenticate a TLS server.' clientAuth means 'this cert can authenticate a TLS client' (used in mTLS). codeSigning means 'this cert can sign executable code.' A cert with only serverAuth should not be accepted for mTLS client authentication.
Authority Information Access (AIA) — URLs where the client can fetch the issuing CA's certificate if it's missing from the chain. Desktop browsers use this to silently recover from incomplete chains. Mobile clients and JVM do not.
CRL Distribution Points — URLs where the client can download the Certificate Revocation List to check if this certificate has been revoked. Largely replaced by short-lived certs in modern architectures.
Basic Constraints — CA:TRUE or CA:FALSE. Indicates whether this certificate can sign other certificates. Leaf certs must be CA:FALSE. If a leaf cert has CA:TRUE, any attacker who gets the private key can mint arbitrary certificates that chain to it.
Certificate Lifecycle: Rotation, Renewal, and the 3 AM Failure
The most reliable way to prevent the $340k outage from the incident above is to obsess over certificate lifecycle. Certificates expire. That's a feature, not a bug. But if you don't manage the lifecycle proactively, you'll get woken up at 3 AM when a certificate you forgot about stops working.
Expiry Monitoring is table stakes. Every certificate in your chain — root CA, intermediate CA, leaf — must have monitoring with alerts at 30, 14, 7, and 1 day before expiry. Don't just monitor leaf certs. The root CA may have a 10-year validity, but it will expire eventually, and when it does, every leaf signed by it becomes invalid.
Automatic Renewal is the gold standard. Use ACME (Let's Encrypt) for public-facing certs or Vault PKI for internal certs. Configure renewal to happen automatically when the certificate has less than 30 days of validity. For Vault, set the TTL of issued certs to 24 hours and configure a renewal job that runs every 12 hours. Short-lived certs eliminate the need for OCSP/CRL entirely — a certificate that expires in 24 hours doesn't need revocation checking.
Dual-Cert Grace Period is critical for rotation without downtime. When renewing a certificate, deploy the new certificate alongside the old one (if your infrastructure supports it) or configure a grace period where the old cert is accepted. In mTLS environments, ensure clients can accept both old and new intermediates during the rotation window. Without this, a rotation can cause a cascading failure as all connections renegotiate simultaneously.
CI/CD Validation should reject any deployment that includes an expired or expiring certificate. Use a script that checks: chain length >= 2, SAN matches the intended hostname, expiry date > 30 days from now, and the root is in the expected trust store. This catches 99% of cert-related deployment failures before they hit production.
Key Rotation for private keys: generate a new key pair and CSR at every renewal. Don't reuse the same private key for the life of a certificate — if the key is compromised, rotating the cert alone doesn't help. The new certificate signs a new public key, and the old key is discarded.
The biggest lifecycle mistake is treating certificates as fire-and-forget. You generate one, install it, and never look at it again until something breaks. The fix is to treat certificates as ephemeral, automatically-renewed resources, just like AWS IAM keys or database passwords.
- Short-lived certs (24h TTL) force automatic renewal, eliminating expiry surprises.
- Long-lived certs require complex revocation infrastructure (CRL/OCSP) that often fails.
- Every cert should be tied to a monitoring alert and an automated renewal pipeline.
- If you're not renewing a cert every 30 days, you're not managing it — you're just hoping.
Trust Stores and Chain Validation: How Clients Decide to Trust
The final piece of PKI is the trust store — the list of root certificates that a client considers trustworthy. Your browser ships with ~150 root CAs from Mozilla's root store. Your Java application has its own truststore (cacerts). Your mobile app has the OS trust store. Each client decides independently what to trust.
The root of trust — literally. A root CA is self-signed: its Subject and Issuer are the same. The client trusts it because it's hardcoded into the trust store, not because it was verified by a higher authority. That means trust is ultimately a social and operational decision: you trust DigiCert because your OS vendor vetted them.
Chain validation step-by-step: 1. Start with leaf certificate. 2. Find its Issuer field. Look for a certificate in the chain (or trust store) that matches that Subject. 3. Verify the leaf's signature using the issuer's public key. 4. Check that the leaf's SAN matches the hostname being connected to. 5. Check that the leaf and all intermediate certs are within their validity periods. 6. Check that no certificate in the chain is revoked (via CRL or OCSP). 7. Repeat steps 2-6 for each intermediate until you reach a root in the trust store. 8. If you reach a trusted root, the chain is valid.
The most common trust store failure is a self-signed certificate that isn't in the trust store. You generate a CA, sign a leaf, install the leaf on the server, forget to add the CA to the client trust store. The handshake fails because the root isn't trusted. This is especially painful in mTLS where both sides need to trust each other's CA.
Java's default truststore is located at $JAVA_HOME/lib/security/cacerts. It contains root CAs from commercial and public CAs. If you're using an internal CA, you must import its root into cacerts using keytool. Many teams forget this step and spend hours debugging 'PKIX path building failed' because the root isn't in the JVM trust store.
Mobile clients have stricter trust stores. iOS and Android ship with their own lists. If you're using a private CA for a mobile app, you must distribute the CA certificate to the device — either during app installation or via a profile. Without it, the app will fail to connect to your API, and the error message will be cryptic: 'SSL handshake failed' with no details.
CRL vs OCSP: Why Hard Failures Beat Soft Failures in Production
You've got a certificate revocation problem. The CA told you a cert is dead, but your client accepted it anyway. That's not a CA bug. That's your revocation check strategy.
CRLs are static lists. Every CA publishes one. The problem is size and staleness. A Facebook-sized CRL can be 50MB. You can't fetch that on every connection. Most production systems cache CRLs for hours. That cache window is your attack surface.
OCSP is real-time. Your client asks the CA "is this cert still valid?" and gets a signed response. The trap: OCSP responders go down under load. Your client's default is "soft fail" — if it can't reach the OCSP responder, it treats the cert as valid. That's garbage.
Hard fail is the only honest approach. If OCSP is unreachable, reject the cert. Period. But nobody does that because it breaks prod during CA outages. The compromise is OCSP stapling. The server fetches the OCSP response upfront and ships it with the certificate handshake. No additional round trip for the client. No soft fail vulnerability.
Set your revocation check to require stapled OCSP responses. If the staple is missing, hard fail. That's the pattern Netflix and Cloudflare use.
Distinguished Name Constraints: The CAA Record That Actually Blocks Misissuance
You think your certificate transparency logs catch all misissued certs. They don't. They catch them after the fact. You want prevention, not post-mortem.
CAA (Certification Authority Authorization) is that prevention. It's a DNS record that says "only these specific CAs are allowed to issue certs for my domain." When Google or Let's Encrypt attempts issuance, they check the CAA record first. If your CA isn't listed, issuance fails.
The sharp end: wildcard CAA records. If you set 'example.com CAA 0 issue "letsencrypt.org"', that covers *.example.com too. But many engineers forget the 'issuewild' tag. Without 'issuewild', wildcard certs from any CA are blocked — including your approved one.
Here's the trick most people miss: CAA also supports account URI constraints. You can pin issuance to a specific Let's Encrypt account. If an attacker compromises a different LE account that uses your domain, CAA blocks it. That's defense in depth.
Set your CAA records today. Run a DNS check after every config change. Tools like 'dig' can verify. Don't rely on a CA's UI to tell you it's working.
Cross-Signing: Why Your Root CA Expiration Bombs Every Client
Your root CA certificate expires in 2038. Your intermediate expires in 2030. You roll the intermediate every year. The root stays the same. That's fine until the root's public key algorithm becomes obsolete.
Cross-signing is how you transition roots without bricking every client on the internet. A cross-signed certificate is one root CA signing another root CA's key. The new root is accepted by clients that only trust the old root.
The gotcha: chain order matters. Your server must present the full chain: leaf -> intermediate -> cross-signed root. If you truncate at the intermediate, clients that don't trust the new root will reject the connection.
Most production failures I've seen happen because ops teams replace the intermediate but forget to update the cross-signed root in their TLS configurations. The OpenSSL s_client command shows you exactly what's being sent. Use it before every deployment.
Here's the pattern: keep two trust stores. One for the old root, one for the new. During the transition window, serve both chains. Clients pick whichever they trust. When you're confident the new root has wide adoption, drop the old one.
This is why you rotate roots at a 5-year cadence, not 20. IBM learned this the hard way in 2020 when their 21-year-old root expired silently.
Key Functions of a CA: Why Authorizing the Authority Matters
A Certificate Authority (CA) is not just a signing machine. Its core functions—registration, validation, issuance, and revocation—determine whether the entire PKI stands or falls. Registration verifies the entity's identity via domain validation, organization validation, or extended validation. Validation confirms control or legal existence before any certificate touches a private key. Issuance binds the validated identity to a public key inside a signed X.509 structure. Revocation, often overlooked, is the CA's kill switch: when a key is compromised, the CA must publish a CRL or push OCSP responses within minutes. If a CA skips any function, attackers insert rogue certificates. The CA acts as the chain's weakest link, not its strongest. Never trust a CA that automates issuance without manual validation for high-value certificates—that's how misissuance bombs production.
Set Up the Subordinate CA: Why Tiered Trust Beats Flat Hierarchies
A root CA signs its own certificate and sits offline in a vault. A subordinate CA does the daily work: issuing server, client, and code-signing certificates. Offline roots prevent key compromise from cascading. To set up a subordinate CA, first generate a new key pair and certificate signing request. Have the root CA sign it, creating a subordinate CA certificate with a Basic Constraints extension marking it as a CA with a path length of 0. Load this certificate into a secure HSM or software keystore. Configure the subordinate CA to sign end-entity certificates only, never another CA. The root stays offline; revocation of a compromised subordinate CA leaves the root untouched. Production shops run multiple subordinate CAs—one per purpose—so a code-signing breach doesn't poison TLS certificates. If you skip tiering, one leaked root key bricks every client trust store.
The $340k Payment Outage: An Expired Internal CA Certificate
- Every certificate in your chain must be monitored for expiry — including root CAs and intermediates, not just leaf certs.
- Certificate management must be centralized. Different teams or tools issuing different parts of the chain leads to blind spots.
- Add certificate chain validation (length, expiry, SAN check) to your CI/CD pipeline. Don't rely on runtime monitoring alone.
echo | openssl s_client -connect host:443 -showcerts 2>/dev/null | openssl x509 -text -noout | grep -c 'Subject:'openssl verify -CAfile ca-chain.pem server.pemKey takeaways
Common mistakes to avoid
4 patternsUsing depends_on without a healthcheck in compose context (analogy: deploying certs without chain check)
Only monitoring leaf certificate expiry, ignoring root and intermediate CAs
Assuming CN is used for hostname matching
Self-signed certificates without distributing the CA to clients
Interview Questions on This Topic
Explain the TLS 1.3 handshake and how it differs from TLS 1.2. Where does certificate chain validation fit?
Frequently Asked Questions
20+ years shipping performance-critical code where algorithms decide the bill. Notes here come from systems that actually shipped.
That's Cryptography. Mark it forged?
16 min read · try the examples if you haven't