SSO and SAML: Stop Copy-Pasting Auth Code and Fix Your Login Once
SSO and SAML explained for production engineers.
20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.
SAML is an XML-based protocol for exchanging authentication and authorization data between an identity provider and a service provider. It's the backbone of enterprise SSO. You configure an IdP (like Okta or ADFS) to issue SAML assertions, and your app (the SP) validates those assertions to grant access.
Think of SAML like a VIP pass at a conference. You show your ID at the front desk (the identity provider), and they give you a stamped wristband (the SAML assertion). You then walk into any session (service provider) and just flash the wristband — no need to show your ID again. The wristband has a hologram (digital signature) so no one can fake it.
Every company I've worked at has had a 'login incident' that woke someone up at 3 AM. Usually it's because someone copy-pasted SAML code from a blog post without understanding the clock skew check. SSO is supposed to make life easier, but misconfigured SAML is a silent killer — users can't log in, and you have no idea why because the error messages are garbage.
The problem SAML solves is simple: you don't want every app to manage its own passwords. You want one central authority (the IdP) that says 'this user is authenticated' and every other app trusts that. Without SAML, you either build a custom SSO protocol (please don't) or force users to log in to each app separately (which they hate).
By the end of this, you'll be able to configure SAML SSO in a production app, debug the three most common failures (clock skew, unsigned assertions, and wrong audience), and explain to your boss why you chose SAML over OIDC without sounding like a Wikipedia article.
Why SAML Exists: The Password Proliferation Problem
Before SAML, every app had its own login. Users had 50 passwords, so they reused 'password123' everywhere. IT admins couldn't revoke access centrally — they had to delete accounts in 20 different systems. SAML solved this by separating authentication (the IdP's job) from authorization (the SP's job). The IdP tells the SP 'this user is who they say they are' via a signed XML document. The SP trusts that document because it's signed with the IdP's private key.
The key insight: SAML is about trust, not just data exchange. The SP doesn't ask the IdP 'is this user valid?' every time. Instead, the IdP gives the SP a self-contained assertion that the SP can verify independently. This means the SP doesn't need network access to the IdP at runtime — it just needs the IdP's public key and a valid clock.
Without SAML, you'd have to build a shared session database or use something like OAuth 2.0 with a token introspection endpoint (which requires network calls). SAML's assertion model is more resilient to network failures, but it's also more complex to debug because the assertion is a blob of XML with strict schema rules.
echo '<base64>' | base64 -d | xmllint --format -. You'll see the actual timestamps, issuer, and signature — 90% of issues are visible right there.SAML vs OIDC: When to Use Each in Production
The classic mistake: using SAML for a mobile app. SAML was designed for browser-based SSO using HTTP POST bindings. It doesn't work well with native apps because there's no browser redirect flow that returns to the app. OIDC (OpenID Connect) is built for this — it uses JWTs and REST APIs.
Use SAML when: (1) You're in an enterprise environment with existing IdPs like ADFS, Okta, or Azure AD. (2) You need to support SAML-based federation (e.g., government or healthcare). (3) Your app is web-based and users access it via browsers.
Use OIDC when: (1) You're building a new app from scratch. (2) You have mobile or single-page apps. (3) You want simpler JSON tokens instead of XML. (4) You need fine-grained scopes and API access.
The trade-off: SAML is more mature and has better enterprise support, but OIDC is simpler and more modern. I've seen teams waste weeks trying to make SAML work in a mobile app — don't be that team.
The SAML Handshake: AuthnRequest and Response Flow
The SAML flow starts when an unauthenticated user tries to access a protected resource on the SP. The SP generates an AuthnRequest — an XML document that tells the IdP what the SP expects (e.g., which SAML version, what binding to use, and where to send the response). The SP redirects the user to the IdP with this request.
The IdP authenticates the user (via password, MFA, etc.) and generates a SAML Response. This response contains an Assertion — the actual statement that the user is authenticated. The assertion includes: the user's identifier (NameID), the issuer (IdP entity ID), the audience (SP entity ID), timestamps (NotBefore and NotOnOrAfter), and optionally attributes (email, roles).
The IdP signs the assertion (or the entire response) with its private key. The SP validates the signature using the IdP's public certificate. If the signature is valid, the timestamps are within range, and the audience matches, the SP creates a local session and redirects the user to the original resource.
The critical detail: the SP must have the IdP's public certificate configured. If the certificate changes (e.g., rotation), all SPs must be updated. This is a common source of production outages.
Validating SAML Responses: The Four Checks You Must Implement
When your SP receives a SAML Response, you must validate four things. Miss any, and you're vulnerable to attacks or login failures.
- Signature validation: Verify the XML signature using the IdP's public certificate. This ensures the response wasn't tampered with. Use a library like
opensamlorpython3-saml— don't roll your own XML signature verification. - Timestamp validation: Check
NotBeforeandNotOnOrAfterconditions. The current time must be between these two timestamps. Allow a clock skew of up to 5 minutes (configurable). If the assertion is expired, reject it. - Audience restriction: The
Audienceelement in the assertion must match your SP's entity ID. This prevents an assertion issued for one SP from being used on another. - Recipient check: The
Recipientattribute in theSubjectConfirmationDatamust match your ACS URL. This prevents assertion replay on a different endpoint.
I've seen production outages caused by each of these. The most common: clock skew (fix with NTP monitoring) and audience mismatch (fix by double-checking entity IDs in both IdP and SP config).
Configuring Your SP for SAML: The Metadata Dance
Every SAML deployment starts with exchanging metadata. The IdP has metadata (entity ID, SSO URL, public certificate). The SP has metadata (entity ID, ACS URL, public certificate). You configure each side with the other's metadata.
The IdP's metadata is usually available at a URL like https://idp.example.com/metadata. It's an XML file containing the IdP's entity ID, single sign-on service URL, and X.509 certificate. You import this into your SP configuration.
Your SP's metadata must be registered with the IdP. It includes your entity ID, ACS URL, and optionally your SP's certificate (if you sign AuthnRequests). The IdP uses this to know where to send responses and which SP is requesting authentication.
The gotcha: metadata URLs change. If the IdP rotates its certificate, the metadata URL updates. But your SP might have cached the old metadata. Always fetch metadata dynamically with a cache TTL (e.g., 24 hours) instead of hardcoding it. I've seen outages because someone hardcoded a certificate that expired.
Single Logout (SLO): The Feature Everyone Forgets
Single Logout (SLO) is the SAML feature that lets a user log out of all SPs by logging out of the IdP. It's rarely implemented correctly because it's complex: the IdP sends a LogoutRequest to each SP, and each SP must respond. If any SP is down, the logout fails.
The SLO flow: User clicks logout at SP. SP sends a LogoutRequest to the IdP. IdP terminates the session and sends LogoutRequests to all other SPs that the user has active sessions with. Each SP responds with a LogoutResponse. The IdP then sends a final response to the original SP.
In practice, SLO is unreliable. SPs might be offline, or the user might have multiple browser tabs. Many organizations skip SLO and rely on session timeouts instead. If you must implement SLO, use asynchronous notifications (e.g., a message queue) rather than synchronous HTTP calls, and accept that some SPs might not log out immediately.
The classic mistake: implementing SLO with synchronous HTTP calls and blocking the user's logout until all SPs respond. This leads to timeouts and a poor user experience.
Debugging SAML: The Tools and Techniques That Actually Work
When SAML breaks, you need to see the actual XML. The browser's developer tools are your friend. Use the Network tab to capture the POST request to your ACS URL. The SAMLResponse parameter contains the base64-encoded assertion. Decode it and format the XML.
Tools: (1) SAML-tracer browser extension (Firefox/Chrome) — intercepts SAML messages and decodes them automatically. (2) base64 -d and xmllint for command-line decoding. (3) Online SAML decoders (but be careful with sensitive data — use local tools).
Common issues: (1) Clock skew — compare NotBefore and NotOnOrAfter with server time. (2) Signature failure — check that the IdP's certificate is correct and the signature algorithm matches (e.g., RSA-SHA256). (3) Audience mismatch — ensure the Audience element matches your SP's entity ID exactly.
I once spent 4 hours debugging a signature failure only to find that the IdP had rotated its certificate and the SP was using the old one. The fix: automate certificate fetching from the IdP's metadata URL.
When Not to Use SAML: The Overkill Scenarios
SAML is overkill for: (1) Internal microservices communicating via REST APIs — use OAuth 2.0 client credentials or mutual TLS. (2) Consumer-facing apps where users don't have enterprise accounts — use social login (Google, Facebook) via OIDC. (3) Simple authentication for a single app — just use a session cookie with a password hash.
SAML adds complexity: XML parsing, signature verification, metadata management, clock synchronization. If you don't need enterprise federation, don't use it. I've seen startups waste weeks implementing SAML because 'it's the enterprise standard' when they had zero enterprise customers.
The rule of thumb: use SAML only if you have an existing IdP (Okta, ADFS, Azure AD) that you must integrate with. If you're building a new system, use OIDC. It's simpler, more secure by default, and works with mobile apps.
The 3 AM Clock Skew That Killed Login
systemctl restart ntp. Added monitoring for clock skew via Prometheus node_exporter's node_timex_offset_seconds metric. Set alert if offset > 1 second.- Always monitor clock skew between IdP and SP.
- A 5-minute tolerance is standard, but drift happens faster than you think.
ntpdate -q idp.example.comdate && curl -s https://idp.example.com/metadata | grep -oP 'NotOnOrAfter="\K[^"]+'Key takeaways
Interview Questions on This Topic
How does SAML handle replay attacks?
NotBefore and NotOnOrAfter timestamps combined with an assertion ID. The SP should also cache assertion IDs and reject duplicates. The SubjectConfirmationData with Recipient and InResponseTo (if using AuthnRequest) adds another layer.Frequently Asked Questions
20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.
That's Security. Mark it forged?
7 min read · try the examples if you haven't