Skip to content
Home System Design Backend for Frontend — Cache Key Versioning Pitfalls

Backend for Frontend — Cache Key Versioning Pitfalls

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Architecture → Topic 12 of 13
Field rename crashed mobile for an hour—Redis served stale shapes within TTL.
🔥 Advanced — solid System Design foundation required
In this tutorial, you'll learn
Field rename crashed mobile for an hour—Redis served stale shapes within TTL.
  • BFF = per client, owned by frontend team. Deployed independently. No business logic — only aggregation, transformation, error normalisation.
  • Promise.allSettled over Promise.all. Classify every dependency as critical or non-critical. One flaky non-critical service should never 503 the page.
  • Field projection = whitelist. If a field isn't whitelisted, it never leaves the BFF. Protects bandwidth and prevents internal data leaks.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • BFF = dedicated backend per client type (mobile, web, partner). Owned by frontend team. Deployed independently.
  • Does three things: aggregates downstream calls, transforms response shapes, normalises errors. No business logic.
  • Promise.allSettled (not Promise.all) + classify dependencies as critical vs non-critical = one flaky service doesn't 503 the page.
  • Field projection = whitelist fields per client. Mobile gets 4 fields from 40-field user service. Smaller payload, no internal data leaks.
  • Versioned cache keys: 'mobile:homescreen:v2:userId'. Bump version when response shape changes. Old keys expire naturally, no flush needed.
  • Production killer: unversioned cache + response shape change = stale field names = client renders 'undefined' for an hour.
🚨 START HERE

BFF — 60-Second Diagnosis

When your client-facing BFF isn't behaving, run these checks
🟡

Check if BFF is using Promise.allSettled for fan-out

Immediate ActionLook for Promise.all in aggregation code — this is a bug waiting to happen
Commands
grep -r 'Promise.all' src/routes/
grep -r 'allSettled' src/routes/
Fix NowReplace Promise.all with Promise.allSettled. Classify each dependency as critical (fails entire request) or non-critical (degrades gracefully).
🟡

Check field projection coverage

Immediate ActionVerify every BFF endpoint has a whitelist of allowed fields
Commands
grep -r 'projectFields\|fieldWhitelist' src/
curl -s BFF_ENDPOINT | jq 'keys' | wc -l
Fix NowAdd field projection to any endpoint returning >10 fields for mobile. Whitelist only what the UI actually renders.
🟡

Check cache key versioning

Immediate ActionVerify cache keys include version number that matches response schema
Commands
grep -r 'cacheKey\|redis.set' src/ | grep -v 'v[0-9]'
redis-cli KEYS 'mobile:*' | grep -v 'v[0-9]'
Fix NowAdd version number to all cache keys. Bump version when response shape changes.
🟡

Check for business logic leak into BFF

Immediate ActionLook for pricing, discount, validation, or calculation code in BFF
Commands
grep -r 'price\|discount\|validation\|calculation' src/
git log --since="3 months ago" -- src/ | grep -E 'price|discount'
Fix NowExtract business logic to downstream service. BFF only shapes data, never computes meaning.
Production Incident

The Unversioned Cache That Rendered 'undefined' for an Hour

The mobile team renamed 'avatarUrl' to 'profileImageUrl' in their BFF response. Redis still served the old field name for 60 seconds. Mobile clients crashed trying to read a field that no longer existed. The deploy rolled back. The on-call engineer learned about cache keys the hard way.
SymptomMobile app renders blank images, crashes on profile screen. Server logs show no errors. New BFF version deployed 5 minutes ago. Some users see stale data; new sessions see correct data.
AssumptionThe team assumed caches cleared on deploy. They didn't know Redis keys persisted across deployments unless explicitly versioned.
Root causeThe BFF cached home screen responses with key 'mobile:homescreen:userId' — no version number. When the mobile team renamed a field in the response shape, Redis was still serving the old shape to any request that arrived within the TTL window. Mobile clients expected 'profileImageUrl'. They received 'avatarUrl'. The app crashed when it tried to read undefined.imageUrl. The caching layer was working exactly as designed — that was the problem.
Fix1. Changed cache key to 'mobile:homescreen:v2:userId' — version number in the key. 2. Deployed new version. Old keys with 'v1' ignored by new code. 3. Added smoke test that verifies cache key version matches response shape version. 4. Documented rule: any breaking change to response shape = bump cache key version. Prevention: version number in every cache key, tied to your API version or schema version. Bump it manually when shape changes. Never reuse the same key across incompatible response shapes.
Key Lesson
Unversioned cache keys + response shape change = stale field names = client crashes.Cache key version must be independent of deployment. Bump it when shape changes.Never reuse a cache key for two different response shapes.Add cache key version to your API versioning strategy docs.
Production Debug Guide

Client gets wrong data? Page partially loads? Cache serves stale fields? Here's the diagnosis map.

Mobile app gets 404 or partial data. Some services return data, others error.Check Promise.allSettled usage. If you're using Promise.all, a single downstream failure 503s the whole BFF. Switch to allSettled and classify dependencies as critical vs non-critical.
Response contains 40 fields when mobile only needs 4. Payload size is 200KB on 4G.Check field projection. Are you returning the entire downstream response without stripping fields? Add whitelist projection per endpoint. Mobile home screen should return <50KB.
After deploy, some users see old data or missing fields. App crashes.Check cache key versioning. Did you change response shape without bumping cache key version? Redis serves stale shape until TTL expires. Add version number to cache key.
Mobile and web BFFs return different answers for same business question.Check for business logic in BFF. BFF should only shape data, not compute it. Extract shared logic to downstream service. Two BFFs shouldn't independently apply discount rules.

Every distributed system eventually hits the same wall: one set of backend microservices, but clients couldn't be more different. A mobile app on 4G cares about payload size and battery drain. A desktop web app wants rich aggregated data in one round trip. A partner integration needs a stable, versioned contract.

Trying to serve all of them from one general-purpose API Gateway is where the pain starts. Your mobile team complains about 40-field responses. Your partner team complains about breaking changes. Your web team complains about N+1 queries.

The Backend for Frontend (BFF) pattern solves this by giving each client its own dedicated backend. This article covers the three rules that make BFF work in production: fan-out with degradation, field projection as a security boundary, and versioned cache keys that don't poison your CDN.

Why a Single API Gateway Breaks Down at Scale — The Case for BFF

The naive starting point is a single API Gateway sitting in front of all your microservices. It handles auth, routing, rate limiting, and maybe a bit of response shaping. This works fine for one or two clients with similar data appetites. The cracks appear the moment you ship a mobile app.

Your mobile team starts complaining that the /user/profile endpoint returns 47 fields when they only render 6. They're paying for bandwidth on every response, parsing data they discard, and your API is throttled by the slowest downstream service even when the mobile screen only needs data from the fastest one. Meanwhile the web team adds a field, breaks the mobile contract, and you spend a week arguing about backward compatibility.

The core problem is impedance mismatch: your backend services model the domain, but your clients model the user experience. Those are genuinely different shapes. A BFF is the translation layer that converts domain model responses into UX-optimised payloads, per client. Critically, the team that owns the frontend also owns its BFF. This is the sociotechnical insight that makes BFF work — Conway's Law turned to your advantage. The mobile team controls the mobile BFF and can iterate it independently without negotiating with the web team or the core services team.

BFF_Architecture_Overview.txt · TEXT
12345678910111213141516171819202122232425262728293031323334353637
┌─────────────────────────────────────────────────────────┐
│                     CLIENT LAYER                         │
│  ┌──────────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │  iOS/Android │  │  React Web   │  │  Partner API  │  │
│  │  Mobile App  │  │  Dashboard   │  │  Consumer     │  │
│  └──────┬───────┘  └──────┬───────┘  └──────┬────────┘  │
└─────────┼────────────────┼─────────────────┼────────────┘
          │                │                 │
          ▼                ▼                 ▼
┌─────────────────────────────────────────────────────────┐
│                   BFF LAYER                              │
│  ┌──────────────┐  ┌──────────────┐  ┌───────────────┐ │
│  │  Mobile BFF  │  │   Web BFF    │  │  Partner BFF  │ │
│  │  (Node.js)   │  │  (Node.js)   │  │  (Node.js)    │ │
│  │              │  │              │  │               │ │
│  │ - Compresses │  │ - Aggregates │  │ - Versioned   │ │
│  │   payloads   │  │   multi-svc  │  │   contracts   │ │
│  │ - Offline    │  │ - SSE/WS     │  │ - OAuth2      │ │
│  │   delta sync │  │   support    │  │   scoping     │ │
│  └──────┬───────┘  └──────┬───────┘  └──────┬────────┘ │
└─────────┼────────────────┼─────────────────┼───────────┘
          │                │                 │
          └────────────────┴─────────────────┘
                           │
          ┌────────────────▼────────────────┐
          │        INTERNAL SERVICE MESH     │
          │  ┌───────────┐  ┌────────────┐  │
          │  │  User Svc │  │ Order Svc  │  │
          │  └───────────┘  └────────────┘  │
          │  ┌───────────┐  ┌────────────┐  │
          │  │Product Svc│  │Inventory   │  │
          │  └───────────┘  │Svc         │  │
          │                 └────────────┘  │
          └─────────────────────────────────┘

KEY INSIGHT: Each BFF is owned by the frontend team that uses it.
The internal services have no knowledge of client-specific concerns.
▶ Output
Architecture diagram showing three BFFs (Mobile, Web, Partner) each consuming
the same downstream microservices but exposing client-optimised interfaces.
No client talks directly to an internal service.
🔥Conway's Law as a Feature
BFF deliberately aligns team ownership with service boundaries. The team that suffers the pain of a bad API shape is the same team that can fix it — no cross-team negotiation required. This is why BFF adoption correlates strongly with faster frontend iteration velocity.
📊 Production Insight
A company had a single API Gateway that served mobile, web, and partner clients. The partner team needed a stable, versioned contract that never changed. The web team needed to add fields weekly. The mobile team needed smaller payloads.
Every change required coordinating three teams and a two-week release cadence. The API Gateway became the bottleneck.
After moving to BFF per client: mobile team deploys their BFF 3 times per week, web team deploys daily, partner BFF changes twice per year. No coordination required.
Rule: BFF is an organisational pattern as much as a technical one. If your teams can't deploy independently, you're missing the point.
🎯 Key Takeaway
API Gateway = cross-cutting concerns (auth, rate limiting).
BFF = client-specific aggregation and shaping. Owned by frontend team.
If all clients need the same shape, BFF is overkill.
If clients diverge, BFF per client is the organisational win.
API Gateway vs BFF vs GraphQL — Which Pattern?
IfOne client type, same data shape for all, small team (<5 engineers)
UseSingle API Gateway with response caching. BFF adds unnecessary cost.
IfOne flexible client (web SPA) that knows what fields it needs
UseGraphQL BFF. Plan DataLoader from day one to avoid N+1 queries.
IfMultiple distinct client surfaces (mobile, web, partner) with separate teams
UseBFF per client surface. Each team owns and deploys their own BFF.
IfStartup with 2 engineers, 1 client, uncertain future
UseSimple monolith or single API. Add BFF when second client arrives.

Building a Production-Grade Mobile BFF in Node.js — Aggregation, Auth, and Error Normalisation

A BFF has three primary jobs: aggregate calls to multiple downstream services into one client request, transform response shapes to match what the UI actually renders, and normalise errors so the client gets consistent, actionable error payloads regardless of which downstream service failed.

Authentication lives in the BFF too. The mobile client sends a JWT or session token to the BFF; the BFF validates it and then uses a machine-to-machine credential (service account, mTLS cert, or internal API key) when calling downstream services. This keeps internal service auth completely hidden from the client — a critical security boundary.

The code below is a production-representative Node.js BFF endpoint for a mobile home screen. It fans out to three services in parallel using Promise.allSettled (not Promise.all — that distinction matters enormously in production), applies field projection to reduce payload size, and returns a normalised error envelope if any dependency fails. Every decision here has a reason.

MobileBFF_HomeScreen.js · JAVASCRIPT
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
// Mobile BFF — Home Screen Aggregation Endpoint
// Owned by: Mobile Platform Team
// Downstream deps: User Service, Order Service, Recommendation Service

import express from 'express';
import { verifyMobileJwt } from './auth/jwtValidator.js';
import { fetchUserProfile } from './clients/userServiceClient.js';
import { fetchRecentOrders } from './clients/orderServiceClient.js';
import { fetchRecommendations } from './clients/recommendationServiceClient.js';
import { projectFields } from './utils/fieldProjector.js';
import { buildErrorEnvelope } from './utils/errorNormaliser.js';

const router = express.Router();

// ─────────────────────────────────────────────────────────────
// FIELD PROJECTION MAPS
// These define EXACTLY what the mobile home screen renders.
// If a field isn't in this map, it never leaves the BFF.
// This is your first line of defence against over-fetching.
// ─────────────────────────────────────────────────────────────
const MOBILE_USER_FIELDS = ['userId', 'displayName', 'avatarUrl', 'loyaltyTier'];
const MOBILE_ORDER_FIELDS = ['orderId', 'status', 'estimatedDelivery', 'itemCount'];
const MOBILE_RECO_FIELDS  = ['productId', 'thumbnailUrl', 'title', 'priceFormatted'];

// ─────────────────────────────────────────────────────────────
// AUTH MIDDLEWARE
// Validates the mobile JWT. On success, attaches decoded payload
// to req.authenticatedUser so downstream handlers don't re-verify.
// The BFF then calls internal services with a SERVICE_ACCOUNT_TOKEN
// — the client never sees or needs internal credentials.
// ─────────────────────────────────────────────────────────────
router.use(verifyMobileJwt);

// ─────────────────────────────────────────────────────────────
// GET /mobile/v1/home
// Returns a single aggregated payload for the mobile home screen.
// Designed for: < 50KB response, < 500ms p95 on 4G.
// ─────────────────────────────────────────────────────────────
router.get('/v1/home', async (req, res) => {
  const { userId } = req.authenticatedUser; // populated by verifyMobileJwt middleware
  const requestStartTime = Date.now();

  // ── PARALLEL FAN-OUT ──────────────────────────────────────
  // We use Promise.allSettled instead of Promise.all.
  // Promise.all would FAIL ENTIRELY if recommendations are down.
  // Promise.allSettled lets us return partial data gracefully —
  // the home screen can still render without recommendations.
  const [userResult, ordersResult, recoResult] = await Promise.allSettled([
    fetchUserProfile(userId),
    fetchRecentOrders(userId, { limit: 3 }),        // mobile only shows 3
    fetchRecommendations(userId, { limit: 6 }),     // 2-column grid = 6 tiles
  ]);

  // ── CRITICAL DEPENDENCY CHECK ─────────────────────────────
  // User profile is non-negotiable. If it fails, the home screen
  // cannot render at all. Return a normalised 503 immediately.
  if (userResult.status === 'rejected') {
    const errorEnvelope = buildErrorEnvelope({
      code:    'USER_PROFILE_UNAVAILABLE',
      message: 'Could not load your profile. Please try again.',
      traceId: req.traceId,   // propagated from upstream via X-Trace-Id header
      retryable: true,
    });
    return res.status(503).json(errorEnvelope);
  }

  // ── NON-CRITICAL DEPENDENCY DEGRADATION ──────────────────
  // Orders or recommendations being unavailable degrades gracefully.
  // We log the failure for alerting but don't blow up the response.
  const recentOrders = ordersResult.status === 'fulfilled'
    ? projectFields(ordersResult.value.orders, MOBILE_ORDER_FIELDS)
    : [];  // empty array tells the UI to render the 'no recent orders' state

  const recommendations = recoResult.status === 'fulfilled'
    ? projectFields(recoResult.value.items, MOBILE_RECO_FIELDS)
    : [];  // UI renders a placeholder skeleton instead of crashing

  // ── LOG DEGRADED DEPENDENCIES ────────────────────────────
  // In production: emit a metric here (e.g. StatsD/Prometheus counter)
  // so your on-call team sees recommendation-service degradation
  // on the dashboard before users start complaining.
  if (ordersResult.status === 'rejected') {
    console.error('[MobileBFF] Order service degraded', {
      userId,
      reason: ordersResult.reason?.message,
      traceId: req.traceId,
    });
  }
  if (recoResult.status === 'rejected') {
    console.error('[MobileBFF] Recommendation service degraded', {
      userId,
      reason: recoResult.reason?.message,
      traceId: req.traceId,
    });
  }

  // ── RESPONSE PROJECTION ───────────────────────────────────
  // projectFields strips every key not in the MOBILE_*_FIELDS arrays.
  // The user service returns ~40 fields. We expose 4.
  // This is not just bandwidth — it prevents accidentally leaking
  // internal fields like 'fraudScore' or 'internalSegmentTag'.
  const projectedUser = projectFields(userResult.value, MOBILE_USER_FIELDS);

  // ── RESPONSE ENVELOPE ─────────────────────────────────────
  // Single, consistent response shape. The mobile app team defined
  // this contract — they own the BFF so they own the contract.
  const responsePayload = {
    meta: {
      traceId:       req.traceId,
      generatedAt:   new Date().toISOString(),
      latencyMs:     Date.now() - requestStartTime,
      degraded:      recentOrders.length === 0 || recommendations.length === 0,
    },
    user:            projectedUser,
    recentOrders,
    recommendations,
  };

  // ── CACHE HEADERS FOR CDN/MOBILE CACHE ───────────────────
  // Home screen data is user-specific — never publicly cacheable.
  // s-maxage=0 prevents CDN caching. max-age=30 allows the mobile
  // client to use stale data for 30 seconds on navigation back.
  res.set('Cache-Control', 'private, max-age=30, s-maxage=0');
  return res.status(200).json(responsePayload);
});

export default router;
▶ Output
// Successful response (all services healthy):
{
"meta": {
"traceId": "abc-123-xyz",
"generatedAt": "2024-11-15T09:32:11.204Z",
"latencyMs": 187,
"degraded": false
},
"user": {
"userId": "usr_9821",
"displayName": "Sarah K.",
"avatarUrl": "https://cdn.example.com/avatars/usr_9821.webp",
"loyaltyTier": "GOLD"
},
"recentOrders": [
{ "orderId": "ord_771", "status": "OUT_FOR_DELIVERY", "estimatedDelivery": "Today, 2–4 PM", "itemCount": 3 }
],
"recommendations": [
{ "productId": "prd_441", "thumbnailUrl": "...", "title": "Wireless Charger", "priceFormatted": "$29.99" }
]
}

// Degraded response (recommendation service down):
{
"meta": { "latencyMs": 203, "degraded": true, ... },
"user": { ... },
"recentOrders": [ ... ],
"recommendations": [] // UI renders skeleton, no crash
}
⚠ Promise.all vs Promise.allSettled
Using Promise.all for BFF fan-out means a flaky recommendations service takes your entire home screen down at 3am. Promise.allSettled lets you classify dependencies as critical vs non-critical and degrade gracefully. Classify before you code — write it down in a comment next to every downstream call.
📊 Production Insight
A BFF using Promise.all failed every time the ad-service (97% uptime) returned an error. The home screen 503'd for 3% of requests. Users saw blank screens. The team spent months debugging 'intermittent 503s'.
Root cause: one flaky non-critical dependency was taking down the whole response.
Fix: Changed to Promise.allSettled. Ads service failure now logs an error and returns an empty array. Home screen renders perfectly without ads.
Rule: Every downstream call is either critical or non-critical. Write that classification in a comment. Use Promise.allSettled for all fan-out. Only Promise.reject if a critical dependency fails.
🎯 Key Takeaway
BFF does three things: aggregate, transform, normalise errors.
Promise.allSettled + critical/non-critical classification = graceful degradation.
Field projection = whitelist. If field not whitelisted, it never leaves BFF.
Auth at BFF boundary = client sends token, BFF uses service account downstream.

Caching Strategy Inside a BFF — Where to Cache and What Goes Wrong

Caching in a BFF is tricky because BFFs sit at the intersection of user-specific data (never publicly cacheable) and shared domain data (very cacheable). Getting this wrong in either direction causes either stale personalised data (a privacy incident waiting to happen) or completely uncacheable responses that hammer your downstream services.

The right model is layered caching with TTL tiering. Domain data that changes rarely (product catalogue, store locations, feature flags) gets cached aggressively at the BFF level — in-process for ultra-low latency reads, with Redis as the L2 for multi-instance consistency. User-specific aggregated data should not be cached in the BFF at all; instead, set accurate Cache-Control headers and let the client cache it locally, where it's scoped to that user's session.

The subtler gotcha is cache stampede on the aggregated data. If you cache the home screen response in Redis with a 60-second TTL and you have 100k mobile users, when that cache expires simultaneously you get a thundering herd that fans out across all three downstream services at once. You need either probabilistic early expiration (PER) or a per-user cache key with jittered TTLs.

And the most common production failure: unversioned cache keys. Your response shape changes (rename a field, change a type), but Redis still serves the old shape until TTL expires. Clients expecting the new field name crash. Version your cache keys. Every time.

BFF_CacheLayer.js · JAVASCRIPT
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
// BFF Cache Layer — Redis-backed with stampede protection
// Uses probabilistic early recompute (PER) to avoid thundering herd.

import { createClient } from 'redis';

const redisClient = createClient({ url: process.env.REDIS_URL });
await redisClient.connect();

// ─────────────────────────────────────────────────────────────
// PROBABILISTIC EARLY RECOMPUTE (PER)
// Instead of letting every instance race to recompute an expired key,
// we start recomputing early with a probability that increases as
// the TTL approaches 0. Only one instance does the recompute.
// Formula from the academic paper by Vattani et al. (2015):
//   recompute_now = current_time - (recompute_cost * beta * ln(random()))
//                  > expiry_time
// ─────────────────────────────────────────────────────────────
const BETA = 1.0; // tuning parameter; 1.0 is a safe default

async function getOrRecompute({ cacheKey, ttlSeconds, recomputeMs, fetchFn }) {
  // Fetch the raw cached value AND its remaining TTL in one pipeline
  const pipeline = redisClient.multi();
  pipeline.get(cacheKey);
  pipeline.ttl(cacheKey); // returns remaining seconds, -2 if key doesn't exist
  const [cachedJson, remainingTtl] = await pipeline.exec();

  if (cachedJson) {
    const cachedValue = JSON.parse(cachedJson);

    // ── PER EARLY RECOMPUTE CHECK ───────────────────────────
    // Convert recompute cost to seconds for comparison with TTL
    const recomputeCostSeconds = recomputeMs / 1000;

    // Math.log returns a negative number for 0 < x < 1, so we negate it
    // This gives us a positive 'recompute window' proportional to cost
    const earlyRecomputeWindow = recomputeCostSeconds * BETA * -Math.log(Math.random());

    const shouldRecomputeEarly = remainingTtl < earlyRecomputeWindow;

    if (!shouldRecomputeEarly) {
      // Cache hit — return immediately without touching downstream services
      return { data: cachedValue, fromCache: true, remainingTtl };
    }
    // Falls through to recompute — probabilistic, so only some instances do this
  }

  // ── CACHE MISS OR EARLY RECOMPUTE ────────────────────────
  console.info(`[BFFCache] Recomputing: ${cacheKey}`);
  const freshData = await fetchFn(); // calls the actual aggregation logic

  // Store with a jittered TTL to prevent synchronised mass expiration.
  // Without jitter: all 100k user caches expire at :00 every minute.
  // With jitter: expiry is spread across 45–75 seconds.
  const jitterSeconds = Math.floor(Math.random() * 30) - 15; // ±15s
  const effectiveTtl  = ttlSeconds + jitterSeconds;

  await redisClient.set(cacheKey, JSON.stringify(freshData), {
    EX: effectiveTtl, // sets TTL in seconds
  });

  return { data: freshData, fromCache: false, remainingTtl: effectiveTtl };
}

// ─────────────────────────────────────────────────────────────
// FIELD PROJECTOR
// Strips all keys not in the allowedFields array.
// Works on both single objects and arrays of objects.
// This is a whitelist approach — safer than a blacklist.
// ─────────────────────────────────────────────────────────────
export function projectFields(input, allowedFields) {
  if (Array.isArray(input)) {
    return input.map(item => projectFields(item, allowedFields));
  }
  // Object.fromEntries + filter = clean, readable field projection
  return Object.fromEntries(
    Object.entries(input).filter(([key]) => allowedFields.includes(key))
  );
}

// ─────────────────────────────────────────────────────────────
// USAGE EXAMPLE — How the home screen route uses the cache layer
// ─────────────────────────────────────────────────────────────
export async function getCachedHomeScreenData(userId, aggregateFn) {
  const cacheKey = `mobile:homescreen:v2:${userId}`; // versioned key!
  // If you change the response shape, bump v2 → v3 to avoid stale
  // shape mismatches. Unversioned cache keys are a production horror.

  return getOrRecompute({
    cacheKey,
    ttlSeconds:   60,   // 60s base TTL, ±15s jitter applied inside
    recomputeMs:  250,  // estimated cost of the aggregation fan-out
    fetchFn:      () => aggregateFn(userId),
  });
}
▶ Output
// Cache miss (first request for this user):
[BFFCache] Recomputing: mobile:homescreen:v2:usr_9821
{ data: { ...homeScreenPayload }, fromCache: false, remainingTtl: 53 }

// Cache hit (subsequent requests within TTL window):
{ data: { ...homeScreenPayload }, fromCache: true, remainingTtl: 47 }

// PER early recompute triggered (TTL low, probability fires):
[BFFCache] Recomputing: mobile:homescreen:v2:usr_9821
// ↑ happens transparently — client still gets the old cached data
// while one instance refreshes in the background
💡Version Your BFF Cache Keys
When you change the projected fields in a BFF response (e.g., rename 'avatarUrl' to 'profileImageUrl'), stale Redis values with the old shape will be served until TTL expires. Version your cache keys: 'mobile:homescreen:v2:userId'. Bumping to v3 instantly invalidates all old entries with zero downtime and no cache flush command needed.
📊 Production Insight
A team deployed a BFF change that renamed 'avatarUrl' to 'profileImageUrl'. The cache key was unversioned: 'mobile:homescreen:userId'. Redis served the old shape for 60 seconds after deploy. Mobile clients expecting the new field name crashed.
The team rolled back the deploy. The incident post-mortem revealed they had no cache invalidation strategy for schema changes.
Fix: Added version number to all cache keys. Version tied to response shape schema. Deploy now bumps version number. Old keys ignored. New keys populated with new shape.
Rule: Cache key version is independent of deploy version. Bump it manually when response shape changes. Test that old clients (still on old version) get old shape from cache, not a mix.
🎯 Key Takeaway
User-specific data: client-side Cache-Control only. Never BFF cache.
Shared domain data: Redis cache with PER + jittered TTL.
Versioned cache keys: 'service:entity:v3:id'. Bump when shape changes.
Unversioned cache + response shape change = stale keys = client crashes.

BFF vs API Gateway vs GraphQL — When Each Pattern Actually Wins

Engineers debate these three patterns constantly, often because they're solving different problems and the differences only become clear under load or at organisational scale.

An API Gateway is infrastructure. It handles cross-cutting concerns — TLS termination, rate limiting, request routing, auth token validation. It should not know what a mobile home screen looks like. When you push field projection, aggregation, or client-specific error handling into a gateway, you've created a shared bottleneck that every team must touch to change anything client-specific.

GraphQL solves the over-fetching problem elegantly for a single client type where the client knows what it wants to ask for. But in practice, mobile clients frequently need to fan out across 4–5 resolvers in a single query, and each resolver carries N+1 query risks unless you implement DataLoader — which adds complexity. GraphQL also surfaces your schema externally, which is a versioning and security surface area problem with partner APIs.

A **BFF** wins when: (1) different clients have genuinely different data shapes and update frequencies, (2) teams need independent deployment of client-specific logic, (3) you need to hide the internal service topology from clients entirely. The BFF pattern scales organisationally — the cost is an extra service per client surface that must be deployed, monitored, and maintained.

Pattern_Decision_Matrix.txt · TEXT
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
DECISION FLOWCHART: API Gateway vs BFF vs GraphQL

START
  │
  ├─ Do ALL your clients need the same data shape?
  │    └─ YESAPI Gateway with response caching is probably enough.
  │             BFF adds cost without benefit here.
  │
  ├─ Do you have ONE flexible client (web SPA) that knows
  │  what fields it needs at query time?
  │    └─ YESGraphQL BFF may be the right call.
  │             But plan for DataLoader from day one or
  │             you'll have N+1 queries in production within a week.
  │
  ├─ Do you have MULTIPLE distinct client surfaces
  │  (mobile, web, third-party) with different teams?
  │    └─ YESBFF per client surface.
  │             Each team owns their BFF.
  │             Deploy independently. Schema evolves independently.
  │
  └─ Are you a startup with 2 engineers and 1 client?
       └─ YESMonolith or single lightweight API.
               BFF is premature abstraction at this scale.
               Add it when the second client surface arrives.

─────────────────────────────────────────────────────────────
ORGANISATIONAL OWNERSHIP MAPPING
─────────────────────────────────────────────────────────────

  API GatewayPlatform/Infra Team owns it
                     (shared, slow to change)

  BFF (Mobile)    →  Mobile Team owns it
                     (fast iteration, team autonomy)

  BFF (Web)       →  Web Frontend Team owns it
                     (fast iteration, team autonomy)

  Core ServicesDomain Teams own them
                     (stable APIs, domain logic only)

─────────────────────────────────────────────────────────────
PERFORMANCE CHARACTERISTICS UNDER LOAD
─────────────────────────────────────────────────────────────

  Single API Gateway (aggregation pushed into gateway):
  - One bottleneck for all clients
  - Any client's traffic pattern affects all others
  - Horizontal scaling scales for everyone, wastefully

  Dedicated BFF per client:
  - Mobile BFF scales independently of web traffic spikes
  - Web BFF can use larger instances (web pays for richer data)
  - Mobile BFF can use smaller, cheaper instances (smaller payloads)
  - Failure in web BFF doesn't affect mobile availability
▶ Output
Decision matrix output is textual/architectural.
Use this during system design interviews to structure your answer.
Examiners respond well to explicit trade-off analysis.
🔥Interview Gold: The BFF + GraphQL Hybrid
You can combine them: put a GraphQL BFF in front of a web React client (for flexible query composition) while keeping a REST BFF for mobile (for predictable payload size and HTTP caching semantics). This is increasingly common in large-scale production systems. Knowing this shows interviewers you think in trade-offs, not dogma.
📊 Production Insight
A company adopted GraphQL as a single BFF for both mobile and web. Mobile clients loved the flexible querying. But they started seeing high latency on 4G. Each mobile query triggered 5-10 resolver calls, each to a different downstream service. Without DataLoader, they had N+1 query problems.
Web team was fine. Mobile team suffered. A single GraphQL schema couldn't satisfy both.
Fix: Split into two BFFs. Mobile kept REST BFF with purpose-built endpoints and aggressive field projection. Web kept GraphQL BFF with DataLoader. Each team optimises for their own latency and payload constraints.
Rule: One BFF to rule them all is a myth. If clients have different performance requirements (mobile vs web), give them different BFF implementations.
🎯 Key Takeaway
API Gateway = infrastructure (auth, rate limiting, TLS). Not aggregation.
GraphQL = flexible query for one client type. DataLoader mandatory.
BFF = per client, owned by frontend team. Deploy independently.
BFF + GraphQL hybrid = common in large orgs. Mobile gets REST, web gets GraphQL.
🗂 API Gateway vs BFF vs GraphQL
Trade-offs you need to explain in system design interviews
AspectAPI GatewayBFF (per client)GraphQL (single BFF)
Team OwnershipPlatform/Infra team (shared)Frontend team (autonomous)Frontend or API team
Deployment FrequencySlow — shared risk surfaceFast — independent per clientMedium — schema changes require coordination
Over-fetching PreventionManual field filtering, brittleField projection per clientClient-driven query selection
Aggregation of ServicesPossible but anti-patternCore use caseVia resolvers + DataLoader
N+1 Query RiskNone (routing only)None — BFF fan-out is explicitHigh if DataLoader is skipped
Payload OptimisationOne-size-fits-allPer client (mobile gets ~90% smaller payloads)Client chooses fields, variable
HTTP Caching SemanticsFull CDN + Cache-Control supportFull CDN + Cache-Control supportPOST requests are not CDN-cacheable by default
Schema VersioningAPI versioning via path (/v1, /v2)Route versioning per BFFSchema evolution with @deprecated directives
Fault IsolationGateway failure = all clients downBFF failure = one client surface downGateway failure = all clients down
Cold Start / Infra CostSingle service, low infra costN services, higher infra costSingle service, medium cost
Best forAuth, routing, rate limitingMultiple distinct client surfacesOne flexible client with varying data needs

🎯 Key Takeaways

  • BFF = per client, owned by frontend team. Deployed independently. No business logic — only aggregation, transformation, error normalisation.
  • Promise.allSettled over Promise.all. Classify every dependency as critical or non-critical. One flaky non-critical service should never 503 the page.
  • Field projection = whitelist. If a field isn't whitelisted, it never leaves the BFF. Protects bandwidth and prevents internal data leaks.
  • Versioned cache keys: 'service:entity:v3:userId'. Bump version when response shape changes. Unversioned keys = stale fields = client crashes.
  • API Gateway = infrastructure (auth, rate limiting). GraphQL = flexible queries for one client. BFF = per-client shaping. They solve different problems.
  • BFFs should never call other BFFs. Chain of BFF calls destroys independence and creates failure cascades. Call domain services directly.

⚠ Common Mistakes to Avoid

    Putting business logic into the BFF
    Symptom

    The BFF starts making pricing calculations, applying discount rules, or running validation that belongs in domain services. You notice the mobile BFF and web BFF have diverged in logic and are now giving different answers for the same business question.

    Fix

    BFF does exactly three things — aggregate, transform, normalise. Any logic that could change the meaning of data belongs in a downstream service. The BFF only changes the shape of data.

    Using Promise.all for all downstream fan-out
    Symptom

    A single flaky service (recommendations, ads, banners) takes the entire page down at 2am. Incident reports show 503s across the board even though 4 of 5 services were healthy.

    Fix

    Classify every downstream dependency as either critical (page cannot render without it) or non-critical (page degrades gracefully without it). Use Promise.allSettled for all fan-out and only throw a 503 when a critical dependency rejects.

    Unversioned cache keys after a response shape change
    Symptom

    You deploy a new BFF version that renames a field (e.g., 'imageUrl' becomes 'thumbnailUrl'), but Redis is still serving the old shape for up to 60 seconds. Mobile clients that deployed expecting the new field name see 'undefined' and render broken UI.

    Fix

    Always include a schema version in your Redis cache key: 'mobile:homescreen:v3:userId'. When your response shape changes, bump the version number. Old keys expire naturally; new requests populate v3 keys immediately.

    Storing user-specific data in shared cache without versioning or user isolation
    Symptom

    User A sees User B's home screen data. Privacy incident. Cached data from one user's request served to another user because cache key didn't include userId.

    Fix

    Cache key for user-specific data MUST include userId or sessionId. 'mobile:homescreen:v2:userId'. Never omit the user identifier. Never use a generic key that could serve one user's data to another.

    BFF calling another BFF
    Symptom

    Mobile BFF calls Web BFF for aggregated data. Now you have a chain of BFF calls. Any change in Web BFF affects Mobile BFF. Team coordination returns. Failure cascade: Web BFF down takes Mobile BFF down.

    Fix

    BFFs should never call other BFFs. They should only call internal domain services or the API Gateway. If two BFFs need the same aggregated data, extract that aggregation into a shared downstream service.

Interview Questions on This Topic

  • QYou have a mobile app, a web dashboard, and a partner API all consuming the same microservices. How would you decide whether to use a single API Gateway with response shaping versus separate BFFs? Walk me through the trade-offs.SeniorReveal
    Decision factors: 1. Data shape divergence: Does mobile need different fields than web? Mobile lives on 4G; every kilobyte matters. Web has fibre; can afford richer payloads. If payloads are similar, API Gateway may suffice. If mobile needs 4 fields and web needs 40, BFF wins. 2. Update frequency: Mobile deploys weekly, partner API changes twice per year. Trying to serve both from one gateway means the gateway changes at the slowest pace of any client (partner's twice-yearly). BFF per client lets each team deploy on their own cadence. 3. Team structure: Separate teams for mobile, web, partner? BFF aligns ownership with team boundaries (Conway's Law). Mobile team owns mobile BFF. Web team owns web BFF. No cross-team coordination for client-specific changes. 4. Fault isolation: If web traffic spikes, does it affect mobile? With a shared API Gateway, yes — web customers retrying failures consume gateway resources, slowing mobile. With separate BFFs, mobile BFF scales independently. Trade-offs: BFF adds more services to deploy, monitor, and maintain. Infrastructure cost is higher (N BFFs vs 1 gateway). Operational complexity increases. Verdict: BFF per client when you have multiple client types, separate teams, different data needs, and independent deployment requirements. API Gateway when all clients are similar and team structures are centralised.
  • QIn your Mobile BFF, you're aggregating data from 5 downstream services. The recommendation service has 99.2% uptime — so it fails about 7 hours per month. How do you design the BFF so that recommendation service failures don't affect mobile home screen availability?SeniorReveal
    Key insight: Recommendation service is non-critical. Home screen can render without recommendations (show skeletons or hide the section). Design: 1. Classify dependencies: Label each downstream call as critical or non-critical. - Critical examples: user profile, authentication — without these, page cannot render. - Non-critical examples: recommendations, ads, social proof — page degrades gracefully without them. 2. Use Promise.allSettled, not Promise.all: - Promise.all rejects if ANY promise rejects. Home screen would 503 every time recommendations service fails. - Promise.allSettled waits for all promises to settle, then returns status ('fulfilled' or 'rejected') for each. 3. Critical dependency check: After allSettled, check critical dependencies. If any critical failed, return 503 with error envelope. 4. Non-critical graceful degradation: For non-critical dependencies, default to empty array, null, or cached stale data. UI renders without that section. 5. Log but don't fail: Emit metrics/warning logs for non-critical failures. On-call should be alerted if recommendations fails >5% of requests, but not paged for every single failure. 6. Optional: stale cache fallback: For non-critical services, serve stale cache data while refreshing in background. Result: Mobile home screen availability = product of critical service uptimes (e.g., 99.99% × 99.99% = 99.98%). Non-critical failures are invisible to users. 7 hours/month of recommendation downtime becomes 0 hours of user-visible downtime.
  • QA candidate says 'we could just use GraphQL and let clients ask for exactly the fields they need — why would we ever need a BFF?' How do you respond? Where does GraphQL fall short that a dedicated BFF handles better?SeniorReveal
    Where GraphQL is excellent: One client type (web SPA), flexible queries, client knows schema, over-fetching solved at query level. Where GraphQL falls short and BFF wins: 1. Mobile performance: GraphQL's flexibility means the client decides query complexity. A deep query could trigger 20 resolvers and 100 downstream calls. Mobile BFF with purpose-built endpoints has predictable latency (<500ms p95 on 4G). 2. HTTP caching: GraphQL typically uses POST requests (same endpoint, different bodies). HTTP caches (CDNs, browser cache) don't cache POST responses. BFF uses GET for cacheable data, POST for mutations. You lose CDN caching entirely with GraphQL. 3. N+1 queries: GraphQL resolvers are per-field. Without DataLoader, a query for 10 orders, each asking for user details, triggers 1 + 10 calls. BFF explicitly fans out to exactly the services it needs — no hidden complexity. 4. Partner API versioning: Exposing a GraphQL schema to external partners gives them full query flexibility — including introspection queries that expose your entire data model. That's a security and stability risk. BFF can expose a stable, versioned REST contract. 5. Team autonomy: GraphQL schema is shared across all clients. Mobile team adding a field affects web's schema version. BFF per client means each team owns their schema. No coordination required. The hybrid approach: Many large systems use GraphQL BFF for web (flexible querying, developer productivity) and REST BFF for mobile (predictable performance, caching). Each client gets the pattern that fits. Response: 'GraphQL is great for web dashboards with power users. For mobile, where every byte and every millisecond matters, I'd choose a REST BFF with field projection and aggressive caching. If we have both, I'd build both.'

Frequently Asked Questions

What is the Backend for Frontend (BFF) pattern in microservices?

The BFF pattern is an architectural approach where you create a dedicated backend service for each distinct client type — typically one BFF for mobile apps, one for web, and one for third-party integrations. Each BFF aggregates calls to multiple internal microservices, projects the response to exactly the fields that client needs, and normalises errors. The key differentiator from a shared API Gateway is team ownership: the frontend team owns and deploys their BFF independently.

When should I NOT use the BFF pattern?

Don't use BFF if you have a single client type, a small team (fewer than 4-5 engineers), or if your clients genuinely need the same data in the same shape. BFF adds a service to deploy, monitor, and maintain — that cost is only justified when you have multiple client surfaces with meaningfully different data needs and separate teams working on them. For early-stage products, a single lightweight API with field filtering is almost always the right call.

Can a BFF call another BFF, or does it only talk to microservices?

BFFs should never call other BFFs — that creates coupling between client surfaces and defeats the entire purpose of isolation. A BFF should only communicate with internal domain services (User Service, Order Service, etc.) and the API Gateway layer above it. If two BFFs need the same aggregated data, the correct answer is to extract that aggregation into a shared downstream service or a common library, not to chain BFF calls together.

Why chaining BFFs is dangerous: Mobile BFF calling Web BFF means Web BFF becomes a dependency for Mobile BFF's availability. Web BFF down → Mobile BFF down. Team coordination returns because changing Web BFF might break Mobile BFF. The entire point of BFF is to eliminate cross-client coupling. Chaining BFFs reintroduces it.

How do you handle authentication in a BFF architecture?

Pattern: Client authenticates with BFF. BFF validates token (JWT, session cookie, API key). BFF then uses a machine-to-machine credential (service account, mTLS certificate, internal API key) to call downstream services.

Why this boundary matters: Downstream services only trust the BFF, not the external client directly. The client never sees internal credentials. The BFF can also enforce client-specific auth policies — mobile might have tighter rate limits than web, partner might have different scopes.

Implementation: 1. BFF receives client token, validates signature/expiry 2. BFF attaches internal credentials (e.g., 'X-Service-Account: mobile-bff') to downstream requests 3. Downstream services authorise based on the BFF's identity, not the original client's

Security benefit: Compromised client token cannot directly call internal services. The BFF is a security boundary and an audit point.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousTwelve Factor AppNext →Software Architecture Explained: Patterns, Trade-offs and Real Decisions
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged