Senior 5 min · March 05, 2026

Strangler Fig: Bidirectional Sync Failure Lost Finances

40% traffic to new service lost finances.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Intercept traffic at the edge proxy, route individual features to new services
  • Old system stays live until all functionality is migrated — no big bang
  • Risk is bounded to the slice currently being migrated, not the entire system
  • Data sync between old and new is the hardest part — expect weeks of reconciliation
  • Rollback means flipping traffic back to legacy, cheap and fast
  • Biggest mistake: migrating the database before the service — you'll need dual writes
Plain-English First

Imagine a giant old oak tree in your garden. A strangler fig vine wraps around it, growing its own roots and branches, slowly taking over — until one day the oak rots away and only the fig is left, strong and healthy. Nobody had to chop the oak down overnight. That's exactly what this pattern does to legacy software: you grow a new system around the old one, route traffic to the new parts gradually, and quietly retire the old code piece by piece.

Every senior engineer has a war story about the legacy monolith. The codebase that nobody dares touch, where a one-line change takes three weeks of regression testing and still breaks something in production at 2 AM on a Friday. These systems didn't become terrifying overnight — they grew that way over years of feature additions, hotfixes, and 'we'll clean this up later' compromises. The business depends on them. You cannot simply turn them off.

The Strangler Fig Pattern, coined by Martin Fowler in 2004 after observing actual strangler fig trees in Australian rainforests, is an architectural migration strategy that solves one specific problem: how do you replace a working-but-painful system with a better one without a risky, all-or-nothing 'big-bang' rewrite? The answer is that you don't replace it all at once. You intercept traffic at the edge, divert individual capabilities to new services as they're built, and let the old system die by starvation rather than demolition. The risk at any point in time is bounded to the slice you're currently migrating.

By the end of this article you'll understand the full mechanics of the pattern — the proxy/facade layer, feature-by-feature traffic routing, data synchronisation between old and new, rollback strategies, and the production gotchas that turn a smooth migration into a nightmare if you don't see them coming. You'll also have working code for the routing facade and a feature-flag-driven traffic splitter you can adapt to your own stack today.

What Is the Strangler Fig Pattern? (And Why Your Team Needs It)

The Strangler Fig Pattern is a migration strategy that lets you replace a legacy system incrementally, one feature at a time. You put a routing layer — a reverse proxy, API gateway, or even a smart load balancer — in front of the existing monolith. Every incoming request hits this facade instead of the legacy app directly.

The facade checks a routing table (often backed by feature flags) and decides whether to send the request to the old system or the new service. Over time, you build replacement services for each functionality while the legacy app still handles everything else. You route traffic to the new service when it's ready. Once a feature is fully replaced and tested, you remove the legacy code for that feature.

This isn't a new idea — Martin Fowler described it in 2004. But most teams still default to the 'rewrite it all' approach, which collapses under its own risk. The Strangler Fig pattern caps the blast radius of any mistake to exactly one feature.

io/thecodeforge/proxy/RoutingFacade.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
package io.thecodeforge.proxy;

import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

public class RoutingFacade {
    private final Map<String, String> routeTable = new ConcurrentHashMap<>();
    private final String legacyBackend;

    public RoutingFacade(String legacyBackend) {
        this.legacyBackend = legacyBackend;
    }

    public void registerMigration(String featurePattern, String newServiceUrl) {
        routeTable.put(featurePattern, newServiceUrl);
    }

    public String resolveBackend(String requestPath) {
        // Check if request path matches any migrated feature
        for (Map.Entry<String, String> entry : routeTable.entrySet()) {
            if (requestPath.startsWith(entry.getKey()) && isActive(entry.getKey())) {
                return entry.getValue();
            }
        }
        return legacyBackend;
    }

    private boolean isActive(String feature) {
        // In production, check a feature flag service (LaunchDarkly, FF4J, etc.)
        return true;
    }
}
Output
RoutingFacade resolves backend based on feature prefix. Returns new service URL when feature is migrated and flag is active.
Mental Model: The Bounded Slice
  • You never rewrite 'everything'. You pick one capability — login, search, payments — and replace that.
  • The legacy system continues running all unmigrated features. Zero risk outside the slice.
  • If the new service fails, you flip the routing rule back. The legacy system never stopped.
  • This pattern works because each slice is small enough to reason about, test, and rollback independently.
Production Insight
If you try to migrate more than one feature at a time, your rollback gets complicated.
Keep the number of in-flight migrations to one or two — otherwise you'll be debugging which new service broke what.
Rule: one slice at a time, done means done (all traffic switched, legacy code removed).
Key Takeaway
Incremental migration caps risk per slice.
Never migrate more than one feature at a time.
The proxy is the only component that knows about the migration — the rest of the world doesn't need to.

Building the Routing Facade: Feature Flags and Traffic Splitting

The facade is the single most critical piece of a Strangler Fig migration. It must be performant, stateless (or externalise state), and observable. Most teams use an API gateway (Kong, Nginx, Envoy) or a reverse proxy with dynamic routing. The key requirement: routing decisions must be changeable at runtime without a deployment.

Feature flags control which users or requests go to the new service. You start at 0% traffic, enable it for internal testing (1% of users), then gradually increase to 100%. The flag can be based on user ID hash, geographic region, or any attribute. If something breaks, you turn the flag off — traffic instantly goes back to legacy.

Don't implement your own feature flag system in-house. Use LaunchDarkly, Unleash, or even a simple Redis-backed toggle. Your only job is to read the flag in the facade, not implement the flag infrastructure.

io/thecodeforge/proxy/routing-config.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# TheCodeForgeStrangler Fig routing config for Envoy
# Feature flags are external, loaded from a config service
routing:
  - feature: user-profile
    pattern: "/api/users/**"
    legacy: "monolith.internal"
    new: "user-service.internal"
    # Flag: strangler.user-profile.enabled
    traffic_percent: 30  # will be overridden at runtime via API

  - feature: payments
    pattern: "/api/payments/**"
    legacy: "monolith.internal"
    new: "payment-service.internal"
    traffic_percent: 0  # not yet migrated

  - feature: legacy-fallback
    pattern: "/**"
    backend: "monolith.internal"
Output
Envoy config with two migrations in progress. Payments feature still at 0% — no traffic goes to new service.
Watch Out: Hot Reload Without Validation
If your proxy reloads routing configuration without validating the new backend is healthy, you can blackhole traffic. Always require a health check pass before allowing a routing change.
Production Insight
The hardest part is not the routing — it's knowing when to flip the flag.
If your new service can't handle the load, 100% traffic will cause a cascade failure.
Rule: load test the new service at 2x expected production traffic before increasing beyond 10%.
Key Takeaway
Routing facade = feature flags + health checks + observable metrics.
Never change routing without validation.
Start at 1% traffic, ramp up slowly, monitor every increment.
Choose Your Routing Strategy
IfYou need feature-level routing (e.g., move 'search' but not 'profile')
UseUse API gateway with URL pattern matching and feature flag per pattern.
IfYou need user-level routing (e.g., test new system with power users first)
UseUse sticky sessions or user ID hash with percentage rollout in the flag service.
IfYou have no API gateway and can't deploy one
UseEmbed a simple routing filter in your load balancer (Nginx Lua script) or a sidecar proxy.

Data Synchronisation: The Real Challenge of Strangler Fig

Routing traffic is the easy part. The hard part is keeping data consistent between the legacy database and your new service's database. During migration, both systems need to access and modify the same user data, orders, or inventory. If you don't have a solid data sync strategy, you'll end up with silent corruption.

The safest approach is to have a single source of truth (the legacy database) and have the new service read from it but write to its own database plus the legacy one (dual writes). This keeps both systems in sync. However, dual writes are error-prone — one side can fail while the other succeeds. A better approach is to use change data capture (CDC) from the legacy database: any change in the legacy DB is streamed to a message topic, and the new service consumes that stream to update its own store. The new service's writes are also written to the legacy DB via the same CDC pipeline (reverse sync).

Alternatively, you can migrate data at the database level first (e.g., use database views or federation), but that adds a different kind of coupling. The key is bidirectional replication until you cut over completely.

io/thecodeforge/migration/DualWriter.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
package io.thecodeforge.migration;

import io.thecodeforge.db.LegacyRepository;
import io.thecodeforge.db.NewServiceRepository;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class DualWriter {
    private static final Logger log = LoggerFactory.getLogger(DualWriter.class);

    private final LegacyRepository legacy;
    private final NewServiceRepository newService;

    public DualWriter(LegacyRepository legacy, NewServiceRepository newService) {
        this.legacy = legacy;
        this.newService = newService;
    }

    public void writeUserProfile(String userId, UserProfile profile) {
        try {
            legacy.saveProfile(userId, profile);
        } catch (Exception e) {
            log.error("Legacy write failed for user {}, rolling back new write", userId);
            throw e; // Let caller handle rollback
        }
        try {
            newService.saveProfile(userId, profile);
        } catch (Exception e) {
            // New service write failed, but legacy succeeded — need a compensation
            log.warn("New service write failed for user {}, scheduling reconciliation", userId);
            scheduleReconciliation(userId, profile);
        }
    }

    private void scheduleReconciliation(String userId, UserProfile profile) {
        // Push to dead-letter queue for later retry
    }
}
Output
DualWriter writes to both databases. If new write fails, a reconciliation job is scheduled.
Mental Model: The Two-Phase Commit That Isn't
  • There is no distributed transaction between two databases in a strangler fig migration.
  • Your write path must handle: legacy success + new failure, legacy failure + new success, or both failure.
  • The legacy system must remain the authoritative source until cutover is complete.
  • Use a reconciliation cron job to detect and fix differences between the two stores hourly.
Production Insight
Dual writes double your write latency and failure surface.
One team I consulted lost 3 hours of order data because the new service's DB was full but the legacy wrote fine — no alerting on partial failure.
Rule: monitor dual write success/fail rate per operation type, alert if >0.01% of writes fail on either side.
Key Takeaway
Legacy DB remains source of truth until cutover.
Dual writes need partial failure handling + reconciliation.
CDC pipeline with a message broker is the production-grade answer.

Rollback Strategy: How to Undo a Migration Without Pain

A good strangler fig migration must have a rapid rollback plan for every slice. The beauty of the pattern is that the legacy system never goes away until the last feature is migrated. You can always flip the routing flag back to legacy for a particular feature.

But a simple routing rollback isn't always enough — you also need to handle data. If the new service wrote data that doesn't exist in legacy, you can lose it on rollback. The rule: the legacy system must be the authoritative writer until cutover. Any writes from the new service must be replicated back to legacy (dual writes or CDC reverse sync). That way, when you flip the routing back, the legacy system has all the data.

Your rollback sequence: 1) Turn off the feature flag (stop routing traffic to new service). 2) Verify the legacy system can serve all the data (run a data consistency check). 3) If data is missing, run a backfill from the new service's database. 4) Decommission new service only after at least 48 hours of clean rollback window.

Test your rollback before you need it. Simulate a failure scenario in staging: let the new service crash and verify that the routing facade correctly falls back to legacy without any UX interruption.

rollback.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/bin/bash
# TheCodeForgeStrangler Fig rollback script
# Usage: ./rollback.sh <feature-name>

FEATURE=$1
FLAG_SERVICE="https://flags.internal/flag"

# Step 1: Disable the feature flag
curl -X POST "$FLAG_SERVICE/strangler.$FEATURE.enabled" \
  -H "Content-Type: application/json" \
  -d '{"enabled": false}'

# Step 2: Wait for proxy to pick up the change
sleep 5  # depends on your proxy cache TTL

# Step 3: Verify traffic is going to legacy
curl -I "http://proxy/api/$FEATURE/health" 2>&1 | grep -q "X-Backend: legacy"
if [ $? -eq 0 ]; then
  echo "Rollback successful: traffic now to legacy"
else
  echo "ERROR: Traffic still hitting new service"
  exit 1
fi

# Step 4: Check data consistency
# (assuming a reconciliation job runs offline)
echo "Rollback complete. Monitor reconciliation status."
Output
Rollback script disables feature flag, waits for proxy to sync, then verifies backend header.
Never Skip Step 3: Verification
I've seen teams disable the flag but the proxy had a 60-second TTL on config reload. Users hit the new service for another full minute, which then returned 503s because the service was already half-undeployed. Always verify, then proceed.
Production Insight
Rollback is not just about routing — it's about data.
If you didn't replicate new writes back to legacy, you lose data on rollback.
Rule: always write to legacy first (or async replicate both ways) before rolling back.
Key Takeaway
Rollback = feature flag off + proxy TTL + data reconciliation.
Test rollback in staging before production.
New service writes must be replicated back to legacy before any rollback is safe.
When to Roll Back vs. Forward-Fix
IfNew service is returning 500s for all requests
UseRoll back immediately. Turn off the flag, all traffic goes to legacy.
IfNew service is slow but functional (e.g., 2s vs 200ms)
UseConsider forward-fix: keep flag on for internal users only, fix performance, then ramp up.
IfData inconsistency detected but small scope
UseFix the data programmatically (backfill script) and keep migration running with monitoring.

When NOT to Use the Strangler Fig Pattern

The Strangler Fig pattern isn't a silver bullet. It works best for replacing parts of a monolithic system where you can isolate a single capability. It fails when:

  • The legacy system has no clear interface boundaries — everything is tightly coupled through a shared database or global state. In that case, you can't extract a single feature without dragging half the monolith with it.
  • The new system requires a fundamentally different data model that can't be mapped to the legacy one. If every request needs to transform heavily between old and new schema, the proxy becomes a bottleneck.
  • You need performance improvements immediately — the strangler fig approach adds latency from the proxy and dual writes for many months. If you need to make the system 2x faster this quarter, a rewrite (with careful planning) might be the better call.
  • The team is unwilling to maintain two codebases in parallel. The pattern requires you to keep the legacy app around until migration is complete. If your team can't handle that cognitive load, consider a big-bang migration with a well-tested rollback plan instead.

Evaluate your specific context. The pattern is a tool, not a religion.

Production Insight
I worked on a migration where the legacy system had 80 stored procedures shared across all features. Extracting one feature meant copying 40 procedures — we might as well have rewritten the whole thing.
Rule: if the cost of extracting a feature exceeds the cost of building it from scratch, don't use Strangler Fig.
Key Takeaway
Strangler Fig works when features are loosely coupled.
If coupling is high, consider big-bang with careful rollback or a database-first migration.
Always measure the extraction cost before committing to the pattern.
● Production incidentPOST-MORTEMseverity: high

The Midnight Data Loss That Killed a Migration

Symptom
After 40% of traffic was routed to the new service, users reported missing recent financial transactions. The legacy system still had the data, but the new service couldn't see updates made on the old side.
Assumption
The team assumed a one-way sync from legacy to new was enough once the new service became the primary writer. They forgot that some users were still served by the legacy app during the gradual rollout — it wrote to the old database, which was never replicated back to the new service.
Root cause
Bidirectional data synchronisation was never designed. The team had a script that copied legacy data to new service, but no reverse sync. When a user on legacy updated their profile, the new service's version became stale.
Fix
Implemented a change data capture (CDC) pipeline using Debezium on the legacy database. Every write on either side was replicated to a shared event stream, and both systems consumed the stream to stay consistent. Took three weeks to backfill and reconcile.
Key lesson
  • Data sync must be bidirectional during the migration period — not just one way.
  • Assume every user can be served by either system at any time until migration is complete.
  • Change data capture with a message broker is the only reliable way to handle dual writes without application-level coupling.
Production debug guideDiagnose and resolve the most common failures during incremental migration4 entries
Symptom · 01
Traffic to new service returns 404 or 502 after routing change
Fix
Check the proxy/facade routing table — verify the feature flag or URL pattern maps to the correct backend. Use curl -H "X-Force-Route: new-service" to isolate the issue.
Symptom · 02
Users see inconsistent data between old and new UI
Fix
Compare the sync lag between databases. Query the CDC stream offset and the primary key ranges. If lag >30 seconds, throttle traffic to new service until sync catches up.
Symptom · 03
Rollback doesn't restore all data — some writes lost
Fix
You likely have a partial dual-write failure. Check application logs for exceptions in the synchronous replication path. Implement a retry with dead-letter queue for failed write operations.
Symptom · 04
New service is slower than legacy under load
Fix
The new service may not be tuned yet. Compare response times. If it's a database query issue, add indexes or caching. Do NOT blame the pattern — it's a performance problem, not a migration problem.
★ Quick Debug Cheat Sheet: Strangler Fig FailuresThree most common production failures during a strangler fig migration and exactly what to do when they hit.
Proxy routing sends traffic to wrong backend
Immediate action
Check the routing config file and revert the last change. Use the rollback button in your CI/CD or manually update the proxy.
Commands
curl -v http://proxy/api/users/123 --header "X-Original-Route: legacy"
diff current_routing.yaml previous_routing.yaml
Fix now
Apply the previous routing config and verify with curl. Then investigate why the new rule failed.
Data mismatch between old and new databases+
Immediate action
Stop writing to the new database. Run a reconciliation query to find orphan records.
Commands
SELECT COUNT(*) FROM legacy_users WHERE id NOT IN (SELECT id FROM new_users);
SELECT * FROM legacy_users WHERE updated_at > (SELECT max(updated_at) FROM sync_log);
Fix now
Manually backfill missing records via a script. Enable CDC with reverse sync if not already done.
Feature flag stuck in half-open state+
Immediate action
Force-close the flag to route all traffic back to legacy. Notify the team via incident channel.
Commands
launchdarkly flag set strangler-users off --all-users
kubectl rollout undo deployment/new-service -n production
Fix now
Revert to legacy. Schedule a postmortem to understand why the flag transition failed.
Migration Strategies Compared
StrategyRisk per changeRollback timeParallel maintenanceBest for
Strangler FigBounded (one feature)Minutes (flag off)Long (months)Large monoliths with clear service boundaries
Big Bang RewriteEntire systemHours to days (data migration)Short (weeks)Small systems (<50k LOC) or when current system is net new
Branch by AbstractionBounded (one abstraction)Hours (commit revert + cache clear)Medium (months)When you can't add a proxy layer (e.g., mobile SDK)

Key takeaways

1
Strangler Fig pattern replaces legacy systems one feature at a time through a routing proxy.
2
Risk is bounded to the feature slice being migrated
rollback is a flag toggle away.
3
Data synchronisation (bidirectional) is the hardest and most failure-prone part.
4
Always load-test new services at 2x traffic before ramping beyond 10%.
5
Test rollback in staging before going to production
verify the proxy actually reverts.
6
If features are tightly coupled via shared state, extract coupling first or choose another strategy.

Common mistakes to avoid

4 patterns
×

Migrating the database before the service

Symptom
The new service can't read legacy data because the schema differs, or dual writes fail silently.
Fix
Migrate the service first (just new code talking to old DB via abstraction), then migrate the database later. Or use database views to present a unified schema during migration.
×

Not planning for bidirectional data sync

Symptom
Users see stale data when switching between old and new UI during the rollout phase.
Fix
Implement CDC (Debezium, AWS DMS) from both databases to a message queue. Both systems consume the stream to stay eventually consistent.
×

Ramping traffic too fast without load testing the new service

Symptom
New service crashes under sudden full load, taking down the entire feature for all users.
Fix
Load test the new service at 2x expected production traffic. Ramp traffic from 1% → 5% → 10% → 25% → 50% → 100% with 24-hour observation windows at each step.
×

Assuming the proxy is stateless when it holds routing state

Symptom
Proxy restarts lose the current traffic split percentage, sending all traffic to one backend.
Fix
Externalise routing state to a config service (Consul, Etcd) or a database. The proxy should reload config on startup, not start with hardcoded defaults.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the Strangler Fig Pattern. How does it differ from a big-bang re...
Q02SENIOR
How do you handle data consistency during a Strangler Fig migration?
Q03SENIOR
What happens if the new service fails after 60% traffic is routed to it?...
Q04SENIOR
When would you NOT recommend the Strangler Fig pattern?
Q01 of 04SENIOR

Explain the Strangler Fig Pattern. How does it differ from a big-bang rewrite?

ANSWER
The Strangler Fig pattern replaces a legacy system incrementally by routing traffic through a proxy that redirects requests to new microservices one feature at a time. The legacy system continues running in parallel for unmigrated features. A big-bang rewrite replaces the entire system at once, which carries enormous risk — a single bug affects all users, rollback is slow, and the project often fails due to scope creep. Strangler Fig caps risk to the current feature slice and enables instant rollback (just turn off the feature flag).
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is the Strangler Fig Pattern in simple terms?
02
Do I need an API gateway to use Strangler Fig?
03
How long does a typical Strangler Fig migration take?
04
What tools do you recommend for feature flags?
05
Can I use Strangler Fig for database migration alone?
🔥

That's Architecture. Mark it forged?

5 min read · try the examples if you haven't

Previous
Saga Pattern
8 / 13 · Architecture
Next
Hexagonal Architecture