A stale flag at 100% for 3 months caused a NullPointerException, taking down checkout for 10% of users.
Basic Flag Implementation
The simplest feature flag is just an if statement controlled by an environment variable or a config value. This pattern works for small teams and simple rollouts. For a production grade approach, you need consistent user bucketing — the same user must always see the same experience. A common way is to hash the flag name with the user ID and take modulo 100 to assign a bucket.
Here's the minimal pattern in Python, using the io.thecodeforge namespace for all production packages.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Package: io.thecodeforge.python.devops
# Simplest possible feature flag — environment variable
import os
def get_recommendations(user_id: int):
if os.getenv('ENABLE_ML_RECOMMENDATIONS', 'false') == 'true':
return ml_recommendations(user_id) # new ML-based system
else:
return rule_based_recommendations(user_id) # old system
# Better: percentage rollout — test on a fraction of users
import hashlib
def is_flag_enabled(flag_name: str, user_id: int, percentage: float) -> bool:
"""Consistently assign users to buckets using hash — same user always gets same result."""
hash_input = f'{flag_name}:{user_id}'.encode()
hash_val = int(hashlib.md5(hash_input).hexdigest(), 16)
bucket = (hash_val % 100) + 1 # 1-100
return bucket <= percentage
# Roll out to 5% of users
def get_checkout_flow(user_id: int):
if is_flag_enabled('new_checkout', user_id, 5.0):
return new_checkout_flow(user_id)
return old_checkout_flow(user_id)Environment Variable Flags Are Fragile
Using environment variables per flag works for a handful of toggles, but as you scale to hundreds of flags, you need a dedicated flag service with targeting rules and audit trails. Environment variables are also hard to change at runtime without a restart.
Production Insight
Hash collisions are rare but possible — use a long hash (MD5 or SHA-256) and validate bucket distribution on a sample of users.
A common mistake is using the user ID alone without the flag name, causing the same user to get inconsistent experiences across different flags.
Rule: always include the flag name in the hash input.
Key Takeaway
Start simple, but plan to migrate to a service before you hit 10 flags.
Consistent hashing is non-negotiable for percentage rollouts.
Test bucket distribution — a biased hash can ruin A/B tests.
IfFewer than 5 flags, single environment, small team
→
UseEnvironment variable flags are fine. Keep a checklist to track removal.
IfMore than 5 flags or multiple environments (staging, prod)
→
UseUse a dedicated flag service (LaunchDarkly, Unleash) for targeting, audit, and easy management.
IfNeed real-time changes (e.g., kill switch)
→
UseFlag service with streaming evaluation is necessary. Polling every 30 seconds is too slow for a kill switch.
Feature Flag Service — LaunchDarkly SDK Pattern
When your team needs targeting by user attributes (plan, country, beta group), a dedicated flag service is the way to go. The SDK handles evaluation, caching, and streaming updates. This example shows how to use the LaunchDarkly SDK in Python, evaluating a flag with a rich user context.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Package: io.thecodeforge.python.devops
# Using a feature flag service (LaunchDarkly, Unleash, Flagsmith)
import ldclient
from ldclient.config import Config
ldclient.set_config(Config(sdk_key='your-sdk-key'))
client = ldclient.get()
# Evaluate a flag for a specific user
def get_dashboard(user):
context = {
'key': str(user.id),
'name': user.name,
'email': user.email,
'plan': user.subscription_plan, # target premium users
'country': user.country # GDPR rollout by country
}
# Flag evaluated with user context — targeting rules in dashboard
if client.variation('new-dashboard-v2', context, default=False):
return render_new_dashboard(user)
return render_old_dashboard(user)Use Contextual Defaults
The default parameter in client.variation() is critical. If the flag service is unreachable (network partition, service down), the SDK falls back to this default. Always default to the old/safe behavior — never default to enabling a new feature.
Production Insight
SDK caching can mask stale flag values for up to the cache TTL (often 30 seconds). If you need instant rollback, use a CDN or feature flag proxy that streams updates.
Evaluation context size matters: large user objects (100+ attributes) can add 5-10ms to evaluation time.
Rule: keep context attributes under 20 and use streaming for time-sensitive flags.
Key Takeaway
Always provide a safe fallback default.
Streaming beats polling for kill switches.
Evaluate flags early, pass results down — don't eval inside loops.
IfNetwork latency to flag service > 10ms
→
UseUse a local cache with a short TTL (5-10 seconds) to avoid synchronous network calls on every request.
IfFlag changes need to propagate within seconds
→
UseEnable streaming (WebSocket or Server-Sent Events) to push changes, not poll.
IfSingle flag evaluated hundreds of times per request (e.g., in a loop)
→
UseEvaluate the flag once at the start of the request and pass the result as a parameter. Avoid repeated evaluations.
Types of Feature Flags
Not all feature flags are the same. Pete Hodgson's taxonomy (from Martin Fowler's article) defines four types: release toggles, experiment toggles, ops toggles, and permission toggles. Release toggles are short-lived — they control rollout of a new feature. Experiment toggles are for A/B tests and should be removed after the experiment ends. Ops toggles are kill switches and circuit breakers — they must be fast and reliable. Permission toggles (entitlement flags) enable features for specific user segments (e.g., premium plan users) and can live long-term.
Mixing these types leads to confusion. Use naming conventions to distinguish: release_, exp_, ops_, perm_.
# Package: io.thecodeforge.python.devops
# Naming convention for flag types
release_flag_variation = client.variation('release_new_checkout_v3', context, default=False)
experiment_flag_variation = client.variation('exp_checkout_button_color', context, default='blue')
ops_flag_variation = client.variation('ops_disable_payment_gateway', context, default=False)
perm_flag_variation = client.variation('perm_premium_dashboard', context, default=False)Flag Types as Lifecycle Stages
- Release flags: live 1 day – 2 weeks. Remove once rollout reaches 100%.
- Experiment flags: live for the duration of the experiment (days to months). Remove after analysis.
- Ops flags: live indefinitely but must be easy to toggle and have monitoring.
- Permission flags: live indefinitely, but should be managed by a product config system, not a feature flag tool.
Production Insight
Permission flags in a feature flag service create a hidden dependency — if the service goes down, all premium users lose access.
Ops flags must have a dashboard button for emergency toggling, not a CLI command that takes 5 minutes to find.
Rule: never use a feature flag service for permanent permissions — use a role-based access control (RBAC) system instead.
Key Takeaway
Name flags by type to avoid confusion.
Permanent permissions don't belong in feature flag tools.
Ops flags need monitoring and a dashboard toggle.
IfRolling out a new feature to all users gradually
→
UseUse a release flag. Plan to remove it within 2 weeks of reaching 100% rollout.
IfTesting two versions of a UI element to measure engagement
→
UseUse an experiment flag. Ensure proper sample size calculation and statistical rigor.
IfNeed to instantly disable a misbehaving API call
→
UseUse an ops flag. Make sure the flag evaluation is fast (<1ms) and the toggle is available in a dashboard.
IfShow a feature only to paying users
→
UseUse a permission flag, but implement it via a user attribute lookup (database or auth token) rather than a feature flag SDK.
Canary Releases and Gradual Rollout with Flags
Canary releases are about routing a percentage of traffic to a new version of the service at the infrastructure level (e.g., Kubernetes canary deployments). But feature flags can enhance canaries by allowing you to target specific user segments within the canary pod. For example, you deploy the new version to 5% of pods, then use a feature flag to only enable the new feature for 10% of users hitting those pods. This gives you fine-grained control.
This pattern is common at large scale: you canary the deployment at the pod level, and inside the pod, use a flag to limit exposure further. This reduces blast radius if the new version has a bug — only a subset of the canary group sees the broken code.
# Package: io.thecodeforge.python.devops
# Canary with feature flag: even if the pod receives traffic, only a fraction of users get the new feature
import hashlib
def compute_bucket(user_id, flag_name, total_percent):
hash_val = int(hashlib.md5(f'{flag_name}:{user_id}'.encode()).hexdigest(), 16)
return (hash_val % 100) + 1 <= total_percent
# Canary: 5% of pods run new code, but only 20% of users on those pods get the feature
# That's effectively 1% of total users
if compute_bucket(user_id, 'new_recommendation_v2', 20):
# This code only runs in the canary pods
return new_recommendation_system(user_id)
else:
return old_system(user_id)Hybrid Canary vs Pure Flag Canary
Pure flag canary: deploy the new code to all pods but turn the flag off. Then gradually increase the flag percentage. This is simpler but uses more resources (both old and new code paths are always loaded). Hybrid is safer for risky changes because the new code is only present in a subset of pods.
Production Insight
If you only use flags for canary, you must ensure the flag evaluation does not add noticeable latency. Use a local cache or a fast evaluation path.
Monitoring the canary: you need separate dashboards for the canary group vs the control group. Use the flag context to tag traces and metrics.
Rule: always run a canary for at least 10 minutes before ramping up. Watch error rates, latency, and business metrics.
Key Takeaway
Hybrid canary = pod-level + flag-level control for maximum safety.
Monitor the canary group separately — don't mix metrics with the control group.
Have a kill switch ops flag ready before starting the canary.
IfLow-risk change (UI change, non-critical path)
→
UsePure flag canary: deploy to all pods, enable flag for 1% of users first.
IfHigh-risk change (database schema migration, payment logic)
→
UseHybrid canary: deploy to 5% of pods, then enable flag for 10% of users within those pods.
IfNeed to roll back instantly for a critical bug
→
UseUse an ops flag alongside the canary. Turn the ops flag on to immediately disable the new code path, even if the flag percentage is high.
Managing Flag Debt and Cleanup
Flag debt is the accumulation of stale conditionals in your code. Every flag that is no longer needed but still present forces your team to maintain two paths. Over time, the old path can break silently because it's rarely tested. The solution is to make flags ephemeral: set a removal date when you create the flag, automate reminders, and schedule cleanup as part of your sprint cycle.
A good rule: if a release flag has been at 100% for more than two weeks, it must be removed. For experiment flags, remove after the experiment analysis is complete — don't keep them 'just in case'. Ops flags and permission flags are exceptions, but they should be reviewed quarterly.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Package: io.thecodeforge.python.devops
# Example: automate flag cleanup detection in CI
# This would be a script that checks git for old flag references
import subprocess
import re
FLAG_PATTERN = r'client\.variation\([\'"]([\w-]+)[\'"]'
def find_old_flags(months: int = 3):
# Get all flags used in codebase
result = subprocess.run(['grep', '-roPh', FLAG_PATTERN, 'src/'], capture_output=True, text=True)
flags = set(re.findall(FLAG_PATTERN, result.stdout))
# Check each flag's metadata (would use API in real life)
# For now, just list them
return flags
# In CI, warn if a release flag is older than 2 weeks
# This helps reduce flag debtFlag Debt Causes Real Outages
A stale flag with a code path that is never exercised can break when a refactoring touches the old code. The outage described in the production incident above happened exactly this way. Treat flag cleanup as a security practice.
Production Insight
Automation is key: add a lint rule that flags any client.variation() call for a flag that is > 2 weeks at 100% rollout.
Manual audits every quarter are better than nothing but often get skipped.
Rule: when you create a flag, create a corresponding JIRA ticket with a due date for removal.
Key Takeaway
Create flags with an expiry date.
Automate flag debt detection in CI.
If a flag is at 100% for more than 2 weeks, schedule its removal now.
IfRelease flag at 100% for > 2 weeks
→
UseHigh priority: remove within next sprint. The old code path is dead and should be deleted.
IfExperiment flag ended > 1 month ago
→
UseMedium priority: remove after analysis report is finalized. Keep the winning variant, delete the rest.
IfOps flag never toggled in 6 months
→
UseLow priority but review: is this ops flag still needed? If not, remove it to simplify the codebase.