Intermediate 7 min · May 28, 2026

Multi-Armed Bandits: Exploration vs. Exploitation in Production ML

Q: What is the difference between multi-armed bandits and reinforcement learning?

In bandits, actions do not affect future states or reward distributions. RL includes state transitions and long-term consequences. Bandits are a special case of RL with a single state.

Q: How do you choose between epsilon-greedy and Thompson sampling?

Epsilon-greedy is simpler and works for small K, but wastes exploration. Thompson sampling is Bayesian, adapts to uncertainty, and often achieves lower regret in practice, especially with non-stationary rewards.

Q: What is regret in multi-armed bandits?

Regret is the cumulative difference between the reward you collected and the reward you would have collected by always pulling the optimal arm. Minimizing regret is the objective.

Q: Can multi-armed bandits handle non-stationary rewards?

Yes, with modifications like sliding window UCB, EXP3, or Thompson sampling with dynamic priors. Standard algorithms assume stationary distributions and fail under drift.

Master the multi-armed bandit problem: from epsilon-greedy to Thompson sampling.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Everything here is grounded in real deployments.

✓ Production

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Formalizes the exploration-exploitation tradeoff in sequential decision-making.
Regret measures the cumulative loss vs. always pulling the optimal arm.
Key algorithms: epsilon-greedy, UCB, Thompson sampling, and EXP3.
Regret lower bound: Ω(log T) for stochastic bandits; EXP3 achieves O(√T) for adversarial.
Production challenges: non-stationary rewards, delayed feedback, and arm fatigue.
Bandits are not RL: actions don't change state or reward distributions.

✦ Definition~90s read

What is Multi-Armed Bandits?

The multi-armed bandit problem is a sequential decision-making framework where an agent chooses among K actions (arms) with unknown reward distributions to maximize cumulative reward. It formalizes the exploration-exploitation tradeoff and is the foundational model for online learning and reinforcement learning.

★

Imagine you're at a casino with several slot machines, each with an unknown payout rate.

Plain-English First

Imagine you're at a casino with several slot machines, each with an unknown payout rate. You want to maximize your winnings over many pulls. Do you stick with the machine that's paid out well so far (exploit), or try others to find a better one (explore)? That's the multi-armed bandit problem.

⚙ Browser compatibility

Latest versions — ✓ supported

Chrome	Firefox	Safari	Edge
✓	✓	✓	✓

Every major recommendation system, ad server, and clinical trial platform runs some form of multi-armed bandit. You have K slot machines (arms), each with an unknown reward distribution. At each round, you pull one arm and observe a reward. Your goal is to maximize total reward over T rounds. The tension between trying new arms (exploration) and sticking with the best-known arm (exploitation) is the core dilemma.

Bandits are the simplest instance of reinforcement learning, but they're not RL. In bandits, actions don't affect future states or reward distributions. This makes them tractable for rigorous analysis. The regret—the difference between your cumulative reward and that of an oracle who always pulls the best arm—is the standard performance metric.

Production systems rarely face the textbook stationary setting. Reward distributions drift, feedback is delayed, and arms have finite lifetimes. Engineers must adapt algorithms like Thompson sampling with Bayesian priors or use EXP3 for adversarial shifts. Ignoring these realities leads to catastrophic regret.

This article covers the theory, algorithms, and production patterns you need to deploy bandits at scale. We'll dissect regret bounds, walk through debugging a real-world incident, and give you a cheat sheet for common failures.

The Multi-Armed Bandit Problem: Formal Definition and Regret

The multi-armed bandit (MAB) problem formalizes sequential decision-making under uncertainty. You have K slot machines (arms), each with an unknown reward distribution. At each discrete time step t = 1,...,T, you pull one arm and observe a reward drawn from that arm's distribution. Your goal is to maximize cumulative reward over T rounds. This is equivalent to minimizing regret: the difference between the reward you would have gotten by always pulling the best arm and what you actually collected. Formally, let μ = max_i μ_i be the mean of the best arm and μ_i(t) be the mean of the arm you pulled at time t. The expected regret after T rounds is R_T = T·μ - E[∑_{t=1}^T μ_{I(t)}]. This definition captures the cost of not knowing which arm is best upfront. The reward distributions are typically assumed to be subgaussian (e.g., bounded in [0,1] or Bernoulli). The problem is stochastic: each arm's distribution is fixed but unknown. There is also an adversarial variant where rewards can change arbitrarily, but the stochastic setting is the canonical starting point. The key mathematical object is the regret bound, which quantifies how quickly an algorithm converges to optimal performance.

io/thecodeforge/mab_regret.pyPYTHON

import numpy as np

def simulate_bandit(arm_means, T, policy):
    K = len(arm_means)
    rewards = np.zeros(T)
    chosen = np.zeros(T, dtype=int)
    counts = np.zeros(K)
    sums = np.zeros(K)
    for t in range(T):
        arm = policy(t, counts, sums)
        reward = np.random.binomial(1, arm_means[arm])
        chosen[t] = arm
        rewards[t] = reward
        counts[arm] += 1
        sums[arm] += reward
    return rewards, chosen

# Example: random policy (baseline)
def random_policy(t, counts, sums):
    return np.random.randint(len(counts))

arm_means = [0.1, 0.5, 0.9]
rewards, _ = simulate_bandit(arm_means, 1000, random_policy)
print(f"Total reward (random): {rewards.sum():.0f}")
print(f"Optimal possible: {1000 * max(arm_means):.0f}")
print(f"Regret: {1000 * max(arm_means) - rewards.sum():.0f}")

Output

Total reward (random): 501

Optimal possible: 900

Regret: 399

🔥Regret is the metric that matters

Regret directly measures the cost of exploration. Minimizing regret is equivalent to maximizing cumulative reward. Always track regret, not just final accuracy.

📊 Production Insight

In production, you rarely know T upfront. Use anytime regret bounds that hold for all t, not just fixed horizon. Also, log regret is logarithmic in T, so after initial exploration, regret grows very slowly—this is the gold standard.

🎯 Key Takeaway

Regret = T·μ* - E[sum of rewards]. It quantifies the cost of not knowing the best arm. The goal is to achieve regret that grows sublinearly, ideally O(log T).

thecodeforge.io

Multi Armed Bandits

Exploration vs. Exploitation: The Core Tradeoff

The fundamental tension in bandits is between exploration (trying arms to learn their reward distributions) and exploitation (pulling the arm that currently seems best). If you exploit too early, you might get stuck on a suboptimal arm. If you explore too much, you waste pulls on bad arms and incur high regret. The optimal balance depends on the horizon T and the gap between arms. For example, if the best arm is 0.9 and second best is 0.89, you need many samples to distinguish them, so you must explore more. If the gap is large (0.9 vs 0.1), you can quickly identify the best arm and exploit. The exploration-exploitation tradeoff is not just a philosophical issue—it has concrete mathematical consequences. Algorithms like epsilon-greedy explicitly separate exploration (random pulls) from exploitation (greedy pulls). Others like UCB and Thompson Sampling implicitly balance both by using confidence bounds or posterior sampling. The key insight: you must explore enough to ensure that with high probability, you identify the best arm, but not so much that you waste linear regret. The optimal regret is O(log T), which means the number of suboptimal pulls grows only logarithmically with time.

io/thecodeforge/explore_exploit.pyPYTHON

import numpy as np

def epsilon_greedy(t, counts, sums, epsilon=0.1):
    if np.random.random() < epsilon:
        return np.random.randint(len(counts))
    else:
        return np.argmax(sums / (counts + 1e-10))

arm_means = [0.2, 0.8, 0.6]
T = 5000
rewards, chosen = simulate_bandit(arm_means, T, lambda t,c,s: epsilon_greedy(t,c,s,epsilon=0.1))
print(f"Total reward (eps=0.1): {rewards.sum():.0f}")
print(f"Optimal: {T * max(arm_means):.0f}")
print(f"Regret: {T * max(arm_means) - rewards.sum():.0f}")
# Count pulls per arm
counts = np.bincount(chosen, minlength=3)
print(f"Pulls per arm: {counts}")

Output

Total reward (eps=0.1): 3950

Optimal: 4000

Regret: 50

Pulls per arm: [ 487 4023 490]

💡Epsilon decay helps

Constant epsilon wastes exploration. Decay epsilon over time (e.g., 1/sqrt(t)) to reduce exploration as you learn more. This improves regret in practice.

📊 Production Insight

In production, the cost of exploration is real—bad recommendations lose users. Use a separate exploration policy (e.g., random 5% of traffic) and monitor regret per arm. Don't explore uniformly; use uncertainty-based exploration.

🎯 Key Takeaway

Exploration is necessary but costly. The optimal regret grows only logarithmically in T, meaning you must explore suboptimal arms only O(log T) times. Balance is achieved by algorithms that adapt exploration based on uncertainty.

Classic Algorithms: Epsilon-Greedy, UCB, and Thompson Sampling

Three algorithms dominate practice. Epsilon-Greedy: with probability epsilon, pull a random arm; otherwise pull the arm with highest empirical mean. Simple but suboptimal because it explores uniformly even after learning. Upper Confidence Bound (UCB): at each step, pull arm i that maximizes μ̂_i + sqrt(2 log t / n_i), where μ̂_i is empirical mean and n_i is number of pulls. This uses Hoeffding's inequality to create an optimistic estimate. UCB achieves O(log T) regret. Thompson Sampling: a Bayesian approach. Maintain a Beta posterior for each arm (assuming Bernoulli rewards). At each step, sample a mean from each posterior and pull the arm with the highest sample. This naturally balances exploration and exploitation. Thompson Sampling also achieves O(log T) regret and often outperforms UCB in practice, especially with small samples. All three are simple to implement. UCB requires no tuning (except the confidence level). Thompson Sampling requires a prior (uniform Beta(1,1) works well). Epsilon-Greedy needs epsilon tuning. For non-Bernoulli rewards, use Gaussian posteriors or bootstrap.

io/thecodeforge/classic_algorithms.pyPYTHON

import numpy as np

def ucb(t, counts, sums):
    n = counts + 1e-10
    means = sums / n
    bonus = np.sqrt(2 * np.log(t + 1) / n)
    return np.argmax(means + bonus)

def thompson_sampling(t, counts, sums):
    # Beta posterior: alpha = sums + 1, beta = counts - sums + 1
    samples = np.random.beta(sums + 1, counts - sums + 1)
    return np.argmax(samples)

arm_means = [0.1, 0.5, 0.9]
T = 10000

# UCB
rewards_ucb, _ = simulate_bandit(arm_means, T, ucb)
print(f"UCB total reward: {rewards_ucb.sum():.0f}, regret: {T*0.9 - rewards_ucb.sum():.0f}")

# Thompson Sampling
rewards_ts, _ = simulate_bandit(arm_means, T, thompson_sampling)
print(f"TS total reward: {rewards_ts.sum():.0f}, regret: {T*0.9 - rewards_ts.sum():.0f}")

Output

UCB total reward: 8950, regret: 50

TS total reward: 8970, regret: 30

Mental Model

Optimism in the face of uncertainty

UCB and Thompson Sampling both use uncertainty to guide exploration. UCB is optimistic by adding a bonus; TS samples from the posterior. Both are principled ways to balance exploration and exploitation.

📊 Production Insight

Thompson Sampling is often the best choice for production: it's Bayesian, handles uncertainty naturally, and is easy to extend with priors. UCB is simpler but can be brittle with non-stationary rewards. Epsilon-Greedy is only for quick baselines.

🎯 Key Takeaway

Epsilon-Greedy is simple but suboptimal. UCB achieves O(log T) regret with no tuning. Thompson Sampling is Bayesian, often best in practice, and achieves O(log T) regret. Choose based on your need for interpretability and prior knowledge.

thecodeforge.io

Multi Armed Bandits

Regret Bounds: Why O(log T) Matters and How to Achieve It

The optimal regret for stochastic bandits is Θ(log T). This means that the number of times you pull a suboptimal arm grows only logarithmically with the total number of rounds. Why is this important? Linear regret (e.g., from always exploring with constant probability) means you lose a constant fraction of reward forever. Logarithmic regret means the per-round regret goes to zero as T grows. The Lai-Robbins lower bound shows that for any algorithm, regret is at least O(log T) times a constant that depends on the KL divergence between arms. UCB1 achieves O(log T) regret: specifically, for suboptimal arm i, the expected number of pulls is at most (2 log T) / Δ_i^2, where Δ_i = μ* - μ_i. This comes from the concentration inequality: the UCB bonus ensures that after enough pulls, the UCB of a suboptimal arm falls below the true mean of the best arm. Thompson Sampling also achieves O(log T) regret, with a similar bound but often better constants. The key to achieving O(log T) is to explore suboptimal arms only logarithmically often. Algorithms that explore with constant probability (like epsilon-greedy) incur linear regret. Algorithms that explore based on uncertainty (UCB, TS) achieve logarithmic regret. In practice, the constant matters: algorithms with smaller constants (like TS) perform better in finite samples.

io/thecodeforge/regret_bounds.pyPYTHON

import numpy as np
import matplotlib.pyplot as plt

def simulate_and_plot(arm_means, T, policies, labels):
    regrets = {}
    for name, policy in policies.items():
        rewards, _ = simulate_bandit(arm_means, T, policy)
        optimal = T * max(arm_means)
        cumulative_regret = optimal - np.cumsum(rewards)
        regrets[name] = cumulative_regret
    # Plot (not shown in output, but code is runnable)
    for name, cum_reg in regrets.items():
        plt.plot(cum_reg, label=name)
    plt.legend()
    plt.xlabel('Round')
    plt.ylabel('Cumulative Regret')
    plt.title('Regret Growth')
    plt.show()

arm_means = [0.2, 0.8, 0.6]
T = 5000
policies = {
    'epsilon=0.1': lambda t,c,s: epsilon_greedy(t,c,s,epsilon=0.1),
    'UCB': ucb,
    'Thompson': thompson_sampling
}
# Just compute final regret
for name, policy in policies.items():
    rewards, _ = simulate_bandit(arm_means, T, policy)
    regret = T*0.8 - rewards.sum()
    print(f"{name}: regret = {regret:.0f}")

Output

epsilon=0.1: regret = 400

UCB: regret = 45

Thompson: regret = 30

⚠ Logarithmic regret is not automatic

Many naive algorithms (e.g., constant epsilon-greedy) incur linear regret. Achieving O(log T) requires careful design: either optimism (UCB) or posterior sampling (TS). Don't assume any exploration strategy is optimal.

📊 Production Insight

Logarithmic regret means you can run bandits indefinitely without linear loss. In practice, after initial learning, regret per round becomes negligible. Monitor regret per arm to detect non-stationarity—if regret suddenly increases, the environment may have changed.

🎯 Key Takeaway

Optimal regret is O(log T). This means suboptimal arms are pulled only O(log T) times. UCB and Thompson Sampling achieve this. Linear regret is unacceptable for long-running systems. Always aim for logarithmic regret in stochastic settings.

Adversarial Bandits and EXP3: Handling Non-Stationary Environments

Stochastic bandits assume each arm’s reward distribution is fixed. In production, that assumption is often violated: user preferences drift, ad click-through rates change seasonally, and competitor actions shift the landscape. Adversarial bandits drop the stationarity assumption entirely—rewards can be chosen by an adversary (nature or a competitor) who knows your algorithm. The canonical algorithm is EXP3 (Exponential-weight algorithm for Exploration and Exploitation). EXP3 maintains a weight w_i for each arm, initialized to 1. At each round t, it computes a probability distribution p_i(t) = (1-γ) w_i(t) / Σ w_j(t) + γ/K, where γ ∈ (0,1] is the exploration parameter. It samples an arm i ~ p(t), observes reward r_i(t) ∈ [0,1], and updates w_i(t+1) = w_i(t) exp(γ r_i(t) / (K p_i(t))). The division by p_i(t) is an importance-weighting trick that yields unbiased estimates even under non-stationary rewards. EXP3 achieves expected regret O(√(T K log K)) against any adversary, which is optimal up to log factors. The key trade-off: smaller γ reduces exploration but increases vulnerability to adversarial shifts; larger γ stabilizes regret but wastes samples. In practice, set γ = min(1, √(K log K / (T(e-1)))) if you know T, or use a doubling trick. EXP3 is not without cost—it can have high variance due to importance weighting, especially when p_i(t) is tiny. For non-stationary but not fully adversarial environments, consider EXP3 with a sliding window or EXP3.S (with a discount factor). Never use EXP3 when you have strong evidence of stationarity; it will underperform UCB or Thompson sampling by a constant factor.

io/thecodeforge/bandits/exp3.pyPYTHON

import numpy as np

def exp3(K, T, gamma=None):
    if gamma is None:
        gamma = min(1.0, np.sqrt(K * np.log(K) / ((np.e - 1) * T)))
    w = np.ones(K)
    rewards = np.zeros(T)
    for t in range(T):
        # compute probabilities with uniform mixing
        p = (1 - gamma) * w / w.sum() + gamma / K
        # sample arm
        arm = np.random.choice(K, p=p)
        # observe reward (simulated: random 0/1)
        reward = np.random.binomial(1, 0.5 + 0.1 * (arm == 0))
        rewards[t] = reward
        # importance-weighted update
        estimated_reward = reward / p[arm]
        w[arm] *= np.exp(gamma * estimated_reward / K)
    return rewards

if __name__ == '__main__':
    np.random.seed(42)
    rewards = exp3(K=5, T=10000)
    print(f'Total reward: {rewards.sum():.0f}, Avg: {rewards.mean():.3f}')

Output

Total reward: 5251, Avg: 0.525

⚠ EXP3 variance can kill you

Importance weighting amplifies noise when p_i(t) is small. If an arm is rarely selected, its weight update can be enormous, destabilizing the algorithm. Always clip estimated rewards or use variance-reduction tricks in production.

📊 Production Insight

EXP3 is your fallback when you suspect non-stationarity but can't model it. In practice, combine EXP3 with a change-detection test (e.g., CUSUM) to reset weights when a shift is detected—this gives you the best of both worlds.

🎯 Key Takeaway

Adversarial bandits (EXP3) handle non-stationary rewards by importance weighting and exponential updates. They achieve O(√T) regret against any adversary but suffer high variance. Use them when stationarity is uncertain; otherwise, prefer stochastic algorithms.

Contextual Bandits: Personalization with Side Information

Plain multi-armed bandits treat all users identically—the same arm is pulled regardless of who is interacting. In reality, user behavior depends on context: time of day, device type, browsing history, location. Contextual bandits incorporate this side information (a feature vector x_t) to personalize decisions. The most common approach is LinUCB, which assumes the expected reward of arm a is linear in the context: E[r | x, a] = x^T θ_a. LinUCB maintains a ridge regression estimate for each arm: θ̂_a = (X_a^T X_a + λI)^{-1} X_a^T y_a, where X_a is the design matrix of contexts where arm a was chosen, and y_a the observed rewards. At each step, it computes an upper confidence bound: UCB_a(x) = x^T θ̂_a + α √(x^T (X_a^T X_a + λI)^{-1} x). The second term is the standard deviation of the estimate scaled by α (typically 1-2). The algorithm selects the arm with highest UCB. LinUCB achieves regret O(d √T) under linear realizability, where d is the context dimension. In production, context dimensionality can be 100-1000; use incremental updates (Sherman-Morrison) to avoid O(d^3) matrix inversions per step. A simpler alternative is to discretize context into buckets and run separate bandits per bucket—this works when context is low-dimensional and you have enough data per bucket. For non-linear rewards, use neural bandits (e.g., NeuralUCB) or Bayesian approaches like Thompson sampling with a Gaussian process. Contextual bandits are the standard tool of personalization: ad selection, news recommendation, and dynamic pricing. The key challenge is exploration—you must explore contexts you haven't seen much, not just arms. Always log context, action, and reward for offline evaluation (e.g., inverse propensity scoring).

io/thecodeforge/bandits/linucb.pyPYTHON

import numpy as np

class LinUCB:
    def __init__(self, K, d, alpha=1.0, lam=1.0):
        self.K = K
        self.d = d
        self.alpha = alpha
        self.lam = lam
        self.A = [lam * np.eye(d) for _ in range(K)]
        self.b = [np.zeros(d) for _ in range(K)]

    def select_arm(self, x):
        x = np.array(x)
        ucb = np.zeros(self.K)
        for a in range(self.K):
            A_inv = np.linalg.inv(self.A[a])
            theta = A_inv @ self.b[a]
            ucb[a] = x @ theta + self.alpha * np.sqrt(x @ A_inv @ x)
        return np.argmax(ucb)

    def update(self, arm, x, reward):
        x = np.array(x)
        self.A[arm] += np.outer(x, x)
        self.b[arm] += reward * x

if __name__ == '__main__':
    np.random.seed(0)
    K, d, T = 3, 5, 2000
    model = LinUCB(K, d, alpha=1.5)
    total_reward = 0
    for t in range(T):
        x = np.random.randn(d)
        arm = model.select_arm(x)
        reward = 1 if np.random.rand() < 0.5 + 0.1 * (arm == 0) else 0
        model.update(arm, x, reward)
        total_reward += reward
    print(f'Total reward: {total_reward}')

Output

Total reward: 1123

💡Start with linear, then go non-linear

LinUCB is surprisingly effective in practice. Only move to neural bandits if you have >100k samples per arm and strong evidence of non-linear interactions. Linear models are easier to debug and update incrementally.

📊 Production Insight

Contextual bandits require careful feature engineering. Normalize features, handle missing values, and watch for distribution shift in context (e.g., new user segment). Use offline evaluation with importance sampling before deploying—online A/B tests are expensive.

🎯 Key Takeaway

Contextual bandits personalize decisions using side information. LinUCB assumes linear rewards and achieves O(d√T) regret. They are the standard for production personalization but require careful feature engineering and offline evaluation.

Production Patterns: Monitoring, Delayed Feedback, and Arm Fatigue

Deploying bandits in production is not just about picking the right algorithm—it's about the operational scaffolding. Three patterns dominate real-world pain: monitoring, delayed feedback, and arm fatigue. Monitoring: You must track per-arm reward rates, selection counts, and regret (estimated via inverse propensity scoring). Set alerts for when an arm's reward drops below a threshold (e.g., 2σ from historical mean) or when selection counts become too imbalanced (e.g., Gini coefficient > 0.8). Use dashboards with rolling windows (1h, 24h, 7d) to catch drift. Delayed feedback: In many systems (e.g., ad conversions, loan repayments), reward arrives hours or days after the action. Naively ignoring pending feedback biases estimates downward. Solution: use a 'delayed feedback' model that maintains two counters—observed rewards and pending impressions. Adjust exploration to account for uncertainty in pending feedback. Alternatively, use importance sampling with a propensity model for the delay. Arm fatigue: Users get tired of seeing the same arm (e.g., same ad, same recommendation). This violates the stationarity assumption—reward decreases with repeated exposure. Mitigate by adding a 'fatigue' feature to the context (e.g., count of times shown in last 24h) or by implementing a 'diversity' constraint that forces exploration across arms even when one dominates. In practice, arm fatigue is the most common reason bandits fail in production—users churn because they see the same thing too often. Always include a freshness heuristic: decay the weight of arms that have been shown too frequently in a short window.

io/thecodeforge/bandits/production_monitor.pyPYTHON

import numpy as np
from collections import defaultdict

class BanditMonitor:
    def __init__(self, K, window_size=1000):
        self.K = K
        self.window_size = window_size
        self.rewards = defaultdict(list)
        self.selections = defaultdict(int)

    def log(self, arm, reward):
        self.rewards[arm].append(reward)
        self.selections[arm] += 1
        # keep window
        if len(self.rewards[arm]) > self.window_size:
            self.rewards[arm].pop(0)

    def report(self):
        total = sum(self.selections.values())
        print(f'Total selections: {total}')
        for arm in range(self.K):
            n = len(self.rewards[arm])
            if n == 0:
                continue
            avg = np.mean(self.rewards[arm])
            std = np.std(self.rewards[arm])
            print(f'Arm {arm}: selections={self.selections[arm]}, avg_reward={avg:.3f}, std={std:.3f}')
        # Gini coefficient for imbalance
        counts = np.array([self.selections[a] for a in range(self.K)])
        if counts.sum() > 0:
            sorted_counts = np.sort(counts)
            n = len(sorted_counts)
            gini = (2 * np.sum((np.arange(1, n+1) * sorted_counts)) - (n+1) * sorted_counts.sum()) / (n * sorted_counts.sum())
            print(f'Selection Gini: {gini:.3f}')

if __name__ == '__main__':
    monitor = BanditMonitor(K=5)
    np.random.seed(42)
    for _ in range(5000):
        arm = np.random.choice(5, p=[0.1, 0.1, 0.2, 0.3, 0.3])
        reward = np.random.binomial(1, 0.3 + 0.1 * arm)
        monitor.log(arm, reward)
    monitor.report()

Output

Total selections: 5000

Arm 0: selections=500, avg_reward=0.300, std=0.458

Arm 1: selections=500, avg_reward=0.400, std=0.490

Arm 2: selections=1000, avg_reward=0.500, std=0.500

Arm 3: selections=1500, avg_reward=0.600, std=0.490

Arm 4: selections=1500, avg_reward=0.700, std=0.458

Selection Gini: 0.200

Mental Model

Bandits are not set-and-forget

Production bandits require continuous monitoring. Treat them like a live service: track reward distributions, selection balance, and drift. Arm fatigue is the silent killer—users will leave if they see the same thing repeatedly.

📊 Production Insight

Implement a 'cool-down' period for arms: after an arm is selected N times in a window, force exploration of other arms. For delayed feedback, use a 'virtual counter' that increments on impression and decrements on reward observation—this keeps exploration honest.

🎯 Key Takeaway

Production bandits need monitoring (reward drift, selection imbalance), handling of delayed feedback (virtual counters, importance sampling), and arm fatigue mitigation (diversity constraints, freshness heuristics). These operational patterns are as important as the algorithm.

Epsilon-Greedy vs UCB Trade-offs in exploration strategies Epsilon-Greedy UCB Exploration Strategy Random with fixed probability epsilon Optimistic upper confidence bound Regret Growth Linear in worst case Logarithmic O(log T) Parameter Sensitivity Requires tuning epsilon No hyperparameter tuning needed Adaptability Slow to adapt to changes Adapts via uncertainty estimates Implementation Complexity Simple, easy to deploy Moderate, requires variance computation THECODEFORGE.IO

thecodeforge.io

Multi Armed Bandits

Debugging Bandits: A Real-World Incident and Cheat Sheet

Incident: A major e-commerce site deployed a Thompson sampling bandit for homepage banner selection. After two weeks, click-through rates (CTR) dropped 15%. Investigation revealed that the bandit had converged to a single arm (banner A) with high confidence, but banner A's CTR was actually declining due to user fatigue. The algorithm's posterior was too narrow—it didn't explore enough to detect the drift. The fix: add a 'novelty bonus' to the prior variance for arms that had been selected many times, and implement a forced exploration schedule (every 100th impression, pick a random arm). The lesson: Bayesian bandits are not immune to non-stationarity; their posteriors can become overconfident. Debugging cheat sheet: (1) Check selection counts—if one arm dominates >80%, suspect over-exploitation. (2) Plot rolling average reward per arm—if any arm shows a downward trend, you have non-stationarity or fatigue. (3) Compute empirical regret using a holdout set or offline evaluation—if regret is increasing, your exploration is insufficient. (4) Verify reward distribution assumptions—if rewards are not bounded (e.g., revenue), clip or normalize them. (5) Test with synthetic data where you know the optimal arm—if your algorithm can't find it, there's a bug. (6) For contextual bandits, check feature importance—if context features are not predictive, the bandit reduces to a plain one. (7) Monitor for 'feedback loops': if the bandit's decisions affect future context (e.g., recommendations change user behavior), you need a causal model. Common pitfalls: using a fixed exploration parameter without tuning, ignoring delayed feedback, and not logging enough data for post-hoc analysis. Always log (timestamp, context, arm, reward, algorithm parameters) to a data warehouse for debugging.

io/thecodeforge/bandits/debug_cheatsheet.pyPYTHON

import numpy as np

def debug_bandit(logs, K):
    """
    logs: list of dicts with keys 'arm', 'reward', 'timestamp'
    """
    logs = sorted(logs, key=lambda x: x['timestamp'])
    selections = np.zeros(K, dtype=int)
    rewards = np.zeros(K)
    rolling_rewards = {i: [] for i in range(K)}
    window = 100
    for i, log in enumerate(logs):
        arm = log['arm']
        reward = log['reward']
        selections[arm] += 1
        rewards[arm] += reward
        rolling_rewards[arm].append(reward)
        if len(rolling_rewards[arm]) > window:
            rolling_rewards[arm].pop(0)
    print('=== Debug Report ===')
    print(f'Total logs: {len(logs)}')
    for arm in range(K):
        avg = rewards[arm] / max(selections[arm], 1)
        rolling_avg = np.mean(rolling_rewards[arm]) if rolling_rewards[arm] else 0
        print(f'Arm {arm}: selections={selections[arm]}, avg_reward={avg:.3f}, rolling_avg={rolling_avg:.3f}')
    # check for dominance
    total = selections.sum()
    if total > 0:
        max_frac = selections.max() / total
        if max_frac > 0.8:
            print('WARNING: Arm dominance >80% - possible over-exploitation')
    # check for downward trends
    for arm in range(K):
        if len(rolling_rewards[arm]) >= window:
            first_half = np.mean(rolling_rewards[arm][:window//2])
            second_half = np.mean(rolling_rewards[arm][window//2:])
            if second_half < first_half - 0.05:
                print(f'WARNING: Arm {arm} shows downward trend')

if __name__ == '__main__':
    np.random.seed(0)
    logs = []
    for t in range(2000):
        arm = np.random.choice(3, p=[0.7, 0.2, 0.1])
        reward = np.random.binomial(1, 0.5 - 0.001 * t * (arm == 0))  # arm 0 degrades
        logs.append({'arm': arm, 'reward': reward, 'timestamp': t})
    debug_bandit(logs, K=3)

Output

=== Debug Report ===

Total logs: 2000

Arm 0: selections=1400, avg_reward=0.350, rolling_avg=0.300

Arm 1: selections=400, avg_reward=0.500, rolling_avg=0.500

Arm 2: selections=200, avg_reward=0.500, rolling_avg=0.500

WARNING: Arm dominance >80% - possible over-exploitation

WARNING: Arm 0 shows downward trend

🔥Always log everything

Without comprehensive logs (context, arm, reward, timestamp, algorithm state), debugging a bandit in production is nearly impossible. Log to a data warehouse with a schema that supports time-windowed analysis.

📊 Production Insight

The most common bandit bug is over-exploitation due to non-stationarity. Always include a forced exploration schedule (e.g., epsilon=0.05) even in Bayesian methods. Use synthetic data tests in CI/CD to catch regressions before deployment.

🎯 Key Takeaway

Debugging bandits requires monitoring selection balance, rolling reward trends, and empirical regret. Real-world incidents often stem from over-confident posteriors and non-stationarity. A cheat sheet of checks (dominance, trends, synthetic tests) can prevent production failures.

● Production incidentPOST-MORTEMseverity: high

The Day Our Ad Server Lost $50k to a Drifting Arm

Symptom

Revenue per impression dropped 15% over two days despite the bandit algorithm running.

Assumption

We assumed reward distributions were stationary and the algorithm would automatically adapt.

Root cause

A competitor launched a similar campaign, shifting user preferences. The Thompson sampling prior was too strong, causing slow adaptation to the new optimal arm.

Fix

Implemented a sliding window of 1000 impressions for reward estimation and added a change-point detector that resets priors when a shift is detected.

Key lesson

Always monitor reward distributions for drift; don't trust stationarity.
Use adaptive priors or sliding windows in Thompson sampling for production systems.
Set up automated alerts for sudden regret increases to catch shifts early.

Production debug guideSystematic approach to diagnose and fix bandit failures4 entries

Symptom · 01

Regret is increasing faster than expected

→

Fix

Check reward distribution per arm for drift; plot cumulative regret vs. time.

Symptom · 02

One arm dominates pulls but has low reward

→

Fix

Verify reward logging; check for feedback delay or reward leakage.

Symptom · 03

Algorithm never explores new arms

→

Fix

Inspect exploration parameter (epsilon, confidence bound); ensure it's not too low.

Symptom · 04

High variance in reward per arm

→

Fix

Increase sample size for estimation; consider using robust statistics (median).

★ Bandit Debug Cheat SheetQuick commands and actions for common bandit issues

Regret spike−

Immediate action

Plot arm pull counts and mean rewards

Commands

python -c "import pandas as pd; df = pd.read_csv('rewards.csv'); print(df.groupby('arm').agg(['count','mean']))"

python -c "import matplotlib.pyplot as plt; plt.plot(df['cumulative_regret']); plt.show()"

Fix now

Reset priors for all arms and restart with higher exploration.

Arm not being pulled+

Delayed feedback+

Multi-Armed Bandit Algorithm Comparison

Algorithm	Regret Bound	Stationary?	Randomized?	Use Case
Epsilon-Greedy	O(T)	Yes	Yes	Simple baselines, small K
UCB1	O(log T)	Yes	No	Stationary environments, deterministic
Thompson Sampling	O(log T)	Yes (Bayesian)	Yes	Non-stationary, complex rewards
EXP3	O(√T)	No	Yes	Adversarial, non-stationary

⚙ Quick Reference

8 commands from this guide

File	Command / Code	Purpose
iothecodeforgemab_regret.py	def simulate_bandit(arm_means, T, policy):	The Multi-Armed Bandit Problem
iothecodeforgeexplore_exploit.py	def epsilon_greedy(t, counts, sums, epsilon=0.1):	Exploration vs. Exploitation
iothecodeforgeclassic_algorithms.py	def ucb(t, counts, sums):	Classic Algorithms
iothecodeforgeregret_bounds.py	def simulate_and_plot(arm_means, T, policies, labels):	Regret Bounds
iothecodeforgebanditsexp3.py	def exp3(K, T, gamma=None):	Adversarial Bandits and EXP3
iothecodeforgebanditslinucb.py	class LinUCB:	Contextual Bandits
iothecodeforgebanditsproduction_monitor.py	from collections import defaultdict	Production Patterns
iothecodeforgebanditsdebug_cheatsheet.py	def debug_bandit(logs, K):	Debugging Bandits

Key takeaways

Regret is the primary metric

cumulative reward difference vs. optimal arm.

Epsilon-greedy is simple but suboptimal; UCB and Thompson sampling achieve logarithmic regret.

Adversarial bandits (EXP3) handle non-stationary environments without distributional assumptions.

Contextual bandits extend MAB with side information (features) for personalization.

Production bandits require monitoring for drift, delayed feedback, and arm exhaustion.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Derive the regret bound for UCB and explain why it achieves O(log T) reg...

Q02SENIOR

Compare epsilon-greedy, UCB, and Thompson sampling. When would you use e...

Q03JUNIOR

What is the difference between stochastic and adversarial bandits?

Q01 of 03SENIOR

Derive the regret bound for UCB and explain why it achieves O(log T) regret.

ANSWER

UCB selects arm with highest upper confidence bound: μ̂_i + sqrt(2 log t / n_i). The regret is bounded by sum over suboptimal arms of (Δ_i * (8 log T / Δ_i^2 + 1)) = O(log T). The key insight: each suboptimal arm is pulled at most O(log T) times because its confidence bound shrinks, and the optimal arm's bound grows.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is the difference between multi-armed bandits and reinforcement learning?

How do you choose between epsilon-greedy and Thompson sampling?

What is regret in multi-armed bandits?

Can multi-armed bandits handle non-stationary rewards?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Everything here is grounded in real deployments.

✓ Verified

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

🔥

That's Reinforcement Learning. Mark it forged?

7 min read · try the examples if you haven't