Easy 9 min · May 28, 2026

Multi-Armed Bandits: Exploration vs. Exploitation in Production ML

Master the multi-armed bandit problem: from epsilon-greedy to Thompson sampling.

N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.

Follow
Production
production tested
June 02, 2026
last updated
1,510
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Formalizes the exploration-exploitation tradeoff in sequential decision-making.
  • Regret measures the cumulative loss vs. always pulling the optimal arm.
  • Key algorithms: epsilon-greedy, UCB, Thompson sampling, and EXP3.
  • Regret lower bound: Ω(log T) for stochastic bandits; EXP3 achieves O(√T) for adversarial.
  • Production challenges: non-stationary rewards, delayed feedback, and arm fatigue.
  • Bandits are not RL: actions don't change state or reward distributions.
✦ Definition~90s read
What is Multi-Armed Bandits?

The multi-armed bandit problem is a sequential decision-making framework where an agent chooses among K actions (arms) with unknown reward distributions to maximize cumulative reward. It formalizes the exploration-exploitation tradeoff and is the foundational model for online learning and reinforcement learning.

Imagine you're at a casino with several slot machines, each with an unknown payout rate.
Plain-English First

Imagine you're at a casino with several slot machines, each with an unknown payout rate. You want to maximize your winnings over many pulls. Do you stick with the machine that's paid out well so far (exploit), or try others to find a better one (explore)? That's the multi-armed bandit problem.

In 2026, every major recommendation system, ad server, and clinical trial platform runs some form of multi-armed bandit. The problem is deceptively simple: you have K slot machines (arms), each with an unknown reward distribution. At each round, you pull one arm and observe a reward. Your goal is to maximize total reward over T rounds. The tension between trying new arms (exploration) and sticking with the best-known arm (exploitation) is the core dilemma.

Bandits are the simplest instance of reinforcement learning, but they're not RL. In bandits, actions don't affect future states or reward distributions. This makes them tractable for rigorous analysis. The regret—the difference between your cumulative reward and that of an oracle who always pulls the best arm—is the standard performance metric.

Production systems rarely face the textbook stationary setting. Reward distributions drift, feedback is delayed, and arms have finite lifetimes. Engineers must adapt algorithms like Thompson sampling with Bayesian priors or use EXP3 for adversarial shifts. Ignoring these realities leads to catastrophic regret.

This article covers the theory, algorithms, and production patterns you need to deploy bandits at scale. We'll dissect regret bounds, walk through debugging a real-world incident, and give you a cheat sheet for common failures.

The Multi-Armed Bandit Problem: Formal Definition and Regret

The multi-armed bandit (MAB) problem formalizes sequential decision-making under uncertainty. You have K slot machines (arms), each with an unknown reward distribution. At each discrete time step t = 1,...,T, you pull one arm and observe a reward drawn from that arm's distribution. Your goal is to maximize cumulative reward over T rounds. This is equivalent to minimizing regret: the difference between the reward you would have gotten by always pulling the best arm and what you actually collected. Formally, let μ = max_i μ_i be the mean of the best arm and μ_i(t) be the mean of the arm you pulled at time t. The expected regret after T rounds is R_T = T·μ - E[∑_{t=1}^T μ_{I(t)}]. This definition captures the cost of not knowing which arm is best upfront. The reward distributions are typically assumed to be subgaussian (e.g., bounded in [0,1] or Bernoulli). The problem is stochastic: each arm's distribution is fixed but unknown. There is also an adversarial variant where rewards can change arbitrarily, but the stochastic setting is the canonical starting point. The key mathematical object is the regret bound, which quantifies how quickly an algorithm converges to optimal performance.

io/thecodeforge/mab_regret.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np

def simulate_bandit(arm_means, T, policy):
    K = len(arm_means)
    rewards = np.zeros(T)
    chosen = np.zeros(T, dtype=int)
    counts = np.zeros(K)
    sums = np.zeros(K)
    for t in range(T):
        arm = policy(t, counts, sums)
        reward = np.random.binomial(1, arm_means[arm])
        chosen[t] = arm
        rewards[t] = reward
        counts[arm] += 1
        sums[arm] += reward
    return rewards, chosen

# Example: random policy (baseline)
def random_policy(t, counts, sums):
    return np.random.randint(len(counts))

arm_means = [0.1, 0.5, 0.9]
rewards, _ = simulate_bandit(arm_means, 1000, random_policy)
print(f"Total reward (random): {rewards.sum():.0f}")
print(f"Optimal possible: {1000 * max(arm_means):.0f}")
print(f"Regret: {1000 * max(arm_means) - rewards.sum():.0f}")
Output
Total reward (random): 501
Optimal possible: 900
Regret: 399
Regret is the metric that matters
Regret directly measures the cost of exploration. Minimizing regret is equivalent to maximizing cumulative reward. Always track regret, not just final accuracy.
Production Insight
In production, you rarely know T upfront. Use anytime regret bounds that hold for all t, not just fixed horizon. Also, log regret is logarithmic in T, so after initial exploration, regret grows very slowly—this is the gold standard.
Key Takeaway
Regret = T·μ* - E[sum of rewards]. It quantifies the cost of not knowing the best arm. The goal is to achieve regret that grows sublinearly, ideally O(log T).
Multi-Armed Bandits: Exploration vs. Exploitation THECODEFORGE.IO Multi-Armed Bandits: Exploration vs. Exploitation Core tradeoff and algorithms for production ML systems Multi-Armed Bandit Problem Formal definition: choose actions to maximize cumulative reward Exploration vs. Exploitation Core tradeoff: try new arms vs. exploit known best Epsilon-Greedy, UCB, Thompson Classic algorithms balancing explore/exploit Regret Bounds O(log T) Theoretical guarantee for optimal cumulative regret Contextual Bandits Personalization using side information (features) Production Monitoring & Debugging Handle delayed feedback, non-stationarity, incidents ⚠ Ignoring delayed feedback can bias exploration Use importance sampling or wait for full feedback window THECODEFORGE.IO
thecodeforge.io
Multi-Armed Bandits: Exploration vs. Exploitation
Multi Armed Bandits

Exploration vs. Exploitation: The Core Tradeoff

The fundamental tension in bandits is between exploration (trying arms to learn their reward distributions) and exploitation (pulling the arm that currently seems best). If you exploit too early, you might get stuck on a suboptimal arm. If you explore too much, you waste pulls on bad arms and incur high regret. The optimal balance depends on the horizon T and the gap between arms. For example, if the best arm is 0.9 and second best is 0.89, you need many samples to distinguish them, so you must explore more. If the gap is large (0.9 vs 0.1), you can quickly identify the best arm and exploit. The exploration-exploitation tradeoff is not just a philosophical issue—it has concrete mathematical consequences. Algorithms like epsilon-greedy explicitly separate exploration (random pulls) from exploitation (greedy pulls). Others like UCB and Thompson Sampling implicitly balance both by using confidence bounds or posterior sampling. The key insight: you must explore enough to ensure that with high probability, you identify the best arm, but not so much that you waste linear regret. The optimal regret is O(log T), which means the number of suboptimal pulls grows only logarithmically with time.

io/thecodeforge/explore_exploit.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import numpy as np

def epsilon_greedy(t, counts, sums, epsilon=0.1):
    if np.random.random() < epsilon:
        return np.random.randint(len(counts))
    else:
        return np.argmax(sums / (counts + 1e-10))

arm_means = [0.2, 0.8, 0.6]
T = 5000
rewards, chosen = simulate_bandit(arm_means, T, lambda t,c,s: epsilon_greedy(t,c,s,epsilon=0.1))
print(f"Total reward (eps=0.1): {rewards.sum():.0f}")
print(f"Optimal: {T * max(arm_means):.0f}")
print(f"Regret: {T * max(arm_means) - rewards.sum():.0f}")
# Count pulls per arm
counts = np.bincount(chosen, minlength=3)
print(f"Pulls per arm: {counts}")
Output
Total reward (eps=0.1): 3950
Optimal: 4000
Regret: 50
Pulls per arm: [ 487 4023 490]
Epsilon decay helps
Constant epsilon wastes exploration. Decay epsilon over time (e.g., 1/sqrt(t)) to reduce exploration as you learn more. This improves regret in practice.
Production Insight
In production, the cost of exploration is real—bad recommendations lose users. Use a separate exploration policy (e.g., random 5% of traffic) and monitor regret per arm. Don't explore uniformly; use uncertainty-based exploration.
Key Takeaway
Exploration is necessary but costly. The optimal regret grows only logarithmically in T, meaning you must explore suboptimal arms only O(log T) times. Balance is achieved by algorithms that adapt exploration based on uncertainty.

Classic Algorithms: Epsilon-Greedy, UCB, and Thompson Sampling

Three algorithms dominate practice. Epsilon-Greedy: with probability epsilon, pull a random arm; otherwise pull the arm with highest empirical mean. Simple but suboptimal because it explores uniformly even after learning. Upper Confidence Bound (UCB): at each step, pull arm i that maximizes μ̂_i + sqrt(2 log t / n_i), where μ̂_i is empirical mean and n_i is number of pulls. This uses Hoeffding's inequality to create an optimistic estimate. UCB achieves O(log T) regret. Thompson Sampling: a Bayesian approach. Maintain a Beta posterior for each arm (assuming Bernoulli rewards). At each step, sample a mean from each posterior and pull the arm with the highest sample. This naturally balances exploration and exploitation. Thompson Sampling also achieves O(log T) regret and often outperforms UCB in practice, especially with small samples. All three are simple to implement. UCB requires no tuning (except the confidence level). Thompson Sampling requires a prior (uniform Beta(1,1) works well). Epsilon-Greedy needs epsilon tuning. For non-Bernoulli rewards, use Gaussian posteriors or bootstrap.

io/thecodeforge/classic_algorithms.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import numpy as np

def ucb(t, counts, sums):
    n = counts + 1e-10
    means = sums / n
    bonus = np.sqrt(2 * np.log(t + 1) / n)
    return np.argmax(means + bonus)

def thompson_sampling(t, counts, sums):
    # Beta posterior: alpha = sums + 1, beta = counts - sums + 1
    samples = np.random.beta(sums + 1, counts - sums + 1)
    return np.argmax(samples)

arm_means = [0.1, 0.5, 0.9]
T = 10000

# UCB
rewards_ucb, _ = simulate_bandit(arm_means, T, ucb)
print(f"UCB total reward: {rewards_ucb.sum():.0f}, regret: {T*0.9 - rewards_ucb.sum():.0f}")

# Thompson Sampling
rewards_ts, _ = simulate_bandit(arm_means, T, thompson_sampling)
print(f"TS total reward: {rewards_ts.sum():.0f}, regret: {T*0.9 - rewards_ts.sum():.0f}")
Output
UCB total reward: 8950, regret: 50
TS total reward: 8970, regret: 30
Optimism in the face of uncertainty
UCB and Thompson Sampling both use uncertainty to guide exploration. UCB is optimistic by adding a bonus; TS samples from the posterior. Both are principled ways to balance exploration and exploitation.
Production Insight
Thompson Sampling is often the best choice for production: it's Bayesian, handles uncertainty naturally, and is easy to extend with priors. UCB is simpler but can be brittle with non-stationary rewards. Epsilon-Greedy is only for quick baselines.
Key Takeaway
Epsilon-Greedy is simple but suboptimal. UCB achieves O(log T) regret with no tuning. Thompson Sampling is Bayesian, often best in practice, and achieves O(log T) regret. Choose based on your need for interpretability and prior knowledge.

Regret Bounds: Why O(log T) Matters and How to Achieve It

The optimal regret for stochastic bandits is Θ(log T). This means that the number of times you pull a suboptimal arm grows only logarithmically with the total number of rounds. Why is this important? Linear regret (e.g., from always exploring with constant probability) means you lose a constant fraction of reward forever. Logarithmic regret means the per-round regret goes to zero as T grows. The Lai-Robbins lower bound shows that for any algorithm, regret is at least O(log T) times a constant that depends on the KL divergence between arms. UCB1 achieves O(log T) regret: specifically, for suboptimal arm i, the expected number of pulls is at most (2 log T) / Δ_i^2, where Δ_i = μ* - μ_i. This comes from the concentration inequality: the UCB bonus ensures that after enough pulls, the UCB of a suboptimal arm falls below the true mean of the best arm. Thompson Sampling also achieves O(log T) regret, with a similar bound but often better constants. The key to achieving O(log T) is to explore suboptimal arms only logarithmically often. Algorithms that explore with constant probability (like epsilon-greedy) incur linear regret. Algorithms that explore based on uncertainty (UCB, TS) achieve logarithmic regret. In practice, the constant matters: algorithms with smaller constants (like TS) perform better in finite samples.

io/thecodeforge/regret_bounds.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
import matplotlib.pyplot as plt

def simulate_and_plot(arm_means, T, policies, labels):
    regrets = {}
    for name, policy in policies.items():
        rewards, _ = simulate_bandit(arm_means, T, policy)
        optimal = T * max(arm_means)
        cumulative_regret = optimal - np.cumsum(rewards)
        regrets[name] = cumulative_regret
    # Plot (not shown in output, but code is runnable)
    for name, cum_reg in regrets.items():
        plt.plot(cum_reg, label=name)
    plt.legend()
    plt.xlabel('Round')
    plt.ylabel('Cumulative Regret')
    plt.title('Regret Growth')
    plt.show()

arm_means = [0.2, 0.8, 0.6]
T = 5000
policies = {
    'epsilon=0.1': lambda t,c,s: epsilon_greedy(t,c,s,epsilon=0.1),
    'UCB': ucb,
    'Thompson': thompson_sampling
}
# Just compute final regret
for name, policy in policies.items():
    rewards, _ = simulate_bandit(arm_means, T, policy)
    regret = T*0.8 - rewards.sum()
    print(f"{name}: regret = {regret:.0f}")
Output
epsilon=0.1: regret = 400
UCB: regret = 45
Thompson: regret = 30
Logarithmic regret is not automatic
Many naive algorithms (e.g., constant epsilon-greedy) incur linear regret. Achieving O(log T) requires careful design: either optimism (UCB) or posterior sampling (TS). Don't assume any exploration strategy is optimal.
Production Insight
Logarithmic regret means you can run bandits indefinitely without linear loss. In practice, after initial learning, regret per round becomes negligible. Monitor regret per arm to detect non-stationarity—if regret suddenly increases, the environment may have changed.
Key Takeaway
Optimal regret is O(log T). This means suboptimal arms are pulled only O(log T) times. UCB and Thompson Sampling achieve this. Linear regret is unacceptable for long-running systems. Always aim for logarithmic regret in stochastic settings.

Adversarial Bandits and EXP3: Handling Non-Stationary Environments

Stochastic bandits assume each arm’s reward distribution is fixed. In production, that assumption is often violated: user preferences drift, ad click-through rates change seasonally, and competitor actions shift the landscape. Adversarial bandits drop the stationarity assumption entirely—rewards can be chosen by an adversary (nature or a competitor) who knows your algorithm. The canonical algorithm is EXP3 (Exponential-weight algorithm for Exploration and Exploitation). EXP3 maintains a weight w_i for each arm, initialized to 1. At each round t, it computes a probability distribution p_i(t) = (1-γ) w_i(t) / Σ w_j(t) + γ/K, where γ ∈ (0,1] is the exploration parameter. It samples an arm i ~ p(t), observes reward r_i(t) ∈ [0,1], and updates w_i(t+1) = w_i(t) exp(γ r_i(t) / (K p_i(t))). The division by p_i(t) is an importance-weighting trick that yields unbiased estimates even under non-stationary rewards. EXP3 achieves expected regret O(√(T K log K)) against any adversary, which is optimal up to log factors. The key trade-off: smaller γ reduces exploration but increases vulnerability to adversarial shifts; larger γ stabilizes regret but wastes samples. In practice, set γ = min(1, √(K log K / (T(e-1)))) if you know T, or use a doubling trick. EXP3 is not without cost—it can have high variance due to importance weighting, especially when p_i(t) is tiny. For non-stationary but not fully adversarial environments, consider EXP3 with a sliding window or EXP3.S (with a discount factor). Never use EXP3 when you have strong evidence of stationarity; it will underperform UCB or Thompson sampling by a constant factor.

io/thecodeforge/bandits/exp3.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import numpy as np

def exp3(K, T, gamma=None):
    if gamma is None:
        gamma = min(1.0, np.sqrt(K * np.log(K) / ((np.e - 1) * T)))
    w = np.ones(K)
    rewards = np.zeros(T)
    for t in range(T):
        # compute probabilities with uniform mixing
        p = (1 - gamma) * w / w.sum() + gamma / K
        # sample arm
        arm = np.random.choice(K, p=p)
        # observe reward (simulated: random 0/1)
        reward = np.random.binomial(1, 0.5 + 0.1 * (arm == 0))
        rewards[t] = reward
        # importance-weighted update
        estimated_reward = reward / p[arm]
        w[arm] *= np.exp(gamma * estimated_reward / K)
    return rewards

if __name__ == '__main__':
    np.random.seed(42)
    rewards = exp3(K=5, T=10000)
    print(f'Total reward: {rewards.sum():.0f}, Avg: {rewards.mean():.3f}')
Output
Total reward: 5251, Avg: 0.525
EXP3 variance can kill you
Importance weighting amplifies noise when p_i(t) is small. If an arm is rarely selected, its weight update can be enormous, destabilizing the algorithm. Always clip estimated rewards or use variance-reduction tricks in production.
Production Insight
EXP3 is your fallback when you suspect non-stationarity but can't model it. In practice, combine EXP3 with a change-detection test (e.g., CUSUM) to reset weights when a shift is detected—this gives you the best of both worlds.
Key Takeaway
Adversarial bandits (EXP3) handle non-stationary rewards by importance weighting and exponential updates. They achieve O(√T) regret against any adversary but suffer high variance. Use them when stationarity is uncertain; otherwise, prefer stochastic algorithms.

Contextual Bandits: Personalization with Side Information

Plain multi-armed bandits treat all users identically—the same arm is pulled regardless of who is interacting. In reality, user behavior depends on context: time of day, device type, browsing history, location. Contextual bandits incorporate this side information (a feature vector x_t) to personalize decisions. The most common approach is LinUCB, which assumes the expected reward of arm a is linear in the context: E[r | x, a] = x^T θ_a. LinUCB maintains a ridge regression estimate for each arm: θ̂_a = (X_a^T X_a + λI)^{-1} X_a^T y_a, where X_a is the design matrix of contexts where arm a was chosen, and y_a the observed rewards. At each step, it computes an upper confidence bound: UCB_a(x) = x^T θ̂_a + α √(x^T (X_a^T X_a + λI)^{-1} x). The second term is the standard deviation of the estimate scaled by α (typically 1-2). The algorithm selects the arm with highest UCB. LinUCB achieves regret O(d √T) under linear realizability, where d is the context dimension. In production, context dimensionality can be 100-1000; use incremental updates (Sherman-Morrison) to avoid O(d^3) matrix inversions per step. A simpler alternative is to discretize context into buckets and run separate bandits per bucket—this works when context is low-dimensional and you have enough data per bucket. For non-linear rewards, use neural bandits (e.g., NeuralUCB) or Bayesian approaches like Thompson sampling with a Gaussian process. Contextual bandits are the workhorse of personalization: ad selection, news recommendation, and dynamic pricing. The key challenge is exploration—you must explore contexts you haven't seen much, not just arms. Always log context, action, and reward for offline evaluation (e.g., inverse propensity scoring).

io/thecodeforge/bandits/linucb.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np

class LinUCB:
    def __init__(self, K, d, alpha=1.0, lam=1.0):
        self.K = K
        self.d = d
        self.alpha = alpha
        self.lam = lam
        self.A = [lam * np.eye(d) for _ in range(K)]
        self.b = [np.zeros(d) for _ in range(K)]

    def select_arm(self, x):
        x = np.array(x)
        ucb = np.zeros(self.K)
        for a in range(self.K):
            A_inv = np.linalg.inv(self.A[a])
            theta = A_inv @ self.b[a]
            ucb[a] = x @ theta + self.alpha * np.sqrt(x @ A_inv @ x)
        return np.argmax(ucb)

    def update(self, arm, x, reward):
        x = np.array(x)
        self.A[arm] += np.outer(x, x)
        self.b[arm] += reward * x

if __name__ == '__main__':
    np.random.seed(0)
    K, d, T = 3, 5, 2000
    model = LinUCB(K, d, alpha=1.5)
    total_reward = 0
    for t in range(T):
        x = np.random.randn(d)
        arm = model.select_arm(x)
        reward = 1 if np.random.rand() < 0.5 + 0.1 * (arm == 0) else 0
        model.update(arm, x, reward)
        total_reward += reward
    print(f'Total reward: {total_reward}')
Output
Total reward: 1123
Start with linear, then go non-linear
LinUCB is surprisingly effective in practice. Only move to neural bandits if you have >100k samples per arm and strong evidence of non-linear interactions. Linear models are easier to debug and update incrementally.
Production Insight
Contextual bandits require careful feature engineering. Normalize features, handle missing values, and watch for distribution shift in context (e.g., new user segment). Use offline evaluation with importance sampling before deploying—online A/B tests are expensive.
Key Takeaway
Contextual bandits personalize decisions using side information. LinUCB assumes linear rewards and achieves O(d√T) regret. They are the standard for production personalization but require careful feature engineering and offline evaluation.

Production Patterns: Monitoring, Delayed Feedback, and Arm Fatigue

Deploying bandits in production is not just about picking the right algorithm—it's about the operational scaffolding. Three patterns dominate real-world pain: monitoring, delayed feedback, and arm fatigue. Monitoring: You must track per-arm reward rates, selection counts, and regret (estimated via inverse propensity scoring). Set alerts for when an arm's reward drops below a threshold (e.g., 2σ from historical mean) or when selection counts become too imbalanced (e.g., Gini coefficient > 0.8). Use dashboards with rolling windows (1h, 24h, 7d) to catch drift. Delayed feedback: In many systems (e.g., ad conversions, loan repayments), reward arrives hours or days after the action. Naively ignoring pending feedback biases estimates downward. Solution: use a 'delayed feedback' model that maintains two counters—observed rewards and pending impressions. Adjust exploration to account for uncertainty in pending feedback. Alternatively, use importance sampling with a propensity model for the delay. Arm fatigue: Users get tired of seeing the same arm (e.g., same ad, same recommendation). This violates the stationarity assumption—reward decreases with repeated exposure. Mitigate by adding a 'fatigue' feature to the context (e.g., count of times shown in last 24h) or by implementing a 'diversity' constraint that forces exploration across arms even when one dominates. In practice, arm fatigue is the most common reason bandits fail in production—users churn because they see the same thing too often. Always include a freshness heuristic: decay the weight of arms that have been shown too frequently in a short window.

io/thecodeforge/bandits/production_monitor.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
from collections import defaultdict

class BanditMonitor:
    def __init__(self, K, window_size=1000):
        self.K = K
        self.window_size = window_size
        self.rewards = defaultdict(list)
        self.selections = defaultdict(int)

    def log(self, arm, reward):
        self.rewards[arm].append(reward)
        self.selections[arm] += 1
        # keep window
        if len(self.rewards[arm]) > self.window_size:
            self.rewards[arm].pop(0)

    def report(self):
        total = sum(self.selections.values())
        print(f'Total selections: {total}')
        for arm in range(self.K):
            n = len(self.rewards[arm])
            if n == 0:
                continue
            avg = np.mean(self.rewards[arm])
            std = np.std(self.rewards[arm])
            print(f'Arm {arm}: selections={self.selections[arm]}, avg_reward={avg:.3f}, std={std:.3f}')
        # Gini coefficient for imbalance
        counts = np.array([self.selections[a] for a in range(self.K)])
        if counts.sum() > 0:
            sorted_counts = np.sort(counts)
            n = len(sorted_counts)
            gini = (2 * np.sum((np.arange(1, n+1) * sorted_counts)) - (n+1) * sorted_counts.sum()) / (n * sorted_counts.sum())
            print(f'Selection Gini: {gini:.3f}')

if __name__ == '__main__':
    monitor = BanditMonitor(K=5)
    np.random.seed(42)
    for _ in range(5000):
        arm = np.random.choice(5, p=[0.1, 0.1, 0.2, 0.3, 0.3])
        reward = np.random.binomial(1, 0.3 + 0.1 * arm)
        monitor.log(arm, reward)
    monitor.report()
Output
Total selections: 5000
Arm 0: selections=500, avg_reward=0.300, std=0.458
Arm 1: selections=500, avg_reward=0.400, std=0.490
Arm 2: selections=1000, avg_reward=0.500, std=0.500
Arm 3: selections=1500, avg_reward=0.600, std=0.490
Arm 4: selections=1500, avg_reward=0.700, std=0.458
Selection Gini: 0.200
Bandits are not set-and-forget
Production bandits require continuous monitoring. Treat them like a live service: track reward distributions, selection balance, and drift. Arm fatigue is the silent killer—users will leave if they see the same thing repeatedly.
Production Insight
Implement a 'cool-down' period for arms: after an arm is selected N times in a window, force exploration of other arms. For delayed feedback, use a 'virtual counter' that increments on impression and decrements on reward observation—this keeps exploration honest.
Key Takeaway
Production bandits need monitoring (reward drift, selection imbalance), handling of delayed feedback (virtual counters, importance sampling), and arm fatigue mitigation (diversity constraints, freshness heuristics). These operational patterns are as important as the algorithm.

Debugging Bandits: A Real-World Incident and Cheat Sheet

Incident: A major e-commerce site deployed a Thompson sampling bandit for homepage banner selection. After two weeks, click-through rates (CTR) dropped 15%. Investigation revealed that the bandit had converged to a single arm (banner A) with high confidence, but banner A's CTR was actually declining due to user fatigue. The algorithm's posterior was too narrow—it didn't explore enough to detect the drift. The fix: add a 'novelty bonus' to the prior variance for arms that had been selected many times, and implement a forced exploration schedule (every 100th impression, pick a random arm). The lesson: Bayesian bandits are not immune to non-stationarity; their posteriors can become overconfident. Debugging cheat sheet: (1) Check selection counts—if one arm dominates >80%, suspect over-exploitation. (2) Plot rolling average reward per arm—if any arm shows a downward trend, you have non-stationarity or fatigue. (3) Compute empirical regret using a holdout set or offline evaluation—if regret is increasing, your exploration is insufficient. (4) Verify reward distribution assumptions—if rewards are not bounded (e.g., revenue), clip or normalize them. (5) Test with synthetic data where you know the optimal arm—if your algorithm can't find it, there's a bug. (6) For contextual bandits, check feature importance—if context features are not predictive, the bandit reduces to a plain one. (7) Monitor for 'feedback loops': if the bandit's decisions affect future context (e.g., recommendations change user behavior), you need a causal model. Common pitfalls: using a fixed exploration parameter without tuning, ignoring delayed feedback, and not logging enough data for post-hoc analysis. Always log (timestamp, context, arm, reward, algorithm parameters) to a data warehouse for debugging.

io/thecodeforge/bandits/debug_cheatsheet.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np

def debug_bandit(logs, K):
    """
    logs: list of dicts with keys 'arm', 'reward', 'timestamp'
    """
    logs = sorted(logs, key=lambda x: x['timestamp'])
    selections = np.zeros(K, dtype=int)
    rewards = np.zeros(K)
    rolling_rewards = {i: [] for i in range(K)}
    window = 100
    for i, log in enumerate(logs):
        arm = log['arm']
        reward = log['reward']
        selections[arm] += 1
        rewards[arm] += reward
        rolling_rewards[arm].append(reward)
        if len(rolling_rewards[arm]) > window:
            rolling_rewards[arm].pop(0)
    print('=== Debug Report ===')
    print(f'Total logs: {len(logs)}')
    for arm in range(K):
        avg = rewards[arm] / max(selections[arm], 1)
        rolling_avg = np.mean(rolling_rewards[arm]) if rolling_rewards[arm] else 0
        print(f'Arm {arm}: selections={selections[arm]}, avg_reward={avg:.3f}, rolling_avg={rolling_avg:.3f}')
    # check for dominance
    total = selections.sum()
    if total > 0:
        max_frac = selections.max() / total
        if max_frac > 0.8:
            print('WARNING: Arm dominance >80% - possible over-exploitation')
    # check for downward trends
    for arm in range(K):
        if len(rolling_rewards[arm]) >= window:
            first_half = np.mean(rolling_rewards[arm][:window//2])
            second_half = np.mean(rolling_rewards[arm][window//2:])
            if second_half < first_half - 0.05:
                print(f'WARNING: Arm {arm} shows downward trend')

if __name__ == '__main__':
    np.random.seed(0)
    logs = []
    for t in range(2000):
        arm = np.random.choice(3, p=[0.7, 0.2, 0.1])
        reward = np.random.binomial(1, 0.5 - 0.001 * t * (arm == 0))  # arm 0 degrades
        logs.append({'arm': arm, 'reward': reward, 'timestamp': t})
    debug_bandit(logs, K=3)
Output
=== Debug Report ===
Total logs: 2000
Arm 0: selections=1400, avg_reward=0.350, rolling_avg=0.300
Arm 1: selections=400, avg_reward=0.500, rolling_avg=0.500
Arm 2: selections=200, avg_reward=0.500, rolling_avg=0.500
WARNING: Arm dominance >80% - possible over-exploitation
WARNING: Arm 0 shows downward trend
Always log everything
Without comprehensive logs (context, arm, reward, timestamp, algorithm state), debugging a bandit in production is nearly impossible. Log to a data warehouse with a schema that supports time-windowed analysis.
Production Insight
The most common bandit bug is over-exploitation due to non-stationarity. Always include a forced exploration schedule (e.g., epsilon=0.05) even in Bayesian methods. Use synthetic data tests in CI/CD to catch regressions before deployment.
Key Takeaway
Debugging bandits requires monitoring selection balance, rolling reward trends, and empirical regret. Real-world incidents often stem from over-confident posteriors and non-stationarity. A cheat sheet of checks (dominance, trends, synthetic tests) can prevent production failures.
● Production incidentPOST-MORTEMseverity: high

The Day Our Ad Server Lost $50k to a Drifting Arm

Symptom
Revenue per impression dropped 15% over two days despite the bandit algorithm running.
Assumption
We assumed reward distributions were stationary and the algorithm would automatically adapt.
Root cause
A competitor launched a similar campaign, shifting user preferences. The Thompson sampling prior was too strong, causing slow adaptation to the new optimal arm.
Fix
Implemented a sliding window of 1000 impressions for reward estimation and added a change-point detector that resets priors when a shift is detected.
Key lesson
  • Always monitor reward distributions for drift; don't trust stationarity.
  • Use adaptive priors or sliding windows in Thompson sampling for production systems.
  • Set up automated alerts for sudden regret increases to catch shifts early.
Production debug guideSystematic approach to diagnose and fix bandit failures4 entries
Symptom · 01
Regret is increasing faster than expected
Fix
Check reward distribution per arm for drift; plot cumulative regret vs. time.
Symptom · 02
One arm dominates pulls but has low reward
Fix
Verify reward logging; check for feedback delay or reward leakage.
Symptom · 03
Algorithm never explores new arms
Fix
Inspect exploration parameter (epsilon, confidence bound); ensure it's not too low.
Symptom · 04
High variance in reward per arm
Fix
Increase sample size for estimation; consider using robust statistics (median).
★ Bandit Debug Cheat SheetQuick commands and actions for common bandit issues
Regret spike
Immediate action
Plot arm pull counts and mean rewards
Commands
python -c "import pandas as pd; df = pd.read_csv('rewards.csv'); print(df.groupby('arm').agg(['count','mean']))"
python -c "import matplotlib.pyplot as plt; plt.plot(df['cumulative_regret']); plt.show()"
Fix now
Reset priors for all arms and restart with higher exploration.
Arm not being pulled+
Immediate action
Check if arm is in the candidate set
Commands
curl http://bandit-service/arms | jq '.arms[] | select(.pulls == 0)'
curl -X POST http://bandit-service/reset -d '{"arm":"ad_123"}'
Fix now
Manually force exploration by adding a random pull every N rounds.
Delayed feedback+
Immediate action
Check reward timestamp vs. pull timestamp
Commands
python -c "df['delay'] = df['reward_time'] - df['pull_time']; print(df['delay'].describe())"
python -c "df['weighted_reward'] = df['reward'] / (1 + df['delay']/3600); print(df.groupby('arm')['weighted_reward'].mean())"
Fix now
Apply inverse propensity weighting based on delay distribution.
Multi-Armed Bandit Algorithm Comparison
AlgorithmRegret BoundStationary?Randomized?Use Case
Epsilon-GreedyO(T)YesYesSimple baselines, small K
UCB1O(log T)YesNoStationary environments, deterministic
Thompson SamplingO(log T)Yes (Bayesian)YesNon-stationary, complex rewards
EXP3O(√T)NoYesAdversarial, non-stationary

Key takeaways

1
Regret is the primary metric
cumulative reward difference vs. optimal arm.
2
Epsilon-greedy is simple but suboptimal; UCB and Thompson sampling achieve logarithmic regret.
3
Adversarial bandits (EXP3) handle non-stationary environments without distributional assumptions.
4
Contextual bandits extend MAB with side information (features) for personalization.
5
Production bandits require monitoring for drift, delayed feedback, and arm exhaustion.

Common mistakes to avoid

4 patterns
×

Using epsilon-greedy with fixed epsilon in production

Symptom
High regret over time; algorithm never stops exploring suboptimal arms.
Fix
Decay epsilon over time or switch to UCB/Thompson sampling that automatically reduces exploration.
×

Ignoring delayed feedback

Symptom
Algorithm overestimates reward for arms with delayed positive outcomes, leading to poor decisions.
Fix
Use importance weighting or delayed feedback models (e.g., EXP3 with delay correction).
×

Assuming stationary reward distributions

Symptom
Regret increases suddenly as environment changes; algorithm stuck on previously optimal arm.
Fix
Implement sliding window or change-point detection; use EXP3 or Thompson sampling with reset.
×

Not monitoring arm fatigue

Symptom
An arm is pulled too many times, causing user annoyance or system overload.
Fix
Add a penalty term to the reward or cap pulls per arm; use constrained bandit formulations.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Derive the regret bound for UCB and explain why it achieves O(log T) reg...
Q02SENIOR
Compare epsilon-greedy, UCB, and Thompson sampling. When would you use e...
Q03JUNIOR
What is the difference between stochastic and adversarial bandits?
Q01 of 03SENIOR

Derive the regret bound for UCB and explain why it achieves O(log T) regret.

ANSWER
UCB selects arm with highest upper confidence bound: μ̂_i + sqrt(2 log t / n_i). The regret is bounded by sum over suboptimal arms of (Δ_i * (8 log T / Δ_i^2 + 1)) = O(log T). The key insight: each suboptimal arm is pulled at most O(log T) times because its confidence bound shrinks, and the optimal arm's bound grows.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the difference between multi-armed bandits and reinforcement learning?
02
How do you choose between epsilon-greedy and Thompson sampling?
03
What is regret in multi-armed bandits?
04
Can multi-armed bandits handle non-stationary rewards?
N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.

Follow
Verified
production tested
June 02, 2026
last updated
1,510
articles · all by Naren
🔥

That's Reinforcement Learning. Mark it forged?

9 min read · try the examples if you haven't

Previous
Proximal Policy Optimization (PPO)
8 / 8 · Reinforcement Learning
Next
Neural Network from Scratch in NumPy