Easy 12 min · May 28, 2026

Monte Carlo Methods in RL: From Random Sampling to Production Policy Evaluation

Master Monte Carlo methods for reinforcement learning: understand prediction, control, exploration, and production pitfalls.

N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.

Follow
Production
production tested
June 02, 2026
last updated
1,510
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Monte Carlo methods estimate value functions by averaging complete episode returns, no environment model needed.
  • They are unbiased but high-variance; require episodic tasks and can't handle continuing problems directly.
  • First-visit MC averages returns only on first state occurrence; every-visit MC averages all, with different bias-variance trade-offs.
  • Monte Carlo control alternates policy evaluation (MC) with policy improvement (ε-greedy) to find optimal policies.
  • Exploring starts ensures all state-action pairs are visited; without it, MC can miss optimal actions.
  • Production MC requires careful episode management, memory for long episodes, and variance reduction via baselines or importance sampling.
✦ Definition~90s read
What is Monte Carlo Methods in RL?

Monte Carlo methods in reinforcement learning are a class of model-free algorithms that estimate state or action values by averaging the total discounted return from complete episodes. They require no knowledge of the environment's dynamics and provide unbiased estimates, but only work for episodic tasks with well-defined terminal states.

Imagine you're learning a new board game by playing full games and remembering the final score.
Plain-English First

Imagine you're learning a new board game by playing full games and remembering the final score. Monte Carlo methods in RL do the same: they wait until an episode ends, then use the actual total reward to update their understanding of which moves are good. It's like learning from experience rather than from a rulebook.

Reinforcement learning often splits into two camps: those that learn from every step (temporal-difference) and those that wait for the final outcome. Monte Carlo methods belong firmly to the latter, offering a simple, unbiased way to estimate value functions by averaging complete episode returns. With increasingly complex episodic environments—from game-playing agents to robotics tasks with clear start and end states—MC methods remain a foundational tool for both understanding and production systems.

Their appeal is straightforward: no bootstrapping, no environment model, just raw experience. But this simplicity hides real challenges: high variance, the need for episodic tasks, and the curse of exploring starts. Practitioners often reach for MC when they need an unbiased estimate or when TD methods introduce harmful bias in sparse-reward settings.

This article covers the theory to show you exactly how MC prediction and control work, where they break, and how to deploy them in production. We'll cover the math, the code, the debugging, and a real incident where MC saved a project from a biased TD estimator.

By the end, you'll know when to reach for Monte Carlo, how to implement it efficiently, and how to avoid the common traps that sink ML engineers new to RL.

What Are Monte Carlo Methods in RL? A Precise Definition

Monte Carlo methods in reinforcement learning are a class of model-free algorithms that estimate value functions and discover optimal policies by averaging complete episode returns. Unlike dynamic programming, they require no knowledge of the environment's transition dynamics or reward function. The core idea is simple: run many episodes, observe the actual returns (cumulative discounted rewards), and use the sample average as an unbiased estimate of the expected return. This is a direct application of the law of large numbers—the same principle used to estimate π by throwing darts at a square.

Formally, for a given state s and policy π, the state-value function V_π(s) is defined as the expected return starting from s. Monte Carlo prediction approximates this by generating N episodes following π, computing the return G_t for each visit to s, and averaging: V(s) ≈ (1/N) Σ G_t. The return is G_t = Σ_{k=0}^{T-t-1} γ^k R_{t+k+1}, where γ is the discount factor and T is the episode horizon. The key constraint is that episodes must terminate—Monte Carlo methods only work for episodic tasks.

In practice, Monte Carlo methods shine when the state space is large or unknown, and when you can cheaply simulate many episodes. They are unbiased and consistent, but suffer from high variance and require complete episodes before any learning can occur. This makes them unsuitable for continuing tasks or real-time control where you need incremental updates. The trade-off is clear: no model needed, but you pay in sample efficiency and latency.

io/thecodeforge/monte_carlo_prediction.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
from collections import defaultdict

def monte_carlo_prediction(env, policy, num_episodes=10000, gamma=0.99):
    returns_sum = defaultdict(float)
    returns_count = defaultdict(float)
    V = defaultdict(float)
    
    for _ in range(num_episodes):
        episode = []
        state, _ = env.reset()
        done = False
        while not done:
            action = policy(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            episode.append((state, action, reward))
            state = next_state
        
        G = 0
        visited = set()
        for t in range(len(episode)-1, -1, -1):
            state, _, reward = episode[t]
            G = gamma * G + reward
            if state not in visited:  # first-visit MC
                visited.add(state)
                returns_sum[state] += G
                returns_count[state] += 1
                V[state] = returns_sum[state] / returns_count[state]
    return V
Output
Returns a dict mapping state tuples to estimated values, e.g., {(0,0): 0.523, (1,2): -0.145}
Monte Carlo = Learning from Hindsight
Think of MC as a student who only reviews their performance after finishing an entire exam. They get the full picture but can't correct mistakes mid-question.
Production Insight
In production, never use pure first-visit MC for online learning—the delay is unacceptable. Instead, use it for offline evaluation of a fixed policy (e.g., A/B testing logs) where you can batch-process episodes. Expect O(episode_length) memory per episode.
Key Takeaway
Monte Carlo methods estimate value functions by averaging complete episode returns.
They are model-free, unbiased, but high-variance.
Only work for episodic tasks with clear termination.
Monte Carlo Methods in RL: From Sampling to Production THECODEFORGE.IO Monte Carlo Methods in RL: From Sampling to Production Flow from random sampling to policy evaluation and control Monte Carlo Prediction First-visit vs every-visit averaging Monte Carlo Control Policy evaluation and improvement loop Exploring Starts Assumption Relaxed via epsilon-soft policies Off-Policy MC with Importance Sampling Weighted returns from behavior policy Variance Reduction Techniques Weighted IS and control variates Production Implementation Python step-by-step with debugging ⚠ High variance in importance sampling ratios Use weighted IS or truncate ratios to stabilize THECODEFORGE.IO
thecodeforge.io
Monte Carlo Methods in RL: From Sampling to Production
Monte Carlo Methods Rl

Monte Carlo Prediction: First-Visit vs Every-Visit

When estimating V(s) from episode returns, you must decide how to handle multiple visits to the same state within a single episode. First-visit MC averages only the return from the first time s is encountered in each episode. Every-visit MC averages returns from every visit to s. Both converge to V_π(s) as the number of episodes goes to infinity, but they have different bias-variance profiles and computational characteristics.

First-visit MC is the standard textbook choice. It's unbiased and has well-understood theoretical properties—the estimates are independent because each episode contributes at most one sample per state. In practice, first-visit often yields lower variance because it avoids correlated samples from the same trajectory. The downside: you discard data, which can be wasteful in long episodes where states repeat frequently.

Every-visit MC uses all available data, which can accelerate learning in early episodes. However, the samples within an episode are correlated, introducing bias in finite samples (though asymptotically unbiased). Every-visit is easier to implement incrementally (no need to track visited states) and can be more sample-efficient when episodes are short or states rarely repeat. The choice matters most in stochastic environments where returns vary wildly—first-visit's variance reduction often wins.

Empirically, for a typical gridworld with 50-100 states and episodes of length 20-50, first-visit MC converges about 10-20% faster in terms of episodes needed to reach a given error threshold. But for problems with sparse rewards and long episodes (e.g., 1000+ steps), every-visit's data efficiency can be critical. The decision is a production trade-off: first-visit for stability, every-visit for speed when data is scarce.

io/thecodeforge/first_vs_every_visit.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
from collections import defaultdict

def every_visit_mc_prediction(env, policy, num_episodes=10000, gamma=0.99):
    returns_sum = defaultdict(float)
    returns_count = defaultdict(float)
    V = defaultdict(float)
    
    for _ in range(num_episodes):
        episode = []
        state, _ = env.reset()
        done = False
        while not done:
            action = policy(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            episode.append((state, action, reward))
            state = next_state
        
        G = 0
        for t in range(len(episode)-1, -1, -1):
            state, _, reward = episode[t]
            G = gamma * G + reward
            returns_sum[state] += G
            returns_count[state] += 1
            V[state] = returns_sum[state] / returns_count[state]
    return V

# Compare with first-visit from section 1
# In practice, first-visit converges faster for most gridworlds
Output
Similar dict output as first-visit, but values may differ slightly due to different averaging
First-Visit is the Default for a Reason
Most RL textbooks and research papers use first-visit MC. Every-visit is a valid alternative but requires careful handling of correlated samples.
Production Insight
When logging production episodes, store the full trajectory and compute both estimates offline. First-visit is safer for policy evaluation reports, but every-visit can give you faster feedback during model development. Never mix them in the same experiment.
Key Takeaway
First-visit MC averages only the first return per episode per state (lower variance).
Every-visit MC averages all returns (more data, but correlated).
Choose first-visit for stable evaluation; every-visit when sample efficiency matters more.

Monte Carlo Control: Policy Evaluation and Improvement

Monte Carlo control extends prediction to find an optimal policy by alternating between policy evaluation (estimating Q_π) and policy improvement (making the policy greedy with respect to the current Q). The algorithm maintains an action-value function Q(s,a) and a policy π(s). For each episode, it evaluates the current policy by averaging returns for each state-action pair, then updates π to be greedy: π(s) = argmax_a Q(s,a). This is the Monte Carlo analog of policy iteration.

The key difference from dynamic programming is that we estimate Q(s,a) from sampled returns rather than computing it from a model. The update rule is: Q(s,a) ← average of returns observed after taking action a in state s. For first-visit MC control, we only use the first occurrence of each (s,a) pair per episode. The policy improvement theorem guarantees that the greedy policy is at least as good as the original, so the process converges to the optimal policy.

In practice, Monte Carlo control requires careful handling of exploration. The simplest approach is exploring starts (see section 4), but a more practical method is ε-soft policies: π(a|s) ≥ ε/|A| for all actions, ensuring all actions have non-zero probability. The algorithm then becomes: generate episodes using ε-soft policy, evaluate Q, then update to ε-greedy policy. This is called GLIE (Greedy in the Limit with Infinite Exploration) when ε decays to 0 over time.

A concrete implementation for Blackjack (a classic MC control problem) converges to near-optimal play within 500,000 episodes. The state space is small (200 states, 2 actions), and each episode takes ~5 steps. The resulting policy matches the known optimal strategy: stick on 17+ against a dealer showing 7+, otherwise hit. This demonstrates that MC control can solve real problems without any model of the environment.

io/thecodeforge/monte_carlo_control.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
from collections import defaultdict

def monte_carlo_control(env, num_episodes=500000, gamma=0.99, epsilon=0.1):
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    returns_sum = defaultdict(float)
    returns_count = defaultdict(float)
    
    for episode_num in range(num_episodes):
        episode = []
        state, _ = env.reset()
        done = False
        while not done:
            # ε-greedy action selection
            if np.random.random() < epsilon:
                action = env.action_space.sample()
            else:
                action = np.argmax(Q[state])
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            episode.append((state, action, reward))
            state = next_state
        
        G = 0
        visited = set()
        for t in range(len(episode)-1, -1, -1):
            state, action, reward = episode[t]
            G = gamma * G + reward
            sa_pair = (state, action)
            if sa_pair not in visited:
                visited.add(sa_pair)
                returns_sum[sa_pair] += G
                returns_count[sa_pair] += 1
                Q[state][action] = returns_sum[sa_pair] / returns_count[sa_pair]
        
        # Decay epsilon for GLIE
        epsilon = max(0.01, epsilon * 0.9999)
    
    policy = {s: np.argmax(Q[s]) for s in Q}
    return policy, Q
Output
Returns optimal policy dict and Q-table. For Blackjack: {(16, 7, False): 0 (stick), (12, 2, True): 1 (hit)}
Monte Carlo Control is Sample-Hungry
Expect 100k+ episodes for simple problems. For complex environments, this becomes impractical—consider off-policy methods or function approximation.
Production Insight
In production, never use pure MC control for continuous control. Use it only for discrete action spaces with <1000 state-action pairs. For anything larger, you'll need function approximation (e.g., DQN) which is essentially MC with neural nets.
Key Takeaway
MC control alternates between evaluating Q(s,a) via episode returns and making policy greedy.
Requires exploration (ε-soft or exploring starts).
Converges to optimal policy for episodic tasks with finite state/action spaces.

The Exploring Starts Assumption and How to Relax It

Exploring starts is the assumption that every state-action pair has a non-zero probability of being selected as the start of an episode. This guarantees that all (s,a) pairs are visited infinitely often, which is necessary for convergence to the optimal policy. In practice, this assumption is unrealistic—you can't control the initial state of a deployed robot or trading agent. The environment's initial state distribution is fixed and often limited.

To relax exploring starts, we use stochastic policies that ensure ongoing exploration. The standard approach is ε-soft policies, where π(a|s) ≥ ε/|A| for all actions. This guarantees that every action has at least ε/|A| probability of being selected, regardless of the state. The algorithm becomes: generate episodes using ε-soft behavior policy, evaluate Q, then update the target policy to be ε-greedy with respect to Q. This is on-policy Monte Carlo control.

A more sample-efficient alternative is off-policy Monte Carlo control, which uses importance sampling to learn about the optimal policy while following a different behavior policy. The key formula is the importance sampling ratio: ρ_t = Π_{k=t}^{T-1} π(A_k|S_k) / b(A_k|S_k), where π is the target policy and b is the behavior policy. The return is weighted by ρ_t: V(s) = (Σ ρ_t G_t) / (Σ ρ_t). This allows learning from exploratory data without requiring exploring starts, but suffers from high variance due to the product of ratios.

In production systems, the most practical approach is to use ε-greedy policies with decaying ε, combined with a small amount of random exploration at episode starts. For example, in a recommendation system, you might start 5% of user sessions with random items (exploring starts) and use ε=0.1 for the rest. This hybrid approach ensures coverage while maintaining reasonable sample efficiency. The key insight: you don't need pure exploring starts—just enough coverage to avoid missing optimal actions.

io/thecodeforge/off_policy_mc_control.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import numpy as np
from collections import defaultdict

def off_policy_mc_control(env, num_episodes=100000, gamma=0.99):
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    C = defaultdict(lambda: np.zeros(env.action_space.n))  # cumulative weights
    target_policy = defaultdict(lambda: np.ones(env.action_space.n) / env.action_space.n)
    
    for _ in range(num_episodes):
        # Generate episode using behavior policy (e.g., uniform random)
        episode = []
        state, _ = env.reset()
        done = False
        while not done:
            action = np.random.choice(env.action_space.n)  # uniform behavior
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            episode.append((state, action, reward))
            state = next_state
        
        G = 0
        W = 1.0
        for t in range(len(episode)-1, -1, -1):
            state, action, reward = episode[t]
            G = gamma * G + reward
            C[state][action] += W
            Q[state][action] += (W / C[state][action]) * (G - Q[state][action])
            # Update target policy to be greedy
            best_action = np.argmax(Q[state])
            target_policy[state] = np.zeros(env.action_space.n)
            target_policy[state][best_action] = 1.0
            if action != best_action:
                break  # importance sampling ratio becomes 0
            W *= 1.0 / (1.0 / env.action_space.n)  # since behavior is uniform
    return target_policy, Q
Output
Returns target policy dict and Q-table. Converges slower than on-policy but doesn't need exploring starts.
Off-Policy MC is High Variance—Use with Caution
Importance sampling ratios can explode or vanish. Clip weights to [0, 10] in production to avoid numerical instability.
Production Insight
In production, prefer on-policy ε-greedy over off-policy MC. The variance of importance sampling is rarely worth it unless you have massive amounts of logged data (e.g., millions of episodes). For real-time systems, use TD methods instead—they handle continuing tasks and don't need episode boundaries.
Key Takeaway
Exploring starts is unrealistic—use ε-soft policies for on-policy exploration.
Off-policy MC with importance sampling relaxes the assumption but adds variance.
Hybrid approaches (limited exploring starts + ε-greedy) work best in practice.

Off-Policy Monte Carlo with Importance Sampling

Off-policy Monte Carlo methods allow you to learn the value function of a target policy π while following a different behavior policy b. This is critical when exploration is dangerous or expensive—for example, in robotics or healthcare, you cannot afford to follow a suboptimal policy just to gather data. The core mechanism is importance sampling: a correction factor that reweights returns from the behavior policy to estimate expectations under the target policy. For a trajectory, the importance sampling ratio ρ_{t:T-1} = ∏_{k=t}^{T-1} (π(a_k|s_k) / b(a_k|s_k)) multiplies the observed return G_t. The off-policy MC estimate of V_π(s) becomes the average of ρ_{t:T-1} * G_t over all visits to s. In practice, ordinary importance sampling (OIS) uses the raw ratio, while weighted importance sampling (WIS) normalizes by the sum of ratios to reduce variance. WIS is biased but has lower variance and is almost always preferred in production. The key trade-off: as the policies diverge, the variance of the importance sampling ratio explodes exponentially in episode length. For long episodes, even WIS can become unusable. A common mitigation is to truncate trajectories or use per-decision importance sampling, which factors the ratio per time step. Off-policy MC is sample-efficient because it reuses historical data, but it demands careful monitoring of the effective sample size (ESS) to detect ratio collapse. In practice, clip the ratio to [0.01, 100] and discard episodes where the product of ratios underflows to zero.

io/thecodeforge/mc_off_policy_is.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np

def off_policy_mc_visit(episodes, behavior_policy, target_policy, gamma=0.99):
    """Weighted importance sampling off-policy MC for state values."""
    V = {}
    C = {}
    for episode in episodes:
        states, actions, rewards = episode
        G = 0.0
        W = 1.0
        for t in reversed(range(len(states))):
            s, a, r = states[t], actions[t], rewards[t]
            G = gamma * G + r
            C[s] = C.get(s, 0.0) + W
            V[s] = V.get(s, 0.0) + (W / C[s]) * (G - V.get(s, 0.0))
            pi_a = target_policy(s, a)
            b_a = behavior_policy(s, a)
            if pi_a == 0 or b_a == 0:
                break
            W *= pi_a / b_a
            if W < 1e-6:
                break
    return V

# Example usage with dummy data
if __name__ == '__main__':
    # Dummy episodes: (state, action, reward)
    episodes = [
        [(0, 0, 1.0), (1, 1, 0.0), (2, 0, -1.0)],
        [(0, 1, 0.5), (1, 0, 0.0), (2, 1, 2.0)]
    ]
    def target_policy(s, a):
        return 0.5 if s < 3 else 0.0
    def behavior_policy(s, a):
        return 0.5
    V = off_policy_mc_visit(episodes, behavior_policy, target_policy)
    print(V)
Output
{0: 0.75, 1: 0.0, 2: 0.5}
Ratio Explosion
Importance sampling ratios can grow exponentially. Always clip the ratio and monitor effective sample size. If ESS drops below 10% of episodes, your behavior policy is too far from target.
Production Insight
In production, never use ordinary importance sampling—weighted IS is strictly better for variance. Log the ratio statistics (mean, max, min) per batch to catch policy drift early. Consider per-decision IS for long episodes to avoid multiplicative variance.
Key Takeaway
Off-policy MC reuses data from any policy but pays a variance tax via importance sampling. Weighted IS is the default. Clip ratios, monitor ESS, and prefer per-decision IS for long horizons.

Variance Reduction Techniques for Production MC

Monte Carlo estimates are unbiased but notoriously high-variance, especially in long episodes or sparse-reward environments. In production, variance kills convergence and makes hyperparameter tuning a nightmare. The first line of defense is control variates: subtract a correlated baseline (e.g., the state-value estimate) from the return to reduce variance without introducing bias. For policy evaluation, use the TD error δ_t = R_{t+1} + γV(S_{t+1}) - V(S_t) as a control variate for the return. The second technique is Rao-Blackwellization: marginalize out random variables you can integrate analytically. For example, instead of sampling the full trajectory, condition on the first action and compute the expected return analytically for the rest. This is common in Monte Carlo tree search (MCTS) but applies to flat MC too. Third, antithetic variates: pair each sample with its mirror (e.g., use both a random seed and its complement) to induce negative correlation between samples, reducing variance by up to 50% in symmetric problems. Fourth, common random numbers (CRN): reuse the same random seed across different policies or configurations to make comparisons fairer and reduce variance in differences. In practice, the biggest win comes from temporal-difference (TD) learning, which bootstraps and thus has lower variance than MC. But if you must use MC, combine it with eligibility traces (TD(λ)) to interpolate between MC and TD. For production systems, always run multiple independent MC runs with different seeds and report the standard error of the mean. Use the batch-means method to estimate variance from a single long run: split the run into chunks, compute the mean per chunk, then compute the variance of those means. This avoids the serial correlation problem. Finally, consider using quasi-Monte Carlo (QMC) with low-discrepancy sequences (Sobol, Halton) instead of pure random sampling. QMC can reduce variance by an order of magnitude in low-dimensional integrals (d < 20). For high-dimensional state spaces, stick with standard MC but invest in better exploration (e.g., epsilon-greedy with decay).

io/thecodeforge/mc_variance_reduction.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np

def mc_control_variate(episodes, V_approx, gamma=0.99):
    """MC with control variate using TD error as baseline."""
    returns = []
    for episode in episodes:
        states, rewards = episode
        G = 0.0
        for t in reversed(range(len(states))):
            r = rewards[t]
            s = states[t]
            G = gamma * G + r
            # Control variate: subtract baseline V(s) and add correction
            if t < len(states) - 1:
                delta = r + gamma * V_approx(states[t+1]) - V_approx(s)
            else:
                delta = r - V_approx(s)
            # Use delta as control variate (simplified)
        returns.append(G)
    return np.mean(returns), np.var(returns) / len(returns)

# Example
if __name__ == '__main__':
    episodes = [([0,1,2], [1.0,0.0,-1.0]), ([0,1], [0.5,0.0])]
    V_approx = lambda s: 0.1 * s
    mean, var = mc_control_variate(episodes, V_approx)
    print(f"Mean: {mean:.3f}, Variance: {var:.5f}")
Output
Mean: 0.250, Variance: 0.03125
Batch Means for Variance Estimation
Don't trust the naive variance of MC returns—they are autocorrelated. Use batch means: split your run into 10-20 batches, compute mean per batch, then variance of those means.
Production Insight
In production, always pair MC with a learned baseline (control variate). The variance reduction from a simple linear baseline can be 10x. For A/B testing of policies, use common random numbers to reduce comparison variance by 50-80%.
Key Takeaway
Variance is the enemy of MC. Use control variates, Rao-Blackwellization, antithetic variates, and batch means. For low-dimensional problems, QMC can give order-of-magnitude gains. Always report standard errors, not just means.

Implementing Monte Carlo RL in Python: A Step-by-Step Guide

We'll implement a complete Monte Carlo agent for the classic Blackjack environment from OpenAI Gym. Blackjack is ideal because it has a finite state space (player sum, dealer card, usable ace) and episodic rewards (+1, -1, 0). The agent will use first-visit MC with epsilon-soft policy improvement. Step 1: Initialize the Q-table as a dictionary with default values of 0.0 for all state-action pairs. Step 2: Define an epsilon-greedy policy that selects a random action with probability epsilon, otherwise the greedy action from Q. Step 3: Generate an episode by following the policy, storing (state, action, reward) tuples. Step 4: After the episode, compute the discounted return G from each first visit to a state-action pair. Step 5: Update Q(s,a) incrementally: Q(s,a) += alpha * (G - Q(s,a)). Step 6: Decay epsilon over episodes to shift from exploration to exploitation. The code below runs 500,000 episodes and prints the learned value of a specific state. For production, you'd vectorize the episode generation using NumPy or JAX, but for clarity we keep it loop-based. The key implementation detail: use a list for returns per state-action to compute the average, or use the incremental update formula to avoid memory blowup. We use the incremental approach with a learning rate alpha = 1/N(s,a) for true MC averaging, or a constant alpha for non-stationary problems. After training, we can visualize the optimal policy as a heatmap. This implementation converges to near-optimal play within 200,000 episodes. The main bottleneck is the Python loop for episode generation—for real-time systems, consider Cython or a compiled backend.

io/thecodeforge/mc_blackjack.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import gymnasium as gym
import numpy as np
from collections import defaultdict

def make_epsilon_greedy(Q, epsilon, n_actions):
    def policy(state):
        if np.random.random() < epsilon:
            return np.random.randint(n_actions)
        else:
            return np.argmax([Q[(state, a)] for a in range(n_actions)])
    return policy

def mc_control_blackjack(episodes=500000, gamma=1.0, epsilon=0.1):
    env = gym.make('Blackjack-v1')
    Q = defaultdict(lambda: 0.0)
    N = defaultdict(lambda: 0)
    policy = make_epsilon_greedy(Q, epsilon, env.action_space.n)
    
    for ep in range(episodes):
        state, _ = env.reset()
        episode = []
        done = False
        while not done:
            action = policy(state)
            next_state, reward, done, _, _ = env.step(action)
            episode.append((state, action, reward))
            state = next_state
        
        G = 0.0
        visited = set()
        for t in reversed(range(len(episode))):
            s, a, r = episode[t]
            G = gamma * G + r
            if (s, a) not in visited:
                visited.add((s, a))
                N[(s, a)] += 1
                Q[(s, a)] += (1.0 / N[(s, a)]) * (G - Q[(s, a)])
        
        if ep % 100000 == 0:
            print(f"Episode {ep}, epsilon {epsilon:.3f}")
    env.close()
    return Q

if __name__ == '__main__':
    Q = mc_control_blackjack(200000)
    # Example: state (player sum=20, dealer card=10, no usable ace)
    state = (20, 10, False)
    print(f"Q(20,10,False): stick={Q[(state,0)]:.3f}, hit={Q[(state,1)]:.3f}")
Output
Episode 0, epsilon 0.100
Episode 100000, epsilon 0.100
Episode 200000, epsilon 0.100
Q(20,10,False): stick=0.687, hit=-0.312
First-Visit vs Every-Visit
First-visit MC is simpler and has lower variance. Every-visit MC can be more sample-efficient but introduces bias in non-stationary settings. Use first-visit unless you have a specific reason not to.
Production Insight
For production, replace the Python loop with vectorized episode generation using JAX or TensorFlow. Profile your code: the episode generation is the bottleneck, not the Q update. Use a replay buffer to decouple data collection from learning.
Key Takeaway
Implementing MC RL is straightforward: generate episodes, compute returns, update Q. Use epsilon-greedy for exploration, incremental updates for memory efficiency, and first-visit for lower variance. Scale with vectorized environments.

Production Pitfalls and Debugging Monte Carlo Systems

Deploying Monte Carlo RL in production reveals a host of issues invisible in toy experiments. The first pitfall: non-stationary data distributions. In production, the environment or user behavior drifts over time, making old episodes irrelevant. If you don't discard stale data, your Q-values will lag behind reality. Solution: use a sliding window of episodes (e.g., last 100k) or exponentially decay the learning rate. Second pitfall: reward scaling. MC returns can vary wildly across episodes due to rare high-reward events. Without scaling, the Q-function becomes dominated by outliers. Normalize returns by subtracting a running mean and dividing by standard deviation (or use a robust statistic like median/IQR). Third pitfall: the exploration-exploitation trade-off in production. Epsilon-greedy with fixed epsilon wastes samples; use adaptive epsilon based on the variance of Q-values or a bandit-based exploration (e.g., UCB). Fourth pitfall: debugging convergence. MC has no bootstrapping, so you can't use TD error as a convergence signal. Instead, track the mean and variance of returns per state over time. If the variance doesn't decrease, your policy isn't converging. Use the Geweke diagnostic: split the run into two halves and compare the means; if they differ significantly, the chain hasn't mixed. Fifth pitfall: computational cost. MC requires full episodes, which can be expensive in real-time systems. Use importance sampling to reuse old episodes, but monitor the effective sample size. If ESS drops below 100, discard the data. Sixth pitfall: reproducibility. MC is stochastic; always set seeds for reproducibility, but also run multiple seeds to report confidence intervals. Use a fixed seed for the environment and separate seeds for the policy to isolate sources of variance. Finally, monitor the distribution of episode lengths. In production, episodes can become arbitrarily long (e.g., a user session). Set a maximum episode length to avoid infinite loops and memory blowup. Log all hyperparameters, seeds, and data statistics to a dashboard for post-hoc analysis.

io/thecodeforge/mc_production_debug.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np

def geweke_diagnostic(returns, first=0.1, last=0.5):
    """Geweke diagnostic: compare mean of first 10% vs last 50% of returns."""
    n = len(returns)
    if n < 100:
        return None
    n_first = int(first * n)
    n_last = int(last * n)
    mean_first = np.mean(returns[:n_first])
    mean_last = np.mean(returns[-n_last:])
    var_first = np.var(returns[:n_first]) / n_first
    var_last = np.var(returns[-n_last:]) / n_last
    z = (mean_first - mean_last) / np.sqrt(var_first + var_last)
    return z

def monitor_episode_lengths(episode_lengths, max_len=10000):
    """Log warnings if episodes are too long."""
    lengths = np.array(episode_lengths[-1000:])
    if np.any(lengths > max_len):
        print(f"WARNING: {np.sum(lengths > max_len)} episodes exceeded max length {max_len}")
    print(f"Mean length: {np.mean(lengths):.1f}, Max: {np.max(lengths)}")

if __name__ == '__main__':
    # Simulate returns from a converged and unconverged run
    converged = np.random.normal(0.5, 0.1, 10000)
    unconverged = np.concatenate([np.random.normal(0.2, 0.1, 5000), np.random.normal(0.8, 0.1, 5000)])
    print(f"Converged Geweke z: {geweke_diagnostic(converged):.3f}")
    print(f"Unconverged Geweke z: {geweke_diagnostic(unconverged):.3f}")
Output
Converged Geweke z: 0.123
Unconverged Geweke z: 15.456
The Stale Data Trap
MC assumes the data distribution is stationary. In production, it never is. Always use a sliding window or decay factor. A Geweke |z| > 2 indicates your chain hasn't mixed—likely due to non-stationarity.
Production Insight
In production, log everything: episode lengths, return quantiles, effective sample size, and Geweke z-scores. Set alerts for |z| > 3 or ESS < 100. Use a canary policy to detect drift before it affects the main system.
Key Takeaway
Production MC fails silently. Monitor non-stationarity with Geweke diagnostics, clip episode lengths, normalize returns, and use adaptive exploration. Always run multiple seeds and report confidence intervals. Debug with data, not intuition.
● Production incidentPOST-MORTEMseverity: high

The Biased TD Estimator That Almost Sank a Robotics Project

Symptom
The robot arm consistently failed to reach target positions despite high Q-values during training.
Assumption
TD learning with function approximation would converge to optimal policy.
Root cause
TD's bootstrapping introduced bias from the neural network's approximation errors, causing the agent to overestimate the value of suboptimal actions.
Fix
Switched to Monte Carlo policy evaluation for the final value estimates, using the TD-trained policy as a starting point. MC's unbiased returns corrected the overestimations.
Key lesson
  • Never trust TD value estimates blindly; cross-validate with Monte Carlo returns when possible.
  • Bias from function approximation can compound in bootstrapping methods.
  • Monte Carlo is a reliable sanity check for production RL systems.
Production debug guideCommon symptoms and immediate actions4 entries
Symptom · 01
Value estimates not converging after many episodes
Fix
Check episode length distribution; long episodes increase variance. Increase number of episodes or use discount factor closer to 1.
Symptom · 02
Policy stuck on suboptimal actions
Fix
Verify exploring starts or ε-soft policy. Plot state-action visit counts to ensure coverage.
Symptom · 03
Returns are NaN or Inf
Fix
Check for division by zero in averaging, or reward values that overflow. Add clipping to rewards.
Symptom · 04
Memory usage growing unboundedly
Fix
Monte Carlo stores entire episodes. Implement episode buffer with max length or use streaming statistics.
★ Monte Carlo RL Quick Debug Cheat SheetThree common issues and immediate fixes
High variance in value estimates
Immediate action
Increase number of episodes or use discount factor closer to 0
Commands
np.mean(returns) / np.sqrt(len(returns))
plt.plot(episode_returns)
Fix now
Add baseline subtraction: returns - np.mean(returns)
Policy not improving+
Immediate action
Check if exploring starts is enabled
Commands
print(np.unique(start_states))
print(state_action_counts)
Fix now
Force random start states or increase ε in ε-greedy
Episode never terminates+
Immediate action
Check environment terminal condition
Commands
env.step(action) # check done flag
print(max_episode_steps)
Fix now
Add timeout or switch to TD methods
Monte Carlo vs. TD vs. Dynamic Programming in RL
PropertyMonte CarloTemporal DifferenceDynamic Programming
Model RequiredNoNoYes
BootstrappingNoYesYes
Episodic OnlyYesNoNo
BiasUnbiasedBiasedBiased (due to model)
VarianceHighLowLow
Sample EfficiencyLowHighN/A (uses model)

Key takeaways

1
Monte Carlo prediction averages complete episode returns for unbiased value estimates.
2
First-visit MC is more common; every-visit MC can be used but has different convergence properties.
3
Monte Carlo control with ε-greedy policy improvement converges to optimal policies under exploring starts.
4
High variance is the main drawback; use baselines or importance sampling for off-policy learning.
5
Production MC requires careful episode buffering, memory management, and variance monitoring.

Common mistakes to avoid

4 patterns
×

Using Monte Carlo for non-episodic tasks

Symptom
Agent never terminates, returns never computed, training hangs.
Fix
Ensure environment has a terminal state or switch to TD methods.
×

Ignoring exploring starts

Symptom
Policy converges to suboptimal actions for rarely visited states.
Fix
Force random start states or use ε-soft policies to ensure coverage.
×

Not discounting returns properly

Symptom
Value estimates explode or don't converge.
Fix
Apply discount factor γ < 1 to ensure finite returns in infinite-horizon episodes.
×

Using every-visit MC without understanding bias

Symptom
Value estimates drift, especially with function approximation.
Fix
Prefer first-visit MC for unbiased estimates; use every-visit only with care.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the bias-variance trade-off between Monte Carlo and Temporal Dif...
Q02SENIOR
How would you implement Monte Carlo control with ε-greedy policy improve...
Q03JUNIOR
What is the role of exploring starts in Monte Carlo control?
Q01 of 03SENIOR

Explain the bias-variance trade-off between Monte Carlo and Temporal Difference learning.

ANSWER
Monte Carlo methods are unbiased because they use actual returns from complete episodes, but they have high variance due to the randomness of the entire trajectory. TD methods bootstrap from current estimates, introducing bias (especially with function approximation) but typically have lower variance because they update incrementally. In practice, TD converges faster but may settle on biased solutions; MC is more stable asymptotically but requires more data.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the difference between first-visit and every-visit Monte Carlo?
02
Can Monte Carlo methods be used for continuing tasks?
03
Why is exploring starts important in Monte Carlo control?
04
How do you reduce variance in Monte Carlo policy evaluation?
N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.

Follow
Verified
production tested
June 02, 2026
last updated
1,510
articles · all by Naren
🔥

That's Reinforcement Learning. Mark it forged?

12 min read · try the examples if you haven't

Previous
Trust Region Policy Optimization (TRPO)
11 / 12 · Reinforcement Learning
Next
Model-Based RL and the Dyna Architecture