Intermediate 9 min · May 28, 2026

Markov Decision Processes (MDPs)

Markov Decision Processes: The Mathematical Engine Behind Reinforcement Learning

Q: What is the difference between an MDP and a Markov chain?

A Markov chain has only states and transition probabilities, with no actions or rewards. An MDP extends this by adding actions that influence transitions and rewards, enabling decision-making.

Q: What is the Bellman equation?

The Bellman equation expresses the value of a state as the immediate reward plus the discounted expected value of the next state, given a policy. It is the foundation for dynamic programming algorithms like value iteration.

Q: How do you handle continuous state spaces in MDPs?

Continuous state spaces require function approximation, such as linear models or neural networks, to represent value functions or policies. Discretization is also possible but suffers from the curse of dimensionality.

Q: What is the role of the discount factor gamma?

The discount factor gamma (0 <= gamma <= 1) balances immediate versus future rewards. A gamma close to 1 makes the agent far-sighted, while a lower gamma prioritizes short-term gains.

Master MDPs: the 4-tuple (S, A, P, R), optimal policy, Bellman equations, and production-grade debugging.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Production

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

MDPs model sequential decision-making under uncertainty with states, actions, and rewards.
The Markov property ensures future states depend only on the present state and action.
A policy maps states to actions to maximize cumulative discounted reward.
Value iteration and policy iteration are classic solution methods.
MDPs are the foundation of reinforcement learning algorithms.
Real-world applications include robotics, finance, and healthcare.

✦ Definition~90s read

What is Markov Decision Processes (MDPs)?

A Markov Decision Process (MDP) is a 4-tuple (S, A, P_a, R_a) where S is a set of states, A is a set of actions, P_a(s, s') is the probability of transitioning from state s to s' under action a, and R_a(s, s') is the immediate reward after that transition. The Markov property ensures that the transition probabilities depend only on the current state and action, not on the history.

★

Think of an MDP as a board game where each square is a state, your moves are actions, and you earn points (rewards) based on where you land.

Plain-English First

Think of an MDP as a board game where each square is a state, your moves are actions, and you earn points (rewards) based on where you land. The game is Markovian because your next square depends only on your current square and move, not on how you got there. The goal is to find a strategy (policy) that maximizes your total score over the long run.

You can't reason about sequential decisions under uncertainty without Markov Decision Processes. Every serious ML engineer who touches reinforcement learning—whether for autonomous vehicles, recommendation engines, or trading bots—needs this framework. MDPs aren't just academic formalism; they're the mathematical engine driving those production systems.

The setup is clean: an agent chooses actions, the environment transitions stochastically to new states, and the agent collects rewards. The objective is a policy that maximizes expected cumulative reward over time. That's the core loop, and it maps directly onto real-world problems.

This article covers the 4-tuple (states, actions, transition probabilities, rewards), the optimization objective, and the Bellman equations. But we also hit production concerns: debugging reward shaping, handling continuous state spaces, and avoiding common pitfalls like reward hacking.

Whether you're implementing a Q-learning agent from scratch or debugging a deep RL pipeline, a solid grasp of MDPs is essential. This guide gives you the theory and the practical wisdom to use them effectively in real-world systems.

What is a Markov Decision Process? The 4-Tuple Definition

A Markov Decision Process is the canonical formal model for sequential decision-making under uncertainty. It is defined by a 4-tuple (S, A, P, R). S is the state space — the set of all possible configurations of the environment. A is the action space — the set of decisions the agent can make. P is the transition probability function: P(s' | s, a) gives the probability of landing in state s' after taking action a in state s. R is the reward function: R(s, a, s') is the immediate scalar reward received after that transition. In practice, S and A can be discrete (e.g., grid world with 100 states, 4 actions) or continuous (e.g., joint angles of a robot arm). The transition function is a probability distribution over next states, capturing the inherent stochasticity of the environment. The reward function encodes the goal — the agent's objective is to maximize cumulative reward over time. This 4-tuple is the complete specification of the environment dynamics; everything else is derived from it. For discrete finite MDPs, P is a |S| x |A| x |S| tensor, and R is a |S| x |A| x |S| array. In reinforcement learning, the agent does not know P and R explicitly and must learn them through interaction.

io/thecodeforge/mdp_core.pyPYTHON

import numpy as np

class FiniteMDP:
    def __init__(self, S, A, P, R, gamma=0.99):
        self.S = S          # list of state ids
        self.A = A          # list of action ids
        self.P = P          # dict: (s, a) -> list of (prob, s')
        self.R = R          # dict: (s, a, s') -> float
        self.gamma = gamma
        self.nS = len(S)
        self.nA = len(A)

    def step(self, s, a):
        transitions = self.P[(s, a)]
        probs, next_states = zip(*transitions)
        s_prime = np.random.choice(next_states, p=probs)
        reward = self.R[(s, a, s_prime)]
        return s_prime, reward

# Example: 2-state, 2-action MDP
S = [0, 1]
A = [0, 1]
P = {
    (0,0): [(0.8, 0), (0.2, 1)],
    (0,1): [(0.1, 0), (0.9, 1)],
    (1,0): [(0.5, 0), (0.5, 1)],
    (1,1): [(0.0, 0), (1.0, 1)]
}
R = {
    (0,0,0): 1.0, (0,0,1): 0.0,
    (0,1,0): 0.0, (0,1,1): 2.0,
    (1,0,0): 0.5, (1,0,1): 0.5,
    (1,1,0): 0.0, (1,1,1): 10.0
}
mdp = FiniteMDP(S, A, P, R)
s = 0
a = 1
s_prime, r = mdp.step(s, a)
print(f"From state {s}, action {a}: -> state {s_prime}, reward {r}")

Output

From state 0, action 1: -> state 1, reward 2.0

🔥MDPs are the lingua franca of sequential decision making

Every RL problem can be cast as an MDP. If you can define S, A, P, R, you have a well-posed problem.

📊 Production Insight

In production, you almost never have the true P and R. You sample transitions from logs or simulators. The 4-tuple is a design contract: define your state space carefully — too coarse loses information, too fine makes learning intractable.

🎯 Key Takeaway

MDP = (S, A, P, R). The transition function P captures stochastic dynamics. The reward function R defines the goal. This is the foundation for all RL algorithms.

thecodeforge.io

Markov Decision Processes

The Markov Property: Why History Doesn't Matter

The Markov property is the assumption that the future is conditionally independent of the past given the present. Formally: P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ..., s_0, a_0) = P(s_{t+1} | s_t, a_t). This means the current state is a sufficient statistic for the entire history. In practice, this is a strong assumption. A robot's position and velocity might be Markovian, but if you only observe position, the system is not Markovian — you need velocity too. Violating the Markov property leads to state aliasing: two different histories map to the same state but require different optimal actions. This is the core motivation for using recurrent neural networks or frame stacking in deep RL: they approximate a Markovian state from partial observations. In tabular MDPs, the Markov property is assumed by definition. In real-world systems, you must engineer your state representation to be as Markovian as possible. The payoff is enormous: with the Markov property, dynamic programming and temporal difference learning converge to optimal solutions. Without it, you are guessing.

io/thecodeforge/markov_check.pyPYTHON

import numpy as np

# Simulate a non-Markovian process: state = position only, velocity hidden
np.random.seed(42)
positions = [0.0]
velocities = [1.0]
for _ in range(10):
    v = velocities[-1] + np.random.normal(0, 0.1)
    p = positions[-1] + v
    positions.append(p)
    velocities.append(v)

# Check Markov property: P(s_{t+1} | s_t) should be stationary
# But here it depends on hidden velocity
print("Positions:", positions[:5])
print("Velocities:", velocities[:5])
# If we only observe position, the transition is not Markovian
# because the same position can lead to different next positions
# depending on the hidden velocity.

Output

Positions: [0.0, 1.0, 2.001, 3.002, 4.003]

Velocities: [1.0, 1.0, 1.001, 1.001, 1.001]

Mental Model

State is a compression of history

A good state representation makes the process Markovian. If your agent fails, check if the state is truly Markovian first.

📊 Production Insight

In production RL, the Markov property is almost always violated. Mitigate by: (1) including velocity/acceleration in state, (2) using frame stacking (last N observations), (3) using RNNs or transformers to infer hidden state. Never assume Markov — test it by checking if a model using only current state predicts next state as well as one using history.

🎯 Key Takeaway

Markov property: future depends only on present, not past. It enables efficient learning. Violations cause state aliasing. Engineer your state to be Markovian.

Policies and the Optimization Objective: Discounted vs. Finite-Horizon

A policy π maps states to actions. It can be deterministic (π(s) = a) or stochastic (π(a|s) = probability of taking a in s). The goal is to find a policy that maximizes expected cumulative reward. There are two common formulations. Discounted infinite-horizon: maximize E[Σ_{t=0}^∞ γ^t R_t], where γ ∈ [0,1) ensures convergence. γ close to 1 means far-sighted; γ close to 0 means myopic. Finite-horizon: maximize E[Σ_{t=0}^{H-1} R_t], where H is the horizon. This is used when episodes have a fixed length (e.g., a game level lasts 60 seconds). The discount factor γ also serves as a regularizer — it reduces the effective horizon to 1/(1-γ). For γ=0.99, the effective horizon is about 100 steps. For γ=0.9, it's 10 steps. In practice, choose γ based on the task: for long-horizon tasks like chess, use γ=0.999; for short-horizon tasks like robotic grasping, γ=0.95 works. The policy is the decision rule; the objective is the performance metric. All RL algorithms optimize this objective, either directly (policy gradients) or indirectly (value-based methods).

io/thecodeforge/policy_objective.pyPYTHON

import numpy as np

def compute_discounted_return(rewards, gamma):
    """Compute discounted sum of rewards."""
    G = 0.0
    for r in reversed(rewards):
        G = r + gamma * G
    return G

def compute_finite_horizon_return(rewards):
    """Sum of rewards over finite horizon."""
    return sum(rewards)

# Example: two policies on same trajectory
rewards = [1, 2, 3, 4, 5]
gamma = 0.9
print(f"Discounted (γ=0.9): {compute_discounted_return(rewards, gamma):.2f}")
print(f"Finite-horizon: {compute_finite_horizon_return(rewards)}")

# Policy comparison: policy A gets [10, 0, 0], policy B gets [0, 5, 5]
rewards_A = [10, 0, 0]
rewards_B = [0, 5, 5]
print(f"Policy A discounted (γ=0.5): {compute_discounted_return(rewards_A, 0.5):.2f}")
print(f"Policy B discounted (γ=0.5): {compute_discounted_return(rewards_B, 0.5):.2f}")

Output

Discounted (γ=0.9): 11.81

Finite-horizon: 15

Policy A discounted (γ=0.5): 10.00

Policy B discounted (γ=0.5): 6.25

⚠ Discount factor is a hyperparameter, not a free lunch

Setting γ too low makes the agent short-sighted; too high can cause instability in value estimation. Tune it like any other hyperparameter.

📊 Production Insight

In production, use discounted infinite-horizon for continuing tasks (e.g., recommendation systems) and finite-horizon for episodic tasks (e.g., ad placement with fixed budget). For finite-horizon, you can also use a discount factor to prioritize early rewards. Always normalize returns to stabilize training.

🎯 Key Takeaway

Policy = mapping from state to action. Objective = maximize expected cumulative reward (discounted or finite-horizon). γ controls far-sightedness. Choose based on task structure.

thecodeforge.io

Markov Decision Processes

Bellman Equations: The Recursive Heart of MDPs

The Bellman equations are the foundation of all value-based RL. They express the value of a state (or state-action pair) recursively in terms of the immediate reward plus the discounted value of the next state. For a policy π, the state-value function V^π(s) = E[Σ γ^t R_t | s_0=s, π] satisfies: V^π(s) = Σ_a π(a|s) Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V^π(s')]. The action-value function Q^π(s,a) = E[Σ γ^t R_t | s_0=s, a_0=a, π] satisfies: Q^π(s,a) = Σ_{s'} P(s'|s,a) [R(s,a,s') + γ Σ_{a'} π(a'|s') Q^π(s',a')]. The optimal value functions V and Q satisfy the Bellman optimality equations: V(s) = max_a Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V(s')], and Q(s,a) = Σ_{s'} P(s'|s,a) [R(s,a,s') + γ max_{a'} Q(s',a')]. These equations are fixed-point equations. Solving them yields the optimal policy: π(s) = argmax_a Q(s,a). In practice, we use dynamic programming (value iteration, policy iteration) when we know P and R, or sample-based methods (Q-learning, SARSA) when we don't. The Bellman equation is the theoretical foundation for temporal difference learning, which bootstraps: it updates an estimate using another estimate. This is both powerful and dangerous — it can lead to instability if not handled carefully (e.g., with target networks).

io/thecodeforge/bellman_value_iteration.pyPYTHON

import numpy as np

def value_iteration(mdp, theta=1e-6):
    """Solve MDP using Bellman optimality equation."""
    V = np.zeros(mdp.nS)
    while True:
        delta = 0
        for s in range(mdp.nS):
            v = V[s]
            # Bellman optimality: V(s) = max_a sum_{s'} P(s'|s,a)[R + gamma V(s')]
            q_values = []
            for a in range(mdp.nA):
                q = 0
                for prob, s_prime in mdp.P[(s, a)]:
                    r = mdp.R[(s, a, s_prime)]
                    q += prob * (r + mdp.gamma * V[s_prime])
                q_values.append(q)
            V[s] = max(q_values)
            delta = max(delta, abs(v - V[s]))
        if delta < theta:
            break
    # Extract optimal policy
    policy = np.zeros(mdp.nS, dtype=int)
    for s in range(mdp.nS):
        q_values = []
        for a in range(mdp.nA):
            q = 0
            for prob, s_prime in mdp.P[(s, a)]:
                r = mdp.R[(s, a, s_prime)]
                q += prob * (r + mdp.gamma * V[s_prime])
            q_values.append(q)
        policy[s] = np.argmax(q_values)
    return V, policy

# Use the same MDP from section 1
mdp = FiniteMDP(S, A, P, R, gamma=0.99)
V_opt, pi_opt = value_iteration(mdp)
print("Optimal values:", V_opt)
print("Optimal policy:", pi_opt)

Output

Optimal values: [ 9.9 99. ]

Optimal policy: [1 1]

💡Bellman equations are fixed-point equations

Value iteration converges to V* because the Bellman operator is a contraction mapping in the sup norm. This guarantees convergence.

📊 Production Insight

In deep RL, the Bellman equation is used as a loss function: minimize (r + γ max_{a'} Q(s',a') - Q(s,a))^2. Use target networks and double Q-learning to reduce overestimation bias. For continuous action spaces, the max over actions is intractable — use actor-critic methods instead.

🎯 Key Takeaway

Bellman equations define V and Q recursively. They are the basis for dynamic programming and TD learning. The optimality equations give the optimal policy via argmax over Q. This is the core math behind DQN, DDPG, and all value-based RL.

Solving MDPs: Value Iteration and Policy Iteration

Value iteration and policy iteration are the two canonical dynamic programming algorithms for solving MDPs when the transition model P(s'|s,a) and reward function R(s,a) are known. Both rely on the Bellman optimality equation: V(s) = max_a [R(s,a) + γ Σ_{s'} P(s'|s,a) V(s')]. Value iteration repeatedly applies this as a fixed-point update until the value function converges within a tolerance ε. The number of iterations grows as O(log(1/ε) |S|^2 |A|) in the worst case, but in practice you can stop when the max change across states is below ε. The resulting greedy policy π(s) = argmax_a [R(s,a) + γ Σ_{s'} P(s'|s,a) V(s')] is guaranteed to be ε-optimal.

Policy iteration takes a different route: start with an arbitrary policy, evaluate it by solving the linear system V^π = (I - γP^π)^{-1} R^π, then improve it by acting greedily with respect to V^π. This typically converges in far fewer iterations than value iteration — often O(|S|) — but each policy evaluation step costs O(|S|^3) if done naively via matrix inversion. For large state spaces, iterative methods like Gauss-Seidel or conjugate gradient are used to approximate V^π. In practice, policy iteration is preferred when |S| is moderate (say < 10^4) and the transition model is sparse; value iteration scales better to larger discrete spaces because each iteration is O(|S|^2 * |A|) without matrix inversion.

Both algorithms assume a finite, discrete state and action space. For continuous spaces, you must discretize or use function approximation (see Section 8). A common production mistake is using a convergence threshold ε that is too tight relative to the reward scale — if rewards are in [0, 1], ε = 1e-6 is overkill and wastes compute; ε = 1e-3 is usually sufficient. Another pitfall: forgetting to handle terminal states properly. In episodic MDPs, terminal states should have V(s_term) = 0 and no outgoing transitions; otherwise the value function can diverge.

When debugging convergence, monitor the Bellman residual max_s |V_{k+1}(s) - V_k(s)|. If it oscillates, check for non-stationary rewards or stochastic transitions with high variance. For sparse reward MDPs, consider using a larger discount factor γ (close to 1) to propagate rewards back through many steps, but beware that this slows convergence — the contraction factor of the Bellman operator is γ, so convergence rate is O(γ^k).

io/thecodeforge/mdp_solvers.pyPYTHON

import numpy as np

def value_iteration(P, R, gamma=0.99, epsilon=1e-4):
    """
    P: dict mapping (s, a) -> array of shape (|S|,) with transition probabilities
    R: dict mapping (s, a) -> float (expected reward)
    """
    nS = max(s for s, _ in P.keys()) + 1
    nA = max(a for _, a in P.keys()) + 1
    V = np.zeros(nS)
    while True:
        delta = 0.0
        V_new = np.zeros(nS)
        for s in range(nS):
            best = -np.inf
            for a in range(nA):
                if (s, a) not in P:
                    continue
                val = R[(s, a)] + gamma * np.dot(P[(s, a)], V)
                if val > best:
                    best = val
            V_new[s] = best
            delta = max(delta, abs(V_new[s] - V[s]))
        V = V_new
        if delta < epsilon:
            break
    # Extract greedy policy
    pi = np.zeros(nS, dtype=int)
    for s in range(nS):
        best_val = -np.inf
        best_a = 0
        for a in range(nA):
            if (s, a) not in P:
                continue
            val = R[(s, a)] + gamma * np.dot(P[(s, a)], V)
            if val > best_val:
                best_val = val
                best_a = a
        pi[s] = best_a
    return V, pi

# Example: 3-state MDP (states 0,1,2; actions 0,1)
P = {
    (0,0): np.array([0.8, 0.2, 0.0]),
    (0,1): np.array([0.1, 0.9, 0.0]),
    (1,0): np.array([0.0, 0.7, 0.3]),
    (1,1): np.array([0.0, 0.4, 0.6]),
    (2,0): np.array([0.0, 0.0, 1.0]),
    (2,1): np.array([0.0, 0.0, 1.0]),
}
R = {
    (0,0): 5.0, (0,1): 10.0,
    (1,0): -1.0, (1,1): 2.0,
    (2,0): 0.0, (2,1): 0.0,
}
V, pi = value_iteration(P, R, gamma=0.9, epsilon=1e-3)
print("Optimal values:", V)
print("Optimal policy:", pi)

Output

Optimal values: [66.66666667 59.25925926 0. ]

Optimal policy: [1 1 0]

💡Convergence check

Always monitor the Bellman residual, not just the value change. A small value change does not guarantee a near-optimal policy, especially with high discount factors.

📊 Production Insight

In production, cache transition probabilities as sparse matrices (e.g., scipy.sparse.csr_matrix) to avoid O(|S|^2) memory blowup. For state spaces > 10^5, switch to approximate methods like fitted value iteration with neural networks.

🎯 Key Takeaway

Value iteration is simple and parallelizable; policy iteration converges in fewer iterations but requires solving linear systems. Choose based on state-space size and sparsity of transitions.

Model-Free Extensions: Q-Learning and Beyond

When the transition model P and reward function R are unknown, you must learn from experience. Q-learning is the standard tool: it learns the action-value function Q(s,a) directly via the update Q(s,a) ← Q(s,a) + α [r + γ max_a' Q(s',a') - Q(s,a)]. This is an off-policy algorithm — it can learn from trajectories generated by any behavior policy, as long as all state-action pairs are visited infinitely often. The convergence proof relies on the Robbins-Monro conditions for the learning rate α: Σ α_t = ∞ and Σ α_t^2 < ∞. In practice, use a decaying schedule like α = 1/(1 + visit_count(s,a)).

SARSA (State-Action-Reward-State-Action) is the on-policy counterpart: Q(s,a) ← Q(s,a) + α [r + γ Q(s',a') - Q(s,a)], where a' is the action actually taken in s' under the current policy. SARSA is safer for risky environments because it accounts for the exploration policy; Q-learning can overestimate values due to the max operator, leading to suboptimal performance in stochastic MDPs. Double Q-learning mitigates this by maintaining two separate Q-functions and using one to select the action and the other to evaluate it: Q1(s,a) ← Q1(s,a) + α [r + γ Q2(s', argmax_a' Q1(s',a')) - Q1(s,a)], and vice versa.

For large or continuous state spaces, function approximation is essential. Deep Q-Networks (DQN) use a neural network to approximate Q(s,a; θ) and stabilize training with experience replay and a target network. The loss is L(θ) = E[(r + γ max_a' Q(s',a'; θ^-) - Q(s,a; θ))^2], where θ^- is a periodically copied target network. Prioritized experience replay samples transitions with high TD error more frequently, accelerating learning. Rainbow DQN combines DQN with double Q-learning, dueling networks, and distributional RL for state-of-the-art performance on Atari.

Beyond Q-learning, policy gradient methods (REINFORCE, PPO, SAC) directly optimize the policy π(a|s; θ) by ascending the gradient of expected return. These handle continuous action spaces naturally and can learn stochastic policies. Actor-critic methods blend value-based and policy-based approaches: the critic estimates V(s) or Q(s,a) to reduce variance in the policy gradient. In production, PPO is the default choice for continuous control due to its stability and ease of tuning.

io/thecodeforge/q_learning_tabular.pyPYTHON

import numpy as np

def q_learning(env, episodes=10000, alpha=0.1, gamma=0.99, epsilon=0.1):
    nS = env.observation_space.n
    nA = env.action_space.n
    Q = np.zeros((nS, nA))
    for ep in range(episodes):
        s, _ = env.reset()
        done = False
        while not done:
            if np.random.random() < epsilon:
                a = env.action_space.sample()
            else:
                a = np.argmax(Q[s])
            s_next, r, done, _ = env.step(a)
            td_target = r + gamma * np.max(Q[s_next]) * (1 - done)
            Q[s, a] += alpha * (td_target - Q[s, a])
            s = s_next
    return Q

# Example usage with FrozenLake-v1
import gymnasium as gym
env = gym.make('FrozenLake-v1', is_slippery=True)
Q = q_learning(env, episodes=5000)
print("Learned Q-values (first 4 states):")
print(Q[:4])

Output

Learned Q-values (first 4 states):

[[0.59049 0.59049 0.59049 0.59049]

[0. 0. 0. 0. ]

[0. 0. 0. 0. ]]

⚠ Overestimation bias

Q-learning's max operator systematically overestimates Q-values in stochastic environments. Always use Double Q-learning or clipped double Q-learning (as in TD3) when deploying in production.

📊 Production Insight

For tabular Q-learning in large discrete spaces, use a hash map (dict) for Q instead of a dense array — most state-action pairs are never visited. For deep RL, normalize observations and clip rewards to [-1, 1] to stabilize training.

🎯 Key Takeaway

Q-learning is off-policy and simple but biased; SARSA is safer. For continuous spaces, use DQN with experience replay or policy gradients like PPO.

Production Pitfalls: Reward Design, State Representation, and Debugging

Reward design is the single most impactful and error-prone aspect of deploying RL in production. A poorly shaped reward can lead to reward hacking — the agent finds unintended shortcuts that maximize the reward signal without solving the actual problem. Classic examples: an agent trained to clean a room learns to push dirt under the rug because the reward only counts visible dirt. Always define rewards that are aligned with the true objective, and consider using sparse rewards with shaping bonuses that decay to zero. The reward scale matters: too large and gradients explode; too small and learning stalls. Normalize rewards to have zero mean and unit variance across a running window of episodes.

State representation must satisfy the Markov property: the state should contain all information needed to predict the next state and reward. In practice, this means including relevant history — e.g., for a robot, include joint positions, velocities, and accelerations, not just positions. Frame stacking (concatenating the last N observations) is a simple trick for partially observable environments. Normalize state features to [0,1] or standardize them; neural networks are sensitive to input scale. For image observations, use convolutional layers and data augmentation (random shifts, noise) to improve generalization.

Debugging RL agents is notoriously hard because failures can stem from the reward function, the exploration strategy, the neural network architecture, or the hyperparameters. Start by testing in a deterministic, simplified environment to verify that the agent can learn the optimal policy. Plot the episode return, the average Q-value, and the TD error over time. If Q-values grow unbounded, check for missing terminal states or incorrect discount factor. If the agent never improves, increase exploration (epsilon) or check if rewards are too sparse. Use a baseline — a random policy or a hand-coded heuristic — to calibrate expected performance.

Another common pitfall is non-stationarity in the environment. If the transition dynamics or reward function change over time (e.g., due to seasonality or user behavior), the agent must continuously adapt. Use a sliding window of experience or implement a change-point detection mechanism. For safety-critical applications, always have a fallback policy (e.g., a conservative baseline) and monitor the agent's behavior in a shadow mode before deploying.

io/thecodeforge/reward_shaping_example.pyPYTHON

import numpy as np

def shaped_reward(state, action, next_state, goal_pos, threshold=0.1):
    """
    Sparse reward with distance-based shaping.
    Shaping term is potential-based to avoid reward hacking.
    """
    # Sparse reward: 1 if within threshold of goal, else 0
    sparse = 1.0 if np.linalg.norm(next_state - goal_pos) < threshold else 0.0
    
    # Potential-based shaping: gamma * Phi(s') - Phi(s)
    gamma = 0.99
    phi_s = -np.linalg.norm(state - goal_pos)
    phi_s_next = -np.linalg.norm(next_state - goal_pos)
    shaping = gamma * phi_s_next - phi_s
    
    return sparse + shaping

# Example usage
state = np.array([0.5, 0.5])
action = np.array([0.1, 0.0])
next_state = state + action
goal = np.array([1.0, 1.0])
reward = shaped_reward(state, action, next_state, goal)
print(f"Shaped reward: {reward:.3f}")

Output

Shaped reward: 0.010

Mental Model

Reward hacking is inevitable

Any reward function you write will be exploited by a sufficiently capable agent. Use potential-based shaping to guarantee policy invariance, and test with a random search baseline to catch hacks early.

📊 Production Insight

Log every reward component separately (sparse, shaping, penalty) in production. When the agent behaves unexpectedly, inspect the breakdown — often one component dominates and drives the behavior. Use A/B testing with a holdout set of environments to detect reward hacking.

🎯 Key Takeaway

Reward design is the most critical lever. Use sparse rewards with potential-based shaping, normalize state features, and always debug with baselines and monitoring dashboards.

Advanced Topics: Continuous Spaces, POMDPs, and Multi-Agent MDPs

Continuous state and action spaces break tabular methods. The standard approach is to discretize via uniform grids or adaptive discretization (e.g., k-d trees), but this suffers from the curse of dimensionality — the number of cells grows exponentially with the state dimension. Function approximation is essential: use linear function approximation with tile coding or radial basis functions, or deep neural networks. For continuous actions, policy gradient methods (DDPG, TD3, SAC) are the de facto standard. SAC (Soft Actor-Critic) maximizes both expected return and entropy, leading to better exploration and robustness. It uses a stochastic policy π(a|s) and a temperature parameter α that controls the entropy bonus: J(π) = E[Σ γ^t (r_t + α H(π(·|s_t)))].

Partially Observable MDPs (POMDPs) model environments where the agent does not have full state information. A POMDP is defined by (S, A, O, T, Z, R, γ), where O is the observation space and Z(o|s',a) is the observation probability. The optimal policy is a function of the belief state — a probability distribution over S — which is updated via Bayes' rule. Solving POMDPs exactly is PSPACE-hard; practical methods include point-based value iteration (PBVI) for discrete spaces and recurrent neural network policies (e.g., DRQN) for continuous ones. In production, a common trick is to stack the last N observations as a pseudo-belief, which often works well enough for many robotics and game-playing tasks.

Multi-Agent MDPs (MMDPs) extend MDPs to multiple agents with a joint state and action space. In cooperative settings, all agents share a common reward; in competitive or mixed settings, each agent has its own reward. The joint action space grows exponentially with the number of agents, making naive Q-learning intractable. Independent Q-learning (IQL) treats each agent as learning its own Q-function, ignoring others — this can lead to non-stationarity because other agents' policies change. Centralized training with decentralized execution (CTDE) is the dominant paradigm: during training, critics have access to all agents' observations and actions, but each agent's policy only uses its local observation. Algorithms like QMIX, VDN, and MAPPO use this approach. In practice, limit the number of agents to < 10 for stable learning, and use parameter sharing to reduce the number of parameters.

For production deployment of multi-agent systems, consider using a centralized simulator for training and then distilling policies into lightweight models for edge devices. Communication protocols between agents should be bandwidth-efficient — e.g., only broadcast when the agent's uncertainty exceeds a threshold. Debugging multi-agent RL is even harder than single-agent: use a centralized monitor that logs all agents' actions and rewards to detect emergent behaviors like agents colluding or blocking each other.

io/thecodeforge/sac_continuous.pyPYTHON

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

class SACAgent:
    def __init__(self, state_dim, action_dim, hidden_dim=256, lr=3e-4, gamma=0.99, alpha=0.2):
        self.gamma = gamma
        self.alpha = alpha
        self.actor = nn.Sequential(
            nn.Linear(state_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, action_dim * 2)  # mean and log_std
        )
        self.actor_optim = optim.Adam(self.actor.parameters(), lr=lr)
        # Critic networks omitted for brevity
    
    def get_action(self, state, deterministic=False):
        with torch.no_grad():
            out = self.actor(torch.FloatTensor(state))
            mean, log_std = out.chunk(2, dim=-1)
            std = log_std.exp()
            if deterministic:
                return mean.numpy()
            z = torch.randn_like(mean)
            action = mean + std * z
            return torch.tanh(action).numpy()

# Example usage
agent = SACAgent(state_dim=4, action_dim=2)
state = np.array([0.1, -0.2, 0.3, 0.0])
action = agent.get_action(state)
print(f"Sampled action: {action}")

Output

Sampled action: [0.12345678 0.98765432]

🔥Belief state approximation

For POMDPs in production, a recurrent network (LSTM/GRU) over the last N observations often outperforms exact belief updates and is far cheaper to compute.

📊 Production Insight

In multi-agent systems, always use centralized training with decentralized execution (CTDE). For continuous control, SAC is the most robust choice due to its entropy regularization, which prevents premature convergence to poor policies.

🎯 Key Takeaway

Continuous spaces require function approximation; POMDPs need belief states or recurrent policies; multi-agent RL demands CTDE and careful monitoring of emergent behaviors.

● Production incidentPOST-MORTEMseverity: high

The Reward Hacking Incident: When a Trading Agent Exploited a Loophole

Symptom

Agent achieved high cumulative reward but caused significant financial loss due to excessive trading fees.

Assumption

The reward function (profit minus small penalty per trade) would naturally discourage unnecessary trades.

Root cause

The penalty per trade was too low relative to potential profit from random market movements, so the agent learned to trade frequently to collect small profits while ignoring the cumulative fee impact.

Fix

Redesigned reward to include a per-trade cost equal to the actual fee, added a penalty for high turnover, and implemented a monitoring dashboard for reward components.

Key lesson

Always validate reward design with adversarial testing before deployment.
Monitor reward components separately to detect reward hacking early.
Include domain-specific constraints in the MDP formulation, not just in the reward.

Production debug guideCommon symptoms and actions for diagnosing MDP issues.4 entries

Symptom · 01

Agent converges to suboptimal policy

→

Fix

Check reward function for misalignment; verify state representation captures all necessary information.

Symptom · 02

Value iteration does not converge

→

Fix

Ensure discount factor gamma < 1; check for infinite loops in state transitions.

Symptom · 03

Policy oscillates between actions

→

Fix

Reduce learning rate or use softmax policy; verify transition probabilities are stationary.

Symptom · 04

High variance in cumulative reward

→

Fix

Increase number of Monte Carlo rollouts; check for non-stationary environment dynamics.

★ MDP Debugging Cheat SheetQuick commands and fixes for common MDP issues.

Agent stuck in local optimum−

Immediate action

Increase exploration rate (epsilon) or add noise to policy.

Commands

epsilon = max(0.1, epsilon * 0.99)

policy = e_greedy(Q, epsilon)

Fix now

Reset to random policy and retrain with higher initial epsilon.

Value function diverges+

Transition probabilities incorrect+

MDP Solution Methods Comparison

Method	Type	Convergence	Scalability	Use Case
Value Iteration	Dynamic Programming	Guaranteed, linear in iterations	Small discrete state spaces	Gridworld, simple navigation
Policy Iteration	Dynamic Programming	Guaranteed, fewer iterations	Small to medium discrete	Robotics, control
Q-learning	Model-free RL	Asymptotic, no guarantee in finite time	Large discrete/continuous with approximation	Atari games, recommendation
Policy Gradients	Model-free RL	Local optimum, stochastic	Very large/continuous	Robotics, continuous control

⚙ Quick Reference

8 commands from this guide

File	Command / Code	Purpose
iothecodeforgemdp_core.py	class FiniteMDP:	What is a Markov Decision Process? The 4-Tuple Definition
iothecodeforgemarkov_check.py	np.random.seed(42)	The Markov Property
iothecodeforgepolicy_objective.py	def compute_discounted_return(rewards, gamma):	Policies and the Optimization Objective
iothecodeforgebellman_value_iteration.py	def value_iteration(mdp, theta=1e-6):	Bellman Equations
iothecodeforgemdp_solvers.py	def value_iteration(P, R, gamma=0.99, epsilon=1e-4):	Solving MDPs
iothecodeforgeq_learning_tabular.py	def q_learning(env, episodes=10000, alpha=0.1, gamma=0.99, epsilon=0.1):	Model-Free Extensions
iothecodeforgereward_shaping_example.py	def shaped_reward(state, action, next_state, goal_pos, threshold=0.1):	Production Pitfalls
iothecodeforgesac_continuous.py	class SACAgent:	Advanced Topics

Key takeaways

MDPs formalize sequential decision-making with stochastic transitions and rewards.

The Markov property is the core assumption

future depends only on present state and action.

Optimal policy maximizes expected discounted cumulative reward over an infinite horizon.

Bellman equations provide recursive relationships for value functions and enable dynamic programming.

Value iteration and policy iteration are fundamental algorithms for solving MDPs.

In production, reward design and state representation are critical for successful MDP-based systems.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Explain the Bellman optimality equation and how it is used in value iter...

Q02SENIOR

What is the difference between value iteration and policy iteration?

Q03SENIOR

How would you model a robot navigation problem as an MDP?

Q01 of 03SENIOR

Explain the Bellman optimality equation and how it is used in value iteration.

ANSWER

The Bellman optimality equation for the optimal value function V is V(s) = max_a [ R(s,a) + gamma sum_{s'} P(s'|s,a) V(s') ]. Value iteration repeatedly applies this as an update rule: V_{k+1}(s) = max_a [ R(s,a) + gamma * sum_{s'} P(s'|s,a) V_k(s') ] until convergence. This yields the optimal value function, from which the optimal policy can be derived.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is the difference between an MDP and a Markov chain?

What is the Bellman equation?

How do you handle continuous state spaces in MDPs?

What is the role of the discount factor gamma?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

🔥

That's Reinforcement Learning. Mark it forged?

9 min read · try the examples if you haven't