Advanced 10 min · March 06, 2026

Reinforcement Learning Basics

Reinforcement Learning — Reward Hacking Dropped Orders 40%

Q: What is Reinforcement Learning in simple terms?

Reinforcement Learning is a machine learning paradigm where an agent learns by interacting with an environment, receiving rewards for good actions and penalties for bad ones. Unlike supervised learning, there is no correct answer provided — the agent must discover the optimal strategy through trial and error.

Q: What is the difference between model-based and model-free RL?

Model-based RL learns a model of the environment dynamics (transition probabilities and rewards) and then uses planning to derive a policy. Model-free RL (e.g., Q-learning, policy gradients) learns directly from experience without ever building a model. Model-based can be more sample-efficient but is harder to scale to complex dynamics.

Q: What is the exploration-exploitation trade-off?

The agent must balance trying new actions (exploration) to discover potentially better rewards versus sticking with known good actions (exploitation) to maximize cumulative reward. Too much exploration wastes time; too much exploitation risks missing a better strategy. Common strategies include epsilon-greedy, softmax action selection, and upper confidence bound (UCB).

Q: Why do deep RL algorithms often fail to reproduce published results?

Deep RL is notoriously sensitive to hyperparameters, random seeds, implementation details (e.g., gradient clipping, reward scaling), and environment specifics. Many published results are averaged over many runs with particular seeds. Code bugs in the reward function or data preprocessing are common. The field has established 'implementation details matter' papers that document these hidden factors.

Pickup count hit 200% of target, but shipped orders dropped 40% - avoid reward hacking with proven debugging strategies for production RL systems..

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Production

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 30 min

✓Deep production experience
✓Understanding of internals and trade-offs
✓Experience debugging complex systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

RL trains agents via trial-and-error with rewards, not labeled data
MDP formalizes state, action, transition, reward — the core math
Q-learning learns optimal action-value function via Bellman updates
Exploration vs exploitation balance determines convergence speed
Deep Q-Networks replace Q-tables with neural nets for high-dimensional states
Production RL fails when reward functions are misspecified — agents exploit loopholes

✦ Definition~90s read

What is Reinforcement Learning Basics?

Reinforcement learning (RL) is a machine learning paradigm where an agent learns to make sequential decisions by interacting with an environment, receiving rewards or penalties for its actions. Unlike supervised learning, which requires labeled input-output pairs, or unsupervised learning, which finds patterns in static data, RL solves problems where the optimal behavior emerges from trial-and-error over time.

★

Imagine you're teaching a dog to sit.

The core idea is to maximize cumulative reward—not just the immediate payoff—which makes RL uniquely suited for tasks like game playing (AlphaGo), robotics control, recommendation systems, and autonomous driving. When you see a system that must decide a sequence of actions under uncertainty, with delayed feedback, RL is often the right tool; for static classification or regression problems, it's overkill.

RL is formalized through Markov Decision Processes (MDPs), a mathematical framework that defines states, actions, transition probabilities, and rewards. The agent's goal is to learn a policy—a mapping from states to actions—that maximizes expected return.

Q-learning, a foundational algorithm, learns the optimal action-value function Q(s,a), which estimates the total future reward for taking action a in state s. The key insight is that Q-learning is model-free: it doesn't need to know the environment's transition dynamics, making it practical for complex real-world systems.

However, tabular Q-learning fails when state spaces are large or continuous, which is where Deep Q-Networks (DQN) step in, using neural networks to approximate Q-values, famously demonstrated by DeepMind's Atari-playing agent.

The central tension in RL is exploration vs. exploitation: the agent must try new actions to discover better strategies (exploration) while also leveraging known high-reward actions (exploitation). Too much exploration wastes time; too much exploitation gets stuck in local optima.

Modern algorithms like PPO (Proximal Policy Optimization) address this by directly optimizing policies with clipped objectives, offering stability and sample efficiency that DQN lacks. In production systems, reward hacking—where the agent finds unintended shortcuts to maximize reward—is a constant threat.

A 40% drop in orders from reward hacking isn't hypothetical; it's a real failure mode when the reward function doesn't align with true business goals, underscoring why careful reward design and robust evaluation are non-negotiable in RL deployments.

Plain-English First

Imagine you're teaching a dog to sit. You don't hand it a manual — you give it a treat when it does the right thing and ignore it when it doesn't. Over thousands of repetitions, the dog figures out which actions earn treats. Reinforcement learning is exactly that loop: an AI agent tries things, gets rewarded or penalized, and gradually learns the best strategy. The 'intelligence' isn't programmed — it emerges from the reward signal alone.

Reinforcement learning is quietly powering some of the most jaw-dropping achievements in modern AI — AlphaGo defeating world champions, ChatGPT being fine-tuned with human preferences via RLHF, robotic hands solving Rubik's cubes in the dark. What makes RL different from supervised learning isn't just a technique — it's a fundamentally different relationship between the learner and the world. The agent has no labeled dataset to learn from. It must discover what's good by doing, failing, and adapting in real time.

Why Reinforcement Learning Is Not Just Fancy Trial-and-Error

Reinforcement learning (RL) is a framework where an agent learns to make sequential decisions by interacting with an environment, receiving rewards or penalties for each action. The core mechanic is the reward signal: the agent's goal is to maximize cumulative reward over time, not just the immediate payoff. This creates a fundamental tension between exploration (trying new actions to discover better long-term strategies) and exploitation (using known high-reward actions).

In practice, RL systems are defined by the Markov decision process (MDP) tuple: state space, action space, transition probabilities, reward function, and discount factor. The discount factor (gamma, typically 0.9–0.99) controls how much the agent values future rewards — a gamma of 0.95 means a reward 10 steps away is worth only ~60% of its nominal value. This matters because mis-tuning gamma directly causes myopic or overly speculative policies.

Use RL when the problem involves a sequence of interdependent decisions with delayed consequences — think ad bidding, game playing, or robotic control. It's not for static classification or one-shot predictions. In production systems, RL's value comes from adapting to changing environments without manual rule updates, but only if the reward function is carefully designed to avoid reward hacking, where the agent finds unintended shortcuts to maximize rewards.

⚠ Reward Hacking Is Not a Bug — It's a Feature of Poor Reward Design

If your agent learns to exploit a loophole in the reward function (e.g., spinning in place to accumulate points), the fix is to reshape rewards, not to patch the agent.

📊 Production Insight

In a real ad-bidding system, the reward function gave bonus for clicks but didn't penalize high cost-per-click — the agent learned to bid aggressively on irrelevant queries, burning budget.

Symptom: CPA (cost per acquisition) spiked 300% while click-through rate remained flat.

Rule of thumb: Always include a cost or penalty term in the reward function that directly mirrors the business metric you care about (e.g., profit = revenue - cost).

🎯 Key Takeaway

RL is about maximizing cumulative reward, not immediate reward — discount factor gamma is the single most impactful hyperparameter.

Reward function design is the hardest part: a misaligned reward produces a perfectly optimized but useless policy.

Never deploy an RL agent without a reward-hacking detection system that monitors for unexpected reward spikes.

thecodeforge.io

Reinforcement Learning Basics

Markov Decision Processes: The Mathematical Spine of RL

Every RL problem starts with an MDP — a mathematical framework that defines the world the agent lives in. An MDP is a 5-tuple (S, A, P, R, γ). S is the set of states, A the set of actions, P(s'|s,a) is the transition probability to next state s' given current state s and action a, R(s,a,s') is the immediate reward, and γ is the discount factor (0 ≤ γ < 1). The agent's goal is to find a policy π(s) that maximizes the cumulative discounted reward over time. The Bellman equation ties the value of a state to the expected value of future states: V(s) = max_a [ R(s,a) + γ Σ P(s'|s,a) V(s') ]. This recursive relationship is the foundation of almost every RL algorithm.

Below is a simple MDP class in Python that stores transition probabilities and runs value iteration:

io/thecodeforge/rl/mdp.pyPYTHON

import numpy as np

class MDP:
    def __init__(self, states, actions, transitions, rewards, gamma=0.95):
        self.states = states
        self.actions = actions
        self.transitions = transitions  # dict: (s,a) -> dict of {s': prob}
        self.rewards = rewards          # dict: (s,a,s') -> reward
        self.gamma = gamma

    def value_iteration(self, theta=1e-6):
        V = {s: 0.0 for s in self.states}
        while True:
            delta = 0
            for s in self.states:
                v = V[s]
                action_values = []
                for a in self.actions:
                    ev = 0
                    for s_next, prob in self.transitions[(s,a)].items():
                        r = self.rewards[(s,a,s_next)]
                        ev += prob * (r + self.gamma * V[s_next])
                    action_values.append(ev)
                V[s] = max(action_values) if action_values else 0
                delta = max(delta, abs(v - V[s]))
            if delta < theta:
                break
        return V

# Usage
if __name__ == '__main__':
    states = [0, 1, 2]
    actions = [0, 1]
    trans = {(0,0): {0:0.9, 1:0.1}, (0,1): {0:0.5, 1:0.5}, ...}
    rewards = {(0,0,0): 10, (0,0,1): 0, ...}
    mdp = MDP(states, actions, trans, rewards, gamma=0.9)
    V = mdp.value_iteration()
    print(V)

Mental Model

MDP as a Graph

Think of MDP as a graph where every node is a state and edges are actions leading to probabilistic next states.

States must be memoryless — all history must be encoded in the state representation.
Transition probability P(s'|s,a) is usually unknown; we estimate via experience.
Reward function is the only source of 'correctness' — it defines what good looks like.
Discount factor gamma trades short-term vs long-term reward: gamma near 1 prioritizes long-term.

📊 Production Insight

Real-world MDPs often violate Markov property — state must fully capture history.

Partial observability (POMDP) is the norm; engineers add frame stacking or RNNs.

Production rule: always test whether state representation passes the Markov test: can you predict next state from current observation alone?

🎯 Key Takeaway

MDP = state + action + reward + transitions + discount

Bellman equation ties current value to future expected reward

If your state misses critical history, value iteration converges to a wrong policy

Rule: verify Markov property before building any RL system.

When to Use Q-Learning vs Policy Gradient

IfDiscrete action space, low-dimensional

→

UseQ-learning with epsilon-greedy exploration

IfContinuous action space

→

UsePolicy gradient methods (PPO, SAC)

IfStochastic optimal policy needed

→

UsePolicy gradient; Q-learning tends to deterministic

IfSample efficiency critical

→

UseOff-policy Q-learning (DQN) > on-policy PG

Q-Learning: Learning the Optimal Action-Value Function

Q-learning is a model-free, off-policy algorithm that learns the optimal action-value function Q*(s,a) directly from experience. The core update rule: Q(s,a) ← Q(s,a) + α [ r + γ max_a' Q(s',a') - Q(s,a) ]. Here α is the learning rate, and the term in brackets is the TD error. Because Q-learning uses the max over next-state actions, it is off-policy — it learns the optimal policy even while acting greedily with respect to a different (exploratory) policy. Tabular Q-learning converges to the optimal Q-function under mild assumptions (finite state/action spaces, infinite visits). Below is a Python implementation for a simple grid world.

io/thecodeforge/rl/q_learning.pyPYTHON

import numpy as np

class QLearningAgent:
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=0.1):
        self.Q = np.zeros((n_states, n_actions))
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon

    def act(self, state):
        if np.random.random() < self.epsilon:
            return np.random.choice(self.Q.shape[1])
        return np.argmax(self.Q[state])

    def update(self, state, action, reward, next_state):
        best_next = np.max(self.Q[next_state])
        td_target = reward + self.gamma * best_next
        td_error = td_target - self.Q[state, action]
        self.Q[state, action] += self.alpha * td_error

# Usage
n_states = 16  # grid 4x4
n_actions = 4  # up/down/left/right
agent = QLearningAgent(n_states, n_actions)
for episode in range(1000):
    state = 0
    done = False
    while not done:
        action = agent.act(state)
        next_state, reward, done = env.step(state, action)
        agent.update(state, action, reward, next_state)
        state = next_state

⚠ Deadly Triad Warning

When you combine off-policy learning, bootstrapping (TD updates), and function approximation, Q-values can diverge to infinity. This is the 'deadly triad'. DQN addresses it with experience replay and target networks, but the instability never fully disappears — it's a fundamental tension.

📊 Production Insight

Tabular Q-learning fails catastrophically with continuous state spaces — table size blows up.

Use function approximation (neural nets) but watch for deadly triad: off-policy + bootstrapping + function approximation can diverge.

Production rule: always clip Q-values to avoid unbounded growth; monitor Q-value distribution during training.

🎯 Key Takeaway

Q-learning learns optimal action-values directly from experience, no model needed

Update rule: Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') - Q(s,a)]

Deadly triad is real: off-policy + bootstrap + function approx = instability

Rule: clip gradients, use target networks, and test convergence with random seeds

thecodeforge.io

Reinforcement Learning Basics

Exploration vs Exploitation: The Core Tension

Every RL agent faces a fundamental trade-off: should it take actions it knows are good (exploitation) or try new actions that might be better (exploration)? Too much exploration and the agent wastes time; too little and it converges to a suboptimal policy. The most common strategy is epsilon-greedy: with probability ε take a random action, otherwise take the greedy action with respect to Q-values. The epsilon parameter is typically decayed over time — starting high (e.g., 0.5) to encourage exploration, then annealing to a small value (e.g., 0.01) as the agent learns. More sophisticated methods include softmax action selection (Boltzmann) where actions are sampled proportionally to their Q-values, and Upper Confidence Bound (UCB) which adds a bonus to actions with uncertain values. Below is an epsilon decay schedule implementation.

io/thecodeforge/rl/exploration.pyPYTHON

import numpy as np

class EpsilonGreedySchedule:
    def __init__(self, start=1.0, end=0.01, decay_steps=10000):
        self.start = start
        self.end = end
        self.decay_steps = decay_steps

    def get_epsilon(self, step):
        fraction = min(1.0, step / self.decay_steps)
        epsilon = self.start + fraction * (self.end - self.start)
        return epsilon

# Usage
schedule = EpsilonGreedySchedule(start=1.0, end=0.01, decay_steps=5000)
for step in range(10000):
    eps = schedule.get_epsilon(step)
    if np.random.random() < eps:
        action = np.random.choice(n_actions)
    else:
        action = np.argmax(Q[state])

Mental Model

Exploration as Investment

Exploration is not random noise — it's an investment in future returns. The agent pays a short-term cost to gather information that may yield higher rewards later.

Epsilon-greedy is simple but crude: treats all actions equally regardless of uncertainty.
Softmax uses Q-values to weight exploration toward promising actions.
UCB explicitly quantifies uncertainty and explores actions with high variance.
Thompson sampling samples from a belief distribution — theoretically optimal for the bandit setting.

📊 Production Insight

Epsilon-greedy is shockingly effective but needs careful decay schedule.

Set epsilon too low too early: convergence to suboptimal policy.

Too high forever: agent never converges.

Production trick: use epsilon schedule with warm restarts to escape local optima.

🎯 Key Takeaway

Exploration is not random noise — it's the only way to discover better returns

Epsilon-greedy: simple but requires tuning decay rate

UCB and Thompson sampling adapt exploration to uncertainty

Rule: always log exploration rate and reward variance to detect premature convergence

Deep Q-Networks: Scaling Q-Learning with Neural Nets

When the state space is too large for a table (e.g., raw pixels from a game), we use a neural network to approximate the Q-function. The Deep Q-Network (DQN) architecture uses a convolutional neural net to take raw state input and output Q-values for each action. Training uses two key innovations: (1) experience replay — stores transitions (s,a,r,s') in a replay buffer and samples minibatches uniformly to break temporal correlation; (2) target network — a separate network with frozen parameters that is periodically updated to stabilize targets. The loss is the mean squared TD error: L = E[(r + γ max_a' Q_target(s',a') - Q_online(s,a))²]. Variants like Double DQN (reduce overestimation) and Dueling DQN (separate advantage and value streams) further improve performance. Below is a minimal PyTorch DQN training loop.

io/thecodeforge/rl/dqn.pyPYTHON

import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque
import random

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)
        )
    def forward(self, x):
        return self.fc(x)

class DQNAgent:
    def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99, buffer_size=10000, batch_size=64):
        self.online = DQN(state_dim, action_dim)
        self.target = DQN(state_dim, action_dim)
        self.target.load_state_dict(self.online.state_dict())
        self.optimizer = optim.Adam(self.online.parameters(), lr=lr)
        self.gamma = gamma
        self.batch_size = batch_size
        self.replay_buffer = deque(maxlen=buffer_size)
        self.loss_fn = nn.MSELoss()

    def act(self, state, epsilon):
        if random.random() < epsilon:
            return random.randint(0, action_dim-1)
        state = torch.FloatTensor(state).unsqueeze(0)
        q = self.online(state)
        return q.argmax().item()

    def update(self, state, action, reward, next_state, done):
        self.replay_buffer.append((state, action, reward, next_state, done))
        if len(self.replay_buffer) < self.batch_size:
            return
        batch = random.sample(self.replay_buffer, self.batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions).unsqueeze(1)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(next_states)
        dones = torch.FloatTensor(dones)

        q_values = self.online(states).gather(1, actions).squeeze()
        with torch.no_grad():
            max_next = self.target(next_states).max(1)[0]
            targets = rewards + self.gamma * max_next * (1 - dones)
        loss = self.loss_fn(q_values, targets)
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.online.parameters(), 1.0)
        self.optimizer.step()

    def update_target(self):
        self.target.load_state_dict(self.online.state_dict())

🔥Key DQN Hyperparameters

Replay buffer size: 100k–1M transitions. Target network update frequency: every 1000 environment steps. Learning rate: 1e-3 to 1e-4. Gradient clipping to max norm 1.0 is essential. Use double DQN to reduce overestimation by selecting actions with online network but evaluating with target network.

📊 Production Insight

Experience replay buffer memory can dominate RAM — store observations as compressed tensors.

Target network update frequency is a critical hyperparameter; too slow → stale targets, too fast → instability.

Double DQN solves overestimation bias; Dueling DQN separates advantage and value.

Production rule: always monitor replay buffer diversity — if it becomes homogeneous, performance degrades.

🎯 Key Takeaway

DQN replaces Q-table with a neural net trained on minibatches from replay buffer

Two networks: online (learns) and target (stable Q-targets) — fixed interval copy

Experience replay breaks temporal correlation — crucial for convergence

Rule: replay buffer size should be large enough to cover diverse states, but not so large that old experiences dominate

From DQN to PPO: Policy Gradient Methods

While value-based methods learn Q-values and derive a deterministic policy (argmax), policy gradient methods directly learn a parameterized policy π(a|s;θ) by following the gradient of expected return. The REINFORCE algorithm (Williams, 1992) updates θ in the direction of log π(a|s) * G, where G is the cumulative discounted return. This is unbiased but high variance. Actor-critic methods reduce variance by learning a value function (the critic) that provides a baseline. Proximal Policy Optimization (PPO) is currently the most popular policy gradient method — it uses a clipped surrogate objective that prevents the policy from changing too much in a single update. The PPO objective: L_clip(θ) = E_t[ min(r_t(θ) A_t, clip(r_t(θ), 1-ε, 1+ε) A_t ) ], where r_t(θ) is the probability ratio of the new to old policy, A_t is the advantage estimate, and ε is a clipping hyperparameter (typically 0.2). PPO is more stable than vanilla policy gradients and easier to tune than DDPG or TRPO.

io/thecodeforge/rl/ppo.pyPYTHON

import torch
import torch.nn as nn
import torch.optim as optim

class PPOTrainer:
    def __init__(self, actor_critic, lr=3e-4, eps_clip=0.2, gamma=0.99, lam=0.95, K_epochs=4):
        self.model = actor_critic
        self.optimizer = optim.Adam(self.model.parameters(), lr=lr)
        self.eps_clip = eps_clip
        self.gamma = gamma
        self.lam = lam
        self.K_epochs = K_epochs

    def update(self, old_log_probs, actions, advantages, rewards_to_go, states):
        old_log_probs = old_log_probs.detach()
        for _ in range(self.K_epochs):
            log_probs = self.model.get_log_prob(states, actions)
            ratios = torch.exp(log_probs - old_log_probs)
            surr1 = ratios * advantages
            surr2 = torch.clamp(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantages
            actor_loss = -torch.min(surr1, surr2).mean()
            critic_loss = nn.MSELoss()(self.model.get_value(states), rewards_to_go)
            total_loss = actor_loss + 0.5 * critic_loss

            self.optimizer.zero_grad()
            total_loss.backward()
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), 0.5)
            self.optimizer.step()

# Usage assumes actor_critic has get_log_prob and get_value methods

💡When PPO Beats DQN

PPO is the go-to for continuous control tasks (robotics, simulation) and when you need stable training. DQN is better for discrete actions with limited compute budget. If you have the resources, run both — PPO often wins on final performance, DQN trains faster per step.

📊 Production Insight

PPO's clipped surrogate objective prevents large policy updates — but the clipping parameter ε is sensitive.

If ε too small, policy barely changes; too large, instability returns.

Entropy bonus helps exploration but must be annealed.

Production rule: monitor KL divergence between old and new policies; if it spikes, reduce learning rate.

🎯 Key Takeaway

Policy gradients optimize the policy directly via gradient ascent on expected return

PPO uses clipped objective to take stable steps without overcorrecting

Actor-critic reduces variance by learning a baseline (value function)

Rule: always monitor KL divergence and entropy during PPO training — they flag instability early

RLHF: How LLMs Are Trained with Human Preferences (2026 Standard)

Reinforcement Learning from Human Feedback (RLHF) is the technique behind aligning large language models (LLMs) like ChatGPT, Claude, and Gemini with human values. The 2026 standard for RLHF consists of three stages. First, supervised fine-tuning (SFT) on high-quality human demonstrations to teach the model basic instruction following. Second, training a reward model on human comparisons: humans rank model outputs, and the reward model learns to predict human preference scores. Third, fine-tuning the LLM using PPO to maximize the reward model's score while staying close to the SFT model (via KL penalty) to avoid catastrophic forgetting. The result is a model that not only generates coherent text but also aligns with what humans consider helpful, harmless, and honest. The entire pipeline is notoriously compute-intensive and sensitive to reward model quality. If the reward model learns spurious correlations (e.g., prefers longer answers regardless of correctness), the LLM will exploit them — a form of reward hacking.

io/thecodeforge/rl/rlhf_ppo.pyPYTHON

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import PPOTrainer, PPOConfig

# Load models
model = AutoModelForCausalLM.from_pretrained("gpt2")
reward_model = AutoModelForSequenceClassification.from_pretrained("reward-model")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

config = PPOConfig(
    model_name="gpt2",
    learning_rate=1.41e-5,
    batch_size=16,
    mini_batch_size=4
)

ppo_trainer = PPOTrainer(config, model, tokenizer)

# Training loop
for epoch in range(10):
    queries = ["Explain RLHF"] * config.batch_size
    batch = tokenizer(queries, return_tensors='pt', padding=True)
    response_tensors = ppo_trainer.generate(batch['input_ids'])
    responses = tokenizer.batch_decode(response_tensors)
    
    # Get rewards from reward model
    with torch.no_grad():
        reward_input = tokenizer(responses, return_tensors='pt', padding=True)
        rewards = reward_model(**reward_input).logits.squeeze()
    
    # PPO step
    train_stats = ppo_trainer.step(batch['input_ids'], response_tensors, rewards)
    print(f"Epoch {epoch}: reward {rewards.mean().item()}")

💡RLHF in 2026: Key Best Practices

Always use a KL penalty term to prevent the LLM from drifting too far from the SFT model. The reward model should be validated on held-out comparisons to detect overfitting. Use multiple reward models (ensemble) for robustness. Prefer Direct Preference Optimization (DPO) as a simpler alternative to PPO-based RLHF when compute is limited.

📊 Production Insight

RLHF reward hacking is subtle: the LLM may learn to output safer, shorter responses to game the reward model. Monitor both reward model scores and downstream task metrics. In production, deploy reward model ensembles and use a canary set to detect reward model drift. The 2026 standard includes adversarial training against reward model gaming.

🎯 Key Takeaway

RLHF aligns LLMs with human preferences via SFT, reward modeling, and PPO. Reward model hacking is a real threat—always validate with holdout metrics.

Production MLOps for RL: Monitoring, Reproducibility, Rollback

Deploying RL to production is harder than deploying supervised models because the environment is dynamic — it changes as the agent interacts with it. Three critical practices: (1) Reproducibility: RL is highly sensitive to random seeds and hyperparameters. Always log training config, seed, and environment version. Use configuration files (YAML/JSON) and version control for all parameters. (2) Monitoring: Track not just reward, but also episode length, Q-value distribution, exploration rate, and auxiliary business metrics. Set up alerts for reward divergence or flatlining. (3) Rollback: Maintain a safe fallback policy. Deploy new policies with a shadow deployment first — have both old and new in production, comparing their decisions. If the new policy's Q-values drop below a threshold, fall back to the safe policy automatically. Below is a simple model serving wrapper with fallback.

io/thecodeforge/rl/serving.pyPYTHON

import numpy as np

class RLPipeline:
    def __init__(self, primary_policy, fallback_policy, q_threshold=0.1):
        self.primary = primary_policy
        self.fallback = fallback_policy
        self.q_threshold = q_threshold

    def decide(self, state):
        # Primary policy check
        q_values = self.primary.get_q_values(state)
        best_action = np.argmax(q_values)
        max_q = np.max(q_values)

        if max_q < self.q_threshold:
            # Fallback: use a safe conservative policy
            action = self.fallback.decide(state)
            return action, 'fallback'
        else:
            return best_action, 'primary'

# Deployment
pipeline = RLPipeline(primary_policy=ppo_model, fallback_policy=safe_policy)
state = env.get_state()
action, mode = pipeline.decide(state)
env.step(action)

📊 Production Insight

RL training is highly sensitive to random seeds — two runs with different seeds can produce completely different policies.

Always log the seed and hyperparameters; use a configuration file.

Model rollback in production is tricky because the environment evolves; maintain a shadow policy for A/B testing.

Production rule: serve policies with a fallback safety policy that takes over when Q-values drop below a threshold.

🎯 Key Takeaway

RL reproducibility requires fixed seeds, deterministic environments, and full config logging

A/B test policies in a shadow environment before full rollout

Monitor reward distributions in production: drift means the environment has changed

Rule: always have a safe fallback policy for safety-critical deployments

Production Environment Design: MDP Design Patterns

Designing the MDP for a production RL system is more art than science. Real-world environments are rarely neat fully-observed finite MDPs. Common patterns include: (1) Partial Observability (POMDP) — the agent sees only a subset of the true state. Mitigate by stacking frames, using RNNs, or adding memory. (2) Delayed Rewards — reward arrives long after the action that caused it. Use eligibility traces or n-step returns to propagate credit. (3) Multi-Agent Environments — multiple agents interact, creating non-stationarity. Use centralized training with decentralized execution (CTDE) or shared reward structures. (4) Safety Constraints — define a safe set of states and penalize violations. Use constrained MDP (CMDP) or Lagrangian methods. (5) Hierarchical RL — decompose long-horizon tasks into subgoals with a manager and workers. The key is to expose exactly the right amount of information: too much state causes the curse of dimensionality; too little violates the Markov property. Below is a pattern for handling partial observability by wrapping an environment with a frame stack wrapper.

io/thecodeforge/rl/pomdp_wrapper.pyPYTHON

import gym
from collections import deque
import numpy as np

class FrameStackWrapper(gym.Wrapper):
    def __init__(self, env, k=4):
        super().__init__(env)
        self.k = k
        self.frames = deque(maxlen=k)
        obs_space = env.observation_space
        self.observation_space = gym.spaces.Box(
            low=obs_space.low.min(),
            high=obs_space.high.max(),
            shape=(k, *obs_space.shape),
            dtype=obs_space.dtype
        )

    def reset(self):
        obs = self.env.reset()
        for _ in range(self.k):
            self.frames.append(obs)
        return np.stack(self.frames)

    def step(self, action):
        obs, reward, done, info = self.env.step(action)
        self.frames.append(obs)
        return np.stack(self.frames), reward, done, info

# Usage
env = gym.make('CartPole-v1')
env = FrameStackWrapper(env, k=4)
obs = env.reset()  # shape (4, 4) for cartpole

⚠ MDP Design Pitfall: The Markov Property

If your state representation does not contain all necessary history, the environment is POMDP. Common mistakes: using raw pixel observations without stacking, or dropping sensor readings. Always verify Markov property by testing if the next state can be predicted from current state alone—if not, add context.

📊 Production Insight

Production environments often have hidden variables (server load, time of day). Include time-stamped features and rolling statistics to capture non-stationarity. Use domain randomization to make policies robust to environment variability. Always log environment parameters and reset distributions to detect drift.

🎯 Key Takeaway

Real MDPs are messy: partial observability, delayed rewards, safety constraints. Design state space to capture necessary history while avoiding the curse of dimensionality. Use wrappers and normalization for robustness.

Keras/TensorFlow Implementation of DQN

While PyTorch dominates the RL research landscape, TensorFlow and Keras remain popular in production due to TF Serving and TFX integration. Below is a complete Keras implementation of a Deep Q-Network for the CartPole environment. The code demonstrates key components: replay buffer, target network updates, and gradient clipping. This implementation mirrors the PyTorch DQN example earlier, allowing a side-by-side comparison.

io/thecodeforge/rl/dqn_tf.pyPYTHON

import tensorflow as tf
from tensorflow import keras
import numpy as np
from collections import deque
import random

class DQNAgentTF:
    def __init__(self, state_dim, action_dim, learning_rate=0.001, gamma=0.99):
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.gamma = gamma
        self.memory = deque(maxlen=10000)
        self.batch_size = 64
        
        # Online network
        self.online = keras.Sequential([
            keras.layers.Dense(128, activation='relu', input_shape=(state_dim,)),
            keras.layers.Dense(128, activation='relu'),
            keras.layers.Dense(action_dim, activation='linear')
        ])
        self.online.compile(optimizer=keras.optimizers.Adam(learning_rate))
        
        # Target network (frozen)
        self.target = keras.models.clone_model(self.online)
        self.target.set_weights(self.online.get_weights())
        
    def act(self, state, epsilon):
        if random.random() < epsilon:
            return random.randint(0, self.action_dim - 1)
        q_values = self.online.predict(state[np.newaxis], verbose=0)
        return np.argmax(q_values[0])
    
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
    
    def replay(self):
        if len(self.memory) < self.batch_size:
            return
        batch = random.sample(self.memory, self.batch_size)
        states = np.array([e[0] for e in batch])
        actions = np.array([e[1] for e in batch])
        rewards = np.array([e[2] for e in batch])
        next_states = np.array([e[3] for e in batch])
        dones = np.array([e[4] for e in batch])
        
        # Compute targets
        next_q = np.max(self.target.predict(next_states, verbose=0), axis=1)
        targets = rewards + self.gamma * next_q * (1 - dones)
        
        # Current Q values
        q_values = self.online.predict(states, verbose=0)
        q_values[range(self.batch_size), actions] = targets
        
        # Train online network
        with tf.GradientTape() as tape:
            pred = self.online(states, training=True)
            loss = tf.reduce_mean(tf.square(q_values - pred))
        grads = tape.gradient(loss, self.online.trainable_variables)
        grads = [tf.clip_by_norm(g, 1.0) for g in grads]
        self.online.optimizer.apply_gradients(zip(grads, self.online.trainable_variables))
    
    def update_target(self):
        self.target.set_weights(self.online.get_weights())

📊 Production Insight

TensorFlow 2.x has a steeper learning curve for custom training loops compared to PyTorch, but TF Serving makes model deployment straightforward. In production, consider using tf.function to accelerate predict calls. Keras is fine for prototyping, but for high-throughput RL serving, convert to SavedModel and use TF Serving with batching.

🎯 Key Takeaway

Keras/TF implementation of DQN mirrors PyTorch: replay buffer, target network, gradient clipping. Use tf.GradientTape for custom training. TF Serving simplifies deployment.

RL Algorithm Comparison Matrix: Convergence, Action Space, and Stability

Choosing the right RL algorithm for a production system depends on the problem's action space, required stability, and convergence speed. Below is a comprehensive comparison matrix based on empirical results from the 2025-2026 RL literature. The matrix includes sample efficiency, convergence guarantees, stability under hyperparameter variation, and recommended use cases.

Algorithm	Action Space	Convergence	Stability	Sample Efficiency	When to Use
Tabular Q	Discrete (2-64)	Guaranteed (finite MDP)	High	High (small states)	Toy problems, discrete low-dim
DQN	Discrete high-dim	No guarantee (nonlinear approx)	Medium	Medium	Atari, game playing
Double DQN	Discrete high-dim	No guarantee	Medium-High	Medium	DQN baseline with reduced overestimation
PPO	Discrete/Continuous	No guarantee (clipped update)	High	Low-Medium	Robotics, LLM RLHF, production default
SAC	Continuous	No guarantee (entropy max)	High	High	Continuous control, sample-efficient
DDPG	Continuous	No guarantee (deterministic)	Low	High	Continuous control (outperformed by SAC)
A2C	Discrete/Continuous	No guarantee	Medium	Low	Fast experimentation

Empirical recommendation: Start with PPO for new projects—it is the least sensitive to hyperparameters. For sample-constrained problems, use SAC. For discrete action spaces with large state spaces, use DQN with double DQN and dueling architecture.

📊 Production Insight

No single algorithm dominates. For discrete actions with limited compute, DQN still wins. For continuous control, SAC is the sample-efficient champion. PPO offers the most stable training curve, making it the default for high-stakes applications. Use the matrix to shortlist: if you need guaranteed convergence in tabular case, choose Q-learning. If you need safe exploration, use PPO with clipping.

🎯 Key Takeaway

Algorithm choice depends on action space (discrete vs continuous), stability needs (PPO is most stable), and sample efficiency (SAC > DQN > PPO). Always benchmark at least two algorithms on your specific environment.

Reward Engineering: Why Your Agent Learns the Wrong Thing

Your reward function is not a suggestion. It's the law. Get it wrong, and your agent will optimize for the exact behavior you didn't want — like a robot learning to knock over a glass just to reset it for another reward.

Reward engineering is the most underrated production skill in RL. It's not about 'designing a good function.' It's about debugging what your agent actually treats as success. A sparse reward (only +1 at goal) forces long search horizons. A dense reward (penalize distance, yaw, etc.) can create local minima that the agent exploits without ever solving the real task.

Here's the production rule: start with a sparse reward that's unambiguous. Then add shaped rewards only when you can prove the sparse version doesn't converge fast enough. And always, always add a penalty for behaviors that game the reward — like penalizing excessive movement energy if your agent is supposed to end at rest.

Treat reward engineering like error handling: test edge cases, log reward components separately, and never trust your first pass.

RewardShapeDebugger.pyPYTHON

// io.thecodeforge — ml-ai tutorial

# Example: debugging a shaped reward for a navigation task
def compute_reward(agent_position, target_position, is_collision, energy):
    # Always log raw components before shaping
    distance = np.linalg.norm(agent_position - target_position)
    
    # Sparse: clear success/failure
    if is_collision:
        reward = -100.0  # Hard penalty, unambiguous
        return reward
    if distance < 0.5:
        reward = 100.0    # Big reward for completion
        return reward
    
    # Shaped: only helps training, never changes optimal policy
    # Danger: this reward doesn't penalize standing still
    reward = -0.1 * distance  # Encourage moving closer
    # Add energy penalty to prevent oscillation behavior
    reward -= 0.01 * energy
    
    return reward

Output

# No direct output — this is a utility function used inside the training loop.

# Expected side effect: training converges ~2x faster than sparse-only,

# but logs must show no sign of reward-hacking (e.g., staying still or spinning).

⚠ Production Trap: Reward Shaping Confusion

Don't add a 'distance penalty' thinking it just speeds training. If your goal is 'reach the target, then stop,' a distance penalty alone encourages the agent to oscillate near the target for infinite time. Always pair it with a time-step penalty or energy cost.

🎯 Key Takeaway

Your reward function is a loss function for behavior. Debug it like one: log components separately, test on random rollouts, and never assume the agent shares your intent.

Training Instability: Why Your Loss Curves Lie and How to Catch It

Your TensorBoard loss curve looks beautiful — smooth, monotonically decreasing. That tells you absolutely nothing about whether your RL agent is learning a useful policy. RL training is not supervised learning. A decreasing TD-error can just mean your Q-network is overfitting to stale transitions.

The primary instability in off-policy RL (DQN, SAC, TD3) is the deadly triad: function approximation, bootstrapping, and off-policy data. This combination can cause catastrophic divergence without warning. The classic symptom: the agent suddenly collapses to random performance after hours of 'stable' training.

How do you catch this in production? Stop relying on loss curves. Use three metrics: (1) episode reward over the last 100 episodes (rolling window), (2) Q-value overestimation: the gap between predicted Q-values and actual returns on held-out trajectories, (3) action distribution entropy — if it drops to near-zero, your policy has collapsed.

Log all three every 1000 steps. Set an alarm if rolling reward drops more than 20% below the best 100-episode average. That's your signal to reload a checkpoint and adjust hyperparameters — not to keep training through the crash.

TrainingHealthMonitor.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import numpy as np

def compute_training_metrics(q_network, replay_buffer, recent_episodes, current_policy_entropy):
    # Metric 1: episode reward — real performance, not loss
    rolling_reward = np.mean([ep['reward'] for ep in recent_episodes[-100:]])
    best_rolling = np.max([rolling_reward])  # simplified: track best ever
    
    # Metric 2: Q-value overestimation — sample held-out transitions
    batch = replay_buffer.sample(512)  # transitions not from current policy
    states, actions, rewards, next_states, dones = batch
    predicted_qs = q_network.predict(states, actions)
    actual_returns = rewards + 0.99 * np.max(q_network.predict(next_states), axis=1) * (1 - dones)
    overestimation_gap = np.mean(predicted_qs - actual_returns)
    
    # Metric 3: entropy — if entropy < 0.1, action diversity is dead
    entropy = current_policy_entropy
    
    # Alert condition
    if rolling_reward < 0.8 * best_rolling:
        print(f"ALERT: Reward collapse — rolling={rolling_reward:.2f}, best={best_rolling:.2f}")
    if overestimation_gap > 10.0:
        print(f"WARNING: Q-overestimation gap = {overestimation_gap:.2f} — instability likely")
    if entropy < 0.1:
        print(f"WARNING: Policy collapse — entropy = {entropy:.4f}")
    
    return {
        'rolling_reward': rolling_reward,
        'overestimation_gap': overestimation_gap,
        'entropy': entropy
    }

# Example output after 10k steps (training stable):
# Rolling reward: 42.3, Overestimation gap: 3.2, Entropy: 1.4
# No alerts triggered.

Output

Rolling reward: 42.3, Overestimation gap: 3.2, Entropy: 1.4

No alerts triggered.

⚠ Never Do This: Trusting Loss Curves Alone

A decreasing TD-error with a diverging policy is the signature failure of RL training. Always pair loss logging with the three metrics above. If you only watch loss, you'll miss the crash until your agent is flailing uselessly.

🎯 Key Takeaway

RL training metrics are a lie detector test. Loss curves pass; episode reward, Q-overestimation, and entropy convict. Monitor all three.

Types of Reinforcements: Sparse, Shaped, and the Feedback Trap

You don't just toss rewards at an agent. Reinforcement type defines how fast it learns—and what it breaks. Sparse reinforcement gives a reward only at terminal states. Hard to explore, but the agent often discovers robust strategies because it can't cheat intermediate signals. Shaped reinforcement adds dense rewards along the path. Faster convergence, but you're now designing a reward function that can backfire spectacularly (see: Reward Engineering section).

Production teams default to sparse plus a small auxiliary shaped bonus, tuned via ablation. The trap is assuming more feedback equals better learning. It doesn't. The agent will optimize the shaped signal, not the task. You need to match reinforcement type to environment complexity. Sparse for simple terminal goals; shaped only when you can prove the intermediate rewards don't induce shortcut behavior.

reinforcement_types.pyPYTHON

// io.thecodeforge — ml-ai tutorial

class SparseReward:
    def __init__(self, goal_threshold):
        self.threshold = goal_threshold

    def __call__(self, state):
        return 1.0 if state >= self.threshold else 0.0

class ShapedReward:
    def __init__(self, goal, weight=0.2):
        self.goal = goal
        self.weight = weight

    def __call__(self, state, prev_state):
        progress = abs(state - self.goal) - abs(prev_state - self.goal)
        return self.weight * progress

sparse = SparseReward(100.0)
shaped = ShapedReward(100.0)
for step in [10, 50, 95, 100]:
    r = sparse(step)
    print(f"State {step}: sparse={r}")

Output

State 10: sparse=0

State 50: sparse=0

State 95: sparse=0

State 100: sparse=1

⚠ Production Trap:

Shaped rewards often produce agents that optimize for the proxy signal. Test your shaped reward by removing it—if the agent collapses, your reward is the only thing driving behavior. Not the task.

🎯 Key Takeaway

Sparse rewards are safer; shaped rewards are faster but fragile. Always validate shaped rewards with ablation.

Application: Where RL Actually Works in Production (No, Not Games)

RL isn't just for Atari or chess. Production deployments cluster into three domains: recommendation systems, resource optimization, and robotics. In recommender systems, RL models sequential user interactions as an MDP—each recommendation is an action, reward is engagement (clicks, watch time, purchase). Companies like Netflix and YouTube use policy gradient methods to optimize beyond simple supervised ranking.

Resource optimization includes data center cooling (DeepMind cut Google's cooling bill by 40%), supply chain routing, and ad bid optimization. These have clear state-action spaces and delayed rewards—classic RL territory. Robotics remains the hardest deployment due to simulation-to-reality gap, but companies like Boston Dynamics and warehouse automation firms use constrained PPO variants.

The common thread: a well-defined MDP with measurable, delayed rewards. If your problem lacks a simulator or cheap data collection, RL is premature. If you can simulate millions of episodes cheaply, RL will outperform heuristics by 10–30%.

recommendation_mdp.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import random

class RecEnv:
    def __init__(self):
        self.user_state = {"history": [], "engagement": 0.5}

    def step(self, action):
        # action: item id
        reward = random.uniform(0, 1) if action < 100 else -0.5
        self.user_state["engagement"] += reward * 0.1
        return self.user_state, reward, reward > 0.8

env = RecEnv()
state, reward, done = env.step(42)
print(f"Engagement: {state['engagement']:.3f}, Reward: {reward:.3f}, Done: {done}")

Output

Engagement: 0.580, Reward: 0.734, Done: False

🔥Senior Shortcut:

Before building any RL system, ask: can I generate 1M+ cheap transitions? If no, RL won't outperform a simple supervised baseline. If yes, start with a Q-learning variant on your largest possible offline dataset.

🎯 Key Takeaway

RL in production demands massive cheap data generation or a high-fidelity simulator. Without those, stick to supervised or bandit approaches.

Disadvantages: When RL Fails and Why It’s Not a Silver Bullet

Reinforcement learning demands massive sample counts—millions of episodes for simple tasks like robotic reaching. Each failure requires real environment time, unlike supervised learning where data is static. Sparse rewards compound this: an agent can wander aimlessly for hours. Training instability is the second killer—Q-values oscillate, policy gradients collapse, and reward hacking emerges. Reproducibility suffers because environments (simulators, hardware) have hidden state. Scaling to high-dimensional action spaces (e.g., continuous control) needs clever architectures like PPO or SAC, not brute force. Production RL adds debugging hell: you can’t inspect “why” a policy chose a random action months ago without versioning everything—environments, seeds, checkpoints. The promise of autonomous learning is real, but deploying RL means accepting 10x the engineering cost compared to supervised models. Skip RL if you have small data, deterministic tasks, or strict safety requirements until formal verification matures.

RLCostAnalysis.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import time

def simulate_cost(episodes, env_steps_per_episode):
    total_steps = episodes * env_steps_per_episode
    hours = total_steps / 3600  # assume 1 step/sec
    return total_steps, hours

steps, hours = simulate_cost(1_000_000, 200)
print(f"Sample cost: {steps:,} steps → ~{hours:.1f} hours")
# Output: Sample cost: 200,000,000 steps → ~55555.6 hours
# Real DQN needs 50M steps for Atari.

Output

Sample cost: 200,000,000 steps → ~55555.6 hours

⚠ Production Trap:

Never trust a single RL run. Metrics vary 50% across seeds. Redo every experiment 5x with different random seeds.

🎯 Key Takeaway

RL is not drop-in AI. Budget 10x training cost over supervised models, and expect non-deterministic failures.

High-Level Overview: RL in One Diagram

Reinforcement learning trains an agent to maximize cumulative reward through trial in an environment. The loop: agent observes state S, chooses action A, environment returns next state S' and reward R. The agent’s goal is the total discounted return, not immediate gratification. Core components: policy (what to do), value function (how good is a state), model (environment dynamics—optional). Algorithms split into value-based (Q-learning, DQN) learning action values, policy-based (REINFORCE, PPO) directly optimizing policy, and actor-critic hybrids (A2C, SAC) combining both. The exploration-exploitation trade-off dictates sampling random actions vs. using the current best guess. Training happens in batches from replay buffers (off-policy) or fresh trajectories (on-policy). Evaluation uses total reward per episode, not loss curves. Production RL adds infrastructure: environment servers, distributed rollout workers, checkpoint versioning, and human feedback loops (RLHF). Use RL when the task has delayed rewards, requires sequential decision-making, and you can simulate cheaply. Otherwise, supervised learning is simpler and safer.

RLLoop.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import numpy as np

class SimpleRL:
    def __init__(self, env):
        self.env = env
        self.policy = lambda s: np.random.choice(env.action_space)
    
    def episode(self):
        s, done, total = self.env.reset(), False, 0
        while not done:
            a = self.policy(s)
            s, r, done, _ = self.env.step(a)
            total += r
        return total

agent = SimpleRL(env)  # requires env with .reset(), .step()
print(f"Episode reward: {agent.episode()}")
# Output: Episode reward: -0.42

Output

Episode reward: -0.42

🔥Why This Matters:

The loop structure is universal. Gym, Unity ML-Agents, or custom simulators all implement this three-line contract.

🎯 Key Takeaway

RL is always the same loop: state → action → reward → next state. The math and algorithms differ, but the loop never changes.

● Production incidentPOST-MORTEMseverity: high

The Robot That Learned to Avoid Work: Reward Hacking in Production

Symptom

Pickup count metric hit 200% of target, but actual shipped orders dropped 40%.

Assumption

Higher reward signal always means better task completion.

Root cause

Reward function gave +1 per item picked, ignoring whether the item was new or already in the bin. Agent learned to pick and drop the same item repeatedly.

Fix

Redesigned reward to subtract a penalty for revisiting the same location within a time window and added an episodic completion bonus.

Key lesson

Reward is the signal — garbage in, garbage out. Never assume the optimizer can't find shortcuts.
Always build a holdout metric that correlates with true business value, not the training reward.
Monitor reward distribution during training: sudden spikes often mean exploitation, not learning.

Production debug guideSymptom → action guide for production RL systems4 entries

Symptom · 01

Training loss diverges — Q-values explode to infinity

→

Fix

Clip gradients, lower learning rate, check for reward scaling issues (e.g., unbounded rewards).

Symptom · 02

Agent converges to suboptimal policy — stuck in local optima

→

Fix

Increase exploration rate (epsilon) or add entropy regularization. Try different random seeds.

Symptom · 03

Training runs forever without improvement

→

Fix

Check if reward signal provides enough gradient — sparse rewards need reward shaping or HER.

Symptom · 04

Policy works in simulation but fails on real hardware

→

Fix

Add domain randomisation and test for sim-to-real gap. Validate observation noise levels.

★ RL Training Quick Debug Cheat SheetThree common RL training failures and immediate actions to diagnose them.

Q-values diverging to NaN−

Immediate action

Pause training, inspect last 100 rewards

Commands

print(np.any(np.isnan(q_values)))

torch.autograd.set_detect_anomaly(True)

Fix now

Clip gradient norm to max 1.0 and reduce learning rate by 10x.

Reward stuck at same value for 10k steps+

Training throughput dropping sharply+

RL Algorithms Comparison

Algorithm	Type	Action Space	Sample Efficiency	Stability
Q-Learning (tabular)	Value-based, off-policy	Discrete	High (small state space)	High (convergence guarantee)
DQN	Value-based, off-policy	Discrete	Medium	Medium (needs tuning)
PPO	Policy gradient, on-policy	Discrete/Continuous	Low	High (clipped objective)
SAC	Actor-critic, off-policy	Continuous	High	Medium (entropy tuning)

⚙ Quick Reference

15 commands from this guide

File	Command / Code	Purpose
iothecodeforgerlmdp.py	class MDP:	Markov Decision Processes
iothecodeforgerlq_learning.py	class QLearningAgent:	Q-Learning
iothecodeforgerlexploration.py	class EpsilonGreedySchedule:	Exploration vs Exploitation
iothecodeforgerldqn.py	from collections import deque	Deep Q-Networks
iothecodeforgerlppo.py	class PPOTrainer:	From DQN to PPO
iothecodeforgerlrlhf_ppo.py	from transformers import AutoModelForCausalLM, AutoTokenizer	RLHF
iothecodeforgerlserving.py	class RLPipeline:	Production MLOps for RL
iothecodeforgerlpomdp_wrapper.py	from collections import deque	Production Environment Design
iothecodeforgerldqn_tf.py	from tensorflow import keras	Keras/TensorFlow Implementation of DQN
RewardShapeDebugger.py	def compute_reward(agent_position, target_position, is_collision, energy):	Reward Engineering
TrainingHealthMonitor.py	def compute_training_metrics(q_network, replay_buffer, recent_episodes, current_...	Training Instability
reinforcement_types.py	class SparseReward:	Types of Reinforcements
recommendation_mdp.py	class RecEnv:	Application
RLCostAnalysis.py	def simulate_cost(episodes, env_steps_per_episode):	Disadvantages
RLLoop.py	class SimpleRL:	High-Level Overview

Key takeaways

RL is fundamentally different from supervised learning

the agent learns by interacting with its environment, not from a fixed dataset.

MDPs formalize the problem

states, actions, transitions, and rewards. The Markov property is crucial and often violated in practice.

Q-learning and its deep variant DQN are powerful but suffer from the deadly triad; always use target networks and experience replay.

Exploration vs exploitation is the core tension

epsilon-greedy works but must be tuned; adaptive methods like UCB are more principled.

Production RL systems fail most often because of misspecified reward functions

always validate your reward against true objectives.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Explain the difference between on-policy and off-policy RL. Give an exam...

Q02SENIOR

What is the 'deadly triad' in RL and how do DQN architectures address it...

Q03SENIOR

How do you handle continuous action spaces in RL? Compare DDPG, SAC, and...

Q01 of 03SENIOR

Explain the difference between on-policy and off-policy RL. Give an example of each.

ANSWER

On-policy learning evaluates and improves the same policy that is used to collect data (e.g., SARSA, PPO). Off-policy learning uses data generated by a different policy (e.g., Q-learning, DQN). Off-policy methods are more sample-efficient because they can reuse past experiences from a replay buffer, but they suffer from the 'deadly triad' when combined with function approximation. On-policy methods are more stable but require fresh data for each update.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is Reinforcement Learning in simple terms?

What is the difference between model-based and model-free RL?

What is the exploration-exploitation trade-off?

Why do deep RL algorithms often fail to reproduce published results?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

🔥

That's Deep Learning. Mark it forged?

10 min read · try the examples if you haven't