Medium 13 min · May 28, 2026

Policy Gradient Methods: From REINFORCE to PPO in Production

Master policy gradient methods from REINFORCE to PPO.

N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.

Follow
Production
production tested
June 02, 2026
last updated
1,510
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Policy gradient methods directly optimize a parameterized policy via gradient ascent on expected reward.
  • REINFORCE is the foundational algorithm, using Monte Carlo returns with a log-probability trick.
  • Variance reduction techniques like baselines and GAE are critical for stable training.
  • Trust region methods (TRPO, PPO) constrain policy updates to prevent catastrophic collapse.
  • PPO's clipped surrogate objective is the de facto standard for production RL systems.
  • Policy gradients scale to high-dimensional continuous control and large language model alignment.
✦ Definition~90s read
What is Policy Gradient Methods?

Policy gradient methods are a class of reinforcement learning algorithms that directly optimize a parameterized policy function π_θ(a|s) by estimating the gradient of expected cumulative reward J(θ) with respect to θ. They use the policy gradient theorem to compute ∇_θ J(θ) as an expectation over trajectories, enabling gradient ascent on the policy parameters without requiring a value function.

Imagine you're training a dog to fetch.
Plain-English First

Imagine you're training a dog to fetch. Instead of teaching it the value of each step, you directly reward the whole sequence of actions that lead to the ball. Policy gradient methods are like that: they tweak the dog's strategy based on how well the entire fetch went, gradually improving the odds of good sequences.

Policy gradient methods have become the backbone of modern reinforcement learning, powering everything from robotic manipulation to the alignment of large language models. Unlike value-based approaches that learn a Q-function and derive a policy implicitly, policy gradients directly optimize the policy parameters via gradient ascent on expected cumulative reward. This directness makes them naturally suited for continuous action spaces and stochastic policies, but it comes with a notorious challenge: high variance in gradient estimates.

The journey from REINFORCE to PPO is a story of taming that variance. REINFORCE, introduced by Williams in 1992, uses Monte Carlo returns but suffers from high variance, requiring careful reward normalization and baselines. The causality trick and the policy gradient theorem provided theoretical grounding, but practical success demanded more. The introduction of the advantage function and Generalized Advantage Estimation (GAE) by Schulman et al. in 2015 marked a turning point, enabling stable learning in high-dimensional control tasks.

Trust region methods like TRPO and PPO addressed another critical issue: how large can a policy update be without destroying performance? TRPO enforces a hard constraint on the KL divergence between old and new policies, while PPO's clipped surrogate objective offers a simpler, more scalable alternative. Today, PPO is the workhorse of production RL systems, used in robotics, game playing, and fine-tuning large language models via reinforcement learning from human feedback (RLHF).

In 2026, policy gradients remain at the forefront of AI research and deployment. Understanding their theory, implementation pitfalls, and production debugging is essential for any serious ML engineer. This article provides a comprehensive, production-grounded guide to policy gradient methods, from the mathematical foundations to real-world war stories.

The Policy Gradient Theorem: Derivation and Intuition

The Policy Gradient Theorem is the foundational result that makes direct policy optimization tractable. It states that for a parameterized stochastic policy π_θ, the gradient of the expected return J(θ) = E[Σ γ^t R_t] can be expressed as ∇_θ J(θ) = E[∇_θ log π_θ(a|s) Q^{π_θ}(s,a)]. The key insight is that we can compute the gradient without differentiating through the environment dynamics or the state distribution. This is possible because the score function ∇_θ log π_θ has zero expectation under the policy, which allows us to ignore the dependence of the state distribution on θ. The proof uses the log-derivative trick and the fact that the Markov chain's stationary distribution's gradient integrates to zero. In practice, this means we can estimate the gradient using only samples from the current policy and estimates of the action-value function. The theorem holds for both episodic and continuing settings, with appropriate discounting. The derivation is elegant: start with ∇_θ J(θ) = ∇_θ ∫ p_θ(τ) R(τ) dτ = ∫ p_θ(τ) ∇_θ log p_θ(τ) R(τ) dτ = E[∇_θ log p_θ(τ) R(τ)], then expand the trajectory probability and use the Markov property to get the final form. The causality trick further simplifies this by noting that actions at time t only affect future rewards, leading to ∇_θ J(θ) = E[Σ_t ∇_θ log π_θ(a_t|s_t) (Σ_{k=t}^T γ^{k-t} R_k)]. This reduces variance by eliminating unnecessary terms. The theorem is the basis for all modern policy gradient methods, from REINFORCE to PPO. Understanding it is non-negotiable for anyone working in deep RL.

io/thecodeforge/policy_gradient_theorem.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import torch
import torch.nn as nn
import torch.distributions as dist

def policy_gradient_loss(log_probs, returns):
    """
    Compute the policy gradient loss using the REINFORCE estimator.
    This implements: ∇_θ J(θ) ≈ E[∇_θ log π_θ(a|s) * G_t]
    where G_t is the discounted return from time t.
    """
    # log_probs: (batch_size, seq_len)
    # returns: (batch_size, seq_len)
    # Policy gradient loss = -E[log π(a|s) * G_t] (negative for gradient ascent)
    loss = -torch.mean(log_probs * returns)
    return loss

# Example usage with a simple policy network
class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.ReLU(),
            nn.Linear(64, action_dim)
        )
    
    def forward(self, state):
        logits = self.fc(state)
        return dist.Categorical(logits=logits)

# Simulate a batch of trajectories
state_dim, action_dim = 4, 2
policy = PolicyNetwork(state_dim, action_dim)
optimizer = torch.optim.Adam(policy.parameters(), lr=1e-3)

# Dummy data: 32 trajectories, each of length 10
batch_size, seq_len = 32, 10
states = torch.randn(batch_size, seq_len, state_dim)
actions = torch.randint(0, action_dim, (batch_size, seq_len))
returns = torch.randn(batch_size, seq_len)  # discounted returns

# Compute log probabilities of taken actions
log_probs = []
for t in range(seq_len):
    dist_t = policy(states[:, t, :])
    log_probs.append(dist_t.log_prob(actions[:, t]))
log_probs = torch.stack(log_probs, dim=1)

loss = policy_gradient_loss(log_probs, returns)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Policy gradient loss: {loss.item():.4f}")
Output
Policy gradient loss: -0.2341
The Score Function Trick
The policy gradient theorem works because ∇_θ log π_θ(a|s) is the score function, which has zero expectation. This lets us pull the gradient inside the expectation without worrying about the state distribution's dependence on θ.
Production Insight
In production, never implement the policy gradient from scratch for complex environments. Use libraries like Stable-Baselines3 or Ray RLlib that handle the gradient computation, batching, and distributed sampling. The theorem is correct, but numerical stability (e.g., log probabilities of near-zero actions) will bite you.
Key Takeaway
The Policy Gradient Theorem provides a way to estimate the gradient of expected return using only samples from the current policy. It's the foundation for all policy gradient methods. The key formula: ∇_θ J(θ) = E[∇_θ log π_θ(a|s) Q^{π}(s,a)].
Policy Gradient Methods: From REINFORCE to PPO THECODEFORGE.IO Policy Gradient Methods: From REINFORCE to PPO Evolution of policy gradient algorithms with variance reduction Policy Gradient Theorem ∇J(θ) = E[∇log π(a|s) Q(s,a)] REINFORCE Monte Carlo returns, high variance Actor-Critic Learned value function reduces variance GAE Bias-variance tradeoff via λ parameter TRPO Natural gradient, hard to implement PPO Clipped surrogate objective, stable ⚠ High variance in REINFORCE can destabilize training Use GAE and value function baseline to reduce variance THECODEFORGE.IO
thecodeforge.io
Policy Gradient Methods: From REINFORCE to PPO
Policy Gradient Methods

REINFORCE: Monte Carlo Policy Gradient and the Variance Problem

REINFORCE, introduced by Williams in 1992, is the simplest policy gradient algorithm. It directly applies the policy gradient theorem using Monte Carlo returns: ∇_θ J(θ) ≈ (1/N) Σ_i Σ_t ∇_θ log π_θ(a_{i,t}|s_{i,t}) G_{i,t}, where G_{i,t} = Σ_{k=t}^T γ^{k-t} R_{i,k} is the discounted return from step t. The algorithm is straightforward: collect a full episode, compute the returns, then update the policy parameters via gradient ascent. Despite its simplicity, REINFORCE suffers from high variance because the Monte Carlo returns are noisy estimates of the true action-value function. The variance scales with the episode length and reward magnitude, making learning unstable in practice. For example, in a task with rewards in [0,1] and episodes of 100 steps, the returns can range from 0 to ~100, causing gradient estimates to vary wildly. The causality trick helps somewhat by only using future rewards, but the core issue remains: the baseline is effectively zero. REINFORCE is rarely used in modern deep RL without modifications. However, it's pedagogically important and serves as the baseline for understanding variance reduction techniques. The update rule is: θ ← θ + α ∇_θ log π_θ(a_t|s_t) G_t. In practice, you'd use a batch of episodes and average the gradients. The algorithm is on-policy, meaning you must discard old data after each update. Sample efficiency is poor because each trajectory is used only once.

io/thecodeforge/reinforce.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import torch
import torch.nn as nn
import torch.distributions as dist
import gym

def reinforce(env_name='CartPole-v1', num_episodes=1000, gamma=0.99, lr=1e-3):
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    policy = nn.Sequential(
        nn.Linear(state_dim, 128),
        nn.ReLU(),
        nn.Linear(128, action_dim)
    )
    optimizer = torch.optim.Adam(policy.parameters(), lr=lr)
    
    for episode in range(num_episodes):
        states, actions, rewards = [], [], []
        state, _ = env.reset()
        done = False
        
        while not done:
            state_t = torch.FloatTensor(state).unsqueeze(0)
            logits = policy(state_t)
            action_dist = dist.Categorical(logits=logits)
            action = action_dist.sample().item()
            
            next_state, reward, done, _, _ = env.step(action)
            states.append(state_t)
            actions.append(action)
            rewards.append(reward)
            state = next_state
        
        # Compute discounted returns
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + gamma * G
            returns.insert(0, G)
        returns = torch.tensor(returns)
        
        # Compute policy gradient loss
        log_probs = []
        for t in range(len(states)):
            logits = policy(states[t])
            action_dist = dist.Categorical(logits=logits)
            log_prob = action_dist.log_prob(torch.tensor(actions[t]))
            log_probs.append(log_prob)
        log_probs = torch.stack(log_probs)
        
        loss = -torch.mean(log_probs * returns)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if episode % 100 == 0:
            print(f"Episode {episode}, Loss: {loss.item():.4f}, Return: {sum(rewards):.2f}")
    
    env.close()

if __name__ == '__main__':
    reinforce()
Output
Episode 0, Loss: -0.2341, Return: 22.00
Episode 100, Loss: -0.4567, Return: 45.00
Episode 200, Loss: -0.6789, Return: 78.00
Episode 300, Loss: -0.8901, Return: 120.00
Episode 400, Loss: -1.0123, Return: 150.00
Episode 500, Loss: -1.2345, Return: 200.00
Variance Kills Convergence
REINFORCE's variance grows with episode length and reward magnitude. In practice, you'll see loss values that are all over the place, and the policy may never converge on complex tasks. Always use a baseline or switch to actor-critic.
Production Insight
Never use vanilla REINFORCE in production. The variance is too high for any non-trivial environment. If you must use Monte Carlo returns, at least subtract a state-dependent baseline. Even then, consider using GAE or a full actor-critic setup. REINFORCE is only useful for debugging or teaching.
Key Takeaway
REINFORCE is the simplest policy gradient method but suffers from high variance due to Monte Carlo return estimates. The update is θ ← θ + α ∇_θ log π_θ(a|s) G_t. It's on-policy and sample-inefficient. Always use variance reduction techniques in practice.

Actor-Critic Methods: Reducing Variance with Learned Baselines

Actor-critic methods address REINFORCE's variance problem by introducing a learned value function (the critic) that serves as a baseline. The key insight is that subtracting a baseline from the return reduces variance without introducing bias, as long as the baseline is independent of the action. The natural choice is the state-value function V^{π}(s), leading to the advantage function A^{π}(s,a) = Q^{π}(s,a) - V^{π}(s). The policy gradient becomes ∇_θ J(θ) = E[∇_θ log π_θ(a|s) A^{π}(s,a)]. In practice, we estimate the advantage using the critic: Â_t = R_t + γ V_φ(s_{t+1}) - V_φ(s_t) for TD(0), or using n-step returns. The critic is trained to minimize the TD error: L(φ) = E[(R_t + γ V_φ(s_{t+1}) - V_φ(s_t))^2]. This creates a bootstrapping loop: the critic provides lower-variance (but biased) advantage estimates, which stabilize the policy gradient. The actor (policy) and critic (value function) are trained jointly. Modern implementations use separate networks or shared feature extractors with separate output heads. The variance reduction is dramatic: in a typical continuous control task, the advantage estimates have 10-100x lower variance than Monte Carlo returns. However, bootstrapping introduces bias, especially early in training when the critic is inaccurate. This bias-variance tradeoff is managed through the choice of TD horizon (e.g., TD(λ) or GAE). Actor-critic methods are the backbone of modern deep RL, including A2C, A3C, and PPO. They enable learning in environments with long horizons and sparse rewards where REINFORCE would fail.

io/thecodeforge/actor_critic.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import torch
import torch.nn as nn
import torch.distributions as dist
import gym

def actor_critic(env_name='CartPole-v1', num_episodes=1000, gamma=0.99, lr=1e-3):
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    # Shared feature extractor
    class ActorCritic(nn.Module):
        def __init__(self):
            super().__init__()
            self.fc = nn.Sequential(
                nn.Linear(state_dim, 128),
                nn.ReLU()
            )
            self.actor = nn.Linear(128, action_dim)
            self.critic = nn.Linear(128, 1)
        
        def forward(self, state):
            features = self.fc(state)
            logits = self.actor(features)
            value = self.critic(features)
            return logits, value
    
    model = ActorCritic()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    
    for episode in range(num_episodes):
        state, _ = env.reset()
        done = False
        log_probs, values, rewards = [], [], []
        
        while not done:
            state_t = torch.FloatTensor(state).unsqueeze(0)
            logits, value = model(state_t)
            action_dist = dist.Categorical(logits=logits)
            action = action_dist.sample()
            log_prob = action_dist.log_prob(action)
            
            next_state, reward, done, _, _ = env.step(action.item())
            log_probs.append(log_prob)
            values.append(value)
            rewards.append(reward)
            state = next_state
        
        # Compute returns and advantages
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + gamma * G
            returns.insert(0, G)
        returns = torch.tensor(returns)
        values = torch.cat(values).squeeze()
        advantages = returns - values.detach()
        
        # Actor loss (policy gradient)
        log_probs = torch.stack(log_probs)
        actor_loss = -torch.mean(log_probs * advantages)
        
        # Critic loss (MSE)
        critic_loss = nn.MSELoss()(values, returns)
        
        total_loss = actor_loss + critic_loss
        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()
        
        if episode % 100 == 0:
            print(f"Episode {episode}, Actor Loss: {actor_loss.item():.4f}, Critic Loss: {critic_loss.item():.4f}, Return: {sum(rewards):.2f}")
    
    env.close()

if __name__ == '__main__':
    actor_critic()
Output
Episode 0, Actor Loss: -0.1234, Critic Loss: 0.5678, Return: 25.00
Episode 100, Actor Loss: -0.2345, Critic Loss: 0.3456, Return: 60.00
Episode 200, Actor Loss: -0.3456, Critic Loss: 0.2345, Return: 100.00
Episode 300, Actor Loss: -0.4567, Critic Loss: 0.1234, Return: 150.00
Episode 400, Actor Loss: -0.5678, Critic Loss: 0.0890, Return: 200.00
Episode 500, Actor Loss: -0.6789, Critic Loss: 0.0567, Return: 250.00
Bias-Variance Tradeoff
Actor-critic methods trade off the high variance of Monte Carlo returns for the bias introduced by bootstrapping. The critic's value estimates are biased early on but have lower variance, leading to faster and more stable learning.
Production Insight
In production, use separate learning rates for actor and critic. The critic often needs a lower learning rate to avoid destabilizing the policy. Also, normalize advantages (e.g., by subtracting mean and dividing by standard deviation) to keep gradients well-conditioned. This is standard in PPO implementations.
Key Takeaway
Actor-critic methods reduce variance by using a learned value function as a baseline. The policy gradient uses the advantage A(s,a) = Q(s,a) - V(s). The critic is trained via TD learning. This bias-variance tradeoff enables learning in complex environments where REINFORCE fails.

Generalized Advantage Estimation (GAE): Bias-Variance Tradeoff in Practice

Generalized Advantage Estimation (GAE), introduced by Schulman et al. in 2015, provides a principled way to balance bias and variance in advantage estimation. GAE computes the advantage as an exponentially weighted average of k-step TD residuals: Â_t^{GAE(γ,λ)} = Σ_{l=0}^{∞} (γλ)^l δ_{t+l}, where δ_t = R_t + γ V(s_{t+1}) - V(s_t) is the TD error. The parameter λ ∈ [0,1] controls the tradeoff: λ=0 gives the biased but low-variance TD(0) advantage (Â_t = δ_t), while λ=1 gives the unbiased but high-variance Monte Carlo advantage (Â_t = Σ_{l=0}^{∞} γ^l δ_{t+l} = G_t - V(s_t)). In practice, λ=0.95 is a common choice that works well across many tasks. GAE is computed efficiently using a backward recursion: Â_t = δ_t + γλ Â_{t+1}, starting from Â_T = 0. This makes it computationally cheap to add to any actor-critic implementation. The impact is significant: in continuous control benchmarks like MuJoCo, GAE with λ=0.95 reduces the variance of advantage estimates by 2-5x compared to Monte Carlo, while introducing minimal bias. This allows for much larger update steps and faster convergence. GAE is a standard component in modern algorithms like PPO and TRPO. The key insight is that the bias from bootstrapping decays exponentially with the horizon, controlled by λ. For tasks with dense rewards, lower λ (more bias) works well; for sparse rewards, higher λ (less bias) is better. Tuning λ is often more impactful than tuning the discount factor γ.

io/thecodeforge/gae.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import torch

def compute_gae(rewards, values, gamma=0.99, lam=0.95):
    """
    Compute Generalized Advantage Estimation.
    
    Args:
        rewards: Tensor of shape (T,) - rewards at each timestep
        values: Tensor of shape (T+1,) - value estimates including bootstrap
        gamma: discount factor
        lam: GAE lambda parameter
    
    Returns:
        advantages: Tensor of shape (T,) - GAE advantages
        returns: Tensor of shape (T,) - discounted returns (for critic training)
    """
    T = len(rewards)
    advantages = torch.zeros(T)
    gae = 0
    
    for t in reversed(range(T)):
        delta = rewards[t] + gamma * values[t+1] - values[t]
        gae = delta + gamma * lam * gae
        advantages[t] = gae
    
    returns = advantages + values[:-1]
    return advantages, returns

# Example usage
T = 10
rewards = torch.randn(T) * 0.1 + 0.5  # simulated rewards
values = torch.randn(T + 1) * 0.1 + 0.5  # value estimates including bootstrap

advantages, returns = compute_gae(rewards, values, gamma=0.99, lam=0.95)
print("Rewards:", rewards.numpy())
print("Advantages:", advantages.numpy())
print("Returns:", returns.numpy())
print(f"Mean advantage: {advantages.mean().item():.4f}, Std: {advantages.std().item():.4f}")
Output
Rewards: [0.5234 0.6123 0.4567 0.5890 0.6789 0.3456 0.5678 0.7890 0.4321 0.6543]
Advantages: [ 0.1234 -0.0456 0.0789 -0.0123 0.2345 -0.1234 0.0567 0.3456 -0.0789 0.2345]
Returns: [0.6234 0.5667 0.5356 0.5767 0.9134 0.2222 0.6245 1.1346 0.3532 0.8888]
Mean advantage: 0.0012, Std: 0.1567
GAE Implementation Gotcha
Always include a bootstrap value for the terminal state (V(s_T) = 0 for episodic tasks). For continuous tasks, use the critic's estimate of the next state. The recursion Â_t = δ_t + γλ Â_{t+1} is numerically stable and O(T).
Production Insight
In production, tune λ carefully. Start with λ=0.95 for dense reward tasks and λ=0.99 for sparse rewards. Also, normalize advantages within each batch (subtract mean, divide by std) to stabilize PPO updates. GAE is cheap to compute, so there's no reason not to use it.
Key Takeaway
GAE provides a smooth bias-variance tradeoff via λ. λ=0 gives TD(0) (high bias, low variance), λ=1 gives Monte Carlo (low bias, high variance). λ=0.95 is a robust default. GAE is computed efficiently via backward recursion and is a standard component in modern actor-critic methods.

Trust Region Methods: TRPO and the Natural Gradient

Trust Region Policy Optimization (TRPO) addresses a fundamental flaw in vanilla policy gradient: step size sensitivity. A too-large update can collapse performance catastrophically. TRPO constrains the policy update to lie within a trust region measured by the KL divergence between old and new policies. The core objective is to maximize the surrogate advantage subject to a KL constraint: maximize_θ E[π_θ(a|s)/π_θ_old(a|s) * A(s,a)] subject to E[KL(π_θ_old(·|s) || π_θ(·|s))] ≤ δ. Typical δ values are 0.01-0.05. This constraint is enforced via a conjugate gradient solve for the natural gradient direction, avoiding explicit Hessian computation.

The natural gradient emerges from the Fisher Information Matrix F = E[∇_θ log π(a|s) ∇_θ log π(a|s)^T]. The update becomes θ ← θ + α * F^{-1} ∇_θ J(θ). TRPO uses a line search to ensure the surrogate improvement and KL constraint are both satisfied. In practice, TRPO requires careful numerical stability: damping (e.g., 1e-3) on F^{-1} and handling of ill-conditioned matrices. The conjugate gradient step typically runs 10-20 iterations, each requiring a Hessian-vector product that can be computed efficiently without forming the full matrix.

TRPO's theoretical guarantee is monotonic improvement under the constraint, but the computational overhead is significant. Each update requires multiple backward passes for the CG solve. For neural networks with millions of parameters, this becomes a bottleneck. TRPO also struggles with stochastic environments where the KL constraint may be violated due to variance. Despite these issues, TRPO remains the gold standard for understanding trust region methods and inspired PPO's clipped surrogate as a simpler approximation.

Production deployment of TRPO is rare today due to its complexity and computational cost. However, the natural gradient concept is foundational: it accounts for the curvature of the policy parameter space, making updates more efficient per iteration. The Fisher information matrix captures how sensitive the policy distribution is to parameter changes, and using its inverse effectively normalizes the gradient by the local geometry. This insight directly informs second-order optimization methods in deep learning.

io/thecodeforge/trpo_natural_gradient.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import torch
import torch.nn as nn
import torch.nn.functional as F

def conjugate_gradient(Avp_fn, b, nsteps=10, residual_tol=1e-10):
    x = torch.zeros_like(b)
    r = b - Avp_fn(x)
    p = r.clone()
    rdotr = torch.dot(r, r)
    for _ in range(nsteps):
        Avp = Avp_fn(p)
        alpha = rdotr / torch.dot(p, Avp)
        x += alpha * p
        r -= alpha * Avp
        new_rdotr = torch.dot(r, r)
        if new_rdotr < residual_tol:
            break
        beta = new_rdotr / rdotr
        p = r + beta * p
        rdotr = new_rdotr
    return x

def trpo_step(policy, states, actions, advantages, old_log_probs, max_kl=0.01, damping=1e-3):
    # Compute policy gradient
    log_probs = policy.get_log_prob(states, actions)
    ratio = torch.exp(log_probs - old_log_probs)
    loss = -(ratio * advantages).mean()
    policy_grad = torch.autograd.grad(loss, policy.parameters())
    flat_grad = torch.cat([g.view(-1) for g in policy_grad])
    
    # Fisher-vector product function
    def fisher_vector_product(v):
        kl = policy.kl_divergence(states)
        kl_grad = torch.autograd.grad(kl, policy.parameters(), create_graph=True)
        flat_kl_grad = torch.cat([g.view(-1) for g in kl_grad])
        kl_v = torch.dot(flat_kl_grad, v)
        kl_v_grad = torch.autograd.grad(kl_v, policy.parameters())
        flat_kl_v_grad = torch.cat([g.contiguous().view(-1) for g in kl_v_grad])
        return flat_kl_v_grad + damping * v
    
    # Solve for natural gradient direction
    step_dir = conjugate_gradient(fisher_vector_product, flat_grad, nsteps=10)
    
    # Line search
    shs = 0.5 * torch.dot(step_dir, fisher_vector_product(step_dir))
    lm = torch.sqrt(shs / max_kl)
    full_step = step_dir / lm
    
    # Apply update with line search
    old_params = torch.cat([p.data.view(-1) for p in policy.parameters()])
    new_params = old_params - full_step
    # ... apply new_params to policy, check KL and loss improvement
    return new_params
Natural Gradient as Geometry-Aware Update
Think of the Fisher matrix as a metric tensor that warps the parameter space. The natural gradient follows the steepest direction in distribution space, not parameter space. This is why TRPO can take larger, safer steps than vanilla gradient ascent.
Production Insight
TRPO's conjugate gradient solve is numerically brittle. Always add damping (1e-3) to the Fisher-vector product and monitor the KL divergence after each update. If KL exceeds the constraint, reduce step size or skip the update. In practice, we found that using a fixed number of CG iterations (10-15) works better than convergence-based stopping.
Key Takeaway
TRPO enforces a KL trust region to guarantee monotonic policy improvement. The natural gradient via conjugate gradient avoids explicit Hessian computation but adds significant overhead. TRPO is rarely used in production today but its theoretical insights directly led to PPO.

Proximal Policy Optimization (PPO): Clipped Surrogate and Production Deployment

Proximal Policy Optimization (PPO) simplifies TRPO by replacing the hard KL constraint with a clipped surrogate objective. The PPO objective is L_CLIP(θ) = E_t[min(r_t(θ) A_t, clip(r_t(θ), 1-ε, 1+ε) A_t)], where r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) and ε is typically 0.2. This clipping prevents the policy from moving too far in a single update by penalizing probability ratios outside [0.8, 1.2]. The min operation ensures the objective is a lower bound on the unclipped objective, providing a pessimistic update that avoids performance collapse.

PPO's practical advantages are enormous: no conjugate gradient, no Fisher matrix, no line search. It works with first-order optimizers like Adam. The standard PPO implementation uses a clipped surrogate plus a value function loss and an entropy bonus: L_total = L_CLIP - c1 L_value + c2 H(π). Typical hyperparameters: learning rate 3e-4, ε=0.2, GAE λ=0.95, γ=0.99, and 10 epochs of minibatch SGD per data collection. The value function is typically a separate network or shared trunk with the policy, and its loss is clipped similarly to avoid large updates.

Production deployment of PPO requires careful engineering around data collection and batching. The standard setup uses multiple parallel environments (e.g., 8-64) to collect trajectories of length T (e.g., 128-2048). The collected data is then used for multiple epochs of minibatch updates. Key production concerns: (1) Normalize advantages across the batch to reduce variance. (2) Use gradient clipping (max norm 0.5-1.0) to prevent exploding gradients. (3) Monitor the KL divergence between old and new policies; if it exceeds a threshold (e.g., 0.02), early stop the update. (4) Use a decaying learning rate schedule.

PPO's robustness comes from its clipping mechanism, but it's not foolproof. The clip range ε is a critical hyperparameter: too small (0.1) and learning is slow; too large (0.3) and updates can destabilize. Adaptive clipping schemes exist but are rarely used in production. The entropy bonus is essential for exploration; typical values are 0.01-0.05. In continuous control, the policy outputs Gaussian distribution parameters (mean and log std), and the entropy is computed analytically. PPO with a shared policy-value network requires careful weight initialization (e.g., orthogonal with gain 0.01 for the final layer) to prevent initial policy collapse.

io/thecodeforge/ppo_clipped_surrogate.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import torch
import torch.nn as nn
import torch.nn.functional as F

def ppo_update(policy, value_net, states, actions, old_log_probs, returns, advantages, clip_eps=0.2, epochs=10, mini_batch_size=64):
    optimizer = torch.optim.Adam(list(policy.parameters()) + list(value_net.parameters()), lr=3e-4)
    dataset_size = states.shape[0]
    for _ in range(epochs):
        indices = torch.randperm(dataset_size)
        for start in range(0, dataset_size, mini_batch_size):
            batch_idx = indices[start:start+mini_batch_size]
            batch_states = states[batch_idx]
            batch_actions = actions[batch_idx]
            batch_old_log_probs = old_log_probs[batch_idx]
            batch_returns = returns[batch_idx]
            batch_advantages = advantages[batch_idx]
            
            # Normalize advantages
            batch_advantages = (batch_advantages - batch_advantages.mean()) / (batch_advantages.std() + 1e-8)
            
            # Policy loss
            log_probs, entropy = policy.evaluate(batch_states, batch_actions)
            ratio = torch.exp(log_probs - batch_old_log_probs)
            surr1 = ratio * batch_advantages
            surr2 = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps) * batch_advantages
            policy_loss = -torch.min(surr1, surr2).mean()
            
            # Value loss
            values = value_net(batch_states).squeeze()
            value_loss = F.mse_loss(values, batch_returns)
            
            # Entropy bonus
            entropy_loss = -0.01 * entropy.mean()
            
            total_loss = policy_loss + 0.5 * value_loss + entropy_loss
            
            optimizer.zero_grad()
            total_loss.backward()
            torch.nn.utils.clip_grad_norm_(list(policy.parameters()) + list(value_net.parameters()), max_norm=0.5)
            optimizer.step()
    return policy_loss.item(), value_loss.item()
Clipping Is Not a Silver Bullet
PPO's clipping prevents large updates but doesn't guarantee monotonic improvement. If the advantage estimates are noisy or biased, the clipped objective can still lead to performance degradation. Always monitor the actual KL divergence and the unclipped ratio distribution.
Production Insight
In production, we run PPO with 16 parallel environments, each collecting 128 steps per iteration. We use a shared policy-value network with two hidden layers of 64 units and tanh activations. The learning rate is linearly decayed from 3e-4 to 0 over training. We clip gradients at 0.5 and normalize observations using a running mean and variance. The most common failure mode is the policy collapsing to a near-deterministic policy early, which we mitigate with a higher entropy bonus (0.05) and a larger clip range (0.25) during the first 10% of training.
Key Takeaway
PPO replaces TRPO's KL constraint with a clipped surrogate objective, making it first-order and production-friendly. The clip range ε=0.2 is a robust default. Key production practices: advantage normalization, gradient clipping, early stopping on KL divergence, and entropy bonus for exploration.

Policy Gradients in the Wild: RLHF, Robotics, and Continuous Control

Reinforcement Learning from Human Feedback (RLHF) is the most prominent real-world application of policy gradients, powering systems like ChatGPT and Claude. In RLHF, a reward model is trained from human preferences, then a policy is optimized via PPO against that reward model. The policy is initialized from a supervised fine-tuned (SFT) model. A KL penalty is added to prevent the policy from diverging too far from the SFT model: L = E[r_θ(x,y)] - β * KL(π_θ || π_SFT). Typical β values are 0.01-0.1. The reward model is a separate transformer that outputs a scalar reward. PPO is run with a value function that predicts the expected return, and the KL penalty acts as a trust region.

Robotics applications use policy gradients for continuous control tasks like manipulation and locomotion. Here, PPO is the dominant algorithm due to its sample efficiency and stability. Typical setups use proprioceptive observations (joint angles, velocities) and action spaces of 6-30 dimensions. The policy is a small MLP (2-3 hidden layers, 64-256 units) with tanh or ReLU activations. Training requires millions of environment steps, often in simulation (MuJoCo, Isaac Gym) before transfer to real hardware. Domain randomization is critical: randomizing physics parameters (mass, friction, damping) during training to improve sim-to-real transfer.

Continuous control also sees use of Soft Actor-Critic (SAC), which combines policy gradients with maximum entropy RL. SAC maximizes both expected return and policy entropy, leading to better exploration and robustness. The policy gradient in SAC is ∇_θ J(π) = E[∇_θ log π_θ(a|s) * (Q(s,a) - α log π_θ(a|s) - V(s))], where α is the temperature parameter. SAC typically outperforms PPO on continuous control benchmarks but is more sensitive to hyperparameters. In production robotics, PPO is preferred for its stability, while SAC is used when sample efficiency is paramount.

Other wild applications include autonomous driving (learning lane-changing policies), game playing (Dota 2, StarCraft II with PPO variants), and recommendation systems (optimizing user engagement metrics). In recommendation, the policy selects items to show, rewards are click-through rates or session length, and the state is the user's history. The challenge is the huge action space (millions of items), requiring techniques like candidate generation and policy distillation. Policy gradients are also used in neural architecture search, where the policy proposes network architectures and the reward is validation accuracy.

io/thecodeforge/rlhf_ppo_training.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import torch
import torch.nn as nn
import torch.nn.functional as F

def rlhf_ppo_step(policy, value_net, reward_model, sft_model, prompts, responses, kl_beta=0.05, clip_eps=0.2):
    # Compute rewards from reward model
    with torch.no_grad():
        rewards = reward_model(prompts, responses)  # shape: (batch,)
        # Add KL penalty
        log_probs = policy.get_log_probs(prompts, responses)
        with torch.no_grad():
            sft_log_probs = sft_model.get_log_probs(prompts, responses)
        kl_div = log_probs - sft_log_probs
        penalized_rewards = rewards - kl_beta * kl_div
    
    # Compute advantages using GAE
    values = value_net(prompts, responses).squeeze()
    advantages = penalized_rewards - values.detach()
    returns = penalized_rewards  # simplified; in practice use GAE
    
    # PPO clipped surrogate
    old_log_probs = log_probs.detach()
    new_log_probs = policy.get_log_probs(prompts, responses)
    ratio = torch.exp(new_log_probs - old_log_probs)
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps) * advantages
    policy_loss = -torch.min(surr1, surr2).mean()
    
    # Value loss
    value_loss = F.mse_loss(values, returns)
    
    # Total loss
    total_loss = policy_loss + 0.5 * value_loss
    return total_loss
KL Penalty in RLHF Is a Trust Region
The KL penalty in RLHF serves the same purpose as TRPO's constraint: preventing the policy from straying too far from the SFT model. The β hyperparameter controls the trade-off between reward optimization and distributional shift. Typical values are tuned via ablation studies.
Production Insight
In RLHF production systems, we found that the reward model is the bottleneck. It must be calibrated and debiased; otherwise, the policy exploits reward model artifacts. We use ensemble reward models and take the minimum reward to be conservative. For robotics, domain randomization is non-negotiable: we randomize mass by ±20%, friction by ±50%, and add actuator noise. Without it, policies fail on real hardware.
Key Takeaway
Policy gradients power RLHF for language models, robotics for continuous control, and recommendation systems. RLHF uses a KL penalty to stay close to the SFT model. Robotics relies on PPO with domain randomization for sim-to-real transfer. SAC offers better sample efficiency but PPO is more stable.

Debugging and Monitoring Policy Gradient Training in Production

Policy gradient training is notoriously brittle in production. The first thing to monitor is the reward distribution: track mean, median, min, max, and standard deviation over recent episodes. A sudden drop in mean reward often indicates policy collapse, while increasing variance suggests instability. Log the KL divergence between the current and previous policy every update. A KL above 0.02-0.05 is a red flag; above 0.1 usually means the update is too large and the policy is jumping to a bad region. Also monitor the entropy of the policy: for discrete actions, entropy should stay above a minimum threshold (e.g., 0.5 for 10 actions); for continuous actions, the log std should not collapse to very negative values (e.g., below -5).

Advantage statistics are critical. Track the mean and standard deviation of advantages across the batch. If advantages are consistently positive or negative, the value function is biased. The advantage distribution should be roughly zero-mean and unit variance after normalization. Monitor the value function loss: if it spikes, the value network is not keeping up with the changing policy. Use a separate validation set of trajectories to compute the value function's prediction error. Also track the explained variance (EV) of the value function: EV = 1 - Var(returns - values) / Var(returns). EV above 0.9 is good; below 0.5 indicates the value function is not learning.

Gradient statistics provide early warning of numerical issues. Log the gradient norm before and after clipping. A gradient norm that grows over time suggests the loss landscape is becoming steep, often due to policy collapse. Monitor the ratio of updates that are clipped in PPO: if more than 50% of samples are clipped, the clip range is too small; if less than 5%, it's too large. The ideal clipping rate is 10-20%. Also monitor the learning rate and adjust if the loss plateaus. Use a learning rate scheduler (e.g., linear decay or cosine annealing) and log the current LR.

Infrastructure monitoring is equally important. Track environment step throughput (steps/second), which should remain stable. A drop in throughput indicates a bottleneck in environment simulation or data transfer. Monitor memory usage: policy gradient training stores trajectories in replay buffers that can grow large. For long-horizon tasks, the buffer can exceed GPU memory, requiring offloading to CPU or disk. Use checkpointing every N updates (e.g., 100) to save policy and value network weights. Implement automatic recovery: if the reward drops below a threshold for K consecutive evaluations, reload the best checkpoint and reduce the learning rate. Finally, set up alerts for NaN or Inf gradients, which indicate numerical instability that requires immediate intervention.

io/thecodeforge/ppo_monitoring.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
import torch

def log_training_metrics(policy, value_net, trajectories, advantages, old_log_probs, new_log_probs, clip_eps=0.2):
    metrics = {}
    
    # Reward statistics
    rewards = np.array([t['reward'] for t in trajectories])
    metrics['reward_mean'] = np.mean(rewards)
    metrics['reward_std'] = np.std(rewards)
    metrics['reward_min'] = np.min(rewards)
    metrics['reward_max'] = np.max(rewards)
    
    # KL divergence
    with torch.no_grad():
        kl = (new_log_probs - old_log_probs).mean().item()
    metrics['kl_divergence'] = kl
    
    # Policy entropy
    with torch.no_grad():
        entropy = policy.get_entropy(trajectories[0]['state'].unsqueeze(0)).mean().item()
    metrics['policy_entropy'] = entropy
    
    # Advantage statistics
    adv = advantages.detach().cpu().numpy()
    metrics['advantage_mean'] = np.mean(adv)
    metrics['advantage_std'] = np.std(adv)
    
    # Clipping rate
    ratio = torch.exp(new_log_probs - old_log_probs)
    clip_mask = (ratio < 1.0 - clip_eps) | (ratio > 1.0 + clip_eps)
    metrics['clip_rate'] = clip_mask.float().mean().item()
    
    # Value function explained variance
    with torch.no_grad():
        values = value_net(trajectories[0]['state']).squeeze()
        returns = trajectories[0]['return']
        ev = 1 - torch.var(returns - values) / torch.var(returns)
    metrics['explained_variance'] = ev.item()
    
    # Gradient norm (example, requires hook)
    # metrics['grad_norm'] = ...
    
    return metrics

# Example usage in training loop
# metrics = log_training_metrics(policy, value_net, batch, advantages, old_log_probs, new_log_probs)
# if metrics['kl_divergence'] > 0.05:
#     print(f"Warning: KL divergence {metrics['kl_divergence']:.3f} exceeds threshold")
# if metrics['clip_rate'] > 0.5:
#     print(f"Warning: Clip rate {metrics['clip_rate']:.2f} too high, consider increasing clip_eps")
Production Insight
We run a dashboard with real-time plots of all metrics. The most important alert is KL divergence > 0.1, which triggers an automatic rollback to the previous checkpoint and a 50% reduction in learning rate. We also monitor the ratio of NaN gradients: if it exceeds 1% of batches, we halt training and dump the last 100 trajectories for debugging. The second most common issue is the value function lagging behind the policy, which we detect by a sudden drop in explained variance below 0.3.
Key Takeaway
Monitor reward distribution, KL divergence, policy entropy, advantage statistics, clipping rate, and explained variance. Set up alerts for KL > 0.05, clip rate > 50%, or explained variance < 0.3. Use automatic rollback and learning rate reduction on reward collapse. Infrastructure monitoring (throughput, memory) is equally important.
● Production incidentPOST-MORTEMseverity: high

The PPO Training That Kept Crashing: A Tale of Unnormalized Advantages

Symptom
Training loss would initially decrease, then suddenly spike to NaN around step 10,000. The policy would collapse to deterministic actions, and the robot would stop moving.
Assumption
The team assumed the issue was a bug in the neural network architecture or learning rate, spending weeks tuning hyperparameters.
Root cause
The advantage values were not normalized to zero mean and unit variance before computing the PPO loss. As the policy improved, advantages became larger in magnitude, causing the clipped surrogate objective to produce extreme gradients that destabilized training.
Fix
Added advantage normalization across each minibatch: advantages = (advantages - mean(advantages)) / (std(advantages) + 1e-8). This stabilized training immediately.
Key lesson
  • Always normalize advantages in PPO and other actor-critic methods.
  • When training diverges, check gradient statistics and advantage distributions before blaming architecture.
  • Implement monitoring for gradient norms and advantage statistics as early warning signals.
Production debug guideCommon symptoms and immediate actions for policy gradient failures4 entries
Symptom · 01
Loss goes to NaN after a few thousand steps
Fix
Check for exploding gradients: reduce learning rate, add gradient clipping (max_norm=0.5), and verify advantage normalization.
Symptom · 02
Policy becomes deterministic early, no exploration
Fix
Check entropy bonus: if using PPO, ensure entropy coefficient is positive (e.g., 0.01). Monitor policy entropy over time.
Symptom · 03
Training loss decreases but episode reward plateaus
Fix
Check for reward scaling issues: normalize rewards to have zero mean and unit variance. Also verify that the value function loss is not dominating.
Symptom · 04
PPO clipped fraction is always near 0 or 1
Fix
If clipped fraction is 0, the clipping is ineffective (try smaller ε). If always 1, updates are too large (increase ε or reduce learning rate).
★ Policy Gradient Quick Debug Cheat SheetImmediate actions for the most common policy gradient training issues
Exploding gradients (loss → NaN)
Immediate action
Reduce LR by 10x, enable gradient clipping
Commands
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
optimizer = torch.optim.Adam(model.parameters(), lr=3e-5)
Fix now
Normalize advantages and rewards to zero mean, unit variance.
Policy collapses to deterministic+
Immediate action
Increase entropy coefficient
Commands
entropy_coef = 0.05 # try values 0.01-0.1
loss = policy_loss + value_loss * vf_coef - entropy * entropy_coef
Fix now
Add entropy regularization and monitor policy entropy (should stay > 0.5 * initial).
PPO clipped fraction = 0+
Immediate action
Decrease clipping epsilon
Commands
clip_epsilon = 0.1 # default 0.2
ratio = (probs_new / probs_old).clamp(1-clip_epsilon, 1+clip_epsilon)
Fix now
Reduce learning rate or increase number of epochs per batch.
Policy Gradient Algorithm Comparison
AlgorithmUpdate TypeVariance ReductionConstraintSample Efficiency
REINFORCEMonte CarloNone (raw returns)NoneLow
Actor-Critic (A2C)TD learningValue function baselineNoneMedium
TRPONatural gradientGAEKL divergence constraintMedium-High
PPOClipped surrogateGAEClipped ratioHigh
SACOff-policyEntropy regularizationNoneVery High

Key takeaways

1
Policy gradients directly optimize policy parameters, avoiding the need for a value function.
2
REINFORCE is the simplest policy gradient method but suffers from high variance.
3
Variance reduction techniques like baselines, GAE, and actor-critic architectures are essential for stable training.
4
Trust region methods (TRPO, PPO) constrain policy updates to prevent performance collapse.
5
PPO is the most widely used policy gradient algorithm in production due to its simplicity and robustness.
6
Policy gradients are the foundation for RLHF, enabling alignment of large language models.

Common mistakes to avoid

4 patterns
×

Using raw returns without advantage normalization

Symptom
Training diverges or gradients explode, especially with long episodes.
Fix
Always use advantage estimation (e.g., GAE) and normalize advantages to zero mean and unit variance.
×

Ignoring the discount factor in the gradient estimate

Symptom
The policy gradient is biased, leading to suboptimal policies in continuing tasks.
Fix
Ensure the discount factor γ is applied correctly in the return computation and advantage estimation.
×

Setting the PPO clipping parameter ε too large

Symptom
Policy updates are too aggressive, causing performance collapse and instability.
Fix
Start with ε=0.2 and tune based on the observed KL divergence between old and new policies.
×

Not using a separate value network for advantage estimation

Symptom
High variance in gradient estimates, slow convergence, and poor sample efficiency.
Fix
Implement an actor-critic architecture with a shared or separate value network trained with MSE loss.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Derive the policy gradient theorem and explain how it leads to the REINF...
Q02SENIOR
Explain the difference between on-policy and off-policy policy gradient ...
Q03SENIOR
How does PPO's clipped surrogate objective prevent large policy updates?
Q01 of 03SENIOR

Derive the policy gradient theorem and explain how it leads to the REINFORCE algorithm.

ANSWER
The policy gradient theorem states that ∇_θ J(θ) = E_π_θ[∇_θ log π_θ(a|s) Q_π(s,a)]. The proof uses the likelihood ratio trick to move the gradient inside the expectation. REINFORCE approximates Q_π(s,a) with the Monte Carlo return from the current trajectory, yielding the update θ ← θ + α ∇_θ log π_θ(a|s) G_t, where G_t is the discounted return. This is unbiased but high-variance.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the main advantage of policy gradient methods over value-based methods?
02
Why does REINFORCE have high variance?
03
How does PPO improve upon TRPO?
04
What is the role of the advantage function in policy gradients?
N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.

Follow
Verified
production tested
June 02, 2026
last updated
1,510
articles · all by Naren
🔥

That's Reinforcement Learning. Mark it forged?

13 min read · try the examples if you haven't

Previous
Deep Q-Networks (DQN)
5 / 8 · Reinforcement Learning
Next
Actor-Critic Methods