TD Learning: SARSA vs Q-Learning – On-Policy vs Off-Policy Control
Master Temporal Difference learning with a production-focused comparison of SARSA and Q-Learning.
20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.
- TD learning bootstraps value estimates from future predictions, blending Monte Carlo and DP.
- SARSA is on-policy: learns action values from the policy being followed.
- Q-Learning is off-policy: learns optimal action values independent of the behavior policy.
- Both use TD(0) updates but differ in the target action selection.
- Q-Learning can overestimate values; Double Q-Learning mitigates this.
- In production, SARSA is safer for risky environments; Q-Learning is more sample-efficient.
Imagine you're learning to cook. TD learning is like tasting the soup as you go and adjusting the recipe based on how you think it will turn out, not waiting until it's fully cooked. SARSA is like adjusting based on the actual next step you take, while Q-Learning adjusts based on the best possible next step you could take, even if you don't take it.
Reinforcement learning has moved from Atari games to production systems: recommendation engines, autonomous driving, and real-time bidding. At the core of many modern RL algorithms lies Temporal Difference (TD) learning, the method that finally made learning from incomplete episodes practical. Without TD, you'd be stuck waiting for terminal states to update your policy—unacceptable in continuous environments.
Two of the most fundamental TD control algorithms are SARSA and Q-Learning. They look almost identical on paper, but that tiny difference in the update rule—whether you use the next action from the current policy or the greedy action—has massive implications for convergence, safety, and sample efficiency. Understanding this distinction is not academic; it determines whether your agent learns to drive safely or to crash spectacularly.
In 2026, with RL being deployed in high-stakes domains like healthcare and finance, choosing the wrong algorithm can lead to catastrophic failures. This article dissects SARSA and Q-Learning from first principles, compares their behavior in production, and provides concrete debugging guidance for when things go wrong.
We'll cover the math, the intuition, the common pitfalls, and the war stories. By the end, you'll know exactly which algorithm to pick and how to diagnose issues when your agent isn't learning.
Foundations: What is Temporal Difference Learning?
Temporal Difference (TD) learning is the backbone of modern reinforcement learning. It combines ideas from Monte Carlo methods and dynamic programming. Like Monte Carlo, TD learns directly from raw experience without a model of the environment. Like dynamic programming, it updates estimates based on other learned estimates—a process called bootstrapping. The key innovation is that TD updates its value estimates after every time step, not at the end of an episode. This makes it dramatically more sample-efficient than Monte Carlo, which must wait for a terminal state. In practice, TD can converge 10-100x faster on many problems because it doesn't waste the information contained in each transition.
The core mechanism is simple: you observe a transition from state S_t to S_{t+1}, receive reward R_{t+1}, and immediately update your estimate of V(S_t) using the current estimate of V(S_{t+1}). This is the TD update rule. The difference between the observed reward plus discounted next-state value and the current value is the TD error. A positive TD error means the current state was better than expected; a negative one means it was worse. This error signal is the same signal that neuroscientists have observed in dopamine neurons firing in the ventral tegmental area and substantia nigra. The biological plausibility of TD learning is one reason it's so compelling.
Consider the classic weather prediction example: you want to predict Saturday's weather. Monte Carlo would wait until Saturday, see the actual weather, and then adjust all your daily predictions. TD, however, would adjust Friday's prediction based on your Saturday prediction, which itself gets adjusted later. This bootstrapping allows learning to propagate backward through time much faster. In reinforcement learning, this means an agent can learn from a single step of experience rather than waiting for the episode to end. This is critical for continuing tasks or long-horizon problems where episodes may never terminate.
The mathematical foundation rests on the Bellman equation for a fixed policy π: V^π(s) = E_π[R_{t+1} + γV^π(S_{t+1}) | S_t = s]. TD learning uses a sample of this expectation: the actual reward R_{t+1} and the current estimate V(S_{t+1}). This is a stochastic approximation to the Bellman operator. Under standard conditions (decreasing learning rates, infinite visits to each state), TD(0) converges to the true value function with probability 1. The convergence proof relies on the fact that the TD update is a contraction mapping in expectation, similar to dynamic programming but with sampling noise.
In production systems, TD learning is the foundation for algorithms like DQN, which uses a neural network to approximate the Q-function and updates it with TD targets. The sample efficiency of TD is what makes deep RL feasible on real-world problems like robotics, game playing, and recommendation systems. Without bootstrapping, Monte Carlo methods would require orders of magnitude more experience, making them impractical for most applications.
The TD(0) Algorithm: Update Rule and Intuition
TD(0) is the simplest temporal difference learning algorithm. It estimates the state-value function V(s) for a given policy π. The update rule is: V(S_t) ← V(S_t) + α[R_{t+1} + γV(S_{t+1}) - V(S_t)]. The term in brackets is the TD error δ_t. The learning rate α ∈ (0,1] controls how much we adjust toward the target. The TD target R_{t+1} + γV(S_{t+1}) is a biased estimate of the true value because it uses the current estimate V(S_{t+1}), but this bias is what enables bootstrapping. The variance is lower than Monte Carlo returns because we don't wait for the full return.
The intuition is straightforward: when you move from state S_t to S_{t+1} and receive reward R_{t+1}, you have a new data point. If V(S_t) is too low compared to R_{t+1} + γV(S_{t+1}), you increase it; if too high, you decrease it. Over many updates, V converges to the true value function. The algorithm is online and incremental—it processes one transition at a time and discards it. This makes it memory-efficient and suitable for streaming data.
Consider a simple random walk with 5 states. State 0 is terminal with reward 0, state 4 is terminal with reward 1. All other transitions are left or right with equal probability. TD(0) with α=0.1 and γ=1 will learn the true values after about 100 episodes. The values will be approximately [0, 0.25, 0.5, 0.75, 1] for states 0-4. Monte Carlo would need 10x more episodes to achieve similar accuracy because it only updates after reaching a terminal state.
The convergence proof for TD(0) relies on the Robbins-Monro conditions for stochastic approximation: Σα_t = ∞ and Σα_t² < ∞. In practice, a constant small α (e.g., 0.01 or 0.001) works well for stationary problems. For non-stationary environments, a constant α is actually preferred because it allows the algorithm to track changes. The choice of α is a critical hyperparameter: too large causes oscillation, too small leads to slow learning.
In deep RL, TD(0) is the foundation for DQN and its variants. The Q-network is trained to minimize the mean squared TD error: (r + γ max_a' Q(s',a') - Q(s,a))². This is essentially TD(0) applied to action-values with a neural network. The key difference is that we use a target network to compute the TD target, which stabilizes training. Without this, the moving target problem causes divergence. Modern implementations also use experience replay to break correlations in the data, which is another form of making TD learning work at scale.
From Value Functions to Control: Introducing SARSA
SARSA extends TD learning from value estimation to control—learning an optimal policy. The name comes from the tuple (State, Action, Reward, next State, next Action) used in the update. SARSA is an on-policy algorithm: it learns the value of the policy it's currently following. The update rule is: Q(S_t, A_t) ← Q(S_t, A_t) + α[R_{t+1} + γQ(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]. Notice that the TD target uses the next action A_{t+1} chosen by the current policy. This means SARSA evaluates and improves the same policy that generates the data.
The on-policy nature of SARSA has important implications for exploration. Because the update uses the actual next action taken, SARSA learns a Q-function that accounts for the exploration policy. If the policy is ε-greedy with ε=0.1, SARSA learns values that assume the agent will explore 10% of the time. This makes SARSA more conservative than Q-learning in stochastic environments. In the classic Cliff Walking problem, SARSA learns a safer path that stays away from the cliff, while Q-learning learns the optimal path along the edge but takes longer to converge because it must overcome the exploration noise.
SARSA's convergence properties are well-understood for tabular settings. Under standard conditions (GLIE: greedy in the limit with infinite exploration), SARSA converges to the optimal Q-function with probability 1. The GLIE condition requires that the exploration schedule decays to zero over time, typically with ε decreasing as 1/t or similar. In practice, a common schedule is ε = max(0.01, 1.0 - episode/total_episodes). This ensures the agent explores enough early on but becomes greedy later.
For function approximation, SARSA is more stable than Q-learning because it doesn't use the max operator. The max operator in Q-learning introduces a positive bias (maximization bias), which can cause overestimation and instability. SARSA avoids this by using the actual next action. However, SARSA's on-policy nature means it's less sample-efficient—it can't reuse data from old policies. This is a fundamental trade-off: on-policy methods are more stable but less sample-efficient; off-policy methods are more sample-efficient but harder to stabilize.
In production, SARSA is useful when you want a conservative agent that accounts for exploration noise. For example, in robotics, you might prefer a policy that stays safe even when exploring. SARSA with ε=0.01 will learn a policy that occasionally takes random actions but still performs well. This is in contrast to Q-learning, which might learn a policy that assumes no exploration and then fails when exploration actually happens. The choice between SARSA and Q-learning depends on whether you can control the exploration policy at deployment time.
Off-Policy Learning: Q-Learning and the Bellman Optimality Equation
Q-learning is the most influential off-policy TD control algorithm. It directly approximates the optimal action-value function Q* regardless of the policy being followed. The update rule is: Q(S_t, A_t) ← Q(S_t, A_t) + α[R_{t+1} + γ max_a Q(S_{t+1}, a) - Q(S_t, A_t)]. The key difference from SARSA is the use of max_a Q(S_{t+1}, a) instead of Q(S_{t+1}, A_{t+1}). This means Q-learning uses the greedy action for the next state, not the action actually taken. This is what makes it off-policy: it learns about the optimal policy while following a different (exploratory) policy.
The Bellman optimality equation for Q is: Q(s,a) = E[R_{t+1} + γ max_a' Q*(S_{t+1}, a') | S_t=s, A_t=a]. Q-learning's update is a sample-based approximation of this equation. By taking the max over next actions, Q-learning implicitly performs policy improvement at every step. This is why it converges to the optimal Q-function under the same conditions as SARSA (GLIE), but it can do so even with a different behavior policy, as long as all state-action pairs are visited infinitely often.
Q-learning's off-policy nature makes it more sample-efficient than SARSA. You can reuse experience from any policy, which is the foundation of experience replay in DQN. The agent stores transitions (s,a,r,s') in a replay buffer and samples them randomly for training. This breaks the temporal correlations that plague on-policy methods and allows multiple updates per experience. DQN's success on Atari games (human-level performance on 49 games) demonstrated the power of off-policy TD learning with neural networks.
However, Q-learning has a well-known flaw: maximization bias. Because max_a Q(s',a) uses the same Q-function for both selection and evaluation, it tends to overestimate the value of actions. This is especially problematic in stochastic environments where the max over noisy estimates is biased upward. Double Q-learning addresses this by maintaining two separate Q-functions and using one for action selection and the other for evaluation. The update becomes: Q_1(s,a) ← Q_1(s,a) + α[r + γ Q_2(s', argmax_a' Q_1(s',a')) - Q_1(s,a)]. This simple trick eliminates the overestimation bias and often leads to better performance.
In production, Q-learning is the default choice for most RL problems due to its sample efficiency and simplicity. The combination of Q-learning with deep neural networks (DQN) has been applied to everything from game playing to chip design. The key engineering considerations are: (1) use a target network that's updated slowly (e.g., every 1000 steps) to stabilize bootstrapping, (2) use experience replay with a large buffer (e.g., 1e6 transitions), and (3) clip rewards or use reward normalization to keep the Q-values in a reasonable range. Modern variants like Rainbow DQN combine Q-learning with double Q-learning, prioritized replay, dueling networks, and distributional RL to achieve state-of-the-art performance.
SARSA vs Q-Learning: The Cliff Walking Experiment
The canonical cliff walking environment (Sutton & Barto, Ex 6.6) exposes the critical behavioral difference between SARSA and Q-Learning. The grid is 4x12, start at (3,0), goal at (3,11). Falling off the cliff (rows 3, cols 1-10) yields -100 and resets to start. Each step costs -1. Both algorithms use tabular Q, ε=0.1, α=0.5, γ=1. After 500 episodes, Q-Learning learns the optimal path hugging the cliff edge, while SARSA learns a safer path one row away from the cliff.
Q-Learning is off-policy: it updates Q(s,a) ← Q(s,a) + α[R + γ max_a' Q(s',a') - Q(s,a)]. The max operator uses the greedy action, not the one actually taken. This causes it to learn the optimal policy regardless of exploration, but during training it takes risky actions because of ε-greedy behavior. In the cliff walk, Q-Learning's optimal path is right next to the cliff, so when it explores (10% of steps), it frequently falls off, accumulating higher total regret during training.
SARSA is on-policy: Q(s,a) ← Q(s,a) + α[R + γ Q(s',a') - Q(s,a)], where a' is the action actually selected by the current policy (including exploration). This means SARSA learns a policy that accounts for the fact that it will sometimes explore. The resulting path stays one row away from the cliff, trading optimality for safety during training. The learned policy is ε-soft optimal rather than greedy optimal.
Empirically, after 500 episodes, Q-Learning achieves an average return of about -50 per episode (due to falls during exploration), while SARSA achieves about -20. But if you evaluate the learned policies greedily (ε=0), Q-Learning's policy is optimal (-13 per episode) while SARSA's is suboptimal (-17 per episode). This is the fundamental trade-off: Q-Learning learns the optimal policy but suffers during training; SARSA learns a safer policy that accounts for its own exploration noise.
In production, this distinction matters when you cannot afford catastrophic failures during training. If you're training a robot that can break, SARSA's conservatism is a feature, not a bug. If you're training a simulator where exploration cost is zero, Q-Learning's faster convergence to optimality wins.
Overestimation Bias and Double Q-Learning
Q-Learning suffers from a systematic overestimation bias because the max operator uses the same Q-values both to select and to evaluate actions. Mathematically, for any set of random variables {X_i} with means {μ_i}, E[max_i X_i] ≥ max_i E[X_i]. Since Q-values are noisy estimates, max_a' Q(s',a') tends to be higher than the true maximum expected return. This bias can lead to suboptimal policies, especially in stochastic environments.
Double Q-Learning (Hasselt, 2010) decouples selection and evaluation by maintaining two separate Q-tables, Q_A and Q_B. The update rule for Q_A is: Q_A(s,a) ← Q_A(s,a) + α[R + γ Q_B(s', argmax_a' Q_A(s',a')) - Q_A(s,a)]. Q_B is updated symmetrically. By using Q_B to evaluate the action selected by Q_A, the overestimation bias is eliminated. The expected value of the target is now an unbiased estimate of the true value.
Empirically, on a simple MDP with two actions and stochastic rewards (e.g., action A returns N(0,1), action B returns N(0.5,1)), Q-Learning with ε=0.1 and α=0.1 overestimates the optimal value by about 0.5 after 10,000 steps. Double Q-Learning's estimate is within 0.05 of the true value. The policy learned by Q-Learning may incorrectly favor action A due to noise, while Double Q-Learning correctly identifies action B as optimal.
In practice, Double DQN (van Hasselt et al., 2016) applies this idea to deep RL by using the online network for action selection and the target network for evaluation. This reduces overestimation and often improves performance on Atari games. The modification is minimal: replace y = r + γ max_a' Q_target(s',a') with y = r + γ Q_target(s', argmax_a' Q_online(s',a')).
A common pitfall: Double Q-Learning does not eliminate underestimation bias—it can actually introduce it. In practice, the bias is usually smaller in magnitude and less harmful. For tabular settings, use Double Q-Learning when the number of actions is large or rewards are high-variance. For deep RL, always use Double DQN as a drop-in replacement.
Production Considerations: Safety, Exploration, and Convergence
Deploying TD learning in production requires addressing three interconnected concerns: safety during training, exploration strategy, and convergence guarantees. In practice, these often conflict. A safe exploration policy may converge slowly; an aggressive exploration policy may cause catastrophic failures. The key is to design the reward function and action space to be forgiving.
Safety: Use action masking to prevent obviously dangerous actions. For example, in a robotic arm, mask actions that would cause self-collision. Implement a 'safe fallback' policy: if the Q-values for all actions are below a threshold (e.g., -100), execute a predefined safe action. Monitor the TD error in production—spikes indicate distribution shift or novel states. Set up alerts when TD error exceeds 3 standard deviations from the running mean.
Exploration: ε-greedy is simple but inefficient for large action spaces. Use Boltzmann exploration (softmax over Q-values) with a temperature parameter that anneals over time. For continuous action spaces, add Ornstein-Uhlenbeck noise. In production, start with high exploration (ε=0.5) and anneal to 0.01 over the first 20% of training steps. Never fully turn off exploration—non-stationary environments require ongoing exploration.
Convergence: Tabular Q-Learning converges to the optimal Q* under standard conditions (all state-action pairs visited infinitely often, α satisfies Robbins-Monro conditions: sum α = ∞, sum α^2 < ∞). In practice, use α = 1/(1 + visit_count(s,a)) for tabular methods. For function approximation, convergence is not guaranteed—use target networks and experience replay to stabilize training. Monitor the average Q-value over the last 1000 steps; if it plateaus for 10,000 steps, consider adjusting the learning rate or exploration schedule.
A real production pattern: train in simulation with high exploration, then fine-tune on the real system with low exploration and a safety wrapper. The safety wrapper checks each action against a simple physics model before execution. This hybrid approach reduces real-world failures by 90% compared to direct online learning.
Debugging TD Learning: A Practical Guide with Real-World Incidents
Debugging TD learning is notoriously difficult because the agent's behavior emerges from the interaction of learning rate, exploration, reward design, and environment dynamics. Here are three real-world incidents and how to diagnose them.
Incident 1: 'The Agent That Forgot Everything' (catastrophic forgetting in neural TD). A team trained a DQN to play a video game. After 1 million steps, performance suddenly dropped to random. Root cause: the replay buffer was too small (10,000 transitions) and the network overwrote earlier experiences. Fix: increase buffer to 1 million, use prioritized experience replay, and add a target network with soft updates (τ=0.001). The TD error spiked from 0.5 to 5.0 just before the collapse—monitoring this would have caught it.
Incident 2: 'The Reward Hacker' (reward misspecification). An agent trained to maximize 'score' in a warehouse simulation learned to repeatedly pick up and drop the same item, generating infinite reward. The Q-values diverged to infinity. Root cause: the reward function did not penalize repeated actions. Fix: add a per-step cost (-0.1) and a 'novelty bonus' for visiting new states. The TD error grew unbounded (reaching 1e6) before the fix. Set a hard cap on Q-values (±1000) to prevent numerical instability.
Incident 3: 'The Frozen Agent' (insufficient exploration). A robot trained with ε=0.01 never discovered the optimal path because it got stuck in a local optimum. The Q-values converged but the policy was suboptimal. Root cause: the exploration rate decayed too quickly. Fix: use count-based exploration (add a bonus of β/√(visit_count(s,a)) to the reward). The average Q-value plateaued at 50 instead of the optimal 100—a clear sign of under-exploration.
General debugging checklist: (1) Plot the average Q-value over time—it should increase monotonically. (2) Plot the TD error distribution—it should be zero-mean with constant variance. (3) Run a 'sanity check' episode with a random policy to ensure the environment is working. (4) Test with a known optimal policy (if available) to verify the Q-values match. (5) Use the 'greedy rollout' metric: evaluate the greedy policy every 1000 steps to separate learning from exploration noise.
The Cliff Walker: When Q-Learning Crashed a Drone
- Always match the learning algorithm to the exploration strategy in safety-critical environments.
- Q-Learning's optimality guarantee assumes you can follow the greedy policy, which may not be true during training.
- Simulate with multiple random seeds to catch rare catastrophic events before deploying.
print(np.mean(q_table, axis=0))plt.plot(episode_rewards)Key takeaways
Common mistakes to avoid
4 patternsUsing Q-Learning in a safety-critical environment without exploration constraints.
Not decaying the learning rate appropriately.
Ignoring the discount factor gamma tuning.
Applying Q-Learning to non-stationary environments without adaptation.
Interview Questions on This Topic
Explain the TD(0) update rule and how it differs from Monte Carlo updates.
Frequently Asked Questions
20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.
That's Reinforcement Learning. Mark it forged?
16 min read · try the examples if you haven't