Temporal Difference learning
Professur für Künstliche Intelligenz - Fakultät für Informatik
V(s_t) = V(s_t) + \alpha (R_t - V(s_t))
R_t = r_{t+1} + \gamma \, R_{t+1}
we could replace R_{t+1} by an estimate, which is the value of the next state V^\pi(s_{t+1}) = \mathbb{E}_\pi [R_{t+1} | s_{t+1}=s]:
R_t \approx r_{t+1} + \gamma \, V^\pi(s_{t+1})
V(s_t) = V(s_t) + \alpha \, (r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t))
where r_{t+1} + \gamma\, V(s_{t+1}) is a sampled estimate of the return.
\delta_t = r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t)
is called equivalently the reward prediction error (RPE), the TD error or the advantage of the action a_t.
It is the difference between:
the estimated return in state s_t: V(s_t).
the actual return r_{t+1} + \gamma \, V(s_{t+1}), computed with an estimation.
If \delta_t > 0, it means that:
we received more reward r_{t+1} than expected, or:
we arrive in a state s_{t+1} that is better than expected.
we should increase the value of s_t as we underestimate it.
If \delta_t < 0, we should decrease the value of s_t as we overestimate it.
Backup diagram of TD(0)
while True:
Start from an initial state s_0.
foreach step t of the episode:
Select a_t using the current policy \pi in state s_t.
Apply a_t, observe r_{t+1} and s_{t+1}.
Compute the TD error:
\delta_t = r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t)
V(s_t) = V(s_t) + \alpha \, \delta_t
V(s_t) = V(s_t) + \alpha \, (r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t)) = V(s_t) + \alpha \, \delta_t
V^\pi(s) = \mathbb{E}_\pi [r(s, a, s') + \gamma \, V^\pi(s')]
By using an estimate of the return R_t instead of directly the return as in MC,
we increase the bias (estimates are always wrong, especially at the beginning of learning)
but we reduce the variance: only r(s, a, s') is stochastic, not the value function V^\pi.
We can therefore expect less optimal solutions, but we will also need less samples.
better sample efficiency than MC.
worse convergence (suboptimal).
Q(s_t, a_t) = Q(s_t, a_t) + \alpha \, (r_{t+1} + \gamma \, Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t))
Like for MC, the exploration/exploitation trade-off has to be managed: what is the next action a_{t+1}?
There are therefore two classes of TD control algorithms:
on-policy (SARSA)
off-policy (Q-learning).
a_{t+1} \sim \pi(s_{t+1}, a)
Q(s_t, a_t) = Q(s_t, a_t) + \alpha \, (r_{t+1} + \gamma \, Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t))
The next action a_{t+1} will have to be executed next: SARSA is on-policy. You cannot change your mind and execute another a_{t+1}.
The learned policy must be \epsilon-soft (stochastic) to ensure exploration.
SARSA converges to the optimal policy if \alpha is small enough and if \epsilon (or \tau) slowly decreases to 0.
while True:
Start from an initial state s_0 and select a_0 using the current policy \pi.
foreach step t of the episode:
Apply a_{t}, observe r_{t+1} and s_{t+1}.
Select a_{t+1} using the current stochastic policy \pi.
Update the action-value function of (s_t, a_t):
Q(s_t, a_t) = Q(s_t, a_t) + \alpha \, (r_{t+1} + \gamma \, Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t))
\pi(s_t, a) = \begin{cases} 1 - \epsilon \; \text{if} \; a = \text{argmax} \, Q(s_t, a) \\ \frac{\epsilon}{|\mathcal{A}(s_t) -1|} \; \text{otherwise.} \\ \end{cases}
Q(s_t, a_t) = Q(s_t, a_t) + \alpha \, (r_{t+1} + \gamma \, \max_a Q(s_{t+1}, a) - Q(s_t, a_t))
The next action a_{t+1} can be generated by a behavior policy: Q-learning is off-policy.
The learned policy can be deterministic.
The behavior policy can be an \epsilon-soft policy derived from Q or expert knowledge.
The behavior policy only needs to visit all state-action pairs during learning to ensure optimality.
while True:
Start from an initial state s_0.
foreach step t of the episode:
Select a_{t} using the behavior policy b (e.g. derived from \pi).
Apply a_t, observe r_{t+1} and s_{t+1}.
Update the action-value function of (s_t, a_t):
Q(s_t, a_t) = Q(s_t, a_t) + \alpha \, (r_{t+1} + \gamma \, \max_a Q(s_{t+1}, a) - Q(s_t, a_t))
\pi(s_t, a) = \begin{cases} 1\; \text{if} \; a = \text{argmax} \, Q(s_t, a) \\ 0 \; \text{otherwise.} \\ \end{cases}
Q^\pi(s, a) = \mathbb{E}_{\tau \sim \rho_b}[\rho_{0:T-1} \, R(\tau) | s_0 = s, a_0=a]
As the rest of the episode is generated by b, we need to correct the returns using the importance sampling weight.
In Q-learning, Q-values are estimated using other estimates:
Q^\pi(s, a) = \mathbb{E}_{s_t \sim \rho_b, a_t \sim b}[ r_{t+1} + \gamma \, \max_a Q^\pi(s_{t+1}, a) | s_t = s, a_t=a]
As we only sample transitions using b and not episodes, there is no need to correct the returns:
The returns use estimates Q^\pi, which depend on \pi and not b.
The immediate reward r_{t+1} is stochastic, but is the same whether you sample a_t from \pi or from b.
Temporal Difference allow to learn Q-values from single transitions instead of complete episodes.
MC methods can only be applied to episodic problems, while TD works for continuing tasks.
MC and TD methods are model-free: you do not need to know anything about the environment (p(s' |s, a) and r(s, a, s')) to learn.
The exploration-exploitation dilemma must be dealt with:
Q(s, a) = Q(s, a) + \alpha \, (r(s, a, s') + \gamma \, Q(s', a') - Q(s, a))
Q(s, a) = Q(s, a) + \alpha \, (r(s, a, s') + \gamma \, \max_a Q(s', a) - Q(s, a))
TD uses bootstrapping like DP: it uses other estimates to update one estimate.
Q-learning is the go-to method in tabular RL.
Actor-critic methods are TD methods that have a separate memory structure to explicitly represent the policy independent of the value function.
The policy \pi is implemented by the actor, because it is used to select actions.
The estimated values V(s) are implemented by the critic, because it criticizes the actions made by the actor.
Learning is always on-policy: the critic must learn about and critique whatever policy is currently being followed by the actor.
\delta_t = r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t)
\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)
tells us how good the action a_t was compared to our expectation V(s_t).
When the advantage \delta_t > 0, this means that the action lead to a better reward or a better state than what was expected by V(s_t), which is a good surprise, so the action should be reinforced (selected again) and the value of that state increased.
When \delta_t < 0, this means that the previous estimation of (s_t, a_t) was too high (bad surprise), so the action should be avoided in the future and the value of the state reduced.
\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)
V(s_t) \leftarrow V(s_t) + \alpha \, \delta_t
p(s_t, a_t) \leftarrow p(s_t, a_t) + \beta \, \delta_t
\pi(s, a) = \frac{\exp{p(s, a)}}{\sum_b \exp{p(s, b)}}
When \delta_t >0, the preference is increased, so the probability of selecting it again increases.
When \delta_t <0, the preference is decreased, so the probability of selecting it again decreases.
This is the equivalent of reinforcement comparison for bandits.
Start in s_0. Initialize the preferences p(s,a) for each state action pair and the critic V(s) for each state.
foreach step t:
\pi(s_t, a) = \frac{\exp{p(s, a)}}{\sum_b \exp{p(s, b)}}
Apply a_t, observe r_{t+1} and s_{t+1}.
Compute the TD error in s_t using the critic:
\delta_t = r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t)
p(s_t, a_t) \leftarrow p(s_t, a_t) + \beta \, \delta_t
V(s_t) \leftarrow V(s_t) + \alpha \, \delta_t
The advantage of the separation between the actor and the critic is that now the actor can take any form (preferences, linear approximation, deep networks).
It requires minimal computation in order to select the actions, in particular when the action space is huge or even continuous.
It can learn stochastic policies, which is particularly useful in non-Markov problems.
It is obligatory to learn on-policy:
the critic must evaluate the actions taken by the current actor.
the actor must learn from the current critic, not “old” V-values.
Value-based methods use value estimates Q(s, a) to infer a policy:
On-policy methods learn and use a stochastic policy to explore.
Off-policy methods learn a deterministic policy but use a (stochastic) behavior policy to explore.
Policy-based methods directly learn the policy \pi(s, a) (actor) using preferences or function approximators.
A critic learning values is used to improve the policy w.r.t a performance baseline.
Actor-critic architectures are strictly on-policy.
Bandits | MDP | |
---|---|---|
Value-based | ||
\qquadOn-policy | \epsilon-greedy, softmax | SARSA |
\qquadOff-policy | greedy | Q-learning |
Policy-based | ||
\qquadOn-policy | Reinforcement comparison | Actor-critic |
MC has high variance, zero bias:
Good convergence properties. We are more likely to find the optimal policy.
Not very sensitive to initial estimates.
Very simple to understand and use.
TD has low variance, some bias:
Usually more sample efficient than MC.
TD(0) converges to V^\pi(s) (but not always with function approximation). The policy might be suboptimal.
More sensitive to initial values (bootstrapping).
Q(s, a) = Q(s, a) + \alpha \, (r(s, a, s') + \gamma \, \max_a Q(s', a) - Q(s, a))
The previous actions, which were equally important in obtaining the reward, will only be updated the next time they are visited.
This makes learning very slow: if the path to the reward has n steps, you will need to repeat the same episode at least n times to learn the Q-value of the first action.
Optimally, we would like a trade-off between:
TD (only one state/action is updated each time, small variance but significant bias)
Monte-Carlo (all states/actions in an episode are updated, no bias but huge variance).
In n-step TD prediction, the next n rewards are used to estimate the return, the rest is approximated.
R^n_t = \sum_{k=0}^{n-1} \gamma^{k} \, r_{t+k+1} + \gamma^n \, V(s_{t+n})
V(s_t) = V(s_t) + \alpha \, (R^n_t - V (s_t))
A^n_t = \sum_{k=0}^{n-1} \gamma^{k} \, r_{t+k+1} + \gamma^n \, V(s_{t+n}) - V (s_t)
\delta_t = A^1_t = r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t)
As you use more “real” rewards, you reduce the bias of Q-learning.
As you use estimates for the rest of the episode, you reduce the variance of MC methods.
But how to choose n?
R^\lambda_t = (1 - \lambda) \, \sum_{n=1}^\infty \lambda^{n-1} \, R^n_t
\sum_{n=1}^\infty \lambda^{n-1} = \dfrac{1}{1 - \lambda}
Each reward r_{t+k+1} will count multiple times in the \lambda-return. Distant rewards are discounted by \lambda^k in addition to \gamma^k.
Large n-step returns (MC) should not have as much importance as small ones (TD), as they have a high variance.
R^\lambda_t = (1 - \lambda) \, \sum_{n=1}^{T-t-1} \lambda^{n-1} \, R^n_t + \lambda^{T-t-1} \, R_t
\lambda controls the bias-variance trade-off:
If \lambda=0, the \lambda-return is equal to R^1_t = r_{t+1} + \gamma \, V(s_{t+1}), i.e. TD: high bias, low variance.
If \lambda=1, the \lambda-return is equal to R_t = \sum_{k=0}^{\infty} \gamma^{k} \, r_{t+k+1}, i.e. MC: low bias, high variance.
This forward view of eligibility traces implies to look at all future rewards until the end of the episode to perform a value update. This prevents online learning using single transitions.
\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)
V(s) \leftarrow V(s) + \alpha \, \delta_t \, e_t(s)
e_t(s) = \begin{cases} \gamma \, \lambda \, e_{t-1}(s) \qquad\qquad \text{if} \quad s \neq s_t \\ e_{t-1}(s) + 1 \qquad \text{if} \quad s = s_t \\ \end{cases}
foreach step t of the episode:
Select a_t using the current policy \pi in state s_t, observe r_{t+1} and s_{t+1}.
Compute the TD error in s_t:
\delta_t = r_{t+1} + \gamma \, V_k(s_{t+1}) - V_k(s_t)
e_{t+1}(s_t) = e_t(s_t) + 1
foreach state s \in [s_o, \ldots, s_t] in the episode:
V_{k+1}(s) = V_k(s) + \alpha \, \delta_t \, e_t(s)
e_{t+1}(s) = \lambda \, \gamma \, e_t(s)
if s_{t+1} is terminal: break
The backward view of eligibility traces can be applied on single transitions, given we know the history of visited states and maintain a trace for each of them.
Eligibility traces are a very useful way to speed learning up in TD methods and control the bias/variance trade-off.
This modification can be applied to all TD methods: TD(\lambda) for states, SARSA(\lambda) and Q(\lambda) for actions.
The main drawback is that we need to keep a trace for ALL possible state-action pairs: memory consumption. Clever programming can limit this issue.
The value of \lambda has to be carefully chosen for the problem: perhaps initial actions are random and should not be reinforced.
If your problem is not strictly Markov (POMDP), eligibility traces can help as they update the history!
A^n_t = \sum_{k=0}^{n-1} \gamma^{k} \, r_{t+k+1} + \gamma^n \, V(s_{t+n}) - V (s_t)
can be written as function of the TD error of the next n transitions:
A^{n}_t = \sum_{l=0}^{n-1} \gamma^l \, \delta_{t+l}
Proof with n=2:
\begin{aligned} A^2_t &= r_{t+1} + \gamma \, r_{t+2} + \gamma^2 \, V(s_{t+2}) - V(s_{t}) \\ &\\ &= (r_{t+1} - V(s_t)) + \gamma \, (r_{t+2} + \gamma \, V(s_{t+2}) ) \\ &\\ &= (r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t)) + \gamma \, (r_{t+2} + \gamma \, V(s_{t+2}) - V(s_{t+1})) \\ &\\ &= \delta_t + \gamma \, \delta_{t+1} \end{aligned}
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438.
A^n_t = \sum_{k=0}^{n-1} \gamma^{k} \, r_{t+k+1} + \gamma^n \, V(s_{t+n}) - V (s_t)
A_t^{\text{GAE}(\gamma, \lambda)} = (1 - \lambda) \sum_{n=1}^\infty \lambda^n \, A^n_t
This is just a forward eligibility trace over distant n-step advantages: the 1-step advantage is more important the the 1000-step advantage (too much variance).
We can show that the GAE can be expressed as a function of the future 1-step TD errors: A_t^{\text{GAE}(\gamma, \lambda)} = \sum_{k=0}^\infty (\gamma \, \lambda)^k \, \delta_{t+k}
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438.
A_t^{\text{GAE}(\gamma, \lambda)} = (1 - \lambda) \sum_{n=1}^\infty \lambda^n \, A^n_t = \sum_{k=0}^\infty (\gamma \, \lambda)^k \, \delta_{t+k}
The parameter \lambda controls the bias-variance trade-off.
When \lambda=0, the generalized advantage is the TD error:
A_t^{\text{GAE}(\gamma, 0)} = r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t) = \delta_{t}
A_t^{\text{GAE}(\gamma, 1)} = \sum_{k=0}^\infty \gamma^k \, r_{t+k+1} - V(s_t) = R_t - V(s_t)
Any value in between controls the bias-variance trade-off: from the high bias / low variance of TD to the small bias / high variance of MC.
In practice, it leads to a better estimation than n-step advantages, but is more computationally expensive.
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438.