Function approximation
Professur für Künstliche Intelligenz - Fakultät für Informatik
All the methods seen so far belong to tabular RL.
Q-learning necessitates to store in a Q-table one Q-value per state-action pair (s, a).
If you use black-and-white 256x256 images as inputs, you have 2^{256 * 256} = 10^{19728} possible states!
Tabular RL is limited to toy problems.
Tabular RL only works for small discrete action spaces.
Robots have continuous action spaces, where the actions are changes in joint angles or torques.
A joint angle could take any value in [0, \pi].
The more degrees of freedom, the more discrete actions, the more entries in the Q-table…
Tabular RL cannot deal with continuous action spaces, unless we approximate the policy with an actor-critic architecture.
Let’s represent a state s by a vector of d features \phi(s) = [\phi_1(s), \phi_2(s), \ldots, \phi_d(s)]^T.
For the cartpole, the feature vector would be:
\phi(s) = \begin{bmatrix}x \\ \dot{x} \\ \theta \\ \dot{\theta} \end{bmatrix}
x is the position, \theta the angle, \dot{x} and \dot{\theta} their derivatives.
We are able to represent any state s using these four variables.
\phi(s) = \begin{bmatrix} x \, \text{position of the paddle} \\ x \, \text{position of the ball} \\ y \, \text{position of the ball} \\ x \, \text{speed of the ball} \\ y \, \text{speed of the position} \\ \text{presence of brick 1} \\ \text{presence of brick 2} \\ \vdots \\ \end{bmatrix}
\phi(s_1) = \begin{bmatrix}1\\0\\0\\ \ldots\\ 0\end{bmatrix} \qquad \phi(s_2) = \begin{bmatrix}0\\1\\0\\ \ldots\\ 0\end{bmatrix}\qquad \phi(s_3) = \begin{bmatrix}0\\0\\1\\ \ldots\\ 0\end{bmatrix} \qquad \ldots
d \ll |\mathcal{S}|
V_\varphi(s) \approx V^\pi(s)
V_\varphi(s) = \sum_{i=1}^d w_i \, \phi_i(s) = \mathbf{w}^T \times \phi(s)
The weight vector \mathbf{w} = [w_1, w_2, \ldots, w_d]^Tis the set of parameters \varphi of the function.
A linear approximator is a single artificial neuron (linear regression) without a bias.
Regardless the form of the function approximator, we want to find the parameters \varphi making the approximated values V_\varphi(s) as close as possible from the true values V^\pi(s) for all states s.
We want to minimize the mean square error between the two quantities:
\min_\varphi \mathcal{L}(\varphi) = \mathbb{E}_{s \in \mathcal{S}} [ (V^\pi(s) - V_\varphi(s))^2]
\Delta \varphi = - \eta \, \nabla_\varphi \mathcal{L}(\varphi)
\nabla_\varphi \mathcal{L}(\varphi) = \begin{bmatrix} \frac{\partial \mathcal{L}(\varphi)}{\partial \varphi_1} \\ \frac{\partial \mathcal{L}(\varphi)}{\partial \varphi_2} \\ \ldots \\ \frac{\partial \mathcal{L}(\varphi)}{\partial \varphi_K} \\ \end{bmatrix}
\min_\varphi \mathcal{L}(\varphi) = \mathbb{E}_{s \in \mathcal{S}} [ (V^\pi(s) - V_\varphi(s))^2]
we will iteratively modify the parameters \varphi according to:
\begin{aligned} \Delta \varphi = \varphi_{k+1} - \varphi_n & = - \eta \, \nabla_\varphi \mathcal{L}(\varphi) = - \eta \, \nabla_\varphi \mathbb{E}_{s \in \mathcal{S}} [ (V^\pi(s) - V_\varphi(s))^2] \\ &\\ & = \mathbb{E}_{s \in \mathcal{S}} [- \eta \, \nabla_\varphi (V^\pi(s) - V_\varphi(s))^2] \\ &\\ & = \mathbb{E}_{s \in \mathcal{S}} [\eta \, (V^\pi(s) - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)] \\ \end{aligned}
\delta_\varphi = \eta \, (V^\pi(s) - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)
and update the parameters with stochastic gradient descent (SGD).
\Delta \varphi = \mathbb{E}_{s \in \mathcal{S}} [\eta \, (V^\pi(s) - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)]
\Delta \varphi = \eta \, \frac{1}{K} \sum_{k=1}^K (V^\pi(s_k) - V_\varphi(s_k)) \, \nabla_\varphi V_\varphi(s_k)
\Delta \varphi = \eta \, (V^\pi(s) - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)
V_\varphi(s) = \sum_{i=1}^d w_i \, \phi_i(s) = \mathbf{w}^T \times \phi(s)
\Delta \mathbf{w} = \eta \, (V^\pi(s) - V_\varphi(s)) \, \phi(s)
\Delta \varphi = \eta \, (V^\pi(s) - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)
The problem is that we do not know V^\pi(s), as it is what we are trying to estimate.
We can replace V^\pi(s) by a sampled estimate using Monte-Carlo or TD:
\Delta \varphi = \eta \, (R_t - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)
\Delta \varphi = \eta \, (r_{t+1} + \gamma \, V_\varphi(s') - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)
Note that for Temporal Difference, we actually want to minimize the TD reward-prediction error for all states, i.e. the surprise:
\mathcal{L}(\varphi) = \mathbb{E}_{s \in \mathcal{S}} [ (r_{t+1} + \gamma \, V_\varphi(s') - V_\varphi(s))^2]= \mathbb{E}_{s \in \mathcal{S}} [ \delta_t^2]
Algorithm:
Initialize the parameter \varphi to 0 or randomly.
while not converged:
\tau = (s_o, a_o, r_ 1, s_1, a_1, \ldots, s_T)
For all encountered states s_0, s_1, \ldots, s_{T-1}:
Compute the return R_t = \sum_k \gamma^k r_{t+k+1} .
Update the parameters using function approximation:
\Delta \varphi = \eta \, (R_t - V_\varphi(s_t)) \, \nabla_\varphi V_\varphi(s_t)
Gradient Monte-Carlo has no bias (real returns) but a high variance.
Algorithm:
Initialize the parameter \varphi to 0 or randomly.
while not converged:
Start from an initial state s_0.
foreach step t of the episode:
Select a_t using the current policy \pi in state s_t.
Observe r_{t+1} and s_{t+1}.
Update the parameters using function approximation:
\Delta \varphi = \eta \, (r_{t+1} + \gamma \, V_\varphi(s_{t+1}) - V_\varphi(s_t)) \, \nabla_\varphi V_\varphi(s_t)
Semi-gradient TD has less variance, but a significant bias as V_\varphi(s_{t+1}) is initially wrong. You can never trust these estimates completely.
Q-values can be approximated by a parameterized function Q_\theta(s, a) in the same manner.
There are basically two options for the structure of the function approximator:
Initialize the parameters \theta.
while True:
Start from an initial state s_0.
foreach step t of the episode:
Select a_{t} using the behavior policy b (e.g. derived from \pi).
Take a_t, observe r_{t+1} and s_{t+1}.
Update the parameters \theta:
\Delta \theta = \eta \, (r_{t+1} + \gamma \, \max_a Q_\theta(s_{t+1}, a) - Q_\theta(s_t, a_t)) \, \nabla_\theta Q_\theta(s_t, a_t)
\pi(s_t, a) = \text{Greedy}(Q_\theta(s_t, a))
The problem with deep NN is that they need a lot of samples to converge, what worsens the fundamental problem of RL: sample efficiency.
By engineering the right features, we could use linear approximators, which converge much faster.
The convergence of linear FA is guaranteed, not (always) non-linear ones.
\phi(s) = \begin{bmatrix}x \\ \dot{x} \\ \theta \\ \dot{\theta} \end{bmatrix}
x is the position, \theta the angle, \dot{x} and \dot{\theta} their derivatives.
Can we predict the value of a state linearly?
V_\varphi(s) = \sum_{i=1}^d w_i \, \phi_i(s) = \mathbf{w}^T \times \phi(s)
No, a high angular velocity \dot{\theta} is good when the pole is horizontal (going up) but bad if the pole is vertical (will not stop).
The value would depends linearly on something like \dot{\theta} \, \sin \theta, which is a non-linear combination of features.
Let’s suppose we have a simple problem where the state s is represented by two continuous variables x and y.
The true value function V^\pi(s) is a non-linear function of x and y.
If we apply linear FA directly on the feature vector [x, y], we catch the tendency of V^\pi(s) but we make a lot of bad predictions:
To introduce non-linear relationships between continuous variables, a simple method is to construct the feature with polynomials of the variables.
Example with polynomials of order 2:
\phi(s) = \begin{bmatrix}1 & x & y & x\, y & x^2 & y^2 \end{bmatrix}^T
We transform the two input variables x and y into a vector with 6 elements. The 1 (order 0) is there to learn the offset.
Example with polynomials of order 3:
\phi(s) = \begin{bmatrix}1 & x & y & x\, y & x^2 & y^2 & x^2 \, y & x \, y^2 & x^3 & y^3\end{bmatrix}^T
V_\varphi(s) = w_0 + w_1 \, x + w_2 \, y + w_3 \, x \, y + w_4 \, x^2 + w_5 \, y^2 + \ldots
The higher the degree of the polynomial, the better the fit, but the number of features grows exponentially.
Computational complexity.
Overfitting: if we only sample some states, high-order polynomials will not interpolate correctly.
V_\varphi(s) = w_0 + w_1 \, x + w_2 \, x^2 + w_3 \, x^3 + \ldots
V_\varphi(s) = w_0 + w_1 \, \cos(\pi \, x) + w_2 \, \cos( 2 \, \pi \, x) + w_3 \, \cos(3 \, \pi \, x) + \ldots
A Fourier basis tends to work better than a polynomial basis.
The main problem is that the number of features increases very fast with:
the number of input dimensions.
the desired precision (higher-order polynomials, more frequencies).
An obvious solution for continuous state variables is to discretize the input space.
The input space is divided into a grid of non-overlapping tiles.
\phi(s) = \begin{bmatrix}0 & 0 & \ldots & 0 & 1 & 0 & \ldots & 0 \\ \end{bmatrix}^T
This ensures generalization inside a tile: you only need a couple of samples inside a tile to know the mean value of all the states.
Drawbacks:
the value function is step-like (discontinuous).
what is the correct size of a tile?
curse of dimensionality.
A more efficient solution is coarse coding.
The tiles (rectangles, circles, or what you need) need to overlap.
A state s is encoded by a binary vector, but with several 1, for each tile it belongs. \phi(s) = \begin{bmatrix}0 & 1 & 0 & \ldots & 1 & 1 & 0 & \ldots & 0 \\ \end{bmatrix}^T
This allows generalization inside a tile, but also across tiles.
A simple way to ensure that tiles overlap is to use several regular grids with an offset.
Each tiling will be coarse, but the location of a state will be quite precise as it may belong to many tiles.
The feature vector in tile coding is a binary vector: there will be discontinuous jumps in the approximated value function when moving between tiles.
We can use radial-basis functions (RBF) such as Gaussians to map the state space.
We set a set of centers \{c_i\}_{i=1}^K in the input space on a regular grid (or randomly).
Each element of the feature vector will be a Gaussian function of the distance between the state s and one center:
\phi_i(s) = \exp \frac{-(s - c_i)^2}{2\, \sigma_i^2}
V_\varphi(s) = \sum_{i=1}^d w_i \, \phi_i(s) = \sum_{i=1}^d w_i \, \exp \frac{-(s - c_i)^2}{2\, \sigma_i^2}
\hat{s} = \sum_{i=1}^d \phi_i(s) \, c_i
In FA, we project the state information into a feature space to get a better representation.
We then apply a linear approximation algorithm to estimate the value function:
V_\varphi(s) = \mathbf{w}^T \, \phi(s)
\Delta \mathbf{w} = \eta \, (V^\pi(s) - V_\varphi(s)) \, \phi(s)
Deep neural networks are the most powerful function approximators in supervised learning.
Do they also work with RL?