Deep Reinforcement Learning

Function approximation

Julien Vitay

Professur für Künstliche Intelligenz - Fakultät für Informatik

1 - Limits of tabular RL

Tabular reinforcement learning

  • All the methods seen so far belong to tabular RL.

  • Q-learning necessitates to store in a Q-table one Q-value per state-action pair (s, a).

Source: https://towardsdatascience.com/qrash-course-deep-q-networks-from-the-ground-up-1bbda41d3677

Tabular reinforcement learning

  • If a state has never been visited during learning, the Q-values will still be at their initial value (0.0), no policy can be derived.
  • Similar states likely have the same optimal action: we want to be able to generalize the policy between states.

Tabular reinforcement learning

  • For most realistic problems, the size of the Q-table becomes quickly untractable.

Source: https://medium.com/@twt446/a-summary-of-deep-reinforcement-learning-rl-bootcamp-lecture-2-c3a15db5934e

  • If you use black-and-white 256x256 images as inputs, you have 2^{256 * 256} = 10^{19728} possible states!

  • Tabular RL is limited to toy problems.

Tabular RL cannot learn to play video games

Continuous action spaces

  • Tabular RL only works for small discrete action spaces.

  • Robots have continuous action spaces, where the actions are changes in joint angles or torques.

  • A joint angle could take any value in [0, \pi].

Continuous action spaces

  • A solution would be to discretize the action space (one action per degree), but we would fall into the curse of dimensionality.
  • The more degrees of freedom, the more discrete actions, the more entries in the Q-table…

  • Tabular RL cannot deal with continuous action spaces, unless we approximate the policy with an actor-critic architecture.

2 - Function approximation

Feature vectors

  • Let’s represent a state s by a vector of d features \phi(s) = [\phi_1(s), \phi_2(s), \ldots, \phi_d(s)]^T.

  • For the cartpole, the feature vector would be:

\phi(s) = \begin{bmatrix}x \\ \dot{x} \\ \theta \\ \dot{\theta} \end{bmatrix}

  • x is the position, \theta the angle, \dot{x} and \dot{\theta} their derivatives.

  • We are able to represent any state s using these four variables.

Feature vectors

  • For more complex problems, the feature vector should include all the necessary information (Markov property).

\phi(s) = \begin{bmatrix} x \, \text{position of the paddle} \\ x \, \text{position of the ball} \\ y \, \text{position of the ball} \\ x \, \text{speed of the ball} \\ y \, \text{speed of the position} \\ \text{presence of brick 1} \\ \text{presence of brick 2} \\ \vdots \\ \end{bmatrix}

  • In deep RL, we will learn these feature vectors, but let’s suppose for now that we have them.

Feature vectors

  • Note that we can always fall back to the tabular case using one-hot encoding of the states:

\phi(s_1) = \begin{bmatrix}1\\0\\0\\ \ldots\\ 0\end{bmatrix} \qquad \phi(s_2) = \begin{bmatrix}0\\1\\0\\ \ldots\\ 0\end{bmatrix}\qquad \phi(s_3) = \begin{bmatrix}0\\0\\1\\ \ldots\\ 0\end{bmatrix} \qquad \ldots

  • But the idea is that we can represent states with much less values than the number of states:

d \ll |\mathcal{S}|

  • We can also represent continuous state spaces with feature vectors.

State value approximation

  • In state value approximation, we want to approximate the state value function V^\pi(s) with a parameterized function V_\varphi(s):

V_\varphi(s) \approx V^\pi(s)

  • The parameterized function can have any form. Its has a set of parameters \varphi used to transform the feature vector \phi(s) into an approximated value V_\varphi(s).

Linear approximation of state value functions

  • The simplest function approximator (FA) is the linear approximator.
  • The approximated value is a linear combination of the features:

V_\varphi(s) = \sum_{i=1}^d w_i \, \phi_i(s) = \mathbf{w}^T \times \phi(s)

  • The weight vector \mathbf{w} = [w_1, w_2, \ldots, w_d]^Tis the set of parameters \varphi of the function.

  • A linear approximator is a single artificial neuron (linear regression) without a bias.

Learning the state value approximation

  • Regardless the form of the function approximator, we want to find the parameters \varphi making the approximated values V_\varphi(s) as close as possible from the true values V^\pi(s) for all states s.

    • This is a regression problem.
  • We want to minimize the mean square error between the two quantities:

\min_\varphi \mathcal{L}(\varphi) = \mathbb{E}_{s \in \mathcal{S}} [ (V^\pi(s) - V_\varphi(s))^2]

  • The loss function \mathcal{L}(\varphi) is minimal when the predicted values are close to the true ones on average for all states.

Learning the state value approximation

  • Let’s suppose that we know the true state values V^\pi(s) for all states and that the parameterized function is differentiable.

  • We can find the minimum of the loss function by applying gradient descent (GD) iteratively:

\Delta \varphi = - \eta \, \nabla_\varphi \mathcal{L}(\varphi)

  • \nabla_\varphi \mathcal{L}(\varphi) is the gradient of the loss function w.r.t to the parameters \varphi.

\nabla_\varphi \mathcal{L}(\varphi) = \begin{bmatrix} \frac{\partial \mathcal{L}(\varphi)}{\partial \varphi_1} \\ \frac{\partial \mathcal{L}(\varphi)}{\partial \varphi_2} \\ \ldots \\ \frac{\partial \mathcal{L}(\varphi)}{\partial \varphi_K} \\ \end{bmatrix}

  • When applied repeatedly, GD converges to a local minimum of the loss function.

Learning the state value approximation

  • To minimize the mean square error,

\min_\varphi \mathcal{L}(\varphi) = \mathbb{E}_{s \in \mathcal{S}} [ (V^\pi(s) - V_\varphi(s))^2]

we will iteratively modify the parameters \varphi according to:

\begin{aligned} \Delta \varphi = \varphi_{k+1} - \varphi_n & = - \eta \, \nabla_\varphi \mathcal{L}(\varphi) = - \eta \, \nabla_\varphi \mathbb{E}_{s \in \mathcal{S}} [ (V^\pi(s) - V_\varphi(s))^2] \\ &\\ & = \mathbb{E}_{s \in \mathcal{S}} [- \eta \, \nabla_\varphi (V^\pi(s) - V_\varphi(s))^2] \\ &\\ & = \mathbb{E}_{s \in \mathcal{S}} [\eta \, (V^\pi(s) - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)] \\ \end{aligned}

  • As it would be too slow to compute the expectation on the whole state space (batch algorithm), we will sample the quantity:

\delta_\varphi = \eta \, (V^\pi(s) - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)

and update the parameters with stochastic gradient descent (SGD).

Learning the state value approximation

  • Gradient of the mse:

\Delta \varphi = \mathbb{E}_{s \in \mathcal{S}} [\eta \, (V^\pi(s) - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)]

  • If we sample K states s_i from the state space:

\Delta \varphi = \eta \, \frac{1}{K} \sum_{k=1}^K (V^\pi(s_k) - V_\varphi(s_k)) \, \nabla_\varphi V_\varphi(s_k)

  • We can also sample a single state s (online algorithm):

\Delta \varphi = \eta \, (V^\pi(s) - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)

  • Unless stated otherwise, we will sample single states in this section, but the parameter updates will be noisy (high variance).

Linear approximation

  • The approximated value is a linear combination of the features:

V_\varphi(s) = \sum_{i=1}^d w_i \, \phi_i(s) = \mathbf{w}^T \times \phi(s)

  • The weights are updated using stochastic gradient descent:

\Delta \mathbf{w} = \eta \, (V^\pi(s) - V_\varphi(s)) \, \phi(s)

  • That is the delta learning rule of linear regression and classification, with \phi(s) being the input vector and V^\pi(s) - V_\varphi(s) the prediction error.

Function approximation with sampling

  • The rule can be used with any function approximator, we only need to be able to differentiate it:

\Delta \varphi = \eta \, (V^\pi(s) - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)

  • The problem is that we do not know V^\pi(s), as it is what we are trying to estimate.

  • We can replace V^\pi(s) by a sampled estimate using Monte-Carlo or TD:

    • Monte-Carlo function approximation:

    \Delta \varphi = \eta \, (R_t - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)

    • Temporal Difference function approximation:

    \Delta \varphi = \eta \, (r_{t+1} + \gamma \, V_\varphi(s') - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)

  • Note that for Temporal Difference, we actually want to minimize the TD reward-prediction error for all states, i.e. the surprise:

\mathcal{L}(\varphi) = \mathbb{E}_{s \in \mathcal{S}} [ (r_{t+1} + \gamma \, V_\varphi(s') - V_\varphi(s))^2]= \mathbb{E}_{s \in \mathcal{S}} [ \delta_t^2]

Gradient Monte Carlo Algorithm for value estimation

  • Algorithm:

    • Initialize the parameter \varphi to 0 or randomly.

    • while not converged:

      1. Generate an episode according to the current policy \pi until a terminal state s_T is reached.

      \tau = (s_o, a_o, r_ 1, s_1, a_1, \ldots, s_T)

      1. For all encountered states s_0, s_1, \ldots, s_{T-1}:

        1. Compute the return R_t = \sum_k \gamma^k r_{t+k+1} .

        2. Update the parameters using function approximation:

        \Delta \varphi = \eta \, (R_t - V_\varphi(s_t)) \, \nabla_\varphi V_\varphi(s_t)

  • Gradient Monte-Carlo has no bias (real returns) but a high variance.

Semi-gradient Temporal Difference Algorithm for value estimation

  • Algorithm:

    • Initialize the parameter \varphi to 0 or randomly.

    • while not converged:

      • Start from an initial state s_0.

      • foreach step t of the episode:

        • Select a_t using the current policy \pi in state s_t.

        • Observe r_{t+1} and s_{t+1}.

        • Update the parameters using function approximation:

        \Delta \varphi = \eta \, (r_{t+1} + \gamma \, V_\varphi(s_{t+1}) - V_\varphi(s_t)) \, \nabla_\varphi V_\varphi(s_t)

        • if s_{t+1} is terminal: break
  • Semi-gradient TD has less variance, but a significant bias as V_\varphi(s_{t+1}) is initially wrong. You can never trust these estimates completely.

Function approximation for Q-values

  • Q-values can be approximated by a parameterized function Q_\theta(s, a) in the same manner.

  • There are basically two options for the structure of the function approximator:

  • The FA takes a feature vector for both the state s and the action a (which can be continuous) as inputs, and outputs a single Q-value Q_\theta(s ,a).

  • The FA takes a feature vector for the state s as input, and outputs one Q-value Q_\theta(s ,a) per possible action (the action space must be discrete).

  • In both cases, we minimize the mse between the true value Q^\pi(s, a) and the approximated value Q_\theta(s, a).

Q-learning with function approximation

  • Initialize the parameters \theta.

  • while True:

    • Start from an initial state s_0.

    • foreach step t of the episode:

      • Select a_{t} using the behavior policy b (e.g. derived from \pi).

      • Take a_t, observe r_{t+1} and s_{t+1}.

      • Update the parameters \theta:

      \Delta \theta = \eta \, (r_{t+1} + \gamma \, \max_a Q_\theta(s_{t+1}, a) - Q_\theta(s_t, a_t)) \, \nabla_\theta Q_\theta(s_t, a_t)

      • Improve greedily the learned policy:

      \pi(s_t, a) = \text{Greedy}(Q_\theta(s_t, a))

      • if s_{t+1} is terminal: break

3 - Feature construction

Feature construction

  • Before we dive into deep RL (i.e. RL with non-linear FA), let’s see how we can design good feature vectors for linear function approximation.
  • The problem with deep NN is that they need a lot of samples to converge, what worsens the fundamental problem of RL: sample efficiency.

  • By engineering the right features, we could use linear approximators, which converge much faster.

  • The convergence of linear FA is guaranteed, not (always) non-linear ones.

Why do we need to choose features?

  • For the cartpole, the feature vector \phi(s) could be:

\phi(s) = \begin{bmatrix}x \\ \dot{x} \\ \theta \\ \dot{\theta} \end{bmatrix}

  • x is the position, \theta the angle, \dot{x} and \dot{\theta} their derivatives.

  • Can we predict the value of a state linearly?

V_\varphi(s) = \sum_{i=1}^d w_i \, \phi_i(s) = \mathbf{w}^T \times \phi(s)

  • No, a high angular velocity \dot{\theta} is good when the pole is horizontal (going up) but bad if the pole is vertical (will not stop).

  • The value would depends linearly on something like \dot{\theta} \, \sin \theta, which is a non-linear combination of features.

Feature coding

  • Let’s suppose we have a simple problem where the state s is represented by two continuous variables x and y.

  • The true value function V^\pi(s) is a non-linear function of x and y.

Linear approximation

  • If we apply linear FA directly on the feature vector [x, y], we catch the tendency of V^\pi(s) but we make a lot of bad predictions:

    • high bias (underfitting).

Polynomials

  • To introduce non-linear relationships between continuous variables, a simple method is to construct the feature with polynomials of the variables.

  • Example with polynomials of order 2:

\phi(s) = \begin{bmatrix}1 & x & y & x\, y & x^2 & y^2 \end{bmatrix}^T

  • We transform the two input variables x and y into a vector with 6 elements. The 1 (order 0) is there to learn the offset.

  • Example with polynomials of order 3:

\phi(s) = \begin{bmatrix}1 & x & y & x\, y & x^2 & y^2 & x^2 \, y & x \, y^2 & x^3 & y^3\end{bmatrix}^T

  • And so on. We then just need to apply linear FA on these feature vectors (polynomial regression).

V_\varphi(s) = w_0 + w_1 \, x + w_2 \, y + w_3 \, x \, y + w_4 \, x^2 + w_5 \, y^2 + \ldots

Polynomials

  • Polynomials of order 2 already allow to get a better approximation.

Polynomials

  • Polynomials of order 6 are an even better fit for our problem.

Polynomials

  • The higher the degree of the polynomial, the better the fit, but the number of features grows exponentially.

    • Computational complexity.

    • Overfitting: if we only sample some states, high-order polynomials will not interpolate correctly.

Fourier transforms

  • Instead of approximating a state variable x by a polynomial:

V_\varphi(s) = w_0 + w_1 \, x + w_2 \, x^2 + w_3 \, x^3 + \ldots

  • we could also use its Fourier decomposition (here DCT, discrete cosine transform):

V_\varphi(s) = w_0 + w_1 \, \cos(\pi \, x) + w_2 \, \cos( 2 \, \pi \, x) + w_3 \, \cos(3 \, \pi \, x) + \ldots

  • Fourier tells us that, if we take enough frequencies, we can reconstruct the signal V_\varphi(s) perfectly.
  • It is just a change of basis, the problem stays a linear regression to find w_0, w_1, w_2, etc.

Fourier transforms

  • Fourier transforms can be applied on multivariate functions as well.

Polynomial vs. Fourier basis

  • A Fourier basis tends to work better than a polynomial basis.

  • The main problem is that the number of features increases very fast with:

    • the number of input dimensions.

    • the desired precision (higher-order polynomials, more frequencies).

Discrete coding

  • An obvious solution for continuous state variables is to discretize the input space.

  • The input space is divided into a grid of non-overlapping tiles.

  • The feature vector is a binary vector with a 1 when the input is inside a tile, 0 otherwise.

\phi(s) = \begin{bmatrix}0 & 0 & \ldots & 0 & 1 & 0 & \ldots & 0 \\ \end{bmatrix}^T

  • This ensures generalization inside a tile: you only need a couple of samples inside a tile to know the mean value of all the states.

  • Drawbacks:

    • the value function is step-like (discontinuous).

    • what is the correct size of a tile?

    • curse of dimensionality.

Coarse coding

  • A more efficient solution is coarse coding.

  • The tiles (rectangles, circles, or what you need) need to overlap.

  • A state s is encoded by a binary vector, but with several 1, for each tile it belongs. \phi(s) = \begin{bmatrix}0 & 1 & 0 & \ldots & 1 & 1 & 0 & \ldots & 0 \\ \end{bmatrix}^T

  • This allows generalization inside a tile, but also across tiles.

  • The size and shape of the “receptive field” influences the generalization properties.

Tile coding

  • A simple way to ensure that tiles overlap is to use several regular grids with an offset.

  • Each tiling will be coarse, but the location of a state will be quite precise as it may belong to many tiles.

  • This helps against the curse of dimensionality: high precision, but the number of tiles does not grow exponentially.

Radial-basis functions (RBF)

  • The feature vector in tile coding is a binary vector: there will be discontinuous jumps in the approximated value function when moving between tiles.

  • We can use radial-basis functions (RBF) such as Gaussians to map the state space.

  • We set a set of centers \{c_i\}_{i=1}^K in the input space on a regular grid (or randomly).

  • Each element of the feature vector will be a Gaussian function of the distance between the state s and one center:

\phi_i(s) = \exp \frac{-(s - c_i)^2}{2\, \sigma_i^2}

Radial-basis functions (RBF)

  • The approximated value function now represents continuously the states:

V_\varphi(s) = \sum_{i=1}^d w_i \, \phi_i(s) = \sum_{i=1}^d w_i \, \exp \frac{-(s - c_i)^2}{2\, \sigma_i^2}

  • If you have enough centers and they overlap sufficiently, you can even decode the original state perfectly:

\hat{s} = \sum_{i=1}^d \phi_i(s) \, c_i

Summary of function approximation

  • In FA, we project the state information into a feature space to get a better representation.

  • We then apply a linear approximation algorithm to estimate the value function:

V_\varphi(s) = \mathbf{w}^T \, \phi(s)

  • The linear FA is trained using some variant of gradient decent:

\Delta \mathbf{w} = \eta \, (V^\pi(s) - V_\varphi(s)) \, \phi(s)

  • Deep neural networks are the most powerful function approximators in supervised learning.

  • Do they also work with RL?