Reinforcement Learning: Q-Learning & Deep Q-Learning

  • These notes present a Reinforcement Learning algorithm, namely Q-Learning
  • Having a Probability, Stochastics and Machine Learning foundation is recommended before reading.
  • It may be useful to be familiar with Dynamic Programming in the context of Reinforcement Learning.

Introduction

The goal of Q-Learning is to learn a certain measure of quality of actions given states. This measure of quality represents the long-term expected reward we can get by taking a certain action at a specific state. The higher the expected reward, the better the quality of the action.

Framework

We have an agent and an environment which interact with each-other in discrete time steps. At time , the agent observes the environment's state , and performs action . The agent gets a reward from performing this action, and the environement changes to state .

The state transition follows a distribution , and we assume to have the markov property .

We assume that we don't know the environment's dynamics (model free), so we don't know the state transition . In other words, from we cannot infer anything on .

We note the observable state space, the action space. is the action space when the state of the environment is , and .

is the reward function, and . The notation will be used to denote a random variable that depends on the event , and will be used when we know its value (i.e. when we know and ).

We define a policy to be a strategy for the agent. We model it as a function such that for every , , so that it defines, for every choice of a probability distribution over , and we denote it with .

The notation means that we sample all actions according to , and all states are sampled according to state transition.

Defining The Goal

Let's start with what we want to achieve. From a state , we want to maximize the expected cumulative reward of our course of action. The expected cumulative reward is what we should obtain on average if we start at a state and follow our policy to perform actions. The expected cumulative reward is defined as so our goal is:

But summing infinitely many rewards can be infinite. So we slightly change our goal to circumvent this. We prioritize the impact of the most recent rewards over the ones that come later, by introducting a discount factor . We thus define the discounted cumulative reward as being . The closer is to 1, the more importance we give to long-term rewards, whereas when is close to 0, we prioritize short-term rewards. This can be important if for example we are in a game where there are multiple short-term goals that don't end the game but a single long-term goal that ends the game.

The V-Function

We can now define the expected discounted cumulative reward when we start at state and follow policy , otherwise known as the State-Value Function (V-function):

And finally we can state our ultimate goal:

But, sticking to our assumptions, the V-function is not sufficient. Let's say we want to use the V-function to choose an action based on . Then we would choose the action that maximizes the next state's V-function, taking into account the reward obtained from this transition. We would want something similar to

The fact that we need to get the next state distribution (i.e. information on the state transition) goes in conflict with our model-free assumption, so the V-function cannot directly be used as a means to choose an action based on the state we're in.

The Q-Function

Definition

Let's introduce a new function, called the Action-Value Function (or Q-function), which is similar to the State-Value Function but takes into account the action that has been chosen.

The intuition of this function is that it gives a measure of the quality of the action we take at a certain state.

Let's link the Q-function to the V-function.

And the other way around is obtained by summing the conditional expectations on :

The above equation is important. It describes the relationship between two fundamental value functions in Reinforcement Learning. It is valid for any policy.

Policy Ordering

Let's define what an optimal policy is by first defining a partial ordering between policies:
Let , be two policies. Then,

Some policies might not be comparable, for example if there exists in such that but .
An optimal policy is one that is comparable with any other policy , and such that .

A result that we won't prove here but that we'll be using is that, in our setting, always exists, and moreover there alway exists a deterministic policy that is optimal. Also note that there can be multiple optimal policies that give the same optimal value, i.e. may not be unique.

We can rewrite our ultimate goal (1) as being . It is the optimal V-function.
Similarly, the optimal Q-function is .

Finding The Optimal V-Function Is Equivalent To Finding The Optimal Q-Function

We will now derive an important result, which says that to obtain the values of the optimal V-function, we can concentrate on getting the values of the optimal Q-function. To help derive this important result, we first give the following lemma.

Policy Improvement Lemma

Lemma: If such that , then

Proof.
Let .
Let

First, :

  • because so .

Then, when :

because (lemma assumption)

Without loss of generality, let be the current timestep.
In what follows, we use multiple times the links derived between the Q-function and the V-function.

The last inequality is due to the fact that , and

  • if , then ,
  • but if , then .

Thus .
Then the expectation is redundant so we can remove it.

Repeating the above reasonning, we obtain

So finally

Now we are ready to state our important result.

Equivalence Theorem

Theorem:

Proof. Since is valid for any policy, it is valid for an optimal policy. So .
Let .
Then

Thus .

We now prove that by contradiction.
Let's assume that .
By our previous lemma, this means that there exists such that , which means is not optimal Contradiction.

This is extremely useful, because we can concentrate on computing the optimal Q-values to obtain the optimal V-function values, which is exactly our ultimate goal . So computing the optimal Q-values comes back to achieving our goal.
The whole idea of Q-Learning is learning these optimal Q-values. To put in place our learning framework, we first derive a recursive formula for the optimal Q-function, called the Bellman optimality equation.

Bellman optimality equation for :

This is obtained by noting that works in particular with , and by combining it with .

Q-Learning

Let be the function obtained from learning .
The Bellman optimality equation will help us learn for all because our learning objective is minimizing the following error measure:

It is the Bellman error, which is simply the difference between the current Q-value when we're at and about to take , and the Q-value computed once we observe the next state . Intuitively, the Bellman error is the update to our expected reward when we observe . The part underlined is the RHS of the Bellman optimality equation, but knowing .

Q-Learning is an algorithm that repeatedly adjusts to minimize the Bellman error. At timestep , we sample the tuple and adjust as follows:

Where is a learning rate. In practice will be close to 0 and stricly less than 1 to take into account previous updates.

Now we state the theoretical constraints under which Q-Learning converges, which helps motivate implementation choices of Q-Learning in practice. The proof of convergence is not given here, but the paper for the proof can be found in Reference 2.

Constraints For Convergence

Convergence of Q-Learning: Let be the timestep of the time that we're in state and we take action . Let the updates to be done as mentioned above. Then, converges almost surely towards as long as

The convergence is almost surely because we have random variables and .
This statement reveals 2 constraints:

  1. The learning rate for each state-action pair must converge towards 0, but not too quickly.
  2. Because is bounded for all , all state-action pair must be visited infinitely often.

Exploration-Exploitation tradeoff

Now the idea could be to apply Machine Learning to learn it: we sample a lot of events to adjust and learn each so that it is as close as possible to . To do this we don’t need a specific policy, we just need enough exploration and (by the convergence constraints) enough iterations to make the values of converge towards .

But we do Reinforcement Learning, so we also need our agent to get better over experiences. Thus our agent needs to take actions that it thinks are the best according to what it's learned until now. So the agent chooses the actions that maximize for each state, and from the Bellman optimality equation, it chooses (In fact, the agent chooses , but converges to so we'll abuse notation here).

To give its best guess, the agent always chooses , so it follows the following policy:

By identifying the Bellman optimality equation with the Bellman equation in Appendix, we can conclude that this is in fact an optimal policy.

But this policy doesn't favor exploration, because it always follows the optimal path. Due to this, our agent might never go into certain states (i.e. sample certain state-action pairs), and thus it misses some information that would help get closer to . This is the exploration-exploitation tradeoff: the agent should sometimes choose suboptimal actions in order to visit new states and actions.
This tradeoff is handled by changing the above optimal (greedy) policy into an -greedy policy. The idea is that with probability we apply our optimal policy, and with probability we choose an action uniformly at random. Formally, this gives the following policy:

Typically, changes as we go through training. It starts with a value close to 1 to favor exploration at the beginning, and decreases to be close to 0 as converges to .

Now, with this policy, the agent chooses most of the time the optimal action, while still learning accurately .

Implementation Pseudocode

We give an implementation of the algorithm:

  • Parameters: discount factor , step size (function)
  • Initialize: , , arbitrarily. .
  • Repeat for each episode:
    • Initialize state
    • For each step of the episode:
      • Choose from using the -greedy policy
      • Take action , observe reward and next state
    • Until: ends the episode

Replay Memory trick: To improve the learning of , we can memorize each inside a set , and at the end of each episode, we can sample tuples uniformly at random from and apply the learning process to them. But this has the disadvantage of being slower and more costly.

Deep-Q-Learning

So far, we've been assuming a tabular representation of the Q-function. There is 1 Q-Value to learn per tuple, so the number of Q-Values (size of the table) can go up to .
In practice, is very big, so having to store all the Q-values in a table is impractical.
Since for any limited-size set we can uniquely represent any using bits, it would be better to have a limited-size parameterized function that approximates the Q-function.

This is what deep Q-Learning is about: have an artificial neural network approximate the Q-function. The network would get a representation of as input (achievable using bits), and it would output an approximate value of .

Our deep Q-Learning network (Q-network) is noted as with parameters to learn . The loss we use is the bellman error squared:

is the target Q-value and is fixed.

Now, updating Q is done using backpropagation:

Where .

Notice that, in the loss, we are using the same parameters for the target Q-value and for the predicted Q-value. This gives significant correlation between the target Q-value and that we are learning. So at each training (updating) step, both our predicted Q-value and the target Q-value will shift. We're getting closer to the target, but the target is also moving. This leads to oscillation in training.
To mitigate this, we can update the target's every training steps.

Implementation Pseudocode

Here is an implementation of the algorithm using Q-network :

  • Parameters: discount factor , step size (function) ,
  • Initialize: using favorite initialization technique. . . .
  • Repeat for each episode:
    • Initialize state
    • For each step of the episode:
      • Choose from using the -greedy policy
      • Take action , observe reward and next state
      • if , and
    • Until: ends the episode

References

  1. The famous book by Richard S. Sutton and Andrew G. Barto - Reinforcement Learning: An Introduction
  2. Watkins & Dayan, 1992 - Almost sure convergence of Q-Learning

Appendix

Bellman equation of the Q-Function:

Simplifying the notation:


Bellman equation of the V-Function:

Simplifying the notation:

Last Updated: