Reinforcement Learning: Q-Learning & Deep Q-Learning

These notes present a Reinforcement Learning algorithm, namely Q-Learning
Having a Probability, Stochastics and Machine Learning foundation is recommended before reading.
It may be useful to be familiar with Dynamic Programming in the context of Reinforcement Learning.

Introduction

The goal of Q-Learning is to learn a certain measure of quality of actions given states. This measure of quality represents the long-term expected reward we can get by taking a certain action at a specific state. The higher the expected reward, the better the quality of the action.

Framework

We have an agent and an environment which interact with each-other in discrete time steps. At time $t$ , the agent observes the environment's state $s_{t} \in S$ , and performs action $a_{t} \in A (s_{t})$ . The agent gets a reward $r_{t} \in R$ from performing this action, and the environement changes to state $s_{t + 1} \in S$ .

The state transition follows a distribution $p (s_{t + 1} ∣ s_{t}, a_{t})$ $\forall s_{t + 1} \in S$ , and we assume to have the markov property $p (s_{t + 1} ∣ s_{0}, a_{0}, ..., s_{t}, a_{t}) = p (s_{t + 1} ∣ s_{t}, a_{t})$ .

We assume that we don't know the environment's dynamics (model free), so we don't know the state transition $p (s_{t + 1} ∣ s_{t}, a_{t})$ $\forall s_{t + 1} \in S$ . In other words, from $s_{t}, a_{t}$ we cannot infer anything on $s_{t + 1}$ .

We note $S$ the observable state space, $A$ the action space. $A (s)$ is the action space when the state of the environment is $s$ , and $A (s) \subseteq A$ .

$R : S \times A \to R$ is the reward function, and $r_{t} = R (s_{t}, a_{t})$ . The notation $r_{t}$ will be used to denote a random variable that depends on the event $(s_{t}, a_{t})$ , and $R (s_{t}, a_{t})$ will be used when we know its value (i.e. when we know $s_{t}$ and $a_{t}$ ).

We define a policy $π$ to be a strategy for the agent. We model it as a function $π : A \times S \to [0, 1]$ such that for every $s \in S$ , $\sum_{a \in A (s)} = 1$ , so that it defines, for every choice of $s \in S$ a probability distribution over $A (s)$ , and we denote it with $π (a ∣ s)$ .

The notation $E_{π} [...]$ means that we sample all actions according to $π$ , and all states are sampled according to state transition.

Defining The Goal

Let's start with what we want to achieve. From a state $s$ , we want to maximize the expected cumulative reward of our course of action. The expected cumulative reward is what we should obtain on average if we start at a state $s$ and follow our policy to perform actions. The expected cumulative reward is defined as $E [\sum_{i = 0}^{\infty} r_{t + i} ∣ s_{t} = s]$ so our goal is:

$π max E_{π} [i = 0 \sum \infty r_{t + i} ∣ s_{t} = s]$

But summing infinitely many rewards can be infinite. So we slightly change our goal to circumvent this. We prioritize the impact of the most recent rewards over the ones that come later, by introducting a discount factor $γ \in] 0, 1 [$ . We thus define the discounted cumulative reward as being $G_{t} = r_{t} + γ r_{t + 1} + γ^{2} r_{r + 2} + ... = \sum_{i = 0}^{\infty} γ^{i} r_{t + i}$ . The closer $γ$ is to 1, the more importance we give to long-term rewards, whereas when $γ$ is close to 0, we prioritize short-term rewards. This can be important if for example we are in a game where there are multiple short-term goals that don't end the game but a single long-term goal that ends the game.

The V-Function

We can now define the expected discounted cumulative reward when we start at state $s$ and follow policy $π$ , otherwise known as the State-Value Function (V-function):

$V^{π} (s) = E_{π} [G_{t} ∣ s_{t} = s] = E_{π} [i = 0 \sum \infty γ^{i} r_{t + i} ∣ s_{t} = s]$

And finally we can state our ultimate goal:

$π max V^{π} (s) \forall s \in S (1)$

But, sticking to our assumptions, the V-function is not sufficient. Let's say we want to use the V-function to choose an action based on $s_{t}$ . Then we would choose the action that maximizes the next state's V-function, taking into account the reward obtained from this transition. We would want something similar to

$ar g a_{t} \in A (s_{t}) max R (s_{t}, a_{t}) + γ E_{p (s_{t + 1} ∣ s_{t}, a_{t})} [V^{π} (s_{t + 1})]$

The fact that we need to get the next state distribution (i.e. information on the state transition) goes in conflict with our model-free assumption, so the V-function cannot directly be used as a means to choose an action based on the state $s_{t}$ we're in.

The Q-Function

Definition

Let's introduce a new function, called the Action-Value Function (or Q-function), which is similar to the State-Value Function but takes into account the action that has been chosen.

$Q^{π} (s, a) = E_{π} [G_{t} ∣ s_{t} = s, a_{t} = a] = E_{π} [i = 0 \sum \infty γ^{i} r_{t + i} ∣ s_{t} = s, a_{t} = a]$

The intuition of this function is that it gives a measure of the quality of the action we take at a certain state.

Link To Our Goal

Let's link the Q-function to the V-function.

$\Rightarrow Q^{π} (s, a) = R (s, a) + γ s^{'} \in S \sum p (s^{'} ∣ s, a) V^{π} (s^{'}) (*)$

And the other way around is obtained by summing the conditional expectations on $a_{t}$ :

$V^{π} (s) = E_{π} [G_{t} ∣ s_{t} = s] = a \in A (s_{t}) \sum π (a_{t} = a ∣ s_{t} = s) E_{π} [G_{t} ∣ s_{t} = s, a_{t} = a]$

$\Rightarrow V^{π} (s) = a \in A (s) \sum π (a ∣ s) Q^{π} (s, a) (* *)$

The above equation is important. It describes the relationship between two fundamental value functions in Reinforcement Learning. It is valid for any policy.

Policy Ordering

Let's define what an optimal policy $π^{*}$ is by first defining a partial ordering between policies:
Let $π_{1}$ , $π_{2}$ be two policies. Then,

$π_{1} \geq π_{2} \Leftrightarrow \forall s \in S V^{π_{1}} (s) \geq V^{π_{2}} (s)$

Some policies might not be comparable, for example if there exists $s_{1}, s_{2}$ in $S$ such that $V^{π_{1}} (s_{1}) > V^{π_{2}} (s_{1})$ but $V^{π_{1}} (s_{2}) < V^{π_{2}} (s_{2})$ .
An optimal policy $π^{*}$ is one that is comparable with any other policy $π$ , and such that $π^{*} \geq π$ .

A result that we won't prove here but that we'll be using is that, in our setting, $π^{*}$ always exists, and moreover there alway exists a deterministic policy that is optimal. Also note that there can be multiple optimal policies that give the same optimal value, i.e. $π^{*}$ may not be unique.

We can rewrite our ultimate goal (1) as being $V^{*} (s) = V^{π^{*}} (s) = max_{π} V^{π} (s)$ . It is the optimal V-function.
Similarly, the optimal Q-function is $Q^{*} (s, a) = Q^{π^{*}} (s, a) = max_{π} Q^{π} (s, a)$ .

Finding The Optimal V-Function Is Equivalent To Finding The Optimal Q-Function

We will now derive an important result, which says that to obtain the values of the optimal V-function, we can concentrate on getting the values of the optimal Q-function. To help derive this important result, we first give the following lemma.

Policy Improvement Lemma

Lemma: If $\exists \overset{s}{ˉ} \in S$ such that $V^{π} (\overset{s}{ˉ}) < max_{a \in A (\overset{s}{ˉ})} Q^{π} (\overset{s}{ˉ}, a)$ , then

$\exists π^{'} s . t . V^{π} (s) = V^{π^{'}} (s) \forall s \in S ∖ {\overset{s}{ˉ}} and V^{π} (\overset{s}{ˉ}) < V^{π^{'}} (\overset{s}{ˉ})$

Proof.
Let $\overset{a}{ˉ} = ar g max_{a \in A (\overset{s}{ˉ})} Q^{π} (\overset{s}{ˉ}, a)$ .
Let $π^{'} (a ∣ s) = ⎩ ⎨ ⎧ π (a ∣ s), 1, 0, if s \neq = \overset{s}{ˉ} if s = \overset{s}{ˉ} and a = \overset{a}{ˉ} otherwise$

First, $\forall s \in S ∖ {\overset{s}{ˉ}}$ :

$V^{π} (s) = V^{π^{'}} (s)$ because $s \neq = \overset{s}{ˉ}$ so $π^{'} (a ∣ s) = π (a ∣ s)$ .

Then, when $s = \overset{s}{ˉ}$ :

$V^{π} (\overset{s}{ˉ}) = a \in A (\overset{s}{ˉ}) \sum π (a ∣ \overset{s}{ˉ}) Q^{π} (\overset{s}{ˉ}, a) < a \in A (\overset{s}{ˉ}) \sum π^{'} (a ∣ \overset{s}{ˉ}) Q^{π} (\overset{s}{ˉ}, a)$

because (lemma assumption)

$a \in A (\overset{s}{ˉ}) \sum π^{'} (a ∣ \overset{s}{ˉ}) Q^{π} (\overset{s}{ˉ}, a) = Q^{π} (\overset{s}{ˉ}, \overset{a}{ˉ}) = a \in A (\overset{s}{ˉ}) max Q^{π} (\overset{s}{ˉ}, a) > V^{π} (\overset{s}{ˉ})$

Without loss of generality, let $t$ be the current timestep.
In what follows, we use multiple times the links derived between the Q-function and the V-function.

$V^{π} (\overset{s}{ˉ}) < a \in A (\overset{s}{ˉ}) \sum π^{'} (a ∣ \overset{s}{ˉ}) Q^{π} (\overset{s}{ˉ}, a) = E_{π^{'}} [Q^{π} (s_{t}, a_{t}) ∣ s_{t} = \overset{s}{ˉ}] = E_{π^{'}} [r_{t} + γ E_{p} [V^{π} (s_{t + 1})] ∣ s_{t} = \overset{s}{ˉ}] = E_{π^{'}} [r_{t} + γ V^{π} (s_{t + 1}) ∣ s_{t} = \overset{s}{ˉ}] following our notation \leq E_{π^{'}} [r_{t} + γ Q^{π} (s_{t + 1}, a_{t + 1}) ∣ s_{t} = \overset{s}{ˉ}]$

The last inequality is due to the fact that $V^{π} (s_{t + 1}) = E_{π} [Q^{π} (s_{t + 1}, a_{t + 1})]$ , and

if $s_{t + 1} \neq = \overset{s}{ˉ}$ , then $E_{π} [Q^{π} (s_{t + 1}, a_{t + 1})] = E_{π^{'}} [Q^{π} (s_{t + 1}, a_{t + 1})]$ ,
but if $s_{t + 1} = \overset{s}{ˉ}$ , then $E_{π} [Q^{π} (s_{t + 1}, a_{t + 1})] \leq max_{a \in A (s_{t + 1})} Q^{π} (s_{t + 1}, a) = E_{π^{'}} [Q^{π} (s_{t + 1}, a_{t + 1})]$ .

Thus $V^{π} (s_{t + 1}) \leq E_{π^{'}} [Q^{π} (s_{t + 1}, a_{t + 1})]$ .
Then the expectation is redundant so we can remove it.

Repeating the above reasonning, we obtain

$E_{π^{'}} [r_{t} + γ V^{π} (s_{t + 1}) ∣ s_{t} = \overset{s}{ˉ}] \leq E_{π^{'}} [r_{t} + γ Q^{π} (s_{t + 1}, a_{t + 1}) ∣ s_{t} = \overset{s}{ˉ}] = E_{π^{'}} [r_{t} + γ r_{t + 1} + γ^{2} V^{π} (s_{t + 2}) ∣ s_{t} = \overset{s}{ˉ}] \leq E_{π^{'}} [r_{t} + γ r_{t + 1} + γ^{2} Q^{π} (s_{t + 2}, a_{t + 2}) ∣ s_{t} = \overset{s}{ˉ}] ⋮ \leq E_{π^{'}} [G_{t} ∣ s_{t} = \overset{s}{ˉ}] = V^{π^{'}} (\overset{s}{ˉ})$

So finally

$V^{π} (\overset{s}{ˉ}) < V^{π^{'}} (\overset{s}{ˉ})$

$e n d o f p roo f □$

Now we are ready to state our important result.

Equivalence Theorem

Theorem:

$\forall s \in S, V^{*} (s) = a \in A (s) max Q^{*} (s, a)$

Proof. Since $(* *)$ is valid for any policy, it is valid for an optimal policy. So $V^{*} (s) = \sum_{a \in A (s)} π^{*} (a ∣ s) Q^{*} (s, a)$ .
Let $\overset{a}{ˉ} = ar g max_{a \in A (s)} Q^{*} (s, a)$ .
Then

$a \in A (s) \sum π^{*} (a ∣ s) Q^{*} (s, a) \leq a \in A (s) \sum π^{*} (a ∣ s) Q^{*} (s, \overset{a}{ˉ}) = Q^{*} (s, \overset{a}{ˉ}) = a \in A (s) max Q^{*} (s, a)$

Thus $V^{*} (s) \leq max_{a \in A (s)} Q^{*} (s, a)$ .

We now prove that $V^{*} (s) \geq max_{a \in A (s)} Q^{*} (s, a) \forall s \in S$ by contradiction.
Let's assume that $\exists \overset{s}{ˉ} \in S : V^{*} (\overset{s}{ˉ}) < max_{a \in A (\overset{s}{ˉ})} Q^{*} (\overset{s}{ˉ}, a)$ .
By our previous lemma, this means that there exists $π^{'}$ such that $V^{*} (\overset{s}{ˉ}) < V^{π^{'}} (\overset{s}{ˉ})$ , which means $π^{*}$ is not optimal $\Rightarrow$ Contradiction.

$e n d o f p roo f □$

This is extremely useful, because we can concentrate on computing the optimal Q-values to obtain the optimal V-function values, which is exactly our ultimate goal $(1)$ . So computing the optimal Q-values comes back to achieving our goal.
The whole idea of Q-Learning is learning these optimal Q-values. To put in place our learning framework, we first derive a recursive formula for the optimal Q-function, called the Bellman optimality equation.

Bellman optimality equation for $Q^{*}$ :

$Q^{*} (s, a) = R (s, a) + γ s^{'} \in S \sum p (s^{'} ∣ s, a) a^{'} \in A (s^{'}) max Q^{*} (s^{'}, a^{'})$

This is obtained by noting that $(*)$ works in particular with $π^{*}$ , and by combining it with $V^{*} (s) = max_{a \in A (s)} Q^{*} (s, a)$ .

Q-Learning

Let $\tilde{Q}$ be the function obtained from learning $Q^{*}$ .
The Bellman optimality equation will help us learn $Q^{*} (s, a)$ for all $s \in S, a \in A (s)$ because our learning objective is minimizing the following error measure:

$B (s_{t}, a_{t}, s_{t + 1}) = \tilde{Q} (s_{t}, a_{t}) - Computed once we can observe s_{t + 1} (R (s_{t}, a_{t}) + γ a \in A (s_{t + 1}) max \tilde{Q} (s_{t + 1}, a))$

It is the Bellman error, which is simply the difference between the current Q-value when we're at $s_{t}$ and about to take $a_{t}$ , and the Q-value computed once we observe the next state $s_{t + 1}$ . Intuitively, the Bellman error is the update to our expected reward when we observe $s_{t + 1}$ . The part underlined is the RHS of the Bellman optimality equation, but knowing $s^{'} = s_{t + 1}$ .

Q-Learning is an algorithm that repeatedly adjusts $\tilde{Q}$ to minimize the Bellman error. At timestep $t + 1$ , we sample the tuple $(s_{t}, a_{t}, s_{t + 1})$ and adjust $\tilde{Q}$ as follows:

$\tilde{Q} (s_{t}, a_{t}) \leftarrow \tilde{Q} (s_{t}, a_{t}) - α_{t} B (s_{t}, a_{t}, s_{t + 1})$

Where $α_{t}$ is a learning rate. In practice $α_{t}$ will be close to 0 and stricly less than 1 to take into account previous updates.

Now we state the theoretical constraints under which Q-Learning converges, which helps motivate implementation choices of Q-Learning in practice. The proof of convergence is not given here, but the paper for the proof can be found in Reference 2.

Constraints For Convergence

Convergence of Q-Learning: Let $t^{i} (s, a)$ be the timestep of the $i^{th}$ time that we're in state $s$ and we take action $a$ . Let the updates to $\tilde{Q}$ be done as mentioned above. Then, $\tilde{Q}$ converges almost surely towards $Q^{*}$ as long as

$i = 0 \sum \infty α_{t^{i} (s, a)} = \infty and i = 0 \sum \infty [α_{t^{i} (s, a)}]^{2} < \infty \forall s \in S, a \in A$

The convergence is almost surely because we have random variables $s_{t}, s_{t + 1}$ and $a_{t}$ .
This statement reveals 2 constraints:

The learning rate for each state-action pair $(s, a)$ must converge towards 0, but not too quickly.
Because $α_{t}$ is bounded for all $t$ , all state-action pair $(s, a)$ must be visited infinitely often.

Exploration-Exploitation tradeoff

Now the idea could be to apply Machine Learning to learn it: we sample a lot of events to adjust and learn each $\tilde{Q}$ so that it is as close as possible to $Q^{*}$ . To do this we don’t need a specific policy, we just need enough exploration and (by the convergence constraints) enough iterations to make the values of $\tilde{Q}$ converge towards $Q^{*}$ .

But we do Reinforcement Learning, so we also need our agent to get better over experiences. Thus our agent needs to take actions that it thinks are the best according to what it's learned until now. So the agent chooses the actions that maximize $V^{*}$ for each state, and from the Bellman optimality equation, it chooses $max_{a} Q^{*}$ (In fact, the agent chooses $max_{a} \tilde{Q}$ , but $\tilde{Q}$ converges to $Q^{*}$ so we'll abuse notation here).

To give its best guess, the agent always chooses $ar g max_{a \in A (s)} Q^{*} (s, a)$ , so it follows the following policy:

$π (a ∣ s) = {1, 0, if a = ar g max_{a^{'} \in A (s)} Q^{*} (s, a^{'}) otherwise$

By identifying the Bellman optimality equation with the Bellman equation in Appendix, we can conclude that this is in fact an optimal policy.

But this policy doesn't favor exploration, because it always follows the optimal path. Due to this, our agent might never go into certain states (i.e. sample certain state-action pairs), and thus it misses some information that would help get closer to $Q^{*}$ . This is the exploration-exploitation tradeoff: the agent should sometimes choose suboptimal actions in order to visit new states and actions.
This tradeoff is handled by changing the above optimal (greedy) policy into an $ϵ$ -greedy policy. The idea is that with probability $1 - ϵ$ we apply our optimal policy, and with probability $ϵ$ we choose an action uniformly at random. Formally, this gives the following policy:

$π (a ∣ s) = {(1 - ϵ) + ϵ \frac{1}{∣ A ( s ) ∣}, ϵ \frac{1}{∣ A ( s ) ∣}, if a = ar g max_{a^{'} \in A (s)} Q^{*} (s, a^{'}) otherwise$

Typically, $ϵ$ changes as we go through training. It starts with a value close to 1 to favor exploration at the beginning, and decreases to be close to 0 as $\tilde{Q}$ converges to $Q^{*}$ .

Now, with this policy, the agent chooses most of the time the optimal action, while still learning accurately $Q^{*}$ .

Implementation Pseudocode

We give an implementation of the algorithm:

Parameters: discount factor $γ \in] 0, 1 [$ , step size (function) $α (t) \in] 0, 1]$
Initialize: $\tilde{Q} (s, a)$ , $\forall (s, a) \in S \times A$ , arbitrarily. $t \leftarrow 0$ .
Repeat for each episode:
- Initialize state $s_{t}$
- For each step of the episode:
  - Choose $a_{t}$ from $s_{t}$ using the $ϵ$ -greedy policy
  - Take action $a_{t}$ , observe reward $R (s_{t}, a_{t})$ and next state $s^{'}$
  - $\tilde{Q} (s_{t}, a_{t}) \leftarrow \tilde{Q} (s_{t}, a_{t}) - α_{t} B (s_{t}, a_{t}, s^{'})$
  - $t \leftarrow t + 1$
  - $s_{t} \leftarrow s^{'}$
- Until: $s_{t}$ ends the episode

Replay Memory trick: To improve the learning of $Q^{*}$ , we can memorize each $(s_{t}, a_{t}, s^{'})$ inside a set $E$ , and at the end of each episode, we can sample tuples uniformly at random from $E$ and apply the learning process to them. But this has the disadvantage of being slower and more costly.

Deep-Q-Learning

So far, we've been assuming a tabular representation of the Q-function. There is 1 Q-Value to learn per $(s, a)$ tuple, so the number of Q-Values (size of the table) can go up to $S \times A$ .
In practice, $∣ S ∣$ is very big, so having to store all the Q-values in a table is impractical.
Since for any limited-size set $S$ we can uniquely represent any $s \in S$ using $l o g_{2} (∣ S ∣)$ bits, it would be better to have a limited-size parameterized function that approximates the Q-function.

This is what deep Q-Learning is about: have an artificial neural network approximate the Q-function. The network would get a representation of $(s, a)$ as input (achievable using $l o g_{2} (∣ S ∣) + l o g_{2} (∣ A ∣)$ bits), and it would output an approximate value of $Q (s, a)$ .

Our deep Q-Learning network (Q-network) is noted as $Q_{θ}$ with parameters to learn $θ$ . The loss we use is the bellman error squared:

$y_{t, θ} = R (s_{t}, a_{t}) + γ a \in A (s_{t + 1}) max Q_{θ} (s_{t + 1}, a)$

$L (s_{t}, a_{t}, s_{t + 1}) = (Q_{θ} (s_{t}, a_{t}) - y_{t, θ})^{2}$

$y_{t, θ}$ is the target Q-value and $θ$ is fixed.

Now, updating Q is done using backpropagation:

$θ \leftarrow θ - α_{t} \frac{\partial L}{\partial θ}$

Where $\frac{\partial L}{\partial θ} = 2 (Q_{θ} (s_{t}, a_{t}) - y_{t, θ}) \frac{\partial Q _{θ}}{\partial θ}$ .

Notice that, in the loss, we are using the same parameters $θ$ for the target Q-value and for the predicted Q-value. This gives significant correlation between the target Q-value and $θ$ that we are learning. So at each training (updating) step, both our predicted Q-value and the target Q-value will shift. We're getting closer to the target, but the target is also moving. This leads to oscillation in training.
To mitigate this, we can update the target's $θ$ every $T$ training steps.

Implementation Pseudocode

Here is an implementation of the algorithm using Q-network $Q_{θ}$ :

Parameters: discount factor $γ \in] 0, 1 [$ , step size (function) $α (t) \in] 0, 1]$ , $T \in N^{*}$
Initialize: $θ$ using favorite initialization technique. $θ_{T} \leftarrow θ$ . $t \leftarrow 0$ . $i \leftarrow 0$ .
Repeat for each episode:
- Initialize state $s_{t}$
- For each step of the episode:
  - Choose $a_{t}$ from $s_{t}$ using the $ϵ$ -greedy policy
  - Take action $a_{t}$ , observe reward $R (s_{t}, a_{t})$ and next state $s^{'}$
  - $y_{t, θ_{T}} \leftarrow R (s_{t}, a_{t}) + γ max_{a \in A (s^{'})} Q_{θ_{T}} (s^{'}, a)$
  - $θ \leftarrow θ - 2 α_{t} (Q_{θ} (s_{t}, a_{t}) - y_{t, θ_{T}}) \frac{\partial Q _{θ}}{\partial θ}$
  - $t \leftarrow t + 1$
  - $s_{t} \leftarrow s^{'}$
  - $i \leftarrow i + 1$
  - if $i = T$ , $θ_{T} \leftarrow θ$ and $i \leftarrow 0$
- Until: $s_{t}$ ends the episode

References

The famous book by Richard S. Sutton and Andrew G. Barto - Reinforcement Learning: An Introduction
Watkins & Dayan, 1992 - Almost sure convergence of Q-Learning

Appendix

Bellman equation of the Q-Function:

$Q^{π} (s, a) = E_{π} [i = 0 \sum \infty γ^{i} r_{t + i} ∣ s_{t} = s, a_{t} = a] = s^{'} \in S \sum p (s_{t + 1} = s^{'} ∣ s_{t} = s, a_{t} = a) E_{π} [i = 0 \sum \infty γ^{i} r_{t + i} ∣ s_{t} = s, a_{t} = a, s_{t + 1} = s^{'}] = s^{'} \in S \sum p (s_{t + 1} = s^{'} ∣ s_{t} = s, a_{t} = a) (R (s_{t}, a_{t}) + γ E_{π} [i = 0 \sum \infty γ^{i} r_{t + 1 + i} ∣ s_{t + 1} = s^{'}]) = R (s_{t}, a_{t}) + s^{'} \in S \sum p (s_{t + 1} = s^{'} ∣ s_{t} = s, a_{t} = a) γ E_{π} [i = 0 \sum \infty γ^{i} r_{t + 1 + i} ∣ s_{t + 1} = s^{'}] = R (s_{t}, a_{t}) + γ s^{'} \in S \sum p (s_{t + 1} = s^{'} ∣ s_{t} = s, a_{t} = a) a^{'} \in A (s_{t + 1}) \sum π (a_{t + 1} = a^{'} ∣ s_{t + 1} = s^{'}) E_{π} [i = 0 \sum \infty γ^{i} r_{t + 1 + i} ∣ s_{t + 1} = s^{'}, a_{t + 1} = a^{'}] = R (s_{t}, a_{t}) + γ s^{'} \in S \sum p (s_{t + 1} = s^{'} ∣ s_{t} = s, a_{t} = a) a^{'} \in A (s_{t + 1}) \sum π (a_{t + 1} = a^{'} ∣ s_{t + 1} = s^{'}) Q^{π} (s^{'}, a^{'})$

Simplifying the notation:

$\Rightarrow Q^{π} (s, a) = R (s, a) + γ s^{'} \in S \sum p (s^{'} ∣ s, a) a^{'} \in A (s^{'}) \sum π (a^{'} ∣ s^{'}) Q^{π} (s^{'}, a^{'})$

Bellman equation of the V-Function:

$V^{π} (s) = E_{π} [i = 0 \sum \infty γ^{i} r_{t + i} ∣ s_{t} = s] = E_{π} [r_{t} + γ i = 1 \sum \infty γ^{i - 1} r_{t + i} ∣ s_{t} = s] = E_{π} [r_{t} + γ i = 0 \sum \infty γ^{i} r_{t + 1 + i} ∣ s_{t} = s] = a \in A (s_{t}) \sum π (a_{t} = a ∣ s_{t} = s) E_{π} [r_{t} + γ i = 0 \sum \infty γ^{i} r_{t + 1 + i} ∣ s_{t} = s, a_{t} = a] = a \in A (s_{t}) \sum π (a_{t} = a ∣ s_{t} = s) s^{'} \in S \sum p (s_{t + 1} = s^{'} ∣ s_{t} = s, a_{t} = a) E_{π} [r_{t} + γ i = 0 \sum \infty γ^{i} r_{t + 1 + i} ∣ s_{t} = s, a_{t} = a, s_{t + 1} = s^{'}] = a \in A (s_{t}) \sum π (a_{t} = a ∣ s_{t} = s) s^{'} \in S \sum p (s_{t + 1} = s^{'} ∣ s_{t} = s, a_{t} = a) (R (s_{t}, a_{t}) + γ E_{π} [i = 0 \sum \infty γ^{i} r_{t + 1 + i} ∣ s_{t} = s, a_{t} = a, s_{t + 1} = s^{'}]) = a \in A (s_{t}) \sum π (a_{t} = a ∣ s_{t} = s) s^{'} \in S \sum p (s_{t + 1} = s^{'} ∣ s_{t} = s, a_{t} = a) (R (s_{t}, a_{t}) + γ V^{π} (s^{'}))$

Simplifying the notation:

$\Rightarrow V^{π} (s) = a \in A (s) \sum π (a ∣ s) s^{'} \in S \sum p (s^{'} ∣ s, a) (R (s, a) + γ V^{π} (s^{'}))$

# Reinforcement Learning: Q-Learning & Deep Q-Learning

# Introduction

# Framework

# Defining The Goal

# The V-Function

# The Q-Function

# Definition

# Link To Our Goal

# Policy Ordering

# Finding The Optimal V-Function Is Equivalent To Finding The Optimal Q-Function

# Policy Improvement Lemma

# Equivalence Theorem

# Q-Learning

# Constraints For Convergence

# Exploration-Exploitation tradeoff

# Implementation Pseudocode

# Deep-Q-Learning

# Implementation Pseudocode

# References

# Appendix

Reinforcement Learning: Q-Learning & Deep Q-Learning

Introduction

Framework

Defining The Goal

The V-Function

The Q-Function

Definition

Link To Our Goal

Policy Ordering

Finding The Optimal V-Function Is Equivalent To Finding The Optimal Q-Function

Policy Improvement Lemma

Equivalence Theorem

Q-Learning

Constraints For Convergence

Exploration-Exploitation tradeoff

Implementation Pseudocode

Deep-Q-Learning

Implementation Pseudocode

References

Appendix