强化学习思考（5）动态规划

7 minute read

Published: April 15, 2020

关于动态规划的注意事项。

Principle of Optimality

Theorem (Principle of Optimality)
A policy $\pi(a \mid s)$ achieves the optimal value from state $s$, $v_\pi(s) = v_∗(s)$, if and only if
For any state $s’$ reachable from $s$
$\pi$ achieves the optimal value from state $s’$, $v_\pi(s’) = v_∗(s)$

State value function & State-action value function

Bellman Expectation Equation

The state value function for policy $\pi$

\[\begin{aligned} v_{\pi}(s) &\doteq \Bbb{E}_{\pi}[G_t | S_t = s] \\&= \Bbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1} | S_t = s] \\&= \Bbb{E}_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s] \end{aligned}\]

The state-action value function for policy $\pi$

\[\begin{aligned} q_{\pi}(s,a) &\doteq \Bbb{E}_{\pi}[G_t | S_t = s, A_t = a] \\&= \Bbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a] \\&= \Bbb{E}_{\pi}[R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1}) | S_t = s, A_t = a] \end{aligned}\]

relation between state value function and state-action value function

\[v_{\pi}(s) = \Bbb{E}_{\pi}[q_{\pi}(S_t,A_t) | S_t = s]\\ q_{\pi}(s,a) = \Bbb{E}_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s, A_t = a]\]

Bellman Optimality Equation

The state value function for policy $\pi$

\[\begin{aligned} v_*(s) &\doteq \max_{\pi} v_{\pi}(s) \\&= \max_a\Bbb{E}[R_{t+1} + \gamma v_{*}(S_{t+1}) | S_t = s, A_t =a] \end{aligned}\]

The state-action value function for policy $\pi$

\[\begin{aligned} q_{*}(s,a) &\doteq \max_{\pi} q_{\pi}(s,a) \\&= \Bbb{E}[R_{t+1} + \gamma \max_{a'} q_{*}(S_{t+1}, a') | S_t = s, A_t = a] \end{aligned}\]

relation between state value function and state-action value function

\[\begin{aligned} v_*(s) = \max_{a} q_{*}(s,a) \end{aligned}\]

Policy Iteration and Value Iteration

Generalised Policy Iteration

Policy Evaluation: Estimate $v_\pi$ (Any policy evaluation algorithm)
Policy improvement: Generate $\pi’ \geq \pi$ (Any policy improvement algorithm)

Policy improvement theorem

Theorem (Policy improvement theorem)
Let $\pi$ and $\pi’$be any pair of deterministic policies such that, for all $s \in \cal{S}$,
\[q_{\pi}(s, \pi'(s)) \geq v_{\pi}(s)\]
Then the policy $\pi’$must be as good as, or better than, $\pi$.

prove:

we know that $\pi’\geq \pi$ if and only if $v_{\pi’}(s)≥ v_\pi(s)$ for all $s$, then we prove that

\[\begin{aligned} v_{\pi}(s) &\leq q_{\pi}(s, \pi'(s)) \\&= \mathbb{E}[R_{t+1}+\gamma v_\pi(S_{t+1}) | S_t=s,A_t=\pi'(s)] \\&= \mathbb{E}_{\pi'}[R_{t+1}+\gamma v_\pi(S_{t+1}) | S_t=s] \\&\leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma q_\pi(S_{t+1},\pi'(S_{t+1})) | S_t=s] \\&= \mathbb{E}_{\pi'}[R_{t+1}+\gamma \mathbb{E}[R_{t+2}+\gamma v_\pi(S_{t+2})| S_{t+1},A_{t+1}=\pi'(S_{t+1}) ] | S_t=s] \\&= \mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^2 v_\pi(S_{t+2}) | S_t=s] \\& \leq\mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\gamma^3 v_\pi(S_{t+3}) | S_t=s] \\& \vdots \\& \leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\gamma^3R_{t+4}+\cdots | S_t=s] \\&=v_{\pi'}(s) \end{aligned}\]

because $v_{\pi’}(s)≥ v_\pi(s)$, we get that $\pi’\geq \pi$

greedy improvement

We can improve the policy by acting greedily

\[\pi'(a|s) = \begin{cases} 1, & \text{if $a = \arg \max_{a \in \cal{A}}$} q_{\pi}(s,a) \\ 0, & \text{else} \end{cases}\]

prove:

\[q_{\pi}(s, \pi'(s)) \geq v_{\pi}(s)\]

Therefore from policy improvement theorem $v_{\pi^{\prime}}(s) \geq v_{\pi}(s)$, so $\pi’\geq \pi$

If improvements stop and $\pi’ = \pi$,

\[q_{\pi}\left(s, \pi^{\prime}(s)\right)=\max _{a \in \mathcal{A}} q_{\pi}(s, a)=q_{\pi}(s, \pi(s))=v_{\pi}(s)\]

Then the Bellman optimality equation has been satisfied

\[v_{\pi}(s)=\max _{a \in \mathcal{A}} q_{\pi}(s, a)\]

Therefore $v_{\pi}(s)=v_{*}(s)$ for all $s \in \mathcal{S}$ so $\pi$ is an optimal policy

$\epsilon$-greedy improvement

We can improve the policy by acting $\epsilon$-greedily

\[\pi'(a|s) = \begin{cases} \epsilon / m + 1 - \epsilon, & \text{if $a = \arg \max_{a \in \cal{A}}$} q_{\pi}(s,a) \\ \epsilon / m, & \text{otherwise} \end{cases}\]

prove:

\[\begin{aligned} q_{\pi}\left(s, \pi^{\prime}(s)\right) &=\sum_{a \in \mathcal{A}} \pi^{\prime}(a | s) q_{\pi}(s, a) \\ &=\epsilon / m \sum_{a \in \mathcal{A}} q_{\pi}(s, a)+(1-\epsilon) \max _{a \in \mathcal{A}} q_{\pi}(s, a) \\ & \geq \epsilon / m \sum_{a \in \mathcal{A}} q_{\pi}(s, a)+(1-\epsilon) \sum_{a \in \mathcal{A}} \frac{\pi(a | s)-\epsilon / m}{1-\epsilon} q_{\pi}(s, a) \\ &=\sum_{a \in \mathcal{A}} \pi(a | s) q_{\pi}(s, a)=v_{\pi}(s) \end{aligned}\]

Therefore from policy improvement theorem $v_{\pi^{\prime}}(s) \geq v_{\pi}(s)$, so $\pi’\geq \pi$

From Policy Iteration to Value Iteration

Value Iteration

We may achieve optimal policy before we achieve optimal value function
So policy evaluation may not need to converge to $v_\pi$
we introduce a stopping condition:
- $\epsilon$-convergence of value function
- simply stop after $k$ iterations of iterative policy evaluation
  - update policy every iteration
  - stop after $k = 1$, This is equivalent to value iteration

Dynamic Programming

Synchronous Dynamic Programming Algorithms

Problem	Bellman Equation	Algorithm
Prediction	Bellman Expectation Equation	Iterative Policy Evaluation
Control	Bellman Expectation Equation + Greedy Policy Improvement	Policy Iteration (Iterative Policy Evaluation + Policy Improvement)
Control	Bellman Optimality Equation	Value Iteration (Truncated Policy Evaluation + Policy Improvement)

synchronous backups: all states are backed up in parallel
Algorithms are based on state-value function $v_\pi(s)$ or $v_∗(s)$
- Complexity $O(mn^2)$ per iteration, for m actions and n states
Could also apply to action-value function $q_\pi(s, a)$ or $q_∗(s, a)$
- Complexity $O(m^2n^2)$ per iteration

Asynchronous Dynamic Programming

Asynchronous DP backs up states individually
Guaranteed to converge if all states continue to be selected

In-Place Dynamic Programming

Synchronous value iteration stores two copies of value function

In-place value iteration only stores one copy of value function

Prioritised Sweeping

Use magnitude of Bellman error to guide state selection, e.g.

\[\left|\max _{a \in \mathcal{A}}\left(\mathcal{R}_{s}^{a}+\gamma \sum_{s^{\prime} \in \mathcal{S}} \mathcal{P}_{s s^{\prime}}^{a} v\left(s^{\prime}\right)\right)-v(s)\right|\]

Backup the state with the largest remaining Bellman error
Update Bellman error of affected states after each backup
Requires knowledge of reverse dynamics (predecessor states)
Can be implemented efficiently by maintaining a priority queue

Real-Time Dynamic Programming

only backup the states that are relevant to agent
Use agent’s experience to guide the selection of states
After each time-step, Backup the state $S_t$

\[v(S_t) \leftarrow \max _{a \in \mathcal{A}}\left(\mathcal{R}_{S_t}^{a}+\gamma \sum_{s^{\prime} \in \mathcal{S}} \mathcal{P}_{S_t s^{\prime}}^{a} v\left(s^{\prime}\right)\right)\]

Backups

Full-Width Backups

DP uses full-width backups
- Every successor state and action is considered
- Using knowledge of the MDP transitions and reward function
DP is effective for medium-sized problems (millions of states)
For large problems DP suffers Bellman’s curse of dimensionality
- Number of states $n = \;\mid S \;\mid$ grows exponentially with number of state variables

Sample Backups

Model-free: no advance knowledge of MDP required
- Using sample rewards and sample transitions
- Instead of reward function $R$ and transition dynamics $P$
Breaks the curse of dimensionality through sampling
- Cost of backup is constant, independent of $n = \;\mid S \;\mid$

Convergence

Theorem (Contraction Mapping Theorem)
For any metric space $V$ that is complete (i.e. closed) under an operator $T(v)$, where $T$ is a $\gamma$-contraction,
$T$ converges to a unique fixed point
At a linear convergence rate of $\gamma$

Bellman Expectation Backup is a Contraction

Define the Bellman expectation backup operator $T^{\pi}$,

\[T^{\pi}(v) = \mathcal{R}^{\pi}+\gamma \mathcal{P}^{\pi} v\]

This operator is a $\gamma$-contraction, i.e. it makes value functions closer by at least $\gamma$,

\[\begin{aligned} \left\|T^{\pi}(u)-T^{\pi}(v)\right\|_{\infty} &=\left\|\left(\mathcal{R}^{\pi}+\gamma \mathcal{P}^{\pi} u\right)-\left(\mathcal{R}^{\pi}+\gamma \mathcal{P}^{\pi} v\right)\right\|_{\infty} \\ &=\left\|\gamma \mathcal{P}^{\pi}(u-v)\right\|_{\infty} \\ & \leq\left\|\gamma \mathcal{P}^{\pi}\right\| u-v\left\|_{\infty}\right\|_{\infty} \\ & \leq \gamma\|u-v\|_{\infty} \end{aligned}\]

Convergence of Iterative Policy Evaluation and Policy Iteration

The Bellman expectation operator $T^\pi$ has a unique fixed point
then $v_\pi$ is a fixed point of $T^\pi$ (by Bellman expectation equation)
By contraction mapping theorem
Iterative policy evaluation converges on $v_\pi$
Policy iteration converges on $v_*$

Bellman Optimality Backup is a Contraction

Define the Bellman optimality backup operator $T^{*}$,

\[T^{*}(v) = \max_{a\in{\cal{A}}} \mathcal{R}^{a}+\gamma \mathcal{P}^{a} v\]

This operator is a $\gamma$-contraction, i.e. it makes value functions closer by at least $\gamma$,

\[\left\|T^{*}(u)-T^{*}(v)\right\|_{\infty} \leq \gamma\|u-v\|_{\infty}\]

Convergence of Iterative Policy Evaluation and Policy Iteration

The Bellman optimality operator $T^*$ has a unique fixed point
then $v_* \text{ is a fixed point of } T^*$ (by Bellman optimality equation)
By contraction mapping theorem
Value iteration converges on $v_*$

参考资料及致谢

所有参考资料在前言中已列出，再次向强化学习的大佬前辈们表达衷心的感谢！

Share on

Twitter Facebook LinkedIn

目录

Principle of Optimality

State value function & State-action value function

Bellman Expectation Equation

Bellman Optimality Equation

Policy Iteration and Value Iteration

Generalised Policy Iteration

Policy improvement theorem

From Policy Iteration to Value Iteration

Dynamic Programming

Synchronous Dynamic Programming Algorithms

Asynchronous Dynamic Programming

In-Place Dynamic Programming

Prioritised Sweeping

Real-Time Dynamic Programming

Backups

Full-Width Backups

Sample Backups

Convergence

参考资料及致谢

Share on