强化学习思考(5)动态规划
Published:
关于动态规划的注意事项。
目录
- 强化学习思考(1)前言
- 强化学习思考(2)强化学习简介
- 强化学习思考(3)马尔可夫决策过程
- 强化学习思考(4)模仿学习和监督学习
- 强化学习思考(5)动态规划
- 强化学习思考(6)蒙特卡罗和时序差分
- 强化学习思考(7)策略梯度
- 强化学习思考(8)Actor-Critic 方法
- 强化学习思考(9)值函数方法
- 强化学习思考(10)Deep Q Network
- 强化学习思考(11)Advanced Policy Gradient
Principle of Optimality
Theorem (Principle of Optimality)
A policy $\pi(a \mid s)$ achieves the optimal value from state $s$, $v_\pi(s) = v_∗(s)$, if and only if
- For any state $s’$ reachable from $s$
- $\pi$ achieves the optimal value from state $s’$, $v_\pi(s’) = v_∗(s)$
State value function & State-action value function
Bellman Expectation Equation
- The state value function for policy $\pi$
- The state-action value function for policy $\pi$
- relation between state value function and state-action value function
Bellman Optimality Equation
- The state value function for policy $\pi$
- The state-action value function for policy $\pi$
- relation between state value function and state-action value function
Policy Iteration and Value Iteration
Generalised Policy Iteration
- Policy Evaluation: Estimate $v_\pi$ (Any policy evaluation algorithm)
- Policy improvement: Generate $\pi’ \geq \pi$ (Any policy improvement algorithm)
Policy improvement theorem
Theorem (Policy improvement theorem)
Let $\pi$ and $\pi’$be any pair of deterministic policies such that, for all $s \in \cal{S}$,
\[q_{\pi}(s, \pi'(s)) \geq v_{\pi}(s)\]Then the policy $\pi’$must be as good as, or better than, $\pi$.
prove:
we know that $\pi’\geq \pi$ if and only if $v_{\pi’}(s)≥ v_\pi(s)$ for all $s$, then we prove that
\[\begin{aligned} v_{\pi}(s) &\leq q_{\pi}(s, \pi'(s)) \\&= \mathbb{E}[R_{t+1}+\gamma v_\pi(S_{t+1}) | S_t=s,A_t=\pi'(s)] \\&= \mathbb{E}_{\pi'}[R_{t+1}+\gamma v_\pi(S_{t+1}) | S_t=s] \\&\leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma q_\pi(S_{t+1},\pi'(S_{t+1})) | S_t=s] \\&= \mathbb{E}_{\pi'}[R_{t+1}+\gamma \mathbb{E}[R_{t+2}+\gamma v_\pi(S_{t+2})| S_{t+1},A_{t+1}=\pi'(S_{t+1}) ] | S_t=s] \\&= \mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^2 v_\pi(S_{t+2}) | S_t=s] \\& \leq\mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\gamma^3 v_\pi(S_{t+3}) | S_t=s] \\& \vdots \\& \leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\gamma^3R_{t+4}+\cdots | S_t=s] \\&=v_{\pi'}(s) \end{aligned}\]because $v_{\pi’}(s)≥ v_\pi(s)$, we get that $\pi’\geq \pi$
greedy improvement
We can improve the policy by acting greedily
\[\pi'(a|s) = \begin{cases} 1, & \text{if $a = \arg \max_{a \in \cal{A}}$} q_{\pi}(s,a) \\ 0, & \text{else} \end{cases}\]prove:
\[q_{\pi}(s, \pi'(s)) \geq v_{\pi}(s)\]Therefore from policy improvement theorem $v_{\pi^{\prime}}(s) \geq v_{\pi}(s)$, so $\pi’\geq \pi$
If improvements stop and $\pi’ = \pi$,
\[q_{\pi}\left(s, \pi^{\prime}(s)\right)=\max _{a \in \mathcal{A}} q_{\pi}(s, a)=q_{\pi}(s, \pi(s))=v_{\pi}(s)\]Then the Bellman optimality equation has been satisfied
\[v_{\pi}(s)=\max _{a \in \mathcal{A}} q_{\pi}(s, a)\]Therefore $v_{\pi}(s)=v_{*}(s)$ for all $s \in \mathcal{S}$ so $\pi$ is an optimal policy
$\epsilon$-greedy improvement
We can improve the policy by acting $\epsilon$-greedily
\[\pi'(a|s) = \begin{cases} \epsilon / m + 1 - \epsilon, & \text{if $a = \arg \max_{a \in \cal{A}}$} q_{\pi}(s,a) \\ \epsilon / m, & \text{otherwise} \end{cases}\]prove:
\[\begin{aligned} q_{\pi}\left(s, \pi^{\prime}(s)\right) &=\sum_{a \in \mathcal{A}} \pi^{\prime}(a | s) q_{\pi}(s, a) \\ &=\epsilon / m \sum_{a \in \mathcal{A}} q_{\pi}(s, a)+(1-\epsilon) \max _{a \in \mathcal{A}} q_{\pi}(s, a) \\ & \geq \epsilon / m \sum_{a \in \mathcal{A}} q_{\pi}(s, a)+(1-\epsilon) \sum_{a \in \mathcal{A}} \frac{\pi(a | s)-\epsilon / m}{1-\epsilon} q_{\pi}(s, a) \\ &=\sum_{a \in \mathcal{A}} \pi(a | s) q_{\pi}(s, a)=v_{\pi}(s) \end{aligned}\]Therefore from policy improvement theorem $v_{\pi^{\prime}}(s) \geq v_{\pi}(s)$, so $\pi’\geq \pi$
From Policy Iteration to Value Iteration
Value Iteration
- We may achieve optimal policy before we achieve optimal value function
- So policy evaluation may not need to converge to $v_\pi$
- we introduce a stopping condition:
- $\epsilon$-convergence of value function
- simply stop after $k$ iterations of iterative policy evaluation
- update policy every iteration
- stop after $k = 1$, This is equivalent to value iteration
Dynamic Programming
Synchronous Dynamic Programming Algorithms
Problem | Bellman Equation | Algorithm |
---|---|---|
Prediction | Bellman Expectation Equation | Iterative Policy Evaluation |
Control | Bellman Expectation Equation + Greedy Policy Improvement | Policy Iteration (Iterative Policy Evaluation + Policy Improvement) |
Control | Bellman Optimality Equation | Value Iteration (Truncated Policy Evaluation + Policy Improvement) |
- synchronous backups: all states are backed up in parallel
- Algorithms are based on state-value function $v_\pi(s)$ or $v_∗(s)$
- Complexity $O(mn^2)$ per iteration, for m actions and n states
- Could also apply to action-value function $q_\pi(s, a)$ or $q_∗(s, a)$
- Complexity $O(m^2n^2)$ per iteration
Asynchronous Dynamic Programming
- Asynchronous DP backs up states individually
- Guaranteed to converge if all states continue to be selected
In-Place Dynamic Programming
- Synchronous value iteration stores two copies of value function
- In-place value iteration only stores one copy of value function
Prioritised Sweeping
- Use magnitude of Bellman error to guide state selection, e.g.
- Backup the state with the largest remaining Bellman error
- Update Bellman error of affected states after each backup
- Requires knowledge of reverse dynamics (predecessor states)
- Can be implemented efficiently by maintaining a priority queue
Real-Time Dynamic Programming
- only backup the states that are relevant to agent
- Use agent’s experience to guide the selection of states
- After each time-step, Backup the state $S_t$
Backups
Full-Width Backups
- DP uses full-width backups
- Every successor state and action is considered
- Using knowledge of the MDP transitions and reward function
- DP is effective for medium-sized problems (millions of states)
- For large problems DP suffers Bellman’s curse of dimensionality
- Number of states $n = \;\mid S \;\mid$ grows exponentially with number of state variables
Sample Backups
- Model-free: no advance knowledge of MDP required
- Using sample rewards and sample transitions
- Instead of reward function $R$ and transition dynamics $P$
- Breaks the curse of dimensionality through sampling
- Cost of backup is constant, independent of $n = \;\mid S \;\mid$
Convergence
Theorem (Contraction Mapping Theorem)
For any metric space $V$ that is complete (i.e. closed) under an operator $T(v)$, where $T$ is a $\gamma$-contraction,
- $T$ converges to a unique fixed point
- At a linear convergence rate of $\gamma$
Bellman Expectation Backup is a Contraction
Define the Bellman expectation backup operator $T^{\pi}$,
\[T^{\pi}(v) = \mathcal{R}^{\pi}+\gamma \mathcal{P}^{\pi} v\]This operator is a $\gamma$-contraction, i.e. it makes value functions closer by at least $\gamma$,
\[\begin{aligned} \left\|T^{\pi}(u)-T^{\pi}(v)\right\|_{\infty} &=\left\|\left(\mathcal{R}^{\pi}+\gamma \mathcal{P}^{\pi} u\right)-\left(\mathcal{R}^{\pi}+\gamma \mathcal{P}^{\pi} v\right)\right\|_{\infty} \\ &=\left\|\gamma \mathcal{P}^{\pi}(u-v)\right\|_{\infty} \\ & \leq\left\|\gamma \mathcal{P}^{\pi}\right\| u-v\left\|_{\infty}\right\|_{\infty} \\ & \leq \gamma\|u-v\|_{\infty} \end{aligned}\]Convergence of Iterative Policy Evaluation and Policy Iteration
- The Bellman expectation operator $T^\pi$ has a unique fixed point
- then $v_\pi$ is a fixed point of $T^\pi$ (by Bellman expectation equation)
- By contraction mapping theorem
- Iterative policy evaluation converges on $v_\pi$
- Policy iteration converges on $v_*$
Bellman Optimality Backup is a Contraction
Define the Bellman optimality backup operator $T^{*}$,
\[T^{*}(v) = \max_{a\in{\cal{A}}} \mathcal{R}^{a}+\gamma \mathcal{P}^{a} v\]This operator is a $\gamma$-contraction, i.e. it makes value functions closer by at least $\gamma$,
\[\left\|T^{*}(u)-T^{*}(v)\right\|_{\infty} \leq \gamma\|u-v\|_{\infty}\]Convergence of Iterative Policy Evaluation and Policy Iteration
- The Bellman optimality operator $T^*$ has a unique fixed point
- then $v_* \text{ is a fixed point of } T^*$ (by Bellman optimality equation)
- By contraction mapping theorem
- Value iteration converges on $v_*$
参考资料及致谢
所有参考资料在前言中已列出,再次向强化学习的大佬前辈们表达衷心的感谢!