强化学习思考(5)动态规划

7 minute read

Published:

关于动态规划的注意事项。

目录

Principle of Optimality

Theorem (Principle of Optimality)

A policy $\pi(a \mid s)$ achieves the optimal value from state $s$, $v_\pi(s) = v_∗(s)$, if and only if

  • For any state $s’$ reachable from $s$
  • $\pi$ achieves the optimal value from state $s’$, $v_\pi(s’) = v_∗(s)$

State value function & State-action value function

Bellman Expectation Equation

  • The state value function for policy $\pi$
\[\begin{aligned} v_{\pi}(s) &\doteq \Bbb{E}_{\pi}[G_t | S_t = s] \\&= \Bbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1} | S_t = s] \\&= \Bbb{E}_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s] \end{aligned}\]
  • The state-action value function for policy $\pi$
\[\begin{aligned} q_{\pi}(s,a) &\doteq \Bbb{E}_{\pi}[G_t | S_t = s, A_t = a] \\&= \Bbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a] \\&= \Bbb{E}_{\pi}[R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1}) | S_t = s, A_t = a] \end{aligned}\]
  • relation between state value function and state-action value function
\[v_{\pi}(s) = \Bbb{E}_{\pi}[q_{\pi}(S_t,A_t) | S_t = s]\\ q_{\pi}(s,a) = \Bbb{E}_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s, A_t = a]\]

Bellman Optimality Equation

  • The state value function for policy $\pi$
\[\begin{aligned} v_*(s) &\doteq \max_{\pi} v_{\pi}(s) \\&= \max_a\Bbb{E}[R_{t+1} + \gamma v_{*}(S_{t+1}) | S_t = s, A_t =a] \end{aligned}\]
  • The state-action value function for policy $\pi$
\[\begin{aligned} q_{*}(s,a) &\doteq \max_{\pi} q_{\pi}(s,a) \\&= \Bbb{E}[R_{t+1} + \gamma \max_{a'} q_{*}(S_{t+1}, a') | S_t = s, A_t = a] \end{aligned}\]
  • relation between state value function and state-action value function
\[\begin{aligned} v_*(s) = \max_{a} q_{*}(s,a) \end{aligned}\]

Policy Iteration and Value Iteration

Generalised Policy Iteration

  • Policy Evaluation: Estimate $v_\pi$ (Any policy evaluation algorithm)
  • Policy improvement: Generate $\pi’ \geq \pi$ (Any policy improvement algorithm)

Policy improvement theorem

Theorem (Policy improvement theorem)

Let $\pi$ and $\pi’$be any pair of deterministic policies such that, for all $s \in \cal{S}$,

\[q_{\pi}(s, \pi'(s)) \geq v_{\pi}(s)\]

Then the policy $\pi’$must be as good as, or better than, $\pi$.

prove:

we know that $\pi’\geq \pi$ if and only if $v_{\pi’}(s)≥ v_\pi(s)$ for all $s$, then we prove that

\[\begin{aligned} v_{\pi}(s) &\leq q_{\pi}(s, \pi'(s)) \\&= \mathbb{E}[R_{t+1}+\gamma v_\pi(S_{t+1}) | S_t=s,A_t=\pi'(s)] \\&= \mathbb{E}_{\pi'}[R_{t+1}+\gamma v_\pi(S_{t+1}) | S_t=s] \\&\leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma q_\pi(S_{t+1},\pi'(S_{t+1})) | S_t=s] \\&= \mathbb{E}_{\pi'}[R_{t+1}+\gamma \mathbb{E}[R_{t+2}+\gamma v_\pi(S_{t+2})| S_{t+1},A_{t+1}=\pi'(S_{t+1}) ] | S_t=s] \\&= \mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^2 v_\pi(S_{t+2}) | S_t=s] \\& \leq\mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\gamma^3 v_\pi(S_{t+3}) | S_t=s] \\& \vdots \\& \leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\gamma^3R_{t+4}+\cdots | S_t=s] \\&=v_{\pi'}(s) \end{aligned}\]

because $v_{\pi’}(s)≥ v_\pi(s)$, we get that $\pi’\geq \pi$

greedy improvement

We can improve the policy by acting greedily

\[\pi'(a|s) = \begin{cases} 1, & \text{if $a = \arg \max_{a \in \cal{A}}$} q_{\pi}(s,a) \\ 0, & \text{else} \end{cases}\]

prove:

\[q_{\pi}(s, \pi'(s)) \geq v_{\pi}(s)\]

Therefore from policy improvement theorem $v_{\pi^{\prime}}(s) \geq v_{\pi}(s)$, so $\pi’\geq \pi$

If improvements stop and $\pi’ = \pi$,

\[q_{\pi}\left(s, \pi^{\prime}(s)\right)=\max _{a \in \mathcal{A}} q_{\pi}(s, a)=q_{\pi}(s, \pi(s))=v_{\pi}(s)\]

Then the Bellman optimality equation has been satisfied

\[v_{\pi}(s)=\max _{a \in \mathcal{A}} q_{\pi}(s, a)\]

Therefore $v_{\pi}(s)=v_{*}(s)$ for all $s \in \mathcal{S}$ so $\pi$ is an optimal policy

$\epsilon$-greedy improvement

We can improve the policy by acting $\epsilon$-greedily

\[\pi'(a|s) = \begin{cases} \epsilon / m + 1 - \epsilon, & \text{if $a = \arg \max_{a \in \cal{A}}$} q_{\pi}(s,a) \\ \epsilon / m, & \text{otherwise} \end{cases}\]

prove:

\[\begin{aligned} q_{\pi}\left(s, \pi^{\prime}(s)\right) &=\sum_{a \in \mathcal{A}} \pi^{\prime}(a | s) q_{\pi}(s, a) \\ &=\epsilon / m \sum_{a \in \mathcal{A}} q_{\pi}(s, a)+(1-\epsilon) \max _{a \in \mathcal{A}} q_{\pi}(s, a) \\ & \geq \epsilon / m \sum_{a \in \mathcal{A}} q_{\pi}(s, a)+(1-\epsilon) \sum_{a \in \mathcal{A}} \frac{\pi(a | s)-\epsilon / m}{1-\epsilon} q_{\pi}(s, a) \\ &=\sum_{a \in \mathcal{A}} \pi(a | s) q_{\pi}(s, a)=v_{\pi}(s) \end{aligned}\]

Therefore from policy improvement theorem $v_{\pi^{\prime}}(s) \geq v_{\pi}(s)$, so $\pi’\geq \pi$

From Policy Iteration to Value Iteration

Value Iteration

  • We may achieve optimal policy before we achieve optimal value function
  • So policy evaluation may not need to converge to $v_\pi$
  • we introduce a stopping condition:
    • $\epsilon$-convergence of value function
    • simply stop after $k$ iterations of iterative policy evaluation
      • update policy every iteration
      • stop after $k = 1$, This is equivalent to value iteration

Dynamic Programming

Synchronous Dynamic Programming Algorithms

ProblemBellman EquationAlgorithm
PredictionBellman Expectation EquationIterative Policy Evaluation
ControlBellman Expectation Equation + Greedy Policy ImprovementPolicy Iteration (Iterative Policy Evaluation + Policy Improvement)
ControlBellman Optimality EquationValue Iteration (Truncated Policy Evaluation + Policy Improvement)
  • synchronous backups: all states are backed up in parallel
  • Algorithms are based on state-value function $v_\pi(s)$ or $v_∗(s)$
    • Complexity $O(mn^2)$ per iteration, for m actions and n states
  • Could also apply to action-value function $q_\pi(s, a)$ or $q_∗(s, a)$
    • Complexity $O(m^2n^2)$ per iteration

Asynchronous Dynamic Programming

  • Asynchronous DP backs up states individually
  • Guaranteed to converge if all states continue to be selected

In-Place Dynamic Programming

  • Synchronous value iteration stores two copies of value function

  • In-place value iteration only stores one copy of value function

Prioritised Sweeping

  • Use magnitude of Bellman error to guide state selection, e.g.
\[\left|\max _{a \in \mathcal{A}}\left(\mathcal{R}_{s}^{a}+\gamma \sum_{s^{\prime} \in \mathcal{S}} \mathcal{P}_{s s^{\prime}}^{a} v\left(s^{\prime}\right)\right)-v(s)\right|\]
  • Backup the state with the largest remaining Bellman error
  • Update Bellman error of affected states after each backup
  • Requires knowledge of reverse dynamics (predecessor states)
  • Can be implemented efficiently by maintaining a priority queue

Real-Time Dynamic Programming

  • only backup the states that are relevant to agent
  • Use agent’s experience to guide the selection of states
  • After each time-step, Backup the state $S_t$
\[v(S_t) \leftarrow \max _{a \in \mathcal{A}}\left(\mathcal{R}_{S_t}^{a}+\gamma \sum_{s^{\prime} \in \mathcal{S}} \mathcal{P}_{S_t s^{\prime}}^{a} v\left(s^{\prime}\right)\right)\]

Backups

Full-Width Backups

  • DP uses full-width backups
    • Every successor state and action is considered
    • Using knowledge of the MDP transitions and reward function
  • DP is effective for medium-sized problems (millions of states)
  • For large problems DP suffers Bellman’s curse of dimensionality
    • Number of states $n = \;\mid S \;\mid$ grows exponentially with number of state variables

Sample Backups

  • Model-free: no advance knowledge of MDP required
    • Using sample rewards and sample transitions
    • Instead of reward function $R$ and transition dynamics $P$
  • Breaks the curse of dimensionality through sampling
    • Cost of backup is constant, independent of $n = \;\mid S \;\mid$

Convergence

Theorem (Contraction Mapping Theorem)

For any metric space $V$ that is complete (i.e. closed) under an operator $T(v)$, where $T$ is a $\gamma$-contraction,

  • $T$ converges to a unique fixed point
  • At a linear convergence rate of $\gamma$

Bellman Expectation Backup is a Contraction

Define the Bellman expectation backup operator $T^{\pi}$,

\[T^{\pi}(v) = \mathcal{R}^{\pi}+\gamma \mathcal{P}^{\pi} v\]

This operator is a $\gamma$-contraction, i.e. it makes value functions closer by at least $\gamma$,

\[\begin{aligned} \left\|T^{\pi}(u)-T^{\pi}(v)\right\|_{\infty} &=\left\|\left(\mathcal{R}^{\pi}+\gamma \mathcal{P}^{\pi} u\right)-\left(\mathcal{R}^{\pi}+\gamma \mathcal{P}^{\pi} v\right)\right\|_{\infty} \\ &=\left\|\gamma \mathcal{P}^{\pi}(u-v)\right\|_{\infty} \\ & \leq\left\|\gamma \mathcal{P}^{\pi}\right\| u-v\left\|_{\infty}\right\|_{\infty} \\ & \leq \gamma\|u-v\|_{\infty} \end{aligned}\]

Convergence of Iterative Policy Evaluation and Policy Iteration

  • The Bellman expectation operator $T^\pi$ has a unique fixed point
  • then $v_\pi$ is a fixed point of $T^\pi$ (by Bellman expectation equation)
  • By contraction mapping theorem
  • Iterative policy evaluation converges on $v_\pi$
  • Policy iteration converges on $v_*$

Bellman Optimality Backup is a Contraction

Define the Bellman optimality backup operator $T^{*}$,

\[T^{*}(v) = \max_{a\in{\cal{A}}} \mathcal{R}^{a}+\gamma \mathcal{P}^{a} v\]

This operator is a $\gamma$-contraction, i.e. it makes value functions closer by at least $\gamma$,

\[\left\|T^{*}(u)-T^{*}(v)\right\|_{\infty} \leq \gamma\|u-v\|_{\infty}\]

Convergence of Iterative Policy Evaluation and Policy Iteration

  • The Bellman optimality operator $T^*$ has a unique fixed point
  • then $v_* \text{ is a fixed point of } T^*$ (by Bellman optimality equation)
  • By contraction mapping theorem
  • Value iteration converges on $v_*$

参考资料及致谢

所有参考资料在前言中已列出,再次向强化学习的大佬前辈们表达衷心的感谢!