强化学习思考(9)值函数方法

3 minute read

Published:

关于值函数方法的注意事项。

目录

State value function & State-action value function

在值函数方法中,我们通常使用 State-action value function 而不使用 State value function,以下是对比。

policy evaluation

State-action value function

\[Q^\pi(s,a) \leftarrow r(s,a) + E_{s' \sim p(s' | s,a)}[Q^\pi(s',\pi(s'))]\]

State value function

\[V^\pi(s) \leftarrow r(s,\pi(s)) + \gamma E_{s' \sim p(s' | s,\pi(s))}[V^\pi(s')]\]
  • State-action value function can be fitted using samples

value iteration

State-action value function

\[Q_{\phi}(s,a) \leftarrow r(s,a) + E_{s' \sim p(s' | s,a)}[\max_{a'} Q_{\phi}(s',a')]\]

State value function

\[V(s) \leftarrow \max_a(r(s,a) + \gamma E_{s' \sim p(s' | s,a)}[V_\phi(s')])\]
  • State value function need to know outcomes for different actions
  • warning:
    • $V^\pi$: value function for policy $\pi$, this is what the critic does
    • $V^{\star}$: value function for optimal policy $\pi^{\star}$, this is what value iteration does

greedy policy improvement

State-action value function

\[\pi(s) \leftarrow \arg \max _{a \in \mathcal{A}}\left(\mathcal{R}_{s}^{a}+\gamma \sum_{s^{\prime} \in \mathcal{S}} \mathcal{P}_{s s^{\prime}}^{a} v\left(s^{\prime}\right)\right)\]

State value function

\[\pi(s) \leftarrow \arg \max _{a \in \mathcal{A}}Q(s,a)\]
  • State value function need to know model of MDP (transition probability)

Q iteration algorithm

full fitted Q-iteration algorithm

online Q-iteration algorithm

  • works even for off-policy samples (unlike actor-critic)
  • only one network, no high-variance policy gradient
  • no convergence guarantees for non-linear function approximation

Exploration

不使用确定性策略

1、Epsilon Greedy(在训练的过程中 $\epsilon$ 的值可以逐渐减小)

\[P(a|s) = \begin{cases} \epsilon / m + 1 - \epsilon, & \text{if $a = \arg \max_{a \in \cal{A}}$} Q(s,a) \\ \epsilon / m, & \text{otherwise} \end{cases}\]

2、Boltzmann Exploration

\[P(a | s)=\frac{\exp (Q(s, a))}{\sum_{a} \exp (Q(s, a))}\]

Value function learning theory

contraction operator

定义 operator ${\cal{}B}$

\[{\cal{B}}V = \max_a r_a + \gamma {\cal{T}}_a V\]

1、$V^*$ is a fixed point of operator ${\cal{}B}$,

\[V^* = {\cal{B}}V^* = \max_a r_a + \gamma {\cal{T}}_a V^*\]

2、operator ${\cal{}B}$ is a contraction: for any $V$ and $\bar{V}$, we have

\[|| {\cal{B}}V - {\cal{B}}\bar{V} ||_{\infty} \leq \gamma || V - \bar{V} ||_{\infty}\]

更多证明可参考强化学习思考(5)动态规划

3、we can prove that value iteration reaches $V^\star$ because operator ${\cal{}B}$ is a contraction, if we choose $V^\star$ as $\bar{V}$

\[V^* = {\cal{B}}V^*\]

then

\[|| {\cal{B}}V - V^* ||_{\infty} \leq \gamma || V - V^* ||_{\infty}\]

定义 operator $\Pi$

\[\Pi V = \arg \min_{V'\in \Omega} \frac{1}{2} \sum{|| V'(s) - V(s) ||}\]

$\Pi$ is a projection onto $\Omega$, and is a contraction

\[|| \Pi V - \Pi \bar{V} ||^2 \leq || V - \bar{V} ||^2\]

算法总结

  • ${\cal{B}}$ is a contraction w.r.t. $\infty$-norm
  • $\Pi$ is a contraction w.r.t. ${\cal{l}}_2$-norm
  • $\Pi{\cal{B}}$ is not a contraction of any kind

value iteration algorithm (using ${\cal{B}}$)

\[V \leftarrow {\cal{B}}V\]
  • converges (tabular case)

fitted value iteration algorithm (using ${\cal{B}}$ and $\Pi$)

\[V \leftarrow \Pi{\cal{B}}V\]
  • not converges in general

fitted Q iteration algorithm (using ${\cal{B}}$ and $\Pi$)

重新定义与 V 类似的 ${\cal{B}}$ 和 $\Pi$

\[Q \leftarrow \Pi{\cal{B}} Q\]
  • not converges in general
  • deep Q-learning 的成功在于其网络的拟和能力 $\Pi$ contraction 十分强
  • warning: Q-learning is not gradient descent, there is no gradient through target value, so it doesn’t converge

bath actor-critic algorithm (using ${\cal{B}}$ and $\Pi$)

when fit $V$

\[V \leftarrow \Pi{\cal{B}}V\]
  • not converges in general

Types of Value Function Approximation

Convergence

Convergence of Prediction Algorithms

On/Off-PolicyAlgorithmTable LookupLinearNon-Linear
On-PolicyMC$\checkmark$$\checkmark$$\checkmark$
 LSMC$\checkmark$$\checkmark$-
 TD(0)$\checkmark$$\checkmark$$\times$
 TD($\lambda$)$\checkmark$$\checkmark$$\times$
 Gradient TD$\checkmark$$\checkmark$$\checkmark$
 LSTD$\checkmark$$\checkmark$-
Off-PolicyMC$\checkmark$$\checkmark$$\checkmark$
 LSMC$\checkmark$$\checkmark$-
 TD(0)$\checkmark$$\times$$\times$
 TD($\lambda$)$\checkmark$$\times$$\times$
 Gradient TD$\checkmark$$\checkmark$$\checkmark$
 LSTD$\checkmark$$\checkmark$-

Convergence of Control Algorithms

AlgorithmTable LookupLinearNon-Linear
MC Control$\checkmark$($\checkmark$)$\times$
Sarsa$\checkmark$($\checkmark$)$\times$
Q-learning$\checkmark$$\times$$\times$
Gradient Q-learning$\checkmark$$\checkmark$$\times$
LSPI$\checkmark$($\checkmark$)-
  • ($\checkmark$) = chatters around near-optimal value function

参考资料及致谢

所有参考资料在前言中已列出,再次向强化学习的大佬前辈们表达衷心的感谢!