强化学习思考（9）值函数方法

3 minute read

Published: April 22, 2020

关于值函数方法的注意事项。

State value function & State-action value function

在值函数方法中，我们通常使用 State-action value function 而不使用 State value function，以下是对比。

policy evaluation

State-action value function

\[Q^\pi(s,a) \leftarrow r(s,a) + E_{s' \sim p(s' | s,a)}[Q^\pi(s',\pi(s'))]\]

State value function

\[V^\pi(s) \leftarrow r(s,\pi(s)) + \gamma E_{s' \sim p(s' | s,\pi(s))}[V^\pi(s')]\]

State-action value function can be fitted using samples

value iteration

State-action value function

\[Q_{\phi}(s,a) \leftarrow r(s,a) + E_{s' \sim p(s' | s,a)}[\max_{a'} Q_{\phi}(s',a')]\]

State value function

\[V(s) \leftarrow \max_a(r(s,a) + \gamma E_{s' \sim p(s' | s,a)}[V_\phi(s')])\]

State value function need to know outcomes for different actions
warning:
- $V^\pi$: value function for policy $\pi$, this is what the critic does
- $V^{\star}$: value function for optimal policy $\pi^{\star}$, this is what value iteration does

greedy policy improvement

State-action value function

\[\pi(s) \leftarrow \arg \max _{a \in \mathcal{A}}\left(\mathcal{R}_{s}^{a}+\gamma \sum_{s^{\prime} \in \mathcal{S}} \mathcal{P}_{s s^{\prime}}^{a} v\left(s^{\prime}\right)\right)\]

State value function

\[\pi(s) \leftarrow \arg \max _{a \in \mathcal{A}}Q(s,a)\]

State value function need to know model of MDP (transition probability)

Q iteration algorithm

full fitted Q-iteration algorithm

online Q-iteration algorithm

works even for off-policy samples (unlike actor-critic)
only one network, no high-variance policy gradient
no convergence guarantees for non-linear function approximation

Exploration

不使用确定性策略

1、Epsilon Greedy（在训练的过程中 $\epsilon$ 的值可以逐渐减小）

\[P(a|s) = \begin{cases} \epsilon / m + 1 - \epsilon, & \text{if $a = \arg \max_{a \in \cal{A}}$} Q(s,a) \\ \epsilon / m, & \text{otherwise} \end{cases}\]

2、Boltzmann Exploration

\[P(a | s)=\frac{\exp (Q(s, a))}{\sum_{a} \exp (Q(s, a))}\]

Value function learning theory

contraction operator

定义 operator ${\cal{}B}$

\[{\cal{B}}V = \max_a r_a + \gamma {\cal{T}}_a V\]

1、$V^*$ is a fixed point of operator ${\cal{}B}$,

\[V^* = {\cal{B}}V^* = \max_a r_a + \gamma {\cal{T}}_a V^*\]

2、operator ${\cal{}B}$ is a contraction: for any $V$ and $\bar{V}$, we have

\[|| {\cal{B}}V - {\cal{B}}\bar{V} ||_{\infty} \leq \gamma || V - \bar{V} ||_{\infty}\]

更多证明可参考强化学习思考（5）动态规划

3、we can prove that value iteration reaches $V^\star$ because operator ${\cal{}B}$ is a contraction, if we choose $V^\star$ as $\bar{V}$

\[V^* = {\cal{B}}V^*\]

then

\[|| {\cal{B}}V - V^* ||_{\infty} \leq \gamma || V - V^* ||_{\infty}\]

定义 operator $\Pi$

\[\Pi V = \arg \min_{V'\in \Omega} \frac{1}{2} \sum{|| V'(s) - V(s) ||}\]

$\Pi$ is a projection onto $\Omega$, and is a contraction

\[|| \Pi V - \Pi \bar{V} ||^2 \leq || V - \bar{V} ||^2\]

算法总结

${\cal{B}}$ is a contraction w.r.t. $\infty$-norm
$\Pi$ is a contraction w.r.t. ${\cal{l}}_2$-norm
$\Pi{\cal{B}}$ is not a contraction of any kind

value iteration algorithm (using ${\cal{B}}$)

\[V \leftarrow {\cal{B}}V\]

converges (tabular case)

fitted value iteration algorithm (using ${\cal{B}}$ and $\Pi$)

\[V \leftarrow \Pi{\cal{B}}V\]

not converges in general

fitted Q iteration algorithm (using ${\cal{B}}$ and $\Pi$)

重新定义与 V 类似的 ${\cal{B}}$ 和 $\Pi$

\[Q \leftarrow \Pi{\cal{B}} Q\]

not converges in general
deep Q-learning 的成功在于其网络的拟和能力 $\Pi$ contraction 十分强
warning: Q-learning is not gradient descent, there is no gradient through target value, so it doesn’t converge

bath actor-critic algorithm (using ${\cal{B}}$ and $\Pi$)

when fit $V$

\[V \leftarrow \Pi{\cal{B}}V\]

not converges in general

Types of Value Function Approximation

Convergence

Convergence of Prediction Algorithms

On/Off-Policy	Algorithm	Table Lookup	Linear	Non-Linear
On-Policy	MC	$\checkmark$	$\checkmark$	$\checkmark$
	LSMC	$\checkmark$	$\checkmark$	-
	TD(0)	$\checkmark$	$\checkmark$	$\times$
	TD($\lambda$)	$\checkmark$	$\checkmark$	$\times$
	Gradient TD	$\checkmark$	$\checkmark$	$\checkmark$
	LSTD	$\checkmark$	$\checkmark$	-
Off-Policy	MC	$\checkmark$	$\checkmark$	$\checkmark$
	LSMC	$\checkmark$	$\checkmark$	-
	TD(0)	$\checkmark$	$\times$	$\times$
	TD($\lambda$)	$\checkmark$	$\times$	$\times$
	Gradient TD	$\checkmark$	$\checkmark$	$\checkmark$
	LSTD	$\checkmark$	$\checkmark$	-

Convergence of Control Algorithms

Algorithm	Table Lookup	Linear	Non-Linear
MC Control	$\checkmark$	($\checkmark$)	$\times$
Sarsa	$\checkmark$	($\checkmark$)	$\times$
Q-learning	$\checkmark$	$\times$	$\times$
Gradient Q-learning	$\checkmark$	$\checkmark$	$\times$
LSPI	$\checkmark$	($\checkmark$)	-

($\checkmark$) = chatters around near-optimal value function

参考资料及致谢

所有参考资料在前言中已列出，再次向强化学习的大佬前辈们表达衷心的感谢！

Share on

Twitter Facebook LinkedIn

目录

State value function & State-action value function

policy evaluation

value iteration

greedy policy improvement

Q iteration algorithm

Exploration

Value function learning theory

contraction operator

算法总结

Types of Value Function Approximation

Convergence

Convergence of Prediction Algorithms

Convergence of Control Algorithms

参考资料及致谢

Share on