强化学习思考(9)值函数方法
Published:
关于值函数方法的注意事项。
目录
- 强化学习思考(1)前言
- 强化学习思考(2)强化学习简介
- 强化学习思考(3)马尔可夫决策过程
- 强化学习思考(4)模仿学习和监督学习
- 强化学习思考(5)动态规划
- 强化学习思考(6)蒙特卡罗和时序差分
- 强化学习思考(7)策略梯度
- 强化学习思考(8)Actor-Critic 方法
- 强化学习思考(9)值函数方法
- 强化学习思考(10)Deep Q Network
- 强化学习思考(11)Advanced Policy Gradient
State value function & State-action value function
在值函数方法中,我们通常使用 State-action value function 而不使用 State value function,以下是对比。
policy evaluation
State-action value function
\[Q^\pi(s,a) \leftarrow r(s,a) + E_{s' \sim p(s' | s,a)}[Q^\pi(s',\pi(s'))]\]State value function
\[V^\pi(s) \leftarrow r(s,\pi(s)) + \gamma E_{s' \sim p(s' | s,\pi(s))}[V^\pi(s')]\]- State-action value function can be fitted using samples
value iteration
State-action value function
\[Q_{\phi}(s,a) \leftarrow r(s,a) + E_{s' \sim p(s' | s,a)}[\max_{a'} Q_{\phi}(s',a')]\]State value function
\[V(s) \leftarrow \max_a(r(s,a) + \gamma E_{s' \sim p(s' | s,a)}[V_\phi(s')])\]- State value function need to know outcomes for different actions
- warning:
- $V^\pi$: value function for policy $\pi$, this is what the critic does
- $V^{\star}$: value function for optimal policy $\pi^{\star}$, this is what value iteration does
greedy policy improvement
State-action value function
\[\pi(s) \leftarrow \arg \max _{a \in \mathcal{A}}\left(\mathcal{R}_{s}^{a}+\gamma \sum_{s^{\prime} \in \mathcal{S}} \mathcal{P}_{s s^{\prime}}^{a} v\left(s^{\prime}\right)\right)\]State value function
\[\pi(s) \leftarrow \arg \max _{a \in \mathcal{A}}Q(s,a)\]- State value function need to know model of MDP (transition probability)
Q iteration algorithm
full fitted Q-iteration algorithm
online Q-iteration algorithm
- works even for off-policy samples (unlike actor-critic)
- only one network, no high-variance policy gradient
- no convergence guarantees for non-linear function approximation
Exploration
不使用确定性策略
1、Epsilon Greedy(在训练的过程中 $\epsilon$ 的值可以逐渐减小)
\[P(a|s) = \begin{cases} \epsilon / m + 1 - \epsilon, & \text{if $a = \arg \max_{a \in \cal{A}}$} Q(s,a) \\ \epsilon / m, & \text{otherwise} \end{cases}\]2、Boltzmann Exploration
\[P(a | s)=\frac{\exp (Q(s, a))}{\sum_{a} \exp (Q(s, a))}\]Value function learning theory
contraction operator
定义 operator ${\cal{}B}$
\[{\cal{B}}V = \max_a r_a + \gamma {\cal{T}}_a V\]1、$V^*$ is a fixed point of operator ${\cal{}B}$,
\[V^* = {\cal{B}}V^* = \max_a r_a + \gamma {\cal{T}}_a V^*\]2、operator ${\cal{}B}$ is a contraction: for any $V$ and $\bar{V}$, we have
\[|| {\cal{B}}V - {\cal{B}}\bar{V} ||_{\infty} \leq \gamma || V - \bar{V} ||_{\infty}\]更多证明可参考强化学习思考(5)动态规划
3、we can prove that value iteration reaches $V^\star$ because operator ${\cal{}B}$ is a contraction, if we choose $V^\star$ as $\bar{V}$
\[V^* = {\cal{B}}V^*\]then
\[|| {\cal{B}}V - V^* ||_{\infty} \leq \gamma || V - V^* ||_{\infty}\]定义 operator $\Pi$
\[\Pi V = \arg \min_{V'\in \Omega} \frac{1}{2} \sum{|| V'(s) - V(s) ||}\]$\Pi$ is a projection onto $\Omega$, and is a contraction
\[|| \Pi V - \Pi \bar{V} ||^2 \leq || V - \bar{V} ||^2\]算法总结
- ${\cal{B}}$ is a contraction w.r.t. $\infty$-norm
- $\Pi$ is a contraction w.r.t. ${\cal{l}}_2$-norm
- $\Pi{\cal{B}}$ is not a contraction of any kind
value iteration algorithm (using ${\cal{B}}$)
\[V \leftarrow {\cal{B}}V\]- converges (tabular case)
fitted value iteration algorithm (using ${\cal{B}}$ and $\Pi$)
\[V \leftarrow \Pi{\cal{B}}V\]- not converges in general
fitted Q iteration algorithm (using ${\cal{B}}$ and $\Pi$)
重新定义与 V 类似的 ${\cal{B}}$ 和 $\Pi$
\[Q \leftarrow \Pi{\cal{B}} Q\]- not converges in general
- deep Q-learning 的成功在于其网络的拟和能力 $\Pi$ contraction 十分强
- warning: Q-learning is not gradient descent, there is no gradient through target value, so it doesn’t converge
bath actor-critic algorithm (using ${\cal{B}}$ and $\Pi$)
when fit $V$
\[V \leftarrow \Pi{\cal{B}}V\]- not converges in general
Types of Value Function Approximation
Convergence
Convergence of Prediction Algorithms
On/Off-Policy | Algorithm | Table Lookup | Linear | Non-Linear |
---|---|---|---|---|
On-Policy | MC | $\checkmark$ | $\checkmark$ | $\checkmark$ |
LSMC | $\checkmark$ | $\checkmark$ | - | |
TD(0) | $\checkmark$ | $\checkmark$ | $\times$ | |
TD($\lambda$) | $\checkmark$ | $\checkmark$ | $\times$ | |
Gradient TD | $\checkmark$ | $\checkmark$ | $\checkmark$ | |
LSTD | $\checkmark$ | $\checkmark$ | - | |
Off-Policy | MC | $\checkmark$ | $\checkmark$ | $\checkmark$ |
LSMC | $\checkmark$ | $\checkmark$ | - | |
TD(0) | $\checkmark$ | $\times$ | $\times$ | |
TD($\lambda$) | $\checkmark$ | $\times$ | $\times$ | |
Gradient TD | $\checkmark$ | $\checkmark$ | $\checkmark$ | |
LSTD | $\checkmark$ | $\checkmark$ | - |
Convergence of Control Algorithms
Algorithm | Table Lookup | Linear | Non-Linear |
---|---|---|---|
MC Control | $\checkmark$ | ($\checkmark$) | $\times$ |
Sarsa | $\checkmark$ | ($\checkmark$) | $\times$ |
Q-learning | $\checkmark$ | $\times$ | $\times$ |
Gradient Q-learning | $\checkmark$ | $\checkmark$ | $\times$ |
LSPI | $\checkmark$ | ($\checkmark$) | - |
- ($\checkmark$) = chatters around near-optimal value function
参考资料及致谢
所有参考资料在前言中已列出,再次向强化学习的大佬前辈们表达衷心的感谢!