强化学习思考（6）蒙特卡罗和时序差分

3 minute read

Published: April 15, 2020

关于蒙特卡罗和时序差分的注意事项。

MC and TD

Basic definition

Monte-Carlo Learning

MC methods learn directly from episodes of experience
MC is model-free: no knowledge of MDP transitions / rewards
MC learns from complete episodes: no bootstrapping
MC uses the simplest possible idea: value = mean return
Caveat: can only apply MC to episodic MDPs
- All episodes must terminate

Temporal-Difference Learning

TD methods learn directly from episodes of experience
TD is model-free: no knowledge of MDP transitions / rewards
TD learns from incomplete episodes, by bootstrapping
TD updates a guess towards a guess

Advantages and Disadvantages of MC vs. TD

TD can learn before knowing the final outcome
- TD can learn online after every step
- MC must wait until end of episode before return is known
TD can learn without the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments
initial value
- MC is not very sensitive to initial value
- TD is more sensitive to initial value than MC

Bias / Variance Trade-Off

Return $G_t = R_{t+1} + \gamma R_{t+2} + … + \gamma^{T−1} R_T$ is unbiased estimate of $v_\pi(S_t)$
True TD target $R_{t+1} + \gamma v_\pi(S_{t+1})$ is unbiased estimate of $v_\pi(S_t)$
TD target $R_{t+1} + \gamma v_\pi(S_{t+1})$ is biased estimate of $v_\pi(S_t)$
TD target is much lower variance than the return:
- Return depends on many random actions, transitions, rewards
- TD target depends on one random action, transition, reward

Advantages and Disadvantages of MC vs. TD

MC has high variance, zero bias
- Good convergence properties
- (even with function approximation)
- Not very sensitive to initial value
- Very simple to understand and use
TD has low variance, some bias
- Usually more efficient than MC
- TD(0) converges to $v_\pi(S_t)$
- (but not always with function approximation)
- More sensitive to initial value

Batch MC and TD

MC and TD converge: $V(s) → v_\pi(s)$ as experience $\to \infty$, but the batch solution for finite experience may not equal

MC converges to solution with minimum mean-squared error
- Best fit to the observed returns

\[\sum_{k=1}^{K} \sum_{t=1}^{T_{k}}\left(G_{t}^{k}-V\left(s_{t}^{k}\right)\right)^{2}\]

TD(0) converges to solution of max likelihood Markov model
- Solution to the MDP that best fits the data

\[\begin{aligned} \hat{\mathcal{P}}_{s, s^{\prime}}^{a} &=\frac{1}{N(s, a)} \sum_{k=1}^{K} \sum_{t=1}^{T_{k}} \mathbf{1}\left(s_{t}^{k}, a_{t}^{k}, s_{t+1}^{k}=s, a, s^{\prime}\right) \\ \hat{\mathcal{R}}_{s}^{a} &=\frac{1}{N(s, a)} \sum_{k=1}^{K} \sum_{t=1}^{T_{k}} \mathbf{1}\left(s_{t}^{k}, a_{t}^{k}=s, a\right) r_{t}^{k} \end{aligned}\]

Advantages and Disadvantages of MC vs. TD

TD exploits Markov property
- Usually more efficient in Markov environments
MC does not exploit Markov property
- Usually more effective in non-Markov environments

Convergence

Greedy in the Limit with Infinite Exploration (GLIE)

GLIE Monte-Carlo control

Theorem
GLIE Monte-Carlo control converges to the optimal action-value function, $Q(s, a) \to q_∗(s, a)$

Sarsa

Theorem
Sarsa converges to the optimal action-value function, $Q(s, a) \to q_∗(s, a)$, under the following conditions:
GLIE sequence of policies $\pi_t(a \mid s)$
Robbins-Monro sequence of step-sizes $\alpha_t$
\[\sum_{t=1}^\infty \alpha_t = \infty \\ \sum_{t=1}^\infty \alpha_t^2 < \infty\]

Q-Learning

Theorem
Q-learning control converges to the optimal action-value function, $Q(s, a) \to q_∗(s, a)$

Unified View of Reinforcement Learning

Bootstrapping and Sampling

Bootstrapping: update involves an estimate MC does not bootstrap
- DP bootstraps
- TD bootstraps
Sampling: update samples an expectation
- MC samples
- DP does not sample
- TD samples

Relationship Between DP and TD

where $x \;{\xleftarrow{a}}\; y \equiv x \leftarrow x+\alpha(y−x)$

参考资料及致谢

所有参考资料在前言中已列出，再次向强化学习的大佬前辈们表达衷心的感谢！

Share on

Twitter Facebook LinkedIn

目录

MC and TD

Basic definition

Monte-Carlo Learning

Temporal-Difference Learning

Advantages and Disadvantages of MC vs. TD

Bias / Variance Trade-Off

Advantages and Disadvantages of MC vs. TD

Batch MC and TD

Advantages and Disadvantages of MC vs. TD

Convergence

Greedy in the Limit with Infinite Exploration (GLIE)

GLIE Monte-Carlo control

Sarsa

Q-Learning

Unified View of Reinforcement Learning

Bootstrapping and Sampling

Relationship Between DP and TD

参考资料及致谢

Share on