强化学习思考(6)蒙特卡罗和时序差分
Published:
关于蒙特卡罗和时序差分的注意事项。
目录
- 强化学习思考(1)前言
- 强化学习思考(2)强化学习简介
- 强化学习思考(3)马尔可夫决策过程
- 强化学习思考(4)模仿学习和监督学习
- 强化学习思考(5)动态规划
- 强化学习思考(6)蒙特卡罗和时序差分
- 强化学习思考(7)策略梯度
- 强化学习思考(8)Actor-Critic 方法
- 强化学习思考(9)值函数方法
- 强化学习思考(10)Deep Q Network
- 强化学习思考(11)Advanced Policy Gradient
MC and TD
Basic definition
Monte-Carlo Learning
- MC methods learn directly from episodes of experience
- MC is model-free: no knowledge of MDP transitions / rewards
- MC learns from complete episodes: no bootstrapping
- MC uses the simplest possible idea: value = mean return
- Caveat: can only apply MC to episodic MDPs
- All episodes must terminate
Temporal-Difference Learning
- TD methods learn directly from episodes of experience
- TD is model-free: no knowledge of MDP transitions / rewards
- TD learns from incomplete episodes, by bootstrapping
- TD updates a guess towards a guess
Advantages and Disadvantages of MC vs. TD
- TD can learn before knowing the final outcome
- TD can learn online after every step
- MC must wait until end of episode before return is known
- TD can learn without the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments
- initial value
- MC is not very sensitive to initial value
- TD is more sensitive to initial value than MC
Bias / Variance Trade-Off
Return $G_t = R_{t+1} + \gamma R_{t+2} + … + \gamma^{T−1} R_T$ is unbiased estimate of $v_\pi(S_t)$
True TD target $R_{t+1} + \gamma v_\pi(S_{t+1})$ is unbiased estimate of $v_\pi(S_t)$
- TD target $R_{t+1} + \gamma v_\pi(S_{t+1})$ is biased estimate of $v_\pi(S_t)$
- TD target is much lower variance than the return:
- Return depends on many random actions, transitions, rewards
- TD target depends on one random action, transition, reward
Advantages and Disadvantages of MC vs. TD
- MC has high variance, zero bias
- Good convergence properties
- (even with function approximation)
- Not very sensitive to initial value
- Very simple to understand and use
- TD has low variance, some bias
- Usually more efficient than MC
- TD(0) converges to $v_\pi(S_t)$
- (but not always with function approximation)
- More sensitive to initial value
Batch MC and TD
MC and TD converge: $V(s) → v_\pi(s)$ as experience $\to \infty$, but the batch solution for finite experience may not equal
- MC converges to solution with minimum mean-squared error
- Best fit to the observed returns
- TD(0) converges to solution of max likelihood Markov model
- Solution to the MDP that best fits the data
Advantages and Disadvantages of MC vs. TD
- TD exploits Markov property
- Usually more efficient in Markov environments
- MC does not exploit Markov property
- Usually more effective in non-Markov environments
Convergence
Greedy in the Limit with Infinite Exploration (GLIE)
GLIE Monte-Carlo control
Theorem
GLIE Monte-Carlo control converges to the optimal action-value function, $Q(s, a) \to q_∗(s, a)$
Sarsa
Theorem
Sarsa converges to the optimal action-value function, $Q(s, a) \to q_∗(s, a)$, under the following conditions:
\[\sum_{t=1}^\infty \alpha_t = \infty \\ \sum_{t=1}^\infty \alpha_t^2 < \infty\]
- GLIE sequence of policies $\pi_t(a \mid s)$
- Robbins-Monro sequence of step-sizes $\alpha_t$
Q-Learning
Theorem
Q-learning control converges to the optimal action-value function, $Q(s, a) \to q_∗(s, a)$
Unified View of Reinforcement Learning
Bootstrapping and Sampling
- Bootstrapping: update involves an estimate MC does not bootstrap
- DP bootstraps
- TD bootstraps
- Sampling: update samples an expectation
- MC samples
- DP does not sample
- TD samples
Relationship Between DP and TD
where $x \;{\xleftarrow{a}}\; y \equiv x \leftarrow x+\alpha(y−x)$
参考资料及致谢
所有参考资料在前言中已列出,再次向强化学习的大佬前辈们表达衷心的感谢!