强化学习思考(6)蒙特卡罗和时序差分

3 minute read

Published:

关于蒙特卡罗和时序差分的注意事项。

目录

MC and TD

Basic definition

Monte-Carlo Learning

  • MC methods learn directly from episodes of experience
  • MC is model-free: no knowledge of MDP transitions / rewards
  • MC learns from complete episodes: no bootstrapping
  • MC uses the simplest possible idea: value = mean return
  • Caveat: can only apply MC to episodic MDPs
    • All episodes must terminate

Temporal-Difference Learning

  • TD methods learn directly from episodes of experience
  • TD is model-free: no knowledge of MDP transitions / rewards
  • TD learns from incomplete episodes, by bootstrapping
  • TD updates a guess towards a guess

Advantages and Disadvantages of MC vs. TD

  • TD can learn before knowing the final outcome
    • TD can learn online after every step
    • MC must wait until end of episode before return is known
  • TD can learn without the final outcome
    • TD can learn from incomplete sequences
    • MC can only learn from complete sequences
    • TD works in continuing (non-terminating) environments
    • MC only works for episodic (terminating) environments
  • initial value
    • MC is not very sensitive to initial value
    • TD is more sensitive to initial value than MC

Bias / Variance Trade-Off

  • Return $G_t = R_{t+1} + \gamma R_{t+2} + … + \gamma^{T−1} R_T$ is unbiased estimate of $v_\pi(S_t)$

  • True TD target $R_{t+1} + \gamma v_\pi(S_{t+1})$ is unbiased estimate of $v_\pi(S_t)$

  • TD target $R_{t+1} + \gamma v_\pi(S_{t+1})$ is biased estimate of $v_\pi(S_t)$
  • TD target is much lower variance than the return:
    • Return depends on many random actions, transitions, rewards
    • TD target depends on one random action, transition, reward

Advantages and Disadvantages of MC vs. TD

  • MC has high variance, zero bias
    • Good convergence properties
    • (even with function approximation)
    • Not very sensitive to initial value
    • Very simple to understand and use
  • TD has low variance, some bias
    • Usually more efficient than MC
    • TD(0) converges to $v_\pi(S_t)$
    • (but not always with function approximation)
    • More sensitive to initial value

Batch MC and TD

MC and TD converge: $V(s) → v_\pi(s)$ as experience $\to \infty$, but the batch solution for finite experience may not equal

  • MC converges to solution with minimum mean-squared error
    • Best fit to the observed returns
\[\sum_{k=1}^{K} \sum_{t=1}^{T_{k}}\left(G_{t}^{k}-V\left(s_{t}^{k}\right)\right)^{2}\]
  • TD(0) converges to solution of max likelihood Markov model
    • Solution to the MDP that best fits the data
\[\begin{aligned} \hat{\mathcal{P}}_{s, s^{\prime}}^{a} &=\frac{1}{N(s, a)} \sum_{k=1}^{K} \sum_{t=1}^{T_{k}} \mathbf{1}\left(s_{t}^{k}, a_{t}^{k}, s_{t+1}^{k}=s, a, s^{\prime}\right) \\ \hat{\mathcal{R}}_{s}^{a} &=\frac{1}{N(s, a)} \sum_{k=1}^{K} \sum_{t=1}^{T_{k}} \mathbf{1}\left(s_{t}^{k}, a_{t}^{k}=s, a\right) r_{t}^{k} \end{aligned}\]

Advantages and Disadvantages of MC vs. TD

  • TD exploits Markov property
    • Usually more efficient in Markov environments
  • MC does not exploit Markov property
    • Usually more effective in non-Markov environments

Convergence

Greedy in the Limit with Infinite Exploration (GLIE)

GLIE Monte-Carlo control

Theorem

GLIE Monte-Carlo control converges to the optimal action-value function, $Q(s, a) \to q_∗(s, a)$

Sarsa

Theorem

Sarsa converges to the optimal action-value function, $Q(s, a) \to q_∗(s, a)$, under the following conditions:

  • GLIE sequence of policies $\pi_t(a \mid s)$
  • Robbins-Monro sequence of step-sizes $\alpha_t$
\[\sum_{t=1}^\infty \alpha_t = \infty \\ \sum_{t=1}^\infty \alpha_t^2 < \infty\]

Q-Learning

Theorem

Q-learning control converges to the optimal action-value function, $Q(s, a) \to q_∗(s, a)$

Unified View of Reinforcement Learning

Bootstrapping and Sampling

  • Bootstrapping: update involves an estimate MC does not bootstrap
    • DP bootstraps
    • TD bootstraps
  • Sampling: update samples an expectation
    • MC samples
    • DP does not sample
    • TD samples

Relationship Between DP and TD

where $x \;{\xleftarrow{a}}\; y \equiv x \leftarrow x+\alpha(y−x)$

参考资料及致谢

所有参考资料在前言中已列出,再次向强化学习的大佬前辈们表达衷心的感谢!