强化学习思考(2)强化学习简介

4 minute read

Published:

关于强化学习简介的注意事项。

目录

强化学习分类

Model Free / Model Based
  • Model Free:environment 的 model 对于 agent 来说不是已知的。

  • Model Based:environment 的 model 对于 agent 来说是已知的。

Prediction / Control
  • Prediction: evaluate the future with a given policy. (evaluation)

  • Control: Find the best policy to optimise the future. (evaluation + improvement)

Learning / Planning
  • Learning:
    • The environment is initially unknown
    • The agent interacts with the environment
    • The agent improves its policy
  • Planning:
    • A model of the environment is known
    • The agent performs computations with its model (without any external interaction)
    • The agent improves its policy
Value Based / Policy Based / Actor-Critic
  • Value Based
    • Learnt Value Function
    • Implicit policy ($\epsilon$-greedy)
  • Policy Based
    • Learnt Policy
  • Actor-Critic
    • Learnt Value Function
    • Learnt Policy
On-Policy / Off-Policy
  • On policy:
    • On-Policy 学习的 agent 以及和环境进行互动的 agent 是相同的 agent。
    • each time the policy is changed, even a little bit, we need to generate new samples
    • “Learn on the job”
      • Learn about policy $\pi$ from experience sampled from $\pi$
  • Off policy:
    • Off-Policy 学习的 agent 以及和环境进行互动的 agent 是不同的 agent。
    • able to improve the policy without generating new samples from that policy
    • “Look over someone’s shoulder”
      • Learn about policy $\pi$ from experience sampled from $\mu$
Exploration / Exploitation
  • Exploration finds more information about the environment
    • Epsilon Greedy
    • Boltzmann Exploration
  • Exploitation exploits known information to maximise reward

强化学习框架

一般强化学习算法遵循这三步:

1、generate samples

2、fit a model / estimate the return

3、improve the policy

generate samples

  • sample one step
  • sample $N$ step
  • sample complete trajectory

fit a model / estimate the return

  • Value-based: fit $V(s)$ or $Q(s,a)$
  • Policy gradients: evaluate returns $R_\tau=\sum_t r(s_t,a_t)$
  • Model-based: learn $p(s_{t+1} \mid s_t,a_t)$

improve the policy

  • Value-based: if we have policy $\pi$, and we know $Q^\pi(s, a)$, then we can improve $\pi$
    • set $\pi’(a \mid s)=1$ if $a=\arg \max_a Q^\pi(s,a)$
    • this policy is at least as good as $\pi$ and it doesn’t matter what $\pi$ is
  • Policy gradients: compute gradient to increase probability of good actions $a$
    • modify $\pi(a \mid s)$ to increase probability of $a$ if $Q^\pi(s,a)>V^\pi(s)$
    • if $Q^\pi(s,a)>V^\pi(s)$, then $a$ is better than average

常见算法

  • Policy gradients: directly differentiate the reward objective

  • Value-based: estimate value function or Q-function of the optimal policy (no explicit policy)

  • Actor-critic: estimate value function or Q-function of the current policy, use it to improve policy

  • Model-based RL: estimate the transition model, and then

    • Use the model to plan (no explicit policy)
      • Trajectory optimization/optimal control (primarily in continuous spaces) – essentially backpropagation to optimize over actions
      • Discrete planning in discrete action spaces – e.g., Monte Carlo tree search
    • Use it to improve a policy
      • Backpropagate gradients into the policy
      • Requires some tricks to make it work
    • Use the model to learn a value function
      • Dynamic programming
      • Generate simulated experience for model-free learner

Comparison

Comparison: sample efficiency

  • 注意 sample efficient 不等于 run time efficient

Comparison: stability and ease of use

SL and RL 算法比较

Supervised learning

  • almost always gradient descent

Reinforcement learning

  • often not gradient descent
    • Q-learning: fixed point iteration
    • Policy gradient: is gradient descent, but also often the least efficient!
    • Model-based RL: model is not optimized for expected reward

RL 各算法比较

Value function fitting

  • At best, minimizes error of fit (“Bellman error”)
    • Not the same as expected reward
  • At worst, doesn’t optimize anything
    • Many popular deep RL value fitting algorithms are not guaranteed to converge to anything in the nonlinear case

Policy gradient

  • The only one that actually performs gradient descent (ascent) on the true objective

Model-based RL

  • Model minimizes error of fit
    • This will converge
  • No guarantee that better model = better policy

Comparison: assumptions

  • Common assumption #1: full observability
    • Generally assumed by value function fitting methods
    • Can be mitigated by adding recurrence
  • Common assumption #2: episodic learning
    • Often assumed by pure policy gradient methods
    • Assumed by some model-based RL methods
  • Common assumption #3: continuity or smoothness
    • Assumed by some continuous value function learning methods
    • Often assumed by some model-based RL methods

参考资料及致谢

所有参考资料在前言中已列出,再次向强化学习的大佬前辈们表达衷心的感谢!