强化学习思考（2）强化学习简介

4 minute read

Published: April 12, 2020

关于强化学习简介的注意事项。

强化学习分类

Model Free / Model Based

Model Free：environment 的 model 对于 agent 来说不是已知的。
Model Based：environment 的 model 对于 agent 来说是已知的。

Prediction / Control

Prediction: evaluate the future with a given policy. (evaluation)
Control: Find the best policy to optimise the future. (evaluation + improvement)

Learning / Planning

Learning:
- The environment is initially unknown
- The agent interacts with the environment
- The agent improves its policy
Planning:
- A model of the environment is known
- The agent performs computations with its model (without any external interaction)
- The agent improves its policy

Value Based / Policy Based / Actor-Critic

Value Based
- Learnt Value Function
- Implicit policy ($\epsilon$-greedy)
Policy Based
- Learnt Policy
Actor-Critic
- Learnt Value Function
- Learnt Policy

On-Policy / Off-Policy

On policy:
- On-Policy 学习的 agent 以及和环境进行互动的 agent 是相同的 agent。
- each time the policy is changed, even a little bit, we need to generate new samples
- “Learn on the job”
  - Learn about policy $\pi$ from experience sampled from $\pi$
Off policy:
- Off-Policy 学习的 agent 以及和环境进行互动的 agent 是不同的 agent。
- able to improve the policy without generating new samples from that policy
- “Look over someone’s shoulder”
  - Learn about policy $\pi$ from experience sampled from $\mu$

Exploration / Exploitation

Exploration finds more information about the environment
- Epsilon Greedy
- Boltzmann Exploration
Exploitation exploits known information to maximise reward

强化学习框架

一般强化学习算法遵循这三步：

1、generate samples

2、fit a model / estimate the return

3、improve the policy

generate samples

sample one step
sample $N$ step
sample complete trajectory

fit a model / estimate the return

Value-based: fit $V(s)$ or $Q(s,a)$
Policy gradients: evaluate returns $R_\tau=\sum_t r(s_t,a_t)$
Model-based: learn $p(s_{t+1} \mid s_t,a_t)$

improve the policy

Value-based: if we have policy $\pi$, and we know $Q^\pi(s, a)$, then we can improve $\pi$
- set $\pi’(a \mid s)=1$ if $a=\arg \max_a Q^\pi(s,a)$
- this policy is at least as good as $\pi$ and it doesn’t matter what $\pi$ is
Policy gradients: compute gradient to increase probability of good actions $a$
- modify $\pi(a \mid s)$ to increase probability of $a$ if $Q^\pi(s,a)>V^\pi(s)$
- if $Q^\pi(s,a)>V^\pi(s)$, then $a$ is better than average

常见算法

Policy gradients: directly differentiate the reward objective
Value-based: estimate value function or Q-function of the optimal policy (no explicit policy)
Actor-critic: estimate value function or Q-function of the current policy, use it to improve policy
Model-based RL: estimate the transition model, and then
- Use the model to plan (no explicit policy)
  - Trajectory optimization/optimal control (primarily in continuous spaces) – essentially backpropagation to optimize over actions
  - Discrete planning in discrete action spaces – e.g., Monte Carlo tree search
- Use it to improve a policy
  - Backpropagate gradients into the policy
  - Requires some tricks to make it work
- Use the model to learn a value function
  - Dynamic programming
  - Generate simulated experience for model-free learner

Comparison

Comparison: sample efficiency

注意 sample efficient 不等于 run time efficient

Comparison: stability and ease of use

SL and RL 算法比较

Supervised learning

almost always gradient descent

Reinforcement learning

often not gradient descent
- Q-learning: fixed point iteration
- Policy gradient: is gradient descent, but also often the least efficient!
- Model-based RL: model is not optimized for expected reward

RL 各算法比较

Value function fitting

At best, minimizes error of fit (“Bellman error”)
- Not the same as expected reward
At worst, doesn’t optimize anything
- Many popular deep RL value fitting algorithms are not guaranteed to converge to anything in the nonlinear case

Policy gradient

The only one that actually performs gradient descent (ascent) on the true objective

Model-based RL

Model minimizes error of fit
- This will converge
No guarantee that better model = better policy

Comparison: assumptions

Common assumption #1: full observability
- Generally assumed by value function fitting methods
- Can be mitigated by adding recurrence
Common assumption #2: episodic learning
- Often assumed by pure policy gradient methods
- Assumed by some model-based RL methods
Common assumption #3: continuity or smoothness
- Assumed by some continuous value function learning methods
- Often assumed by some model-based RL methods

参考资料及致谢

所有参考资料在前言中已列出，再次向强化学习的大佬前辈们表达衷心的感谢！

Share on

Twitter Facebook LinkedIn

目录