强化学习思考（4）模仿学习和监督学习

2 minute read

Published: April 13, 2020

关于模仿学习和监督学习的注意事项。

imitation learning

在某些情况下:

agent 和环境进行互动中得到的 reward 十分稀疏。
在某些任务中很难定义 reward。
人为设计的 reward 可能会得到不受控制的 policy。

因此需要 imitation learning：让一个专家来示范应该如何解决问题，而 agent 则试着去模仿专家。

Behavior Cloning

这个方法可以看做是一个监督学习，agent 需要学习在某些特定的 state 下尽可能采取像专家一样的 action。然而专家只能进行有限的采样，因此需要引入 Dataset Aggregation。

DAgger: Dataset Aggregation

我们的目标是使得 $P_{data}(o_t) = P_{\pi_\theta}(o_t)$，因此可以从 $P_{\pi_\theta}(o_t)$ 中收集训练数据更新 $P_{data}(o_t)$：

step 1、train $\pi_\theta(a_t \mid o_t)$ from human data ${\cal{D}} = {o_1, a_1,…,o_N, a_N}$
step 2、run $\pi_\theta(a_t \mid o_t)$ to get dataset ${\cal{D}}_\pi = { o_1, …, o_M }$
step 3、Ask human to label $\cal{D}_\pi$ with actions $a_t$
step 4、Aggregate: $\cal{D} \leftarrow \cal{D} \cup \cal{D}_\pi$

优点

DAgger addresses the problem of distributional “drift”

缺点

agent 会复制专家所有的动作，甚至包括一些错误的行为。
在监督学习中，我们希望训练数据和测试数据有相同的分布，但是在行为克隆中，训练数据来自于专家的分布，而测试数据来自于 agent，因为专家的 $\pi$ 和 agent 的是不一样的，所以生成的 state 也是不一样的，分布可能会不相同（可以采用 Inverse Reinforcement Learning 解决）。

How bad is the policy?

定义损失函数如下：

\[c(\mathbf{s}, \mathbf{a})=\left\{\begin{array}{l} 0 \text { if } \mathbf{a}=\pi^{\star}(\mathbf{s}) \\ 1 \text { otherwise } \end{array}\right.\]

情况 1

不使用 DAgger，且只对来自于训练集中的 $s$ 做假设。

Assume:

\[\pi_{\theta}\left(\mathbf{a} \neq \pi^{\star}(\mathbf{s}) | \mathbf{s}\right) \leq \epsilon \text { for all } \mathbf{s} \in \mathcal{D}_{\text {train }}\]

Result:

\[\begin{aligned} &E\left[\sum_{t} c\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] \leq \underbrace{\epsilon T+(1-\epsilon)(\epsilon(T-1)+(1-\epsilon)(\ldots))}_{T \text { terms }*\text{ each } O(\epsilon T) \;= \; O(\epsilon T^2)} \end{aligned}\]

情况 2

使用 DAgger，且对来自于训练集分布中的 $s$ 做假设。

Assume:

\[\pi_{\theta}\left(\mathbf{a} \neq \pi^{\star}(\mathbf{s}) | \mathbf{s}\right) \leq \epsilon\text { for all } \mathbf{s} \sim p_{\text {train }}(\mathbf{s})\]

With DAgger:

\[p_{\text {train }}(\mathbf{s}) \to p_{\theta}(\mathbf{s})\]

Result:

\[\begin{aligned} E\left[\sum_{t} c\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] \leq \epsilon T \end{aligned}\]

情况 3

不使用 DAgger，对来自于训练集分布中的 $s$ 做假设。

Assume:

\[\pi_{\theta}\left(\mathbf{a} \neq \pi^{\star}(\mathbf{s}) | \mathbf{s}\right) \leq \epsilon\text { for all } \mathbf{s} \sim p_{\text {train }}(\mathbf{s})\]

If:

\[p_{\text {train }}(\mathbf{s}) \neq p_{\theta}(\mathbf{s})\]

Result:

step 1：

\[p_{\theta}\left(\mathbf{s}_{t}\right)=\underbrace{(1-\epsilon)^{t}}_{probability we made no mistakes} p_{\text {train }}\left(\mathbf{s}_{t}\right)+\left(1-(1-\epsilon)^{t}\right) \underbrace{p_{\text {mistake }}\left(\mathbf{s}_{t}\right)}_{some other distribution}\]

step 2：

\[\begin{aligned} &\left|p_{\theta}\left(\mathbf{s}_{t}\right)-p_{\text {train }}\left(\mathbf{s}_{t}\right)\right| \\=&\left(1-(1-\epsilon)^{t}\right)\left|p_{\text {mistake }}\left(\mathbf{s}_{t}\right)-p_{\text {train }}\left(\mathbf{s}_{t}\right)\right| \\ \leq & 2\left(1-(1-\epsilon)^{t}\right) \\ \leq & 2 \epsilon t \end{aligned}\]

其中使用定理：$(1-\epsilon)^{t} \geq 1-\epsilon t$ for $\epsilon \in[0,1]$

step 3：

\[\begin{aligned} &\sum_{t} E_{p_{\theta}\left(\mathbf{s}_{t}\right)}\left[c_{t}\right] \\=&\sum_{t} \sum_{\mathbf{s}_{t}} p_{\theta}\left(\mathbf{s}_{t}\right) c_{t}\left(\mathbf{s}_{t}\right) \\ \leq & \sum_{t} \sum_{\mathbf{s}_{t}} p_{\text {train }}\left(\mathbf{s}_{t}\right) c_{t}\left(\mathbf{s}_{t}\right)+\left|p_{\theta}\left(\mathbf{s}_{t}\right)-p_{\text {train }}\left(\mathbf{s}_{t}\right)\right| c_{\max } \\ \leq & \underbrace{\sum_{t} \epsilon+2 \epsilon t}_{O(\epsilon T^2)} \end{aligned}\]

For more analysis, see Ross et al. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”

Why might we fail to fit the expert?

1、Non-Markovian behavior

use the whole history: RNN

\[\pi_\theta(a_t|o_t) \to \pi_\theta(a_t|o_1,...,o_t)\]

2、Multimodal behavior

Output mixture of Gaussians
Latent variable models
Autoregressive discretization

Goal-conditioned behavioral cloning

将 policy function 修改为目标相关的，这样便可以学习不同目标下的 policy function

\[\pi_\theta(a|s) \to \pi_\theta(a|s,g)\]

where $g$ is the goal state.

Inverse Reinforcement Learning

在 IRL 中，没有固定的 reward function，而是用一个专家来和环境做互动并学到一个 reward function，然后这个 reward function 才会被用来训练 actor，具体算法如下：

step 1、专家和 actor 都会生成对应的 trajectory
step 2、生成的 reward function 需要满足专家的累积 reward 总是比 actor 的大（假设：专家永远是最棒的）
step 3、使用 reward function 来训练一个新的 actor 替换原来旧的 actor
step 4、重复上述步骤

这里的模型和 GAN 十分相似，actor 就是 generator，reward function 就是 discriminator。

参考资料及致谢

所有参考资料在前言中已列出，再次向强化学习的大佬前辈们表达衷心的感谢！

Share on

Twitter Facebook LinkedIn

目录

imitation learning

Behavior Cloning

DAgger: Dataset Aggregation

How bad is the policy?

情况 1

情况 2

情况 3

Why might we fail to fit the expert?

Goal-conditioned behavioral cloning

Inverse Reinforcement Learning

参考资料及致谢

Share on