强化学习思考(8)Actor-Critic 方法

关于 Actor-Critic 方法的注意事项。



1、sample $\tau_{i}$ from $\pi_{\theta}$ (run the policy)

2、compute the gradient

\[\nabla_{\theta} J_{PG}(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \left[\nabla_\theta \log \pi_{\theta}\left(a_{i,t} | s_{i,t}\right) \left(\sum_{t'=t}^{T} r\left({s}_{i,t'}, {a}_{i,t'}\right) - b\right) \right]\]

3、$\theta \leftarrow \theta+\alpha \nabla_{\theta} J(\theta)$

reward to go

$\hat{Q}(s_t,a_t) $: estimate of expect reward if we take action $a_{t}$ in state $s_{t}$

  • 无偏估计
  • 方差大:使用了仅仅一个样本轨迹
\[\hat{Q}(s_t,a_t) = \sum_{t'=t}^{T} r\left({s}_{t'}, {a}_{t'}\right)\]

$Q(s_t,a_t)$: true expect reward to go

  • 无偏估计
  • 方差小:使用了所有样本轨迹的期望
\[Q(s_t,a_t) = \sum_{t'=t}^T E_{\pi_\theta}[r\left({s}_{t'}, {a}_{t'}\right) | {s}_{t}, {a}_{t}]\]

baseline: average reward

\[b= \frac{1}{N} \sum_i Q(s_{i,t},a_{i,t})\]

baseline: state value function

\[b=V(s_t) = E_{a_t \sim \pi_{\theta}}Q(s_{t},a_{t})\]

Advantage function

state-action value function: total reward from taking action $a_{t}$ in state $s_{t}$

\[Q^\pi(s_t,a_t) = \sum_{t'=t}^T E_{\pi_\theta}[r\left({s}_{t'}, {a}_{t'}\right) | {s}_{t}, {a}_{t}]\]

state value function: total reward from state $s_{t}$

\[V^\pi(s_t) = E_{a_t \sim \pi_{\theta}}[Q(s_{t},a_{t})]\]

Advantage function: how much better action $a_{t}$ is

\[A^\pi(s_t,a_t) = Q^\pi(s_t,a_t) - V^\pi(s_t)\]

Actor-Critic 方法

compute the gradient

\[\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \left[\nabla_\theta \log \pi_{\theta}\left(a_{i,t} | s_{i,t}\right) A^\pi(s_{i,t},a_{i,t}) \right]\]

we should estimate Advantage function

Value function fitting

Q and V

\[\begin{aligned} & Q^\pi(s_t,a_t) \\=& \sum_{t'=t}^T E_{\pi_\theta}[r\left({s}_{t'}, {a}_{t'}\right) | {s}_{t}, {a}_{t}] \\ =& r(s_t,a_t) + \sum_{t'=t+1}^T E_{\pi_\theta}[r\left({s}_{t'}, {a}_{t'}\right) | {s}_{t}, {a}_{t}] \\ =& r(s_t,a_t) + E_{s_{t+1} \sim p(s_{t+1}|s_t,a_t)}[V^\pi(s_{t+1})] \\ \approx & r(s_t,a_t) + V^\pi(s_{t+1}) \end{aligned}\]

V and A

\[A^\pi(s_t,a_t) \approx r(s_t,a_t) + V^\pi(s_{t+1}) - V^\pi(s_t)\]

we can estimate state value function to get advantage function

Policy evaluation

we use a network $V^\pi_{\phi}(s_t)$ to approximate state value function $V^\pi(s_t)$

\[\begin{aligned} & V^\pi(s_t) \\=&\; E_{a_t \sim \pi_{\theta}}[Q(s_{t},a_{t})] \\=&\; E_{a_t \sim \pi_{\theta}}\left[\sum_{t'=t}^T E_{\pi_\theta}[r\left({s}_{t'}, {a}_{t'}\right) | {s}_{t}, {a}_{t}] \right] \\=&\; \sum_{t'=t}^T E_{\pi_\theta}[r\left({s}_{t'}, {a}_{t'}\right) | {s}_{t}] \end{aligned}\]

Monte Carlo evaluation with function approximation

use sampled reward sums to train network

\[\begin{aligned} & V^\pi(s_t) \\=&\; \sum_{t'=t}^T E_{\pi_\theta}[r\left({s}_{t'}, {a}_{t'}\right) | {s}_{t}] \\ \approx&\;\sum_{t'=t}^{T} r\left({s}_{t'}, {a}_{t'}\right) \end{aligned}\]


\[y_{i,t} = \sum_{t'=t}^{T} r\left({s}_{t'}, {a}_{t'}\right)\]

Temporal difference evaluation with function approximation

use bootstrap to train network

\[\begin{aligned} & V^\pi(s_t) \\=&\; \sum_{t'=t}^T E_{\pi_\theta}[r\left({s}_{t'}, {a}_{t'}\right) | {s}_{t}] \\ \approx&\; r(s_t,a_t) + \sum_{t'=t+1}^T E_{\pi_\theta}[r\left({s}_{t'}, {a}_{t'}\right) | {s}_{t}] \\ \approx&\; r(s_t,a_t) + V^\pi(s_{t+1}) \end{aligned}\]


\[y_{i,t} = r(s_t,a_t) + V^\pi_{\phi}(s_{t+1})\]

Actor-critic algorithms

Batch Actor-critic algorithms

  • step 1、sample $(s_i,a_i)$ from $\pi_\theta(a \mid s)$

  • step 2、fit $\hat{V}_{\phi}^{\pi}(s)$ to sampled reward sums

  • step 3、evaluate

\[\hat{A}^{\pi}\left(s_{i}, a_{i}\right)=r\left(s_{i}, a_{i}\right)+\gamma \hat{V}_{\phi}^{\pi}\left(s_{i}^{\prime}\right)-\hat{V}_{\phi}^{\pi}\left(s_{i}\right)\]
  • step 4、compute the gradient
\[\nabla_{\theta} J(\theta) \approx \sum_{i} \nabla_{\theta} \log \pi_{\theta}\left(a_{i} \mid s_{i}\right) \hat{A}^{\pi}\left(s_{i}, a_{i}\right)\]
  • step 5、$\theta \leftarrow \theta+\alpha \nabla_{\theta} J(\theta)$

Online Actor-critic algorithms

  • step 1、take action $a \sim \pi_{\theta}(a \mid s),$ get $\left(s, a, s^{\prime}, r\right)$

  • step 2、update using target

\[\hat{V}_{\phi}^{\pi} \leftarrow r+\gamma\hat{V}_{\phi}^{\pi}(s^{\prime})\]
  • step 3、evaluate
\[\hat{A}^{\pi}(s, a)=r(s, a)+\gamma \hat{V}_{\phi}^{\pi}\left(s^{\prime}\right)-\hat{V}_{\phi}^{\pi}(s)\]
  • step 4、compute the gradient
\[\nabla_{\theta} J(\theta) \approx \nabla_{\theta} \log \pi_{\theta}(a \mid s) \hat{A}^{\pi}(s, a)\]
  • step 5、$\theta \leftarrow \theta+\alpha \nabla_{\theta} J(\theta)$

Architecture design

1、two network design & shared network design

  • two network design:
    • simple & stable
  • shared network design:
    • shared features between actor & critic
    • 训练比较难,因为会有两个不同的梯度(策略梯度和回归梯度)往不同的方向更新共享的参数,数据类型也不太一样,因此让这个网络稳定下来需要很多技巧,如初始化数值和学习率的选择。
    • 如果模拟器(如 Atari 模拟器和 MuJoCo)很快的话,不妨使用双网络结构,这样比较容易。

2、使用 output entropy 作为 $\pi(s)$ 的正则项的时候,entropy 越大,不同动作的概率越接近,越鼓励探索。

3、synchronized parallel actor-critic & asynchronous parallel actor-critic

  • step 1:每个 worker 都会 copy 全局参数

  • step 2:每个 worker 都与环境进行互动,并得到 sample data

  • step 3:每个 worker 计算梯度

  • step 4:每个 worker 将计算得到的梯度返回给 global network 用于更新全局参数,即使全局参数已被其他 worker 更新也没有关系。


discount factors

Temporal difference policy gradient

\[y_{i,t} = r(s_t,a_t) + V^\pi_{\phi}(s_{t+1})\]

when $T$ (episode length) is $\infty$, $V^\pi_{\phi}(s_{t+1})$ can get infinitely large in many case

we can use a discount factor $\gamma \in [0,1]$ (0.99 works well): better to get rewards sooner than later

\[y_{i,t} = r(s_t,a_t) + \gamma V^\pi_{\phi}(s_{t+1})\]

另外的角度:$\gamma$ 改变了 MDP(添加了终止状态),因为 state value function 本来是对总收益求期望,使用了 discount factors 可以认为有 $\gamma$ 的概率获得原先的总收益,$1-\gamma$ 的概率进入终止状态不获得收益。

compute the gradient

\[\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \left[\nabla_\theta \log \pi_{\theta}\left(a_{i,t} | s_{i,t}\right) \left( r(s_{i,t},a_{i,t}) + \gamma V^\pi_{\phi}(s_{i,t+1})- V^\pi_{\phi}(s_{i,t}) \right) \right]\]

Monte Carlo policy gradient

\[y_{i,t} = \sum_{t'=t}^{T} r\left({s}_{t'}, {a}_{t'}\right)\]

有两种加 discount factors 计算梯度的形式

Further reading: Philip Thomas, Bias in natural actor-critic algorithms. ICML 2014

option 1

一般选择 option 1

\[\begin{aligned} &\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}\left(a_{i, t} | s_{i, t}\right)\left(\sum_{t^{\prime}=t}^{T} \gamma^{t^{\prime}-t} r\left(s_{i, t^{\prime}}, a_{i, t^{\prime}}\right)\right)\\ \end{aligned}\]

option 2

option 2 不仅对 reward 的计算进行 dicount,也对梯度的计算进行了 discount,会使得越往后的步骤的梯度影响力越小。

\[\begin{aligned} \nabla_{\theta} J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N}\left(\sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}\left(a_{i, t} | s_{i, t}\right)\right)\left(\sum_{t=1}^{T} \gamma^{t-1} r\left(s_{i, t^{\prime}}, a_{i, t^{\prime}}\right)\right)\\ &\approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}\left(a_{i, t} | s_{i, t}\right)\left(\sum_{t'=t}^{T} \gamma^{t^{\prime}-1} r\left(s_{i, t^{\prime}}, a_{i, t^{\prime}}\right)\right)\\ &= \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \gamma^{t-1} \nabla_{\theta} \log \pi_{\theta}\left(a_{i, t} | s_{i, t}\right)\left(\sum_{t^{\prime}=t}^{T} \gamma^{t^{\prime}-t} r\left(s_{i, t^{\prime}}, a_{i, t^{\prime}}\right)\right) \end{aligned}\]


state-dependent baselines

policy gradient

\[\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \left[\nabla_\theta \log \pi_{\theta}\left(a_{i,t} | s_{i,t}\right) \left(\sum_{t'=t}^{T} \gamma^{t'-t}r\left({s}_{i,t'}, {a}_{i,t'}\right) - b\right) \right]\]
  • no bias
  • higher variance (because single-sample estimate)


\[\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \left[\nabla_\theta \log \pi_{\theta}\left(a_{i,t} | s_{i,t}\right) \left( r(s_{i,t},a_{i,t}) + \gamma V^\pi_{\phi}(s_{i,t+1})- V^\pi_{\phi}(s_{i,t}) \right) \right]\]
  • lower variance (due to critic)
  • not unbiased (if the critic is not perfect)

use $V^\pi_{\phi}(s_{i,t})$ and still keep the estimator unbiased

\[\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \left[\nabla_\theta \log \pi_{\theta}\left(a_{i,t} | s_{i,t}\right) \left(\sum_{t'=t}^{T} \gamma^{t'-t}r\left({s}_{i,t'}, {a}_{i,t'}\right) - V^\pi_{\phi}(s_{i,t})\right) \right]\]
  • no bias
  • lower variance (baseline is closer to rewards)
  • but still higher variance (because single-sample estimate)

action-dependent baselines

use a critic without the bias (still unbiased), provided second term can be evaluated Gu et al. 2016 (Q-Prop)

\[\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \left[\nabla_\theta \log \pi_{\theta}\left(a_{i,t} | s_{i,t}\right) \left(\sum_{t'=t}^{T} \gamma^{t'-t}r\left({s}_{i,t'}, {a}_{i,t'}\right) - Q^\pi_{\phi}(s_{i,t},a_{i,t})\right) \right]\]
  • goes to zero in expectation if critic is correct
  • not correct
\[\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}\left(a_{i, t} | s_{i, t}\right)\left(\hat{Q}_{i, t}-Q_{\phi}^{\pi}\left(s_{i, t}, a_{i, t}\right)\right)\\+\frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} E_{a \sim \pi_{\theta}\left(a_{t} | s_{i, t}\right)}\left[Q_{\phi}^{\pi}\left(s_{i, t}, a_{t}\right)\right]\]

Generalized advantage estimation

n-step returns

\[\hat{A}_{n}^{\pi}\left(s_{t}, a_{t}\right)=\sum_{t^{\prime}=t}^{t+n} \gamma^{t^{\prime}-t} r\left(s_{t^{\prime}}, a_{t^{\prime}}\right)-\hat{V}_{\phi}^{\pi}\left(s_{t}\right)+\gamma^{n} \hat{V}_{\phi}^{\pi}\left(s_{t+n}\right)\]

Generalized advantage estimation

\[\hat{A}_{\mathrm{GAE}}^{\pi}\left(s_{t}, a_{t}\right)=\sum_{n=1}^{\infty} w_{n} \hat{A}_{n}^{\pi}\left(s_{t}, a_{t}\right)\]

where $w_{n} \propto \lambda^{n-1}$

\[\hat{A}_{\mathrm{GAE}}^{\pi}\left(s_{t}, a_{t}\right)=\sum_{t^{\prime}=t}^{\infty}(\gamma \lambda)^{t^{\prime}-t} \delta_{t^{\prime}} \\ \delta_{t^{\prime}}=r\left(s_{t^{\prime}}, a_{t^{\prime}}\right)+\gamma \hat{V}_{\phi}^{\pi}\left(s_{t^{\prime}+1}\right)-\hat{V}_{\phi}^{\pi}\left(s_{t^{\prime}}\right)\]

$(\gamma \lambda)$ similar effect as discount, and discount = variance reduction

Compatible Function Approximation

Theorem (Compatible Function Approximation Theorem)

If the following two conditions are satisfied:

1、Value function approximator is compatible to the policy

\[\nabla_{w} Q_{w}(s, a)=\nabla_{\theta} \log \pi_{\theta}(s, a)\]

2、Value function parameters $w$ minimise the mean-squared error

\[\varepsilon=\mathbb{E}_{\pi_{\theta}}\left[\left(Q^{\pi_{\theta}}(s, a)-Q_{w}(s, a)\right)^{2}\right]\]

Then the policy gradient is exact,

\[\nabla_{\theta} J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(s, a) Q_{w}(s, a)\right]\]

If $w$ is chosen to minimise mean-squared error, gradient of $\varepsilon$ w.r.t. $w$ must be zero,

\[\begin{aligned} \nabla_{w} \varepsilon&=0 \\ \mathbb{E}_{\pi_{\theta}}\left[\left(Q^{\theta}(s, a)-Q_{w}(s, a)\right) \nabla_{w} Q_{w}(s, a)\right]&=0 \\ \mathbb{E}_{\pi_{\theta}}\left[\left(Q^{\theta}(s, a)-Q_{w}(s, a)\right) \nabla_{\theta} \log \pi_{\theta}(s, a)\right]&=0 \\ \mathbb{E}_{\pi_{\theta}}\left[Q^{\theta}(s, a) \nabla_{\theta} \log \pi_{\theta}(s, a)\right]&=\mathbb{E}_{\pi_{\theta}}\left[Q_{w}(s, a) \nabla_{\theta} \log \pi_{\theta}(s, a)\right] \end{aligned}\]

So $Q_{w}(s, a)$ can be substituted directly into the policy gradient,

\[\nabla_{\theta} J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(s, a) Q_{w}(s, a)\right]\]

Summary of Policy Gradient Algorithms

The policy gradient has many equivalent forms

\[\begin{aligned} \nabla_{\theta} J(\theta)&=\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(s, a) v_{t}\right] \quad\quad\;\;\;\quad \text { REINFORCE } \\ &=\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(s, a) Q^{w}(s, a)\right] \quad \text { Q Actor-Critic } \\ &=\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(s, a) A^{w}(s, a)\right] \quad \text { Advantage Actor-Critic } \\ &=\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(s, a) \delta\right] \quad \quad \quad \;\;\;\;\text { TD Actor-Critic } \\ &=\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(s, a) \delta e\right] \quad \quad \quad \;\;\text { TD($\lambda$) Actor-Critic } \\ G_{\theta}^{-1} \nabla_{\theta} J(\theta)&=w \quad \quad \quad \quad\quad \quad \quad \quad \quad \quad \quad \quad \;\text { NAtural Actor-Critic } \end{aligned}\]

Each leads a stochastic gradient ascent algorithm

Critic uses policy evaluation (e.g. MC or TD learning) to estimate $Q^{\pi}(s, a), A^{\pi}(s, a)$ or $V^{\pi}(s)$

