强化学习思考（3）马尔可夫决策过程

3 minute read

Published: April 12, 2020

关于马尔可夫决策过程的注意事项。

基本定义

Reward

Reward Hypothesis

Definition (Reward Hypothesis):
All goals can be described by the maximisation of expected cumulative reward.

如果一个问题不满足奖励假设，那么就不能用强化学习去解决。

Expected Reward

In RL, we almost always care about expectations:

if there are two actions with $r(s,a_1)=1$ and $r(s,a_2)=-1$, $\pi_\theta(a_1 \mid s)=\alpha$ and $\pi_\theta(a_2 \mid s)=1-\alpha$
- $r(s,a)$ is not smooth, it is difficult to optimize
- $E_{a \sim \pi_\theta}[r(s,a)] = 2\alpha - 1$ is smooth, it is easy to optimize

State and Observation

State 是关于历史信息的函数映射，决定了下一刻会发生什么

\[S_t =f(H_t)\]

State 可以分为 environment state 和 agent state 两种，其中 agent state 亦可称为 observation。在没有特殊说明的情况下，我们所说的 state 均指 agent state。

The environment state $S_t^e$ is the environment’s private representation.
The agent state $S_t^a$ (observation $O_t$) is the agent’s internal representation.

对于 agent 来说，environment state 通常是未知的，agent state 是已知的，agent 可以通过 agent state 来做出相应的 action。

其中要注意的是，agent state 一般是需要 agent 自己构建的，它可以写成任意关于历史信息的函数映射：

\[S_t^a =f(H_t)\]

Markov state

Markov state 包含所有有用的历史信息，其中 Markov state 亦可称为 Information state。

Definition (Markov state):
A state $S_t$ is Markov if and only if
\[P[S_{t+1} | S_t] = P[S_{t+1} | S_1,...,S_t]\]

Model and Environment

Model 会预测 environment 接下来的输出，其定义如下：

\[P_{ss'}^a = P[S_{t+1} = s' | S_t = s, A_t = a] \\ R_{s}^a =E[R_{t+1} | S_t = s, A_t = a]\]

Observable Environments

Markov decision process (MDP) 定义在 Fully Observable Environments 下，agent 可以直接观察到 environment state：

\[O_t = S_t^a = S_t^e \quad\text{($S_t^e$ is Markov)}\]

partially observable Markov decision process (POMDP) 定义在 Partially Observable Environments 下，agent 不可以直接观察到 environment state：

\[O_t = S_t^a \neq S_t^e \quad\text{($S_t^e$ is Markov)}\]

在 POMDP 中 agent 需要构建自己的 agent state，可以通过以下几种方式进行构建：

Complete history: $S_t^a = H_t$
Beliefs of environment state: $S_t^a = (P[S_t^e = s_1],…,P[S_t^e = s_n])$
Recurrent neural network: $S_t^a = \sigma(S_{t-1}^a W_s + O_t W_o)$

比如在 DQN 论文中将连续 4 帧游戏画面作为输入，便是利用了部分历史信息来构建 agent state。

Extensions to MDPs

Infinite MDP

Countably infinite state and/or action spaces
- Straightforward
Continuous state and/or action spaces
- Closed form for linear quadratic model (LQR)
Continuous time
- Requires partial differential equations
- Hamilton-Jacobi-Bellman (HJB) equation
- Limiting case of Bellman equation as time-step $\to 0$

POMDP

POMDP 的定义如下：

History 定义如下：

Definition (History):
A history $H_t$ is a sequence of actions, observations and rewards,
\[H_t = A_0,O_1,R_1,...,A_{t−1}, O_t, R_t\]

Belief State 定义如下：

Definition (Belief State):
A belief state $b(h)$ is a probability distribution over states, conditioned on the history $h$
\[b(h) = (P[S_t = s_1|H_t=h],...,P[S_t = s_n|H_t=h])\]

The history $H_t$ satisfies the Markov property
The belief state $b(H_t)$ satisfies the Markov property

A POMDP can be reduced to an (infinite) history tree
A POMDP can be reduced to an (infinite) belief state tree

Ergodic MDP

Definition (Ergodic MDP):
An MDP is ergodic if the Markov chain induced by any policy is ergodic.

An ergodic Markov process is

Recurrent: each state is visited an infinite number of times
Aperiodic: each state is visited without any systematic period

Theorem:
An ergodic Markov process has a limiting stationary distribution $d^\pi(s)$ with the property
\[d^\pi(s)=\sum_{s'\in \cal{S}}d^\pi(s')P_{s's}\]

$d^\pi$ is eigenvector of $P$ with eigenvalue $1$ (always exists under some regularity conditions)

Average Reward

For any policy $\pi$, an ergodic MDP has an average reward per time-step $\rho^\pi$ that is independent of start state.

\[\rho^\pi = \lim_{T \to \infty} \frac{1}{T} \Bbb{E}\left[ \sum_{t=1}^T{R_t} \right]\]

Average Reward Value Function

The value function of an undiscounted, ergodic MDP can be expressed in terms of average reward.

$\tilde{v}_{\pi}(s)$ is the extra reward due to starting from state $s$,

\[\tilde{v}_{\pi}(s)=\mathbb{E}_{\pi}\left[\sum_{k=1}^{\infty}\left(R_{t+k}-\rho^{\pi}\right) | S_{t}=s\right]\]

There is a corresponding average reward Bellman equation,

\[\begin{aligned} \tilde{v}_{\pi}(s) &=\mathbb{E}_{\pi}\left[\left(R_{t+1}-\rho^{\pi}\right)+\sum_{k=1}^{\infty}\left(R_{t+k+1}-\rho^{\pi}\right) | S_{t}=s\right] \\ &=\mathbb{E}_{\pi}\left[\left(R_{t+1}-\rho^{\pi}\right)+\tilde{v}_{\pi}\left(S_{t+1}\right) | S_{t}=s\right] \end{aligned}\]

参考资料及致谢

所有参考资料在前言中已列出，再次向强化学习的大佬前辈们表达衷心的感谢！

Share on

Twitter Facebook LinkedIn

目录