强化学习思考(3)马尔可夫决策过程
Published:
关于马尔可夫决策过程的注意事项。
目录
- 强化学习思考(1)前言
- 强化学习思考(2)强化学习简介
- 强化学习思考(3)马尔可夫决策过程
- 强化学习思考(4)模仿学习和监督学习
- 强化学习思考(5)动态规划
- 强化学习思考(6)蒙特卡罗和时序差分
- 强化学习思考(7)策略梯度
- 强化学习思考(8)Actor-Critic 方法
- 强化学习思考(9)值函数方法
- 强化学习思考(10)Deep Q Network
- 强化学习思考(11)Advanced Policy Gradient
基本定义
Reward
Reward Hypothesis
Definition (Reward Hypothesis):
All goals can be described by the maximisation of expected cumulative reward.
如果一个问题不满足奖励假设,那么就不能用强化学习去解决。
Expected Reward
In RL, we almost always care about expectations:
if there are two actions with $r(s,a_1)=1$ and $r(s,a_2)=-1$, $\pi_\theta(a_1 \mid s)=\alpha$ and $\pi_\theta(a_2 \mid s)=1-\alpha$
$r(s,a)$ is not smooth, it is difficult to optimize
$E_{a \sim \pi_\theta}[r(s,a)] = 2\alpha - 1$ is smooth, it is easy to optimize
State and Observation
State 是关于历史信息的函数映射,决定了下一刻会发生什么
\[S_t =f(H_t)\]State 可以分为 environment state 和 agent state 两种,其中 agent state 亦可称为 observation。在没有特殊说明的情况下,我们所说的 state 均指 agent state。
The environment state $S_t^e$ is the environment’s private representation.
The agent state $S_t^a$ (observation $O_t$) is the agent’s internal representation.
对于 agent 来说,environment state 通常是未知的,agent state 是已知的,agent 可以通过 agent state 来做出相应的 action。
其中要注意的是,agent state 一般是需要 agent 自己构建的,它可以写成任意关于历史信息的函数映射:
\[S_t^a =f(H_t)\]Markov state
Markov state 包含所有有用的历史信息,其中 Markov state 亦可称为 Information state。
Definition (Markov state):
A state $S_t$ is Markov if and only if
\[P[S_{t+1} | S_t] = P[S_{t+1} | S_1,...,S_t]\]
Model and Environment
Model 会预测 environment 接下来的输出,其定义如下:
\[P_{ss'}^a = P[S_{t+1} = s' | S_t = s, A_t = a] \\ R_{s}^a =E[R_{t+1} | S_t = s, A_t = a]\]Observable Environments
Markov decision process (MDP) 定义在 Fully Observable Environments 下,agent 可以直接观察到 environment state:
\[O_t = S_t^a = S_t^e \quad\text{($S_t^e$ is Markov)}\]partially observable Markov decision process (POMDP) 定义在 Partially Observable Environments 下,agent 不可以直接观察到 environment state:
\[O_t = S_t^a \neq S_t^e \quad\text{($S_t^e$ is Markov)}\]在 POMDP 中 agent 需要构建自己的 agent state,可以通过以下几种方式进行构建:
- Complete history: $S_t^a = H_t$
- Beliefs of environment state: $S_t^a = (P[S_t^e = s_1],…,P[S_t^e = s_n])$
- Recurrent neural network: $S_t^a = \sigma(S_{t-1}^a W_s + O_t W_o)$
比如在 DQN 论文中将连续 4 帧游戏画面作为输入,便是利用了部分历史信息来构建 agent state。
Extensions to MDPs
Infinite MDP
- Countably infinite state and/or action spaces
- Straightforward
- Continuous state and/or action spaces
- Closed form for linear quadratic model (LQR)
- Continuous time
- Requires partial differential equations
- Hamilton-Jacobi-Bellman (HJB) equation
- Limiting case of Bellman equation as time-step $\to 0$
POMDP
POMDP 的定义如下:
History 定义如下:
Definition (History):
A history $H_t$ is a sequence of actions, observations and rewards,
\[H_t = A_0,O_1,R_1,...,A_{t−1}, O_t, R_t\]
Belief State 定义如下:
Definition (Belief State):
A belief state $b(h)$ is a probability distribution over states, conditioned on the history $h$
\[b(h) = (P[S_t = s_1|H_t=h],...,P[S_t = s_n|H_t=h])\]
The history $H_t$ satisfies the Markov property
The belief state $b(H_t)$ satisfies the Markov property
A POMDP can be reduced to an (infinite) history tree
A POMDP can be reduced to an (infinite) belief state tree
Ergodic MDP
Definition (Ergodic MDP):
An MDP is ergodic if the Markov chain induced by any policy is ergodic.
An ergodic Markov process is
Recurrent: each state is visited an infinite number of times
Aperiodic: each state is visited without any systematic period
Theorem:
An ergodic Markov process has a limiting stationary distribution $d^\pi(s)$ with the property
\[d^\pi(s)=\sum_{s'\in \cal{S}}d^\pi(s')P_{s's}\]
$d^\pi$ is eigenvector of $P$ with eigenvalue $1$ (always exists under some regularity conditions)
Average Reward
For any policy $\pi$, an ergodic MDP has an average reward per time-step $\rho^\pi$ that is independent of start state.
\[\rho^\pi = \lim_{T \to \infty} \frac{1}{T} \Bbb{E}\left[ \sum_{t=1}^T{R_t} \right]\]Average Reward Value Function
The value function of an undiscounted, ergodic MDP can be expressed in terms of average reward.
$\tilde{v}_{\pi}(s)$ is the extra reward due to starting from state $s$,
\[\tilde{v}_{\pi}(s)=\mathbb{E}_{\pi}\left[\sum_{k=1}^{\infty}\left(R_{t+k}-\rho^{\pi}\right) | S_{t}=s\right]\]There is a corresponding average reward Bellman equation,
\[\begin{aligned} \tilde{v}_{\pi}(s) &=\mathbb{E}_{\pi}\left[\left(R_{t+1}-\rho^{\pi}\right)+\sum_{k=1}^{\infty}\left(R_{t+k+1}-\rho^{\pi}\right) | S_{t}=s\right] \\ &=\mathbb{E}_{\pi}\left[\left(R_{t+1}-\rho^{\pi}\right)+\tilde{v}_{\pi}\left(S_{t+1}\right) | S_{t}=s\right] \end{aligned}\]参考资料及致谢
所有参考资料在前言中已列出,再次向强化学习的大佬前辈们表达衷心的感谢!