[1971] The Optimal Control of Particially Observable Markov Processes over a Finite Horizon

最新推荐文章于 2023-04-21 08:16:42 发布

Vic_Hao

最新推荐文章于 2023-04-21 08:16:42 发布

阅读量458

点赞数 1

本文链接：https://blog.csdn.net/weixin_42018112/article/details/104221068

版权

文章目录

Contribution
Assumption
Properties of the Model
Examples
An Algorithm for Computing $V_{n}(\pi)$

Contribution

This paper demonstrates that, if there are only a finite number of control intervals remaining, then the optimal payoff function is a piecewise-linear, convex function of the current belief.
（注意：这篇文章考虑的是fintie horizon）

Assumption

discrete state, action, observation space, finite horizon

the underlying Markov process is a discrete-time finite-state Markov process
the number of possible outputs at each observation is finite

Properties of the Model

首先来定义POMDP:

$p_{ij}^{a}$ —— transition function, state $i$ select action $a$ , the probability of making transition to state $j d$
$r_{j \theta}^{a}$ —— observation function, select action $a$ , transion to state $j$ , the probability of observing $\theta$
$w_{ij \theta}^{a}$ —— immediate reward of, state $i$ select action $a$ , transition to state $j$ , observe $\theta$

接下来引入information vector

$\pi = [\pi_{1}, \pi_{2}, ..., \pi_{N}]$ where $\pi_{i}$ is the probability that the current internal state is $i$
(这里感觉information vector就是belief space)

conclusion: The current information-state vector $\pi$ is a sufficient statistic for the past history of observationsof a POMDP

Proof:
首先推导information-state vector的更新公式
$\epsilon(t)$ —— the total available information about the process at the end of control interval $t$
在每个control interval，我们能获得的唯一信息就是执行的动作 $a$ 和得到的观测 $z$ ，所以更新公式就是：
$\epsilon(t) = [a(t), z(t), \epsilon(t-1)]$

根据information vector的定义： $\pi_{j}(t) = \mathrm{Pr}(s(t) = j | \epsilon(t))$

结合上面两式和Bayes’ rule，可以得到： $\pi_{j}(t) = \mathrm{Pr}(s(t) = j | a(t), z(t), \epsilon(t-1)) = \frac{\mathrm{Pr}(s(t) = j, a(t), z(t) = \theta, \epsilon(t-1))}{\mathrm{Pr}( a(t), z(t), \epsilon(t-1))}$ $\frac{\mathrm{Pr}(s(t) = j, z(t) = \theta | \epsilon(t-1), a(t)) \cdot \mathrm{Pr}(\epsilon(t-1), a(t))}{\mathrm{Pr}(z(t) | a(t), \epsilon(t-1)) \cdot \mathrm{Pr}(\epsilon(t-1), a(t))}$ $\frac{\mathrm{Pr}(s(t) = j, z(t) = \theta | \epsilon(t-1), a(t))}{\mathrm{Pr}(z(t) | a(t), \epsilon(t-1))}$

把分子扩展到所有 $(n - 1)$ 步所有可能的states:
$\pi_{j}(t) = \sum_{i} \frac{\mathrm{Pr}(s(t) = j, z(t) = \theta, s(t-1)=i | \epsilon(t-1), a(t))}{\mathrm{Pr}(z(t) | a(t), \epsilon(t-1))}$
$\sum_{i} \frac{\mathrm{Pr}(s(t-1) = i|a(t), \epsilon(t-1)) \cdot \mathrm{Pr}(s(t) = j | s(t-1) = i, a(t), \epsilon(t-1)) \cdot \mathrm{Pr}(z(t) = \theta | s(t) = j, s(t-1)=i, a(t), \epsilon(t-1))}{\mathrm{Pr}(z(t) | a(t), \epsilon(t-1))}$

上式分子中，第二项是transition function, 第三项是observation function。因此，上式可以写为： $\pi_{j}(t) = \frac{r_{j \theta}^{a(t)} \sum_{i} \pi_{i}(t-1) p_{ij}^{a(t)}}{ \sum_{j} [r_{j \theta}^{a(t)} \sum_{i} \pi_{i}(t-1) p_{ij}^{a(t)} ]} \tag{1}$

上面这个式子很重要，其实就是 $b (s)$ 的update rule。上式的一个重要特点是：在 $t$ 计算information vector只需要 $t - 1$ 的information vector。因此， $\pi(t-1)$ 总结了在 $t$ 之前的所有的信息，并且represents a sufficient statistic for the complete past history of the process $\epsilon(t-1)$ 。

此外，上式是一个continuous-state Markov process的transition function，其中 $\pi(t)$ 是state。对于这个过程来说，上式的分母是transition: $\pi(t-1) \rightarrow T(\pi(t-1) | a(t), \theta)$ 的概率。这是consinuous-state MP的一个特殊情况，因为state是continuous的，但是state transtion probabilities是离散的。

上面的证明也说明information vector本身的动态行为是一个discrete-time, continuous-state的Markov process. 这是很关键的一点。

公式（1）定义了information vector的transition function: $\pi ' = T(\pi | a, \theta) = \frac{\sum_{i} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}}{\sum_{ij} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}}$ ，下面用一种特殊的表达方式来阐述这种变换的一些性质：

用正三角形来表示the space of $\pi$ ，正三角形内的每一个点表示一个information vector。对于每个information vector $\pi$ , 该点到第 $i$ 个顶点的对边的距离表示处在状态 $i$ 的概率（顶点表示概率）。如下图：

所以上面的transition function就可以理解为information space vector中点的变换。此外，每一个observation就对应着一个这样的变换。
接下来就是引入 $\alpha$ -vector

首先定义value function $V_{n}(\pi)$ ——maximum expected reward, $\pi$ 是the current information vector, $n$ 是还剩下的control intervals $V_{n}(\pi) = \underset{a \in A(n)}{max}[\sum_{i=1}^{i=N} \pi_{i} \sum_{j=1}^{j=N} p_{ij}^{a} \sum_{\theta} r_{j \theta}^{a}(w_{ij\theta}^{a} + V_{n-1}(T(\pi | a, \theta)))]$

为了简化这个公式，我们接下来定义一个expected immediate reward: $q_{i}^{a} = \sum_{j, \theta} p_{ij}^{\theta} r_{j \theta}^{a} w_{ij\theta}^{a}$
所以上式可以简化为： $V_{n}(\pi) = \underset{a \in A(n)}{max} [ \sum_{i=1}^{N} \pi_{i} q_{i}^{a} + \sum_{i, j, \theta} \pi_{i} p_{ij}^{a}r_{j \theta}^{a} V_{n-1}(T(\pi | a, \theta))] \tag{2}$

为了将上式写成矩阵的形式，我们定义一个 $Pr(\theta | \pi, a)$ ： $V_{n}(\pi) = \underset{a \in A(n)}{max} [\pi q^{a} + \sum_{\theta} Pr(\theta | \pi, a) T(\pi | a, \theta) ]$

接下来我们就要引出最重要的结论了： $V_{n}(\pi)$ is piece-wise linear and convex，可以写成：
$V_{n}(\pi) = \underset{k}{max} [ \sum_{i=1}^{i=N} \alpha_{i}^{k}(n) \pi_{i} ] \tag{3}$
其中， $\alpha^{k}(n) = [\alpha_{1}^{k}(n), \alpha_{2}^{k}(n), ..., \alpha_{N}^{k}(n)], k=1, 2, ...$ 就是大名鼎鼎的 $\alpha$ -vector
$\alpha$ -vector的证明

首先我们假设 $V_{n-1}(\pi)$ 可以被写成 $\alpha$ -vector的形式，下面我们就只需要证明 $V_{n}(\pi)$ 也是这种形式就可以了。
所以， $V_{n-1}[T(\pi | a, \theta)] = \underset{k}{max}[\sum_{j} \alpha_{j}^{k}(n-1) \frac{\sum_{i} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}}{\sum_{ij} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}}] \tag{6}$

由上面的图可以看出，如果 $V_{n-1}(\cdot)$ is piecewise linear and convex，information vector space可以被划分为a finite set of convex regions seperated by linear hyperplanes such that $V_{n-1}(\pi) = \pi \cdot \alpha^{k}(n-1)$ within a region for a single index $k$ 。

为了下面证明的方便，我们定义一个方程—— $l(\pi, a, \theta)$ that is equal to the corresponding $\alpha$ -vector index for the region containing the transformed information vector $T(\pi | a, \theta)$ , 所以 $V_{n-1}[T(\pi | a, \theta)] = \sum_{j} \alpha_{j}^{l(\pi, a, \theta)}(n-1) \frac{\sum_{i} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}}{\sum_{ij} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}} \tag{4}$

把(3)带入(2)，可以得到：
$V_{n}(\pi) = \underset{a \in A(n)}{max} [\sum_{i} \pi_{i}[q_{i}^{a} + \sum_{\theta j}p_{ij}^{a} r_{j \theta}^{a} \alpha_{j}^{l(\pi, a, \theta)}(n-1)]] \tag{5}$
接下来证明(5)与(3)具有相同的形式就行。证明过程详见论文。

结合上面的分析，我们需要注意两点：
1. 如果 $V_{n-1}(\cdot)$ 的 $\alpha$ -vector集合已经计算出来了，然后就可以使用(5)和(6)得到Optimal Policy了, 也可以得到the corresponding $\alpha$ -vector for any specified informaton vector $\pi$ for the $n$ -horizon case。
2. 使用(5)计算新的 $\alpha$ -vector时候会产生和每个新的 $\alpha$ -vector相关的Optimal Policy