[1971] The Optimal Control of Particially Observable Markov Processes over a Finite Horizon


This paper demonstrates that, if there are only a finite number of control intervals remaining, then the optimal payoff function is a piecewise-linear, convex function of the current belief.
(注意:这篇文章考虑的是fintie horizon)


discrete state, action, observation space, finite horizon

  1. the underlying Markov process is a discrete-time finite-state Markov process
  2. the number of possible outputs at each observation is finite

Properties of the Model

  • 首先来定义POMDP:

p i j a p_{ij}^{a} pija —— transition function, state i i i select action a a a, the probability of making transition to state j d jd jd
r j θ a r_{j \theta}^{a} rjθa —— observation function, select action a a a, transion to state j j j, the probability of observing θ \theta θ
w i j θ a w_{ij \theta}^{a} wijθa —— immediate reward of, state i i i select action a a a, transition to state j j j, observe θ \theta θ

  • 接下来引入information vector

π = [ π 1 , π 2 , . . . , π N ] \pi = [\pi_{1}, \pi_{2}, ..., \pi_{N}] π=[π1,π2,...,πN] where π i \pi_{i} πi is the probability that the current internal state is i i i
(这里感觉information vector就是belief space)

  • conclusion: The current information-state vector π \pi π is a sufficient statistic for the past history of observationsof a POMDP

    首先推导information-state vector的更新公式
    ϵ ( t ) \epsilon(t) ϵ(t) —— the total available information about the process at the end of control interval t t t
    在每个control interval,我们能获得的唯一信息就是执行的动作 a a a和得到的观测 z z z, 所以更新公式就是:
    ϵ ( t ) = [ a ( t ) , z ( t ) , ϵ ( t − 1 ) ] \epsilon(t) = [a(t), z(t), \epsilon(t-1)] ϵ(t)=[a(t),z(t),ϵ(t1)]

    根据information vector的定义: π j ( t ) = P r ( s ( t ) = j ∣ ϵ ( t ) ) \pi_{j}(t) = \mathrm{Pr}(s(t) = j | \epsilon(t)) πj(t)=Pr(s(t)=jϵ(t))

    结合上面两式和Bayes’ rule,可以得到: π j ( t ) = P r ( s ( t ) = j ∣ a ( t ) , z ( t ) , ϵ ( t − 1 ) ) = P r ( s ( t ) = j , a ( t ) , z ( t ) = θ , ϵ ( t − 1 ) ) P r ( a ( t ) , z ( t ) , ϵ ( t − 1 ) ) \pi_{j}(t) = \mathrm{Pr}(s(t) = j | a(t), z(t), \epsilon(t-1)) = \frac{\mathrm{Pr}(s(t) = j, a(t), z(t) = \theta, \epsilon(t-1))}{\mathrm{Pr}( a(t), z(t), \epsilon(t-1))} πj(t)=Pr(s(t)=ja(t),z(t),ϵ(t1))=Pr(a(t),z(t),ϵ(t1))Pr(s(t)=j,a(t),z(t)=θ,ϵ(t1)) = P r ( s ( t ) = j , z ( t ) = θ ∣ ϵ ( t − 1 ) , a ( t ) ) ⋅ P r ( ϵ ( t − 1 ) , a ( t ) ) P r ( z ( t ) ∣ a ( t ) , ϵ ( t − 1 ) ) ⋅ P r ( ϵ ( t − 1 ) , a ( t ) ) = \frac{\mathrm{Pr}(s(t) = j, z(t) = \theta | \epsilon(t-1), a(t)) \cdot \mathrm{Pr}(\epsilon(t-1), a(t))}{\mathrm{Pr}(z(t) | a(t), \epsilon(t-1)) \cdot \mathrm{Pr}(\epsilon(t-1), a(t))} =Pr(z(t)a(t),ϵ(t1))Pr(ϵ(t1),a(t))Pr(s(t)=j,z(t)=θϵ(t1),a(t))Pr(ϵ(t1),a(t)) = P r ( s ( t ) = j , z ( t ) = θ ∣ ϵ ( t − 1 ) , a ( t ) ) P r ( z ( t ) ∣ a ( t ) , ϵ ( t − 1 ) ) = \frac{\mathrm{Pr}(s(t) = j, z(t) = \theta | \epsilon(t-1), a(t))}{\mathrm{Pr}(z(t) | a(t), \epsilon(t-1))} =Pr(z(t)a(t),ϵ(t1))Pr(s(t)=j,z(t)=θϵ(t1),a(t))

    把分子扩展到所有 ( n − 1 ) (n-1) (n1)步所有可能的states:
    π j ( t ) = ∑ i P r ( s ( t ) = j , z ( t ) = θ , s ( t − 1 ) = i ∣ ϵ ( t − 1 ) , a ( t ) ) P r ( z ( t ) ∣ a ( t ) , ϵ ( t − 1 ) ) \pi_{j}(t) = \sum_{i} \frac{\mathrm{Pr}(s(t) = j, z(t) = \theta, s(t-1)=i | \epsilon(t-1), a(t))}{\mathrm{Pr}(z(t) | a(t), \epsilon(t-1))} πj(t)=iPr(z(t)a(t),ϵ(t1))Pr(s(t)=j,z(t)=θ,s(t1)=iϵ(t1),a(t))
    = ∑ i P r ( s ( t − 1 ) = i ∣ a ( t ) , ϵ ( t − 1 ) ) ⋅ P r ( s ( t ) = j ∣ s ( t − 1 ) = i , a ( t ) , ϵ ( t − 1 ) ) ⋅ P r ( z ( t ) = θ ∣ s ( t ) = j , s ( t − 1 ) = i , a ( t ) , ϵ ( t − 1 ) ) P r ( z ( t ) ∣ a ( t ) , ϵ ( t − 1 ) ) = \sum_{i} \frac{\mathrm{Pr}(s(t-1) = i|a(t), \epsilon(t-1)) \cdot \mathrm{Pr}(s(t) = j | s(t-1) = i, a(t), \epsilon(t-1)) \cdot \mathrm{Pr}(z(t) = \theta | s(t) = j, s(t-1)=i, a(t), \epsilon(t-1))}{\mathrm{Pr}(z(t) | a(t), \epsilon(t-1))} =iPr(z(t)a(t),ϵ(t1))Pr(s(t1)=ia(t),ϵ(t1))Pr(s(t)=js(t1)=i,a(t),ϵ(t1))Pr(z(t)=θs(t)=j,s(t1)=i,a(t),ϵ(t1))

    上式分子中,第二项是transition function, 第三项是observation function。因此,上式可以写为: π j ( t ) = r j θ a ( t ) ∑ i π i ( t − 1 ) p i j a ( t ) ∑ j [ r j θ a ( t ) ∑ i π i ( t − 1 ) p i j a ( t ) ] (1) \pi_{j}(t) = \frac{r_{j \theta}^{a(t)} \sum_{i} \pi_{i}(t-1) p_{ij}^{a(t)}}{ \sum_{j} [r_{j \theta}^{a(t)} \sum_{i} \pi_{i}(t-1) p_{ij}^{a(t)} ]} \tag{1} πj(t)=j[rjθa(t)iπi(t1)pija(t)]rjθa(t)iπi(t1)pija(t)(1)

    上面这个式子很重要,其实就是 b ( s ) b(s) b(s)的update rule。上式的一个重要特点是:在 t t t计算information vector只需要 t − 1 t-1 t1的information vector。因此, π ( t − 1 ) \pi(t-1) π(t1)总结了在 t t t之前的所有的信息,并且represents a sufficient statistic for the complete past history of the process ϵ ( t − 1 ) \epsilon(t-1) ϵ(t1)

    此外,上式是一个continuous-state Markov process的transition function,其中 π ( t ) \pi(t) π(t)是state。对于这个过程来说,上式的分母是transition: π ( t − 1 ) → T ( π ( t − 1 ) ∣ a ( t ) , θ ) \pi(t-1) \rightarrow T(\pi(t-1) | a(t), \theta) π(t1)T(π(t1)a(t),θ)的概率。这是consinuous-state MP的一个特殊情况,因为state是continuous的,但是state transtion probabilities是离散的。

    上面的证明也说明information vector本身的动态行为是一个discrete-time, continuous-state的Markov process. 这是很关键的一点。

    公式(1)定义了information vector的transition function: π ′ = T ( π ∣ a , θ ) = ∑ i π i p i j a r j θ a ∑ i j π i p i j a r j θ a \pi ' = T(\pi | a, \theta) = \frac{\sum_{i} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}}{\sum_{ij} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}} π=T(πa,θ)=ijπipijarjθaiπipijarjθa,下面用一种特殊的表达方式来阐述这种变换的一些性质:

    用正三角形来表示the space of π \pi π,正三角形内的每一个点表示一个information vector。对于每个information vector π \pi π, 该点到第 i i i个顶点的对边的距离表示处在状态 i i i的概率(顶点表示概率)。如下图:


    所以上面的transition function就可以理解为information space vector中点的变换。此外,每一个observation就对应着一个这样的变换。

  • 接下来就是引入 α \alpha α-vector

    首先定义value function V n ( π ) V_{n}(\pi) Vn(π)——maximum expected reward, π \pi π是the current information vector, n n n是还剩下的control intervals V n ( π ) = m a x a ∈ A ( n ) [ ∑ i = 1 i = N π i ∑ j = 1 j = N p i j a ∑ θ r j θ a ( w i j θ a + V n − 1 ( T ( π ∣ a , θ ) ) ) ] V_{n}(\pi) = \underset{a \in A(n)}{max}[\sum_{i=1}^{i=N} \pi_{i} \sum_{j=1}^{j=N} p_{ij}^{a} \sum_{\theta} r_{j \theta}^{a}(w_{ij\theta}^{a} + V_{n-1}(T(\pi | a, \theta)))] Vn(π)=aA(n)max[i=1i=Nπij=1j=Npijaθrjθa(wijθa+Vn1(T(πa,θ)))]

    为了简化这个公式,我们接下来定义一个expected immediate reward: q i a = ∑ j , θ p i j θ r j θ a w i j θ a q_{i}^{a} = \sum_{j, \theta} p_{ij}^{\theta} r_{j \theta}^{a} w_{ij\theta}^{a} qia=j,θpijθrjθawijθa
    所以上式可以简化为: V n ( π ) = m a x a ∈ A ( n ) [ ∑ i = 1 N π i q i a + ∑ i , j , θ π i p i j a r j θ a V n − 1 ( T ( π ∣ a , θ ) ) ] (2) V_{n}(\pi) = \underset{a \in A(n)}{max} [ \sum_{i=1}^{N} \pi_{i} q_{i}^{a} + \sum_{i, j, \theta} \pi_{i} p_{ij}^{a}r_{j \theta}^{a} V_{n-1}(T(\pi | a, \theta))] \tag{2} Vn(π)=aA(n)max[i=1Nπiqia+i,j,θπipijarjθaVn1(T(πa,θ))](2)

    为了将上式写成矩阵的形式,我们定义一个 P r ( θ ∣ π , a ) Pr(\theta | \pi, a) Pr(θπ,a) V n ( π ) = m a x a ∈ A ( n ) [ π q a + ∑ θ P r ( θ ∣ π , a ) T ( π ∣ a , θ ) ] V_{n}(\pi) = \underset{a \in A(n)}{max} [\pi q^{a} + \sum_{\theta} Pr(\theta | \pi, a) T(\pi | a, \theta) ] Vn(π)=aA(n)max[πqa+θPr(θπ,a)T(πa,θ)]

    接下来我们就要引出最重要的结论了: V n ( π ) V_{n}(\pi) Vn(π) is piece-wise linear and convex,可以写成:
    V n ( π ) = m a x k [ ∑ i = 1 i = N α i k ( n ) π i ] (3) V_{n}(\pi) = \underset{k}{max} [ \sum_{i=1}^{i=N} \alpha_{i}^{k}(n) \pi_{i} ] \tag{3} Vn(π)=kmax[i=1i=Nαik(n)πi](3)
    其中, α k ( n ) = [ α 1 k ( n ) , α 2 k ( n ) , . . . , α N k ( n ) ] , k = 1 , 2 , . . . \alpha^{k}(n) = [\alpha_{1}^{k}(n), \alpha_{2}^{k}(n), ..., \alpha_{N}^{k}(n)], k=1, 2, ... αk(n)=[α1k(n),α2k(n),...,αNk(n)],k=1,2,...就是大名鼎鼎的 α \alpha α-vector

  • α \alpha α-vector的证明

    首先我们假设 V n − 1 ( π ) V_{n-1}(\pi) Vn1(π)可以被写成 α \alpha α-vector的形式,下面我们就只需要证明 V n ( π ) V_{n}(\pi) Vn(π)也是这种形式就可以了。
    所以, V n − 1 [ T ( π ∣ a , θ ) ] = m a x k [ ∑ j α j k ( n − 1 ) ∑ i π i p i j a r j θ a ∑ i j π i p i j a r j θ a ] (6) V_{n-1}[T(\pi | a, \theta)] = \underset{k}{max}[\sum_{j} \alpha_{j}^{k}(n-1) \frac{\sum_{i} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}}{\sum_{ij} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}}] \tag{6} Vn1[T(πa,θ)]=kmax[jαjk(n1)ijπipijarjθaiπipijarjθa](6)

    由上面的图可以看出,如果 V n − 1 ( ⋅ ) V_{n-1}(\cdot) Vn1() is piecewise linear and convex,information vector space可以被划分为a finite set of convex regions seperated by linear hyperplanes such that V n − 1 ( π ) = π ⋅ α k ( n − 1 ) V_{n-1}(\pi) = \pi \cdot \alpha^{k}(n-1) Vn1(π)=παk(n1) within a region for a single index k k k

    为了下面证明的方便,我们定义一个方程—— l ( π , a , θ ) l(\pi, a, \theta) l(π,a,θ) that is equal to the corresponding α \alpha α-vector index for the region containing the transformed information vector T ( π ∣ a , θ ) T(\pi | a, \theta) T(πa,θ), 所以 V n − 1 [ T ( π ∣ a , θ ) ] = ∑ j α j l ( π , a , θ ) ( n − 1 ) ∑ i π i p i j a r j θ a ∑ i j π i p i j a r j θ a (4) V_{n-1}[T(\pi | a, \theta)] = \sum_{j} \alpha_{j}^{l(\pi, a, \theta)}(n-1) \frac{\sum_{i} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}}{\sum_{ij} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}} \tag{4} Vn1[T(πa,θ)]=jαjl(π,a,θ)(n1)ijπipijarjθaiπipijarjθa(4)

    V n ( π ) = m a x a ∈ A ( n ) [ ∑ i π i [ q i a + ∑ θ j p i j a r j θ a α j l ( π , a , θ ) ( n − 1 ) ] ] (5) V_{n}(\pi) = \underset{a \in A(n)}{max} [\sum_{i} \pi_{i}[q_{i}^{a} + \sum_{\theta j}p_{ij}^{a} r_{j \theta}^{a} \alpha_{j}^{l(\pi, a, \theta)}(n-1)]] \tag{5} Vn(π)=aA(n)max[iπi[qia+θjpijarjθaαjl(π,a,θ)(n1)]](5)


    1. 如果 V n − 1 ( ⋅ ) V_{n-1}(\cdot) Vn1() α \alpha α-vector集合已经计算出来了,然后就可以使用(5)和(6)得到Optimal Policy了, 也可以得到the corresponding α \alpha α-vector for any specified informaton vector π \pi π for the n n n-horizon case。
    2. 使用(5)计算新的 α \alpha α-vector时候会产生和每个新的 α \alpha α-vector相关的Optimal Policy


有一台机器里面有两种相同的零件,两个零件的状态相互独立,每一个都对产品出品之前施加操作。我们把这台机器model成一个three-state, discrete-time Markov process。三种状态分别是0,1,2个零件坏了。每个零件有0.1的概率是在产品生产过程中坏了。

在产品的生产过程中,如果有一个零件坏了,那这个产品损坏的概率为0.5。所以,如果在生产之后检查产品的质量,三种状态 [ 0 , 1 , 2 ] [0, 1, 2] [0,1,2]分别对应的观测到好产品的概率为 [ 1.0 , 0.5 , 0.25 ] [1.0, 0.5, 0.25] [1.0,0.5,0.25].

好产品的利润是1,坏产品没有利润。 所以以三种状态 [ 0 , 1 , 2 ] [0, 1, 2] [0,1,2]开始生产过程(注意在生产过程中零件也可能损坏)的利润的期望是 [ 0.9025 , 0.427 , 0.25 ] [0.9025, 0.427, 0.25] [0.9025,0.427,0.25]


  1. 直接生产,不检查生产的产品是否损坏
  2. 直接生产,以-0.25的cost检查生产的产品的状态
  3. 不生产了,检查两个零件,如果有损坏的就替换,替换一个零件的cost是-1,检查两个零件的cost是-0.5
  4. 直接替换两个零件,不检查,所以这个动作的cost就是-2

An Algorithm for Computing V n ( π ) V_{n}(\pi) Vn(π)

要计算 V n ( π ) V_{n}(\pi) Vn(π), 算法的主要任务就是:

  1. computing the α \alpha α-vectors
  2. the corresponding mapping of these vectors onto the set of actions.

Assume α k ( n − 1 ) \alpha^{k}(n-1) αk(n1)已知, 目标就是计算 α k ( n ) \alpha^{k}(n) αk(n)





