文章目录
Contribution
This paper demonstrates that, if there are only a finite number of control intervals remaining, then the optimal payoff function is a piecewise-linear, convex function of the current belief.
(注意:这篇文章考虑的是fintie horizon)
Assumption
discrete state, action, observation space, finite horizon
- the underlying Markov process is a discrete-time finite-state Markov process
- the number of possible outputs at each observation is finite
Properties of the Model
p
i
j
a
p_{ij}^{a}
pija —— transition function, state
i
i
i select action
a
a
a, the probability of making transition to state
j
d
jd
jd
r
j
θ
a
r_{j \theta}^{a}
rjθa —— observation function, select action
a
a
a, transion to state
j
j
j, the probability of observing
θ
\theta
θ
w
i
j
θ
a
w_{ij \theta}^{a}
wijθa —— immediate reward of, state
i
i
i select action
a
a
a, transition to state
j
j
j, observe
θ
\theta
θ
π
=
[
π
1
,
π
2
,
.
.
.
,
π
N
]
\pi = [\pi_{1}, \pi_{2}, ..., \pi_{N}]
π=[π1,π2,...,πN] where
π
i
\pi_{i}
πi is the probability that the current internal state is
i
i
i
(这里感觉information vector就是belief space)
-
conclusion: The current information-state vector π \pi π is a sufficient statistic for the past history of observationsof a POMDP
Proof:
首先推导information-state vector的更新公式
ϵ ( t ) \epsilon(t) ϵ(t) —— the total available information about the process at the end of control interval t t t
在每个control interval,我们能获得的唯一信息就是执行的动作 a a a和得到的观测 z z z, 所以更新公式就是:
ϵ ( t ) = [ a ( t ) , z ( t ) , ϵ ( t − 1 ) ] \epsilon(t) = [a(t), z(t), \epsilon(t-1)] ϵ(t)=[a(t),z(t),ϵ(t−1)]根据information vector的定义: π j ( t ) = P r ( s ( t ) = j ∣ ϵ ( t ) ) \pi_{j}(t) = \mathrm{Pr}(s(t) = j | \epsilon(t)) πj(t)=Pr(s(t)=j∣ϵ(t))
结合上面两式和Bayes’ rule,可以得到: π j ( t ) = P r ( s ( t ) = j ∣ a ( t ) , z ( t ) , ϵ ( t − 1 ) ) = P r ( s ( t ) = j , a ( t ) , z ( t ) = θ , ϵ ( t − 1 ) ) P r ( a ( t ) , z ( t ) , ϵ ( t − 1 ) ) \pi_{j}(t) = \mathrm{Pr}(s(t) = j | a(t), z(t), \epsilon(t-1)) = \frac{\mathrm{Pr}(s(t) = j, a(t), z(t) = \theta, \epsilon(t-1))}{\mathrm{Pr}( a(t), z(t), \epsilon(t-1))} πj(t)=Pr(s(t)=j∣a(t),z(t),ϵ(t−1))=Pr(a(t),z(t),ϵ(t−1))Pr(s(t)=j,a(t),z(t)=θ,ϵ(t−1)) = P r ( s ( t ) = j , z ( t ) = θ ∣ ϵ ( t − 1 ) , a ( t ) ) ⋅ P r ( ϵ ( t − 1 ) , a ( t ) ) P r ( z ( t ) ∣ a ( t ) , ϵ ( t − 1 ) ) ⋅ P r ( ϵ ( t − 1 ) , a ( t ) ) = \frac{\mathrm{Pr}(s(t) = j, z(t) = \theta | \epsilon(t-1), a(t)) \cdot \mathrm{Pr}(\epsilon(t-1), a(t))}{\mathrm{Pr}(z(t) | a(t), \epsilon(t-1)) \cdot \mathrm{Pr}(\epsilon(t-1), a(t))} =Pr(z(t)∣a(t),ϵ(t−1))⋅Pr(ϵ(t−1),a(t))Pr(s(t)=j,z(t)=θ∣ϵ(t−1),a(t))⋅Pr(ϵ(t−1),a(t)) = P r ( s ( t ) = j , z ( t ) = θ ∣ ϵ ( t − 1 ) , a ( t ) ) P r ( z ( t ) ∣ a ( t ) , ϵ ( t − 1 ) ) = \frac{\mathrm{Pr}(s(t) = j, z(t) = \theta | \epsilon(t-1), a(t))}{\mathrm{Pr}(z(t) | a(t), \epsilon(t-1))} =Pr(z(t)∣a(t),ϵ(t−1))Pr(s(t)=j,z(t)=θ∣ϵ(t−1),a(t))
把分子扩展到所有 ( n − 1 ) (n-1) (n−1)步所有可能的states:
π j ( t ) = ∑ i P r ( s ( t ) = j , z ( t ) = θ , s ( t − 1 ) = i ∣ ϵ ( t − 1 ) , a ( t ) ) P r ( z ( t ) ∣ a ( t ) , ϵ ( t − 1 ) ) \pi_{j}(t) = \sum_{i} \frac{\mathrm{Pr}(s(t) = j, z(t) = \theta, s(t-1)=i | \epsilon(t-1), a(t))}{\mathrm{Pr}(z(t) | a(t), \epsilon(t-1))} πj(t)=i∑Pr(z(t)∣a(t),ϵ(t−1))Pr(s(t)=j,z(t)=θ,s(t−1)=i∣ϵ(t−1),a(t))
= ∑ i P r ( s ( t − 1 ) = i ∣ a ( t ) , ϵ ( t − 1 ) ) ⋅ P r ( s ( t ) = j ∣ s ( t − 1 ) = i , a ( t ) , ϵ ( t − 1 ) ) ⋅ P r ( z ( t ) = θ ∣ s ( t ) = j , s ( t − 1 ) = i , a ( t ) , ϵ ( t − 1 ) ) P r ( z ( t ) ∣ a ( t ) , ϵ ( t − 1 ) ) = \sum_{i} \frac{\mathrm{Pr}(s(t-1) = i|a(t), \epsilon(t-1)) \cdot \mathrm{Pr}(s(t) = j | s(t-1) = i, a(t), \epsilon(t-1)) \cdot \mathrm{Pr}(z(t) = \theta | s(t) = j, s(t-1)=i, a(t), \epsilon(t-1))}{\mathrm{Pr}(z(t) | a(t), \epsilon(t-1))} =i∑Pr(z(t)∣a(t),ϵ(t−1))Pr(s(t−1)=i∣a(t),ϵ(t−1))⋅Pr(s(t)=j∣s(t−1)=i,a(t),ϵ(t−1))⋅Pr(z(t)=θ∣s(t)=j,s(t−1)=i,a(t),ϵ(t−1))上式分子中,第二项是transition function, 第三项是observation function。因此,上式可以写为: π j ( t ) = r j θ a ( t ) ∑ i π i ( t − 1 ) p i j a ( t ) ∑ j [ r j θ a ( t ) ∑ i π i ( t − 1 ) p i j a ( t ) ] (1) \pi_{j}(t) = \frac{r_{j \theta}^{a(t)} \sum_{i} \pi_{i}(t-1) p_{ij}^{a(t)}}{ \sum_{j} [r_{j \theta}^{a(t)} \sum_{i} \pi_{i}(t-1) p_{ij}^{a(t)} ]} \tag{1} πj(t)=∑j[rjθa(t)∑iπi(t−1)pija(t)]rjθa(t)∑iπi(t−1)pija(t)(1)
上面这个式子很重要,其实就是 b ( s ) b(s) b(s)的update rule。上式的一个重要特点是:在 t t t计算information vector只需要 t − 1 t-1 t−1的information vector。因此, π ( t − 1 ) \pi(t-1) π(t−1)总结了在 t t t之前的所有的信息,并且represents a sufficient statistic for the complete past history of the process ϵ ( t − 1 ) \epsilon(t-1) ϵ(t−1)。
此外,上式是一个continuous-state Markov process的transition function,其中 π ( t ) \pi(t) π(t)是state。对于这个过程来说,上式的分母是transition: π ( t − 1 ) → T ( π ( t − 1 ) ∣ a ( t ) , θ ) \pi(t-1) \rightarrow T(\pi(t-1) | a(t), \theta) π(t−1)→T(π(t−1)∣a(t),θ)的概率。这是consinuous-state MP的一个特殊情况,因为state是continuous的,但是state transtion probabilities是离散的。
上面的证明也说明information vector本身的动态行为是一个discrete-time, continuous-state的Markov process. 这是很关键的一点。
公式(1)定义了information vector的transition function: π ′ = T ( π ∣ a , θ ) = ∑ i π i p i j a r j θ a ∑ i j π i p i j a r j θ a \pi ' = T(\pi | a, \theta) = \frac{\sum_{i} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}}{\sum_{ij} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}} π′=T(π∣a,θ)=∑ijπipijarjθa∑iπipijarjθa,下面用一种特殊的表达方式来阐述这种变换的一些性质:
用正三角形来表示the space of π \pi π,正三角形内的每一个点表示一个information vector。对于每个information vector π \pi π, 该点到第 i i i个顶点的对边的距离表示处在状态 i i i的概率(顶点表示概率)。如下图:
所以上面的transition function就可以理解为information space vector中点的变换。此外,每一个observation就对应着一个这样的变换。
-
接下来就是引入 α \alpha α-vector
首先定义value function V n ( π ) V_{n}(\pi) Vn(π)——maximum expected reward, π \pi π是the current information vector, n n n是还剩下的control intervals V n ( π ) = m a x a ∈ A ( n ) [ ∑ i = 1 i = N π i ∑ j = 1 j = N p i j a ∑ θ r j θ a ( w i j θ a + V n − 1 ( T ( π ∣ a , θ ) ) ) ] V_{n}(\pi) = \underset{a \in A(n)}{max}[\sum_{i=1}^{i=N} \pi_{i} \sum_{j=1}^{j=N} p_{ij}^{a} \sum_{\theta} r_{j \theta}^{a}(w_{ij\theta}^{a} + V_{n-1}(T(\pi | a, \theta)))] Vn(π)=a∈A(n)max[i=1∑i=Nπij=1∑j=Npijaθ∑rjθa(wijθa+Vn−1(T(π∣a,θ)))]
为了简化这个公式,我们接下来定义一个expected immediate reward: q i a = ∑ j , θ p i j θ r j θ a w i j θ a q_{i}^{a} = \sum_{j, \theta} p_{ij}^{\theta} r_{j \theta}^{a} w_{ij\theta}^{a} qia=j,θ∑pijθrjθawijθa
所以上式可以简化为: V n ( π ) = m a x a ∈ A ( n ) [ ∑ i = 1 N π i q i a + ∑ i , j , θ π i p i j a r j θ a V n − 1 ( T ( π ∣ a , θ ) ) ] (2) V_{n}(\pi) = \underset{a \in A(n)}{max} [ \sum_{i=1}^{N} \pi_{i} q_{i}^{a} + \sum_{i, j, \theta} \pi_{i} p_{ij}^{a}r_{j \theta}^{a} V_{n-1}(T(\pi | a, \theta))] \tag{2} Vn(π)=a∈A(n)max[i=1∑Nπiqia+i,j,θ∑πipijarjθaVn−1(T(π∣a,θ))](2)为了将上式写成矩阵的形式,我们定义一个 P r ( θ ∣ π , a ) Pr(\theta | \pi, a) Pr(θ∣π,a): V n ( π ) = m a x a ∈ A ( n ) [ π q a + ∑ θ P r ( θ ∣ π , a ) T ( π ∣ a , θ ) ] V_{n}(\pi) = \underset{a \in A(n)}{max} [\pi q^{a} + \sum_{\theta} Pr(\theta | \pi, a) T(\pi | a, \theta) ] Vn(π)=a∈A(n)max[πqa+θ∑Pr(θ∣π,a)T(π∣a,θ)]
接下来我们就要引出最重要的结论了: V n ( π ) V_{n}(\pi) Vn(π) is piece-wise linear and convex,可以写成:
V n ( π ) = m a x k [ ∑ i = 1 i = N α i k ( n ) π i ] (3) V_{n}(\pi) = \underset{k}{max} [ \sum_{i=1}^{i=N} \alpha_{i}^{k}(n) \pi_{i} ] \tag{3} Vn(π)=kmax[i=1∑i=Nαik(n)πi](3)
其中, α k ( n ) = [ α 1 k ( n ) , α 2 k ( n ) , . . . , α N k ( n ) ] , k = 1 , 2 , . . . \alpha^{k}(n) = [\alpha_{1}^{k}(n), \alpha_{2}^{k}(n), ..., \alpha_{N}^{k}(n)], k=1, 2, ... αk(n)=[α1k(n),α2k(n),...,αNk(n)],k=1,2,...就是大名鼎鼎的 α \alpha α-vector -
α \alpha α-vector的证明
首先我们假设 V n − 1 ( π ) V_{n-1}(\pi) Vn−1(π)可以被写成 α \alpha α-vector的形式,下面我们就只需要证明 V n ( π ) V_{n}(\pi) Vn(π)也是这种形式就可以了。
所以, V n − 1 [ T ( π ∣ a , θ ) ] = m a x k [ ∑ j α j k ( n − 1 ) ∑ i π i p i j a r j θ a ∑ i j π i p i j a r j θ a ] (6) V_{n-1}[T(\pi | a, \theta)] = \underset{k}{max}[\sum_{j} \alpha_{j}^{k}(n-1) \frac{\sum_{i} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}}{\sum_{ij} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}}] \tag{6} Vn−1[T(π∣a,θ)]=kmax[j∑αjk(n−1)∑ijπipijarjθa∑iπipijarjθa](6)由上面的图可以看出,如果 V n − 1 ( ⋅ ) V_{n-1}(\cdot) Vn−1(⋅) is piecewise linear and convex,information vector space可以被划分为a finite set of convex regions seperated by linear hyperplanes such that V n − 1 ( π ) = π ⋅ α k ( n − 1 ) V_{n-1}(\pi) = \pi \cdot \alpha^{k}(n-1) Vn−1(π)=π⋅αk(n−1) within a region for a single index k k k。
为了下面证明的方便,我们定义一个方程—— l ( π , a , θ ) l(\pi, a, \theta) l(π,a,θ) that is equal to the corresponding α \alpha α-vector index for the region containing the transformed information vector T ( π ∣ a , θ ) T(\pi | a, \theta) T(π∣a,θ), 所以 V n − 1 [ T ( π ∣ a , θ ) ] = ∑ j α j l ( π , a , θ ) ( n − 1 ) ∑ i π i p i j a r j θ a ∑ i j π i p i j a r j θ a (4) V_{n-1}[T(\pi | a, \theta)] = \sum_{j} \alpha_{j}^{l(\pi, a, \theta)}(n-1) \frac{\sum_{i} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}}{\sum_{ij} \pi_{i} p_{ij}^{a} r_{j \theta}^{a}} \tag{4} Vn−1[T(π∣a,θ)]=j∑αjl(π,a,θ)(n−1)∑ijπipijarjθa∑iπipijarjθa(4)
把(3)带入(2),可以得到:
V n ( π ) = m a x a ∈ A ( n ) [ ∑ i π i [ q i a + ∑ θ j p i j a r j θ a α j l ( π , a , θ ) ( n − 1 ) ] ] (5) V_{n}(\pi) = \underset{a \in A(n)}{max} [\sum_{i} \pi_{i}[q_{i}^{a} + \sum_{\theta j}p_{ij}^{a} r_{j \theta}^{a} \alpha_{j}^{l(\pi, a, \theta)}(n-1)]] \tag{5} Vn(π)=a∈A(n)max[i∑πi[qia+θj∑pijarjθaαjl(π,a,θ)(n−1)]](5)
接下来证明(5)与(3)具有相同的形式就行。证明过程详见论文。结合上面的分析,我们需要注意两点:
- 如果 V n − 1 ( ⋅ ) V_{n-1}(\cdot) Vn−1(⋅)的 α \alpha α-vector集合已经计算出来了,然后就可以使用(5)和(6)得到Optimal Policy了, 也可以得到the corresponding α \alpha α-vector for any specified informaton vector π \pi π for the n n n-horizon case。
- 使用(5)计算新的 α \alpha α-vector时候会产生和每个新的 α \alpha α-vector相关的Optimal Policy
Examples
有一台机器里面有两种相同的零件,两个零件的状态相互独立,每一个都对产品出品之前施加操作。我们把这台机器model成一个three-state, discrete-time Markov process。三种状态分别是0,1,2个零件坏了。每个零件有0.1的概率是在产品生产过程中坏了。
在产品的生产过程中,如果有一个零件坏了,那这个产品损坏的概率为0.5。所以,如果在生产之后检查产品的质量,三种状态 [ 0 , 1 , 2 ] [0, 1, 2] [0,1,2]分别对应的观测到好产品的概率为 [ 1.0 , 0.5 , 0.25 ] [1.0, 0.5, 0.25] [1.0,0.5,0.25].
好产品的利润是1,坏产品没有利润。 所以以三种状态 [ 0 , 1 , 2 ] [0, 1, 2] [0,1,2]开始生产过程(注意在生产过程中零件也可能损坏)的利润的期望是 [ 0.9025 , 0.427 , 0.25 ] [0.9025, 0.427, 0.25] [0.9025,0.427,0.25]
动作空间:
- 直接生产,不检查生产的产品是否损坏
- 直接生产,以-0.25的cost检查生产的产品的状态
- 不生产了,检查两个零件,如果有损坏的就替换,替换一个零件的cost是-1,检查两个零件的cost是-0.5
- 直接替换两个零件,不检查,所以这个动作的cost就是-2
An Algorithm for Computing V n ( π ) V_{n}(\pi) Vn(π)
要计算 V n ( π ) V_{n}(\pi) Vn(π), 算法的主要任务就是:
- computing the α \alpha α-vectors
- the corresponding mapping of these vectors onto the set of actions.
Assume α k ( n − 1 ) \alpha^{k}(n-1) αk(n−1)已知, 目标就是计算 α k ( n ) \alpha^{k}(n) αk(n)