Preface
Before we officially begin, I’d like to give everyone a sense of the general style of this article, as well as the main areas of mathematics we’ll be using. Let’s start with a fundamental theorem from mathematical analysis.
A commonly used inequality
− x > ln ( 1 − x ) , 0 < x < 1 -x > \ln(1 - x), \quad 0 < x < 1 −x>ln(1−x),0<x<1
Proof: Let f ( x ) = ln ( 1 − x ) + x f(x) = \ln(1 - x) + x f(x)=ln(1−x)+x, for 0 < x < 1 0 < x < 1 0<x<1. Then f ( 0 ) = 0 f(0) = 0 f(0)=0.
f ′ ( x ) = − 1 1 − x + 1 = x x − 1 < 0 f'(x) = \frac{-1}{1 - x} + 1 = \frac{x}{x - 1} < 0 f′(x)=1−x−1+1=x−1x<0
Hence, − x > ln ( 1 − x ) , 0 < x < 1 -x > \ln(1 - x), \quad 0 < x < 1 −x>ln(1−x),0<x<1. Q.E.D.
Fundamental Theorem
If
a
n
>
−
1
a_n > -1
an>−1, then
∏ n = 1 ∞ ( 1 + a n ) = 0 ⇔ ∑ n = 1 ∞ ln ( 1 + a n ) = − ∞ \prod_{n=1}^\infty (1 + a_n) = 0 \Leftrightarrow \sum_{n=1}^\infty \ln(1 + a_n) = -\infty n=1∏∞(1+an)=0⇔n=1∑∞ln(1+an)=−∞
Proof: Let P k = ∏ n = 1 k ( 1 + a n ) P_k = \prod_{n=1}^k (1 + a_n) Pk=∏n=1k(1+an), then
ln P k = ln ( ∏ n = 1 k ( 1 + a n ) ) = ∑ n = 1 k ln ( 1 + a n ) \ln P_k = \ln\left(\prod_{n=1}^k (1 + a_n)\right) = \sum_{n=1}^k \ln(1 + a_n) lnPk=ln(n=1∏k(1+an))=n=1∑kln(1+an)
Thus,
∑ n = 1 ∞ ln ( 1 + a n ) = − ∞ ⇔ lim k → ∞ ∑ n = 1 k ln ( 1 + a n ) = − ∞ ⇔ lim k → ∞ ln P k = − ∞ ⇔ lim k → ∞ P k = 0 \sum_{n=1}^\infty \ln(1 + a_n) = -\infty \Leftrightarrow \lim_{k \to \infty} \sum_{n=1}^k \ln(1 + a_n) = -\infty \Leftrightarrow \lim_{k \to \infty} \ln P_k = -\infty \Leftrightarrow \lim_{k \to \infty} P_k = 0 n=1∑∞ln(1+an)=−∞⇔k→∞limn=1∑kln(1+an)=−∞⇔k→∞limlnPk=−∞⇔k→∞limPk=0
Q.E.D.
Corollary
If
0
≤
b
n
<
1
0 \le b_n < 1
0≤bn<1 and
∑
n
=
1
∞
b
n
=
+
∞
\sum_{n=1}^\infty b_n = +\infty
∑n=1∞bn=+∞, then
∏ n = 1 ∞ ( 1 − b n ) = 0 \prod_{n=1}^\infty (1 - b_n) = 0 n=1∏∞(1−bn)=0
Proof: Consider the subsequence { b n k } \{b_{n_k}\} {bnk} consisting of non-zero b n b_n bn. Since − b n k > − 1 -b_{n_k} > -1 −bnk>−1, and applying the fundamental theorem, we have:
∏ n = 1 ∞ ( 1 − b n ) = ∏ k = 1 ∞ ( 1 − b n k ) = 0 ⇔ ∑ k = 1 ∞ ln ( 1 − b n k ) = − ∞ \prod_{n=1}^\infty (1 - b_n) = \prod_{k=1}^\infty (1 - b_{n_k}) = 0 \Leftrightarrow \sum_{k=1}^\infty \ln(1 - b_{n_k}) = -\infty n=1∏∞(1−bn)=k=1∏∞(1−bnk)=0⇔k=1∑∞ln(1−bnk)=−∞
We now show
∑
k
=
1
∞
ln
(
1
−
b
n
k
)
=
−
∞
\sum_{k=1}^\infty \ln(1 - b_{n_k}) = -\infty
∑k=1∞ln(1−bnk)=−∞.
Given
0
<
1
−
b
n
k
<
1
0 < 1 - b_{n_k} < 1
0<1−bnk<1, we have
ln
(
1
−
b
n
k
)
<
0
\ln(1 - b_{n_k}) < 0
ln(1−bnk)<0, and
∑
k
=
1
∞
b
n
k
=
+
∞
\sum_{k=1}^\infty b_{n_k} = +\infty
∑k=1∞bnk=+∞. It’s not immediately obvious, so we proceed by contradiction:
Assume ∑ k = 1 ∞ ln ( 1 − b n k ) ≠ − ∞ \sum_{k=1}^\infty \ln(1 - b_{n_k}) \ne -\infty ∑k=1∞ln(1−bnk)=−∞. Since each term is negative, this implies convergence, i.e.,
∑ k = 1 ∞ ln ( 1 − b n k ) > − ∞ \sum_{k=1}^\infty \ln(1 - b_{n_k}) > -\infty k=1∑∞ln(1−bnk)>−∞
But
∑
k
=
1
∞
(
−
b
n
k
)
=
−
∞
≥
∑
k
=
1
∞
ln
(
1
−
b
n
k
)
>
−
∞
\sum_{k=1}^\infty (-b_{n_k}) = -\infty \ge \sum_{k=1}^\infty \ln(1 - b_{n_k}) > -\infty
∑k=1∞(−bnk)=−∞≥∑k=1∞ln(1−bnk)>−∞, a contradiction.
Therefore,
∑
k
=
1
∞
ln
(
1
−
b
n
k
)
=
−
∞
\sum_{k=1}^\infty \ln(1 - b_{n_k}) = -\infty
∑k=1∞ln(1−bnk)=−∞, and so
∏ n = 1 ∞ ( 1 − b n ) = 0 \prod_{n=1}^\infty (1 - b_n) = 0 n=1∏∞(1−bn)=0
Q.E.D.
The Essence of Mathematical Truth: Induction
Observe a linear-looking relation, fantasize wildly, then coldly examine whether it is truly valid.
Given X 1 X_1 X1 and the recursive formula:
X n + 1 = X n + β n ( ξ n − X n ) = ( 1 − β n ) X n + β n ξ n X_{n+1} = X_n + \beta_n(\xi_n - X_n) = (1 - \beta_n)X_n + \beta_n \xi_n Xn+1=Xn+βn(ξn−Xn)=(1−βn)Xn+βnξn
Show that
X n + 1 = ∑ j = 1 n ξ j β j ∏ i = j n − 1 ( 1 − β i + 1 ) + X 1 ∏ i = 1 n ( 1 − β i ) X_{n+1} = \sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j}^{n-1} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^n (1 - \beta_i) Xn+1=j=1∑nξjβji=j∏n−1(1−βi+1)+X1i=1∏n(1−βi)
Proof:
-
Base case: n = 1 n = 1 n=1
X 2 = ( 1 − β 1 ) X 1 + β 1 ξ 1 = ξ 1 β 1 + X 1 ( 1 − β 1 ) X_2 = (1 - \beta_1)X_1 + \beta_1 \xi_1 = \xi_1 \beta_1 + X_1 (1 - \beta_1) X2=(1−β1)X1+β1ξ1=ξ1β1+X1(1−β1)
holds.
-
Inductive step: assume true for n n n, prove for n + 1 n+1 n+1:
X n + 2 = ( 1 − β n + 1 ) X n + 1 + β n + 1 ξ n + 1 X_{n+2} = (1 - \beta_{n+1})X_{n+1} + \beta_{n+1} \xi_{n+1} Xn+2=(1−βn+1)Xn+1+βn+1ξn+1
Plug in inductive hypothesis:
= ( 1 − β n + 1 ) [ ∑ j = 1 n ξ j β j ∏ i = j n − 1 ( 1 − β i + 1 ) + X 1 ∏ i = 1 n ( 1 − β i ) ] + β n + 1 ξ n + 1 = (1 - \beta_{n+1})\left[\sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j}^{n-1} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^n (1 - \beta_i)\right] + \beta_{n+1} \xi_{n+1} =(1−βn+1)[j=1∑nξjβji=j∏n−1(1−βi+1)+X1i=1∏n(1−βi)]+βn+1ξn+1
= ∑ j = 1 n + 1 ξ j β j ∏ i = j n ( 1 − β i + 1 ) + X 1 ∏ i = 1 n + 1 ( 1 − β i ) = \sum_{j=1}^{n+1} \xi_j \beta_j \prod_{i=j}^{n} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^{n+1} (1 - \beta_i) =j=1∑n+1ξjβji=j∏n(1−βi+1)+X1i=1∏n+1(1−βi)
-
By induction, the formula holds for all positive integers n n n. Q.E.D.
Now, let’s relax for a while — it’s movie time.
1. Definition of Action Replay Process
Given an n n n-step finite MDP with a possibly varying learning rate α \alpha α, in step i i i, the agent is in state x i x_i xi, takes action a i a_i ai, receives random reward r i r_i ri, and transitions to a new state y i y_i yi.
Action Replay Process (ARP) is a re-examination of state x x x and action a a a within a given MDP.
Suppose we focus on state x x x and action a a a, and consider an MDP consisting of n n n steps.
We add a step 0 in which the agent immediately terminates and receives reward Q 0 ( x , a ) Q_0(x,a) Q0(x,a).
During steps 1 to n n n, due to MDP randomness, the agent may take action a a a in state x x x at time steps 1 ≤ n i 1 , n i 2 , . . . , n i ∗ ≤ n 1 \le n^{i_1}, n^{i_2}, ..., n^{i_*} \le n 1≤ni1,ni2,...,ni∗≤n.
If action a a a is never taken at x x x in this episode, the only opportunity for it is at step 0.
When i ∗ ≥ 1 i_* \ge 1 i∗≥1, to determine ARP’s next reward and state, we sample an index n i e n^{i_e} nie as follows:
n i e = { n i ∗ , with probability α n i ∗ n i ∗ − 1 , with probability ( 1 − α n i ∗ ) α n i ∗ − 1 ⋮ 0 , with probability ∏ i = 1 i ∗ ( 1 − α n i ) n^{i_e} = \begin{cases} n^{i_*}, & \text{with probability } \alpha_{n^{i_*}} \\ n^{i_{*-1}}, & \text{with probability } (1 - \alpha_{n^{i_*}})\alpha_{n^{i_{*-1}}} \\ \vdots \\ 0, & \text{with probability } \prod_{i=1}^{i_*}(1 - \alpha_{n^i}) \end{cases} nie=⎩ ⎨ ⎧ni∗,ni∗−1,⋮0,with probability αni∗with probability (1−αni∗)αni∗−1with probability ∏i=1i∗(1−αni)
Then, after one ARP step, the state
<
x
,
n
>
<x, n>
<x,n> transitions to
<
y
n
i
e
,
n
i
e
−
1
>
<y_{n^{i_e}}, n^{i_e} - 1>
<ynie,nie−1>, and the reward is
r
n
i
e
r_{n^{i_e}}
rnie.
Clearly,
n
i
e
−
1
<
n
n^{i_e} - 1 < n
nie−1<n, so ARP terminates with probability 1. Thus, ARP is a finite process almost surely.
To summarize, the core transition formula is:
< x , n > → a < y n i e , n i e − 1 > , reward r n i e <x,n> \overset{a}{\rightarrow} <y_{n^{i_e}}, n^{i_e} - 1>, \quad \text{reward } r_{n^{i_e}} <x,n>→a<ynie,nie−1>,reward rnie
2. Properties of the Action Replay Process
We now examine ARP’s properties, particularly in comparison to MDPs. Given an MDP rule and a (non-terminating) instance, we can construct an ARP accordingly.
Property 1
∀ n , x , a , Q A R P ∗ ( < x , n > , a ) = Q n ( x , a ) \forall n, x, a,\quad Q^*_{ARP}(<x, n>, a) = Q_n(x, a) ∀n,x,a,QARP∗(<x,n>,a)=Qn(x,a)
Proof:
Using mathematical induction on
n
n
n:
-
Base case n = 1 n=1 n=1:
-
If the MDP did not take a a a at x x x in step 1, ARP gives reward Q 0 ( x , a ) = 0 = Q 1 ( x , a ) Q_0(x,a) = 0 = Q_1(x,a) Q0(x,a)=0=Q1(x,a)
-
If ( x , a ) = ( x 1 , a 1 ) (x,a) = (x_1, a_1) (x,a)=(x1,a1), then:
Q A R P ∗ ( < x , 1 > , a ) = α 1 r 1 + ( 1 − α 1 ) Q 0 ( x , a ) = α 1 r 1 = Q 1 ( x , a ) Q^*_{ARP}(<x,1>, a) = \alpha_1 r_1 + (1 - \alpha_1) Q_0(x,a) = \alpha_1 r_1 = Q_1(x,a) QARP∗(<x,1>,a)=α1r1+(1−α1)Q0(x,a)=α1r1=Q1(x,a)
-
-
Inductive step: Assume Q A R P ∗ ( < x , k − 1 > , a ) = Q k − 1 ( x , a ) Q^*_{ARP}(<x, k-1>, a) = Q_{k-1}(x,a) QARP∗(<x,k−1>,a)=Qk−1(x,a), show for k k k:
-
If ( x , a ) ≠ ( x k , a k ) (x,a) \ne (x_k, a_k) (x,a)=(xk,ak), then:
Q k ( x , a ) = Q k − 1 ( x , a ) = Q A R P ∗ ( < x , k > , a ) Q_k(x,a) = Q_{k-1}(x,a) = Q^*_{ARP}(<x, k>, a) Qk(x,a)=Qk−1(x,a)=QARP∗(<x,k>,a)
-
If ( x , a ) = ( x k , a k ) (x,a) = (x_k, a_k) (x,a)=(xk,ak), then:
Q A R P ∗ ( < x , k > , a ) = α k [ r k + γ max a Q k − 1 ( y k , a ) ] + ( 1 − α k ) Q k − 1 ( x , a ) = Q k ( x , a ) Q^*_{ARP}(<x,k>, a) = \alpha_k [r_k + \gamma \max_a Q_{k-1}(y_k,a)] + (1 - \alpha_k) Q_{k-1}(x,a) = Q_k(x,a) QARP∗(<x,k>,a)=αk[rk+γamaxQk−1(yk,a)]+(1−αk)Qk−1(x,a)=Qk(x,a)
-
-
Therefore, Q A R P ∗ ( < x , n > , a ) = Q n ( x , a ) Q^*_{ARP}(<x,n>, a) = Q_n(x,a) QARP∗(<x,n>,a)=Qn(x,a). Q.E.D.
Property 2 In the ARP { < x i , n i > } \{<x_i,n_i>\} {<xi,ni>}, for all l , s , ϵ > 0 l, s, \epsilon > 0 l,s,ϵ>0, there exists h > l h > l h>l such that for all n 1 > h n_1 > h n1>h,
P ( n s + 1 < l ) < ϵ P(n_{s+1} < l) < \epsilon P(ns+1<l)<ϵ
Proof:
Let us first consider the final step, that is, the case where
n
i
e
<
n
i
l
n^{i_e} < n^{i_l}
nie<nil or even lower.
Given in the ARP, starting from
<
x
,
h
>
<x, h>
<x,h>, after taking action
a
a
a, the probability of reaching a level lower than
l
l
l in one step is:
∑ j = 0 i l − 1 [ α n j ∏ k = j + 1 i h ( 1 − α n k ) ] = ∑ j = 0 i l − 1 [ α n j ∏ k = j + 1 i l − 1 ( 1 − α n k ) ] [ ∏ i = i l i h ( 1 − α n i ) ] = [ ∏ i = i l i h ( 1 − α n i ) ] ∑ j = 0 i l − 1 [ α n j ∏ k = j + 1 i l − 1 ( 1 − α n k ) ] \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_h} (1 - \alpha_{n^k}) \right] = \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_l - 1} (1 - \alpha_{n^k}) \right] \left[ \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) \right] = \left[ \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) \right] \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_l - 1} (1 - \alpha_{n^k}) \right] j=0∑il−1 αnjk=j+1∏ih(1−αnk) =j=0∑il−1 αnjk=j+1∏il−1(1−αnk) [i=il∏ih(1−αni)]=[i=il∏ih(1−αni)]j=0∑il−1 αnjk=j+1∏il−1(1−αnk)
But note that:
∑ j = 0 i l − 1 [ α n j ∏ k = j + 1 i l − 1 ( 1 − α n k ) ] = 1 \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_l - 1} (1 - \alpha_{n^k}) \right] = 1 j=0∑il−1 αnjk=j+1∏il−1(1−αnk) =1
Therefore,
∑ j = 0 i l − 1 [ α n j ∏ k = j + 1 i h ( 1 − α n k ) ] = ∏ i = i l i h ( 1 − α n i ) < e − ∑ i = i l i h α n i \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_h} (1 - \alpha_{n^k}) \right] = \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) < e^{-\sum_{i=i_l}^{i_h} \alpha_{n^i}} j=0∑il−1 αnjk=j+1∏ih(1−αnk) =i=il∏ih(1−αni)<e−∑i=ilihαni
As long as every subsequence of { α n } \{\alpha_n\} {αn} diverges, then as h → ∞ h \to \infty h→∞:
∑ j = 0 i l − 1 [ α n j ∏ k = j + 1 i h ( 1 − α n k ) ] = ∏ i = i l i h ( 1 − α n i ) < e − ∑ i = i l i h α n i → 0 \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_h} (1 - \alpha_{n^k}) \right] = \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) < e^{-\sum_{i=i_l}^{i_h} \alpha_{n^i}} \to 0 j=0∑il−1 αnjk=j+1∏ih(1−αnk) =i=il∏ih(1−αni)<e−∑i=ilihαni→0
Moreover, since the MDP is finite, we have:
∀ l j ∈ N ∗ , ∀ η j > 0 , ∃ M j > 0 , ∀ n j > M j , ∀ X j , a j , \forall l_j \in \mathbb{N}^*, \forall \eta_j > 0, \exists M_j > 0, \forall n_j > M_j, \forall X_j, a_j, ∀lj∈N∗,∀ηj>0,∃Mj>0,∀nj>Mj,∀Xj,aj,
starting from < X j , n j > <X_j, n_j> <Xj,nj>, after taking action a j a_j aj,
P ( n j + 1 ≥ l j ) = 1 − η j P(n_{j+1} \ge l_j) = 1 - \eta_j P(nj+1≥lj)=1−ηj
Using the index j j j, we recursively apply this conclusion from step s s s back to step 1. Then, the probability of reaching at least l = l s l = l_s l=ls is at least:
∏ j = 1 s ( 1 − η j ) = 1 − ϵ \prod_{j=1}^{s} (1 - \eta_j) = 1 - \epsilon j=1∏s(1−ηj)=1−ϵ
where n j + 1 ≥ l j n_{j+1} \ge l_j nj+1≥lj, and < X j + 1 , n j + 1 > <X_{j+1}, n_{j+1}> <Xj+1,nj+1> is reached from < x j , n j > <x_j, n_j> <xj,nj> after executing a j a_j aj. Q.E.D.
Now, define:
P x y ( n ) [ a ] = ∑ m = 1 n − 1 P < x , n > , < y , m > A R P [ a ] P_{xy}^{(n)}[a] = \sum_{m=1}^{n-1} P_{<x,n>,<y,m>}^{ARP}[a] Pxy(n)[a]=m=1∑n−1P<x,n>,<y,m>ARP[a]
Lemma 1 (Robbins-Monro):
Let
ξ
n
{\xi_n}
ξn be a sequence of bounded random variables with expectation
E
\mathfrak{E}
E, and let
0
≤
β
n
<
1
0 \le \beta_n < 1
0≤βn<1 satisfy
∑
i
=
1
∞
β
i
=
+
∞
\sum_{i=1}^{\infty} \beta_i = +\infty
∑i=1∞βi=+∞ and
∑
i
=
1
∞
β
i
2
<
+
∞
\sum_{i=1}^{\infty} \beta_i^2 < +\infty
∑i=1∞βi2<+∞.
Define the sequence
X
n
+
1
=
X
n
+
β
n
(
ξ
n
−
X
n
)
X_{n+1} = X_n + \beta_n(\xi_n - X_n)
Xn+1=Xn+βn(ξn−Xn). Then:
P ( lim n → ∞ X n = E ) = 1 P\left( \lim_{n \to \infty} X_n = \mathfrak{E} \right) = 1 P(n→∞limXn=E)=1
This lemma is actually a corollary of a theorem related to the Robbins-Monro algorithm. The proof is rather intricate, so we will dedicate a separate, complete blog post to discuss it—from the scenario setup, to key insights, and down to the technical details.
Property 3
P { lim n → ∞ P x y ( n ) [ a ] = P x y [ a ] } = 1 , P [ lim n → ∞ R x ( n ) ( a ) = R x ( a ) ] = 1 P\left\{\lim_{n\to\infty}P_{xy}^{(n)}[a]=P_{xy}[a]\right\}=1, \quad P\left[\lim_{n\to\infty}\mathfrak{R}_{x}^{(n)}(a)=\mathfrak{R}_{x}(a)\right]=1 P{n→∞limPxy(n)[a]=Pxy[a]}=1,P[n→∞limRx(n)(a)=Rx(a)]=1
The two convergence expressions inside the probability operator P P P actually represent uniform convergence over x x x, y y y, and a a a. In other words, they converge almost surely uniformly with respect to the probability measure.
Proof:
We will use Lemma 1 and construct a clever proof to demonstrate this property.
From the definition of ARP, we have:
R x ( n i + 1 ) ( a ) = α n i + 1 r n i + 1 + ( 1 − α n i + 1 ) R x ( n i ) ( a ) = R x ( n i ) ( a ) + α n i + 1 [ r n i + 1 − R x ( n i ) ( a ) ] \mathfrak{R}_{x}^{(n^{i+1})}(a)=\alpha_{n^{i+1}}r_{n^{i+1}}+(1-\alpha_{n^{i+1}})\mathfrak{R}_{x}^{(n^{i})}(a)=\mathfrak{R}_{x}^{(n^{i})}(a)+\alpha_{n^{i+1}}[r_{n^{i+1}}-\mathfrak{R}_{x}^{(n^{i})}(a)] Rx(ni+1)(a)=αni+1rni+1+(1−αni+1)Rx(ni)(a)=Rx(ni)(a)+αni+1[rni+1−Rx(ni)(a)]
Although after the MDP process completes, r n i + 1 r_{n^{i+1}} rni+1 is just a deterministic sequence with respect to i i i, its value can be viewed as a sample from a certain random variable. Therefore, as long as:
- 0 ≤ α n < 1 0 \le \alpha_n < 1 0≤αn<1
- ∑ i = 1 ∞ α i 2 < + ∞ \sum_{i=1}^\infty \alpha_i^2 < +\infty ∑i=1∞αi2<+∞
then
P [ lim i → ∞ R x ( n i ) ( a ) = R x ( a ) ] = 1 , P\left[\lim_{i\to\infty}\mathfrak{R}_{x}^{(n^i)}(a)=\mathfrak{R}_{x}(a)\right]=1, P[i→∞limRx(ni)(a)=Rx(a)]=1,
which implies
P [ lim n → ∞ R x ( n ) ( a ) = R x ( a ) ] = 1. P\left[\lim_{n\to\infty}\mathfrak{R}_{x}^{(n)}(a)=\mathfrak{R}_{x}(a)\right]=1. P[n→∞limRx(n)(a)=Rx(a)]=1.
Now define X n ( x , a , y ) X_n(x,a,y) Xn(x,a,y): starting from state x x x, if executing action a a a results in reaching y = y n y = y_n y=yn, then this value is 1; otherwise, it’s 0.
Then:
P x y ( n i + 1 ) [ a ] = ( 1 − α n i + 1 ) P x y ( n i ) [ a ] + α n i + 1 X n i + 1 ( x , a , y ) = P x y ( n i ) [ a ] + α n i + 1 { X n i + 1 ( x , a , y ) − P x y ( n i ) [ a ] } P_{xy}^{(n^{i+1})}[a] = (1-\alpha_{n^{i+1}})P_{xy}^{(n^i)}[a] + \alpha_{n^{i+1}}X_{n^{i+1}}(x,a,y) = P_{xy}^{(n^i)}[a] + \alpha_{n^{i+1}}\{X_{n^{i+1}}(x,a,y) - P_{xy}^{(n^i)}[a]\} Pxy(ni+1)[a]=(1−αni+1)Pxy(ni)[a]+αni+1Xni+1(x,a,y)=Pxy(ni)[a]+αni+1{Xni+1(x,a,y)−Pxy(ni)[a]}
Although X n i + 1 ( x , a , y ) X_{n^{i+1}}(x,a,y) Xni+1(x,a,y) is a deterministic sequence depending on i i i, it can also be regarded as a sample value from a random variable.
Therefore,
P { lim i → ∞ P x y ( n i + 1 ) [ a ] = P x y [ a ] } = 1 , P\left\{\lim_{i\to\infty}P_{xy}^{(n^{i+1})}[a]=P_{xy}[a]\right\}=1, P{i→∞limPxy(ni+1)[a]=Pxy[a]}=1,
so
P { lim n → ∞ P x y ( n ) [ a ] = P x y [ a ] } = 1. P\left\{\lim_{n\to\infty}P_{xy}^{(n)}[a]=P_{xy}[a]\right\}=1. P{n→∞limPxy(n)[a]=Pxy[a]}=1.
In fact, since the MDP is finite, the expressions inside the probability operator converge uniformly for any x x x , y y y, and a a a. Q.E.D.
Lemma 2
Now consider
s
s
s Markov chains that share the same state and action spaces (they differ only in their state transition probabilities). Suppose in the
i
i
ith chain, the expected reward from taking action
a
a
a in state
x
x
x is
R
x
i
(
a
)
R^i_x(a)
Rxi(a), and the transition probability to state
y
y
y is
p
x
y
i
[
a
]
p^i_{xy}[a]
pxyi[a]. At each step, the index of the Markov chain used increases by 1.
If for all a , i , x , y , η a, i, x, y, \eta a,i,x,y,η,
∣ R x i ( a ) − R x ( a ) ∣ < η , ∣ p x y i [ a ] − p x y [ a ] ∣ < η R |R_x^i(a)-R_x(a)| < \eta, \quad |p_{xy}^i[a]-p_{xy}[a]| < \frac{\eta}{R} ∣Rxi(a)−Rx(a)∣<η,∣pxyi[a]−pxy[a]∣<Rη
(where R x ( a ) R_x(a) Rx(a), p x y [ a ] p_{xy}[a] pxy[a] are corresponding statistics from a new chain outside the shared set, and R = sup x , a ∣ R x ( a ) ∣ R = \sup_{x,a}|R_x(a)| R=supx,a∣Rx(a)∣), and the MDP has n n n states, then for any state x x x, the s s s-step value estimate Q ‾ ′ ( x , a 1 , … , a s ) \overline{Q}'(x,a_1,\dots,a_s) Q′(x,a1,…,as) (given action sequence a 1 , … , a s a_1,\dots,a_s a1,…,as) satisfies:
∣ Q ‾ ′ ( x , a 1 , … , a s ) − Q ‾ ( x , a 1 , … , a s ) ∣ < η , |\overline{Q}'(x,a_1,\dots,a_s)-\overline{Q}(x,a_1,\dots,a_s)| < \eta, ∣Q′(x,a1,…,as)−Q(x,a1,…,as)∣<η,
where Q ‾ ( x , a 1 , … , a s ) \overline{Q}(x,a_1,\dots,a_s) Q(x,a1,…,as) is the value on the new chain.
Proof:
Let
Q ‾ ( x , a 1 , a 2 ) = R x ( a 1 ) + γ ∑ y p x y [ a 1 ] R y ( a 2 ) \overline{Q}(x,a_1,a_2) = R_x(a_1) + \gamma \sum_y p_{xy}[a_1] R_y(a_2) Q(x,a1,a2)=Rx(a1)+γy∑pxy[a1]Ry(a2)
and
Q ‾ ′ ( x , a 1 , a 2 ) = R x 1 ( a 1 ) + γ ∑ y p x y 1 [ a 1 ] R y 2 ( a 2 ) \overline{Q}'(x,a_1,a_2) = R^1_x(a_1) + \gamma \sum_y p^1_{xy}[a_1] R^2_y(a_2) Q′(x,a1,a2)=Rx1(a1)+γy∑pxy1[a1]Ry2(a2)
Then,
∣ Q ‾ ( x , a 1 , a 2 ) − Q ‾ ′ ( x , a 1 , a 2 ) ∣ ≤ ∣ R x ( a 1 ) − R x 1 ( a 1 ) ∣ + γ ∑ y ∣ p x y [ a 1 ] R y ( a 2 ) − p x y 1 [ a 1 ] R y 2 ( a 2 ) ∣ |\overline{Q}(x,a_1,a_2) - \overline{Q}'(x,a_1,a_2)| \le |R_x(a_1) - R_x^1(a_1)| + \gamma \sum_y |p_{xy}[a_1]R_y(a_2) - p^1_{xy}[a_1]R^2_y(a_2)| ∣Q(x,a1,a2)−Q′(x,a1,a2)∣≤∣Rx(a1)−Rx1(a1)∣+γy∑∣pxy[a1]Ry(a2)−pxy1[a1]Ry2(a2)∣
< η + γ ∑ y ∣ p x y [ a 1 ] R y ( a 2 ) − p x y 1 [ a 1 ] R y ( a 2 ) + p x y 1 [ a 1 ] R y ( a 2 ) − p x y 1 [ a 1 ] R y 2 ( a 2 ) ∣ < \eta + \gamma \sum_y \left|p_{xy}[a_1]R_y(a_2) - p^1_{xy}[a_1]R_y(a_2) + p^1_{xy}[a_1]R_y(a_2) - p^1_{xy}[a_1]R^2_y(a_2)\right| <η+γy∑ pxy[a1]Ry(a2)−pxy1[a1]Ry(a2)+pxy1[a1]Ry(a2)−pxy1[a1]Ry2(a2)
≤ η + γ ∑ y ∣ p x y [ a 1 ] − p x y 1 [ a 1 ] ∣ ∣ R y ( a 2 ) ∣ + γ ∑ y p x y 1 [ a 1 ] ∣ R y ( a 2 ) − R y 2 ( a 2 ) ∣ \le \eta + \gamma \sum_y \left|p_{xy}[a_1] - p^1_{xy}[a_1]\right||R_y(a_2)| + \gamma \sum_y p^1_{xy}[a_1]|R_y(a_2) - R^2_y(a_2)| ≤η+γy∑ pxy[a1]−pxy1[a1] ∣Ry(a2)∣+γy∑pxy1[a1]∣Ry(a2)−Ry2(a2)∣
≤ η + γ ∑ y η R R + γ η = η + γ ( n η ) + γ η = η ( 1 + γ + n γ ) < η ( n + 2 ) \le \eta + \gamma \sum_y \frac{\eta}{R}R + \gamma\eta = \eta + \gamma(n\eta) + \gamma\eta = \eta(1 + \gamma + n\gamma) < \eta(n+2) ≤η+γy∑RηR+γη=η+γ(nη)+γη=η(1+γ+nγ)<η(n+2)
Since η \eta η is arbitrary, we can in fact treat the bound as ∣ Q ‾ ( x , a 1 , a 2 ) − Q ‾ ′ ( x , a 1 , a 2 ) ∣ < η |\overline{Q}(x,a_1,a_2) - \overline{Q}'(x,a_1,a_2)| < \eta ∣Q(x,a1,a2)−Q′(x,a1,a2)∣<η.
Thus, by mathematical induction (Hint: the Q Q Q-values are bounded, and the recurrence is similar to a Bellman equation but with R R R replaced by corresponding Q Q Q-values), we get:
∣ Q ‾ ′ ( x , a 1 , … , a s ) − Q ‾ ( x , a 1 , … , a s ) ∣ < η |\overline{Q}'(x,a_1,\dots,a_s)-\overline{Q}(x,a_1,\dots,a_s)|<\eta ∣Q′(x,a1,…,as)−Q(x,a1,…,as)∣<η
Q.E.D.