The Action Replay Process - A Complete Guide

AI是人类修仙的基本引擎

已于 2025-05-17 19:55:54 修改

阅读量495

点赞数 27

文章标签： MDP ARP Mathematics Analysis Random Process

于 2025-05-17 19:40:53 首次发布

本文链接：https://blog.csdn.net/m0_62984100/article/details/148032658

版权

Preface

Before we officially begin, I’d like to give everyone a sense of the general style of this article, as well as the main areas of mathematics we’ll be using. Let’s start with a fundamental theorem from mathematical analysis.
A commonly used inequality

$\ln(1 - x), \quad 0 < x < 1$

Proof: Let $\ln(1 - x) + x$ , for $0 < x < 1$ . Then $f (0) = 0$ .

$\frac{-1}{1 - x} + 1 = \frac{x}{x - 1} < 0$

Hence, $\ln(1 - x), \quad 0 < x < 1$ . Q.E.D.

Fundamental Theorem
If $a_n > -1$ , then

$\prod_{n=1}^\infty (1 + a_n) = 0 \Leftrightarrow \sum_{n=1}^\infty \ln(1 + a_n) = -\infty$

Proof: Let $P_k = \prod_{n=1}^k (1 + a_n)$ , then

$\ln P_k = \ln\left(\prod_{n=1}^k (1 + a_n)\right) = \sum_{n=1}^k \ln(1 + a_n)$

Thus,

$\sum_{n=1}^\infty \ln(1 + a_n) = -\infty \Leftrightarrow \lim_{k \to \infty} \sum_{n=1}^k \ln(1 + a_n) = -\infty \Leftrightarrow \lim_{k \to \infty} \ln P_k = -\infty \Leftrightarrow \lim_{k \to \infty} P_k = 0$

Q.E.D.

Corollary
If $\le b_n < 1$ and $\sum_{n=1}^\infty b_n = +\infty$ , then

$\prod_{n=1}^\infty (1 - b_n) = 0$

Proof: Consider the subsequence ${b_{n_k}\}$ consisting of non-zero $b_n$ . Since $b_{n_k} > -1$ , and applying the fundamental theorem, we have:

$\prod_{n=1}^\infty (1 - b_n) = \prod_{k=1}^\infty (1 - b_{n_k}) = 0 \Leftrightarrow \sum_{k=1}^\infty \ln(1 - b_{n_k}) = -\infty$

We now show $\sum_{k=1}^\infty \ln(1 - b_{n_k}) = -\infty$ .
Given $0 < 1 - b_{n_k} < 1$ , we have $ln(1 - b_{n_k}) < 0$ , and $\sum_{k=1}^\infty b_{n_k} = +\infty$ . It’s not immediately obvious, so we proceed by contradiction:

Assume $\sum_{k=1}^\infty \ln(1 - b_{n_k}) \ne -\infty$ . Since each term is negative, this implies convergence, i.e.,

$\sum_{k=1}^\infty \ln(1 - b_{n_k}) > -\infty$

But $\sum_{k=1}^\infty (-b_{n_k}) = -\infty \ge \sum_{k=1}^\infty \ln(1 - b_{n_k}) > -\infty$ , a contradiction.
Therefore, $\sum_{k=1}^\infty \ln(1 - b_{n_k}) = -\infty$ , and so

$\prod_{n=1}^\infty (1 - b_n) = 0$

Q.E.D.

The Essence of Mathematical Truth: Induction
Observe a linear-looking relation, fantasize wildly, then coldly examine whether it is truly valid.

Given $X_1$ and the recursive formula:

$X_{n+1} = X_n + \beta_n(\xi_n - X_n) = (1 - \beta_n)X_n + \beta_n \xi_n$

Show that

$X_{n+1} = \sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j}^{n-1} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^n (1 - \beta_i)$

Proof:

Base case: $n = 1$

$X_2 = (1 - \beta_1)X_1 + \beta_1 \xi_1 = \xi_1 \beta_1 + X_1 (1 - \beta_1)$

holds.
Inductive step: assume true for $n$ , prove for $n + 1$ :

$X_{n+2} = (1 - \beta_{n+1})X_{n+1} + \beta_{n+1} \xi_{n+1}$

Plug in inductive hypothesis:

$\beta_{n+1})\left[\sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j}^{n-1} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^n (1 - \beta_i)\right] + \beta_{n+1} \xi_{n+1}$

$\sum_{j=1}^{n+1} \xi_j \beta_j \prod_{i=j}^{n} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^{n+1} (1 - \beta_i)$
By induction, the formula holds for all positive integers $n$ . Q.E.D.

Now, let’s relax for a while — it’s movie time.

1. Definition of Action Replay Process

Given an $n$ -step finite MDP with a possibly varying learning rate $\alpha$ , in step $i$ , the agent is in state $x_i$ , takes action $a_i$ , receives random reward $r_i$ , and transitions to a new state $y_i$ .

Action Replay Process (ARP) is a re-examination of state $x$ and action $a$ within a given MDP.

Suppose we focus on state $x$ and action $a$ , and consider an MDP consisting of $n$ steps.

We add a step 0 in which the agent immediately terminates and receives reward $Q_0(x,a)$ .

During steps 1 to $n$ , due to MDP randomness, the agent may take action $a$ in state $x$ at time steps $\le n^{i_1}, n^{i_2}, ..., n^{i_*} \le n$ .

If action $a$ is never taken at $x$ in this episode, the only opportunity for it is at step 0.

When $i_* \ge 1$ , to determine ARP’s next reward and state, we sample an index $n^{i_e}$ as follows:

$n^{i_e} = \begin{cases} n^{i_*}, & \text{with probability } \alpha_{n^{i_*}} \\ n^{i_{*-1}}, & \text{with probability } (1 - \alpha_{n^{i_*}})\alpha_{n^{i_{*-1}}} \\ \vdots \\ 0, & \text{with probability } \prod_{i=1}^{i_*}(1 - \alpha_{n^i}) \end{cases}$

Then, after one ARP step, the state $< x, n >$ transitions to $y_{n^{i_e}}, n^{i_e} - 1>$ , and the reward is $r_{n^{i_e}}$ .
Clearly, $n^{i_e} - 1 < n$ , so ARP terminates with probability 1. Thus, ARP is a finite process almost surely.

To summarize, the core transition formula is:

$\overset{a}{\rightarrow} <y_{n^{i_e}}, n^{i_e} - 1>, \quad \text{reward } r_{n^{i_e}}$

2. Properties of the Action Replay Process

We now examine ARP’s properties, particularly in comparison to MDPs. Given an MDP rule and a (non-terminating) instance, we can construct an ARP accordingly.

Property 1

$\forall n, x, a,\quad Q^*_{ARP}(<x, n>, a) = Q_n(x, a)$

Proof:
Using mathematical induction on $n$ :

Base case $n = 1$ :
- If the MDP did not take $a$ at $x$ in step 1, ARP gives reward $Q_0(x,a) = 0 = Q_1(x,a)$
- If $x,a) = (x_1, a_1)$ , then:
  
  $Q^*_{ARP}(<x,1>, a) = \alpha_1 r_1 + (1 - \alpha_1) Q_0(x,a) = \alpha_1 r_1 = Q_1(x,a)$
Inductive step: Assume $Q^*_{ARP}(<x, k-1>, a) = Q_{k-1}(x,a)$ , show for $k$ :
- If $\ne (x_k, a_k)$ , then:
  
  $Q_k(x,a) = Q_{k-1}(x,a) = Q^*_{ARP}(<x, k>, a)$
- If $x,a) = (x_k, a_k)$ , then:
  
  $Q^*_{ARP}(<x,k>, a) = \alpha_k [r_k + \gamma \max_a Q_{k-1}(y_k,a)] + (1 - \alpha_k) Q_{k-1}(x,a) = Q_k(x,a)$
Therefore, $Q^*_{ARP}(<x,n>, a) = Q_n(x,a)$ . Q.E.D.

Property 2 In the ARP ${<x_i,n_i>\}$ , for all $\epsilon > 0$ , there exists $h > l$ such that for all $n_1 > h$ ,

$P(n_{s+1} < l) < \epsilon$

Proof:

Let us first consider the final step, that is, the case where $n^{i_e} < n^{i_l}$ or even lower.
Given in the ARP, starting from $< x, h >$ , after taking action $a$ , the probability of reaching a level lower than $l$ in one step is:

$\sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_h} (1 - \alpha_{n^k}) \right] = \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_l - 1} (1 - \alpha_{n^k}) \right] \left[ \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) \right] = \left[ \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) \right] \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_l - 1} (1 - \alpha_{n^k}) \right]$

But note that:

$\sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_l - 1} (1 - \alpha_{n^k}) \right] = 1$

Therefore,

$\sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_h} (1 - \alpha_{n^k}) \right] = \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) < e^{-\sum_{i=i_l}^{i_h} \alpha_{n^i}}$

As long as every subsequence of $\{\alpha_n\}$ diverges, then as $\to \infty$ :

$\sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_h} (1 - \alpha_{n^k}) \right] = \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) < e^{-\sum_{i=i_l}^{i_h} \alpha_{n^i}} \to 0$

Moreover, since the MDP is finite, we have:

$\forall l_j \in \mathbb{N}^*, \forall \eta_j > 0, \exists M_j > 0, \forall n_j > M_j, \forall X_j, a_j,$

starting from $X_j, n_j>$ , after taking action $a_j$ ,

$P(n_{j+1} \ge l_j) = 1 - \eta_j$

Using the index $j$ , we recursively apply this conclusion from step $s$ back to step 1. Then, the probability of reaching at least $l = l_s$ is at least:

$\prod_{j=1}^{s} (1 - \eta_j) = 1 - \epsilon$

where $n_{j+1} \ge l_j$ , and $X_{j+1}, n_{j+1}>$ is reached from $x_j, n_j>$ after executing $a_j$ . Q.E.D.

Now, define:

$P_{xy}^{(n)}[a] = \sum_{m=1}^{n-1} P_{<x,n>,<y,m>}^{ARP}[a]$

Lemma 1 (Robbins-Monro):
Let ${\xi_n}$ be a sequence of bounded random variables with expectation $\mathfrak{E}$ , and let $\le \beta_n < 1$ satisfy $\sum_{i=1}^{\infty} \beta_i = +\infty$ and $\sum_{i=1}^{\infty} \beta_i^2 < +\infty$ .
Define the sequence $X_{n+1} = X_n + \beta_n(\xi_n - X_n)$ . Then:

$P\left( \lim_{n \to \infty} X_n = \mathfrak{E} \right) = 1$

This lemma is actually a corollary of a theorem related to the Robbins-Monro algorithm. The proof is rather intricate, so we will dedicate a separate, complete blog post to discuss it—from the scenario setup, to key insights, and down to the technical details.

Property 3

$P\left\{\lim_{n\to\infty}P_{xy}^{(n)}[a]=P_{xy}[a]\right\}=1, \quad P\left[\lim_{n\to\infty}\mathfrak{R}_{x}^{(n)}(a)=\mathfrak{R}_{x}(a)\right]=1$

The two convergence expressions inside the probability operator $P$ actually represent uniform convergence over $x$ , $y$ , and $a$ . In other words, they converge almost surely uniformly with respect to the probability measure.

Proof:
We will use Lemma 1 and construct a clever proof to demonstrate this property.

From the definition of ARP, we have:

$\mathfrak{R}_{x}^{(n^{i+1})}(a)=\alpha_{n^{i+1}}r_{n^{i+1}}+(1-\alpha_{n^{i+1}})\mathfrak{R}_{x}^{(n^{i})}(a)=\mathfrak{R}_{x}^{(n^{i})}(a)+\alpha_{n^{i+1}}[r_{n^{i+1}}-\mathfrak{R}_{x}^{(n^{i})}(a)]$

Although after the MDP process completes, $r_{n^{i+1}}$ is just a deterministic sequence with respect to $i$ , its value can be viewed as a sample from a certain random variable. Therefore, as long as:

$\le \alpha_n < 1$
$\sum_{i=1}^\infty \alpha_i^2 < +\infty$

then

$P\left[\lim_{i\to\infty}\mathfrak{R}_{x}^{(n^i)}(a)=\mathfrak{R}_{x}(a)\right]=1,$

which implies

$P\left[\lim_{n\to\infty}\mathfrak{R}_{x}^{(n)}(a)=\mathfrak{R}_{x}(a)\right]=1.$

Now define $X_n(x,a,y)$ : starting from state $x$ , if executing action $a$ results in reaching $y = y_n$ , then this value is 1; otherwise, it’s 0.

Then:

$P_{xy}^{(n^{i+1})}[a] = (1-\alpha_{n^{i+1}})P_{xy}^{(n^i)}[a] + \alpha_{n^{i+1}}X_{n^{i+1}}(x,a,y) = P_{xy}^{(n^i)}[a] + \alpha_{n^{i+1}}\{X_{n^{i+1}}(x,a,y) - P_{xy}^{(n^i)}[a]\}$

Although $X_{n^{i+1}}(x,a,y)$ is a deterministic sequence depending on $i$ , it can also be regarded as a sample value from a random variable.

Therefore,

$P\left\{\lim_{i\to\infty}P_{xy}^{(n^{i+1})}[a]=P_{xy}[a]\right\}=1,$

$P\left\{\lim_{n\to\infty}P_{xy}^{(n)}[a]=P_{xy}[a]\right\}=1.$

In fact, since the MDP is finite, the expressions inside the probability operator converge uniformly for any $x$ , $y$ , and $a$ . Q.E.D.

Lemma 2
Now consider $s$ Markov chains that share the same state and action spaces (they differ only in their state transition probabilities). Suppose in the $i$ th chain, the expected reward from taking action $a$ in state $x$ is $R^i_x(a)$ , and the transition probability to state $y$ is $p^i_{xy}[a]$ . At each step, the index of the Markov chain used increases by 1.

If for all $\eta$ ,

$|R_x^i(a)-R_x(a)| < \eta, \quad |p_{xy}^i[a]-p_{xy}[a]| < \frac{\eta}{R}$

(where $R_x(a)$ , $p_{xy}[a]$ are corresponding statistics from a new chain outside the shared set, and $R = \sup_{x,a}|R_x(a)|$ ), and the MDP has $n$ states, then for any state $x$ , the $s$ -step value estimate $\overline{Q}'(x,a_1,\dots,a_s)$ (given action sequence $a_1,\dots,a_s$ ) satisfies:

$|\overline{Q}'(x,a_1,\dots,a_s)-\overline{Q}(x,a_1,\dots,a_s)| < \eta,$

where $\overline{Q}(x,a_1,\dots,a_s)$ is the value on the new chain.

Proof:
Let

$\overline{Q}(x,a_1,a_2) = R_x(a_1) + \gamma \sum_y p_{xy}[a_1] R_y(a_2)$

and

$\overline{Q}'(x,a_1,a_2) = R^1_x(a_1) + \gamma \sum_y p^1_{xy}[a_1] R^2_y(a_2)$

Then,

$|\overline{Q}(x,a_1,a_2) - \overline{Q}'(x,a_1,a_2)| \le |R_x(a_1) - R_x^1(a_1)| + \gamma \sum_y |p_{xy}[a_1]R_y(a_2) - p^1_{xy}[a_1]R^2_y(a_2)|$

$\eta + \gamma \sum_y \left|p_{xy}[a_1]R_y(a_2) - p^1_{xy}[a_1]R_y(a_2) + p^1_{xy}[a_1]R_y(a_2) - p^1_{xy}[a_1]R^2_y(a_2)\right|$

$\le \eta + \gamma \sum_y \left|p_{xy}[a_1] - p^1_{xy}[a_1]\right||R_y(a_2)| + \gamma \sum_y p^1_{xy}[a_1]|R_y(a_2) - R^2_y(a_2)|$

$\le \eta + \gamma \sum_y \frac{\eta}{R}R + \gamma\eta = \eta + \gamma(n\eta) + \gamma\eta = \eta(1 + \gamma + n\gamma) < \eta(n+2)$

Since $\eta$ is arbitrary, we can in fact treat the bound as $|\overline{Q}(x,a_1,a_2) - \overline{Q}'(x,a_1,a_2)| < \eta$ .

Thus, by mathematical induction (Hint: the $Q$ -values are bounded, and the recurrence is similar to a Bellman equation but with $R$ replaced by corresponding $Q$ -values), we get:

$|\overline{Q}'(x,a_1,\dots,a_s)-\overline{Q}(x,a_1,\dots,a_s)|<\eta$