The Action Replay Process

最新推荐文章于 2025-05-08 16:27:17 发布

AI是人类修仙的基本引擎

最新推荐文章于 2025-05-08 16:27:17 发布

阅读量1k

点赞数 11

文章标签：算法随机决策过程数学

本文链接：https://blog.csdn.net/m0_62984100/article/details/147756124

版权

Preface

A commonly used inequality

$\ln(1 - x), \quad 0 < x < 1$

Proof: Let $\ln(1 - x) + x$ , for $0 < x < 1$ . Then $f (0) = 0$ .

$\frac{-1}{1 - x} + 1 = \frac{x}{x - 1} < 0$

Hence, $\ln(1 - x), \quad 0 < x < 1$ . Q.E.D.

Fundamental Theorem
If $a_n > -1$ , then

$\prod_{n=1}^\infty (1 + a_n) = 0 \Leftrightarrow \sum_{n=1}^\infty \ln(1 + a_n) = -\infty$

Proof: Let $P_k = \prod_{n=1}^k (1 + a_n)$ , then

$\ln P_k = \ln\left(\prod_{n=1}^k (1 + a_n)\right) = \sum_{n=1}^k \ln(1 + a_n)$

Thus,

$\sum_{n=1}^\infty \ln(1 + a_n) = -\infty \Leftrightarrow \lim_{k \to \infty} \sum_{n=1}^k \ln(1 + a_n) = -\infty \Leftrightarrow \lim_{k \to \infty} \ln P_k = -\infty \Leftrightarrow \lim_{k \to \infty} P_k = 0$

Q.E.D.

Corollary
If $\le b_n < 1$ and $\sum_{n=1}^\infty b_n = +\infty$ , then

$\prod_{n=1}^\infty (1 - b_n) = 0$

Proof: Consider the subsequence ${b_{n_k}\}$ consisting of non-zero $b_n$ . Since $b_{n_k} > -1$ , and applying the fundamental theorem, we have:

$\prod_{n=1}^\infty (1 - b_n) = \prod_{k=1}^\infty (1 - b_{n_k}) = 0 \Leftrightarrow \sum_{k=1}^\infty \ln(1 - b_{n_k}) = -\infty$

We now show $\sum_{k=1}^\infty \ln(1 - b_{n_k}) = -\infty$ .
Given $0 < 1 - b_{n_k} < 1$ , we have $ln(1 - b_{n_k}) < 0$ , and $\sum_{k=1}^\infty b_{n_k} = +\infty$ . It’s not immediately obvious, so we proceed by contradiction:

Assume $\sum_{k=1}^\infty \ln(1 - b_{n_k}) \ne -\infty$ . Since each term is negative, this implies convergence, i.e.,

$\sum_{k=1}^\infty \ln(1 - b_{n_k}) > -\infty$

But $\sum_{k=1}^\infty (-b_{n_k}) = -\infty \ge \sum_{k=1}^\infty \ln(1 - b_{n_k}) > -\infty$ , a contradiction.
Therefore, $\sum_{k=1}^\infty \ln(1 - b_{n_k}) = -\infty$ , and so

$\prod_{n=1}^\infty (1 - b_n) = 0$

Q.E.D.

The Essence of Mathematical Truth: Induction
Observe a linear-looking relation, fantasize wildly, then coldly examine whether it is truly valid.

Given $X_1$ and the recursive formula:

$X_{n+1} = X_n + \beta_n(\xi_n - X_n) = (1 - \beta_n)X_n + \beta_n \xi_n$

Show that

$X_{n+1} = \sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j}^{n-1} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^n (1 - \beta_i)$

Proof:

Base case: $n = 1$

$X_2 = (1 - \beta_1)X_1 + \beta_1 \xi_1 = \xi_1 \beta_1 + X_1 (1 - \beta_1)$

holds.
Inductive step: assume true for $n$ , prove for $n + 1$ :

$X_{n+2} = (1 - \beta_{n+1})X_{n+1} + \beta_{n+1} \xi_{n+1}$

Plug in inductive hypothesis:

$\beta_{n+1})\left[\sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j}^{n-1} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^n (1 - \beta_i)\right] + \beta_{n+1} \xi_{n+1}$

$\sum_{j=1}^{n+1} \xi_j \beta_j \prod_{i=j}^{n} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^{n+1} (1 - \beta_i)$
By induction, the formula holds for all positive integers $n$ . Q.E.D.

Now, let’s relax for a while — it’s movie time.

1. Definition of Action Replay Process

Given an $n$ -step finite MDP with a possibly varying learning rate $\alpha$ , in step $i$ , the agent is in state $x_i$ , takes action $a_i$ , receives random reward $r_i$ , and transitions to a new state $y_i$ .

Action Replay Process (ARP) is a re-examination of state $x$ and action $a$ within a given MDP.

Suppose we focus on state $x$ and action $a$ , and consider an MDP consisting of $n$ steps.

We add a step 0 in which the agent immediately terminates and receives reward $Q_0(x,a)$ .

During steps 1 to $n$ , due to MDP randomness, the agent may take action $a$ in state $x$ at time steps $\le n^{i_1}, n^{i_2}, ..., n^{i_*} \le n$ .

If action $a$ is never taken at $x$ in this episode, the only opportunity for it is at step 0.

When $i_* \ge 1$ , to determine ARP’s next reward and state, we sample an index $n^{i_e}$ as follows:

$n^{i_e} = \begin{cases} n^{i_*}, & \text{with probability } \alpha_{n^{i_*}} \\ n^{i_{*-1}}, & \text{with probability } (1 - \alpha_{n^{i_*}})\alpha_{n^{i_{*-1}}} \\ \vdots \\ 0, & \text{with probability } \prod_{i=1}^{i_*}(1 - \alpha_{n^i}) \end{cases}$

Then, after one ARP step, the state $< x, n >$ transitions to $y_{n^{i_e}}, n^{i_e} - 1>$ , and the reward is $r_{n^{i_e}}$ .
Clearly, $n^{i_e} - 1 < n$ , so ARP terminates with probability 1. Thus, ARP is a finite process almost surely.

To summarize, the core transition formula is:

$\overset{a}{\rightarrow} <y_{n^{i_e}}, n^{i_e} - 1>, \quad \text{reward } r_{n^{i_e}}$

2. Properties of the Action Replay Process

We now examine ARP’s properties, particularly in comparison to MDPs. Given an MDP rule and a (non-terminating) instance, we can construct an ARP accordingly.

Property 1

$\forall n, x, a,\quad Q^*_{ARP}(<x, n>, a) = Q_n(x, a)$

Proof:
Using mathematical induction on $n$ :

Base case $n = 1$ :
- If the MDP did not take $a$ at $x$ in step 1, ARP gives reward $Q_0(x,a) = 0 = Q_1(x,a)$
- If $x,a) = (x_1, a_1)$ , then:
  
  $Q^*_{ARP}(<x,1>, a) = \alpha_1 r_1 + (1 - \alpha_1) Q_0(x,a) = \alpha_1 r_1 = Q_1(x,a)$
Inductive step: Assume $Q^*_{ARP}(<x, k-1>, a) = Q_{k-1}(x,a)$ , show for $k$ :
- If $\ne (x_k, a_k)$ , then:
  
  $Q_k(x,a) = Q_{k-1}(x,a) = Q^*_{ARP}(<x, k>, a)$
- If $x,a) = (x_k, a_k)$ , then:
  
  $Q^*_{ARP}(<x,k>, a) = \alpha_k [r_k + \gamma \max_a Q_{k-1}(y_k,a)] + (1 - \alpha_k) Q_{k-1}(x,a) = Q_k(x,a)$
Therefore, $Q^*_{ARP}(<x,n>, a) = Q_n(x,a)$ . Q.E.D.

Property 2 In the ARP ${<x_i,n_i>}$, for all $\epsilon > 0$ , there exists $h > l$ such that for all $n_1 > h$ ,

$P(n_{s+1} < l) < \epsilon$

Proof:

Let us first consider the final step, that is, the case where $n^{i_e} < n^{i_l}$ or even lower.
Given in the ARP, starting from $< x, h >$ , after taking action $a$ , the probability of reaching a level lower than $l$ in one step is:

$\sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_h} (1 - \alpha_{n^k}) \right] = \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_l - 1} (1 - \alpha_{n^k}) \right] \left[ \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) \right] = \left[ \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) \right] \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_l - 1} (1 - \alpha_{n^k}) \right]$

But note that:

$\sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_l - 1} (1 - \alpha_{n^k}) \right] = 1$

Therefore,

$\sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_h} (1 - \alpha_{n^k}) \right] = \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) < e^{-\sum_{i=i_l}^{i_h} \alpha_{n^i}}$

As long as every subsequence of $\{\alpha_n\}$ diverges, then as $\to \infty$ :

$\sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_h} (1 - \alpha_{n^k}) \right] = \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) < e^{-\sum_{i=i_l}^{i_h} \alpha_{n^i}} \to 0$

Moreover, since the MDP is finite, we have:

$\forall l_j \in \mathbb{N}^*, \forall \eta_j > 0, \exists M_j > 0, \forall n_j > M_j, \forall X_j, a_j,$

starting from $X_j, n_j>$ , after taking action $a_j$ ,

$P(n_{j+1} \ge l_j) = 1 - \eta_j$

Using the index $j$ , we recursively apply this conclusion from step $s$ back to step 1. Then, the probability of reaching at least $l = l_s$ is at least:

$\prod_{j=1}^{s} (1 - \eta_j) = 1 - \epsilon$

where $n_{j+1} \ge l_j$ , and $X_{j+1}, n_{j+1}>$ is reached from $x_j, n_j>$ after executing $a_j$ . Q.E.D.

Now, define:

$P_{xy}^{(n)}[a] = \sum_{m=1}^{n-1} P_{<x,n>,<y,m>}^{ARP}[a]$

Lemma:
Let ${\xi_n}$ be a sequence of bounded random variables with expectation $\mathfrak{E}$ , and let $\le \beta_n < 1$ satisfy $\sum_{i=1}^{\infty} \beta_i = +\infty$ and $\sum_{i=1}^{\infty} \beta_i^2 < +\infty$ .
Define the sequence $X_{n+1} = X_n + \beta_n(\xi_n - X_n)$ . Then:

$P\left( \lim_{n \to \infty} X_n = \mathfrak{E} \right) = 1$

My attempt:

$X_{n+1} = X_n + \beta_n(\xi_n - X_n) = (1 - \beta_n) X_n + \beta_n \xi_n$

By induction, we obtain:

$X_{n+1} = \sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j}^{n-1} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^{n} (1 - \beta_i)$

From a corollary of a fundamental theorem:

$\prod_{i=1}^{\infty} (1 - \beta_i) = 0$

Hence:

$\lim_{n \to \infty} X_n = \lim_{n \to \infty} \sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j+1}^{n} (1 - \beta_i) = \frac{ \lim_{n \to \infty} \sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j+1}^{n} (1 - \beta_i) }{1 - 0}$

$\frac{ \lim_{n \to \infty} \sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j+1}^{n} (1 - \beta_i) }{1 - \prod_{i=1}^{\infty} (1 - \beta_i)} = \lim_{n \to \infty} \sum_{j=1}^{n} \xi_j \frac{ \beta_j \prod_{i=j+1}^{n} (1 - \beta_i) }{1 - \prod_{i=1}^{n} (1 - \beta_i)}$

Property 3

$P\left\{ \lim_{n \to \infty} P_{xy}^{(n)}[a] = P_{xy}[a] \right\} = 1, \quad P\left[ \lim_{n \to \infty} \mathfrak{R}_{x}^{(n)}(a) = \mathfrak{R}_{x}(a) \right] = 1$