The Action Replay Process

Preface

A commonly used inequality

− x > ln ⁡ ( 1 − x ) , 0 < x < 1 -x > \ln(1 - x), \quad 0 < x < 1 x>ln(1x),0<x<1

Proof: Let f ( x ) = ln ⁡ ( 1 − x ) + x f(x) = \ln(1 - x) + x f(x)=ln(1x)+x, for 0 < x < 1 0 < x < 1 0<x<1. Then f ( 0 ) = 0 f(0) = 0 f(0)=0.

f ′ ( x ) = − 1 1 − x + 1 = x x − 1 < 0 f'(x) = \frac{-1}{1 - x} + 1 = \frac{x}{x - 1} < 0 f(x)=1x1+1=x1x<0

Hence, − x > ln ⁡ ( 1 − x ) , 0 < x < 1 -x > \ln(1 - x), \quad 0 < x < 1 x>ln(1x),0<x<1. Q.E.D.


Fundamental Theorem
If a n > − 1 a_n > -1 an>1, then

∏ n = 1 ∞ ( 1 + a n ) = 0 ⇔ ∑ n = 1 ∞ ln ⁡ ( 1 + a n ) = − ∞ \prod_{n=1}^\infty (1 + a_n) = 0 \Leftrightarrow \sum_{n=1}^\infty \ln(1 + a_n) = -\infty n=1(1+an)=0n=1ln(1+an)=

Proof: Let P k = ∏ n = 1 k ( 1 + a n ) P_k = \prod_{n=1}^k (1 + a_n) Pk=n=1k(1+an), then

ln ⁡ P k = ln ⁡ ( ∏ n = 1 k ( 1 + a n ) ) = ∑ n = 1 k ln ⁡ ( 1 + a n ) \ln P_k = \ln\left(\prod_{n=1}^k (1 + a_n)\right) = \sum_{n=1}^k \ln(1 + a_n) lnPk=ln(n=1k(1+an))=n=1kln(1+an)

Thus,

∑ n = 1 ∞ ln ⁡ ( 1 + a n ) = − ∞ ⇔ lim ⁡ k → ∞ ∑ n = 1 k ln ⁡ ( 1 + a n ) = − ∞ ⇔ lim ⁡ k → ∞ ln ⁡ P k = − ∞ ⇔ lim ⁡ k → ∞ P k = 0 \sum_{n=1}^\infty \ln(1 + a_n) = -\infty \Leftrightarrow \lim_{k \to \infty} \sum_{n=1}^k \ln(1 + a_n) = -\infty \Leftrightarrow \lim_{k \to \infty} \ln P_k = -\infty \Leftrightarrow \lim_{k \to \infty} P_k = 0 n=1ln(1+an)=klimn=1kln(1+an)=klimlnPk=klimPk=0

Q.E.D.


Corollary
If 0 ≤ b n < 1 0 \le b_n < 1 0bn<1 and ∑ n = 1 ∞ b n = + ∞ \sum_{n=1}^\infty b_n = +\infty n=1bn=+, then

∏ n = 1 ∞ ( 1 − b n ) = 0 \prod_{n=1}^\infty (1 - b_n) = 0 n=1(1bn)=0

Proof: Consider the subsequence { b n k } \{b_{n_k}\} {bnk} consisting of non-zero b n b_n bn. Since − b n k > − 1 -b_{n_k} > -1 bnk>1, and applying the fundamental theorem, we have:

∏ n = 1 ∞ ( 1 − b n ) = ∏ k = 1 ∞ ( 1 − b n k ) = 0 ⇔ ∑ k = 1 ∞ ln ⁡ ( 1 − b n k ) = − ∞ \prod_{n=1}^\infty (1 - b_n) = \prod_{k=1}^\infty (1 - b_{n_k}) = 0 \Leftrightarrow \sum_{k=1}^\infty \ln(1 - b_{n_k}) = -\infty n=1(1bn)=k=1(1bnk)=0k=1ln(1bnk)=

We now show ∑ k = 1 ∞ ln ⁡ ( 1 − b n k ) = − ∞ \sum_{k=1}^\infty \ln(1 - b_{n_k}) = -\infty k=1ln(1bnk)=.
Given 0 < 1 − b n k < 1 0 < 1 - b_{n_k} < 1 0<1bnk<1, we have ln ⁡ ( 1 − b n k ) < 0 \ln(1 - b_{n_k}) < 0 ln(1bnk)<0, and ∑ k = 1 ∞ b n k = + ∞ \sum_{k=1}^\infty b_{n_k} = +\infty k=1bnk=+. It’s not immediately obvious, so we proceed by contradiction:

Assume ∑ k = 1 ∞ ln ⁡ ( 1 − b n k ) ≠ − ∞ \sum_{k=1}^\infty \ln(1 - b_{n_k}) \ne -\infty k=1ln(1bnk)=. Since each term is negative, this implies convergence, i.e.,

∑ k = 1 ∞ ln ⁡ ( 1 − b n k ) > − ∞ \sum_{k=1}^\infty \ln(1 - b_{n_k}) > -\infty k=1ln(1bnk)>

But ∑ k = 1 ∞ ( − b n k ) = − ∞ ≥ ∑ k = 1 ∞ ln ⁡ ( 1 − b n k ) > − ∞ \sum_{k=1}^\infty (-b_{n_k}) = -\infty \ge \sum_{k=1}^\infty \ln(1 - b_{n_k}) > -\infty k=1(bnk)=k=1ln(1bnk)>, a contradiction.
Therefore, ∑ k = 1 ∞ ln ⁡ ( 1 − b n k ) = − ∞ \sum_{k=1}^\infty \ln(1 - b_{n_k}) = -\infty k=1ln(1bnk)=, and so

∏ n = 1 ∞ ( 1 − b n ) = 0 \prod_{n=1}^\infty (1 - b_n) = 0 n=1(1bn)=0

Q.E.D.


The Essence of Mathematical Truth: Induction
Observe a linear-looking relation, fantasize wildly, then coldly examine whether it is truly valid.

Given X 1 X_1 X1 and the recursive formula:

X n + 1 = X n + β n ( ξ n − X n ) = ( 1 − β n ) X n + β n ξ n X_{n+1} = X_n + \beta_n(\xi_n - X_n) = (1 - \beta_n)X_n + \beta_n \xi_n Xn+1=Xn+βn(ξnXn)=(1βn)Xn+βnξn

Show that

X n + 1 = ∑ j = 1 n ξ j β j ∏ i = j n − 1 ( 1 − β i + 1 ) + X 1 ∏ i = 1 n ( 1 − β i ) X_{n+1} = \sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j}^{n-1} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^n (1 - \beta_i) Xn+1=j=1nξjβji=jn1(1βi+1)+X1i=1n(1βi)

Proof:

  1. Base case: n = 1 n = 1 n=1

    X 2 = ( 1 − β 1 ) X 1 + β 1 ξ 1 = ξ 1 β 1 + X 1 ( 1 − β 1 ) X_2 = (1 - \beta_1)X_1 + \beta_1 \xi_1 = \xi_1 \beta_1 + X_1 (1 - \beta_1) X2=(1β1)X1+β1ξ1=ξ1β1+X1(1β1)

    holds.

  2. Inductive step: assume true for n n n, prove for n + 1 n+1 n+1:

    X n + 2 = ( 1 − β n + 1 ) X n + 1 + β n + 1 ξ n + 1 X_{n+2} = (1 - \beta_{n+1})X_{n+1} + \beta_{n+1} \xi_{n+1} Xn+2=(1βn+1)Xn+1+βn+1ξn+1

    Plug in inductive hypothesis:

    = ( 1 − β n + 1 ) [ ∑ j = 1 n ξ j β j ∏ i = j n − 1 ( 1 − β i + 1 ) + X 1 ∏ i = 1 n ( 1 − β i ) ] + β n + 1 ξ n + 1 = (1 - \beta_{n+1})\left[\sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j}^{n-1} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^n (1 - \beta_i)\right] + \beta_{n+1} \xi_{n+1} =(1βn+1)[j=1nξjβji=jn1(1βi+1)+X1i=1n(1βi)]+βn+1ξn+1

    = ∑ j = 1 n + 1 ξ j β j ∏ i = j n ( 1 − β i + 1 ) + X 1 ∏ i = 1 n + 1 ( 1 − β i ) = \sum_{j=1}^{n+1} \xi_j \beta_j \prod_{i=j}^{n} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^{n+1} (1 - \beta_i) =j=1n+1ξjβji=jn(1βi+1)+X1i=1n+1(1βi)

  3. By induction, the formula holds for all positive integers n n n. Q.E.D.


Now, let’s relax for a while — it’s movie time.


1. Definition of Action Replay Process

Given an n n n-step finite MDP with a possibly varying learning rate α \alpha α, in step i i i, the agent is in state x i x_i xi, takes action a i a_i ai, receives random reward r i r_i ri, and transitions to a new state y i y_i yi.

Action Replay Process (ARP) is a re-examination of state x x x and action a a a within a given MDP.

Suppose we focus on state x x x and action a a a, and consider an MDP consisting of n n n steps.

We add a step 0 in which the agent immediately terminates and receives reward Q 0 ( x , a ) Q_0(x,a) Q0(x,a).

During steps 1 to n n n, due to MDP randomness, the agent may take action a a a in state x x x at time steps 1 ≤ n i 1 , n i 2 , . . . , n i ∗ ≤ n 1 \le n^{i_1}, n^{i_2}, ..., n^{i_*} \le n 1ni1,ni2,...,nin.

If action a a a is never taken at x x x in this episode, the only opportunity for it is at step 0.

When i ∗ ≥ 1 i_* \ge 1 i1, to determine ARP’s next reward and state, we sample an index n i e n^{i_e} nie as follows:

n i e = { n i ∗ , with probability  α n i ∗ n i ∗ − 1 , with probability  ( 1 − α n i ∗ ) α n i ∗ − 1 ⋮ 0 , with probability  ∏ i = 1 i ∗ ( 1 − α n i ) n^{i_e} = \begin{cases} n^{i_*}, & \text{with probability } \alpha_{n^{i_*}} \\ n^{i_{*-1}}, & \text{with probability } (1 - \alpha_{n^{i_*}})\alpha_{n^{i_{*-1}}} \\ \vdots \\ 0, & \text{with probability } \prod_{i=1}^{i_*}(1 - \alpha_{n^i}) \end{cases} nie= ni,ni∗−1,0,with probability αniwith probability (1αni)αni∗−1with probability i=1i(1αni)

Then, after one ARP step, the state < x , n > <x, n> <x,n> transitions to < y n i e , n i e − 1 > <y_{n^{i_e}}, n^{i_e} - 1> <ynie,nie1>, and the reward is r n i e r_{n^{i_e}} rnie.
Clearly, n i e − 1 < n n^{i_e} - 1 < n nie1<n, so ARP terminates with probability 1. Thus, ARP is a finite process almost surely.

To summarize, the core transition formula is:

< x , n > → a < y n i e , n i e − 1 > , reward  r n i e <x,n> \overset{a}{\rightarrow} <y_{n^{i_e}}, n^{i_e} - 1>, \quad \text{reward } r_{n^{i_e}} <x,n>a<ynie,nie1>,reward rnie


2. Properties of the Action Replay Process

We now examine ARP’s properties, particularly in comparison to MDPs. Given an MDP rule and a (non-terminating) instance, we can construct an ARP accordingly.

Property 1

∀ n , x , a , Q A R P ∗ ( < x , n > , a ) = Q n ( x , a ) \forall n, x, a,\quad Q^*_{ARP}(<x, n>, a) = Q_n(x, a) n,x,a,QARP(<x,n>,a)=Qn(x,a)

Proof:
Using mathematical induction on n n n:

  1. Base case n = 1 n=1 n=1:

    • If the MDP did not take a a a at x x x in step 1, ARP gives reward Q 0 ( x , a ) = 0 = Q 1 ( x , a ) Q_0(x,a) = 0 = Q_1(x,a) Q0(x,a)=0=Q1(x,a)

    • If ( x , a ) = ( x 1 , a 1 ) (x,a) = (x_1, a_1) (x,a)=(x1,a1), then:

      Q A R P ∗ ( < x , 1 > , a ) = α 1 r 1 + ( 1 − α 1 ) Q 0 ( x , a ) = α 1 r 1 = Q 1 ( x , a ) Q^*_{ARP}(<x,1>, a) = \alpha_1 r_1 + (1 - \alpha_1) Q_0(x,a) = \alpha_1 r_1 = Q_1(x,a) QARP(<x,1>,a)=α1r1+(1α1)Q0(x,a)=α1r1=Q1(x,a)

  2. Inductive step: Assume Q A R P ∗ ( < x , k − 1 > , a ) = Q k − 1 ( x , a ) Q^*_{ARP}(<x, k-1>, a) = Q_{k-1}(x,a) QARP(<x,k1>,a)=Qk1(x,a), show for k k k:

    • If ( x , a ) ≠ ( x k , a k ) (x,a) \ne (x_k, a_k) (x,a)=(xk,ak), then:

      Q k ( x , a ) = Q k − 1 ( x , a ) = Q A R P ∗ ( < x , k > , a ) Q_k(x,a) = Q_{k-1}(x,a) = Q^*_{ARP}(<x, k>, a) Qk(x,a)=Qk1(x,a)=QARP(<x,k>,a)

    • If ( x , a ) = ( x k , a k ) (x,a) = (x_k, a_k) (x,a)=(xk,ak), then:

      Q A R P ∗ ( < x , k > , a ) = α k [ r k + γ max ⁡ a Q k − 1 ( y k , a ) ] + ( 1 − α k ) Q k − 1 ( x , a ) = Q k ( x , a ) Q^*_{ARP}(<x,k>, a) = \alpha_k [r_k + \gamma \max_a Q_{k-1}(y_k,a)] + (1 - \alpha_k) Q_{k-1}(x,a) = Q_k(x,a) QARP(<x,k>,a)=αk[rk+γamaxQk1(yk,a)]+(1αk)Qk1(x,a)=Qk(x,a)

  3. Therefore, Q A R P ∗ ( < x , n > , a ) = Q n ( x , a ) Q^*_{ARP}(<x,n>, a) = Q_n(x,a) QARP(<x,n>,a)=Qn(x,a). Q.E.D.


Property 2 In the ARP ${<x_i,n_i>}$, for all l , s , ϵ > 0 l, s, \epsilon > 0 l,s,ϵ>0, there exists h > l h > l h>l such that for all n 1 > h n_1 > h n1>h,

P ( n s + 1 < l ) < ϵ P(n_{s+1} < l) < \epsilon P(ns+1<l)<ϵ

Proof:

Let us first consider the final step, that is, the case where n i e < n i l n^{i_e} < n^{i_l} nie<nil or even lower.
Given in the ARP, starting from < x , h > <x, h> <x,h>, after taking action a a a, the probability of reaching a level lower than l l l in one step is:

∑ j = 0 i l − 1 [ α n j ∏ k = j + 1 i h ( 1 − α n k ) ] = ∑ j = 0 i l − 1 [ α n j ∏ k = j + 1 i l − 1 ( 1 − α n k ) ] [ ∏ i = i l i h ( 1 − α n i ) ] = [ ∏ i = i l i h ( 1 − α n i ) ] ∑ j = 0 i l − 1 [ α n j ∏ k = j + 1 i l − 1 ( 1 − α n k ) ] \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_h} (1 - \alpha_{n^k}) \right] = \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_l - 1} (1 - \alpha_{n^k}) \right] \left[ \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) \right] = \left[ \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) \right] \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_l - 1} (1 - \alpha_{n^k}) \right] j=0il1 αnjk=j+1ih(1αnk) =j=0il1 αnjk=j+1il1(1αnk) [i=ilih(1αni)]=[i=ilih(1αni)]j=0il1 αnjk=j+1il1(1αnk)

But note that:

∑ j = 0 i l − 1 [ α n j ∏ k = j + 1 i l − 1 ( 1 − α n k ) ] = 1 \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_l - 1} (1 - \alpha_{n^k}) \right] = 1 j=0il1 αnjk=j+1il1(1αnk) =1

Therefore,

∑ j = 0 i l − 1 [ α n j ∏ k = j + 1 i h ( 1 − α n k ) ] = ∏ i = i l i h ( 1 − α n i ) < e − ∑ i = i l i h α n i \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_h} (1 - \alpha_{n^k}) \right] = \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) < e^{-\sum_{i=i_l}^{i_h} \alpha_{n^i}} j=0il1 αnjk=j+1ih(1αnk) =i=ilih(1αni)<ei=ilihαni

As long as every subsequence of { α n } \{\alpha_n\} {αn} diverges, then as h → ∞ h \to \infty h:

∑ j = 0 i l − 1 [ α n j ∏ k = j + 1 i h ( 1 − α n k ) ] = ∏ i = i l i h ( 1 − α n i ) < e − ∑ i = i l i h α n i → 0 \sum_{j=0}^{i_l - 1} \left[ \alpha_{n^j} \prod_{k=j+1}^{i_h} (1 - \alpha_{n^k}) \right] = \prod_{i=i_l}^{i_h} (1 - \alpha_{n^i}) < e^{-\sum_{i=i_l}^{i_h} \alpha_{n^i}} \to 0 j=0il1 αnjk=j+1ih(1αnk) =i=ilih(1αni)<ei=ilihαni0

Moreover, since the MDP is finite, we have:

∀ l j ∈ N ∗ , ∀ η j > 0 , ∃ M j > 0 , ∀ n j > M j , ∀ X j , a j , \forall l_j \in \mathbb{N}^*, \forall \eta_j > 0, \exists M_j > 0, \forall n_j > M_j, \forall X_j, a_j, ljN,ηj>0,Mj>0,nj>Mj,Xj,aj,

starting from < X j , n j > <X_j, n_j> <Xj,nj>, after taking action a j a_j aj,

P ( n j + 1 ≥ l j ) = 1 − η j P(n_{j+1} \ge l_j) = 1 - \eta_j P(nj+1lj)=1ηj

Using the index j j j, we recursively apply this conclusion from step s s s back to step 1. Then, the probability of reaching at least l = l s l = l_s l=ls is at least:

∏ j = 1 s ( 1 − η j ) = 1 − ϵ \prod_{j=1}^{s} (1 - \eta_j) = 1 - \epsilon j=1s(1ηj)=1ϵ

where n j + 1 ≥ l j n_{j+1} \ge l_j nj+1lj, and < X j + 1 , n j + 1 > <X_{j+1}, n_{j+1}> <Xj+1,nj+1> is reached from < x j , n j > <x_j, n_j> <xj,nj> after executing a j a_j aj. Q.E.D.

Now, define:

P x y ( n ) [ a ] = ∑ m = 1 n − 1 P < x , n > , < y , m > A R P [ a ] P_{xy}^{(n)}[a] = \sum_{m=1}^{n-1} P_{<x,n>,<y,m>}^{ARP}[a] Pxy(n)[a]=m=1n1P<x,n>,<y,m>ARP[a]

Lemma:
Let ξ n {\xi_n} ξn be a sequence of bounded random variables with expectation E \mathfrak{E} E, and let 0 ≤ β n < 1 0 \le \beta_n < 1 0βn<1 satisfy ∑ i = 1 ∞ β i = + ∞ \sum_{i=1}^{\infty} \beta_i = +\infty i=1βi=+ and ∑ i = 1 ∞ β i 2 < + ∞ \sum_{i=1}^{\infty} \beta_i^2 < +\infty i=1βi2<+.
Define the sequence X n + 1 = X n + β n ( ξ n − X n ) X_{n+1} = X_n + \beta_n(\xi_n - X_n) Xn+1=Xn+βn(ξnXn). Then:

P ( lim ⁡ n → ∞ X n = E ) = 1 P\left( \lim_{n \to \infty} X_n = \mathfrak{E} \right) = 1 P(nlimXn=E)=1

My attempt:

X n + 1 = X n + β n ( ξ n − X n ) = ( 1 − β n ) X n + β n ξ n X_{n+1} = X_n + \beta_n(\xi_n - X_n) = (1 - \beta_n) X_n + \beta_n \xi_n Xn+1=Xn+βn(ξnXn)=(1βn)Xn+βnξn

By induction, we obtain:

X n + 1 = ∑ j = 1 n ξ j β j ∏ i = j n − 1 ( 1 − β i + 1 ) + X 1 ∏ i = 1 n ( 1 − β i ) X_{n+1} = \sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j}^{n-1} (1 - \beta_{i+1}) + X_1 \prod_{i=1}^{n} (1 - \beta_i) Xn+1=j=1nξjβji=jn1(1βi+1)+X1i=1n(1βi)

From a corollary of a fundamental theorem:

∏ i = 1 ∞ ( 1 − β i ) = 0 \prod_{i=1}^{\infty} (1 - \beta_i) = 0 i=1(1βi)=0

Hence:

lim ⁡ n → ∞ X n = lim ⁡ n → ∞ ∑ j = 1 n ξ j β j ∏ i = j + 1 n ( 1 − β i ) = lim ⁡ n → ∞ ∑ j = 1 n ξ j β j ∏ i = j + 1 n ( 1 − β i ) 1 − 0 \lim_{n \to \infty} X_n = \lim_{n \to \infty} \sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j+1}^{n} (1 - \beta_i) = \frac{ \lim_{n \to \infty} \sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j+1}^{n} (1 - \beta_i) }{1 - 0} nlimXn=nlimj=1nξjβji=j+1n(1βi)=10limnj=1nξjβji=j+1n(1βi)

= lim ⁡ n → ∞ ∑ j = 1 n ξ j β j ∏ i = j + 1 n ( 1 − β i ) 1 − ∏ i = 1 ∞ ( 1 − β i ) = lim ⁡ n → ∞ ∑ j = 1 n ξ j β j ∏ i = j + 1 n ( 1 − β i ) 1 − ∏ i = 1 n ( 1 − β i ) = \frac{ \lim_{n \to \infty} \sum_{j=1}^{n} \xi_j \beta_j \prod_{i=j+1}^{n} (1 - \beta_i) }{1 - \prod_{i=1}^{\infty} (1 - \beta_i)} = \lim_{n \to \infty} \sum_{j=1}^{n} \xi_j \frac{ \beta_j \prod_{i=j+1}^{n} (1 - \beta_i) }{1 - \prod_{i=1}^{n} (1 - \beta_i)} =1i=1(1βi)limnj=1nξjβji=j+1n(1βi)=nlimj=1nξj1i=1n(1βi)βji=j+1n(1βi)

Property 3

P { lim ⁡ n → ∞ P x y ( n ) [ a ] = P x y [ a ] } = 1 , P [ lim ⁡ n → ∞ R x ( n ) ( a ) = R x ( a ) ] = 1 P\left\{ \lim_{n \to \infty} P_{xy}^{(n)}[a] = P_{xy}[a] \right\} = 1, \quad P\left[ \lim_{n \to \infty} \mathfrak{R}_{x}^{(n)}(a) = \mathfrak{R}_{x}(a) \right] = 1 P{nlimPxy(n)[a]=Pxy[a]}=1,P[nlimRx(n)(a)=Rx(a)]=1

以下是Per-MADDPG的代码示例,主要涵盖了actor和critic的实现: ```python import torch import torch.nn as nn import torch.optim as optim import numpy as np import random import copy from collections import deque, namedtuple device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") BUFFER_SIZE = int(1e6) # replay buffer size BATCH_SIZE = 128 # minibatch size GAMMA = 0.99 # discount factor TAU = 1e-3 # for soft update of target parameters LR_ACTOR = 1e-3 # learning rate of the actor LR_CRITIC = 1e-3 # learning rate of the critic WEIGHT_DECAY = 0 # L2 weight decay UPDATE_EVERY = 2 # how often to update the network NUM_UPDATE = 1 # how many times to update the network class ReplayBuffer: """ Replay buffer class """ def __init__(self, buffer_size, batch_size): """ Initialize ReplayBuffer class :param buffer_size: int, size of the replay buffer :param batch_size: int, size of the batch """ self.memory = deque(maxlen=buffer_size) # internal memory (deque) self.batch_size = batch_size self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"]) def add(self, state, action, reward, next_state, done): """ Add an experience to memory. :param state: current state :param action: action taken :param reward: reward received :param next_state: next state :param done: whether the episode is done """ e = self.experience(state, action, reward, next_state, done) self.memory.append(e) def sample(self): """ Randomly sample a batch of experiences from memory. :return: tuple of torch tensors, state, action, reward, next_state, done """ experiences = random.sample(self.memory, k=self.batch_size) states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device) actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).float().to(device) rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device) next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device) dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device) return (states, actions, rewards, next_states, dones) def __len__(self): """ Return the current size of internal memory. """ return len(self.memory) class Actor(nn.Module): """ Actor neural network """ def __init__(self, state_size, action_size, seed, fc1_units=256, fc2_units=128): """ Initialize Actor class :param state_size: int, size of the state space :param action_size: int, size of the action space :param seed: int, random seed :param fc1_units: int, number of neurons in the first hidden layer :param fc2_units: int, number of neurons in the second hidden layer """ super(Actor, self).__init__() self.seed = torch.manual_seed(seed) self.fc1 = nn.Linear(state_size, fc1_units) self.fc2 = nn.Linear(fc1_units, fc2_units) self.fc3 = nn.Linear(fc2_units, action_size) self.bn1 = nn.BatchNorm1d(fc1_units) self.reset_parameters() def reset_parameters(self): """ Initialize the weights of the neural network """ self.fc1.weight.data.uniform_(*self.hidden_init(self.fc1)) self.fc2.weight.data.uniform_(*self.hidden_init(self.fc2)) self.fc3.weight.data.uniform_(-3e-3, 3e-3) def hidden_init(self, layer): """ Initialize the weights of the hidden layers :param layer: layer of the neural network :return: tuple of floats, initialized weights """ fan_in = layer.weight.data.size()[0] lim = 1. / np.sqrt(fan_in) return (-lim, lim) def forward(self, state): """ Forward pass of the neural network :param state: state input :return: action output """ x = self.bn1(self.fc1(state)) x = torch.relu(x) x = torch.relu(self.fc2(x)) return torch.tanh(self.fc3(x)) class Critic(nn.Module): """ Critic neural network """ def __init__(self, state_size, action_size, seed, fcs1_units=256, fc2_units=128): """ Initialize Critic class :param state_size: int, size of the state space :param action_size: int, size of the action space :param seed: int, random seed :param fcs1_units: int, number of neurons in the first hidden layer :param fc2_units: int, number of neurons in the second hidden layer """ super(Critic, self).__init__() self.seed = torch.manual_seed(seed) self.fcs1 = nn.Linear(state_size, fcs1_units) self.fc2 = nn.Linear(fcs1_units+action_size, fc2_units) self.fc3 = nn.Linear(fc2_units, 1) self.bn1 = nn.BatchNorm1d(fcs1_units) self.reset_parameters() def reset_parameters(self): """ Initialize the weights of the neural network """ self.fcs1.weight.data.uniform_(*self.hidden_init(self.fcs1)) self.fc2.weight.data.uniform_(*self.hidden_init(self.fc2)) self.fc3.weight.data.uniform_(-3e-3, 3e-3) def hidden_init(self, layer): """ Initialize the weights of the hidden layers :param layer: layer of the neural network :return: tuple of floats, initialized weights """ fan_in = layer.weight.data.size()[0] lim = 1. / np.sqrt(fan_in) return (-lim, lim) def forward(self, state, action): """ Forward pass of the neural network :param state: state input :param action: action input :return: Q-value output """ xs = self.bn1(self.fcs1(state)) xs = torch.relu(xs) x = torch.cat((xs, action), dim=1) x = torch.relu(self.fc2(x)) return self.fc3(x) class Agent(): """ Agent class """ def __init__(self, state_size, action_size, num_agents, random_seed): """ Initialize Agent class :param state_size: int, size of the state space :param action_size: int, size of the action space :param num_agents: int, number of agents :param random_seed: int, random seed """ self.state_size = state_size self.action_size = action_size self.num_agents = num_agents self.seed = random.seed(random_seed) # Actor networks self.actor_local = Actor(state_size, action_size, random_seed).to(device) self.actor_target = Actor(state_size, action_size, random_seed).to(device) self.actor_optimizer = optim.Adam(self.actor_local.parameters(), lr=LR_ACTOR) # Critic networks self.critic_local = Critic(state_size*num_agents, action_size*num_agents, random_seed).to(device) self.critic_target = Critic(state_size*num_agents, action_size*num_agents, random_seed).to(device) self.critic_optimizer = optim.Adam(self.critic_local.parameters(), lr=LR_CRITIC, weight_decay=WEIGHT_DECAY) # Noise process self.noise = OUNoise((num_agents, action_size), random_seed) # Replay memory self.memory = ReplayBuffer(BUFFER_SIZE, BATCH_SIZE) def act(self, state, add_noise=True): """ Returns actions for given state as per current policy. :param state: current state :param add_noise: bool, whether to add noise :return: array of actions """ state = torch.from_numpy(state).float().to(device) self.actor_local.eval() with torch.no_grad(): action = self.actor_local(state).cpu().data.numpy() self.actor_local.train() if add_noise: action += self.noise.sample() return np.clip(action, -1, 1) def reset(self): """ Reset the noise process """ self.noise.reset() def learn(self, experiences, gamma): """ Update policy and value parameters using given batch of experience tuples. :param experiences: tuple of torch tensors, state, action, reward, next_state, done :param gamma: float, discount factor """ states, actions, rewards, next_states, dones = experiences # ---------------------------- update critic ---------------------------- # # Get predicted next-state actions and Q values from target models actions_next = self.actor_target(next_states) q_targets_next = self.critic_target(next_states, actions_next) # Compute Q targets for current states (y_i) q_targets = rewards + (gamma * q_targets_next * (1 - dones)) # Compute critic loss q_expected = self.critic_local(states, actions) critic_loss = nn.MSELoss()(q_expected, q_targets) # Minimize the loss self.critic_optimizer.zero_grad() critic_loss.backward() torch.nn.utils.clip_grad_norm_(self.critic_local.parameters(), 1) self.critic_optimizer.step() # ---------------------------- update actor ---------------------------- # # Compute actor loss actions_pred = self.actor_local(states) actor_loss = -self.critic_local(states, actions_pred).mean() # Minimize the loss self.actor_optimizer.zero_grad() actor_loss.backward() self.actor_optimizer.step() # ----------------------- update target networks ----------------------- # self.soft_update(self.critic_local, self.critic_target, TAU) self.soft_update(self.actor_local, self.actor_target, TAU) def soft_update(self, local_model, target_model, tau): """ Soft update model parameters. &theta;_target = τ*&theta;_local + (1 - τ)*&theta;_target :param local_model: PyTorch model (weights will be copied from) :param target_model: PyTorch model (weights will be copied to) :param tau: float, interpolation parameter """ for target_param, local_param in zip(target_model.parameters(), local_model.parameters()): target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data) class OUNoise: """ Ornstein-Uhlenbeck process """ def __init__(self, size, seed, mu=0., theta=0.15, sigma=0.2): """ Initialize OUNoise class :param size: int, size of the noise :param seed: int, random seed :param mu: float, mean of the noise :param theta: float, theta parameter :param sigma: float, sigma parameter """ self.mu = mu * np.ones(size) self.theta = theta self.sigma = sigma self.seed = random.seed(seed) self.reset() def reset(self): """ Reset the noise """ self.state = copy.copy(self.mu) def sample(self): """ Update the noise state and return the current noise value """ x = self.state dx = self.theta * (self.mu - x) + self.sigma * np.random.standard_normal(self.mu.shape) self.state = x + dx return self.state ``` 在使用Per-MADDPG算法时,需要创建多个Agents对象,每个对象都有自己的actor和critic神经网络,并且每个对象都有自己的replay buffer和noise process。在每个时间步,每个智能体都会执行一次`act()`方法,得到它的动作,然后将其作为一个元组添加到replay buffer中。 然后,每个智能体都会从replay buffer中获取一个批次的经验元组,并使用这些元组来更新它们的actor和critic神经网络。在更新critic神经网络时,需要计算目标Q值,并使用MSE损失计算critic损失。在更新actor神经网络时,需要使用critic神经网络的输出来计算actor损失。最后,使用soft update方法更新目标网络。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值