

原文:Building Rainbow Step by Step with TensorFlow 2.0
Rainbow: Combining Improvements in Deep Reinforcement Learning
Journal: The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)
Year: 2017
Institute: DeepMind
Author: Matteo Hessel, Joseph Modayil, Hado van Hasselt
#Deep Reinforcement Learning


This paper examines six main extensions to DQN algorithm and empirically studies their combination. (It is a good paper which gives you a summary of several important technologies to alleviate the problems remaining in DQN and provides you some valuable insights in this research region.)
Baseline: Deep Q-Network(DQN) Algorithm Implementation in CS234 Assignment 2


Because the traditional tabular methods are not applicable in arbitrarily large state spaces, we turn to those approximate solution methods (linear approximator & nonlinear approximator value-function approximation & policy approximation), which is to find a good approximate solution using limited computational resources. We can use a linear function, or multi-layer artificial neural networks, or decision tree as a parameterized function to approximate the value-function or policy.(More, read Sutton’s book Reinforcement Learning: An Introduction Chapter 9).

The following methods are all value-function approximation and gradient-based(using the gradients to update the parameters), and they all use experience replay and target network to eliminate the correlations present in the sequence of observations.


Using a linear function to approximate the value function(always the action value).
v ^ ( s , w ) ≐ w T x ( s ) ≐ ∑ i = 1 d w i x i \hat v(s, w) \doteq w^Tx(s) \doteq \sum \limits_{i=1}^d w_i x_i v^(s,w)wTx(s)i=1dwixi
w w w is the parameters, x ( s ) x(s) x(s) is called a feature vector representing state s s s, and the state s s s is the images(frames) observed by the agent in most time. So a linear approximator implemented with tensorflow can be just a fully-connected layer.

import tensorflow as tf
# state: a sequence of image(frame)
inputs = tf.layers.flatten(state)
# scope, which is used to distinguish q_params and target_q_params
out = layers.fully_connected(inputs, num_actions, scope=scope, reuse=reuse)


Deep Q-Network. The main difference of DQN from linear approximator is the architecture of getting the q_value, it is nonlinear.

And the total algorithm is as follows:

Paper: Human-level control through deep reinforcement learning.

The approximator of DeepMind DQN implemented with tensorflow as described in their Nature paper can be:

import tensorflow as tf
with tf.variable_scope(scope, reuse=reuse) as _:
	conv1 = layers.conv2d(state, num_outputs=32, kernel_size=(8, 8), stride=4, activation_fn=tf.nn.relu)
	conv2 = layers.conv2d(conv1, num_outputs=64, kernel_size=(4, 4), stride=2, activation_fn=tf.nn.relu)
	conv3 = layers.conv2d(conv2, num_outputs=64, kernel_size=(3, 3), stride=1, activation_fn=tf.nn.relu)
	full_inputs = layers.flatten(conv3)
	full_layer = layers.fully_connected(full_inputs, num_outputs=512, activation_fn=tf.nn.relu)
	out = layers.fully_connected(full_layer, num_outputs=num_actions)

Do DQN from scratch(basic version)


Double DQN. The main difference of DDQN from DQN is the way of calculating the target q value.
As a reminder,
In Q-Learning:
Q ( s , a ) ← Q ( s , a ) + α [ r + λ m a x a ′ Q ( s ′ , a ′ ) − Q ( s , a ) ] Q(s,a) \leftarrow Q(s,a) + \alpha[r + \lambda max_{a'}Q(s',a') − Q(s,a)] Q(s,a)Q(s,a)+α[r+λmaxaQ(s,a)Q(s,a)]
Y t Q = R t + 1 + λ m a x a ′ Q ( S t + 1 , a ′ ) = R t + 1 + λ Q ( S t + 1 , a r g m a x a ′ Q ( S t + 1 , a ′ ) ) Y_t^{Q} = R_{t+1} + \lambda max_{a'}Q(S_{t+1},a') = R_{t+1} + \lambda Q(S_{t+1},argmax_{a'}Q(S_{t+1},a')) YtQ=Rt+1+λmaxaQ(St+1,a)=Rt+1+λQ(St+1,argmaxaQ(St+1,a))

where θ i − 1 \theta_{i-1} θi1 is the target network parameters which is always represeted with θ t − \theta_t^- θt.
Y t D Q N = R t + 1 + λ m a x a ′ Q ( S t + 1 , a ′ ; θ t − ) Y_t^{DQN} = R_{t+1} + \lambda max_{a'}Q(S_{t+1},a';\theta_t^-) YtDQN=Rt+1+λmaxaQ(St+1,a;θt)
There is a problem with deep q-learning that “It is known to sometimes learn unrealistically high action values because it includes a maximization step over estimated action values, which tends to prefer overestimated to underestimated values” as said in DDQN paper.
The idea of Double Q-learning is to reduce overestimations by decomposing the max operation in the target into action selection and action evaluation.
Y t D o u b l e Q = R t + 1 + λ Q ( S t + 1 , a r g m a x a ′ Q ( S t + 1 , a ′ ; θ t ) ; θ t − ) Y_t^{DoubleQ} = R_{t+1} + \lambda Q(S_{t+1}, argmax_{a'}Q(S_{t+1},a';\theta_t);\theta_t^-) YtDoubleQ=Rt+1+λQ(St+1,argmaxaQ(St+1,a;θt);θt)
Implement with tensorflow (the minimal possible change to DQN in cs234 assignment 2)

q_samp = tf.where(self.done_mask, self.r, self.r + self.config.gamma * tf.reduce_max(target_q, axis=1))
actions = tf.one_hot(self.a, num_actions)
q = tf.reduce_sum(tf.multiply(q, actions), axis=1)
self.loss = tf.reduce_mean(tf.squared_difference(q_samp, q))

max_q_idxes = tf.argmax(q, axis=1)
max_actions = tf.one_hot(max_q_idxes, num_actions)
q_samp = tf.where(self.done_mask, self.r, self.r + self.config.gamma * tf.reduce_sum(tf.multiply(target_q, max_actions), axis=1))
actions = tf.one_hot(self.a, num_actions)
q = tf.reduce_sum(tf.multiply(q, actions), axis=1)
self.loss = tf.reduce_mean(tf.squared_difference(q_samp, q))

Do Double DQN from scratch(basic version)

4>Prioritized experience replay

Prioritized experience replay. Improve data efficiency, by replaying more often transitions from which there is more to learn.
And the total algorithm is as follows:

Paper: Prioritized Experience Replay

  • Prioritizing with Temporal-Difference(TD) Error
    TD-Error: how far the value is from its next-step bootstrap estimate r + λ V ( s ′ ) − V ( s ) r + \lambda V(s') - V(s) r+λV(s)V(s)
    Where the value $r + \lambda V(s’) $ is known as the TD target.
    Experiences with high magnitude TD error also appear to be replayed more often. TD-errors have also been used as a prioritization mechanism for determining where to focus resources, for example when choosing where to explore or which features to select. However, the TD-error can be a poor estimate in some circumstances as well, e.g. when rewards are noisy.

  • Stochastic Prioritization
    Because greedy prioritization results in high-error transitions are replayed too frequently causing lack of diversity which could lead to over-fitting. So Stochastic Prioritization is intruduced in order to add diversity and find a balance between greedy prioritization and random sampling.
    We ensure that the probability of being sampled is monotonic in a transition’s priority, while guaranteeing a non-zero probability even for the lowest-priority transition. Concretely, the probability of sampling transition i as
    P ( i ) = p i α ∑ k p k α P(i) = \frac{p_i^{\alpha}}{\sum_kp_k^{\alpha}} P(i)=kpkαpiα
    (Note: the probability of sampling transition P ( i ) P(i) P(i) has nothing to do with the probability to sample a transition(experience) in the replay buffer(sum tree), which is based on the transition’s priority p i p_i pi. So don’t be confused by it, the P ( i ) P(i) P(i) is used to calculate the Importance Sampling(IS) Weight.)
    where p i > 0 p_i > 0 pi>0 is the priority of transition i i i. The exponent α determines how much prioritization is used, with α = 0 \alpha = 0 α=0 corresponding to the uniform case.

    • proportional prioritization: p i = ∣ δ i ∣ + ϵ p_i = |\delta_i| + \epsilon pi=δi+ϵ
    • rank-based prioritization: p i = 1 r a n k ( i ) p_i = \frac{1}{rank(i)} pi=rank(i)1 , where r a n k ( i ) rank(i) rank(i) is the rank of transition i i i sorted according to δ i \delta_i δi.
  • Importance Sampling(IS)
    Because prioritized replay introduces a bias that changes this distribution uncontrollably. This can be corrected by using importance-sampling (IS) weights:
    w i = ( 1 N 1 P ( i ) ) β w_i = (\frac{1}{N} \frac{1}{P(i)})^\beta wi=(N1P(i)1)β
    that fully compensate for the non-uniform probabilities P ( i ) P(i) P(i) if β = 1 \beta = 1 β=1. These weights can be folded into the Q-learning update by using w i δ i w_i\delta_i wiδi instead of δ i \delta_i δi. For stability reasons, we always normalize weights by 1 / m a x i w i 1 / max_i w_i 1/maxiwi so that they only scale the update downwards.
    IS is annealed from β 0 \beta_0 β0 to 1 1 1, which means its affect is felt more strongly at the end of the stochastic process; this is because the unbiased nature of the updates in RL is most important near convergence.

Do Double DQN with prioritized experience replay from scratch(basic version)

5>Dueling network architecture

Dueling network architecture. Generalize across actions by separately representing state values and action advantages.
The dueling network is a neural network architecture designed for value based RL which has a ∣ A ∣ |A| A dimension output that Q-value for each action. It features two streams of computation, the state value and action advantage streams, sharing a convolutional encoder, and merged by a special aggregator layer.

The aggregator can be expressed as:
Q ( s , a ; θ , α , β ) = V ( s ; θ , β ) + ( A ( s , a ; θ , α ) − 1 ∣ A ∣ ∑ a ′ A ( s , a ′ ; θ , α ) ) Q(s, a; \theta, \alpha, \beta) = V(s; \theta, \beta) + \big(A(s, a; \theta, \alpha) - \frac{1}{|A|} \sum_{a'}A(s, a'; \theta, \alpha)\big) Q(s,a;θ,α,β)=V(s;θ,β)+(A(s,a;θ,α)A1aA(s,a;θ,α))
where θ , β , α \theta, \beta, \alpha θ,β,α, respectively, the parameters of the shared convolutional encoder, value stream, and action advantage stream.
The details of dueling network architecture for Atari:

Since both the value and the advantage stream propagate gradients to the last convolutional layer in the backward pass, we rescale the combined gradient entering the last convolutional layer by 1 / 2 1/\sqrt{2} 1/2 . This simple heuristic mildly increases stability. In addition, we clip the gradients to have their norm less than or equal to 10 10 10.

Other tricks:

  • Human Starts: Using 100 starting points sampled from a human expert’s trajectory.
  • Saliency maps: To better understand the roles of the value and the advantage streams.

Do Dueliing Double DQN with prioritized experience replay from scratch(basic version)

6>Multi-step bootstrapping

Multi-step bootstrap targets. Shift the bias-variance tradeoff and helps to propagate newly observed rewards faster to earlier visited states.
The best methods are often intermediate between the two extremes. n-step TD learning method lies between Monte Carlo and one-step TD methods.

  • Monte Carlo methods perform an update for each state based on the entire sequence of observed rewards from that state until the end of the episode
    G t ≐ R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ + γ T − t − 1 R T G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2R_{t+3} + \dots + \gamma^{T−t−1}R_T GtRt+1+γRt+2+γ2Rt+3++γTt1RT

  • The update of one-step TD methods(also called TD(0) methods), on the other hand, is based on just the one next reward, bootstrapping from the value of the state one step later as a proxy for the remaining rewards.
    G t : t + 1 ≐ R t + 1 + γ V t ( S t + 1 ) G_{t:t+1} \doteq R_{t+1} + \gamma V_t(S_{t+1}) Gt:t+1Rt+1+γVt(St+1)

  • Now, n-step TD methods perform a tradeoff that update each state after n time steps, based on n next rewards, bootstrapping from the value of the state n step later as a proxy for the remaining rewards.
    G t : t + n ≐ R t + 1 + γ R t + 2 + ⋯ + γ n − 1 R t + n + γ n V t + n − 1 ( S t + n ) G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \dots + \gamma ^{n−1}R_{t+n} + \gamma^n V_{t+n−1}(S_{t+n}) Gt:t+nRt+1+γRt+2++γn1Rt+n+γnVt+n1(St+n)

We know that Q-learning is a kind of TD learning. All the implementations before are based on TD(0) learing updating. Now, we are going to implement a n-step deep Q-learning method, the most different part is how to calculate the target Q value.
In one-step DQN:
q t a r g e t = R t + 1 + γ t + 1 max ⁡ q θ ˉ ( S t + 1 , a ′ ) q_{target} = R_{t+1} + \gamma_{t+1} \max q_{\bar{\theta}}(S_{t+1}, a') qtarget=Rt+1+γt+1maxqθˉ(St+1,a)

In one-step Double DQN, the loss is :
q t a r g e t = R t + 1 + γ t + 1 q θ ˉ ( S t + 1 , arg ⁡ max ⁡ a ′ q θ ( S t + 1 , a ′ ) ) q_{target} = R_{t+1} + \gamma_{t+1} q_{\bar{\theta}}(S_{t+1}, \arg\max_{a'} q_{\theta}(S_{t+1}, a')) qtarget=Rt+1+γt+1qθˉ(St+1,argamaxqθ(St+1,a))

In multi-step Double DQN, the loss is :
R t ( n ) = ∑ k = 0 n − 1 γ t ( k ) R t + k + 1 R^{(n)}_t = \sum\limits_{k=0}^{n-1} \gamma_t^{(k)}R_{t+k+1} Rt(n)=k=0n1γt(k)Rt+k+1

q t a r g e t = R t ( n ) + γ t n q θ ˉ ( S t + n , arg ⁡ max ⁡ a ′ q θ ( S t + n , a ′ ) ) q_{target} = R^{(n)}_t + \gamma_{t}^n q_{\bar{\theta}}(S_{t+n}, \arg\max_{a'} q_{\theta}(S_{t+n}, a')) qtarget=Rt(n)+γtnqθˉ(St+n,argamaxqθ(St+n,a))

(The algorithm looks easy to implement and stability guaranteed, but it brings much fluctuation and seems learning rate sensitive when used to train the agent to play CartPole-v0. So if you check this model, you maybe should pay a little bit more attention to it.)

Do Multi-Step Dueliing Double DQN with prioritized experience replay from scratch(basic version)

7>Distributional Q-learning

Distributional Q-learning. Learn a categorical distribution of discounted returns, instead of its expectation.
In Q learning:
Q ( s , a ) = ∑ i = 0 n p r i r i ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) max ⁡ a ′ ∈ A ( s ′ ) Q ( s ′ , a ′ ) Q(s, a) = \sum\limits_{i=0}^{n} p_{r_i}r_i(s, a) + \gamma \sum\limits_{s' \in S} P(s'|s, a)\max_{a' \in A(s')}Q(s', a') Q(s,a)=i=0npriri(s,a)+γsSP(ss,a)aA(s)maxQ(s,a)
Q ( s , a ) = E s , a [ r ( s , a ) ] + γ E s , a , s ′ [ max ⁡ a ′ ∈ A ( s ′ ) Q ( s ′ , a ′ ) ] Q(s, a) = E_{s, a}[ r(s, a) ] + \gamma E_{s, a, s'}[ \max_{a' \in A(s')}Q(s', a') ] \\ Q(s,a)=Es,a[r(s,a)]+γEs,a,s[aA(s)maxQ(s,a)]
Q ( s , a ) = E s , a , s ′ [ r ( s , a ) + γ max ⁡ a ′ ∈ A ( s ′ ) Q ( s ′ , a ′ ) ] Q(s, a) = E_{s, a, s'}[ r(s, a) + \gamma \max_{a' \in A(s')}Q(s', a') ] Q(s,a)=Es,a,s[r(s,a)+γaA(s)maxQ(s,a)]
Where Q ( s , a ) Q(s, a) Q(s,a) is the expection of the discounted returns.
Now, in distributional rl, instead of calculating the expection, we work directly with the full distribution of the returns of state s s s, action a a a and following the current policy π \pi π, denoted by a random variable Z ( s , a ) Z(s, a) Z(s,a).

Where z i − z i − 1 = Δ z = ( V m i n − V m a x ) / N z_i - z_{i-1} = \Delta z = (V_{min} - V_{max}) / N zizi1=Δz=(VminVmax)/N, we assume that the range of the return z i z_i zi is from V m i n V_{min} Vmin to V m a x V_{max} Vmax, N N N is the number of atoms, ( z i , p i ( s , a ) ) (z_i, p_i(s, a)) (zi,pi(s,a)). Now, for each state-action pair ( s , a ) (s, a) (s,a), there is a corresponding distribution of its returns, not a expection value. We calculate the action value of ( s , a ) (s, a) (s,a) as Q ( s , a ) = E [ Z ( s , a ) ] Q(s, a) = E[Z(s, a)] Q(s,a)=E[Z(s,a)]. Even through we still use the expected value, what we’re going to optimize is the distribution:
sup ⁡ s , a d i s t ( R ( s , a ) + γ Z θ ˉ ( s ′ , a ∗ ) , Z θ ( s , a ) ) a ∗ = arg ⁡ max ⁡ a ′ Q ( s ′ , a ′ ) = arg ⁡ max ⁡ a ′ E [ Z ( s ′ , a ′ ) ] \sup_{s, a} dist(R(s, a) + \gamma Z_{\bar{\theta}}(s', a^*), Z_{\theta}(s, a)) \\ a^* = \arg\max_{a′}Q(s′, a′) = \arg\max_{a′}E[Z(s′, a′)] s,asupdist(R(s,a)+γZθˉ(s,a),Zθ(s,a))a=argamaxQ(s,a)=argamaxE[Z(s,a)]
The difference is obverse that, we still use a deep neural network to do function approximation, in traditional DQN, our output for each input state s s s is a ∣ A ∣ |A| A-dim vector, each element corresponds to an action value q ( s , a ) q(s, a) q(s,a), but now, the output for each input state s s s is a ∣ A ∣ N |A|N AN-dim matrix, that each row is a N N N-dim vector represents the return distribution of Z ( s , a ) Z(s, a) Z(s,a), then we calculate the action-value of ( s , a ) (s, a) (s,a) through:
q ( s , a ) = E [ Z ( s , a ) ] = ∑ i = 0 N p i ( s , a ) z i q(s, a) = E[Z(s, a)] = \sum\limits_{i=0}^{N} p_i(s, a) z_i q(s,a)=E[Z(s,a)]=i=0Npi(s,a)zi
KL Divergence
Now, we need to minimize the distance between the current distribution and the target distribution.
Note: the following content are mainly from that great blog:
If p p p and q q q are two distributions with same support (i.e. their p d f s pdfs pdfs are non-zero at the same points), then their KL divergence is defined as follows:
K L ( p ∣ ∣ q ) = ∫ p ( x ) log ⁡ p ( x ) q ( x ) d x K L ( p ∣ ∣ q ) = ∑ i = 1 N p ( x i ) log ⁡ p ( x i ) q ( x i ) = ∑ i = 1 N p ( x i ) [ log ⁡ p ( x i ) − log ⁡ q ( x i ) ] KL(p||q) = \int p(x) \log \frac{p(x)}{q(x)}dx \\ KL(p||q) = \sum\limits_{i=1}^{N} p(x_i) \log\frac{p(x_i)}{q(x_i)} = \sum\limits_{i=1}^{N} p(x_i)[ \log{p(x_i)} - \log{q(x_i)}] KL(pq)=p(x)logq(x)p(x)dxKL(pq)=i=1Np(xi)logq(xi)p(xi)=i=1Np(xi)[logp(xi)logq(xi)]
“Now say we’re using DQN and extract ( s , a , r , s ′ ) (s, a, r, s′) (s,a,r,s) from the replay buffer. A “sample of the target distribution” is r + γ Z θ ˉ ( s ′ , a ∗ ) r + \gamma Z_{\bar{\theta}}(s′, a^*) r+γZθˉ(s,a). We want to move Z θ ( s , a ) Z_{\theta}(s, a) Zθ(s,a) towards this target (by keeping the target fixed).”

Then, their KL loss is:
K L ( m ∣ ∣ p θ ) = ∑ i = 1 N m i log ⁡ m i p θ , i = ∑ i = 1 N m i [ log ⁡ m i − log ⁡ p θ , i ] = H ( m , p θ ) − H ( m ) KL(m||p_{\theta}) = \sum\limits_{i=1}^{N} m_i \log\frac{m_i}{p_{\theta, i}} = \sum\limits_{i=1}^{N} m_i[ \log{m_i} - \log{p_{\theta, i}}] = H(m, p_{\theta}) − H(m) KL(mpθ)=i=1Nmilogpθ,imi=i=1Nmi[logmilogpθ,i]=H(m,pθ)H(m)
The gradient of the KL loss is:
∇ θ K L ( m ∣ ∣ p θ ) = ∇ θ ∑ i = 1 N m i log ⁡ m i p θ , i = ∇ θ [ H ( m , p θ ) − H ( m ) ] = ∇ θ H ( m , p θ ) \nabla_{\theta} KL(m||p_{\theta}) = \nabla_{\theta} \sum\limits_{i=1}^{N} m_i \log\frac{m_i}{p_{\theta, i}} = \nabla_{\theta}[H(m, p_{\theta}) − H(m)] = \nabla_{\theta}H(m, p_{\theta}) θKL(mpθ)=θi=1Nmilogpθ,imi=θ[H(m,pθ)H(m)]=θH(m,pθ)
So, we can just use the cross-entropy:
H ( m , p θ ) = − ∑ i = 1 N m i log ⁡ p i ( s , a ; θ ) H(m, p_{\theta}) = - \sum\limits_{i=1}^{N} m_i \log{p_i(s, a; \theta)} H(m,pθ)=i=1Nmilogpi(s,a;θ)
as the loss function.

The total algorithm is as follows:

Do Distributional RL Based on Multi-Step Dueling Double DQN with Prioritized Experience Replay from scratch(basic version)
I feel really sorry to say that actually, this is a failed implementation, just as a reference, but I still hope it to be helpful to someone, and I promise I will try my best to fix it. Further more, I really hope some good guy can check my code, find the wrong place, even as a contributor to make it work together, thanks a lot.

8>Noisy DQN

Noisy DQN. Use stochastic network layers for exploration.
By now, the exploration method we used are all e-greedy methods, but in some games such as Montezuma’s Revenge, where many actions must be executed to collect the first reward. the limitations of exploring using e-greedy policies are clear. Noisy Nets propose a noisy linear layer that combines a deterministic and noisy stream.
A normal linear layer with p p p inputs and q q q outputs, represented by:
y = w x + b y = wx + b y=wx+b
A noisy linear layer now is:
y = ( μ w + σ w ⊙ ϵ w ) x + ( μ b + σ b ⊙ ϵ b ) y = (\mu^w + \sigma^w \odot \epsilon^w)x + (\mu^b + \sigma^b \odot \epsilon^b) y=(μw+σwϵw)x+(μb+σbϵb)
Where where μ w + σ w ⊙ ϵ w \mu^w + \sigma^w \odot \epsilon^w μw+σwϵw and μ b + σ b ⊙ ϵ b \mu^b + \sigma^b \odot \epsilon^b μb+σbϵb replace w w w and b b b, respectively. The parameters μ w ∈ R q × p \mu^w \in R^{q \times p} μwRq×p, μ b ∈ R q \mu^b \in R^q μbRq, σ w ∈ R q × p \sigma^w \in R^{q\times p} σwRq×p and σ b ∈ R q \sigma^b \in R^q σbRq are learnable whereas ϵ w ∈ R q × p \epsilon^w \in R^{q\times p} ϵwRq×p and ϵ b ∈ R q \epsilon^b \in R^q ϵbRq are noise random variables. There are two kinds of Gaussian Noise:

  • Independent Gaussian Noise:
    The noise applied to each weight and bias is independent, where each entry ϵ i , j w \epsilon^w_{i,j} ϵi,jw (respectively each entry ϵ j b \epsilon^b_j ϵjb) of the random matrix ϵ w \epsilon^w ϵw (respectively of the random vector ϵ b \epsilon^b ϵb ) is drawn from a unit Gaussian distribution. This means that for each noisy linear layer, there are p q + q pq + q pq+q noise variables (for p inputs to the layer and q outputs).

  • Factorised Gaussian Noise:
    By factorising ϵ i , j w \epsilon^w_{i,j} ϵi,jw, we can use p p p unit Gaussian variables ϵ i \epsilon_i ϵi for noise of the inputs and and q q q unit Gaussian variables ϵ j \epsilon_j ϵj for noise of the outputs (thus p + q p + q p+q unit Gaussian variables in total). Each ϵ i , j w \epsilon^w_{i,j} ϵi,jw and ϵ j b \epsilon^b_j ϵjb can then be written as:
    ϵ i , j w = f ( ϵ i ) f ( ϵ j ) ϵ j b = f ( ϵ j ) \epsilon^w_{i,j} = f(\epsilon_i)f(\epsilon_j) \\ \epsilon^b_j = f(\epsilon_j) ϵi,jw=f(ϵi)f(ϵj)ϵjb=f(ϵj)
    where f f f is a real-valued function. In our experiments we used f ( x ) = s g n ( x ) ∣ x ∣ f(x) = sgn(x) \sqrt{|x|} f(x)=sgn(x)x . Note that
    for the bias ϵ j b \epsilon^b_j ϵjb we could have set f ( x ) = x f(x) = x f(x)=x, but we decided to keep the same output noise for weights and biases.

The total algorithm is as follows:
Paper:Noisy Networks for Exploration

Do Noisy Network Based on Multi-Step Dueling Double DQN with Prioritized Experience Replay from scratch(basic version)


Finally, we get the integrated agent: Rainbow. It used a multi-step distributional loss:
D K L ( Φ z d t ( n ) ∣ ∣ d t ) D_{KL}(\Phi_z d_t^{(n)} || d_t) DKL(Φzdt(n)dt)
Where Φ z \Phi_z Φz is the projection onto z z z, and the target distribution d t ( n ) d_t^{(n)} dt(n) is:
d t ( n ) = ( R t ( n ) + γ t ( n ) z , p θ ˉ ( S t + n , a t + n ∗ ) ) d_t^{(n)} =(R_t^{(n)} + \gamma_t^{(n)} z, p_{\bar{\theta}} (S_{t+n}, a^{*}_{t+n})) dt(n)=(Rt(n)+γt(n)z,pθˉ(St+n,at+n))
Using double Q-learning gets the greedy action a t + n ∗ a^*_{t+n} at+n of S t + n S_{t+n} St+n through online network, and evaluates such action using the target network.

In Rainbow, it uses the KL loss to prioritize transitions instead of using the absolute TD error, maybe more robust to noisy stochastic environments because the loss can continue to decrease even when the returns are not deterministic.
p t ∝ ( D K L ( Φ z d t ( n ) ∣ ∣ d t ) ) w p_t \propto (D_{KL}(\Phi_z d_t^{(n)} || d_t))^w pt(DKL(Φzdt(n)dt))w

The network architecture is a dueling network architecture adapted for use with return distributions. The network has a shared representation f ξ ( s ) f_{\xi}(s) fξ(s), which is then fed into a value stream v η v_{\eta} vη with N a t o m s N_{atoms} Natoms outputs, and into an advantage stream a ξ a_{\xi} aξ with N a t o m s × N a c t i o n s N_{atoms} \times N_{actions} Natoms×Nactions outputs, where a ξ i ( f ξ ( s ) , a ) a_{\xi}^i(f_{\xi}(s),a) aξi(fξ(s),a) will denote the output corresponding to atom i i i and action a a a. For each atom z i z^i zi, the value and advantage streams are aggregated, as in dueling DQN, and then passed through a softmax layer to obtain the normalised parametric distributions used to estimate the returns’ distributions:
p θ i ( s , a ) = e x p ( v η i + a Φ i ( ϕ , a ) − a ˉ Φ i ( s ) ) ∑ j e x p ( v η j + a Φ j ( ϕ , a ) − a ˉ Φ j ( s ) ) p_{\theta}^i(s, a) = \frac{exp(v_{\eta}^i + a_{\Phi}^i(\phi, a) - \bar{a}_{\Phi}^i(s))}{\sum_j exp(v_{\eta}^j + a_{\Phi}^j(\phi, a) - \bar{a}_{\Phi}^j(s))} pθi(s,a)=jexp(vηj+aΦj(ϕ,a)aˉΦj(s))exp(vηi+aΦi(ϕ,a)aˉΦi(s))
where ϕ = f ξ ( s ) \phi = f_{\xi}(s) ϕ=fξ(s), and a ˉ Φ i ( s ) = 1 N a c t i o n s ∑ a ′ a Φ i ( ϕ , a ′ ) \bar{a}_{\Phi}^i(s) = \frac{1}{N_{actions}}\sum_{a'}a_{\Phi}^i(\phi, a') aˉΦi(s)=Nactions1aaΦi(ϕ,a)

Then replace all linear layers with their noisy equivalent(factorised Gaussian noise version).

Done, and thanks for reading, I hope it could be helpful to someone.
Any suggestion is more than welcome, thanks again.


