Notations
- input v v v
- output r r r
- weight parameter W ∈ R d × m W \in \mathbb{R}^{d \times m} W∈Rd×m
- activation function a a a
- mask m m m for vector and M M M for matrix
Dropout
- Randomly set activations of each layer to zero with probability
1
−
p
1-p
1−p.
r = m ∘ a ( W v ) , r = m \circ a(Wv), r=m∘a(Wv),
m j ∼ Bernoulli ( p ) m_j \sim \text{\small Bernoulli}(p) mj∼Bernoulli(p). - As many activation functions have the property that
a
(
0
)
=
0
)
a(0)=0)
a(0)=0), we have
r = a ( m ∘ W v ) . r = a(m \circ Wv). r=a(m∘Wv).
DropConnect
- Randomly set the weight of each layer to zero with probability
1
−
p
1-p
1−p.
r = a ( M ∘ W v ) , r = a(M \circ Wv), r=a(M∘Wv),
M i j ∼ Bernoulli ( p ) M_{ij} \sim \text{\small Bernoulli}(p) Mij∼Bernoulli(p). - Each
M
i
j
M_{ij}
Mij is drawn independently for each example during training.
The memory requirement for M M M's grows with the size of each mini-batch, and therefore, the implementation needs to be carefully designed. - overall model
f
(
x
;
θ
,
M
)
f(x;\theta,M)
f(x;θ,M), where
θ
=
{
W
g
,
W
,
W
s
}
\theta = \{W_g,W,W_s\}
θ={Wg,W,Ws}
o = E M [ f ( x ; θ , M ) ] = ∑ M p ( M ) f ( x ; θ , M ) = 1 ∣ M ∣ ∑ M s ( a ( M ∘ W ) v ) ; W s ) if p = 0.5 \begin{aligned} o=\mathbb{E}_M[f(x;\theta,M)]&=\sum_M p(M) f(x;\theta,M)\\ &=\frac{1}{|M|}\sum_M s(a(M \circ W) v); W_s) \quad \text{if } p = 0.5 \end{aligned} o=EM[f(x;θ,M)]=M∑p(M)f(x;θ,M)=∣M∣1M∑s(a(M∘W)v);Ws)if p=0.5

- inference (test stage)
r = 1 ∣ M ∣ ∑ M a ( ( M ∘ W ) v ) ) r ≈ 1 Z ∑ z = 1 Z r z ≈ 1 Z ∑ z = 1 Z a ( u z ) , \begin{aligned} r&=\frac{1}{|M|} \sum_M a((M \circ W)v))\\ r&\approx \frac{1}{Z} \sum_{z=1}^Z r_z \\ &\approx \frac{1}{Z} \sum_{z=1}^Z a(u_z), \end{aligned} rr=∣M∣1M∑a((M∘W)v))≈Z1z=1∑Zrz≈Z1z=1∑Za(uz),
where u z ∼ N ( p W v , p ( 1 − p ) ( W ∘ W ) ( v ∘ v ) u_z \sim \mathcal{N}(pWv,p(1-p)(W \circ W)(v \circ v) uz∼N(pWv,p(1−p)(W∘W)(v∘v); Z Z Z denotes the number of randoml samples drawn from the Gaussian distribution.
Idea: approximate a sum of weighted Bernoulli random variables by a Gaussian random variable. Partially supported by the central limit theorem.
局限性
\textcolor{red}{\text{\small 局限性}}
局限性:
Both techniques are suitable for fully connected layers only.