Notations
- input v v v
- output r r r
- weight parameter W ∈ R d × m W \in \mathbb{R}^{d \times m} W∈Rd×m
- activation function a a a
- mask m m m for vector and M M M for matrix
Dropout
- Randomly set activations of each layer to zero with probability 1 − p 1-p 1−p.
r = m ∘ a ( W v ) , r = m \circ a(Wv), r=m∘a(Wv),
m j ∼ Bernoulli ( p ) m_j \sim \text{\small Bernoulli}(p) mj∼Bernoulli(p). - As many activation functions have the property that a ( 0 ) = 0 ) a(0)=0) a(0)=0), we have
r = a ( m ∘ W v ) . r = a(m \circ Wv). r=a(m∘<