训练神经网络(上)——激活函数、数据 预处理
文章目录
Activateion
Sigmoid
σ ( x ) = 1 1 + e − x \sigma(x) = \frac{1}{1 + e^{-x}} σ(x)=1+e−x1
- Squashes numbers to range [0,1]
- Historically popular
3 problems:
- Saturated neurons kill the gradient
- Sigmoid outputs are not zero-centered
- exp() is a bit compute expensive
tanh(x)
- Squanshes numbers to range [-1, 1]
- zero centered 😃
- still kill gradients when saturated 😦
ReLU
f ( x ) = m a x ( 0 , x ) f(x) = max(0, x) f(x)=max(0,x)
- Does not saturate 😃
- very computationally efficient 😃
- Converges much faster than sigmoid/tanh in practice 😃
- Actually more biologically plausible than sigmoid 😃
problems:
- Not zero-centered output
Leaky ReLU
f ( x ) = m a x ( 0.01 x , x ) f(x) = max(0.01x, x) f(x)=max(0.01x,x)
Exponential Linear Units(ELU)
f ( x ) = { x i f x > 0 α ( e x p ( x ) − 1 ) i f x ≤ 0 f(x) = \begin{cases} x \quad if \quad x > 0 \\ \alpha (exp(x)-1) \quad if \quad x \leq 0 \end{cases} f(x)={xifx>0α(exp(x)−1)ifx≤0
Maxout Neuron
m a x ( w 1 T x + b 1 , w 2 T x + b 2 ) max(w_1^Tx + b_1, w_2^Tx + b_2) max(w1Tx+b1,w2Tx+b2)
- double parameters 😦
Data Preprocessing
Preprocess the data
- zero-centered data
- normalized data
- PCA
- Whitening
Weight Normalization
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)
W = 0.01 ∗ n p . r a n d o m . r a n d n ( D , H ) W = 0.01 * np.random.randn(D, H) W=0.01∗np.random.randn(D,H)
Work Okey for small networks, but problems with deeper networks - Xavier initialization
W = n p . r a n d o m . r a n d n ( f a n _ i n , f a n _ o u t ) / n p . s q r t ( f a n _ i n ) W = np.random.randn(fan\_in, fan\_out) / np.sqrt(fan\_in) W=np.random.randn(fan_in,fan_out)/np.sqrt(fan_in)