Introduction
- 分类/回归 ⟺ \iff ⟺ 函数 f i : R d → R f_i : \reals^d \to \reals fi:Rd→R
- 类别, 概率, one hot embedding, softmax + cross entropy.
- 损失
- mean square error
- mean absolute error
- cross entropy (for probability ∈ [0,1])
- 梯度下降
- 梯度为零, 局部最优vs鞍点; 梯度极其接近于零.
- ∇ L ( θ t ) = ( ∂ L ∂ θ ∣ θ = θ t ) \nabla L(\theta_t) = (\frac{\partial L}{\partial \theta}|_{\theta=\theta_t}) ∇L(θt)=(∂θ∂L∣θ=θt)
- vanilla
θ t + 1 = θ t − η ∇ L ( θ t ) \theta_{t+1} = \theta_t - \eta \nabla L(\theta_t) θt+1=θt−η∇L(θt)
θ t + 1 = θ t + G t \theta_{t+1} = \theta_t + G_t θt+1=θt+Gt, G t = − η ∇ L ( θ t ) G_t = - \eta \nabla L(\theta_t) Gt=−η∇L(θt). - momentum
θ t + 1 = θ t − η ∑ s = 1 t λ t − s ∇ L ( θ t ) \theta_{t+1} = \theta_t - \eta \sum\limits_{s=1}^{t} \lambda^{t-s} \nabla L(\theta_t) θt+1=θt−ηs=1∑tλt−s∇L(θt)
θ t + 1 = θ t + G t \theta_{t+1} = \theta_t + G_t θt+1=θt+Gt, G t = λ G ( t − 1 ) − η ∇ L ( θ t ) G_t = \lambda G^{(t-1)} - \eta \nabla L(\theta_t) Gt=λG(t−1)−η∇L(θt). - adam = rmsprop + momentum
- adaptive learning rate
- θ t + 1 = θ t + 1 H t G t \theta_{t+1} = \theta_t + \frac{1}{H_t} G_t θt+1=θt+Ht1Gt
- adagrad
H t = 1 t ∑ s = 1 t ∣ ∇ L ( θ t ) ∣ 2 H_t = \sqrt{\frac{1}{t} \sum\limits_{s=1}^{t} |\nabla L(\theta_t)|^2} Ht=t1s=1∑t∣∇L(θt)∣2 - rmsprop
H t = α t ∣ ∇ L ( θ 1 ) ∣ 2 + ( 1 − α ) ∑ s = 1 t α t − s ∣ ∇ L ( θ t ) ∣ 2 H_t = \sqrt{\alpha^t |\nabla L(\theta_1)|^2 + (1-\alpha) \sum\limits_{s=1}^{t} \alpha^{t-s} |\nabla L(\theta_t)|^2} Ht=αt∣∇L(θ1)∣2+(1−α)s=1∑tαt−s∣∇L(θt)∣2
H t 2 = α H t − 1 2 + ( 1 − α ) ∣ ∇ L ( θ t ) ∣ 2 {H_t}^2 = \alpha {H_{t-1}}^2 + (1-\alpha) |\nabla L(\theta_t)|^2 Ht2=αHt−12+(1−α)∣∇L(θt)∣2
- learning rate scheduling
- decay ↘
- warm up ↗ ↘
- batch
- epoch = sample / batch
- large batch, fast training, sharp minima; small batch, noisy gradient, better optimization and generalization.
- 激活函数
- hard sigmoid
- sigmoid
- ReLU
- 神经网络
- z = σ ( b + W x ) z = \sigma(b + Wx) z=σ(b+Wx), ∂ L ∂ W = ∂ L ∂ z ∂ σ ∂ ( b + W x ) ∂ ( b + W x ) ∂ W = ∂ L ∂ z σ ( 1 − σ ) x T \frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} \frac{\partial \sigma}{\partial (b + Wx)} \frac{\partial (b + Wx)}{\partial W} = \frac{\partial L}{\partial z} \sigma(1-\sigma) x^T ∂W∂L=∂z∂L∂(b+Wx)∂σ∂W∂(b+Wx)=∂z∂Lσ(1−σ)xT, ∂ L ∂ b = ∂ L ∂ z ∂ σ ∂ ( b + W x ) ∂ ( b + W x ) ∂ b = ∂ L ∂ z σ ( 1 − σ ) \frac{\partial L}{\partial b} = \frac{\partial L}{\partial z} \frac{\partial \sigma}{\partial (b + Wx)} \frac{\partial (b + Wx)}{\partial b} = \frac{\partial L}{\partial z} \sigma(1-\sigma) ∂b∂L=∂z∂L∂(b+Wx)∂σ∂b∂(b+Wx)=∂z∂Lσ(1−σ).
- 深度学习 vs 表示学习
- 深度网络 vs 宽度网络
- pytorch
- 计算图
- x.grad
- torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=‘valid’, dilation=1, groups=1, bias=True, padding_mode=‘zeros’, device=None, dtype=None)
- torch.nn.LSTM(input_size, hidden_size, num_layers=1, bias=True, batch_first=False, dropout=0, bidirectional=False, proj_size=0)
- 验证与测试
- oof(out of fold)与cv(cross validation)
- train loss与test loss(优化与泛化)
train loss大(无法优化), bias(模型太简单), 优化技术.
train loss小test loss大(无法泛化), variance(模型太复杂), 分布变化.
CNN
- receptive field
- parameter sharing
- subsampling / pooling
- feature map
- activation maximization vs saliency map (gradient)
- 图像任务
- LeNet, AlexNet, VGG, GoogleLeNet, ResNet.
- LeNet, 卷积+池化, 先卷积再全连接.
(LeCun) - AlexNet, ReLU解决深度梯度消失, DropOut解决深度过拟合.
(Hinton) - VGG, 3x3核.
(牛津) - GoogleLeNet, 多种卷积核.
(谷歌) - ResNet, 残差连接同时解决深度梯度消失和深度过拟合.
(何恺明 孙剑)
祭
孙剑今日辞世.
RNN
- elman y t = f ( c t − 1 , x t ) y_t = f(c_{t-1}, x_t) yt=f(ct−1,xt)
jordan y t = f ( y t − 1 , x t ) y_t = f(y_{t-1}, x_t) yt=f(yt−1,xt) - bidirectional y ˉ = y ← + y → \bar{y} = y_{\leftarrow}+y_{\rightarrow} yˉ=y←+y→
- LSTM
g t = tanh ( W x g x t + W h g h t − 1 + b g ) i t = σ ( W x i x t + W h i h t − 1 + b i ) f t = σ ( W x f x t + W h f h t − 1 + b f ) c t = f t ⊙ c t − 1 + i t ⊙ g t o t = σ ( W x o x t + W h o h t − 1 + b o ) h t = o t ⊙ tanh ( c t ) \begin{aligned} g_t =& \tanh(W^{g}_x x_t + W^{g}_h h_{t-1} + b^{g}) \\ i_t =& \sigma(W^{i}_x x_t + W^{i}_h h_{t-1} + b^{i}) \\ f_t =& \sigma(W^{f}_x x_t + W^{f}_h h_{t-1} + b^{f}) \\ c_t =& f_t \odot c_{t-1} + i_t \odot g_t \\ o_t =& \sigma(W^{o}_x x_t + W^{o}_h h_{t-1} + b^{o}) \\ h_t =& o_t \odot \tanh(c_t) \\ \end{aligned} gt=it=ft=ct