Deep Learning 2023/07/07
1. Function with Unknown Parameters
L ( b , w ) = y = b + w x L(b,w) = y = b + wx L(b,w)=y=b+wx
2. Define Loss from Training Data
- Loss is a function of parameters.
L ( b , w ) L(b,w) L(b,w)
-
Loss: how good a set of value is.
-
Loss Function.
L o s s : L = 1 N ∑ n e n Loss: L = \frac{1}{N}\sum_{n}e_n Loss:L=N1n∑en
e为误差值
- L is mean absolute error(MAE).
e = ∣ y − y ^ ∣ e = \mid y-\hat{y} \,\,\mid e=∣y−y^∣
- L is mean square error(MSE)
e = ( y − y ^ ) 2 e= (y-\hat{y})^2 e=(y−y^)2
如果真实值和预测值都是概率分布,则会使用交叉熵作为损失函数用以度量两者之间的差异
3. Optimization
w ∗ , b ∗ = a r g min w , b L w^*,b^*=arg\min_{w,b}L w∗,b∗=argw,bminL
- Gradient Descent
随机选取初始点
计算L损失函数在初始点对w的导
如果计算在这一点的斜率小于零(负数)则增大w
如果计算在这一点的斜率大于零(正数)则减小w
- Learning Rate
η ∂ L ∂ w ∣ w = w 0 \eta\frac{\partial L}{\partial w}\mid_{w = w^0} η∂w∂L∣w=w0
η: learning rate
- Hyperparameters
4. Model Bias
- Linear models have severe limitation. So we need a more flexible model !
y = b + w x ⇒ y = b + ∑ i c i s i g m o i d ( b i + w i x ) y=b+wx \Rightarrow y=b+\sum_{i} c_i sigmoid(b_i+w_ix) y=b+wx⇒y=b+i∑cisigmoid(bi+wix)
y = b + ∑ j w j x j ⇒ y = b + ∑ i c i s i g m o i d ( b i + ∑ j w i j x j ) y=b+\sum_j w_jx_j\Rightarrow y=b+\sum_i c_i sigmoid(b_i+\sum_jw_{ij}x_j) y=b+j∑wjxj⇒y=b+i∑cisigmoid(bi+j∑wijxj)
5. Backpropagation
-
Backpropagation: an efficient way to compute ∂L/∂w in neural network.
-
Gradient Descent
-
Chain Rule
Case 1 :
y = g ( x ) z = h ( y ) Δ x → Δ y → Δ z d z d x = d z d y d y d x y = g(x)\ \ \ \ \ \ \ z=h(y) \\ \Delta x \rightarrow \Delta y \rightarrow \Delta z \ \ \ \ \ \ \frac{dz}{dx} = \frac{dz}{dy}\frac{dy}{dx} y=g(x) z=h(y)Δx→Δy→Δz dxdz=dydzdxdy
Case 2 :
x = g ( s ) y = h ( s ) z = k ( x , y ) d z d s = ∂ z ∂ x d x d s + ∂ z ∂ y d y d s x = g(s) \ \ \ \ \ y = h(s) \ \ \ \ \ z = k(x,y) \\ \frac{dz}{ds} = \frac{\partial z}{\partial x}\frac{dx}{ds}+\frac{\partial z}{\partial y}\frac{dy}{ds} x=g(s) y=h(s) z=k(x,y)dsdz=∂x∂zdsdx+∂y∂zdsdy -
L ( θ ) = ∑ n = 1 N C n ( θ ) → ∂ L ( θ ) ∂ w = ∑ n = 1 N ∂ C n ( θ ) ∂ w L(\theta)=\sum_{n=1}^NC^n(\theta) \rightarrow \frac{\partial L(\theta)}{\partial w}=\sum_{n=1}^N\frac{\partial C^n(\theta)}{\partial w} L(θ)=n=1∑NCn(θ)→∂w∂L(θ)=n=1∑N∂w∂Cn(θ)
-
6.
7. Ups and downs of Deep learning
-
1958: Perceptron (linear model 感知机,一种人工神经网络)
-
1969: Perceptron has limitation
-
1980s: Multi-layer perceptron
- Do not have significant difference from DNN today.
-
1986: Backpropagation
- Usually more than 3 hidden layers is not helpful.
-
1989: 1 hidden layer is “good enough”, why deep?
-
2006: RBM initialization(breakthrough)
-
2009: GPU
-
2011: Start to be popular in speech recognition
-
2012: win ILSVRC image competition
Gradient Descent for Deep Learning
- 深度学习的梯度,即对损失函数中出现的变量值逐一求偏导并求和,而相关变量则由计算的偏导值减去偏导值与学习率的乘积