基础理论-数学推导
-
数据集
给定数据集 { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , ⋯ , ( x ( m ) , y ( m ) ) } \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots, (x^{(m)}, y^{(m)})\} {(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))},其中 x ( i ) = { x 1 ( i ) , x 2 ( i ) , ⋯ , x n ( i ) } x^{(i)} = \{x_1^{(i)}, x_2^{(i)}, \cdots, x_n^{(i)}\} x(i)={x1(i),x2(i),⋯,xn(i)}, y ( i ) ∈ R y^{(i)} \in R y(i)∈R。 -
待拟合的函数(线性函数)
h θ ( x ( i ) ) = θ 0 + θ 1 x 1 ( i ) + θ 2 x 2 ( i ) + ⋯ + θ n x n ( i ) = ∑ j = 0 n θ j x j ( i ) = θ T x ( i ) \begin{aligned} h_{\theta} (x^{(i)}) &= \theta_0 + \theta_1 x_1^{(i)} + \theta_2 x_2^{(i)} + \cdots + \theta_n x_n^{(i)} \\ &= \sum_{j=0}^{n} \theta_j x_j^{(i)} \\ &= \theta^T x^{(i)} \end{aligned} hθ(x(i))=θ0+θ1x1(i)+θ2x2(i)+⋯+θnxn(i)=j=0∑nθjxj(i)=θTx(i)
- 损失函数(目标函数)
J ( θ ) = 1 2 ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 min θ J ( θ ) J(\theta) = \frac{1}{2} \sum_{i=1}^{m} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right)^2 \\ \min_{\theta} J(\theta) J(θ)=21i=1∑m(hθ(x(i))−y(i))2θminJ(θ)
-
梯度下降法求解
-
Batch Gradient Descent (m为样本数)
θ j : = θ j − α ∂ ∂ θ j J ( θ ) = θ j − α ∂ ∂ θ j 1 2 ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 = θ j − α ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ∂ ∂ θ j ( h θ ( x ( i ) ) − y ( i ) ) = θ j − α ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \begin{aligned} \theta_j &:= \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \\ &= \theta_j - \alpha \frac{\partial}{\partial \theta_j} \frac{1}{2} \sum_{i=1}^{m} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right)^2 \\ &= \theta_j - \alpha \sum_{i=1}^{m} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) \frac{\partial}{\partial \theta_j} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) \\ &= \theta_j - \alpha \sum_{i=1}^{m} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) x_j^{(i)} \end{aligned} θj:=θj−α∂θj∂J(θ)=θj−α∂θj∂21i=1∑m(hθ(x(i))−y(i))2=θj−αi=1∑m(hθ(x(i))−y(i))∂θj∂(hθ(x(i))−y(i))=θj−αi=1∑m(hθ(x(i))−y(i))xj(i) -
Stochastic Gradient Descent (随机取一个样本)
θ j : = θ j − α ∂ ∂ θ j J ( θ ) = θ j − α ∂ ∂ θ j 1 2 ( h θ ( x ( i ) ) − y ( i ) ) 2 = θ j − α ( h θ ( x ( i ) ) − y ( i ) ) ∂ ∂ θ j ( h θ ( x ( i ) ) − y ( i ) ) = θ j − α ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \begin{aligned} \theta_j &:= \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \\ &= \theta_j - \alpha \frac{\partial}{\partial \theta_j} \frac{1}{2} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right)^2 \\ &= \theta_j - \alpha \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) \frac{\partial}{\partial \theta_j} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) \\ &= \theta_j - \alpha \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) x_j^{(i)} \end{aligned} θj:=θj−α∂θj∂J(θ)=θj−α∂θj∂21(hθ(x(i))−y(i))2=θj−α(hθ(x(i))−y(i))∂θj∂(hθ(x(i))−y(i))=θj−α(hθ(x(i))−y(i))xj(i) -
mini-batch Stochastic Gradient Descent (b为batch size)
θ j : = θ j − α ∂ ∂ θ j J ( θ ) = θ j − α ∂ ∂ θ j 1 2 ∑ i = 1 b ( h θ ( x ( i ) ) − y ( i ) ) 2 = θ j − α ∑ i = 1 b ( h θ ( x ( i ) ) − y ( i ) ) ∂ ∂ θ j ( h θ ( x ( i ) ) − y ( i ) ) = θ j − α ∑ i = 1 b ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \begin{aligned} \theta_j &:= \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \\ &= \theta_j - \alpha \frac{\partial}{\partial \theta_j} \frac{1}{2} \sum_{i=1}^{b} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right)^2 \\ &= \theta_j - \alpha \sum_{i=1}^{b} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) \frac{\partial}{\partial \theta_j} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) \\ &= \theta_j - \alpha \sum_{i=1}^{b} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) x_j^{(i)} \end{aligned} θj:=θj−α∂θj∂J(θ)=θj−α∂θj∂21i=1∑b(hθ(x(i))−y(i))2=θj−αi=1∑b(hθ(x(i))−y(i))∂θj∂(hθ(x(i))−y(i))=θj−αi=1∑b(hθ(x(i))−y(i))xj(i) -
三种梯度下降算法对比:
- BGD:全局最优解,能保证每一次更新权值,都能降低损失函数;样本数量大的时候训练慢。
- SGD:训练速度快;局部最优解,不能保证每次迭代都向着整体最优化方向,需要迭代的次数多。
- mbSGD:以上二者折衷。