梯度下降法
1.什么是梯度下降法
是一种基于搜索的最优化方法,作用是最小化一个损失函数
并不是一个机器学习算法!
-
η \eta η 称为学习率(learning rate),是梯度下降法的一个超参数,其取值会影响获得最优解的速度。如果取值不合适,甚至得不到最优解
-
并不是所有函数都有唯一的极值点
解决方案:多次运行,随机化初始点(初始点也是超参数)
2.梯度下降的代码实现
def gradient_descent(initial_theta, eta, n_iters = 1e4, epsilon = 1e-8):
theta = initial_theta
theta_history.append(initial_theta)
i_iter = 0
while i_iter < n_iters:
gradient = dJ(theta) # 损失函数对θ求导
last_theta = theta
theta = theta - eta * gradient # 梯度下降
theta_history.append(theta)
if (abs(J(theta) - J(last_theta)) < epsilon):
break
i_iter += 1
3.线性回归中的梯度下降法
损失函数: J ( θ ) = M S E ( y , y ^ ) = 1 m ∑ 1 m ( y ( i ) − y ^ ( i ) ) 2 = 1 m ∑ 1 m ( y ( i ) − θ 0 − θ 1 X 1 ( i ) − . . . − θ n X n ( i ) ) 2 J(\theta) = MSE(y, \hat{y}) = \frac{1}{m}\sum_1^m(y^{(i)} - \hat{y}^{(i)})^2 = \frac{1}{m}\sum_1^m(y^{(i)} - \theta _0 - \theta_1X_1^{(i)}- ... - \theta _n X_n^{(i)})^2 J(θ)=MSE(y,y^)=m1∑1m(y(i)−y^(i))2=m1∑1m(y(i)−θ0−θ1X1(i)−...−θnXn(i))2
梯度: ∇ J ( θ ) = ( ∂ J / ∂ θ 0 ∂ J / ∂ θ 1 ∂ J / ∂ θ 2 . . . ∂ J / ∂ θ n ) = 1 m ( ∑ 1 m 2 ( y ( i ) − X b ( i ) θ ) ⋅ ( − 1 ) ∑ 1 m 2 ( y ( i ) − X b ( i ) θ ) ⋅ ( − X 1 ( i ) ) ∑ 1 m 2 ( y ( i ) − X b ( i ) θ ) ⋅ ( − X 2 ( i ) ) . . . ∑ 1 m 2 ( y ( i ) − X b ( i ) θ ) ⋅ ( − X n ( i ) ) ) = 2 m ⋅ ( ∑ 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X 0 ( i ) ∑ 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X 1 ( i ) ∑ 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X 2 ( i ) . . . ∑ 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X n ( i ) ) \nabla J(\theta) = \begin{pmatrix} \partial J/\partial \theta _0\\ \partial J/\partial \theta _1\\ \partial J/\partial \theta _2\\ ...\\ \partial J/\partial \theta _n\\ \end{pmatrix} = \frac{1}{m} \begin{pmatrix} \sum_1^m 2(y^{(i)} - X_b^{(i)}\theta)\cdot(-1)\\ \sum_1^m 2(y^{(i)} - X_b^{(i)}\theta)\cdot(-X_1^{(i)})\\ \sum_1^m 2(y^{(i)} - X_b^{(i)}\theta)\cdot(-X_2^{(i)})\\ ...\\ \sum_1^m 2(y^{(i)} - X_b^{(i)}\theta)\cdot(-X_n^{(i)})\\ \end{pmatrix} = \frac{2}{m} \cdot \begin{pmatrix} \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_0^{(i)}\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_1^{(i)}\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_2^{(i)}\\ ...\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_n^{(i)}\\ \end{pmatrix} ∇J(θ)=⎝⎜⎜⎜⎜⎛∂J/∂θ0∂J/∂θ1∂J/∂θ2...∂J/∂θn⎠⎟⎟⎟⎟⎞=m1⎝⎜⎜⎜⎜⎜⎛∑1m2(y(i)−Xb(i)θ)⋅(−1)∑1m2(y(i)−Xb(i)θ)⋅(−X1(i))∑1m2(y(i)−Xb(i)θ)⋅(−X2(i))...∑1m2(y(i)−Xb(i)θ)⋅(−Xn(i))⎠⎟⎟⎟⎟⎟⎞=m2⋅⎝⎜⎜⎜⎜⎜⎛∑1m(Xb(i)θ−y(i))⋅X0(i)∑1m(Xb(i)θ−y(i))⋅X1(i)∑1m(Xb(i)θ−y(i))⋅X2(i)...∑1m(Xb(i)θ−y(i))⋅Xn(i)⎠⎟⎟⎟⎟⎟⎞
其中 X b = ( X 0 ( 1 ) X 1 ( 1 ) . . . X n ( 1 ) X 0 ( 2 ) X 1 ( 2 ) . . . X n ( 2 ) X 0 ( 3 ) X 1 ( 3 ) . . . X n ( 3 ) . . . . . . . . . . . . X 0 ( m ) X 1 ( m ) . . . X n ( m ) ) X_b = \begin{pmatrix} X_0^{(1)} & X_1^{(1)} & ... & X_n^{(1)}\\ X_0^{(2)} & X_1^{(2)} & ... & X_n^{(2)}\\ X_0^{(3)} & X_1^{(3)} & ... & X_n^{(3)}\\ ... & ... & ... & ...\\ X_0^{(m)} & X_1^{(m)} & ... & X_n^{(m)}\\ \end{pmatrix} Xb=⎝⎜⎜⎜⎜⎜⎛X0(1)X0(2)X0(3)...X0(m)X1(1)X1(2)X1(3)...X1(m)...............Xn(1)Xn(2)Xn(3)...Xn(m)⎠⎟⎟⎟⎟⎟⎞
向量化计算: ∇ J ( θ ) = 2 m ⋅ X b T ⋅ ( X b θ − y ) \nabla J(\theta) = \frac{2}{m}\cdot X_b^T\cdot (X_b\theta - y) ∇J(θ)=m2⋅XbT⋅(Xbθ−y)
使用梯度下降法前,最好进行数据归一化(防止梯度过大)
4.批量梯度 VS 随机梯度
批量梯度: ∇ J ( θ ) = 2 m ⋅ ( ∑ 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X 0 ( i ) ∑ 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X 1 ( i ) ∑ 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X 2 ( i ) . . . ∑ 1 m ( X b ( i ) θ − y ( i ) ) ⋅ X n ( i ) ) = 2 m ⋅ X b T ⋅ ( X b θ − y ) \nabla J(\theta) = \frac{2}{m} \cdot \begin{pmatrix} \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_0^{(i)}\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_1^{(i)}\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_2^{(i)}\\ ...\\ \sum_1^m (X_b^{(i)}\theta - y^{(i)})\cdot X_n^{(i)}\\ \end{pmatrix} = \frac{2}{m}\cdot X_b^T\cdot (X_b\theta - y) ∇J(θ)=m2⋅⎝⎜⎜⎜⎜⎜⎛∑1m(Xb(i)θ−y(i))⋅X0(i)∑1m(Xb(i)θ−y(i))⋅X1(i)∑1m(Xb(i)θ−y(i))⋅X2(i)...∑1m(Xb(i)θ−y(i))⋅Xn(i)⎠⎟⎟⎟⎟⎟⎞=m2⋅XbT⋅(Xbθ−y)
随机梯度: ∇ J ( θ ) = 2 ⋅ ( ( X b ( i ) θ − y ( i ) ) ⋅ X 0 ( i ) ( X b ( i ) θ − y ( i ) ) ⋅ X 1 ( i ) ( X b ( i ) θ − y ( i ) ) ⋅ X 2 ( i ) . . . ( X b ( i ) θ − y ( i ) ) ⋅ X n ( i ) ) = 2 ⋅ ( X b ( i ) ) T ⋅ ( X b ( i ) θ − y ( i ) ) \nabla J(\theta) = 2 \cdot \begin{pmatrix} (X_b^{(i)}\theta - y^{(i)})\cdot X_0^{(i)}\\ (X_b^{(i)}\theta - y^{(i)})\cdot X_1^{(i)}\\ (X_b^{(i)}\theta - y^{(i)})\cdot X_2^{(i)}\\ ...\\ (X_b^{(i)}\theta - y^{(i)})\cdot X_n^{(i)}\\ \end{pmatrix} = 2\cdot (X_b^{(i)})^T\cdot (X_b^{(i)}\theta - y^{(i)}) ∇J(θ)=2⋅⎝⎜⎜⎜⎜⎜⎛(Xb(i)θ−y(i))⋅X0(i)(Xb(i)θ−y(i))⋅X1(i)(Xb(i)θ−y(i))⋅X2(i)...(Xb(i)θ−y(i))⋅Xn(i)⎠⎟⎟⎟⎟⎟⎞=2⋅(Xb(i))T⋅(Xb(i)θ−y(i))
5.随机梯度下降法 Stochastic Gradient Descent
- 优势:更快的运行速度,跳出局部最优解
- 随机梯度下降法中的学习率 η \eta η 需要是递减的,构造一个函数:
η = a i _ i t e r s + b \eta = \frac{a}{i\_iters + b} η=i_iters+ba
其中, a , b a, b a,b为两个超参数
- 代码实现
def sgd(X_b, y, initial_theta, n_iters):
t0 = 5
t1 = 50
def learning_rate(t):
return t0 / (t + t1)
theta = initial_theta
for cur_iter in range(n_iters):
rand_i = np.random.randint(len(X_b))
gradient = dJ_sgd(theta, X_b[rand_i], y[rand_i])
theta = theta - learning_rate(cur_iter) * gradient
return theta
6.scikit-learn中的SGD
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(n_iter = 100)
sgd_reg.fit(X_train, y_train)
sgd_reg.score(X_test, y_test)
注意:SGDRegressor
是封装在linear_model
中的,所以只能解决线性模型
7.小批量梯度下降法 Mini-Batch Gradient Descent
批量梯度考虑所有行,计算梯度 优势:损失函数下降稳定
随机梯度只选取一行,计算梯度 优势:计算速度快
小批量梯度综合两者特点,选取k行计算梯度
8.通用求解梯度的方法
导数的求解:
梯度的求解: