- 不是一个机器学习算法
- 是一种基于搜索的最优化方法
- 作用:最小化一个损失函数
- 梯度上升法:最大化一个效用函数
- 使用梯度下降法前,最好进行数据归一化,提高效率
代价函数
-
损失函数:计算的是一个样本的误差
代价函数:是整个训练集上所有样本误差的平均
-
最小二乘法
-
真 实 值 y , 预 测 是 h θ ( x ) , 则 误 差 平 方 为 ( y − h θ ( x ) ) 2 真实值y,预测是h_\theta(x),则误差平方为(y-h_\theta(x))^2 真实值y,预测是hθ(x),则误差平方为(y−hθ(x))2
-
找到合适的参数,使得误差平方和:
J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( y i − h θ ( x i ) ) 2 最 小 J(\theta_0,\theta_1)= \frac{1}{2m}\sum_{i=1}^m(y^i-h_\theta(x^i))^2 最小 J(θ0,θ1)=2m1i=1∑m(yi−hθ(xi))2最小 -
目的
不断迭代代价函数,使之尽可能达到最小值
θ j = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 ) , 其 中 α 为 学 习 率 ( 步 长 ) , 是 一 个 超 参 数 \theta_j=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1),其中\alpha为学习率(步长),是一个超参数 θj=θj−α∂θj∂J(θ0,θ1),其中α为学习率(步长),是一个超参数
线性回归中的梯度下降法
- 梯度
∇ J = [ ∂ J ∂ θ 0 ∂ J ∂ θ 1 . . . ∂ J ∂ θ n ] = 1 m [ ∑ i = 1 m ( X b ( i ) ⋅ θ − y ( i ) ) ∑ i = 1 m ( X b ( i ) ⋅ θ − y ( i ) ) ∗ X 1 ( i ) . . . ∑ i = 1 m ( X b ( i ) ⋅ θ − y ( i ) ) ∗ X n ( i ) ] = 1 m ( X b ⋅ θ − y ) T ⋅ X b = 1 m X b T ⋅ ( X b ⋅ θ − y ) \nabla J = \left[\begin{matrix} \frac{\partial J}{\partial \theta_0} \\ \frac{\partial J}{\partial \theta_1} \\ ... \\ \frac{\partial J}{\partial \theta_n} \end{matrix} \right] = \frac{1}{m} \left[\begin{matrix}\sum_{i=1}^{m}(X_b^{(i)}·\theta - y^{(i)}) \\ \sum_{i=1}^{m}(X_b^{(i)}·\theta - y^{(i)})*X_1^{(i)} \\ ... \\ \sum_{i=1}^{m}(X_b^{(i)}·\theta - y^{(i)})*X_n^{(i)}\end{matrix}\right] = \frac{1}{m}(X_b·\theta - y)^T·X_b = \frac{1}{m}{X_b}^T·(X_b·\theta - y) ∇J=⎣⎢⎢⎡∂θ0∂J∂θ1∂J...∂θn∂J⎦⎥⎥⎤=m1⎣⎢⎢⎢⎡∑i=1m(Xb(i)⋅θ−y(i))∑i=1m(Xb(i)⋅θ−y(i))∗X1(i)...∑i=1m(Xb(i)⋅θ−y(i))∗Xn(i)⎦⎥⎥⎥⎤=m1(Xb⋅θ−y)T⋅Xb=m1XbT⋅(Xb⋅θ−y)
# coding=utf-8
import numpy as np
from sklearn.metrics import r2_score
class MeGradientDescent:
def __init__(self):
self.coef_ = None
self.intercept_ = None
def _J(self, theta, X_b, y_train):
return np.sum(y_train - X_b.dot(theta))**2/(2*len(y_train))
def _dJ(self, theta, X_b, y_train):
return X_b.T.dot(X_b.dot(theta) - y_train)/len(y_train)
def fit_gd(self, X_train, y_train, alpha=0.01, cycle_index=1e4, interv=1e-8 ):
start_index = 0
X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
theta = np.zeros(X_b.shape[1])
while start_index < cycle_index:
last_theta = theta
theta = theta - alpha * self._dJ(theta, X_b, y_train)
if abs(self._J(theta, X_b, y_train) - self._J(last_theta, X_b, y_train)) < interv:
break
start_index += 1
self.coef_ = theta[1:]
self.intercept_ = theta[0]
return self
def predict(self, X_test):
return X_test.dot(self.coef_) + self.intercept_
def score(self, y_predict, y_test):
return r2_score(y_test, y_predict)
随机梯度下降
上面所述的批量梯度下降(考虑全部样本),而随机梯度下降每次随机考虑一个样本
- 学习率
α = t 0 i − i t e r s + t 1 , i − i t e r s 表 示 当 前 循 环 的 次 数 \alpha = \frac{t_0}{i_{-}iters + t1},i_{-}iters表示当前循环的次数 α=i−iters+t1t0,i−iters表示当前循环的次数
- sk-learn中的随机梯度下降(线性回归)
# coding=utf-8
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
boston = load_boston()
x = boston.data
y = boston.target
x = x[y < 50]
y = y[y < 50]
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, test_size=0.2, random_state=666)
# 归一化
standardScaler = StandardScaler()
standardScaler.fit(x_train)
x_train_standard = standardScaler.transform(x_train)
x_test_standard = standardScaler.transform(x_test)
sgd = SGDRegressor(tol=1e-3)
sgd.fit(x_train_standard, y_train)
score = sgd.score(x_test_standard, y_test)
print(score)
sgd = SGDRegressor(max_iter=50, tol=1e-3)
sgd.fit(x_train_standard, y_train)
score = sgd.score(x_test_standard, y_test)
print(score)
-
优点
可能会跳出局部最优解
更快的运行速度
-
缺点
准确度会降低
小批量梯度下降
批量梯度下降与随机梯度下降相结合,每次取k个样本进行随机梯度下降