1 介绍
设一函数为
y
=
(
x
−
4
)
2
+
1
y={ (x-4) }^{ 2 }+1
y=(x−4)2+1,则它的梯度,也就是一阶导数为
d
y
d
x
=
2
(
x
−
4
)
\frac { dy }{ dx } =2(x-4)
dxdy=2(x−4),如下图所示。
梯度,可以代表方向,对应着函数 y y y 增大的方向,当 x x x 变化式, y y y 也会有相应的变化。
在机器学习中,我们使用式 − η d J d θ -\eta \frac { dJ }{ d\theta } −ηdθdJ 来表示函数 J J J 减小的方向。
∇ J = − η d J d θ \nabla J=-\eta \frac { dJ }{ d\theta } ∇J=−ηdθdJ,其中 η \eta η 的含义
- η \eta η 表示学习率(learning rate)。
- 取值影响获得最优解的速度。
- 取值不合适,则可能得不到最优解。
- 梯度下降法的一个超参数。
- 若 η \eta η 太小,则会减慢收敛的速度;若 η \eta η 太大,则可能会导致不收敛。
并不是所有函数都有唯一的极值点,如下图:
上述解决方法:
- 多次运行,随机化初始点。梯度下降的初始点也是一个超参数。
2 在线性回归中使用梯度下降法
在线性回归中,损失函数具有唯一的最优解。
2.1 目标
求出 θ = ( θ 0 , θ 1 , θ 2 ⋯ θ n ) \theta =({ \theta }_{ 0 },{ \theta }_{ 1 },{ \theta }_{ 2 }\cdots { \theta }_{ n }) θ=(θ0,θ1,θ2⋯θn),使得 $J=\sum _{ i=1 }^{ m }{ { ({ y }^{ (i) }-{ \hat { y } }^{ (i) }) }^{ 2 } } $ 的值最小,其中 y ^ ( i ) = θ 0 + θ 1 x 1 ( i ) + θ 2 x 2 ( i ) + ⋯ + θ n x n ( i ) { \hat { y } }^{ (i) }={ \theta }_{ 0 }+{ \theta }_{ 1 }{ x }_{ 1 }^{ (i) }+{ \theta }_{ 2 }{ x }_{ 2 }^{ (i) }+\cdots +{ \theta }_{ n }{ x }_{ n }^{ (i) } y^(i)=θ0+θ1x1(i)+θ2x2(i)+⋯+θnxn(i)。
2.2 求解 J J J 的梯度
∇ J ( θ ) = ( ∂ J / ∂ θ 0 ∂ J / ∂ θ 1 ⋮ ∂ J / ∂ θ n ) = ( ∑ i = 1 m 2 ( y ( i ) − x b ( i ) θ ) ( − 1 ) ∑ i = 1 m 2 ( y ( i ) − x b ( i ) θ ) ( − x 1 ( i ) ) ⋮ ∑ i = 1 m 2 ( y ( i ) − x b ( i ) θ ) ( − x n ( i ) ) ) = 2 ( ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) x 1 ( i ) ⋮ ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) x n ( i ) ) \nabla J(\theta )=\begin{pmatrix} { \partial J }/{ \partial { \theta }_{ 0 } } \\ { \partial J }/{ \partial { \theta }_{ 1 } } \\ \vdots \\ { \partial J }/{ \partial { \theta }_{ n } } \end{pmatrix}=\begin{pmatrix} \sum _{ i=1 }^{ m }{ 2({ { y }^{ (i) }-x }_{ b }^{ (i) }\theta )(-1) } \\ \sum _{ i=1 }^{ m }{ 2({ { y }^{ (i) }-x }_{ b }^{ (i) }\theta ) } { (-x }_{ 1 }^{ (i) }) \\ \vdots \\ \sum _{ i=1 }^{ m }{ 2({ { y }^{ (i) }-x }_{ b }^{ (i) }\theta ) } (-{ x }_{ n }^{ (i) }) \end{pmatrix}=2\begin{pmatrix} \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ 1 }^{ (i) } \\ \vdots \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ n }^{ (i) } \end{pmatrix}\\ \\ ∇J(θ)=⎝⎜⎜⎜⎛∂J/∂θ0∂J/∂θ1⋮∂J/∂θn⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎛∑i=1m2(y(i)−xb(i)θ)(−1)∑i=1m2(y(i)−xb(i)θ)(−x1(i))⋮∑i=1m2(y(i)−xb(i)θ)(−xn(i))⎠⎟⎟⎟⎟⎞=2⎝⎜⎜⎜⎜⎛∑i=1m(xb(i)θ−y(i))∑i=1m(xb(i)θ−y(i))x1(i)⋮∑i=1m(xb(i)θ−y(i))xn(i)⎠⎟⎟⎟⎟⎞
令 J ( θ ) = 1 m ∑ i = 1 m ( y ( i ) − y ^ ( i ) ) 2 = M S E ( y , y ^ ) J(\theta )=\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ { ({ y }^{ (i) }-{ \hat { y } }^{ (i) }) }^{ 2 } } =MSE(y,\hat { y } ) J(θ)=m1∑i=1m(y(i)−y^(i))2=MSE(y,y^),则,
∇ J ( θ ) = 2 m ( ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) x 1 ( i ) ⋮ ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) x n ( i ) ) = 2 m ( ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) x 0 ( i ) ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) x 1 ( i ) ⋮ ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) x n ( i ) ) , x 0 ( i ) ≡ 1 = 2 m ( x b ( 1 ) θ − y ( 1 ) , x b ( 2 ) θ − y ( 2 ) , ⋯   , x b ( m ) θ − y ( m ) ) ⋅ ( x 0 ( 1 ) x 1 ( 1 ) ⋯ x n ( 1 ) x 0 ( 2 ) x 1 ( 2 ) ⋯ x n ( 2 ) ⋮ ⋮ ⋱ ⋮ x 0 ( m ) x 1 ( m ) ⋯ x n ( m ) ) = 2 m [ ( x b θ − y ) ⊺ x b ] ⊺ = 2 m x b ⊺ ( x b θ − y ) \nabla J(\theta )=\frac { 2 }{ m } \begin{pmatrix} \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ 1 }^{ (i) } \\ \vdots \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ n }^{ (i) } \end{pmatrix}\\ \qquad =\frac { 2 }{ m } \begin{pmatrix} \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }){ x }_{ 0 }^{ (i) } } \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ 1 }^{ (i) } \\ \vdots \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ n }^{ (i) } \end{pmatrix},{ x }_{ 0 }^{ (i) }\equiv 1\\ \qquad =\frac { 2 }{ m } (\begin{matrix} { x }_{ b }^{ (1) }\theta -{ y }^{ (1) }, & { x }_{ b }^{ (2) }\theta -{ y }^{ (2) }, & \cdots , & { x }_{ b }^{ (m) }\theta -{ y }^{ (m) } \end{matrix})\cdot \begin{pmatrix} { x }_{ 0 }^{ (1) } & { x }_{ 1 }^{ (1) } & \cdots & { x }_{ n }^{ (1) } \\ { x }_{ 0 }^{ (2) } & { x }_{ 1 }^{ (2) } & \cdots & { x }_{ n }^{ (2) } \\ \vdots & \vdots & \ddots & \vdots \\ { x }_{ 0 }^{ (m) } & { x }_{ 1 }^{ (m) } & \cdots & { x }_{ n }^{ (m) } \end{pmatrix}\\ \qquad =\frac { 2 }{ m } { { [({ x }_{ b }\theta -y) }^{ \intercal }{ x }_{ b }] }^{ \intercal }\\ \qquad =\frac { 2 }{ m } { { x }_{ b } }^{ \intercal }({ x }_{ b }\theta -y) ∇J(θ)=m2⎝⎜⎜⎜⎜⎛∑i=1m(xb(i)θ−y(i))∑i=1m(xb(i)θ−y(i))x1(i)⋮∑i=1m(xb(i)θ−y(i))xn(i)⎠⎟⎟⎟⎟⎞=m2⎝⎜⎜⎜⎜⎛∑i=1m(xb(i)θ−y(i))x0(i)∑i=1m(xb(i)θ−y(i))x1(i)⋮∑i=1m(xb(i)θ−y(i))xn(i)⎠⎟⎟⎟⎟⎞,x0(i)≡1=m2(xb(1)θ−y(1),xb(2)θ−y(2),⋯,xb(m)θ−y(m))⋅⎝⎜⎜⎜⎜⎛x0(1)x0(2)⋮x0(m)x1(1)x1(2)⋮x1(m)⋯⋯⋱⋯xn(1)xn(2)⋮xn(m)⎠⎟⎟⎟⎟⎞=m2[(xbθ−y)⊺xb]⊺=m2xb⊺(xbθ−y)
3 编程
3.1 模拟梯度下降法
import numpy as np
import matplotlib.pyplot as plt
# 1 准备数据
plot_x = np.linspace(-1, 6, 141)
# 2 定义损失函数
plot_y = (plot_x - 2.5) ** 2 - 1
plt.plot(plot_x, plot_y)
plt.show()
# 3 实现梯度下降法
# 定义损失函数的导数
def dj(theta):
return 2 * (theta - 2.5)
# 定义损失函数
def j(theta):
return (theta - 2.5) ** 2 - 1
# 使用梯度下降法求下降方向上的坐标
def gradient_descent(initial_theta, eta, n_iters=1e4, epsilon=1e-8):
theta = initial_theta
theta_history.append(initial_theta)
i_iter = 0 # 防止eta过大造成死循环
while i_iter < n_iters:
gradient = dj(theta)
last_theta = theta
theta = theta - eta * gradient
theta_history.append(theta)
if (abs(j(theta) - j(last_theta)) < epsilon):
break
i_iter += 1
return
# 画图
def plot_theta_history():
plt.plot(plot_x, j(plot_x))
plt.plot(np.array(theta_history), j(np.array(theta_history)), color='r', marker='+')
plt.show()
eta = 0.1
theta_history = []
gradient_descent(0., eta)
plot_theta_history()
运行结果:
3.2 在线性回归中使用梯度下降法
1)LinearRegression_gd.py
import numpy as np
from comm_utils.testCapability import r2_score
class LinearRegressionGD:
def __init__(self):
""" 初识化 LR 模型"""
self.coef_ = None
self.intercept_ = None
self._theta = None
def fit_gd(self, x_train, y_train, eta=0.01, n_iters=1e4):
""" 根据训练数据集x_train,y_train,使用梯度下降法训练Linear Regression模型"""
assert x_train.shape[0] == y_train.shape[0], "the size of x_train must be equal to y_train"
# 定义损失函数
def j(theta, x_b, y):
try:
return np.sum((y - x_b.dot(theta)) ** 2) / len(y)
except:
return float('inf')
# 定义梯度函数
def dj(theta, x_b, y):
# return (x_b.T.dot(x_b.dot(theta) - y) * 2. / len(x_b)).T
return x_b.T.dot(x_b.dot(theta) - y) * 2. / len(x_b)
# 求梯度下的theta
def gradient_descent(x_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
theta = initial_theta
cur_iter = 0
while cur_iter < n_iters:
gradient = dj(theta, x_b, y)
last_theta = theta
theta = theta - eta * gradient
if (abs(j(theta, x_b, y) - j(last_theta, x_b, y)) < epsilon):
break
cur_iter += 1
return theta
x_b = np.hstack([np.ones((len(x_train), 1)), x_train])
initial_theta = np.zeros(x_b.shape[1])
self._theta = gradient_descent(x_b, y_train, initial_theta, eta, n_iters)
self.intercept_ = self._theta[0] # 截距
self.coef_ = self._theta[1:] # theta
return self
def predict(self, x_predict):
""" 给定待预测数据集x_predict,返回表示x_predict的结果向量"""
assert self.intercept_ is not None and self.coef_ is not None, "must fit before predict."
assert x_predict.shape[1] == len(self.coef_), "the feature number of x_predict must be equal to x_train"
x_b = np.hstack([np.ones((len(x_predict), 1)), x_predict])
return x_b.dot(self._theta)
def score(self, x_test, y_test):
""" 根据测试数据集 x_test 和 y_test 确定当前模型的准确度"""
y_predict = self.predict(x_test)
return r2_score(y_test, y_predict)
def __repr__(self):
return "LinearRegression()"
2)test.py
import numpy as np
import matplotlib.pyplot as plt
import GradientDescent_pro.LinearRegression_gd
np.random.seed(666)
x = 2 * np.random.random(size=100)
y = x * 3. + 4. + np.random.normal(size=100) # 线性方程,最后是添加的噪音
x = x.reshape(-1, 1)
plt.scatter(x, y)
plt.show()
lrgd = GradientDescent_pro.LinearRegression_gd.LinearRegressionGD()
lrgd.fit_gd(x, y)
print(lrgd.coef_)
print(lrgd.intercept_)
运行结果:
[3.00706277]
4.021457858204859
3.3 随机梯度下降法(Stochastic Gradient Descent)
搜索方向: 2 ( ( x b ( i ) θ − y ( i ) ) ( x b ( i ) θ − y ( i ) ) x 1 ( i ) ⋮ ( x b ( i ) θ − y ( i ) ) x n ( i ) ) = 2 ( x b ( i ) ) ⊺ ( x b ( i ) θ − y ( i ) ) 2\begin{pmatrix} { ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } \\ { ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) }{ x }_{ 1 }^{ (i) } \\ \vdots \\ { ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) }{ x }_{ n }^{ (i) } \end{pmatrix}=2{ ({ x }_{ b }^{ (i) }) }^{ \intercal }({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) })\\ \qquad 2⎝⎜⎜⎜⎜⎛(xb(i)θ−y(i))(xb(i)θ−y(i))x1(i)⋮(xb(i)θ−y(i))xn(i)⎠⎟⎟⎟⎟⎞=2(xb(i))⊺(xb(i)θ−y(i))
η = 1 i _ i t e r s → 1 i _ i t e r s + b → a i _ i t e r s + b → t 0 i _ i t e r s + t 1 \eta =\frac { 1 }{ i\_ iters } \rightarrow \frac { 1 }{ i\_ iters+b } \rightarrow \frac { a }{ i\_ iters+b } \rightarrow \frac { { t }_{ 0 } }{ { i\_ iters+t }_{ 1 } } η=i_iters1→i_iters+b1→i_iters+ba→i_iters+t1t0
特点:
- 下降方向随机
- 跳出局部最优解
- 更快的运行速度
1)在LinearRegression_gd.py中添加下面代码:
def fit_sgd(self, x_train, y_train, n_iters=5, t0=5, t1=50):
""" 根据训练数据集,使用随机梯度下降法训练线性回归模型"""
assert x_train.shape[0] == y_train.shape[0], "the size of x_train must be equal to y_train"
assert n_iters >= 1
# 随机梯度
def dj_sgd(theta, x_b_i, y_i):
return x_b_i * (x_b_i.dot(theta) - y_i) * 2.
def sgd(x_b, y, initial_theta, n_iters, t0=5, t1=50):
def learning_rate(t):
return t0 / (t + t1)
theta = initial_theta
m = len(x_b)
for cur_iter in range(n_iters):
indexes = np.random.permutation(m)
x_b_new = x_b[indexes]
y_new = y[indexes]
for i in range(m):
gradient = dj_sgd(theta, x_b_new[i], y_new[i])
theta = theta - learning_rate(cur_iter * m + i) * gradient
return theta
x_b = np.hstack([np.ones((len(x_train), 1)), x_train])
initial_theta = np.random.randn(x_b.shape[1])
self._theta = sgd(x_b, y_train, initial_theta, n_iters, t0, t1)
self.intercept_ = self._theta[0]
self.coef_ = self._theta[1:]
return self
运行结果:
3.004638997983383
[4.00713831]
3.4 梯度下降法的调试
d J d θ = J ( θ + ε ) − J ( θ − ε ) 2 ε θ = ( θ 0 , θ 1 , θ 2 , ⋯   , θ n ) , ∂ J ∂ θ = ( ∂ J ∂ θ 0 , ∂ J ∂ θ 1 , ∂ J ∂ θ 2 , ⋯   , ∂ J ∂ θ n ) θ 0 + = ( θ 0 + ε , θ 1 , θ 2 , ⋯   , θ n ) , θ 1 + = ( θ 0 , θ 1 + ε , θ 2 , ⋯   , θ n ) θ 0 − = ( θ 0 − ε , θ 1 , θ 2 , ⋯   , θ n ) , θ 1 − = ( θ 0 , θ 1 − ε , θ 2 , ⋯   , θ n ) ∂ J ∂ θ 0 = J ( θ 0 + ) − J ( θ 0 − ) 2 ε , ∂ J ∂ θ 1 = J ( θ 1 + ) − J ( θ 1 − ) 2 ε \frac { dJ }{ d\theta } =\frac { J(\theta +\varepsilon )-J(\theta -\varepsilon ) }{ 2\varepsilon } \\ \\ \theta =({ \theta }_{ 0 },{ \theta }_{ 1 },{ \theta }_{ 2 },\cdots ,{ \theta }_{ n }),\frac { \partial J }{ \partial \theta } =(\frac { \partial J }{ \partial { \theta }_{ 0 } } ,\frac { \partial J }{ \partial { \theta }_{ 1 } } ,\frac { \partial J }{ \partial { \theta }_{ 2 } } ,\cdots ,\frac { \partial J }{ \partial { \theta }_{ n } } )\\ \\ { \theta }_{ 0 }^{ + }=({ \theta }_{ 0 }+\varepsilon ,{ \theta }_{ 1 },{ \theta }_{ 2 },\cdots ,{ \theta }_{ n }),{ \theta }_{ 1 }^{ + }=({ \theta }_{ 0 },{ \theta }_{ 1 }+\varepsilon ,{ \theta }_{ 2 },\cdots ,{ \theta }_{ n })\\ \\ { \theta }_{ 0 }^{ - }=({ \theta }_{ 0 }-\varepsilon ,{ \theta }_{ 1 },{ \theta }_{ 2 },\cdots ,{ \theta }_{ n }),{ \theta }_{ 1 }^{ - }=({ \theta }_{ 0 },{ \theta }_{ 1 }-\varepsilon ,{ \theta }_{ 2 },\cdots ,{ \theta }_{ n })\\ \\ \frac { \partial J }{ \partial { \theta }_{ 0 } } =\frac { J({ \theta }_{ 0 }^{ + })-J({ \theta }_{ 0 }^{ - }) }{ 2\varepsilon } ,\frac { \partial J }{ \partial { \theta }_{ 1 } } =\frac { J({ \theta }_{ 1 }^{ + })-J({ \theta }_{ 1 }^{ - }) }{ 2\varepsilon } dθdJ=2εJ(θ+ε)−J(θ−ε)θ=(θ0,θ1,θ2,⋯,θn),∂θ∂J=(∂θ0∂J,∂θ1∂J,∂θ2∂J,⋯,∂θn∂J)θ0+=(θ0+ε,θ1,θ2,⋯,θn),θ1+=(θ0,θ1+ε,θ2,⋯,θn)θ0−=(θ0−ε,θ1,θ2,⋯,θn),θ1−=(θ0,θ1−ε,θ2,⋯,θn)∂θ0∂J=2εJ(θ0+)−J(θ0−),∂θ1∂J=2εJ(θ1+)−J(θ1−)
# 调试法求梯度
def dj_debug(theta,x_b,y,epsilon=0.01):
res=np.empty(len(theta))
for i in range(len(theta)):
theta_1=theta.copy()
theta_1[i]+=epsilon
theta_2=theta.copy()
theta_2[i]-=epsilon
res[i]=(j(theta_1,x_b,y)-j(theta_2,x_b,y))/(2*epsilon)
return res
4 注意
- 使用梯度下降法前最好进行数据归一化。