梯度下降
梯度下降是一种非常通用的优化算法,能够为大范围的问题找到最优解。梯度下降中的目的就是通过迭代地调整参数从而使成本函数最小化。
假设你迷失在山上的迷雾中,你能感觉到的只有你脚下路面的坡度。快速到达山脚的一个策略就是沿着最陡的方向下坡。这就是梯度下降的思想:通过测量 θ \theta θ相关的误差函数的局部梯度,并不断沿着降梯度的方向调整,直到梯度降为0,到达最小值!
具体来说,首先使用一个随机
θ
\theta
θ值(称之为随机初始化),然后逐步改进,每次踏出yibu,每一步都尝试将第一点成本函数(如MSE1),直到算法收敛出一个最小值,如下图
梯度下降中一个重要参数是每一步的步长,梯度下降中称为学习率(超参数)。如果学习率太低,就需要更多次的迭代才能达到收敛,如果学习率太高可能直接越过山脚最低点,如果学习率太高非但不能走到山脚反而可能一不小心走过头走到了另一个比之前还要高的位置。
上图 可以看出 当学习率不能太低也不能太高 机器学习中一般为 0.01
由 MSE作为损失函数 而MSE正好又是一个 凸函数可以找到全局最小值
而在现实中 损失函数可能没有那么好的性质 只能找到局部最小值。
y
^
(
i
)
=
b
+
θ
1
X
1
(
i
)
+
θ
2
X
2
(
i
)
+
θ
3
X
3
(
i
)
,
.
.
.
.
.
,
θ
n
X
n
(
i
)
\hat{y}^{(i)} =b+\theta_{1}X_{1}^{(i)}+\theta_{2}X_{2}^{(i)}+\theta_{3}X_{3}^{(i)},.....,\theta_{n}X_{n}^{(i)}
y^(i)=b+θ1X1(i)+θ2X2(i)+θ3X3(i),.....,θnXn(i)
X
d
=
[
1
X
(
1
)
(
1
)
X
(
2
)
(
1
)
.
.
.
.
X
(
n
)
(
1
)
1
X
(
1
)
(
2
)
X
(
2
)
(
2
)
.
.
.
.
X
(
n
)
(
2
)
.
.
.
.
.
1
X
(
1
)
(
m
)
X
(
2
)
(
m
)
.
.
.
.
X
(
n
)
(
m
)
]
X_d = \left[ \begin{matrix} 1 & X^{(1)}_{(1)} & X^{(1)}_{(2)} & .... & X^{(1)}_{(n)}\\ 1 & X^{(2)}_{(1)} & X^{(2)}_{(2)} & .... & X^{(2)}_{(n)}\\ ..... \\1 & X^{(m)}_{(1)} & X^{(m)}_{(2)} & .... & X^{(m)}_{(n)}\\ \end{matrix} \right]
Xd=⎣⎢⎢⎢⎡11.....1X(1)(1)X(1)(2)X(1)(m)X(2)(1)X(2)(2)X(2)(m)............X(n)(1)X(n)(2)X(n)(m)⎦⎥⎥⎥⎤
θ
=
(
θ
0
,
θ
1
,
θ
2
,
θ
3
,
.
.
.
.
.
,
θ
n
)
T
\theta=(\theta_{0},\theta_{1},\theta_{2},\theta_{3},.....,\theta_{n})^T
θ=(θ0,θ1,θ2,θ3,.....,θn)T
y
^
(
i
)
=
X
d
θ
\hat{y}^{(i)}=X_d\theta
y^(i)=Xdθ
y
^
(
i
)
∈
m
∗
n
+
1
\hat{y}^{(i)} ∈ {m*n +1}
y^(i)∈m∗n+1维向量
目标使损失函数最小化:
a
r
g
m
i
n
∑
i
=
1
m
(
y
(
i
)
−
y
^
(
i
)
)
2
{argmin}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2
argmin∑i=1m(y(i)−y^(i))2
根据求导法则可得:
∇
J
(
θ
)
=
[
∂
J
∂
θ
1
∂
J
∂
θ
2
∂
J
∂
θ
3
.
.
.
∂
J
∂
θ
n
]
\nabla J(\theta) = \left[ \begin{matrix} \frac {\partial J}{\partial \theta_1} \\ \frac {\partial J}{\partial \theta_2} \\ \frac {\partial J}{\partial \theta_3} \\ ... \\ \frac {\partial J}{\partial \theta_n} \end{matrix} \right]
∇J(θ)=⎣⎢⎢⎢⎢⎢⎡∂θ1∂J∂θ2∂J∂θ3∂J...∂θn∂J⎦⎥⎥⎥⎥⎥⎤ =
[
∑
i
=
1
m
2
(
y
(
i
)
−
X
d
(
i
)
θ
)
⋅
(
−
1
)
∑
i
=
1
m
2
(
y
(
i
)
−
X
d
(
i
)
θ
)
⋅
(
−
X
1
(
i
)
)
∑
i
=
1
m
2
(
y
(
i
)
−
X
d
(
i
)
θ
)
⋅
(
−
X
2
(
i
)
)
.
.
.
∑
i
=
1
m
2
(
y
(
i
)
−
X
d
(
i
)
θ
)
⋅
(
−
X
n
(
i
)
)
]
\ \left[ \begin{matrix} \sum_{i=1}^m2(y^{(i)} - X_d^{(i)}\theta)\cdot(-1) \\ \sum_{i=1}^m2(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_1 ^{(i)}) \\ \sum_{i=1}^m2(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_2 ^{(i)}) \\ ... \\ \sum_{i=1}^m2(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_n ^{(i)}) \end{matrix} \right]
⎣⎢⎢⎢⎢⎢⎡∑i=1m2(y(i)−Xd(i)θ)⋅(−1)∑i=1m2(y(i)−Xd(i)θ)⋅(−X1(i))∑i=1m2(y(i)−Xd(i)θ)⋅(−X2(i))...∑i=1m2(y(i)−Xd(i)θ)⋅(−Xn(i))⎦⎥⎥⎥⎥⎥⎤=
2
m
[
∑
i
=
1
m
(
y
(
i
)
−
X
d
(
i
)
θ
)
⋅
(
−
1
)
∑
i
=
1
m
(
y
(
i
)
−
X
d
(
i
)
θ
)
⋅
(
−
X
1
(
i
)
)
∑
i
=
1
m
(
y
(
i
)
−
X
d
(
i
)
θ
)
⋅
(
−
X
2
(
i
)
)
.
.
.
∑
i
=
1
m
(
y
(
i
)
−
X
d
(
i
)
θ
)
⋅
(
−
X
n
(
i
)
)
]
\frac{2}{m}\ \left[ \begin{matrix} \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-1) \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_1 ^{(i)}) \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_2 ^{(i)}) \\ ... \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_n ^{(i)}) \end{matrix} \right]
m2 ⎣⎢⎢⎢⎢⎢⎡∑i=1m(y(i)−Xd(i)θ)⋅(−1)∑i=1m(y(i)−Xd(i)θ)⋅(−X1(i))∑i=1m(y(i)−Xd(i)θ)⋅(−X2(i))...∑i=1m(y(i)−Xd(i)θ)⋅(−Xn(i))⎦⎥⎥⎥⎥⎥⎤
式子前除以m 是因为 在计算梯度的时候随着样本量的增加梯度也会变大但梯度不应该和样本量挂钩所以除以m。
所以此时优化函数就变成了:
a
r
g
m
i
n
1
m
∑
i
=
1
m
(
y
(
i
)
−
y
^
(
i
)
)
2
{argmin}\frac {1}{m}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2
argminm1∑i=1m(y(i)−y^(i))2
而这个函数就是
J
(
θ
)
=
M
S
E
(
y
,
y
^
)
J(\theta)=MSE(y,\hat y)
J(θ)=MSE(y,y^)
还有一种常用的形式:
a
r
g
m
i
n
1
2
m
∑
i
=
1
m
(
y
(
i
)
−
y
^
(
i
)
)
2
{argmin}\frac {1}{2m}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2
argmin2m1∑i=1m(y(i)−y^(i))2 这种方式 求导后 分母2会消掉。
批量梯度下降法
def J(theta,x_b,y):
try:
#通过一组theta计算损失
return np.sumy(y - x_b.dot(theta)) ** 2/len(x_b)
except:
return float('INF')
def dJ(theta,x_b,y):
#初始化梯度数组
res = np.empty(len(theta))
res[0] = np.sum((x_b.dot(theta) - y))
for i in range(1,len(theta)):
#根据初始化为一组 0 的theta求导数
res[1] = (x_b.dot(theta)- y).dot(x_b[:,i])
return res * 2 /len(x_b)
def gradient_descent(x_b,y,initial_theta,eta,iters_num = 1e4,epsilon = 1e-8):
theta = initial_theta
i_iter = 0
while i_iter < iters_num:
#计算梯度值
gradient = dJ(theta,x_b,y)
last_theta = theta
#不停地更新损失函数 eta 为学习率
theta = theta - eta * gradient
#迭代次数
i_iter += 1
#比较两次求出的theta值如果2次 下降的间隔 小于一定的值 则认为达到最小点
if(abs(J(theta,x_b,y) - J(last_theta,x_b,y))<epsilon):
break
return theta
梯度求导向量化
上面对theta点求导用的是循环 更简洁的直接使用矩阵乘法效率更高。
2 m [ ∑ i = 1 m ( y ( i ) − X d ( i ) θ ) ⋅ ( − 1 ) ∑ i = 1 m ( y ( i ) − X d ( i ) θ ) ⋅ ( − X 1 ( i ) ) ∑ i = 1 m ( y ( i ) − X d ( i ) θ ) ⋅ ( − X 2 ( i ) ) . . . ∑ i = 1 m ( y ( i ) − X d ( i ) θ ) ⋅ ( − X n ( i ) ) ] \frac{2}{m}\ \left[ \begin{matrix} \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-1) \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_1 ^{(i)}) \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_2 ^{(i)}) \\ ... \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_n ^{(i)}) \end{matrix} \right] m2 ⎣⎢⎢⎢⎢⎢⎡∑i=1m(y(i)−Xd(i)θ)⋅(−1)∑i=1m(y(i)−Xd(i)θ)⋅(−X1(i))∑i=1m(y(i)−Xd(i)θ)⋅(−X2(i))...∑i=1m(y(i)−Xd(i)θ)⋅(−Xn(i))⎦⎥⎥⎥⎥⎥⎤= 2 m ( X d ( 1 ) θ − y ( 1 ) , X d ( 2 ) θ − y ( 2 ) , X d ( 3 ) θ − y ( 3 ) , X d ( 4 ) θ − y ( 4 ) , . . . . . , X d ( m ) θ − y ( m ) ) ⋅ [ 1 X ( 1 ) ( 1 ) X ( 2 ) ( 1 ) . . . . X ( n ) ( 1 ) 1 X ( 1 ) ( 2 ) X ( 2 ) ( 2 ) . . . . X ( n ) ( 2 ) . . . . . 1 X ( 1 ) ( m ) X ( 2 ) ( m ) . . . . X ( n ) ( m ) ] \frac{2}{m}(X_d^{(1)}\theta -y^{(1)},X_d^{(2)}\theta -y^{(2)},X_d^{(3)}\theta -y^{(3)},X_d^{(4)}\theta -y^{(4)},.....,X_d^{(m)}\theta -y^{(m)})\cdot\left[ \begin{matrix} 1 & X^{(1)}_{(1)} & X^{(1)}_{(2)} & .... & X^{(1)}_{(n)}\\ 1 & X^{(2)}_{(1)} & X^{(2)}_{(2)} & .... & X^{(2)}_{(n)}\\ ..... \\1 & X^{(m)}_{(1)} & X^{(m)}_{(2)} & .... & X^{(m)}_{(n)}\\ \end{matrix} \right] m2(Xd(1)θ−y(1),Xd(2)θ−y(2),Xd(3)θ−y(3),Xd(4)θ−y(4),.....,Xd(m)θ−y(m))⋅⎣⎢⎢⎢⎡11.....1X(1)(1)X(1)(2)X(1)(m)X(2)(1)X(2)(2)X(2)(m)............X(n)(1)X(n)(2)X(n)(m)⎦⎥⎥⎥⎤ = 2 m ⋅ ( X d θ − y ) T ⋅ X d \frac{2}{m}\cdot(X_d\theta - y)^T\cdot X_d m2⋅(Xdθ−y)T⋅Xd = 2 m X d T ⋅ ( X d θ − y ) \frac{2}{m} X_d^T\cdot (X_d\theta - y) m2XdT⋅(Xdθ−y)
#矩阵化后求导代码
return x_b.T.dot(x_b.dot(theta) -y) * 2. /len(y)
随机梯度下降法(Stochastic Gradient Descent)
前面讨论了 用所有样本点进行计算搜索梯度,当样本量非常大是可能非常耗时,有没有其他方法呢?
2
⋅
(
X
d
(
i
)
)
T
⋅
(
X
d
(
i
)
θ
−
y
(
i
)
)
2\cdot (X_d^{(i)})^T\cdot (X_d^{(i)}\theta - y^{(i)})
2⋅(Xd(i))T⋅(Xd(i)θ−y(i))
可以拿出一个单独数据来搜索方向。
随机梯度每次迭代不会朝着更小的点前进可能还会往梯度更大的值走但在现实情况下随机梯度下降法是能找到最小值的附近的。
在随机梯度下降中 学习率随着迭代次数的增加需要不断变小,否则就算到达最小点可能也会在下一次迭代中跳过。
η
=
a
i
_
i
t
e
r
s
+
b
\eta=\frac{a}{i\_iters+b}
η=i_iters+ba
a 和 b 为超参数 i_iters为循环迭代次数
随着循环次数增加,学习率是下降的
模拟退火的思想
η
=
t
0
i
_
i
t
e
r
s
+
t
1
\eta=\frac{t_0}{i\_iters+t_1}
η=i_iters+t1t0
def learing_rate(t):
return t0 / (t + t1)
def dJ_Sgd(theta, x_i_b, y_i):
return x_i_d.T.dot((x_i_d.dot(theta) - y_i)) * 2
梯度的调试
当2个蓝点取任意 都无限趋近于 红点时求极限 用2点之间的斜率 当做红点的斜率 。
y
1
−
y
2
x
1
−
x
2
\frac{y_1 - y_2}{x_1 -x_2}
x1−x2y1−y2 斜率公式
x
1
=
θ
+
ϵ
x_1 =\theta + \epsilon
x1=θ+ϵ
x
2
=
θ
−
ϵ
x_2 =\theta - \epsilon
x2=θ−ϵ
y
1
=
f
(
θ
+
ϵ
)
y_1 =f(\theta + \epsilon)
y1=f(θ+ϵ)
y
2
=
f
(
θ
−
ϵ
)
y_2 =f(\theta - \epsilon)
y2=f(θ−ϵ)
f
(
θ
+
ϵ
)
−
f
(
θ
−
ϵ
)
θ
+
ϵ
−
θ
+
ϵ
\frac{f(\theta + \epsilon)-f(\theta - \epsilon)}{\theta + \epsilon - \theta + \epsilon}
θ+ϵ−θ+ϵf(θ+ϵ)−f(θ−ϵ)
可得:
d
f
d
θ
=
f
(
θ
+
ϵ
)
−
f
(
θ
−
ϵ
)
2
ϵ
\frac{df}{d\theta} = \frac {f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon}
dθdf=2ϵf(θ+ϵ)−f(θ−ϵ)
假设有n维参数:
θ
=
(
θ
0
,
θ
1
,
θ
2
,
θ
3
,
⋯
,
θ
n
)
\theta =(\theta_0,\theta_1,\theta_2,\theta_3,\cdots,\theta_n)
θ=(θ0,θ1,θ2,θ3,⋯,θn)
θ = ( ∂ f θ 0 , ∂ f θ 1 , ∂ f θ 2 , ∂ f θ 3 , ⋯ , ∂ f θ n ) \theta =(\frac{\partial f}{\theta_0},\frac{\partial f}{\theta_1},\frac{\partial f}{\theta_2},\frac{\partial f}{\theta_3},\cdots,\frac{\partial f}{\theta_n}) θ=(θ0∂f,θ1∂f,θ2∂f,θ3∂f,⋯,θn∂f)
如果要求 t h e t a 0 theta_0 theta0维度导数
θ 0 + = ( θ 0 + ε , θ 1 , θ 2 , θ 3 , ⋯ , θ n ) \theta^+_0 =(\theta_0 +\varepsilon,\theta_1,\theta_2,\theta_3,\cdots,\theta_n) θ0+=(θ0+ε,θ1,θ2,θ3,⋯,θn)
θ 0 − = ( θ 0 − ε , θ 1 , θ 2 , θ 3 , ⋯ , θ n ) \theta^-_0 =(\theta_0 -\varepsilon,\theta_1,\theta_2,\theta_3,\cdots,\theta_n) θ0−=(θ0−ε,θ1,θ2,θ3,⋯,θn)
d f d θ = f ( θ 1 + ) − f ( θ 1 − 2 ϵ \frac{df}{d\theta} = \frac {f(\theta_1^+) - f(\theta_1^-}{2\epsilon} dθdf=2ϵf(θ1+)−f(θ1−
以此类推 通过不断的近似
要计算 对
d
f
d
θ
\frac{df}{d\theta}
dθdf 可以通过 将每个
θ
\theta
θ加上一个最小的值 和减去一个最小的值 然后带入
d
f
d
θ
=
f
(
θ
+
ϵ
)
−
f
(
θ
−
ϵ
)
2
ϵ
\frac{df}{d\theta} = \frac {f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon}
dθdf=2ϵf(θ+ϵ)−f(θ−ϵ)求导.
然后再用 梯度下降的方式 一步步找到最小值。
def J(theta,x_b,y):
try:
return np.sum((y - x_b.dot(theta)) ** 2) /len(x_b)
except:
return float('inf')
def dj_math(theta,x_b,y):
return x_b.T.dot(x_b.dot(theta) -y) * 2. /len(y)
def dj_debug(theta,x_b,y,epsilon = 0.01):
res = np.empty(len(theta))
for i in range(len(theta)):
#拷贝 theta
theta_p = theta.copy()
theta_m = theta.copy()
#找到theta左边的点
theta_p[i] += epsilon
#找到theta右边的点
theta_m[i] -= epsilon
#通2个点 斜率公式 当 2个点 和 点theta 无限接近的时候 2点之间的斜率就是 j在 theta 的导数 (导数定义)
res[i] = (J(theta_p,x_b,y) - J(theta_m,x_b,y))/(2 * epsilon)
return res
def gradient_descent(dj,x_b,y,initial_theta,eta,iters_num = 1e4,epsilon = 1e-4):
theta = initial_theta
i_iter = 0
while i_iter < iters_num:
gradient = dj(theta,x_b,y)
last_theta = theta
#通过不断迭代 theta
theta = theta - eta * gradient
i_iter += 1
#当 2 次 theta 波动小于某个值时则认为 到达了局部最小值 当函数是凸函数则是全局最小值
if(abs(J(theta,x_b,y) - J(last_theta,x_b,y))<epsilon):
break
return theta
X_b = np.hstack((np.ones([len(X),1]),X))
initial_theta =np.zeros(x_b.shape[1])
theta gradient_descent(dj_debug,X_b,y,eta=0.01,initial_theta=initial_theta)
LinearRegression封装代码
import numpy as np
from ml_utils.metrics import r_Squared
class LinearRegression:
def __init__(self):
"""初始化linear Regression模型"""
self.coef_ = None
self.interception_ = None
self._theta = None
def fit_normal(self,X_train,y_train):
assert X_train.shape[0] == y_train.shape[0], \
"the size of x_train must be equal to the size of y_train"
#传入的训练样本没有截距 在每一行加一个截距项
X_b = np.hstack([np.ones((len(X_train),1)),X_train])
#根据公式计算 theta 的值
self._theta = np .linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)
self.interception_=self._theta[0]
self.coef_ = self._theta[1:]
return self
def fit_gd(self,X_train,y_train,eta,iters_num = 1e4,epsilon = 1e-7):
def J( theta, x_b, y):
try:
return np.sum((y - x_b.dot(theta)) ** 2) / len(y)
except:
return float('INF')
def dJ(theta, x_b, y):
return x_b.T.dot(x_b.dot(theta) -y) * 2./len(y)
assert X_train.shape[0] == y_train.shape[0], \
"the size of x_train must be equal to the size of y_train"
b = np.ones([X_train.shape[0], 1])
X_train = np.hstack((b,X_train))
i_iter = 0
initial_theta = np.zeros(X_train.shape[1])
theta = initial_theta
while i_iter < iters_num:
gradient = dJ(theta, X_train, y_train)
last_theta = theta
theta = theta - eta * gradient
i_iter += 1
if (abs(J(theta, X_train, y_train) - J(last_theta, X_train, y_train)) < epsilon):
break
self._theta = theta
self.coef_ = theta[1:]
self.interception_=theta[0]
def sgd(self,X_train,y_train,epoch_num = 2):
assert epoch_num >= 1,\
"epoch must bigger than one."
assert X_train.shape[0] == y_train.shape[0], \
"the size of X_predict must be equal to the size of X_train"
t0 = 5
t1 = 50
X_train = np.hstack((np.ones((X_train.shape[0],1)),X_train))
init_theta = np.zeros(X_train.shape[1])
theta = init_theta
def learing_rate(t):
return t0 / (t + t1)
def dJ_Sgd(theta, x_i_b, y_i):
return x_i_b.T.dot((x_i_b.dot(theta) - y_i)) * 2
for cur_epoch in range(epoch_num):
disRupt_index = np.random.permutation(len(X_train))
disRupt_X_train = X_train[disRupt_index]
disRupt_y_train =y_train[disRupt_index]
random_test = np.random.randint(0,X_train.shape[0],10)
for i in range(X_train.shape[0]):
gradient = dJ_Sgd(theta,disRupt_X_train[i],disRupt_y_train[i])
theta = theta - learing_rate(i) * gradient
self._theta = theta
self.coef_ = theta[1:]
self.interception_ = theta[0]
xt = disRupt_X_train[random_test][:,1:]
epoch_score = self.score(xt,disRupt_y_train[random_test])
print("now traing epoch:" + str(cur_epoch) +' loss:' + str(epoch_score))
return self
def predict(self,X_predict):
assert self.coef_ is not None and self.interception_ is not None, \
"you must run fit_normal before predict"
assert X_predict.shape[1] == len(self.coef_), \
"the size of X_predict must be equal to the size of coef_"
X_predict_b = np.hstack([np.ones((len(X_predict),1)), X_predict])
return X_predict_b.dot(self._theta)
def score(self, X_test,y_test):
y_predict = self.predict(X_test)
return r_Squared(y_test,y_predict)
def __repr__(self):
return "LinearRegression()"
1 m ∑ i = 1 m ( y t e s t ( i ) − y ^ t e t s ( i ) ) 2 \frac {1}{m}\sum_{i=1}^m(y^{(i)}_{test} - \hat {y}^{(i)}_{tets})^2 m1∑i=1m(ytest(i)−y^tets(i))2 ↩︎