机器学习入门 04 梯度下降法(Gradient Descent)

1 介绍

设一函数为 y = ( x − 4 ) 2 + 1 y={ (x-4) }^{ 2 }+1 y=(x4)2+1,则它的梯度,也就是一阶导数为 d y d x = 2 ( x − 4 ) \frac { dy }{ dx } =2(x-4) dxdy=2(x4),如下图所示。
这里写图片描述

梯度,可以代表方向,对应着函数 y y y 增大的方向,当 x x x 变化式, y y y 也会有相应的变化。

在机器学习中,我们使用式 − η d J d θ -\eta \frac { dJ }{ d\theta } ηdθdJ 来表示函数 J J J 减小的方向。

∇ J = − η d J d θ \nabla J=-\eta \frac { dJ }{ d\theta } J=ηdθdJ,其中 η \eta η 的含义

  • η \eta η 表示学习率(learning rate)。
  • 取值影响获得最优解的速度。
  • 取值不合适,则可能得不到最优解。
  • 梯度下降法的一个超参数。
  • η \eta η 太小,则会减慢收敛的速度;若 η \eta η 太大,则可能会导致不收敛。

并不是所有函数都有唯一的极值点,如下图:

上述解决方法:

  • 多次运行,随机化初始点。梯度下降的初始点也是一个超参数。

2 在线性回归中使用梯度下降法

在线性回归中,损失函数具有唯一的最优解。

2.1 目标

求出 θ = ( θ 0 , θ 1 , θ 2 ⋯ θ n ) \theta =({ \theta }_{ 0 },{ \theta }_{ 1 },{ \theta }_{ 2 }\cdots { \theta }_{ n }) θ=(θ0,θ1,θ2θn),使得 $J=\sum _{ i=1 }^{ m }{ { ({ y }^{ (i) }-{ \hat { y } }^{ (i) }) }^{ 2 } } $ 的值最小,其中 y ^ ( i ) = θ 0 + θ 1 x 1 ( i ) + θ 2 x 2 ( i ) + ⋯ + θ n x n ( i ) { \hat { y } }^{ (i) }={ \theta }_{ 0 }+{ \theta }_{ 1 }{ x }_{ 1 }^{ (i) }+{ \theta }_{ 2 }{ x }_{ 2 }^{ (i) }+\cdots +{ \theta }_{ n }{ x }_{ n }^{ (i) } y^(i)=θ0+θ1x1(i)+θ2x2(i)++θnxn(i)

2.2 求解 J J J 的梯度

∇ J ( θ ) = ( ∂ J / ∂ θ 0 ∂ J / ∂ θ 1 ⋮ ∂ J / ∂ θ n ) = ( ∑ i = 1 m 2 ( y ( i ) − x b ( i ) θ ) ( − 1 ) ∑ i = 1 m 2 ( y ( i ) − x b ( i ) θ ) ( − x 1 ( i ) ) ⋮ ∑ i = 1 m 2 ( y ( i ) − x b ( i ) θ ) ( − x n ( i ) ) ) = 2 ( ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) x 1 ( i ) ⋮ ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) x n ( i ) ) \nabla J(\theta )=\begin{pmatrix} { \partial J }/{ \partial { \theta }_{ 0 } } \\ { \partial J }/{ \partial { \theta }_{ 1 } } \\ \vdots \\ { \partial J }/{ \partial { \theta }_{ n } } \end{pmatrix}=\begin{pmatrix} \sum _{ i=1 }^{ m }{ 2({ { y }^{ (i) }-x }_{ b }^{ (i) }\theta )(-1) } \\ \sum _{ i=1 }^{ m }{ 2({ { y }^{ (i) }-x }_{ b }^{ (i) }\theta ) } { (-x }_{ 1 }^{ (i) }) \\ \vdots \\ \sum _{ i=1 }^{ m }{ 2({ { y }^{ (i) }-x }_{ b }^{ (i) }\theta ) } (-{ x }_{ n }^{ (i) }) \end{pmatrix}=2\begin{pmatrix} \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ 1 }^{ (i) } \\ \vdots \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ n }^{ (i) } \end{pmatrix}\\ \\ J(θ)=J/θ0J/θ1J/θn=i=1m2(y(i)xb(i)θ)(1)i=1m2(y(i)xb(i)θ)(x1(i))i=1m2(y(i)xb(i)θ)(xn(i))=2i=1m(xb(i)θy(i))i=1m(xb(i)θy(i))x1(i)i=1m(xb(i)θy(i))xn(i)

J ( θ ) = 1 m ∑ i = 1 m ( y ( i ) − y ^ ( i ) ) 2 = M S E ( y , y ^ ) J(\theta )=\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ { ({ y }^{ (i) }-{ \hat { y } }^{ (i) }) }^{ 2 } } =MSE(y,\hat { y } ) J(θ)=m1i=1m(y(i)y^(i))2=MSE(y,y^),则,

∇ J ( θ ) = 2 m ( ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) x 1 ( i ) ⋮ ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) x n ( i ) ) = 2 m ( ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) x 0 ( i ) ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) x 1 ( i ) ⋮ ∑ i = 1 m ( x b ( i ) θ − y ( i ) ) x n ( i ) ) , x 0 ( i ) ≡ 1 = 2 m ( x b ( 1 ) θ − y ( 1 ) , x b ( 2 ) θ − y ( 2 ) , ⋯   , x b ( m ) θ − y ( m ) ) ⋅ ( x 0 ( 1 ) x 1 ( 1 ) ⋯ x n ( 1 ) x 0 ( 2 ) x 1 ( 2 ) ⋯ x n ( 2 ) ⋮ ⋮ ⋱ ⋮ x 0 ( m ) x 1 ( m ) ⋯ x n ( m ) ) = 2 m [ ( x b θ − y ) ⊺ x b ] ⊺ = 2 m x b ⊺ ( x b θ − y ) \nabla J(\theta )=\frac { 2 }{ m } \begin{pmatrix} \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ 1 }^{ (i) } \\ \vdots \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ n }^{ (i) } \end{pmatrix}\\ \qquad =\frac { 2 }{ m } \begin{pmatrix} \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }){ x }_{ 0 }^{ (i) } } \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ 1 }^{ (i) } \\ \vdots \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ n }^{ (i) } \end{pmatrix},{ x }_{ 0 }^{ (i) }\equiv 1\\ \qquad =\frac { 2 }{ m } (\begin{matrix} { x }_{ b }^{ (1) }\theta -{ y }^{ (1) }, & { x }_{ b }^{ (2) }\theta -{ y }^{ (2) }, & \cdots , & { x }_{ b }^{ (m) }\theta -{ y }^{ (m) } \end{matrix})\cdot \begin{pmatrix} { x }_{ 0 }^{ (1) } & { x }_{ 1 }^{ (1) } & \cdots & { x }_{ n }^{ (1) } \\ { x }_{ 0 }^{ (2) } & { x }_{ 1 }^{ (2) } & \cdots & { x }_{ n }^{ (2) } \\ \vdots & \vdots & \ddots & \vdots \\ { x }_{ 0 }^{ (m) } & { x }_{ 1 }^{ (m) } & \cdots & { x }_{ n }^{ (m) } \end{pmatrix}\\ \qquad =\frac { 2 }{ m } { { [({ x }_{ b }\theta -y) }^{ \intercal }{ x }_{ b }] }^{ \intercal }\\ \qquad =\frac { 2 }{ m } { { x }_{ b } }^{ \intercal }({ x }_{ b }\theta -y) J(θ)=m2i=1m(xb(i)θy(i))i=1m(xb(i)θy(i))x1(i)i=1m(xb(i)θy(i))xn(i)=m2i=1m(xb(i)θy(i))x0(i)i=1m(xb(i)θy(i))x1(i)i=1m(xb(i)θy(i))xn(i)x0(i)1=m2(xb(1)θy(1),xb(2)θy(2),,xb(m)θy(m))x0(1)x0(2)x0(m)x1(1)x1(2)x1(m)xn(1)xn(2)xn(m)=m2[(xbθy)xb]=m2xb(xbθy)

3 编程

3.1 模拟梯度下降法
import numpy as np
import matplotlib.pyplot as plt

# 1 准备数据
plot_x = np.linspace(-1, 6, 141)
# 2 定义损失函数
plot_y = (plot_x - 2.5) ** 2 - 1
plt.plot(plot_x, plot_y)
plt.show()


# 3 实现梯度下降法

# 定义损失函数的导数
def dj(theta):
    return 2 * (theta - 2.5)


# 定义损失函数
def j(theta):
    return (theta - 2.5) ** 2 - 1


# 使用梯度下降法求下降方向上的坐标
def gradient_descent(initial_theta, eta, n_iters=1e4, epsilon=1e-8):
    theta = initial_theta
    theta_history.append(initial_theta)
    i_iter = 0  # 防止eta过大造成死循环

    while i_iter < n_iters:
        gradient = dj(theta)
        last_theta = theta
        theta = theta - eta * gradient
        theta_history.append(theta)

        if (abs(j(theta) - j(last_theta)) < epsilon):
            break

        i_iter += 1
    return


# 画图
def plot_theta_history():
    plt.plot(plot_x, j(plot_x))
    plt.plot(np.array(theta_history), j(np.array(theta_history)), color='r', marker='+')
    plt.show()


eta = 0.1
theta_history = []
gradient_descent(0., eta)
plot_theta_history()

运行结果:
这里写图片描述
这里写图片描述

3.2 在线性回归中使用梯度下降法

1)LinearRegression_gd.py

import numpy as np
from comm_utils.testCapability import r2_score


class LinearRegressionGD:

    def __init__(self):
        """ 初识化 LR 模型"""
        self.coef_ = None
        self.intercept_ = None
        self._theta = None

    def fit_gd(self, x_train, y_train, eta=0.01, n_iters=1e4):
        """ 根据训练数据集x_train,y_train,使用梯度下降法训练Linear Regression模型"""
        assert x_train.shape[0] == y_train.shape[0], "the size of x_train must be equal to y_train"

        # 定义损失函数
        def j(theta, x_b, y):
            try:
                return np.sum((y - x_b.dot(theta)) ** 2) / len(y)
            except:
                return float('inf')

        # 定义梯度函数
        def dj(theta, x_b, y):
            # return (x_b.T.dot(x_b.dot(theta) - y) * 2. / len(x_b)).T
            return x_b.T.dot(x_b.dot(theta) - y) * 2. / len(x_b)

        # 求梯度下的theta
        def gradient_descent(x_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
            theta = initial_theta
            cur_iter = 0

            while cur_iter < n_iters:
                gradient = dj(theta, x_b, y)
                last_theta = theta
                theta = theta - eta * gradient
                if (abs(j(theta, x_b, y) - j(last_theta, x_b, y)) < epsilon):
                    break
                cur_iter += 1

            return theta

        x_b = np.hstack([np.ones((len(x_train), 1)), x_train])
        initial_theta = np.zeros(x_b.shape[1])
        self._theta = gradient_descent(x_b, y_train, initial_theta, eta, n_iters)

        self.intercept_ = self._theta[0]  # 截距
        self.coef_ = self._theta[1:]  # theta

        return self

    def predict(self, x_predict):
        """ 给定待预测数据集x_predict,返回表示x_predict的结果向量"""
        assert self.intercept_ is not None and self.coef_ is not None, "must fit before predict."
        assert x_predict.shape[1] == len(self.coef_), "the feature number of x_predict must be equal to x_train"

        x_b = np.hstack([np.ones((len(x_predict), 1)), x_predict])
        return x_b.dot(self._theta)

    def score(self, x_test, y_test):
        """ 根据测试数据集 x_test 和 y_test 确定当前模型的准确度"""
        y_predict = self.predict(x_test)
        return r2_score(y_test, y_predict)

    def __repr__(self):
        return "LinearRegression()"

2)test.py

import numpy as np
import matplotlib.pyplot as plt
import GradientDescent_pro.LinearRegression_gd

np.random.seed(666)
x = 2 * np.random.random(size=100)
y = x * 3. + 4. + np.random.normal(size=100)  # 线性方程,最后是添加的噪音

x = x.reshape(-1, 1)

plt.scatter(x, y)
plt.show()

lrgd = GradientDescent_pro.LinearRegression_gd.LinearRegressionGD()
lrgd.fit_gd(x, y)

print(lrgd.coef_)
print(lrgd.intercept_)

运行结果:
这里写图片描述

[3.00706277]
4.021457858204859
3.3 随机梯度下降法(Stochastic Gradient Descent)

搜索方向: 2 ( ( x b ( i ) θ − y ( i ) ) ( x b ( i ) θ − y ( i ) ) x 1 ( i ) ⋮ ( x b ( i ) θ − y ( i ) ) x n ( i ) ) = 2 ( x b ( i ) ) ⊺ ( x b ( i ) θ − y ( i ) ) 2\begin{pmatrix} { ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } \\ { ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) }{ x }_{ 1 }^{ (i) } \\ \vdots \\ { ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) }{ x }_{ n }^{ (i) } \end{pmatrix}=2{ ({ x }_{ b }^{ (i) }) }^{ \intercal }({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) })\\ \qquad 2(xb(i)θy(i))(xb(i)θy(i))x1(i)(xb(i)θy(i))xn(i)=2(xb(i))(xb(i)θy(i))

η = 1 i _ i t e r s → 1 i _ i t e r s + b → a i _ i t e r s + b → t 0 i _ i t e r s + t 1 \eta =\frac { 1 }{ i\_ iters } \rightarrow \frac { 1 }{ i\_ iters+b } \rightarrow \frac { a }{ i\_ iters+b } \rightarrow \frac { { t }_{ 0 } }{ { i\_ iters+t }_{ 1 } } η=i_iters1i_iters+b1i_iters+bai_iters+t1t0

特点:

  • 下降方向随机
  • 跳出局部最优解
  • 更快的运行速度

1)在LinearRegression_gd.py中添加下面代码:

def fit_sgd(self, x_train, y_train, n_iters=5, t0=5, t1=50):
     """ 根据训练数据集,使用随机梯度下降法训练线性回归模型"""
     assert x_train.shape[0] == y_train.shape[0], "the size of x_train must be equal to y_train"
     assert n_iters >= 1

     # 随机梯度
     def dj_sgd(theta, x_b_i, y_i):
         return x_b_i * (x_b_i.dot(theta) - y_i) * 2.

     def sgd(x_b, y, initial_theta, n_iters, t0=5, t1=50):
         def learning_rate(t):
             return t0 / (t + t1)

         theta = initial_theta
         m = len(x_b)

         for cur_iter in range(n_iters):
             indexes = np.random.permutation(m)
             x_b_new = x_b[indexes]
             y_new = y[indexes]
             for i in range(m):
                 gradient = dj_sgd(theta, x_b_new[i], y_new[i])
                 theta = theta - learning_rate(cur_iter * m + i) * gradient
         return theta

     x_b = np.hstack([np.ones((len(x_train), 1)), x_train])
     initial_theta = np.random.randn(x_b.shape[1])
     self._theta = sgd(x_b, y_train, initial_theta, n_iters, t0, t1)

     self.intercept_ = self._theta[0]
     self.coef_ = self._theta[1:]

     return self

运行结果:

3.004638997983383
[4.00713831]
3.4 梯度下降法的调试

d J d θ = J ( θ + ε ) − J ( θ − ε ) 2 ε θ = ( θ 0 , θ 1 , θ 2 , ⋯ &ThinSpace; , θ n ) , ∂ J ∂ θ = ( ∂ J ∂ θ 0 , ∂ J ∂ θ 1 , ∂ J ∂ θ 2 , ⋯ &ThinSpace; , ∂ J ∂ θ n ) θ 0 + = ( θ 0 + ε , θ 1 , θ 2 , ⋯ &ThinSpace; , θ n ) , θ 1 + = ( θ 0 , θ 1 + ε , θ 2 , ⋯ &ThinSpace; , θ n ) θ 0 − = ( θ 0 − ε , θ 1 , θ 2 , ⋯ &ThinSpace; , θ n ) , θ 1 − = ( θ 0 , θ 1 − ε , θ 2 , ⋯ &ThinSpace; , θ n ) ∂ J ∂ θ 0 = J ( θ 0 + ) − J ( θ 0 − ) 2 ε , ∂ J ∂ θ 1 = J ( θ 1 + ) − J ( θ 1 − ) 2 ε \frac { dJ }{ d\theta } =\frac { J(\theta +\varepsilon )-J(\theta -\varepsilon ) }{ 2\varepsilon } \\ \\ \theta =({ \theta }_{ 0 },{ \theta }_{ 1 },{ \theta }_{ 2 },\cdots ,{ \theta }_{ n }),\frac { \partial J }{ \partial \theta } =(\frac { \partial J }{ \partial { \theta }_{ 0 } } ,\frac { \partial J }{ \partial { \theta }_{ 1 } } ,\frac { \partial J }{ \partial { \theta }_{ 2 } } ,\cdots ,\frac { \partial J }{ \partial { \theta }_{ n } } )\\ \\ { \theta }_{ 0 }^{ + }=({ \theta }_{ 0 }+\varepsilon ,{ \theta }_{ 1 },{ \theta }_{ 2 },\cdots ,{ \theta }_{ n }),{ \theta }_{ 1 }^{ + }=({ \theta }_{ 0 },{ \theta }_{ 1 }+\varepsilon ,{ \theta }_{ 2 },\cdots ,{ \theta }_{ n })\\ \\ { \theta }_{ 0 }^{ - }=({ \theta }_{ 0 }-\varepsilon ,{ \theta }_{ 1 },{ \theta }_{ 2 },\cdots ,{ \theta }_{ n }),{ \theta }_{ 1 }^{ - }=({ \theta }_{ 0 },{ \theta }_{ 1 }-\varepsilon ,{ \theta }_{ 2 },\cdots ,{ \theta }_{ n })\\ \\ \frac { \partial J }{ \partial { \theta }_{ 0 } } =\frac { J({ \theta }_{ 0 }^{ + })-J({ \theta }_{ 0 }^{ - }) }{ 2\varepsilon } ,\frac { \partial J }{ \partial { \theta }_{ 1 } } =\frac { J({ \theta }_{ 1 }^{ + })-J({ \theta }_{ 1 }^{ - }) }{ 2\varepsilon } dθdJ=2εJ(θ+ε)J(θε)θ=(θ0,θ1,θ2,,θn)θJ=(θ0J,θ1J,θ2J,,θnJ)θ0+=(θ0+ε,θ1,θ2,,θn)θ1+=(θ0,θ1+ε,θ2,,θn)θ0=(θ0ε,θ1,θ2,,θn)θ1=(θ0,θ1ε,θ2,,θn)θ0J=2εJ(θ0+)J(θ0),θ1J=2εJ(θ1+)J(θ1)

# 调试法求梯度
def dj_debug(theta,x_b,y,epsilon=0.01):
    res=np.empty(len(theta))
    for i in range(len(theta)):
        theta_1=theta.copy()
        theta_1[i]+=epsilon
        theta_2=theta.copy()
        theta_2[i]-=epsilon
        res[i]=(j(theta_1,x_b,y)-j(theta_2,x_b,y))/(2*epsilon)
    return res

4 注意

  • 使用梯度下降法前最好进行数据归一化。
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值