机器学习入门 04 梯度下降法（Gradient Descent）

最新推荐文章于 2023-07-31 08:57:24 发布

豪华手抓饼

最新推荐文章于 2023-07-31 08:57:24 发布

阅读量377

点赞数

分类专栏： 08 机器学习入门文章标签：机器学习

本文链接：https://blog.csdn.net/lihaogn/article/details/81670333

版权

08 机器学习入门专栏收录该内容

7 篇文章 0 订阅

订阅专栏

1 介绍

设一函数为 $y={ (x-4) }^{ 2 }+1$ ，则它的梯度，也就是一阶导数为 $\frac { dy }{ dx } =2(x-4)$ ，如下图所示。
这里写图片描述

梯度，可以代表方向，对应着函数 $y$ 增大的方向，当 $x$ 变化式， $y$ 也会有相应的变化。

在机器学习中，我们使用式 $-\eta \frac { dJ }{ d\theta }$ 来表示函数 $J$ 减小的方向。

$\nabla J=-\eta \frac { dJ }{ d\theta }$ ，其中 $\eta$ 的含义

$\eta$ 表示学习率（learning rate）。
取值影响获得最优解的速度。
取值不合适，则可能得不到最优解。
梯度下降法的一个超参数。
若 $\eta$ 太小，则会减慢收敛的速度；若 $\eta$ 太大，则可能会导致不收敛。

并不是所有函数都有唯一的极值点，如下图：

上述解决方法：

多次运行，随机化初始点。梯度下降的初始点也是一个超参数。

2 在线性回归中使用梯度下降法

在线性回归中，损失函数具有唯一的最优解。

2.1 目标

求出 $\theta =({ \theta }_{ 0 },{ \theta }_{ 1 },{ \theta }_{ 2 }\cdots { \theta }_{ n })$ ，使得 $J=\sum _{ i=1 }^{ m }{ { ({ y }^{ (i) }-{ \hat { y } }^{ (i) }) }^{ 2 } } $ 的值最小，其中 $\hat { y } }^{ (i) }={ \theta }_{ 0 }+{ \theta }_{ 1 }{ x }_{ 1 }^{ (i) }+{ \theta }_{ 2 }{ x }_{ 2 }^{ (i) }+\cdots +{ \theta }_{ n }{ x }_{ n }^{ (i) }$ 。

2.2 求解 $J$ 的梯度

$\nabla J(\theta )=\begin{pmatrix} { \partial J }/{ \partial { \theta }_{ 0 } } \\ { \partial J }/{ \partial { \theta }_{ 1 } } \\ \vdots \\ { \partial J }/{ \partial { \theta }_{ n } } \end{pmatrix}=\begin{pmatrix} \sum _{ i=1 }^{ m }{ 2({ { y }^{ (i) }-x }_{ b }^{ (i) }\theta )(-1) } \\ \sum _{ i=1 }^{ m }{ 2({ { y }^{ (i) }-x }_{ b }^{ (i) }\theta ) } { (-x }_{ 1 }^{ (i) }) \\ \vdots \\ \sum _{ i=1 }^{ m }{ 2({ { y }^{ (i) }-x }_{ b }^{ (i) }\theta ) } (-{ x }_{ n }^{ (i) }) \end{pmatrix}=2\begin{pmatrix} \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ 1 }^{ (i) } \\ \vdots \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ n }^{ (i) } \end{pmatrix}\\ \\$

令 $J(\theta )=\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ { ({ y }^{ (i) }-{ \hat { y } }^{ (i) }) }^{ 2 } } =MSE(y,\hat { y } )$ ，则，

$\nabla J(\theta )=\frac { 2 }{ m } \begin{pmatrix} \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ 1 }^{ (i) } \\ \vdots \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ n }^{ (i) } \end{pmatrix}\\ \qquad =\frac { 2 }{ m } \begin{pmatrix} \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }){ x }_{ 0 }^{ (i) } } \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ 1 }^{ (i) } \\ \vdots \\ \sum _{ i=1 }^{ m }{ ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } { x }_{ n }^{ (i) } \end{pmatrix}，{ x }_{ 0 }^{ (i) }\equiv 1\\ \qquad =\frac { 2 }{ m } (\begin{matrix} { x }_{ b }^{ (1) }\theta -{ y }^{ (1) }, & { x }_{ b }^{ (2) }\theta -{ y }^{ (2) }, & \cdots , & { x }_{ b }^{ (m) }\theta -{ y }^{ (m) } \end{matrix})\cdot \begin{pmatrix} { x }_{ 0 }^{ (1) } & { x }_{ 1 }^{ (1) } & \cdots & { x }_{ n }^{ (1) } \\ { x }_{ 0 }^{ (2) } & { x }_{ 1 }^{ (2) } & \cdots & { x }_{ n }^{ (2) } \\ \vdots & \vdots & \ddots & \vdots \\ { x }_{ 0 }^{ (m) } & { x }_{ 1 }^{ (m) } & \cdots & { x }_{ n }^{ (m) } \end{pmatrix}\\ \qquad =\frac { 2 }{ m } { { [({ x }_{ b }\theta -y) }^{ \intercal }{ x }_{ b }] }^{ \intercal }\\ \qquad =\frac { 2 }{ m } { { x }_{ b } }^{ \intercal }({ x }_{ b }\theta -y)$

3 编程

3.1 模拟梯度下降法

import numpy as np
import matplotlib.pyplot as plt

# 1 准备数据
plot_x = np.linspace(-1, 6, 141)
# 2 定义损失函数
plot_y = (plot_x - 2.5) ** 2 - 1
plt.plot(plot_x, plot_y)
plt.show()


# 3 实现梯度下降法

# 定义损失函数的导数
def dj(theta):
    return 2 * (theta - 2.5)


# 定义损失函数
def j(theta):
    return (theta - 2.5) ** 2 - 1


# 使用梯度下降法求下降方向上的坐标
def gradient_descent(initial_theta, eta, n_iters=1e4, epsilon=1e-8):
    theta = initial_theta
    theta_history.append(initial_theta)
    i_iter = 0  # 防止eta过大造成死循环

    while i_iter < n_iters:
        gradient = dj(theta)
        last_theta = theta
        theta = theta - eta * gradient
        theta_history.append(theta)

        if (abs(j(theta) - j(last_theta)) < epsilon):
            break

        i_iter += 1
    return


# 画图
def plot_theta_history():
    plt.plot(plot_x, j(plot_x))
    plt.plot(np.array(theta_history), j(np.array(theta_history)), color='r', marker='+')
    plt.show()


eta = 0.1
theta_history = []
gradient_descent(0., eta)
plot_theta_history()

运行结果：
这里写图片描述

3.2 在线性回归中使用梯度下降法

1）LinearRegression_gd.py

import numpy as np
from comm_utils.testCapability import r2_score


class LinearRegressionGD:

    def __init__(self):
        """ 初识化 LR 模型"""
        self.coef_ = None
        self.intercept_ = None
        self._theta = None

    def fit_gd(self, x_train, y_train, eta=0.01, n_iters=1e4):
        """ 根据训练数据集x_train,y_train,使用梯度下降法训练Linear Regression模型"""
        assert x_train.shape[0] == y_train.shape[0], "the size of x_train must be equal to y_train"

        # 定义损失函数
        def j(theta, x_b, y):
            try:
                return np.sum((y - x_b.dot(theta)) ** 2) / len(y)
            except:
                return float('inf')

        # 定义梯度函数
        def dj(theta, x_b, y):
            # return (x_b.T.dot(x_b.dot(theta) - y) * 2. / len(x_b)).T
            return x_b.T.dot(x_b.dot(theta) - y) * 2. / len(x_b)

        # 求梯度下的theta
        def gradient_descent(x_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
            theta = initial_theta
            cur_iter = 0

            while cur_iter < n_iters:
                gradient = dj(theta, x_b, y)
                last_theta = theta
                theta = theta - eta * gradient
                if (abs(j(theta, x_b, y) - j(last_theta, x_b, y)) < epsilon):
                    break
                cur_iter += 1

            return theta

        x_b = np.hstack([np.ones((len(x_train), 1)), x_train])
        initial_theta = np.zeros(x_b.shape[1])
        self._theta = gradient_descent(x_b, y_train, initial_theta, eta, n_iters)

        self.intercept_ = self._theta[0]  # 截距
        self.coef_ = self._theta[1:]  # theta

        return self

    def predict(self, x_predict):
        """ 给定待预测数据集x_predict,返回表示x_predict的结果向量"""
        assert self.intercept_ is not None and self.coef_ is not None, "must fit before predict."
        assert x_predict.shape[1] == len(self.coef_), "the feature number of x_predict must be equal to x_train"

        x_b = np.hstack([np.ones((len(x_predict), 1)), x_predict])
        return x_b.dot(self._theta)

    def score(self, x_test, y_test):
        """ 根据测试数据集 x_test 和 y_test 确定当前模型的准确度"""
        y_predict = self.predict(x_test)
        return r2_score(y_test, y_predict)

    def __repr__(self):
        return "LinearRegression()"

2）test.py

import numpy as np
import matplotlib.pyplot as plt
import GradientDescent_pro.LinearRegression_gd

np.random.seed(666)
x = 2 * np.random.random(size=100)
y = x * 3. + 4. + np.random.normal(size=100)  # 线性方程，最后是添加的噪音

x = x.reshape(-1, 1)

plt.scatter(x, y)
plt.show()

lrgd = GradientDescent_pro.LinearRegression_gd.LinearRegressionGD()
lrgd.fit_gd(x, y)

print(lrgd.coef_)
print(lrgd.intercept_)

运行结果：
这里写图片描述

[3.00706277]
4.021457858204859

3.3 随机梯度下降法（Stochastic Gradient Descent）

搜索方向： $2\begin{pmatrix} { ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) } \\ { ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) }{ x }_{ 1 }^{ (i) } \\ \vdots \\ { ({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) }) }{ x }_{ n }^{ (i) } \end{pmatrix}=2{ ({ x }_{ b }^{ (i) }) }^{ \intercal }({ x }_{ b }^{ (i) }\theta -{ y }^{ (i) })\\ \qquad$

$\eta =\frac { 1 }{ i\_ iters } \rightarrow \frac { 1 }{ i\_ iters+b } \rightarrow \frac { a }{ i\_ iters+b } \rightarrow \frac { { t }_{ 0 } }{ { i\_ iters+t }_{ 1 } }$

特点：

下降方向随机
跳出局部最优解
更快的运行速度

1）在LinearRegression_gd.py中添加下面代码：

def fit_sgd(self, x_train, y_train, n_iters=5, t0=5, t1=50):
     """ 根据训练数据集，使用随机梯度下降法训练线性回归模型"""
     assert x_train.shape[0] == y_train.shape[0], "the size of x_train must be equal to y_train"
     assert n_iters >= 1

     # 随机梯度
     def dj_sgd(theta, x_b_i, y_i):
         return x_b_i * (x_b_i.dot(theta) - y_i) * 2.

     def sgd(x_b, y, initial_theta, n_iters, t0=5, t1=50):
         def learning_rate(t):
             return t0 / (t + t1)

         theta = initial_theta
         m = len(x_b)

         for cur_iter in range(n_iters):
             indexes = np.random.permutation(m)
             x_b_new = x_b[indexes]
             y_new = y[indexes]
             for i in range(m):
                 gradient = dj_sgd(theta, x_b_new[i], y_new[i])
                 theta = theta - learning_rate(cur_iter * m + i) * gradient
         return theta

     x_b = np.hstack([np.ones((len(x_train), 1)), x_train])
     initial_theta = np.random.randn(x_b.shape[1])
     self._theta = sgd(x_b, y_train, initial_theta, n_iters, t0, t1)

     self.intercept_ = self._theta[0]
     self.coef_ = self._theta[1:]

     return self

运行结果：

3.004638997983383
[4.00713831]

3.4 梯度下降法的调试

$\frac { dJ }{ d\theta } =\frac { J(\theta +\varepsilon )-J(\theta -\varepsilon ) }{ 2\varepsilon } \\ \\ \theta =({ \theta }_{ 0 },{ \theta }_{ 1 },{ \theta }_{ 2 },\cdots ,{ \theta }_{ n })，\frac { \partial J }{ \partial \theta } =(\frac { \partial J }{ \partial { \theta }_{ 0 } } ,\frac { \partial J }{ \partial { \theta }_{ 1 } } ,\frac { \partial J }{ \partial { \theta }_{ 2 } } ,\cdots ,\frac { \partial J }{ \partial { \theta }_{ n } } )\\ \\ { \theta }_{ 0 }^{ + }=({ \theta }_{ 0 }+\varepsilon ,{ \theta }_{ 1 },{ \theta }_{ 2 },\cdots ,{ \theta }_{ n })，{ \theta }_{ 1 }^{ + }=({ \theta }_{ 0 },{ \theta }_{ 1 }+\varepsilon ,{ \theta }_{ 2 },\cdots ,{ \theta }_{ n })\\ \\ { \theta }_{ 0 }^{ - }=({ \theta }_{ 0 }-\varepsilon ,{ \theta }_{ 1 },{ \theta }_{ 2 },\cdots ,{ \theta }_{ n })，{ \theta }_{ 1 }^{ - }=({ \theta }_{ 0 },{ \theta }_{ 1 }-\varepsilon ,{ \theta }_{ 2 },\cdots ,{ \theta }_{ n })\\ \\ \frac { \partial J }{ \partial { \theta }_{ 0 } } =\frac { J({ \theta }_{ 0 }^{ + })-J({ \theta }_{ 0 }^{ - }) }{ 2\varepsilon } ,\frac { \partial J }{ \partial { \theta }_{ 1 } } =\frac { J({ \theta }_{ 1 }^{ + })-J({ \theta }_{ 1 }^{ - }) }{ 2\varepsilon }$

# 调试法求梯度
def dj_debug(theta,x_b,y,epsilon=0.01):
    res=np.empty(len(theta))
    for i in range(len(theta)):
        theta_1=theta.copy()
        theta_1[i]+=epsilon
        theta_2=theta.copy()
        theta_2[i]-=epsilon
        res[i]=(j(theta_1,x_b,y)-j(theta_2,x_b,y))/(2*epsilon)
    return res

4 注意

使用梯度下降法前最好进行数据归一化。

豪华手抓饼

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
机器学习入门 04 梯度下降法（Gradient Descent）

1 介绍设一函数为 y=(x−4)2+1y=(x−4)2+1y={ (x-4) }^{ 2 }+1，则它的梯度，也就是一阶导数为 dydx=2(x−4)dydx=2(x−4)\frac { dy }{ dx } =2(x-4)，如下图所示。梯度，可以代表方向，对应着函数 yyy 增大的方向，当 xxx 变化式，yyy 也会有相应的变化。在机器学习中，我们使用式 −ηdJdθ−ηdJ...
复制链接

扫一扫