机器学习基础-4.梯度下降法

最新推荐文章于 2023-07-25 11:28:21 发布

闲人勿-

最新推荐文章于 2023-07-25 11:28:21 发布

阅读量1k

点赞数

分类专栏：机器学习文章标签：梯度下降法

本文链接：https://blog.csdn.net/qq_25560849/article/details/80454903

版权

机器学习专栏收录该内容

12 篇文章 6 订阅

订阅专栏

一、梯度下降法

1.梯度下降法概念

梯度下降法本身不是一个算法，而是一种基于搜索的最优化方法，作用是最小化一个损失函数，这个损失函数不能够或者很难通过公式推导的方式得出一个数学解，所以需要用基于搜索的方法一步一步得到最终解。相对的有梯度上升法，即最大化一个效用函数，思想是完全一样的。

梯度下降法只能找到局部的最优解，但是对于一般情况，可能存在多个极值点。如下图。

解决的方法：多次运行，随机初始点。由此可见，这个初始点也是一个超参数。

2.编程模拟梯度下降法

import numpy as np
import matplotlib.pyplot as plt
plot_x = np.linspace(-1,6,141)
plot_y = (plot_x-2.5)**2-1 #这里给出我们的函数，接下来用梯度下降法求得最小值
plt.plot(plot_x,plot_y)
plt.show()

def dj(theta):
    return 2*(theta-2.5)
def j(theta):
    return (theta-2.5)**2-1
theta =0.0
eta = 0.1
epsilon = 1e-8
while True:
    gradient = dj(theta)
    last_theta = theta
    theta = theta - eta*gradient
    if (abs(j(theta)-j(last_theta))<epsilon):
        break
print(theta)
print(j(theta))

为了方便可视化，将上面代码可以修改如下。

def dj(theta):
    return 2*(theta-2.5)
def j(theta):
    return (theta-2.5)**2-1

theta =0.0
theta_history =[theta]
eta = 0.1
epsilon = 1e-8
while True:
    gradient = dj(theta)
    last_theta = theta
    theta_history.append(theta)
    theta = theta - eta*gradient
    if (abs(j(theta)-j(last_theta))<epsilon):
        break
plt.plot(plot_x,j(plot_x))
plt.plot(np.array(theta_history),j(np.array(theta_history)),color='red',marker='+')

可以清楚地看到下降的过程，开始时候下降比较快，这是因为斜率比较大的原因，可以通过len(theta_history)查看下降了多次得到最终的最优解。

通过调整eta值，查看效果。左图eta=0.01，右图eta=0.8，如果eta=1，将会直接出现值过大的异常。

二、多元线性回归梯度下降法

1.求解思路

为了使得梯度和样本的数目m无关，对函数进行调整。最终的损失函数即求MSE。

2.代码实现

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(666)
x = 2* np.random.random(size=100)
y = x*3. + 4. + np.random.normal(size=100)

X = x.reshape(-1,1)
plt.scatter(x,y)
plt.show()

def j(theta,X_b,y):
    try:
        return np.sum((y-X_b.dot(theta))**2)/len(X_b)
    except:
        return float('inf')
    
def dj(theta,X_b,y):
    res = np.empty(len(theta))
    res[0] = np.sum(X_b.dot(theta)-y)
    for i in range(1,len(theta)):
        res[i] = (X_b.dot(theta)-y).dot(X_b[:,i])
    return res * 2 / len(X_b)

def gradient_descent(X_b,y,initial_theta,eta,n_iters=1e4,epsilon=1e-8):
    theta = initial_theta
    i_iter = 0
    while i_iter< n_iters:
        gradient = dj(theta,X_b,y)
        last_theta = theta
        theta = theta-eta*gradient
       
        if(abs(j(last_theta,X_b,y)-j(theta,X_b,y))<epsilon):
            break
        i_iter +=1
    return theta

X_b = np.hstack([np.ones((len(x),1)),x.reshape(-1,1)])
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01
theta = gradient_descent(X_b,y,initial_theta,eta)
theta

3.LinearRegression

创建LinearRegression将代码进行封装。

import numpy as np
from .metrics import r2_score
class LinearRegression:
    def fit_gd(self, X_train, y_train, eta=0.01, n_iters=1e4):
        """根据训练数据集X_train, y_train, 使用梯度下降法训练Linear Regression模型"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"

        def J(theta, X_b, y):
            try:
                return np.sum((y - X_b.dot(theta)) ** 2) / len(y)
            except:
                return float('inf')

        def dJ(theta, X_b, y):
            res = np.empty(len(theta))
            res[0] = np.sum(X_b.dot(theta) - y)
            for i in range(1, len(theta)):
                res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
            return res * 2 / len(X_b)
            
        def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):

            theta = initial_theta
            cur_iter = 0

            while cur_iter < n_iters:
                gradient = dJ(theta, X_b, y)
                last_theta = theta
                theta = theta - eta * gradient
                if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
                    break

                cur_iter += 1

            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        initial_theta = np.zeros(X_b.shape[1])
        self._theta = gradient_descent(X_b, y_train, initial_theta, eta, n_iters)

        self.intercept_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

测试代码，最终的截距和系数和之前得到的一致。

from ml.LinearRegression import LinearRegression
linreg = LinearRegression()
linreg.fit_gd(X,y)
linreg.coef_
linreg.intercept_

三、线性回归中梯度下降法的向量化

1.向量化

对求得的梯度公式进行变形，对于第一项乘以值为1的X0。

根据线性代数交换两个项的位置，可以得到。

因此上面求dj的代码可以改变成如下。

def dJ(theta, X_b, y):
   return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(X_b)

现在对代码进行实际使用，调用的数据为波士顿房价。

import numpy as np  
import matplotlib.pyplot as plt  
from sklearn import datasets  
boston = datasets.load_boston()  
x = boston.data #使用所有特征值  
y = boston.target  
x = x[ y < 50.0]   
y = y[ y < 50.0]  
from ml.model_selection import train_test_split  
x_train,x_test,y_train,y_test = train_test_split(x,y,seed=666)  
from ml.LinearRegression import LinearRegression  
reg = LinearRegression()  
reg.fit_gd(x_train,y_train)  
reg.coef_  
reg.intercept_  
reg.score(x_test,y_test)

运行发现值溢出。原因是步长过大。

将eta值改为0.000001，由于eta变小，所以迭代次数需要相应的增加。

reg.fit_gd(x_train,y_train,eta=0.000001,n_iters=1e6)

最终得到的分值为0.7541852353980764，依然没有达到最高的分数（根据线性回归博客中，得到的分数是0.8129）。主要原因是数据规模不在一个量级上。

2.归一化

归一化的方式采用knn博客中的方式，这里调用sklearn的归一化函数模块。

from sklearn.preprocessing import StandardScaler
standscaler = StandardScaler()
standscaler.fit(x_train)
x_train_standard = standscaler.transform(x_train)
reg = LinearRegression()
reg.fit_gd(x_train_standard,y_train)
x_test_standard = standscaler.transform(x_test)
reg.score(x_test_standard,y_test)

四、随机梯度下降法

梯度下降法每次求梯度都需要将所有的样本代入进行运算，当样本的数量较大时，耗费的时间将会很多，为了改进，提出了随机梯度下降法。它采用的公式如下，即每次只取1个样本进行运算。由于是随机的，所以不能保证每次损失函数都能是下降的趋势。

1.思想

这里对每次的步长进行动态的调整，使之随着迭代次数不断地减小，这里取a=5，b=50。【模拟退火】

2.代码实现

ef j(theta,X_b,y):  
    try:  
        return np.sum((y-X_b.dot(theta))**2)/len(X_b)  
    except:  
        return float('inf')  
      
def dj_sgd(theta,X_b_i,y_i):  
     return X_b_i.T.dot(X_b_i.dot(theta) - y_i) * 2. 
  
def sgd(X_b,y,initial_theta,n_iters=1e4):  
    t0=5
    t1=50
    def learning_rate(t):
        return t0/(t+t1)
    theta = initial_theta  
    i_iter = 0  
    while i_iter< n_iters: 
        rand_i = np.random.randint(len(X_b))
        gradient = dj_sgd(theta,X_b[rand_i],y[rand_i])  
        i_iter +=1  
    return theta

测试效果：

%%time
X_b = np.hstack([np.ones((len(X),1)),X])
initial_theta = np.zeros(X_b.shape[1])
theta = sgd(X_b,y,initial_theta,n_iters=len(X_b)//3)

，可以将结果和梯度下降法进行比较，发现快了非常多。

3.封装我们的代码，并做改进

import numpy as np
from .metrics import r2_score

class LinearRegression:
    def fit_sgd(self, X_train, y_train, n_iters=5, t0=5, t1=50): #这里的n_iters代表将样本循环看几遍
        """根据训练数据集X_train, y_train, 使用梯度下降法训练Linear Regression模型"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"
        assert n_iters >= 1

        def dJ_sgd(theta, X_b_i, y_i):
            return X_b_i * (X_b_i.dot(theta) - y_i) * 2.

        def sgd(X_b, y, initial_theta, n_iters, t0=5, t1=50):

            def learning_rate(t):
                return t0 / (t + t1)

            theta = initial_theta
            m = len(X_b)

            for cur_iter in range(n_iters):
                indexes = np.random.permutation(m)#将样本乱序，保证每个样本被看一遍，也就是模拟了随机梯度
                X_b_new = X_b[indexes]
                y_new = y[indexes]
                for i in range(m):
                    gradient = dJ_sgd(theta, X_b_new[i], y_new[i])
                    theta = theta - learning_rate(cur_iter * m + i) * gradient

            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        initial_theta = np.random.randn(X_b.shape[1])
        self._theta = sgd(X_b, y_train, initial_theta, n_iters, t0, t1)

        self.intercept_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

4.sklearn中随机梯度下降

from sklearn.linear_model import SGDRegressor #因为是在线性回归中个，所以sklearn中的随机梯度下降只能运用在线性回归中
sgd_reg = SGDRegressor(n_iter=100) #可传入参数，代表将全部样本看100遍
%time sgd_reg.fit(X_train_standard,y_train)
sgd_reg.score(X_test_standard,y_test)

需要注意的是：sklearn中随机梯度更加优化和复杂，上面封装的代码只是为了学习基本的原理。

五、关于梯度公式的验证

有时候梯度求解可能会出错，但是运算过程不会报错。为了验证所求梯度公式的正确性，需要先用梯度求解的原始思想来求解梯度【即逼近的思想】，用这种方式求得的梯度代入数据进行运算，并记录结果。之后再将求得的梯度公式也用同样的数据进行运算，检查所得的结果是否和之前求得一致。如果一致，说明梯度求解的公式是正确的。

这里采用求导数最原始的思想，即逼近的思想，当epsilon逼近0的时候，就是在该处的导数。对于更高纬度的有如下形式。

对于某个方向上的导数公式如下。

import numpy as np  
import matplotlib.pyplot as plt  
np.random.seed(666)
x = np.random.random(size=(1000,10))
t_theta = np.arange(1,12,dtype=float)
x_b = np.hstack([np.ones((len(x),1)),x])
y = x_b.dot(t_theta)+np.random.normal(size=1000)
def j(theta,x_b,y):  
    try:  
        return np.sum((y-x_b.dot(theta))**2)/len(x_b)  
    except:  
        return float('inf')  
def dj_math(theta,x_b,y):
    return x_b.T.dot(x_b.dot(theta) - y) * 2. /len(x_b)  
def dj_debug(theta,x_b,y,epsilon=0.01):
    res = np.empty(len(theta))
    for i in range(len(theta)):
        theta_1 = theta.copy()
        theta_2 = theta.copy()
        theta_1[i] += epsilon
        theta_2[i] -= epsilon
        res[i] =(j(theta_1,x_b,y)-j(theta_2,x_b,y))/(2*epsilon)
    return res 
def gd (dj,x_b,y,initial_theta,eta,n_iters =1e4,epsilon=1e-8):
    theta = initial_theta
    cur_iter =0
    while cur_iter < n_iters:
        gradient = dj(theta,x_b,y)
        last_theta = theta
        theta = theta-eta*gradient
        if (abs(j(theta,x_b,y)-j(last_theta,x_b,y))<epsilon):
            break
        cur_iter +=1
    return theta

代码中，dj_debug代表调试梯度的代码，dj_math代表用数学推导得出的梯度。下面进行比较。可以发现调试时候的运算时间比较长，因此这仅仅是适用于验证公式正确性时候采用的。

闲人勿-

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
1
评论
机器学习基础-4.梯度下降法

一、梯度下降法梯度下降法本身不是一个算法，而是一种基于搜索的最优化方法，作用是最小化一个损失函数，这个损失函数不能够或者很难通过公式推导的方式得出一个解，所以需要用基于搜索的方法一步一步得到最终解。相对的有梯度上升法，即最大化一个效用函数，思想是完全一样的。...
复制链接

扫一扫