机器学习(四) - 梯度下降法

梯度下降

梯度下降是一种非常通用的优化算法,能够为大范围的问题找到最优解。梯度下降中的目的就是通过迭代地调整参数从而使成本函数最小化。

假设你迷失在山上的迷雾中,你能感觉到的只有你脚下路面的坡度。快速到达山脚的一个策略就是沿着最陡的方向下坡。这就是梯度下降的思想:通过测量 θ \theta θ相关的误差函数的局部梯度,并不断沿着降梯度的方向调整,直到梯度降为0,到达最小值!

具体来说,首先使用一个随机 θ \theta θ值(称之为随机初始化),然后逐步改进,每次踏出yibu,每一步都尝试将第一点成本函数(如MSE1),直到算法收敛出一个最小值,如下图
在这里插入图片描述
梯度下降中一个重要参数是每一步的步长,梯度下降中称为学习率(超参数)。如果学习率太低,就需要更多次的迭代才能达到收敛,如果学习率太高可能直接越过山脚最低点,如果学习率太高非但不能走到山脚反而可能一不小心走过头走到了另一个比之前还要高的位置。
在这里插入图片描述
上图 可以看出 当学习率不能太低也不能太高 机器学习中一般为 0.01
由 MSE作为损失函数 而MSE正好又是一个 凸函数可以找到全局最小值
而在现实中 损失函数可能没有那么好的性质 只能找到局部最小值。
在这里插入图片描述

在这里插入图片描述

y ^ ( i ) = b + θ 1 X 1 ( i ) + θ 2 X 2 ( i ) + θ 3 X 3 ( i ) , . . . . . , θ n X n ( i ) \hat{y}^{(i)} =b+\theta_{1}X_{1}^{(i)}+\theta_{2}X_{2}^{(i)}+\theta_{3}X_{3}^{(i)},.....,\theta_{n}X_{n}^{(i)} y^(i)=b+θ1X1(i)+θ2X2(i)+θ3X3(i),.....,θnXn(i)
X d = [ 1 X ( 1 ) ( 1 ) X ( 2 ) ( 1 ) . . . . X ( n ) ( 1 ) 1 X ( 1 ) ( 2 ) X ( 2 ) ( 2 ) . . . . X ( n ) ( 2 ) . . . . . 1 X ( 1 ) ( m ) X ( 2 ) ( m ) . . . . X ( n ) ( m ) ] X_d = \left[ \begin{matrix} 1 & X^{(1)}_{(1)} & X^{(1)}_{(2)} & .... & X^{(1)}_{(n)}\\ 1 & X^{(2)}_{(1)} & X^{(2)}_{(2)} & .... & X^{(2)}_{(n)}\\ ..... \\1 & X^{(m)}_{(1)} & X^{(m)}_{(2)} & .... & X^{(m)}_{(n)}\\ \end{matrix} \right] Xd=11.....1X(1)(1)X(1)(2)X(1)(m)X(2)(1)X(2)(2)X(2)(m)............X(n)(1)X(n)(2)X(n)(m)
θ = ( θ 0 , θ 1 , θ 2 , θ 3 , . . . . . , θ n ) T \theta=(\theta_{0},\theta_{1},\theta_{2},\theta_{3},.....,\theta_{n})^T θ=(θ0,θ1,θ2,θ3,.....,θn)T
y ^ ( i ) = X d θ \hat{y}^{(i)}=X_d\theta y^(i)=Xdθ
y ^ ( i ) ∈ m ∗ n + 1 \hat{y}^{(i)} ∈ {m*n +1} y^(i)mn+1维向量

目标使损失函数最小化: a r g m i n ∑ i = 1 m ( y ( i ) − y ^ ( i ) ) 2 {argmin}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2 argmini=1m(y(i)y^(i))2
根据求导法则可得:
∇ J ( θ ) = [ ∂ J ∂ θ 1 ∂ J ∂ θ 2 ∂ J ∂ θ 3 . . . ∂ J ∂ θ n ] \nabla J(\theta) = \left[ \begin{matrix} \frac {\partial J}{\partial \theta_1} \\ \frac {\partial J}{\partial \theta_2} \\ \frac {\partial J}{\partial \theta_3} \\ ... \\ \frac {\partial J}{\partial \theta_n} \end{matrix} \right] J(θ)=θ1Jθ2Jθ3J...θnJ =   [ ∑ i = 1 m 2 ( y ( i ) − X d ( i ) θ ) ⋅ ( − 1 ) ∑ i = 1 m 2 ( y ( i ) − X d ( i ) θ ) ⋅ ( − X 1 ( i ) ) ∑ i = 1 m 2 ( y ( i ) − X d ( i ) θ ) ⋅ ( − X 2 ( i ) ) . . . ∑ i = 1 m 2 ( y ( i ) − X d ( i ) θ ) ⋅ ( − X n ( i ) ) ] \ \left[ \begin{matrix} \sum_{i=1}^m2(y^{(i)} - X_d^{(i)}\theta)\cdot(-1) \\ \sum_{i=1}^m2(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_1 ^{(i)}) \\ \sum_{i=1}^m2(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_2 ^{(i)}) \\ ... \\ \sum_{i=1}^m2(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_n ^{(i)}) \end{matrix} \right]  i=1m2(y(i)Xd(i)θ)(1)i=1m2(y(i)Xd(i)θ)(X1(i))i=1m2(y(i)Xd(i)θ)(X2(i))...i=1m2(y(i)Xd(i)θ)(Xn(i))= 2 m   [ ∑ i = 1 m ( y ( i ) − X d ( i ) θ ) ⋅ ( − 1 ) ∑ i = 1 m ( y ( i ) − X d ( i ) θ ) ⋅ ( − X 1 ( i ) ) ∑ i = 1 m ( y ( i ) − X d ( i ) θ ) ⋅ ( − X 2 ( i ) ) . . . ∑ i = 1 m ( y ( i ) − X d ( i ) θ ) ⋅ ( − X n ( i ) ) ] \frac{2}{m}\ \left[ \begin{matrix} \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-1) \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_1 ^{(i)}) \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_2 ^{(i)}) \\ ... \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_n ^{(i)}) \end{matrix} \right] m2 i=1m(y(i)Xd(i)θ)(1)i=1m(y(i)Xd(i)θ)(X1(i))i=1m(y(i)Xd(i)θ)(X2(i))...i=1m(y(i)Xd(i)θ)(Xn(i))
式子前除以m 是因为 在计算梯度的时候随着样本量的增加梯度也会变大但梯度不应该和样本量挂钩所以除以m。
所以此时优化函数就变成了: a r g m i n 1 m ∑ i = 1 m ( y ( i ) − y ^ ( i ) ) 2 {argmin}\frac {1}{m}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2 argminm1i=1m(y(i)y^(i))2
而这个函数就是 J ( θ ) = M S E ( y , y ^ ) J(\theta)=MSE(y,\hat y) J(θ)=MSE(y,y^)
还有一种常用的形式: a r g m i n 1 2 m ∑ i = 1 m ( y ( i ) − y ^ ( i ) ) 2 {argmin}\frac {1}{2m}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2 argmin2m1i=1m(y(i)y^(i))2 这种方式 求导后 分母2会消掉。

批量梯度下降法

def J(theta,x_b,y):
    try:
    	#通过一组theta计算损失
        return np.sumy(y - x_b.dot(theta)) ** 2/len(x_b)
    except:
        return float('INF')
def dJ(theta,x_b,y):
	#初始化梯度数组
    res = np.empty(len(theta))
    res[0] = np.sum((x_b.dot(theta) - y))
    for i in range(1,len(theta)):
    	#根据初始化为一组 0 的theta求导数 
        res[1] = (x_b.dot(theta)- y).dot(x_b[:,i])   
    return res * 2 /len(x_b)
def gradient_descent(x_b,y,initial_theta,eta,iters_num = 1e4,epsilon = 1e-8):
    theta = initial_theta
    i_iter = 0
    while i_iter < iters_num:
    	#计算梯度值
        gradient = dJ(theta,x_b,y)
        last_theta = theta
        #不停地更新损失函数 eta 为学习率 
        theta = theta - eta * gradient
        #迭代次数
        i_iter += 1
        #比较两次求出的theta值如果2次 下降的间隔 小于一定的值 则认为达到最小点
        if(abs(J(theta,x_b,y) - J(last_theta,x_b,y))<epsilon):
            break
    return theta
梯度求导向量化

上面对theta点求导用的是循环 更简洁的直接使用矩阵乘法效率更高。

2 m   [ ∑ i = 1 m ( y ( i ) − X d ( i ) θ ) ⋅ ( − 1 ) ∑ i = 1 m ( y ( i ) − X d ( i ) θ ) ⋅ ( − X 1 ( i ) ) ∑ i = 1 m ( y ( i ) − X d ( i ) θ ) ⋅ ( − X 2 ( i ) ) . . . ∑ i = 1 m ( y ( i ) − X d ( i ) θ ) ⋅ ( − X n ( i ) ) ] \frac{2}{m}\ \left[ \begin{matrix} \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-1) \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_1 ^{(i)}) \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_2 ^{(i)}) \\ ... \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_n ^{(i)}) \end{matrix} \right] m2 i=1m(y(i)Xd(i)θ)(1)i=1m(y(i)Xd(i)θ)(X1(i))i=1m(y(i)Xd(i)θ)(X2(i))...i=1m(y(i)Xd(i)θ)(Xn(i))= 2 m ( X d ( 1 ) θ − y ( 1 ) , X d ( 2 ) θ − y ( 2 ) , X d ( 3 ) θ − y ( 3 ) , X d ( 4 ) θ − y ( 4 ) , . . . . . , X d ( m ) θ − y ( m ) ) ⋅ [ 1 X ( 1 ) ( 1 ) X ( 2 ) ( 1 ) . . . . X ( n ) ( 1 ) 1 X ( 1 ) ( 2 ) X ( 2 ) ( 2 ) . . . . X ( n ) ( 2 ) . . . . . 1 X ( 1 ) ( m ) X ( 2 ) ( m ) . . . . X ( n ) ( m ) ] \frac{2}{m}(X_d^{(1)}\theta -y^{(1)},X_d^{(2)}\theta -y^{(2)},X_d^{(3)}\theta -y^{(3)},X_d^{(4)}\theta -y^{(4)},.....,X_d^{(m)}\theta -y^{(m)})\cdot\left[ \begin{matrix} 1 & X^{(1)}_{(1)} & X^{(1)}_{(2)} & .... & X^{(1)}_{(n)}\\ 1 & X^{(2)}_{(1)} & X^{(2)}_{(2)} & .... & X^{(2)}_{(n)}\\ ..... \\1 & X^{(m)}_{(1)} & X^{(m)}_{(2)} & .... & X^{(m)}_{(n)}\\ \end{matrix} \right] m2(Xd(1)θy(1),Xd(2)θy(2),Xd(3)θy(3),Xd(4)θy(4),.....,Xd(m)θy(m))11.....1X(1)(1)X(1)(2)X(1)(m)X(2)(1)X(2)(2)X(2)(m)............X(n)(1)X(n)(2)X(n)(m) = 2 m ⋅ ( X d θ − y ) T ⋅ X d \frac{2}{m}\cdot(X_d\theta - y)^T\cdot X_d m2(Xdθy)TXd = 2 m X d T ⋅ ( X d θ − y ) \frac{2}{m} X_d^T\cdot (X_d\theta - y) m2XdT(Xdθy)

#矩阵化后求导代码
return x_b.T.dot(x_b.dot(theta) -y) * 2. /len(y)

随机梯度下降法(Stochastic Gradient Descent)

在这里插入图片描述
前面讨论了 用所有样本点进行计算搜索梯度,当样本量非常大是可能非常耗时,有没有其他方法呢?
2 ⋅ ( X d ( i ) ) T ⋅ ( X d ( i ) θ − y ( i ) ) 2\cdot (X_d^{(i)})^T\cdot (X_d^{(i)}\theta - y^{(i)}) 2(Xd(i))T(Xd(i)θy(i))
可以拿出一个单独数据来搜索方向。

随机梯度每次迭代不会朝着更小的点前进可能还会往梯度更大的值走但在现实情况下随机梯度下降法是能找到最小值的附近的。
在随机梯度下降中 学习率随着迭代次数的增加需要不断变小,否则就算到达最小点可能也会在下一次迭代中跳过。
η = a i _ i t e r s + b \eta=\frac{a}{i\_iters+b} η=i_iters+ba
a 和 b 为超参数 i_iters为循环迭代次数
随着循环次数增加,学习率是下降的
模拟退火的思想
η = t 0 i _ i t e r s + t 1 \eta=\frac{t_0}{i\_iters+t_1} η=i_iters+t1t0

def learing_rate(t):
    return t0 / (t + t1)
def dJ_Sgd(theta, x_i_b, y_i):
    return  x_i_d.T.dot((x_i_d.dot(theta) - y_i)) * 2

梯度的调试

在这里插入图片描述
当2个蓝点取任意 都无限趋近于 红点时求极限 用2点之间的斜率 当做红点的斜率 。 y 1 − y 2 x 1 − x 2 \frac{y_1 - y_2}{x_1 -x_2} x1x2y1y2 斜率公式

x 1 = θ + ϵ x_1 =\theta + \epsilon x1=θ+ϵ
x 2 = θ − ϵ x_2 =\theta - \epsilon x2=θϵ
y 1 = f ( θ + ϵ ) y_1 =f(\theta + \epsilon) y1=f(θ+ϵ)
y 2 = f ( θ − ϵ ) y_2 =f(\theta - \epsilon) y2=f(θϵ)

f ( θ + ϵ ) − f ( θ − ϵ ) θ + ϵ − θ + ϵ \frac{f(\theta + \epsilon)-f(\theta - \epsilon)}{\theta + \epsilon - \theta + \epsilon} θ+ϵθ+ϵf(θ+ϵ)f(θϵ)
可得:
d f d θ = f ( θ + ϵ ) − f ( θ − ϵ ) 2 ϵ \frac{df}{d\theta} = \frac {f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon} dθdf=2ϵf(θ+ϵ)f(θϵ)

假设有n维参数:
θ = ( θ 0 , θ 1 , θ 2 , θ 3 , ⋯   , θ n ) \theta =(\theta_0,\theta_1,\theta_2,\theta_3,\cdots,\theta_n) θ=(θ0,θ1,θ2,θ3,,θn)

θ = ( ∂ f θ 0 , ∂ f θ 1 , ∂ f θ 2 , ∂ f θ 3 , ⋯   , ∂ f θ n ) \theta =(\frac{\partial f}{\theta_0},\frac{\partial f}{\theta_1},\frac{\partial f}{\theta_2},\frac{\partial f}{\theta_3},\cdots,\frac{\partial f}{\theta_n}) θ=(θ0f,θ1f,θ2f,θ3f,,θnf)

如果要求 t h e t a 0 theta_0 theta0维度导数

θ 0 + = ( θ 0 + ε , θ 1 , θ 2 , θ 3 , ⋯   , θ n ) \theta^+_0 =(\theta_0 +\varepsilon,\theta_1,\theta_2,\theta_3,\cdots,\theta_n) θ0+=(θ0+ε,θ1,θ2,θ3,,θn)

θ 0 − = ( θ 0 − ε , θ 1 , θ 2 , θ 3 , ⋯   , θ n ) \theta^-_0 =(\theta_0 -\varepsilon,\theta_1,\theta_2,\theta_3,\cdots,\theta_n) θ0=(θ0ε,θ1,θ2,θ3,,θn)

d f d θ = f ( θ 1 + ) − f ( θ 1 − 2 ϵ \frac{df}{d\theta} = \frac {f(\theta_1^+) - f(\theta_1^-}{2\epsilon} dθdf=2ϵf(θ1+)f(θ1

以此类推 通过不断的近似

要计算 对 d f d θ \frac{df}{d\theta} dθdf 可以通过 将每个 θ \theta θ加上一个最小的值 和减去一个最小的值 然后带入
d f d θ = f ( θ + ϵ ) − f ( θ − ϵ ) 2 ϵ \frac{df}{d\theta} = \frac {f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon} dθdf=2ϵf(θ+ϵ)f(θϵ)求导.
然后再用 梯度下降的方式 一步步找到最小值。

def J(theta,x_b,y):
    try:
        return np.sum((y - x_b.dot(theta)) ** 2) /len(x_b)
    except:
        return float('inf')
def dj_math(theta,x_b,y):
    return x_b.T.dot(x_b.dot(theta) -y) * 2. /len(y)
def dj_debug(theta,x_b,y,epsilon = 0.01):
    res = np.empty(len(theta))
    for i in range(len(theta)):
    	#拷贝 theta
        theta_p = theta.copy()
        theta_m = theta.copy()
        #找到theta左边的点
        theta_p[i] += epsilon
        #找到theta右边的点
        theta_m[i] -= epsilon
        #通2个点 斜率公式 当 2个点 和 点theta 无限接近的时候 2点之间的斜率就是 j在 theta 的导数 (导数定义)
        res[i] = (J(theta_p,x_b,y) - J(theta_m,x_b,y))/(2 * epsilon)
    return res
def gradient_descent(dj,x_b,y,initial_theta,eta,iters_num = 1e4,epsilon = 1e-4):
    theta = initial_theta
    i_iter = 0
    while i_iter < iters_num:
        gradient = dj(theta,x_b,y)
        last_theta = theta
        #通过不断迭代 theta
        theta = theta - eta * gradient
        i_iter += 1
        #当 2 次 theta 波动小于某个值时则认为 到达了局部最小值 当函数是凸函数则是全局最小值
        if(abs(J(theta,x_b,y) - J(last_theta,x_b,y))<epsilon):
            break
    return theta
X_b = np.hstack((np.ones([len(X),1]),X))
initial_theta =np.zeros(x_b.shape[1])
theta gradient_descent(dj_debug,X_b,y,eta=0.01,initial_theta=initial_theta)

LinearRegression封装代码

import numpy as np
from ml_utils.metrics import r_Squared

class LinearRegression:
    def __init__(self):
        """初始化linear Regression模型"""
        self.coef_ = None
        self.interception_ = None
        self._theta = None
    def fit_normal(self,X_train,y_train):
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of x_train must be equal to the size of y_train"
        #传入的训练样本没有截距 在每一行加一个截距项
        X_b = np.hstack([np.ones((len(X_train),1)),X_train])
        #根据公式计算 theta 的值
        self._theta = np .linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)
        self.interception_=self._theta[0]
        self.coef_ = self._theta[1:]
        return self
    def fit_gd(self,X_train,y_train,eta,iters_num = 1e4,epsilon = 1e-7):
        def J( theta, x_b, y):
            try:
                return np.sum((y - x_b.dot(theta)) ** 2) / len(y)
            except:
                return float('INF')
        def dJ(theta, x_b, y):
            return x_b.T.dot(x_b.dot(theta) -y) * 2./len(y)
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of x_train must be equal to the size of y_train"
        b = np.ones([X_train.shape[0], 1])
        X_train = np.hstack((b,X_train))
        i_iter = 0
        initial_theta = np.zeros(X_train.shape[1])
        theta = initial_theta
        while i_iter < iters_num:
            gradient = dJ(theta, X_train, y_train)
            last_theta = theta
            theta = theta - eta * gradient
            i_iter += 1
            if (abs(J(theta, X_train, y_train) - J(last_theta, X_train, y_train)) < epsilon):
                break
        self._theta = theta
        self.coef_ = theta[1:]
        self.interception_=theta[0]
    def sgd(self,X_train,y_train,epoch_num = 2):
        assert epoch_num >= 1,\
            "epoch must bigger than one."
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_predict must be equal to the size of X_train"
        t0 = 5
        t1 =  50
        X_train = np.hstack((np.ones((X_train.shape[0],1)),X_train))
        init_theta = np.zeros(X_train.shape[1])
        theta = init_theta
        def learing_rate(t):
            return t0 / (t + t1)
        def dJ_Sgd(theta, x_i_b, y_i):
            return  x_i_b.T.dot((x_i_b.dot(theta) - y_i)) * 2
        for cur_epoch in range(epoch_num):
            disRupt_index = np.random.permutation(len(X_train))
            disRupt_X_train = X_train[disRupt_index]
            disRupt_y_train =y_train[disRupt_index]
            random_test = np.random.randint(0,X_train.shape[0],10)
            for i in range(X_train.shape[0]):
                gradient = dJ_Sgd(theta,disRupt_X_train[i],disRupt_y_train[i])
                theta  = theta - learing_rate(i) * gradient
            self._theta = theta
            self.coef_ = theta[1:]
            self.interception_ = theta[0]
            xt = disRupt_X_train[random_test][:,1:]
            epoch_score = self.score(xt,disRupt_y_train[random_test])
            print("now traing epoch:" + str(cur_epoch) +' loss:' + str(epoch_score))
        return self
    def predict(self,X_predict):
        assert self.coef_ is not None and  self.interception_ is not None, \
            "you must run fit_normal before predict"
        assert X_predict.shape[1] == len(self.coef_), \
            "the size of X_predict must be equal to the size of coef_"
        X_predict_b = np.hstack([np.ones((len(X_predict),1)), X_predict])
        return X_predict_b.dot(self._theta)

    def score(self, X_test,y_test):
        y_predict = self.predict(X_test)
        return r_Squared(y_test,y_predict)
    def __repr__(self):
        return "LinearRegression()"


  1. 1 m ∑ i = 1 m ( y t e s t ( i ) − y ^ t e t s ( i ) ) 2 \frac {1}{m}\sum_{i=1}^m(y^{(i)}_{test} - \hat {y}^{(i)}_{tets})^2 m1i=1m(ytest(i)y^tets(i))2 ↩︎

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值