An introduction of Gradient Descent

Gradient Descent is a useful optimization in machine learning and deep learning. It is a first-order iterative optimization algorithm in find the minimum of a function. To understand the gradient, you must have some knowledge in mathematical analysis.

So let start with the definition of the gradient. In wikipedia, gradient is a multi-variable generalization of the derivative. For example, given a function \(f(x,y,z) = x^2 + 2y+1/z\). The gradient of \(f(x,y,z)\) equals to \(\nabla f(x,y,z)=[2x,2,-\frac{1}{z^2}]^T\). At point \((x_0,y_0,z_0)=(0,1,2)\), the gradient on this point is \(\nabla f(x_0,y_0,z_0)=[0,2,-0.25]^T\). In mathematics, the gradient points are in the direction of the greatest rate of increase of the function. Hence, we can approach to the minimum through the direction of the gradient.

Gradient Descent

Denote the objective function \(f(x), x\in \mathcal{R}^p\). A typical gradient descent can be represented as \[x_{n+1} = x_n - \lambda \nabla f(x_n)\]
Here, the \(\lambda\) is called learning rate. It is vital to choose an appropriate learning rate since either small or large learning rates will lead to a poor performance in finding the optimal solution.

Let us demonstrate gradient descent by a regresion case whose independent variables are centralized.
The regression is \[y=\beta_1 x_1-\beta_2 x_2+\varepsilon\]
\(\beta_1\) and \(\beta_2\) are the parameters we want to estimate. And there are in total \(n\) observations.

In this case, from the definition of ordinary least square, the objective function is \[f(\beta_1,\beta_2) = \sum_{i=1}^{n}(y_i-\beta_1 x_{1,i} - \beta_2 x_{2,i})^2\]
The gradient will by \[\nabla f(\beta_1,\beta_2) = [-\sum_{i=1}^n x_{1,i}(y_i-\beta_1 x_{1,i} - \beta_2 x_{2,i});\ \ -\sum_{i=1}^n x_{2,i}(y_i-\beta_1 x_{1,i} - \beta_2 x_{2,i})]^T\]
The iteration formula:
\[\beta_{1,n+1}=\beta_{1,n}-\lambda \times(-\sum_{i=1}^n x_{1,i}(y_i-\beta_{1,n} x_{1,i} - \beta_{2,n} x_{2,i}))\]
\[\beta_{2,n+1}=\beta_{2,n}-\lambda \times(-\sum_{i=1}^n x_{1,i}(y_i-\beta_{1,n} x_{1,i} - \beta_{2,n} x_{2,i}))\]
The overall steps of gradient descent:

  1. set up an initial point \((\beta_{1,1},\beta_{2,1})\)
  2. generate the next point using the formula listed above
  3. iterate step 2
  4. stop until the objective function converges or reach the specified iterations
# import necessray library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# generate random variables
x1 = np.array(np.random.randn(100))
x2 = np.array(np.random.randn(100))
residual = np.array(np.random.randn(100))
y = 2*x1-x2+residual
def gradient_descent(beta1,beta2,y,x1,x2):
    res = y-beta1*x1-beta2*x2
    return(-2*sum(np.multiply(x1,res)),-2*sum(np.multiply(x2,res)))
def objective_func(beta1,beta2,y,x1,x2):
    res = y-beta1*x1-beta2*x2
    return(sum(res**2))
# set the initial point as (0,0)
beta1 = [0]
beta2 = [0]
beta1_iter = 0
beta2_iter = 0
object_value = [objective_func(beta1_iter,beta2_iter,y,x1,x2)]
chg_obj = 1
count = 0
learning_rate = 0.0001
while count==0 or chg_obj>0.000001:
    count += 1
    beta1_iter, beta2_iter = beta1_iter - learning_rate * gradient_descent(beta1_iter,beta2_iter,y,x1,x2)[0],beta2_iter - learning_rate * gradient_descent(beta1_iter,beta2_iter,y,x1,x2)[1]
    object_value.append(objective_func(beta1_iter,beta2_iter,y,x1,x2))
    chg_obj = abs(object_value[count] / object_value[count-1] - 1)
    beta1.append(beta1_iter)
    beta2.append(beta2_iter)
# the parameters estimated by Gradient Descent
print(beta1[count],beta2[count])
1.8970883788098543 -1.053406380437433
print('The objective function estimated by Gradient Descent: ',objective_func(beta1[count],beta2[count],y,x1,x2))
The objective function estimated by Gradient Descent:  72.096375419068

Here, we would like to see the evolution path of the gradient descent.

n = 200
xlin = np.linspace(-1, 5, n)
ylin = np.linspace(-4, 2, n)
xlin, ylin = np.meshgrid(xlin, ylin)
obj = np.array(np.zeros(40000)).reshape((200,200))
for i in range(0,n):
    for j in range(0,n):
        obj[i][j] = objective_func(xlin[i][j],ylin[i][j],y,x1,x2)

plt.contourf(xlin, ylin, obj, 20, alpha = 0.75, cmap = 'coolwarm')
plt.plot(beta1,beta2,'b')
plt.show()

1430125-20180629140612999-1557277098.png

plt.plot(np.arange(0,count+1,10),object_value[0:(count+1):10], ls = '-',marker = 'o')
plt.title('Objective Function Value versus iteration')
plt.xlabel('Iteration')
plt.show()

1430125-20180629140557024-744373936.png

Let us compare Gradient Descent with Linear Regression

# This part is Linear Regression estimation
X = pd.concat([pd.DataFrame(data = x1,columns = ['x1']),pd.DataFrame(data = x2, columns = ['x2'])],axis = 1)
Y = pd.DataFrame(data = y,columns = ['y'])
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X,Y)
print(lm.intercept_[0],lm.coef_[0][0],lm.coef_[0][1])
print('\nThe objective function estimated by Linear Regression: ',objective_func(2.026324766879077,-1.1180692401234138,y,x1,x2))
0.04112020730288102 2.026324766879077 -1.1180692401234138

The objective value estimated by Linear Regression:  92.76335553520987

Stochastic Gradient Descent

As for the gradient descent, we use all data to compute the gradient. It may raise our concern that it would be pretty time-consuming when the objective function is complex and the data are very large. Therefore, a Stochastic Gradient Descent(Hereinafter SGD) algorithm is introduced to speed up the process. Frankly, SGD is widely used in machine learning nowadays.

The Major difference between SGD and gradient descent is, in every iteration, Gradient descent will update gradient using all training sample, however, as for SGD, a batch of random samples will be choosed to update the gradient. The formual below shows the iteration of parameters.
\[x_{n+1} = x_n - \lambda \times \nabla sgd\ f(x_n)\]
\[sgd\ f(x_n) = \sum_{i\in t_n} objective\ func(x_n)\]
\[t_n\ is\ a\ subset\ of\ data\]

The general steps for SGD

  1. set up an initial point \((\beta_{1,1},\beta_{2,1})\)
  2. choose a random sample from the all training data
  3. generate the next point using the formula listed above
  4. iterate step 2,3
  5. stop until the objective function converges

Let us implement the SGD using the data in the previous case

beta1 = [0]
beta2 = [0]
beta1_iter = 0
beta2_iter = 0
object_value = [objective_func(beta1_iter,beta2_iter,y,x1,x2)]
chg_obj = 1
count = 0
learning_rate = 0.0001
while count==0 or chg_obj>0.000001:
    count += 1
    flag = np.random.randint(0, 100, size=30)
    beta1_iter, beta2_iter = beta1_iter - learning_rate * gradient_descent(beta1_iter,beta2_iter,y[flag],x1[flag],x2[flag])[0],beta2_iter - learning_rate * gradient_descent(beta1_iter,beta2_iter,y[flag],x1[flag],x2[flag])[1]
    object_value.append(objective_func(beta1_iter,beta2_iter,y,x1,x2))
    chg_obj = abs(object_value[count] / object_value[count-1] - 1)
    beta1.append(beta1_iter)
    beta2.append(beta2_iter)
# the parameters estimated by Stochastic Gradient Descent
print(beta1[count],beta2[count])
1.8763735882460528 -0.9968233184396047
print('The objective function estimated by Gradient Descent: ',objective_func(beta1[count],beta2[count],y,x1,x2))
The objective function estimated by Gradient Descent:  72.4169302773315
plt.contourf(xlin, ylin, obj, 20, alpha = 0.75, cmap = 'coolwarm')
plt.plot(beta1[0:(count+1):50],beta2[0:(count+1):50],'b')
plt.show()

1430125-20180629140533589-532048758.png

plt.plot(np.arange(0,count+1,10),object_value[0:(count+1):10], ls = '-',marker = 'o')
plt.title('Objective Function Value versus iteration')
plt.xlabel('Iteration')
plt.show()

1430125-20180629140527445-48235011.png

Currently, we just introduce some basic algorithms about the gradient descent. However, there are other extensions about the gradient descent like AdaGrad and Momentum. I will introduce them in the future.

Reference

  1. Wikipedia [https://en.wikipedia.org/wiki/Gradient#Definition]
  2. Wikipedia [(https://en.wikipedia.org/wiki/Gradient_descent]
  3. Slides from NUS dept.Statistics ST 4240 2015, lectured by Alexandre Hoang THIERY
  4. Large-Scale Machine Learning with Stochastic Gradient Descent [http://leon.bottou.org/publications/pdf/compstat-2010.pdf]

转载于:https://www.cnblogs.com/PeterShengShijie/p/9243120.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值