0 引言
人工智能的多个领域都涉及到问题的求解,实际就是一个搜索的过程。搜索技术分为全局搜索和局部搜索,梯度下降算法是一种基于一阶微分的局部搜索算法。由于梯度下降算法的简单易用,梯度下降算法(GD)已然成为了深度学习领域的核心算法之一。
1 梯度下降算法(GD)的数学推导
梯度下降算法基于泰勒展开式推导而来。
设
f
(
x
)
f(x)
f(x)在实数域上的各点均能够泰勒展开,则
f
(
x
)
f(x)
f(x)在
x
t
x_{t}
xt处的泰勒展开式为:
f
(
x
)
=
∑
k
=
0
n
f
(
n
)
(
x
t
)
n
!
(
x
−
x
t
)
k
(1)
\begin{aligned} f(x)=\sum\limits_{k=0}^n\frac{f^{(n)}(x_{t})}{n!}{(x-x_{t})^k}\tag{1} \end{aligned}
f(x)=k=0∑nn!f(n)(xt)(x−xt)k(1)
忽略
n
>
2
n>2
n>2的子式,则有:
f
(
x
)
≈
f
(
x
t
)
+
f
′
(
x
t
)
(
x
−
x
t
)
(2)
\begin{aligned} f(x)\approx f(x_{t})+f^{'}(x_{t})(x-x_{t})\tag{2} \end{aligned}
f(x)≈f(xt)+f′(xt)(x−xt)(2)
由连续到离散,令
x
=
x
t
+
1
x=x_{t+1}
x=xt+1,代入式
(
2
)
(2)
(2),近似相等变为相等,得到:
f
(
x
t
+
1
)
=
f
(
x
t
)
+
f
′
(
x
t
)
(
x
t
+
1
−
x
t
)
(3)
\begin{aligned} f(x_{t+1})=f(x_{t})+f^{'}(x_{t})(x_{t+1}-x_{t})\tag{3} \end{aligned}
f(xt+1)=f(xt)+f′(xt)(xt+1−xt)(3)
对式
(
3
)
(3)
(3)进行向量化,得到:
f
(
X
t
+
1
)
=
f
(
X
t
)
+
(
X
t
+
1
−
X
t
)
∇
f
(
X
t
)
(4)
\begin{aligned} f(X_{t+1})=f(X_{t})+ (X_{t+1}-X_{t})\nabla f(X_{t})\tag{4} \end{aligned}
f(Xt+1)=f(Xt)+(Xt+1−Xt)∇f(Xt)(4)
梯度下降要求点沿着目标函数梯度下降的方向运动,所以
f
(
X
t
+
1
)
<
f
(
X
t
)
f(X_{t+1})<f(X_{t})
f(Xt+1)<f(Xt)。那么,则有:
(
X
t
+
1
−
X
t
)
∇
f
(
X
t
)
<
0
(X_{t+1}-X_{t})\nabla f(X_{t})<0
(Xt+1−Xt)∇f(Xt)<0,所以向量
(
X
t
+
1
−
X
t
)
(X_{t+1}-X_{t})
(Xt+1−Xt)和向量
∇
f
(
X
t
)
\nabla f (X_{t})
∇f(Xt)夹角为钝角。自然的为了使
f
(
X
t
+
1
)
−
f
(
X
t
)
f(X_{t+1})-f(X_{t})
f(Xt+1)−f(Xt)尽可能的小,向量
(
X
t
+
1
−
X
t
)
(X_{t+1}-X_{t})
(Xt+1−Xt)和向量
∇
f
(
X
t
)
\nabla f (X_{t})
∇f(Xt)夹角应为平角
18
0
。
180^{。}
180。即两向量为相反向量。
由相反向量定义可知:
X
t
+
1
−
X
t
=
−
γ
∇
f
(
X
t
)
(5)
\begin{aligned} X_{t+1}-X_{t}=-\gamma\nabla f(X_{t})\tag{5} \end{aligned}
Xt+1−Xt=−γ∇f(Xt)(5)
其中
γ
\gamma
γ是一个正数,机器学习中称为学习率,用来表示每次梯度下降的剧烈程度。利用式
(
5
)
(5)
(5)即可完成参数
X
X
X的更新,这种方法就称为梯度下降算法。
2 梯度下降算法拟合函数
问题背景:使用梯度下降算法,用三阶多项式
f
(
x
)
=
a
x
3
+
b
x
2
+
c
x
+
d
f(x)=ax^{3}+bx^{2}+cx+d
f(x)=ax3+bx2+cx+d拟合函数
g
(
x
)
=
s
i
n
(
x
)
g(x)=sin(x)
g(x)=sin(x)。在
f
(
x
)
f(x)
f(x)上等间隔的取
N
=
2000
N=2000
N=2000个点作为样本点。构造损失函数如下:
L
(
a
,
b
,
c
,
d
)
=
1
N
∑
n
[
f
(
x
n
)
−
g
(
x
n
)
]
2
(6)
\begin{aligned} L(a,b,c,d)=\frac{1}{N}\sum\limits_{n}[f(x_{n})-g(x_{n})]^2\tag{6} \end{aligned}
L(a,b,c,d)=N1n∑[f(xn)−g(xn)]2(6)
显然拟合问题转换为求解:
M
i
n
L
(
a
,
b
,
c
,
d
)
(7)
\begin{aligned} Min\ L(a,b,c,d)\tag{7} \end{aligned}
Min L(a,b,c,d)(7)
根据梯度下降算法得到
a
,
b
,
c
,
d
a,b,c,d
a,b,c,d的向量化更新公式如下:
(
a
b
c
d
)
t
+
1
=
(
a
b
c
d
)
t
−
γ
(
∂
L
∂
a
∂
L
∂
b
∂
L
∂
c
∂
L
∂
d
)
(8)
\begin{aligned} \left( \begin{matrix} a \\b \\c \\ d \end{matrix} \right)_{t+1}=\left( \begin{matrix} a \\b \\c \\ d \end{matrix} \right)_{t}-\gamma \left( \begin{matrix} \frac{\partial L}{\partial a} \\ \\\frac{\partial L}{\partial b} \\ \\\frac{\partial L}{\partial c} \\ \\\frac{\partial L}{\partial d} \end{matrix} \right)\tag{8} \end{aligned}
abcd
t+1=
abcd
t−γ
∂a∂L∂b∂L∂c∂L∂d∂L
(8)
2.1 使用numpy实现梯度下降
def numpy_GD():
import numpy as np
import math
import matplotlib.pyplot as plt
import time
# Create random input and output data
x = np.linspace(-math.pi, math.pi, 2000)
y = np.sin(x)
plt.plot(x, y)
plt.grid(True)
# Randomly initialize weights
a = np.random.randn()
b = np.random.randn()
c = np.random.randn()
d = np.random.randn()
learning_rate = 1e-6
t0 = time.time()
for t in range(2000):
# Forward pass: compute predicted y
# y = a + b x + c x^2 + d x^3
y_pred = a + b * x + c * x ** 2 + d * x ** 3
# Compute and print loss
loss = np.square(y_pred - y).sum()
if t % 100 == 99:
print(t, loss)
# Backprop to compute gradients of a, b, c, d with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_a = grad_y_pred.sum()
grad_b = (grad_y_pred * x).sum()
grad_c = (grad_y_pred * x ** 2).sum()
grad_d = (grad_y_pred * x ** 3).sum()
# Update weights
a -= learning_rate * grad_a
b -= learning_rate * grad_b
c -= learning_rate * grad_c
d -= learning_rate * grad_d
t1 = time.time()
print("花费时间%.2fs" % (t1 - t0))
print(f'Result: y = {a} + {b} x + {c} x^2 + {d} x^3')
plt.plot(x, y, label='real')
plt.plot(x, y_pred, label='pred')
plt.legend(loc='best')
plt.grid(True)
plt.show()
在cpu上运行,耗时:0.33s。
结果:
2.2 使用pytorch实现梯度下降
def torch_GD():
import torch
import math
import matplotlib.pyplot as plt
import time
dtype = torch.float
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)
a = torch.randn((), device=device, dtype=dtype, requires_grad=True)
b = torch.randn((), device=device, dtype=dtype, requires_grad=True)
c = torch.randn((), device=device, dtype=dtype, requires_grad=True)
d = torch.randn((), device=device, dtype=dtype, requires_grad=True)
learning_rate = 1e-6
t0 = time.time()
for t in range(2000):
y_pred = a + b * x + c * x ** 2 + d * x ** 3
loss = (y_pred - y).pow(2).sum()
if t % 100 == 99:
print(t, loss.item())
loss.backward()
with torch.no_grad():
a -= learning_rate * a.grad
b -= learning_rate * b.grad
c -= learning_rate * c.grad
d -= learning_rate * d.grad
a.grad = None
b.grad = None
c.grad = None
d.grad = None
t1 = time.time()
print("花费时间%.2fs" % (t1 - t0))
print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')
plt.plot(x.cpu().data.numpy(), y.cpu().data.numpy(), label='real')
plt.plot(x.cpu().data.numpy(), y_pred.cpu().data.numpy(), label='pred')
plt.grid(True)
plt.legend(loc='best')
plt.show()
在GPU上运行,耗时1.02s。
结果:
可视化自己写的GD算法和torch计算图GD算法的loss曲线如下