梯度下降算法
核心公式
ω = ω − α ∂ c o s t ∂ ω \omega=\omega-\alpha\frac{\partial cost}{\partial\omega} ω=ω−α∂ω∂cost
对公式的理解
- α \alpha α为学习率
- ∂ c o s t ∂ ω \frac{\partial cost}{\partial\omega} ∂ω∂cost为该点的斜率,公式中取负,意思是每次都朝着斜率减小的方向移动
公式推导
推导过程
∂ c o s t ( ω ) ∂ ω = ∂ ∂ ω 1 N ∑ n = 1 N ( x n ⋅ ω − y n ) 2 = 1 N ∑ n = 1 N ∂ ∂ ω ( x n ⋅ ω − y n ) 2 = 1 N ∑ n = 1 N 2 ⋅ ( x n ⋅ ω − y n ) ∂ ( x n ⋅ ω − y n ) ∂ ω = 1 N ∑ n = 1 N 2 ⋅ x n ⋅ ( x n ⋅ ω − y n ) \begin{aligned} \frac{\partial cost(\omega)}{\partial\omega}& \begin{aligned}=\frac{\partial}{\partial\omega}\frac{1}{N}\sum_{n=1}^{N}(x_n\cdot\omega-y_n)^2\end{aligned} \\ &=\frac1N\sum_{n=1}^N\frac\partial{\partial\omega}(x_n\cdot\omega-y_n)^2 \\ &=\frac1N\sum_{n=1}^N2\cdot(x_n\cdot\omega-y_n)\frac{\partial(x_n\cdot\omega-y_n)}{\partial\omega} \\ &=\frac1N\sum_{n=1}^N2\cdot x_n\cdot(x_n\cdot\omega-y_n) \end{aligned} ∂ω∂cost(ω)=∂ω∂N1n=1∑N(xn⋅ω−yn)2=N1n=1∑N∂ω∂(xn⋅ω−yn)2=N1n=1∑N2⋅(xn⋅ω−yn)∂ω∂(xn⋅ω−yn)=N1n=1∑N2⋅xn⋅(xn⋅ω−yn)
推导结果
ω = ω − α 1 N ∑ n = 1 N 2 ⋅ x n ⋅ ( x n ⋅ ω − y n ) \omega=\omega-\alpha\frac1N\sum_{n=1}^N2\cdot x_n\cdot(x_n\cdot\omega-y_n) ω=ω−αN1n=1∑N2⋅xn⋅(xn⋅ω−yn)
梯度下降算法的几个问题
- 找到的可能是
局部最优点
,但局部最优点在实际的网络中较少见 - 但存在
鞍点
时,无法进行迭代,因为鞍点的斜率为0,即 ∂ c o s t ∂ ω = 0 \frac{\partial cost}{\partial\omega}=0 ∂ω∂cost=0
梯度下降算法代码
import matplotlib.pyplot as plt
x_data = [1.0, 2.0, 3.0]
y_data = [2.0, 4.0, 6.0]
w = 1.0
Epoch = []
Cost = []
def forward(x):
return x * w
#即求MSE
def cost(xs, ys):
cost = 0
for x, y in zip(xs, ys):
y_pred = forward(x)
cost += (y_pred - y) ** 2
return cost / len(xs)
#求梯度
def gradient(xs, ys):
grad = 0
for x, y in zip(xs, ys):
grad += 2 * x * (x * w - y)
return grad / len(xs)
print('Predict (before training)', 4, forward(4)) #要预测x=4时的y,此时的w初始值为1.0
for epoch in range(100):
cost_val = cost(x_data, y_data)
grad_val = gradient(x_data, y_data)
w -= 0.01 * grad_val #0.01是学习率α
Epoch.append(epoch)
Cost.append(cost_val)
print('Epoch:', epoch, 'w=', w, ' loss=', cost_val)
print('Predict (after training)', 4, forward(4)) #用收敛后的w带入,预测出x=4时,y的值
#绘图
plt.xticks(range(0, 101, 10))
plt.yticks(range(0, 6))
plt.xlabel('Epoch')
plt.ylabel('Cost')
plt.grid()
plt.plot(Epoch, Cost)
plt.show()
随机梯度下降算法
- 原来的梯度下降算法用的是样本总体的cost,是一个
总体
的平均数,而随机梯度下降算法则用的是随机某个
的loss - 这种方法能通过引入随机噪声,有效解决带有
鞍点
的函数
公式
ω
=
ω
−
α
∂
l
o
s
s
∂
ω
\omega=\omega-\alpha\frac{\partial loss}{\partial\omega}
ω=ω−α∂ω∂loss
∂
l
o
s
s
n
∂
ω
=
2
⋅
x
n
⋅
(
x
n
⋅
ω
−
y
n
)
\frac{\partial loss_n}{\partial\omega}=2\cdot x_n\cdot(x_n\cdot\omega-y_n)
∂ω∂lossn=2⋅xn⋅(xn⋅ω−yn)
随机梯度下降算法代码(按顺序进行,并未"随机")
import matplotlib.pyplot as plt
x_data = [1.0, 2.0, 3.0]
y_data = [2.0, 4.0, 6.0]
w = 1.0
Epoch = []
Loss = []
def forward(x):
return x * w
def loss(x, y):
y_pred = forward(x)
return (y_pred - y) ** 2
def gradient(x, y):
return 2 * x * (x * w - y)
print('Predict (before training)', 4, forward(4))
for epoch in range(100):
for x, y in zip(x_data, y_data):
grad = gradient(x, y)
w -= 0.01 * grad
print('\tgrad: ', x, y, grad)
l = loss(x, y)
Epoch.append(epoch)
Loss.append(l)
print('Epoch:', epoch, 'w=', w, ' loss=', l)
print('Predict (after training)', 4, forward(4))
plt.xticks(range(0, 101, 10))
plt.yticks(range(0, 6))
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid()
plt.plot(Epoch, Loss)
plt.show()
随机梯度下降算法代码(随机进行)
import matplotlib.pyplot as plt
import random
x_data = [1.0, 2.0, 3.0]
y_data = [2.0, 4.0, 6.0]
w = 1.0
Epoch = []
Loss = []
def forward(x):
return x * w
def loss(x, y):
y_pred = forward(x)
return (y_pred - y) ** 2
def gradient(x, y):
return 2 * x * (x * w - y)
print('Predict (before training)', 4, forward(4))
for epoch in range(100):
rc=random.randrange(0,3)
x=x_data[rc]
y=y_data[rc]
grad = gradient(x, y)
w -= 0.01 * grad
print('\tgrad: ', x, y, grad)
l = loss(x, y)
Epoch.append(epoch)
Loss.append(l)
print('Epoch:', epoch, 'w=', w, ' loss=', l)
print('Predict (after training)', 4, forward(4))
plt.xticks(range(0, 101, 10))
plt.yticks(range(0, 6))
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Stochastic Gradient Descent\ny = x * w, w = 1.0, a = 0.01')
plt.grid()
plt.plot(Epoch, Loss)
plt.show()
对比梯度下降算法和随机梯度下降算法
- 因为梯度下降算法的w每次都是变化固定的值(因为用的是样本平均cost),所以计算f(x_1)和计算f(x_1+1)是互不影响的,可以进行并行运算提高效率,效率高
- 随机梯度下降算法中的w每次的变化的值要取决于当前样本值,即对上一次的w有依赖关系,不能像梯度下降算法那样简单的取n倍关系,故不能进行并行运算,性能好,但时间复杂度大
折中办法
- 设置
Batch
:批量的随机梯度下降 - 把样本分成若干组,每次用一组样本去求相应的梯度
- 随机梯度下降算法默认采用
Mini-Batch
这种方法