神经网络参数优化器
引导参数优化的方法,
1.SGD
没有momentum
W
t
+
1
=
W
t
−
l
r
∗
∂
l
o
s
s
∂
W
t
W_{t+1}=W_t-lr\;\ast\;\frac{\partial loss}{\partial W_t}
Wt+1=Wt−lr∗∂Wt∂loss
2.SGDM
含有momentum,在SGD基础上增加一阶动量
β 一般取值0.9
m
t
=
β
⋅
m
t
−
1
+
(
1
−
β
)
⋅
g
t
,
v
t
=
1
η
t
=
l
r
⋅
m
t
v
t
=
l
r
⋅
(
β
⋅
m
t
−
1
+
(
1
−
β
)
⋅
g
t
)
W
t
+
1
=
W
t
−
l
r
∗
l
r
⋅
(
β
⋅
m
t
−
1
+
(
1
−
β
)
⋅
g
t
)
m_t=\beta\cdot m_{t-1}+(1-\beta)\cdot g_t\;\;,\;\;\;v_t=1\\\eta_t\;=lr\cdot\frac{m_t}{\sqrt{v_t}}=lr\cdot(\beta\cdot m_{t-1}+(1-\beta)\cdot g_t)\\W_{t+1}=W_t-lr\;\ast\;lr\cdot(\beta\cdot m_{t-1}+(1-\beta)\cdot g_t)
mt=β⋅mt−1+(1−β)⋅gt,vt=1ηt=lr⋅vtmt=lr⋅(β⋅mt−1+(1−β)⋅gt)Wt+1=Wt−lr∗lr⋅(β⋅mt−1+(1−β)⋅gt)
#更新梯度
m_w = beta * m_w + (1 - beta) * m_w
m_b = beta * m_b + (1 - beta) * m_b
w1.assign_sub(lr * grads[0])
b1.assign_sub(lr * grads[1])
(3).Adagrad
加入二阶动量,gt表示梯度
m
t
=
g
t
,
v
t
=
∑
g
τ
2
τ
=
1
t
η
t
=
l
r
⋅
m
t
v
t
=
l
r
⋅
g
t
∑
g
τ
2
τ
=
1
t
W
t
+
1
=
W
t
−
l
r
∗
l
r
⋅
g
t
∑
g
τ
2
τ
=
1
t
m_t=g_t\;\;\;\;\;,\;\;\;\;v_t=\overset t{\underset{\tau=1}{\sum g_\tau^2}}\\\eta_t\;=lr\cdot\frac{m_t}{\sqrt{v_t}}=lr\cdot\frac{g_t}{\sqrt{\overset t{\underset{\tau=1}{\sum g_\tau^2}}}}\\W_{t+1}=W_t-lr\;\ast lr\cdot\frac{g_t}{\sqrt{\overset t{\underset{\tau=1}{\sum g_\tau^2}}}}
mt=gt,vt=τ=1∑gτ2tηt=lr⋅vtmt=lr⋅τ=1∑gτ2tgtWt+1=Wt−lr∗lr⋅τ=1∑gτ2tgt
v_w +=tf.square(grads[0])
v_b +=tf.square(grads[1])
w1.assign_sub(lr * grads[0] / tf.sqrt(v_w))
b1.assign_sub(lr * grads[1] / tf.sqrt(v_b))
(4).RMSProp
增加二阶动力
m
t
=
g
t
,
v
t
=
β
⋅
v
t
−
1
+
(
1
−
β
)
⋅
g
t
2
η
t
=
l
r
⋅
m
t
v
t
=
l
r
⋅
g
t
/
(
β
⋅
v
t
−
1
+
(
1
−
β
)
⋅
g
t
2
)
W
t
+
1
=
W
t
−
l
r
⋅
g
t
/
(
β
⋅
v
t
−
1
+
(
1
−
β
)
⋅
g
t
2
)
m_t=g_t\;\;,\;\;\;v_t=\beta\cdot v_{t-1}+(1-\beta)\cdot g_t^2\;\\\eta_t\;=lr\cdot\frac{m_t}{\sqrt{v_t}}=lr\cdot g_t\;/(\sqrt{\beta\cdot v_{t-1}+(1-\beta)\cdot g_t^2})\\W_{t+1}=W_t-lr\cdot g_t\;/(\sqrt{\beta\cdot v_{t-1}+(1-\beta)\cdot g_t^2})
mt=gt,vt=β⋅vt−1+(1−β)⋅gt2ηt=lr⋅vtmt=lr⋅gt/(β⋅vt−1+(1−β)⋅gt2)Wt+1=Wt−lr⋅gt/(β⋅vt−1+(1−β)⋅gt2)
v_w =beta * v_w + (1 - beta) * tf.square(grads[0])
v_b =beta * v_b + (1 - beta) * tf.square(grads[1])
w1.assign_sub(lr * grads[0] / tf.sqrt(v_w))
b1.assign_sub(lr * grads[1] / tf.sqrt(v_b))
(5)Adam 优化器
同时结合SGDM一阶动量和RMSProp二阶动量
m
t
=
β
1
⋅
m
t
−
1
+
(
1
−
β
1
)
⋅
g
t
修
正
一
阶
动
量
的
偏
差
:
m
t
⏞
=
m
t
1
−
β
1
t
v
t
=
β
2
⋅
v
s
t
e
p
−
1
+
(
1
−
β
2
)
⋅
g
t
2
修
正
二
阶
动
量
的
偏
差
:
v
t
⏞
=
v
t
1
−
β
2
t
η
t
=
l
r
⋅
m
t
v
t
=
l
r
⋅
m
t
1
−
β
1
t
t
/
(
v
t
1
−
β
2
t
)
W
t
+
1
=
W
t
−
l
r
⋅
m
t
1
−
β
1
t
t
/
(
v
t
1
−
β
2
t
)
m_t=\beta_1\cdot m_{t-1}+(1-\beta_1)\cdot g_t\;\;\\\mathrm{修正一阶动量的偏差}:\overbrace{m_t}=\frac{m_t}{1-\beta_1^{\;\;t}}\;\\\\v_t=\beta_2\cdot v_{step-1}+(1-\beta_2)\cdot g_t^2\;\\\mathrm{修正二阶动量的偏差}:\overbrace{v_t}=\frac{v_t}{1-\beta_2^{\;\;t}}\\\\\eta_t\;=lr\cdot\frac{m_t}{\sqrt{v_t}}=lr\cdot{\frac{m_t}{1-\beta_1^{\;\;t}}}_t\;/(\sqrt{\frac{v_t}{1-\beta_2^{\;\;t}}})\\W_{t+1}=W_t-lr\cdot{\frac{m_t}{1-\beta_1^{\;\;t}}}_t\;/(\sqrt{\frac{v_t}{1-\beta_2^{\;\;t}}})
mt=β1⋅mt−1+(1−β1)⋅gt修正一阶动量的偏差:mt
=1−β1tmtvt=β2⋅vstep−1+(1−β2)⋅gt2修正二阶动量的偏差:vt
=1−β2tvtηt=lr⋅vtmt=lr⋅1−β1tmtt/(1−β2tvt)Wt+1=Wt−lr⋅1−β1tmtt/(1−β2tvt)
#学习率和画图用的参数的存储
lr = 0.1
train_loss_results = []
test_acc = []
epoch = 500
loss_all = 0
#加入优化器参数
m_w,m_b=0,0
v_w,v_b=0,0
beta1=0.9
beta2=0.999
delta_w,delta_b = 0,0
global_step=0
#训练 epoch 是整个数据集 而第二个for是一个batch
for epoch in range(epoch):
for step , (x_train,y_train) in enumerate(train_db):
#更新
global_step += 1
with tf.GradientTape() as tape:
y = tf.matmul(x_train , w1) + b1
y = tf.nn.softmax(y)
y_ = tf.one_hot(y_train,depth=3)
loss = tf.reduce_mean(tf.square(y_ - y))
loss_all += loss.numpy()
grads = tape.gradient(loss,[w1 , b1])
#更新梯度adma
m_w = beta1 * m_w + (1 - beta1) * grads[0]
m_b = beta2 * m_b + (1 - beta1) * grads[1]
v_w = beta2 * v_w + (1 - beta2) * tf.square(grads[0])
v_b = beta2 * v_b + (1 - beta2) * tf.square(grads[1])
m_w_correction = m_w / (1 - tf.pow(beta1,int(global_step)))
m_b_correction = m_b / (1 - tf.pow(beta1,int(global_step)))
v_w_correction = v_w / (1 - tf.pow(beta2,int(global_step)))
v_b_correction = v_b / (1 - tf.pow(beta2,int(global_step)))
w1.assign_sub(lr * m_w_correction / tf.sqrt(v_w_correction))
b1.assign_sub(lr * m_b_correction / tf.sqrt(v_b_correction))
#m每个epoch 打印loss的值
print("Epoch {},loss {}:".format(epoch,loss_all/4))
train_loss_results.append(loss_all/4)
loss_all = 0