对于一个二分类问题,其损失函数(交叉熵)为
J
(
θ
)
=
−
∑
i
=
1
m
[
y
i
ln
y
^
i
+
(
1
−
y
i
)
ln
(
1
−
y
^
i
)
]
m
J(\theta)=-\frac{\sum_{i=1}^m[y_i\ln\hat{y}_i+(1-y_i)\ln(1-\hat{y}_i)]}{m}
J(θ)=−m∑i=1m[yilny^i+(1−yi)ln(1−y^i)]模型的输出表示为
y
=
β
0
+
β
1
x
+
β
2
x
2
+
.
.
.
+
β
n
x
n
+
ε
y=\beta_0+\beta_1x+\beta_2x^2+...+\beta_nx^n+\varepsilon
y=β0+β1x+β2x2+...+βnxn+ε对损失函数添加一个约束项,在最小化损失函数的时候也最小化这个约束项,
θ
\theta
θ代表y中的参数,即
β
\beta
β,更改后的损失函数如下
J
(
θ
)
=
−
∑
i
=
1
m
[
y
i
ln
y
^
i
+
(
1
−
y
i
)
ln
(
1
−
y
^
i
)
]
m
+
λ
∑
i
=
1
n
∣
θ
i
∣
J(\theta)=-\frac{\sum_{i=1}^m[y_i\ln\hat{y}_i+(1-y_i)\ln(1-\hat{y}_i)]}{m} +\lambda\sum_{i=1}^n|\theta_i|
J(θ)=−m∑i=1m[yilny^i+(1−yi)ln(1−y^i)]+λi=1∑n∣θi∣其中约束项可以是上式的L1范数形式,也可以是L2范数形式
λ
∑
i
=
1
n
∣
∣
θ
∣
∣
2
2
\frac{\lambda\sum_{i=1}^{n}||\theta||^2}{2}
2λ∑i=1n∣∣θ∣∣2
for step,(x, y)inenumerate(db):with tf.GradientTape()as tape:
loss = tf.reduce_mean(tf.losses.categorical_crossentropy(y_onehot, out, from_logits=True))
loss_regularization =[]for p in network.trainable_variables:
loss_regularization.append(tf.nn.l2_loss(p))
loss_regularization = tf.reduce_sum(tf.stack(loss_regularization))
loss = loss +0.001*loss_regularization
grads = tape.gradient(loss, network.trainable_variables)
optimizer.apply_gradients(zip(grads, network.trainable_variables))
3.3 动量与学习率
动量
之前的参数更新方式为
w
k
+
1
=
w
k
−
α
∇
f
(
w
k
)
w^{k+1}=w^{k}-\alpha \nabla f(w^k)
wk+1=wk−α∇f(wk)增加动量为
z
k
+
1
=
β
z
k
+
α
∇
f
(
w
k
)
z^{k+1}=\beta z^{k}+\alpha \nabla f(w^k)
zk+1=βzk+α∇f(wk)则参数更新方式变为
w
k
+
1
=
w
k
−
α
z
k
+
1
w^{k+1}=w^{k}-\alpha z^{k+1}
wk+1=wk−αzk+1
for step,(x, y)inenumerate(db):with tf.GradientTape()as tape:
x = tf.reshape(x,(-1,28*28))
out = network(x, training=True)# 训练
out = network(x, training=False)# 测试
之前的参数更新如下所示
∂
∂
θ
j
J
(
θ
)
=
1
m
∑
i
=
1
m
(
y
^
i
−
y
i
)
x
j
i
\frac{\partial}{\partial \theta_j}J(\theta)=\frac{1}{m}\sum_{i=1}^m(\hat{y}^i-y^i)x^i_j
∂θj∂J(θ)=m1i=1∑m(y^i−yi)xji
θ
j
=
θ
j
−
α
∂
∂
θ
j
J
(
θ
)
\theta_j = \theta_j -\alpha \frac{\partial}{\partial \theta_j}J(\theta)
θj=θj−α∂θj∂J(θ)进行随机梯度下降后
θ
j
=
θ
j
−
α
(
y
^
i
−
y
i
)
x
j
i
∇
J
∇
θ
j
\theta_j = \theta_j -\alpha(\hat{y}^i-y^i)x^i_j\frac{\nabla J}{\nabla \theta_j}
θj=θj−α(y^i−yi)xji∇θj∇J其中,
(
y
^
i
−
y
i
)
x
j
i
(\hat{y}^i-y^i)x^i_j
(y^i−yi)xji为一个sample,将此式子循环一个batch