1、L2正则化
不使用正则化的公式:
J=−1m∑i=1m(y(i)log(a[L](i))+(1−y(i))log(1−a[L](i)))(1)
(1)
J
=
−
1
m
∑
i
=
1
m
(
y
(
i
)
log
(
a
[
L
]
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
a
[
L
]
(
i
)
)
)
正则化的公式:
Jregularized=−1m∑i=1m(y(i)log(a[L](i))+(1−y(i))log(1−a[L](i)))cross-entropy cost+1mλ2∑l∑k∑jW[l]2k,jL2 regularization cost(2)
(2)
J
r
e
g
u
l
a
r
i
z
e
d
=
−
1
m
∑
i
=
1
m
(
y
(
i
)
log
(
a
[
L
]
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
a
[
L
]
(
i
)
)
)
⏟
cross-entropy cost
+
1
m
λ
2
∑
l
∑
k
∑
j
W
k
,
j
[
l
]
2
⏟
L2 regularization cost
cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost
L2_regularization_cost = 1./m * lambd/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3)))
cost = cross_entropy_cost + L2_regularization_cost
dZ3 = A3 - Y
dW3 = 1./m * np.dot(dZ3, A2.T) + lambd / m * W3 # + lambd / m * W3
db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
dA2 = np.dot(W3.T, dZ3)
dZ2 = np.multiply(dA2, np.int64(A2 > 0))
dW2 = 1./m * np.dot(dZ2, A1.T) + lambd / m * W2
db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
dA1 = np.dot(W2.T, dZ2)
dZ1 = np.multiply(dA1, np.int64(A1 > 0))
dW1 = 1./m * np.dot(dZ1, X.T) + lambd / m * W1
db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
2、He 初始化
parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(2.0/layers_dims[l-1])
parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
3、Adam 梯度下降
How does Adam work?
- It calculates an exponentially weighted average of past gradients, and stores it in variables v v (before bias correction) and (with bias correction).
- It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variables s s (before bias correction) and (with bias correction).
- It updates parameters in a direction based on combining information from “1” and “2”.
The update rule is, for l=1,...,L l = 1 , . . . , L :
⎧⎩⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪vdW[l]=β1vdW[l]+(1−β1)∂J∂W[l]vcorrecteddW[l]=vdW[l]1−(β1)tsdW[l]=β2sdW[l]+(1−β2)(∂J∂W[l])2scorrecteddW[l]=sdW[l]1−(β1)tW[l]=W[l]−αvcorrecteddW[l]scorrecteddW[l]√+ε
{
v
d
W
[
l
]
=
β
1
v
d
W
[
l
]
+
(
1
−
β
1
)
∂
J
∂
W
[
l
]
v
d
W
[
l
]
c
o
r
r
e
c
t
e
d
=
v
d
W
[
l
]
1
−
(
β
1
)
t
s
d
W
[
l
]
=
β
2
s
d
W
[
l
]
+
(
1
−
β
2
)
(
∂
J
∂
W
[
l
]
)
2
s
d
W
[
l
]
c
o
r
r
e
c
t
e
d
=
s
d
W
[
l
]
1
−
(
β
1
)
t
W
[
l
]
=
W
[
l
]
−
α
v
d
W
[
l
]
c
o
r
r
e
c
t
e
d
s
d
W
[
l
]
c
o
r
r
e
c
t
e
d
+
ε
where:
t counts the number of steps taken of Adam
L is the number of layers
β1
β
1
and
β2
β
2
are hyperparameters that control the two exponentially weighted averages.
α
α
is the learning rate
ε
ε
is a very small number to avoid dividing by zero
翻译: 第一步和第二步分别使用了动量梯度下降和均方根支算法,其原理都是利用指数加权平均的思想和过去的梯度进行联系,让其具有一定的原来的动量(趋势)或者得到一个更平均的值不至于矫枉过正。corrected是指进行一定的修正。
for l in range(L):
# Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
v["dW" + str(l+1)] = beta1 * v["dW" + str(l+1)] + (1-beta1) * grads['dW' + str(l+1)]
v["db" + str(l+1)] = beta1 * v["db" + str(l+1)] + (1-beta1) * grads['db' + str(l+1)]
# Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
v_corrected["dW" + str(l+1)] = v["dW" + str(l+1)] / (1-np.power(beta1,t))
v_corrected["db" + str(l+1)] = v["db" + str(l+1)] / (1-np.power(beta1,t))
# Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
s["dW" + str(l+1)] = beta2 * s["dW" + str(l+1)] + (1-beta2) * grads['dW' + str(l+1)]**2
s["db" + str(l+1)] = beta2 * s["db" + str(l+1)] + (1-beta2) * grads['db' + str(l+1)]**2
# Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
s_corrected["dW" + str(l+1)] = s["dW" + str(l+1)] / (1-np.power(beta2,t))
s_corrected["db" + str(l+1)] = s["db" + str(l+1)] / (1-np.power(beta2,t))