Adding regularization will often help To prevent overfitting problem (high variance problem ).
1. Logistic regression
回忆一下训练时的优化目标函数
minw,bJ(w,b), w∈Rnx,b∈R(1-1)
(1-1)
min
w
,
b
J
(
w
,
b
)
,
w
∈
R
n
x
,
b
∈
R
其中
J(w,b)=1m∑i=1mL(y^(i),y(i))(1-2)
(1-2)
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
L
(
y
^
(
i
)
,
y
(
i
)
)
L2 regularization L 2 r e g u l a r i z a t i o n (most commonly used):
J(w,b)=1m∑i=1mL(y^(i),y(i))+λ2m∥w∥22(1-3)
(1-3)
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
L
(
y
^
(
i
)
,
y
(
i
)
)
+
λ
2
m
‖
w
‖
2
2
其中
∥w∥22=∑j=1nxw2j=wTw(1-4)
(1-4)
‖
w
‖
2
2
=
∑
j
=
1
n
x
w
j
2
=
w
T
w
Why do we regularize just the parameter w? Because w Is usually a high dimensional parameter vector while b is A scalar. Almost all The parameters are in w rather than b.
L1 regularization L 1 r e g u l a r i z a t i o n
J(w,b)=1m∑i=1mL(y^(i),y(i))+λm|w|1(1-5)
(1-5)
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
L
(
y
^
(
i
)
,
y
(
i
)
)
+
λ
m
|
w
|
1
其中
|w|1=∑jnx|wj|(1-6)
(1-6)
|
w
|
1
=
∑
j
n
x
|
w
j
|
w will end up being sparse. In other words the w vector will have a lot of zeros in it. This can help with compressing the model a little.
2. Neural network “Frobenius norm”
J(w[1],b[1],⋯,w[L],b[L])=1m∑i=1mL(y^(i),y(i))+λ2m∑l=1L∥w∥22(2-1)
(2-1)
J
(
w
[
1
]
,
b
[
1
]
,
⋯
,
w
[
L
]
,
b
[
L
]
)
=
1
m
∑
i
=
1
m
L
(
y
^
(
i
)
,
y
(
i
)
)
+
λ
2
m
∑
l
=
1
L
‖
w
‖
2
2
其中
∥∥w[l]∥∥2F=∑in[l−1]∑jn[l](wij)2(2-2)
(2-2)
‖
w
[
l
]
‖
F
2
=
∑
i
n
[
l
−
1
]
∑
j
n
[
l
]
(
w
i
j
)
2
L2 L 2 regulation is also called Weight decay:
dw[l]wl:=(from backprop)+λmw[l]=w[l]−αdw[l]=(1−αλm)w[l]−α(from backprop)(2-3)
(2-3)
d
w
[
l
]
=
(
f
r
o
m
b
a
c
k
p
r
o
p
)
+
λ
m
w
[
l
]
w
l
:
=
w
[
l
]
−
α
d
w
[
l
]
=
(
1
−
α
λ
m
)
w
[
l
]
−
α
(
f
r
o
m
b
a
c
k
p
r
o
p
)
能够防止权重 w w 过大,从而避免过拟合
3. inverted dropout
对于不同的训练样本都可以随机消除一部分结点
反向随机失活(前向和后向都需要dropout):
this inverted dropout technique by dividing by the keep.prob, it ensures that the expected value of a3 remains the same. This makes test time easier because you have less of a scaling problem.
测试时不需要使用drop out