Logistic / Sigmoid function
g(x)=11+e−x=ex1+ex g ( x ) = 1 1 + e − x = e x 1 + e x
Cost function
Logistic Regression
hθ(X)=f(X⊺θ)=P(y=1|X;θ)
h
θ
(
X
)
=
f
(
X
⊺
θ
)
=
P
(
y
=
1
|
X
;
θ
)
令
z=X⊺θ,
z
=
X
⊺
θ
,
则
lnP(y=y|X;θ)
ln
P
(
y
=
y
|
X
;
θ
)
=ylnP(y=1|X;θ)+(1−y)lnP(y=0|X;θ)
=
y
ln
P
(
y
=
1
|
X
;
θ
)
+
(
1
−
y
)
ln
P
(
y
=
0
|
X
;
θ
)
=ylnhθ(X)+(1−y)ln[1−hθ(X)]
=
y
ln
h
θ
(
X
)
+
(
1
−
y
)
ln
[
1
−
h
θ
(
X
)
]
=ylng(z)+(1−y)ln[1−g(z)]
=
y
ln
g
(
z
)
+
(
1
−
y
)
ln
[
1
−
g
(
z
)
]
因此
dlnP(y=y|X;θ)=ydlng(z)+(1−y)dln[1−g(z)]
d
ln
P
(
y
=
y
|
X
;
θ
)
=
y
d
ln
g
(
z
)
+
(
1
−
y
)
d
ln
[
1
−
g
(
z
)
]
=y⋅1g(z)g(z)[1−g(z)]dz+(1−y)11−g(z)(−1)g(z)[1−g(z)]dz
=
y
⋅
1
g
(
z
)
g
(
z
)
[
1
−
g
(
z
)
]
d
z
+
(
1
−
y
)
1
1
−
g
(
z
)
(
−
1
)
g
(
z
)
[
1
−
g
(
z
)
]
d
z
={y⋅[1−g(z)]−(1−y)g(z)}dz
=
{
y
⋅
[
1
−
g
(
z
)
]
−
(
1
−
y
)
g
(
z
)
}
d
z
=[y−g(z)]dz
=
[
y
−
g
(
z
)
]
d
z
=[y−g(X⊺θ)]X⊺dθ
=
[
y
−
g
(
X
⊺
θ
)
]
X
⊺
d
θ
最大似然函数
L(θ)=ln[∏i=1mP(y=yi|Xi;θ)]=∑i=1mlnP(y=yi|Xi;θ)
L
(
θ
)
=
ln
[
∏
i
=
1
m
P
(
y
=
y
i
|
X
i
;
θ
)
]
=
∑
i
=
1
m
ln
P
(
y
=
y
i
|
X
i
;
θ
)
令
cost(θ)=−1mL(θ)=−1m∑i=1mlnP(y=yi|Xi;θ)
cost
(
θ
)
=
−
1
m
L
(
θ
)
=
−
1
m
∑
i
=
1
m
ln
P
(
y
=
y
i
|
X
i
;
θ
)
=−1m∑i=1m{yilnhθ(Xi)+(1−yi)ln[1−hθ(Xi)]}
=
−
1
m
∑
i
=
1
m
{
y
i
ln
h
θ
(
X
i
)
+
(
1
−
y
i
)
ln
[
1
−
h
θ
(
X
i
)
]
}
=−1m∑i=1m{yilng(zi)+(1−yi)ln[1−g(zi)]}
=
−
1
m
∑
i
=
1
m
{
y
i
ln
g
(
z
i
)
+
(
1
−
y
i
)
ln
[
1
−
g
(
z
i
)
]
}
,其中
zi=X⊺iθ
z
i
=
X
i
⊺
θ
则
maxL(θ)=−mmincost(θ)
max
L
(
θ
)
=
−
m
min
cost
(
θ
)
cost(θ)
cost
(
θ
)
即为代价函数。
令
g(θ)=−L(θ)
g
(
θ
)
=
−
L
(
θ
)
则
d[g(θ)]=−∑i=1m[yi−g(X⊺iθ)]X⊺idθ
d
[
g
(
θ
)
]
=
−
∑
i
=
1
m
[
y
i
−
g
(
X
i
⊺
θ
)
]
X
i
⊺
d
θ
=∑i=1m[g(X⊺iθ)−yi]X⊺idθ
=
∑
i
=
1
m
[
g
(
X
i
⊺
θ
)
−
y
i
]
X
i
⊺
d
θ
因此
∇[g(θ)]=∑i=1m[g(X⊺iθ)−yi]Xi
∇
[
g
(
θ
)
]
=
∑
i
=
1
m
[
g
(
X
i
⊺
θ
)
−
y
i
]
X
i
=X⊺[g(X⊺θ)−y]
=
X
⊺
[
g
(
X
⊺
θ
)
−
y
]
其中
X=⎛⎝⎜⎜X⊺1⋮X⊺m⎞⎠⎟⎟,y=⎛⎝⎜⎜y⊺1⋮y⊺m⎞⎠⎟⎟,g(X⊺θ)=⎛⎝⎜⎜g(X⊺1θ)⋮g(X⊺mθ)⎞⎠⎟⎟,
X
=
(
X
1
⊺
⋮
X
m
⊺
)
,
y
=
(
y
1
⊺
⋮
y
m
⊺
)
,
g
(
X
⊺
θ
)
=
(
g
(
X
1
⊺
θ
)
⋮
g
(
X
m
⊺
θ
)
)
,
则
d{∇[g(θ)]}=∑i=1md[g(X⊺iθ)]Xi
d
{
∇
[
g
(
θ
)
]
}
=
∑
i
=
1
m
d
[
g
(
X
i
⊺
θ
)
]
X
i
=∑i=1mg′(X⊺iθ)(X⊺idθ)Xi
=
∑
i
=
1
m
g
′
(
X
i
⊺
θ
)
(
X
i
⊺
d
θ
)
X
i
=∑i=1mg′(X⊺iθ)XiX⊺idθ
=
∑
i
=
1
m
g
′
(
X
i
⊺
θ
)
X
i
X
i
⊺
d
θ
因此
Hg(θ)=∑i=1mg′(X⊺iθ)XiX⊺i
H
g
(
θ
)
=
∑
i
=
1
m
g
′
(
X
i
⊺
θ
)
X
i
X
i
⊺
注
∂∂θjg(θ)=∑i=1m[g(X⊺iθ)−yi]xij,j∈N,1≤j≤n ∂ ∂ θ j g ( θ ) = ∑ i = 1 m [ g ( X i ⊺ θ ) − y i ] x i j , j ∈ N , 1 ≤ j ≤ n
Regularized Logistic Regression
cost(θ)=−1m∑i=1m{yilnhθ(Xi)+(1−yi)ln[1−hθ(Xi)]}+λ2n∑j=1nθ2j
cost
(
θ
)
=
−
1
m
∑
i
=
1
m
{
y
i
ln
h
θ
(
X
i
)
+
(
1
−
y
i
)
ln
[
1
−
h
θ
(
X
i
)
]
}
+
λ
2
n
∑
j
=
1
n
θ
j
2
则
Hcost(θ)=∑i=1mg′(X⊺iθ)XiX⊺i+λ2n⎛⎝⎜⎜⎜⎜⎜01⋱1⎞⎠⎟⎟⎟⎟⎟
H
cost
(
θ
)
=
∑
i
=
1
m
g
′
(
X
i
⊺
θ
)
X
i
X
i
⊺
+
λ
2
n
(
0
1
⋱
1
)
性质
Hcost(θ) H cost ( θ ) 为正定矩阵。
证明
∀Z=⎛⎝⎜⎜z0⋮zn⎞⎠⎟⎟∈Rn+1,
∀
Z
=
(
z
0
⋮
z
n
)
∈
R
n
+
1
,
Z⊺Hcost(θ)Z=∑i=1mg′(X⊺iθ)Z⊺XiX⊺iZ+λ2n∑j=1nz2j
Z
⊺
H
cost
(
θ
)
Z
=
∑
i
=
1
m
g
′
(
X
i
⊺
θ
)
Z
⊺
X
i
X
i
⊺
Z
+
λ
2
n
∑
j
=
1
n
z
j
2
=∑i=1mg′(X⊺iθ)(X⊺iZ)2+λ2n∑j=1nz2j≥0
=
∑
i
=
1
m
g
′
(
X
i
⊺
θ
)
(
X
i
⊺
Z
)
2
+
λ
2
n
∑
j
=
1
n
z
j
2
≥
0
若
Z⊺Hcost(θ)Z=0,
Z
⊺
H
cost
(
θ
)
Z
=
0
,
则
∀j∈N,1≤j≤n,zj=0,
∀
j
∈
N
,
1
≤
j
≤
n
,
z
j
=
0
,
于是
Z⊺Hcost(θ)Z=∑i=1mg′(X⊺iθ)z02=0⇒z0=0
Z
⊺
H
cost
(
θ
)
Z
=
∑
i
=
1
m
g
′
(
X
i
⊺
θ
)
z
0
2
=
0
⇒
z
0
=
0
于是
Z=0
Z
=
0
因此
Hcost(θ)
H
cost
(
θ
)
为正定矩阵。
Neural Network for Classification
cost(θ)=−1m∑i=1m∑k=1K{yik(lnhθ(Xi))k+(1−yik)(ln[1−hθ(Xi)])k}
cost
(
θ
)
=
−
1
m
∑
i
=
1
m
∑
k
=
1
K
{
y
i
k
(
ln
h
θ
(
X
i
)
)
k
+
(
1
−
y
i
k
)
(
ln
[
1
−
h
θ
(
X
i
)
]
)
k
}
+λ2m∑l=1L−1∑i=1sl+1∑j=1slθ2lij
+
λ
2
m
∑
l
=
1
L
−
1
∑
i
=
1
s
l
+
1
∑
j
=
1
s
l
θ
l
i
j
2