Logistic regression
2.1 logistic regression
2.2 goodness of a function
在算对于training data而言的likelihood,
w
∗
w^*
w∗和
b
∗
b^*
b∗是在maximize likelihood function,相当于
w
∗
,
b
∗
=
a
r
g
m
i
n
m
i
n
w
,
b
−
l
n
L
(
w
,
b
)
w^*,b^*=argmin\ \underset{w,b}{min}-lnL(w,b)
w∗,b∗=argmin w,bmin−lnL(w,b)
−
l
n
L
(
w
,
b
)
-lnL(w,b)
−lnL(w,b)
= − l n f w , b ( x 1 ) → − [ y ^ 1 l n f ( x 1 ) + ( 1 − y ^ 1 ) l n ( 1 − f ( x 1 ) ) ] = - lnf_{w,b}(x^1) \to -[\hat{y}^1lnf(x^1)+(1-\hat{y}^1)ln(1-f(x^1))] =−lnfw,b(x1)→−[y^1lnf(x1)+(1−y^1)ln(1−f(x1))]
− l n f w , b ( x 2 ) → − [ y ^ 2 l n f ( x 2 ) + ( 1 − y ^ 2 ) l n ( 1 − f ( x 2 ) ) ] - lnf_{w,b}(x^2) \to -[\hat{y}^2lnf(x^2)+(1-\hat{y}^2)ln(1-f(x^2))] −lnfw,b(x2)→−[y^2lnf(x2)+(1−y^2)ln(1−f(x2))]
− l n f w , b ( x 3 ) → − [ y ^ 3 l n f ( x 3 ) + ( 1 − y ^ 3 ) l n ( 1 − f ( x 3 ) ) ] . . . . . . - lnf_{w,b}(x^3) \to -[\hat{y}^3lnf(x^3)+(1-\hat{y}^3)ln(1-f(x^3))]...... −lnfw,b(x3)→−[y^3lnf(x3)+(1−y^3)ln(1−f(x3))]......
*
y
1
=
1
w
h
e
n
t
h
e
f
i
r
s
t
d
a
t
a
i
s
i
n
c
l
a
s
s
1
a
n
d
y
2
=
0
w
h
e
n
t
h
e
s
e
c
o
n
d
d
a
t
a
i
s
i
n
c
l
a
s
s
2
{y}^1 = 1\ when\ the\ first\ data\ is\ in\ class\ 1\ and\ {y}^2 = 0\ when\ the\ second\ data\ is\ in\ class\ 2
y1=1 when the first data is in class 1 and y2=0 when the second data is in class 2
这样就可以合并
L ( w , b ) = ∑ n − [ y ^ n l n f ( x n ) + ( 1 − y ^ n ) l n ( 1 − f ( x n ) ) ] − L(w,b) = \underset{n}{\sum} -[\hat{y}^nlnf(x^n)+(1-\hat{y}^n)ln(1-f(x^n))]- L(w,b)=n∑−[y^nlnf(xn)+(1−y^n)ln(1−f(xn))]−
同样可以用cross entropy得到同样的式子
理解为
y
^
n
\hat{y}^n
y^n和
f
(
x
n
)
f(x^n)
f(xn)有多相似,
1
−
y
^
n
1-\hat{y}^n
1−y^n和
1
−
f
(
x
n
)
1-f(x^n)
1−f(xn)有多相似
2.3 find the best function (gradient descent)
找到能minimize NLL的
w
i
w_i
wi
∂
(
−
l
n
L
(
w
,
b
)
)
∂
w
i
=
∑
n
−
[
y
^
n
∂
l
n
f
w
,
b
(
x
n
)
∂
w
i
+
(
1
−
y
^
n
)
∂
(
1
−
f
w
,
b
(
x
n
)
)
∂
w
i
]
\frac{\partial(-lnL(w,b))}{\partial{w_i}} = \underset{n}{\sum}-[\hat{y}^n\frac{\partial{lnf_{w,b}(x^n)}}{\partial{w_i}}+(1-\hat{y}^n)\frac{\partial(1-f_{w,b}(x^n))}{\partial{w_i}}]
∂wi∂(−lnL(w,b))=n∑−[y^n∂wi∂lnfw,b(xn)+(1−y^n)∂wi∂(1−fw,b(xn))]
∂
l
n
f
w
,
b
(
x
n
)
∂
w
i
=
∂
l
n
f
w
,
b
(
x
n
)
∂
z
∂
z
∂
w
i
\frac{\partial{lnf_{w,b}(x^n)}}{\partial{w_i}} = \frac{\partial{lnf_{w,b}(x^n)}}{\partial{z}}\frac{\partial{z}}{\partial{w_i}}
∂wi∂lnfw,b(xn)=∂z∂lnfw,b(xn)∂wi∂z
because that
∂
l
n
f
w
,
b
(
x
n
)
∂
z
=
∂
l
n
σ
(
z
)
∂
z
=
1
σ
(
z
)
∂
σ
(
z
)
∂
z
=
1
σ
(
z
)
σ
(
z
)
(
1
−
σ
(
z
)
)
=
1
−
σ
(
z
)
\frac{\partial{lnf_{w,b}(x^n)}}{\partial{z}} = \frac{\partial{ln\sigma(z)}}{\partial{z}}=\frac{1}{\sigma(z)}\frac{\partial{\sigma(z)}}{\partial{z}}=\frac{1}{\sigma(z)}\sigma(z)(1-\sigma(z)) =1-\sigma(z)
∂z∂lnfw,b(xn)=∂z∂lnσ(z)=σ(z)1∂z∂σ(z)=σ(z)1σ(z)(1−σ(z))=1−σ(z)
and
∂
z
∂
w
i
=
x
i
\frac{\partial{z}}{\partial{w_i}} = x_i
∂wi∂z=xi
we can get that
∂
(
−
l
n
L
(
w
,
b
)
)
∂
w
i
=
∑
n
−
[
y
^
n
(
1
−
f
w
,
b
(
x
n
)
)
x
i
n
+
(
1
−
y
^
n
)
f
w
,
b
(
x
n
)
x
i
n
]
\frac{\partial(-lnL(w,b))}{\partial{w_i}} = \underset{n}{\sum}-[\hat{y}^n(1-f_{w,b}(x^n))x^n_i+(1-\hat{y}^n)f_{w,b}(x^n)x^n_i]
∂wi∂(−lnL(w,b))=n∑−[y^n(1−fw,b(xn))xin+(1−y^n)fw,b(xn)xin]
= ∑ n − [ y ^ n − y ^ n f w , b ( x n ) − f w , b ( x n ) + y ^ n f w , b ( x n ) ] x i n =\underset{n}{\sum}-[\hat{y}^n-\hat{y}^nf_{w,b}(x^n)-f_{w,b}(x^n)+\hat{y}^nf_{w,b}(x^n)]x^n_i =n∑−[y^n−y^nfw,b(xn)−fw,b(xn)+y^nfw,b(xn)]xin
=
∑
n
−
[
y
^
n
−
f
w
,
b
(
x
n
)
]
x
i
n
=\underset{n}{\sum}-[\hat{y}^n-f_{w,b}(x^n)]x^n_i
=n∑−[y^n−fw,b(xn)]xin
*
η
\eta
η是指learning rate
2.4 logistic regression why we don’t use Square Error
如果真是的class
y
^
n
=
0
\hat{y}^n = 0
y^n=0, 当predict result
f
w
,
b
(
x
n
)
f_{w,b}(x^n)
fw,b(xn)距离真实prediction很远的时候,微分
∂
L
∂
w
i
=
0
\frac{\partial{L}}{\partial{w_i}} = 0
∂wi∂L=0
红色的地方是square error的gradient descent, 那么在离target很远的时候也很平坦,这样的话gradient update rate就会很小,update很慢,效果会不好。而黑色的部分是cross entropy,这个时候离target越远,update rate就越快
2.5 discriminative v.s. generative
logistic regression的方法是discriminative的方法,用Gaussian描述的方法是generative的方法。
我们本质上都是有这样一个式子
P
(
C
1
∣
x
)
=
σ
(
w
⋅
x
)
P(C_1|x) = \sigma(w \cdot x)
P(C1∣x)=σ(w⋅x)
如果使用logistic的话,本质上是用gradient descent找出w和b
如果使用Gaussian的话,本质上是用maximum likelihood estimator算到最好的covariance matrix and mean,然后带入计算
w
T
=
(
μ
1
−
μ
2
)
T
∑
−
1
w^T = (\mu^1-\mu^2)^T\sum^{-1}
wT=(μ1−μ2)T∑−1 and
b
=
−
1
2
(
μ
1
)
T
(
∑
1
)
−
1
μ
1
+
1
2
(
μ
2
)
T
(
∑
2
)
−
1
μ
2
+
l
n
N
1
N
2
b = -\frac{1}{2}(\mu^1)^T(\sum^1)^{-1}\mu^1+\frac{1}{2}(\mu^2)^T(\sum^2)^{-1}\mu^2+ln\frac{N_1}{N_2}
b=−21(μ1)T(∑1)−1μ1+21(μ2)T(∑2)−1μ2+lnN2N1
找出来的结果其实是不一样的,一般情况下都会discriminative model的performance会比generative model的performance好
- benefit of generative model (在data比较少的时候其实generative model会比较好,或者在noise data比较多的时候,generative model会比较好,因为会有一个假设)
- with the assumption of probability distribution, less training data is needed
- with the assumption of probability distribution, more robust to the noise
- priors and class-dependent probabilities can be estimated from different sources
2.6 multi-class classification (3 classes example)
C 1 : w 1 , b 1 z 1 = w 1 ⋅ x + b 1 C_1: w^1,b_1\quad z_1 = w^1\cdot{x}+b_1 C1:w1,b1z1=w1⋅x+b1
C 1 : w 2 , b 2 z 2 = w 2 ⋅ x + b 2 C_1: w^2,b_2\quad z_2 = w^2\cdot{x}+b_2 C1:w2,b2z2=w2⋅x+b2
C 3 : w 3 , b 3 z 3 = w 3 ⋅ x + b 3 C_3: w^3,b_3\quad z_3 = w^3\cdot{x}+b_3 C3:w3,b3z3=w3⋅x+b3
z
1
,
z
2
,
z
3
z_1,z_2,z_3
z1,z2,z3可以是任意值,然后我们将他们丢进softmax里面做exponential,然后做normalization。在softmax里面transform了之后,output就是
∈
\in
∈ [0,1], 且
∑
i
y
i
=
1
\sum_i{y_i} = 1
∑iyi=1
为什么叫softmax: 会对最大的值做强化,大的值和小的值的差距会被拉得更大,可以用来估计posterior probability
个人理解是因为sigmoid function是对binary case来说的,softmax是对multi-class而言的
2.7 Limitation of Logistic Regression
linear regression是没有办法将这样分布的点分成不同的class的
解决方法feature transformation
x
1
′
:
d
i
s
t
a
n
c
e
t
o
x_1^{'}:distance\ to
x1′:distance to
[
0
0
]
\begin{bmatrix} 0 \\0 \end{bmatrix}
[00]
x
2
′
:
d
i
s
t
a
n
c
e
t
o
x_2^{'}:distance\ to
x2′:distance to
[
1
1
]
\begin{bmatrix} 1 \\1 \end{bmatrix}
[11]
如何让机器自己决定怎么feature transformation: cascading logistic regression models