1. Logistic Regression(逻辑回归)
在逻辑回归中,我们使用Sigmoid函数求取预测的概率:
h
θ
(
x
(
i
)
)
=
1
1
+
e
−
θ
T
x
(
i
)
h_{\theta}(x^{(i)}) = \frac{1}{1+e^{-{\theta^{T}x^{(i)}}}}
hθ(x(i))=1+e−θTx(i)1
设定结果为正例或者反例的概率为:
P
(
y
(
i
)
=
1
∣
x
(
i
)
;
θ
)
=
h
θ
(
x
(
i
)
)
P
(
y
(
i
)
=
0
∣
x
(
i
)
;
θ
)
=
1
−
h
θ
(
x
(
i
)
)
P(y^{(i)}=1|x^{(i)};\theta) = h_{\theta}(x^{(i)}) \\ P(y^{(i)}=0|x^{(i)};\theta) = 1 - h_{\theta}(x^{(i)}) \\
P(y(i)=1∣x(i);θ)=hθ(x(i))P(y(i)=0∣x(i);θ)=1−hθ(x(i))
LR的损失函数可以写为:
J
(
θ
)
=
−
1
m
∑
i
=
1
m
y
(
i
)
log
h
θ
(
x
(
i
)
)
+
(
1
−
y
(
i
)
)
log
[
1
−
h
θ
(
x
(
i
)
)
]
=
−
1
m
∑
i
=
1
m
∑
j
=
0
1
l
{
y
(
i
)
=
j
}
log
P
(
y
(
i
)
=
j
∣
x
(
i
)
;
θ
)
J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log h_{\theta}(x^{(i)})+(1-y^{(i)})\log [1-h_{\theta}(x^{(i)})] \\ =-\frac{1}{m}\sum_{i=1}^{m} \sum_{j=0}^{1} l\{y^{(i)}=j\}\log P(y^{(i)}=j|x^{(i)};\theta)
J(θ)=−m1i=1∑my(i)loghθ(x(i))+(1−y(i))log[1−hθ(x(i))]=−m1i=1∑mj=0∑1l{y(i)=j}logP(y(i)=j∣x(i);θ)
其中,
l
{
⋅
}
l\{·\}
l{⋅}是一个指示函数,它表示如果括号内的语句为真时,结果为1;如果为假,则结果为0。
以上就是逻辑回归的主要内容。逻辑回归可以用于二分类问题。面对多分类任务,逻辑回归可针对于每一个类别训练一个分类器,即将数据集分为两个部分,是该类别样本的,标签设置为1,其余样本标签都为0。如果类别很多的时候,这种办法就捉襟见肘了。有没有可以直接实现多分类任务呢?除了建立神经网络,还可使用最简单的Softmax回归。
2. Softmax回归
在softmax回归中,我们预测的概率被写为:
P
(
y
(
i
)
=
j
∣
x
(
i
)
;
θ
)
=
h
θ
(
x
(
i
)
)
=
e
θ
j
T
x
(
i
)
∑
l
=
1
k
e
θ
k
T
x
(
i
)
P(y^{(i)}=j|x^{(i)};\theta) = h_{\theta}(x^{(i)}) = \frac{e^{\theta_{j}^{T}x^{(i)}}}{\sum_{l=1}^{k}e^{\theta_{k}^{T}x^{(i)}}}
P(y(i)=j∣x(i);θ)=hθ(x(i))=∑l=1keθkTx(i)eθjTx(i)
于是对照着LR中的损失函数的写法,那么softmax的损失函数就可以写为:
J
(
θ
)
=
−
1
m
∑
i
=
1
m
∑
j
=
l
k
l
{
y
(
i
)
=
j
}
log
P
(
P
(
y
(
i
)
=
j
∣
x
(
i
)
;
θ
)
)
=
−
1
m
∑
i
=
1
m
∑
j
=
l
k
l
{
y
(
i
)
=
j
}
log
e
θ
j
T
x
(
i
)
∑
l
=
1
k
e
θ
k
T
x
(
i
)
=
−
1
m
∑
i
=
1
m
∑
j
=
l
k
l
{
y
(
i
)
=
j
}
[
log
e
θ
j
T
x
(
i
)
−
log
∑
l
=
1
k
e
θ
k
T
x
(
i
)
]
=
−
1
m
∑
i
=
1
m
∑
j
=
l
k
l
{
y
(
i
)
=
j
}
[
θ
j
T
x
(
i
)
−
log
∑
l
=
1
k
e
θ
k
T
x
(
i
)
]
=
−
1
m
∑
i
=
1
m
∑
j
=
l
k
l
{
y
(
i
)
=
j
}
θ
j
T
x
(
i
)
−
l
{
y
(
i
)
=
j
}
log
∑
l
=
1
k
e
θ
k
T
x
(
i
)
J(\theta) = -\frac{1}{m}\sum_{i=1}^{m} \sum_{j=l}^{k} l\{y^{(i)}=j\}\log P(P(y^{(i)}=j|x^{(i)};\theta)) \\ = -\frac{1}{m}\sum_{i=1}^{m} \sum_{j=l}^{k} l\{y^{(i)}=j\}\log \frac{e^{\theta_{j}^{T}x^{(i)}}}{\sum_{l=1}^{k}e^{\theta_{k}^{T}x^{(i)}}} \\ = -\frac{1}{m}\sum_{i=1}^{m} \sum_{j=l}^{k} l\{y^{(i)}=j\}[\log e^{\theta_{j}^{T}x^{(i)}}-\log {\sum_{l=1}^{k}e^{\theta_{k}^{T}x^{(i)}}}] \\ = -\frac{1}{m}\sum_{i=1}^{m} \sum_{j=l}^{k} l\{y^{(i)}=j\}[\theta_{j}^{T}x^{(i)}-\log {\sum_{l=1}^{k}e^{\theta_{k}^{T}x^{(i)}}}] \\ = -\frac{1}{m}\sum_{i=1}^{m} \sum_{j=l}^{k} l\{y^{(i)}=j\}\theta_{j}^{T}x^{(i)}-l\{y^{(i)}=j\}\log {\sum_{l=1}^{k}e^{\theta_{k}^{T}x^{(i)}}}
J(θ)=−m1i=1∑mj=l∑kl{y(i)=j}logP(P(y(i)=j∣x(i);θ))=−m1i=1∑mj=l∑kl{y(i)=j}log∑l=1keθkTx(i)eθjTx(i)=−m1i=1∑mj=l∑kl{y(i)=j}[logeθjTx(i)−logl=1∑keθkTx(i)]=−m1i=1∑mj=l∑kl{y(i)=j}[θjTx(i)−logl=1∑keθkTx(i)]=−m1i=1∑mj=l∑kl{y(i)=j}θjTx(i)−l{y(i)=j}logl=1∑keθkTx(i)
J
(
θ
)
J(\theta)
J(θ)对
θ
j
\theta_j
θj求偏导数为:
∂
J
(
θ
)
∂
θ
j
=
−
1
m
∑
i
=
1
m
l
{
y
(
i
)
=
j
}
x
(
i
)
−
x
(
i
)
e
θ
j
T
x
(
i
)
∑
l
=
1
k
e
θ
k
T
x
(
i
)
∑
j
=
l
k
l
{
y
(
i
)
=
j
}
=
−
1
m
∑
i
=
1
m
l
{
y
(
i
)
=
j
}
x
(
i
)
−
x
(
i
)
P
(
y
(
i
)
=
j
∣
x
(
i
)
;
θ
)
=
−
1
m
∑
i
=
1
m
x
(
i
)
[
l
{
y
(
i
)
=
j
}
−
P
(
y
(
i
)
=
j
∣
x
(
i
)
;
θ
)
]
\frac{\partial J(\theta)}{\partial \theta_j} = -\frac{1}{m}\sum_{i=1}^{m} l\{y^{(i)}=j\}x^{(i)}-\frac{x^{(i)}e^{\theta_{j}^{T}x^{(i)}}} {\sum_{l=1}^{k}e^{\theta_{k}^{T}x^{(i)}}} \sum_{j=l}^{k} l\{y^{(i)}=j\} \\ = -\frac{1}{m}\sum_{i=1}^{m} l\{y^{(i)}=j\}x^{(i)}-x^{(i)}P(y^{(i)}=j|x^{(i)};\theta) \\ = -\frac{1}{m}\sum_{i=1}^{m} x^{(i)}[l\{y^{(i)}=j\}-P(y^{(i)}=j|x^{(i)};\theta)]
∂θj∂J(θ)=−m1i=1∑ml{y(i)=j}x(i)−∑l=1keθkTx(i)x(i)eθjTx(i)j=l∑kl{y(i)=j}=−m1i=1∑ml{y(i)=j}x(i)−x(i)P(y(i)=j∣x(i);θ)=−m1i=1∑mx(i)[l{y(i)=j}−P(y(i)=j∣x(i);θ)]
进行梯度下降即可学得参数
θ
\theta
θ:
θ
j
:
=
θ
j
−
η
∂
J
(
θ
)
∂
θ
j
\theta_j:= \theta_j - \eta \frac{\partial J(\theta)}{\partial \theta_j}
θj:=θj−η∂θj∂J(θ)