基本概念
对数几率回归(Logistic Regression,又称逻辑回归)可以用来解决二分类和多分类问题。分类问题中,输出集合不再是连续值,而是离散值,即 Y ∈ { 0 , 1 , 2 , ⋯ } \mathcal{Y}\in \{0,1,2,\cdots\} Y∈{0,1,2,⋯}。以二分类问题为例,其输出集合一般为 Y ∈ { 0 , 1 } \mathcal{Y}\in \{0,1\} Y∈{0,1}。
为了解决二分类问题,对数几率回归在线性回归的基础上引入Sigmoid函数(Logistic函数),其中
exp
(
⋅
)
\exp(\cdot)
exp(⋅)是自然指数:
g
(
z
)
=
1
1
+
exp
(
−
z
)
g(z) = \dfrac{1}{1 +\exp({-z})}\\
g(z)=1+exp(−z)1
该函数的值域为
[
0
,
1
]
[0,1]
[0,1],如下图所示:
因此,对数几率回归中假设集的定义为:
h
θ
(
x
)
=
g
(
θ
T
x
)
h_\theta (x) = g ( \theta^T x )
hθ(x)=g(θTx)
实际上,
h
θ
(
x
)
h_{\theta}(x)
hθ(x)给出了在给定参数
θ
\theta
θ和样本
x
x
x的条件下,标签
y
=
1
y=1
y=1的概率。
h
θ
(
x
)
=
P
(
y
=
1
∣
x
;
θ
)
=
1
−
P
(
y
=
0
∣
x
;
θ
)
P
(
y
=
0
∣
x
;
θ
)
+
P
(
y
=
1
∣
x
;
θ
)
=
1
\begin{aligned}& h_\theta(x) = P(y=1 | x ; \theta) = 1 - P(y=0 | x ; \theta) \\& P(y = 0 | x;\theta) + P(y = 1 | x ; \theta) = 1\end{aligned}
hθ(x)=P(y=1∣x;θ)=1−P(y=0∣x;θ)P(y=0∣x;θ)+P(y=1∣x;θ)=1
损失函数
对数几率回归的损失函数如下所示:
J
(
θ
)
=
1
n
∑
i
=
1
N
C
o
s
t
(
h
θ
(
x
(
i
)
)
,
y
(
i
)
)
C
o
s
t
(
h
θ
(
x
(
i
)
)
,
y
(
i
)
)
=
{
−
log
(
h
θ
(
x
(
i
)
)
)
if
y
(
i
)
=
1
−
log
(
1
−
h
θ
(
x
(
i
)
)
)
if
y
(
i
)
=
0
J(\theta) = \dfrac{1}{n} \sum_{i=1}^N \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) \\ \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) =\left\{ \begin{aligned} &-\log(h_\theta(x^{(i)})) \; & \text{if }y^{(i)} = 1\\ &-\log(1-h_\theta(x^{(i)})) \; & \text{if } y^{(i)} = 0 \end{aligned} \right.
J(θ)=n1i=1∑NCost(hθ(x(i)),y(i))Cost(hθ(x(i)),y(i))={−log(hθ(x(i)))−log(1−hθ(x(i)))if y(i)=1if y(i)=0
该损失函数通过极大似然法导出。对于给定的输入集
X
\mathcal{X}
X和输出集
Y
\mathcal{Y}
Y,其似然函数为:
∏
i
=
1
n
[
h
θ
(
x
(
i
)
)
]
y
(
i
)
[
1
−
h
θ
(
x
(
i
)
)
]
1
−
y
(
i
)
\prod _{i = 1}^n \left[h_\theta(x^{(i)})\right]^{y^{(i)}}\left[1 - h_\theta(x^{(i)})\right]^{1 - y^{(i)}}
i=1∏n[hθ(x(i))]y(i)[1−hθ(x(i))]1−y(i)
由于连乘不好优化,因此上式两边取对数,转化成连加的形式,得到对数似然函数:
L
(
θ
)
=
1
n
∑
i
=
1
n
[
y
(
i
)
log
h
θ
(
x
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
θ
(
x
(
i
)
)
)
]
L(\theta)=\frac{1}{n} \sum _{i=1}^n \left[ y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)})\log(1 - h_\theta(x^{(i)})) \right ]
L(θ)=n1i=1∑n[y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))]
最大化上述对数似然函数就可以得到最优的参数
θ
\theta
θ。而最大化对数似然函数
L
(
θ
)
L(\theta)
L(θ)等价于最小化
−
L
(
θ
)
- L(\theta)
−L(θ),因此我们可以得到如下损失函数的形式:
J
(
θ
)
=
−
1
n
∑
i
=
1
n
[
y
(
i
)
log
h
θ
(
x
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
θ
(
x
(
i
)
)
)
]
J(\theta) = -\frac{1}{n} \sum _{i=1}^n \left[ y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)})\log(1 - h_\theta(x^{(i)})) \right ]
J(θ)=−n1i=1∑n[y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))]
参数学习
得到损失函数后,需要使用梯度下降法求解该函数的最小值。首先,将损失函数进行化简:
J
(
θ
)
=
−
1
n
∑
i
=
1
N
[
y
(
i
)
log
h
θ
(
x
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
θ
(
x
(
i
)
)
)
]
=
−
1
n
∑
i
=
1
n
[
y
(
i
)
log
h
θ
(
x
(
i
)
)
1
−
h
θ
(
x
(
i
)
)
+
log
(
1
−
h
θ
(
x
(
i
)
)
)
]
=
−
1
n
∑
i
=
1
n
[
y
(
i
)
log
exp
(
θ
⋅
x
(
i
)
)
/
(
1
+
exp
(
θ
⋅
x
(
i
)
)
)
1
/
(
1
+
exp
(
θ
⋅
x
(
i
)
)
)
+
log
(
1
−
h
θ
(
x
(
i
)
)
)
]
=
−
1
n
∑
i
=
1
n
[
y
i
(
θ
⋅
x
(
i
)
)
+
log
(
1
+
exp
(
θ
⋅
x
(
i
)
)
)
]
\begin{aligned} J(\theta) &=-\frac{1}{n} \sum _{i=1}^N \left[ y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)})\log(1 - h_\theta(x^{(i)})) \right ] \\ &=-\frac{1}{n} \sum _{i=1}^n \left[ y^{(i)}\log \frac {h_\theta(x^{(i)})} {1 - h_\theta(x^{(i)})} + \log(1 - h_\theta(x^{(i)})) \right ] \\ &=-\frac{1}{n} \sum _{i=1}^n \left[ y^{(i)} \log \frac { {\exp(\theta\cdot x^{(i)})} / (1 + \exp(\theta\cdot x^{(i)}))} {{1} /(1 + \exp(\theta\cdot x^{(i)}))} + \log(1 - h_\theta(x^{(i)})) \right ] \\ &=-\frac{1}{n} \sum _{i=1}^n \left[ y_i (\theta\cdot x^{(i)}) + \log(1 + \exp (\theta\cdot x^{(i)})) \right ] \end{aligned}
J(θ)=−n1i=1∑N[y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))]=−n1i=1∑n[y(i)log1−hθ(x(i))hθ(x(i))+log(1−hθ(x(i)))]=−n1i=1∑n[y(i)log1/(1+exp(θ⋅x(i)))exp(θ⋅x(i))/(1+exp(θ⋅x(i)))+log(1−hθ(x(i)))]=−n1i=1∑n[yi(θ⋅x(i))+log(1+exp(θ⋅x(i)))]
求解损失函数
J
(
θ
)
J(\theta)
J(θ)对参数
θ
\theta
θ的偏导数:
∂
∂
θ
J
(
θ
)
=
−
1
n
∑
i
=
1
n
[
y
(
i
)
⋅
x
(
i
)
−
1
1
+
exp
(
θ
⋅
x
(
i
)
)
⋅
exp
(
θ
⋅
x
(
i
)
)
⋅
x
(
i
)
]
=
−
1
n
∑
i
=
1
n
[
y
(
i
)
⋅
x
(
i
)
−
exp
(
θ
⋅
x
(
i
)
)
1
+
exp
(
θ
⋅
x
(
i
)
)
⋅
x
(
i
)
]
=
−
1
n
∑
i
=
1
n
(
y
(
i
)
−
exp
(
θ
⋅
x
(
i
)
)
1
+
exp
(
θ
⋅
x
(
i
)
)
)
x
(
i
)
=
1
n
∑
i
=
1
n
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
(
i
)
\begin{aligned} \frac{\partial}{\partial \theta}J(\theta) &=-\frac{1}{n} \sum _{i=1}^n \left [y^{(i)} \cdot x^{(i)} - \frac {1} {1 + \exp(\theta \cdot x^{(i)})} \cdot \exp(\theta \cdot x^{(i)}) \cdot x^{(i)}\right ] \\ &=-\frac{1}{n} \sum _{i=1}^n \left [y^{(i)} \cdot x^{(i)} - \frac {\exp(\theta \cdot x^{(i)})} {1 + \exp(\theta \cdot x^{(i)})} \cdot x^{(i)}\right ] \\ &=-\frac{1}{n} \sum _{i=1}^n \left (y^{(i)} - \frac {\exp(\theta \cdot x^{(i)})} {1 + \exp(\theta \cdot x^{(i)})} \right ) x^{(i)}\\ &=\frac{1}{n} \sum _{i=1}^n \left (h_\theta(x^{(i)})-y^{(i)} \right )x^{(i)} \end{aligned}
∂θ∂J(θ)=−n1i=1∑n[y(i)⋅x(i)−1+exp(θ⋅x(i))1⋅exp(θ⋅x(i))⋅x(i)]=−n1i=1∑n[y(i)⋅x(i)−1+exp(θ⋅x(i))exp(θ⋅x(i))⋅x(i)]=−n1i=1∑n(y(i)−1+exp(θ⋅x(i))exp(θ⋅x(i)))x(i)=n1i=1∑n(hθ(x(i))−y(i))x(i)
使用梯度下降法逐个更新参数:
θ
j
≔
θ
j
−
α
n
∑
i
=
1
n
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
\theta_j \coloneqq \theta_j - \frac{\alpha}{n} \sum_{i=1}^n \left(h_\theta(x^{(i)}) - y^{(i)}\right) x_j^{(i)}
θj:=θj−nαi=1∑n(hθ(x(i))−y(i))xj(i)