介绍
逻辑回归(Logistic Regression)是机器学习中一种应用非常广泛的分类预测算法,而且足够简单。工业中广泛应用LR算法,例如CTR预估,推荐系统等。逻辑回归模型的预测函数为:
h
w
,
b
(
x
)
=
1
1
+
e
−
(
w
T
x
+
b
)
h_{w,b}(x) = \frac{1}{1+e^{-(w^\mathrm{T}x+b)}}
hw,b(x)=1+e−(wTx+b)1
其中
w
,
b
w,b
w,b为模型参数。
Sigmoid函数
首先
f
(
x
)
=
s
i
g
m
o
i
d
(
x
)
=
1
1
+
e
−
x
f(x)=sigmoid(x)=\frac{1}{1+e^{-x}}
f(x)=sigmoid(x)=1+e−x1的定义域为
R
R
R,值域为
(
0
,
1
)
(0,1)
(0,1),两端不可取。
s
i
g
m
o
i
d
(
x
)
sigmoid(x)
sigmoid(x)关于点(0,
1
2
\frac{1}{2}
21)对称。其函数图为:
而且非常重要的一点,
s
i
g
m
o
i
d
(
x
)
′
=
s
i
g
m
o
i
d
(
x
)
(
1
−
s
i
g
m
o
i
d
(
x
)
)
sigmoid(x)^{\prime} = sigmoid(x)(1-sigmoid(x))
sigmoid(x)′=sigmoid(x)(1−sigmoid(x))。
LR模型
考虑二分类任务,
y
∈
{
0
,
1
}
y\in\{0,1\}
y∈{0,1},有:
P
(
y
=
1
∣
x
;
w
,
b
)
=
e
(
w
T
x
+
b
)
1
+
e
(
w
T
x
+
b
)
P(y=1|x;w,b) = \frac{e^{(w^\mathrm{T}x+b)}}{1+e^{(w^\mathrm{T}x+b)}}
P(y=1∣x;w,b)=1+e(wTx+b)e(wTx+b)
P
(
y
=
0
∣
x
;
w
,
b
)
=
1
1
+
e
(
w
T
x
+
b
)
P(y=0|x;w,b) = \frac{1}{1+e^{(w^\mathrm{T}x+b)}}
P(y=0∣x;w,b)=1+e(wTx+b)1
考虑到w是一个向量,我们令
Θ
=
(
w
1
,
w
2
,
w
3
,
w
4
,
.
.
.
,
b
)
\Theta = (w_{1},w_{2},w_{3},w_{4},...,b)
Θ=(w1,w2,w3,w4,...,b),表示的是样本每一个属性
x
j
x_{j}
xj的权重。相应的,
x
=
(
x
1
,
x
2
,
x
3
,
x
4
,
.
.
.
,
1
)
\mathbf x=(x_{1},x_{2},x_{3},x_{4},...,1)
x=(x1,x2,x3,x4,...,1),则模型预测函数可以改写成:
h
θ
(
x
)
=
1
1
+
e
−
(
Θ
T
x
)
h_{\theta}(x) = \frac{1}{1+e^{-(\Theta ^\mathrm{T}\mathbf x)}}
hθ(x)=1+e−(ΘTx)1
然后,我们用极大似然估计法来估计模型参数
Θ
\Theta
Θ,有:
L
(
Θ
)
=
−
1
m
∑
i
=
1
m
(
y
(
i
)
log
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
θ
(
x
(
i
)
)
)
)
L(\Theta) = -\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}\log(h_{\theta}(x^{(i)}))+(1-y^{(i)})\log(1-h_{\theta}(x^{(i)})))
L(Θ)=−m1i=1∑m(y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i))))
则我们的优化目标是找到能使目标函数最小的参数:
Θ
^
=
arg min
L
(
Θ
)
{\hat\Theta}=\argmin L(\Theta)
Θ^=argminL(Θ)
参数估计
LR的目标函数是关于
Θ
\Theta
Θ的凸函数,并且连续可导,这里可以采用梯度下降来求解其最优解:
∂
L
(
Θ
)
∂
Θ
j
=
−
1
m
∑
i
=
1
m
(
y
(
i
)
1
h
θ
(
x
(
i
)
)
−
(
1
−
y
(
i
)
)
1
1
−
h
θ
(
x
(
i
)
)
)
∂
h
θ
(
x
(
i
)
)
∂
Θ
j
=
−
1
m
∑
i
=
1
m
(
y
(
i
)
−
h
θ
(
x
(
i
)
)
)
x
j
\frac{\partial L(\Theta)}{\partial \Theta_{j}}=-\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}\frac{1}{h_{\theta}(x^{(i)})}-(1-y^{(i)})\frac{1}{1-h_{\theta}(x^{(i)})})\frac{\partial h_{\theta}(x^{(i)})}{\partial \Theta_{j}}=-\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}-h_{\theta}(x^{(i)}))x_{j}
∂Θj∂L(Θ)=−m1i=1∑m(y(i)hθ(x(i))1−(1−y(i))1−hθ(x(i))1)∂Θj∂hθ(x(i))=−m1i=1∑m(y(i)−hθ(x(i)))xj
则每次梯度下降更新参数:
Θ
j
=
Θ
j
+
1
m
∑
i
=
1
m
(
y
(
i
)
−
h
θ
(
x
(
i
)
)
)
x
j
\Theta_{j} = \Theta_{j} +\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}-h_{\theta}(x^{(i)}))x_{j}
Θj=Θj+m1i=1∑m(y(i)−hθ(x(i)))xj