前言
Logistic Regression 虽然被称为回归,但其实际上是分类模型,并常用于二分类。LR也是面试时常需要手撕的模型,本文从各种角度推导了LR的相关公式,希望对大家有帮助(未完待续)。
模型建立推导
由线性回归的定义式出发,
h
θ
(
x
)
h_{\theta}(x)
hθ(x)是预测值:
h
θ
(
x
)
=
∑
i
=
0
n
θ
i
x
i
=
θ
T
x
h_{\theta}(x)=\sum_{i=0}^n{\theta _i}x_i=\theta ^Tx
hθ(x)=i=0∑nθixi=θTx
Sigmoid函数(s型函数):
g
(
x
)
=
1
1
+
e
−
x
g(x)=\frac{1}{1+e^{-x}}
g(x)=1+e−x1
Sigmoid函数求导有如下结果(后面推导会用到):
g
′
(
x
)
=
(
1
1
+
e
−
x
)
′
=
e
−
x
(
1
+
e
−
x
)
2
=
1
1
+
e
−
x
⋅
e
−
x
1
+
e
−
x
=
1
1
+
e
−
x
⋅
(
1
−
1
1
+
e
−
x
)
=
g
(
x
)
⋅
(
1
−
g
(
x
)
)
g^{\prime}(x)=\left( \frac{1}{1+e^{-x}} \right) ^{\prime}=\frac{e^{-x}}{\left( 1+e^{-x} \right) ^2} \\=\frac{1}{1+e^{-x}}\cdot \frac{e^{-x}}{1+e^{-x}}\\=\frac{1}{1+e^{-x}}\cdot \left( 1-\frac{1}{1+e^{-x}} \right) \\ =g(x)\cdot (1-g(x))
g′(x)=(1+e−x1)′=(1+e−x)2e−x=1+e−x1⋅1+e−xe−x=1+e−x1⋅(1−1+e−x1)=g(x)⋅(1−g(x))
将
θ
T
x
\theta ^Tx
θTx代入Sigmoid函数:
h
θ
(
x
)
=
g
(
θ
T
x
)
=
1
1
+
e
−
θ
T
x
h_{\theta}(x)=g\left( \theta ^Tx \right) =\frac{1}{1+e^{-\theta ^Tx}}
hθ(x)=g(θTx)=1+e−θTx1
模型求解推导
一般的二分类问题,可以记作:
P
(
y
=
1
∣
x
;
θ
)
=
h
θ
(
x
)
P
(
y
=
0
∣
x
;
θ
)
=
1
−
h
θ
(
x
)
\begin{aligned} P(y=1\mid x;\theta )&=h_{\theta}(x)\\ P(y=0\mid x;\theta )&=1-h_{\theta}(x)\\ \end{aligned}
P(y=1∣x;θ)P(y=0∣x;θ)=hθ(x)=1−hθ(x)
其中用
h
θ
(
x
)
h_{\theta}(x)
hθ(x)代替概率值
θ
\theta
θ。
将两式归为一个式子:
p
(
y
∣
x
;
θ
)
=
(
h
θ
(
x
)
)
y
(
1
−
h
θ
(
x
)
)
1
−
y
p(y\mid x;\theta )=\left( h_{\theta}(x) \right) ^y\left( 1-h_{\theta}(x) \right) ^{1-y}
p(y∣x;θ)=(hθ(x))y(1−hθ(x))1−y
假定样本独立,求似然函数:
L
(
θ
)
=
p
(
y
⃗
∣
X
;
θ
)
=
∏
i
=
1
m
p
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
∏
i
=
1
m
(
h
θ
(
x
(
i
)
)
)
y
(
i
)
(
1
−
h
θ
(
x
(
i
)
)
)
1
−
y
(
i
)
\begin{aligned} L(\theta )&=p(\vec{y}\mid X;\theta )\\ &=\prod_{i=1}^m{p}\left( y^{(i)}\mid x^{(i)};\theta \right)\\ &=\prod_{i=1}^m{\left( h_{\theta}\left( x^{(i)} \right) \right) ^{y^{(i)}}}\left( 1-h_{\theta}\left( x^{(i)} \right) \right) ^{1-y^{(i)}}\\ \end{aligned}
L(θ)=p(y∣X;θ)=i=1∏mp(y(i)∣x(i);θ)=i=1∏m(hθ(x(i)))y(i)(1−hθ(x(i)))1−y(i)
两边取对数:
l
(
θ
)
=
log
L
(
θ
)
=
∑
i
=
1
m
y
(
i
)
log
h
(
x
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
(
x
(
i
)
)
)
l(\theta )=\log L(\theta )=\sum_{i=1}^m{y^{(i)}}\log h\left( x^{(i)} \right) +\left( 1-y^{(i)} \right) \log \left( 1-h\left( x^{(i)} \right) \right)
l(θ)=logL(θ)=i=1∑my(i)logh(x(i))+(1−y(i))log(1−h(x(i)))
求偏导(注意,这里仅是对一个θ求):
∂
l
(
θ
)
∂
θ
j
=
∑
i
=
1
m
(
y
(
i
)
h
(
x
(
i
)
)
−
1
−
y
(
i
)
1
−
h
(
x
(
i
)
)
)
⋅
∂
h
(
x
(
i
)
)
∂
θ
j
=
∑
i
=
1
m
(
y
(
i
)
g
(
θ
T
x
(
i
)
)
−
1
−
y
(
i
)
1
−
g
(
θ
T
x
(
i
)
)
)
⋅
∂
g
(
θ
T
x
(
i
)
)
∂
θ
j
=
∑
i
=
1
m
(
y
(
i
)
g
(
θ
T
x
(
i
)
)
−
1
−
y
(
i
)
1
−
g
(
θ
T
x
(
i
)
)
)
⋅
g
(
θ
T
x
(
i
)
)
⋅
(
1
−
g
(
θ
T
x
(
i
)
)
)
⋅
∂
θ
T
x
(
i
)
∂
θ
j
=
∑
i
=
1
m
(
y
(
i
)
(
1
−
g
(
θ
T
x
(
i
)
)
)
−
(
1
−
y
(
i
)
)
g
(
θ
T
x
(
i
)
)
)
⋅
x
j
(
i
)
=
∑
i
=
1
m
(
y
(
i
)
−
g
(
θ
T
x
(
i
)
)
)
⋅
x
j
(
i
)
\frac{\partial l(\theta )}{\partial \theta _j}=\sum_{i=1}^m{\left( \frac{y^{(i)}}{h\left( x^{(i)} \right)}-\frac{1-y^{(i)}}{1-h\left( x^{(i)} \right)} \right)}\cdot \frac{\partial h\left( x^{(i)} \right)}{\partial \theta _j} \\ =\sum_{i=1}^m{\left( \frac{y^{(i)}}{g\left( \theta ^Tx^{(i)} \right)}-\frac{1-y^{(i)}}{1-g\left( \theta ^Tx^{(i)} \right)} \right)}\cdot \frac{\partial g\left( \theta ^Tx^{(i)} \right)}{\partial \theta _j} \\ =\sum_{i=1}^m{\left( \frac{y^{(i)}}{g\left( \theta ^Tx^{(i)} \right)}-\frac{1-y^{(i)}}{1-g\left( \theta ^Tx^{(i)} \right)} \right)}\cdot g\left( \theta ^Tx^{(i)} \right) \cdot \left( 1-g\left( \theta ^Tx^{(i)} \right) \right) \cdot \frac{\partial \theta ^Tx^{(i)}}{\partial \theta _j} \\ =\sum_{i=1}^m{\left( y^{(i)}\left( 1-g\left( \theta ^Tx^{(i)} \right) \right) -\left( 1-y^{(i)} \right) g\left( \theta ^Tx^{(i)} \right) \right)}\cdot x_{j}^{(i)} \\ =\sum_{i=1}^m{\left( y^{(i)}-g\left( \theta ^Tx^{(i)} \right) \right)}\cdot x_{j}^{(i)}
∂θj∂l(θ)=i=1∑m(h(x(i))y(i)−1−h(x(i))1−y(i))⋅∂θj∂h(x(i))=i=1∑m(g(θTx(i))y(i)−1−g(θTx(i))1−y(i))⋅∂θj∂g(θTx(i))=i=1∑m(g(θTx(i))y(i)−1−g(θTx(i))1−y(i))⋅g(θTx(i))⋅(1−g(θTx(i)))⋅∂θj∂θTx(i)=i=1∑m(y(i)(1−g(θTx(i)))−(1−y(i))g(θTx(i)))⋅xj(i)=i=1∑m(y(i)−g(θTx(i)))⋅xj(i)
注意最后一项求导之后仅剩下了 x j ( i ) x_{j}^{(i)} xj(i)。
为了求最大似然估计,要用到梯度下降:
θ j : = θ j + α ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \theta _j:=\theta _j+\alpha \left( y^{(i)}-h_{\theta}\left( x^{(i)} \right) \right) x_{j}^{(i)} θj:=θj+α(y(i)−hθ(x(i)))xj(i)
比较线性回归和逻辑回归的梯度下降规则:
θ j : = θ j + α ∑ i = 1 m ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) θ j : = θ j + α ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \theta _j:=\theta _j+\alpha \sum_{i=1}^m{\left( y^{(i)}-h_{\theta}\left( x^{(i)} \right) \right)}x_{j}^{(i)} \\ \theta _j:=\theta _j+\alpha \left( y^{(i)}-h_{\theta}\left( x^{(i)} \right) \right) x_{j}^{(i)} θj:=θj+αi=1∑m(y(i)−hθ(x(i)))xj(i)θj:=θj+α(y(i)−hθ(x(i)))xj(i)
一点思考
可以观察到: h θ h_{\theta} hθ函数不一样,但学习规则是一样的。 区别是什么?逻辑回归是假定模型服从二项分布,而线性回归是假定模型服从高斯分布,包括泊松分布,这三者之间有一个共同的属性,他们都属于指数族分布,都是广义的线性模型。
损失函数的角度
A. 假设实际值的取值在-1到1之间
y i ∈ { − 1 , 1 } y ^ i = { p i y i = 1 1 − p i y i = − 1 y_i\in \{-1,1\} \\ \hat{y}_i=\left\{ \begin{matrix} p_i& y_i=1\\ 1-p_i& y_i=-1\\ \end{matrix} \right. yi∈{−1,1}y^i={pi1−piyi=1yi=−1
可以凑出:
L ( θ ) = ∏ i = 1 m p i ( y i + 1 ) / 2 ( 1 − p i ) − ( y i − 1 ) / 2 L(\theta )=\prod_{i=1}^m{p_{i}^{\left( y_i+1 \right) /2}}\left( 1-p_i \right) ^{-\left( y_i-1 \right) /2} L(θ)=i=1∏mpi(yi+1)/2(1−pi)−(yi−1)/2
两边取对数:
l
(
θ
)
=
∑
i
=
1
m
ln
[
p
i
(
y
i
+
1
)
/
2
(
1
−
p
i
)
−
(
y
i
−
1
)
/
2
]
l(\theta )=\sum_{i=1}^m{\ln}\left[ p_{i}^{\left( y_i+1 \right) /2}\left( 1-p_i \right) ^{-\left( y_i-1 \right) /2} \right]
l(θ)=i=1∑mln[pi(yi+1)/2(1−pi)−(yi−1)/2]
代入
p
i
=
1
1
+
e
−
f
i
p_i=\frac{1}{1+e^{-f_i}}
pi=1+e−fi1并通分:
l
(
θ
)
=
∑
i
=
1
m
ln
[
(
1
1
+
e
−
f
i
)
(
y
i
+
1
)
/
2
(
1
1
+
e
f
i
)
−
(
y
i
−
1
)
/
2
]
l(\theta )=\sum_{i=1}^m{\ln}\left[ \left( \frac{1}{1+e^{-f_i}} \right) ^{\left( y_i+1 \right) /2}\left( \frac{1}{1+e^{f_i}} \right) ^{-\left( y_i-1 \right) /2} \right]
l(θ)=i=1∑mln[(1+e−fi1)(yi+1)/2(1+efi1)−(yi−1)/2]
取最大就是最大似然,取反取最小就是负的对数似然,求出损失函数:
∴
l
o
s
s
(
y
i
,
y
^
i
)
=
−
l
(
θ
)
=
∑
i
=
1
m
[
1
2
(
y
i
+
1
)
ln
(
1
+
e
−
f
i
)
−
1
2
(
y
i
−
1
)
ln
(
1
+
e
f
i
)
]
\therefore \mathrm{loss}\left( y_i,\hat{y}_i \right) =-l(\theta ) \\ =\sum_{i=1}^m{\left[ \frac{1}{2}\left( y_i+1 \right) \ln \left( 1+e^{-f_i} \right) -\frac{1}{2}\left( y_i-1 \right) \ln \left( 1+e^{f_i} \right) \right]}
∴loss(yi,y^i)=−l(θ)=i=1∑m[21(yi+1)ln(1+e−fi)−21(yi−1)ln(1+efi)]
写开为两个式子:
=
{
∑
i
=
1
m
[
ln
(
1
+
e
−
f
i
)
]
y
i
=
1
∑
i
=
1
m
[
ln
(
1
+
e
f
i
)
]
y
i
=
−
1
=\left\{ \begin{matrix} \sum_{i=1}^m{\left[ \ln \left( 1+e^{-f_i} \right) \right]}& y_i=1\\ \sum_{i=1}^m{\left[ \ln \left( 1+e^{f_i} \right) \right]}& y_i=-1\\ \end{matrix} \right.
={∑i=1m[ln(1+e−fi)]∑i=1m[ln(1+efi)]yi=1yi=−1
观察到
y
i
y_i
yi和
f
i
f_i
fi的符号相同,可以写到一起。最后的损失函数:
⇒ l o s s ( y i , y ^ i ) = ∑ i = 1 m [ ln ( 1 + e − y i ⋅ f i ) ] \Rightarrow \mathrm{loss}\left( y_i,\hat{y}_i \right) =\sum_{i=1}^m{\left[ \ln \left( 1+e^{-y_i\cdot f_i} \right) \right]} ⇒loss(yi,y^i)=i=1∑m[ln(1+e−yi⋅fi)]
B. 若 y i y_i yi取值发生改变
y i ∈ { 0 , 1 } y ^ i = { p i y i = 1 1 − p i y i = 0 y_i\in \{0,1\} \\ \hat{y}_i=\left\{ \begin{matrix} p_i& y_i=1\\ 1-p_i& y_i=0\\ \end{matrix} \right. yi∈{0,1}y^i={pi1−piyi=1yi=0
则损失函数推导如下:
L
(
θ
)
=
∏
i
=
1
m
p
i
y
i
(
1
−
p
i
)
1
−
y
i
⇒
l
(
θ
)
=
∑
i
=
1
m
ln
[
p
i
y
i
(
1
−
p
i
)
1
−
y
i
]
p
i
=
1
1
+
e
−
f
i
⟶
l
(
θ
)
=
∑
i
=
1
m
ln
[
(
1
1
+
e
−
f
i
)
y
i
(
1
1
+
e
f
i
)
1
−
y
i
]
∴
l
o
s
s
(
y
i
,
y
^
i
)
=
−
l
(
θ
)
=
∑
i
=
1
m
[
y
i
ln
(
1
+
e
−
f
i
)
+
(
1
−
y
i
)
ln
(
1
+
e
f
i
)
]
L(\theta )=\prod_{i=1}^m{p_{i}^{y_i}}\left( 1-p_i \right) ^{1-y_i} \\ \Rightarrow l(\theta )=\sum_{i=1}^m{\ln}\left[ p_{i}^{y_i}\left( 1-p_i \right) ^{1-y_i} \right] \\ \quad \frac{p_i=\frac{1}{1+e^{-fi}}}{\longrightarrow}l(\theta )=\sum_{i=1}^m{\ln}\left[ \left( \frac{1}{1+e^{-f_i}} \right) ^{y_i}\left( \frac{1}{1+e^{f_i}} \right) ^{1-y_i} \right] \\ \therefore \mathrm{loss}\left( y_i,\hat{y}_i \right) =-l(\theta ) \\ =\sum_{i=1}^m{\left[ y_i\ln \left( 1+e^{-f_i} \right) +\left( 1-y_i \right) \ln \left( 1+e^{f_i} \right) \right]}
L(θ)=i=1∏mpiyi(1−pi)1−yi⇒l(θ)=i=1∑mln[piyi(1−pi)1−yi]⟶pi=1+e−fi1l(θ)=i=1∑mln[(1+e−fi1)yi(1+efi1)1−yi]∴loss(yi,y^i)=−l(θ)=i=1∑m[yiln(1+e−fi)+(1−yi)ln(1+efi)]
两种都是一个意思,只是第一种的更简洁。
任何一个样本都是有标记值和估计值,也可以从交叉熵来解释逻辑回归。