逻辑斯蒂分布
定义:设
X
X
X 是连续随机变量,
X
X
X 服从逻辑斯蒂分布是指
X
X
X 具有下列分布函数:
F
(
x
)
=
P
(
X
⩽
x
)
=
1
1
+
e
(
−
(
x
−
μ
)
/
γ
)
f
(
x
)
=
d
F
(
x
)
d
x
=
e
(
−
(
x
−
μ
)
/
γ
)
γ
(
1
+
e
(
−
(
x
−
μ
)
/
γ
)
)
2
F(x)=P(X\leqslant x)=\frac{1}{1+e^{(-(x-\mu)/\gamma)}}\\ f(x) = \frac{\mathrm{d}F(x)}{\mathrm{d}x} = \frac{e^{(-(x-\mu)/\gamma)}}{\gamma(1+e^{(-(x-\mu)/\gamma)})^2}
F(x)=P(X⩽x)=1+e(−(x−μ)/γ)1f(x)=dxdF(x)=γ(1+e(−(x−μ)/γ))2e(−(x−μ)/γ)
式中:
μ
,
γ
\mu,\gamma
μ,γ 为参数。
其函数图像如下:
二项逻辑斯蒂回归模型
二项逻辑斯蒂回归模型是一种分类模型,由条件概率分布
P
(
Y
∣
X
)
P(Y|X)
P(Y∣X) 表示。
w
=
(
w
(
1
)
,
w
(
2
)
,
…
,
w
(
n
)
,
b
)
T
w=(w^{(1)},w^{(2)},\ldots,w^{(n)},b)^T
w=(w(1),w(2),…,w(n),b)T,
x
=
(
x
(
1
)
,
x
(
2
)
,
…
,
x
(
n
)
,
1
)
x=(x^{(1)},x^{(2)},\ldots,x^{(n)},1)
x=(x(1),x(2),…,x(n),1),分类模型的条件概率分布如下:
P
(
Y
=
1
∣
x
)
=
exp
(
w
⋅
x
)
1
+
exp
(
w
⋅
x
)
P
(
Y
=
0
∣
x
)
=
1
1
+
exp
(
w
⋅
x
)
\begin{aligned} P(Y=1|x)=\frac{\exp(w\cdot x)}{1+\exp(w\cdot x)}\\ P(Y=0|x)=\frac{1}{1+\exp(w\cdot x)}\\ \end{aligned}
P(Y=1∣x)=1+exp(w⋅x)exp(w⋅x)P(Y=0∣x)=1+exp(w⋅x)1
这是一个逻辑斯蒂分布,由逻辑斯蒂分布图可知:函数值越接近正无穷,概率值就越接近1,线性函数的值越接近负无穷,概率值就越接近0。
对于给定的输入实例 x x x 按照上式计算,比较两个条件概率值的大小,将 x x x 分到概率较大的哪一类。
定义事件的几率:事件发生的概率与事件不发生的概率的比值,所以其对数几率是:
l
o
g
i
t
(
p
)
=
log
p
1
−
p
logit(p)=\log\frac{p}{1-p}
logit(p)=log1−pp
逻辑斯蒂回归的对数几率为:
log
P
(
Y
=
1
∣
x
)
1
−
P
(
Y
=
1
∣
x
)
=
log
P
(
Y
=
1
∣
x
)
P
(
Y
=
0
∣
x
)
=
w
⋅
x
\log\frac{P(Y=1|x)}{1-P(Y=1|x)}=\color{red}\log\frac{P(Y=1|x)}{P(Y=0|x)}\color{black}=w\cdot x
log1−P(Y=1∣x)P(Y=1∣x)=logP(Y=0∣x)P(Y=1∣x)=w⋅x
从上式可知:在逻辑斯蒂回归模型中,输出
Y
=
1
Y=1
Y=1 的对数几率是输入x的线性函数。通过上式也可解得:
P
(
Y
=
1
∣
x
)
=
exp
(
w
⋅
x
)
1
+
exp
(
w
⋅
x
)
P(Y=1|x)=\frac{\exp(w\cdot x)}{1+\exp(w\cdot x)}
P(Y=1∣x)=1+exp(w⋅x)exp(w⋅x)
所以逻辑斯蒂回归又称“对数几率回归”。
模型的参数估计
对于给定数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( x N , y N ) } , x i ∈ R n , y i ∈ { 0 , 1 } T=\{(x_1,y_1),(x_2,y_2),\ldots,(x_N,y_N)\}, x_i \in \mathcal R^n, y_i \in\{0,1\} T={(x1,y1),(x2,y2),…,(xN,yN)},xi∈Rn,yi∈{0,1},可以通过极大似然函数法估计模型的参数。
设:
P
(
Y
=
1
∣
x
)
=
π
(
x
)
,
P
(
Y
=
0
∣
x
)
=
1
−
π
(
x
)
P(Y=1|x)=\pi(x),\qquad P(Y=0|x)=1-\pi(x)
P(Y=1∣x)=π(x),P(Y=0∣x)=1−π(x)
似然函数为:
∏
i
=
1
N
[
π
(
x
i
)
]
y
i
[
1
−
π
(
x
i
)
]
1
−
y
i
\prod^N_{i=1}[\pi(x_i)]^{y_i}[1-\pi(x_i)]^{1-y_i}
i=1∏N[π(xi)]yi[1−π(xi)]1−yi
对数似然函数为:
L
(
w
)
=
log
∏
i
=
1
N
[
π
(
x
i
)
]
y
i
[
1
−
π
(
x
i
)
]
1
−
y
i
=
∑
i
=
1
N
[
y
i
log
(
π
(
x
i
)
)
+
(
1
−
y
i
)
log
(
1
−
π
(
x
i
)
)
]
=
∑
i
=
1
N
[
y
i
log
(
π
(
x
i
)
1
−
π
(
x
i
)
)
+
log
(
1
−
π
(
x
i
)
)
]
=
∑
i
=
1
N
[
y
i
(
w
⋅
x
i
)
−
log
(
1
+
exp
(
w
⋅
x
i
)
)
]
\begin{aligned} L(w) &=\log \prod_{i=1}^N[\pi(x_i)]^{y_i}[1-\pi(x_i)]^{1-y_i}\\&=\sum_{i=1}^N[y_i\log(\pi(x_i))+(1-y_i)\log(1-\pi(x_i))]\\ &=\sum_{i=1}^N[y_i\log(\frac{\pi(x_i)}{1-\pi(x_i)})+\log(1-\pi(x_i))]\\ &=\sum_{i=1}^N[y_i(w\cdot x_i)-\log(1+\exp(w\cdot x_i))] \end{aligned}
L(w)=logi=1∏N[π(xi)]yi[1−π(xi)]1−yi=i=1∑N[yilog(π(xi))+(1−yi)log(1−π(xi))]=i=1∑N[yilog(1−π(xi)π(xi))+log(1−π(xi))]=i=1∑N[yi(w⋅xi)−log(1+exp(w⋅xi))]
注:上式的最后一步应用了逻辑斯蒂回归的对数几率关系。
对 L ( w ) L(w) L(w) 求极大值,得到 w w w 的估计值。
这样问题就变成了以对数似然函数为目标的最优化问题,该问题可以用牛顿法进行求解。
最大化似然函数等价于最小化下式:
l
(
w
)
=
∑
i
=
1
N
[
−
y
i
(
w
⋅
x
i
)
+
log
(
1
+
exp
(
w
⋅
x
i
)
)
]
\mathcal l(w) = \sum_{i=1}^N[-y_i(w\cdot x_i)+\log(1+\exp(w\cdot x_i))]
l(w)=i=1∑N[−yi(w⋅xi)+log(1+exp(w⋅xi))]
l
(
w
)
\mathcal l(w)
l(w) 是关于
w
w
w 的高阶可导连续凸函数, 采用牛顿法进行迭代求解。
l
(
w
)
\mathcal l(w)
l(w) 是关于
w
w
w 的一阶导数,二阶导数分别为:
∂
l
(
w
)
∂
w
=
∑
i
=
1
N
[
−
y
i
x
i
+
x
i
exp
(
w
⋅
x
i
)
1
+
exp
(
w
⋅
x
i
)
]
=
−
∑
i
=
1
N
x
i
(
y
i
−
exp
(
w
⋅
x
i
)
1
+
exp
(
w
⋅
x
i
)
)
=
−
∑
i
=
1
N
x
i
(
y
i
−
P
(
Y
=
1
∣
x
)
)
\begin{aligned} \frac{\partial\mathcal l(w)}{\partial w} &=\sum_{i=1}^N[-y_ix_i+\frac{x_i\exp(w\cdot x_i)}{1+\exp(w\cdot x_i)}]\\ &=-\sum_{i=1}^N x_i\left(y_i-\frac{\exp(w\cdot x_i)}{1+\exp(w\cdot x_i)}\right)\\ &=-\sum_{i=1}^N x_i\left(y_i-P(Y=1|x)\right) \end{aligned}
∂w∂l(w)=i=1∑N[−yixi+1+exp(w⋅xi)xiexp(w⋅xi)]=−i=1∑Nxi(yi−1+exp(w⋅xi)exp(w⋅xi))=−i=1∑Nxi(yi−P(Y=1∣x))
∂ 2 l ( w ) ∂ w ∂ w T = ∂ ∑ i = 1 N x i exp ( w ⋅ x i ) 1 + exp ( w ⋅ x i ) ∂ w T = ∑ i = 1 N x i ( 1 + exp ( w ⋅ x i ) ) exp ( w ⋅ x i ) x i − exp ( w ⋅ x i ) exp ( w ⋅ x i ) x i ( 1 + exp ( w ⋅ x i ) ) 2 = ∑ i = 1 N x i x i T exp ( w ⋅ x i ) ( 1 + exp ( w ⋅ x i ) ) 2 = ∑ i = 1 N x i x i T P ( Y = 1 ∣ x ) ( 1 − P ( Y = 1 ∣ x ) ) \begin{aligned} \frac{\partial^2 \mathcal l(w)}{\partial w \partial w^T}&=\frac{\partial \sum_{i=1}^N\frac{x_i\exp(w\cdot x_i)}{1+\exp(w\cdot x_i)}}{\partial w^T} \\ &=\sum_{i=1}^N x_i \frac{(1+\exp(w\cdot x_i))\exp(w\cdot x_i)x_i-\exp(w\cdot x_i)\exp(w\cdot x_i)x_i}{(1+\exp(w\cdot x_i))^2}\\ &=\sum_{i=1}^Nx_ix_i^T\frac{\exp(w\cdot x_i)}{(1+\exp(w\cdot x_i))^2}\\ &=\sum_{i=1}^Nx_ix_i^TP(Y=1|x)(1-P(Y=1|x)) \end{aligned} ∂w∂wT∂2l(w)=∂wT∂∑i=1N1+exp(w⋅xi)xiexp(w⋅xi)=i=1∑Nxi(1+exp(w⋅xi))2(1+exp(w⋅xi))exp(w⋅xi)xi−exp(w⋅xi)exp(w⋅xi)xi=i=1∑NxixiT(1+exp(w⋅xi))2exp(w⋅xi)=i=1∑NxixiTP(Y=1∣x)(1−P(Y=1∣x))
其
t
+
1
t+1
t+1 轮迭代更新的公式为:
w
(
t
+
1
)
=
w
(
t
)
−
(
∂
2
l
(
w
)
∂
w
∂
w
T
)
−
1
∂
l
(
w
)
∂
w
w^{(t+1)} = w^{(t)} - \left(\frac{\partial^2 \mathcal l(w)}{\partial w \partial w^T}\right)^{-1} \frac{\partial \mathcal l(w)}{\partial w}
w(t+1)=w(t)−(∂w∂wT∂2l(w))−1∂w∂l(w)
多项逻辑斯蒂回归
逻辑斯蒂回归如何用于多分类?主要是通过组合多个二分类器来实现多分类器的构造,假设有4个类别需要划分,类别分别为A,B,C,D。先选择一个类别作为主类别,假设选择D作为主类别。然后把A,B,C三个类分别的主类别D进行回归。所以会得到以下几个模型:
(1)类别A,类别D二项逻辑斯蒂回归模型,模型参数为 w 1 w_1 w1
(2)类别B,类别D二项逻辑斯蒂回归模型,模型参数为 w 2 w_2 w2
(3)类别C,类别D二项逻辑斯蒂回归模型,模型参数为 w 3 w_3 w3
分别计算:
P
(
Y
=
k
∣
x
)
=
exp
(
w
k
⋅
x
)
1
+
∑
k
=
1
K
−
1
exp
(
w
k
⋅
x
)
,
k
=
1
,
2
,
3
P
(
Y
=
K
∣
x
)
=
1
1
+
∑
k
=
1
K
−
1
exp
(
w
k
⋅
x
)
K
=
4
\begin{aligned} P(Y=k|x)&=\frac{\exp(w_k\cdot x)}{1+\sum_{k=1}^{K-1}\exp(w_k\cdot x)},\quad k=1,2,3\\ P(Y=K|x)&=\frac{1}{1+\sum_{k=1}^{K-1}\exp(w_k\cdot x)} \quad K=4\\ \end{aligned}
P(Y=k∣x)P(Y=K∣x)=1+∑k=1K−1exp(wk⋅x)exp(wk⋅x),k=1,2,3=1+∑k=1K−1exp(wk⋅x)1K=4
得到概率
P
(
Y
=
1
∣
x
)
,
P
(
Y
=
2
∣
x
)
,
P
(
Y
=
3
∣
x
)
,
P
(
Y
=
4
∣
x
)
P(Y=1|x),P(Y=2|x),P(Y=3|x),P(Y=4|x)
P(Y=1∣x),P(Y=2∣x),P(Y=3∣x),P(Y=4∣x),其中最大的概率为其最终类别。
通过这个例子,可以看出:对于K分类,首先选出一个主类别,然后把主类别和其他的K -1 个类别分别进行二项逻辑斯蒂回归分类,所以多项逻辑斯蒂回归的模型为:
P
(
Y
=
k
∣
x
)
=
exp
(
w
k
⋅
x
)
1
+
∑
k
=
1
K
−
1
exp
(
w
k
⋅
x
)
,
k
=
1
,
2
,
…
,
K
−
1
P
(
Y
=
K
∣
x
)
=
1
1
+
∑
k
=
1
K
−
1
exp
(
w
k
⋅
x
)
\begin{aligned} P(Y=k|x)&=\frac{\exp(w_k\cdot x)}{1+\sum_{k=1}^{K-1}\exp(w_k\cdot x)}, k=1,2,\dots,K-1\\ P(Y=K|x)&=\frac{1}{1+\sum_{k=1}^{K-1}\exp(w_k\cdot x)}\\ \end{aligned}
P(Y=k∣x)P(Y=K∣x)=1+∑k=1K−1exp(wk⋅x)exp(wk⋅x),k=1,2,…,K−1=1+∑k=1K−1exp(wk⋅x)1
接下来看看这个模型是如何推导出来的。
根据逻辑斯蒂回归的对数几率,计算
K
−
1
K-1
K−1种可能的取值发生的概率相对取值
K
K
K发生的概率的比值:
ln
P
(
Y
=
1
∣
x
)
P
(
Y
=
K
∣
x
)
=
w
1
⋅
x
ln
P
(
Y
=
2
∣
x
)
P
(
Y
=
K
∣
x
)
=
w
2
⋅
x
⋯
ln
P
(
Y
=
K
−
1
∣
x
)
P
(
Y
=
K
∣
x
)
=
w
K
−
1
⋅
x
\begin{aligned} \ln\frac{P(Y=1|x)}{P(Y=K|x)}&=w_1\cdot x\\ \ln\frac{P(Y=2|x)}{P(Y=K|x)}&=w_2\cdot x\\ \cdots\\ \ln\frac{P(Y=K-1|x)}{P(Y=K|x)}&=w_{K-1}\cdot x\\ \end{aligned}
lnP(Y=K∣x)P(Y=1∣x)lnP(Y=K∣x)P(Y=2∣x)⋯lnP(Y=K∣x)P(Y=K−1∣x)=w1⋅x=w2⋅x=wK−1⋅x
所以:
P
(
Y
=
1
∣
x
)
=
P
(
Y
=
K
∣
x
)
exp
(
w
1
⋅
x
)
P
(
Y
=
2
∣
x
)
=
P
(
Y
=
K
∣
x
)
exp
(
w
2
⋅
x
)
⋯
P
(
Y
=
K
−
1
∣
x
)
=
P
(
Y
=
K
∣
x
)
exp
(
w
K
−
1
⋅
x
)
\begin{aligned} {P(Y=1|x)}&={P(Y=K|x)}\exp(w_1\cdot x)\\ {P(Y=2|x)}&={P(Y=K|x)}\exp(w_2\cdot x)\\ \cdots\\ {P(Y=K-1|x)}&={P(Y=K|x)}\exp(w_{K-1}\cdot x)\\ \end{aligned}
P(Y=1∣x)P(Y=2∣x)⋯P(Y=K−1∣x)=P(Y=K∣x)exp(w1⋅x)=P(Y=K∣x)exp(w2⋅x)=P(Y=K∣x)exp(wK−1⋅x)
观察式子,上式可以写成如下格式:
P
(
Y
=
k
∣
x
)
=
P
(
Y
=
K
∣
x
)
exp
(
w
k
⋅
x
)
,
k
=
1
,
2
,
…
,
K
−
1
\color{red}{P(Y=k|x)}\color{red}={P(Y=K|x)}\exp(w_k\cdot x), k=1,2,\dots,K-1
P(Y=k∣x)=P(Y=K∣x)exp(wk⋅x),k=1,2,…,K−1
因为最后得到的概率和为1,所以有
P
(
Y
=
K
∣
x
)
=
1
−
∑
j
=
1
K
−
1
P
(
Y
=
j
∣
x
)
=
1
−
P
(
Y
=
K
∣
x
)
∑
j
=
1
K
−
1
exp
(
w
j
⋅
x
)
\begin{aligned} P(Y=K|x)&=1-\sum_{j=1}^{K-1}P(Y=j|x)\\ &=1-P(Y=K|x)\sum_{j=1}^{K-1}\exp(w_j\cdot x)\\ \end{aligned}
P(Y=K∣x)=1−j=1∑K−1P(Y=j∣x)=1−P(Y=K∣x)j=1∑K−1exp(wj⋅x)
解得:
P
(
Y
=
K
∣
x
)
=
1
1
+
∑
j
=
1
K
−
1
exp
(
w
j
⋅
x
)
P(Y=K|x)=\frac{1}{1+\sum_{j=1}^{K-1}\exp(w_j\cdot x)}
P(Y=K∣x)=1+∑j=1K−1exp(wj⋅x)1
将
P
(
Y
=
K
∣
x
)
P(Y=K|x)
P(Y=K∣x) 带入
P
(
Y
=
k
∣
x
)
,
k
=
1
,
2
,
…
,
K
−
1
P(Y=k|x), \quad k=1,2,\ldots,K-1
P(Y=k∣x),k=1,2,…,K−1,得:
P
(
Y
=
k
∣
x
)
=
P
(
Y
=
K
∣
x
)
exp
(
w
k
⋅
x
)
,
k
=
1
,
2
,
…
,
K
−
1
=
1
1
+
∑
j
=
1
K
−
1
exp
(
w
j
⋅
x
)
exp
(
w
k
⋅
x
)
,
k
=
1
,
2
,
…
,
K
−
1
=
exp
(
w
k
⋅
x
)
1
+
∑
j
=
1
K
−
1
exp
(
w
j
⋅
x
)
,
k
=
1
,
2
,
…
,
K
−
1
\begin{aligned} \color{red}{P(Y=k|x)}&\color{red}={P(Y=K|x)}\exp(w_k\cdot x), k=1,2,\dots,K-1\\ &=\frac{1}{1+\sum_{j=1}^{K-1}\exp(w_j\cdot x)}\exp(w_k\cdot x), k=1,2,\dots,K-1\\ &=\frac{\exp(w_k\cdot x)}{1+\sum_{j=1}^{K-1}\exp(w_j\cdot x)}, k=1,2,\dots,K-1\\ \end{aligned}
P(Y=k∣x)=P(Y=K∣x)exp(wk⋅x),k=1,2,…,K−1=1+∑j=1K−1exp(wj⋅x)1exp(wk⋅x),k=1,2,…,K−1=1+∑j=1K−1exp(wj⋅x)exp(wk⋅x),k=1,2,…,K−1
这样就得到了最终得多分类模型:
P
(
Y
=
k
∣
x
)
=
exp
(
w
k
⋅
x
)
1
+
∑
k
=
1
K
−
1
exp
(
w
k
⋅
x
)
,
k
=
1
,
2
,
…
,
K
−
1
P
(
Y
=
K
∣
x
)
=
1
1
+
∑
k
=
1
K
−
1
exp
(
w
k
⋅
x
)
\begin{aligned} P(Y=k|x)&=\frac{\exp(w_k\cdot x)}{1+\sum_{k=1}^{K-1}\exp(w_k\cdot x)}, k=1,2,\dots,K-1\\ P(Y=K|x)&=\frac{1}{1+\sum_{k=1}^{K-1}\exp(w_k\cdot x)}\\ \end{aligned}
P(Y=k∣x)P(Y=K∣x)=1+∑k=1K−1exp(wk⋅x)exp(wk⋅x),k=1,2,…,K−1=1+∑k=1K−1exp(wk⋅x)1