逻辑斯谛回归
Logistic Regression
虽然被称为回归,但其实际上是分类模型,并常用于二分类
1. 逻辑斯谛分布
设 X X X是连续随机变量, X X X服从逻辑斯谛分布是指 X X X具有下列分布函数和密度函数
F ( x ) = P ( X ≤ x ) = 1 1 + e − ( x − μ ) / γ F(x)=P(X \leq x)=\frac {1} {1+e^{-(x-\mu)/\gamma}} F(x)=P(X≤x)=1+e−(x−μ)/γ1
f ( x ) = F ′ ( x ) = e − ( x − μ ) / γ γ ( 1 + e − ( x − μ ) / γ ) 2 f(x)=F'(x)=\frac {e^{-(x-\mu)/\gamma}} {\gamma(1+e^{-(x-\mu)/\gamma})^2} f(x)=F′(x)=γ(1+e−(x−μ)/γ)2e−(x−μ)/γ
分布函数是一条sigmoid curve。曲线以点 ( μ , 1 2 ) (\mu, \frac {1} {2}) (μ,21)为中心对称,曲线在中心附近增长速度较快,在两端增长速度较慢。形状参数 γ \gamma γ的值越小,曲线在中心附件增长越快。
2. 二项逻辑斯谛回归模型
二项逻辑斯谛回归模型是如下的条件概率分布:
P
(
Y
=
1
∣
X
)
=
e
x
p
(
w
⋅
x
+
b
)
1
+
e
x
p
(
w
⋅
x
+
b
)
P(Y=1|X)=\frac {exp(w \cdot x+b)} {1+exp(w \cdot x+b)}
P(Y=1∣X)=1+exp(w⋅x+b)exp(w⋅x+b)
P
(
Y
=
0
∣
X
)
=
1
1
+
e
x
p
(
w
⋅
x
+
b
)
P(Y=0|X)=\frac {1} {1+exp(w \cdot x + b)}
P(Y=0∣X)=1+exp(w⋅x+b)1
其中, x ∈ R n x \in R^n x∈Rn是输入, Y ∈ 0 , 1 Y \in {0, 1} Y∈0,1是输出, w ∈ R n w \in R^n w∈Rn和 b ∈ R b \in R b∈R是参数, w w w是权重向量, b b b是偏置, w ⋅ x w \cdot x w⋅x为 w w w和 x x x的内积
有时为了方便,将权重向量和输入向量加以扩充,仍记作
w
,
x
w,x
w,x,即
w
=
(
w
(
1
)
,
w
(
2
)
,
.
.
.
,
w
(
n
)
,
b
)
T
,
x
=
(
x
(
1
)
,
x
(
2
)
,
.
.
.
,
x
(
n
)
,
1
)
T
w=(w^{(1)},w^{(2)},...,w^{(n)},b)^T,x=(x^{(1)},x^{(2)},...,x^{(n)},1)^T
w=(w(1),w(2),...,w(n),b)T,x=(x(1),x(2),...,x(n),1)T。这时,逻辑斯谛回归模型如下:
P
(
Y
=
1
∣
X
)
=
e
x
p
(
w
⋅
x
)
1
+
e
x
p
(
w
⋅
x
)
P(Y=1|X)=\frac {exp(w \cdot x)} {1+exp(w \cdot x)}
P(Y=1∣X)=1+exp(w⋅x)exp(w⋅x)
P
(
Y
=
0
∣
X
)
=
1
1
+
e
x
p
(
w
⋅
x
)
P(Y=0|X)=\frac {1} {1+exp(w \cdot x)}
P(Y=0∣X)=1+exp(w⋅x)1
一件事的几率 (odds) 是指该件事发生的概率与该件事不发生概率的比值。如果事件发生概率是
p
p
p,那么该事件的几率是
p
1
−
p
\frac {p} {1-p}
1−pp,该事件的对数几率logit
函数为
l
o
g
i
t
(
p
)
=
l
o
g
p
1
−
p
logit(p)=log \frac {p} {1-p}
logit(p)=log1−pp
对逻辑斯谛回归而言,
l
o
g
P
(
Y
=
1
∣
x
)
1
−
p
(
Y
=
1
∣
x
)
=
w
⋅
x
log \frac {P(Y=1|x)} {1-p(Y=1|x)}=w \cdot x
log1−p(Y=1∣x)P(Y=1∣x)=w⋅x
这就是说,在逻辑斯谛回归模型中,输出
Y
=
1
Y=1
Y=1的对数几率是输入
x
x
x的线性函数。换个角度看,对输入
x
x
x进行分类的线性函数
w
⋅
x
w \cdot x
w⋅x,其值域是实数域,通过逻辑斯谛回归模型可以将线性函数
w
⋅
x
w \cdot x
w⋅x转换为概率
P
(
Y
=
1
∣
X
)
=
e
x
p
(
w
⋅
x
)
1
+
e
x
p
(
w
⋅
x
)
P(Y=1|X)=\frac {exp(w \cdot x)} {1+exp(w \cdot x)}
P(Y=1∣X)=1+exp(w⋅x)exp(w⋅x)
其中,线性函数的值越接近正无穷,概率值越接近1;线性函数的值越接近负无穷,概率值越接近0。
3. 模型参数估计
对于给定的训练数据集
T
=
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
.
.
.
,
(
x
N
,
y
N
)
}
T=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\}
T={(x1,y1),(x2,y2),...,(xN,yN)},其中,
x
i
∈
R
n
,
y
i
∈
0
,
1
x_i \in R^n,y_i \in {0, 1}
xi∈Rn,yi∈0,1,可以应用极大似然估计法估计模型参数,从而得到逻辑斯谛回归模型。
设:
P
(
Y
=
1
∣
x
)
=
π
(
x
)
,
P
(
Y
=
0
∣
x
)
=
1
−
π
(
x
)
P(Y=1|x)=\pi(x), \quad P(Y=0|x)=1-\pi(x)
P(Y=1∣x)=π(x),P(Y=0∣x)=1−π(x)
似然函数为:
∏
i
=
1
N
[
π
(
x
i
)
]
y
i
[
1
−
π
(
x
i
)
]
1
−
y
i
\prod_{i=1}^N [\pi(x_i)]^{y_i}[1-\pi(x_i)]^{1-y_i}
i=1∏N[π(xi)]yi[1−π(xi)]1−yi
对数似然函数为:
L
(
w
)
=
∑
i
=
1
N
[
y
i
l
o
g
π
(
x
i
)
+
(
1
−
y
i
)
l
o
g
(
1
−
π
(
x
i
)
)
]
=
∑
i
=
1
N
[
y
i
l
o
g
π
(
x
i
)
1
−
π
(
x
i
)
+
l
o
g
(
1
−
π
(
x
i
)
)
]
=
∑
i
=
1
N
[
y
i
(
w
⋅
x
)
−
l
o
g
(
1
+
e
x
p
(
w
⋅
x
)
)
]
L(w)=\sum_{i=1}^N[y_ilog\pi(x_i) + (1-y_i)log(1-\pi(x_i))] \\ =\sum_{i=1}^N[y_ilog \frac {\pi(x_i)} {1-\pi(x_i)}+log(1-\pi(x_i))] \\ =\sum_{i=1}^N[y_i(w \cdot x)-log(1+exp(w \cdot x))]
L(w)=i=1∑N[yilogπ(xi)+(1−yi)log(1−π(xi))]=i=1∑N[yilog1−π(xi)π(xi)+log(1−π(xi))]=i=1∑N[yi(w⋅x)−log(1+exp(w⋅x))]
对
L
(
w
)
L(w)
L(w)求极大值,得到
w
w
w的估计值。采用的方法通常为梯度下降法和拟牛顿法。
4. 多项逻辑斯谛回归
假设离散随机变量
Y
Y
Y的取值集合为
1
,
2
,
.
.
.
,
K
{1,2,...,K}
1,2,...,K,那么多项逻辑斯谛回归模型为:
P
(
Y
=
k
∣
x
)
=
e
x
p
(
w
k
⋅
x
)
1
+
∑
k
=
1
K
−
1
e
x
p
(
w
k
⋅
x
)
,
k
=
1
,
2
,
.
.
.
,
K
−
1
P(Y=k|x)=\frac {exp(w_k \cdot x)} {1+\sum_{k=1}^{K-1}exp(w_k \cdot x)}, \quad k=1, 2, ..., K-1
P(Y=k∣x)=1+∑k=1K−1exp(wk⋅x)exp(wk⋅x),k=1,2,...,K−1
P
(
Y
=
K
∣
x
)
=
1
1
+
∑
k
=
1
K
−
1
e
x
p
(
w
k
⋅
x
)
P(Y=K|x)=\frac {1} {1+\sum_{k=1}^{K-1}exp(w_k \cdot x)}
P(Y=K∣x)=1+∑k=1K−1exp(wk⋅x)1
其中, x ∈ R n + 1 , w k ∈ R n + 1 x \in R^{n+1},w_k \in R^{n+1} x∈Rn+1,wk∈Rn+1
5. 算法实现
# 导入所需的库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# 加载数据
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['label'] = iris.target
df
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | label | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | 2 |
146 | 6.3 | 2.5 | 5.0 | 1.9 | 2 |
147 | 6.5 | 3.0 | 5.2 | 2.0 | 2 |
148 | 6.2 | 3.4 | 5.4 | 2.3 | 2 |
149 | 5.9 | 3.0 | 5.1 | 1.8 | 2 |
150 rows × 5 columns
# 展示数据
x_idx = iris.feature_names[0]
y_idx = iris.feature_names[1]
plt.scatter(df[:50][x_idx], df[:50][y_idx], label='0')
plt.scatter(df[50:100][x_idx], df[50:100][y_idx], label='1')
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend()
plt.show()
# 准备数据
data = np.array(df.iloc[:100, [0, 1, -1]])
X, y = data[:,:-1], data[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
test_point = [[6, 3]]
plt.scatter(df[:50][x_idx], df[:50][y_idx], label='0')
plt.scatter(df[50:100][x_idx], df[50:100][y_idx], label='1')
plt.plot(test_point[0][0], test_point[0][1], 'bo', label='test_point')
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend()
plt.show()
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)
clf.predict(test_point)
array([1.])