线性回归(Linear Regression)
给定数据集
{
(
x
1
,
y
1
)
,
⋯
,
(
x
N
,
y
N
)
}
\{(\boldsymbol x_1,y_1),\cdots,(\boldsymbol x_N,y_N)\}
{(x1,y1),⋯,(xN,yN)},
x
∈
R
n
\boldsymbol x\in\R^n
x∈Rn,求
w
=
(
w
1
,
w
2
,
⋯
,
w
n
)
\boldsymbol w=(w_1, w_2, \cdots, w_n)
w=(w1,w2,⋯,wn)使得以下回归模型成立:
y
≈
f
(
x
)
=
w
T
x
+
b
y \approx f(\boldsymbol x)=\boldsymbol w^T \boldsymbol x+b
y≈f(x)=wTx+b
y
y
y为真实值
,
f
(
x
)
f(\boldsymbol x)
f(x)为预测值
.
模型的矩阵形式
X
=
[
x
11
⋯
x
1
n
1
x
21
⋯
x
2
n
1
⋮
⋱
⋮
⋮
x
N
1
⋯
x
N
n
1
]
,
w
=
[
w
1
w
2
⋮
b
]
X= \left[\begin{array}{ccc|c} x_{11} &\cdots &x_{1n} &1\\ x_{21} &\cdots &x_{2n} &1\\ \vdots &\ddots &\vdots &\vdots\\ x_{N1} &\cdots &x_{Nn} &1 \end{array}\right],\quad \boldsymbol w= \left[\begin{array}{c} w_1 \\ w_2 \\ \vdots \\ b \end{array}\right]
X=⎣⎢⎢⎢⎡x11x21⋮xN1⋯⋯⋱⋯x1nx2n⋮xNn11⋮1⎦⎥⎥⎥⎤,w=⎣⎢⎢⎢⎡w1w2⋮b⎦⎥⎥⎥⎤
基于最小二乘思想(方差最小)求解:
w
∗
=
arg
min
(
y
−
X
w
)
T
(
y
−
X
w
)
\boldsymbol w^*=\arg \min(\boldsymbol y-X\boldsymbol w)^T(\boldsymbol y-X\boldsymbol w)
w∗=argmin(y−Xw)T(y−Xw)
令
E
=
(
y
−
X
w
)
T
(
y
−
X
w
)
E=(\boldsymbol y-X\boldsymbol w)^T(\boldsymbol y-X\boldsymbol w)
E=(y−Xw)T(y−Xw):
∂
E
∂
w
=
2
X
T
(
X
w
−
y
)
\frac{\partial E}{\partial\boldsymbol w}=2X^T(X\boldsymbol w-\boldsymbol y)
∂w∂E=2XT(Xw−y)
极值点偏导为
0
\boldsymbol 0
0
w
∗
=
(
X
T
X
)
−
1
X
T
y
\boldsymbol w^*=(X^T X)^{-1}X^T\boldsymbol y
w∗=(XTX)−1XTy
当
X
T
X
X^T X
XTX为非满秩矩阵
(列数大于行数,即特征数大于样例数),需引入正则项 (向量求导).
对于任意类型的映射
x
→
y
\boldsymbol x\to y
x→y,可表示为:
y
=
g
−
1
(
w
T
x
+
b
)
y=g^{-1}(\boldsymbol w^T\boldsymbol x+b)
y=g−1(wTx+b)
式中 g ( ⋅ ) g(\cdot) g(⋅)单调可微,上式等价于 g ( y ) = w T x + b g(y)=\boldsymbol w^T\boldsymbol x+b g(y)=wTx+b.
逻辑回归(Logistic Regression)
二分类任务回归模型输出 y ∈ { 0 , 1 } y\in\{0,1\} y∈{0,1},而线性回归模型输出为连续值,应选取合适的映射函数将模型输出映射0/1值.
令
z
=
w
T
x
+
b
z=\boldsymbol w^T \boldsymbol x+b
z=wTx+b, 则理想情况
g
−
1
(
z
)
g^{-1}(z)
g−1(z)应具备单位阶跃性质:
g
−
1
(
z
)
=
{
0
,
z
<
0
0.5
,
z
=
0
1
,
z
>
0
g^{-1}(z)= \begin{cases} 0, & \text {$z<0$} \\ 0.5, & \text{$z=0$} \\ 1, & \text{$z>0$}\end{cases}
g−1(z)=⎩⎪⎨⎪⎧0,0.5,1,z<0z=0z>0
对于二分类任务, 令
y
y
y表示样例
x
\boldsymbol x
x预测为正例概率, 则将模型定义为
ln
y
1
−
y
=
w
T
x
+
b
⟹
y
=
1
1
+
e
−
(
w
T
x
+
b
)
\ln\frac{y}{1-y}=\boldsymbol w^T\boldsymbol x + b \implies y=\dfrac{1}{1+e^{-(\boldsymbol w^T\boldsymbol x+b)}}
ln1−yy=wTx+b⟹y=1+e−(wTx+b)1
y y y称为对数几率 s i g m o d \sf sigmod sigmod函数, 输出等价于类后验概率估计 p ( y = 1 ∣ x ) p(y=1|\boldsymbol x) p(y=1∣x).
对数几率函数可近似替代单位阶跃函数, 且满足单调可微, 如下:
假设模仿线性回归求解模型参数
w
,
b
)
\boldsymbol w, b)
w,b)的方法,即将误差平方和作为代价函数
J
(
w
,
b
)
=
∑
i
=
1
N
(
f
i
−
y
i
)
2
J(\boldsymbol w,b)=\sum_{i=1}^N (f_i - y_i)^2
J(w,b)=i=1∑N(fi−yi)2
函数为非凸函数,存在较多个局部极小值,不利于求解.
s
i
g
m
o
d
\sf sigmod
sigmod函数输出为类1后验概率估计,为了区分模型预测值
y
^
\hat y
y^与样本真实值
y
y
y,取
ϕ
(
z
)
=
ϕ
(
w
T
x
+
b
)
=
p
(
y
=
1
∣
x
;
w
,
b
)
=
1
1
+
e
−
(
w
T
x
+
b
)
\phi(z)=\phi(\boldsymbol w^T \boldsymbol x + b)=p(y=1|\boldsymbol x;\boldsymbol w, b)=\frac{1}{1+e^{-(\boldsymbol w^T \boldsymbol x + b)}}
ϕ(z)=ϕ(wTx+b)=p(y=1∣x;w,b)=1+e−(wTx+b)1
由于
y
∈
{
0
,
1
}
y\in\{0,1\}
y∈{0,1}服从0-1分布,令
z
=
w
T
x
+
b
z=\boldsymbol w^T \boldsymbol x + b
z=wTx+b, 则概率函数为
p
(
y
∣
x
;
w
,
b
)
=
ϕ
(
z
)
y
(
1
−
ϕ
(
z
)
)
1
−
y
p(y|\boldsymbol x;\boldsymbol w, b)=\phi(z)^{y}(1-\phi(z))^{1-y}
p(y∣x;w,b)=ϕ(z)y(1−ϕ(z))1−y
通过极大似然估计求解模型参数
L
(
w
,
b
)
=
∏
i
=
1
N
p
(
y
i
∣
x
i
;
w
,
b
)
=
∏
i
=
1
N
ϕ
(
z
i
)
y
i
(
1
−
ϕ
(
z
i
)
)
1
−
y
i
L(\boldsymbol w,b)=\prod_{i=1}^N p(y_i|\boldsymbol x_i;\boldsymbol w,b)=\prod_{i=1}^N\phi(z_i)^{y_i}(1-\phi(z_i))^{1-y_i}
L(w,b)=i=1∏Np(yi∣xi;w,b)=i=1∏Nϕ(zi)yi(1−ϕ(zi))1−yi
取对数得
ln
L
(
w
,
b
)
=
∑
i
=
1
N
[
y
i
ln
ϕ
(
z
i
)
+
(
1
−
y
i
)
ln
(
1
−
ϕ
(
z
i
)
)
]
\ln L(\boldsymbol w,b) =\sum_{i=1}^N {\large[}y_i\ln \phi(z_i) + (1-y_i) \ln(1-\phi(z_i)){\large]}
lnL(w,b)=i=1∑N[yilnϕ(zi)+(1−yi)ln(1−ϕ(zi))]
极大化对数似然函数
ln
L
\ln L
lnL,等价于极小化
−
ln
L
-\ln L
−lnL
J
(
w
,
b
)
=
−
ln
L
(
w
,
b
)
=
−
∑
i
=
1
N
[
y
i
ln
ϕ
(
z
i
)
+
(
1
−
y
i
)
ln
(
1
−
ϕ
(
z
i
)
)
]
J(\boldsymbol w, b)=-\ln L (\boldsymbol w,b)=-\sum_{i=1}^N {\large[}y_i\ln \phi(z_i) + (1-y_i) \ln(1-\phi(z_i)){\large]}
J(w,b)=−lnL(w,b)=−i=1∑N[yilnϕ(zi)+(1−yi)ln(1−ϕ(zi))]
令
w
^
=
(
w
,
b
)
\boldsymbol {\hat w}=(\boldsymbol w, b)
w^=(w,b),
x
^
=
(
x
,
1
)
\boldsymbol {\hat x}=(\boldsymbol x, 1)
x^=(x,1), 则
w
^
T
x
^
=
w
T
x
+
b
\boldsymbol{\hat w}^T\boldsymbol{\hat x}=\boldsymbol w^T\boldsymbol x + b
w^Tx^=wTx+b. 由
ϕ
(
z
)
=
1
/
(
1
+
e
−
z
)
\phi(z)=1/(1+e^{-z})
ϕ(z)=1/(1+e−z), 可推得
ϕ
′
(
z
)
=
ϕ
(
z
)
(
1
−
ϕ
(
z
)
)
\phi'(z)=\phi(z)(1-\phi(z))
ϕ′(z)=ϕ(z)(1−ϕ(z)), 故
∂
J
∂
w
^
=
−
∑
i
=
1
N
[
y
i
1
ϕ
(
z
i
)
−
(
1
−
y
i
)
1
1
−
ϕ
(
z
i
)
]
∂
ϕ
(
z
i
)
∂
w
^
=
−
∑
i
=
1
N
[
y
i
1
ϕ
(
z
i
)
−
(
1
−
y
i
)
1
1
−
ϕ
(
z
i
)
]
ϕ
(
z
i
)
(
1
−
ϕ
(
z
i
)
)
∂
z
i
∂
w
^
=
−
∑
i
=
1
N
(
y
i
−
ϕ
(
z
i
)
)
x
^
i
\begin{aligned} \frac{\partial J}{\partial \boldsymbol {\hat w}} & =-\sum_{i=1}^N{\big[}y_i \frac{1}{\phi(z_i)} - (1-y_i) \frac{1}{1-\phi(z_i)}{\big]}\frac{\partial \phi(z_i)}{\partial \boldsymbol{\hat w}} \\ & =-\sum_{i=1}^N{\big[}y_i \frac{1}{\phi(z_i)} - (1-y_i) \frac{1}{1-\phi(z_i)}{\big]}\phi(z_i)(1-\phi(z_i)) \frac{\partial z_i}{\partial \boldsymbol{\hat w}} \\ & = -\sum_{i=1}^N(y_i-\phi(z_i))\boldsymbol {\hat x_i} \end{aligned}
∂w^∂J=−i=1∑N[yiϕ(zi)1−(1−yi)1−ϕ(zi)1]∂w^∂ϕ(zi)=−i=1∑N[yiϕ(zi)1−(1−yi)1−ϕ(zi)1]ϕ(zi)(1−ϕ(zi))∂w^∂zi=−i=1∑N(yi−ϕ(zi))x^i
理解代价函数
由于
y
∈
{
0
,
1
}
y \in\{0, 1\}
y∈{0,1},故代价函数的等价形式为
J
(
w
,
b
)
=
{
−
ln
ϕ
(
z
)
,
y=1
−
ln
(
1
−
ϕ
(
z
)
)
,
y=0
J(\boldsymbol w,b)= \begin{cases} -\ln \phi(z), & \text{y=1} \\ -\ln(1-\phi(z)), & \text{y=0} \end{cases}
J(w,b)={−lnϕ(z),−ln(1−ϕ(z)),y=1y=0
若样本真实值为1,则估计值
ϕ
(
z
)
\phi(z)
ϕ(z)越接近于1,则代价越小;
若样本真实值为0,则估计值
ϕ
(
z
)
\phi(z)
ϕ(z)越接近于0,则代价越小;
损失函数与交叉熵
对于真实分布
P
(
X
)
P(X)
P(X)和预测分布
Q
(
X
)
Q(X)
Q(X),其交叉熵定义为
H
(
P
,
Q
)
=
−
∑
i
P
(
x
i
)
ln
(
Q
(
x
i
)
)
H(P,Q) = -\sum_i P(\boldsymbol x_i)\ln (Q(\boldsymbol x_i))
H(P,Q)=−i∑P(xi)ln(Q(xi))
当前仅当 Q ( X ) Q(X) Q(X)与 P ( X ) P(X) P(X)同分布时, 交叉熵取得极小值. 由于 y ∈ { 0 , 1 } y \in\{0, 1\} y∈{0,1}, ϕ ( z ) ∈ [ 0 , 1 ] \phi(z) \in [0, 1] ϕ(z)∈[0,1], 可知逻辑回归的极大似然参数估计得损失函数与交叉熵等价!
模型参数更新
使用最速梯度下降法求解模型参数,迭代更新公式为
w
^
i
+
1
=
w
^
i
+
Δ
w
^
i
,
Δ
w
^
i
=
−
η
∂
J
(
w
^
)
∂
w
^
i
\boldsymbol{\hat w_{i+1}}=\boldsymbol{\hat w_i} + \Delta \boldsymbol{\hat w_i},\quad\Delta\boldsymbol{\hat w_i}=-\eta\frac{\partial J(\boldsymbol{\hat w})}{\partial \boldsymbol{\hat w_i}}
w^i+1=w^i+Δw^i,Δw^i=−η∂w^i∂J(w^)
式中
η
\eta
η为学习速率,因此
w
^
i
+
1
=
w
^
i
+
η
∑
i
=
1
n
(
y
i
−
ϕ
(
z
i
)
)
x
^
i
\boldsymbol{\hat w_{i+1}}=\boldsymbol{\hat w_{i}}+\eta\sum_{i=1}^n(y_i-\phi(z_i))\boldsymbol{\hat x_{i}}
w^i+1=w^i+ηi=1∑n(yi−ϕ(zi))x^i
转为矩阵形式
w
^
i
+
1
=
w
^
i
+
η
∑
i
=
1
n
e
i
⋅
x
^
i
=
w
^
i
+
η
X
^
T
E
\boldsymbol{\hat w_{i+1}} = \boldsymbol{\hat w_{i}}+\eta \sum_{i=1}^n e_i \cdot \boldsymbol{\hat x_{i}}= \boldsymbol{\hat w_{i}} + \eta \hat X^T E
w^i+1=w^i+ηi=1∑nei⋅x^i=w^i+ηX^TE
式中
e
i
=
y
i
−
ϕ
(
z
i
)
e_i=y_i-\phi(z_i)
ei=yi−ϕ(zi)表示实际与预测之间的差值,
E
=
(
e
1
,
⋯
,
e
n
)
T
E=(e_1,\cdots,e_n)^T
E=(e1,⋯,en)T,
X
^
T
=
(
x
^
1
,
⋯
,
x
^
n
)
\hat X^T =(\boldsymbol {\hat x_1},\cdots, \boldsymbol {\hat x_n})
X^T=(x^1,⋯,x^n).
Python实现
# !/usr/bin/python
import numpy as np
import matplotlib.pyplot as plt
def load_data_sets():
"""
载入数据集
样本数为n,特征数为m
X:n*m(n行m列), y:1*m(1行m列)
"""
X, y = [], []
with open('testSet.txt', encoding='utf8') as f:
for line in f:
line = line.strip().split()
if len(line):
X.append(line[:-1])
y.append(line[-1])
return np.array(X, dtype=np.float), np.array(y, dtype=np.float)
def plt_data_sets(X, y, w):
"""
数据集散点图
X: 输入级, n*2
y: 标签,1*n
w:模型参数[b, w1, w2]
"""
true = y == 1
x1, x2 = X[:, 0], X[:, 1]
plt.scatter(x1[true], x2[true], marker='+', c='b')
plt.scatter(x1[~true], x2[~true], marker='o', c='r')
# 添加决策边界, w^T * x + b = 0
# 二维特征对应w1*x1 + w2*x2 + b = 0, 其中x2为纵轴
# 即 y = -(w1*x1 + b) / w2
x1 = np.arange(x1.min(), x1.max(), 0.1)
x2 = -(w[0] + w[1] * x1) / w[2]
plt.plot(x1, x2)
plt.show()
class LogisticRegression:
"""逻辑回归"""
def __init__(self, max_iter=300, tol=1e-3, alpha=0.001):
"""
初始化构造函数
:param max_iter: 最大迭代次数
:param tol: 精度
:param a: 学习速度
"""
self.max_iter = max_iter
self.tol = tol
self.alpha = alpha
self.w = None
self.b = None
def train(self, X, y):
"""
训练模型
:param X: 输入集,n*m(n行m列)
:param y: 标签集,1*m(1行m列)
:return: 训练好的模型
"""
n, m = np.shape(X)
# X = (X, 1), w=(w;b)
# w^T*x+b => w^T*X
X = np.mat(np.concatenate((X, np.ones((n, 1))), axis=1))
y = np.mat(y).T
w = np.ones((m + 1, 1))
alpha = self.alpha
tol = self.tol
records = []
for i in range(self.max_iter):
# 实际值与预测值之差
error = y - 1.0 / (1 + np.exp(-X * w))
# 误差变化记录
records.append((error.T * error)[0, 0])
if records[-1] <= tol:
break
# 负梯度方向更新参数
w += alpha * X.T * error
# 动态更新误差曲线
"""
plt.title('Error Curve')
plt.xlabel('iterations')
plt.ylabel('error')
plt.plot(range(len(records)), records, 'b')
plt.pause(0.01)
"""
self.w = w[:-1]
self.b = w[-1]
def prob(self, X):
"""计算类1概率"""
if self.w is None or self.b is None:
raise Exception('Please train the model.')
# n*1的矩阵
results = 1.0 / (1 + np.exp(-np.mat(X) * self.w - self.b))
# n*1矩阵转为n*1的数组
return np.array([res[0] for res in np.array(results)])
def predict(self, X):
"""预测输出类别"""
prob = self.prob(X)
# 类1概率小于0.5则为0,其余为1
return np.where(prob < 0.5, 0, 1)
def score(self, X, y):
"""
预测输入集正确率
:param X:
:param y:
:return:
"""
predict = self.predict(X)
try:
error_num = np.where(predict != y)[0][0]
except IndexError:
error_num = 0
return 1- 1. * error_num / len(y)
if __name__ == '__main__':
# 二分类测试
X, y = load_data_sets()
# 分割数据集
num = int(len(X) * 0.7)
LR = LogisticRegression()
LR.train(X[:num], y[:num])
# 预测输出
predict = LR.predict(X[num:])
# 计算得分
score = LR.score(X[num:], y[num:])
print(score)
# 绘制分割平面
plt_data_sets(X, y, (LR.b, LR.w[0], LR.w[1]))
模型训练效果如下
数据集下载链接:https://pan.baidu.com/s/1Xd-DXpw8xvm9mVWuQ0CU8Q