8 数值预测 - 回归(Predicting numeric values regression)
8.1 利用线性逻辑回归计算最佳匹配边界(Finding best-fit lines with linear regression)
支持向量机(support vector machines):
优点:结果容易解释,计算成本低
缺点:处理线性不可分数据表现差
适用范围:数值和标称值
回归(regression):计算回归权值的过程
回归步骤:
- 收集数据
- 准备:数值型数据,标称值需映射为二进制数值
- 分析:
- 训练:计算回归权值
- 测试:计算 L 2 L_2 L2范数或预测值与数据的相关性,用以评估
- 使用:
样本数据:
X
=
[
x
i
T
]
T
\mathbf{X} = \left[ \mathbf{x}_i^{\mathrm{T}} \right]^{\mathrm{T}}
X=[xiT]T,
x
i
T
=
[
x
i
,
0
,
x
i
,
1
,
x
i
,
2
,
⋯
 
,
x
i
,
n
]
\mathbf{x}_i^{\mathrm{T}} = \left[ x_{i, 0}, x_{i, 1}, x_{i, 2}, \cdots, x_{i, n} \right]
xiT=[xi,0,xi,1,xi,2,⋯,xi,n]
权值向量:
w
T
=
[
w
0
,
w
1
,
w
2
,
⋯
 
,
w
n
]
\mathbf{w}^{\mathrm{T}} = \left[ w_0, w_1, w_2, \cdots, w_n \right]
wT=[w0,w1,w2,⋯,wn]
标签向量:
y
T
=
[
y
1
,
y
2
,
⋯
 
,
y
m
]
\mathbf{y}^{\mathrm{T}} = \left[ y_1, y_2, \cdots, y_m \right]
yT=[y1,y2,⋯,ym]
平方误差损失函数:
L
=
∑
i
=
1
m
(
y
i
−
x
i
T
w
)
2
=
(
y
−
X
w
)
T
(
y
−
X
w
)
{L} = \sum_{i = 1}^{m} \left( y_i - \mathbf{x}^{\mathrm{T}}_i \mathbf{w} \right)^2 = \left( \mathbf{y} - \mathbf{X} \mathbf{w} \right)^{\mathrm{T}} \left( \mathbf{y} - \mathbf{X} \mathbf{w} \right)
L=∑i=1m(yi−xiTw)2=(y−Xw)T(y−Xw)
平方误差损失函数关于 w \mathbf{w} w的偏导数(partial derivative):
∂ L ∂ w = − 2 ( y − X w ) T X \frac{\partial {L}}{\partial \mathbf{w}} = -2 \left( \mathbf{y} - \mathbf{X} \mathbf{w} \right)^{\mathrm{T}} \mathbf{X} ∂w∂L=−2(y−Xw)TX
令 ∂ L ∂ w = 0 \frac{\partial {L}}{\partial \mathbf{w}} = 0 ∂w∂L=0,则 ( y − X w ) T = 0 \left( \mathbf{y} - \mathbf{X} \mathbf{w} \right)^{\mathrm{T}} = 0 (y−Xw)T=0。
通常样本数远大于样本特征维度, m ≫ n m \gg n m≫n, X \mathbf{X} X满足列满秩条件,则 X \mathbf{X} X存在左伪逆矩阵 ( X T X ) − 1 X T \left( \mathbf{X}^{\mathrm{T}} \mathbf{X} \right)^{-1} \mathbf{X}^{\mathrm{T}} (XTX)−1XT,故有:
w ∗ = ( X T X ) − 1 X T y \mathbf{w}^{*} = \left( \mathbf{X}^{\mathrm{T}} \mathbf{X} \right)^{-1} \mathbf{X}^{\mathrm{T}} \mathbf{y} w∗=(XTX)−1XTy
∗ * ∗表示最优解。实践中,应确认 ( X T X ) − 1 \left( \mathbf{X}^{\mathrm{T}} \mathbf{X} \right)^{-1} (XTX)−1是否存在。
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
X = []
with open("ex0.txt", 'r') as fr:
for line in fr.readlines():
X.append([float(item)
for item in line.strip().split("\t")])
X = np.array(X)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(X[:, 1], X[:, 2], c="red", marker="o", edgecolors="black")
ax.set_xlabel("$x$")
ax.set_ylabel("$y$")
plt.show()
- OLS(ordinary least squares)求解回归问题:
# Listing 8.1 Standard regression function and data-importing functions
import numpy as np
def loadDataSet(fileName):
dataMat = []
labelMat = []
with open(fileName, 'r') as fr:
for line in fr.readlines():
curLine = line.strip().split('\t')
lineArr = [float(item) for item in curLine]
dataMat.append(lineArr[: -1])
labelMat.append(lineArr[-1])
return dataMat, labelMat
def standRegres(xArr, yArr):
xMat = np.matrix(xArr)
yMat = np.matrix(yArr).T
xTx = np.dot(xMat.T, xMat)
if np.linalg.det(xTx) == 0:
print("This matrix is singular, cannot do inverse")
return None
ws = xTx.I * np.dot(xMat.T, yMat)
# ws = np.linalg.solve(xTx, np.dot(xMat.T, yMat))
return ws
import numpy as np
xArr, yArr = loadDataSet("ex0.txt")
print("X: {}".format(xArr[0: 2]))
ws = standRegres(xArr, yArr)
print("weights: {}".format(ws))
X: [[1.0, 0.067732], [1.0, 0.42781]]
weights: [[3.00774324]
[1.69532264]]
xMat = np.matrix(xArr)
yMat = np.matrix(yArr)
yHat = xMat * ws
print("y hat: {}".format(yHat.T))
y hat: [[3.12257084 3.73301922 4.69582855 4.25946098 4.67099547 3.89977285
3.65007895 3.06514517 3.23286683 3.24221654 3.42785945 4.10677005
4.24737842 3.40925159 4.6518468 4.03763819 3.61572324 3.25838991
4.08905393 3.39829811 3.12681762 3.12159094 4.57689489 4.22453225
3.03380205 3.575795 3.07638008 3.36812645 4.05412689 3.92524508
3.08463121 3.22264573 3.95116656 4.53547307 4.63266931 3.47321272
3.67637171 4.48687834 3.70271024 4.54873897 3.29055527 4.13515992
3.4548456 3.39972557 3.9451346 3.08868473 3.62780072 3.84693134
3.84432733 4.35061 4.31256527 3.43466109 3.31620889 4.37786401
3.57417766 3.58183035 3.77501915 4.18560255 4.00951278 3.52978055
3.25926809 4.1981649 4.23099312 4.13833018 3.2195416 4.0559375
3.65380697 3.66153255 4.32530562 3.20455155 3.20353097 4.13260677
3.43985386 4.69270407 3.50859919 4.62257705 3.52828697 3.47779009
3.48090949 3.30565212 3.27286288 4.25069447 4.44561758 4.31514894
3.73587075 3.28268389 4.56246261 3.54184447 3.34930498 3.51131849
3.57846852 4.13701291 3.99762179 3.15189822 4.17244007 4.61771487
3.32970536 4.43864133 3.57277394 3.73170535 3.79838917 3.94128791
3.47940404 4.56535313 4.51074001 3.97320914 3.12027706 3.57694952
3.73141375 3.6786604 4.18467182 3.47015945 3.86132972 3.1215316
3.07271818 4.02344833 4.61330533 4.14577773 3.74488818 3.87130839
4.27828923 3.10609569 3.82654678 3.79907916 3.61427374 4.61681295
3.71523529 4.56822162 3.27294765 4.1209972 4.05277064 4.45457905
3.23372127 4.1826137 3.31027187 3.34600419 3.13188155 3.49347861
3.31633604 4.06142526 3.5919243 3.68342425 3.82212368 3.26772097
3.29907766 4.47772699 3.38724969 3.90338389 4.5169059 3.1880442
4.56385616 3.03292726 3.97014739 3.12447299 4.03092979 3.90973965
4.11201199 4.53561039 3.35416039 4.18782512 4.64190543 4.67151593
4.68257113 4.68771813 4.25553631 3.43763299 4.15161473 4.60027508
3.4068544 3.195112 3.37736257 4.48229758 3.3414336 4.6321641
4.6567378 4.27210978 3.76330447 3.80548918 3.93887547 3.5768478
4.5076613 4.56560234 3.75676222 3.20541447 3.61682859 4.38911078
3.66404502 4.67296204 4.33158679 4.11405146 4.4827248 4.03299809
3.44040992 4.24625442 3.84494273 3.44346998 4.42119307 3.12657688
3.90129695 3.20467701]]
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xMat[:, 1].A, yMat.A)
ax.set_xlabel("$x$")
ax.set_ylabel("$y$")
xCopy = xMat.copy()
xCopy.sort(0)
yHat = np.dot(xCopy, ws)
ax.plot(xCopy[:, 1], yHat)
plt.show()
yHat = np.dot(xMat, ws)
print(yHat.shape)
print(yMat.shape)
print(np.corrcoef(yHat.T, yMat))
(200, 1)
(1, 200)
[[1. 0.98647356]
[0.98647356 1. ]]
8.2 局部加权线性回归(Locally weighted linear regression)
在最小均方误差(lowest mean-squared error)意义下,线性回归实现对统计量的无偏估计,但线性回归倾向于欠拟合(underfit)。通过增加偏置项,线性回归能够进一步减小均方误差。
*局部加权线性回归(Locally weighted linear regression,LWLR)*通过对感兴趣数据点附近的数据赋予权值,减少均方误差:
w ∗ = ( X T W X ) − 1 X T W y \mathbf{w}^{*} = \left( \mathbf{X}^{\mathrm{T}} \mathbf{W} \mathbf{X} \right)^{-1} \mathbf{X}^{\mathrm{T}} \mathbf{W} \mathbf{y} w∗=(XTWX)−1XTWy
其中, W \mathbf{W} W为数据点的权值矩阵。LWLR通过类似“核”技巧的方式,使邻近数据点(nearby points)相比其它数据点或的更大的权重。常用核为高斯(Gaussian)核函数:
w ( i , i ) = exp ( − ∥ x ( i ) − x ∥ 2 2 2 k 2 ) w(i, i) = \exp \left( - \frac{\left\| \mathbf{x}^{(i)} - \mathbf{x} \right\|_2^2}{2 k^2} \right) w(i,i)=exp(−2k2∥∥x(i)−x∥∥22)
该方程定义了一个对角(diagonal)权值矩阵 W \mathbf{W} W,数据点 x ( i ) \mathbf{x}^{(i)} x(i)与 x \mathbf{x} x越近,其权值 w ( i , i ) w(i, i) w(i,i)越大。预定义的参数 k k k用于指定 x \mathbf{x} x的影响范围(权值衰减速度)。
fig = plt.figure()
ax = fig.add_subplot(411)
ax.scatter(xMat[:, 1].A, yMat.A)
x = np.linspace(0, 1)
y = np.exp(- (x - 0.5) ** 2 / 2 / 0.5 **2)
ax = fig.add_subplot(412)
ax.plot(x, y, label="$k = 0.5$")
ax.legend()
y = np.exp(- (x - 0.5) ** 2 / 2 / 0.1 **2)
ax = fig.add_subplot(413)
ax.plot(x, y, label="$k = 0.1$")
ax.legend()
y = np.exp(- (x - 0.5) ** 2 / 2 / 0.01 **2)
ax = fig.add_subplot(414)
ax.plot(x, y, label="$k = 0.01$")
ax.legend()
plt.show()
# Listing 8.2 Locally weighted linear regression function
def lwlr(testPoint, xArr, yArr, k=1.0):
xMat = np.matrix(xArr)
yMat = np.matrix(yArr).T
m = xMat.shape[0]
# create diagonal matrix
weights = np.matrix(np.eye(m))
for j in range(m):
# populate weights with exponentially decaying values
diffMat = testPoint - xMat[j, :]
weights[j, j] = np.exp(
- np.dot(diffMat, diffMat.T) / 2 / k**2)
xTx = xMat.T * weights * xMat
if np.linalg.det(xTx) == 0:
print("this matrix is singular, cannot do inverse")
return None
ws = xTx.I * xMat.T * weights * yMat
return ws
def lwlrTest(testArr, xArr, yArr, k=1.0):
m = np.shape(testArr)[0]
yHat = np.zeros(m)
for i in range(m):
yHat[i] = np.dot(testArr[i], lwlr(testArr[i], xArr, yArr, k))
return yHat
xArr, yArr = loadDataSet("ex0.txt")
print(lwlr(xArr[0], xArr, yArr, 1.0))
# 单点估计
print("y_0: {}".format(yArr[0]))
print("y_hat_0: {}".format(
np.dot(xArr[0], lwlr(xArr[0], xArr, yArr, 1.0))))
yHat = lwlrTest(xArr, xArr, yArr, 1)
xMat = np.matrix(xArr)
srtInd = xMat[:, 1].argsort(0)
xSort = xMat[srtInd][:, 0, :]
fig = plt.figure(figsize=(5, 8))
ax = fig.add_subplot(311)
ax.plot(xSort[:, 1], yHat[srtInd])
ax.scatter(xMat[:, 1].A, yArr, s=2, c="red")
yHat = lwlrTest(xArr, xArr, yArr, 0.01)
ax = fig.add_subplot(312)
ax.plot(xSort[:, 1], yHat[srtInd])
ax.scatter(xMat[:, 1].A, yArr, s=2, c="red")
yHat = lwlrTest(xArr, xArr, yArr, 0.003)
ax = fig.add_subplot(313)
ax.plot(xSort[:, 1], yHat[srtInd])
ax.scatter(xMat[:, 1].A, yArr, s=2, c="red")
plt.show()
[[3.00714496]
[1.69638809]]
y_0: 3.176513
y_hat_0: [[3.12204471]]
8.3 例:预测鲍鱼年龄(Example: predicting the age of an abalone)
def rssError(yArr, yHatArr):
return ((yArr - yHatArr)**2).sum()
abX, abY = loadDataSet("abalone.txt")
yHat01 = lwlrTest(abX[0 : 99], abX[0 : 99], abY[0 : 99], 0.1)
yHat1 = lwlrTest(abX[0 : 99], abX[0 : 99], abY[0 : 99], 1)
yHat10 = lwlrTest(abX[0 : 99], abX[0 : 99], abY[0 : 99], 10)
print(rssError(abY[0 : 99], yHat01.T))
print(rssError(abY[0 : 99], yHat1.T))
print(rssError(abY[0 : 99], yHat10.T))
56.81549669032892
429.8905618702056
549.1181708826584
- k k k值过小时将导致过拟合
abX, abY = loadDataSet("abalone.txt")
abY = np.array(abY)
yHat01 = lwlrTest(abX[100 : 199], abX[0 : 99], abY[0 : 99], 0.1)
yHat1 = lwlrTest(abX[100 : 199], abX[0 : 99], abY[0 : 99], 1)
yHat10 = lwlrTest(abX[100 : 199], abX[0 : 99], abY[0 : 99], 10)
print(rssError(abY[100 : 199], yHat01.T))
print(rssError(abY[100 : 199], yHat1.T))
print(rssError(abY[100 : 199], yHat10.T))
40659.27596541478
573.5261441898057
517.5711905382693
- 与常规线性回归比较
ws = standRegres(abX[0 : 99], abY[0 : 99])
yHat = np.dot(np.matrix(abX[100 : 199]), ws)
print(rssError(abY[100 : 199], yHat.A.T))
518.6363153249081
此例中,常规线性回归与局部加权线性回归性能几乎相同。
8.4 系数收缩(Shrinking coefficients to understand our data)
当特征数 n n n大于样本数 m m m时, ( X T X ) − 1 \left( \mathbf{X}^{\mathrm{T}} \mathbf{X} \right)^{-1} (XTX)−1不存在,统计学家通过引入岭回归(ridge regression)进行系数收缩来解决这一问题。还有一种系数收缩法是套索回归(lasso regression),但套索回归计算困难,可通过前向分段回归(forward stagewise regression)近似。
8.4.1 岭回归
*岭回归(ridge regression)*向矩阵 X T X \mathbf{X}^{\mathrm{T}} \mathbf{X} XTX中添加附加项 λ I \lambda \mathbf{I} λI,使其变为非奇异矩阵:
X T X + λ I \mathbf{X}^{\mathrm{T}} \mathbf{X} + \lambda \mathbf{I} XTX+λI
其中, I \mathbf{I} I为 m × m m \times m m×m维单位方阵, λ \lambda λ为预定义标量。此时权值向量表示为:
w ∗ = ( X T X + λ I ) − 1 X T y \mathbf{w}^{\ast} = \left( \mathbf{X}^{\mathrm{T}} \mathbf{X} + \lambda \mathbf{I} \right)^{-1} \mathbf{X}^{\mathrm{T}} \mathbf{y} w∗=(XTX+λI)−1XTy
岭回归能够抑制权值向量和的最大值,防止过拟合。统计学中,通过引入惩罚项,减小非重要参数的方法,称为收缩(shrinkage)。
# Listing 8.3 Ridge regression
def ridgeRegres(xMat, yMat, lam=0.2):
m, n = xMat.shape
xTx = np.dot(xMat.T, xMat)
denom = xTx + lam * np.eye(n)
if np.linalg.det(denom) == 0:
print("This matrix is singular, can not do inverse")
return None
ws = denom.I * np.dot(xMat.T, yMat)
return ws
def ridgeTest(xArr, yArr):
xMat = np.matrix(xArr)
yMat = np.matrix(yArr).T
m, n = xMat.shape
yMat = yMat - yMat.mean()
xMat = (xMat - xMat.mean(axis=0)) / xMat.var(axis=0)
numTestPts = 30
wMat = np.zeros((numTestPts, n))
for i in range(numTestPts):
ws = ridgeRegres(xMat, yMat, np.exp(i - 10))
wMat[i, :] = ws.T
return wMat
abX, abY = loadDataSet("abalone.txt")
ridgeWeights = ridgeTest(abX, abY)
log_lam = [item - 10 for item in range(30)]
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(log_lam, ridgeWeights)
ax.set_xlabel("log($\\lambda$)")
plt.show()
8.4.2 套索
岭回归与一般最小平方误差回归的区别在于引入限制条件:
∑ k = 1 n w k 2 ≤ λ \sum_{k = 1}^{n} w_{k}^{2} \leq \lambda k=1∑nwk2≤λ
即系数平方和不大于 λ \lambda λ。当两个或两个以上的特征相关时,一般最小平方误差回归可能导致某个特征的权值为很大的正数,与其相关的特征权值为很大的负数,而岭回归可以避免这个问题。
套索(lasso)引入的限制条件为:
∑ k = 1 n ∣ w k ∣ ≤ λ \sum_{k = 1}^{n} \left| w_{k} \right| \leq \lambda k=1∑n∣wk∣≤λ
套索回归限制条件使得权值向量稀疏化,有利于理解数据。套索回归求解困难,可用前向分段回归近似。
8.4.3 前向分段回归
前向分段回归(forward stagewise regression)是一种贪婪算法(greedy algorithm)。每次更新时,前向分段回归都试图将当前误差最小化。权值向量初始化为零,
- 前向分段回归伪代码:
1. 数据标准化
2. FOR迭代:
3. SET lowestError = inf
4. FOR 遍历特征:
5. FOR 增加、减小特征权值:
6. 修改该特征的权值,得到新的权值向量W
7. 计算新特征向量的误差Error
8. IF Error < lowestERROR:
9. SET Wbest = W
10. 更新:SET W = Wbest
# Listing 8.4 Forward stagewise linear regression
def stageWise(xArr, yArr, eps=0.01, numIt=100):
xMat = np.matrix(xArr)
yMat = np.matrix(yArr).T
m, n = xMat.shape
xMat = (xMat - xMat.mean(axis=0)) / xMat.var(axis=0)
yMat -= yMat.mean()
returnMat = np.zeros((numIt, n))
ws = np.zeros((n, 1))
wsMax = ws.copy()
for i in range(numIt):
# print(ws.T)
lowestError = np.inf
for j in range(n):
for sign in [-1, 1]:
wsTest = ws.copy()
wsTest[j] += eps * sign
yTest = np.dot(xMat, wsTest)
rssE = rssError(yMat.A, yTest.A)
if rssE < lowestError:
lowestError = rssE
wsMax = wsTest
ws = wsMax.copy()
returnMat[i, :] = ws.T
return returnMat
前向分段回归的原理是坐标提升算法(coordinate ascent),每次迭代( i i i)中,仅更新一个使误差减小最大的特征的权值。
xArr, yArr = loadDataSet("abalone.txt")
stageWise(xArr, yArr, 0.01, 200)
array([[ 0. , 0. , 0. , ..., 0. , 0. , 0. ],
[ 0. , 0. , 0. , ..., 0. , 0. , 0. ],
[ 0. , 0. , 0. , ..., 0. , 0. , 0. ],
...,
[ 0.05, 0. , 0.09, ..., -0.64, 0. , 0.36],
[ 0.04, 0. , 0.09, ..., -0.64, 0. , 0.36],
[ 0.05, 0. , 0.09, ..., -0.64, 0. , 0.36]])
stageWise(xArr, yArr, 0.001, 5000)
array([[ 0. , 0. , 0. , ..., 0. , 0. , 0. ],
[ 0. , 0. , 0. , ..., 0. , 0. , 0. ],
[ 0. , 0. , 0. , ..., 0. , 0. , 0. ],
...,
[ 0.043, -0.011, 0.12 , ..., -0.963, -0.105, 0.187],
[ 0.044, -0.011, 0.12 , ..., -0.963, -0.105, 0.187],
[ 0.043, -0.011, 0.12 , ..., -0.963, -0.105, 0.187]])
wMat = stageWise(xArr, yArr, 0.005, 1000)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(list(range(1000)), wMat)
plt.show()
xMat = np.matrix(xArr)
yMat = np.matrix(yArr).T
xMat = (xMat - xMat.mean(axis=0)) / xMat.var(axis=0)
yMat -= yMat.mean(axis=0)
weights = standRegres(xMat, yMat.T)
print(weights.T)
[[ 0.0430442 -0.02274163 0.13214087 0.02075182 2.22403814 -0.99895312
-0.11725427 0.16622915]]
收缩(shrinkage)法相当于通过增加模型的偏置(bias)换取模型方差(variance)的减小。
8.5 平衡偏置与方差(The bias/variance tradeoff)
通常认为误差由偏置(bias)、variance(方差)和随机噪声(random noise)三部分组成。
局部加权线性回归增加了模型的复杂度,因此增加了模型的方差;系数收缩减小了模型的复杂度,因此增加了模型的偏置。
模型方差的测量:从训练数据集中随机抽取两个子集,用两个子集分别训练同一模型,得到两组模型参数,通过比较两组模型参数的差异总量即可得到模型的方差。