岭回归介绍
用正规方程求解线性回归时,
岭回归是线性回归的正则化版本,即在原来的线性回归的 cost function 中添加正则项(regularization term):
利用岭回归实现Longley数据集,Longley数据集来自J.W.Longley(1967)发表在JASA上的一篇论文,是强共线性的宏观经济数据,包含GNP deflator(GNP平减指数)、GNP(国民生产总值)、Unemployed(失业率)、rmedForces(武装力量)、Population(人口)、year(年份),Emlpoyed(就业率)。LongLey数据集因存在严重的多重共线性问题,在早期经常用来检验各种算法或计算机的计算精度。
sklearn中的岭回归
导入包
vdfgfd
import numpy as np
from numpy import genfromtxt
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
读入数据
data = pd.read_csv("longley.csv",delimiter=',')
print(data)
Unnamed: 0 GNP.deflator GNP Unemployed Armed.Forces Population Year Employed
0 1947 83.0 234.289 235.6 159.0 107.608 1947 60.323
1 1948 88.5 259.426 232.5 145.6 108.632 1948 61.122
2 1949 88.2 258.054 368.2 161.6 109.773 1949 60.171
3 1950 89.5 284.599 335.1 165.0 110.929 1950 61.187
4 1951 96.2 328.975 209.9 309.9 112.075 1951 63.221
5 1952 98.1 346.999 193.2 359.4 113.270 1952 63.639
6 1953 99.0 365.385 187.0 354.7 115.094 1953 64.989
7 1954 100.0 363.112 357.8 335.0 116.219 1954 63.761
8 1955 101.2 397.469 290.4 304.8 117.388 1955 66.019
9 1956 104.6 419.180 282.2 285.7 118.734 1956 67.857
10 1957 108.4 442.769 293.6 279.8 120.445 1957 68.169
11 1958 110.8 444.546 468.1 263.7 121.950 1958 66.513
12 1959 112.6 482.704 381.3 255.2 123.366 1959 68.655
13 1960 114.2 502.601 393.1 251.4 125.368 1960 69.564
14 1961 115.7 518.173 480.6 257.2 127.852 1961 69.331
15 1962 116.9 554.894 400.7 282.7 130.081 1962 70.551
切分数据
x_data = data.iloc[:,2:]
y_data = data.iloc[:,1]
print(x_data)
print(y_data)
创建模型
# 生成50个值,作为岭回归系数
alphas_to_test = np.linspace(0.001, 1)
# 创建模型,保存误差值
estimator = linear_model.RidgeCV(alphas=alphas_to_test, store_cv_values=True)
estimator.fit(x_data, y_data)
# 岭回归系数
print(estimator.alpha_)
# loss值
print(estimator.cv_values_.shape)
0.40875510204081633
(16, 50)
# 画图
# 岭系数跟loss值的关系
plt.plot(alphas_to_test, estimator.cv_values_.mean(axis=0))
# 选取的岭系数值的位置
plt.plot(estimator.alpha_, min(estimator.cv_values_.mean(axis=0)),'ro')
plt.show()
estimator.predict([x_data.iloc[2]])
array([88.11216213])
正规方程岭回归
导入包
import numpy as np
from numpy import genfromtxt
import pandas as pd
import matplotlib.pyplot as plt
读入数据
data = pd.read_csv("longley.csv",delimiter=',')
print(data)
Unnamed: 0 GNP.deflator GNP Unemployed Armed.Forces Population Year Employed
0 1947 83.0 234.289 235.6 159.0 107.608 1947 60.323
1 1948 88.5 259.426 232.5 145.6 108.632 1948 61.122
2 1949 88.2 258.054 368.2 161.6 109.773 1949 60.171
3 1950 89.5 284.599 335.1 165.0 110.929 1950 61.187
4 1951 96.2 328.975 209.9 309.9 112.075 1951 63.221
5 1952 98.1 346.999 193.2 359.4 113.270 1952 63.639
6 1953 99.0 365.385 187.0 354.7 115.094 1953 64.989
7 1954 100.0 363.112 357.8 335.0 116.219 1954 63.761
8 1955 101.2 397.469 290.4 304.8 117.388 1955 66.019
9 1956 104.6 419.180 282.2 285.7 118.734 1956 67.857
10 1957 108.4 442.769 293.6 279.8 120.445 1957 68.169
11 1958 110.8 444.546 468.1 263.7 121.950 1958 66.513
12 1959 112.6 482.704 381.3 255.2 123.366 1959 68.655
13 1960 114.2 502.601 393.1 251.4 125.368 1960 69.564
14 1961 115.7 518.173 480.6 257.2 127.852 1961 69.331
15 1962 116.9 554.894 400.7 282.7 130.081 1962 70.551
切分数据
x_data = data.iloc[0:,2:]
y_data = data.iloc[0:,1]
y_data = y_data[:, np.newaxis]
print(x_data)
print(y_data)
GNP Unemployed Armed.Forces Population Year Employed
0 234.289 235.6 159.0 107.608 1947 60.323
1 259.426 232.5 145.6 108.632 1948 61.122
2 258.054 368.2 161.6 109.773 1949 60.171
3 284.599 335.1 165.0 110.929 1950 61.187
4 328.975 209.9 309.9 112.075 1951 63.221
5 346.999 193.2 359.4 113.270 1952 63.639
6 365.385 187.0 354.7 115.094 1953 64.989
7 363.112 357.8 335.0 116.219 1954 63.761
8 397.469 290.4 304.8 117.388 1955 66.019
9 419.180 282.2 285.7 118.734 1956 67.857
10 442.769 293.6 279.8 120.445 1957 68.169
11 444.546 468.1 263.7 121.950 1958 66.513
12 482.704 381.3 255.2 123.366 1959 68.655
13 502.601 393.1 251.4 125.368 1960 69.564
14 518.173 480.6 257.2 127.852 1961 69.331
15 554.894 400.7 282.7 130.081 1962 70.551
[[ 83. ]
[ 88.5]
[ 88.2]
[ 89.5]
[ 96.2]
[ 98.1]
[ 99. ]
[100. ]
[101.2]
[104.6]
[108.4]
[110.8]
[112.6]
[114.2]
[115.7]
[116.9]]
print(np.mat(x_data).shape)
print(np.mat(y_data).shape)
# 给样本添加偏置项
X_data = np.concatenate((np.ones((16,1)),x_data),axis=1)
print(X_data.shape)
(16, 6)
(16, 1)
(16, 7)
print(X_data[:3])
[[1.00000e+00 2.34289e+02 2.35600e+02 1.59000e+02 1.07608e+02,1.94700e+03
6.03230e+01]
[1.00000e+00 2.59426e+02 2.32500e+02 1.45600e+02 1.08632e+02 1.94800e+03 6.11220e+01]
[1.00000e+00 2.58054e+02 3.68200e+02 1.61600e+02 1.09773e+02 1.94900e+036.01710e+01]]
岭回归标准方程法求解回归参数
def weights(xArr, yArr, lam=0.2):
xMat = np.mat(xArr)
yMat = np.mat(yArr)
xTx = xMat.T*xMat # 矩阵乘法
rxTx = xTx + np.eye(xMat.shape[1])*lam
# 计算矩阵的值,如果值为0,说明该矩阵没有逆矩阵
if np.linalg.det(rxTx) == 0.0:
print("This matrix cannot do inverse")
return
# xTx.I为xTx的逆矩阵
ws = rxTx.I*xMat.T*yMat
return ws
ws = weights(X_data,y_data)
print(ws)
[[ 7.38107648e-04]
[ 2.07703836e-01]
[ 2.10076376e-02]
[ 5.05385441e-03]
[-1.59173066e+00]
[ 1.10442920e-01]
[-2.42280461e-01]]
计算预测值
np.mat(X_data)*np.mat(ws)
matrix([[ 83.55075226],
[ 86.92588689],
[ 88.09720228],
[ 90.95677622],
[ 96.06951002],
[ 97.81955375],
[ 98.36444357],
[ 99.99814266],
[103.26832266],
[105.03165135],
[107.45224671],
[109.52190685],
[112.91863666],
[113.98357055],
[115.29845063],
[117.64279933]])