机器学习--岭回归10

本文介绍了岭回归作为线性回归的正则化版本,通过添加正则项解决共线性问题。使用Python的sklearn库展示了如何在Longley数据集上应用岭回归,并通过调整系数α找到最佳拟合。此外,还探讨了使用正规方程直接求解岭回归参数的方法。
摘要由CSDN通过智能技术生成

岭回归介绍

用正规方程求解线性回归时,
在这里插入图片描述
在这里插入图片描述
岭回归是线性回归的正则化版本,即在原来的线性回归的 cost function 中添加正则项(regularization term):
在这里插入图片描述
在这里插入图片描述
利用岭回归实现Longley数据集,Longley数据集来自J.W.Longley(1967)发表在JASA上的一篇论文,是强共线性的宏观经济数据,包含GNP deflator(GNP平减指数)、GNP(国民生产总值)、Unemployed(失业率)、rmedForces(武装力量)、Population(人口)、year(年份),Emlpoyed(就业率)。LongLey数据集因存在严重的多重共线性问题,在早期经常用来检验各种算法或计算机的计算精度。
在这里插入图片描述
sklearn中的岭回归

导入包

vdfgfd

import numpy as np
from numpy import genfromtxt
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt

读入数据

data = pd.read_csv("longley.csv",delimiter=',')
print(data)

Unnamed: 0 GNP.deflator GNP Unemployed Armed.Forces Population Year Employed
0 1947 83.0 234.289 235.6 159.0 107.608 1947 60.323
1 1948 88.5 259.426 232.5 145.6 108.632 1948 61.122
2 1949 88.2 258.054 368.2 161.6 109.773 1949 60.171
3 1950 89.5 284.599 335.1 165.0 110.929 1950 61.187
4 1951 96.2 328.975 209.9 309.9 112.075 1951 63.221
5 1952 98.1 346.999 193.2 359.4 113.270 1952 63.639
6 1953 99.0 365.385 187.0 354.7 115.094 1953 64.989
7 1954 100.0 363.112 357.8 335.0 116.219 1954 63.761
8 1955 101.2 397.469 290.4 304.8 117.388 1955 66.019
9 1956 104.6 419.180 282.2 285.7 118.734 1956 67.857
10 1957 108.4 442.769 293.6 279.8 120.445 1957 68.169
11 1958 110.8 444.546 468.1 263.7 121.950 1958 66.513
12 1959 112.6 482.704 381.3 255.2 123.366 1959 68.655
13 1960 114.2 502.601 393.1 251.4 125.368 1960 69.564
14 1961 115.7 518.173 480.6 257.2 127.852 1961 69.331
15 1962 116.9 554.894 400.7 282.7 130.081 1962 70.551

切分数据

x_data = data.iloc[:,2:]
y_data = data.iloc[:,1]
print(x_data)
print(y_data)

创建模型

# 生成50个值,作为岭回归系数
alphas_to_test = np.linspace(0.001, 1)
# 创建模型,保存误差值
estimator = linear_model.RidgeCV(alphas=alphas_to_test, store_cv_values=True)
estimator.fit(x_data, y_data)

# 岭回归系数
print(estimator.alpha_)
# loss值
print(estimator.cv_values_.shape)

0.40875510204081633
(16, 50)

# 画图
# 岭系数跟loss值的关系
plt.plot(alphas_to_test, estimator.cv_values_.mean(axis=0))
# 选取的岭系数值的位置
plt.plot(estimator.alpha_, min(estimator.cv_values_.mean(axis=0)),'ro')
plt.show()

在这里插入图片描述

estimator.predict([x_data.iloc[2]])

array([88.11216213])

正规方程岭回归

导入包

import numpy as np
from numpy import genfromtxt
import pandas as pd
import matplotlib.pyplot as plt  

读入数据

data = pd.read_csv("longley.csv",delimiter=',')
print(data)
  Unnamed: 0  GNP.deflator      GNP  Unemployed  Armed.Forces  Population    Year  Employed  
0         1947          83.0  234.289       235.6         159.0     107.608    1947    60.323  
1         1948          88.5  259.426       232.5         145.6     108.632    1948    61.122  
2         1949          88.2  258.054       368.2         161.6     109.773    1949    60.171  
3         1950          89.5  284.599       335.1         165.0     110.929    1950    61.187  
4         1951          96.2  328.975       209.9         309.9     112.075    1951    63.221  
5         1952          98.1  346.999       193.2         359.4     113.270    1952    63.639  
6         1953          99.0  365.385       187.0         354.7     115.094    1953    64.989  
7         1954         100.0  363.112       357.8         335.0     116.219    1954    63.761  
8         1955         101.2  397.469       290.4         304.8     117.388    1955    66.019  
9         1956         104.6  419.180       282.2         285.7     118.734    1956    67.857  
10        1957         108.4  442.769       293.6         279.8     120.445    1957    68.169  
11        1958         110.8  444.546       468.1         263.7     121.950    1958    66.513  
12        1959         112.6  482.704       381.3         255.2     123.366    1959    68.655  
13        1960         114.2  502.601       393.1         251.4     125.368    1960    69.564  
14        1961         115.7  518.173       480.6         257.2     127.852    1961    69.331  
15        1962         116.9  554.894       400.7         282.7     130.081    1962    70.551  

切分数据

x_data = data.iloc[0:,2:]
y_data = data.iloc[0:,1]
y_data = y_data[:, np.newaxis]
print(x_data)
print(y_data)
        GNP  Unemployed  Armed.Forces  Population  Year  Employed
0   234.289       235.6         159.0     107.608  1947    60.323
1   259.426       232.5         145.6     108.632  1948    61.122
2   258.054       368.2         161.6     109.773  1949    60.171
3   284.599       335.1         165.0     110.929  1950    61.187
4   328.975       209.9         309.9     112.075  1951    63.221
5   346.999       193.2         359.4     113.270  1952    63.639
6   365.385       187.0         354.7     115.094  1953    64.989
7   363.112       357.8         335.0     116.219  1954    63.761
8   397.469       290.4         304.8     117.388  1955    66.019
9   419.180       282.2         285.7     118.734  1956    67.857
10  442.769       293.6         279.8     120.445  1957    68.169
11  444.546       468.1         263.7     121.950  1958    66.513
12  482.704       381.3         255.2     123.366  1959    68.655
13  502.601       393.1         251.4     125.368  1960    69.564
14  518.173       480.6         257.2     127.852  1961    69.331
15  554.894       400.7         282.7     130.081  1962    70.551
[[ 83. ]
 [ 88.5]
 [ 88.2]
 [ 89.5]
 [ 96.2]
 [ 98.1]
 [ 99. ]
 [100. ]
 [101.2]
 [104.6]
 [108.4]
 [110.8]
 [112.6]
 [114.2]
 [115.7]
 [116.9]]
print(np.mat(x_data).shape)
print(np.mat(y_data).shape)
# 给样本添加偏置项
X_data = np.concatenate((np.ones((16,1)),x_data),axis=1)
print(X_data.shape)

(16, 6)
(16, 1)
(16, 7)

print(X_data[:3])

[[1.00000e+00 2.34289e+02 2.35600e+02 1.59000e+02 1.07608e+02,1.94700e+03
6.03230e+01]
[1.00000e+00 2.59426e+02 2.32500e+02 1.45600e+02 1.08632e+02 1.94800e+03 6.11220e+01]
[1.00000e+00 2.58054e+02 3.68200e+02 1.61600e+02 1.09773e+02 1.94900e+036.01710e+01]]

岭回归标准方程法求解回归参数

def weights(xArr, yArr, lam=0.2):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr)
    xTx = xMat.T*xMat # 矩阵乘法
    rxTx = xTx + np.eye(xMat.shape[1])*lam
    # 计算矩阵的值,如果值为0,说明该矩阵没有逆矩阵
    if np.linalg.det(rxTx) == 0.0:
        print("This matrix cannot do inverse")
        return
    # xTx.I为xTx的逆矩阵
    ws = rxTx.I*xMat.T*yMat
    return ws
ws = weights(X_data,y_data)
print(ws)

[[ 7.38107648e-04]
[ 2.07703836e-01]
[ 2.10076376e-02]
[ 5.05385441e-03]
[-1.59173066e+00]
[ 1.10442920e-01]
[-2.42280461e-01]]

计算预测值

np.mat(X_data)*np.mat(ws)

matrix([[ 83.55075226],
[ 86.92588689],
[ 88.09720228],
[ 90.95677622],
[ 96.06951002],
[ 97.81955375],
[ 98.36444357],
[ 99.99814266],
[103.26832266],
[105.03165135],
[107.45224671],
[109.52190685],
[112.91863666],
[113.98357055],
[115.29845063],
[117.64279933]])

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值