基于最小二乘法的一般多元线性回归的实战

一、说明

我是在jupyter完成的,然后导出成markdown格式,ipynb文件导出为markdown的命令如下:

jupyter nbconvert --to markdown  xxx.ipynb

源代码和数据文件,点击这里获取

二、数据项说明

	Name		Data Type	Meas.	Description
	----		---------	-----	-----------
	Sex		nominal			M, F, and I (infant)
	Length		continuous	mm	Longest shell measurement
	Diameter	continuous	mm	perpendicular to length
	Height		continuous	mm	with meat in shell
	Whole weight	continuous	grams	whole abalone
	Shucked weight	continuous	grams	weight of meat
	Viscera weight	continuous	grams	gut weight (after bleeding)
	Shell weight	continuous	grams	after being dried
	Rings		integer			+1.5 gives the age in years

现在有8个数据字段,前面7个是特征值,最最后一个Rings为预测,具体请查阅文件内容

三、实战部分

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dataframe01 = pd.read_excel('abalone.xlsx', sheet_name='data')
dataframe01.head(10)
SexLengthDiameterHeightWhole weightShucked weightViscera weightShell weightRings
0M0.4550.3650.0950.51400.22450.10100.15015
1M0.3500.2650.0900.22550.09950.04850.0707
2F0.5300.4200.1350.67700.25650.14150.2109
3M0.4400.3650.1250.51600.21550.11400.15510
4I0.3300.2550.0800.20500.08950.03950.0557
5I0.4250.3000.0950.35150.14100.07750.1208
6F0.5300.4150.1500.77750.23700.14150.33020
7F0.5450.4250.1250.76800.29400.14950.26016
8M0.4750.3700.1250.50950.21650.11250.1659
9F0.5500.4400.1500.89450.31450.15100.32019
# 查看数据容量 
dataframe01.shape
(4177, 9)
dataframe01.columns # 特征名字
Index(['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight',
       'Viscera weight', 'Shell weight', 'Rings'],
      dtype='object')
# 清洗数据
# 替换特征值,将性别中的字符类型转化为整数
dataframe02 = dataframe01.copy()

dataframe02.Sex[dataframe01['Sex']=='I']=0
dataframe02.Sex[dataframe01['Sex']=='F']=1
dataframe02.Sex[dataframe01['Sex']=='M']=2

dataframe02.head(10)
SexLengthDiameterHeightWhole weightShucked weightViscera weightShell weightRings
020.4550.3650.0950.51400.22450.10100.15015
120.3500.2650.0900.22550.09950.04850.0707
210.5300.4200.1350.67700.25650.14150.2109
320.4400.3650.1250.51600.21550.11400.15510
400.3300.2550.0800.20500.08950.03950.0557
500.4250.3000.0950.35150.14100.07750.1208
610.5300.4150.1500.77750.23700.14150.33020
710.5450.4250.1250.76800.29400.14950.26016
820.4750.3700.1250.50950.21650.11250.1659
910.5500.4400.1500.89450.31450.15100.32019
# 导入线性回归的库
from sklearn.linear_model import LinearRegression as LR
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
data_index = list(dataframe01.columns)
data_index
['Sex',
 'Length',
 'Diameter',
 'Height',
 'Whole weight',
 'Shucked weight',
 'Viscera weight',
 'Shell weight',
 'Rings']
# 获取特征矩阵X 的index
X_index = data_index[0:-1]
Y_index = data_index[-1]
X_index, Y_index
(['Sex',
  'Length',
  'Diameter',
  'Height',
  'Whole weight',
  'Shucked weight',
  'Viscera weight',
  'Shell weight'],
 'Rings')
X = dataframe02[X_index]
X.head()
SexLengthDiameterHeightWhole weightShucked weightViscera weightShell weight
020.4550.3650.0950.51400.22450.10100.150
120.3500.2650.0900.22550.09950.04850.070
210.5300.4200.1350.67700.25650.14150.210
320.4400.3650.1250.51600.21550.11400.155
400.3300.2550.0800.20500.08950.03950.055
Y = dataframe02[Y_index]
Y.head()
0    15
1     7
2     9
3    10
4     7
Name: Rings, dtype: int64
# 划分训练集和测试集
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y,test_size=0.2,random_state=420)
Xtrain.head()
SexLengthDiameterHeightWhole weightShucked weightViscera weightShell weight
276300.5500.4250.1350.65600.25700.17000.203
43920.5000.4150.1650.68850.24900.13800.250
173520.6700.5200.1651.39000.71100.28650.300
75120.4850.3550.1200.54700.21500.16150.140
162610.5700.4500.1350.78050.33450.18500.210
Ytrain.head()
2763    10
439     13
1735    11
751     10
1626     8
Name: Rings, dtype: int64
#恢复索引
for i in [Xtrain, Xtest]:
    i.index = range(i.shape[0])
#恢复索引
for i in [Ytrain, Ytest]:
    i.index = range(i.shape[0])
Xtrain.head()   # 查看X训练集头部
SexLengthDiameterHeightWhole weightShucked weightViscera weightShell weight
000.5500.4250.1350.65600.25700.17000.203
120.5000.4150.1650.68850.24900.13800.250
220.6700.5200.1651.39000.71100.28650.300
320.4850.3550.1200.54700.21500.16150.140
410.5700.4500.1350.78050.33450.18500.210
Ytrain.head()
0    10
1    13
2    11
3    10
4     8
Name: Rings, dtype: int64
# 先用训练集训练(fit)标准化的类,然后用训练好的类分别转化(transform)训练集和测试集

# 开始建模
reg = LR().fit(Xtrain, Ytrain)
yhat = reg.predict(Xtest) #预测我们的yhat
yhat.min()
4.22923686878166
yhat.max()
22.656846035572762
reg.coef_ # w,系数向量
array([  0.40527178,  -0.88791132,  13.01662939,  10.39250886,
         9.64127293, -20.87747601, -10.50683081,   7.70632772])
Xtrain.columns
Index(['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight',
       'Viscera weight', 'Shell weight'],
      dtype='object')
[*zip(Xtrain.columns,reg.coef_)]
[('Sex', 0.4052717783379893),
 ('Length', -0.8879113179582045),
 ('Diameter', 13.016629389061475),
 ('Height', 10.39250886428478),
 ('Whole weight', 9.64127293101552),
 ('Shucked weight', -20.87747600529615),
 ('Viscera weight', -10.506830809919672),
 ('Shell weight', 7.706327719866024)]
# 特征说明

Name Data Type Meas. Description


​ Sex nominal M, F, and I (infant)
​ Length continuous mm Longest shell measurement
​ Diameter continuous mm perpendicular to length
​ Height continuous mm with meat in shell
​ Whole weight continuous grams whole abalone
​ Shucked weight continuous grams weight of meat
​ Viscera weight continuous grams gut weight (after bleeding)
​ Shell weight continuous grams after being dried
​ Rings integer +1.5 gives the age in years

# 截距
reg.intercept_
2.7888240054011835
# 自定义最小二乘法尝试
def my_least_squares(x_array, y_array):
    '''
    :param x: 列表,表示m*n矩阵
    :param y: 列表,表示m*1矩阵
    :return: coef:list 回归系数(1*n矩阵)   intercept: float 截距
    '''
    # 矩阵对象化
    arr_x_01 = np.array(x_array)
    arr_y_01 = np.array(y_array)

    # x_array由 m*n矩阵转化为 m*(n+1)矩阵,其中第n+1列系数全为1
    # 获取行数
    row_num = arr_x_01.shape[0]

    # 生成常量系数矩阵  m*1矩阵
    arr_b = np.array([[1 for i in range(0, row_num)]])

    # 合并成m*(n+1)矩阵
    arr_x_02 = np.insert(arr_x_01, 0, values=arr_b, axis=1)

    # 矩阵运算
    w = np.linalg.inv(np.matmul(arr_x_02.T, arr_x_02))
    w = np.matmul(w, arr_x_02.T)
    w = np.matmul(w, arr_y_01)
    
    # w为1*(n+1)矩阵
    # print(w)
    result = list(w)
    coef = result.pop(-1)
    intercept = result
    
    return coef, intercept
# debug中
my_least_squares(Xtrain,list(Ytrain))
# 梯度下降法尝试
def costFunc(X,Y,theta):
    '''
    代价函数
    '''
    inner = np.power((X*theta.T)-Y,2)
    return np.sum(inner)/(2*len(X))

def gradientDescent(X,Y,theta,alpha,iters):
    '''
    梯度下降
    '''
    temp = np.mat(np.zeros(theta.shape))
    cost = np.zeros(iters)
    thetaNums = int(theta.shape[1])
    print(thetaNums)
    for i in range(iters):
        error = (X*theta.T-Y)
        for j in range(thetaNums):
            derivativeInner = np.multiply(error,X[:,j])
            temp[0,j] = theta[0,j] - (alpha*np.sum(derivativeInner)/len(X))

        theta = temp
        cost[i] = costFunc(X,Y,theta)

    return theta,cost
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值