使用多项式回归预测钻石数据集(附详细代码和数据集)

一、实验环境

1、本文使用的是python3.7 使用的编辑器是Spyder

实验数据集:本文采用的数据集是由kaggle上面的diamonds数据集,该数据集收集了约54000颗钻石的价格和质量的信息。每条记录由十个变量构成,其中有三个是名义变量,分别描述钻石的切工,颜色和净度;carat:克拉,重量cut:切工,color:颜色,clarity:净度,depth:深度,table:钻石宽度,以及X,Y,Z

数据集下载链接:链接:百度网盘,数据集下载链接
提取码:uy7j 
 

实验步骤:

1、使用导入数据集,并查看数据集情况并去除有缺失值的行

dataset=pd.read_csv('diamonds.csv')
dataset = dataset[(dataset[['x','y','z']] != 0).all(axis=1)]
print(dataset.describe())#查看数据集情况
dataset.count()# 查看非缺失值的数量
#data.info()
X =dataset.iloc[:,:-1].values
y = dataset.iloc[:,9].values

2、对数据集中数值为字符的列进行lable编码

X =dataset.iloc[:,:-1].values
y = dataset.iloc[:,9].values
from sklearn import preprocessing
#对第二列进行lable编码(cut)
X[:,0]=X[:,0]*100
le = preprocessing.LabelEncoder()
le.fit(["Fair", "Good", "Very Good","Premium", "Ideal"])
X[:,1]=le.transform(X[:,1])

#对第三列进行lable编码(color)
a=pd.DataFrame(X[:,2])
a=a.replace(["J","I","H", "G", "F","E", "D"],[1,2,3,4,5,6,7])
X[:,2]=a.iloc[:,0]
#对第四列进行编码
a=pd.DataFrame(X[:,3])
a=a.replace(["I1","SI2","SI1","VS2","VS1","VVS2","VVS1","IF"],[1,2,3,4,5,6,7,8])
X[:,3]=a.iloc[:,0]

3、选出自变量中与y相关性前n的值

X=pd.DataFrame(X)
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
SelectKBest = SelectKBest(f_regression, k=9)
bestFeature = SelectKBest.fit_transform(X,y)
SelectKBest.get_support()
X.columns[SelectKBest.get_support()]
X1=X.loc[:,[0, 1, 2, 3, 4, 5, 6, 7, 8]]

4、在使用多项式回归前要对数据集进行多项式处理,代码如下:

#多项式回归的数据预处理
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 2)
X_poly = poly_reg.fit_transform(X1)

5、划分训练集和测试集设置初始random_state,使用回归器进行预测,使用R^2系数进行评判模型准确度。这里需要不断调整random_state的值以得到最佳的预测值因此这里我把它写成一个循环,避免手动调节

for i in range(0,100):
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size = 0.2, random_state = i)
    lin_reg_2 = LinearRegression()
    lin_reg_2.fit(X_train, y_train)
    y_pred=lin_reg_2.predict(X_test)
#    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import r2_score
    print(i," ",r2_score(y_test, y_pred))

6、参数调节

通过对步骤3中n值的调整、步骤4中degree值的调整以及步骤5中random_state的调整我们可以得出当 n为9时,degree取2,random_state取4时模型的准确度最高,其值达到了96.63%。

最后附上完整代码:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 
dataset=pd.read_csv('diamonds.csv')
dataset = dataset[(dataset[['x','y','z']] != 0).all(axis=1)]
print(dataset.describe())#查看数据集情况
dataset.count()# 查看非缺失值的数量
#data.info()
X =dataset.iloc[:,:-1].values
y = dataset.iloc[:,9].values
from sklearn import preprocessing
#对第二列进行lable编码(cut)
X[:,0]=X[:,0]*100
le = preprocessing.LabelEncoder()
le.fit(["Fair", "Good", "Very Good","Premium", "Ideal"])
X[:,1]=le.transform(X[:,1])
#对第三列进行lable编码(color)
a=pd.DataFrame(X[:,2])
a=a.replace(["J","I","H", "G", "F","E", "D"],[1,2,3,4,5,6,7])
X[:,2]=a.iloc[:,0]
#对第四列进行编码
a=pd.DataFrame(X[:,3])
a=a.replace(["I1","SI2","SI1","VS2","VS1","VVS2","VVS1","IF"],[1,2,3,4,5,6,7,8])
X[:,3]=a.iloc[:,0]
X=pd.DataFrame(X)
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
SelectKBest = SelectKBest(f_regression, k=9)
bestFeature = SelectKBest.fit_transform(X,y)
SelectKBest.get_support()
X.columns[SelectKBest.get_support()]
X1=X.loc[:,[0, 1, 2, 3, 4, 5, 6, 7, 8]]
#多项式回归的数据预处理
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 2)
X_poly = poly_reg.fit_transform(X1)
for i in range(0,100):
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size = 0.2, random_state = i)
    lin_reg_2 = LinearRegression()
    lin_reg_2.fit(X_train, y_train)
    y_pred=lin_reg_2.predict(X_test)
#    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import r2_score
    print(i," ",r2_score(y_test, y_pred))

 

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值