项目:房价预测(爱荷华州埃姆斯市)(预测销售价格并实践功能设计,RF和梯度提升)
数据来源:https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview/tutorials
数据概述: train 1460 * 81 test 1459 * 80
一.导入相关库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
%matplotlib inline
sns.set(style = "darkgrid")
plt.rcParams["font.family"] = "SimHei"
plt.rcParams["axes.unicode_minus"] = False
warnings.filterwarnings("ignore")
二.读取数据
#2.读取数据
train = pd.read_csv(".train.csv")
test = pd.read_csv(".test.csv")
#数据了解,数据认识
print(train.shape)
print(test.shape)
# display(train.head(2))
# display(test.head(2))
# train.info()
test.info() #两个数据都有缺失值
(1460, 81)
(1459, 80)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
Id 1459 non-null int64
MSSubClass 1459 non-null int64
MSZoning 1455 non-null object
LotFrontage 1232 non-null float64
LotArea 1459 non-null int64
Street 1459 non-null object
Alley 107 non-null object
LotShape 1459 non-null object
LandContour 1459 non-null object
Utilities 1457 non-null object
LotConfig 1459 non-null object
LandSlope 1459 non-null object
Neighborhood 1459 non-null object
Condition1 1459 non-null object
Condition2 1459 non-null object
BldgType 1459 non-null object
HouseStyle 1459 non-null object
OverallQual 1459 non-null int64
OverallCond 1459 non-null int64
YearBuilt 1459 non-null int64
YearRemodAdd 1459 non-null int64
RoofStyle 1459 non-null object
RoofMatl 1459 non-null object
Exterior1st 1458 non-null object
Exterior2nd 1458 non-null object
MasVnrType 1443 non-null object
MasVnrArea 1444 non-null float64
ExterQual 1459 non-null object
ExterCond 1459 non-null object
Foundation 1459 non-null object
BsmtQual 1415 non-null object
BsmtCond 1414 non-null object
BsmtExposure 1415 non-null object
BsmtFinType1 1417 non-null object
BsmtFinSF1 1458 non-null float64
BsmtFinType2 1417 non-null object
BsmtFinSF2 1458 non-null float64
BsmtUnfSF 1458 non-null float64
TotalBsmtSF 1458 non-null float64
Heating 1459 non-null object
HeatingQC 1459 non-null object
CentralAir 1459 non-null object
Electrical 1459 non-null object
1stFlrSF 1459 non-null int64
2ndFlrSF 1459 non-null int64
LowQualFinSF 1459 non-null int64
GrLivArea 1459 non-null int64
BsmtFullBath 1457 non-null float64
BsmtHalfBath 1457 non-null float64
FullBath 1459 non-null int64
HalfBath 1459 non-null int64
BedroomAbvGr 1459 non-null int64
KitchenAbvGr 1459 non-null int64
KitchenQual 1458 non-null object
TotRmsAbvGrd 1459 non-null int64
Functional 1457 non-null object
Fireplaces 1459 non-null int64
FireplaceQu 729 non-null object
GarageType 1383 non-null object
GarageYrBlt 1381 non-null float64
GarageFinish 1381 non-null object
GarageCars 1458 non-null float64
GarageArea 1458 non-null float64
GarageQual 1381 non-null object
GarageCond 1381 non-null object
PavedDrive 1459 non-null object
WoodDeckSF 1459 non-null int64
OpenPorchSF 1459 non-null int64
EnclosedPorch 1459 non-null int64
3SsnPorch 1459 non-null int64
ScreenPorch 1459 non-null int64
PoolArea 1459 non-null int64
PoolQC 3 non-null object
Fence 290 non-null object
MiscFeature 51 non-null object
MiscVal 1459 non-null int64
MoSold 1459 non-null int64
YrSold 1459 non-null int64
SaleType 1458 non-null object
SaleCondition 1459 non-null object
dtypes: float64(11), int64(26), object(43)
memory usage: 912.0+ KB
三.数据清洗
数据整合(横向/纵向)
#保留两个数据集各自的id,以便之后获取,合并
train_ID = train["Id"]
test_ID = test["Id"]
# print(test_ID)
train_y = train["SalePrice"]
# print(train_y)
# print(test.shape)
#方便之后按照行数提取各自数据
ntrain=train.shape[0]
ntrain
ntest = test.shape[0]
ntest
1459
# 合并数据
# train.drop("Id",axis = 1,inplace = True)
# test.drop("Id",axis = 1,inplace = True)
all_data = pd.concat((train,test)).reset_index(drop = True)
all_data.shape
all_data.drop(['SalePrice'],axis = 1,inplace = True)
all_data.head()
![a0d516f2e5830ed61f52893815aa1d8b.png](https://i-blog.csdnimg.cn/blog_migrate/1d9a9d144308035ca176912b0ff34308.jpeg)
3.1重复值
#无重复值
train.duplicated().sum()
test.duplicated().sum()
0
3.2缺失值
- 随机缺失
- 规律缺失
3.2.1查找缺失值,缺失值分布情况
info()/isnull()
#计算缺失率
missing_ratio = all_data.isnull().sum()/len(all_data)*100
missing_ratio = missing_ratio.drop(missing_ratio[missing_ratio == 0].index).sort_values(ascending=False)
missing_ratio.shape
(34,)
# 绘图查看缺失率
plt.figure(figsize=(15,8))
sns.barplot(x=missing_ratio.index,y=missing_ratio)
plt.xticks(rotation=90,fontsize=12)
plt.xlabel("Feature",fontsize=15)
plt.ylabel("missing_ratio",fontsize=15)
Text(0, 0.5, 'missing_ratio')
![b3ace372064e37e5ee613cacd8cc9355.png](https://i-blog.csdnimg.cn/blog_migrate/f35eb8721c96bd48b62375361495c295.jpeg)
相关性探索
- 防止 多重共线性,了解各个自变量与因变量的关系。
- 处理 https://zhuanlan.zhihu.com/p/72722146
- 手动删除/逐步回归法/增加样本容量/岭回归
- 了解 缺失率大 的变量与因变量的相关性,从而决定缺失值的处理(删除/另起一列)
- 查看强相关的变量是否有 异常值
#绘制热力图,查看相关性
plt.figure(figsize=(15,12))
corrmat = train.corr()
ax = sns.heatmap(corrmat,vmax=1)
a,b = ax.get_ylim()
ax.set_ylim(a+0.5,b-0.5)
(38.0, 0.0)
![de4650d519355550ccae47d06b010505.png](https://i-blog.csdnimg.cn/blog_migrate/0b69c5a95560a450f2cece8ba3bd5b1f.jpeg)
3.2.2缺失值处理
- 删除缺失值
- 缺失数量很少的数据
- 填充缺失值
- 数值型
- 均值/条件平均值
- 中位数
- 类别型
- 众数mode()/条件相似值
- 单独赋值(None)
- 数值型
- 另起一列,标记 是/否 缺失(>80%)
# print(train["Alley"].value_counts())
train["Alley"] = train["Alley"].fillna("None")#先选择填充nan,可视化查看分布情况
# print(train["Alley"].value_counts())
sns.stripplot(x='Alley',y='SalePrice',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1d2f7655a88>
![26c1899a0791577d4f357a284e9968cd.png](https://i-blog.csdnimg.cn/blog_migrate/257f60337f1f48c24257be267ec6a6d2.png)
# print(train["PoolQC"].value_counts())
train["PoolQC"] = train["PoolQC"].fillna("None")
# print(train["PoolQC"].value_counts())
sns.stripplot(x='PoolQC',y='SalePrice',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1d2f79faf48>
![9002b63934184795fafdae4a6a1dee99.png](https://i-blog.csdnimg.cn/blog_migrate/cd788d4ff9e172fa7604930fb03f7238.png)
# print(train["MiscFeature"].value_counts())
train["MiscFeature"] = train["MiscFeature"].fillna("None")
# print(train["MiscFeature"].value_counts())
sns.stripplot(x='MiscFeature',y='SalePrice',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1d2f7a66e48>
![9345affe6aa4f0304b7b7b29f93da997.png](https://i-blog.csdnimg.cn/blog_migrate/0563a5f18e88f22d32549dcc8c9fe936.png)
# print(train["Fence"].value_counts())
train["Fence"] = train["Fence"].fillna("None")
# print(train["Fence"].value_counts())
sns.stripplot(x='Fence',y='SalePrice',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1d2f7d47488>
![387193f136348f8798a1930d56d2a887.png](https://i-blog.csdnimg.cn/blog_migrate/bc1d7068d9c3c9c71e0a593c3d6ecabf.png)
missing_ratio.head()
PoolQC 99.657417
MiscFeature 96.402878
Alley 93.216855
Fence 80.438506
FireplaceQu 48.646797
dtype: float64
#"PoolQC","MiscFeature","Alley","Fence"四个特征中的空值都表示没有相对应的设施。
# 通过上面分布图可以看出,缺失值太大,缺失值在y值中的分布很广,与有值的y值没有明显差异
#可以选择直接删除或者另起一列表示是否缺失
#本数据先删除
all_data = all_data.drop(["PoolQC","MiscFeature","Alley","Fence"],axis = 1)
all_data.shape
(2919, 76)
missing_ratio = all_data.isnull().sum()/len(all_data)*100
missing_ratio = missing_ratio[missing_ratio > 0].sort_values(ascending=False)
missing_ratio.head(5)
FireplaceQu 48.646797
LotFrontage 16.649538
GarageCond 5.447071
GarageFinish 5.447071
GarageQual 5.447071
dtype: float64
- 填充 数值型变量 的缺失值(均值/中位数)
- 连续型
- 离散型
#获取到缺失率大于0 的数值型特征的情况和列名。
all_data[missing_ratio.index].describe()
![fe788a2f87ff521bb20e07f0334ab097.png](https://i-blog.csdnimg.cn/blog_migrate/8656f639cf497915df9981f45dc16f04.jpeg)
# (缺失率16.6)
# LotFrontage:linear feet of street connected to property
# (缺失率6%以下)
# GarageYrBlt
# GarageArea
# GarageCars
# BsmtFullBath
# BsmtHalfBath
# MasVnrArea:Masonry veneer area in square feet
# basement 地下室相关
# BsmtUnfSF:Unfinished square feet of basement area
# TotalBsmtSF:Total square feet of basement area
# BsmtFinSF2
# BsmtFinSF1
#取出缺失率在6%以下的特征,根据数据理解,缺失可能是因为没有相关设施,0值填充
cols = ["GarageYrBlt","MasVnrArea","BsmtFullBath","BsmtHalfBath","BsmtUnfSF","TotalBsmtSF","GarageArea","GarageCars","BsmtFinSF2","BsmtFinSF1"]
for col in cols:
all_data[col] = all_data[col].fillna(0)
#“LotFrontage”条件平均值填充(根据邻居的情况判断)
plt.scatter(x=train["LotFrontage"],y=train["SalePrice"])
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
all_data["LotFrontage"].isnull().sum()
0
![6b464c99d42c6c7c3a59ff272a8e1f5f.png](https://i-blog.csdnimg.cn/blog_migrate/5e0cec39c9e777ada171fe33192c5582.png)
missing_ratio = all_data.isnull().sum()/len(all_data)*100
missing_ratio = missing_ratio[missing_ratio > 0].sort_values(ascending=False)
missing_ratio
FireplaceQu 48.646797
GarageQual 5.447071
GarageFinish 5.447071
GarageCond 5.447071
GarageType 5.378554
BsmtCond 2.809181
BsmtExposure 2.809181
BsmtQual 2.774923
BsmtFinType2 2.740665
BsmtFinType1 2.706406
MasVnrType 0.822199
MSZoning 0.137033
Utilities 0.068517
Functional 0.068517
Electrical 0.034258
Exterior1st 0.034258
Exterior2nd 0.034258
SaleType 0.034258
KitchenQual 0.034258
dtype: float64
- 填充 类别型 缺失值
all_data[missing_ratio.index].describe()
![0ee904df0173ef35d3cc2c08876db291.png](https://i-blog.csdnimg.cn/blog_migrate/b8b070411808f046ed8d15b9f9dbddb0.jpeg)
cols = ["FireplaceQu","GarageQual","GarageFinish","GarageCond","GarageType","BsmtCond","BsmtExposure","BsmtQual","BsmtFinType2","BsmtFinType1"]
#这三类缺失值,是因为本身没有对应的设施,可以填None
for col in cols:
all_data[col] = all_data[col].fillna("None")
# MasVnrType: Masonry veneer type(缺失值为没有该项)
all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
# MSZoning: Identifies the general zoning classification of the sale
# Utilitie: Type of utilities available
# Functional; Home functionality (Assume typical unless deductions are warranted)
# Electrical: Electrical system
# Exterior1st: Exterior covering on house
# Exterior2nd: Exterior covering on house (if more than one material)
# SaleType: Type of sale
# KitchenQual: Kitchen quality
cols = ["MSZoning","Utilities","Functional","Electrical","Exterior1st","Exterior2nd","SaleType","KitchenQual"]
#这三类缺失值,是因为本身没有对应的设施,可以填None
for col in cols:
all_data[col] = all_data[col].fillna(all_data[col].mode()[0])
missing_ratio = all_data.isnull().sum()/len(all_data)*100
missing_ratio = missing_ratio[missing_ratio > 0].sort_values(ascending=False)
missing_ratio
Series([], dtype: float64)
3.3异常值
#查看相关性
plt.figure(figsize=(15,12))
corrmat = train.corr()
sns.heatmap(corrmat)
<matplotlib.axes._subplots.AxesSubplot at 0x1d2f7dc0648>
![cb7be440b013744db5f83f1abfd13221.png](https://i-blog.csdnimg.cn/blog_migrate/53dd7725a1166047af3adb371060a2dc.jpeg)
#
top10_corr = corrmat.nlargest(15,'SalePrice')["SalePrice"].index
# print(top10_corr)
plt.figure(figsize=(12,8))
ax = sns.heatmap(train[top10_corr].corr(),annot=True,square=True,fmt=".2f",annot_kws={"size":10})
a,b = ax.get_ylim()
ax.set_ylim(a+0.5,b-0.5)
(15.0, 0.0)
![25c1c4fe2a8190917d2f225493900f93.png](https://i-blog.csdnimg.cn/blog_migrate/65db7c6a38cdbf67acb2f057177ad49c.jpeg)
#与房价相关性最高的OverallQual(整体材料质量)和GarageCars,分布情况
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121)
ax1 = sns.boxplot(x = train["OverallQual"],y = train["SalePrice"])
ax2 = fig.add_subplot(122)
ax2 = sns.boxplot(x = train["GarageCars"],y = train["SalePrice"])
![6bb91d6c8cf7f8dd1fddd3fae82bdb7a.png](https://i-blog.csdnimg.cn/blog_migrate/901b4f73368edf89ba494bec27f54d28.png)
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121)
ax1 = sns.boxplot(x = train["FullBath"],y = train["SalePrice"])
ax2 = fig.add_subplot(122)
ax2 = sns.boxplot(x = train["TotRmsAbvGrd"],y = train["SalePrice"])
![dc85b285fc72fe5b3c4588a1b4b259ed.png](https://i-blog.csdnimg.cn/blog_migrate/c727f0411793e710c114e25dc189c432.png)
fig = plt.figure(figsize=(15,8))
sns.boxplot(x = train["YearBuilt"],y = train["SalePrice"])
plt.xticks(rotation=90)
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103,
104, 105, 106, 107, 108, 109, 110, 111]),
<a list of 112 Text xticklabel objects>)
![6d65f56a4a1f0b9d9a4218e1417278ac.png](https://i-blog.csdnimg.cn/blog_migrate/ccabcf7000bb822c5f5dc5881c863f45.jpeg)
# 'GrLivArea', 'GarageArea','TotalBsmtSF', '1stFlrSF'
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121)
ax1.scatter(train["GrLivArea"],train["SalePrice"])
<matplotlib.collections.PathCollection at 0x1d2f9d25348>
![3b51fb102baa75e0086820e6019efdae.png](https://i-blog.csdnimg.cn/blog_migrate/bad52d6fd355f6ba9501d637c662a501.png)
train[(train["GrLivArea"]>4000)&(train["SalePrice"]<200000)]
#删除异常值
all_data.drop([523,1298],inplace=True)
all_data[all_data["Id"]==1299]
![e825988d7823ca1cfc595008ef72f5ab.png](https://i-blog.csdnimg.cn/blog_migrate/15700c8a33020d6ff62dfec63a759b38.png)
train_y.drop([523,1298],inplace=True)
train_y.shape
(1458,)
# 'GrLivArea', 'GarageArea','TotalBsmtSF', '1stFlrSF'
# plt.scatter(train["GarageArea"],train["SalePrice"])
# plt.scatter(train["TotalBsmtSF"],train["SalePrice"])
plt.scatter(train["1stFlrSF"],train["SalePrice"])
# train[(train["1stFlrSF"]>4000)&(train["SalePrice"]<200000)]
#异常值id为1299,在all_data已经删除
#没有异常值了
<matplotlib.collections.PathCollection at 0x1d2fa4bca08>
![85cd7142d6efc4028966ebe85b3995b6.png](https://i-blog.csdnimg.cn/blog_migrate/b6bed0bcec2a5d5a35e4d8e283ab4b46.png)
sns.boxplot(train["1stFlrSF"])
<matplotlib.axes._subplots.AxesSubplot at 0x1d2fa468e48>
![7ef5f9389e158a0c6aaf82a1c6f62151.png](https://i-blog.csdnimg.cn/blog_migrate/da910519219d64ff07c85af2eb38d56e.png)
sns.distplot(train["1stFlrSF"])
<matplotlib.axes._subplots.AxesSubplot at 0x1d2fa1f16c8>
![ebb48309b4be826f3e71aca6f243083c.png](https://i-blog.csdnimg.cn/blog_migrate/a62ec43682d7c6e04f25938c497cf486.png)
四.特征工程
all_data.head()
all_data.shape
(2917, 76)
特征抽取
train[["GarageYrBlt","YearBuilt","YearRemodAdd"]]
train[["1stFlrSF","2ndFlrSF","GrLivArea","TotalBsmtSF"]].describe()
all_data["1stFlrSF_Perc"] = all_data["1stFlrSF"]/all_data["GrLivArea"]
all_data["totalSF_Perc"] = (all_data["1stFlrSF"]+all_data["2ndFlrSF"])/all_data["GrLivArea"]
# all_data["GrLivArea"].isnull().sum()
all_data.shape
(2917, 78)
数据转换
- 类别(标签)
- 有序标签——>label encoding
- 无序标签——>one hot
- 数值
all_cols = all_data.columns.tolist()
# all_cols[0]
lab_cols = []
for i in range(len(all_cols)):
if all_data[all_cols[i]].dtype == "object":
lab_cols.append(all_cols[i])
lab_cols #类别变量的所有列名
# lab_cols = pd.DataFrame(lab_cols).T
len(lab_cols)
39
#有序标签——>labelencoding
from sklearn.preprocessing import LabelEncoder
lab_cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond',
'ExterQual', 'ExterCond','HeatingQC', 'KitchenQual', 'BsmtFinType1',
'BsmtFinType2', 'Functional', 'BsmtExposure', 'GarageFinish', 'LandSlope',
'LotShape', 'PavedDrive', 'Street', 'CentralAir', 'MSSubClass','YrSold' ,
'MoSold')
for col in lab_cols:
lab = LabelEncoder()
all_data[col] = lab.fit_transform(list(all_data[col].values))
print('Shape all_data: {}'.format(all_data.shape))
Shape all_data: (2917, 78)
all_data.head(1)
![e9e953b3caf84bdbe778d1ae70c20427.png](https://i-blog.csdnimg.cn/blog_migrate/cdb105caefb75a936fb7704a1d74f028.png)
#无序变量——>onehot
# all_date.get_dum
all_data = pd.get_dummies(all_data)
all_data.shape
(2917, 216)
特征选择
#相关性
corr_matrix = all_data.corr().abs()
corr_matrix.head()
![cc9fa8df0fc9ff2afaf9c7277bcbeea8.png](https://i-blog.csdnimg.cn/blog_migrate/0946fb82bd78a37c714f1397d55684a1.jpeg)
threshold = 0.9
#只取相关性系数的上半部分
upper_corr = corr_matrix.where(np.triu(corr_matrix,k=1).astype(np.bool))
upper_corr.head()
![cf38dde8f31118cd2976d8aeda42acf5.png](https://i-blog.csdnimg.cn/blog_migrate/a9cb116ab19f0ef447f45b78db0fd594.jpeg)
#删除相关性大于0.9的
corr_drop = [column for column in upper_corr.columns if any(upper_corr[column]>threshold)]
corr_drop
['1stFlrSF_Perc',
'totalSF_Perc',
'Exterior2nd_CmentBd',
'Exterior2nd_MetalSd',
'Exterior2nd_VinylSd',
'GarageType_None',
'RoofStyle_Hip',
'SaleType_New',
'Utilities_NoSeWa']
all_data = all_data.drop(columns = corr_drop)
all_data.shape
(2917, 207)
五.建模
相关知识概念
- 过拟合和欠拟合
- 过拟合:模型过于复杂,在训练集上表现很好,测试集表现较差,导致模型的泛化能力下降 -解决 https://blog.csdn.net/u010899985/article/details/79471909
- 交叉验证
- 模型的原理、优缺点、适用场景(如何选用模型,流程?)
- 模型参数的含义 https://scikit-learn.org/stable/
train = all_data[all_data["Id"]<=1460]
print(train.shape)
test = all_data[all_data["Id"]>1460]
print(test.shape)
# train["Id"]
(1458, 207)
(1459, 207)
fig,ax = plt.subplots(1,2)
fig.set_size_inches(15,5)
sns.distplot(train_y,ax=ax[0])
sns.distplot(np.log(train_y),ax=ax[1])
<matplotlib.axes._subplots.AxesSubplot at 0x1d2fb188848>
![8ac599782479e13ad7bc6e560e59a449.png](https://i-blog.csdnimg.cn/blog_migrate/c4ec8b751ed4146ce08416a5622021ec.png)
# train_y = np.log(train_y)
#导入相关算法库
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.model_selection import train_test_split,cross_val_score
train
![ee5ac477d6422913b3998d0a0e5ed91a.png](https://i-blog.csdnimg.cn/blog_migrate/e17010693a9d817b05beb986895c700e.jpeg)
train_y
0 208500
1 181500
2 223500
3 140000
4 250000
...
1455 175000
1456 210000
1457 266500
1458 142125
1459 147500
Name: SalePrice, Length: 1458, dtype: int64
def rmse_cv(model):
rmse = np.sqrt(-cross_val_score(model,train,train_y,scoring = "neg_mean_squared_error",cv = 5))
return (rmse)
# model_ridge = Ridge()
alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean()
for alpha in alphas]
cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot()
plt.xlabel("alpha")
plt.ylabel("rmse")
Text(0, 0.5, 'rmse')
![6786424dabc630f6804d81f8a806b26a.png](https://i-blog.csdnimg.cn/blog_migrate/4a68a5af413151d9fa29e2db1a67c78f.png)
cv_ridge
0.05 27045.423937
0.10 26980.425644
0.30 26792.870318
1.00 26457.515190
3.00 26142.818396
5.00 26046.790986
10.00 26002.523512
15.00 26028.112302
30.00 26159.737196
50.00 26323.603320
75.00 26492.612852
dtype: float64
clf = Ridge(alpha=5)
clf.fit(train,train_y)
Ridge(alpha=5, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,
random_state=None, solver='auto', tol=0.001)
predict = clf.predict(test)
sub = pd.DataFrame()
sub['Id'] = test_ID
sub['SalePrice'] = predict
sub.head(10)
![e64dc007c355952d14d6d05d76e0571e.png](https://i-blog.csdnimg.cn/blog_migrate/6a2b8998aa720d9909803519438eab8d.jpeg)
# sub.to_csv('.submission.csv',index=False)