kaggle房价预测特征意思_kaggle-房价预测

项目:房价预测(爱荷华州埃姆斯市)(预测销售价格并实践功能设计,RF和梯度提升)

数据来源:https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview/tutorials

数据概述: train 1460 * 81 test 1459 * 80

一.导入相关库

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
%matplotlib inline

sns.set(style = "darkgrid")

plt.rcParams["font.family"] = "SimHei"
plt.rcParams["axes.unicode_minus"] = False
warnings.filterwarnings("ignore")

二.读取数据

#2.读取数据 
train = pd.read_csv(".train.csv")
test = pd.read_csv(".test.csv")
#数据了解,数据认识

print(train.shape)
print(test.shape)
# display(train.head(2))
# display(test.head(2))

# train.info()
test.info() #两个数据都有缺失值
(1460, 81)
(1459, 80)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
Id               1459 non-null int64
MSSubClass       1459 non-null int64
MSZoning         1455 non-null object
LotFrontage      1232 non-null float64
LotArea          1459 non-null int64
Street           1459 non-null object
Alley            107 non-null object
LotShape         1459 non-null object
LandContour      1459 non-null object
Utilities        1457 non-null object
LotConfig        1459 non-null object
LandSlope        1459 non-null object
Neighborhood     1459 non-null object
Condition1       1459 non-null object
Condition2       1459 non-null object
BldgType         1459 non-null object
HouseStyle       1459 non-null object
OverallQual      1459 non-null int64
OverallCond      1459 non-null int64
YearBuilt        1459 non-null int64
YearRemodAdd     1459 non-null int64
RoofStyle        1459 non-null object
RoofMatl         1459 non-null object
Exterior1st      1458 non-null object
Exterior2nd      1458 non-null object
MasVnrType       1443 non-null object
MasVnrArea       1444 non-null float64
ExterQual        1459 non-null object
ExterCond        1459 non-null object
Foundation       1459 non-null object
BsmtQual         1415 non-null object
BsmtCond         1414 non-null object
BsmtExposure     1415 non-null object
BsmtFinType1     1417 non-null object
BsmtFinSF1       1458 non-null float64
BsmtFinType2     1417 non-null object
BsmtFinSF2       1458 non-null float64
BsmtUnfSF        1458 non-null float64
TotalBsmtSF      1458 non-null float64
Heating          1459 non-null object
HeatingQC        1459 non-null object
CentralAir       1459 non-null object
Electrical       1459 non-null object
1stFlrSF         1459 non-null int64
2ndFlrSF         1459 non-null int64
LowQualFinSF     1459 non-null int64
GrLivArea        1459 non-null int64
BsmtFullBath     1457 non-null float64
BsmtHalfBath     1457 non-null float64
FullBath         1459 non-null int64
HalfBath         1459 non-null int64
BedroomAbvGr     1459 non-null int64
KitchenAbvGr     1459 non-null int64
KitchenQual      1458 non-null object
TotRmsAbvGrd     1459 non-null int64
Functional       1457 non-null object
Fireplaces       1459 non-null int64
FireplaceQu      729 non-null object
GarageType       1383 non-null object
GarageYrBlt      1381 non-null float64
GarageFinish     1381 non-null object
GarageCars       1458 non-null float64
GarageArea       1458 non-null float64
GarageQual       1381 non-null object
GarageCond       1381 non-null object
PavedDrive       1459 non-null object
WoodDeckSF       1459 non-null int64
OpenPorchSF      1459 non-null int64
EnclosedPorch    1459 non-null int64
3SsnPorch        1459 non-null int64
ScreenPorch      1459 non-null int64
PoolArea         1459 non-null int64
PoolQC           3 non-null object
Fence            290 non-null object
MiscFeature      51 non-null object
MiscVal          1459 non-null int64
MoSold           1459 non-null int64
YrSold           1459 non-null int64
SaleType         1458 non-null object
SaleCondition    1459 non-null object
dtypes: float64(11), int64(26), object(43)
memory usage: 912.0+ KB

三.数据清洗

数据整合(横向/纵向)

#保留两个数据集各自的id,以便之后获取,合并
train_ID = train["Id"]
test_ID = test["Id"]
# print(test_ID)

train_y = train["SalePrice"]
# print(train_y)
# print(test.shape)

#方便之后按照行数提取各自数据
ntrain=train.shape[0]
ntrain
ntest = test.shape[0]
ntest
1459
# 合并数据
# train.drop("Id",axis = 1,inplace = True)
# test.drop("Id",axis = 1,inplace = True)
all_data = pd.concat((train,test)).reset_index(drop = True)
all_data.shape
all_data.drop(['SalePrice'],axis = 1,inplace = True)
all_data.head()

a0d516f2e5830ed61f52893815aa1d8b.png

3.1重复值

#无重复值
train.duplicated().sum()
test.duplicated().sum()
0

3.2缺失值

  • 随机缺失
  • 规律缺失

3.2.1查找缺失值,缺失值分布情况

info()/isnull()

#计算缺失率
missing_ratio = all_data.isnull().sum()/len(all_data)*100

missing_ratio = missing_ratio.drop(missing_ratio[missing_ratio == 0].index).sort_values(ascending=False)
missing_ratio.shape
(34,)
# 绘图查看缺失率
plt.figure(figsize=(15,8))
sns.barplot(x=missing_ratio.index,y=missing_ratio)
plt.xticks(rotation=90,fontsize=12)
plt.xlabel("Feature",fontsize=15)
plt.ylabel("missing_ratio",fontsize=15)
Text(0, 0.5, 'missing_ratio')

b3ace372064e37e5ee613cacd8cc9355.png

相关性探索

  1. 防止 多重共线性,了解各个自变量与因变量的关系。
  • 处理 https://zhuanlan.zhihu.com/p/72722146
    • 手动删除/逐步回归法/增加样本容量/岭回归
  1. 了解 缺失率大 的变量与因变量的相关性,从而决定缺失值的处理(删除/另起一列)
  2. 查看强相关的变量是否有 异常值
#绘制热力图,查看相关性
plt.figure(figsize=(15,12))

corrmat = train.corr()

ax = sns.heatmap(corrmat,vmax=1)
a,b = ax.get_ylim()
ax.set_ylim(a+0.5,b-0.5)
(38.0, 0.0)

de4650d519355550ccae47d06b010505.png

3.2.2缺失值处理

  • 删除缺失值
    • 缺失数量很少的数据
  • 填充缺失值
    • 数值型
      • 均值/条件平均值
      • 中位数
    • 类别型
      • 众数mode()/条件相似值
      • 单独赋值(None)
  • 另起一列,标记 是/否 缺失(>80%)
# print(train["Alley"].value_counts())
train["Alley"] = train["Alley"].fillna("None")#先选择填充nan,可视化查看分布情况
# print(train["Alley"].value_counts())
sns.stripplot(x='Alley',y='SalePrice',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1d2f7655a88>

26c1899a0791577d4f357a284e9968cd.png
# print(train["PoolQC"].value_counts())
train["PoolQC"] = train["PoolQC"].fillna("None")
# print(train["PoolQC"].value_counts())
sns.stripplot(x='PoolQC',y='SalePrice',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1d2f79faf48>

9002b63934184795fafdae4a6a1dee99.png
# print(train["MiscFeature"].value_counts())
train["MiscFeature"] = train["MiscFeature"].fillna("None")
# print(train["MiscFeature"].value_counts())
sns.stripplot(x='MiscFeature',y='SalePrice',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1d2f7a66e48>

9345affe6aa4f0304b7b7b29f93da997.png
# print(train["Fence"].value_counts())
train["Fence"] = train["Fence"].fillna("None")
# print(train["Fence"].value_counts())
sns.stripplot(x='Fence',y='SalePrice',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1d2f7d47488>

387193f136348f8798a1930d56d2a887.png
missing_ratio.head()
PoolQC         99.657417
MiscFeature    96.402878
Alley          93.216855
Fence          80.438506
FireplaceQu    48.646797
dtype: float64
#"PoolQC","MiscFeature","Alley","Fence"四个特征中的空值都表示没有相对应的设施。
# 通过上面分布图可以看出,缺失值太大,缺失值在y值中的分布很广,与有值的y值没有明显差异
#可以选择直接删除或者另起一列表示是否缺失

#本数据先删除
all_data = all_data.drop(["PoolQC","MiscFeature","Alley","Fence"],axis = 1)
all_data.shape
(2919, 76)
missing_ratio = all_data.isnull().sum()/len(all_data)*100
missing_ratio = missing_ratio[missing_ratio > 0].sort_values(ascending=False)
missing_ratio.head(5)
FireplaceQu     48.646797
LotFrontage     16.649538
GarageCond       5.447071
GarageFinish     5.447071
GarageQual       5.447071
dtype: float64
  • 填充 数值型变量 的缺失值(均值/中位数)
    • 连续型
    • 离散型
#获取到缺失率大于0 的数值型特征的情况和列名。
all_data[missing_ratio.index].describe()

fe788a2f87ff521bb20e07f0334ab097.png
# (缺失率16.6)
# LotFrontage:linear feet of street connected to property

# (缺失率6%以下)
# GarageYrBlt 
# GarageArea 
# GarageCars

# BsmtFullBath 
# BsmtHalfBath

# MasVnrArea:Masonry veneer area in square feet

# basement 地下室相关
# BsmtUnfSF:Unfinished square feet of basement area 
# TotalBsmtSF:Total square feet of basement area
# BsmtFinSF2
# BsmtFinSF1
#取出缺失率在6%以下的特征,根据数据理解,缺失可能是因为没有相关设施,0值填充
cols = ["GarageYrBlt","MasVnrArea","BsmtFullBath","BsmtHalfBath","BsmtUnfSF","TotalBsmtSF","GarageArea","GarageCars","BsmtFinSF2","BsmtFinSF1"]
for col in cols:
    all_data[col] = all_data[col].fillna(0)
#“LotFrontage”条件平均值填充(根据邻居的情况判断)
plt.scatter(x=train["LotFrontage"],y=train["SalePrice"])

all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

all_data["LotFrontage"].isnull().sum()
0

6b464c99d42c6c7c3a59ff272a8e1f5f.png
missing_ratio = all_data.isnull().sum()/len(all_data)*100
missing_ratio = missing_ratio[missing_ratio > 0].sort_values(ascending=False)
missing_ratio
FireplaceQu     48.646797
GarageQual       5.447071
GarageFinish     5.447071
GarageCond       5.447071
GarageType       5.378554
BsmtCond         2.809181
BsmtExposure     2.809181
BsmtQual         2.774923
BsmtFinType2     2.740665
BsmtFinType1     2.706406
MasVnrType       0.822199
MSZoning         0.137033
Utilities        0.068517
Functional       0.068517
Electrical       0.034258
Exterior1st      0.034258
Exterior2nd      0.034258
SaleType         0.034258
KitchenQual      0.034258
dtype: float64
  • 填充 类别型 缺失值
all_data[missing_ratio.index].describe()

0ee904df0173ef35d3cc2c08876db291.png
cols = ["FireplaceQu","GarageQual","GarageFinish","GarageCond","GarageType","BsmtCond","BsmtExposure","BsmtQual","BsmtFinType2","BsmtFinType1"]
#这三类缺失值,是因为本身没有对应的设施,可以填None
for col in cols:
    all_data[col] = all_data[col].fillna("None")
# MasVnrType: Masonry veneer type(缺失值为没有该项)
all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")

# MSZoning: Identifies the general zoning classification of the sale
# Utilitie: Type of utilities available
# Functional; Home functionality (Assume typical unless deductions are warranted)
# Electrical: Electrical system
# Exterior1st: Exterior covering on house
# Exterior2nd: Exterior covering on house (if more than one material)
# SaleType: Type of sale
# KitchenQual: Kitchen quality
cols = ["MSZoning","Utilities","Functional","Electrical","Exterior1st","Exterior2nd","SaleType","KitchenQual"]
#这三类缺失值,是因为本身没有对应的设施,可以填None
for col in cols:
    all_data[col] = all_data[col].fillna(all_data[col].mode()[0])
missing_ratio = all_data.isnull().sum()/len(all_data)*100
missing_ratio = missing_ratio[missing_ratio > 0].sort_values(ascending=False)
missing_ratio
Series([], dtype: float64)

3.3异常值

#查看相关性
plt.figure(figsize=(15,12))
corrmat = train.corr()
sns.heatmap(corrmat)
<matplotlib.axes._subplots.AxesSubplot at 0x1d2f7dc0648>

cb7be440b013744db5f83f1abfd13221.png
#
top10_corr = corrmat.nlargest(15,'SalePrice')["SalePrice"].index
# print(top10_corr)
plt.figure(figsize=(12,8))
ax = sns.heatmap(train[top10_corr].corr(),annot=True,square=True,fmt=".2f",annot_kws={"size":10})
a,b = ax.get_ylim()
ax.set_ylim(a+0.5,b-0.5)
(15.0, 0.0)

25c1c4fe2a8190917d2f225493900f93.png
#与房价相关性最高的OverallQual(整体材料质量)和GarageCars,分布情况
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121) 
ax1 = sns.boxplot(x = train["OverallQual"],y = train["SalePrice"])

ax2 = fig.add_subplot(122)
ax2 = sns.boxplot(x = train["GarageCars"],y = train["SalePrice"])

6bb91d6c8cf7f8dd1fddd3fae82bdb7a.png
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121) 
ax1 = sns.boxplot(x = train["FullBath"],y = train["SalePrice"])

ax2 = fig.add_subplot(122)
ax2 = sns.boxplot(x = train["TotRmsAbvGrd"],y = train["SalePrice"])

dc85b285fc72fe5b3c4588a1b4b259ed.png
fig = plt.figure(figsize=(15,8))
sns.boxplot(x = train["YearBuilt"],y = train["SalePrice"])
plt.xticks(rotation=90)
(array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
         13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
         26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
         39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
         52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
         65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
         78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
         91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
        104, 105, 106, 107, 108, 109, 110, 111]),
 <a list of 112 Text xticklabel objects>)

6d65f56a4a1f0b9d9a4218e1417278ac.png
# 'GrLivArea', 'GarageArea','TotalBsmtSF', '1stFlrSF'
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121)
ax1.scatter(train["GrLivArea"],train["SalePrice"])
<matplotlib.collections.PathCollection at 0x1d2f9d25348>

3b51fb102baa75e0086820e6019efdae.png
train[(train["GrLivArea"]>4000)&(train["SalePrice"]<200000)]

#删除异常值
all_data.drop([523,1298],inplace=True)
all_data[all_data["Id"]==1299]

e825988d7823ca1cfc595008ef72f5ab.png
train_y.drop([523,1298],inplace=True)
train_y.shape
(1458,)
# 'GrLivArea', 'GarageArea','TotalBsmtSF', '1stFlrSF'
# plt.scatter(train["GarageArea"],train["SalePrice"])
# plt.scatter(train["TotalBsmtSF"],train["SalePrice"])
plt.scatter(train["1stFlrSF"],train["SalePrice"])

# train[(train["1stFlrSF"]>4000)&(train["SalePrice"]<200000)]
#异常值id为1299,在all_data已经删除
#没有异常值了
<matplotlib.collections.PathCollection at 0x1d2fa4bca08>

85cd7142d6efc4028966ebe85b3995b6.png
sns.boxplot(train["1stFlrSF"])
<matplotlib.axes._subplots.AxesSubplot at 0x1d2fa468e48>

7ef5f9389e158a0c6aaf82a1c6f62151.png
sns.distplot(train["1stFlrSF"])
<matplotlib.axes._subplots.AxesSubplot at 0x1d2fa1f16c8>

ebb48309b4be826f3e71aca6f243083c.png

四.特征工程

all_data.head()
all_data.shape
(2917, 76)

特征抽取

train[["GarageYrBlt","YearBuilt","YearRemodAdd"]]
train[["1stFlrSF","2ndFlrSF","GrLivArea","TotalBsmtSF"]].describe()
all_data["1stFlrSF_Perc"] = all_data["1stFlrSF"]/all_data["GrLivArea"]
all_data["totalSF_Perc"] = (all_data["1stFlrSF"]+all_data["2ndFlrSF"])/all_data["GrLivArea"]
# all_data["GrLivArea"].isnull().sum()
all_data.shape
(2917, 78)

数据转换

  • 类别(标签)
    • 有序标签——>label encoding
    • 无序标签——>one hot
  • 数值
all_cols = all_data.columns.tolist()
# all_cols[0]
lab_cols = []
for i in range(len(all_cols)):
    if all_data[all_cols[i]].dtype == "object":
        lab_cols.append(all_cols[i])

lab_cols #类别变量的所有列名
# lab_cols = pd.DataFrame(lab_cols).T
len(lab_cols)
39
#有序标签——>labelencoding

from sklearn.preprocessing import LabelEncoder

lab_cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'CentralAir', 'MSSubClass','YrSold' , 
         'MoSold')
for col in lab_cols:
    lab = LabelEncoder()
    all_data[col] = lab.fit_transform(list(all_data[col].values))

print('Shape all_data: {}'.format(all_data.shape))
Shape all_data: (2917, 78)
all_data.head(1)

e9e953b3caf84bdbe778d1ae70c20427.png
#无序变量——>onehot
# all_date.get_dum
all_data = pd.get_dummies(all_data)
all_data.shape
(2917, 216)

特征选择

#相关性
corr_matrix = all_data.corr().abs()
corr_matrix.head()

cc9fa8df0fc9ff2afaf9c7277bcbeea8.png
threshold = 0.9
#只取相关性系数的上半部分
upper_corr = corr_matrix.where(np.triu(corr_matrix,k=1).astype(np.bool))
upper_corr.head()

cf38dde8f31118cd2976d8aeda42acf5.png
#删除相关性大于0.9的
corr_drop = [column for column in upper_corr.columns if any(upper_corr[column]>threshold)]
corr_drop
['1stFlrSF_Perc',
 'totalSF_Perc',
 'Exterior2nd_CmentBd',
 'Exterior2nd_MetalSd',
 'Exterior2nd_VinylSd',
 'GarageType_None',
 'RoofStyle_Hip',
 'SaleType_New',
 'Utilities_NoSeWa']
all_data = all_data.drop(columns = corr_drop)
all_data.shape
(2917, 207)

五.建模

相关知识概念

  1. 过拟合和欠拟合
  • 过拟合:模型过于复杂,在训练集上表现很好,测试集表现较差,导致模型的泛化能力下降 -解决 https://blog.csdn.net/u010899985/article/details/79471909
  1. 交叉验证
  2. 模型的原理、优缺点、适用场景(如何选用模型,流程?)
  3. 模型参数的含义 https://scikit-learn.org/stable/
train = all_data[all_data["Id"]<=1460]
print(train.shape)
test = all_data[all_data["Id"]>1460]
print(test.shape)
# train["Id"]
(1458, 207)
(1459, 207)
fig,ax = plt.subplots(1,2)
fig.set_size_inches(15,5)
sns.distplot(train_y,ax=ax[0])
sns.distplot(np.log(train_y),ax=ax[1])
<matplotlib.axes._subplots.AxesSubplot at 0x1d2fb188848>

8ac599782479e13ad7bc6e560e59a449.png
# train_y = np.log(train_y)
#导入相关算法库
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.model_selection import train_test_split,cross_val_score
train

ee5ac477d6422913b3998d0a0e5ed91a.png
train_y
0       208500
1       181500
2       223500
3       140000
4       250000
         ...  
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1458, dtype: int64
def rmse_cv(model):
    rmse = np.sqrt(-cross_val_score(model,train,train_y,scoring = "neg_mean_squared_error",cv = 5))
    return (rmse)
# model_ridge = Ridge()
alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean() 
            for alpha in alphas]
cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot()
plt.xlabel("alpha")
plt.ylabel("rmse")
Text(0, 0.5, 'rmse')

6786424dabc630f6804d81f8a806b26a.png
cv_ridge
0.05     27045.423937
0.10     26980.425644
0.30     26792.870318
1.00     26457.515190
3.00     26142.818396
5.00     26046.790986
10.00    26002.523512
15.00    26028.112302
30.00    26159.737196
50.00    26323.603320
75.00    26492.612852
dtype: float64
clf = Ridge(alpha=5)
clf.fit(train,train_y)
Ridge(alpha=5, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,
      random_state=None, solver='auto', tol=0.001)
predict = clf.predict(test)
sub = pd.DataFrame()
sub['Id'] = test_ID
sub['SalePrice'] = predict
sub.head(10)

e64dc007c355952d14d6d05d76e0571e.png
# sub.to_csv('.submission.csv',index=False)
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值