kaggle房价预测特征意思_kaggle-房价预测

最新推荐文章于 2024-03-18 17:28:26 发布

weixin_39632212

最新推荐文章于 2024-03-18 17:28:26 发布

阅读量220

点赞数 1

文章标签： kaggle房价预测特征意思

本文链接：https://blog.csdn.net/weixin_39632212/article/details/111662275

版权

项目：房价预测（爱荷华州埃姆斯市）(预测销售价格并实践功能设计，RF和梯度提升)

数据来源：https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview/tutorials

数据概述： train 1460 * 81 test 1459 * 80

一.导入相关库

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
%matplotlib inline

sns.set(style = "darkgrid")

plt.rcParams["font.family"] = "SimHei"
plt.rcParams["axes.unicode_minus"] = False
warnings.filterwarnings("ignore")

二.读取数据

#2.读取数据 
train = pd.read_csv(".train.csv")
test = pd.read_csv(".test.csv")
#数据了解,数据认识

print(train.shape)
print(test.shape)
# display(train.head(2))
# display(test.head(2))

# train.info()
test.info() #两个数据都有缺失值
(1460, 81)
(1459, 80)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
Id               1459 non-null int64
MSSubClass       1459 non-null int64
MSZoning         1455 non-null object
LotFrontage      1232 non-null float64
LotArea          1459 non-null int64
Street           1459 non-null object
Alley            107 non-null object
LotShape         1459 non-null object
LandContour      1459 non-null object
Utilities        1457 non-null object
LotConfig        1459 non-null object
LandSlope        1459 non-null object
Neighborhood     1459 non-null object
Condition1       1459 non-null object
Condition2       1459 non-null object
BldgType         1459 non-null object
HouseStyle       1459 non-null object
OverallQual      1459 non-null int64
OverallCond      1459 non-null int64
YearBuilt        1459 non-null int64
YearRemodAdd     1459 non-null int64
RoofStyle        1459 non-null object
RoofMatl         1459 non-null object
Exterior1st      1458 non-null object
Exterior2nd      1458 non-null object
MasVnrType       1443 non-null object
MasVnrArea       1444 non-null float64
ExterQual        1459 non-null object
ExterCond        1459 non-null object
Foundation       1459 non-null object
BsmtQual         1415 non-null object
BsmtCond         1414 non-null object
BsmtExposure     1415 non-null object
BsmtFinType1     1417 non-null object
BsmtFinSF1       1458 non-null float64
BsmtFinType2     1417 non-null object
BsmtFinSF2       1458 non-null float64
BsmtUnfSF        1458 non-null float64
TotalBsmtSF      1458 non-null float64
Heating          1459 non-null object
HeatingQC        1459 non-null object
CentralAir       1459 non-null object
Electrical       1459 non-null object
1stFlrSF         1459 non-null int64
2ndFlrSF         1459 non-null int64
LowQualFinSF     1459 non-null int64
GrLivArea        1459 non-null int64
BsmtFullBath     1457 non-null float64
BsmtHalfBath     1457 non-null float64
FullBath         1459 non-null int64
HalfBath         1459 non-null int64
BedroomAbvGr     1459 non-null int64
KitchenAbvGr     1459 non-null int64
KitchenQual      1458 non-null object
TotRmsAbvGrd     1459 non-null int64
Functional       1457 non-null object
Fireplaces       1459 non-null int64
FireplaceQu      729 non-null object
GarageType       1383 non-null object
GarageYrBlt      1381 non-null float64
GarageFinish     1381 non-null object
GarageCars       1458 non-null float64
GarageArea       1458 non-null float64
GarageQual       1381 non-null object
GarageCond       1381 non-null object
PavedDrive       1459 non-null object
WoodDeckSF       1459 non-null int64
OpenPorchSF      1459 non-null int64
EnclosedPorch    1459 non-null int64
3SsnPorch        1459 non-null int64
ScreenPorch      1459 non-null int64
PoolArea         1459 non-null int64
PoolQC           3 non-null object
Fence            290 non-null object
MiscFeature      51 non-null object
MiscVal          1459 non-null int64
MoSold           1459 non-null int64
YrSold           1459 non-null int64
SaleType         1458 non-null object
SaleCondition    1459 non-null object
dtypes: float64(11), int64(26), object(43)
memory usage: 912.0+ KB

三.数据清洗

数据整合（横向/纵向）

#保留两个数据集各自的id，以便之后获取，合并
train_ID = train["Id"]
test_ID = test["Id"]
# print(test_ID)

train_y = train["SalePrice"]
# print(train_y)
# print(test.shape)

#方便之后按照行数提取各自数据
ntrain=train.shape[0]
ntrain
ntest = test.shape[0]
ntest
1459
# 合并数据
# train.drop("Id",axis = 1,inplace = True)
# test.drop("Id",axis = 1,inplace = True)
all_data = pd.concat((train,test)).reset_index(drop = True)
all_data.shape
all_data.drop(['SalePrice'],axis = 1,inplace = True)
all_data.head()

3.1重复值

#无重复值
train.duplicated().sum()
test.duplicated().sum()
0

3.2缺失值

随机缺失
规律缺失

3.2.1查找缺失值，缺失值分布情况

info()/isnull()

#计算缺失率
missing_ratio = all_data.isnull().sum()/len(all_data)*100

missing_ratio = missing_ratio.drop(missing_ratio[missing_ratio == 0].index).sort_values(ascending=False)
missing_ratio.shape
(34,)
# 绘图查看缺失率
plt.figure(figsize=(15,8))
sns.barplot(x=missing_ratio.index,y=missing_ratio)
plt.xticks(rotation=90,fontsize=12)
plt.xlabel("Feature",fontsize=15)
plt.ylabel("missing_ratio",fontsize=15)
Text(0, 0.5, 'missing_ratio')

3.2.2缺失值处理

删除缺失值
- 缺失数量很少的数据

填充缺失值
- 数值型
  - 均值/条件平均值
  - 中位数
- 类别型
  - 众数mode()/条件相似值
  - 单独赋值(None)

另起一列，标记是/否缺失(>80%)

# print(train["Alley"].value_counts())
train["Alley"] = train["Alley"].fillna("None")#先选择填充nan,可视化查看分布情况
# print(train["Alley"].value_counts())
sns.stripplot(x='Alley',y='SalePrice',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1d2f7655a88>

# print(train["PoolQC"].value_counts())
train["PoolQC"] = train["PoolQC"].fillna("None")
# print(train["PoolQC"].value_counts())
sns.stripplot(x='PoolQC',y='SalePrice',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1d2f79faf48>

# print(train["MiscFeature"].value_counts())
train["MiscFeature"] = train["MiscFeature"].fillna("None")
# print(train["MiscFeature"].value_counts())
sns.stripplot(x='MiscFeature',y='SalePrice',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1d2f7a66e48>

# print(train["Fence"].value_counts())
train["Fence"] = train["Fence"].fillna("None")
# print(train["Fence"].value_counts())
sns.stripplot(x='Fence',y='SalePrice',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1d2f7d47488>

missing_ratio.head()
PoolQC         99.657417
MiscFeature    96.402878
Alley          93.216855
Fence          80.438506
FireplaceQu    48.646797
dtype: float64
#"PoolQC","MiscFeature","Alley","Fence"四个特征中的空值都表示没有相对应的设施。
# 通过上面分布图可以看出，缺失值太大，缺失值在y值中的分布很广，与有值的y值没有明显差异
#可以选择直接删除或者另起一列表示是否缺失

#本数据先删除
all_data = all_data.drop(["PoolQC","MiscFeature","Alley","Fence"],axis = 1)
all_data.shape
(2919, 76)
missing_ratio = all_data.isnull().sum()/len(all_data)*100
missing_ratio = missing_ratio[missing_ratio > 0].sort_values(ascending=False)
missing_ratio.head(5)
FireplaceQu     48.646797
LotFrontage     16.649538
GarageCond       5.447071
GarageFinish     5.447071
GarageQual       5.447071
dtype: float64

填充数值型变量的缺失值（均值/中位数)
- 连续型
- 离散型

#获取到缺失率大于0 的数值型特征的情况和列名。
all_data[missing_ratio.index].describe()

# （缺失率16.6）
# LotFrontage：linear feet of street connected to property

# （缺失率6%以下）
# GarageYrBlt 
# GarageArea 
# GarageCars

# BsmtFullBath 
# BsmtHalfBath

# MasVnrArea:Masonry veneer area in square feet

# basement 地下室相关
# BsmtUnfSF:Unfinished square feet of basement area 
# TotalBsmtSF:Total square feet of basement area
# BsmtFinSF2
# BsmtFinSF1
#取出缺失率在6%以下的特征，根据数据理解，缺失可能是因为没有相关设施，0值填充
cols = ["GarageYrBlt","MasVnrArea","BsmtFullBath","BsmtHalfBath","BsmtUnfSF","TotalBsmtSF","GarageArea","GarageCars","BsmtFinSF2","BsmtFinSF1"]
for col in cols:
    all_data[col] = all_data[col].fillna(0)
#“LotFrontage”条件平均值填充（根据邻居的情况判断）
plt.scatter(x=train["LotFrontage"],y=train["SalePrice"])

all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

all_data["LotFrontage"].isnull().sum()
0

missing_ratio = all_data.isnull().sum()/len(all_data)*100
missing_ratio = missing_ratio[missing_ratio > 0].sort_values(ascending=False)
missing_ratio
FireplaceQu     48.646797
GarageQual       5.447071
GarageFinish     5.447071
GarageCond       5.447071
GarageType       5.378554
BsmtCond         2.809181
BsmtExposure     2.809181
BsmtQual         2.774923
BsmtFinType2     2.740665
BsmtFinType1     2.706406
MasVnrType       0.822199
MSZoning         0.137033
Utilities        0.068517
Functional       0.068517
Electrical       0.034258
Exterior1st      0.034258
Exterior2nd      0.034258
SaleType         0.034258
KitchenQual      0.034258
dtype: float64

填充类别型缺失值

all_data[missing_ratio.index].describe()

cols = ["FireplaceQu","GarageQual","GarageFinish","GarageCond","GarageType","BsmtCond","BsmtExposure","BsmtQual","BsmtFinType2","BsmtFinType1"]
#这三类缺失值，是因为本身没有对应的设施，可以填None
for col in cols:
    all_data[col] = all_data[col].fillna("None")
# MasVnrType: Masonry veneer type(缺失值为没有该项)
all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")

# MSZoning: Identifies the general zoning classification of the sale
# Utilitie: Type of utilities available
# Functional; Home functionality (Assume typical unless deductions are warranted)
# Electrical: Electrical system
# Exterior1st: Exterior covering on house
# Exterior2nd: Exterior covering on house (if more than one material)
# SaleType: Type of sale
# KitchenQual: Kitchen quality
cols = ["MSZoning","Utilities","Functional","Electrical","Exterior1st","Exterior2nd","SaleType","KitchenQual"]
#这三类缺失值，是因为本身没有对应的设施，可以填None
for col in cols:
    all_data[col] = all_data[col].fillna(all_data[col].mode()[0])
missing_ratio = all_data.isnull().sum()/len(all_data)*100
missing_ratio = missing_ratio[missing_ratio > 0].sort_values(ascending=False)
missing_ratio
Series([], dtype: float64)

3.3异常值

#查看相关性
plt.figure(figsize=(15,12))
corrmat = train.corr()
sns.heatmap(corrmat)
<matplotlib.axes._subplots.AxesSubplot at 0x1d2f7dc0648>

#
top10_corr = corrmat.nlargest(15,'SalePrice')["SalePrice"].index
# print(top10_corr)
plt.figure(figsize=(12,8))
ax = sns.heatmap(train[top10_corr].corr(),annot=True,square=True,fmt=".2f",annot_kws={"size":10})
a,b = ax.get_ylim()
ax.set_ylim(a+0.5,b-0.5)
(15.0, 0.0)

#与房价相关性最高的OverallQual（整体材料质量）和GarageCars，分布情况
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121) 
ax1 = sns.boxplot(x = train["OverallQual"],y = train["SalePrice"])

ax2 = fig.add_subplot(122)
ax2 = sns.boxplot(x = train["GarageCars"],y = train["SalePrice"])

fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121) 
ax1 = sns.boxplot(x = train["FullBath"],y = train["SalePrice"])

ax2 = fig.add_subplot(122)
ax2 = sns.boxplot(x = train["TotRmsAbvGrd"],y = train["SalePrice"])

fig = plt.figure(figsize=(15,8))
sns.boxplot(x = train["YearBuilt"],y = train["SalePrice"])
plt.xticks(rotation=90)
(array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
         13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
         26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
         39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
         52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
         65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
         78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
         91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
        104, 105, 106, 107, 108, 109, 110, 111]),
 <a list of 112 Text xticklabel objects>)

# 'GrLivArea', 'GarageArea','TotalBsmtSF', '1stFlrSF'
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121)
ax1.scatter(train["GrLivArea"],train["SalePrice"])
<matplotlib.collections.PathCollection at 0x1d2f9d25348>

train[(train["GrLivArea"]>4000)&(train["SalePrice"]<200000)]

#删除异常值
all_data.drop([523,1298],inplace=True)
all_data[all_data["Id"]==1299]

train_y.drop([523,1298],inplace=True)
train_y.shape
(1458,)
# 'GrLivArea', 'GarageArea','TotalBsmtSF', '1stFlrSF'
# plt.scatter(train["GarageArea"],train["SalePrice"])
# plt.scatter(train["TotalBsmtSF"],train["SalePrice"])
plt.scatter(train["1stFlrSF"],train["SalePrice"])

# train[(train["1stFlrSF"]>4000)&(train["SalePrice"]<200000)]
#异常值id为1299，在all_data已经删除
#没有异常值了
<matplotlib.collections.PathCollection at 0x1d2fa4bca08>

sns.boxplot(train["1stFlrSF"])
<matplotlib.axes._subplots.AxesSubplot at 0x1d2fa468e48>

sns.distplot(train["1stFlrSF"])
<matplotlib.axes._subplots.AxesSubplot at 0x1d2fa1f16c8>

四.特征工程

all_data.head()
all_data.shape
(2917, 76)

特征抽取

train[["GarageYrBlt","YearBuilt","YearRemodAdd"]]
train[["1stFlrSF","2ndFlrSF","GrLivArea","TotalBsmtSF"]].describe()
all_data["1stFlrSF_Perc"] = all_data["1stFlrSF"]/all_data["GrLivArea"]
all_data["totalSF_Perc"] = (all_data["1stFlrSF"]+all_data["2ndFlrSF"])/all_data["GrLivArea"]
# all_data["GrLivArea"].isnull().sum()
all_data.shape
(2917, 78)

数据转换

类别（标签）
- 有序标签——>label encoding
- 无序标签——>one hot

数值

all_cols = all_data.columns.tolist()
# all_cols[0]
lab_cols = []
for i in range(len(all_cols)):
    if all_data[all_cols[i]].dtype == "object":
        lab_cols.append(all_cols[i])

lab_cols #类别变量的所有列名
# lab_cols = pd.DataFrame(lab_cols).T
len(lab_cols)
39
#有序标签——>labelencoding

from sklearn.preprocessing import LabelEncoder

lab_cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'CentralAir', 'MSSubClass','YrSold' , 
         'MoSold')
for col in lab_cols:
    lab = LabelEncoder()
    all_data[col] = lab.fit_transform(list(all_data[col].values))

print('Shape all_data: {}'.format(all_data.shape))
Shape all_data: (2917, 78)
all_data.head(1)

#无序变量——>onehot
# all_date.get_dum
all_data = pd.get_dummies(all_data)
all_data.shape
(2917, 216)

特征选择

#相关性
corr_matrix = all_data.corr().abs()
corr_matrix.head()

threshold = 0.9
#只取相关性系数的上半部分
upper_corr = corr_matrix.where(np.triu(corr_matrix,k=1).astype(np.bool))
upper_corr.head()

#删除相关性大于0.9的
corr_drop = [column for column in upper_corr.columns if any(upper_corr[column]>threshold)]
corr_drop
['1stFlrSF_Perc',
 'totalSF_Perc',
 'Exterior2nd_CmentBd',
 'Exterior2nd_MetalSd',
 'Exterior2nd_VinylSd',
 'GarageType_None',
 'RoofStyle_Hip',
 'SaleType_New',
 'Utilities_NoSeWa']
all_data = all_data.drop(columns = corr_drop)
all_data.shape
(2917, 207)