数据分析与挖掘练习2 --kaggle比赛 House Prices 预测

最新推荐文章于 2024-01-12 22:10:24 发布

Eric_zh69

最新推荐文章于 2024-01-12 22:10:24 发布

阅读量1.5k

点赞数 1

分类专栏：机器学习、数据分析与挖掘文章标签： data analysis kaggle

本文链接：https://blog.csdn.net/shaiguchun9503/article/details/81273765

版权

题目描述：

通过79个变量（几乎）描述爱荷华州埃姆斯（Ames）住宅的每一个特征，在这个竞赛里，需要你预测每个住宅的最终价格，并最终提交。http://ww2.amstat.org/publications/jse/v19n3/Decock/DataDocumentation.txt 上述官方给的一份说明里是对数据的描述，描述了79个属性变量的具体描述以及数据类型。

The data has 82 columns which include 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables (and 2 additional observation identifiers).、

1、数据集探索

2、数据处理

3、特征工程

4、模型选择

5、模型融合

一、数据集探索

我们有1460的训练数据和1459的测试数据，数据的特征列有79个，其中35个是数值类型的，44个类别类型。

1.1导入相应的库

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt  # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)

from scipy import stats
from scipy.stats import norm, skew #for some statistics

pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points

1.2 导入训练、测试数据

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
train.head()
test.head()

train.info() #train数据集信息展示1460行81列，--object代表了字符串类型 int整型 float浮点数类型 datetime时间类型 bool布尔类型
#还可以看到各个属性的缺失情况

ID特征对分类没有影响，但最后我们得到结果以后提交的时候需要，所以需要将ID单独提取出来：

二数据处理

离群点处理

(在后面的模型融合方案中不去掉该异常值，成绩也并没有下降，这里有点疑问，估计该属性对结果的影响系数不是太大)

在数据中会有个别离群点，他们对分类结果噪音太大，我们选择将其删掉。但是如果不是太过分的离群点，就不能删掉，因为如果删掉所有噪声会影响模型的健壮性，对测试数据的泛化能力就会下降。由一开始的说明知道，只要的异常值在GrLivArea

fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

右下方的两个数据，living area特别大，但是价格又低的离谱，应该是远离市区的无人地带。对最后的分类结果没有影响的离群点（Oultliers），我们可以放心将其删除。

#Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)

#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

最低0.47元/天解锁文章

Eric_zh69

关注

1
点赞
踩
15

收藏

觉得还不错? 一键收藏
0
评论
数据分析与挖掘练习2 --kaggle比赛 House Prices 预测

题目描述：通过79个变量（几乎）描述爱荷华州埃姆斯（Ames）住宅的每一个特征，在这个竞赛里，需要你预测每个住宅的最终价格，并最终提交。http://ww2.amstat.org/publications/jse/v19n3/Decock/DataDocumentation.txt 上述官方给的一份说明里是对数据的描述，描述了79个属性变量的具体描述以及数据类型。The data has...
复制链接

扫一扫

专栏目录