kaggle房价预测（House Prices: Advanced Regression Techniques）数据分析(二）

最新推荐文章于 2024-08-05 11:28:49 发布

Nyte2018

最新推荐文章于 2024-08-05 11:28:49 发布

阅读量2.2k

点赞数 5

文章标签： kaggle 房价预测数据处理 python 回归

本文链接：https://blog.csdn.net/Nyte2018/article/details/90175616

版权

本文是关于kaggle房价预测的分析笔记，详细介绍了数据预处理，包括删除离群值、特征工程和处理缺失值；建立了多个回归模型，如LASSO、Elastic Net、KRR等，并进行了模型融合，最终采用堆叠模型提高预测准确性。

摘要由CSDN通过智能技术生成

这是上一篇的数据分析笔记https://blog.csdn.net/Nyte2018/article/details/90045166
kernel中有许多很好的分析，最好多看几篇学习，这次的看的kernel是Stacked Regressions : Top 4% on LeaderBoard

1、准备工作

各种库的导入：

#import some necessary librairies
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt  # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)
from scipy import stats
from scipy.stats import norm, skew #for some statistics
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points
import os
print(os.listdir('./data'))#check the files available in the directory

注：sns.color_palette()返回定义调色板的颜色列表。set_style()有5个seaborn的主题：darkgrid（灰色网格），whitegrid（白色网格），dark（黑色），white（白色），ticks（十字叉）。这里显示文件目录下的文件用了os方法，因为我之前运行的时候一直报错，说找不到文件。用这个方法显示也是可以的，问题不大。

['data_description.doc', 'data_description.txt', 'sample_submission.csv', 'test.csv', 'train.csv']

然后就可以导入文件：

#Now let's import and put the train and test datasets in  pandas dataframe
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

显示训练集上前五行数据：

##display the first five rows of the train dataset.
train.head(5)

在这里插入图片描述
显示测试集上前五行数据：

##display the first five rows of the test dataset.
test.head(5)

在这里插入图片描述
删除id那一列，并显示前后的数据大小：

#check the numbers of samples and features
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))
#Save the 'Id' column
train_ID = train['Id']
test_ID = test['Id']
#Now drop the  'Id' colum since it's unnecessary for  the prediction process.
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)
#check again the data size after dropping the 'Id' variable
print("\nThe train data size after dropping Id feature is : {} ".format(train.shape)) 
print("The test data size after dropping Id feature is : {} ".format(test.shape))

The train data size before dropping Id feature is : (1460, 81) 
The test data size before dropping Id feature is : (1459, 80) 

The train data size after dropping Id feature is : (1460, 80) 
The test data size after dropping Id feature is : (1459, 79)

2、数据处理

离群值

先来看GrLivArea（地面上生活面积）和SalePrice的散点图：

fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

在这里插入图片描述
可以看出，在底部的最右边有两个GrLivArea很大的值，但价格很低，可以判断是离群值而删除。
删除离群值并再检查一遍：

#Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)
#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

在这里插入图片描述
SalePrice分析

sns.distplot(train['SalePrice'] , fit=norm);
#Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')
#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

在这里插入图片描述
可以看出SalePrice是右偏态，因为线性模型更喜欢正态分布的数据，所以需要log转变这些数据:

#We use the numpy fuction log1p which  applies log(1+x) to all elements of the column
train["SalePrice"] = np.log1p(train["SalePrice"])
#Check the new distribution 
sns.distplot(train['SalePrice'] , fit=norm);
# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel(

最低0.47元/天解锁文章

Nyte2018

关注

5
点赞
踩
22

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫