kaggle房价预测(House Prices: Advanced Regression Techniques)数据分析(二)

本文是关于kaggle房价预测的分析笔记,详细介绍了数据预处理,包括删除离群值、特征工程和处理缺失值;建立了多个回归模型,如LASSO、Elastic Net、KRR等,并进行了模型融合,最终采用堆叠模型提高预测准确性。
摘要由CSDN通过智能技术生成

这是上一篇的数据分析笔记https://blog.csdn.net/Nyte2018/article/details/90045166
kernel中有许多很好的分析,最好多看几篇学习,这次的看的kernel是Stacked Regressions : Top 4% on LeaderBoard

1、准备工作

各种库的导入:

#import some necessary librairies
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt  # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)
from scipy import stats
from scipy.stats import norm, skew #for some statistics
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points
import os
print(os.listdir('./data'))#check the files available in the directory 

注:sns.color_palette()返回定义调色板的颜色列表。set_style()有5个seaborn的主题:darkgrid(灰色网格),whitegrid(白色网格),dark(黑色),white(白色),ticks(十字叉)。这里显示文件目录下的文件用了os方法,因为我之前运行的时候一直报错,说找不到文件。用这个方法显示也是可以的,问题不大。

['data_description.doc', 'data_description.txt', 'sample_submission.csv', 'test.csv', 'train.csv']

然后就可以导入文件:

#Now let's import and put the train and test datasets in  pandas dataframe
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

显示训练集上前五行数据:

##display the first five rows of the train dataset.
train.head(5)

在这里插入图片描述
显示测试集上前五行数据:

##display the first five rows of the test dataset.
test.head(5)

在这里插入图片描述
删除id那一列,并显示前后的数据大小:

#check the numbers of samples and features
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))
#Save the 'Id' column
train_ID = train['Id']
test_ID = test['Id']
#Now drop the  'Id' colum since it's unnecessary for  the prediction process.
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)
#check again the data size after dropping the 'Id' variable
print("\nThe train data size after dropping Id feature is : {} ".format(train.shape)) 
print("The test data size after dropping Id feature is : {} ".format(test.shape))
The train data size before dropping Id feature is : (1460, 81) 
The test data size before dropping Id feature is : (1459, 80) 

The train data size after dropping Id feature is : (1460, 80) 
The test data size after dropping Id feature is : (1459, 79) 

2、数据处理

离群值

先来看GrLivArea(地面上生活面积)和SalePrice的散点图:

fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

在这里插入图片描述
可以看出,在底部的最右边有两个GrLivArea很大的值,但价格很低,可以判断是离群值而删除。
删除离群值并再检查一遍:

#Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)
#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

在这里插入图片描述
SalePrice分析

sns.distplot(train['SalePrice'] , fit=norm);
#Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')
#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

在这里插入图片描述
可以看出SalePrice是右偏态,因为线性模型更喜欢正态分布的数据,所以需要log转变这些数据:

#We use the numpy fuction log1p which  applies log(1+x) to all elements of the column
train["SalePrice"] = np.log1p(train["SalePrice"])
#Check the new distribution 
sns.distplot(train['SalePrice'] , fit=norm);
# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel(
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值