这是上一篇的数据分析笔记https://blog.csdn.net/Nyte2018/article/details/90045166
kernel中有许多很好的分析,最好多看几篇学习,这次的看的kernel是Stacked Regressions : Top 4% on LeaderBoard
1、准备工作
各种库的导入:
#import some necessary librairies
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)
from scipy import stats
from scipy.stats import norm, skew #for some statistics
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points
import os
print(os.listdir('./data'))#check the files available in the directory
注:sns.color_palette()返回定义调色板的颜色列表。set_style()有5个seaborn的主题:darkgrid(灰色网格),whitegrid(白色网格),dark(黑色),white(白色),ticks(十字叉)。这里显示文件目录下的文件用了os方法,因为我之前运行的时候一直报错,说找不到文件。用这个方法显示也是可以的,问题不大。
['data_description.doc', 'data_description.txt', 'sample_submission.csv', 'test.csv', 'train.csv']
然后就可以导入文件:
#Now let's import and put the train and test datasets in pandas dataframe
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
显示训练集上前五行数据:
##display the first five rows of the train dataset.
train.head(5)
显示测试集上前五行数据:
##display the first five rows of the test dataset.
test.head(5)
删除id那一列,并显示前后的数据大小:
#check the numbers of samples and features
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))
#Save the 'Id' column
train_ID = train['Id']
test_ID = test['Id']
#Now drop the 'Id' colum since it's unnecessary for the prediction process.
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)
#check again the data size after dropping the 'Id' variable
print("\nThe train data size after dropping Id feature is : {} ".format(train.shape))
print("The test data size after dropping Id feature is : {} ".format(test.shape))
The train data size before dropping Id feature is : (1460, 81)
The test data size before dropping Id feature is : (1459, 80)
The train data size after dropping Id feature is : (1460, 80)
The test data size after dropping Id feature is : (1459, 79)
2、数据处理
离群值
先来看GrLivArea(地面上生活面积)和SalePrice的散点图:
fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()
可以看出,在底部的最右边有两个GrLivArea很大的值,但价格很低,可以判断是离群值而删除。
删除离群值并再检查一遍:
#Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)
#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()
SalePrice分析
sns.distplot(train['SalePrice'] , fit=norm);
#Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')
#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()
可以看出SalePrice是右偏态,因为线性模型更喜欢正态分布的数据,所以需要log转变这些数据:
#We use the numpy fuction log1p which applies log(1+x) to all elements of the column
train["SalePrice"] = np.log1p(train["SalePrice"])
#Check the new distribution
sns.distplot(train['SalePrice'] , fit=norm);
# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best')
plt.ylabel(