DataExploration

机器学习算法完整版见fenghaootong-github

DataExploration

  • We know the data is very important in data science,but it is time-consuming.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
df_train = pd.read_csv('../DATA/SalePrice_train.csv')
df_train.columns
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',  
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',  
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')
# help(df_train.columns)

1 What can we expect?

  • In order to understand our data, we can look at each variable and try to understand their meaning and relevance to this problem. I know this is time-consuming, but it will give us the flavour of our dataset.

2 First: Analysing ‘SalePrice’

  • First, we need to see the ‘SalePrice’,because it is our reason
  • Some data about ‘SalePrice’
df_train['SalePrice'].describe()
count      1460.000000   
mean     180921.195890   
std       79442.502883  
min       34900.000000
25%      129975.000000  
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64
  • we can get distributing about SaleePrice
sns.distplot(df_train['SalePrice'])
<matplotlib.axes._subplots.AxesSubplot at 0x2410a439be0>

这里写图片描述

  • calculate Skewness(偏度) and Kurtosis(峰度)
    • 偏度Skewness(三阶) ——是统计数据分布偏斜方向和程度的度,三阶中心距除以标准差的三次方, 正态分布的偏度为0,偏度小于0为负偏度,位于均值左边的比右边的多,正偏度相反,上图为正偏度
      • 偏度为0表示其数据分布形态与正态分布的偏斜程度相同;偏度大于0表示其数据分布形态与正态分布相比为正偏或右偏,即有一条长尾巴拖在右边,数据右端有较多的极端值;偏度小于0表示其数据分布形态与正态分布相比为负偏或左偏,即有一条长尾拖在左边,数据左端有较多的极端值。偏度的绝对值数值越大表示其分布形态的偏斜程度越大。
    • 峰度Kurtosis (四阶) ——描述总体中所有取值分布形态陡缓程度的统计量, 概率密度在均值处峰值高低的特征,常定义四阶中心矩除以方差的平方,减去三;
      • 正态分布的峰度为3。以一般而言,正态分布为参照,峰度可以描述分布形态的陡缓程度,若bk<3,则称分布具有不足的峰度,若bk>3,则称分布具有过度的峰度。若知道分布有可能在峰度上偏离正态分布时,可用峰度来检验分布的正态性。在相同的标准差下,峰度系数越大,分布就有更多的极端值
print("Skewness: %f" % df_train['SalePrice'].skew())
print("Kurtosis: %f" % df_train['SalePrice'].kurt())
Skewness: 1.882876 Kurtosis: 6.536282

‘SalePrice’, her buddies and her interests

  • GrLinArea
  • TotalBsmtSF

Relationship with numerical variable

var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0, 800000));

这里写图片描述

var = 'TotalBsmtSF'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0, 800000));  

这里写图片描述

#box plot overallqual/saleprice
var = 'OverallQual'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

这里写图片描述

var = 'YearBuilt'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
plt.xticks(rotation=90);   

这里写图片描述

  • feature selection and feature engineering can help us analysis data

3 Keep calm and work smart

Next, do a more objectve analysis

Raw Data like soup

  • Raw data like soup, we know a little
  • Using follow to analysis
    • Correlation matrix (heatmap style).
    • ‘SalePrice’ correlation matrix (zoomed heatmap style).
    • Scatter plots between the most correlated variables (move like Jagger style).

Correlation matrix

#correlation matrix
corrmat = df_train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)   
<matplotlib.axes._subplots.AxesSubplot at 0x2410a1c99b0>    

这里写图片描述

SalePrice correlation matrix (zoomed heatmap style)

#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)  
cols
Index(['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea', 
       'TotalBsmtSF', '1stFlrSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt'], 
      dtype='object')
cm
array([[ 1.        ,  0.7909816 ,  0.70862448,  0.6404092 ,  0.62343144,   
         0.61358055,  0.60585218,  0.56066376,  0.53372316,  0.52289733],
       [ 0.7909816 ,  1.        ,  0.59300743,  0.60067072,  0.56202176,
         0.5378085 ,  0.47622383,  0.55059971,  0.42745234,  0.57232277],
       [ 0.70862448,  0.59300743,  1.        ,  0.46724742,  0.46899748,
         0.4548682 ,  0.56602397,  0.63001165,  0.82548937,  0.19900971],
       [ 0.6404092 ,  0.60067072,  0.46724742,  1.        ,  0.88247541,
         0.43458483,  0.43931681,  0.46967204,  0.36228857,  0.53785009],
       [ 0.62343144,  0.56202176,  0.46899748,  0.88247541,  1.        ,
         0.48666546,  0.48978165,  0.40565621,  0.33782212,  0.47895382],
       [ 0.61358055,  0.5378085 ,  0.4548682 ,  0.43458483,  0.48666546, 
         1.        ,  0.81952998,  0.32372241,  0.28557256,  0.391452  ], 
       [ 0.60585218,  0.47622383,  0.56602397,  0.43931681,  0.48978165, 
         0.81952998,  1.        ,  0.38063749,  0.40951598,  0.28198586],
       [ 0.56066376,  0.55059971,  0.63001165,  0.46967204,  0.40565621, 
         0.32372241,  0.38063749,  1.        ,  0.55478425,  0.46827079],
       [ 0.53372316,  0.42745234,  0.82548937,  0.36228857,  0.33782212,
         0.28557256,  0.40951598,  0.55478425,  1.        ,  0.09558913],
       [ 0.52289733,  0.57232277,  0.19900971,  0.53785009,  0.47895382,
         0.391452  ,  0.28198586,  0.46827079,  0.09558913,  1.        ]])
#help(corrmat.nlargest)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

这里写图片描述

根据上图,我们可以看出

  • ‘OverallQual’,’GrLivArea’和’TotalBsmtSF’与’SalePrice’强相关。之前我们分析过
  • “GarageCars”和“GarageArea”也是一些最相关的变量。但是,正如我们在最后一点所讨论的那样,适合车库的车辆数量是车库面积的结果。 “GarageCars”和“GarageArea”就像孪生兄弟。你永远无法区分它们。因此,在我们的分析中,我们只需要其中的一个变量(我们可以保留“GarageCars”,因为它与“SalePrice”的关联性更高)
  • “TotalBsmtSF”和“1stFloor”似乎也是双胞胎兄弟。我们可以保留“TotalBsmtSF”只是说我们的第一个猜测是正确的
  • FullBath?不确定
  • “TotRmsAbvGrd”和“GrLivArea”?
  • “YearBuilt”似乎与“SalePrice”略有关联。

Scatter plots between ‘SalePrice’ and correlated variables (move like Jagger style)

#scatterplot
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(df_train[cols], size = 2.5)
plt.show()

这里写图片描述

4 Missing Data

Important questions when thinking about missing data:
- How prevalent is the missing data?
- Is missing data random or does it have a pattern?

#missing data 
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis= 1, keys=['Total', 'Percent'])
missing_data.head(20) 
TotalPercent
PoolQC14530.995205
MiscFeature14060.963014
Alley13690.937671
Fence11790.807534
FireplaceQu6900.472603
LotFrontage2590.177397
GarageCond810.055479
GarageType810.055479
GarageYrBlt810.055479
GarageFinish810.055479
GarageQual810.055479
BsmtExposure380.026027
BsmtFinType2380.026027
BsmtFinType1370.025342
BsmtCond370.025342
BsmtQual370.025342
MasVnrArea80.005479
MasVnrType80.005479
Electrical10.000685
Utilities00.000000
  • 当超过15%的数据丢失时,应该删除相应的变量
  • 可以看到GarageX有相同的数据量缺失,所以缺少的数据肯定是指同一组观察值,由于GarageCars表示了关于车库的最重要信息,所以我们可以删除这几个有5%缺失的变量
  • BsmtX也可以这样删除
  • MasVnrArea和MasVnrTypeu与Yearbuilt和OverallQual有很强的相关性,所以亦可以删除
  • 对于Electrical只有一个缺失数据,可以删除这次记录不用删除变量
df_train = df_train.drop((missing_data[missing_data['Total']>1]).index, 1)
df_train = df_train.drop(df_train.loc[df_train['Electrical'].isnull()].index)
df_train.isnull().sum().max()
0  

完整版的见fenghaootong-github

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值