概述
在做的过程中,浏览了好多出色的报告,受益匪浅,浏览的文章主要包括:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats
from scipy.stats import skew
from scipy.stats import norm
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# import warnings
# warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina' #set 'png' here when working on notebook
%matplotlib inline
train_df = pd.read_csv("../input/train.csv")
test_df = pd.read_csv("../input/test.csv")
查看数据
我们拿到数据后,先对数据要有个大致的了解,我们有1460的训练数据和1460的测试数据,数据的特征列有81个,其中35个是数值类型的,44个类别类型。
我们通过阅读数据的描述说明,会发现列MSSubClass,OverallQual,OverallCond 这些数据可以将其转换为类别类型.
但是去具体看OverallQual,OverallCond 的时候,其没有缺失列,可以当做int来处理
all_df = pd.concat((train_df.loc[:,'MSSubClass':'SaleCondition'], test_df.loc[:,'MSSubClass':'SaleCondition']), axis=0,ignore_index=True)
all_df['MSSubClass'] = all_df['MSSubClass'].astype(str)
quantitative = [f for f in all_df.columns if all_df.dtypes[f] != 'object']
qualitative = [f for f in all_df.columns if all_df.dtypes[f] == 'object']
print("quantitative: {}, qualitative: {}" .format (len(quantitative),len(qualitative)))
quantitative: 35, qualitative: 44
处理缺失数据
对于缺失值的处理
缺失的行特别对,弃用该列
缺失的值比较少,取均值
缺失的值中间,对于类别信息的列可以将缺失作为新的类别做 one-hot
missing = all_df.isnull().sum()
missing.sort_values(inplace=True,ascending=False)
missing = missing[missing > 0]
types = all_df[missing.index].dtypes
percent = (all_df[missing.index].isnull().sum()/all_df[missing.index].isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([missing, percent,types], axis=1, keys=['Total', 'Percent','Types'])
missing_data.sort_values('Total',ascending=False,inplace=True)
missing_data
image.png
missing.plot.bar()
output_14_1.png
上述缺失的列中有6列大于了15%的缺失率,其余主要是 BsmtX 和 GarageX 两大类,我们在具体决定这些列的处理之前,我们来看下我们要预测的价格的一些特征
数据统计分析
单变量分析
先看下我们要预测的价格的一些统计信息
train_df.describe()['SalePrice']
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, d