本人对数据挖掘很感兴趣,在自学了python相关数据处理模块和部分机器学习算法后,尝试在kaggle上做实践,项目是房价预测,网上也有很多网友分享的好方法,自己整个顺下来算是对数据处理和挖掘有了初步的认识。
预处理
导入库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as statsMM
import sklearn.linear_model as linear_model
import seaborn as sns
%matplotlib inline
plt.style.use('ggplot')
from sklearn.cluster import KMeans
numpy,pandas是数据处理常用的库,matplotlib,seaborn用来辅助画图分析,K-Means是我在处理neighborhood特征时用到的。
读入训练和测试数据
train = pd.read_csv('~/train.csv')#训练数据train
test = pd.read_csv('~/test.csv')#测试数据test
#训练集和测试集大小
print('original train set: ',train.shape)
print('original test set:',test.shape)
#since Id is unnecessary, save and drop it.
test_Id = test['Id']
test.drop('Id',axis=1,inplace=True)
train_Id = train['Id']
train.drop('Id',axis=1,inplace=True)
print('train drop Id:',train.shape)
print('test drop Id:',test.shape)
original train set: (1460, 81)
original test set: (1459, 80)
train drop Id: (1460, 80)
test drop Id: (1459, 79)
特征相关性热力图分析
corrmat = train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)
按相关系数排序,与SalePrice相关性最强的10个特征是:
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={
'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
与SalePrice相关性较高的有 OverallQual, GrLivArea,GarageCars, GargeArea, TotalBsmtSF, 1stFlrSF, FullBath, TotRmsAbvGrd, YearBuilt, YearRemodAdd, GargeYrBlt, MasVnrArea and Fireplaces.
- 这其中,GarageArea和GarageCars,GarageYrBlt有很强的相关性,可以选择有代表性的GarageCars;
- TotalBsmtSF和1stFlrSF有很强相关性,可以作为新特征Basement分析;
- TotRmsAbvGrd与GrLivArea相似,取GrLivArea分析。
目标sale price特征
#since SalePrice is the target,drop and save it
y = train.SalePrice
train_labels = y.values.copy
#train.drop('SalePrice',axis=1,inplace=True)
print(y.describe())
sns.distplot(y)
print('Skewness: %f' %y.skew())
print('Kurtosis: %f' %y.kurt())
#sns.distplot(train['SalePrice'])
#print("Skewness: %f" % train['SalePrice'].skew())
#print("Kurtosis: %f" % train['SalePrice'].kurt())
positive skewness正偏态,可取对数使其满足正态分布,以方便用于机器学习算法
得到特征属性(数值和类别)
#得到训练集的数值特征和类别特征
num_f = [f for f in train.columns if train.dtypes[f] != 'object']#数值特征
# pop the target-Saleprice
num_f.pop() #pop the last one SalePrice
print('numerical feature length:',len(num_f))
category_f = [f for f in train.columns if train.dtypes[f] == 'object']#训练集类别特征
print('category feature length:',len(category_f))
numerical feature length: 36
category feature length: 43
数值特征num_f分析
def jointplot(x,y,**kwargs):
try:
sns.regplot(x=x,y=y)
except Exception:
print(x.value_counts())
f = pd.melt(train, id_vars=['SalePrice'], value_vars=num_f)
g = sns.FacetGrid(f,col='variable',col_wrap=3,sharex=False,sharey=False,size=5)
g = g.map(jointplot,'value','SalePrice')
居住面积GrLivArea(数值特征)和price的关系
var = 'GrLivArea'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000))
居住面积与价格呈明显的线性关系,例外是右下角两个点,面积大,售价低,不符合实际规律,可将其去掉:
train.drop(train[(train['GrLivArea']>4000)&(train.SalePrice<300000)].index,inplace=True)
后面证明噪声点对结果影响很大,所以找到和去除噪声点很关键。
整体评价(数值特征)和价格关系
var = 'OverallQual'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
修建年代和价格关系
var = 'YearBuilt'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000)
plt.xticks(rotation=90)
Category特征分析
Category特征总体概览
''''
for c in category_f:
train[c] = train[c].astype('category')
if train[c].isnull().any():
train[c] = train[c].cat.add_categories(['MISSING'])
train[c] = train[c].fillna('MISSING')
'''''
def boxplot(x, y, **kwargs):
sns.boxplot(x=x, y=y)
x=plt.xticks(rotation=90)
f = pd.melt(train, id_vars=['SalePrice'], value_vars=category_f)
g = sns.FacetGrid(f, col="variable", col_wrap=3, sharex=False,sharey=False,size=5)
g = g.map(boxplot, "value", "SalePrice")
对Price影响较大的有:Neighborhood波动较明显,SaleCondition中Partial稍高,QUAL类型(有无泳池、壁炉、地下室)
Category特征单因素方差分析
def anova(frame):
anv = pd.DataFrame()
anv['feature'] = category_f
pvals = []
for c in category_f:
samples = []
for cls in frame[c].unique():
s = frame[frame[c] == cls]['SalePrice'].values
samples.append(s)
pval = statsMM.f_oneway(*samples)[1]
pvals.append(pval)
anv['pval'] = pvals
return anv.sort_values('pval')
a = anova(train)
a['disparity'] = np.log(1./a['pval'].values)
sns.barplot(data=a, x='feature', y='disparity')
x=plt.xticks(rotation=90)
由上图示分析可见,不少离散变量的具体取值对最终房价会产生较大影响(例如Neighborhood这个变量,实际上暗含了地段这个影响房价的重要因素),因此,我们可以按照各离散变量相应取值下房价的均值来给各个取值划定一个1,2,3,4来定量描述他们对房价的影响,也就是将离散变量转化为数值型的有序变量: