-
#两个特征的相关性 pd.DataFrame({"full_data": p1,"red_data": p2}).corr()
根据提供的文件分析各个特征之间的相关性
Id | MSSubClass | MSZoning | LotFrontage | LotArea | LandContour | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | MasVnrArea | ExterQual | ExterCond | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | EnclosedPorch | MiscVal | MoSold | SalePrice |
1 | 60 | RL | 65 | 8450 | Lvl | 5 | 2003 | 2003 | Gable | 196 | Gd | TA | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 0 | 0 | 2 | 208500 |
2 | 20 | RL | 80 | 9600 | Lvl | 8 | 1976 | 1976 | Gable | 0 | TA | TA | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 0 | 5 | 181500 |
3 | 60 | RL | 68 | 11250 | Lvl | 5 | 2001 | 2002 | Gable | 162 | Gd | TA | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 0 | 0 | 9 | 223500 |
4 | 70 | RL | 60 | 9550 | Lvl | 5 | 1915 | 1970 | Gable | 0 | TA | TA | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 272 | 0 | 2 | 140000 |
5 | 60 | RL | 84 | 14260 | Lvl | 5 | 2000 | 2000 | Gable | 350 | Gd | TA | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 0 | 0 | 12 | 250000 |
6 | 50 | RL | 85 | 14115 | Lvl | 5 | 1993 | 1995 | Gable | 0 | TA | TA | GasA | Ex | Y | SBrkr | 796 | 566 | 0 | 1362 | 1 | 0 | 1 | 0 | 700 | 10 | 143000 |
7 | 20 | RL | 75 | 10084 | Lvl | 5 | 2004 | 2005 | Gable | 186 | Gd | TA | GasA | Ex | Y | SBrkr | 1694 | 0 | 0 | 1694 | 1 | 0 | 2 | 0 | 0 | 8 | 307000 |
8 | 60 | RL | NA | 10382 | Lvl | 6 | 1973 | 1973 | Gable | 240 | TA | TA | GasA | Ex | Y | SBrkr | 1107 | 983 | 0 | 2090 | 1 | 0 | 2 | 228 | 350 | 11 | 200000 |
9 | 50 | RM | 51 | 6120 | Lvl | 5 | 1931 | 1950 | Gable | 0 | TA | TA | GasA | Gd | Y | FuseF | 1022 | 752 | 0 | 1774 | 0 | 0 | 2 | 205 | 0 | 4 | 129900 |
10 | 190 | RL | 50 | 7420 | Lvl | 6 | 1939 | 1950 | Gable | 0 | TA | TA | GasA | Ex | Y | SBrkr | 1077 | 0 | 0 | 1077 | 1 | 0 | 1 | 0 | 0 | 1 | 118000 |
11 | 20 | RL | 70 | 11200 | Lvl | 5 | 1965 | 1965 | Hip | 0 | TA | TA | GasA | Ex | Y | SBrkr | 1040 | 0 | 0 | 1040 | 1 | 0 | 1 | 0 | 0 | 2 | 129500 |
12 | 60 | RL | 85 | 11924 | Lvl | 5 | 2005 | 2006 | Hip | 286 | Ex | TA | GasA | Ex | Y | SBrkr | 1182 | 1142 | 0 | 2324 | 1 | 0 | 3 | 0 | 0 | 7 | 345000 |
13 | 20 | RL | NA | 12968 | Lvl | 6 | 1962 | 1962 | Hip | 0 | TA | TA | GasA | TA | Y | SBrkr | 912 | 0 | 0 | 912 | 1 | 0 | 1 | 0 | 0 | 9 | 144000 |
14 | 20 | RL | 91 | 10652 | Lvl | 5 | 2006 | 2007 | Gable | 306 | Gd | TA | GasA | Ex | Y | SBrkr | 1494 | 0 | 0 | 1494 | 0 | 0 | 2 | 0 | 0 | 8 | 279500 |
15 | 20 | RL | NA | 10920 | Lvl | 5 | 1960 | 1960 | Hip | 212 | TA | TA | GasA | TA | Y | SBrkr | 1253 | 0 | 0 | 1253 | 1 | 0 | 1 | 176 | 0 | 5 | 157000 |
16 | 45 | RM | 51 | 6120 | Lvl | 8 | 1929 | 2001 | Gable | 0 | TA | TA | GasA | Ex | Y | FuseA | 854 | 0 | 0 | 854 | 0 | 0 | 1 | 0 | 0 | 7 | 132000 |
17 | 20 | RL | NA | 11241 | Lvl | 7 | 1970 | 1970 | Gable | 180 | TA | TA | GasA | Ex | Y | SBrkr | 1004 | 0 | 0 | 1004 | 1 | 0 | 1 | 0 | 700 | 3 | 149000 |
18 | 90 | RL | 72 | 10791 | Lvl | 5 | 1967 | 1967 | Gable | 0 | TA | TA | GasA | TA | Y | SBrkr | 1296 | 0 | 0 | 1296 | 0 | 0 | 2 | 0 | 500 | 10 | 90000 |
19 | 20 | RL | 66 | 13695 | Lvl | 5 | 2004 | 2004 | Gable | 0 | TA | TA | GasA | Ex | Y | SBrkr | 1114 | 0 | 0 | 1114 | 1 | 0 | 1 | 0 | 0 | 6 | 159000 |
20 | 20 | RL | 70 | 7560 | Lvl | 6 | 1958 | 1965 | Hip | 0 | TA | TA | GasA | TA | Y | SBrkr | 1339 | 0 | 0 | 1339 | 0 | 0 | 1 | 0 | 0 | 5 | 139000 |
21 | 60 | RL | 101 | 14215 | Lvl | 5 | 2005 | 2006 | Gable | 380 | Gd | TA | GasA | Ex | Y | SBrkr | 1158 | 1218 | 0 | 2376 | 0 | 0 | 3 | 0 | 0 | 11 | 325300 |
22 | 45 | RM | 57 | 7449 | Bnk | 7 | 1930 | 1950 | Gable | 0 | TA | TA | GasA | Ex | Y | FuseF | 1108 | 0 | 0 | 1108 | 0 | 0 | 1 | 205 | 0 | 6 | 139400 |
23 | 20 | RL | 75 | 9742 | Lvl | 5 | 2002 | 2002 | Hip | 281 | Gd | TA | GasA | Ex | Y | SBrkr | 1795 | 0 | 0 | 1795 | 0 | 0 | 2 | 0 | 0 | 9 | 230000 |
24 | 120 | RM | 44 | 4224 | Lvl | 7 | 1976 | 1976 | Gable | 0 | TA | TA | GasA | TA | Y | SBrkr | 1060 | 0 | 0 | 1060 | 1 | 0 | 1 | 0 | 0 | 6 | 129900 |
25 | 20 | RL | NA | 8246 | Lvl | 8 | 1968 | 2001 | Gable | 0 | TA | Gd | GasA | Ex | Y | SBrkr | 1060 | 0 | 0 | 1060 | 1 | 0 | 1 | 0 | 0 | 5 | 154000 |
26 | 20 | RL | 110 | 14230 | Lvl | 5 | 2007 | 2007 | Gable | 640 | Gd | TA | GasA | Ex | Y | SBrkr | 1600 | 0 | 0 | 1600 | 0 | 0 | 2 | 0 | 0 | 7 | 256300 |
27 | 20 | RL | 60 | 7200 | Lvl | 7 | 1951 | 2000 | Gable | 0 | TA | TA | GasA | TA | Y | SBrkr | 900 | 0 | 0 | 900 | 0 | 1 | 1 | 0 | 0 | 5 | 134800 |
28 | 20 | RL | 98 | 11478 | Lvl | 5 | 2007 | 2008 | Gable | 200 | Gd | TA | GasA | Ex | Y | SBrkr | 1704 | 0 | 0 | 1704 | 1 | 0 | 2 | 0 | 0 | 5 | 306000 |
29 | 20 | RL | 47 | 16321 | Lvl | 6 | 1957 | 1997 | Gable | 0 | TA | TA | GasA | TA | Y | SBrkr | 1600 | 0 | 0 | 1600 | 1 | 0 | 1 | 0 | 0 | 12 | 207500 |
30 | 30 | RM | 60 | 6324 | Lvl | 6 | 1927 | 1950 | Gable | 0 | TA | TA | GasA | Fa | N | SBrkr | 520 | 0 | 0 | 520 | 0 | 0 | 1 | 87 | 0 | 5 | 68500 |
1.观察哪些变量会和预测目标关系比较大(比如这个分析主要是(saleprice)
2.观察哪些变量之间会有较强的关联
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
import scipy.stats as stats
import pandas
df_train = pandas.read_csv('train1.csv')
print(df_train)
corrmat = df_train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)
plt.show()
输出入下图所示,是各个特征之间的相关性,
1.我们关注最后一行可以查看和saleprice关联性最大的特征,比如GliveArea
2.YearBuilt 和 YearRemodAdd 之间关联性很强,所以如果特征比较多的时候,可以考虑取其中一个就可以了
k=5
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
取对saleprice影响最大的几个特征并排序
查看某个特征的outliers
var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
plt.show()
很重要的一步是把不符合正态分布的变量给转化成正态分布的
sns.distplot(df_train['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)
plt.show()
这个图里可以看到 ‘SalePrice’ 的分布是正偏度,在正偏度的情况下,用 log 取对数后可以做到转换:
df_train['SalePrice'] = np.log(df_train['SalePrice'])
sns.distplot(df_train['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)
plt.show()
GrLivArea 和 目标值 SalePrice 在转化之前的关系图是类似锥形的:
转换之后就好多了
内容参考了杨熹的kaggle比赛总结 开发者自述:我是如何从 0 到 1 走进 Kaggle 的