用Python进行全面的数据探索

最新推荐文章于 2024-05-11 13:16:56 发布

Up_梅子酒

最新推荐文章于 2024-05-11 13:16:56 发布

阅读量874

点赞数

分类专栏： Data Analysis 文章标签： python

本文链接：https://blog.csdn.net/eerywh/article/details/114554337

版权

用Python进行全面的数据探索

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import missingno as mno
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

df_train = pd.read_csv('./train.csv')

df_train.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')

df_train.head()

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

mno.matrix(df_train)

<AxesSubplot:>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Qk3jEEVf-1615214498362)(output_5_1.png)]

mno.heatmap(df_train)

<AxesSubplot:>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pqSKidQV-1615214498364)(output_6_1.png)]

数据探索

为了理解我们的数据，我们可以看看每个变量，并试图理解它们的含义和与这个问题的相关性。
为了在我们的分析中有所规范，我们可以创建一个包含以下的分析:

变量——变量名。
类型——变量类型的标识。该字段有两个可能的值:“数值”或“分类”。“数值”是指数值为数字的变量，而“分类”是指数值为类别的变量。
字段-变量字段的标识。我们可以定义三个可能的部分:建筑、空间或位置。当我们说“建筑”时，我们指的是与建筑的物理特征相关的变量(例如，“总体质量”)。当我们说“空间”时，我们指的是一个报告房屋空间属性的变量(例如，“总空间”)。最后，当我们说“位置”时，我们指的是给出房子所在位置信息的变量(例如，“邻居”)。
期望——我们对“销售价格”中可变影响的期望。我们可以使用“高”、“中”和“低”作为可能的值。
结论——在我们快速查看数据后，得出关于变量重要性的结论。我们可以保持与“期望”中相同的分类尺度。
评论——我们想到的任何一般性评论。

虽然“类型”和“部分”只是为了将来可能的参考，但“期望”一栏很重要，因为它将帮助我们发展“第六感”。要填写这一栏，我们应该阅读所有变量的描述，并一个接一个地问自己:

我们买房的时候会考虑这个变量吗？(例如，当我们想到我们梦想中的房子时，我们是否关心它的“砖石贴面类型”？).
如果是，这个变量有多重要？(例如，使用“优秀”材料代替“差”材料会有什么影响？用“优秀”代替“好”？).
这个信息已经在其他变量中描述了吗？(例如，如果“陆地轮廓”给出了物业的平整度，我们真的需要知道“陆地坡度”吗？).

首先：分析‘SalePrice’

df_train['SalePrice'].describe()

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

sns.distplot(df_train['SalePrice']);   #直方图

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-17VfmPsq-1615214498365)(output_11_0.png)]

可以看出SalePrice 右偏

#skewness and kurtosis
print("Skewness: %f" % df_train['SalePrice'].skew())  # 偏度
print("Kurtosis: %f" % df_train['SalePrice'].kurt())  # 峰度

Skewness: 1.882876
Kurtosis: 6.536282

销售价格与数值变量之间的关系探索

GrLivArea VS SalePrice

var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'],df_train[var]],axis = 1)
fig = plt.figure()
plt.ylim((0,800000))
plt.scatter(data[var],data['SalePrice']);
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4OZvbGjD-1615214498367)(output_16_0.png)]

可以看出销售价格和变量GrLivArea之间存在着较强的线性关系

TotalBsmtSF VS SalePrice

var = 'TotalBsmtSF'
data = pd.concat([df_train['SalePrice'],df_train[var]],axis = 1)
fig = plt.figure()
plt.ylim((0,800000))
plt.scatter(data[var],data['SalePrice']);
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bBPV9kWB-1615214498369)(output_19_0.png)]

分类特征

OverallQual

var = 'OverallQual'
data = pd.concat([df_train['SalePrice'],df_train

最低0.47元/天解锁文章

Up_梅子酒

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
用Python进行全面的数据探索

用Python进行全面的数据探索import pandas as pdimport numpy as npfrom matplotlib import pyplot as pltfrom scipy.stats import normfrom sklearn.preprocessing import StandardScalerfrom scipy import statsimport missingno as mnoimport seaborn as snsimport warnings
复制链接

扫一扫