数据分析初步

最新推荐文章于 2023-03-17 13:57:09 发布

天伤星

最新推荐文章于 2023-03-17 13:57:09 发布

阅读量816

点赞数

分类专栏： python数据分析文章标签： python 数据分析

本文链接：https://blog.csdn.net/qq_39862505/article/details/105416110

版权

python数据分析专栏收录该内容

6 篇文章 0 订阅

订阅专栏

探索性数据分析(Exploratory Data Analysis, EDA)是指对已有的数据在尽量少的先验假设下通过作图、制表、方程拟合、计算特征等手段探索数据的结构和规律的一种数据分析方法。

数据及背景

https://tianchi.aliyun.com/competition/entrance/231784/information（阿里天池-零基础入门数据挖掘）
数据集下载链接：https://pan.baidu.com/s/16Bfl6wgCSW1mclkL_j8goQ
提取码：5xoo

EDA的目标

熟悉数据集，了解数据集，对数据集进行验证来确定所获得的数据集可以用于接下来的机器学习或者深度学习使用。
了解变量间的相互关系以及变量与预测值之间的存在关系。
引导数据科学从业者进行数据处理以及特征工程的步骤，使数据集的结构和特征集让接下来的预测问题更加可靠。

数据载入及总览

载入各种数据科学以及可视化库

missingno库用于可视化缺失值分布，是基于matplotlib的，接受pandas的数据源

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno  # 用于可视化缺失值分布
import scipy.stats as st

载入数据

train_data = pd.read_csv(r'C:\\Users\\lenovo\\Desktop\\used_car_train_20200313.csv',sep=' ')
test_data = pd.read_csv(r'C:\\Users\\lenovo\\Desktop\\used_car_testA_20200313.csv',sep=' ')

所有特征集均脱敏处理，脱敏处理后均为label encoding形式，即数字形式

Filed	Description
SaleID	交易ID，唯一编码
name	汽车交易名称，已脱敏
regDate	汽车注册日期
model	车型编码
brand	汽车品牌
bodyType	车身类型
fuelType	燃油类型
gearbox	变速箱
power	发动机功率
kilometer	汽车已行驶公里数
notRepairedDamage	汽车有尚未修复的损坏
regionCode	地区编码
seller	销售方
offerType	报价类型
creatDate	汽车上线时间
price	二手车交易价格
v系列特征	匿名特征，包含v0-14在内15个匿名特征

总览数据

简略观察数据head()+shape

train_data.head().append(train_data.tail())
test_data.head().append(test_data.tail())
train_data.shape
test_data.shape

describe()熟悉相关统计量
describe()中包含每列的统计量，个数(count)、平均值(mean)、方差(std)、最小值(min)、中位数(25%,50%,75%)、最大值(max)等。通过观察以上指标，可以瞬间掌握数据的大概范围和每个值的异常值的判断，例如有时候会发现999 9999、-1等值，这些其实都是nan的另一种表达方式。

train_data.describe()

结果显示如下：

SaleID           name       regDate          model  \
count  150000.000000  150000.000000  1.500000e+05  149999.000000   
mean    74999.500000   68349.172873  2.003417e+07      47.129021   
std     43301.414527   61103.875095  5.364988e+04      49.536040   
min         0.000000       0.000000  1.991000e+07       0.000000   
25%     37499.750000   11156.000000  1.999091e+07      10.000000   
50%     74999.500000   51638.000000  2.003091e+07      30.000000   
75%    112499.250000  118841.250000  2.007111e+07      66.000000   
max    149999.000000  196812.000000  2.015121e+07     247.000000   

               brand       bodyType       fuelType        gearbox  \
count  150000.000000  145494.000000  141320.000000  144019.000000   
mean        8.052733       1.792369       0.375842       0.224943   
std         7.864956       1.760640       0.548677       0.417546   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       0.000000       0.000000       0.000000   
50%         6.000000       1.000000       0.000000       0.000000   
75%        13.000000       3.000000       1.000000       0.000000   
max        39.000000       7.000000       6.000000       1.000000   

               power      kilometer  ...            v_5            v_6  \
count  150000.000000  150000.000000  ...  150000.000000  150000.000000   
mean      119.316547      12.597160  ...       0.248204       0.044923   
std       177.168419       3.919576  ...       0.045804       0.051743   
min         0.000000       0.500000  ...       0.000000       0.000000   
25%        75.000000      12.500000  ...       0.243615       0.000038   
50%       110.000000      15.000000  ...       0.257798       0.000812   
75%       150.000000      15.000000  ...       0.265297       0.102009   
max     19312.000000      15.000000  ...       0.291838       0.151420   

                 v_7            v_8            v_9           v_10  \
count  150000.000000  150000.000000  150000.000000  150000.000000   
mean        0.124692       0.058144       0.061996      -0.001000   
std         0.201410       0.029186       0.035692       3.772386   
min         0.000000       0.000000       0.000000      -9.168192   
25%         0.062474       0.035334       0.033930      -3.722303   
50%         0.095866       0.057014       0.058484       1.624076   
75%         0.125243       0.079382       0.087491       2.844357   
max         1.404936       0.160791       0.222787      12.357011   

                v_11           v_12           v_13           v_14  
count  150000.000000  150000.000000  150000.000000  150000.000000  
mean        0.009035       0.004813       0.000313      -0.000688  
std         3.286071       2.517478       1.288988       1.038685  
min        -5.558207      -9.639552      -4.153899      -6.546556  
25%        -1.951543      -1.871846      -1.057789      -0.437034  
50%        -0.358053      -0.130753      -0.036245       0.141246  
75%         1.255022       1.776933       0.942813       0.680378  
max        18.819042      13.847792      11.147669       8.658418  

[8 rows x 30 columns]

info()熟悉数据类型
通过info()来了解数据每列的type，有助于了解是否存在除了nan以外的特殊符号异常。

train_data.info()

结果如下：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  seller             150000 non-null  int64  
 13  offerType          150000 non-null  int64  
 14  creatDate          150000 non-null  int64  
 15  price              150000 non-null  int64  
 16  v_0                150000 non-null  float64
 17  v_1                150000 non-null  float64
 18  v_2                150000 non-null  float64
 19  v_3                150000 non-null  float64
 20  v_4                150000 non-null  float64
 21  v_5                150000 non-null  float64
 22  v_6                150000 non-null  float64
 23  v_7                150000 non-null  float64
 24  v_8                150000 non-null  float64
 25  v_9                150000 non-null  float64
 26  v_10               150000 non-null  float64
 27  v_11               150000 non-null  float64
 28  v_12               150000 non-null  float64
 29  v_13               150000 non-null  float64
 30  v_14               150000 non-null  float64
dtypes: float64(20), int64(10), object(1)
memory usage: 34.9+ MB

缺失值和异常值

缺失值

查看每列的存在nan情况

train_data.isnull().sum()
test_data.isnull().sum()

结果显示如下：

SaleID                  0
name                    0
regDate                 0
model                   1
brand                   0
bodyType             4506
fuelType             8680
gearbox              5981
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
price                   0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64
SaleID                  0
name                    0
regDate                 0
model                   0
brand                   0
bodyType             1413
fuelType             2893
gearbox              1910
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64

排序函数sort_values()
可以将数据集依照某个字段中的数据进行排序，该函数即可根据指定列数据也可根据指定行的

参数	说明
by	指定列名(axis=0或’index’)，或索引值(axis=1或’columns’)
axis	若axis=0或’index’，则按照指定列中数据大小进行排序；若axis=1或’columns’，则按照指定索引中数据大小进行排序，默认axis=0
ascending	是否按照升序排列，默认True：升序排序
inplace	是否用排序后的数据集代替原来的数据集，默认False：不替换
na_position	(‘first’, ‘last’)，设定缺失值的显示位置

通过以下两句可以很直观的了解哪些列存在“nan”，并可以把nan的个数打印。主要目的在于nan存在的个数是否真的很大，如果很小一般选择填充，如果使用lgb等树模型可以直接空缺，让树自己去优化，但如果nan存在的过多，可以考虑删掉。

# nan可视化
missing = train_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
plt.show()

在这里插入图片描述

# 可视化缺失值
msno.matrix(train_data.sample(250))
msno.bar(train_data.sample(1000))
msno.matrix(test_data.sample(250))
msno.bar(test_data.sample(1000))
plt.show()

结果如下：
在这里插入图片描述

从上文train_data.info()的统计信息可以发现，除了notRepairedDamage为object类型，其他都为数字。接下来将notRepairedDamage中几个不同的值都进行显示如下：

train_data['notRepairedDamage'].value_counts()

结果如下：

0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64

可以看出‘-’也为空缺值，因为很多模型对nan有直接的处理，这里我们先不做处理，先替换成nan。

train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
train_data['notRepairedDamage'].value_counts()

结果如下：

0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64

train_data.isnull().sum()

SaleID                   0
name                     0
regDate                  0
model                    1
brand                    0
bodyType              4506
fuelType              8680
gearbox               5981
power                    0
kilometer                0
notRepairedDamage    24324
regionCode               0
seller                   0
offerType                0
creatDate                0
price                    0
v_0                      0
v_1                      0
v_2                      0
v_3                      0
v_4                      0
v_5                      0
v_6                      0
v_7                      0
v_8                      0
v_9                      0
v_10                     0
v_11                     0
v_12                     0
v_13                     0
v_14                     0
dtype: int64

异常值

以下两个类别的特征：严重倾斜，一般不会对预测有什么帮助，故这边先删除，当然你也可以继续挖掘，但是一般意义不大。

train_data['seller'].value_counts()
train_data['offerType'].value_counts()

0    149999
1         1
Name: seller, dtype: int64

0    150000
Name: offerType, dtype: int64

预测值分布

总体分布概况

数据整体上服从正态分布，样本均值和方差则相互独立，正态分布具有很多好的性质，很多模型也假设数据服从正态分布。

例如线性回归(linear regression)，它假设误差服从正态分布，从而每个样本点出现的概率就可以表示为正态分布形式，将多个样本点连乘再取对数，就是所有训练集样本出现的条件概率，最大化该条件概率就是LR最终求解的问题。这个条件概率的最终表达式的形式就是我们熟悉的误差平方和。

总之，机器学习中很多model都假设数据或参数服从正态分布。当样本不服从正态分布时，可以做如下转换：

线性变换z-scores
使用BoxCox变换
使用yeo-johnson变换

盲目假设变量服从正态分布可能导致不准确的结果，要结合分析。例如：不能假设股票价格服从正态分布，因为价格不能为负，故我们可以将股票价格假设为服从对数正态分布，以确保其值 $\ge0$ ；而股票收益可能是负数，因此收益可以假设服从正态分布。

当样本数据表明质量特征的分布为非正态时，应用基于生态分布的方法会做出不正确的判决。约翰逊分布族即为经约翰(yeo-johnson)变换后服从正态分布的随机变量的概率分布，约翰逊分布体系建立了三族分布，分别为有界 $\bold{S}_B$ 、对数正态 $\bold{S}_L$ 和无界 $\bold{S}_U$ 。

本案例的预测值为价格，显然不符合正态分布，故分别采用无界约翰逊分布johnson SU、正态分布normal、对数正态分布lognormal，综合来看无界约翰逊分布对price的拟合效果更好。

y = train_data['price']
plt.figure(1)
plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2)
plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3)
plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
plt.show()

在这里插入图片描述

偏度(Skewness)和峰度(Kurtosis)

在这里插入图片描述

sns.distplot(train_data['price'])
print("Skewness:%f" % train_data['price'].skew())
print("Kurtosis:%f" % train_data['price'].kurt())
sns.distplot(train_data.skew(), color='blue', axlabel='Skewness')
sns.distplot(train_data.kurt(), color='orange', axlabel='Kurtosis')
train_data.skew()
train_data.kurt()

结果如下：

Skewness:3.346487
Kurtosis:18.995183

SaleID                 -1.200000
name                   -1.039945
regDate                -0.697308
model                   1.740483
brand                   1.076201
bodyType                0.206937
fuelType                5.880049
gearbox                -0.264161
power                5733.451054
kilometer               1.141934
notRepairedDamage       3.908072
regionCode             -0.340832
creatDate            6881.080328
price                  18.995183
v_0                     3.993841
v_1                    -1.753017
v_2                    23.860591
v_3                    -0.418006
v_4                    -0.197295
v_5                    22.934081
v_6                    -1.742567
v_7                    25.845489
v_8                    -0.636225
v_9                    -0.321491
v_10                   -0.577935
v_11                   12.568731
v_12                    0.268937
v_13                   -0.438274
v_14                    2.393526
dtype: float64

预测值频数

大于20000的值很少，其实该处可将其当做异常值处理填充或者删除，本文中经过log变换后，分布较均匀，可据此进行预测，这也是预测问题常用的技巧。

plt.hist(train_data['price'], orientation='vertical', histtype='bar', color='red')
plt.show()
plt.hist(np.log(train_data['price']), orientation='vertical', histtype='bar', color='red')
plt.show()

在这里插入图片描述

特征分析

数字特征

'seller’和’offerType’已被删除，其他特征均经过了label coding。若需要处理的数据未label coding，则可通过如下代码对特征进行区分：

# 数字特征
numeric_feature = train_data.select_dtypes(include=[np.number])
numeric_feature.columns
# 类型特征
categorical_feature = train_data.select_dtypes(include=[np.object])
categorical_feature.columns

本文数据已经label coding，故采用人工区分方法：

numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14']
categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode']

总览

numeric_features.append('price')
print(numeric_features)

['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14', 'price']

相关性分析

price_numeric = train_data[numeric_features]
correlation = price_numeric.corr()
print(correlation['price'].sort_values(ascending=False), '\n')
f, ax = plt.subplots(figsize=(7,7))
plt.title('Correlation of Numeric Features with Price', y=1, size=16)
sns.heatmap(correlation, square=True, vmax=0.8)
del price_numeric['price']
plt.show()

在这里插入图片描述
特征偏度和峰度

for col in numeric_features:
    print('{:15}'.format(col), 'Skewness: {:05.2f}'.format(train_data[col].skew()), '    ',
          'Kurtosis:{:06.2f}'.format(train_data[col].kurt()))

power           Skewness: 65.86      Kurtosis:5733.45
kilometer       Skewness: -1.53      Kurtosis:001.14
v_0             Skewness: -1.32      Kurtosis:003.99
v_1             Skewness: 00.36      Kurtosis:-01.75
v_2             Skewness: 04.84      Kurtosis:023.86
v_3             Skewness: 00.11      Kurtosis:-00.42
v_4             Skewness: 00.37      Kurtosis:-00.20
v_5             Skewness: -4.74      Kurtosis:022.93
v_6             Skewness: 00.37      Kurtosis:-01.74
v_7             Skewness: 05.13      Kurtosis:025.85
v_8             Skewness: 00.20      Kurtosis:-00.64
v_9             Skewness: 00.42      Kurtosis:-00.32
v_10            Skewness: 00.03      Kurtosis:-00.58
v_11            Skewness: 03.03      Kurtosis:012.57
v_12            Skewness: 00.37      Kurtosis:000.27
v_13            Skewness: 00.27      Kurtosis:-00.44
v_14            Skewness: -1.19      Kurtosis:002.39
price           Skewness: 03.35      Kurtosis:019.00

每个数字特征的分布可视化
pd.melt()：处理数据，透视表格，可将宽数据转化为长数据，以便于后续分析。形成的数据即为，键：各特征名称，值：特征对应的值
sns.FacetGrid()：先sns.FacetGrid()画出轮廓，再map()填充内容

f = pd.melt(train_data, value_vars=numeric_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
plt.show()

匿名特征分布情况
sns.pairplot()：展示变量两两之间的关系(线性或非线性，有无较为明显的相关关系)：

对角线：各个属性的直方图，用diag_kind属性控制图类型，可选“scatter”与“reg”
非对角线：两个不同属性之间的相关图，用kind属性控制图类型，可选“scatter”与“reg”
hue：针对某一字段进行分类

sns.set()
columns = ['price', 'v_12', 'v_8', 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14']
sns.pairplot(train_data[columns], size=2, kind='scatter', diag_kind='kde')
plt.show()

在这里插入图片描述
多变量与price的回归关系此段程序有误

fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20))
# ['v_12', 'v_8', 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', v_14']
y = train_data['price']
v_12_scatter_plot = pd.concat([y, train_data['v_12']], axis=1)
sns.regplot(x='v_12', y='price', data=v_12_scatter_plot, scatter=True, fit_reg=True, ax=ax1)

v_8_scatter_plot = pd.concat([y, train_data['v_8']], axis=1)
sns.regplot(x='v_8', y='price', data=v_8_scatter_plot, scatter=True, fit_reg=True, ax=ax2)

v_0_scatter_plot = pd.concat([y, train_data['v_0']], axis=1)
sns.regplot(x='v_0', y='price', data=v_0_scatter_plot, scatter=True, fit_reg=True, ax=ax3)

power_scatter_plot = pd.concat([y, train_data['power']], axis=1)
sns.regplot(x='power', y='price', data=power_scatter_plot, scatter=True, fit_reg=True, ax=ax4)

v_5_scatter_plot = pd.concat([y, train_data['v_5']], axis=1)
sns.regplot(x='v_5', y='price', data=v_5_scatter_plot, scatter=True, fit_reg=True, ax=ax5)

v_2_scatter_plot = pd.concat([y, train_data['v_2']], axis=1)
sns.regplot(x='v_2', y='price', data=v_2_scatter_plot, scatter=True, fit_reg=True, ax=ax6)

v_6_scatter_plot = pd.concat([y, train_data['v_6']], axis=1)
sns.regplot(x='v_6', y='price', data=v_6_scatter_plot, scatter=True, fit_reg=True, ax=ax7)

v_1_scatter_plot = pd.concat([y, train_data['v_1']], axis=1)
sns.regplot(x='v_1', y='price', data=v_1_scatter_plot, scatter=True, fit_reg=True, ax=ax8)

v_14_scatter_plot = pd.concat([y, train_data['v_14']], axis=1)
sns.regplot(x='v_14', y='price', data=v_14_scatter_plot, scatter=True, fit_reg=True, ax=ax9)

v_13_scatter_plot = pd.concat([y, train_data['v_13']], axis=1)
sns.regplot(x='v_13', y='price', data=v_13_scatter_plot, scatter=True, fit_reg=True, ax=ax10)
plt.show()

在这里插入图片描述

类别特征

查看nunique分布

for cat_fea in categorical_features:
    print(cat_fea + "特征分布如下：")
    print('{}特征有{}个不同的值'.format(cat_fea, train_data[cat_fea].nunique()))
    print(train_data[cat_fea].value_counts())

name特征分布如下：
name特征有99662个不同的值
708       282
387       282
55        280
1541      263
203       233
         ... 
5074        1
7123        1
11221       1
13270       1
174485      1
Name: name, Length: 99662, dtype: int64
model特征分布如下：
model特征有248个不同的值
0.0      11762
19.0      9573
4.0       8445
1.0       6038
29.0      5186
         ...  
245.0        2
209.0        2
240.0        2
242.0        2
247.0        1
Name: model, Length: 248, dtype: int64
brand特征分布如下：
brand特征有40个不同的值
0     31480
4     16737
14    16089
10    14249
1     13794
6     10217
9      7306
5      4665
13     3817
11     2945
3      2461
7      2361
16     2223
8      2077
25     2064
27     2053
21     1547
15     1458
19     1388
20     1236
12     1109
22     1085
26      966
30      940
17      913
24      772
28      649
32      592
29      406
37      333
2       321
31      318
18      316
36      228
34      227
33      218
23      186
35      180
38       65
39        9
Name: brand, dtype: int64
bodyType特征分布如下：
bodyType特征有8个不同的值
0.0    41420
1.0    35272
2.0    30324
3.0    13491
4.0     9609
5.0     7607
6.0     6482
7.0     1289
Name: bodyType, dtype: int64
fuelType特征分布如下：
fuelType特征有7个不同的值
0.0    91656
1.0    46991
2.0     2212
3.0      262
4.0      118
5.0       45
6.0       36
Name: fuelType, dtype: int64
gearbox特征分布如下：
gearbox特征有2个不同的值
0.0    111623
1.0     32396
Name: gearbox, dtype: int64
notRepairedDamage特征分布如下：
notRepairedDamage特征有2个不同的值
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64
regionCode特征分布如下：
regionCode特征有7905个不同的值
419     369
764     258
125     137
176     136
462     134
       ... 
6414      1
7063      1
4239      1
5931      1
7267      1
Name: regionCode, Length: 7905, dtype: int64

查看箱型图

直观识别数据中的离群点
直观判断数据离散分布情况，了解数据分布状态

categorical_features = ['model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage']
for c in categorical_features:
    train_data[c] = train_data[c].astype('category')
    if train_data[c].isnull().any():
        train_data[c] = train_data[c].cat.add_categories(['MISSING'])
        train_data[c] = train_data[c].fillna('MISSING')
        
def boxplot(x, y, **kwargs):
    sns.boxplot(x=x, y=y)
    x = plt.xticks(rotation = 90)
    
f = pd.melt(train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col='variable', col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, 'value', 'price')
plt.show()

结果如下：
在这里插入图片描述
查看小提琴图

用于显示数据分布及概率密度
这种图表结合了箱型图和密度图的特征，主要用来显示数据的分布形状

catg_list = categorical_features
target = 'icu_los'
for catg in catg_list:
    sns.violinplot(x=catg, y=target, data=train_data)
    plt.show()

查看柱形图

def bar_plot(x, y, **kwargs):
    sns.barplot(x=x, y=y)
    x = plt.xticks(rotation=90)
    
f = pd.melt(train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col='variable', col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(bar_plot, 'value', 'price')
plt.show()

在这里插入图片描述
类别频数可视化

def cout_plot(x, **kwargs):
    sns.countplot(x=x)
    x = plt.xticks(rotation=90)
    
f = pd.melt(train_data, value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(cout_plot, 'value')
plt.show()