数据可视化数据分析常用图 seaborn

最新推荐文章于 2024-04-07 16:04:39 发布

weixin_37763484

最新推荐文章于 2024-04-07 16:04:39 发布

阅读量590

点赞数 2

分类专栏：机器学习 python 数据挖掘文章标签： python 数据挖掘

本文链接：https://blog.csdn.net/weixin_37763484/article/details/128298841

版权

python 同时被 3 个专栏收录

50 篇文章 0 订阅

订阅专栏

数据挖掘

20 篇文章 1 订阅

订阅专栏

机器学习

14 篇文章 0 订阅

订阅专栏

本文主要介绍几种数据分析阶段常用的统计图，可以用来验证数据分布，发现数据之间的关系，或进行异常值检测等。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt	
import seaborn as sns
from scipy import stats
import math

import warnings 
warnings.filterwarnings("ignore")

数据准备

首先准备两份数据，鸢尾花数据集

第一份是iris原始数据，第二份iris_z将所有特征取整

然后观察数据基本信息，如缺失值，平均值等

1.构造数据

from sklearn.datasets import load_iris

data=load_iris().data
target=load_iris().target

data_f=pd.DataFrame(data)
target_f=pd.DataFrame(target)
iris=pd.concat([data_f,target_f],axis=1)

data_f.columns=["w1","w2","l1","l2"]
for column in data_f.columns:
    data_f[column]=data_f[column]=data_f[column].apply(lambda x :math.floor(x))

iris_z=pd.concat([data_f,target_f],axis=1)

iris.columns=["w1","w2","l1","l2","target"]
iris.head()

	w1	w2	l1	l2
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

iris_z.columns=["w1","w2","l1","l2","target"]
iris_z.head()

	w1	w2	l1
0	5	3	1
1	4	3	1
2	4	3	1
3	4	3	1
4	5	3	1

2.观察基本信息

iris.describe()

	w1	w2	l1	l2	target
count	150.000000	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.054000	3.758667	1.198667	1.000000
std	0.828066	0.433594	1.764420	0.763161	0.819232
min	4.300000	2.000000	1.000000	0.100000	0.000000
25%	5.100000	2.800000	1.600000	0.300000	0.000000
50%	5.800000	3.000000	4.350000	1.300000	1.000000
75%	6.400000	3.300000	5.100000	1.800000	2.000000
max	7.900000	4.400000	6.900000	2.500000	2.000000

iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
w1        150 non-null float64
w2        150 non-null float64
l1        150 non-null float64
l2        150 non-null float64
target    150 non-null int32
dtypes: float64(4), int32(1)
memory usage: 5.4 KB

绘图

1.countplot

对w1特征进行分析，看其取值不同时，对应的target分别有哪些

sns.countplot("w1",hue="target",data=iris_z)

在这里插入图片描述

2.透视图

看w1和w2的不同取值组合，分别对应的target的平均值

k=iris_z.groupby(["w1","w2"])["target"].mean()
k

w1  w2
4   2     0.750000
    3     0.000000
5   2     1.208333
    3     0.242424
    4     0.000000
6   2     1.440000
    3     1.689655
7   2     2.000000
    3     1.888889
Name: target, dtype: float64

3. 数据分布图

对两个数据的"w1"特征进行分析，蓝色线条是数据的分布，黑色是期待的分布，

可以看到将数据离散化后得到iris_z,与正态分布差异较大

sns.distplot(iris["w1"],fit=stats.norm)
plt.show()
sns.distplot(iris_z["w1"],fit=stats.norm)
plt.show()

在这里插入图片描述

4.qq图

qq图，蓝色点越接近红线，就越符合正态分布

对上面的iris[“w1”]进行分析，发现其非常符合正态分布

而iris_z[“w1”]则不符合正态分布

x=stats.probplot(iris["w1"],plot=plt)
plt.show()
x=stats.probplot(iris_z["w1"],plot=plt)

在这里插入图片描述

5.分布比较图

把数据分成训练集和测试集，看特征w1在两个数据及上分布是否一致

如果不一致，说明这个特征不应该被使用，应该被删除

下图中，差异比较小，说明w1可以使用

iris_copy=iris.copy()
iris_shuffle=iris_copy.sample(frac=1)
x_train=iris_shuffle.iloc[0:120]
x_test=iris_shuffle.iloc[120:]
ax=sns.kdeplot(x_train["w1"],color="red",shade=True)
ax=sns.kdeplot(x_test["w1"],color="blue",shade=True)

在这里插入图片描述

6.相关性热力图

用于比较特征之间的相关度

绝对值越大，相关性越强，即“-1”比“0”更相关

下图中，可以发现target与l1、l2相关度较高

plt.figure(figsize=[5,5])
sns.heatmap(iris[["l1","l2","w1","w2","target"]].corr(),annot=True)

在这里插入图片描述

7.箱型图

观察上一步中，w2的分散情况，可以用来发现异常值,如果需要还可以删除

fig = plt.figure(figsize=(6,4))
w2=pd.DataFrame(iris["w2"]) 
box=w2.boxplot(
    return_type="both",
            notch=True, # 是否用盒子形状
            sym='r*',    # 用红色矩形展示异常值
            showmeans=True,#展示均值点
            patch_artist=False,#是否要填充色
            meanline=True,#展示均值线
            widths=0.5,#设置箱盒宽度
            vert=True)   #垂直展示图形
 
t = plt.title('Box plot')
# 原文链接：https://blog.csdn.net/opp003/article/details/84959020

在这里插入图片描述

异常值的上下界，可以用

box.lines[“whiskers”][0] 和 box.lines[“whiskers”][0] 来获得

print("line1:",box.lines["whiskers"][0].get_ydata())
print("line2:",box.lines["whiskers"][1].get_ydata())

low_value=box.lines["whiskers"][0].get_ydata()[1]
high_value=box.lines["whiskers"][1].get_ydata()[1]
print("low_value:",low_value)
print("high_value:",high_value)

line1: [2.8 2.2]
line2: [3.3 4. ]
low_value: 2.2
high_value: 4.0

weixin_37763484

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

	w1	w2	l1	l2
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	w1	w2	l1	l2
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

数据可视化 数据分析 常用图 seaborn

数据准备

1.构造数据

2.观察基本信息

绘图

1.countplot

2.透视图

3. 数据分布图

4.qq图

5.分布比较图

6.相关性热力图

7.箱型图

数据可视化数据分析常用图 seaborn

	w1	w2	l1	l2
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2