数据挖掘day20、21-《数据挖掘导论》-第三章，探索数据

最新推荐文章于 2023-04-08 10:37:10 发布

偲偲粑

最新推荐文章于 2023-04-08 10:37:10 发布

阅读量526

点赞数 2

分类专栏：数据挖掘

本文链接：https://blog.csdn.net/weixin_43329319/article/details/98474403

版权

数据挖掘专栏收录该内容

23 篇文章 1 订阅

订阅专栏

文章目录

主要是使用鸢尾花数据，使用python对书中的各种可视化手段进行实现。

3.3.3-1、少量属性的可视化

1.1 茎叶图

茎叶图，在《商务经济统计》实现过，商务与经济统计（13版，Python）笔记 01-02章
改动了一下

import numpy as np
import seaborn as sns
iris = sns.load_dataset("iris")
_stem=[]
data=iris['sepal_length']*10
for x in data:
    _stem.append(int(x//10))
    stem=list(set(_stem))
for m in stem:
    print(m,'|',end=' ')
    leaf=[]
    for n in data:
        if n//10==m:
            leaf.append(int(n%10))
    leaf.sort()   
    for i in range(1,len(leaf)):
        print(leaf[i],end='')
    print('\n')

4 | 444566667788888999999

5 | 000000000111111111222234444445555555666666777777778888888999

6 | 00000111111222233333333344444445555566777777778889999

7 | 122234677779

1.2 直方图（histogram）

在这里插入图片描述

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
iris = sns.load_dataset("iris")
cols=['sepal_length','sepal_width','petal_length','petal_width']
bins=10
plt.figure(figsize=(20,4))
for i in range(len(cols)):  
    plt.subplot(1,4,i+1)
    plt.hist(iris[cols[i]],10,histtype='bar',facecolor='yellowgreen',alpha=0.75,rwidth=0.95)
    plt.title(cols[i])

1.3 二维直方图（two-dimensional histogram）

数据还是之前的数据，增加使用工具Axes3D参考官方例子
在这里插入图片描述

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
bins=3
hist, xedges, yedges = np.histogram2d(iris['petal_width'],iris['petal_length'] , bins=bins)
#获取坐标点，去掉最后一个
xpos, ypos = np.meshgrid(xedges[:-1], yedges[:-1] )
#由于x轴的方向由左向右，需要倒序
xpos = sorted(xpos.flatten('F'),reverse=True)
ypos = ypos.flatten('F')
zpos = np.zeros_like(xpos)
#每个图像宽度，使用 最大值/bins
dx =(iris['petal_width'].max()/bins)*np.ones_like(zpos)
dy = iris['petal_length'].max()/bins*np.ones_like(zpos)
dz = hist.flatten()
ax.bar3d(xpos, ypos, zpos, dx, dy, dz, color='yellowgreen', zsort='average')
#因为前面的倒序，需要人为调整x轴刻度（不知道有其他方法没有）
xticks=[2.5,2,1.5,1,0.5,0]
plt.xticks(xticks,('0','0.5','1.0','1.5','2.0','2.5'))
plt.xlabel('花瓣宽度',rotation=-15)
plt.ylabel('花瓣长度',rotation=45)
plt.show()

1.4 盒状图（box plot）

盒状图较为简单，顺便弄点颜色
在这里插入图片描述

plt.boxplot(iris.iloc[:,0:4].T,vert=True,patch_artist=True)
plt.xticks([1,2,3,4],('sepal_length','sepal_width','petal_length','petal_width'))
for patch, color in zip(ax['boxes'], colors):
        patch.set_facecolor(color)

1.5 饼图（pie plot）

之前已经把好看的饼图都摘出来了，商务与经济统计（13版，Python）笔记 01-02章
使用value_count（）函数汇总数据，顺便加一个图例
在这里插入图片描述

plt.pie(iris.species.value_counts(),labels=iris.species.value_counts().index)
plt.legend(loc="center left",bbox_to_anchor=(1, 0, 0.5, 1))

1.6 经验累积分布函数（ECDF）

需要手动构造数据，循环内使用reduce会增加计算了，但是数据少无所谓，然后用plt.step
在这里插入图片描述

from functools import reduce
cols=['sepal_length','sepal_width','petal_length','petal_width']
plt.figure(figsize=(10,6))
for n in range(len(cols)): 
#   构造数据  
    data=iris[cols[n]].value_counts().sort_index()
    len_data=len(data)
    y_max=reduce(lambda a,b:a+b,data)
    y=[data.iloc[0]/y_max]
    for i in range(1,len_data):
        y.append(reduce(lambda a,b:a+b,data.iloc[:i+1])/y_max)
    plt.subplot(2,2,n+1)
    plt.step(data.index,y,where='mid', label='mid')
    plt.grid(axis='both',linestyle='-')
#   plt.plot(data.index,y, 'C1o', alpha=0.5)
    plt.title(cols[n])

1.6 百分位数图（percentile plot）

在这里插入图片描述

cols=['sepal_length','sepal_width','petal_length','petal_width']
marker=['o','v','s','D']
x=list(range(0,101,10))
for n in range(len(cols)):
    data_per=[]
    for i in x:
        data_per.append(np.percentile(iris[cols[n]],i))
    plt.plot(x,data_per,marker=marker[n])
plt.legend(cols)

1.7 散布图矩阵（scatter plot matrix）

seaborn.PairGrid的例子就是鸢尾花数据做的，但是图例不知道怎么放好
在这里插入图片描述

g = sns.PairGrid(iris, hue="species", palette="Set2",hue_kws={"marker": ["o", "s", "D"]})
g = g.map_offdiag(plt.scatter, linewidths=1, edgecolor="w", s=40)
g.add_legend()

1.8 散布图

在这里插入图片描述

cols=['sepal_length','sepal_width','petal_length','petal_width']
species=['versicolor', 'virginica', 'setosa']
fig = plt.figure()
for c,m,i in [('r', 'o',0), ('b', '^',1),('y','*',2)]:
    iris_1=iris[iris.species==species[i]]
    plt.scatter(iris_1[cols[2]],iris_1[cols[3]],c=c,marker=m)
plt.legend(['versicolor', 'virginica', 'setosa'],loc='upper left')
ax.set_xlabel('petal_length')
ax.set_ylabel('petal_width')

1.9 三维散布图

感觉做的有点笨，但是米办法。for循环用列表的方式，只是记忆一下有这种方式。
在这里插入图片描述

cols=['sepal_length','sepal_width','petal_length','petal_width']
species=['versicolor', 'virginica', 'setosa']
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
for c,m,i in [('r', 'o',0), ('b', '^',1),('y','*',2)]:
    iris_1=iris[iris.species==species[i]]
    ax.scatter(iris_1[cols[0]],iris_1[cols[1]],iris_1[cols[2]],c=c,marker=m)
plt.legend(['sepal_length','sepal_width','petal_length'],loc='upper left')
ax.set_xlabel('sepal_length')
ax.set_ylabel('sepal_width')
ax.set_zlabel('petal_length')

2、可视化空间数据

2.1 等高线图（contour plot）

抄一个例子Contour plot of irregularly spaced data
在这里插入图片描述

origin = 'lower'
delta = 0.025
x = y = np.arange(-3.0, 3.01, delta)
X, Y = np.meshgrid(x, y)
Z1 = np.exp(-X**2 - Y**2)
Z2 = np.exp(-(X - 1)**2 - (Y - 1)**2)
Z = (Z1 - Z2) * 2
fig1, ax2 = plt.subplots(constrained_layout=True)
CS = ax2.contourf(X, Y, Z, 10, cmap=plt.cm.bone, origin=origin)
CS2 = ax2.contour(CS, levels=CS.levels[::2], colors='r', origin=origin)
ax2.set_title('Nonsense (3 masked regions)')
ax2.set_xlabel('word length anomaly')
ax2.set_ylabel('sentence length anomaly')
cbar = fig1.colorbar(CS)
cbar.ax.set_ylabel('verbosity coefficient')
cbar.add_lines(CS2)

2.2 曲面图（surface plot）

第九章再说吧，先放个核密度图
在这里插入图片描述

x=[4,6,1,2,4,6,7,1,2,4,6,7]
y=[1,1,4,4,4,4,4,5,5,5,5,5]
plt.scatter(x,y)
sns.kdeplot(x,y)

2.2 平行坐标图（parallel coordinates）

使用pandas.parallel_coordinates
在这里插入图片描述

from pandas.plotting import parallel_coordinates
fig,axes = plt.subplots()
parallel_coordinates(iris,'species',ax=axes)

2.3 星形坐标（star coordinates）

没有找到库，做chernoff脸，只能自己动手搞一个星形坐标图，没有随机抽取样本，只是每种花选前5朵。
在这里插入图片描述

cols=['sepal_length','sepal_width','petal_length','petal_width']
species=['versicolor', 'virginica', 'setosa']
# plt.figure(figsize=(15,15))
for i in range(3):
    numbers=list(iris[iris.species==species[i]].index)[:5]
    plt.figure(figsize=(10,3))
    for n in range(len(numbers)):
        ir=iris.iloc[numbers[n]]
        #点画线，12341324
        x=[ir[0],0,-ir[2],0,ir[0],-ir[2],0,0]
        y=[0,ir[1],0,-ir[3],0,0,ir[1],-ir[3]]
        plt.subplot(1,5,n+1)
        plt.scatter(x,y)
        plt.plot(x,y,c='r')
        #统一大小
        plt.xlim(-7,8)
        plt.ylim(-3,5)
        #去掉刻度线
        plt.xticks([0],'')
        plt.yticks([0],'')
        plt.title('%s %i' % (species[i],numbers[n]))

偲偲粑

关注

2
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
数据挖掘day20、21-《数据挖掘导论》-第三章，探索数据

文章目录3.3.3-1、少量属性的可视化1.1 茎叶图1.2 直方图（histogram）1.3 二维直方图（two-dimensional histogram）1.4 盒状图（box plot）1.5 饼图（pie plot）1.6 经验累积分布函数（ECDF）1.6 百分位数图（percentile plot）1.7 散布图矩阵（scatter plot matrix）1.8 散布图1.9 三...
复制链接

扫一扫