4-03-2 Pandas - 散点图、安德鲁斯曲线

Yehchitsai

已于 2023-12-05 13:47:45 修改

阅读量2.8k

点赞数 1

分类专栏： Python数据处理文章标签： Python

于 2022-04-18 10:44:02 首次发布

本文链接：https://blog.csdn.net/m0_50614038/article/details/124244029

版权

Python数据处理专栏收录该内容

42 篇文章 4 订阅

订阅专栏

本文探讨了使用Python进行数据可视化的两种方法：散点图和安德鲁斯曲线。通过散点图展示了美国中西部各州人口与教育水平与贫困比例的关系，揭示了高中学历与贫困负相关性。接着利用安德鲁斯曲线对mtcars数据集进行聚类分析，验证了汽缸数作为分类依据的有效性，呈现出明显的群聚现象。

摘要由CSDN通过智能技术生成

相互关系－散点图

散点图是研究两个变量之间关系的经典和基础图，当有多组数据时，也可以用不同的颜色来显示每个组，以下使用美国中西部各州 (midwest.csv) 的人口分布案例来观察，首先先依类别来分群，共 16 的类别 (category)，给每个类别不同的颜色，显示 midwild 这个数据集的信息，共有 437 笔数据， 28 个栏位，接着以位于贫穷线以下的比例 (percbelowpoverty, percent below poverty) 与高中毕业生的比例 (perchsd, the percentage of people with a high school diploma) 来画出散点图，很明显的，大多数的点都是往左上角聚集，也就是高中毕业生的比例越高，贫穷的比例越低，说明高中学历与贫穷的相关性。

midwest 数据集说明

栏位	说明
county	县代号
state	州代号
area	县面积
poptotal	县总人口数
popdensity	县总人口数
Population density	县人口密度
popwhite	县中白人人数
popblack	县中黑人人数
popamerindian	县中美裔印度人人数
popasian	县中亚洲人人数
popother	县中其他人种人数
percwhite	县中白人比例
percblack	县中黑人比例
percamerindan	县中美裔印度人比例
percasian	县中亚洲人比例
percother	县中其他人比例
popadults	成人人数
perchsd	拥有高中学历比例
percollege	拥有大专学历比例
percprof	拥有专业能力比例
poppovertyknown	已知贫穷人数
percpovertyknown	已知贫穷人数比例
percbelowpoverty	贫穷线下的比例
percchildbelowpovert	小孩贫穷线下的比例
percadultpoverty	成人贫穷的比例
percelderlypoverty	老人贫穷的比例
inmetro	是否在城区

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
  
# 读取数据
midwest = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest.csv")
# Prepare Data 
# 每一类 (category) 设定一种颜色
categories = np.unique(midwest['category'])
colors = [plt.cm.tab10(i/float(len(categories)-1)) for i in range(len(categories))]

print(colors)  
# 显示数据集信息，共有 437 笔数据， 28 个栏位
midwest.info()
  
# 将所有类别的散点图话在同一张表里好做比较，主要是观察受教育程度与贫穷的关系
ax = midwest.loc[midwest.category==categories[0], :].plot.scatter(x='percbelowpoverty', y='perchsd',color=colors[0], label=str(categories[0]))
for i in range(1,len(categories)):
    ax = midwest.loc[midwest.category==categories[i], :].plot.scatter(x='percbelowpoverty', y='perchsd',color=colors[i], label=str(categories[i]),ax=ax)
  
输出结果如下：
  
RangeIndex: 437 entries, 0 to 436
Data columns (total 28 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   PID                   437 non-null    int64  
 1   county                437 non-null    object 
 2   state                 437 non-null    object 
 3   area                  437 non-null    float64
 4   poptotal              437 non-null    int64  
 5   popdensity            437 non-null    float64
 6   popwhite              437 non-null    int64  
 7   popblack              437 non-null    int64  
 8   popamerindian         437 non-null    int64  
 9   popasian              437 non-null    int64  
 10  popother              437 non-null    int64  
 11  percwhite             437 non-null    float64
 12  percblack             437 non-null    float64
 13  percamerindan         437 non-null    float64
 14  percasian             437 non-null    float64
 15  percother             437 non-null    float64
 16  popadults             437 non-null    int64  
 17  perchsd               437 non-null    float64
 18  percollege            437 non-null    float64
 19  percprof              437 non-null    float64
 20  poppovertyknown       437 non-null    int64  
 21  percpovertyknown      437 non-null    float64
 22  percbelowpoverty      437 non-null    float64
 23  percchildbelowpovert  437 non-null    float64
 24  percadultpoverty      437 non-null    float64
 25  percelderlypoverty    437 non-null    float64
 26  inmetro               437 non-null    int64  
 27  category              437 non-null    object 
dtypes: float64(15), int64(10), object(3)
memory usage: 95.7+ KB

在这里插入图片描述

图 4-3-6 利用 pandas 画出一个散点图

分群－安德鲁斯曲线
安德鲁斯曲线允许人们用大量的曲线来绘制多元数据 (multivariate data)，这些曲线使用样本属性作傅立叶级数的系数来创建，通过为每个群不同的曲线着色，可以可视化数据聚类。属于同一群样本的曲线通常会比较接近，形成较大的结构。以下以 mtcars 为例，以汽缸数为分群依据，可以发现的确有群聚的现象。

import numpy as np
import pandas as pd
from pandas.plotting import andrews_curves
  
# Import
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")
# 移除掉字符串的栏位，保留数字栏位便于傅立叶级数计算
df.drop(['cars', 'carname'], axis=1, inplace=True)
  
# Plot
andrews_curves(df, 'cyl', colormap='Set1')