Day10-Python有序数据(DataWhale)

最新推荐文章于 2023-03-07 16:02:14 发布

liying_tt

最新推荐文章于 2023-03-07 16:02:14 发布

阅读量664

点赞数

分类专栏： Python 文章标签： python 数据分析

本文链接：https://blog.csdn.net/liying_tt/article/details/112321008

版权

本文详细介绍了Pandas中对分类数据的操作，包括创建、修改和管理分类对象，如astype转换、cat属性使用、增加删除修改类别、设置有序分类以及区间构造等。还展示了如何使用cut和qcut进行区间构造，以及IntervalIndex和Interval对象的运用。此外，文章通过实例演示了如何处理未出现的类别、区间排序和比较，以及如何根据分位数和分割点进行分箱。

摘要由CSDN通过智能技术生成

import pandas as pd 
import numpy as np

分类数据

一、cat对象

1. cat对象的属性

(1) pandas 中提供了 category 类型，能够处理分类类型的变量

(2) astype将普通序列转换为分类变量

df = pd.read_csv('data/learn_pandas.csv',
                usecols=['Grade','Name','Gender','Height','Weight'])
s = df.Grade.astype('category')
s.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Junior', 'Senior', 'Sophomore']

1. cat对象

s.cat

<pandas.core.arrays.categorical.CategoricalAccessor object at 0x000001F06C2F7880>

# 类别种类，以 Index 类型存储
s.cat.categories

Index(['Freshman', 'Junior', 'Senior', 'Sophomore'], dtype='object')

# 类别是否有序
s.cat.ordered

False

# 类别编号：依据categories的顺序
s.cat.codes.head()

0    0
1    0
2    2
3    3
4    3
dtype: int8

2. 类别的增加、删除、修改

(1) 索引 Index 类型是无法用 index_obj[0] = item 来修改，而 categories 被存储在 Index 中，因此使用该方法无法修改

1. 类别增加add_categories

s = s.cat.add_categories('Graduate')
s.cat.categories

Index(['Freshman', 'Junior', 'Senior', 'Sophomore', 'Graduate'], dtype='object')

2. 类别删除remove_categories

s = s.cat.remove_categories('Freshman')
s.cat.categories

Index(['Junior', 'Senior', 'Sophomore', 'Graduate'], dtype='object')

s.head()

0          NaN
1          NaN
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Junior', 'Senior', 'Sophomore', 'Graduate']

3. 设置序列的新类别set_categories

原来的类别中如果存在元素不属于新类别，那么会被设置为缺失

s = s.cat.set_categories(['Sophomore','PhD'])
s.cat.categories

Index(['Sophomore', 'PhD'], dtype='object')

s.head()

0          NaN
1          NaN
2          NaN
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (2, object): ['Sophomore', 'PhD']

4. 删除未出现在序列中的类别remove_unused_categories

s = s.cat.remove_unused_categories()
s.cat.categories

Index(['Sophomore'], dtype='object')

5. 类别修改rename_categories

会对原序列的对应值也进行相应修改

s = s.cat.rename_categories({
   'Sophomore':'本科二年级学生'})
s.head()

0        NaN
1        NaN
2        NaN
3    本科二年级学生
4    本科二年级学生
Name: Grade, dtype: category
Categories (1, object): ['本科二年级学生']

二、有序分类

1. 序的建立

(1) as_unordered 转换为无序

(2) reorder_categories 转换为有序，参数必须是由当前序列的无序类别构成的列表，不能够增加新的类别，也不能缺少原来的类别（即：必须包含原来所有的类别，并且不可新增）同时必须指定参数ordered=True

注意： 如果不想指定 ordered=True 参数，那么可以先用s.cat.as_ordered() 转化为有序类别，再利用 reorder_categories 进行具体的相对大小调整。

s = df.Grade.astype('category')
s = s.cat.reorder_categories(['Freshman','Sophomore','Junior','Senior'],ordered=True)
s.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman' < 'Sophomore' < 'Junior' < 'Senior']

s.cat.as_unordered().head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Sophomore', 'Junior', 'Senior']

2. 排序和比较

1. 排序

分类变量的排序：只需把列的类型修改为 category 后，再赋予相应的大小关系，就能正常地使用 sort_index 和 sort_values

df.Grade = df.Grade.astype('category')
df.Grade = df.Grade.cat.reorder_categories(['Freshman','Sophomore','Junior','Senior'],ordered=True)
df.sort_values('Grade').head() # 值排序

	Grade	Name	Gender	Height	Weight
0	Freshman	Gaopeng Yang	Female	158.9	46.0
105	Freshman	Qiang Shi	Female	164.5	52.0
96	Freshman	Changmei Feng	Female	163.8	56.0
88	Freshman	Xiaopeng Han	Female	164.1	53.0
81	Freshman	Yanli Zhang	Female	165.1	52.0

df.set_index('Grade').sort_index().head() # 索引排序

	Name	Gender	Height	Weight
Grade
Freshman	Gaopeng Yang	Female	158.9	46.0
Freshman	Qiang Shi	Female	164.5	52.0
Freshman	Changmei Feng	Female	163.8	56.0
Freshman	Xiaopeng Han	Female	164.1	53.0
Freshman	Yanli Zhang	Female	165.1	52.0

2. 比较

(1) = 或 != 关系的比较，比较的对象可以是标量或者同长度的 Series （或 list）

(2) >,>=,<,<= 四类大小关系的比较，比较的对象和第一种类似，但是所有参与比较的元素必须属于原序列的 categories ，同时要和原序列具有相同的索引

#标量
res1 = df.Grade == 'Sophomore'
res1.head()

0    False
1    False
2    False
3     True
4     True
Name: Grade, dtype: bool

#同长度的list
res2 = df.Grade == ['PhD']*df.shape[0]
res2.head()

最低0.47元/天解锁文章

liying_tt

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录