【学习笔记】《深入浅出Pandas》第12章：Pandas分类数据

Schanappi

已于 2022-10-31 11:20:08 修改

阅读量594

点赞数

分类专栏：《深入浅出Pandas》学习笔记文章标签： pandas 学习分类

于 2022-09-25 18:50:33 首次发布

本文链接：https://blog.csdn.net/weixin_43894455/article/details/127040601

版权

《深入浅出Pandas》学习笔记专栏收录该内容

17 篇文章 48 订阅

订阅专栏

文章目录

12.1 分类数据
12.1.1 创建分类数据
12.2 分类的操作

12.1 分类数据

12.1.1 创建分类数据

数据类型dtype为category，另外还包括分类的具体信息，有三个object类型数据。

# 构造数据 dtype="category"指定数据类型
s = pd.Series(["x", "y", "z", "x"], dtype="category")

"""
0    x
1    y
2    z
3    x
dtype: category
Categories (3, object): ['x', 'y', 'z']
"""

# 构造数据
df = pd.DataFrame({'A': list('xyzz'), 'B':list('aabc')}, dtype="category")
"""
	A	B
0	x	a
1	y	a
2	z	b
3	z	c
"""

# 查看数据类型
df.dtypes
"""
A    category
B    category
dtype: object
""" 

# 查看指定列的数据类型
df.B # 同s
"""
0    a
1    a
2    b
3    c
Name: B, dtype: category
Categories (3, object): ['a', 'b', 'c']
"""

分箱操作会自动将数据类型创建为分类数据类型。

# 生成分箱序列
pd.Series(pd.cut(range(1, 10, 2), [0, 4, 6, 10]))
"""
0     (0, 4]
1     (0, 4]
2     (4, 6]
3    (6, 10]
4    (6, 10]
dtype: category
Categories (3, interval[int64]): [(0, 4] < (4, 6] < (6, 10]]
"""

12.1.2 pd.Categorical()

# 分类数据
pd.Categorical(["x", "y", "z", "x"], categories=["y", "z", "x"])
"""
['x', 'y', 'z', 'x']
Categories (3, object): ['y', 'z', 'x']
"""

分类数据只能使用有限数量的数值，分类还可以具有顺序，这里的顺序指的是类别的先后顺序。它们不能参与加减等数字运算。

# 构建Series， 指定顺序
pd.Series(pd.Categorical(["x", "y", "z", "x"], categories=["y", "z", "x"], ordered=True))
"""
0    x
1    y
2    z
3    x
dtype: category
Categories (3, object): ['y' < 'z' < 'x']
"""

12.1.3 CategoricalDtype对象

CategoricalDtype是Pandas的分类数据对象，它可以传入以下参数：

categories：没有缺失值的不重复序列；
ordered：布尔值，顺序的控制，默认是False。

from pandas.api.types import CategoricalDtype
CategoricalDtype(['a', 'b', 'c'])
"""
CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
"""

CategoricalDtype可以在Pandas指定dtype的任何地方，例如pd.read_csv()、df.astype()或Series构造函数中。分类数据默认是无序的，可以使用字符串category代替CategoricalDtype，也就是说，dtype=‘category’ <=> dtype=CategoricalDtype()。

from pandas.api.types import CategoricalDtype
# 定义CategoricalDtype对象
c = CategoricalDtype(['a', 'b', 'c'])

# 类别指定CategoricalDtype对象
pd.Series(list('abcabc'), dtype=c)
"""
0    a
1    b
2    c
3    a
4    b
5    c
dtype: category
Categories (3, object): ['a', 'b', 'c']
"""

12.1.4 类型转换

将数据类型转换为分类数据类型最简单的方法是使用s.astype(‘category’)。

# 原数据
df = pd.read_excel('https://www.gairuo.com/file/data/dataset/team.xlsx')
df.team.astype('category') # 类型转换
"""
0     E
1     C
2     A
3     C
4     D
     ..
95    C`在这里插入代码片`
96    C
97    C
98    E
99    E
Name: team, Length: 100, dtype: category
Categories (5, object): ['A', 'B', 'C', 'D', 'E']
"""

CategoricalDtype对象可以用于分类数据类型转换：

# 定义CategoricalDtype对象
c = CategoricalDtype(['A', 'B', 'C', 'D', 'E'])
# 应用类型转换
df.team.astype(c)
# 效果同上

分类数据也可以转化为其他数据，比如s.astype(str)可以转换为文本。

12.2 分类的操作

12.2.1 修改分类

（1）利用s.cat.categories修改：

s = pd.Series(["a", "b", "c", "a"], dtype="category")
# 修改分类 s.cat.categories
s.cat.categories = ['x', 'y', 'z']
"""
s
0    x
1    y
2    z
3    x
dtype: category
Categories (3, object): ['x', 'y', 'z']
"""

（2）利用rename_categories修改，需要注意的是该方法会把值和分类同时修改，但并没有修改s本身：

# 修改分类 s.cat.rename_categories
s.cat.rename_categories(['h', 'i', 'j'])
"""
0    h
1    i
2    j
3    h
dtype: category
Categories (3, object): ['h', 'i', 'j']
"""

还可以利用字典修改：

# 修改分类 使用字典
s.cat.rename_categories({'a':'x', 'b':'y', 'c':'z'}) # 效果同上

（3）利用set_categories修改，修改分类，但本身值不会变化，同样，不修改s本身：

# 设置分类
s.cat.set_categories(['b', 'c', 'a'])
"""
0    NaN
1    NaN
2    NaN
3    NaN
dtype: category
Categories (3, object): ['b', 'c', 'a']
"""

注意，指定的分类数据必须不重复且不为NaN,否则会引发ValueError。

12.2.2 追加新分类

使用add_categories()方法在原有分类上增加一个新分类：

s = s.cat.add_categories(['t'])
s.cat.categories
"""
Index(['x', 'y', 'z', 't'], dtype='object')
"""

12.2.3 删除分类

使用remove_categories()方法来删除分类，删除的值将被替换成np.nan。

s = s.cat.remove_categories(['y'])
s
"""
0      x
1    NaN
2      z
3      x
dtype: category
Categories (3, object): ['x', 'z', 't']
"""

使用s.cat.remove_unused_categories()删除未被使用的分类：

s = pd.Series(pd.Categorical(["a", "b", "a"], categories=["a", "b", "c", "d"]))
s
"""
0    a
1    b
2    a
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
"""
s.cat.remove_unused_categories()
"""
0    a
1    b
2    a
dtype: category
Categories (2, object): ['a', 'b']
"""

12.2.4 顺序

新生成的分类数据不会自动排序，需要传入参数ordered=True来指示分类数据有序：

s = pd.Series(["a", "b", "c", "a"], dtype="category")

# 查看分类
s.cat.categories
# Index(['a', 'b', 'c'], dtype='object')

# 是否有序
s.cat.ordered
# False

也可以按照特定顺序传递分类：

s = pd.Series(pd.Categorical(["a", "b", "c", "a"], categories=["c", "b", "a"]))

s.cat.categories
# Index(['c', 'b', 'a'], dtype='object')

s.cat.ordered
# False

可以使用as_ordered()将分类数据设置为排序，或者使用as_unordered()将分类数据设置为无序，默认情况下会返回一个新对象：

s = pd.Series(["a", "b", "c", "a"], dtype="category")

# 设置为有序
s.cat.as_ordered()
"""
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a' < 'b' < 'c']
"""

# 设置为无序
s.cat.as_unordered()
"""
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
"""

重新排序，传入ordered=True使得排序生效：

# 重新排序
s.cat.reorder_categories(['b', 'a', 'c'], ordered=True)