pd.Categorical-CSDN博客

本文链接：https://blog.csdn.net/ouyang20110913/article/details/109708784

pd.Categorical

pd.Categorical 提取列表数据的唯一值，类似Set。。常用于 pandas string类型的label列，求出该份数据集有哪些label，
常与 pd.Series.cat.codes 配合使用，将 string 类型的 label 转化为数字。

import pandas as pd
import numpy as np

pd.Categorical([1, 2, 3, 1, 2, 3])

[1, 2, 3, 1, 2, 3]
Categories (3, int64): [1, 2, 3]

cats = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])
cats

['a', 'b', 'c', 'a', 'b', 'c']
Categories (3, object): ['a', 'b', 'c']

cats.categories 是一个 Index，

cats.categories

Index(['a', 'b', 'c'], dtype='object')

由Categories (3, object): ['a', 'b', 'c']可知，默认顺序为：a,b,c，下面可以验证：

cats.sort_values()

['a', 'a', 'b', 'b', 'c', 'c']
Categories (3, object): ['a', 'b', 'c']

# error: 'Categorical' object has no attribute 'cat'
# cats.cat.codes

# cats大小不可比较，error
# cats.min()

# 可比较 cat，设置 ordered 为 True，同时设置 categories 参数列表，由小到大
cats2 = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'], ordered=True,
                   categories=['c', 'b', 'a'])
cats2

['a', 'b', 'c', 'a', 'b', 'c']
Categories (3, object): ['c' < 'b' < 'a']

cats2.min()

'c'

由Categories (3, object): ['c' < 'b' < 'a']可知，类型顺序为：c,b,a，下面可以验证：

cats2.sort_values()

['c', 'c', 'b', 'b', 'a', 'a']
Categories (3, object): ['c' < 'b' < 'a']

Categories 类别数据转换为 Series，如下，注意，生成的 Series 的 dtype 是 category

series2 = pd.Series(cats2)
series2

0    a
1    b
2    c
3    a
4    b
5    c
dtype: category
Categories (3, object): ['c' < 'b' < 'a']

series2.cat.categories

Index(['c', 'b', 'a'], dtype='object')

category类型Series数字化，如下，注意 dtype 变换，数字变换根据series2.cat.categories，c为0，b为1，a为2。

codes = series2.cat.codes
print(type(codes))
codes

<class 'pandas.core.series.Series'>





0    2
1    1
2    0
3    2
4    1
5    0
dtype: int8

categorical型Series

#直接创建categorical型Series
series_cat = pd.Series(['B','D','C','A'], dtype='category')
#显示Series信息
series_cat

0    B
1    D
2    C
3    A
dtype: category
Categories (4, object): ['A', 'B', 'C', 'D']

series_cat的类型为category，但是没有声明顺序，这时若对Series排序，实际上按照词法的顺序，如下：

series_cat.sort_values()

3    A
0    B
2    C
1    D
dtype: category
Categories (4, object): ['A', 'B', 'C', 'D']

series_cat 转化为数字，通过 Categories (4, object): ['A', 'B', 'C', 'D']，按顺序编码，A 对应 0，B对应1，以此类推，结果如下：

series_cat.cat.codes

0    1
1    3
2    2
3    0
dtype: int8

df 指定列的类型转换为 category

df = pd.DataFrame(np.random.randint(0, 5, size=[8, 2]), columns=list('AB'))
# 列转为字符串，然后在每个字符串后添加 a
df['A'] = df['A'].apply(lambda x: str(x)) + 'a'
df

	A	B
0	0a	2
1	0a	4
2	3a	3
3	2a	0
4	4a	1
5	3a	0
6	4a	2
7	4a	1

# df.A 与 df['A'] 等价
print(type(df.A))
print(type(df['A']))

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

方式一：

col_1 = pd.Categorical(df.A)
pd.Series(col_1)
# df.A 是一个Series，pd.Categorical(df.A)可以直接赋值给 df.A，
# pd内置会自适应将 pd.Categorical(df.A) 转换为 Series，还要尝试方法2，这里没有直接赋值。

0    0a
1    0a
2    3a
3    2a
4    4a
5    3a
6    4a
7    4a
dtype: category
Categories (4, object): ['0a', '2a', '3a', '4a']

方式二：

col_2 = df.A.astype('category')
col_2

0    0a
1    0a
2    3a
3    2a
4    4a
5    3a
6    4a
7    4a
Name: A, dtype: category
Categories (4, object): ['0a', '2a', '3a', '4a']

列类型直接转换

df.A = df.A.astype('category')
df.A

0    0a
1    0a
2    3a
3    2a
4    4a
5    3a
6    4a
7    4a
Name: A, dtype: category
Categories (4, object): ['0a', '2a', '3a', '4a']

# string列转换为数字（首先转为 category类型，取code变数字）
df.A = df.A.astype('category').cat.codes
df.A

0    0
1    0
2    2
3    1
4    3
5    2
6    3
7    3
Name: A, dtype: int8

df

	A	B
0	0	2
1	0	4
2	2	3
3	1	0
4	3	1
5	2	0
6	3	2
7	3	1