目录
1. 学习内容
1. 学习分类类型的创建和性质
2. 学会对分类类型进行排序操作和比较操作
2. 准备工作
import pandas as pd
import numpy as np
df = pd.read_csv('data/table.csv')
df.head()
School Class ID Gender Address Height Weight Math Physics
0 S_1 C_1 1101 M street_1 173 63 34.0 A+
1 S_1 C_1 1102 F street_2 192 73 32.5 B+
2 S_1 C_1 1103 M street_2 186 82 87.2 B+
3 S_1 C_1 1104 F street_2 167 81 80.4 B-
4 S_1 C_1 1105 F street_4 159 64 84.8 B+
3. 分类变量的创建及其性质
3.1 创建
分类变量有很多种创建方法:从序列中创建,从表格中指定列创建,利用内置Categorical类型创建和利用cut()方法进行创建。
pd.Series(["a", "b", "c", "a"], dtype = "category")
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
temp_df = pd.DataFrame({'A': pd.Series(["a", "b", "c", "a"], \
dtype = "category"), 'B': list('abcd')})
temp_df.dtypes
A category
B object
dtype: object
cat = pd.Categorical(["a", "b", "c", "a"], categories = ['a', 'b', 'c'])
pd.Series(cat)
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
# 默认以区间为标签,不过也可以指定某种字符为标签
pd.cut(np.random.randint(0, 60, 5), [0, 10, 30, 60])
[(0, 10], (30, 60], (30, 60], (30, 60], (30, 60]]
Categories (3, interval[int64]): [(0, 10] < (10, 30] < (30, 60]]
pd.cut(np.random.randint(0, 60, 5), [0, 10, 30, 60], \
right = False, labels = ['0-10', '10-30', '30-60'])
[10-30, 0-10, 30-60, 0-10, 30-60]
Categories (3, object): [0-10 < 10-30 < 30-60]
3.2 性质
一个分类变量包括三个部分,元素值(values)、分类类别(categories)、是否有序(order)使用cut函数创建的分类变量默认为有序分类变量。
s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
categories = ['a', 'b', 'c', 'd']))
s.describe()
count 4
unique 3
top a
freq 2
dtype: object
3.2.1 查看分类类别以及是否有序
print(s.cat.categories)
print(s.cat.ordered)
Index(['a', 'b', 'c', 'd'], dtype='object')
False
3.2.2 修改类别
# 利用set_categories修改。修改分类,但本身值不会变化
s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
categories = ['a', 'b', 'c', 'd']))
s.cat.set_categories(['new_a', 'c'])
0 NaN
1 NaN
2 c
3 NaN
4 NaN
dtype: category
Categories (2, object): [new_a, c]
# 利用rename_categories修改。需要注意的是该方法会把值和分类同时修改
s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
categories = ['a', 'b', 'c', 'd']))
s.cat.rename_categories(['new_%s' % i for i in s.cat.categories])
0 new_a
1 new_b
2 new_c
3 new_a
4 NaN
dtype: category
Categories (4, object): [new_a, new_b, new_c, new_d]
# 利用字典修改值
s.cat.rename_categories({'a': 'new_a', 'b': 'new_b'})
0 new_a
1 new_b
2 c
3 new_a
4 NaN
dtype: category
Categories (4, object): [new_a, new_b, c, d]
3.2.3 添加类别
s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
categories = ['a', 'b', 'c', 'd']))
s.cat.add_categories(['e'])
0 a
1 b
2 c
3 a
4 NaN
dtype: category
Categories (5, object): [a, b, c, d, e]
3.2.4 删除类别
s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
categories = ['a', 'b', 'c', 'd']))
s.cat.remove_categories(['d'])
0 a
1 b
2 c
3 a
4 NaN
dtype: category
Categories (3, object): [a, b, c]
# 删除元素值未出现的分类类型
s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
categories = ['a', 'b', 'c', 'd']))
s.cat.remove_unused_categories()
0 a
1 b
2 c
3 a
4 NaN
dtype: category
Categories (3, object): [a, b, c]
4. 分类变量的排序
4.1 序的建立与退化
4.1.2 建立
s = pd.Series(["a", "d", "c", "a"]).astype('category').cat.as_ordered()
s
0 a
1 d
2 c
3 a
dtype: category
Categories (3, object): [a < c < d]
s = pd.Series(["a", "d", "c", "a"]).astype('category')
s.cat.set_categories(['a', 'c', 'd'], ordered = True)
0 a
1 d
2 c
3 a
dtype: category
Categories (3, object): [a < c < d]
# 这个方法的特点在于,新设置的分类必须与原分类为同一集合
s = pd.Series(["a", "d", "c", "a"]).astype('category')
s.cat.reorder_categories(['a', 'c', 'd'],ordered = True)
0 a
1 d
2 c
3 a
dtype: category
Categories (3, object): [a < c < d]
4.1.2 退化
s.cat.as_unordered()
0 a
1 d
2 c
3 a
dtype: category
Categories (3, object): [a, c, d]
4.2 排序
s = pd.Series(np.random.choice(['perfect', 'good', 'fair', 'bad', 'awful'], 50)).astype('category')
s.cat.set_categories(['perfect', 'good', 'fair', 'bad', 'awful'][::-1], ordered = True).head()
0 good
1 perfect
2 fair
3 good
4 fair
dtype: category
Categories (5, object): [awful < bad < fair < good < perfect]
s.sort_values(ascending = False).head()
37 perfect
9 perfect
19 perfect
18 perfect
17 perfect
dtype: category
Categories (5, object): [awful, bad, fair, good, perfect]
df_sort = pd.DataFrame({'cat': s.values, 'value': np.random.randn(50)})
df_sort.set_index('cat').head()
df_sort.sort_index().head()
5. 分类变量的比较操作
5.1 与标量或等长序列的比较
s = pd.Series(["a", "d", "c", "a"]).astype('category')
s == 'a'
0 True
1 False
2 False
3 True
dtype: bool
s == list('abcd')
0 True
1 False
2 True
3 False
dtype: bool
5.2 与另一分类变量的比较
5.2.1 等式判别
两个分类变量的等式判别需要满足分类完全相同。
s = pd.Series(["a", "d", "c", "a"]).astype('category')
s == s
0 True
1 True
2 True
3 True
dtype: bool
5.2.2 不等式判别
两个分类变量的不等式判别需要满足两个条件:分类完全相同和排序完全相同。
s = pd.Series(["a", "d", "c", "a"]).astype('category')
#s >= s #报错
s = pd.Series(["a", "d", "c", "a"]).astype('category')
s = s.cat.reorder_categories(['a', 'c', 'd'], ordered = True)
s >= s
0 True
1 True
2 True
3 True
dtype: bool