学习链接: 第8章 分类数据.
8.1 category的创建及其性质
-
分类变量的创建
(a)用Series创建pd.Series(["a", "b", "c", "a"], dtype="category")
(b)对DataFrame指定类型创建
temp_df = pd.DataFrame({'A':pd.Series(["a", "b", "c", "a"], dtype="category"),'B':list('abcd')})
(c)利用内置Categorical类型创建
输入: cat = pd.Categorical(["a", "b", "c", "a"], categories=['a','b','c']) pd.Series(cat) 输出: 0 a 1 b 2 c 3 a dtype: category Categories (3, object): [a, b, c]
(d)利用pd.cut()函数创建
输入:pd.cut(pd.Series([1,35,56]), [0,10,30,60], right=False, labels=['0-10','10-30','30-60']) 输出: 0 0-10 1 30-60 2 30-60 dtype: category Categories (3, object): [0-10 < 10-30 < 30-60]
第一个参数:要被划分的数值
第二个参数:要被划分的区间
right参数:如果为Ture,左开右闭;为False,左闭右开
labels参数:指定区间标签
[注]:使用cut函数创建的分类变量默认为有序分类变量 -
分类变量的结构
一个分类变量包括三个部分,元素值(values)、分类类别(categories)、是否有序(order)
(a)describe() 查看
该方法描述了一个分类序列的情况,包括非缺失值个数、元素值类别数(不是分类类别数)、最多次出现的元素及其频数输入:s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d'])) s.describe() 输出: count 4 unique 3 top a freq 2 dtype: object
(b)cat.categories 查看分类类别
输入:s.cat.categories 输出:Index(['a', 'b', 'c', 'd'], dtype='object')
(c )cat.ordered 查看是否排序
In:s.cat.ordered Out:False
8.2 分类变量的类别
-
cat.set_categories() 修改分类,但本身值不会变化
输入: s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d'])) s.cat.set_categories(['new_a','c']) 输出: 0 NaN 1 NaN 2 c 3 NaN 4 NaN dtype: category Categories (2, object): [new_a, c]
-
cat.rename_categories() 该方法会把值和分类同时修改
输入: s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d'])) #直接修改 s.cat.rename_categories(['new_%s'%i for i in s.cat.categories]) 输出: 0 new_a 1 new_b 2 new_c 3 new_a 4 NaN dtype: category Categories (4, object): [new_a, new_b, new_c, new_d] 输入:s.cat.rename_categories({'a':'new_a','b':'new_b'}) 利用字典修改 输出: 0 new_a 1 new_b 2 c 3 new_a 4 NaN dtype: category Categories (4, object): [new_a, new_b, c, d]
-
cat.add_categories() 添加类别
s.cat.add_categories(['e'])
-
cat.remove_categories() 移除类别
s.cat.remove_categories(['d'])
-
cat.remove_unused_categories() 删除元素值未出现的分类类型
输入: s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d'])) s.cat.remove_unused_categories() 输出: 0 a 1 b 2 c 3 a 4 NaN dtype: category Categories (3, object): [a, b, c]
-
union_categoricals() 合并分类,将值和分类同时合并
输入: s1 = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], categories=['a', 'b', 'c', 'd', 'e']) ) s2 = pd.Series(pd.Categorical(["a", "b", "c", "d", "e"], categories=['a', 'b', 'c', 'd', 'r'])) from pandas.api.types import union_categoricals union_categoricals([s1, s2]) 输出: [a, b, c, a, NaN, a, b, c, d, NaN] Categories (6, object): [a, b, c, d, e, r] 输入:pd.concat([s1, s2]) 输出: 0 a 1 b 2 c 3 a 4 NaN 0 a 1 b 2 c 3 d 4 NaN dtype: object #使用concat()方法合并后类型转换为object
8.3 分类变量的排序
- 序的建立
(a)cat.as_ordered()将一个序列转为有序变量;cat.as_unordered()将一个序列转为无序变量
(b)利用set_categories方法中的order参数s = pd.Series(["a", "d", "c", "a"]).astype('category').cat.as_ordered() #有序 s.cat.as_unordered() #无序
(c)cat.reorder_categories() 新设置的分类必须与原分类为同一集合,多或少都不可pd.Series(["a", "d", "c", "a"]).astype('category').cat.set_categories(['a','c','d'],ordered=True)
输入: s = pd.Series(["a", "d", "c", "a"]).astype('category') s.cat.reorder_categories(['a','d','c'],ordered=True) 输出: 0 a 1 d 2 c 3 a dtype: category Categories (3, object): [a < d < c]
- 排序
第1章介绍的值排序和索引排序都是适用的#值排序 s = pd.Series(np.random.choice(['perfect','good','fair','bad','awful'],50)).astype('category') s.cat.set_categories(['perfect','good','fair','bad','awful'][::-1],ordered=True).head() s.sort_values(ascending=False).head() #False降序,True升序 #索引排序 df_sort = pd.DataFrame({'cat':s.values,'value':np.random.randn(50)}).set_index('cat') df_sort.head() df_sort.sort_index().head()
8.4 分类变量的比较操作
-
与标量或等长序列的比较
(a)标量比较输入: s = pd.Series(["a", "d", "c", "a"]).astype('category') s == 'a' 输出: 0 True 1 False 2 False 3 True dtype: bool
(b)等长序列比较
输入:s == list('abcd') 输出: 0 True 1 False 2 True 3 False dtype: bool
-
与另一分类变量的比较
(a)等式判别(包含等号和不等号)
两个分类变量的等式判别需要满足分类完全相同
(b)不等式判别(包含>=,<=,<,>)
两个分类变量的不等式判别需要满足两个条件:① 分类完全相同 ② 排序完全相同s = pd.Series(["a", "d", "c", "a"]).astype('category') #s >= s #报错 输出: s = pd.Series(["a", "d", "c", "a"]).astype('category').cat.reorder_categories(['a','c','d'],ordered=True) s >= s 输出: 0 True 1 True 2 True 3 True dtype: bool
问题与练习
#1(a)
df = pd.read_csv(r"D:\python\python3.6\pysl\Pre_\data\Earthquake.csv")
s = pd.cut(df["深度"], [-1e-10,5,10,15,20,30,50,np.inf], labels=['Ⅰ', 'Ⅱ', 'Ⅲ', 'Ⅳ', 'Ⅴ', 'Ⅵ', 'Ⅶ'])
df['深度'] = s
df.sort_values(by='深度', ascending=True)
#1(b)
df['烈度'] = pd.cut(df['烈度'], [-1e-10,3,4,5,np.inf])
df.set_index(['深度', '烈度']).sort_index()
#2
def my_crosstab(foo,bar):
num = len(foo)
s1 = pd.Series([i for i in list(foo.categories.union(set(foo)))],name='1nd var')
s2 = [i for i in list(bar.categories.union(set(bar)))]
df = pd.DataFrame({i:[0]*len(s1) for i in s2},index=s1)
for i in range(num):
df.at[foo[i],bar[i]] += 1
return df.rename_axis('2st var',axis=1)
my_crosstab(foo,bar)