P8

最新推荐文章于 2024-04-19 03:21:46 发布

瘦与狗

最新推荐文章于 2024-04-19 03:21:46 发布

阅读量129

点赞数

分类专栏：笔记文章标签： python

本文链接：https://blog.csdn.net/weixin_51503843/article/details/111247880

版权

笔记专栏收录该内容

19 篇文章 0 订阅

订阅专栏

知知知…
在这里插入图片描述

1.category 分类变量的创建

① 用Series创建，指定类型为 dtype=‘category’ (类似于用Series创建 string 类)

s = pd.Series([1, 'a', 'uuu', 4.3], dtype='category')
print(s)
'''
0      1
1      a
2    uuu
3    4.3
dtype: category
Categories (4, object): [1, 4.3, 'a', 'uuu']
'''

② 对DataFrame指定类型创建（因为 cat 是 Series的内置函数，所以在创建df指定类型的时候，应该是对df的列（df的列即是Series）进行指定

df = pd.DataFrame({'A': pd.Series(["a", "b", "c", "a"], dtype='category'), 'B': pd.Series(list('abcd'), dtype='category')})
print(df.dtypes)
'''
A    category
B    category
dtype: object
'''

③ 利用内置Categorical类型创建

cat = pd.Categorical(["a", "b", "c", "a"])
print(pd.Series(cat))
'''
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
'''

④ 利用cut函数创建（其类是有序的，在日常数据处理中使用pd.cut或pd.qcut时，默认分组标签就是category类型）

cat = pd.cut(np.random.randint(0, 60 ,5), [0, 10, 30, 60])
print(cat)
'''
[(30, 60], (0, 10], (30, 60], (0, 10], (0, 10]]
Categories (3, interval[int64]): [(0, 10] < (10, 30] < (30, 60]]
'''

2.分类变量的性质

一个分类变量包括三个部分，元素值（values）、分类类别（categories）、是否有序（order）

① describe方法

该方法描述了一个分类序列的情况，包括非缺失值个数、元素值类别数（不是分类类别数）、最多次出现的元素及其频数

s = pd.Series(pd.Categorical(["a", "b", "c", 'a',np.nan], categories=['a','b','c','d']))
print(s.describe())
'''
count     3
unique    3
top       a
freq      2
dtype: object
'''

元素值类别数是指在已设置好的类别中，元素中对应出现的类，有多少个。
比如设置好类是 a,b,c,d 四个类，若元素为a，a，b 那么 unique = 2 （只出现了 a，b 两类）；若元素为 a,b,c,d,b,c 则 unique = 4 （全都出现了，且重复的不算）

② categories和ordered属性用于查看分类类别和是否排序

s = pd.Series(pd.Categorical(["a", "b", "c", 'a',np.nan], categories=['a','b','c','d']))
print(s.cat.categories)
'''
Index(['a', 'b', 'c', 'd'], dtype='object')
'''

print(s.cat.ordered)
'''
False
'''

③ 类别的修改

s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))

1）利用set_categories修改（修改分类，但本身值不会变化；会生成一个新的对象）

s.cat.set_categories(['new_a','c'])

2）利用rename_categories修改（该方法会把值和分类同时修改；对应的值也会跟着类的变化而变化）

s.cat.rename_categories(['new_%s' % i for i in s.cat.categories])

3）利用字典修改值

s.cat.rename_categories({'a':'new_a','b':'new_b'})

4）利用add_categories添加

s.cat.add_categories(['e'])

5）利用remove_categories移除

s.cat.remove_categories(['d'])

6）删除元素值未出现的分类类型

s.cat.remove_unused_categories()

2.分类变量的排序（指的是类的排序，而不是值的）

① 序的建立

1）利用as_ordered方法可将一个序列转为有序变量，返回一个新的对象，原序列不受影响

s = pd.Series([5, 7, 2, 6]).astype('category')
print(s.cat.as_ordered())
'''
0    5
1    7
2    2
3    6
dtype: category
Categories (4, int64): [2 < 5 < 6 < 7]
'''
print(s)
'''
0    5
1    7
2    2
3    6
dtype: category
Categories (4, int64): [2, 5, 6, 7]
'''

使用as_unordered 方法，可使有序变量退化为无序变量（事实上，我觉得没啥 luan useful）

2）利用set_categories方法中的order参数(返回一个新的变量)

s = pd.Series(["a", "d", "c", "a"]).astype('category')
print(s.cat.set_categories(['a', 'd', 'c'], ordered=True))
'''
0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): ['a' < 'd' < 'c']
'''

3）利用reorder_categories方法

注：使用该方法是新设置的分类必须与原分类为同一集合，即个数与种类必须一致，否则会报错

s = pd.Series(["a", "d", "c", "a"]).astype('category')
print(s.cat.reorder_categories(['a', 'd', 'c'], ordered=True))
'''
0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): ['a' < 'd' < 'c']
'''

② 排序

没啥好说的，给大家表演一个粘贴复制的倒序吧

s = pd.Series(np.random.choice(['perfect','good','fair','bad','awful'], 50)).astype('category')
print(s.cat.set_categories(['perfect','good','fair','bad','awful'][::-1], ordered=True).head())
'''
0    perfect
1        bad
2       fair
3       fair
4    perfect
dtype: category
Categories (5, object): ['awful' < 'bad' < 'fair' < 'good' < 'perfect']
'''

3.分类变量的比较操作

① 与标量比较（Series里的每个元素都逐个与单个标量比较，结果返回的是布尔值）

cat = pd.Categorical(["a", "b", "c", "a"])
s = pd.Series(cat)
print(s == 'b')
'''
0    False
1     True
2    False
3    False
dtype: bool
'''

② 与等长序列比较（Series里的每个元素都逐个与等长序列对应位置比较，结果返回的同样是布尔值）

cat = pd.Categorical(["a", "b", "c", "a"])
s = pd.Series(cat)
print(s == list('aaba')
'''
0     True
1    False
2    False
3     True
dtype: bool
'''

③ 与另一分类变量的比较

1）等式判别（包含等号和不等号）

两个分类变量的等式判别需要满足分类完全相同

2）不等式判别（包含>=,<=,<,>）

两个分类变量的不等式判别需要满足两个条件：① 分类完全相同 ② 排序完全相同

题题题…

【问题一】如何使用union_categoricals方法？它的作用是什么？

好家伙，上面根本都没介绍union_categoricals，要不是这提到，还不知道有这玩意儿呢。。官方文档：

Combine list-like of Categorical-like, unioning categories. All categories must have the same dtype. 合并类似于列表的分类变量？且要求这些分类变量必须具有相同的 dtype

union_categoricals(to_union, sort_categories=False, ignore_order=False)

参数：
to_union：要合并的对象，可以是 dtype = '类别’的分类列表，分类索引，或系列

sort_categories：默认False，如果为 true，则生成的类别将被分类，否则它们将按数据中显示的排序。

ignore_order：默认False，如果为 true，分类的排序属性将被忽略，导致无序分类

a = pd.Categorical(["b", "c"])
b = pd.Categorical(["a", "b"])
print(union_categoricals([a, b]))
'''
['b', 'c', 'a', 'b']
Categories (3, object): ['b', 'c', 'a']

【问题二】利用concat方法将两个序列纵向拼接，它的结果一定是分类变量吗？什么情况下不是？

当分类的数量和类别一致是，才是分类变量，否则不是

s = pd.Series(["a", "d", "c", "a"]).astype('category')
s1 = pd.Series(["a", "d", "c", "d"]).astype('category')
print(pd.concat([s, s1]))
'''
0    a
1    d
2    c
3    a
0    a
1    d
2    c
3    d
dtype: category
Categories (3, object): ['a', 'c', 'd']
'''

s = pd.Series(["a", "d", "c", "a"]).astype('category')
s1 = pd.Series(["a", "d", "c", "b"]).astype('category')
print(pd.concat([s, s1]))
'''
0    a
1    d
2    c
3    a
0    a
1    d
2    c
3    b
dtype: object
'''

【问题三】当使用groupby方法或者value_counts方法时，分类变量的统计结果和普通变量有什么区别？

貌似没啥区别，分量变量的统计会跟对应的普通变量一样

s = pd.Series(["a", "d", "c", "a"]).astype('category')
s1 = pd.Series(["a", "d", "c", "a"])
print(s.value_counts())
'''
a    2
d    1
c    1
dtype: int64
'''
print(s1.value_counts())
'''
a    2
d    1
c    1
dtype: int64
'''

【问题四】下面的代码说明了Series创建分类变量的什么“缺陷”？如何避免？（提示：使用Series中的copy参数）

缺陷:修改series变量的时候，原分类cat 跟着变化
解决方法：创建的时候设置参数copy=True

s = pd.Series(cat, name="cat", copy=True)

【练习一】现继续使用第四章中的地震数据集，请解决以下问题：

df = pd.read_csv(r'C:\Users\YANG\Desktop\joyful-pandas-master\data\Earthquake.csv')
print(df.head())
'''
     日期           时间     维度     经度        方向   距离   深度   烈度
0  2003.05.20  12:17:44 AM  39.04   40.38        west   0.1   10.0   0.0
1  2007.08.01  12:03:08 AM  40.79   30.09        west   0.1    5.2   4.0
2  1978.05.07  12:41:37 AM  38.58   27.61  south_west   0.1    0.0   0.0
3  1997.03.22  12:31:45 AM  39.47   36.44  south_west   0.1   10.0   0.0
4  2000.04.02  12:57:38 AM  40.80   30.24  south_west   0.1    7.0   0.0
'''

（a）现在将深度分为七个等级：[0,5,10,15,20,30,50,np.inf]，请以深度等级Ⅰ,Ⅱ,Ⅲ,Ⅳ,Ⅴ,Ⅵ,Ⅶ为索引并按照由浅到深的顺序进行排序。

df['深度'] = pd.cut(df['深度'], [-0.01, 5, 10, 15, 20, 30, 50, np.inf], labels=['Ⅰ', 'Ⅱ', 'Ⅲ', 'Ⅳ', 'Ⅴ', 'Ⅵ', 'Ⅶ'])  # 按要求分类
df.set_index(['深度'], inplace=True)  # 设置索引
df.sort_index(inplace=True) # 索引排序
print(df.head())
'''
            日期       时间     维度     经度    方向    距离   烈度
深度                                                              
Ⅰ     2009.09.09  12:54:13 AM  42.42  43.03  north_east  95.4  0.0
Ⅰ     1997.06.16  12:18:04 AM  37.92  29.17  north_east   3.2  0.0
Ⅰ     2011.10.25  12:29:45 AM  38.96  43.64  south_east   1.6  3.9
Ⅰ     1995.07.23  12:05:04 AM  37.61  29.29  north_east   3.2  0.0
Ⅰ     2013.06.10  12:39:19 AM  38.53  43.85  south_east   1.6  3.7
'''

b）在（a）的基础上，将烈度分为4个等级：[0,3,4,5,np.inf]，依次对南部地区的深度和烈度等级建立多级索引排序

同样的先对烈度分类，建好索引，最好再利用这两个索引建立起多级索引

df['烈度'] = pd.cut(df['烈度'], [-0.01, 3, 4, 5, np.inf], labels=['Ⅰ', 'Ⅱ', 'Ⅲ', 'Ⅳ'])
df = df[df['方向'].isin(['south'])]
df.set_index(['烈度'], append=True, inplace=True)
df.sort_index(inplace=True)
print(df.head())
'''
           日期           时间     维度     经度   方向   距离
深度 烈度                                                   
Ⅰ   Ⅰ   1995.04.09  12:52:44 AM  37.02  27.43   south  1.6
     Ⅰ   1999.04.05  12:14:35 AM  37.02  27.43   south  1.6
     Ⅰ   1999.07.17  12:37:09 AM  38.35  40.04   south  1.6
     Ⅰ   1994.03.01  12:24:06 AM  38.63  26.47   south  3.0
     Ⅰ   1970.04.26  12:36:06 AM  39.03  29.77   south  3.0

'''

瘦与狗

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
P8

知知知…1.category 分类变量的创建① 用Series创建，指定类型为 dtype=‘category’ (类似于用Series创建 string 类)s = pd.Series([1, 'a', 'uuu', 4.3], dtype='category')print(s)'''0 11 a2 uuu3 4.3dtype: categoryCategories (4, object): [1, 4.3, 'a', 'uuu']'''② 对
复制链接

扫一扫