文本分类

最新推荐文章于 2022-11-19 23:47:50 发布

lukem44

最新推荐文章于 2022-11-19 23:47:50 发布

阅读量167

点赞数

本文链接：https://blog.csdn.net/lukem44/article/details/107010832

版权

第8章分类数据

import pandas as pd
import numpy as np
df = pd.read_csv('data/table.csv')
df.head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1101	M	street_1	173	63	34.0	A+
1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
2	S_1	C_1	1103	M	street_2	186	82	87.2	B+
3	S_1	C_1	1104	F	street_2	167	81	80.4	B-
4	S_1	C_1	1105	F	street_4	159	64	84.8	B+

一、category的创建及其性质

1. 分类变量的创建

（a）用Series创建

pd.Series(["a", "b", "c", "a"], dtype="category")

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

（b）对DataFrame指定类型创建

temp_df = pd.DataFrame({'A':pd.Series(["a", "b", "c", "a"], dtype="category"),'B':list('abcd')})
temp_df.dtypes

A    category
B      object
dtype: object

（c）利用内置Categorical类型创建

cat = pd.Categorical(["a", "b", "c", "a"], categories=['a','b','c'])
pd.Series(cat)

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

（d）利用cut函数创建

默认使用区间类型为标签

pd.cut(np.random.randint(0,60,5), [0,10,30,60])

[(10, 30], (0, 10], (10, 30], (30, 60], (30, 60]]
Categories (3, interval[int64]): [(0, 10] < (10, 30] < (30, 60]]

可指定字符为标签

pd.cut(np.random.randint(0,60,5), [0,10,30,60], right=False, labels=['0-10','10-30','30-60'])

[10-30, 30-60, 30-60, 10-30, 30-60]
Categories (3, object): [0-10 < 10-30 < 30-60]

2. 分类变量的结构

一个分类变量包括三个部分，元素值（values）、分类类别（categories）、是否有序（order）

从上面可以看出，使用cut函数创建的分类变量默认为有序分类变量

下面介绍如何获取或修改这些属性

（a）describe方法

该方法描述了一个分类序列的情况，包括非缺失值个数、元素值类别数（不是分类类别数）、最多次出现的元素及其频数

s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.describe()

count     4
unique    3
top       a
freq      2
dtype: object

（b）categories和ordered属性

查看分类类别和是否排序

s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

s.cat.ordered

False

3. 类别的修改

（a）利用set_categories修改

修改分类，但本身值不会变化

s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.set_categories(['new_a','c'])

0    NaN
1    NaN
2      c
3    NaN
4    NaN
dtype: category
Categories (2, object): [new_a, c]

（b）利用rename_categories修改

需要注意的是该方法会把值和分类同时修改

s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.rename_categories(['new_%s'%i for i in s.cat.categories])

0    new_a
1    new_b
2    new_c
3    new_a
4      NaN
dtype: category
Categories (4, object): [new_a, new_b, new_c, new_d]

利用字典修改值

s.cat.rename_categories({'a':'new_a','b':'new_b'})

0    new_a
1    new_b
2        c
3    new_a
4      NaN
dtype: category
Categories (4, object): [new_a, new_b, c, d]

（c）利用add_categories添加

s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.add_categories(['e'])

0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (5, object): [a, b, c, d, e]

（d）利用remove_categories移除

s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.remove_categories(['d'])

0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (3, object): [a, b, c]

（e）删除元素值未出现的分类类型

s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.remove_unused_categories()

0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (3, object): [a, b, c]

二、分类变量的排序

前面提到，分类数据类型被分为有序和无序，这非常好理解，例如分数区间的高低是有序变量，考试科目的类别一般看做无序变量

1. 序的建立

（a）一般来说会将一个序列转为有序变量，可以利用as_ordered方法

s = pd.Series(["a", "d", "c", "a"]).astype('category').cat.as_ordered()
s

0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): [a < c < d]

退化为无序变量，只需要使用as_unordered

s.cat.as_unordered()

0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): [a, c, d]

（b）利用set_categories方法中的order参数

pd.Series(["a", "d", "c", "a"]).astype('category').cat.set_categories(['a','c','d'],ordered=True)

0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): [a < c < d]

（c）利用reorder_categories方法

这个方法的特点在于，新设置的分类必须与原分类为同一集合

s = pd.Series(["a", "d", "c", "a"]).astype('category')
s.cat.reorder_categories(['a','c','d'],ordered=True)

0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): [a < c < d]

#s.cat.reorder_categories(['a','c'],ordered=True) #报错
#s.cat.reorder_categories(['a','c','d','e'],ordered=True) #报错

2. 排序

先前在第1章介绍的值排序和索引排序都是适用的

s = pd.Series(np.random.choice(['perfect','good','fair','bad','awful'],50)).astype('category')
s.cat.set_categories(['perfect','good','fair','bad','awful'][::-1],ordered=True).head()

0       good
1       fair
2        bad
3    perfect
4    perfect
dtype: category
Categories (5, object): [awful < bad < fair < good < perfect]

s.sort_values(ascending=False).head()

29    perfect
17    perfect
31    perfect
3     perfect
4     perfect
dtype: category
Categories (5, object): [awful, bad, fair, good, perfect]

df_sort = pd.DataFrame({'cat':s.values,'value':np.random.randn(50)}).set_index('cat')
df_sort.head()

	value
cat
good	-1.746975
fair	0.836732
bad	0.094912
perfect	-0.724338
perfect	-1.456362

df_sort.sort_index().head()

	value
cat
awful	0.245782
awful	0.063991
awful	1.541862
awful	-0.062976
awful	0.472542

三、分类变量的比较操作

1. 与标量或等长序列的比较

（a）标量比较

s = pd.Series(["a", "d", "c", "a"]).astype('category')
s == 'a'

0     True
1    False
2    False
3     True
dtype: bool

（b）等长序列比较

s == list('abcd')

0     True
1    False
2     True
3    False
dtype: bool

2. 与另一分类变量的比较

（a）等式判别（包含等号和不等号）

两个分类变量的等式判别需要满足分类完全相同

s = pd.Series(["a", "d", "c", "a"]).astype('category')
s == s

0    True
1    True
2    True
3    True
dtype: bool

s != s

0    False
1    False
2    False
3    False
dtype: bool

s_new = s.cat.set_categories(['a','d','e'])
#s == s_new #报错

（b）不等式判别（包含>=,<=,<,>）

两个分类变量的不等式判别需要满足两个条件：① 分类完全相同 ② 排序完全相同

s = pd.Series(["a", "d", "c", "a"]).astype('category')
#s >= s #报错

s = pd.Series(["a", "d", "c", "a"]).astype('category').cat.reorder_categories(['a','c','d'],ordered=True)
s >= s

0    True
1    True
2    True
3    True
dtype: bool

四、问题与练习

【问题一】如何使用union_categoricals方法？它的作用是什么？

#分类数据经过 pd.concat 合并等操作后类型转为了 object 类型，如果想要保持分类类型的话，可以借助 union_categoricals 来完成
from pandas.api.types import union_categoricals 
union_categoricals([type1, type2])

【练习一】现继续使用第四章中的地震数据集，请解决以下问题：

现在将深度分为七个等级：[0,5,10,15,20,30,50,np.inf]，请以深度等级Ⅰ,Ⅱ,Ⅲ,Ⅳ,Ⅴ,Ⅵ,Ⅶ为索引并按照由浅到深的顺序进行排序。

data = pd.read_csv('data/Earthquake.csv')
data.head()

	日期	时间	维度	经度	方向	距离	深度	烈度
0	2003.05.20	12:17:44 AM	39.04	40.38	west	0.1	10.0	0.0
1	2007.08.01	12:03:08 AM	40.79	30.09	west	0.1	5.2	4.0
2	1978.05.07	12:41:37 AM	38.58	27.61	south_west	0.1	0.0	0.0
3	1997.03.22	12:31:45 AM	39.47	36.44	south_west	0.1	10.0	0.0
4	2000.04.02	12:57:38 AM	40.80	30.24	south_west	0.1	7.0	0.0

data.describe()

	维度	经度	距离	深度	烈度
count	10062.000000	10062.000000	10062.000000	10062.000000	10062.000000
mean	38.809421	33.623419	3.175015	12.502057	1.678513
std	1.237242	5.796679	4.715461	15.316634	2.068193
min	35.770000	25.700000	0.100000	0.000000	0.000000
25%	37.860000	28.960000	1.400000	5.000000	0.000000
50%	38.850000	30.980000	2.300000	8.100000	0.000000
75%	39.600000	39.280000	3.600000	12.000000	3.800000
max	42.770000	45.000000	95.400000	172.000000	7.200000

data.set_index(pd.cut(data['深度'], [-np.inf,5,10,15,20,30,50,np.inf], right=True, labels=['Ⅰ','Ⅱ','Ⅲ','Ⅳ','Ⅴ','Ⅵ','Ⅶ'])).sort_values(by='深度')

	日期	时间	维度	经度	方向	距离	深度	烈度
深度
Ⅰ	2000.03.31	12:24:51 AM	39.49	43.00	north_east	2.7	0.0	0.0
Ⅰ	2000.01.17	12:19:56 AM	38.07	31.94	north_west	3.2	0.0	0.0
Ⅰ	2000.06.06	12:47:36 AM	40.44	33.09	west	5.7	0.0	0.0
Ⅰ	1997.04.04	12:03:38 AM	40.17	29.27	north_west	2.1	0.0	0.0
Ⅰ	1996.08.14	12:17:56 AM	40.69	35.41	north_west	2.1	0.0	0.0
Ⅰ	1995.07.23	12:05:04 AM	37.61	29.29	north_east	3.2	0.0	0.0
Ⅰ	1997.06.16	12:18:04 AM	37.92	29.17	north_east	3.2	0.0	0.0
Ⅰ	1982.10.31	12:48:24 AM	39.28	27.73	east	0.9	0.0	0.0
Ⅰ	1998.11.18	12:32:24 AM	39.17	40.07	east	0.9	0.0	0.0
Ⅰ	1976.03.06	12:20:32 AM	36.94	30.38	north_east	5.6	0.0	0.0
Ⅰ	1996.08.14	12:54:32 AM	40.67	35.37	west	1.0	0.0	0.0
Ⅰ	2000.03.04	12:50:33 AM	37.24	27.59	west	1.0	0.0	0.0
Ⅰ	1982.01.16	12:06:47 AM	39.27	29.00	south	2.1	0.0	0.0
Ⅰ	1997.07.17	12:29:07 AM	36.89	30.40	north_west	5.6	0.0	0.0
Ⅰ	1976.08.24	12:53:30 AM	39.20	29.50	south_west	1.0	0.0	0.0
Ⅰ	1977.02.21	12:36:15 AM	37.64	29.10	south_west	1.0	0.0	0.0
Ⅰ	1996.04.26	12:12:36 AM	39.71	39.20	north_west	5.6	0.0	0.0
Ⅰ	1976.09.30	12:21:14 AM	41.80	26.30	north_west	5.6	0.0	0.0
Ⅰ	1981.06.04	12:40:37 AM	38.61	27.20	south_east	1.0	0.0	3.9
Ⅰ	1979.10.16	12:27:05 AM	37.20	27.80	south_east	1.0	0.0	0.0
Ⅰ	1981.03.28	12:59:18 AM	37.20	29.90	south_east	5.6	0.0	0.0
Ⅰ	1976.05.05	12:12:21 AM	39.40	29.01	south_east	1.0	0.0	3.5
Ⅰ	1995.05.15	12:42:49 AM	39.45	26.23	south	5.6	0.0	0.0
Ⅰ	1978.07.05	12:18:24 AM	39.49	33.20	south_east	2.1	0.0	0.0
Ⅰ	1995.01.06	12:52:26 AM	36.94	29.02	north_west	3.2	0.0	0.0
Ⅰ	1980.04.21	12:58:00 AM	39.26	30.12	south_east	2.1	0.0	0.0
Ⅰ	1977.03.16	12:17:43 AM	37.20	28.10	west	3.3	0.0	0.0
Ⅰ	1975.03.28	12:11:07 AM	40.39	26.39	north_west	5.6	0.0	0.0
Ⅰ	1996.08.14	12:24:56 AM	40.71	35.28	south_west	1.0	0.0	0.0
Ⅰ	1996.04.10	12:29:40 AM	40.41	32.25	south_west	1.0	0.0	0.0
...	...	...	...	...	...	...	...	...
Ⅶ	2011.06.23	12:29:07 AM	37.05	30.98	south_east	2.1	110.4	0.0
Ⅶ	1965.03.26	12:29:23 AM	36.82	30.94	south_east	6.6	111.0	5.1
Ⅶ	1990.02.26	12:00:45 AM	37.17	30.21	north_east	2.8	111.0	0.0
Ⅶ	2011.08.17	12:27:48 AM	36.89	30.99	west	1.8	111.4	0.0
Ⅶ	1991.03.11	12:33:43 AM	37.01	30.96	south_west	0.8	113.0	0.0
Ⅶ	2005.07.20	12:42:08 AM	36.98	30.91	north_west	2.1	116.0	0.0
Ⅶ	2010.02.23	12:26:37 AM	36.87	31.30	south_west	1.5	117.6	0.0
Ⅶ	1991.01.24	12:32:33 AM	37.11	31.00	south	4.8	118.0	0.0
Ⅶ	1951.02.02	12:59:23 AM	36.83	30.54	south_east	2.7	120.0	0.0
Ⅶ	2013.12.25	12:11:35 AM	36.97	31.10	north_west	1.1	123.2	4.4
Ⅶ	1989.07.12	12:42:54 AM	37.16	31.09	north_east	3.5	125.0	0.0
Ⅶ	1981.07.10	12:39:32 AM	37.10	31.16	north_east	4.1	126.0	0.0
Ⅶ	1964.12.30	12:14:03 AM	36.40	34.20	south_east	8.7	128.0	4.9
Ⅶ	1936.08.12	12:24:28 AM	37.44	29.44	north_east	2.9	130.0	4.9
Ⅶ	1925.09.01	12:16:30 AM	37.56	29.17	north	5.4	130.0	5.3
Ⅶ	1966.08.31	12:54:16 AM	38.90	41.50	north_west	1.6	131.0	4.1
Ⅶ	1966.05.09	12:51:10 AM	37.05	30.98	south_east	2.2	132.0	4.9
Ⅶ	1989.01.16	12:33:12 AM	37.17	30.95	north_east	1.4	132.0	0.0
Ⅶ	1964.08.22	12:14:05 AM	36.93	30.60	north_east	3.2	133.0	4.1
Ⅶ	2004.04.21	12:41:47 AM	36.73	31.49	west	4.3	133.0	0.0
Ⅶ	1992.10.20	12:26:13 AM	37.20	31.21	north_east	5.6	136.0	0.0
Ⅶ	1985.06.04	12:15:17 AM	37.15	31.08	north_east	2.2	139.0	0.0
Ⅶ	1978.01.17	12:09:03 AM	39.40	41.40	south_west	7.3	139.0	0.0
Ⅶ	1962.08.18	12:29:07 AM	36.97	32.52	south_east	1.4	140.0	4.7
Ⅶ	2007.12.02	12:40:22 AM	37.32	31.09	north_west	3.6	141.1	0.0
Ⅶ	1982.01.24	12:37:03 AM	36.61	27.52	south_east	10.1	146.0	3.7
Ⅶ	1936.02.02	12:08:26 AM	37.69	38.82	north_west	2.7	160.0	4.9
Ⅶ	1975.09.06	12:11:02 AM	38.60	40.80	south_west	1.8	166.0	0.0
Ⅶ	1969.07.23	12:54:11 AM	38.90	41.00	north_east	2.5	169.0	4.2
Ⅶ	1966.08.19	12:38:56 AM	38.40	41.20	south_east	1.6	172.0	4.6

10062 rows × 8 columns

lukem44

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
文本分类

第8章分类数据import pandas as pdimport numpy as npdf = pd.read_csv('data/table.csv')df.head() School Class ID Gender Address Height Weight Math Physics 0 S_1
复制链接

扫一扫