pandas学习-第八章

最新推荐文章于 2023-01-21 15:17:01 发布

lx12633036

最新推荐文章于 2023-01-21 15:17:01 发布

阅读量395

点赞数

本文链接：https://blog.csdn.net/lx12633036/article/details/106984461

版权

import pandas as pd
import numpy as np

data=pd.read_csv(r'D:\jupyter Notebook\天池比赛\pandas学习\joyful-pandas-master\data\table.csv')

data.head()

	Unnamed: 0	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	0	S_1	C_1	1101	M	street_1	173	63	34.0	A+
1	1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
2	2	S_1	C_1	1103	M	street_2	186	82	87.2	B+
3	3	S_1	C_1	1104	F	street_2	167	81	80.4	B-
4	4	S_1	C_1	1105	F	street_4	159	64	84.8	B+

data.info

<bound method DataFrame.info of     Unnamed: 0 School Class    ID Gender   Address  Height  Weight  Math  \
0            0    S_1   C_1  1101      M  street_1     173      63  34.0   
1            1    S_1   C_1  1102      F  street_2     192      73  32.5   
2            2    S_1   C_1  1103      M  street_2     186      82  87.2   
3            3    S_1   C_1  1104      F  street_2     167      81  80.4   
4            4    S_1   C_1  1105      F  street_4     159      64  84.8   
5            5    S_1   C_2  1201      M  street_5     188      68  97.0   
6            6    S_1   C_2  1202      F  street_4     176      94  63.5   
7            7    S_1   C_2  1203      M  street_6     160      53  58.8   
8            8    S_1   C_2  1204      F  street_5     162      63  33.8   
9            9    S_1   C_2  1205      F  street_6     167      63  68.4   
10          10    S_1   C_3  1301      M  street_4     161      68  31.5   
11          11    S_1   C_3  1302      F  street_1     175      57  87.7   
12          12    S_1   C_3  1303      M  street_7     188      82  49.7   
13          13    S_1   C_3  1304      M  street_2     195      70  85.2   
14          14    S_1   C_3  1305      F  street_5     187      69  61.7   
15          15    S_2   C_1  2101      M  street_7     174      84  83.3   
16          16    S_2   C_1  2102      F  street_6     161      61  50.6   
17          17    S_2   C_1  2103      M  street_4     157      61  52.5   
18          18    S_2   C_1  2104      F  street_5     159      97  72.2   
19          19    S_2   C_1  2105      M  street_4     170      81  34.2   
20          20    S_2   C_2  2201      M  street_5     193     100  39.1   
21          21    S_2   C_2  2202      F  street_7     194      77  68.5   
22          22    S_2   C_2  2203      M  street_4     155      91  73.8   
23          23    S_2   C_2  2204      M  street_1     175      74  47.2   
24          24    S_2   C_2  2205      F  street_7     183      76  85.4   
25          25    S_2   C_3  2301      F  street_4     157      78  72.3   
26          26    S_2   C_3  2302      M  street_5     171      88  32.7   
27          27    S_2   C_3  2303      F  street_7     190      99  65.9   
28          28    S_2   C_3  2304      F  street_6     164      81  95.5   
29          29    S_2   C_3  2305      M  street_4     187      73  48.9   
30          30    S_2   C_4  2401      F  street_2     192      62  45.3   
31          31    S_2   C_4  2402      M  street_7     166      82  48.7   
32          32    S_2   C_4  2403      F  street_6     158      60  59.7   
33          33    S_2   C_4  2404      F  street_2     160      84  67.7   
34          34    S_2   C_4  2405      F  street_6     193      54  47.6   

   Physics  
0       A+  
1       B+  
2       B+  
3       B-  
4       B+  
5       A-  
6       B-  
7       A+  
8        B  
9       B-  
10      B+  
11      A-  
12       B  
13       A  
14      B-  
15       C  
16      B+  
17      B-  
18      B+  
19       A  
20       B  
21      B+  
22      A+  
23      B-  
24       B  
25      B+  
26       A  
27       C  
28      A-  
29       B  
30       A  
31       B  
32      B+  
33       B  
34       B  >

data.drop(columns=['Unnamed: 0'],axis=1,inplace=True)

data.head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1101	M	street_1	173	63	34.0	A+
1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
2	S_1	C_1	1103	M	street_2	186	82	87.2	B+
3	S_1	C_1	1104	F	street_2	167	81	80.4	B-
4	S_1	C_1	1105	F	street_4	159	64	84.8	B+

category的创建及其性质

分类变量的创建

用Series创建

pd.Series(['a','b','c','a'],dtype='category')

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

对DataFrame指定类型创建

temp_data=pd.DataFrame({'A':pd.Series(['a','b','c','a'],dtype='category'),'B':list('abcd')})
temp_data.dtypes

A    category
B      object
dtype: object

利用内置Categorical类型创建

cat=pd.Categorical(['a','b','c','a'],categories=['a','b','c'])
pd.Series(cat)

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

利用cut函数创建

pd.cut(np.random.randint(0,60,5),[0,10,30,60])#直接使用区间作为标签

[(10, 30], (30, 60], (10, 30], (30, 60], (10, 30]]
Categories (3, interval[int64]): [(0, 10] < (10, 30] < (30, 60]]

#使用cut创建的分类变量是有顺序的
pd.cut(np.random.randint(0,60,5),[0,10,30,60],right=False,labels=['0-10','10-30','30-60'])

[0-10, 30-60, 30-60, 10-30, 0-10]
Categories (3, object): [0-10 < 10-30 < 30-60]

分类变量的结构

一个分类变量包括三个部分，元素值（values）、分类类别（categories）、是否有序（order）

describe方法

该方法描述了一个分类序列的情况，包括非缺失值个数、元素值类别数（不是分类类别数）、最多次出现的元素及其频数

s=pd.Series(pd.Categorical(['a','b','c','a',np.nan],categories=['a','b','c','d']))
s.describe()

count     4
unique    3
top       a
freq      2
dtype: object

categories和ordered属性

s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

s.cat.ordered

False

类别修改

利用set_categories修改

s.cat.set_categories(['new_a','c'])

0    NaN
1    NaN
2      c
3    NaN
4    NaN
dtype: category
Categories (2, object): [new_a, c]

cat

[a, b, c, a]
Categories (3, object): [a, b, c]

0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (4, object): [a, b, c, d]

利用rename_categories修改

s.cat.rename_categories(['new_%s'%i for i in s.cat.categories])

0    new_a
1    new_b
2    new_c
3    new_a
4      NaN
dtype: category
Categories (4, object): [new_a, new_b, new_c, new_d]

利用字典修改值

s.cat.rename_categories({'a':'new_a','b':'new_b'})

0    new_a
1    new_b
2        c
3    new_a
4      NaN
dtype: category
Categories (4, object): [new_a, new_b, c, d]

利用add_categories添加

s.cat.add_categories(['e'])

0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (5, object): [a, b, c, d, e]

利用remove_categories移除

s.cat.remove_categories(['d'])

0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (3, object): [a, b, c]

删除元素值未出现的分类类型

s.cat.remove_unused_categories()

0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (3, object): [a, b, c]

分类变量的排序

序的建立

将一个序列转为有序变量，可以利用as_ordered方法

s=pd.Series(['a','b','c','d']).astype('category').cat.as_ordered()
s
#这个部分可以用在我自己的时间序列上面（给那串时间进行排序）

0    a
1    b
2    c
3    d
dtype: category
Categories (4, object): [a < b < c < d]

退化为无序变量，只需要使用as_unordered

s.cat.as_unordered()

0    a
1    b
2    c
3    d
dtype: category
Categories (4, object): [a, b, c, d]

利用set_categories方法中的order参数

pd.Series(['a','b','c','a']).astype('category').cat.set_categories(['a','c','d']
                                                                   ,ordered=True)

0      a
1    NaN
2      c
3      a
dtype: category
Categories (3, object): [a < c < d]

利用reorder_categories方法

s=pd.Series(['a','d','c','a']).astype('category')
s.cat.reorder_categories(['a','c','d'],ordered=True)
#这个方法的特点在于，新设置的分类必须与原分类为同一集合

0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): [a < c < d]

排序

s=pd.Series(np.random.choice(['perfect','good','fair','bad','awful'],50)).astype('category')
s.cat.set_categories(['perfect','good','fair','bad','awful'][::-1],ordered=True).head(20)

0        fair
1         bad
2       awful
3       awful
4     perfect
5     perfect
6     perfect
7         bad
8        good
9        good
10        bad
11        bad
12      awful
13      awful
14        bad
15    perfect
16       good
17      awful
18      awful
19       fair
dtype: category
Categories (5, object): [awful < bad < fair < good < perfect]

s.sort_values(ascending=True).head()

29    awful
26    awful
2     awful
3     awful
25    awful
dtype: category
Categories (5, object): [awful, bad, fair, good, perfect]

s.describe()

count      50
unique      5
top       bad
freq       15
dtype: object

data_sort=pd.DataFrame({'cat':s.values,'value':np.random.randn(50)}).set_index('cat')
data_sort.head(50)

	value
cat
fair	-2.811450
bad	0.783449
awful	1.721855
awful	0.118119
perfect	-1.317575
perfect	-0.500355
perfect	0.432203
bad	-1.554244
good	-1.164727
good	-1.244924
bad	-0.060621
bad	-0.779449
awful	0.200825
awful	-1.588781
bad	-1.185241
perfect	-0.478165
good	-1.119550
awful	0.766782
awful	0.147791
fair	0.042877
good	1.005659
perfect	0.383797
good	-0.796085
fair	-0.725092
bad	0.036638
awful	-2.128523
awful	-0.763601
fair	1.242681
bad	1.703662
awful	1.150105
fair	-0.361344
awful	-1.371411
bad	-0.258389
bad	-1.519332
fair	1.727239
bad	1.310984
awful	-0.833761
fair	-0.272145
good	-0.348845
bad	-1.244156
good	0.520846
bad	0.284184
perfect	0.582454
bad	0.969094
good	2.186216
perfect	-1.115565
awful	-0.840769
bad	0.631311
bad	0.245196
good	0.386491

data_sort.info()

<class 'pandas.core.frame.DataFrame'>
CategoricalIndex: 50 entries, fair to good
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   value   50 non-null     float64
dtypes: float64(1)
memory usage: 1.9 KB

data_sort.sort_index()

	value
cat
awful	1.150105
awful	-0.763601
awful	1.721855
awful	0.118119
awful	-2.128523
awful	-0.840769
awful	-1.371411
awful	0.147791
awful	0.200825
awful	-1.588781
awful	0.766782
awful	-0.833761
bad	0.245196
bad	-0.258389
bad	-1.519332
bad	1.310984
bad	0.036638
bad	1.703662
bad	-0.779449
bad	-0.060621
bad	-1.244156
bad	0.284184
bad	-1.554244
bad	0.969094
bad	0.631311
bad	0.783449
bad	-1.185241
fair	1.727239
fair	-0.361344
fair	-0.272145
fair	-2.811450
fair	-0.725092
fair	1.242681
fair	0.042877
good	2.186216
good	-1.164727
good	-1.244924
good	-0.348845
good	0.520846
good	-1.119550
good	1.005659
good	-0.796085
good	0.386491
perfect	0.383797
perfect	0.582454
perfect	0.432203
perfect	-1.115565
perfect	-0.500355
perfect	-1.317575
perfect	-0.478165

分类变量的比较操作

与标量或等长序列的比较

标量比较

s=pd.Series(['a','d','c','a']).astype('category')
s=='a'

0     True
1    False
2    False
3     True
dtype: bool

等长序列比较

s==list('abcd')

0     True
1    False
2     True
3    False
dtype: bool

与另一分类变量的比较

等式判别（包含等号和不等号)

s=pd.Series(['a','d','c','a']).astype('category')
s==s

0    True
1    True
2    True
3    True
dtype: bool

问题与练习

问题

1.如何使用union_categoricals方法？它的作用是什么？

保持分类类型的话，可以借助 union_categoricals 来完成

from pandas.api.types import union_categoricals

2. 利用concat方法将两个序列纵向拼接，它的结果一定是分类变量吗？什么情况下不是？

序列本身就不一定是分类变量，比如时间序列与分类序列拼接后，就不是分类变量

3. 当使用groupby方法或者value_counts方法时，分类变量的统计结果和普通变量有什么区别？

s.value_counts

<bound method IndexOpsMixin.value_counts of 0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): [a, c, d]>

s_1=pd.Series(['a','d','c','a']).astype('string')

s_1

0    a
1    d
2    c
3    a
dtype: string

s_1.value_counts

<bound method IndexOpsMixin.value_counts of 0    a
1    d
2    c
3    a
dtype: string>

练习

1. 现继续使用第四章中的地震数据集，请解决以下问题：

现在将深度分为七个等级：[0,5,10,15,20,30,50,np.inf]，请以深度等级Ⅰ,Ⅱ,Ⅲ,Ⅳ,Ⅴ,Ⅵ,Ⅶ为索引并按照由浅到深的顺序进行排序。

data_earth=pd.read_csv(r'D:\jupyter Notebook\天池比赛\pandas学习\joyful-pandas-master\data\Earthquake.csv')

data_earth.head(20)

	日期	时间	维度	经度	方向	距离	深度	烈度
0	2003.05.20	12:17:44 AM	39.04	40.38	west	0.1	10.0	0.0
1	2007.08.01	12:03:08 AM	40.79	30.09	west	0.1	5.2	4.0
2	1978.05.07	12:41:37 AM	38.58	27.61	south_west	0.1	0.0	0.0
3	1997.03.22	12:31:45 AM	39.47	36.44	south_west	0.1	10.0	0.0
4	2000.04.02	12:57:38 AM	40.80	30.24	south_west	0.1	7.0	0.0
5	2005.01.21	12:04:03 AM	37.11	27.75	south_west	0.1	32.8	0.0
6	2012.06.24	12:07:22 AM	38.75	43.61	south_west	0.1	9.4	4.5
7	1987.12.31	12:49:54 AM	39.43	27.98	south_east	0.1	26.0	0.0
8	2000.02.07	12:11:45 AM	40.05	34.07	south_east	0.1	1.0	0.0
9	2011.10.28	12:47:56 AM	38.76	43.54	south_east	0.1	3.1	4.2
10	2013.05.01	12:47:56 AM	37.31	37.11	south_east	0.1	9.5	3.5
11	1989.04.27	12:45:19 AM	37.04	28.04	south	0.1	9.0	0.0
12	1999.11.26	12:42:20 AM	37.77	38.54	south	0.1	13.0	0.0
13	1999.12.20	12:41:56 AM	40.86	30.99	south	0.1	9.0	0.0
14	1984.02.02	12:10:29 AM	37.21	30.81	north_west	0.1	15.0	0.0
15	2011.05.22	12:49:49 AM	39.13	29.04	north_west	0.1	7.2	3.9
16	1971.05.20	12:08:46 AM	37.72	30.00	north_east	0.1	5.0	0.0
17	1985.01.28	12:20:56 AM	38.85	29.06	north_east	0.1	4.0	0.0
18	1997.05.31	12:59:03 AM	39.89	39.79	north_east	0.1	26.0	0.0
19	2005.07.24	12:36:10 AM	36.96	36.03	north_east	0.1	22.0	4.1

data_earth['深度']=pd.cut(data_earth['深度'],[0,5,10,15,20,30,50,np.inf],labels=['Ⅰ','Ⅱ','Ⅲ','Ⅳ','Ⅴ','Ⅵ','Ⅶ'])

data_earth.head()

	日期	时间	维度	经度	方向	距离	深度	烈度
0	2003.05.20	12:17:44 AM	39.04	40.38	west	0.1	Ⅱ	0.0
1	2007.08.01	12:03:08 AM	40.79	30.09	west	0.1	Ⅱ	4.0
2	1978.05.07	12:41:37 AM	38.58	27.61	south_west	0.1	NaN	0.0
3	1997.03.22	12:31:45 AM	39.47	36.44	south_west	0.1	Ⅱ	0.0
4	2000.04.02	12:57:38 AM	40.80	30.24	south_west	0.1	Ⅱ	0.0

data_earth.set_index('深度').sort_index()

	日期	时间	维度	经度	方向	距离	烈度
深度
Ⅰ	1969.08.29	12:46:05 AM	38.00	36.50	south_east	2.3	0.0
Ⅰ	2007.02.11	12:43:00 AM	38.41	39.13	south_east	2.2	4.2
Ⅰ	2007.03.08	12:57:56 AM	39.08	40.40	south_east	2.2	3.9
Ⅰ	2007.04.14	12:30:37 AM	38.31	39.29	south_east	2.2	4.2
Ⅰ	2011.05.20	12:39:03 AM	39.11	29.10	south_east	2.2	3.5
...	...	...	...	...	...	...	...
NaN	1977.03.09	12:21:25 AM	36.34	29.04	south_west	15.8	0.0
NaN	1975.03.22	12:25:25 AM	40.38	25.94	north_east	17.0	3.9
NaN	1991.10.10	12:07:02 AM	40.41	25.87	north	20.0	0.0
NaN	2000.10.22	12:18:40 AM	38.54	44.56	north_east	23.4	0.0
NaN	1978.04.02	12:35:03 AM	42.40	26.60	north	48.5	0.0

10062 rows × 7 columns

2. 在（a）的基础上，将烈度分为4个等级：[0,3,4,5,np.inf]，依次对南部地区的深度和烈度等级建立多级索引排序。

data_earth['烈度']=pd.cut(data_earth['烈度'],[-1e-10,3,4,5,np.inf],labels=['Ⅰ','Ⅱ','Ⅲ','Ⅳ'])
data_earth.set_index(['深度','烈度']).sort_index()

		日期	时间	维度	经度	方向	距离
深度	烈度
Ⅰ	Ⅰ	2000.02.07	12:11:45 AM	40.05	34.07	south_east	0.1
	Ⅰ	1971.05.20	12:08:46 AM	37.72	30.00	north_east	0.1
	Ⅰ	1985.01.28	12:20:56 AM	38.85	29.06	north_east	0.1
	Ⅰ	1990.07.05	12:43:04 AM	37.87	29.18	east	0.1
	Ⅰ	1985.01.07	12:37:08 AM	39.24	27.80	west	0.2
...	...	...	...	...	...	...	...
NaN	Ⅲ	1981.12.26	12:53:35 AM	40.15	28.74	east	2.6
	Ⅲ	1975.04.19	12:52:58 AM	37.69	27.30	north_west	3.3
	Ⅲ	1976.05.15	12:05:57 AM	39.33	29.05	north_east	3.6
	Ⅲ	1975.01.30	12:51:25 AM	39.82	28.60	south_east	4.7
	Ⅲ	1980.03.03	12:15:06 AM	38.13	27.75	north_east	4.8

10062 rows × 6 columns

总结

对于分类变量本身没有相关的学习，只是单纯的在某些算法中使用过
期其本身的重要性并没有那么强，主要还是一个数据变量类型，因此在使用的时候需要注意

lx12633036

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pandas学习-第八章

import pandas as pdimport numpy as np data=pd.read_csv(r'D:\jupyter Notebook\天池比赛\pandas学习\joyful-pandas-master\data\table.csv')data.head() Unnamed: 0 School Class ID Gender Address Height
复制链接

扫一扫