pandas学习-第八章

import pandas as pd
import numpy as np 
data=pd.read_csv(r'D:\jupyter Notebook\天池比赛\pandas学习\joyful-pandas-master\data\table.csv')
data.head()
Unnamed: 0SchoolClassIDGenderAddressHeightWeightMathPhysics
00S_1C_11101Mstreet_11736334.0A+
11S_1C_11102Fstreet_21927332.5B+
22S_1C_11103Mstreet_21868287.2B+
33S_1C_11104Fstreet_21678180.4B-
44S_1C_11105Fstreet_41596484.8B+
data.info
<bound method DataFrame.info of     Unnamed: 0 School Class    ID Gender   Address  Height  Weight  Math  \
0            0    S_1   C_1  1101      M  street_1     173      63  34.0   
1            1    S_1   C_1  1102      F  street_2     192      73  32.5   
2            2    S_1   C_1  1103      M  street_2     186      82  87.2   
3            3    S_1   C_1  1104      F  street_2     167      81  80.4   
4            4    S_1   C_1  1105      F  street_4     159      64  84.8   
5            5    S_1   C_2  1201      M  street_5     188      68  97.0   
6            6    S_1   C_2  1202      F  street_4     176      94  63.5   
7            7    S_1   C_2  1203      M  street_6     160      53  58.8   
8            8    S_1   C_2  1204      F  street_5     162      63  33.8   
9            9    S_1   C_2  1205      F  street_6     167      63  68.4   
10          10    S_1   C_3  1301      M  street_4     161      68  31.5   
11          11    S_1   C_3  1302      F  street_1     175      57  87.7   
12          12    S_1   C_3  1303      M  street_7     188      82  49.7   
13          13    S_1   C_3  1304      M  street_2     195      70  85.2   
14          14    S_1   C_3  1305      F  street_5     187      69  61.7   
15          15    S_2   C_1  2101      M  street_7     174      84  83.3   
16          16    S_2   C_1  2102      F  street_6     161      61  50.6   
17          17    S_2   C_1  2103      M  street_4     157      61  52.5   
18          18    S_2   C_1  2104      F  street_5     159      97  72.2   
19          19    S_2   C_1  2105      M  street_4     170      81  34.2   
20          20    S_2   C_2  2201      M  street_5     193     100  39.1   
21          21    S_2   C_2  2202      F  street_7     194      77  68.5   
22          22    S_2   C_2  2203      M  street_4     155      91  73.8   
23          23    S_2   C_2  2204      M  street_1     175      74  47.2   
24          24    S_2   C_2  2205      F  street_7     183      76  85.4   
25          25    S_2   C_3  2301      F  street_4     157      78  72.3   
26          26    S_2   C_3  2302      M  street_5     171      88  32.7   
27          27    S_2   C_3  2303      F  street_7     190      99  65.9   
28          28    S_2   C_3  2304      F  street_6     164      81  95.5   
29          29    S_2   C_3  2305      M  street_4     187      73  48.9   
30          30    S_2   C_4  2401      F  street_2     192      62  45.3   
31          31    S_2   C_4  2402      M  street_7     166      82  48.7   
32          32    S_2   C_4  2403      F  street_6     158      60  59.7   
33          33    S_2   C_4  2404      F  street_2     160      84  67.7   
34          34    S_2   C_4  2405      F  street_6     193      54  47.6   

   Physics  
0       A+  
1       B+  
2       B+  
3       B-  
4       B+  
5       A-  
6       B-  
7       A+  
8        B  
9       B-  
10      B+  
11      A-  
12       B  
13       A  
14      B-  
15       C  
16      B+  
17      B-  
18      B+  
19       A  
20       B  
21      B+  
22      A+  
23      B-  
24       B  
25      B+  
26       A  
27       C  
28      A-  
29       B  
30       A  
31       B  
32      B+  
33       B  
34       B  >
data.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
data.head()
SchoolClassIDGenderAddressHeightWeightMathPhysics
0S_1C_11101Mstreet_11736334.0A+
1S_1C_11102Fstreet_21927332.5B+
2S_1C_11103Mstreet_21868287.2B+
3S_1C_11104Fstreet_21678180.4B-
4S_1C_11105Fstreet_41596484.8B+

category的创建及其性质

分类变量的创建

  • 用Series创建
pd.Series(['a','b','c','a'],dtype='category')
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]
  • 对DataFrame指定类型创建
temp_data=pd.DataFrame({'A':pd.Series(['a','b','c','a'],dtype='category'),'B':list('abcd')})
temp_data.dtypes
A    category
B      object
dtype: object
  • 利用内置Categorical类型创建
cat=pd.Categorical(['a','b','c','a'],categories=['a','b','c'])
pd.Series(cat)
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]
  • 利用cut函数创建
pd.cut(np.random.randint(0,60,5),[0,10,30,60])#直接使用区间作为标签
[(10, 30], (30, 60], (10, 30], (30, 60], (10, 30]]
Categories (3, interval[int64]): [(0, 10] < (10, 30] < (30, 60]]
#使用cut创建的分类变量是有顺序的
pd.cut(np.random.randint(0,60,5),[0,10,30,60],right=False,labels=['0-10','10-30','30-60'])
[0-10, 30-60, 30-60, 10-30, 0-10]
Categories (3, object): [0-10 < 10-30 < 30-60]

分类变量的结构

  • 一个分类变量包括三个部分,元素值(values)、分类类别(categories)、是否有序(order)

describe方法

  • 该方法描述了一个分类序列的情况,包括非缺失值个数、元素值类别数(不是分类类别数)、最多次出现的元素及其频数
s=pd.Series(pd.Categorical(['a','b','c','a',np.nan],categories=['a','b','c','d']))
s.describe()
count     4
unique    3
top       a
freq      2
dtype: object

categories和ordered属性

s.cat.categories
Index(['a', 'b', 'c', 'd'], dtype='object')
s.cat.ordered
False

类别修改

  • 利用set_categories修改
s.cat.set_categories(['new_a','c'])
0    NaN
1    NaN
2      c
3    NaN
4    NaN
dtype: category
Categories (2, object): [new_a, c]
cat
[a, b, c, a]
Categories (3, object): [a, b, c]
s
0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (4, object): [a, b, c, d]
  • 利用rename_categories修改
s.cat.rename_categories(['new_%s'%i for i in s.cat.categories])
0    new_a
1    new_b
2    new_c
3    new_a
4      NaN
dtype: category
Categories (4, object): [new_a, new_b, new_c, new_d]
  • 利用字典修改值
s.cat.rename_categories({'a':'new_a','b':'new_b'})
0    new_a
1    new_b
2        c
3    new_a
4      NaN
dtype: category
Categories (4, object): [new_a, new_b, c, d]
  • 利用add_categories添加
s.cat.add_categories(['e'])
0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (5, object): [a, b, c, d, e]
  • 利用remove_categories移除
s.cat.remove_categories(['d'])
0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (3, object): [a, b, c]
  • 删除元素值未出现的分类类型
s.cat.remove_unused_categories()
0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (3, object): [a, b, c]

分类变量的排序

序的建立

  • 将一个序列转为有序变量,可以利用as_ordered方法
s=pd.Series(['a','b','c','d']).astype('category').cat.as_ordered()
s
#这个部分可以用在我自己的时间序列上面(给那串时间进行排序)
0    a
1    b
2    c
3    d
dtype: category
Categories (4, object): [a < b < c < d]
  • 退化为无序变量,只需要使用as_unordered
s.cat.as_unordered()
0    a
1    b
2    c
3    d
dtype: category
Categories (4, object): [a, b, c, d]
  • 利用set_categories方法中的order参数
pd.Series(['a','b','c','a']).astype('category').cat.set_categories(['a','c','d']
                                                                   ,ordered=True)
0      a
1    NaN
2      c
3      a
dtype: category
Categories (3, object): [a < c < d]
  • 利用reorder_categories方法
s=pd.Series(['a','d','c','a']).astype('category')
s.cat.reorder_categories(['a','c','d'],ordered=True)
#这个方法的特点在于,新设置的分类必须与原分类为同一集合
0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): [a < c < d]

排序

s=pd.Series(np.random.choice(['perfect','good','fair','bad','awful'],50)).astype('category')
s.cat.set_categories(['perfect','good','fair','bad','awful'][::-1],ordered=True).head(20)
0        fair
1         bad
2       awful
3       awful
4     perfect
5     perfect
6     perfect
7         bad
8        good
9        good
10        bad
11        bad
12      awful
13      awful
14        bad
15    perfect
16       good
17      awful
18      awful
19       fair
dtype: category
Categories (5, object): [awful < bad < fair < good < perfect]
s.sort_values(ascending=True).head()
29    awful
26    awful
2     awful
3     awful
25    awful
dtype: category
Categories (5, object): [awful, bad, fair, good, perfect]
s.describe()
count      50
unique      5
top       bad
freq       15
dtype: object
data_sort=pd.DataFrame({'cat':s.values,'value':np.random.randn(50)}).set_index('cat')
data_sort.head(50)
value
cat
fair-2.811450
bad0.783449
awful1.721855
awful0.118119
perfect-1.317575
perfect-0.500355
perfect0.432203
bad-1.554244
good-1.164727
good-1.244924
bad-0.060621
bad-0.779449
awful0.200825
awful-1.588781
bad-1.185241
perfect-0.478165
good-1.119550
awful0.766782
awful0.147791
fair0.042877
good1.005659
perfect0.383797
good-0.796085
fair-0.725092
bad0.036638
awful-2.128523
awful-0.763601
fair1.242681
bad1.703662
awful1.150105
fair-0.361344
awful-1.371411
bad-0.258389
bad-1.519332
fair1.727239
bad1.310984
awful-0.833761
fair-0.272145
good-0.348845
bad-1.244156
good0.520846
bad0.284184
perfect0.582454
bad0.969094
good2.186216
perfect-1.115565
awful-0.840769
bad0.631311
bad0.245196
good0.386491
data_sort.info()
<class 'pandas.core.frame.DataFrame'>
CategoricalIndex: 50 entries, fair to good
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   value   50 non-null     float64
dtypes: float64(1)
memory usage: 1.9 KB
data_sort.sort_index()
value
cat
awful1.150105
awful-0.763601
awful1.721855
awful0.118119
awful-2.128523
awful-0.840769
awful-1.371411
awful0.147791
awful0.200825
awful-1.588781
awful0.766782
awful-0.833761
bad0.245196
bad-0.258389
bad-1.519332
bad1.310984
bad0.036638
bad1.703662
bad-0.779449
bad-0.060621
bad-1.244156
bad0.284184
bad-1.554244
bad0.969094
bad0.631311
bad0.783449
bad-1.185241
fair1.727239
fair-0.361344
fair-0.272145
fair-2.811450
fair-0.725092
fair1.242681
fair0.042877
good2.186216
good-1.164727
good-1.244924
good-0.348845
good0.520846
good-1.119550
good1.005659
good-0.796085
good0.386491
perfect0.383797
perfect0.582454
perfect0.432203
perfect-1.115565
perfect-0.500355
perfect-1.317575
perfect-0.478165

分类变量的比较操作

与标量或等长序列的比较

  • 标量比较
s=pd.Series(['a','d','c','a']).astype('category')
s=='a'
0     True
1    False
2    False
3     True
dtype: bool
  • 等长序列比较
s==list('abcd')
0     True
1    False
2     True
3    False
dtype: bool

与另一分类变量的比较

  • 等式判别(包含等号和不等号)
s=pd.Series(['a','d','c','a']).astype('category')
s==s
0    True
1    True
2    True
3    True
dtype: bool

问题与练习

问题

1.如何使用union_categoricals方法?它的作用是什么?

  • 保持分类类型的话,可以借助 union_categoricals 来完成
from pandas.api.types import union_categoricals

2. 利用concat方法将两个序列纵向拼接,它的结果一定是分类变量吗?什么情况下不是?

  • 序列本身就不一定是分类变量,比如时间序列与分类序列拼接后,就不是分类变量

3. 当使用groupby方法或者value_counts方法时,分类变量的统计结果和普通变量有什么区别?

s.value_counts
<bound method IndexOpsMixin.value_counts of 0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): [a, c, d]>
s_1=pd.Series(['a','d','c','a']).astype('string')
s_1
0    a
1    d
2    c
3    a
dtype: string
s_1.value_counts
<bound method IndexOpsMixin.value_counts of 0    a
1    d
2    c
3    a
dtype: string>

练习

1. 现继续使用第四章中的地震数据集,请解决以下问题:

  • 现在将深度分为七个等级:[0,5,10,15,20,30,50,np.inf],请以深度等级Ⅰ,Ⅱ,Ⅲ,Ⅳ,Ⅴ,Ⅵ,Ⅶ为索引并按照由浅到深的顺序进行排序。
data_earth=pd.read_csv(r'D:\jupyter Notebook\天池比赛\pandas学习\joyful-pandas-master\data\Earthquake.csv')
data_earth.head(20)
日期时间维度经度方向距离深度烈度
02003.05.2012:17:44 AM39.0440.38west0.110.00.0
12007.08.0112:03:08 AM40.7930.09west0.15.24.0
21978.05.0712:41:37 AM38.5827.61south_west0.10.00.0
31997.03.2212:31:45 AM39.4736.44south_west0.110.00.0
42000.04.0212:57:38 AM40.8030.24south_west0.17.00.0
52005.01.2112:04:03 AM37.1127.75south_west0.132.80.0
62012.06.2412:07:22 AM38.7543.61south_west0.19.44.5
71987.12.3112:49:54 AM39.4327.98south_east0.126.00.0
82000.02.0712:11:45 AM40.0534.07south_east0.11.00.0
92011.10.2812:47:56 AM38.7643.54south_east0.13.14.2
102013.05.0112:47:56 AM37.3137.11south_east0.19.53.5
111989.04.2712:45:19 AM37.0428.04south0.19.00.0
121999.11.2612:42:20 AM37.7738.54south0.113.00.0
131999.12.2012:41:56 AM40.8630.99south0.19.00.0
141984.02.0212:10:29 AM37.2130.81north_west0.115.00.0
152011.05.2212:49:49 AM39.1329.04north_west0.17.23.9
161971.05.2012:08:46 AM37.7230.00north_east0.15.00.0
171985.01.2812:20:56 AM38.8529.06north_east0.14.00.0
181997.05.3112:59:03 AM39.8939.79north_east0.126.00.0
192005.07.2412:36:10 AM36.9636.03north_east0.122.04.1
data_earth['深度']=pd.cut(data_earth['深度'],[0,5,10,15,20,30,50,np.inf],labels=['Ⅰ','Ⅱ','Ⅲ','Ⅳ','Ⅴ','Ⅵ','Ⅶ'])
data_earth.head()
日期时间维度经度方向距离深度烈度
02003.05.2012:17:44 AM39.0440.38west0.10.0
12007.08.0112:03:08 AM40.7930.09west0.14.0
21978.05.0712:41:37 AM38.5827.61south_west0.1NaN0.0
31997.03.2212:31:45 AM39.4736.44south_west0.10.0
42000.04.0212:57:38 AM40.8030.24south_west0.10.0
data_earth.set_index('深度').sort_index()
日期时间维度经度方向距离烈度
深度
1969.08.2912:46:05 AM38.0036.50south_east2.30.0
2007.02.1112:43:00 AM38.4139.13south_east2.24.2
2007.03.0812:57:56 AM39.0840.40south_east2.23.9
2007.04.1412:30:37 AM38.3139.29south_east2.24.2
2011.05.2012:39:03 AM39.1129.10south_east2.23.5
........................
NaN1977.03.0912:21:25 AM36.3429.04south_west15.80.0
NaN1975.03.2212:25:25 AM40.3825.94north_east17.03.9
NaN1991.10.1012:07:02 AM40.4125.87north20.00.0
NaN2000.10.2212:18:40 AM38.5444.56north_east23.40.0
NaN1978.04.0212:35:03 AM42.4026.60north48.50.0

10062 rows × 7 columns

2. 在(a)的基础上,将烈度分为4个等级:[0,3,4,5,np.inf],依次对南部地区的深度和烈度等级建立多级索引排序。

data_earth['烈度']=pd.cut(data_earth['烈度'],[-1e-10,3,4,5,np.inf],labels=['Ⅰ','Ⅱ','Ⅲ','Ⅳ'])
data_earth.set_index(['深度','烈度']).sort_index()
日期时间维度经度方向距离
深度烈度
2000.02.0712:11:45 AM40.0534.07south_east0.1
1971.05.2012:08:46 AM37.7230.00north_east0.1
1985.01.2812:20:56 AM38.8529.06north_east0.1
1990.07.0512:43:04 AM37.8729.18east0.1
1985.01.0712:37:08 AM39.2427.80west0.2
........................
NaN1981.12.2612:53:35 AM40.1528.74east2.6
1975.04.1912:52:58 AM37.6927.30north_west3.3
1976.05.1512:05:57 AM39.3329.05north_east3.6
1975.01.3012:51:25 AM39.8228.60south_east4.7
1980.03.0312:15:06 AM38.1327.75north_east4.8

10062 rows × 6 columns

总结

  1. 对于分类变量本身没有相关的学习,只是单纯的在某些算法中使用过
  2. 期其本身的重要性并没有那么强,主要还是一个数据变量类型,因此在使用的时候需要注意

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值