分层索引可以让数据在一个轴上拥有多个索引,考虑以下例子:
import pandas as pd
import numpy as np
data = pd.Series(np.random.randn(9),
index = [list('aaabbccdd'),[1,2,3,1,3,1,2,2,3]])
# a 1 0.956148
# 2 -0.722317
# 3 0.349010
# b 1 0.226239
# 3 -0.529405
# c 1 1.047720
# 2 0.125013
# d 2 1.106569
# 3 1.529121
# dtype: float64
可以使用分层索引来方便的选出子集:
print(data['a'])
# 1 0.470976
# 2 -1.218771
# 3 0.244200
# dtype: float64
print(data['a':'c'])
# a 1 0.082245
# 2 0.776815
# 3 0.461049
# b 1 -1.141243
# 3 0.862922
# c 1 -0.703534
# 2 -1.186413
# dtype: float64
print(data.loc[['a','b']])
# a 1 1.199505
# 2 -0.224409
# 3 1.005648
# b 1 -0.178657
# 3 -0.524276
# dtype: float64
print(data.loc[:,2])
# a -0.536039
# c -1.384110
# d -0.738628
# dtype: float64
可以使用unstack方法,将数据在DataFrame中重新排列:
df = data.unstack()
print(df)
# 1 2 3
# a -2.381009 -0.998445 0.156583
# b -0.394105 NaN 0.629349
# c 1.404356 0.970075 NaN
# d NaN 0.355427 0.601178
其反向操作为stack方法:
df = data.unstack()
print(df.stack())
# a 1 -1.328935
# 2 -2.237134
# 3 -0.029331
# b 1 -0.774613
# 3 -0.437482
# c 1 -0.270287
# 2 -0.307728
# d 2 0.264248
# 3 1.083429
# dtype: float64
在DataFrame中两个轴都可以使用分层索引:
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(12).reshape((4,3)),
index = [list('aabb'),[1,2,1,2]],
columns = [['O','O','C'],
['G','R','G']])
print(data)
# O C
# G R G
# a 1 0 1 2
# 2 3 4 5
# b 1 6 7 8
# 2 9 10 11
我们可以给分层的层级加上名称,如果层级有名称,这些名称会在控制台的输出显示:
data.index.names = ['key1','key2']
data.columns.names = ['state','color']
print(data)
# state O C
# color G R G
# key1 key2
# a 1 0 1 2
# 2 3 4 5
# b 1 6 7 8
# 2 9 10 11
可以通过部分索引选出符合条件的组;
print(data['O'])
# color G R
# key1 key2
# a 1 0 1
# 2 3 4
# b 1 6 7
# 2 9 10
重排序和层级排序
可以使用swaplevel方法改变层级序号顺序(默认为级数0排序,数据排序不变):
print(data.swaplevel('key1','key2'))
# state O C
# color G R G
# key2 key1
# 1 a 0 1 2
# 2 a 3 4 5
# 1 b 6 7 8
# 2 b 9 10 11
sort_index只能在单一层级上进行排序,在使用时需要进行指定层级:
print(data)
# state O C
# color G R G
# key1 key2
# a 1 0 1 2
# 2 3 4 5
# b 1 6 7 8
# 2 9 10 11
print(data.sort_index(level = 1))
# state O C
# color G R G
# key1 key2
# a 1 0 1 2
# b 1 6 7 8
# a 2 3 4 5
# b 2 9 10 11
print(data.swaplevel(0,1).sort_index(level = 0))
# state O C
# color G R G
# key2 key1
# 1 a 0 1 2
# b 6 7 8
# 2 a 3 4 5
# b 9 10 11
按层级进行汇总统计
我们可以按下面的方法按照层级在行和列上进行聚合:
print(data.sum(level = 'key2'))
# state O C
# color G R G
# key2
# 1 6 8 10
# 2 12 14 16
print(data.sum(level = 'color',axis = 1))
# color G R
# key1 key2
# a 1 2 1
# 2 8 4
# b 1 14 7
# 2 20 10
使用DataFrame的列进行索引
我们有时候可能会想将DataFrame中的多个列作为行的索引,也可能会想让行索引移动到DataFrame的列中。考虑下面这个DataFrame:
import pandas as pd
data = pd.DataFrame({'a':range(7),'b':range(7,0,-1),
'c':['one','one','one','two','two','two','two'],
'd':[0,1,2,0,1,2,3]})
# a b c d
# 0 0 7 one 0
# 1 1 6 one 1
# 2 2 5 one 2
# 3 3 4 two 0
# 4 4 3 two 1
# 5 5 2 two 2
# 6 6 1 two 3
DataFrame的set_index方法可以生成一个新的DataFrame,新的DataFrame使用一个或者多个列作为索引,另外默认情况下作为索引的列会从DataFrame中移除,但是可以指定drop为False对其进行保留:
print(data.set_index(['c','d']))
# a b
# c d
# one 0 0 7
# 1 1 6
# 2 2 5
# two 0 3 4
# 1 4 3
# 2 5 2
# 3 6 1
print(data.set_index(['c','d'],drop=False))
# a b c d
# c d
# one 0 0 7 one 0
# 1 1 6 one 1
# 2 2 5 one 2
# two 0 3 4 two 0
# 1 4 3 two 1
# 2 5 2 two 2
# 3 6 1 two 3
另外,reset_index方法则可以将行索引移动到列中:
f = data.set_index(['c','d'],)
# a b
# c d
# one 0 0 7
# 1 1 6
# 2 2 5
# two 0 3 4
# 1 4 3
# 2 5 2
# 3 6 1
print(f.reset_index())
# c d a b
# 0 one 0 0 7
# 1 one 1 1 6
# 2 one 2 2 5
# 3 two 0 3 4
# 4 two 1 4 3
# 5 two 2 5 2
# 6 two 3 6 1