第三章 索引
import numpy as np
import pandas as pd
一、索引器
1. 表的列索引
df = pd.read_csv('../data/learn_pandas.csv', usecols = ['School', 'Grade', 'Name', 'Gender', 'Weight', 'Transfer'])
df['Name'].head()
0 Gaopeng Yang
1 Changqiang You
2 Mei Sun
3 Xiaojuan Sun
4 Gaojuan You
Name: Name, dtype: object
取出多列:
df[['Gender', 'Name']].head()
Gender | Name | |
---|---|---|
0 | Female | Gaopeng Yang |
1 | Male | Changqiang You |
2 | Male | Mei Sun |
3 | Female | Xiaojuan Sun |
4 | Male | Gaojuan You |
取出单列:
df.Name.head()
0 Gaopeng Yang
1 Changqiang You
2 Mei Sun
3 Xiaojuan Sun
4 Gaojuan You
Name: Name, dtype: object
2. 序列的行索引
【a】Series
s = pd.Series([1, 2, 3, 4, 5, 6,7], index=['a','a', 'b', 'a', 'a', 'a', 'c'])
s['a']
a 1
a 2
a 4
a 5
a 6
dtype: int64
多个索引:
s[['c', 'b']]
c 6
b 2
dtype: int64
【b】使用[int]
或[int_list]
,则可以取出对应索引元素的值:
s = pd.Series(['a', 'b', 'c', 'd', 'e', 'f'], index=[1, 3, 1, 2, 5, 4])
s[1]
1 a
1 c
dtype: object
整数切片,则会取出对应索引位置的值:
3. loc索引器
s[1:-1:2]
a 1
b 3
a 5
dtype: int64
df_demo = df.set_index('Name')
df_demo.head()
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
Name | |||||
Gaopeng Yang | Shanghai Jiao Tong University | Freshman | Female | 46.0 | N |
Changqiang You | Peking University | Freshman | Male | 70.0 | N |
Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
Xiaojuan Sun | Fudan University | Sophomore | Female | 41.0 | N |
Gaojuan You | Fudan University | Sophomore | Male | 74.0 | N |
df_demo.loc['Quan Zhao'] # 名字唯一
School Shanghai Jiao Tong University
Grade Junior
Gender Female
Weight 53
Transfer N
Name: Quan Zhao, dtype: object
也可以同时选择行和列:
df_demo.loc['Qiang Sun', 'School'] # 返回Series
Name
Qiang Sun Tsinghua University
Qiang Sun Tsinghua University
Qiang Sun Shanghai Jiao Tong University
Name: School, dtype: object
df_demo.loc['Quan Zhao', 'School'] # 返回单个元素
'Shanghai Jiao Tong University'
4. iloc索引器
df_demo.iloc[1, 1] # 第二行第二列
'Freshman'
df_demo.iloc[1: 4, 2:4] # 切片不包含结束端点
Gender | Weight | |
---|---|---|
Name | ||
Changqiang You | Male | 70.0 |
Mei Sun | Male | 89.0 |
Xiaojuan Sun | Female | 41.0 |
对Series
而言同样也可以通过iloc
返回相应位置的值或子序列:
df_demo.School.iloc[1]
'Peking University'
5. query方法
df.query('((School == "Fudan University")&'
' (Grade == "Senior")&'
' (Weight > 70))|'
'((School == "Peking University")&'
' (Grade != "Senior")&'
' (Weight > 80))')
School | Grade | Name | Gender | Weight | Transfer | |
---|---|---|---|---|---|---|
38 | Peking University | Freshman | Qiang Han | Male | 87.0 | N |
66 | Fudan University | Senior | Chengpeng Zhou | Male | 81.0 | N |
99 | Peking University | Freshman | Changpeng Zhao | Male | 83.0 | N |
131 | Fudan University | Senior | Chengpeng Qian | Male | 73.0 | Y |
在query
表达式中,帮用户注册了所有来自DataFrame
的列名,所有属于该Series
的方法都可以被调用,和正常的函数调用并没有区别,例如查询体重超过均值的学生:
df.query('Weight > Weight.mean()').head()
School | Grade | Name | Gender | Weight | Transfer | |
---|---|---|---|---|---|---|
1 | Peking University | Freshman | Changqiang You | Male | 70.0 | N |
2 | Shanghai Jiao Tong University | Senior | Mei Sun | Male | 89.0 | N |
4 | Fudan University | Sophomore | Gaojuan You | Male | 74.0 | N |
10 | Shanghai Jiao Tong University | Freshman | Xiaopeng Zhou | Male | 74.0 | N |
14 | Tsinghua University | Senior | Xiaomei Zhou | Female | 57.0 | N |
二、多级索引
1. 多级索引及其表的结构
np.random.seed(0)
multi_index = pd.MultiIndex.from_product([list('ABCD'), df.Gender.unique()], names=('School', 'Gender'))
multi_column = pd.MultiIndex.from_product([['Height', 'Weight'], df.Grade.unique()], names=('Indicator', 'Grade'))
df_multi = pd.DataFrame(np.c_[(np.random.randn(8,4)*5 + 163).tolist(), (np.random.randn(8,4)*5 + 65).tolist()],
index = multi_index, columns = multi_column).round(1)
df_multi
Indicator | Height | Weight | |||||||
---|---|---|---|---|---|---|---|---|---|
Grade | Freshman | Senior | Sophomore | Junior | Freshman | Senior | Sophomore | Junior | |
School | Gender | ||||||||
A | Female | 171.8 | 165.0 | 167.9 | 174.2 | 60.6 | 55.1 | 63.3 | 65.8 |
Male | 172.3 | 158.1 | 167.8 | 162.2 | 71.2 | 71.0 | 63.1 | 63.5 | |
B | Female | 162.5 | 165.1 | 163.7 | 170.3 | 59.8 | 57.9 | 56.5 | 74.8 |
Male | 166.8 | 163.6 | 165.2 | 164.7 | 62.5 | 62.8 | 58.7 | 68.9 | |
C | Female | 170.5 | 162.0 | 164.6 | 158.7 | 56.9 | 63.9 | 60.5 | 66.9 |
Male | 150.2 | 166.3 | 167.3 | 159.3 | 62.4 | 59.1 | 64.9 | 67.1 | |
D | Female | 174.3 | 155.7 | 163.2 | 162.1 | 65.3 | 66.5 | 61.8 | 63.2 |
Male | 170.7 | 170.3 | 163.8 | 164.9 | 61.6 | 63.2 | 60.9 | 56.4 |
3. IndexSlice对象
np.random.seed(0)
L1,L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
L3,L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(9,9)), index=mul_index1, columns=mul_index2)
df_ex
Big | D | E | F | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Small | d | e | f | d | e | f | d | e | f | |
Upper | Lower | |||||||||
A | a | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 | -5 |
b | -3 | 3 | -8 | -3 | -2 | 5 | 8 | -4 | 4 | |
c | -1 | 0 | 7 | -4 | 6 | 6 | -9 | 9 | -6 | |
B | a | 8 | 5 | -2 | -9 | -8 | 0 | -9 | 1 | -6 |
b | 2 | 9 | -7 | -9 | -9 | -5 | -4 | -3 | -1 | |
c | 8 | 6 | -5 | 0 | 1 | -8 | -8 | -2 | 0 | |
C | a | -6 | -3 | 2 | 5 | 9 | -9 | 5 | -6 | 3 |
b | 1 | 2 | -5 | -3 | -5 | 6 | -6 | 3 | -5 | |
c | -1 | 5 | 6 | -6 | 6 | 4 | 7 | 8 | -4 |
为了使用silce
对象,先要进行定义:
idx = pd.IndexSlice
【a】loc[idx[*,*]]
型
这种情况并不能进行多层分别切片,前一个*
表示行的选择,后一个*
表示列的选择,与单纯的loc
是类似的:
df_ex.loc[idx['C':, ('D', 'f'):]]
Big | D | E | F | |||||
---|---|---|---|---|---|---|---|---|
Small | f | d | e | f | d | e | f | |
Upper | Lower | |||||||
C | a | 2 | 5 | 9 | -9 | 5 | -6 | 3 |
b | -5 | -3 | -5 | 6 | -6 | 3 | -5 | |
c | 6 | -6 | 6 | 4 | 7 | 8 | -4 |
另外,也支持布尔序列的索引:
df_ex.loc[idx[:'A', lambda x:x.sum()>0]] # 列和大于0
Big | D | F | ||
---|---|---|---|---|
Small | d | e | e | |
Upper | Lower | |||
A | a | 3 | 6 | 9 |
b | -3 | 3 | -4 | |
c | -1 | 0 | 9 |
【b】loc[idx[*,*],idx[*,*]]
型
这种情况能够分层进行切片,前一个idx
指代的是行索引,后一个是列索引。
df_ex.loc[idx[:'A', 'b':], idx['E':, 'e':]]
Big | E | F | |||
---|---|---|---|---|---|
Small | e | f | e | f | |
Upper | Lower | ||||
A | b | -2 | 5 | -4 | 4 |
c | 6 | 6 | 9 | -6 |
4. 多级索引的构造
my_tuple = [('a','cat'),('a','dog'),('b','cat'),('b','dog')]
pd.MultiIndex.from_tuples(my_tuple, names=['First','Second'])
MultiIndex([('a', 'cat'),
('a', 'dog'),
('b', 'cat'),
('b', 'dog')],
names=['First', 'Second'])
from_arrays
指根据传入列表中,对应层的列表进行构造:
my_array = [list('aabb'), ['cat', 'dog']*2]
pd.MultiIndex.from_arrays(my_array, names=['First','Second'])
MultiIndex([('a', 'cat'),
('a', 'dog'),
('b', 'cat'),
('b', 'dog')],
names=['First', 'Second'])
from_product
指根据给定多个列表的笛卡尔积进行构造:
my_list1 = ['a','b']
my_list2 = ['cat','dog']
pd.MultiIndex.from_product([my_list1, my_list2], names=['First','Second'])
MultiIndex([('a', 'cat'),
('a', 'dog'),
('b', 'cat'),
('b', 'dog')],
names=['First', 'Second'])
三、索引的常用方法
1. 索引层的交换和删除
np.random.seed(0)
L1,L2,L3 = ['A','B'],['a','b'],['alpha','beta']
mul_index1 = pd.MultiIndex.from_product([L1,L2,L3], names=('Upper', 'Lower','Extra'))
L4,L5,L6 = ['C','D'],['c','d'],['cat','dog']
mul_index2 = pd.MultiIndex.from_product([L4,L5,L6], names=('Big', 'Small', 'Other'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(8,8)), index=mul_index1, columns=mul_index2)
df_ex
Big | C | D | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Small | c | d | c | d | ||||||
Other | cat | dog | cat | dog | cat | dog | cat | dog | ||
Upper | Lower | Extra | ||||||||
A | a | alpha | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
beta | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 | ||
b | alpha | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 | |
beta | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 | ||
B | a | alpha | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
beta | -9 | -5 | -4 | -3 | -1 | 8 | 6 | -5 | ||
b | alpha | 0 | 1 | -8 | -8 | -2 | 0 | -6 | -3 | |
beta | 2 | 5 | 9 | -9 | 5 | -6 | 3 | 1 |
索引层的交换由swaplevel
和reorder_levels
完成,前者只能交换两个层,而后者可以交换任意层,两者都可以指定交换的是轴是哪一个,即行索引或列索引:
2. 索引的变形
在某些场合下,需要对索引做一些扩充或者剔除,更具体地要求是给定一个新的索引,把原表中相应的索引对应元素填充到新索引构成的表中。例如,下面的表中给出了员工信息,需要重新制作一张新的表,要求增加一名员工的同时去掉身高列并增加性别列:
df_reindex = pd.DataFrame({"Weight":[60,70,80], "Height":[176,180,179]}, index=['1001','1003','1002'])
df_reindex
Weight | Height | |
---|---|---|
1001 | 60 | 176 |
1003 | 70 | 180 |
1002 | 80 | 179 |
df_reindex.reindex(index=['1001','1002','1003','1004'], columns=['Weight','Gender'])
Weight | Gender | |
---|---|---|
1001 | 60.0 | NaN |
1002 | 80.0 | NaN |
1003 | 70.0 | NaN |
1004 | NaN | NaN |
这种需求常出现在时间序列索引的时间点填充以及ID
编号的扩充。另外,需要注意的是原来表中的数据和新表中会根据索引自动对其,例如原先的1002号位置在1003号之后,而新表中相反,那么reindex
中会根据元素对其,与位置无关。
还有一个与reindex
功能类似的函数是reindex_like
,其功能是仿照传入的表的索引来进行被调用表索引的变形。例如,现在以及存在一张表具备了目标索引的条件,那么上述功能可以如下等价地写出:
df_existed = pd.DataFrame(index=['1001','1002','1003','1004'], columns=['Weight','Gender'])
df_reindex.reindex_like(df_existed)
Weight | Gender | |
---|---|---|
1001 | 60.0 | NaN |
1002 | 80.0 | NaN |
1003 | 70.0 | NaN |
1004 | NaN | NaN |