第三章 索引

第三章 索引

import numpy as np
import pandas as pd

一、索引器

1. 表的列索引

df = pd.read_csv('../data/learn_pandas.csv', usecols = ['School', 'Grade', 'Name', 'Gender', 'Weight', 'Transfer'])
df['Name'].head()
0      Gaopeng Yang
1    Changqiang You
2           Mei Sun
3      Xiaojuan Sun
4       Gaojuan You
Name: Name, dtype: object

取出多列:

df[['Gender', 'Name']].head()
GenderName
0FemaleGaopeng Yang
1MaleChangqiang You
2MaleMei Sun
3FemaleXiaojuan Sun
4MaleGaojuan You

取出单列:

df.Name.head()
0      Gaopeng Yang
1    Changqiang You
2           Mei Sun
3      Xiaojuan Sun
4       Gaojuan You
Name: Name, dtype: object

2. 序列的行索引

【a】Series

s = pd.Series([1, 2, 3, 4, 5, 6,7], index=['a','a', 'b', 'a', 'a', 'a', 'c'])
s['a']
a    1
a    2
a    4
a    5
a    6
dtype: int64

多个索引:

s[['c', 'b']]
c    6
b    2
dtype: int64

【b】使用[int][int_list],则可以取出对应索引元素的值:

s = pd.Series(['a', 'b', 'c', 'd', 'e', 'f'], index=[1, 3, 1, 2, 5, 4])
s[1]
1    a
1    c
dtype: object

整数切片,则会取出对应索引位置的值:

3. loc索引器

s[1:-1:2]
a    1
b    3
a    5
dtype: int64
df_demo = df.set_index('Name')
df_demo.head()
SchoolGradeGenderWeightTransfer
Name
Gaopeng YangShanghai Jiao Tong UniversityFreshmanFemale46.0N
Changqiang YouPeking UniversityFreshmanMale70.0N
Mei SunShanghai Jiao Tong UniversitySeniorMale89.0N
Xiaojuan SunFudan UniversitySophomoreFemale41.0N
Gaojuan YouFudan UniversitySophomoreMale74.0N
df_demo.loc['Quan Zhao'] # 名字唯一
School      Shanghai Jiao Tong University
Grade                              Junior
Gender                             Female
Weight                                 53
Transfer                                N
Name: Quan Zhao, dtype: object

也可以同时选择行和列:

df_demo.loc['Qiang Sun', 'School'] # 返回Series
Name
Qiang Sun              Tsinghua University
Qiang Sun              Tsinghua University
Qiang Sun    Shanghai Jiao Tong University
Name: School, dtype: object
df_demo.loc['Quan Zhao', 'School'] # 返回单个元素
'Shanghai Jiao Tong University'

4. iloc索引器

df_demo.iloc[1, 1] # 第二行第二列
'Freshman'
df_demo.iloc[1: 4, 2:4] # 切片不包含结束端点
GenderWeight
Name
Changqiang YouMale70.0
Mei SunMale89.0
Xiaojuan SunFemale41.0

Series而言同样也可以通过iloc返回相应位置的值或子序列:

df_demo.School.iloc[1]
'Peking University'

5. query方法

df.query('((School == "Fudan University")&'
         ' (Grade == "Senior")&'
         ' (Weight > 70))|'
         '((School == "Peking University")&'
         ' (Grade != "Senior")&'
         ' (Weight > 80))')
SchoolGradeNameGenderWeightTransfer
38Peking UniversityFreshmanQiang HanMale87.0N
66Fudan UniversitySeniorChengpeng ZhouMale81.0N
99Peking UniversityFreshmanChangpeng ZhaoMale83.0N
131Fudan UniversitySeniorChengpeng QianMale73.0Y

query表达式中,帮用户注册了所有来自DataFrame的列名,所有属于该Series的方法都可以被调用,和正常的函数调用并没有区别,例如查询体重超过均值的学生:

df.query('Weight > Weight.mean()').head()
SchoolGradeNameGenderWeightTransfer
1Peking UniversityFreshmanChangqiang YouMale70.0N
2Shanghai Jiao Tong UniversitySeniorMei SunMale89.0N
4Fudan UniversitySophomoreGaojuan YouMale74.0N
10Shanghai Jiao Tong UniversityFreshmanXiaopeng ZhouMale74.0N
14Tsinghua UniversitySeniorXiaomei ZhouFemale57.0N

二、多级索引

1. 多级索引及其表的结构

np.random.seed(0)
multi_index = pd.MultiIndex.from_product([list('ABCD'), df.Gender.unique()], names=('School', 'Gender'))
multi_column = pd.MultiIndex.from_product([['Height', 'Weight'], df.Grade.unique()], names=('Indicator', 'Grade'))
df_multi = pd.DataFrame(np.c_[(np.random.randn(8,4)*5 + 163).tolist(), (np.random.randn(8,4)*5 + 65).tolist()],
                        index = multi_index, columns = multi_column).round(1)
df_multi
IndicatorHeightWeight
GradeFreshmanSeniorSophomoreJuniorFreshmanSeniorSophomoreJunior
SchoolGender
AFemale171.8165.0167.9174.260.655.163.365.8
Male172.3158.1167.8162.271.271.063.163.5
BFemale162.5165.1163.7170.359.857.956.574.8
Male166.8163.6165.2164.762.562.858.768.9
CFemale170.5162.0164.6158.756.963.960.566.9
Male150.2166.3167.3159.362.459.164.967.1
DFemale174.3155.7163.2162.165.366.561.863.2
Male170.7170.3163.8164.961.663.260.956.4

3. IndexSlice对象

np.random.seed(0)
L1,L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
L3,L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(9,9)), index=mul_index1, columns=mul_index2)
df_ex
BigDEF
Smalldefdefdef
UpperLower
Aa36-9-6-6-209-5
b-33-8-3-258-44
c-107-466-99-6
Ba85-2-9-80-91-6
b29-7-9-9-5-4-3-1
c86-501-8-8-20
Ca-6-3259-95-63
b12-5-3-56-63-5
c-156-66478-4

为了使用silce对象,先要进行定义:

idx = pd.IndexSlice

【a】loc[idx[*,*]]

这种情况并不能进行多层分别切片,前一个*表示行的选择,后一个*表示列的选择,与单纯的loc是类似的:

df_ex.loc[idx['C':, ('D', 'f'):]]
BigDEF
Smallfdefdef
UpperLower
Ca259-95-63
b-5-3-56-63-5
c6-66478-4

另外,也支持布尔序列的索引:

df_ex.loc[idx[:'A', lambda x:x.sum()>0]] # 列和大于0
BigDF
Smalldee
UpperLower
Aa369
b-33-4
c-109

【b】loc[idx[*,*],idx[*,*]]

这种情况能够分层进行切片,前一个idx指代的是行索引,后一个是列索引。

df_ex.loc[idx[:'A', 'b':], idx['E':, 'e':]]
BigEF
Smallefef
UpperLower
Ab-25-44
c669-6

4. 多级索引的构造

my_tuple = [('a','cat'),('a','dog'),('b','cat'),('b','dog')]
pd.MultiIndex.from_tuples(my_tuple, names=['First','Second'])
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

from_arrays指根据传入列表中,对应层的列表进行构造:

my_array = [list('aabb'), ['cat', 'dog']*2]
pd.MultiIndex.from_arrays(my_array, names=['First','Second'])
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

from_product指根据给定多个列表的笛卡尔积进行构造:

my_list1 = ['a','b']
my_list2 = ['cat','dog']
pd.MultiIndex.from_product([my_list1, my_list2], names=['First','Second'])
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

三、索引的常用方法

1. 索引层的交换和删除

np.random.seed(0)
L1,L2,L3 = ['A','B'],['a','b'],['alpha','beta']
mul_index1 = pd.MultiIndex.from_product([L1,L2,L3], names=('Upper', 'Lower','Extra'))
L4,L5,L6 = ['C','D'],['c','d'],['cat','dog']
mul_index2 = pd.MultiIndex.from_product([L4,L5,L6], names=('Big', 'Small', 'Other'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(8,8)), index=mul_index1,  columns=mul_index2)
df_ex
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
UpperLowerExtra
Aaalpha36-9-6-6-209
beta-5-33-8-3-258
balpha-44-107-466
beta-99-685-2-9-8
Baalpha0-91-629-7-9
beta-9-5-4-3-186-5
balpha01-8-8-20-6-3
beta259-95-631

索引层的交换由swaplevelreorder_levels完成,前者只能交换两个层,而后者可以交换任意层,两者都可以指定交换的是轴是哪一个,即行索引或列索引:

2. 索引的变形

在某些场合下,需要对索引做一些扩充或者剔除,更具体地要求是给定一个新的索引,把原表中相应的索引对应元素填充到新索引构成的表中。例如,下面的表中给出了员工信息,需要重新制作一张新的表,要求增加一名员工的同时去掉身高列并增加性别列:

df_reindex = pd.DataFrame({"Weight":[60,70,80], "Height":[176,180,179]}, index=['1001','1003','1002'])
df_reindex
WeightHeight
100160176
100370180
100280179
df_reindex.reindex(index=['1001','1002','1003','1004'], columns=['Weight','Gender'])
WeightGender
100160.0NaN
100280.0NaN
100370.0NaN
1004NaNNaN

这种需求常出现在时间序列索引的时间点填充以及ID编号的扩充。另外,需要注意的是原来表中的数据和新表中会根据索引自动对其,例如原先的1002号位置在1003号之后,而新表中相反,那么reindex中会根据元素对其,与位置无关。

还有一个与reindex功能类似的函数是reindex_like,其功能是仿照传入的表的索引来进行被调用表索引的变形。例如,现在以及存在一张表具备了目标索引的条件,那么上述功能可以如下等价地写出:

df_existed = pd.DataFrame(index=['1001','1002','1003','1004'], columns=['Weight','Gender'])
df_reindex.reindex_like(df_existed)
WeightGender
100160.0NaN
100280.0NaN
100370.0NaN
1004NaNNaN
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值