第三章 索引

第三章 索引

import numpy as np
import pandas as pd

索引器

列索引(for表)

【df[‘col1’]】 == 【df.列名】(列名中不包含空格):取相应列
【df[[‘col1’, ‘col2’]]】:取多列组成DataFrame

行索引(for序列)

以字符串为索引的Series

s = pd.Series([1, 2, 3, 4, 5, 6],
              index=['a', 'b', 'a', 'a', 'a', 'c'])
s
a    1
b    2
a    3
a    4
a    5
c    6
dtype: int64
# 取单个索引
s['a']
a    1
a    3
a    4
a    5
dtype: int64
# 取多个索引对应值
s[['c', 'b']]
c    6
b    2
dtype: int64
# 取某两个索引之间的元素,并且两索引在整个索引中唯一出现
# 可使用切片,切片包含两端点
s['c': 'b': -2]
# 以步长为2的长度进行倒序切片
c    6
a    4
b    2
dtype: int64

以整数位索引的Series

如不特别指定列作为索引,则会生成从 0 开始的整数索引作为默认索引
使用 [int] 或 [int_list] ,则可取出对应索引元素的值

s = pd.Series(['a', 'b', 'c', 'd', 'e', 'f'],
              index=[1, 3, 1, 2, 5, 4])
s
1    a
3    b
1    c
2    d
5    e
4    f
dtype: object
s[1]
1    a
1    c
dtype: object
s[[2,3]]
2    d
3    b
dtype: object
# 切片不包含右端点【索引是字符串,切片包含两端点;索引是整数,切片不包含右端点】
s[1:-1:2]
3    b
2    d
dtype: object

补充说明:
如果不想陷入麻烦,请不要把纯浮点以及任何混合类型(字符串、整数、浮点类型等的混合) 作为索引,否则可能会在具体的操作时报错或者返回非预期的结果

loc 索引

基于元素的 loc 索引器,基于位置的 iloc 索引器

loc[‘行名’, ‘列名’]
loc[,]

  • 的位置:单个元素、元素列表、元素切片、布尔列表、函数
df = pd.read_csv('learn_pandas.csv',
                 usecols = ['School', 'Grade', 'Name', 'Gender',
                            'Weight', 'Transfer'])
df.head()
SchoolGradeNameGenderWeightTransfer
0Shanghai Jiao Tong UniversityFreshmanGaopeng YangFemale46.0N
1Peking UniversityFreshmanChangqiang YouMale70.0N
2Shanghai Jiao Tong UniversitySeniorMei SunMale89.0N
3Fudan UniversitySophomoreXiaojuan SunFemale41.0N
4Fudan UniversitySophomoreGaojuan YouMale74.0N

先利用 set_index 方法把 Name 列设为索引

df_demo = df.set_index('Name')
df_demo
SchoolGradeGenderWeightTransfer
Name
Gaopeng YangShanghai Jiao Tong UniversityFreshmanFemale46.0N
Changqiang YouPeking UniversityFreshmanMale70.0N
Mei SunShanghai Jiao Tong UniversitySeniorMale89.0N
Xiaojuan SunFudan UniversitySophomoreFemale41.0N
Gaojuan YouFudan UniversitySophomoreMale74.0N
..................
Xiaojuan SunFudan UniversityJuniorFemale46.0N
Li ZhaoTsinghua UniversitySeniorFemale50.0N
Chengqiang ChuShanghai Jiao Tong UniversitySeniorFemale45.0N
Chengmei ShenShanghai Jiao Tong UniversitySeniorMale71.0N
Chunpeng LvTsinghua UniversitySophomoreMale51.0N

200 rows × 5 columns

【a】* 为单个元素

# 直接取出相应的行或列,如果该元素在索引中重复则结果为 DataFrame,否则为 Series
df_demo.loc['Qiang Sun'] 
# 多个人叫此名字
SchoolGradeGenderWeightTransfer
Name
Qiang SunTsinghua UniversityJuniorFemale53.0N
Qiang SunTsinghua UniversitySophomoreFemale40.0N
Qiang SunShanghai Jiao Tong UniversityJuniorFemaleNaNN
df_demo.loc['Quan Zhao'] 
# 名字唯一
School      Shanghai Jiao Tong University
Grade                              Junior
Gender                             Female
Weight                                 53
Transfer                                N
Name: Quan Zhao, dtype: object
# 同时选择行和列:
df_demo.loc['Qiang Sun', 'School'] 
# 返回 Series
Name
Qiang Sun              Tsinghua University
Qiang Sun              Tsinghua University
Qiang Sun    Shanghai Jiao Tong University
Name: School, dtype: object

【b】* 为元素列表

df_demo.loc[['Qiang Sun','Quan Zhao'], ['School','Gender']]
SchoolGender
Name
Qiang SunTsinghua UniversityFemale
Qiang SunTsinghua UniversityFemale
Qiang SunShanghai Jiao Tong UniversityFemale
Quan ZhaoShanghai Jiao Tong UniversityFemale

[[‘行名1’,‘行名2’],[‘列名1’,‘列名2’]]【返回2行属性2列属性】
[‘行名1’:‘行名2’,‘列名1’:‘列名2’]【返回行1到行2
列1到列2矩阵】
【c】* 为切片
Series 使用字符串索引提过
如果是唯一值的起点和终点字符,那么就可以使用切片,并且包含两个端点。
如果不唯一则报错。

df_demo.loc['Gaojuan You':'Gaoqiang Qian', 'School':'Gender']
SchoolGradeGender
Name
Gaojuan YouFudan UniversitySophomoreMale
Xiaoli QianTsinghua UniversityFreshmanFemale
Qiang ChuShanghai Jiao Tong UniversityFreshmanFemale
Gaoqiang QianTsinghua UniversityJuniorFemale

DataFrame 使用整数索引
使用整数切片的时字符串索引的要求一致。
是元素切片,【包含端点】且起点、终点不允许有重复值。

df_loc_slice_demo = df_demo.copy()
df_loc_slice_demo.index = range(df_demo.shape[0],0,-1)
# 自定义索引,索引第一个数为【df_demo.shape[0]】df_demo行数,【-1】倒序排列
df_loc_slice_demo
SchoolGradeGenderWeightTransfer
200Shanghai Jiao Tong UniversityFreshmanFemale46.0N
199Peking UniversityFreshmanMale70.0N
198Shanghai Jiao Tong UniversitySeniorMale89.0N
197Fudan UniversitySophomoreFemale41.0N
196Fudan UniversitySophomoreMale74.0N
..................
5Fudan UniversityJuniorFemale46.0N
4Tsinghua UniversitySeniorFemale50.0N
3Shanghai Jiao Tong UniversitySeniorFemale45.0N
2Shanghai Jiao Tong UniversitySeniorMale71.0N
1Tsinghua UniversitySophomoreMale51.0N

200 rows × 5 columns

df_loc_slice_demo.loc[5:3]
SchoolGradeGenderWeightTransfer
5Fudan UniversityJuniorFemale46.0N
4Tsinghua UniversitySeniorFemale50.0N
3Shanghai Jiao Tong UniversitySeniorFemale45.0N
df_loc_slice_demo.loc[3:5] 
# 没有返回,说明不是整数位置切片【'3','5'视为字符串形态的元素位置切片】
SchoolGradeGenderWeightTransfer
df_loc_slice_demo.iloc[5:3:-1]
# 顺数第4,5位
SchoolGradeGenderWeightTransfer
195Tsinghua UniversityFreshmanFemale51.0N
196Fudan UniversitySophomoreMale74.0N

【d】* 为布尔列表
传入 loc 的布尔列表与 DataFrame 长度相同
列表为 True 的位置所对应的行会被选中,False 则会被剔除

# 选出体重超过 70kg 的学生:
df_demo.loc[df_demo.Weight>70].head()
SchoolGradeGenderWeightTransfer
Name
Mei SunShanghai Jiao Tong UniversitySeniorMale89.0N
Gaojuan YouFudan UniversitySophomoreMale74.0N
Xiaopeng ZhouShanghai Jiao Tong UniversityFreshmanMale74.0N
Xiaofeng SunTsinghua UniversitySeniorMale71.0N
Qiang ZhengShanghai Jiao Tong UniversitySeniorMale87.0N

传入元素列表
也可通过 isin 方法返回的布尔列表等价写出
【isin()】接收一个列表,判断该列中元素是否在列表中,同时对多个列过滤
df[df[某列].isin(条件)&df[某列].isin(条件)]

# 选出所有大一和大四的同学信息
df_demo.loc[df_demo.Grade.isin(['Freshman', 'Senior'])].head()
# 选出【Grade】列中'Freshman'和'Senior'的所有信息
SchoolGradeGenderWeightTransfer
Name
Gaopeng YangShanghai Jiao Tong UniversityFreshmanFemale46.0N
Changqiang YouPeking UniversityFreshmanMale70.0N
Mei SunShanghai Jiao Tong UniversitySeniorMale89.0N
Xiaoli QianTsinghua UniversityFreshmanFemale51.0N
Qiang ChuShanghai Jiao Tong UniversityFreshmanFemale52.0N

对于复合条件
可以用 |(或), &(且), ~(取反)的组合来实现

# 选出复旦大学中体重超过 70kg 的大四学生
# 或者北大男生中体重超过 80kg 的非大四的学生
condition_1_1 = df_demo.School == 'Fudan University'
condition_1_2 = df_demo.Grade == 'Senior'
condition_1_3 = df_demo.Weight > 70
condition_1 = condition_1_1 & condition_1_2 & condition_1_3
condition_2_1 = df_demo.School == 'Peking University'
condition_2_2 = df_demo.Grade == 'Senior'
condition_2_3 = df_demo.Weight > 80
condition_2 = condition_2_1 & (~condition_2_2) & condition_2_3
df_demo.loc[condition_1 | condition_2]
SchoolGradeGenderWeightTransfer
Name
Qiang HanPeking UniversityFreshmanMale87.0N
Chengpeng ZhouFudan UniversitySeniorMale81.0N
Changpeng ZhaoPeking UniversityFreshmanMale83.0N
Chengpeng QianFudan UniversitySeniorMale73.0Y

练一练

# 选出所有数值型的列.select_dtypes('number') 
df_demo.select_dtypes('number') 
Weight
Name
Gaopeng Yang46.0
Changqiang You70.0
Mei Sun89.0
Xiaojuan Sun41.0
Gaojuan You74.0
......
Xiaojuan Sun46.0
Li Zhao50.0
Chengqiang Chu45.0
Chengmei Shen71.0
Chunpeng Lv51.0

200 rows × 1 columns

【e】* 为函数
*函数,必须以前面的四种合法形式之一为返回值,函数的输入值为 DataFrame 本身。
假设仍然是上述复合条件筛选的例子,可以把逻辑写入一个函数中再返回。
需要注意的是函数的形式参数 x 本质上即为 df_demo

def condition(x):
    condition_1_1 = x.School == 'Fudan University'
    condition_1_2 = x.Grade == 'Senior'
    condition_1_3 = x.Weight > 70
    condition_1 = condition_1_1 & condition_1_2 & condition_1_3
    condition_2_1 = x.School == 'Peking University'
    condition_2_2 = x.Grade == 'Senior'
    condition_2_3 = x.Weight > 80
    condition_2 = condition_2_1 & (~condition_2_2) & condition_2_3
    result = condition_1 | condition_2
    return result

df_demo.loc[condition]
SchoolGradeGenderWeightTransfer
Name
Qiang HanPeking UniversityFreshmanMale87.0N
Chengpeng ZhouFudan UniversitySeniorMale81.0N
Changpeng ZhaoPeking UniversityFreshmanMale83.0N
Chengpeng QianFudan UniversitySeniorMale73.0Y

支持使用 lambda 表达式
其返回值也同样必须是先前提到的四种形式之一

df_demo.loc[lambda x:'Quan Zhao', lambda x:'Gender']
'Female'

用函数进行切片操作
由于函数无法返回如 start: end: step 的切片形式
故返回切片时要用 slice 对象进行包装

df_demo.loc[lambda x: slice('Gaojuan You', 'Gaoqiang Qian')]
SchoolGradeGenderWeightTransfer
Name
Gaojuan YouFudan UniversitySophomoreMale74.0N
Xiaoli QianTsinghua UniversityFreshmanFemale51.0N
Qiang ChuShanghai Jiao Tong UniversityFreshmanFemale52.0N
Gaoqiang QianTsinghua UniversityJuniorFemale50.0N

不要使用链式赋值

赋值时,应当在使用一层索引器后直接进行赋值操作
这样做是由于进行多次索引后赋值是赋在临时返回的 copy 副本上的。

df_chain = pd.DataFrame([[0,0],[1,0],[-1,0]], columns=list('AB'))
df_chain
AB
000
110
2-10
df_chain[df_chain.A!=0].B = 1
df_chain
D:\MyDownloads\Anaconda\lib\site-packages\pandas\core\generic.py:5170: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
AB
000
110
2-10
df_chain.loc[df_chain.A!=0,'B'] = 1 
df_chain
AB
000
111
2-11

iloc 索引

针对位置进行筛选
*位置处也有五类合法对:整数、整数列表、整数切片、布尔列表、函数

df_demo.head()
SchoolGradeGenderWeightTransfer
Name
Gaopeng YangShanghai Jiao Tong UniversityFreshmanFemale46.0N
Changqiang YouPeking UniversityFreshmanMale70.0N
Mei SunShanghai Jiao Tong UniversitySeniorMale89.0N
Xiaojuan SunFudan UniversitySophomoreFemale41.0N
Gaojuan YouFudan UniversitySophomoreMale74.0N
df_demo.iloc[1, 1] 
# 第二行第二列
'Freshman'
df_demo.iloc[[0, 1], [0, 1]] 
# 前两行前两列
SchoolGrade
Name
Gaopeng YangShanghai Jiao Tong UniversityFreshman
Changqiang YouPeking UniversityFreshman
df_demo.iloc[1: 4, 2:4] 
# 切片不包含结束端点
GenderWeight
Name
Changqiang YouMale70.0
Mei SunMale89.0
Xiaojuan SunFemale41.0
# 取2-3行
df_demo.iloc[lambda x: slice(1, 4)] 
# 传入切片为返回值的函数
SchoolGradeGenderWeightTransfer
Name
Changqiang YouPeking UniversityFreshmanMale70.0N
Mei SunShanghai Jiao Tong UniversitySeniorMale89.0N
Xiaojuan SunFudan UniversitySophomoreFemale41.0N

布尔筛选优先考虑 loc
在使用布尔列表的时
不能传入 Series 而必须传入序列的 values ,否则会报错。
因此,在使用布尔筛选的时候还是应当优先考虑 loc 的方式。

# 选出体重超过 80kg 的学生
df_demo.iloc[(df_demo.Weight>80).values].head()
# 对比loc法
# df_demo.loc[df_demo.Weight>80].head()
SchoolGradeGenderWeightTransfer
Name
Mei SunShanghai Jiao Tong UniversitySeniorMale89.0N
Qiang ZhengShanghai Jiao Tong UniversitySeniorMale87.0N
Qiang HanPeking UniversityFreshmanMale87.0N
Chengpeng ZhouFudan UniversitySeniorMale81.0N
Feng HanShanghai Jiao Tong UniversitySophomoreMale82.0N
# 对 Series 而言同样也可以通过 iloc 返回相应位置的值
df_demo.School.iloc[1]
# 对比loc法
# df_demo.School.loc['Changqiang You'] 
'Peking University'
df_demo.head()
SchoolGradeGenderWeightTransfer
Name
Gaopeng YangShanghai Jiao Tong UniversityFreshmanFemale46.0N
Changqiang YouPeking UniversityFreshmanMale70.0N
Mei SunShanghai Jiao Tong UniversitySeniorMale89.0N
Xiaojuan SunFudan UniversitySophomoreFemale41.0N
Gaojuan YouFudan UniversitySophomoreMale74.0N
# 对 Series 而言同样也可以通过 iloc 返回相应位置的子序列
df_demo.School.iloc[1:5:2]
Name
Changqiang You    Peking University
Xiaojuan Sun       Fudan University
Name: School, dtype: object
# 对比loc法:不能取步长,只能确定行名输入,并且存在重复行无法去除
df_demo.School.loc[['Changqiang You','Xiaojuan Sun']]
Name
Changqiang You    Peking University
Xiaojuan Sun       Fudan University
Xiaojuan Sun       Fudan University
Name: School, dtype: object

query 方法(query询问)

把字符串形式的查询表达式传入 query 方法来查询数据,其表达式的执行结果必须返回布尔列表,会使代码长度在不降低可读性的前提下有所减少。
将 loc 一节中的复合条件查询例子可以如下改写:

df.query('((School == "Fudan University")&'
         ' (Grade == "Senior")&'
         ' (Weight > 70))|'
         '((School == "Peking University")&'
         ' (Grade != "Senior")&'
         ' (Weight > 80))')
SchoolGradeNameGenderWeightTransfer
38Peking UniversityFreshmanQiang HanMale87.0N
66Fudan UniversitySeniorChengpeng ZhouMale81.0N
99Peking UniversityFreshmanChangpeng ZhaoMale83.0N
131Fudan UniversitySeniorChengpeng QianMale73.0Y

在 query 表达式中,帮用户注册了所有来自 DataFrame 的列名
查询体重超过均值的学生:

df.query('Weight > Weight.mean()').head()
SchoolGradeNameGenderWeightTransfer
1Peking UniversityFreshmanChangqiang YouMale70.0N
2Shanghai Jiao Tong UniversitySeniorMei SunMale89.0N
4Fudan UniversitySophomoreGaojuan YouMale74.0N
10Shanghai Jiao Tong UniversityFreshmanXiaopeng ZhouMale74.0N
14Tsinghua UniversitySeniorXiaomei ZhouFemale57.0N

query 中引用带空格的列名
对于含有空格的列名,需要使用 col name 的方式进行引用
query 中还注册了若干英语,如:or, and, or, is in, not in 。
字符串中出现与列表的比较时,== 和 != 分别表示元素出现在列表和没有出现在列表,等价于 is in 和 not in。对于 query 中的字符串,如果要引用外部变量,只需在变量名前加 @ 符号

# 取出体重位于 70kg 到 80kg 之间的学生
low, high =70, 80
df.query('((Weight >= 70))&((Weight >= 80))').head()
# 括号,引号用法不明白
# df.query('Weight.between(@low, @high)').head()
SchoolGradeNameGenderWeightTransfer
2Shanghai Jiao Tong UniversitySeniorMei SunMale89.0N
23Shanghai Jiao Tong UniversitySeniorQiang ZhengMale87.0N
38Peking UniversityFreshmanQiang HanMale87.0N
66Fudan UniversitySeniorChengpeng ZhouMale81.0N
71Shanghai Jiao Tong UniversitySophomoreFeng HanMale82.0N

随机抽样

把 DataFrame 的每一行看作一个样本,或把每一列看作一个特征
再把整个 DataFrame 看作总体,想要对样本或特征进行随机抽样就可以用 【sample】 函数
统计特征在等概率不放回的简单随机抽样条件下,是总体统计特征的无偏估计,
如样本均值和总体均值,则可以先从整张表中抽出一部分来做近似估计。【sample】函数
主要参数
【n】:抽样数量
【axis】:抽样的方向(0 为 行、1 为列)
【frac】:抽样比例
【replace】:是否放回(True 表示有放回抽样)
【weights】:每个样本的抽样相对概率(0.3从总体中抽30%样本)

df_sample = pd.DataFrame({'id': list('abcde'),
                        'value': [1, 2, 3, 4, 90]})
df_sample
idvalue
0a1
1b2
2c3
3d4
4e90
df_sample.sample(3, replace = True, weights = df_sample.value)
idvalue
4e90
3d4
3d4

多级索引

表结构(多级索引)

# 构造表,忽略这里的构造方法,第4小结讲解
np.random.seed(0)
multi_index = pd.MultiIndex.from_product([list('ABCD'),
                                          df.Gender.unique()], names=('School', 'Gender'))
multi_column = pd.MultiIndex.from_product([['Height', 'Weight'],
                                           df.Grade.unique()], names=('Indicator', 'Grade'))
df_multi = pd.DataFrame(np.c_[(np.random.randn(8,4)*5 + 163).tolist(),
                              (np.random.randn(8,4)*5 + 65).tolist()],
                        index = multi_index,
                        columns = multi_column).round(1)
# df_multi
# 索引的名字和值属性分别可以通过 names 和 values 获得
df_multi.index.names
FrozenList(['School', 'Gender'])
df_multi.columns.names
FrozenList(['Indicator', 'Grade'])
df_multi.index.values
array([('A', 'Female'), ('A', 'Male'), ('B', 'Female'), ('B', 'Male'),
       ('C', 'Female'), ('C', 'Male'), ('D', 'Female'), ('D', 'Male')],
      dtype=object)
df_multi.columns.values
array([('Height', 'Freshman'), ('Height', 'Senior'),
       ('Height', 'Sophomore'), ('Height', 'Junior'),
       ('Weight', 'Freshman'), ('Weight', 'Senior'),
       ('Weight', 'Sophomore'), ('Weight', 'Junior')], dtype=object)
# 如果想要得到某一层的索引,则需要通过 get_level_values 获得:
df_multi.index.get_level_values(0)
Index(['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'], dtype='object', name='School')

不能通过赋值来修改索引名,第3节讲解修改索引方式

loc索引(多级索引筛选)

# 将学校和年级设为索引,此时的行为多级索引,列为单级索引
df_multi = df.set_index(['School', 'Grade'])
df_multi.head()
NameGenderWeightTransfer
SchoolGrade
Shanghai Jiao Tong UniversityFreshmanGaopeng YangFemale46.0N
Peking UniversityFreshmanChangqiang YouMale70.0N
Shanghai Jiao Tong UniversitySeniorMei SunMale89.0N
Fudan UniversitySophomoreXiaojuan SunFemale41.0N
SophomoreGaojuan YouMale74.0N
# 多级索引中的单个元素以元组为单位,loc 和 iloc 方法完全可以照搬
# 在索引前最好对 MultiIndex 进行排序以避免性能警告
df_multi = df_multi.sort_index()
df_multi.loc[('Fudan University', 'Junior')].head()
# 一类'Fudan University'的大三
NameGenderWeightTransfer
SchoolGrade
Fudan UniversityJuniorYanli YouFemale48.0N
JuniorChunqiang ChuMale72.0N
JuniorChangfeng LvMale76.0N
JuniorYanjuan LvFemale49.0NaN
JuniorGaoqiang ZhouFemale43.0N
df_multi.loc[[('Fudan University', 'Senior'),
              ('Shanghai Jiao Tong University', 'Freshman')]]
# 两类:'Fudan University'的大四和'Shanghai Jiao Tong University'的大一
NameGenderWeightTransfer
SchoolGrade
Fudan UniversitySeniorChengpeng ZhengFemale38.0N
SeniorFeng ZhouFemale47.0N
SeniorGaomei LvFemale34.0N
SeniorChunli LvFemale56.0N
SeniorChengpeng ZhouMale81.0N
SeniorGaopeng QinFemale52.0N
SeniorChunjuan XuFemale47.0N
SeniorJuan ZhangFemale47.0N
SeniorChengpeng QianMale73.0Y
SeniorXiaojuan QianFemale50.0N
SeniorQuan XuFemale44.0N
Shanghai Jiao Tong UniversityFreshmanGaopeng YangFemale46.0N
FreshmanQiang ChuFemale52.0N
FreshmanXiaopeng ZhouMale74.0N
FreshmanYanpeng LvMale65.0N
FreshmanXiaopeng ZhaoFemale53.0N
FreshmanChunli ZhaoMale83.0N
FreshmanPeng ZhangFemaleNaNN
FreshmanXiaoquan SunFemale40.0N
FreshmanChunmei ShiFemale52.0N
FreshmanXiaomei YangFemale49.0N
FreshmanXiaofeng QianFemale49.0N
FreshmanChangmei LvMale75.0N
FreshmanQiang FengMale80.0N
df_multi.loc[df_multi.Weight > 70].head() 
# 布尔列表也是可用的
NameGenderWeightTransfer
SchoolGrade
Fudan UniversityFreshmanFeng WangMale74.0N
JuniorChunqiang ChuMale72.0N
JuniorChangfeng LvMale76.0N
SeniorChengpeng ZhouMale81.0N
SeniorChengpeng QianMale73.0Y
df_multi.loc[lambda x:('Fudan University','Junior')].head()
NameGenderWeightTransfer
SchoolGrade
Fudan UniversityJuniorYanli YouFemale48.0N
JuniorChunqiang ChuMale72.0N
JuniorChangfeng LvMale76.0N
JuniorYanjuan LvFemale49.0NaN
JuniorGaoqiang ZhouFemale43.0N

对多层的元素进行交叉组合后索引,但同时需要指定 loc 的列,全选则用 : 表示
传入 loc 的形式为 [(level_0_list, level_1_list), cols]

# 想要得到所有北大和复旦的大二大三学生
res = df_multi.loc[(['Peking University', 'Fudan University'],
                    ['Sophomore', 'Junior']), :]
res.head()
NameGenderWeightTransfer
SchoolGrade
Peking UniversitySophomoreChangmei XuFemale43.0N
SophomoreXiaopeng QinMaleNaNN
SophomoreMei XuFemale39.0N
SophomoreXiaoli ZhouFemale55.0N
SophomorePeng HanFemale34.0NaN
res.shape
(33, 4)
# 选出北大的大三学生和复旦的大二学生:
res = df_multi.loc[[('Peking University', 'Junior'),
                    ('Fudan University', 'Sophomore')]]
res.head()
NameGenderWeightTransfer
SchoolGrade
Peking UniversityJuniorJuan XuFemaleNaNN
JuniorChangjuan YouFemale47.0N
JuniorGaoli XuFemale48.0N
JuniorGaoquan ZhouMale70.0N
JuniorQiang YouFemale56.0N
res.shape
(16, 4)

IndexSlice 对象(对索引切片)

前面方法只能对元组整体进行切片,不能对每层进行切片,也不能将切片和布尔列表混合使用。
引入 IndexSlice 对象就能解决这个问题
【Slice】 对象一共有两种形式,
第一种为 loc[idx[,]] 型,
第二种为 loc[idx[,],idx[,]] 型

# 构造索引不重复df
np.random.seed(0)
L1,L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
L3,L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(9,9)),
                     index=mul_index1,
                     columns=mul_index2)
df_ex
BigDEF
Smalldefdefdef
UpperLower
Aa36-9-6-6-209-5
b-33-8-3-258-44
c-107-466-99-6
Ba85-2-9-80-91-6
b29-7-9-9-5-4-3-1
c86-501-8-8-20
Ca-6-3259-95-63
b12-5-3-56-63-5
c-156-66478-4
# 引入 IndexSlice
idx = pd.IndexSlice

【a】loc[idx[ * , * ]] 型
这种情况并不能进行多层分别切片,
前一个 * 表示行的选择,
后一个 * 表示列的选择,
与单纯的 loc 类似

df_ex.loc[idx['C':, ('D', 'f'):]]
BigDEF
Smallfdefdef
UpperLower
Ca259-95-63
b-5-3-56-63-5
c6-66478-4
# 支持布尔序列的索引
df_ex.loc[idx[:'A', lambda x:x.sum()>0]] 
# 列和大于 0
BigDF
Smalldee
UpperLower
Aa369
b-33-4
c-109

【b】loc[idx[ * , * ],idx[ * , * ]] 型
这种情况能够分层进行切片,
前一个 idx 指代的是行索引,
后一个是列索引。

df_ex.loc[idx[:'A', 'b':], idx['E':, 'e':]]
BigEF
Smallefef
UpperLower
Ab-25-44
c669-6

多级索引构造

那么除了使用 set_index 之外,如何自己构造多级索引

df_multi = df.set_index(['School', 'Grade'])
df_multi.head()
# 将已有的列设置为索引
NameGenderWeightTransfer
SchoolGrade
Shanghai Jiao Tong UniversityFreshmanGaopeng YangFemale46.0N
Peking UniversityFreshmanChangqiang YouMale70.0N
Shanghai Jiao Tong UniversitySeniorMei SunMale89.0N
Fudan UniversitySophomoreXiaojuan SunFemale41.0N
SophomoreGaojuan YouMale74.0N
my_array = [list('aabb'), ['cat', 'dog']*2]
pd.MultiIndex.from_arrays(my_array, names=['First','Second'])

np.random.seed(0)
L1,L2 = ['A','B'],['a','b']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
L3,L4 = ['D','E'],['d','e']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(4,4)),
                     index=my_tuple,
                     columns=my_tuple)
df_ex
(a, cat)(a, dog)(b, cat)(b, dog)
(a, cat)36-9-6
(a, dog)-6-209
(b, cat)-5-33-8
(b, dog)-3-258

【from_tuples】:根据传入由元组组成的列表进行构造
【from_arrays】:根据传入列表中,对应层的列表进行构造
【from_product】:根据给定多个列表的笛卡尔积进行构造

# 【from_tuples】
my_tuple = [('a','cat'),('a','dog'),('b','cat'),('b','dog')]
pd.MultiIndex.from_tuples(my_tuple, names=['First','Second'])
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])
# 【from_arrays】
my_array = [list('aabb'), ['cat', 'dog']*2]
pd.MultiIndex.from_arrays(my_array, names=['First','Second'])
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])
# 【from_product】
my_list1 = ['a','b']
my_list2 = ['cat','dog']
pd.MultiIndex.from_product([my_list1,
                            my_list2],
                           names=['First','Second'])
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

索引常用方法

交换、删除(索引层)

# 三级索引例子
np.random.seed(0)
L1,L2,L3 = ['A','B'],['a','b'],['alpha','beta']
mul_index1 = pd.MultiIndex.from_product([L1,L2,L3],
                                        names=('Upper', 'Lower','Extra'))
L4,L5,L6 = ['C','D'],['c','d'],['cat','dog']
mul_index2 = pd.MultiIndex.from_product([L4,L5,L6],
                                        names=('Big', 'Small', 'Other'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(8,8)),
                     index=mul_index1, 
                     columns=mul_index2)
df_ex
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
UpperLowerExtra
Aaalpha36-9-6-6-209
beta-5-33-8-3-258
balpha-44-107-466
beta-99-685-2-9-8
Baalpha0-91-629-7-9
beta-9-5-4-3-186-5
balpha01-8-8-20-6-3
beta259-95-631

索引层内部交换

【swaplevel】、【reorder_levels】
索引层间的交换由 swaplevel 和 reorder_levels 完成,
前者只能交换两个层,而后者可以交换任意层,
两者都可以指定交换行索引或列索引

df_ex.swaplevel(0,2,axis=1).head() 
# 列索引的第一层和第三层交换
Othercatdogcatdogcatdogcatdog
Smallccddccdd
BigCCCCDDDD
UpperLowerExtra
Aaalpha36-9-6-6-209
beta-5-33-8-3-258
balpha-44-107-466
beta-99-685-2-9-8
Baalpha0-91-629-7-9
df_ex.reorder_levels([2,0,1],axis=0).head() 
# axis=0行索引顺序调换,原先[0,1,2]顺序的行索引顺序变换成[2,0,1]
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
ExtraUpperLower
alphaAa36-9-6-6-209
betaAa-5-33-8-3-258
alphaAb-44-107-466
betaAb-99-685-2-9-8
alphaBa0-91-629-7-9
df_ex.reorder_levels([2,0,1],axis=1).head() 
# axis=1列索引顺序调换,原先[0,1,2]顺序的列索引顺序变换成[2,0,1]
Othercatdogcatdogcatdogcatdog
BigCCCCDDDD
Smallccddccdd
UpperLowerExtra
Aaalpha36-9-6-6-209
beta-5-33-8-3-258
balpha-44-107-466
beta-99-685-2-9-8
Baalpha0-91-629-7-9

删除某层索引

【droplevel】

df_ex.droplevel(1,axis=1)
# 1:第2层索引
# axis=1:列索引
# 将第2层列索引删除
BigCD
Othercatdogcatdogcatdogcatdog
UpperLowerExtra
Aaalpha36-9-6-6-209
beta-5-33-8-3-258
balpha-44-107-466
beta-99-685-2-9-8
Baalpha0-91-629-7-9
beta-9-5-4-3-186-5
balpha01-8-8-20-6-3
beta259-95-631
df_ex.droplevel([0,1],axis=0)
# 将第1,2层行索引删除
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
Extra
alpha36-9-6-6-209
beta-5-33-8-3-258
alpha-44-107-466
beta-99-685-2-9-8
alpha0-91-629-7-9
beta-9-5-4-3-186-5
alpha01-8-8-20-6-3
beta259-95-631

属性修改(索引)

【rename_axis】修改索引层名
【rename_axis】可以对索引层的名字进行修改
修改方式:传入字典映射

df_ex.rename_axis(index={'Upper':'000'},
                  columns={'Other':'003'}).head()
# 将行索引名'Upper'改成'000',将列索引名'Other'改成'003'
BigCD
Smallcdcd
003catdogcatdogcatdogcatdog
000LowerExtra
Aaalpha36-9-6-6-209
beta-5-33-8-3-258
balpha-44-107-466
beta-99-685-2-9-8
Baalpha0-91-629-7-9

【rename】修改索引值
【rename】对索引进行修改,
如果是多级索引需要指定修改的层号 level

df_ex.rename(columns={'cat':'猫'},
             level=2).head()
# 将第【2】+1层【columns】列索引【'cat'】改成'猫'
BigCD
Smallcdcd
Otherdogdogdogdog
UpperLowerExtra
Aaalpha36-9-6-6-209
beta-5-33-8-3-258
balpha-44-107-466
beta-99-685-2-9-8
Baalpha0-91-629-7-9

传入参数也可以是函数,其输入值就是索引元素

df_ex.rename(index=lambda x:str.upper(x),
             level=2).head()
# 将第【2】+1层的【index】行索引【str.upper(x)】字符串都改成英文大写
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
UpperLowerExtra
AaALPHA36-9-6-6-209
BETA-5-33-8-3-258
bALPHA-44-107-466
BETA-99-685-2-9-8
BaALPHA0-91-629-7-9

整个索引的元素替换,迭代器实现:

new_values = iter(list('abcdefgh'))
df_ex.rename(index=lambda x:next(new_values),
             level=2)
# 替换【index】行索引【level=2】第3层索引名
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
UpperLowerExtra
Aaa36-9-6-6-209
b-5-33-8-3-258
bc-44-107-466
d-99-685-2-9-8
Bae0-91-629-7-9
f-9-5-4-3-186-5
bg01-8-8-20-6-3
h259-95-631

对索引元素进行修改,
【单层索引】容易实现:
先取出索引的 values 属性,
再给对得到的列表进行修改,
最后再对 index 对象重新赋值。
【多级索引】进行修改(有些麻烦):
先把某一层索引临时转为表的元素,
然后再进行修改,
最后重新设定为索引。map 函数,定义在 Index 上的方法,
与前面 rename 方法中层的函数式用法类似,
只不过传入的不是层的标量值,而是直接传入索引的元组,
这为用户进行跨层的修改提供了遍历。

# 【rename】方法
df_ex.rename(index=lambda x:str.upper(x),
             level=2).head()
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
UpperLowerExtra
AaALPHA36-9-6-6-209
BETA-5-33-8-3-258
bALPHA-44-107-466
BETA-99-685-2-9-8
BaALPHA0-91-629-7-9
df_temp = df_ex.copy()
new_idx = df_temp.index.map(lambda x: (x[0],
                                       x[1],
                                       str.upper(x[2])))
df_temp.index = new_idx
df_temp.head()
# 【index.map】遍历行索引,
# 【x[0],x[1],str.upper(x[2])】第1,2位索引保持原状,第3位置索引字母改为大写
# 【df_temp.index = new_idx】赋值为新索引
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
UpperLowerExtra
AaALPHA36-9-6-6-209
BETA-5-33-8-3-258
bALPHA-44-107-466
BETA-99-685-2-9-8
BaALPHA0-91-629-7-9

map 的另一个使用方法是对多级索引的压缩

df_temp = df_ex.copy()
new_idx = df_temp.index.map(lambda x: (x[0]+'-'+
                                       x[1]+'-'+
                                       x[2]))
df_temp.index = new_idx
df_temp.head() 
# 多层索引压缩成单层索引
# 行索引层名消失
# 【(x[0]+'-'+x[1]+'-'+x[2])】索引压缩连接方式
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
A-a-alpha36-9-6-6-209
A-a-beta-5-33-8-3-258
A-b-alpha-44-107-466
A-b-beta-99-685-2-9-8
B-a-alpha0-91-629-7-9

也可以反向地展开:

new_idx = df_temp.index.map(lambda x:tuple(x.split('-')))
df_temp.index = new_idx
df_temp.head() 
# 单层索引展开成三层索引
# 【tuple】元组
# 【tuple(x.split('-'))】以'-'为分割,拆分为元组
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
Aaalpha36-9-6-6-209
beta-5-33-8-3-258
balpha-44-107-466
beta-99-685-2-9-8
Baalpha0-91-629-7-9

设置与重设(索引)

# 构造新表
df_new = pd.DataFrame({'A':list('aacd'),
                       'B':list('PQRT'),
                       'C':[1,2,3,4]})
df_new
ABC
0aP1
1aQ2
2cR3
3dT4

【set_index】索引的设置
【append】参数:是否来保留原来的索引
直接把新设定的添加到原索引的内层

df_new.set_index('A')
# 不保留原索引
BC
A
aP1
aQ2
cR3
dT4
df_new.set_index('A', append=True)
# 【append=True】保留原索引
BC
A
0aP1
1aQ2
2cR3
3dT4

可以同时指定多个列作为索引:

df_new.set_index(['A', 'B'])
C
AB
aP1
Q2
cR3
dT4

新设索引
如果添加索引的列原表中没有,直接参数导入新Series :

my_index = pd.Series(list('WXYZ'), name='D')
df_new = df_new.set_index(['A', my_index])
df_new
BC
AD
aWP1
XQ2
cYR3
dZT4

【reset_index】set_index 的逆函数,
【drop】参数:是否丢弃将去掉的索引层,而不是添加到列中:

df_new.reset_index(['D'])
# 默认放回列中
DBC
A
aWP1
aXQ2
cYR3
dZT4
df_new.reset_index(['D'], drop=True)
# 【drop=True】丢弃索引层
BC
A
aP1
aQ2
cR3
dT4

如果重置了所有的索引,那么 pandas 会直接重新生成一个默认索引

df_new.reset_index()
ADBC
0aWP1
1aXQ2
2cYR3
3dZT4

变形(索引)

给定一个新的索引,
把原表中相应的索引对应元素填充到新索引构成的表中

# 需要重新制作一张新的表,要求增加一名员工的同时去掉身高列并增加性别列
df_reindex = pd.DataFrame({"Weight":[60,70,80],
                           "Height":[176,180,179]},
                          index=['1001','1003','1002'])
df_reindex
WeightHeight
100160176
100370180
100280179
df_reindex.reindex(index=['1001','1002','1003','1004'],
                   columns=['Weight','Gender'])
WeightGender
100160.0NaN
100280.0NaN
100370.0NaN
1004NaNNaN

还有一个与 reindex 功能类似的函数是
【reindex_like】:仿照传入的表的索引来进行被调用表索引的变形

# 现在以及存在一张表具备了目标索引的条件,那么上述功能可以如下等价地写出:
df_existed = pd.DataFrame(index=['1001','1002','1003','1004'],
                          columns=['Weight','Gender'])
df_reindex.reindex_like(df_existed)
WeightGender
100160.0NaN
100280.0NaN
100370.0NaN
1004NaNNaN

索引运算

集合运算

利用集合运算来取出符合条件行
例如有两张表 A 和 B ,它们的索引都是员工编号,现在需要筛选出两表索引交集的所有员工信息,此时通过 Index 上的运算操作就很容易实现。SA.intersection(SB) = SA ∩ SB ⇔ {x|x ∈ SA and x ∈ SB}
SA.union(SB) = SA ∪ SB ⇔ {x|x ∈ SA or x ∈ SB}
SA.difference(SB) = SA − SB ⇔ {x|x ∈ SA and x /∈ SB}
SA.symmetric_difference(SB) = SA△SB ⇔ {x|x ∈ SA ∪ SB − SA ∩ SB}

索引运算

先用 unique 去重后再进行运算

# 构造简单示例
df_set_1 = pd.DataFrame([[0,1],[1,2],[3,4]],
                        index = pd.Index(['a','b','a'],name='id1'))
df_set_2 = pd.DataFrame([[4,5],[2,6],[7,1]],
                        index = pd.Index(['b','b','c'],name='id2'))
id1, id2 = df_set_1.index.unique(), df_set_2.index.unique()

id1, id2
(Index(['a', 'b'], dtype='object', name='id1'),
 Index(['b', 'c'], dtype='object', name='id2'))
df_set_1
01
id1
a01
b12
a34
df_set_2
01
id2
b45
b26
c71
id1.intersection(id2)
# A ∩ B
Index(['b'], dtype='object')
id1.union(id2)
# A ∪ B
Index(['a', 'b', 'c'], dtype='object')
id1.difference(id2)
# A - B
Index(['a'], dtype='object')
id1.symmetric_difference(id2)
# A ∪ B − A ∩ B
Index(['a', 'c'], dtype='object')

上述的四类运算还可以用等价的符号表示代替如下:

id1 & id2
Index(['b'], dtype='object')
id1 | id2
Index(['a', 'b', 'c'], dtype='object')
(id1 ^ id2) & id1
Index(['a'], dtype='object')
id1 ^ id2 
# ^ 符号即对称差
Index(['a', 'c'], dtype='object')

无索引情况
两张表需要做集合运算的列并没有被设置索引
方法一:先转成索引,运算后再恢复
方法二:利用 isin 函数

df_set_1
01
id1
a01
b12
a34
df_set_2
01
id2
b45
b26
c71
# 在重置索引的第一张表中选出 id 列交集的所在行
df_set_in_col_1 = df_set_1.reset_index()
df_set_in_col_2 = df_set_2.reset_index()
# 如果重置了所有的索引,那么 pandas 会直接重新生成一个默认索引
# 【.reset_index()】重置所有索引
df_set_in_col_1
id101
0a01
1b12
2a34
df_set_in_col_2
id201
0b45
1b26
2c71
id1
Index(['a', 'b'], dtype='object', name='id1')

【isin()】接收一个列表,判断该列中元素是否在列表中,同时对多个列过滤
df[df[某列].isin(条件)&df[某列].isin(条件)]

df_set_in_col_1[df_set_in_col_1.id1.isin(df_set_in_col_2.id2)]
# 【新表1[新表1.id1.isin(新表2.id2)]】
# 【id1】原索引层名新列名
# A ∩ B
id101
1b12

练习

Ex1:公司员工数据集

df = pd.read_csv('company.csv')
df.head()
EmployeeIDbirthdate_keyagecity_namedepartmentjob_titlegender
013181/3/195461VancouverExecutiveCEOM
113191/3/195758VancouverExecutiveVP StoresF
213201/2/195560VancouverExecutiveLegal CounselF
313211/2/195956VancouverExecutiveVP Human ResourcesM
413221/9/195857VancouverExecutiveVP FinanceM

1、分别只使用 query 和 loc 选出年龄不超过四十岁且工作部门为 Dairy 或 Bakery 的男性。

df.query('((age <= 40)&'
         '(department == "Dairy"))|'
         '((age <= 40)&'
         '(department == "Bakery"))')
EmployeeIDbirthdate_keyagecity_namedepartmentjob_titlegender
360857881/12/197540ChilliwackDairyDairy PersonF
360957891/13/197540ChilliwackDairyDairy PersonF
361057901/13/197540KelownaDairyDairy PersonF
361157911/14/197540KelownaDairyDairy PersonM
361357931/22/197540RichmondBakeryBakerM
........................
6136833512/28/199421VancouverDairyDairy PersonF
6137833612/31/199421VancouverDairyDairy PersonM
627063125/14/197936Grand ForksDairyDairy PersonM
627165402/14/198134VictoriaBakeryBakerM
6278780110/18/199025AbbotsfordDairyDairy PersonF

866 rows × 7 columns

df1 = df.loc[df.department.isin(['Dairy','Bakery'])]
df1.loc[df1.age <= 40]
EmployeeIDbirthdate_keyagecity_namedepartmentjob_titlegender
3113633/20/194966VictoriaBakeryBakery ManagerM
4013774/15/194966VictoriaDairyDairy PersonF
6414086/12/194966NanaimoBakeryBakery ManagerF
7314227/18/194966VancouverDairyDairy PersonM
7414257/20/194966VancouverDairyDairy PersonM
........................
626552457/13/197045KelownaBakeryBakerF
626653476/28/197144TerraceDairyDairy PersonF
627063125/14/197936Grand ForksDairyDairy PersonM
627165402/14/198134VictoriaBakeryBakerM
6278780110/18/199025AbbotsfordDairyDairy PersonF

1931 rows × 7 columns

2、选出员工 ID 号为奇数所在行的第 1、第 3 和倒数第 2 列。

df[df['EmployeeID']%2==1].iloc[:,[0,2,-2]]
EmployeeIDagejob_title
1131958VP Stores
3132156VP Human Resources
5132353Exec Assistant, VP Stores
6132551Exec Assistant, Legal Counsel
8132948Store Manager
............
6276765926Cashier
6277774125Cashier
6278780125Dairy Person
6280818122Cashier
6281822321Cashier

3126 rows × 3 columns

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值