第三章 索引
import numpy as np
import pandas as pd
索引器
列索引(for表)
【df[‘col1’]】 == 【df.列名】(列名中不包含空格):取相应列 【df[[‘col1’, ‘col2’]]】:取多列组成DataFrame
行索引(for序列)
以字符串为索引的Series
s = pd. Series( [ 1 , 2 , 3 , 4 , 5 , 6 ] ,
index= [ 'a' , 'b' , 'a' , 'a' , 'a' , 'c' ] )
s
a 1
b 2
a 3
a 4
a 5
c 6
dtype: int64
s[ 'a' ]
a 1
a 3
a 4
a 5
dtype: int64
s[ [ 'c' , 'b' ] ]
c 6
b 2
dtype: int64
s[ 'c' : 'b' : - 2 ]
c 6
a 4
b 2
dtype: int64
以整数位索引的Series
如不特别指定列作为索引,则会生成从 0 开始的整数索引作为默认索引 使用 [int] 或 [int_list] ,则可取出对应索引元素的值
s = pd. Series( [ 'a' , 'b' , 'c' , 'd' , 'e' , 'f' ] ,
index= [ 1 , 3 , 1 , 2 , 5 , 4 ] )
s
1 a
3 b
1 c
2 d
5 e
4 f
dtype: object
s[ 1 ]
1 a
1 c
dtype: object
s[ [ 2 , 3 ] ]
2 d
3 b
dtype: object
s[ 1 : - 1 : 2 ]
3 b
2 d
dtype: object
补充说明: 如果不想陷入麻烦,请不要把纯浮点以及任何混合类型(字符串、整数、浮点类型等的混合) 作为索引,否则可能会在具体的操作时报错或者返回非预期的结果
loc 索引
基于元素的 loc 索引器,基于位置的 iloc 索引器
loc[‘行名’, ‘列名’] loc[, ]
的位置:单个元素、元素列表、元素切片、布尔列表、函数
df = pd. read_csv( 'learn_pandas.csv' ,
usecols = [ 'School' , 'Grade' , 'Name' , 'Gender' ,
'Weight' , 'Transfer' ] )
df. head( )
School Grade Name Gender Weight Transfer 0 Shanghai Jiao Tong University Freshman Gaopeng Yang Female 46.0 N 1 Peking University Freshman Changqiang You Male 70.0 N 2 Shanghai Jiao Tong University Senior Mei Sun Male 89.0 N 3 Fudan University Sophomore Xiaojuan Sun Female 41.0 N 4 Fudan University Sophomore Gaojuan You Male 74.0 N
先利用 set_index 方法把 Name 列设为索引
df_demo = df. set_index( 'Name' )
df_demo
School Grade Gender Weight Transfer Name Gaopeng Yang Shanghai Jiao Tong University Freshman Female 46.0 N Changqiang You Peking University Freshman Male 70.0 N Mei Sun Shanghai Jiao Tong University Senior Male 89.0 N Xiaojuan Sun Fudan University Sophomore Female 41.0 N Gaojuan You Fudan University Sophomore Male 74.0 N ... ... ... ... ... ... Xiaojuan Sun Fudan University Junior Female 46.0 N Li Zhao Tsinghua University Senior Female 50.0 N Chengqiang Chu Shanghai Jiao Tong University Senior Female 45.0 N Chengmei Shen Shanghai Jiao Tong University Senior Male 71.0 N Chunpeng Lv Tsinghua University Sophomore Male 51.0 N
200 rows × 5 columns
【a】* 为单个元素
df_demo. loc[ 'Qiang Sun' ]
School Grade Gender Weight Transfer Name Qiang Sun Tsinghua University Junior Female 53.0 N Qiang Sun Tsinghua University Sophomore Female 40.0 N Qiang Sun Shanghai Jiao Tong University Junior Female NaN N
df_demo. loc[ 'Quan Zhao' ]
School Shanghai Jiao Tong University
Grade Junior
Gender Female
Weight 53
Transfer N
Name: Quan Zhao, dtype: object
df_demo. loc[ 'Qiang Sun' , 'School' ]
Name
Qiang Sun Tsinghua University
Qiang Sun Tsinghua University
Qiang Sun Shanghai Jiao Tong University
Name: School, dtype: object
【b】* 为元素列表
df_demo. loc[ [ 'Qiang Sun' , 'Quan Zhao' ] , [ 'School' , 'Gender' ] ]
School Gender Name Qiang Sun Tsinghua University Female Qiang Sun Tsinghua University Female Qiang Sun Shanghai Jiao Tong University Female Quan Zhao Shanghai Jiao Tong University Female
[[‘行名1’,‘行名2’],[‘列名1’,‘列名2’]]【返回2行属性2列属性】 [‘行名1’:‘行名2’,‘列名1’:‘列名2’]【返回行1到行2 列1到列2矩阵】 【c】* 为切片 Series 使用字符串索引提过 如果是唯一值的起点和终点字符,那么就可以使用切片,并且包含两个端点。 如果不唯一则报错。
df_demo. loc[ 'Gaojuan You' : 'Gaoqiang Qian' , 'School' : 'Gender' ]
School Grade Gender Name Gaojuan You Fudan University Sophomore Male Xiaoli Qian Tsinghua University Freshman Female Qiang Chu Shanghai Jiao Tong University Freshman Female Gaoqiang Qian Tsinghua University Junior Female
DataFrame 使用整数索引 使用整数切片的时字符串索引的要求一致。 是元素切片,【包含端点】且起点、终点不允许有重复值。
df_loc_slice_demo = df_demo. copy( )
df_loc_slice_demo. index = range ( df_demo. shape[ 0 ] , 0 , - 1 )
df_loc_slice_demo
School Grade Gender Weight Transfer 200 Shanghai Jiao Tong University Freshman Female 46.0 N 199 Peking University Freshman Male 70.0 N 198 Shanghai Jiao Tong University Senior Male 89.0 N 197 Fudan University Sophomore Female 41.0 N 196 Fudan University Sophomore Male 74.0 N ... ... ... ... ... ... 5 Fudan University Junior Female 46.0 N 4 Tsinghua University Senior Female 50.0 N 3 Shanghai Jiao Tong University Senior Female 45.0 N 2 Shanghai Jiao Tong University Senior Male 71.0 N 1 Tsinghua University Sophomore Male 51.0 N
200 rows × 5 columns
df_loc_slice_demo. loc[ 5 : 3 ]
School Grade Gender Weight Transfer 5 Fudan University Junior Female 46.0 N 4 Tsinghua University Senior Female 50.0 N 3 Shanghai Jiao Tong University Senior Female 45.0 N
df_loc_slice_demo. loc[ 3 : 5 ]
School Grade Gender Weight Transfer
df_loc_slice_demo. iloc[ 5 : 3 : - 1 ]
School Grade Gender Weight Transfer 195 Tsinghua University Freshman Female 51.0 N 196 Fudan University Sophomore Male 74.0 N
【d】* 为布尔列表 传入 loc 的布尔列表与 DataFrame 长度相同 列表为 True 的位置所对应的行会被选中,False 则会被剔除
df_demo. loc[ df_demo. Weight> 70 ] . head( )
School Grade Gender Weight Transfer Name Mei Sun Shanghai Jiao Tong University Senior Male 89.0 N Gaojuan You Fudan University Sophomore Male 74.0 N Xiaopeng Zhou Shanghai Jiao Tong University Freshman Male 74.0 N Xiaofeng Sun Tsinghua University Senior Male 71.0 N Qiang Zheng Shanghai Jiao Tong University Senior Male 87.0 N
传入元素列表 也可通过 isin 方法返回的布尔列表等价写出 【isin()】接收一个列表,判断该列中元素是否在列表中,同时对多个列过滤 df[df[某列].isin(条件)&df[某列].isin(条件)]
df_demo. loc[ df_demo. Grade. isin( [ 'Freshman' , 'Senior' ] ) ] . head( )
School Grade Gender Weight Transfer Name Gaopeng Yang Shanghai Jiao Tong University Freshman Female 46.0 N Changqiang You Peking University Freshman Male 70.0 N Mei Sun Shanghai Jiao Tong University Senior Male 89.0 N Xiaoli Qian Tsinghua University Freshman Female 51.0 N Qiang Chu Shanghai Jiao Tong University Freshman Female 52.0 N
对于复合条件 可以用 |(或), &(且), ~(取反)的组合来实现
condition_1_1 = df_demo. School == 'Fudan University'
condition_1_2 = df_demo. Grade == 'Senior'
condition_1_3 = df_demo. Weight > 70
condition_1 = condition_1_1 & condition_1_2 & condition_1_3
condition_2_1 = df_demo. School == 'Peking University'
condition_2_2 = df_demo. Grade == 'Senior'
condition_2_3 = df_demo. Weight > 80
condition_2 = condition_2_1 & ( ~ condition_2_2) & condition_2_3
df_demo. loc[ condition_1 | condition_2]
School Grade Gender Weight Transfer Name Qiang Han Peking University Freshman Male 87.0 N Chengpeng Zhou Fudan University Senior Male 81.0 N Changpeng Zhao Peking University Freshman Male 83.0 N Chengpeng Qian Fudan University Senior Male 73.0 Y
练一练
df_demo. select_dtypes( 'number' )
Weight Name Gaopeng Yang 46.0 Changqiang You 70.0 Mei Sun 89.0 Xiaojuan Sun 41.0 Gaojuan You 74.0 ... ... Xiaojuan Sun 46.0 Li Zhao 50.0 Chengqiang Chu 45.0 Chengmei Shen 71.0 Chunpeng Lv 51.0
200 rows × 1 columns
【e】* 为函数 *函数,必须以前面的四种合法形式之一为返回值,函数的输入值为 DataFrame 本身。 假设仍然是上述复合条件筛选的例子,可以把逻辑写入一个函数中再返回。 需要注意的是函数的形式参数 x 本质上即为 df_demo
def condition ( x) :
condition_1_1 = x. School == 'Fudan University'
condition_1_2 = x. Grade == 'Senior'
condition_1_3 = x. Weight > 70
condition_1 = condition_1_1 & condition_1_2 & condition_1_3
condition_2_1 = x. School == 'Peking University'
condition_2_2 = x. Grade == 'Senior'
condition_2_3 = x. Weight > 80
condition_2 = condition_2_1 & ( ~ condition_2_2) & condition_2_3
result = condition_1 | condition_2
return result
df_demo. loc[ condition]
School Grade Gender Weight Transfer Name Qiang Han Peking University Freshman Male 87.0 N Chengpeng Zhou Fudan University Senior Male 81.0 N Changpeng Zhao Peking University Freshman Male 83.0 N Chengpeng Qian Fudan University Senior Male 73.0 Y
支持使用 lambda 表达式 其返回值也同样必须是先前提到的四种形式之一
df_demo. loc[ lambda x: 'Quan Zhao' , lambda x: 'Gender' ]
'Female'
用函数进行切片操作 由于函数无法返回如 start: end: step 的切片形式 故返回切片时要用 slice 对象进行包装
df_demo. loc[ lambda x: slice ( 'Gaojuan You' , 'Gaoqiang Qian' ) ]
School Grade Gender Weight Transfer Name Gaojuan You Fudan University Sophomore Male 74.0 N Xiaoli Qian Tsinghua University Freshman Female 51.0 N Qiang Chu Shanghai Jiao Tong University Freshman Female 52.0 N Gaoqiang Qian Tsinghua University Junior Female 50.0 N
不要使用链式赋值
赋值时,应当在使用一层索引器后直接进行赋值操作 这样做是由于进行多次索引后赋值是赋在临时返回的 copy 副本上的。
df_chain = pd. DataFrame( [ [ 0 , 0 ] , [ 1 , 0 ] , [ - 1 , 0 ] ] , columns= list ( 'AB' ) )
df_chain
df_chain[ df_chain. A!= 0 ] . B = 1
df_chain
D:\MyDownloads\Anaconda\lib\site-packages\pandas\core\generic.py:5170: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self[name] = value
df_chain. loc[ df_chain. A!= 0 , 'B' ] = 1
df_chain
iloc 索引
针对位置进行筛选 *位置处也有五类合法对:整数、整数列表、整数切片、布尔列表、函数
df_demo. head( )
School Grade Gender Weight Transfer Name Gaopeng Yang Shanghai Jiao Tong University Freshman Female 46.0 N Changqiang You Peking University Freshman Male 70.0 N Mei Sun Shanghai Jiao Tong University Senior Male 89.0 N Xiaojuan Sun Fudan University Sophomore Female 41.0 N Gaojuan You Fudan University Sophomore Male 74.0 N
df_demo. iloc[ 1 , 1 ]
'Freshman'
df_demo. iloc[ [ 0 , 1 ] , [ 0 , 1 ] ]
School Grade Name Gaopeng Yang Shanghai Jiao Tong University Freshman Changqiang You Peking University Freshman
df_demo. iloc[ 1 : 4 , 2 : 4 ]
Gender Weight Name Changqiang You Male 70.0 Mei Sun Male 89.0 Xiaojuan Sun Female 41.0
df_demo. iloc[ lambda x: slice ( 1 , 4 ) ]
School Grade Gender Weight Transfer Name Changqiang You Peking University Freshman Male 70.0 N Mei Sun Shanghai Jiao Tong University Senior Male 89.0 N Xiaojuan Sun Fudan University Sophomore Female 41.0 N
布尔筛选优先考虑 loc 在使用布尔列表的时 不能传入 Series 而必须传入序列的 values ,否则会报错。 因此,在使用布尔筛选的时候还是应当优先考虑 loc 的方式。
df_demo. iloc[ ( df_demo. Weight> 80 ) . values] . head( )
School Grade Gender Weight Transfer Name Mei Sun Shanghai Jiao Tong University Senior Male 89.0 N Qiang Zheng Shanghai Jiao Tong University Senior Male 87.0 N Qiang Han Peking University Freshman Male 87.0 N Chengpeng Zhou Fudan University Senior Male 81.0 N Feng Han Shanghai Jiao Tong University Sophomore Male 82.0 N
df_demo. School. iloc[ 1 ]
'Peking University'
df_demo. head( )
School Grade Gender Weight Transfer Name Gaopeng Yang Shanghai Jiao Tong University Freshman Female 46.0 N Changqiang You Peking University Freshman Male 70.0 N Mei Sun Shanghai Jiao Tong University Senior Male 89.0 N Xiaojuan Sun Fudan University Sophomore Female 41.0 N Gaojuan You Fudan University Sophomore Male 74.0 N
df_demo. School. iloc[ 1 : 5 : 2 ]
Name
Changqiang You Peking University
Xiaojuan Sun Fudan University
Name: School, dtype: object
df_demo. School. loc[ [ 'Changqiang You' , 'Xiaojuan Sun' ] ]
Name
Changqiang You Peking University
Xiaojuan Sun Fudan University
Xiaojuan Sun Fudan University
Name: School, dtype: object
query 方法(query询问)
把字符串形式的查询表达式传入 query 方法来查询数据,其表达式的执行结果必须返回布尔列表,会使代码长度在不降低可读性的前提下有所减少。 将 loc 一节中的复合条件查询例子可以如下改写:
df. query( '((School == "Fudan University")&'
' (Grade == "Senior")&'
' (Weight > 70))|'
'((School == "Peking University")&'
' (Grade != "Senior")&'
' (Weight > 80))' )
School Grade Name Gender Weight Transfer 38 Peking University Freshman Qiang Han Male 87.0 N 66 Fudan University Senior Chengpeng Zhou Male 81.0 N 99 Peking University Freshman Changpeng Zhao Male 83.0 N 131 Fudan University Senior Chengpeng Qian Male 73.0 Y
在 query 表达式中,帮用户注册了所有来自 DataFrame 的列名 查询体重超过均值的学生:
df. query( 'Weight > Weight.mean()' ) . head( )
School Grade Name Gender Weight Transfer 1 Peking University Freshman Changqiang You Male 70.0 N 2 Shanghai Jiao Tong University Senior Mei Sun Male 89.0 N 4 Fudan University Sophomore Gaojuan You Male 74.0 N 10 Shanghai Jiao Tong University Freshman Xiaopeng Zhou Male 74.0 N 14 Tsinghua University Senior Xiaomei Zhou Female 57.0 N
query 中引用带空格的列名 对于含有空格的列名,需要使用 col name
的方式进行引用 query 中还注册了若干英语,如:or, and, or, is in, not in 。 字符串中出现与列表的比较时,== 和 != 分别表示元素出现在列表和没有出现在列表,等价于 is in 和 not in。对于 query 中的字符串,如果要引用外部变量,只需在变量名前加 @ 符号
low, high = 70 , 80
df. query( '((Weight >= 70))&((Weight >= 80))' ) . head( )
School Grade Name Gender Weight Transfer 2 Shanghai Jiao Tong University Senior Mei Sun Male 89.0 N 23 Shanghai Jiao Tong University Senior Qiang Zheng Male 87.0 N 38 Peking University Freshman Qiang Han Male 87.0 N 66 Fudan University Senior Chengpeng Zhou Male 81.0 N 71 Shanghai Jiao Tong University Sophomore Feng Han Male 82.0 N
随机抽样
把 DataFrame 的每一行看作一个样本,或把每一列看作一个特征 再把整个 DataFrame 看作总体,想要对样本或特征进行随机抽样就可以用 【sample】 函数 统计特征在等概率不放回的简单随机抽样条件下,是总体统计特征的无偏估计, 如样本均值和总体均值,则可以先从整张表中抽出一部分来做近似估计。【sample】函数 主要参数 【n】:抽样数量 【axis】:抽样的方向(0 为 行、1 为列) 【frac】:抽样比例 【replace】:是否放回(True 表示有放回抽样) 【weights】:每个样本的抽样相对概率(0.3从总体中抽30%样本)
df_sample = pd. DataFrame( { 'id' : list ( 'abcde' ) ,
'value' : [ 1 , 2 , 3 , 4 , 90 ] } )
df_sample
df_sample. sample( 3 , replace = True , weights = df_sample. value)
多级索引
表结构(多级索引)
np. random. seed( 0 )
multi_index = pd. MultiIndex. from_product( [ list ( 'ABCD' ) ,
df. Gender. unique( ) ] , names= ( 'School' , 'Gender' ) )
multi_column = pd. MultiIndex. from_product( [ [ 'Height' , 'Weight' ] ,
df. Grade. unique( ) ] , names= ( 'Indicator' , 'Grade' ) )
df_multi = pd. DataFrame( np. c_[ ( np. random. randn( 8 , 4 ) * 5 + 163 ) . tolist( ) ,
( np. random. randn( 8 , 4 ) * 5 + 65 ) . tolist( ) ] ,
index = multi_index,
columns = multi_column) . round ( 1 )
df_multi. index. names
FrozenList(['School', 'Gender'])
df_multi. columns. names
FrozenList(['Indicator', 'Grade'])
df_multi. index. values
array([('A', 'Female'), ('A', 'Male'), ('B', 'Female'), ('B', 'Male'),
('C', 'Female'), ('C', 'Male'), ('D', 'Female'), ('D', 'Male')],
dtype=object)
df_multi. columns. values
array([('Height', 'Freshman'), ('Height', 'Senior'),
('Height', 'Sophomore'), ('Height', 'Junior'),
('Weight', 'Freshman'), ('Weight', 'Senior'),
('Weight', 'Sophomore'), ('Weight', 'Junior')], dtype=object)
df_multi. index. get_level_values( 0 )
Index(['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'], dtype='object', name='School')
不能通过赋值来修改索引名,第3节讲解修改索引方式
loc索引(多级索引筛选)
df_multi = df. set_index( [ 'School' , 'Grade' ] )
df_multi. head( )
Name Gender Weight Transfer School Grade Shanghai Jiao Tong University Freshman Gaopeng Yang Female 46.0 N Peking University Freshman Changqiang You Male 70.0 N Shanghai Jiao Tong University Senior Mei Sun Male 89.0 N Fudan University Sophomore Xiaojuan Sun Female 41.0 N Sophomore Gaojuan You Male 74.0 N
df_multi = df_multi. sort_index( )
df_multi. loc[ ( 'Fudan University' , 'Junior' ) ] . head( )
Name Gender Weight Transfer School Grade Fudan University Junior Yanli You Female 48.0 N Junior Chunqiang Chu Male 72.0 N Junior Changfeng Lv Male 76.0 N Junior Yanjuan Lv Female 49.0 NaN Junior Gaoqiang Zhou Female 43.0 N
df_multi. loc[ [ ( 'Fudan University' , 'Senior' ) ,
( 'Shanghai Jiao Tong University' , 'Freshman' ) ] ]
Name Gender Weight Transfer School Grade Fudan University Senior Chengpeng Zheng Female 38.0 N Senior Feng Zhou Female 47.0 N Senior Gaomei Lv Female 34.0 N Senior Chunli Lv Female 56.0 N Senior Chengpeng Zhou Male 81.0 N Senior Gaopeng Qin Female 52.0 N Senior Chunjuan Xu Female 47.0 N Senior Juan Zhang Female 47.0 N Senior Chengpeng Qian Male 73.0 Y Senior Xiaojuan Qian Female 50.0 N Senior Quan Xu Female 44.0 N Shanghai Jiao Tong University Freshman Gaopeng Yang Female 46.0 N Freshman Qiang Chu Female 52.0 N Freshman Xiaopeng Zhou Male 74.0 N Freshman Yanpeng Lv Male 65.0 N Freshman Xiaopeng Zhao Female 53.0 N Freshman Chunli Zhao Male 83.0 N Freshman Peng Zhang Female NaN N Freshman Xiaoquan Sun Female 40.0 N Freshman Chunmei Shi Female 52.0 N Freshman Xiaomei Yang Female 49.0 N Freshman Xiaofeng Qian Female 49.0 N Freshman Changmei Lv Male 75.0 N Freshman Qiang Feng Male 80.0 N
df_multi. loc[ df_multi. Weight > 70 ] . head( )
Name Gender Weight Transfer School Grade Fudan University Freshman Feng Wang Male 74.0 N Junior Chunqiang Chu Male 72.0 N Junior Changfeng Lv Male 76.0 N Senior Chengpeng Zhou Male 81.0 N Senior Chengpeng Qian Male 73.0 Y
df_multi. loc[ lambda x: ( 'Fudan University' , 'Junior' ) ] . head( )
Name Gender Weight Transfer School Grade Fudan University Junior Yanli You Female 48.0 N Junior Chunqiang Chu Male 72.0 N Junior Changfeng Lv Male 76.0 N Junior Yanjuan Lv Female 49.0 NaN Junior Gaoqiang Zhou Female 43.0 N
对多层的元素进行交叉组合后索引,但同时需要指定 loc 的列,全选则用 : 表示 传入 loc 的形式为 [(level_0_list, level_1_list), cols]
res = df_multi. loc[ ( [ 'Peking University' , 'Fudan University' ] ,
[ 'Sophomore' , 'Junior' ] ) , : ]
res. head( )
Name Gender Weight Transfer School Grade Peking University Sophomore Changmei Xu Female 43.0 N Sophomore Xiaopeng Qin Male NaN N Sophomore Mei Xu Female 39.0 N Sophomore Xiaoli Zhou Female 55.0 N Sophomore Peng Han Female 34.0 NaN
res. shape
(33, 4)
res = df_multi. loc[ [ ( 'Peking University' , 'Junior' ) ,
( 'Fudan University' , 'Sophomore' ) ] ]
res. head( )
Name Gender Weight Transfer School Grade Peking University Junior Juan Xu Female NaN N Junior Changjuan You Female 47.0 N Junior Gaoli Xu Female 48.0 N Junior Gaoquan Zhou Male 70.0 N Junior Qiang You Female 56.0 N
res. shape
(16, 4)
IndexSlice 对象(对索引切片)
前面方法只能对元组整体进行切片,不能对每层进行切片,也不能将切片和布尔列表混合使用。 引入 IndexSlice 对象就能解决这个问题 【Slice】 对象一共有两种形式, 第一种为 loc[idx[, ]] 型, 第二种为 loc[idx[, ],idx[, ]] 型
np. random. seed( 0 )
L1, L2 = [ 'A' , 'B' , 'C' ] , [ 'a' , 'b' , 'c' ]
mul_index1 = pd. MultiIndex. from_product( [ L1, L2] , names= ( 'Upper' , 'Lower' ) )
L3, L4 = [ 'D' , 'E' , 'F' ] , [ 'd' , 'e' , 'f' ]
mul_index2 = pd. MultiIndex. from_product( [ L3, L4] , names= ( 'Big' , 'Small' ) )
df_ex = pd. DataFrame( np. random. randint( - 9 , 10 , ( 9 , 9 ) ) ,
index= mul_index1,
columns= mul_index2)
df_ex
Big D E F Small d e f d e f d e f Upper Lower A a 3 6 -9 -6 -6 -2 0 9 -5 b -3 3 -8 -3 -2 5 8 -4 4 c -1 0 7 -4 6 6 -9 9 -6 B a 8 5 -2 -9 -8 0 -9 1 -6 b 2 9 -7 -9 -9 -5 -4 -3 -1 c 8 6 -5 0 1 -8 -8 -2 0 C a -6 -3 2 5 9 -9 5 -6 3 b 1 2 -5 -3 -5 6 -6 3 -5 c -1 5 6 -6 6 4 7 8 -4
idx = pd. IndexSlice
【a】loc[idx[ * , * ]] 型 这种情况并不能进行多层分别切片, 前一个 * 表示行的选择, 后一个 * 表示列的选择, 与单纯的 loc 类似
df_ex. loc[ idx[ 'C' : , ( 'D' , 'f' ) : ] ]
Big D E F Small f d e f d e f Upper Lower C a 2 5 9 -9 5 -6 3 b -5 -3 -5 6 -6 3 -5 c 6 -6 6 4 7 8 -4
df_ex. loc[ idx[ : 'A' , lambda x: x. sum ( ) > 0 ] ]
Big D F Small d e e Upper Lower A a 3 6 9 b -3 3 -4 c -1 0 9
【b】loc[idx[ * , * ],idx[ * , * ]] 型 这种情况能够分层进行切片, 前一个 idx 指代的是行索引, 后一个是列索引。
df_ex. loc[ idx[ : 'A' , 'b' : ] , idx[ 'E' : , 'e' : ] ]
Big E F Small e f e f Upper Lower A b -2 5 -4 4 c 6 6 9 -6
多级索引构造
那么除了使用 set_index 之外,如何自己构造多级索引
df_multi = df. set_index( [ 'School' , 'Grade' ] )
df_multi. head( )
Name Gender Weight Transfer School Grade Shanghai Jiao Tong University Freshman Gaopeng Yang Female 46.0 N Peking University Freshman Changqiang You Male 70.0 N Shanghai Jiao Tong University Senior Mei Sun Male 89.0 N Fudan University Sophomore Xiaojuan Sun Female 41.0 N Sophomore Gaojuan You Male 74.0 N
my_array = [ list ( 'aabb' ) , [ 'cat' , 'dog' ] * 2 ]
pd. MultiIndex. from_arrays( my_array, names= [ 'First' , 'Second' ] )
np. random. seed( 0 )
L1, L2 = [ 'A' , 'B' ] , [ 'a' , 'b' ]
mul_index1 = pd. MultiIndex. from_product( [ L1, L2] , names= ( 'Upper' , 'Lower' ) )
L3, L4 = [ 'D' , 'E' ] , [ 'd' , 'e' ]
mul_index2 = pd. MultiIndex. from_product( [ L3, L4] , names= ( 'Big' , 'Small' ) )
df_ex = pd. DataFrame( np. random. randint( - 9 , 10 , ( 4 , 4 ) ) ,
index= my_tuple,
columns= my_tuple)
df_ex
(a, cat) (a, dog) (b, cat) (b, dog) (a, cat) 3 6 -9 -6 (a, dog) -6 -2 0 9 (b, cat) -5 -3 3 -8 (b, dog) -3 -2 5 8
【from_tuples】:根据传入由元组组成的列表进行构造 【from_arrays】:根据传入列表中,对应层的列表进行构造 【from_product】:根据给定多个列表的笛卡尔积进行构造
my_tuple = [ ( 'a' , 'cat' ) , ( 'a' , 'dog' ) , ( 'b' , 'cat' ) , ( 'b' , 'dog' ) ]
pd. MultiIndex. from_tuples( my_tuple, names= [ 'First' , 'Second' ] )
MultiIndex([('a', 'cat'),
('a', 'dog'),
('b', 'cat'),
('b', 'dog')],
names=['First', 'Second'])
my_array = [ list ( 'aabb' ) , [ 'cat' , 'dog' ] * 2 ]
pd. MultiIndex. from_arrays( my_array, names= [ 'First' , 'Second' ] )
MultiIndex([('a', 'cat'),
('a', 'dog'),
('b', 'cat'),
('b', 'dog')],
names=['First', 'Second'])
my_list1 = [ 'a' , 'b' ]
my_list2 = [ 'cat' , 'dog' ]
pd. MultiIndex. from_product( [ my_list1,
my_list2] ,
names= [ 'First' , 'Second' ] )
MultiIndex([('a', 'cat'),
('a', 'dog'),
('b', 'cat'),
('b', 'dog')],
names=['First', 'Second'])
索引常用方法
交换、删除(索引层)
np. random. seed( 0 )
L1, L2, L3 = [ 'A' , 'B' ] , [ 'a' , 'b' ] , [ 'alpha' , 'beta' ]
mul_index1 = pd. MultiIndex. from_product( [ L1, L2, L3] ,
names= ( 'Upper' , 'Lower' , 'Extra' ) )
L4, L5, L6 = [ 'C' , 'D' ] , [ 'c' , 'd' ] , [ 'cat' , 'dog' ]
mul_index2 = pd. MultiIndex. from_product( [ L4, L5, L6] ,
names= ( 'Big' , 'Small' , 'Other' ) )
df_ex = pd. DataFrame( np. random. randint( - 9 , 10 , ( 8 , 8 ) ) ,
index= mul_index1,
columns= mul_index2)
df_ex
Big C D Small c d c d Other cat dog cat dog cat dog cat dog Upper Lower Extra A a alpha 3 6 -9 -6 -6 -2 0 9 beta -5 -3 3 -8 -3 -2 5 8 b alpha -4 4 -1 0 7 -4 6 6 beta -9 9 -6 8 5 -2 -9 -8 B a alpha 0 -9 1 -6 2 9 -7 -9 beta -9 -5 -4 -3 -1 8 6 -5 b alpha 0 1 -8 -8 -2 0 -6 -3 beta 2 5 9 -9 5 -6 3 1
索引层内部交换
【swaplevel】、【reorder_levels】 索引层间的交换由 swaplevel 和 reorder_levels 完成, 前者只能交换两个层,而后者可以交换任意层, 两者都可以指定交换行索引或列索引
df_ex. swaplevel( 0 , 2 , axis= 1 ) . head( )
Other cat dog cat dog cat dog cat dog Small c c d d c c d d Big C C C C D D D D Upper Lower Extra A a alpha 3 6 -9 -6 -6 -2 0 9 beta -5 -3 3 -8 -3 -2 5 8 b alpha -4 4 -1 0 7 -4 6 6 beta -9 9 -6 8 5 -2 -9 -8 B a alpha 0 -9 1 -6 2 9 -7 -9
df_ex. reorder_levels( [ 2 , 0 , 1 ] , axis= 0 ) . head( )
Big C D Small c d c d Other cat dog cat dog cat dog cat dog Extra Upper Lower alpha A a 3 6 -9 -6 -6 -2 0 9 beta A a -5 -3 3 -8 -3 -2 5 8 alpha A b -4 4 -1 0 7 -4 6 6 beta A b -9 9 -6 8 5 -2 -9 -8 alpha B a 0 -9 1 -6 2 9 -7 -9
df_ex. reorder_levels( [ 2 , 0 , 1 ] , axis= 1 ) . head( )
Other cat dog cat dog cat dog cat dog Big C C C C D D D D Small c c d d c c d d Upper Lower Extra A a alpha 3 6 -9 -6 -6 -2 0 9 beta -5 -3 3 -8 -3 -2 5 8 b alpha -4 4 -1 0 7 -4 6 6 beta -9 9 -6 8 5 -2 -9 -8 B a alpha 0 -9 1 -6 2 9 -7 -9
删除某层索引
【droplevel】
df_ex. droplevel( 1 , axis= 1 )
Big C D Other cat dog cat dog cat dog cat dog Upper Lower Extra A a alpha 3 6 -9 -6 -6 -2 0 9 beta -5 -3 3 -8 -3 -2 5 8 b alpha -4 4 -1 0 7 -4 6 6 beta -9 9 -6 8 5 -2 -9 -8 B a alpha 0 -9 1 -6 2 9 -7 -9 beta -9 -5 -4 -3 -1 8 6 -5 b alpha 0 1 -8 -8 -2 0 -6 -3 beta 2 5 9 -9 5 -6 3 1
df_ex. droplevel( [ 0 , 1 ] , axis= 0 )
Big C D Small c d c d Other cat dog cat dog cat dog cat dog Extra alpha 3 6 -9 -6 -6 -2 0 9 beta -5 -3 3 -8 -3 -2 5 8 alpha -4 4 -1 0 7 -4 6 6 beta -9 9 -6 8 5 -2 -9 -8 alpha 0 -9 1 -6 2 9 -7 -9 beta -9 -5 -4 -3 -1 8 6 -5 alpha 0 1 -8 -8 -2 0 -6 -3 beta 2 5 9 -9 5 -6 3 1
属性修改(索引)
【rename_axis】修改索引层名 【rename_axis】可以对索引层的名字进行修改 修改方式:传入字典映射
df_ex. rename_axis( index= { 'Upper' : '000' } ,
columns= { 'Other' : '003' } ) . head( )
Big C D Small c d c d 003 cat dog cat dog cat dog cat dog 000 Lower Extra A a alpha 3 6 -9 -6 -6 -2 0 9 beta -5 -3 3 -8 -3 -2 5 8 b alpha -4 4 -1 0 7 -4 6 6 beta -9 9 -6 8 5 -2 -9 -8 B a alpha 0 -9 1 -6 2 9 -7 -9
【rename】修改索引值 【rename】对索引进行修改, 如果是多级索引需要指定修改的层号 level
df_ex. rename( columns= { 'cat' : '猫' } ,
level= 2 ) . head( )
Big C D Small c d c d Other 猫 dog 猫 dog 猫 dog 猫 dog Upper Lower Extra A a alpha 3 6 -9 -6 -6 -2 0 9 beta -5 -3 3 -8 -3 -2 5 8 b alpha -4 4 -1 0 7 -4 6 6 beta -9 9 -6 8 5 -2 -9 -8 B a alpha 0 -9 1 -6 2 9 -7 -9
传入参数也可以是函数,其输入值就是索引元素
df_ex. rename( index= lambda x: str . upper( x) ,
level= 2 ) . head( )
Big C D Small c d c d Other cat dog cat dog cat dog cat dog Upper Lower Extra A a ALPHA 3 6 -9 -6 -6 -2 0 9 BETA -5 -3 3 -8 -3 -2 5 8 b ALPHA -4 4 -1 0 7 -4 6 6 BETA -9 9 -6 8 5 -2 -9 -8 B a ALPHA 0 -9 1 -6 2 9 -7 -9
整个索引的元素替换,迭代器实现:
new_values = iter ( list ( 'abcdefgh' ) )
df_ex. rename( index= lambda x: next ( new_values) ,
level= 2 )
Big C D Small c d c d Other cat dog cat dog cat dog cat dog Upper Lower Extra A a a 3 6 -9 -6 -6 -2 0 9 b -5 -3 3 -8 -3 -2 5 8 b c -4 4 -1 0 7 -4 6 6 d -9 9 -6 8 5 -2 -9 -8 B a e 0 -9 1 -6 2 9 -7 -9 f -9 -5 -4 -3 -1 8 6 -5 b g 0 1 -8 -8 -2 0 -6 -3 h 2 5 9 -9 5 -6 3 1
对索引元素进行修改, 【单层索引】容易实现: 先取出索引的 values 属性, 再给对得到的列表进行修改, 最后再对 index 对象重新赋值。 【多级索引】进行修改(有些麻烦): 先把某一层索引临时转为表的元素, 然后再进行修改, 最后重新设定为索引。map 函数,定义在 Index 上的方法, 与前面 rename 方法中层的函数式用法类似, 只不过传入的不是层的标量值,而是直接传入索引的元组, 这为用户进行跨层的修改提供了遍历。
df_ex. rename( index= lambda x: str . upper( x) ,
level= 2 ) . head( )
Big C D Small c d c d Other cat dog cat dog cat dog cat dog Upper Lower Extra A a ALPHA 3 6 -9 -6 -6 -2 0 9 BETA -5 -3 3 -8 -3 -2 5 8 b ALPHA -4 4 -1 0 7 -4 6 6 BETA -9 9 -6 8 5 -2 -9 -8 B a ALPHA 0 -9 1 -6 2 9 -7 -9
df_temp = df_ex. copy( )
new_idx = df_temp. index. map ( lambda x: ( x[ 0 ] ,
x[ 1 ] ,
str . upper( x[ 2 ] ) ) )
df_temp. index = new_idx
df_temp. head( )
Big C D Small c d c d Other cat dog cat dog cat dog cat dog Upper Lower Extra A a ALPHA 3 6 -9 -6 -6 -2 0 9 BETA -5 -3 3 -8 -3 -2 5 8 b ALPHA -4 4 -1 0 7 -4 6 6 BETA -9 9 -6 8 5 -2 -9 -8 B a ALPHA 0 -9 1 -6 2 9 -7 -9
map 的另一个使用方法是对多级索引的压缩
df_temp = df_ex. copy( )
new_idx = df_temp. index. map ( lambda x: ( x[ 0 ] + '-' +
x[ 1 ] + '-' +
x[ 2 ] ) )
df_temp. index = new_idx
df_temp. head( )
Big C D Small c d c d Other cat dog cat dog cat dog cat dog A-a-alpha 3 6 -9 -6 -6 -2 0 9 A-a-beta -5 -3 3 -8 -3 -2 5 8 A-b-alpha -4 4 -1 0 7 -4 6 6 A-b-beta -9 9 -6 8 5 -2 -9 -8 B-a-alpha 0 -9 1 -6 2 9 -7 -9
也可以反向地展开:
new_idx = df_temp. index. map ( lambda x: tuple ( x. split( '-' ) ) )
df_temp. index = new_idx
df_temp. head( )
Big C D Small c d c d Other cat dog cat dog cat dog cat dog A a alpha 3 6 -9 -6 -6 -2 0 9 beta -5 -3 3 -8 -3 -2 5 8 b alpha -4 4 -1 0 7 -4 6 6 beta -9 9 -6 8 5 -2 -9 -8 B a alpha 0 -9 1 -6 2 9 -7 -9
设置与重设(索引)
df_new = pd. DataFrame( { 'A' : list ( 'aacd' ) ,
'B' : list ( 'PQRT' ) ,
'C' : [ 1 , 2 , 3 , 4 ] } )
df_new
【set_index】索引的设置 【append】参数:是否来保留原来的索引 直接把新设定的添加到原索引的内层
df_new. set_index( 'A' )
df_new. set_index( 'A' , append= True )
可以同时指定多个列作为索引:
df_new. set_index( [ 'A' , 'B' ] )
新设索引 如果添加索引的列原表中没有,直接参数导入新Series :
my_index = pd. Series( list ( 'WXYZ' ) , name= 'D' )
df_new = df_new. set_index( [ 'A' , my_index] )
df_new
【reset_index】set_index 的逆函数, 【drop】参数:是否丢弃将去掉的索引层,而不是添加到列中:
df_new. reset_index( [ 'D' ] )
df_new. reset_index( [ 'D' ] , drop= True )
如果重置了所有的索引,那么 pandas 会直接重新生成一个默认索引
df_new. reset_index( )
变形(索引)
给定一个新的索引, 把原表中相应的索引对应元素填充到新索引构成的表中
df_reindex = pd. DataFrame( { "Weight" : [ 60 , 70 , 80 ] ,
"Height" : [ 176 , 180 , 179 ] } ,
index= [ '1001' , '1003' , '1002' ] )
df_reindex
Weight Height 1001 60 176 1003 70 180 1002 80 179
df_reindex. reindex( index= [ '1001' , '1002' , '1003' , '1004' ] ,
columns= [ 'Weight' , 'Gender' ] )
Weight Gender 1001 60.0 NaN 1002 80.0 NaN 1003 70.0 NaN 1004 NaN NaN
还有一个与 reindex 功能类似的函数是 【reindex_like】:仿照传入的表的索引来进行被调用表索引的变形
df_existed = pd. DataFrame( index= [ '1001' , '1002' , '1003' , '1004' ] ,
columns= [ 'Weight' , 'Gender' ] )
df_reindex. reindex_like( df_existed)
Weight Gender 1001 60.0 NaN 1002 80.0 NaN 1003 70.0 NaN 1004 NaN NaN
索引运算
集合运算
利用集合运算来取出符合条件行 例如有两张表 A 和 B ,它们的索引都是员工编号,现在需要筛选出两表索引交集的所有员工信息,此时通过 Index 上的运算操作就很容易实现。SA.intersection(SB) = SA ∩ SB ⇔ {x|x ∈ SA and x ∈ SB} SA.union(SB) = SA ∪ SB ⇔ {x|x ∈ SA or x ∈ SB} SA.difference(SB) = SA − SB ⇔ {x|x ∈ SA and x /∈ SB} SA.symmetric_difference(SB) = SA△SB ⇔ {x|x ∈ SA ∪ SB − SA ∩ SB}
索引运算
先用 unique 去重后再进行运算
df_set_1 = pd. DataFrame( [ [ 0 , 1 ] , [ 1 , 2 ] , [ 3 , 4 ] ] ,
index = pd. Index( [ 'a' , 'b' , 'a' ] , name= 'id1' ) )
df_set_2 = pd. DataFrame( [ [ 4 , 5 ] , [ 2 , 6 ] , [ 7 , 1 ] ] ,
index = pd. Index( [ 'b' , 'b' , 'c' ] , name= 'id2' ) )
id1, id2 = df_set_1. index. unique( ) , df_set_2. index. unique( )
id1, id2
(Index(['a', 'b'], dtype='object', name='id1'),
Index(['b', 'c'], dtype='object', name='id2'))
df_set_1
df_set_2
id1. intersection( id2)
Index(['b'], dtype='object')
id1. union( id2)
Index(['a', 'b', 'c'], dtype='object')
id1. difference( id2)
Index(['a'], dtype='object')
id1. symmetric_difference( id2)
Index(['a', 'c'], dtype='object')
上述的四类运算还可以用等价的符号表示代替如下:
id1 & id2
Index(['b'], dtype='object')
id1 | id2
Index(['a', 'b', 'c'], dtype='object')
( id1 ^ id2) & id1
Index(['a'], dtype='object')
id1 ^ id2
Index(['a', 'c'], dtype='object')
无索引情况 两张表需要做集合运算的列并没有被设置索引 方法一:先转成索引,运算后再恢复 方法二:利用 isin 函数
df_set_1
df_set_2
df_set_in_col_1 = df_set_1. reset_index( )
df_set_in_col_2 = df_set_2. reset_index( )
df_set_in_col_1
df_set_in_col_2
id1
Index(['a', 'b'], dtype='object', name='id1')
【isin()】接收一个列表,判断该列中元素是否在列表中,同时对多个列过滤 df[df[某列].isin(条件)&df[某列].isin(条件)]
df_set_in_col_1[ df_set_in_col_1. id1. isin( df_set_in_col_2. id2) ]
练习
Ex1:公司员工数据集
df = pd. read_csv( 'company.csv' )
df. head( )
EmployeeID birthdate_key age city_name department job_title gender 0 1318 1/3/1954 61 Vancouver Executive CEO M 1 1319 1/3/1957 58 Vancouver Executive VP Stores F 2 1320 1/2/1955 60 Vancouver Executive Legal Counsel F 3 1321 1/2/1959 56 Vancouver Executive VP Human Resources M 4 1322 1/9/1958 57 Vancouver Executive VP Finance M
1、分别只使用 query 和 loc 选出年龄不超过四十岁且工作部门为 Dairy 或 Bakery 的男性。
df. query( '((age <= 40)&'
'(department == "Dairy"))|'
'((age <= 40)&'
'(department == "Bakery"))' )
EmployeeID birthdate_key age city_name department job_title gender 3608 5788 1/12/1975 40 Chilliwack Dairy Dairy Person F 3609 5789 1/13/1975 40 Chilliwack Dairy Dairy Person F 3610 5790 1/13/1975 40 Kelowna Dairy Dairy Person F 3611 5791 1/14/1975 40 Kelowna Dairy Dairy Person M 3613 5793 1/22/1975 40 Richmond Bakery Baker M ... ... ... ... ... ... ... ... 6136 8335 12/28/1994 21 Vancouver Dairy Dairy Person F 6137 8336 12/31/1994 21 Vancouver Dairy Dairy Person M 6270 6312 5/14/1979 36 Grand Forks Dairy Dairy Person M 6271 6540 2/14/1981 34 Victoria Bakery Baker M 6278 7801 10/18/1990 25 Abbotsford Dairy Dairy Person F
866 rows × 7 columns
df1 = df. loc[ df. department. isin( [ 'Dairy' , 'Bakery' ] ) ]
df1. loc[ df1. age <= 40 ]
EmployeeID birthdate_key age city_name department job_title gender 31 1363 3/20/1949 66 Victoria Bakery Bakery Manager M 40 1377 4/15/1949 66 Victoria Dairy Dairy Person F 64 1408 6/12/1949 66 Nanaimo Bakery Bakery Manager F 73 1422 7/18/1949 66 Vancouver Dairy Dairy Person M 74 1425 7/20/1949 66 Vancouver Dairy Dairy Person M ... ... ... ... ... ... ... ... 6265 5245 7/13/1970 45 Kelowna Bakery Baker F 6266 5347 6/28/1971 44 Terrace Dairy Dairy Person F 6270 6312 5/14/1979 36 Grand Forks Dairy Dairy Person M 6271 6540 2/14/1981 34 Victoria Bakery Baker M 6278 7801 10/18/1990 25 Abbotsford Dairy Dairy Person F
1931 rows × 7 columns
2、选出员工 ID 号为奇数所在行的第 1、第 3 和倒数第 2 列。
df[ df[ 'EmployeeID' ] % 2 == 1 ] . iloc[ : , [ 0 , 2 , - 2 ] ]
EmployeeID age job_title 1 1319 58 VP Stores 3 1321 56 VP Human Resources 5 1323 53 Exec Assistant, VP Stores 6 1325 51 Exec Assistant, Legal Counsel 8 1329 48 Store Manager ... ... ... ... 6276 7659 26 Cashier 6277 7741 25 Cashier 6278 7801 25 Dairy Person 6280 8181 22 Cashier 6281 8223 21 Cashier
3126 rows × 3 columns