第2章 索引
大致内容
import numpy as np
import pandas as pd
df = pd. read_csv( 'data/table.csv' , index_col= 'ID' )
df. head( )
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173 63 34.0 A+ 1102 S_1 C_1 F street_2 192 73 32.5 B+ 1103 S_1 C_1 M street_2 186 82 87.2 B+ 1104 S_1 C_1 F street_2 167 81 80.4 B- 1105 S_1 C_1 F street_4 159 64 84.8 B+
一、单级索引
1. loc方法、iloc方法、[]操作符
最常用的索引方法可能就是这三类,其中iloc表示位置索引,loc表示标签索引,[]也具有很大的便利性,各有特点
(a)loc方法(注意:所有在loc中使用的切片全部包含右端点!)
① 单行索引:
返回对应行的Series
df. loc[ 1103 ]
School S_1
Class C_1
Gender M
Address street_2
Height 186
Weight 82
Math 87.2
Physics B+
Name: 1103 , dtype: object
② 多行索引:
返回1102和2304行 组成DataFrame
df. loc[ [ 1102 , 2304 ] ]
School Class Gender Address Height Weight Math Physics ID 1102 S_1 C_1 F street_2 192 73 32.5 B+ 2304 S_2 C_3 F street_6 164 81 95.5 A-
切片访问 开头与结尾都取到
重点差别
df. loc[ 1304 : ] . head( )
School Class Gender Address Height Weight Math Physics ID 1304 S_1 C_3 M street_2 195 70 85.2 A 1305 S_1 C_3 F street_5 187 69 61.7 B- 2101 S_2 C_1 M street_7 174 84 83.3 C 2102 S_2 C_1 F street_6 161 61 50.6 B+ 2103 S_2 C_1 M street_4 157 61 52.5 B-
df. loc[ 2402 : : - 1 ] . head( 60 )
School Class Gender Address Height Weight Math Physics ID 2402 S_2 C_4 M street_7 166 82 48.7 B 2401 S_2 C_4 F street_2 192 62 45.3 A 2305 S_2 C_3 M street_4 187 73 48.9 B 2304 S_2 C_3 F street_6 164 81 95.5 A- 2303 S_2 C_3 F street_7 190 99 65.9 C 2302 S_2 C_3 M street_5 171 88 32.7 A 2301 S_2 C_3 F street_4 157 78 72.3 B+ 2205 S_2 C_2 F street_7 183 76 85.4 B 2204 S_2 C_2 M street_1 175 74 47.2 B- 2203 S_2 C_2 M street_4 155 91 73.8 A+ 2202 S_2 C_2 F street_7 194 77 68.5 B+ 2201 S_2 C_2 M street_5 193 100 39.1 B 2105 S_2 C_1 M street_4 170 81 34.2 A 2104 S_2 C_1 F street_5 159 97 72.2 B+ 2103 S_2 C_1 M street_4 157 61 52.5 B- 2102 S_2 C_1 F street_6 161 61 50.6 B+ 2101 S_2 C_1 M street_7 174 84 83.3 C 1305 S_1 C_3 F street_5 187 69 61.7 B- 1304 S_1 C_3 M street_2 195 70 85.2 A 1303 S_1 C_3 M street_7 188 82 49.7 B 1302 S_1 C_3 F street_1 175 57 87.7 A- 1301 S_1 C_3 M street_4 161 68 31.5 B+ 1205 S_1 C_2 F street_6 167 63 68.4 B- 1204 S_1 C_2 F street_5 162 63 33.8 B 1203 S_1 C_2 M street_6 160 53 58.8 A+ 1202 S_1 C_2 F street_4 176 94 63.5 B- 1201 S_1 C_2 M street_5 188 68 97.0 A- 1105 S_1 C_1 F street_4 159 64 84.8 B+ 1104 S_1 C_1 F street_2 167 81 80.4 B- 1103 S_1 C_1 M street_2 186 82 87.2 B+ 1102 S_1 C_1 F street_2 192 73 32.5 B+ 1101 S_1 C_1 M street_1 173 63 34.0 A+
③ 单列索引:
Series类型
df. loc[ : , 'Height' ] . head( )
ID
1101 173
1102 192
1103 186
1104 167
1105 159
Name: Height, dtype: int64
④ 多列索引:
df. loc[ : , [ 'Height' , 'Math' ] ] . head( )
Height Math ID 1101 173 34.0 1102 192 32.5 1103 186 87.2 1104 167 80.4 1105 159 84.8
这里取得是切片
从Height列到Math列 组成DataFrame
df. loc[ : , 'Height' : 'Math' ] . head( )
Height Weight Math ID 1101 173 63 34.0 1102 192 73 32.5 1103 186 82 87.2 1104 167 81 80.4 1105 159 64 84.8
⑤ 联合索引:
对于行列同时切片 [start:end:step]
df. loc[ 1102 : 2401 : 3 , 'Height' : 'Math' ] . head( )
Height Weight Math ID 1102 192 73 32.5 1105 159 64 84.8 1203 160 53 58.8 1301 161 68 31.5 1304 195 70 85.2
标签值索引与位置索引区别
s = pd. Series( [ 1 , 2 , 3 ] , index= [ 1.1 , 2.2 , 3.3 ] )
s = pd. Series( [ 1 , 2 , 3 ] , index= [ 1 , 3 , 5 ] )
⑥ 函数式索引:
loc中使用的函数操作的对象
是前面的DataFrame
df. loc[ lambda x: x[ 'Gender' ] == 'M' ] . head( 10 )
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173 63 34.0 A+ 1103 S_1 C_1 M street_2 186 82 87.2 B+ 1201 S_1 C_2 M street_5 188 68 97.0 A- 1203 S_1 C_2 M street_6 160 53 58.8 A+ 1301 S_1 C_3 M street_4 161 68 31.5 B+ 1303 S_1 C_3 M street_7 188 82 49.7 B 1304 S_1 C_3 M street_2 195 70 85.2 A 2101 S_2 C_1 M street_7 174 84 83.3 C 2103 S_2 C_1 M street_4 157 61 52.5 B- 2105 S_2 C_1 M street_4 170 81 34.2 A
df. loc[ [ x for x in [ 1101 , 1103 ] ] ]
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173 63 34.0 A+ 1103 S_1 C_1 M street_2 186 82 87.2 B+
⑦ 布尔索引(将重点在第2节介绍)
df. loc[ df[ 'Address' ] . isin( [ 'street_7' , 'street_4' ] ) ] . head( )
School Class Gender Address Height Weight Math Physics ID 1105 S_1 C_1 F street_4 159 64 84.8 B+ 1202 S_1 C_2 F street_4 176 94 63.5 B- 1301 S_1 C_3 M street_4 161 68 31.5 B+ 1303 S_1 C_3 M street_7 188 82 49.7 B 2101 S_2 C_1 M street_7 174 84 83.3 C
df[ 'Address' ] . values
array(['street_1', 'street_2', 'street_2', 'street_2', 'street_4',
'street_5', 'street_4', 'street_6', 'street_5', 'street_6',
'street_4', 'street_1', 'street_7', 'street_2', 'street_5',
'street_7', 'street_6', 'street_4', 'street_5', 'street_4',
'street_5', 'street_7', 'street_4', 'street_1', 'street_7',
'street_4', 'street_5', 'street_7', 'street_6', 'street_4',
'street_2', 'street_7', 'street_6', 'street_2', 'street_6'],
dtype=object)
df. loc[ [ True if i[ - 1 ] == '4' or i[ - 1 ] == '7' else False for i in df[ 'Address' ] . values] ] . head( )
School Class Gender Address Height Weight Math Physics ID 1105 S_1 C_1 F street_4 159 64 84.8 B+ 1202 S_1 C_2 F street_4 176 94 63.5 B- 1301 S_1 C_3 M street_4 161 68 31.5 B+ 1303 S_1 C_3 M street_7 188 82 49.7 B 2101 S_2 C_1 M street_7 174 84 83.3 C 2103 S_2 C_1 M street_4 157 61 52.5 B- 2105 S_2 C_1 M street_4 170 81 34.2 A 2202 S_2 C_2 F street_7 194 77 68.5 B+ 2203 S_2 C_2 M street_4 155 91 73.8 A+ 2205 S_2 C_2 F street_7 183 76 85.4 B 2301 S_2 C_3 F street_4 157 78 72.3 B+ 2303 S_2 C_3 F street_7 190 99 65.9 C 2305 S_2 C_3 M street_4 187 73 48.9 B 2402 S_2 C_4 M street_7 166 82 48.7 B
小节:本质上说,loc中能传入的只有布尔列表和索引子集构成的列表,只要把握这个原则就很容易理解上面那些操作
(b)iloc方法(注意与loc不同,切片右端点不包含)
① 单行索引:
df. iloc[ 3 ]
School S_1
Class C_1
Gender F
Address street_2
Height 167
Weight 81
Math 80.4
Physics B-
Name: 1104, dtype: object
ID
1102 192
1103 186
1104 167
1105 159
1201 188
1202 176
1203 160
1204 162
1205 167
1301 161
1302 175
1303 188
1304 195
1305 187
2101 174
2102 161
2103 157
2104 159
2105 170
2201 193
2202 194
2203 155
2204 175
2205 183
2301 157
2302 171
2303 190
2304 164
2305 187
2401 192
2402 166
2403 158
2404 160
2405 193
Name: Height, dtype: int64
② 多行索引:
df. iloc[ : 5 ]
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173 63 34.0 A+ 1102 S_1 C_1 F street_2 192 73 32.5 B+ 1103 S_1 C_1 M street_2 186 82 87.2 B+ 1104 S_1 C_1 F street_2 167 81 80.4 B- 1105 S_1 C_1 F street_4 159 64 84.8 B+
③ 单列索引:
df. iloc[ : , 3 ] . head( )
ID
1101 street_1
1102 street_2
1103 street_2
1104 street_2
1105 street_4
Name: Address, dtype: object
④ 多列索引:
df. iloc[ : , 7 : : - 2 ] . head( 10 )
Physics Weight Address Class ID 1101 A+ 63 street_1 C_1 1102 B+ 73 street_2 C_1 1103 B+ 82 street_2 C_1 1104 B- 81 street_2 C_1 1105 B+ 64 street_4 C_1 1201 A- 68 street_5 C_2 1202 B- 94 street_4 C_2 1203 A+ 53 street_6 C_2 1204 B 63 street_5 C_2 1205 B- 63 street_6 C_2
⑤ 混合索引:
df. iloc[ 3 : : 4 , 7 : : - 2 ] . head( )
Physics Weight Address Class ID 1104 B- 81 street_2 C_1 1203 A+ 53 street_6 C_2 1302 A- 57 street_1 C_3 2101 C 84 street_7 C_1 2105 A 81 street_4 C_1
⑥ 函数式索引:
df. iloc[ lambda x: [ 3 ] ] . head( )
School Class Gender Address Height Weight Math Physics ID 1104 S_1 C_1 F street_2 167 81 80.4 B-
小节:由上所述,iloc中接收的参数只能为整数或整数列表,不能使用布尔索引 纯元素Series布尔索引报错 将其转为list列表就可以 或者在后边加上values就可
(c) []操作符
如果不想陷入困境,请不要在行索引为浮点时使用[]操作符,因为在Series中的浮点[]并不是进行位置比较,而是值比较,非常特殊
(c.1)Series的[]操作
① 单元素索引:
s = pd. Series( df[ 'Math' ] , index= df. index)
s[ 1101 ]
34.0
② 多行索引:
使用的是绝对位置的整数切片,与元素无关,这里容易混淆
s[ 0 : 4 ]
ID
1101 34.0
1102 32.5
1103 87.2
1104 80.4
Name: Math, dtype: float64
③ 函数式索引:
注意使用lambda函数时,直接切片(如:s[lambda x: 16::-6])就报错,此时使用的不是绝对位置切片,而是元素切片,非常易错
s[ lambda x: x. index[ 16 : : - 6 ] ]
ID
2102 50.6
1301 31.5
1105 84.8
Name: Math, dtype: float64
④ 布尔索引:
s[ s> 80 ]
ID
1103 87.2
1104 80.4
1105 84.8
1201 97.0
1302 87.7
1304 85.2
2101 83.3
2205 85.4
2304 95.5
Name: Math, dtype: float64
(c.2)DataFrame的[]操作
① 单行索引:
df[ 1 : 2 ]
School Class Gender Address Height Weight Math Physics ID 1102 S_1 C_1 F street_2 192 73 32.5 B+
row = df. index. get_loc( 1102 )
df[ row: row+ 1 ]
School Class Gender Address Height Weight Math Physics ID 1102 S_1 C_1 F street_2 192 73 32.5 B+
② 多行索引:
用切片,如果是选取指定的某几行,推荐使用loc,否则很可能报错
df[ 3 : 5 ]
School Class Gender Address Height Weight Math Physics ID 1104 S_1 C_1 F street_2 167 81 80.4 B- 1105 S_1 C_1 F street_4 159 64 84.8 B+
③ 单列索引:
df[ 'School' ] . head( )
ID
1101 S_1
1102 S_1
1103 S_1
1104 S_1
1105 S_1
Name: School, dtype: object
④ 多列索引:
df[ [ 'School' , 'Math' ] ] . head( )
School Math ID 1101 S_1 34.0 1102 S_1 32.5 1103 S_1 87.2 1104 S_1 80.4 1105 S_1 84.8
⑤函数式索引:
df[ lambda x: [ 'Math' , 'Physics' ] ] . head( )
Math Physics ID 1101 34.0 A+ 1102 32.5 B+ 1103 87.2 B+ 1104 80.4 B- 1105 84.8 B+
⑥ 布尔索引:
df[ df[ 'Gender' ] == 'F' ] . head( )
<class 'pandas.core.series.Series'>
小节:一般来说,[]操作符常用于列选择或布尔选择,尽量避免行的选择
2. 布尔索引
(a)布尔符号:’&’,’|’,’~’:分别代表和and,或or,取反not
df[ ( df[ 'Gender' ] == 'F' ) & ( df[ 'Address' ] == 'street_2' ) ] . head( )
School Class Gender Address Height Weight Math Physics ID 1102 S_1 C_1 F street_2 192 73 32.5 B+ 1104 S_1 C_1 F street_2 167 81 80.4 B- 2401 S_2 C_4 F street_2 192 62 45.3 A 2404 S_2 C_4 F street_2 160 84 67.7 B
df[ ( df[ 'Math' ] > 85 ) | ( df[ 'Address' ] == 'street_7' ) ] . head( )
School Class Gender Address Height Weight Math Physics ID 1103 S_1 C_1 M street_2 186 82 87.2 B+ 1201 S_1 C_2 M street_5 188 68 97.0 A- 1302 S_1 C_3 F street_1 175 57 87.7 A- 1303 S_1 C_3 M street_7 188 82 49.7 B 1304 S_1 C_3 M street_2 195 70 85.2 A
df[ ~ ( ( df[ 'Math' ] > 75 ) | ( df[ 'Address' ] == 'street_1' ) ) ] . head( )
School Class Gender Address Height Weight Math Physics ID 1102 S_1 C_1 F street_2 192 73 32.5 B+ 1202 S_1 C_2 F street_4 176 94 63.5 B- 1203 S_1 C_2 M street_6 160 53 58.8 A+ 1204 S_1 C_2 F street_5 162 63 33.8 B 1205 S_1 C_2 F street_6 167 63 68.4 B-
loc和[]中相应位置都能使用布尔列表选择:
df[ : 8 ] [ 'Address' ] == 'street_6'
ID
1101 False
1102 False
1103 False
1104 False
1105 False
1201 False
1202 False
1203 True
Name: Address, dtype: bool
( df[ : 8 ] [ 'Address' ] == 'street_6' ) . values
array([False, False, False, False, False, False, False, True])
如果不加values就会索引对齐发生错误,Pandas中的索引对齐是一个重要特征,很多时候非常使用
df. loc[ df[ 'Math' ] > 60 , ( df[ : 8 ] [ 'Address' ] == 'street_6' ) . values] . head( )
Physics ID 1103 B+ 1104 B- 1105 B+ 1201 A- 1202 B-
(b) isin方法
df[ df[ 'Address' ] . isin( [ 'street_1' , 'street_4' ] ) & df[ 'Physics' ] . isin( [ 'A' , 'A+' ] ) ]
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173 63 34.0 A+ 2105 S_2 C_1 M street_4 170 81 34.2 A 2203 S_2 C_2 M street_4 155 91 73.8 A+
all与&的思路是类似的,其中的1代表按照跨列方向判断是否全为True
df[ df[ [ 'Address' , 'Physics' ] ] . isin( { 'Address' : [ 'street_1' , 'street_4' ] , 'Physics' : [ 'A' , 'A+' ] } ) . all ( 1 ) ]
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173 63 34.0 A+ 2105 S_2 C_1 M street_4 170 81 34.2 A 2203 S_2 C_2 M street_4 155 91 73.8 A+
3. 快速标量索引
当只需要取一个元素时,at和iat方法能够提供更快的实现:
display( df. at[ 1101 , 'School' ] )
df. at[ 1101 , 'School' ]
'S_1'
'S_1'
4. 区间索引
此处介绍并不是说只能在单级索引中使用区间索引,只是作为一种特殊类型的索引方式,在此处先行介绍
(a)利用interval_range方法
pd. interval_range( start= 0 , end= 5 )
IntervalIndex( [ ( 0 , 1 ] , ( 1 , 2 ] , ( 2 , 3 ] , ( 3 , 4 ] , ( 4 , 5 ] ] ,
closed= 'right' ,
dtype= 'interval[int64]' )
pd. interval_range( start= 0 , periods= 10 , freq= 5 )
IntervalIndex( [ ( 0 , 5 ] , ( 5 , 10 ] , ( 10 , 15 ] , ( 15 , 20 ] , ( 20 , 25 ] , ( 25 , 30 ] , ( 30 , 35 ] , ( 35 , 40 ] , ( 40 , 45 ] , ( 45 , 50 ] ] ,
closed= 'right' ,
dtype= 'interval[int64]' )
(b)利用cut将数值列转为区间为元素的分类变量,例如统计数学成绩的区间情况:
math_interval = pd. cut( df[ 'Math' ] , bins= [ 0 , 40 , 60 , 80 , 100 ] )
math_interval. head( )
ID
1101 ( 0 , 40 ]
1102 ( 0 , 40 ]
1103 ( 80 , 100 ]
1104 ( 80 , 100 ]
1105 ( 80 , 100 ]
Name: Math, dtype: category
Categories ( 4 , interval[ int64] ) : [ ( 0 , 40 ] < ( 40 , 60 ] < ( 60 , 80 ] < ( 80 , 100 ] ]
df. join( math_interval, rsuffix= '_interval' )
School Class Gender Address Height Weight Math Physics Math_interval ID 1101 S_1 C_1 M street_1 173 63 34.0 A+ (0, 40] 1102 S_1 C_1 F street_2 192 73 32.5 B+ (0, 40] 1103 S_1 C_1 M street_2 186 82 87.2 B+ (80, 100] 1104 S_1 C_1 F street_2 167 81 80.4 B- (80, 100] 1105 S_1 C_1 F street_4 159 64 84.8 B+ (80, 100] 1201 S_1 C_2 M street_5 188 68 97.0 A- (80, 100] 1202 S_1 C_2 F street_4 176 94 63.5 B- (60, 80] 1203 S_1 C_2 M street_6 160 53 58.8 A+ (40, 60] 1204 S_1 C_2 F street_5 162 63 33.8 B (0, 40] 1205 S_1 C_2 F street_6 167 63 68.4 B- (60, 80] 1301 S_1 C_3 M street_4 161 68 31.5 B+ (0, 40] 1302 S_1 C_3 F street_1 175 57 87.7 A- (80, 100] 1303 S_1 C_3 M street_7 188 82 49.7 B (40, 60] 1304 S_1 C_3 M street_2 195 70 85.2 A (80, 100] 1305 S_1 C_3 F street_5 187 69 61.7 B- (60, 80] 2101 S_2 C_1 M street_7 174 84 83.3 C (80, 100] 2102 S_2 C_1 F street_6 161 61 50.6 B+ (40, 60] 2103 S_2 C_1 M street_4 157 61 52.5 B- (40, 60] 2104 S_2 C_1 F street_5 159 97 72.2 B+ (60, 80] 2105 S_2 C_1 M street_4 170 81 34.2 A (0, 40] 2201 S_2 C_2 M street_5 193 100 39.1 B (0, 40] 2202 S_2 C_2 F street_7 194 77 68.5 B+ (60, 80] 2203 S_2 C_2 M street_4 155 91 73.8 A+ (60, 80] 2204 S_2 C_2 M street_1 175 74 47.2 B- (40, 60] 2205 S_2 C_2 F street_7 183 76 85.4 B (80, 100] 2301 S_2 C_3 F street_4 157 78 72.3 B+ (60, 80] 2302 S_2 C_3 M street_5 171 88 32.7 A (0, 40] 2303 S_2 C_3 F street_7 190 99 65.9 C (60, 80] 2304 S_2 C_3 F street_6 164 81 95.5 A- (80, 100] 2305 S_2 C_3 M street_4 187 73 48.9 B (40, 60] 2401 S_2 C_4 F street_2 192 62 45.3 A (40, 60] 2402 S_2 C_4 M street_7 166 82 48.7 B (40, 60] 2403 S_2 C_4 F street_6 158 60 59.7 B+ (40, 60] 2404 S_2 C_4 F street_2 160 84 67.7 B (60, 80] 2405 S_2 C_4 F street_6 193 54 47.6 B (40, 60]
df. join( math_interval, rsuffix= '_interval' ) [ [ 'Math' , 'Math_interval' ] ]
Math Math_interval ID 1101 34.0 (0, 40] 1102 32.5 (0, 40] 1103 87.2 (80, 100] 1104 80.4 (80, 100] 1105 84.8 (80, 100] 1201 97.0 (80, 100] 1202 63.5 (60, 80] 1203 58.8 (40, 60] 1204 33.8 (0, 40] 1205 68.4 (60, 80] 1301 31.5 (0, 40] 1302 87.7 (80, 100] 1303 49.7 (40, 60] 1304 85.2 (80, 100] 1305 61.7 (60, 80] 2101 83.3 (80, 100] 2102 50.6 (40, 60] 2103 52.5 (40, 60] 2104 72.2 (60, 80] 2105 34.2 (0, 40] 2201 39.1 (0, 40] 2202 68.5 (60, 80] 2203 73.8 (60, 80] 2204 47.2 (40, 60] 2205 85.4 (80, 100] 2301 72.3 (60, 80] 2302 32.7 (0, 40] 2303 65.9 (60, 80] 2304 95.5 (80, 100] 2305 48.9 (40, 60] 2401 45.3 (40, 60] 2402 48.7 (40, 60] 2403 59.7 (40, 60] 2404 67.7 (60, 80] 2405 47.6 (40, 60]
(c)区间索引的选取
df_i = df. join( math_interval, rsuffix= '_interval' ) [ [ 'Math' , 'Math_interval' ] ] \
. reset_index( ) . set_index( 'Math_interval' )
df_i. head( )
ID Math Math_interval (0, 40] 1101 34.0 (0, 40] 1102 32.5 (80, 100] 1103 87.2 (80, 100] 1104 80.4 (80, 100] 1105 84.8
df_i. loc[ 65 ] . head( )
ID Math Math_interval (60, 80] 1202 63.5 (60, 80] 1205 68.4 (60, 80] 1305 61.7 (60, 80] 2104 72.2 (60, 80] 2202 68.5
df_i. loc[ [ 65 , 90 ] ] . head( )
ID Math Math_interval (60, 80] 1202 63.5 (60, 80] 1205 68.4 (60, 80] 1305 61.7 (60, 80] 2104 72.2 (60, 80] 2202 68.5
df_i. index. astype( 'interval' ) . overlaps( pd. Interval( 70 , 85 ) )
array( [ False , False , True , True , True , True , True , False , False ,
True , False , True , False , True , True , True , False , False ,
True , False , False , True , True , False , True , True , False ,
True , True , False , False , False , False , True , False ] )
如果想要选取某个区间,先要把分类变量转为区间变量,再使用overlap方法:
overlaps Check whether two Interval objects overlap.如果重叠返回true
df_i[ df_i. index. astype( 'interval' ) . overlaps( pd. Interval( 70 , 85 ) ) ] . head( )
ID Math Math_interval (80, 100] 1103 87.2 (80, 100] 1104 80.4 (80, 100] 1105 84.8 (80, 100] 1201 97.0 (60, 80] 1202 63.5
二、多级索引
1. 创建多级索引
(a)通过from_tuple或from_arrays
① 直接创建元组
tuples = [ ( 'A' , 'a' ) , ( 'A' , 'b' ) , ( 'B' , 'a' ) , ( 'B' , 'b' ) ]
mul_index = pd. MultiIndex. from_tuples( tuples, names= ( 'Upper' , 'Lower' ) )
mul_index
pandas.core.indexes.multi.MultiIndex
pd. DataFrame( { 'Score' : [ 'perfect' , 'good' , 'fair' , 'bad' ] } , index= mul_index)
Score Upper Lower A a perfect b good B a fair b bad
L1 = list ( 'AABB' )
L2 = list ( 'abab' )
list ( zip ( L1, L2) )
[ ( 'A' , 'a' ) , ( 'A' , 'b' ) , ( 'B' , 'a' ) , ( 'B' , 'b' ) ]
② 利用zip创建元组
L1 = list ( 'AABB' )
L2 = list ( 'abab' )
tuples = list ( zip ( L1, L2) )
mul_index = pd. MultiIndex. from_tuples( tuples, names= ( 'Upper' , 'Lower' ) )
pd. DataFrame( { 'Score' : [ 'perfect' , 'good' , 'fair' , 'bad' ] } , index= mul_index)
Score Upper Lower A a perfect b good B a fair b bad
③ 通过Array创建
arrays = [ [ 'A' , 'a' ] , [ 'A' , 'b' ] , [ 'B' , 'a' ] , [ 'B' , 'b' ] ]
mul_index = pd. MultiIndex. from_tuples( arrays, names= ( 'Upper' , 'Lower' ) )
pd. DataFrame( { 'Score' : [ 'perfect' , 'good' , 'fair' , 'bad' ] } , index= mul_index)
Score Upper Lower A a perfect b good B a fair b bad
mul_index
MultiIndex( [ ( 'A' , 'a' ) ,
( 'A' , 'b' ) ,
( 'B' , 'a' ) ,
( 'B' , 'b' ) ] ,
names= [ 'Upper' , 'Lower' ] )
(b)通过from_product
L1 = [ 'A' , 'B' ]
L2 = [ 'a' , 'b' ]
pd. MultiIndex. from_product( [ L1, L2] , names= ( 'Upper' , 'Lower' ) )
MultiIndex( levels= [ [ 'A' , 'B' ] , [ 'a' , 'b' ] ] ,
codes= [ [ 0 , 0 , 1 , 1 ] , [ 0 , 1 , 0 , 1 ] ] ,
names= [ 'Upper' , 'Lower' ] )
(c)指定df中的列创建(set_index方法)
df_using_mul = df. set_index( [ 'Class' , 'Address' ] )
df_using_mul. head( )
School Gender Height Weight Math Physics Class Address C_1 street_1 S_1 M 173 63 34.0 A+ street_2 S_1 F 192 73 32.5 B+ street_2 S_1 M 186 82 87.2 B+ street_2 S_1 F 167 81 80.4 B- street_4 S_1 F 159 64 84.8 B+
2. 多层索引切片
df_using_mul. head( )
School Gender Height Weight Math Physics Class Address C_1 street_1 S_1 M 173 63 34.0 A+ street_2 S_1 F 192 73 32.5 B+ street_2 S_1 M 186 82 87.2 B+ street_2 S_1 F 167 81 80.4 B- street_4 S_1 F 159 64 84.8 B+
(a)一般切片
当索引不排序时,单个索引会报出性能警告 is_lexsorted()该函数检查是否排序
df_using_mul. sort_index( ) . loc[ 'C_2' , 'street_5' ]
School Gender Height Weight Math Physics Class Address C_2 street_5 S_1 M 188 68 97.0 A- street_5 S_1 F 162 63 33.8 B street_5 S_2 M 193 100 39.1 B
df_using_mul. sort_index( ) . loc[ ( 'C_2' , 'street_6' ) : ( 'C_3' , 'street_4' ) ]
School Gender Height Weight Math Physics Class Address C_2 street_6 S_1 M 160 53 58.8 A+ street_6 S_1 F 167 63 68.4 B- street_7 S_2 F 194 77 68.5 B+ street_7 S_2 F 183 76 85.4 B C_3 street_1 S_1 F 175 57 87.7 A- street_2 S_1 M 195 70 85.2 A street_4 S_1 M 161 68 31.5 B+ street_4 S_2 F 157 78 72.3 B+ street_4 S_2 M 187 73 48.9 B
df_using_mul. sort_index( ) . loc[ ( 'C_2' , 'street_7' ) : 'C_3' ] . head( )
School Gender Height Weight Math Physics Class Address C_2 street_7 S_2 F 194 77 68.5 B+ street_7 S_2 F 183 76 85.4 B C_3 street_1 S_1 F 175 57 87.7 A- street_2 S_1 M 195 70 85.2 A street_4 S_1 M 161 68 31.5 B+
(b)第一类特殊情况:由元组构成列表
df_using_mul. sort_index( ) . loc[ [ ( 'C_2' , 'street_7' ) , ( 'C_3' , 'street_2' ) ] ]
School Gender Height Weight Math Physics Class Address C_2 street_7 S_2 F 194 77 68.5 B+ street_7 S_2 F 183 76 85.4 B C_3 street_2 S_1 M 195 70 85.2 A
(c)第二类特殊情况:由列表构成元组
df_using_mul. sort_index( ) . loc[ ( [ 'C_2' , 'C_3' ] , [ 'street_4' , 'street_7' ] ) , : ]
School Gender Height Weight Math Physics Class Address C_2 street_4 S_1 F 176 94 63.5 B- street_4 S_2 M 155 91 73.8 A+ street_7 S_2 F 194 77 68.5 B+ street_7 S_2 F 183 76 85.4 B C_3 street_4 S_1 M 161 68 31.5 B+ street_4 S_2 F 157 78 72.3 B+ street_4 S_2 M 187 73 48.9 B street_7 S_1 M 188 82 49.7 B street_7 S_2 F 190 99 65.9 C
3. 多层索引中的slice对象
L1, L2 = [ 'A' , 'B' , 'C' ] , [ 'a' , 'b' , 'c' ]
mul_index1 = pd. MultiIndex. from_product( [ L1, L2] , names= ( 'Upper' , 'Lower' ) )
L3, L4 = [ 'D' , 'E' , 'F' ] , [ 'd' , 'e' , 'f' ]
mul_index2 = pd. MultiIndex. from_product( [ L3, L4] , names= ( 'Big' , 'Small' ) )
df_s = pd. DataFrame( np. random. rand( 9 , 9 ) , index= mul_index1, columns= mul_index2)
df_s
Big D E F Small d e f d e f d e f Upper Lower A a 0.669897 0.872137 0.821429 0.395501 0.120418 0.244982 0.778986 0.035890 0.446660 b 0.455974 0.665127 0.634520 0.519962 0.662663 0.812809 0.901754 0.552220 0.994960 c 0.273936 0.465171 0.697102 0.724623 0.525177 0.012609 0.535589 0.089462 0.650650 B a 0.640397 0.836209 0.398517 0.800238 0.267954 0.217449 0.539515 0.016317 0.507975 b 0.527949 0.939956 0.944304 0.969345 0.738103 0.007159 0.256861 0.815255 0.449877 c 0.799722 0.657790 0.515653 0.825300 0.333965 0.499731 0.822523 0.946890 0.950979 C a 0.904004 0.815088 0.090696 0.141877 0.191694 0.941201 0.131053 0.504041 0.815668 b 0.126503 0.638547 0.721800 0.169264 0.799974 0.557650 0.498308 0.971533 0.892235 c 0.207226 0.085262 0.041191 0.465851 0.718353 0.891853 0.185869 0.037784 0.625961
idx= pd. IndexSlice
idx
<pandas.core.indexing._IndexSlice at 0x2a41a8c1588>
索引Slice的使用非常灵活:
df_s. loc[ idx[ 'B' : , df_s[ 'D' ] [ 'd' ] > 0.3 ] , idx[ df_s. sum ( ) > 4 ] ]
Big D E F Small e f d e d f Upper Lower B c 0.344853 0.029392 0.646919 0.760591 0.360384 0.168119 C a 0.472030 0.987852 0.747845 0.466530 0.764040 0.795936
4. 索引层的交换
(a)swaplevel方法(两层交换)
df_using_mul. head( )
School Gender Height Weight Math Physics Class Address C_1 street_1 S_1 M 173 63 34.0 A+ street_2 S_1 F 192 73 32.5 B+ street_2 S_1 M 186 82 87.2 B+ street_2 S_1 F 167 81 80.4 B- street_4 S_1 F 159 64 84.8 B+
df_using_mul. swaplevel( i= 1 , j= 0 , axis= 0 ) . sort_index( ) . head( )
School Gender Height Weight Math Physics Address Class street_1 C_1 S_1 M 173 63 34.0 A+ C_2 S_2 M 175 74 47.2 B- C_3 S_1 F 175 57 87.7 A- street_2 C_1 S_1 F 192 73 32.5 B+ C_1 S_1 M 186 82 87.2 B+
(b)reorder_levels方法(多层交换)
df_muls = df. set_index( [ 'School' , 'Class' , 'Address' ] )
df_muls. head( )
Gender Height Weight Math Physics School Class Address S_1 C_1 street_1 M 173 63 34.0 A+ street_2 F 192 73 32.5 B+ street_2 M 186 82 87.2 B+ street_2 F 167 81 80.4 B- street_4 F 159 64 84.8 B+
df_muls. reorder_levels( [ 2 , 0 , 1 ] , axis= 0 ) . sort_index( ) . head( )
Gender Height Weight Math Physics Address School Class street_1 S_1 C_1 M 173 63 34.0 A+ C_3 F 175 57 87.7 A- S_2 C_2 M 175 74 47.2 B- street_2 S_1 C_1 F 192 73 32.5 B+ C_1 M 186 82 87.2 B+
df_muls. reorder_levels( [ 'Address' , 'School' , 'Class' ] , axis= 0 ) . sort_index( ) . head( )
Gender Height Weight Math Physics Address School Class street_1 S_1 C_1 M 173 63 34.0 A+ C_3 F 175 57 87.7 A- S_2 C_2 M 175 74 47.2 B- street_2 S_1 C_1 F 192 73 32.5 B+ C_1 M 186 82 87.2 B+
三、索引设定
1. index_col参数
index_col是read_csv中的一个参数,而不是某一个方法:
pd. read_csv( 'data/table.csv' ) . head( )
School Class ID Gender Address Height Weight Math Physics 0 S_1 C_1 1101 M street_1 173 63 34.0 A+ 1 S_1 C_1 1102 F street_2 192 73 32.5 B+ 2 S_1 C_1 1103 M street_2 186 82 87.2 B+ 3 S_1 C_1 1104 F street_2 167 81 80.4 B- 4 S_1 C_1 1105 F street_4 159 64 84.8 B+
2. reindex和reindex_like
reindex是指重新索引,它的重要特性在于索引对齐,很多时候用于重新排序
df. head( 10 )
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173 63 34.0 A+ 1102 S_1 C_1 F street_2 192 73 32.5 B+ 1103 S_1 C_1 M street_2 186 82 87.2 B+ 1104 S_1 C_1 F street_2 167 81 80.4 B- 1105 S_1 C_1 F street_4 159 64 84.8 B+ 1201 S_1 C_2 M street_5 188 68 97.0 A- 1202 S_1 C_2 F street_4 176 94 63.5 B- 1203 S_1 C_2 M street_6 160 53 58.8 A+ 1204 S_1 C_2 F street_5 162 63 33.8 B 1205 S_1 C_2 F street_6 167 63 68.4 B-
df. reindex( index= [ 1101 , 1203 , 1206 , 2402 ] )
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173.0 63.0 34.0 A+ 1203 S_1 C_2 M street_6 160.0 53.0 58.8 A+ 1206 NaN NaN NaN NaN NaN NaN NaN NaN 2402 S_2 C_4 M street_7 166.0 82.0 48.7 B
df. reindex( columns= [ 'Height' , 'Gender' , 'Average' ] ) . head( )
Height Gender Average ID 1101 173 M NaN 1102 192 F NaN 1103 186 M NaN 1104 167 F NaN 1105 159 F NaN
可以选择缺失值的填充方法:fill_value和method(bfill/ffill/nearest),其中method参数必须索引单调
bfill表示用所在索引1206的后一个有效行填充,ffill为前一个有效行,nearest是指最近的
df. reindex( index= [ 1101 , 1203 , 1206 , 2402 ] , method= 'bfill' )
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173 63 34.0 A+ 1203 S_1 C_2 M street_6 160 53 58.8 A+ 1206 S_1 C_2 F street_6 167 63 68.4 B- 2402 S_2 C_4 M street_7 166 82 48.7 B
数值上1205比1301更接近1206,因此用前者填充
df. reindex( index= [ 1101 , 1203 , 1206 , 2402 ] , method= 'nearest' )
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173 63 34.0 A+ 1203 S_1 C_2 M street_6 160 53 58.8 A+ 1206 S_1 C_2 F street_6 167 63 68.4 B- 2402 S_2 C_4 M street_7 166 82 48.7 B
reindex_like的作用为生成一个横纵索引完全与参数列表一致的DataFrame,数据使用被调用的表
df_temp = pd. DataFrame( { 'Weight' : np. zeros( 5 ) ,
'Height' : np. zeros( 5 ) ,
'ID' : [ 1101 , 1104 , 1103 , 1106 , 1102 ] } ) . set_index( 'ID' )
df_temp. reindex_like( df[ 0 : 5 ] [ [ 'Weight' , 'Height' ] ] )
Weight Height ID 1101 0.0 0.0 1102 0.0 0.0 1103 0.0 0.0 1104 0.0 0.0 1105 NaN NaN
如果df_temp单调还可以使用method参数:
df_temp = pd. DataFrame( { 'Weight' : range ( 5 ) ,
'Height' : range ( 5 ) ,
'ID' : [ 1101 , 1104 , 1103 , 1106 , 1102 ] } ) . set_index( 'ID' ) . sort_index( )
df_temp. reindex_like( df[ 0 : 5 ] [ [ 'Weight' , 'Height' ] ] , method= 'bfill' )
Weight Height ID 1101 0 0 1102 4 4 1103 2 2 1104 1 1 1105 3 3
3. set_index和reset_index
先介绍set_index:从字面意思看,就是将某些列作为索引
使用表内列作为索引:
df. head( )
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173 63 34.0 A+ 1102 S_1 C_1 F street_2 192 73 32.5 B+ 1103 S_1 C_1 M street_2 186 82 87.2 B+ 1104 S_1 C_1 F street_2 167 81 80.4 B- 1105 S_1 C_1 F street_4 159 64 84.8 B+
df. set_index( 'Class' ) . head( )
School Gender Address Height Weight Math Physics Class C_1 S_1 M street_1 173 63 34.0 A+ C_1 S_1 F street_2 192 73 32.5 B+ C_1 S_1 M street_2 186 82 87.2 B+ C_1 S_1 F street_2 167 81 80.4 B- C_1 S_1 F street_4 159 64 84.8 B+
利用append参数可以将当前索引维持不变
df. set_index( 'Class' , append= True ) . head( )
School Gender Address Height Weight Math Physics ID Class 1101 C_1 S_1 M street_1 173 63 34.0 A+ 1102 C_1 S_1 F street_2 192 73 32.5 B+ 1103 C_1 S_1 M street_2 186 82 87.2 B+ 1104 C_1 S_1 F street_2 167 81 80.4 B- 1105 C_1 S_1 F street_4 159 64 84.8 B+
当使用与表长相同的列作为索引(需要先转化为Series,否则报错):
df. set_index( pd. Series( range ( df. shape[ 0 ] ) ) ) . head( )
School Class Gender Address Height Weight Math Physics 0 S_1 C_1 M street_1 173 63 34.0 A+ 1 S_1 C_1 F street_2 192 73 32.5 B+ 2 S_1 C_1 M street_2 186 82 87.2 B+ 3 S_1 C_1 F street_2 167 81 80.4 B- 4 S_1 C_1 F street_4 159 64 84.8 B+
可以直接添加多级索引:
df. set_index( [ pd. Series( range ( df. shape[ 0 ] ) ) , pd. Series( np. ones( df. shape[ 0 ] ) ) ] ) . head( )
School Class Gender Address Height Weight Math Physics 0 1.0 S_1 C_1 M street_1 173 63 34.0 A+ 1 1.0 S_1 C_1 F street_2 192 73 32.5 B+ 2 1.0 S_1 C_1 M street_2 186 82 87.2 B+ 3 1.0 S_1 C_1 F street_2 167 81 80.4 B- 4 1.0 S_1 C_1 F street_4 159 64 84.8 B+
下面介绍reset_index方法,它的主要功能是将索引重置
默认状态直接恢复到自然数索引:
df. reset_index( ) . head( )
ID School Class Gender Address Height Weight Math Physics 0 1101 S_1 C_1 M street_1 173 63 34.0 A+ 1 1102 S_1 C_1 F street_2 192 73 32.5 B+ 2 1103 S_1 C_1 M street_2 186 82 87.2 B+ 3 1104 S_1 C_1 F street_2 167 81 80.4 B- 4 1105 S_1 C_1 F street_4 159 64 84.8 B+
用level参数指定哪一层被reset,用col_level参数指定set到哪一层:
L1, L2 = [ 'A' , 'B' , 'C' ] , [ 'a' , 'b' , 'c' ]
mul_index1 = pd. MultiIndex. from_product( [ L1, L2] , names= ( 'Upper' , 'Lower' ) )
L3, L4 = [ 'D' , 'E' , 'F' ] , [ 'd' , 'e' , 'f' ]
mul_index2 = pd. MultiIndex. from_product( [ L3, L4] , names= ( 'Big' , 'Small' ) )
df_temp = pd. DataFrame( np. random. rand( 9 , 9 ) , index= mul_index1, columns= mul_index2)
df_temp. head( )
Big D E F Small d e f d e f d e f Upper Lower A a 0.639608 0.511061 0.439781 0.436492 0.473135 0.858310 0.538744 0.637985 0.441126 b 0.740495 0.942201 0.681934 0.087504 0.600452 0.993468 0.611108 0.736948 0.177016 c 0.965554 0.301755 0.659689 0.017442 0.110830 0.301041 0.205102 0.471527 0.052899 B a 0.057613 0.303613 0.242287 0.435452 0.289861 0.514301 0.859227 0.477511 0.655530 b 0.746643 0.091802 0.450144 0.255623 0.583973 0.398016 0.127545 0.313311 0.638950
df_temp1. columns
MultiIndex(levels=[['D', 'E', 'F', ''], ['d', 'e', 'f', 'Lower']],
codes=[[3, 0, 0, 0, 1, 1, 1, 2, 2, 2], [3, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
names=['Big', 'Small'])
df_temp1 = df_temp. reset_index( level= 0 , col_level= 1 )
df_temp1. head( )
Big D E F Small Upper d e f d e f d e f Lower a A 0.639608 0.511061 0.439781 0.436492 0.473135 0.858310 0.538744 0.637985 0.441126 b A 0.740495 0.942201 0.681934 0.087504 0.600452 0.993468 0.611108 0.736948 0.177016 c A 0.965554 0.301755 0.659689 0.017442 0.110830 0.301041 0.205102 0.471527 0.052899 a B 0.057613 0.303613 0.242287 0.435452 0.289861 0.514301 0.859227 0.477511 0.655530 b B 0.746643 0.091802 0.450144 0.255623 0.583973 0.398016 0.127545 0.313311 0.638950
df_temp1. columns
MultiIndex(levels=[['D', 'E', 'F', ''], ['d', 'e', 'f', 'Upper']],
codes=[[3, 0, 0, 0, 1, 1, 1, 2, 2, 2], [3, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
names=['Big', 'Small'])
df_temp1. index
Index(['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c'], dtype='object', name='Lower')
4. rename_axis和rename
rename_axis是针对多级索引的方法,作用是修改某一层的索引名,而不是索引标签
df_temp. rename_axis( index= { 'Lower' : 'LowerLower' } , columns= { 'Big' : 'BigBig' } )
BigBig D E F Small d e f d e f d e f Upper LowerLower A a 0.036300 0.111297 0.509819 0.322065 0.107846 0.393002 0.951695 0.352045 0.055245 b 0.585976 0.817781 0.526512 0.560763 0.647126 0.801922 0.520511 0.708981 0.774692 c 0.859907 0.824712 0.675348 0.238558 0.869172 0.965363 0.803485 0.568771 0.734648 B a 0.129040 0.278234 0.981728 0.903423 0.676240 0.371291 0.876571 0.338353 0.021567 b 0.221386 0.744765 0.080138 0.013936 0.623034 0.314859 0.520199 0.742233 0.834835 c 0.905252 0.037512 0.377849 0.225882 0.781182 0.232511 0.778518 0.572787 0.858842 C a 0.678289 0.324638 0.165628 0.737036 0.591936 0.312173 0.319689 0.289072 0.954463 b 0.647861 0.527387 0.505945 0.488335 0.631082 0.639539 0.785094 0.026073 0.881210 c 0.990722 0.691715 0.697617 0.115831 0.129802 0.994152 0.176841 0.449053 0.145351
rename方法用于修改列或者行索引标签,而不是索引名:
df_temp. rename( index= { 'A' : 'T' } , columns= { 'e' : 'changed_e' } ) . head( )
Big D E F Small d changed_e f d changed_e f d changed_e f Upper Lower T a 0.036300 0.111297 0.509819 0.322065 0.107846 0.393002 0.951695 0.352045 0.055245 b 0.585976 0.817781 0.526512 0.560763 0.647126 0.801922 0.520511 0.708981 0.774692 c 0.859907 0.824712 0.675348 0.238558 0.869172 0.965363 0.803485 0.568771 0.734648 B a 0.129040 0.278234 0.981728 0.903423 0.676240 0.371291 0.876571 0.338353 0.021567 b 0.221386 0.744765 0.080138 0.013936 0.623034 0.314859 0.520199 0.742233 0.834835
四、常用索引型函数
1. where函数
当对条件为False的单元进行填充:
df. head( )
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173 63 34.0 A+ 1102 S_1 C_1 F street_2 192 73 32.5 B+ 1103 S_1 C_1 M street_2 186 82 87.2 B+ 1104 S_1 C_1 F street_2 167 81 80.4 B- 1105 S_1 C_1 F street_4 159 64 84.8 B+
不满足条件的行全部被设置为NaN 通过other来改变不符合条件的值
df. where( df[ 'Gender' ] == 'M' , ' ' ) . head( )
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173 63 34 A+ 1102 1103 S_1 C_1 M street_2 186 82 87.2 B+ 1104 1105
通过这种方法筛选结果和[]操作符的结果完全一致:
df. where( df[ 'Gender' ] == 'M' ) . dropna( ) . head( )
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173.0 63.0 34.0 A+ 1103 S_1 C_1 M street_2 186.0 82.0 87.2 B+ 1201 S_1 C_2 M street_5 188.0 68.0 97.0 A- 1203 S_1 C_2 M street_6 160.0 53.0 58.8 A+ 1301 S_1 C_3 M street_4 161.0 68.0 31.5 B+
第一个参数为布尔条件,第二个参数为填充值:
df. where( df[ 'Gender' ] == 'M' , np. random. rand( df. shape[ 0 ] , df. shape[ 1 ] ) ) . head( )
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173.000000 63.000000 34.000000 A+ 1102 0.55092 0.303216 0.824431 0.755427 0.433119 0.997173 0.358161 0.699621 1103 S_1 C_1 M street_2 186.000000 82.000000 87.200000 B+ 1104 0.313753 0.481069 0.335257 0.645973 0.551690 0.562617 0.684399 0.312336 1105 0.0705601 0.067762 0.818158 0.352117 0.200073 0.076810 0.885933 0.142629
2. mask函数
mask函数与where功能上相反,其余完全一致,即对条件为True的单元进行填充
df. mask( df[ 'Gender' ] == 'M' ) . dropna( ) . head( )
School Class Gender Address Height Weight Math Physics ID 1102 S_1 C_1 F street_2 192.0 73.0 32.5 B+ 1104 S_1 C_1 F street_2 167.0 81.0 80.4 B- 1105 S_1 C_1 F street_4 159.0 64.0 84.8 B+ 1202 S_1 C_2 F street_4 176.0 94.0 63.5 B- 1204 S_1 C_2 F street_5 162.0 63.0 33.8 B
df. mask( df[ 'Gender' ] == 'M' , np. random. rand( df. shape[ 0 ] , df. shape[ 1 ] ) ) . head( )
School Class Gender Address Height Weight Math Physics ID 1101 0.273962 0.25028 0.587471 0.977206 0.442403 0.319460 0.460991 0.842498 1102 S_1 C_1 F street_2 192.000000 73.000000 32.500000 B+ 1103 0.436674 0.741524 0.46996 0.688603 0.938241 0.531811 0.794352 0.17495 1104 S_1 C_1 F street_2 167.000000 81.000000 80.400000 B- 1105 S_1 C_1 F street_4 159.000000 64.000000 84.800000 B+
3. query函数
df. head( )
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173 63 34.0 A+ 1102 S_1 C_1 F street_2 192 73 32.5 B+ 1103 S_1 C_1 M street_2 186 82 87.2 B+ 1104 S_1 C_1 F street_2 167 81 80.4 B- 1105 S_1 C_1 F street_4 159 64 84.8 B+
query函数中的布尔表达式中,下面的符号都是合法的:行列索引名、字符串、and/not/or/&/|/~/not in/in/==/!=、四则运算符
df. query( '(Address in ["street_6","street_7"])&(Weight>(70+10))&(ID in [1303,2304,2402])' )
School Class Gender Address Height Weight Math Physics ID 1303 S_1 C_3 M street_7 188 82 49.7 B 2304 S_2 C_3 F street_6 164 81 95.5 A- 2402 S_2 C_4 M street_7 166 82 48.7 B
五、重复元素处理
1. duplicated方法
该方法返回了是否重复的布尔列表
df. duplicated( 'Class' ) . head( 10 )
ID
1101 False
1102 True
1103 True
1104 True
1105 True
1201 False
1202 True
1203 True
1204 True
1205 True
dtype: bool
df. duplicated( 'Class' , keep= 'first' ) . head( 10 )
ID
1101 False
1102 True
1103 True
1104 True
1105 True
1201 False
1202 True
1203 True
1204 True
1205 True
dtype: bool
可选参数keep默认为first,即首次出现设为不重复,若为last,则最后一次设为不重复,若为False,则所有重复项为False
df. duplicated( 'Class' , keep= 'last' ) . tail( )
ID
2401 True
2402 True
2403 True
2404 True
2405 False
dtype: bool
df. duplicated( 'Class' , keep= False ) . head( )
ID
1101 True
1102 True
1103 True
1104 True
1105 True
dtype: bool
2. drop_duplicates方法
从名字上看出为剔除重复项,这在后面章节中的分组操作中可能是有用的,例如需要保留每组的第一个值:
df. drop_duplicates( 'Class' )
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173 63 34.0 A+ 1201 S_1 C_2 M street_5 188 68 97.0 A- 1301 S_1 C_3 M street_4 161 68 31.5 B+ 2401 S_2 C_4 F street_2 192 62 45.3 A
参数与duplicate函数类似:
df. drop_duplicates( 'Class' , keep= 'last' )
School Class Gender Address Height Weight Math Physics ID 2105 S_2 C_1 M street_4 170 81 34.2 A 2205 S_2 C_2 F street_7 183 76 85.4 B 2305 S_2 C_3 M street_4 187 73 48.9 B 2405 S_2 C_4 F street_6 193 54 47.6 B
在传入多列时等价于将多列共同视作一个多级索引,比较重复项:
df. drop_duplicates( [ 'School' , 'Class' ] , keep = 'first' , inplace= False )
School Class Gender Address Height Weight Math Physics ID 1101 S_1 C_1 M street_1 173 63 34.0 A+ 1201 S_1 C_2 M street_5 188 68 97.0 A- 1301 S_1 C_3 M street_4 161 68 31.5 B+ 2101 S_2 C_1 M street_7 174 84 83.3 C 2201 S_2 C_2 M street_5 193 100 39.1 B 2301 S_2 C_3 F street_4 157 78 72.3 B+ 2401 S_2 C_4 F street_2 192 62 45.3 A
六、抽样函数
这里的抽样函数指的就是sample函数
(a)n为样本量
df. sample( n= 5 )
School Class Gender Address Height Weight Math Physics ID 2103 S_2 C_1 M street_4 157 61 52.5 B- 1102 S_1 C_1 F street_2 192 73 32.5 B+ 1301 S_1 C_3 M street_4 161 68 31.5 B+ 1304 S_1 C_3 M street_2 195 70 85.2 A 1105 S_1 C_1 F street_4 159 64 84.8 B+
(b)frac为抽样比
df. sample( frac= 0.05 )
School Class Gender Address Height Weight Math Physics ID 1105 S_1 C_1 F street_4 159 64 84.8 B+ 2402 S_2 C_4 M street_7 166 82 48.7 B
(c)replace为是否放回
df. sample( n= df. shape[ 0 ] , replace= True ) . head( )
School Class Gender Address Height Weight Math Physics ID 2403 S_2 C_4 F street_6 158 60 59.7 B+ 2404 S_2 C_4 F street_2 160 84 67.7 B 2405 S_2 C_4 F street_6 193 54 47.6 B 2303 S_2 C_3 F street_7 190 99 65.9 C 1203 S_1 C_2 M street_6 160 53 58.8 A+
df. sample( n= 35 , replace= True ) . index. is_unique
False
(d)axis为抽样维度,默认为0,即抽行
df. sample( n= 3 , axis= 1 ) . head( )
Address Weight School ID 1101 street_1 63 S_1 1102 street_2 73 S_1 1103 street_2 82 S_1 1104 street_2 81 S_1 1105 street_4 64 S_1
(e)weights为样本权重,自动归一化
df. sample( n= 3 , weights= np. random. rand( df. shape[ 0 ] ) ) . head( )
School Class Gender Address Height Weight Math Physics ID 1302 S_1 C_3 F street_1 175 57 87.7 A- 1305 S_1 C_3 F street_5 187 69 61.7 B- 2404 S_2 C_4 F street_2 160 84 67.7 B
df. sample( n= 3 , weights= df[ 'Math' ] ) . head( )
School Class Gender Address Height Weight Math Physics ID 2205 S_2 C_2 F street_7 183 76 85.4 B 2404 S_2 C_4 F street_2 160 84 67.7 B 1203 S_1 C_2 M street_6 160 53 58.8 A+