pandas学习-第二章

import numpy as np 
import pandas as pd 
df=pd.read_csv('E:\jupyter Notebook\天池比赛\joyful-pandas-master\data\\table.csv',index_col='ID')
df.head(20)
Unnamed: 0SchoolClassGenderAddressHeightWeightMathPhysics
ID
11010S_1C_1Mstreet_11736334.0A+
11021S_1C_1Fstreet_21927332.5B+
11032S_1C_1Mstreet_21868287.2B+
11043S_1C_1Fstreet_21678180.4B-
11054S_1C_1Fstreet_41596484.8B+
12015S_1C_2Mstreet_51886897.0A-
12026S_1C_2Fstreet_41769463.5B-
12037S_1C_2Mstreet_61605358.8A+
12048S_1C_2Fstreet_51626333.8B
12059S_1C_2Fstreet_61676368.4B-
130110S_1C_3Mstreet_41616831.5B+
130211S_1C_3Fstreet_11755787.7A-
130312S_1C_3Mstreet_71888249.7B
130413S_1C_3Mstreet_21957085.2A
130514S_1C_3Fstreet_51876961.7B-
210115S_2C_1Mstreet_71748483.3C
210216S_2C_1Fstreet_61616150.6B+
210317S_2C_1Mstreet_41576152.5B-
210418S_2C_1Fstreet_51599772.2B+
210519S_2C_1Mstreet_41708134.2A
df.columns
df=df.drop(columns='Unnamed: 0')
df.head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1102S_1C_1Fstreet_21927332.5B+
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+

单级索引

loc方法,iloc方法,[]操作符

  • 最常用的三种索引方法,其中**iloc表示位置索引,loc表示标签索引,[]具有良好的便利性

loc方法(注意:所有在loc中使用的切片是包含右端点的!!!)

1.单行索引

df.loc[1103]#注意是[]
School          S_1
Class           C_1
Gender            M
Address    street_2
Height          186
Weight           82
Math           87.2
Physics          B+
Name: 1103, dtype: object

2.多行索引

df.loc[[1102,1104,2304]]#注意是元组,多行索引则是[[]]
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1102S_1C_1Fstreet_21927332.5B+
1104S_1C_1Fstreet_21678180.4B-
2304S_2C_3Fstreet_61648195.5A-
df.loc[1103:]  # 表示从1103一直到最后
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
1201S_1C_2Mstreet_51886897.0A-
1202S_1C_2Fstreet_41769463.5B-
1203S_1C_2Mstreet_61605358.8A+
1204S_1C_2Fstreet_51626333.8B
1205S_1C_2Fstreet_61676368.4B-
1301S_1C_3Mstreet_41616831.5B+
1302S_1C_3Fstreet_11755787.7A-
1303S_1C_3Mstreet_71888249.7B
1304S_1C_3Mstreet_21957085.2A
1305S_1C_3Fstreet_51876961.7B-
2101S_2C_1Mstreet_71748483.3C
2102S_2C_1Fstreet_61616150.6B+
2103S_2C_1Mstreet_41576152.5B-
2104S_2C_1Fstreet_51599772.2B+
2105S_2C_1Mstreet_41708134.2A
2201S_2C_2Mstreet_519310039.1B
2202S_2C_2Fstreet_71947768.5B+
2203S_2C_2Mstreet_41559173.8A+
2204S_2C_2Mstreet_11757447.2B-
2205S_2C_2Fstreet_71837685.4B
2301S_2C_3Fstreet_41577872.3B+
2302S_2C_3Mstreet_51718832.7A
2303S_2C_3Fstreet_71909965.9C
2304S_2C_3Fstreet_61648195.5A-
2305S_2C_3Mstreet_41877348.9B
2401S_2C_4Fstreet_21926245.3A
2402S_2C_4Mstreet_71668248.7B
2403S_2C_4Fstreet_61586059.7B+
2404S_2C_4Fstreet_21608467.7B
2405S_2C_4Fstreet_61935447.6B
df.loc[2402::-1]#从2402向前索引,包含所有的行
SchoolClassGenderAddressHeightWeightMathPhysics
ID
2402S_2C_4Mstreet_71668248.7B
2401S_2C_4Fstreet_21926245.3A
2305S_2C_3Mstreet_41877348.9B
2304S_2C_3Fstreet_61648195.5A-
2303S_2C_3Fstreet_71909965.9C
2302S_2C_3Mstreet_51718832.7A
2301S_2C_3Fstreet_41577872.3B+
2205S_2C_2Fstreet_71837685.4B
2204S_2C_2Mstreet_11757447.2B-
2203S_2C_2Mstreet_41559173.8A+
2202S_2C_2Fstreet_71947768.5B+
2201S_2C_2Mstreet_519310039.1B
2105S_2C_1Mstreet_41708134.2A
2104S_2C_1Fstreet_51599772.2B+
2103S_2C_1Mstreet_41576152.5B-
2102S_2C_1Fstreet_61616150.6B+
2101S_2C_1Mstreet_71748483.3C
1305S_1C_3Fstreet_51876961.7B-
1304S_1C_3Mstreet_21957085.2A
1303S_1C_3Mstreet_71888249.7B
1302S_1C_3Fstreet_11755787.7A-
1301S_1C_3Mstreet_41616831.5B+
1205S_1C_2Fstreet_61676368.4B-
1204S_1C_2Fstreet_51626333.8B
1203S_1C_2Mstreet_61605358.8A+
1202S_1C_2Fstreet_41769463.5B-
1201S_1C_2Mstreet_51886897.0A-
1105S_1C_1Fstreet_41596484.8B+
1104S_1C_1Fstreet_21678180.4B-
1103S_1C_1Mstreet_21868287.2B+
1102S_1C_1Fstreet_21927332.5B+
1101S_1C_1Mstreet_11736334.0A+

3.单列索引

df.loc[:,'Height']#包含所有行的height列
ID
1101    173
1102    192
1103    186
1104    167
1105    159
1201    188
1202    176
1203    160
1204    162
1205    167
1301    161
1302    175
1303    188
1304    195
1305    187
2101    174
2102    161
2103    157
2104    159
2105    170
2201    193
2202    194
2203    155
2204    175
2205    183
2301    157
2302    171
2303    190
2304    164
2305    187
2401    192
2402    166
2403    158
2404    160
2405    193
Name: Height, dtype: int64

4.多列索引

df.loc[:,['Height','Math']]#所有行+两列,还是多行索引时,如果是单独选取用[[]]
HeightMath
ID
110117334.0
110219232.5
110318687.2
110416780.4
110515984.8
120118897.0
120217663.5
120316058.8
120416233.8
120516768.4
130116131.5
130217587.7
130318849.7
130419585.2
130518761.7
210117483.3
210216150.6
210315752.5
210415972.2
210517034.2
220119339.1
220219468.5
220315573.8
220417547.2
220518385.4
230115772.3
230217132.7
230319065.9
230416495.5
230518748.9
240119245.3
240216648.7
240315859.7
240416067.7
240519347.6
df.loc[:,'Height':'Math']#所有行,height-Math 所有列 注意 包含右端
HeightWeightMath
ID
11011736334.0
11021927332.5
11031868287.2
11041678180.4
11051596484.8
12011886897.0
12021769463.5
12031605358.8
12041626333.8
12051676368.4
13011616831.5
13021755787.7
13031888249.7
13041957085.2
13051876961.7
21011748483.3
21021616150.6
21031576152.5
21041599772.2
21051708134.2
220119310039.1
22021947768.5
22031559173.8
22041757447.2
22051837685.4
23011577872.3
23021718832.7
23031909965.9
23041648195.5
23051877348.9
24011926245.3
24021668248.7
24031586059.7
24041608467.7
24051935447.6

5.联合索引

df.loc[1102:2401,'Height':'Math']#联合行列索引,步长为1
HeightWeightMath
ID
11021927332.5
11031868287.2
11041678180.4
11051596484.8
12011886897.0
12021769463.5
12031605358.8
12041626333.8
12051676368.4
13011616831.5
13021755787.7
13031888249.7
13041957085.2
13051876961.7
21011748483.3
21021616150.6
21031576152.5
21041599772.2
21051708134.2
220119310039.1
22021947768.5
22031559173.8
22041757447.2
22051837685.4
23011577872.3
23021718832.7
23031909965.9
23041648195.5
23051877348.9
24011926245.3
df.loc[1102:2401:2,'Height':'Math']#与之前相比,在行索引范围内步长改为2
HeightWeightMath
ID
11021927332.5
11041678180.4
12011886897.0
12031605358.8
12051676368.4
13021755787.7
13041957085.2
21011748483.3
21031576152.5
21051708134.2
22021947768.5
22041757447.2
23011577872.3
23031909965.9
23051877348.9

6.函数式索引

df.loc[lambda x:x['Gender']=='M']
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1103S_1C_1Mstreet_21868287.2B+
1201S_1C_2Mstreet_51886897.0A-
1203S_1C_2Mstreet_61605358.8A+
1301S_1C_3Mstreet_41616831.5B+
1303S_1C_3Mstreet_71888249.7B
1304S_1C_3Mstreet_21957085.2A
2101S_2C_1Mstreet_71748483.3C
2103S_2C_1Mstreet_41576152.5B-
2105S_2C_1Mstreet_41708134.2A
2201S_2C_2Mstreet_519310039.1B
2203S_2C_2Mstreet_41559173.8A+
2204S_2C_2Mstreet_11757447.2B-
2302S_2C_3Mstreet_51718832.7A
2305S_2C_3Mstreet_41877348.9B
2402S_2C_4Mstreet_71668248.7B
def f(x):
    return[1101,1103]
df.loc[f]
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1103S_1C_1Mstreet_21868287.2B+

7.布尔索引

df.loc[df['Address'].isin(['street_7','street_4'])].head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1105S_1C_1Fstreet_41596484.8B+
1202S_1C_2Fstreet_41769463.5B-
1301S_1C_3Mstreet_41616831.5B+
1303S_1C_3Mstreet_71888249.7B
2101S_2C_1Mstreet_71748483.3C
df.loc[[True if i[-1]=='4' or i[-1]=='7' else False for i in df['Address'].values]].head()#意义与上面是一样的
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1105S_1C_1Fstreet_41596484.8B+
1202S_1C_2Fstreet_41769463.5B-
1301S_1C_3Mstreet_41616831.5B+
1303S_1C_3Mstreet_71888249.7B
2101S_2C_1Mstreet_71748483.3C

总结:loc中能传入只有布尔列表和索引子集构成的列表

iloc方法-切片法,不含右端

1.单行索引

df.head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1102S_1C_1Fstreet_21927332.5B+
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
df.iloc[3]
School          S_1
Class           C_1
Gender            F
Address    street_2
Height          167
Weight           81
Math           80.4
Physics          B-
Name: 1104, dtype: object

2.多行索引

df.iloc[3:5]
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+

3.混合索引

df.iloc[3::2,7::-2].head()
PhysicsWeightAddressClass
ID
1104B-81street_2C_1
1201A-68street_5C_2
1203A+53street_6C_2
1205B-63street_6C_2
1302A-57street_1C_3

4.函数式索引

df.iloc[lambda x:[3]]
#iloc中能接受的参数只能为整数或者整数列表,不能使用布尔索引
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1104S_1C_1Fstreet_21678180.4B-

[]操作符

  • 不能再行索引为浮点时使用,因为在Series中的浮点[]并不是进行位置比较,而是值比较

1.单元素索引

s=pd.Series(df['Math'],index=df.index)
s[1101]#索引标签
34.0

2.多行索引

s[0:4]
#使用的是绝对位置的整数切片,与元素无关
ID
1101    34.0
1102    32.5
1103    87.2
1104    80.4
Name: Math, dtype: float64

3.函数式索引

s[lambda x:x.index[16::-6]]
#在使用lambda函数的时候,直接切片报错(比如16::-6就是错误的),此时使用的不是绝对位置切片,而是元素切片
ID
2102    50.6
1301    31.5
1105    84.8
Name: Math, dtype: float64

4.布尔式索引

s[s>80]
ID
1103    87.2
1104    80.4
1105    84.8
1201    97.0
1302    87.7
1304    85.2
2101    83.3
2205    85.4
2304    95.5
Name: Math, dtype: float64

data与[]操作

df[1:2]
#如果这里写成df['label'](df['Math'])会报错
#这里与series相同,使用绝对位置切片
#如果要找到某个元素,采用get_loc方法
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1102S_1C_1Fstreet_21927332.5B+
row=df.index.get_loc(1102)#shou
df[row:row+1]
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1102S_1C_1Fstreet_21927332.5B+

2.多行索引

#用切片,如果是选取指定的某几行,推荐使用loc,否则可能报错
df[3:5]
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+

3.单列索引

df['School'].head()
ID
1101    S_1
1102    S_1
1103    S_1
1104    S_1
1105    S_1
Name: School, dtype: object

4.多列索引

df[['School','Math']]
SchoolMath
ID
1101S_134.0
1102S_132.5
1103S_187.2
1104S_180.4
1105S_184.8
1201S_197.0
1202S_163.5
1203S_158.8
1204S_133.8
1205S_168.4
1301S_131.5
1302S_187.7
1303S_149.7
1304S_185.2
1305S_161.7
2101S_283.3
2102S_250.6
2103S_252.5
2104S_272.2
2105S_234.2
2201S_239.1
2202S_268.5
2203S_273.8
2204S_247.2
2205S_285.4
2301S_272.3
2302S_232.7
2303S_265.9
2304S_295.5
2305S_248.9
2401S_245.3
2402S_248.7
2403S_259.7
2404S_267.7
2405S_247.6

5.布尔索引

df[df['Gender']=='F'].head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1102S_1C_1Fstreet_21927332.5B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
1202S_1C_2Fstreet_41769463.5B-
1204S_1C_2Fstreet_51626333.8B

总结,[]操作常用于列选择或布尔选择,尽量避免行选择!

布尔索引

布尔符号:‘&’,‘\’,’~’:分别代表 and,or 反not

df[(df['Gender']=='F') &(df['Math']>80)]#二者之间不同的 使用()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
1302S_1C_3Fstreet_11755787.7A-
2205S_2C_2Fstreet_71837685.4B
2304S_2C_3Fstreet_61648195.5A-
df[(df['Gender']=='F')|(df['Math']>80)]
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1102S_1C_1Fstreet_21927332.5B+
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
1201S_1C_2Mstreet_51886897.0A-
1202S_1C_2Fstreet_41769463.5B-
1204S_1C_2Fstreet_51626333.8B
1205S_1C_2Fstreet_61676368.4B-
1302S_1C_3Fstreet_11755787.7A-
1304S_1C_3Mstreet_21957085.2A
1305S_1C_3Fstreet_51876961.7B-
2101S_2C_1Mstreet_71748483.3C
2102S_2C_1Fstreet_61616150.6B+
2104S_2C_1Fstreet_51599772.2B+
2202S_2C_2Fstreet_71947768.5B+
2205S_2C_2Fstreet_71837685.4B
2301S_2C_3Fstreet_41577872.3B+
2303S_2C_3Fstreet_71909965.9C
2304S_2C_3Fstreet_61648195.5A-
2401S_2C_4Fstreet_21926245.3A
2403S_2C_4Fstreet_61586059.7B+
2404S_2C_4Fstreet_21608467.7B
2405S_2C_4Fstreet_61935447.6B
df[~((df['Gender']=='F')|(df['Math']>80))]#用韦恩图 很容易理解
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1203S_1C_2Mstreet_61605358.8A+
1301S_1C_3Mstreet_41616831.5B+
1303S_1C_3Mstreet_71888249.7B
2103S_2C_1Mstreet_41576152.5B-
2105S_2C_1Mstreet_41708134.2A
2201S_2C_2Mstreet_519310039.1B
2203S_2C_2Mstreet_41559173.8A+
2204S_2C_2Mstreet_11757447.2B-
2302S_2C_3Mstreet_51718832.7A
2305S_2C_3Mstreet_41877348.9B
2402S_2C_4Mstreet_71668248.7B

loc和[]中相应位置都能使用布尔列表选择:

df.loc[df['Math']>60,(df[:8]['Address']=='street_6').values].head(30)
Physics
ID
1103B+
1104B-
1105B+
1201A-
1202B-
1205B-
1302A-
1304A
1305B-
2101C
2104B+
2202B+
2203A+
2205B
2301B+
2303C
2304A-
2404B
df
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1102S_1C_1Fstreet_21927332.5B+
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
1201S_1C_2Mstreet_51886897.0A-
1202S_1C_2Fstreet_41769463.5B-
1203S_1C_2Mstreet_61605358.8A+
1204S_1C_2Fstreet_51626333.8B
1205S_1C_2Fstreet_61676368.4B-
1301S_1C_3Mstreet_41616831.5B+
1302S_1C_3Fstreet_11755787.7A-
1303S_1C_3Mstreet_71888249.7B
1304S_1C_3Mstreet_21957085.2A
1305S_1C_3Fstreet_51876961.7B-
2101S_2C_1Mstreet_71748483.3C
2102S_2C_1Fstreet_61616150.6B+
2103S_2C_1Mstreet_41576152.5B-
2104S_2C_1Fstreet_51599772.2B+
2105S_2C_1Mstreet_41708134.2A
2201S_2C_2Mstreet_519310039.1B
2202S_2C_2Fstreet_71947768.5B+
2203S_2C_2Mstreet_41559173.8A+
2204S_2C_2Mstreet_11757447.2B-
2205S_2C_2Fstreet_71837685.4B
2301S_2C_3Fstreet_41577872.3B+
2302S_2C_3Mstreet_51718832.7A
2303S_2C_3Fstreet_71909965.9C
2304S_2C_3Fstreet_61648195.5A-
2305S_2C_3Mstreet_41877348.9B
2401S_2C_4Fstreet_21926245.3A
2402S_2C_4Mstreet_71668248.7B
2403S_2C_4Fstreet_61586059.7B+
2404S_2C_4Fstreet_21608467.7B
2405S_2C_4Fstreet_61935447.6B

isin方法

df[df['Address'].isin(['street_1','street_4'])&df['Physics'].isin(['A','A+'])]
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
2105S_2C_1Mstreet_41708134.2A
2203S_2C_2Mstreet_41559173.8A+
#也可以使用字典的方法写
df[df[['Address','Physics']].isin({'Address':['street_1','street_4'],'Physics':['A','A+']}).all(1)]
#all与&的思路是类似的,其中1代表按照列的方向判断是否全为True
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
2105S_2C_1Mstreet_41708134.2A
2203S_2C_2Mstreet_41559173.8A+

快速标量索引

display(df.at[1101,'School'])
'S_1'

区间索引

利用interbal_range方法

pd.interval_range(start=0,end=5)
#closed参数可选‘left’,'both','right','neighter',默认左开又闭
IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
              closed='right',
              dtype='interval[int64]')
pd.interval_range(start=0,periods=8,freq=5)
#periods控制区间个数,ferq控制步长
IntervalIndex([(0, 5], (5, 10], (10, 15], (15, 20], (20, 25], (25, 30], (30, 35], (35, 40]],
              closed='right',
              dtype='interval[int64]')

利用cut将数值转换为区间元素的分类变量,例如统计数学成绩的区间情况

math_interval=pd.cut(df['Math'],bins=[0,40,60,80,100])
#此时的类型是category
math_interval
ID
1101      (0, 40]
1102      (0, 40]
1103    (80, 100]
1104    (80, 100]
1105    (80, 100]
1201    (80, 100]
1202     (60, 80]
1203     (40, 60]
1204      (0, 40]
1205     (60, 80]
1301      (0, 40]
1302    (80, 100]
1303     (40, 60]
1304    (80, 100]
1305     (60, 80]
2101    (80, 100]
2102     (40, 60]
2103     (40, 60]
2104     (60, 80]
2105      (0, 40]
2201      (0, 40]
2202     (60, 80]
2203     (60, 80]
2204     (40, 60]
2205    (80, 100]
2301     (60, 80]
2302      (0, 40]
2303     (60, 80]
2304    (80, 100]
2305     (40, 60]
2401     (40, 60]
2402     (40, 60]
2403     (40, 60]
2404     (60, 80]
2405     (40, 60]
Name: Math, dtype: category
Categories (4, interval[int64]): [(0, 40] < (40, 60] < (60, 80] < (80, 100]]

区间索引的选取

df_i = df.join(math_interval,rsuffix='_interval')[['Math','Math_interval']]\
 .reset_index().set_index('Math_interval')
df_i.head()# rsuffix : string,当左右两个表的列名有重复时,用于右边表列名的前缀。
IDMath
Math_interval
(0, 40]110134.0
(0, 40]110232.5
(80, 100]110387.2
(80, 100]110480.4
(80, 100]110584.8
df_i.loc[65]#所有包含这个值的都会被选中
IDMath
Math_interval
(60, 80]120263.5
(60, 80]120568.4
(60, 80]130561.7
(60, 80]210472.2
(60, 80]220268.5
(60, 80]220373.8
(60, 80]230172.3
(60, 80]230365.9
(60, 80]240467.7
df_i.loc[[65,90]]#所有包含65-90的都会被选中
IDMath
Math_interval
(60, 80]120263.5
(60, 80]120568.4
(60, 80]130561.7
(60, 80]210472.2
(60, 80]220268.5
(60, 80]220373.8
(60, 80]230172.3
(60, 80]230365.9
(60, 80]240467.7
(80, 100]110387.2
(80, 100]110480.4
(80, 100]110584.8
(80, 100]120197.0
(80, 100]130287.7
(80, 100]130485.2
(80, 100]210183.3
(80, 100]220585.4
(80, 100]230495.5
#如果要选取某个区间,先要把分类变量转换为区间变量,然后使用overlap方法
df_i[df_i.index.astype('interval').overlaps(pd.Interval(70,85))]
IDMath
Math_interval
(80, 100]110387.2
(80, 100]110480.4
(80, 100]110584.8
(80, 100]120197.0
(60, 80]120263.5
(60, 80]120568.4
(80, 100]130287.7
(80, 100]130485.2
(60, 80]130561.7
(80, 100]210183.3
(60, 80]210472.2
(60, 80]220268.5
(60, 80]220373.8
(80, 100]220585.4
(60, 80]230172.3
(60, 80]230365.9
(80, 100]230495.5
(60, 80]240467.7

多级索引

创建多级索引

通过from_tuple或from_arrays

1.直接创立元组

tuples=[('A','a'),('A','b'),('B','a'),('B','b')]
mul_index=pd.MultiIndex.from_tuples(tuples,names=('Upper','Lower'))
mul_index
MultiIndex([('A', 'a'),
            ('A', 'b'),
            ('B', 'a'),
            ('B', 'b')],
           names=['Upper', 'Lower'])
pd.DataFrame({'Score':['Perfect','Good','Fair','Bad']},index=mul_index)
Score
UpperLower
AaPerfect
bGood
BaFair
bBad

2.利用zip创建元组

L1=list('AABB')
L2=list('abab')
tuples=list(zip(L1,L2))
mul_index=pd.MultiIndex.from_tuples(tuples,names=('Upper','Lower'))
pd.DataFrame({'Score':['Perfect','Good','Fair','Bad']},index=mul_index)
Score
UpperLower
AaPerfect
bGood
BaFair
bBad

3.通过Array创建

通过 from_prodect

L1=['A','B']
L2=['a','b']
pd.MultiIndex.from_product([L1,L2],names=('Upper','Lower'))
#两两相乘,笛卡尔积
MultiIndex([('A', 'a'),
            ('A', 'b'),
            ('B', 'a'),
            ('B', 'b')],
           names=['Upper', 'Lower'])

指定df中的列进行创建(set_index)

df_using_mul=df.set_index(['Class','Address'])
df_using_mul
#通过两列进行拆分,进行多级索引,注意,有先后顺序
SchoolGenderHeightWeightMathPhysics
ClassAddress
C_1street_1S_1M1736334.0A+
street_2S_1F1927332.5B+
street_2S_1M1868287.2B+
street_2S_1F1678180.4B-
street_4S_1F1596484.8B+
C_2street_5S_1M1886897.0A-
street_4S_1F1769463.5B-
street_6S_1M1605358.8A+
street_5S_1F1626333.8B
street_6S_1F1676368.4B-
C_3street_4S_1M1616831.5B+
street_1S_1F1755787.7A-
street_7S_1M1888249.7B
street_2S_1M1957085.2A
street_5S_1F1876961.7B-
C_1street_7S_2M1748483.3C
street_6S_2F1616150.6B+
street_4S_2M1576152.5B-
street_5S_2F1599772.2B+
street_4S_2M1708134.2A
C_2street_5S_2M19310039.1B
street_7S_2F1947768.5B+
street_4S_2M1559173.8A+
street_1S_2M1757447.2B-
street_7S_2F1837685.4B
C_3street_4S_2F1577872.3B+
street_5S_2M1718832.7A
street_7S_2F1909965.9C
street_6S_2F1648195.5A-
street_4S_2M1877348.9B
C_4street_2S_2F1926245.3A
street_7S_2M1668248.7B
street_6S_2F1586059.7B+
street_2S_2F1608467.7B
street_6S_2F1935447.6B

多层索引切片

df_using_mul
SchoolGenderHeightWeightMathPhysics
ClassAddress
C_1street_1S_1M1736334.0A+
street_2S_1F1927332.5B+
street_2S_1M1868287.2B+
street_2S_1F1678180.4B-
street_4S_1F1596484.8B+
C_2street_5S_1M1886897.0A-
street_4S_1F1769463.5B-
street_6S_1M1605358.8A+
street_5S_1F1626333.8B
street_6S_1F1676368.4B-
C_3street_4S_1M1616831.5B+
street_1S_1F1755787.7A-
street_7S_1M1888249.7B
street_2S_1M1957085.2A
street_5S_1F1876961.7B-
C_1street_7S_2M1748483.3C
street_6S_2F1616150.6B+
street_4S_2M1576152.5B-
street_5S_2F1599772.2B+
street_4S_2M1708134.2A
C_2street_5S_2M19310039.1B
street_7S_2F1947768.5B+
street_4S_2M1559173.8A+
street_1S_2M1757447.2B-
street_7S_2F1837685.4B
C_3street_4S_2F1577872.3B+
street_5S_2M1718832.7A
street_7S_2F1909965.9C
street_6S_2F1648195.5A-
street_4S_2M1877348.9B
C_4street_2S_2F1926245.3A
street_7S_2M1668248.7B
street_6S_2F1586059.7B+
street_2S_2F1608467.7B
street_6S_2F1935447.6B

一般切片

#df_using_mul.loc['C_2','street_5']
#当索引不排序时,单个索引会出现性能警告
df_using_mul.index.is_lexsorted()
#以上函数是检验是否排序
df_using_mul.sort_index().loc['C_2','street_5']
SchoolGenderHeightWeightMathPhysics
ClassAddress
C_2street_5S_1M1886897.0A-
street_5S_1F1626333.8B
street_5S_2M19310039.1B
#当不排序时,不能使用多层切片
df_using_mul.sort_index().loc[('C_2','street_6'):('C_3','street_4')]
SchoolGenderHeightWeightMathPhysics
ClassAddress
C_2street_6S_1M1605358.8A+
street_6S_1F1676368.4B-
street_7S_2F1947768.5B+
street_7S_2F1837685.4B
C_3street_1S_1F1755787.7A-
street_2S_1M1957085.2A
street_4S_1M1616831.5B+
street_4S_2F1577872.3B+
street_4S_2M1877348.9B
df_using_mul.sort_index().loc[('C_2','street_7'):'C_3']
#可以选取该层中所有的元素,c_3
SchoolGenderHeightWeightMathPhysics
ClassAddress
C_2street_7S_2F1947768.5B+
street_7S_2F1837685.4B
C_3street_1S_1F1755787.7A-
street_2S_1M1957085.2A
street_4S_1M1616831.5B+
street_4S_2F1577872.3B+
street_4S_2M1877348.9B
street_5S_1F1876961.7B-
street_5S_2M1718832.7A
street_6S_2F1648195.5A-
street_7S_1M1888249.7B
street_7S_2F1909965.9C

第一类特殊情况:又元组构成的列表

df_using_mul.sort_index().loc[[('C_2','street_7'),('C_3','street_2')]]
#表示选出某几个元素,精确到最内层的索引
SchoolGenderHeightWeightMathPhysics
ClassAddress
C_2street_7S_2F1947768.5B+
street_7S_2F1837685.4B
C_3street_2S_1M1957085.2A

第二类特殊情况:由列表构成元组

df_using_mul.sort_index().loc[(['C_2','C_3'],['street_4','street_7']),:]
#选出第一层在‘C_2’和‘C_2’中且第二层在‘4’'7'中的行
SchoolGenderHeightWeightMathPhysics
ClassAddress
C_2street_4S_1F1769463.5B-
street_4S_2M1559173.8A+
street_7S_2F1947768.5B+
street_7S_2F1837685.4B
C_3street_4S_1M1616831.5B+
street_4S_2F1577872.3B+
street_4S_2M1877348.9B
street_7S_1M1888249.7B
street_7S_2F1909965.9C

多层索引中的slice对象

L1,L2=['A','B','C'],['a','b','c']
mul_index1=pd.MultiIndex.from_product([L1,L2],names=('Upper','Lower'))
L3,L4=['D','E','F'],['d','e','f']
mul_index2=pd.MultiIndex.from_product([L3,L4],names=('Big','Small'))
df_s=pd.DataFrame(np.random.rand(9,9),index=mul_index1,columns=mul_index2)
df_s
BigDEF
Smalldefdefdef
UpperLower
Aa0.0862900.0164340.7709530.9057860.7747930.4319680.0588170.6392990.270593
b0.4694270.2236890.0280100.6657540.6630300.5461660.5938340.4224240.261654
c0.2169100.5888300.2269570.8943200.2826740.8743500.4586560.8659670.617002
Ba0.4297230.1771350.9672810.5733280.6812090.5634890.9058600.3209770.992234
b0.0633230.1185900.2398370.7660580.8231490.5283630.5215570.5340790.338802
c0.9266390.7592750.3963880.8516460.1178740.9400120.0457740.6789960.723390
Ca0.9397710.1800970.1439230.4098550.8465890.0667320.6644380.8550660.376750
b0.9410220.2328780.2348200.4030610.7633590.0199180.3924240.4392210.545601
c0.4809660.7304750.4231530.1941070.7563160.5673780.9835340.3287700.850334
idx=pd.IndexSlice
#这个函数的意义是对切片进行结合
df_s.loc[idx['B':,df_s['D']['d']>0.12],idx['E',df_s.sum()>2]]#先确定第一层,然后第二才是确定具体的切片
BigE
Smalldef
UpperLower
Ba0.5733280.6812090.563489
c0.8516460.1178740.940012
Ca0.4098550.8465890.066732
b0.4030610.7633590.019918
c0.1941070.7563160.567378

索引层的交换

swaplevel方法(两层交换)

df_using_mul
SchoolGenderHeightWeightMathPhysics
ClassAddress
C_1street_1S_1M1736334.0A+
street_2S_1F1927332.5B+
street_2S_1M1868287.2B+
street_2S_1F1678180.4B-
street_4S_1F1596484.8B+
C_2street_5S_1M1886897.0A-
street_4S_1F1769463.5B-
street_6S_1M1605358.8A+
street_5S_1F1626333.8B
street_6S_1F1676368.4B-
C_3street_4S_1M1616831.5B+
street_1S_1F1755787.7A-
street_7S_1M1888249.7B
street_2S_1M1957085.2A
street_5S_1F1876961.7B-
C_1street_7S_2M1748483.3C
street_6S_2F1616150.6B+
street_4S_2M1576152.5B-
street_5S_2F1599772.2B+
street_4S_2M1708134.2A
C_2street_5S_2M19310039.1B
street_7S_2F1947768.5B+
street_4S_2M1559173.8A+
street_1S_2M1757447.2B-
street_7S_2F1837685.4B
C_3street_4S_2F1577872.3B+
street_5S_2M1718832.7A
street_7S_2F1909965.9C
street_6S_2F1648195.5A-
street_4S_2M1877348.9B
C_4street_2S_2F1926245.3A
street_7S_2M1668248.7B
street_6S_2F1586059.7B+
street_2S_2F1608467.7B
street_6S_2F1935447.6B
df_using_mul.swaplevel(i=1,j=0,axis=0).sort_index()#指的是不同层的交换,axis=0,l列交换
SchoolGenderHeightWeightMathPhysics
AddressClass
street_1C_1S_1M1736334.0A+
C_2S_2M1757447.2B-
C_3S_1F1755787.7A-
street_2C_1S_1F1927332.5B+
C_1S_1M1868287.2B+
C_1S_1F1678180.4B-
C_3S_1M1957085.2A
C_4S_2F1926245.3A
C_4S_2F1608467.7B
street_4C_1S_1F1596484.8B+
C_1S_2M1576152.5B-
C_1S_2M1708134.2A
C_2S_1F1769463.5B-
C_2S_2M1559173.8A+
C_3S_1M1616831.5B+
C_3S_2F1577872.3B+
C_3S_2M1877348.9B
street_5C_1S_2F1599772.2B+
C_2S_1M1886897.0A-
C_2S_1F1626333.8B
C_2S_2M19310039.1B
C_3S_1F1876961.7B-
C_3S_2M1718832.7A
street_6C_1S_2F1616150.6B+
C_2S_1M1605358.8A+
C_2S_1F1676368.4B-
C_3S_2F1648195.5A-
C_4S_2F1586059.7B+
C_4S_2F1935447.6B
street_7C_1S_2M1748483.3C
C_2S_2F1947768.5B+
C_2S_2F1837685.4B
C_3S_1M1888249.7B
C_3S_2F1909965.9C
C_4S_2M1668248.7B

reorder_levels方法(多层交换)

df_muls=df.set_index(['School','Class','Address'])
df_muls
GenderHeightWeightMathPhysics
SchoolClassAddress
S_1C_1street_1M1736334.0A+
street_2F1927332.5B+
street_2M1868287.2B+
street_2F1678180.4B-
street_4F1596484.8B+
C_2street_5M1886897.0A-
street_4F1769463.5B-
street_6M1605358.8A+
street_5F1626333.8B
street_6F1676368.4B-
C_3street_4M1616831.5B+
street_1F1755787.7A-
street_7M1888249.7B
street_2M1957085.2A
street_5F1876961.7B-
S_2C_1street_7M1748483.3C
street_6F1616150.6B+
street_4M1576152.5B-
street_5F1599772.2B+
street_4M1708134.2A
C_2street_5M19310039.1B
street_7F1947768.5B+
street_4M1559173.8A+
street_1M1757447.2B-
street_7F1837685.4B
C_3street_4F1577872.3B+
street_5M1718832.7A
street_7F1909965.9C
street_6F1648195.5A-
street_4M1877348.9B
C_4street_2F1926245.3A
street_7M1668248.7B
street_6F1586059.7B+
street_2F1608467.7B
street_6F1935447.6B
df_muls.reorder_levels([2,0,1],axis=0).sort_index()#针对索引层级,由原来的(0,1,2)变为(2,0,1),也可以使用名字(‘Address’,'School','Class')
GenderHeightWeightMathPhysics
AddressSchoolClass
street_1S_1C_1M1736334.0A+
C_3F1755787.7A-
S_2C_2M1757447.2B-
street_2S_1C_1F1927332.5B+
C_1M1868287.2B+
C_1F1678180.4B-
C_3M1957085.2A
S_2C_4F1926245.3A
C_4F1608467.7B
street_4S_1C_1F1596484.8B+
C_2F1769463.5B-
C_3M1616831.5B+
S_2C_1M1576152.5B-
C_1M1708134.2A
C_2M1559173.8A+
C_3F1577872.3B+
C_3M1877348.9B
street_5S_1C_2M1886897.0A-
C_2F1626333.8B
C_3F1876961.7B-
S_2C_1F1599772.2B+
C_2M19310039.1B
C_3M1718832.7A
street_6S_1C_2M1605358.8A+
C_2F1676368.4B-
S_2C_1F1616150.6B+
C_3F1648195.5A-
C_4F1586059.7B+
C_4F1935447.6B
street_7S_1C_3M1888249.7B
S_2C_1M1748483.3C
C_2F1947768.5B+
C_2F1837685.4B
C_3F1909965.9C
C_4M1668248.7B

索引设定

index_col参数

  • index_col是read_csv中的一个参数,而不是一个方法
pd.read_csv('E:\jupyter Notebook\天池比赛\joyful-pandas-master\data\\table.csv',index_col=['Address','School'])
#这个不是多层索引
Unnamed: 0ClassIDGenderHeightWeightMathPhysics
AddressSchool
street_1S_10C_11101M1736334.0A+
street_2S_11C_11102F1927332.5B+
S_12C_11103M1868287.2B+
S_13C_11104F1678180.4B-
street_4S_14C_11105F1596484.8B+
street_5S_15C_21201M1886897.0A-
street_4S_16C_21202F1769463.5B-
street_6S_17C_21203M1605358.8A+
street_5S_18C_21204F1626333.8B
street_6S_19C_21205F1676368.4B-
street_4S_110C_31301M1616831.5B+
street_1S_111C_31302F1755787.7A-
street_7S_112C_31303M1888249.7B
street_2S_113C_31304M1957085.2A
street_5S_114C_31305F1876961.7B-
street_7S_215C_12101M1748483.3C
street_6S_216C_12102F1616150.6B+
street_4S_217C_12103M1576152.5B-
street_5S_218C_12104F1599772.2B+
street_4S_219C_12105M1708134.2A
street_5S_220C_22201M19310039.1B
street_7S_221C_22202F1947768.5B+
street_4S_222C_22203M1559173.8A+
street_1S_223C_22204M1757447.2B-
street_7S_224C_22205F1837685.4B
street_4S_225C_32301F1577872.3B+
street_5S_226C_32302M1718832.7A
street_7S_227C_32303F1909965.9C
street_6S_228C_32304F1648195.5A-
street_4S_229C_32305M1877348.9B
street_2S_230C_42401F1926245.3A
street_7S_231C_42402M1668248.7B
street_6S_232C_42403F1586059.7B+
street_2S_233C_42404F1608467.7B
street_6S_234C_42405F1935447.6B

reindex和reindex_like

  • reindex是指重新索引,重要性与索引对齐类似,用于重新排序
df.head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1102S_1C_1Fstreet_21927332.5B+
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
df.reindex(columns=['Height','Gender','Average'])#可以产生空的列与行
HeightGenderAverage
ID
1101173MNaN
1102192FNaN
1103186MNaN
1104167FNaN
1105159FNaN
1201188MNaN
1202176FNaN
1203160MNaN
1204162FNaN
1205167FNaN
1301161MNaN
1302175FNaN
1303188MNaN
1304195MNaN
1305187FNaN
2101174MNaN
2102161FNaN
2103157MNaN
2104159FNaN
2105170MNaN
2201193MNaN
2202194FNaN
2203155MNaN
2204175MNaN
2205183FNaN
2301157FNaN
2302171MNaN
2303190FNaN
2304164FNaN
2305187MNaN
2401192FNaN
2402166MNaN
2403158FNaN
2404160FNaN
2405193FNaN

可以选择缺失值的填充方法:fill_value and method(bfill/ffill/nearest),其中method参数必须索引单调

df.reindex(index=[1101,1203,1206,2402])
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_1173.063.034.0A+
1203S_1C_2Mstreet_6160.053.058.8A+
1206NaNNaNNaNNaNNaNNaNNaNNaN
2402S_2C_4Mstreet_7166.082.048.7B
df.reindex(index=[1101,1203,1206,2402],method='bfill')
#bfill表示用所在索引1206的后一个有效行填充,ffill为前一个有效行填充,nearest是指最近的
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1203S_1C_2Mstreet_61605358.8A+
1206S_1C_3Mstreet_41616831.5B+
2402S_2C_4Mstreet_71668248.7B
df.reindex(index=[1101,1203,1206,2402],method='ffill')
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1203S_1C_2Mstreet_61605358.8A+
1206S_1C_2Fstreet_61676368.4B-
2402S_2C_4Mstreet_71668248.7B
df.reindex(index=[1101,1203,1206,2402],method='nearest')
#数值上1205比1301更接近1206,因此用前者填充
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1203S_1C_2Mstreet_61605358.8A+
1206S_1C_2Fstreet_61676368.4B-
2402S_2C_4Mstreet_71668248.7B

reindex_like的作用为生成一个横纵索引完全与参数列表一致的DataFrame ,数据使用被调用的表

df_temp=pd.DataFrame({'Weight':np.zeros(5),'Height':np.zeros(5),'ID':[1101,1104,1103,1106,1102]}).set_index('ID')
df_temp.reindex_like(df[0:5][['Weight','Height']])
WeightHeight
ID
11010.00.0
11020.00.0
11030.00.0
11040.00.0
1105NaNNaN

如果df_temp单调还可以使用method参数

df_temp=pd.DataFrame({'Weight':range(5),'Height':range(5),'ID':[1101,1104,1103,1106,1102]}).set_index('ID').sort_index()
df_temp.reindex_like(df[0:5][['Weight','Height']],method='bfill')
WeightHeight
ID
110100
110244
110322
110411
110533

set_index和reset_index

  • 先介绍set_index:从字面意思看,就是将某些列作为索引
  • 使用表内列作为索引:
df.head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1102S_1C_1Fstreet_21927332.5B+
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
df.set_index('Class')
SchoolGenderAddressHeightWeightMathPhysics
Class
C_1S_1Mstreet_11736334.0A+
C_1S_1Fstreet_21927332.5B+
C_1S_1Mstreet_21868287.2B+
C_1S_1Fstreet_21678180.4B-
C_1S_1Fstreet_41596484.8B+
C_2S_1Mstreet_51886897.0A-
C_2S_1Fstreet_41769463.5B-
C_2S_1Mstreet_61605358.8A+
C_2S_1Fstreet_51626333.8B
C_2S_1Fstreet_61676368.4B-
C_3S_1Mstreet_41616831.5B+
C_3S_1Fstreet_11755787.7A-
C_3S_1Mstreet_71888249.7B
C_3S_1Mstreet_21957085.2A
C_3S_1Fstreet_51876961.7B-
C_1S_2Mstreet_71748483.3C
C_1S_2Fstreet_61616150.6B+
C_1S_2Mstreet_41576152.5B-
C_1S_2Fstreet_51599772.2B+
C_1S_2Mstreet_41708134.2A
C_2S_2Mstreet_519310039.1B
C_2S_2Fstreet_71947768.5B+
C_2S_2Mstreet_41559173.8A+
C_2S_2Mstreet_11757447.2B-
C_2S_2Fstreet_71837685.4B
C_3S_2Fstreet_41577872.3B+
C_3S_2Mstreet_51718832.7A
C_3S_2Fstreet_71909965.9C
C_3S_2Fstreet_61648195.5A-
C_3S_2Mstreet_41877348.9B
C_4S_2Fstreet_21926245.3A
C_4S_2Mstreet_71668248.7B
C_4S_2Fstreet_61586059.7B+
C_4S_2Fstreet_21608467.7B
C_4S_2Fstreet_61935447.6B

利用append参数可以将当前索引维持不变

df.set_index('Class',append=True)
SchoolGenderAddressHeightWeightMathPhysics
IDClass
1101C_1S_1Mstreet_11736334.0A+
1102C_1S_1Fstreet_21927332.5B+
1103C_1S_1Mstreet_21868287.2B+
1104C_1S_1Fstreet_21678180.4B-
1105C_1S_1Fstreet_41596484.8B+
1201C_2S_1Mstreet_51886897.0A-
1202C_2S_1Fstreet_41769463.5B-
1203C_2S_1Mstreet_61605358.8A+
1204C_2S_1Fstreet_51626333.8B
1205C_2S_1Fstreet_61676368.4B-
1301C_3S_1Mstreet_41616831.5B+
1302C_3S_1Fstreet_11755787.7A-
1303C_3S_1Mstreet_71888249.7B
1304C_3S_1Mstreet_21957085.2A
1305C_3S_1Fstreet_51876961.7B-
2101C_1S_2Mstreet_71748483.3C
2102C_1S_2Fstreet_61616150.6B+
2103C_1S_2Mstreet_41576152.5B-
2104C_1S_2Fstreet_51599772.2B+
2105C_1S_2Mstreet_41708134.2A
2201C_2S_2Mstreet_519310039.1B
2202C_2S_2Fstreet_71947768.5B+
2203C_2S_2Mstreet_41559173.8A+
2204C_2S_2Mstreet_11757447.2B-
2205C_2S_2Fstreet_71837685.4B
2301C_3S_2Fstreet_41577872.3B+
2302C_3S_2Mstreet_51718832.7A
2303C_3S_2Fstreet_71909965.9C
2304C_3S_2Fstreet_61648195.5A-
2305C_3S_2Mstreet_41877348.9B
2401C_4S_2Fstreet_21926245.3A
2402C_4S_2Mstreet_71668248.7B
2403C_4S_2Fstreet_61586059.7B+
2404C_4S_2Fstreet_21608467.7B
2405C_4S_2Fstreet_61935447.6B

可以直接添加多级索引

df.set_index([pd.Series(range(df.shape[0])),pd.Series(np.ones(df.shape[0]))])
SchoolClassGenderAddressHeightWeightMathPhysics
01.0S_1C_1Mstreet_11736334.0A+
11.0S_1C_1Fstreet_21927332.5B+
21.0S_1C_1Mstreet_21868287.2B+
31.0S_1C_1Fstreet_21678180.4B-
41.0S_1C_1Fstreet_41596484.8B+
51.0S_1C_2Mstreet_51886897.0A-
61.0S_1C_2Fstreet_41769463.5B-
71.0S_1C_2Mstreet_61605358.8A+
81.0S_1C_2Fstreet_51626333.8B
91.0S_1C_2Fstreet_61676368.4B-
101.0S_1C_3Mstreet_41616831.5B+
111.0S_1C_3Fstreet_11755787.7A-
121.0S_1C_3Mstreet_71888249.7B
131.0S_1C_3Mstreet_21957085.2A
141.0S_1C_3Fstreet_51876961.7B-
151.0S_2C_1Mstreet_71748483.3C
161.0S_2C_1Fstreet_61616150.6B+
171.0S_2C_1Mstreet_41576152.5B-
181.0S_2C_1Fstreet_51599772.2B+
191.0S_2C_1Mstreet_41708134.2A
201.0S_2C_2Mstreet_519310039.1B
211.0S_2C_2Fstreet_71947768.5B+
221.0S_2C_2Mstreet_41559173.8A+
231.0S_2C_2Mstreet_11757447.2B-
241.0S_2C_2Fstreet_71837685.4B
251.0S_2C_3Fstreet_41577872.3B+
261.0S_2C_3Mstreet_51718832.7A
271.0S_2C_3Fstreet_71909965.9C
281.0S_2C_3Fstreet_61648195.5A-
291.0S_2C_3Mstreet_41877348.9B
301.0S_2C_4Fstreet_21926245.3A
311.0S_2C_4Mstreet_71668248.7B
321.0S_2C_4Fstreet_61586059.7B+
331.0S_2C_4Fstreet_21608467.7B
341.0S_2C_4Fstreet_61935447.6B

reset_index方法,主要功能是索引重置

df.reset_index()
IDSchoolClassGenderAddressHeightWeightMathPhysics
01101S_1C_1Mstreet_11736334.0A+
11102S_1C_1Fstreet_21927332.5B+
21103S_1C_1Mstreet_21868287.2B+
31104S_1C_1Fstreet_21678180.4B-
41105S_1C_1Fstreet_41596484.8B+
51201S_1C_2Mstreet_51886897.0A-
61202S_1C_2Fstreet_41769463.5B-
71203S_1C_2Mstreet_61605358.8A+
81204S_1C_2Fstreet_51626333.8B
91205S_1C_2Fstreet_61676368.4B-
101301S_1C_3Mstreet_41616831.5B+
111302S_1C_3Fstreet_11755787.7A-
121303S_1C_3Mstreet_71888249.7B
131304S_1C_3Mstreet_21957085.2A
141305S_1C_3Fstreet_51876961.7B-
152101S_2C_1Mstreet_71748483.3C
162102S_2C_1Fstreet_61616150.6B+
172103S_2C_1Mstreet_41576152.5B-
182104S_2C_1Fstreet_51599772.2B+
192105S_2C_1Mstreet_41708134.2A
202201S_2C_2Mstreet_519310039.1B
212202S_2C_2Fstreet_71947768.5B+
222203S_2C_2Mstreet_41559173.8A+
232204S_2C_2Mstreet_11757447.2B-
242205S_2C_2Fstreet_71837685.4B
252301S_2C_3Fstreet_41577872.3B+
262302S_2C_3Mstreet_51718832.7A
272303S_2C_3Fstreet_71909965.9C
282304S_2C_3Fstreet_61648195.5A-
292305S_2C_3Mstreet_41877348.9B
302401S_2C_4Fstreet_21926245.3A
312402S_2C_4Mstreet_71668248.7B
322403S_2C_4Fstreet_61586059.7B+
332404S_2C_4Fstreet_21608467.7B
342405S_2C_4Fstreet_61935447.6B

用level参数指定哪一层被reset,用col_level参数指定set到哪一层

L1,L2=['A','B','C'],['a','b','c']
mul_index1=pd.MultiIndex.from_product([L1,L2],names=('Upper','Lower'))
L3,L4=['D','E','F'],['d','e','f']
mul_index2=pd.MultiIndex.from_product([L3,L4],names=('Big','Small'))
df_temp=pd.DataFrame(np.random.rand(9,9),index=mul_index1,columns=mul_index2)
df_temp
BigDEF
Smalldefdefdef
UpperLower
Aa0.0784390.6064730.0273360.8205400.7248650.0135410.8122420.0594550.411638
b0.9240120.6556970.2884710.0562280.2014860.8323350.5211110.2072240.846760
c0.1351540.5586030.1571170.1487680.0593790.1569310.9032500.8007080.427384
Ba0.6081380.6797600.0796420.8769100.7331280.5446120.6198750.8958870.232243
b0.8556680.7011330.3674700.7729940.9361750.4879150.8145330.1533070.711710
c0.5555240.6454120.0923940.6374090.8480080.5264160.5150750.1757140.480151
Ca0.7037970.4080540.7341730.4256740.8073810.1646090.7426900.0496130.115544
b0.4255310.7889190.7595880.4869300.1023130.5183170.1974880.4241220.339473
c0.1309660.8682250.8406920.4945510.9425970.1091350.4759030.1642800.973699
df_temp1=df_temp.reset_index(level=1,col_level=1)
df_temp1
BigDEF
SmallLowerdefdefdef
Upper
Aa0.0784390.6064730.0273360.8205400.7248650.0135410.8122420.0594550.411638
Ab0.9240120.6556970.2884710.0562280.2014860.8323350.5211110.2072240.846760
Ac0.1351540.5586030.1571170.1487680.0593790.1569310.9032500.8007080.427384
Ba0.6081380.6797600.0796420.8769100.7331280.5446120.6198750.8958870.232243
Bb0.8556680.7011330.3674700.7729940.9361750.4879150.8145330.1533070.711710
Bc0.5555240.6454120.0923940.6374090.8480080.5264160.5150750.1757140.480151
Ca0.7037970.4080540.7341730.4256740.8073810.1646090.7426900.0496130.115544
Cb0.4255310.7889190.7595880.4869300.1023130.5183170.1974880.4241220.339473
Cc0.1309660.8682250.8406920.4945510.9425970.1091350.4759030.1642800.973699

rename_axis和rename

  • rename_axis是针对多级索引的方法,作用是修改某一层的索引名,而不是索引标签!
df_temp.rename_axis(index={'Lower':'LowerLower'},columns={'Big':'BigBig'})
BigBigDEF
Smalldefdefdef
UpperLowerLower
Aa0.0784390.6064730.0273360.8205400.7248650.0135410.8122420.0594550.411638
b0.9240120.6556970.2884710.0562280.2014860.8323350.5211110.2072240.846760
c0.1351540.5586030.1571170.1487680.0593790.1569310.9032500.8007080.427384
Ba0.6081380.6797600.0796420.8769100.7331280.5446120.6198750.8958870.232243
b0.8556680.7011330.3674700.7729940.9361750.4879150.8145330.1533070.711710
c0.5555240.6454120.0923940.6374090.8480080.5264160.5150750.1757140.480151
Ca0.7037970.4080540.7341730.4256740.8073810.1646090.7426900.0496130.115544
b0.4255310.7889190.7595880.4869300.1023130.5183170.1974880.4241220.339473
c0.1309660.8682250.8406920.4945510.9425970.1091350.4759030.1642800.973699
  • rename方法用于修改列或者行索引标签,而不是索引名
df_temp.rename(index={'A':"AS"},columns={'F':'FF'})#可以针对任一层级,无须专门说明层级!
BigDEFF
Smalldefdefdef
UpperLower
ASa0.0784390.6064730.0273360.8205400.7248650.0135410.8122420.0594550.411638
b0.9240120.6556970.2884710.0562280.2014860.8323350.5211110.2072240.846760
c0.1351540.5586030.1571170.1487680.0593790.1569310.9032500.8007080.427384
Ba0.6081380.6797600.0796420.8769100.7331280.5446120.6198750.8958870.232243
b0.8556680.7011330.3674700.7729940.9361750.4879150.8145330.1533070.711710
c0.5555240.6454120.0923940.6374090.8480080.5264160.5150750.1757140.480151
Ca0.7037970.4080540.7341730.4256740.8073810.1646090.7426900.0496130.115544
b0.4255310.7889190.7595880.4869300.1023130.5183170.1974880.4241220.339473
c0.1309660.8682250.8406920.4945510.9425970.1091350.4759030.1642800.973699

常用索引型函数

where函数

  • 当对条件为False的单元进行填充:
df.head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1102S_1C_1Fstreet_21927332.5B+
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
df.where(df['Gender']=='M')
#不满足条件的行全部设置为NAN-筛选
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_1173.063.034.0A+
1102NaNNaNNaNNaNNaNNaNNaNNaN
1103S_1C_1Mstreet_2186.082.087.2B+
1104NaNNaNNaNNaNNaNNaNNaNNaN
1105NaNNaNNaNNaNNaNNaNNaNNaN
1201S_1C_2Mstreet_5188.068.097.0A-
1202NaNNaNNaNNaNNaNNaNNaNNaN
1203S_1C_2Mstreet_6160.053.058.8A+
1204NaNNaNNaNNaNNaNNaNNaNNaN
1205NaNNaNNaNNaNNaNNaNNaNNaN
1301S_1C_3Mstreet_4161.068.031.5B+
1302NaNNaNNaNNaNNaNNaNNaNNaN
1303S_1C_3Mstreet_7188.082.049.7B
1304S_1C_3Mstreet_2195.070.085.2A
1305NaNNaNNaNNaNNaNNaNNaNNaN
2101S_2C_1Mstreet_7174.084.083.3C
2102NaNNaNNaNNaNNaNNaNNaNNaN
2103S_2C_1Mstreet_4157.061.052.5B-
2104NaNNaNNaNNaNNaNNaNNaNNaN
2105S_2C_1Mstreet_4170.081.034.2A
2201S_2C_2Mstreet_5193.0100.039.1B
2202NaNNaNNaNNaNNaNNaNNaNNaN
2203S_2C_2Mstreet_4155.091.073.8A+
2204S_2C_2Mstreet_1175.074.047.2B-
2205NaNNaNNaNNaNNaNNaNNaNNaN
2301NaNNaNNaNNaNNaNNaNNaNNaN
2302S_2C_3Mstreet_5171.088.032.7A
2303NaNNaNNaNNaNNaNNaNNaNNaN
2304NaNNaNNaNNaNNaNNaNNaNNaN
2305S_2C_3Mstreet_4187.073.048.9B
2401NaNNaNNaNNaNNaNNaNNaNNaN
2402S_2C_4Mstreet_7166.082.048.7B
2403NaNNaNNaNNaNNaNNaNNaNNaN
2404NaNNaNNaNNaNNaNNaNNaNNaN
2405NaNNaNNaNNaNNaNNaNNaNNaN
  • 通过这种方法筛选结果和[]操作符的结果完全一致
df.where(df['Gender']=='M').dropna()#剔除空值的行
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_1173.063.034.0A+
1103S_1C_1Mstreet_2186.082.087.2B+
1201S_1C_2Mstreet_5188.068.097.0A-
1203S_1C_2Mstreet_6160.053.058.8A+
1301S_1C_3Mstreet_4161.068.031.5B+
1303S_1C_3Mstreet_7188.082.049.7B
1304S_1C_3Mstreet_2195.070.085.2A
2101S_2C_1Mstreet_7174.084.083.3C
2103S_2C_1Mstreet_4157.061.052.5B-
2105S_2C_1Mstreet_4170.081.034.2A
2201S_2C_2Mstreet_5193.0100.039.1B
2203S_2C_2Mstreet_4155.091.073.8A+
2204S_2C_2Mstreet_1175.074.047.2B-
2302S_2C_3Mstreet_5171.088.032.7A
2305S_2C_3Mstreet_4187.073.048.9B
2402S_2C_4Mstreet_7166.082.048.7B
df[df['Gender']=='M']#与之前结果一致
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1103S_1C_1Mstreet_21868287.2B+
1201S_1C_2Mstreet_51886897.0A-
1203S_1C_2Mstreet_61605358.8A+
1301S_1C_3Mstreet_41616831.5B+
1303S_1C_3Mstreet_71888249.7B
1304S_1C_3Mstreet_21957085.2A
2101S_2C_1Mstreet_71748483.3C
2103S_2C_1Mstreet_41576152.5B-
2105S_2C_1Mstreet_41708134.2A
2201S_2C_2Mstreet_519310039.1B
2203S_2C_2Mstreet_41559173.8A+
2204S_2C_2Mstreet_11757447.2B-
2302S_2C_3Mstreet_51718832.7A
2305S_2C_3Mstreet_41877348.9B
2402S_2C_4Mstreet_71668248.7B
  • 第一个参数为布尔条件,第二个参数为填充值:
df.where(df['Gender']=='M',np.random.rand(df.shape[0],df.shape[1]))
#首先判断Gender是否是M,然后对于非M的,进行填充(均匀值分布)
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_1173.00000063.00000034.000000A+
11020.1617380.3902470.6570450.08818980.1112760.1792490.6716460.0845629
1103S_1C_1Mstreet_2186.00000082.00000087.200000B+
11040.6913230.05949060.5804460.1086470.4874080.8910300.3233220.462188
11050.2497550.9240710.9410960.1365370.2209080.1327540.5902900.227241
1201S_1C_2Mstreet_5188.00000068.00000097.000000A-
12020.3759640.02259060.5912330.7382390.4884560.8501550.1339230.196627
1203S_1C_2Mstreet_6160.00000053.00000058.800000A+
12040.986370.126430.38010.4577920.4272300.0865000.6653440.0635772
12050.2604350.5474760.4271860.4326370.4017690.2859260.8165760.543128
1301S_1C_3Mstreet_4161.00000068.00000031.500000B+
13020.05824590.4472840.4674350.7813730.4092790.6014200.2590860.694789
1303S_1C_3Mstreet_7188.00000082.00000049.700000B
1304S_1C_3Mstreet_2195.00000070.00000085.200000A
13050.02159510.5763860.399190.4174880.7030720.1327060.4491120.946308
2101S_2C_1Mstreet_7174.00000084.00000083.300000C
21020.9567070.07825640.1455320.303580.6072910.2961910.4302930.0231763
2103S_2C_1Mstreet_4157.00000061.00000052.500000B-
21040.5626540.541940.5936740.6240790.8724370.3923480.3333590.459291
2105S_2C_1Mstreet_4170.00000081.00000034.200000A
2201S_2C_2Mstreet_5193.000000100.00000039.100000B
22020.7135380.2627850.1065740.8305320.8437100.9284490.5777970.372659
2203S_2C_2Mstreet_4155.00000091.00000073.800000A+
2204S_2C_2Mstreet_1175.00000074.00000047.200000B-
22050.2557740.7857940.7749710.1477580.4211190.0360380.3602230.657362
23010.8446610.150340.6436870.3230730.8794500.5734110.2210430.123979
2302S_2C_3Mstreet_5171.00000088.00000032.700000A
23030.1695790.1018330.3102070.4779930.1434840.5751780.7764760.941298
23040.3491150.8472120.03588210.8976380.2522170.3592010.3929410.813281
2305S_2C_3Mstreet_4187.00000073.00000048.900000B
24010.08982090.1568820.3026670.5437480.8852450.8161170.1322350.944388
2402S_2C_4Mstreet_7166.00000082.00000048.700000B
24030.8605980.8727370.6195510.9803740.9396240.8755740.1560610.840507
24040.7965950.8455910.9670.8896820.1220280.6042790.3363560.663724
24050.5191060.7845990.9840370.9298750.9538320.3985720.6525830.135617
df.shape[0]
35
df.shape[1]
8

np.random? 的解释

  • random 在[0,1)上取均匀浮点值
  • bytes np.random 的解释
  • permutation 生成随机序列
  • choice 生成随机样本
  • rand值均匀分布。
  • randn正态分布的值。
  • ranf均匀分布的浮点数。

mask函数

  • mask函数与where功能上相反,其余完全一致,即对条件为True的单元进行填充

query函数

df.head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1102S_1C_1Fstreet_21927332.5B+
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
  • query函数中的布尔表达式中,下面的符号都是合法的:行列索引名,字符串,and/not/or/&/|/~/not in /in/==/!=、四则运算
df.query('(Address in ["street_4","street_7"])&(Weight>=(20+40))&(ID in [1101,1102,1103,2402])')
#类似于数据库的查询功能
#整个是一个语句 '',这点需要注意,里面的含有字符串的用“”,其余的不用
SchoolClassGenderAddressHeightWeightMathPhysics
ID
2402S_2C_4Mstreet_71668248.7B
df.query('(Class in ["C_1","C_2"])&(Math>45) and (Physics in ["A+","A"]) ')#= 后面只能是一个值
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1203S_1C_2Mstreet_61605358.8A+
2203S_2C_2Mstreet_41559173.8A+
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 35 entries, 1101 to 2405
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   School   35 non-null     object 
 1   Class    35 non-null     object 
 2   Gender   35 non-null     object 
 3   Address  35 non-null     object 
 4   Height   35 non-null     int64  
 5   Weight   35 non-null     int64  
 6   Math     35 non-null     float64
 7   Physics  35 non-null     object 
dtypes: float64(1), int64(2), object(5)
memory usage: 3.7+ KB

重复元素处理

duplicater方法

  • 该方法返回了是否重复的布尔列表
df.duplicated('Class')
#返回了元素之前是否有重复的
ID
1101    False
1102     True
1103     True
1104     True
1105     True
1201    False
1202     True
1203     True
1204     True
1205     True
1301    False
1302     True
1303     True
1304     True
1305     True
2101     True
2102     True
2103     True
2104     True
2105     True
2201     True
2202     True
2203     True
2204     True
2205     True
2301     True
2302     True
2303     True
2304     True
2305     True
2401    False
2402     True
2403     True
2404     True
2405     True
dtype: bool
  • 可选keep默认为first,即首次出现设为不重复,若为last,则最后一次设为不重复,若为False,则所有重复项为True

drop_duplicates方法

  • 剔除重复项,例如保留每组的第一个值(可以通过keep参数控制)、最后一个值
df.drop_duplicates('Class')#保留第一个值
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1201S_1C_2Mstreet_51886897.0A-
1301S_1C_3Mstreet_41616831.5B+
2401S_2C_4Fstreet_21926245.3A
df.drop_duplicates('Class',keep='last')
SchoolClassGenderAddressHeightWeightMathPhysics
ID
2105S_2C_1Mstreet_41708134.2A
2205S_2C_2Fstreet_71837685.4B
2305S_2C_3Mstreet_41877348.9B
2405S_2C_4Fstreet_6193
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值