第2章 索引

第2章 索引

文章目录

大致内容

在这里插入图片描述

import numpy as np
import pandas as pd
df = pd.read_csv('data/table.csv',index_col='ID')
df.head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1102S_1C_1Fstreet_21927332.5B+
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+

一、单级索引

1. loc方法、iloc方法、[]操作符

最常用的索引方法可能就是这三类,其中iloc表示位置索引,loc表示标签索引,[]也具有很大的便利性,各有特点
(a)loc方法(注意:所有在loc中使用的切片全部包含右端点!)
① 单行索引:

返回对应行的Series


df.loc[1103]
School          S_1
Class           C_1
Gender            M
Address    street_2
Height          186
Weight           82
Math           87.2
Physics          B+
Name: 1103, dtype: object
② 多行索引:

返回1102和2304行 组成DataFrame


df.loc[[1102,2304]]
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1102S_1C_1Fstreet_21927332.5B+
2304S_2C_3Fstreet_61648195.5A-

切片访问
开头与结尾都取到重点差别


df.loc[1304:].head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1304S_1C_3Mstreet_21957085.2A
1305S_1C_3Fstreet_51876961.7B-
2101S_2C_1Mstreet_71748483.3C
2102S_2C_1Fstreet_61616150.6B+
2103S_2C_1Mstreet_41576152.5B-
## 逆序切片
df.loc[2402::-1].head(60)
SchoolClassGenderAddressHeightWeightMathPhysics
ID
2402S_2C_4Mstreet_71668248.7B
2401S_2C_4Fstreet_21926245.3A
2305S_2C_3Mstreet_41877348.9B
2304S_2C_3Fstreet_61648195.5A-
2303S_2C_3Fstreet_71909965.9C
2302S_2C_3Mstreet_51718832.7A
2301S_2C_3Fstreet_41577872.3B+
2205S_2C_2Fstreet_71837685.4B
2204S_2C_2Mstreet_11757447.2B-
2203S_2C_2Mstreet_41559173.8A+
2202S_2C_2Fstreet_71947768.5B+
2201S_2C_2Mstreet_519310039.1B
2105S_2C_1Mstreet_41708134.2A
2104S_2C_1Fstreet_51599772.2B+
2103S_2C_1Mstreet_41576152.5B-
2102S_2C_1Fstreet_61616150.6B+
2101S_2C_1Mstreet_71748483.3C
1305S_1C_3Fstreet_51876961.7B-
1304S_1C_3Mstreet_21957085.2A
1303S_1C_3Mstreet_71888249.7B
1302S_1C_3Fstreet_11755787.7A-
1301S_1C_3Mstreet_41616831.5B+
1205S_1C_2Fstreet_61676368.4B-
1204S_1C_2Fstreet_51626333.8B
1203S_1C_2Mstreet_61605358.8A+
1202S_1C_2Fstreet_41769463.5B-
1201S_1C_2Mstreet_51886897.0A-
1105S_1C_1Fstreet_41596484.8B+
1104S_1C_1Fstreet_21678180.4B-
1103S_1C_1Mstreet_21868287.2B+
1102S_1C_1Fstreet_21927332.5B+
1101S_1C_1Mstreet_11736334.0A+
③ 单列索引:

Series类型

## 对于全部行取Height列
df.loc[:,'Height'].head()
#type(df.loc[:,'Height'])
ID
1101    173
1102    192
1103    186
1104    167
1105    159
Name: Height, dtype: int64
④ 多列索引:
## 注意和下面例子区别 这里取得是Height列和Math列 组成DataFrame
df.loc[:,['Height','Math']].head()
#type(df.loc[:,['Height','Math']])
HeightMath
ID
110117334.0
110219232.5
110318687.2
110416780.4
110515984.8

这里取得是切片从Height列到Math列 组成DataFrame

## 包括左右两端
df.loc[:,'Height':'Math'].head()
HeightWeightMath
ID
11011736334.0
11021927332.5
11031868287.2
11041678180.4
11051596484.8
⑤ 联合索引:

对于行列同时切片 [start:end:step]


df.loc[1102:2401:3,'Height':'Math'].head()
HeightWeightMath
ID
11021927332.5
11051596484.8
12031605358.8
13011616831.5
13041957085.2

标签值索引与位置索引区别

s = pd.Series([1,2,3],index=[1.1,2.2,3.3])

## 标签值索引
# s.loc[1:2:1]
# s.loc[1:2.2:1]
#s = pd.Series([1,2,3],index=[1.1,3.2,5.3])
## 报错 标签值不存在s[1]
#s[1.1]
## 标签值切片
#s[1:]
## 位置索引 
s = pd.Series([1,2,3],index=[1,3,5])
#s[2:]
#s[3]
## 总的来说 对于整数index讲 单个索引访问按元素值索引 切片访问按位置索引
## 对于浮点数index讲 单个浮点数索引按元素值索引 切片访问按元素值索引
⑥ 函数式索引:
  • loc中使用的函数操作的对象是前面的DataFrame
## 可以通过内置函数来查找符合条件的值
df.loc[lambda x:x['Gender']=='M'].head(10)

SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1103S_1C_1Mstreet_21868287.2B+
1201S_1C_2Mstreet_51886897.0A-
1203S_1C_2Mstreet_61605358.8A+
1301S_1C_3Mstreet_41616831.5B+
1303S_1C_3Mstreet_71888249.7B
1304S_1C_3Mstreet_21957085.2A
2101S_2C_1Mstreet_71748483.3C
2103S_2C_1Mstreet_41576152.5B-
2105S_2C_1Mstreet_41708134.2A
  • isin 确定某列值在某个范围中
# def f(x):
#     return [1101,1103]
# df.loc[f]
# df.loc[[1101,1103]]
## 
# df.loc[df.index.isin([1101,1103])]
## 列表表达式
df.loc[[x for x in [1101,1103]]]
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1103S_1C_1Mstreet_21868287.2B+
⑦ 布尔索引(将重点在第2节介绍)
df.loc[df['Address'].isin(['street_7','street_4'])].head()
# df.loc[lambda x:x['Address'].isin(['street_7','street_4'])].head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1105S_1C_1Fstreet_41596484.8B+
1202S_1C_2Fstreet_41769463.5B-
1301S_1C_3Mstreet_41616831.5B+
1303S_1C_3Mstreet_71888249.7B
2101S_2C_1Mstreet_71748483.3C
df['Address'].values
array(['street_1', 'street_2', 'street_2', 'street_2', 'street_4',
       'street_5', 'street_4', 'street_6', 'street_5', 'street_6',
       'street_4', 'street_1', 'street_7', 'street_2', 'street_5',
       'street_7', 'street_6', 'street_4', 'street_5', 'street_4',
       'street_5', 'street_7', 'street_4', 'street_1', 'street_7',
       'street_4', 'street_5', 'street_7', 'street_6', 'street_4',
       'street_2', 'street_7', 'street_6', 'street_2', 'street_6'],
      dtype=object)
df.loc[[True if i[-1]=='4' or i[-1]=='7' else False for i in df['Address'].values]].head()
## df[df['Address'].str.contains('4|7')]
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1105S_1C_1Fstreet_41596484.8B+
1202S_1C_2Fstreet_41769463.5B-
1301S_1C_3Mstreet_41616831.5B+
1303S_1C_3Mstreet_71888249.7B
2101S_2C_1Mstreet_71748483.3C
2103S_2C_1Mstreet_41576152.5B-
2105S_2C_1Mstreet_41708134.2A
2202S_2C_2Fstreet_71947768.5B+
2203S_2C_2Mstreet_41559173.8A+
2205S_2C_2Fstreet_71837685.4B
2301S_2C_3Fstreet_41577872.3B+
2303S_2C_3Fstreet_71909965.9C
2305S_2C_3Mstreet_41877348.9B
2402S_2C_4Mstreet_71668248.7B
  • 提问的问题如何把缺失值替换为前后值的平均值
## 
#s = pd.Series([1,2,1,np.nan,7,4,3,np.nan,7,4,5])
#s[s.isna()] = (s[s[s.isna()].index+1].values + s[s[s.isna()].index-1].values)/2
小节:本质上说,loc中能传入的只有布尔列表和索引子集构成的列表,只要把握这个原则就很容易理解上面那些操作
(b)iloc方法(注意与loc不同,切片右端点不包含)
① 单行索引:
df.iloc[3]
School          S_1
Class           C_1
Gender            F
Address    street_2
Height          167
Weight           81
Math           80.4
Physics          B-
Name: 1104, dtype: object
#df.loc[1102:,'Height']
#df.iloc[1:]['Height']
#df.iloc[1:,4]
#df[1:]['Height']
ID
1102    192
1103    186
1104    167
1105    159
1201    188
1202    176
1203    160
1204    162
1205    167
1301    161
1302    175
1303    188
1304    195
1305    187
2101    174
2102    161
2103    157
2104    159
2105    170
2201    193
2202    194
2203    155
2204    175
2205    183
2301    157
2302    171
2303    190
2304    164
2305    187
2401    192
2402    166
2403    158
2404    160
2405    193
Name: Height, dtype: int64
② 多行索引:
df.iloc[:5]
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1102S_1C_1Fstreet_21927332.5B+
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
③ 单列索引:
df.iloc[:,3].head()
ID
1101    street_1
1102    street_2
1103    street_2
1104    street_2
1105    street_4
Name: Address, dtype: object
④ 多列索引:
  • 对于全部行 从第七列每次减少两列
df.iloc[:,7::-2].head(10)
PhysicsWeightAddressClass
ID
1101A+63street_1C_1
1102B+73street_2C_1
1103B+82street_2C_1
1104B-81street_2C_1
1105B+64street_4C_1
1201A-68street_5C_2
1202B-94street_4C_2
1203A+53street_6C_2
1204B63street_5C_2
1205B-63street_6C_2
⑤ 混合索引:
## 每次隔三行 每次隔一列倒数
df.iloc[3::4,7::-2].head()
PhysicsWeightAddressClass
ID
1104B-81street_2C_1
1203A+53street_6C_2
1302A-57street_1C_3
2101C84street_7C_1
2105A81street_4C_1
⑥ 函数式索引:
  • 默认值取3 即取第三行
df.iloc[lambda x:[3]].head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1104S_1C_1Fstreet_21678180.4B-
小节:由上所述,iloc中接收的参数只能为整数或整数列表,不能使用布尔索引 纯元素Series布尔索引报错 将其转为list列表就可以 或者在后边加上values就可
(c) []操作符
如果不想陷入困境,请不要在行索引为浮点时使用[]操作符,因为在Series中的浮点[]并不是进行位置比较,而是值比较,非常特殊
(c.1)Series的[]操作
① 单元素索引:
s = pd.Series(df['Math'],index=df.index)
s[1101]
#使用的是索引标签
34.0
② 多行索引:
  • 使用的是绝对位置的整数切片,与元素无关,这里容易混淆
s[0:4]
ID
1101    34.0
1102    32.5
1103    87.2
1104    80.4
Name: Math, dtype: float64
③ 函数式索引:
  • 注意使用lambda函数时,直接切片(如:s[lambda x: 16::-6])就报错,此时使用的不是绝对位置切片,而是元素切片,非常易错
s[lambda x: x.index[16::-6]]
## cannot do slice indexing on with these indexers
#s[lambda x:16::-6]
ID
2102    50.6
1301    31.5
1105    84.8
Name: Math, dtype: float64
④ 布尔索引:
s[s>80]
ID
1103    87.2
1104    80.4
1105    84.8
1201    97.0
1302    87.7
1304    85.2
2101    83.3
2205    85.4
2304    95.5
Name: Math, dtype: float64
(c.2)DataFrame的[]操作
① 单行索引:
  • 同Series使用了绝对位置切片
df[1:2]
#这里非常容易写成df['label'],会报错
#如果想要获得某一个元素,可用如下get_loc方法:
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1102S_1C_1Fstreet_21927332.5B+
row = df.index.get_loc(1102)
df[row:row+1]
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1102S_1C_1Fstreet_21927332.5B+
② 多行索引:
  • 用切片,如果是选取指定的某几行,推荐使用loc,否则很可能报错
df[3:5]
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
③ 单列索引:
df['School'].head()
ID
1101    S_1
1102    S_1
1103    S_1
1104    S_1
1105    S_1
Name: School, dtype: object
④ 多列索引:
df[['School','Math']].head()
SchoolMath
ID
1101S_134.0
1102S_132.5
1103S_187.2
1104S_180.4
1105S_184.8
⑤函数式索引:
df[lambda x:['Math','Physics']].head()
MathPhysics
ID
110134.0A+
110232.5B+
110387.2B+
110480.4B-
110584.8B+
⑥ 布尔索引:
#print(type(df['Gender']=='F'))
df[df['Gender']=='F'].head()
<class 'pandas.core.series.Series'>
小节:一般来说,[]操作符常用于列选择或布尔选择,尽量避免行的选择

2. 布尔索引

(a)布尔符号:’&’,’|’,’~’:分别代表和and,或or,取反not
df[(df['Gender']=='F')&(df['Address']=='street_2')].head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1102S_1C_1Fstreet_21927332.5B+
1104S_1C_1Fstreet_21678180.4B-
2401S_2C_4Fstreet_21926245.3A
2404S_2C_4Fstreet_21608467.7B
df[(df['Math']>85)|(df['Address']=='street_7')].head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1103S_1C_1Mstreet_21868287.2B+
1201S_1C_2Mstreet_51886897.0A-
1302S_1C_3Fstreet_11755787.7A-
1303S_1C_3Mstreet_71888249.7B
1304S_1C_3Mstreet_21957085.2A
df[~((df['Math']>75)|(df['Address']=='street_1'))].head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1102S_1C_1Fstreet_21927332.5B+
1202S_1C_2Fstreet_41769463.5B-
1203S_1C_2Mstreet_61605358.8A+
1204S_1C_2Fstreet_51626333.8B
1205S_1C_2Fstreet_61676368.4B-
loc和[]中相应位置都能使用布尔列表选择:
df[:8]['Address']=='street_6'
ID
1101    False
1102    False
1103    False
1104    False
1105    False
1201    False
1202    False
1203     True
Name: Address, dtype: bool
(df[:8]['Address']=='street_6').values
array([False, False, False, False, False, False, False,  True])
  • 如果不加values就会索引对齐发生错误,Pandas中的索引对齐是一个重要特征,很多时候非常使用
## 选择 Math分数大于60的行,选择最后一列其他列为False
# df.loc[:,(df[:8]['Address']=='street_6').values].head()
# df.loc[df['Math']>60,:].head()
df.loc[df['Math']>60,(df[:8]['Address']=='street_6').values].head()
#但是若不加以留意,就会埋下隐患
## 这里我的理解是如果不加values是代之前八行的布尔Series 而对应的位置在列上产生矛盾
## Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
Physics
ID
1103B+
1104B-
1105B+
1201A-
1202B-
(b) isin方法
df[df['Address'].isin(['street_1','street_4'])&df['Physics'].isin(['A','A+'])]
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
2105S_2C_1Mstreet_41708134.2A
2203S_2C_2Mstreet_41559173.8A+
  • all与&的思路是类似的,其中的1代表按照跨列方向判断是否全为True
#上面也可以用字典方式写:
df[df[['Address','Physics']].isin({'Address':['street_1','street_4'],'Physics':['A','A+']}).all(1)]
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
2105S_2C_1Mstreet_41708134.2A
2203S_2C_2Mstreet_41559173.8A+

3. 快速标量索引

当只需要取一个元素时,at和iat方法能够提供更快的实现:
# 访问行/列标签对的单个值。
## 必须输入两个参数
display(df.at[1101,'School'])
df.at[1101,'School']
# display(df.loc[1101,'School'])
# # 通过整数位置访问行/列对的单个值
# display(df.iat[0,0])
# display(df.iloc[0,0])
# #可尝试去掉注释对比时间
# %timeit df.at[1101,'School']
# %timeit df.loc[1101,'School']
# %timeit df.iat[0,0]
# %timeit df.iloc[0,0]
'S_1'
'S_1'

4. 区间索引

此处介绍并不是说只能在单级索引中使用区间索引,只是作为一种特殊类型的索引方式,在此处先行介绍
(a)利用interval_range方法
pd.interval_range(start=0,end=5)
#closed参数可选'left''right''both''neither',默认左开右闭
IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
              closed='right',
              dtype='interval[int64]')
pd.interval_range(start=0,periods=10,freq=5)
#periods参数控制区间个数,freq控制步长
IntervalIndex([(0, 5], (5, 10], (10, 15], (15, 20], (20, 25], (25, 30], (30, 35], (35, 40], (40, 45], (45, 50]],
              closed='right',
              dtype='interval[int64]')
(b)利用cut将数值列转为区间为元素的分类变量,例如统计数学成绩的区间情况:
math_interval = pd.cut(df['Math'],bins=[0,40,60,80,100])
#注意,如果没有类型转换,此时并不是区间类型,而是category类型
math_interval.head()
ID
1101      (0, 40]
1102      (0, 40]
1103    (80, 100]
1104    (80, 100]
1105    (80, 100]
Name: Math, dtype: category
Categories (4, interval[int64]): [(0, 40] < (40, 60] < (60, 80] < (80, 100]]
df.join(math_interval,rsuffix='_interval')
SchoolClassGenderAddressHeightWeightMathPhysicsMath_interval
ID
1101S_1C_1Mstreet_11736334.0A+(0, 40]
1102S_1C_1Fstreet_21927332.5B+(0, 40]
1103S_1C_1Mstreet_21868287.2B+(80, 100]
1104S_1C_1Fstreet_21678180.4B-(80, 100]
1105S_1C_1Fstreet_41596484.8B+(80, 100]
1201S_1C_2Mstreet_51886897.0A-(80, 100]
1202S_1C_2Fstreet_41769463.5B-(60, 80]
1203S_1C_2Mstreet_61605358.8A+(40, 60]
1204S_1C_2Fstreet_51626333.8B(0, 40]
1205S_1C_2Fstreet_61676368.4B-(60, 80]
1301S_1C_3Mstreet_41616831.5B+(0, 40]
1302S_1C_3Fstreet_11755787.7A-(80, 100]
1303S_1C_3Mstreet_71888249.7B(40, 60]
1304S_1C_3Mstreet_21957085.2A(80, 100]
1305S_1C_3Fstreet_51876961.7B-(60, 80]
2101S_2C_1Mstreet_71748483.3C(80, 100]
2102S_2C_1Fstreet_61616150.6B+(40, 60]
2103S_2C_1Mstreet_41576152.5B-(40, 60]
2104S_2C_1Fstreet_51599772.2B+(60, 80]
2105S_2C_1Mstreet_41708134.2A(0, 40]
2201S_2C_2Mstreet_519310039.1B(0, 40]
2202S_2C_2Fstreet_71947768.5B+(60, 80]
2203S_2C_2Mstreet_41559173.8A+(60, 80]
2204S_2C_2Mstreet_11757447.2B-(40, 60]
2205S_2C_2Fstreet_71837685.4B(80, 100]
2301S_2C_3Fstreet_41577872.3B+(60, 80]
2302S_2C_3Mstreet_51718832.7A(0, 40]
2303S_2C_3Fstreet_71909965.9C(60, 80]
2304S_2C_3Fstreet_61648195.5A-(80, 100]
2305S_2C_3Mstreet_41877348.9B(40, 60]
2401S_2C_4Fstreet_21926245.3A(40, 60]
2402S_2C_4Mstreet_71668248.7B(40, 60]
2403S_2C_4Fstreet_61586059.7B+(40, 60]
2404S_2C_4Fstreet_21608467.7B(60, 80]
2405S_2C_4Fstreet_61935447.6B(40, 60]
df.join(math_interval,rsuffix='_interval')[['Math','Math_interval']]
MathMath_interval
ID
110134.0(0, 40]
110232.5(0, 40]
110387.2(80, 100]
110480.4(80, 100]
110584.8(80, 100]
120197.0(80, 100]
120263.5(60, 80]
120358.8(40, 60]
120433.8(0, 40]
120568.4(60, 80]
130131.5(0, 40]
130287.7(80, 100]
130349.7(40, 60]
130485.2(80, 100]
130561.7(60, 80]
210183.3(80, 100]
210250.6(40, 60]
210352.5(40, 60]
210472.2(60, 80]
210534.2(0, 40]
220139.1(0, 40]
220268.5(60, 80]
220373.8(60, 80]
220447.2(40, 60]
220585.4(80, 100]
230172.3(60, 80]
230232.7(0, 40]
230365.9(60, 80]
230495.5(80, 100]
230548.9(40, 60]
240145.3(40, 60]
240248.7(40, 60]
240359.7(40, 60]
240467.7(60, 80]
240547.6(40, 60]
(c)区间索引的选取
## rsuffix Suffix to use from right frame’s overlapping columns. 规定表合成时使用的后缀
## 使用调用框架的索引默认使用左边
## 选取 Math  Math_interval列
## 重置索引为Math_interval
df_i = df.join(math_interval,rsuffix='_interval')[['Math','Math_interval']]\
            .reset_index().set_index('Math_interval')
df_i.head()
IDMath
Math_interval
(0, 40]110134.0
(0, 40]110232.5
(80, 100]110387.2
(80, 100]110480.4
(80, 100]110584.8
df_i.loc[65].head()
#包含该值就会被选中
IDMath
Math_interval
(60, 80]120263.5
(60, 80]120568.4
(60, 80]130561.7
(60, 80]210472.2
(60, 80]220268.5
df_i.loc[[65,90]].head()
IDMath
Math_interval
(60, 80]120263.5
(60, 80]120568.4
(60, 80]130561.7
(60, 80]210472.2
(60, 80]220268.5
df_i.index.astype('interval').overlaps(pd.Interval(70, 85))
array([False, False,  True,  True,  True,  True,  True, False, False,
        True, False,  True, False,  True,  True,  True, False, False,
        True, False, False,  True,  True, False,  True,  True, False,
        True,  True, False, False, False, False,  True, False])
如果想要选取某个区间,先要把分类变量转为区间变量,再使用overlap方法:
  • overlaps Check whether two Interval objects overlap.如果重叠返回true
#df_i.loc[pd.Interval(70,75)].head() 报错
## astype将dtype变换为interval
## 获得到布尔列表
df_i[df_i.index.astype('interval').overlaps(pd.Interval(70, 85))].head()
IDMath
Math_interval
(80, 100]110387.2
(80, 100]110480.4
(80, 100]110584.8
(80, 100]120197.0
(60, 80]120263.5

二、多级索引

1. 创建多级索引

(a)通过from_tuple或from_arrays
① 直接创建元组
tuples = [('A','a'),('A','b'),('B','a'),('B','b')]
mul_index = pd.MultiIndex.from_tuples(tuples, names=('Upper', 'Lower'))
#type(mul_index)
mul_index
pandas.core.indexes.multi.MultiIndex
pd.DataFrame({'Score':['perfect','good','fair','bad']},index=mul_index)
Score
UpperLower
Aaperfect
bgood
Bafair
bbad
L1 = list('AABB')
L2 = list('abab')
#type(zip(L1,L2)) 按列压缩
list(zip(L1,L2))

[('A', 'a'), ('A', 'b'), ('B', 'a'), ('B', 'b')]
② 利用zip创建元组
L1 = list('AABB')
L2 = list('abab')
tuples = list(zip(L1,L2))
mul_index = pd.MultiIndex.from_tuples(tuples, names=('Upper', 'Lower'))
pd.DataFrame({'Score':['perfect','good','fair','bad']},index=mul_index)
Score
UpperLower
Aaperfect
bgood
Bafair
bbad
③ 通过Array创建
arrays = [['A','a'],['A','b'],['B','a'],['B','b']]
mul_index = pd.MultiIndex.from_tuples(arrays, names=('Upper', 'Lower'))
pd.DataFrame({'Score':['perfect','good','fair','bad']},index=mul_index)
Score
UpperLower
Aaperfect
bgood
Bafair
bbad
  • 由此看出内部自动转成元组
mul_index
MultiIndex([('A', 'a'),
            ('A', 'b'),
            ('B', 'a'),
            ('B', 'b')],
           names=['Upper', 'Lower'])
(b)通过from_product
## 根据多个可迭代项的笛卡尔积建立MultiIndex。
L1 = ['A','B']
L2 = ['a','b']
pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
#两两相乘 笛卡尔乘积
MultiIndex(levels=[['A', 'B'], ['a', 'b']],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=['Upper', 'Lower'])
(c)指定df中的列创建(set_index方法)
df_using_mul = df.set_index(['Class','Address'])
df_using_mul.head()
SchoolGenderHeightWeightMathPhysics
ClassAddress
C_1street_1S_1M1736334.0A+
street_2S_1F1927332.5B+
street_2S_1M1868287.2B+
street_2S_1F1678180.4B-
street_4S_1F1596484.8B+

2. 多层索引切片

df_using_mul.head()
SchoolGenderHeightWeightMathPhysics
ClassAddress
C_1street_1S_1M1736334.0A+
street_2S_1F1927332.5B+
street_2S_1M1868287.2B+
street_2S_1F1678180.4B-
street_4S_1F1596484.8B+
(a)一般切片
  • 当索引不排序时,单个索引会报出性能警告
  • is_lexsorted()该函数检查是否排序
# df_using_mul.loc['C_2','street_5']
#df_using_mul.index.is_lexsorted()
df_using_mul.sort_index().loc['C_2','street_5']
#df_using_mul.sort_index().index.is_lexsorted()
SchoolGenderHeightWeightMathPhysics
ClassAddress
C_2street_5S_1M1886897.0A-
street_5S_1F1626333.8B
street_5S_2M19310039.1B
  • 当不排序时,不能使用多层切片
#df_using_mul.loc[('C_2','street_5'):] 报错
df_using_mul.sort_index().loc[('C_2','street_6'):('C_3','street_4')]
#注意此处由于使用了loc,因此仍然包含右端点
SchoolGenderHeightWeightMathPhysics
ClassAddress
C_2street_6S_1M1605358.8A+
street_6S_1F1676368.4B-
street_7S_2F1947768.5B+
street_7S_2F1837685.4B
C_3street_1S_1F1755787.7A-
street_2S_1M1957085.2A
street_4S_1M1616831.5B+
street_4S_2F1577872.3B+
street_4S_2M1877348.9B
  • 非元组也是合法的,表示选中该层所有元素
df_using_mul.sort_index().loc[('C_2','street_7'):'C_3'].head()
SchoolGenderHeightWeightMathPhysics
ClassAddress
C_2street_7S_2F1947768.5B+
street_7S_2F1837685.4B
C_3street_1S_1F1755787.7A-
street_2S_1M1957085.2A
street_4S_1M1616831.5B+
(b)第一类特殊情况:由元组构成列表
df_using_mul.sort_index().loc[[('C_2','street_7'),('C_3','street_2')]]
#表示选出某几个元素,精确到最内层索引
SchoolGenderHeightWeightMathPhysics
ClassAddress
C_2street_7S_2F1947768.5B+
street_7S_2F1837685.4B
C_3street_2S_1M1957085.2A
(c)第二类特殊情况:由列表构成元组
df_using_mul.sort_index().loc[(['C_2','C_3'],['street_4','street_7']),:]
#选出第一层在‘C_2’和'C_3'中且第二层在'street_4'和'street_7'中的行
SchoolGenderHeightWeightMathPhysics
ClassAddress
C_2street_4S_1F1769463.5B-
street_4S_2M1559173.8A+
street_7S_2F1947768.5B+
street_7S_2F1837685.4B
C_3street_4S_1M1616831.5B+
street_4S_2F1577872.3B+
street_4S_2M1877348.9B
street_7S_1M1888249.7B
street_7S_2F1909965.9C

3. 多层索引中的slice对象

L1,L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
L3,L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
df_s = pd.DataFrame(np.random.rand(9,9),index=mul_index1,columns=mul_index2)
df_s
BigDEF
Smalldefdefdef
UpperLower
Aa0.6698970.8721370.8214290.3955010.1204180.2449820.7789860.0358900.446660
b0.4559740.6651270.6345200.5199620.6626630.8128090.9017540.5522200.994960
c0.2739360.4651710.6971020.7246230.5251770.0126090.5355890.0894620.650650
Ba0.6403970.8362090.3985170.8002380.2679540.2174490.5395150.0163170.507975
b0.5279490.9399560.9443040.9693450.7381030.0071590.2568610.8152550.449877
c0.7997220.6577900.5156530.8253000.3339650.4997310.8225230.9468900.950979
Ca0.9040040.8150880.0906960.1418770.1916940.9412010.1310530.5040410.815668
b0.1265030.6385470.7218000.1692640.7999740.5576500.4983080.9715330.892235
c0.2072260.0852620.0411910.4658510.7183530.8918530.1858690.0377840.625961
idx=pd.IndexSlice
idx
<pandas.core.indexing._IndexSlice at 0x2a41a8c1588>
索引Slice的使用非常灵活:
df_s.loc[idx['B':,df_s['D']['d']>0.3],idx[df_s.sum()>4]]
#df_s.sum()默认为对列求和,因此返回一个长度为9的数值列表
BigDEF
Smallefdedf
UpperLower
Bc0.3448530.0293920.6469190.7605910.3603840.168119
Ca0.4720300.9878520.7478450.4665300.7640400.795936

4. 索引层的交换

(a)swaplevel方法(两层交换)
df_using_mul.head()
SchoolGenderHeightWeightMathPhysics
ClassAddress
C_1street_1S_1M1736334.0A+
street_2S_1F1927332.5B+
street_2S_1M1868287.2B+
street_2S_1F1678180.4B-
street_4S_1F1596484.8B+
## 交换 i j 级别序列 在特定的axis=0轴上 即列轴
df_using_mul.swaplevel(i=1,j=0,axis=0).sort_index().head()
SchoolGenderHeightWeightMathPhysics
AddressClass
street_1C_1S_1M1736334.0A+
C_2S_2M1757447.2B-
C_3S_1F1755787.7A-
street_2C_1S_1F1927332.5B+
C_1S_1M1868287.2B+
(b)reorder_levels方法(多层交换)
df_muls = df.set_index(['School','Class','Address'])
df_muls.head()
GenderHeightWeightMathPhysics
SchoolClassAddress
S_1C_1street_1M1736334.0A+
street_2F1927332.5B+
street_2M1868287.2B+
street_2F1678180.4B-
street_4F1596484.8B+
## 按指定顺序更改标签 
df_muls.reorder_levels([2,0,1],axis=0).sort_index().head()
GenderHeightWeightMathPhysics
AddressSchoolClass
street_1S_1C_1M1736334.0A+
C_3F1755787.7A-
S_2C_2M1757447.2B-
street_2S_1C_1F1927332.5B+
C_1M1868287.2B+
#如果索引有name,可以直接使用name
df_muls.reorder_levels(['Address','School','Class'],axis=0).sort_index().head()
GenderHeightWeightMathPhysics
AddressSchoolClass
street_1S_1C_1M1736334.0A+
C_3F1755787.7A-
S_2C_2M1757447.2B-
street_2S_1C_1F1927332.5B+
C_1M1868287.2B+

三、索引设定

1. index_col参数

index_col是read_csv中的一个参数,而不是某一个方法:
#pd.read_csv('data/table.csv',index_col=['Address','School']).head()
## 使用默认顺序
pd.read_csv('data/table.csv').head()
SchoolClassIDGenderAddressHeightWeightMathPhysics
0S_1C_11101Mstreet_11736334.0A+
1S_1C_11102Fstreet_21927332.5B+
2S_1C_11103Mstreet_21868287.2B+
3S_1C_11104Fstreet_21678180.4B-
4S_1C_11105Fstreet_41596484.8B+

2. reindex和reindex_like

reindex是指重新索引,它的重要特性在于索引对齐,很多时候用于重新排序
df.head(10)
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1102S_1C_1Fstreet_21927332.5B+
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
1201S_1C_2Mstreet_51886897.0A-
1202S_1C_2Fstreet_41769463.5B-
1203S_1C_2Mstreet_61605358.8A+
1204S_1C_2Fstreet_51626333.8B
1205S_1C_2Fstreet_61676368.4B-
df.reindex(index=[1101,1203,1206,2402])
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_1173.063.034.0A+
1203S_1C_2Mstreet_6160.053.058.8A+
1206NaNNaNNaNNaNNaNNaNNaNNaN
2402S_2C_4Mstreet_7166.082.048.7B
df.reindex(columns=['Height','Gender','Average']).head()
HeightGenderAverage
ID
1101173MNaN
1102192FNaN
1103186MNaN
1104167FNaN
1105159FNaN
可以选择缺失值的填充方法:fill_value和method(bfill/ffill/nearest),其中method参数必须索引单调
  • bfill表示用所在索引1206的后一个有效行填充,ffill为前一个有效行,nearest是指最近的
#df.reindex(index=[1101,1203,1206,2402],method='nearest')
df.reindex(index=[1101,1203,1206,2402],method='bfill')
#df.reindex(index=[1101,1203,1206,2402],method='nearest')
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1203S_1C_2Mstreet_61605358.8A+
1206S_1C_2Fstreet_61676368.4B-
2402S_2C_4Mstreet_71668248.7B
  • 数值上1205比1301更接近1206,因此用前者填充
df.reindex(index=[1101,1203,1206,2402],method='nearest')
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1203S_1C_2Mstreet_61605358.8A+
1206S_1C_2Fstreet_61676368.4B-
2402S_2C_4Mstreet_71668248.7B
reindex_like的作用为生成一个横纵索引完全与参数列表一致的DataFrame,数据使用被调用的表
df_temp = pd.DataFrame({'Weight':np.zeros(5),
                        'Height':np.zeros(5),
                        'ID':[1101,1104,1103,1106,1102]}).set_index('ID')
#df_temp
df_temp.reindex_like(df[0:5][['Weight','Height']])
WeightHeight
ID
11010.00.0
11020.00.0
11030.00.0
11040.00.0
1105NaNNaN
如果df_temp单调还可以使用method参数:
df_temp = pd.DataFrame({'Weight':range(5),
                        'Height':range(5),
                        'ID':[1101,1104,1103,1106,1102]}).set_index('ID').sort_index()
#df_temp
df_temp.reindex_like(df[0:5][['Weight','Height']],method='bfill')
#可以自行检验这里的1105的值是否是由bfill规则填充
WeightHeight
ID
110100
110244
110322
110411
110533

3. set_index和reset_index

先介绍set_index:从字面意思看,就是将某些列作为索引
使用表内列作为索引:
df.head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1102S_1C_1Fstreet_21927332.5B+
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
df.set_index('Class').head()
SchoolGenderAddressHeightWeightMathPhysics
Class
C_1S_1Mstreet_11736334.0A+
C_1S_1Fstreet_21927332.5B+
C_1S_1Mstreet_21868287.2B+
C_1S_1Fstreet_21678180.4B-
C_1S_1Fstreet_41596484.8B+
利用append参数可以将当前索引维持不变
df.set_index('Class',append=True).head()
SchoolGenderAddressHeightWeightMathPhysics
IDClass
1101C_1S_1Mstreet_11736334.0A+
1102C_1S_1Fstreet_21927332.5B+
1103C_1S_1Mstreet_21868287.2B+
1104C_1S_1Fstreet_21678180.4B-
1105C_1S_1Fstreet_41596484.8B+
当使用与表长相同的列作为索引(需要先转化为Series,否则报错):
df.set_index(pd.Series(range(df.shape[0]))).head()
SchoolClassGenderAddressHeightWeightMathPhysics
0S_1C_1Mstreet_11736334.0A+
1S_1C_1Fstreet_21927332.5B+
2S_1C_1Mstreet_21868287.2B+
3S_1C_1Fstreet_21678180.4B-
4S_1C_1Fstreet_41596484.8B+
可以直接添加多级索引:
df.set_index([pd.Series(range(df.shape[0])),pd.Series(np.ones(df.shape[0]))]).head()
SchoolClassGenderAddressHeightWeightMathPhysics
01.0S_1C_1Mstreet_11736334.0A+
11.0S_1C_1Fstreet_21927332.5B+
21.0S_1C_1Mstreet_21868287.2B+
31.0S_1C_1Fstreet_21678180.4B-
41.0S_1C_1Fstreet_41596484.8B+
下面介绍reset_index方法,它的主要功能是将索引重置
默认状态直接恢复到自然数索引:
df.reset_index().head()
IDSchoolClassGenderAddressHeightWeightMathPhysics
01101S_1C_1Mstreet_11736334.0A+
11102S_1C_1Fstreet_21927332.5B+
21103S_1C_1Mstreet_21868287.2B+
31104S_1C_1Fstreet_21678180.4B-
41105S_1C_1Fstreet_41596484.8B+
用level参数指定哪一层被reset,用col_level参数指定set到哪一层:
L1,L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
L3,L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
df_temp = pd.DataFrame(np.random.rand(9,9),index=mul_index1,columns=mul_index2)
df_temp.head()
BigDEF
Smalldefdefdef
UpperLower
Aa0.6396080.5110610.4397810.4364920.4731350.8583100.5387440.6379850.441126
b0.7404950.9422010.6819340.0875040.6004520.9934680.6111080.7369480.177016
c0.9655540.3017550.6596890.0174420.1108300.3010410.2051020.4715270.052899
Ba0.0576130.3036130.2422870.4354520.2898610.5143010.8592270.4775110.655530
b0.7466430.0918020.4501440.2556230.5839730.3980160.1275450.3133110.638950
df_temp1.columns
MultiIndex(levels=[['D', 'E', 'F', ''], ['d', 'e', 'f', 'Lower']],
           codes=[[3, 0, 0, 0, 1, 1, 1, 2, 2, 2], [3, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=['Big', 'Small'])
df_temp1 = df_temp.reset_index(level=0,col_level=1)
df_temp1.head()
BigDEF
SmallUpperdefdefdef
Lower
aA0.6396080.5110610.4397810.4364920.4731350.8583100.5387440.6379850.441126
bA0.7404950.9422010.6819340.0875040.6004520.9934680.6111080.7369480.177016
cA0.9655540.3017550.6596890.0174420.1108300.3010410.2051020.4715270.052899
aB0.0576130.3036130.2422870.4354520.2898610.5143010.8592270.4775110.655530
bB0.7466430.0918020.4501440.2556230.5839730.3980160.1275450.3133110.638950
df_temp1.columns
#看到的确插入了level2
MultiIndex(levels=[['D', 'E', 'F', ''], ['d', 'e', 'f', 'Upper']],
           codes=[[3, 0, 0, 0, 1, 1, 1, 2, 2, 2], [3, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=['Big', 'Small'])
df_temp1.index
#最内层索引被移出
Index(['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c'], dtype='object', name='Lower')

4. rename_axis和rename

rename_axis是针对多级索引的方法,作用是修改某一层的索引名,而不是索引标签
df_temp.rename_axis(index={'Lower':'LowerLower'},columns={'Big':'BigBig'})
BigBigDEF
Smalldefdefdef
UpperLowerLower
Aa0.0363000.1112970.5098190.3220650.1078460.3930020.9516950.3520450.055245
b0.5859760.8177810.5265120.5607630.6471260.8019220.5205110.7089810.774692
c0.8599070.8247120.6753480.2385580.8691720.9653630.8034850.5687710.734648
Ba0.1290400.2782340.9817280.9034230.6762400.3712910.8765710.3383530.021567
b0.2213860.7447650.0801380.0139360.6230340.3148590.5201990.7422330.834835
c0.9052520.0375120.3778490.2258820.7811820.2325110.7785180.5727870.858842
Ca0.6782890.3246380.1656280.7370360.5919360.3121730.3196890.2890720.954463
b0.6478610.5273870.5059450.4883350.6310820.6395390.7850940.0260730.881210
c0.9907220.6917150.6976170.1158310.1298020.9941520.1768410.4490530.145351
rename方法用于修改列或者行索引标签,而不是索引名:
df_temp.rename(index={'A':'T'},columns={'e':'changed_e'}).head()
BigDEF
Smalldchanged_efdchanged_efdchanged_ef
UpperLower
Ta0.0363000.1112970.5098190.3220650.1078460.3930020.9516950.3520450.055245
b0.5859760.8177810.5265120.5607630.6471260.8019220.5205110.7089810.774692
c0.8599070.8247120.6753480.2385580.8691720.9653630.8034850.5687710.734648
Ba0.1290400.2782340.9817280.9034230.6762400.3712910.8765710.3383530.021567
b0.2213860.7447650.0801380.0139360.6230340.3148590.5201990.7422330.834835

四、常用索引型函数

1. where函数

当对条件为False的单元进行填充:
df.head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1102S_1C_1Fstreet_21927332.5B+
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
  • 不满足条件的行全部被设置为NaN
  • 通过other来改变不符合条件的值
df.where(df['Gender']=='M',' ').head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334A+
1102
1103S_1C_1Mstreet_21868287.2B+
1104
1105
通过这种方法筛选结果和[]操作符的结果完全一致:
df.where(df['Gender']=='M').dropna().head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_1173.063.034.0A+
1103S_1C_1Mstreet_2186.082.087.2B+
1201S_1C_2Mstreet_5188.068.097.0A-
1203S_1C_2Mstreet_6160.053.058.8A+
1301S_1C_3Mstreet_4161.068.031.5B+
第一个参数为布尔条件,第二个参数为填充值:
  • 用随机值替换NaN值
df.where(df['Gender']=='M',np.random.rand(df.shape[0],df.shape[1])).head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_1173.00000063.00000034.000000A+
11020.550920.3032160.8244310.7554270.4331190.9971730.3581610.699621
1103S_1C_1Mstreet_2186.00000082.00000087.200000B+
11040.3137530.4810690.3352570.6459730.5516900.5626170.6843990.312336
11050.07056010.0677620.8181580.3521170.2000730.0768100.8859330.142629

2. mask函数

mask函数与where功能上相反,其余完全一致,即对条件为True的单元进行填充
df.mask(df['Gender']=='M').dropna().head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1102S_1C_1Fstreet_2192.073.032.5B+
1104S_1C_1Fstreet_2167.081.080.4B-
1105S_1C_1Fstreet_4159.064.084.8B+
1202S_1C_2Fstreet_4176.094.063.5B-
1204S_1C_2Fstreet_5162.063.033.8B
df.mask(df['Gender']=='M',np.random.rand(df.shape[0],df.shape[1])).head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
11010.2739620.250280.5874710.9772060.4424030.3194600.4609910.842498
1102S_1C_1Fstreet_2192.00000073.00000032.500000B+
11030.4366740.7415240.469960.6886030.9382410.5318110.7943520.17495
1104S_1C_1Fstreet_2167.00000081.00000080.400000B-
1105S_1C_1Fstreet_4159.00000064.00000084.800000B+

3. query函数

df.head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1102S_1C_1Fstreet_21927332.5B+
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
query函数中的布尔表达式中,下面的符号都是合法的:行列索引名、字符串、and/not/or/&/|/~/not in/in/==/!=、四则运算符
## 封装了类似于数据库查询的方法
df.query('(Address in ["street_6","street_7"])&(Weight>(70+10))&(ID in [1303,2304,2402])')
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1303S_1C_3Mstreet_71888249.7B
2304S_2C_3Fstreet_61648195.5A-
2402S_2C_4Mstreet_71668248.7B

五、重复元素处理

1. duplicated方法

该方法返回了是否重复的布尔列表
##默认为first
df.duplicated('Class').head(10)
ID
1101    False
1102     True
1103     True
1104     True
1105     True
1201    False
1202     True
1203     True
1204     True
1205     True
dtype: bool
df.duplicated('Class',keep='first').head(10)
ID
1101    False
1102     True
1103     True
1104     True
1105     True
1201    False
1202     True
1203     True
1204     True
1205     True
dtype: bool
可选参数keep默认为first,即首次出现设为不重复,若为last,则最后一次设为不重复,若为False,则所有重复项为False
df.duplicated('Class',keep='last').tail()
ID
2401     True
2402     True
2403     True
2404     True
2405    False
dtype: bool
df.duplicated('Class',keep=False).head()
ID
1101    True
1102    True
1103    True
1104    True
1105    True
dtype: bool

2. drop_duplicates方法

从名字上看出为剔除重复项,这在后面章节中的分组操作中可能是有用的,例如需要保留每组的第一个值:
df.drop_duplicates('Class')
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1201S_1C_2Mstreet_51886897.0A-
1301S_1C_3Mstreet_41616831.5B+
2401S_2C_4Fstreet_21926245.3A
参数与duplicate函数类似:
df.drop_duplicates('Class',keep='last')
SchoolClassGenderAddressHeightWeightMathPhysics
ID
2105S_2C_1Mstreet_41708134.2A
2205S_2C_2Fstreet_71837685.4B
2305S_2C_3Mstreet_41877348.9B
2405S_2C_4Fstreet_61935447.6B
在传入多列时等价于将多列共同视作一个多级索引,比较重复项:
df.drop_duplicates(['School','Class'],keep = 'first',inplace=False)
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1201S_1C_2Mstreet_51886897.0A-
1301S_1C_3Mstreet_41616831.5B+
2101S_2C_1Mstreet_71748483.3C
2201S_2C_2Mstreet_519310039.1B
2301S_2C_3Fstreet_41577872.3B+
2401S_2C_4Fstreet_21926245.3A

六、抽样函数

这里的抽样函数指的就是sample函数
(a)n为样本量
df.sample(n=5)
SchoolClassGenderAddressHeightWeightMathPhysics
ID
2103S_2C_1Mstreet_41576152.5B-
1102S_1C_1Fstreet_21927332.5B+
1301S_1C_3Mstreet_41616831.5B+
1304S_1C_3Mstreet_21957085.2A
1105S_1C_1Fstreet_41596484.8B+
(b)frac为抽样比
df.sample(frac=0.05)
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1105S_1C_1Fstreet_41596484.8B+
2402S_2C_4Mstreet_71668248.7B
(c)replace为是否放回
df.sample(n=df.shape[0],replace=True).head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
2403S_2C_4Fstreet_61586059.7B+
2404S_2C_4Fstreet_21608467.7B
2405S_2C_4Fstreet_61935447.6B
2303S_2C_3Fstreet_71909965.9C
1203S_1C_2Mstreet_61605358.8A+
df.sample(n=35,replace=True).index.is_unique
False
(d)axis为抽样维度,默认为0,即抽行
df.sample(n=3,axis=1).head()
AddressWeightSchool
ID
1101street_163S_1
1102street_273S_1
1103street_282S_1
1104street_281S_1
1105street_464S_1
(e)weights为样本权重,自动归一化
df.sample(n=3,weights=np.random.rand(df.shape[0])).head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1302S_1C_3Fstreet_11755787.7A-
1305S_1C_3Fstreet_51876961.7B-
2404S_2C_4Fstreet_21608467.7B
#以某一列为权重,这在抽样理论中很常见
df.sample(n=3,weights=df['Math']).head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
2205S_2C_2Fstreet_71837685.4B
2404S_2C_4Fstreet_21608467.7B
1203S_1C_2Mstreet_61605358.8A+
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值