第2章索引

最新推荐文章于 2021-12-24 17:46:25 发布

数分虐我千百遍

最新推荐文章于 2021-12-24 17:46:25 发布

阅读量602

点赞数

分类专栏： python pandas

本文链接：https://blog.csdn.net/qq_45019698/article/details/105695618

版权

python 同时被 2 个专栏收录

20 篇文章 4 订阅

订阅专栏

pandas

8 篇文章 0 订阅

订阅专栏

第2章索引

文章目录

第2章索引

大致内容

在这里插入图片描述

import numpy as np
import pandas as pd
df = pd.read_csv('data/table.csv',index_col='ID')
df.head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+

一、单级索引

1. loc方法、iloc方法、[]操作符

最常用的索引方法可能就是这三类，其中iloc表示位置索引，loc表示标签索引，[]也具有很大的便利性，各有特点

（a）loc方法（注意：所有在loc中使用的切片全部包含右端点！）

① 单行索引：

返回对应行的Series


df.loc[1103]

School          S_1
Class           C_1
Gender            M
Address    street_2
Height          186
Weight           82
Math           87.2
Physics          B+
Name: 1103, dtype: object

② 多行索引：

返回1102和2304行组成DataFrame


df.loc[[1102,2304]]

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1102	S_1	C_1	F	street_2	192	73	32.5	B+
2304	S_2	C_3	F	street_6	164	81	95.5	A-

切片访问
开头与结尾都取到重点差别


df.loc[1304:].head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1304	S_1	C_3	M	street_2	195	70	85.2	A
1305	S_1	C_3	F	street_5	187	69	61.7	B-
2101	S_2	C_1	M	street_7	174	84	83.3	C
2102	S_2	C_1	F	street_6	161	61	50.6	B+
2103	S_2	C_1	M	street_4	157	61	52.5	B-

## 逆序切片
df.loc[2402::-1].head(60)

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
2402	S_2	C_4	M	street_7	166	82	48.7	B
2401	S_2	C_4	F	street_2	192	62	45.3	A
2305	S_2	C_3	M	street_4	187	73	48.9	B
2304	S_2	C_3	F	street_6	164	81	95.5	A-
2303	S_2	C_3	F	street_7	190	99	65.9	C
2302	S_2	C_3	M	street_5	171	88	32.7	A
2301	S_2	C_3	F	street_4	157	78	72.3	B+
2205	S_2	C_2	F	street_7	183	76	85.4	B
2204	S_2	C_2	M	street_1	175	74	47.2	B-
2203	S_2	C_2	M	street_4	155	91	73.8	A+
2202	S_2	C_2	F	street_7	194	77	68.5	B+
2201	S_2	C_2	M	street_5	193	100	39.1	B
2105	S_2	C_1	M	street_4	170	81	34.2	A
2104	S_2	C_1	F	street_5	159	97	72.2	B+
2103	S_2	C_1	M	street_4	157	61	52.5	B-
2102	S_2	C_1	F	street_6	161	61	50.6	B+
2101	S_2	C_1	M	street_7	174	84	83.3	C
1305	S_1	C_3	F	street_5	187	69	61.7	B-
1304	S_1	C_3	M	street_2	195	70	85.2	A
1303	S_1	C_3	M	street_7	188	82	49.7	B
1302	S_1	C_3	F	street_1	175	57	87.7	A-
1301	S_1	C_3	M	street_4	161	68	31.5	B+
1205	S_1	C_2	F	street_6	167	63	68.4	B-
1204	S_1	C_2	F	street_5	162	63	33.8	B
1203	S_1	C_2	M	street_6	160	53	58.8	A+
1202	S_1	C_2	F	street_4	176	94	63.5	B-
1201	S_1	C_2	M	street_5	188	68	97.0	A-
1105	S_1	C_1	F	street_4	159	64	84.8	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1101	S_1	C_1	M	street_1	173	63	34.0	A+

③ 单列索引：

Series类型

## 对于全部行取Height列
df.loc[:,'Height'].head()
#type(df.loc[:,'Height'])

ID
1101    173
1102    192
1103    186
1104    167
1105    159
Name: Height, dtype: int64

④ 多列索引：

## 注意和下面例子区别 这里取得是Height列和Math列 组成DataFrame
df.loc[:,['Height','Math']].head()
#type(df.loc[:,['Height','Math']])

	Height	Math
ID
1101	173	34.0
1102	192	32.5
1103	186	87.2
1104	167	80.4
1105	159	84.8

这里取得是切片从Height列到Math列组成DataFrame

## 包括左右两端
df.loc[:,'Height':'Math'].head()

	Height	Weight	Math
ID
1101	173	63	34.0
1102	192	73	32.5
1103	186	82	87.2
1104	167	81	80.4
1105	159	64	84.8

⑤ 联合索引：

对于行列同时切片 [start:end:step]


df.loc[1102:2401:3,'Height':'Math'].head()

	Height	Weight	Math
ID
1102	192	73	32.5
1105	159	64	84.8
1203	160	53	58.8
1301	161	68	31.5
1304	195	70	85.2

标签值索引与位置索引区别

s = pd.Series([1,2,3],index=[1.1,2.2,3.3])

## 标签值索引
# s.loc[1:2:1]
# s.loc[1:2.2:1]
#s = pd.Series([1,2,3],index=[1.1,3.2,5.3])
## 报错 标签值不存在s[1]
#s[1.1]
## 标签值切片
#s[1:]
## 位置索引 
s = pd.Series([1,2,3],index=[1,3,5])
#s[2:]
#s[3]
## 总的来说 对于整数index讲 单个索引访问按元素值索引 切片访问按位置索引
## 对于浮点数index讲 单个浮点数索引按元素值索引 切片访问按元素值索引

⑥ 函数式索引：

loc中使用的函数操作的对象是前面的DataFrame

## 可以通过内置函数来查找符合条件的值
df.loc[lambda x:x['Gender']=='M'].head(10)

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1201	S_1	C_2	M	street_5	188	68	97.0	A-
1203	S_1	C_2	M	street_6	160	53	58.8	A+
1301	S_1	C_3	M	street_4	161	68	31.5	B+
1303	S_1	C_3	M	street_7	188	82	49.7	B
1304	S_1	C_3	M	street_2	195	70	85.2	A
2101	S_2	C_1	M	street_7	174	84	83.3	C
2103	S_2	C_1	M	street_4	157	61	52.5	B-
2105	S_2	C_1	M	street_4	170	81	34.2	A

isin 确定某列值在某个范围中

# def f(x):
#     return [1101,1103]
# df.loc[f]
# df.loc[[1101,1103]]
## 
# df.loc[df.index.isin([1101,1103])]
## 列表表达式
df.loc[[x for x in [1101,1103]]]

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1103	S_1	C_1	M	street_2	186	82	87.2	B+

⑦ 布尔索引（将重点在第2节介绍）

df.loc[df['Address'].isin(['street_7','street_4'])].head()
# df.loc[lambda x:x['Address'].isin(['street_7','street_4'])].head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1105	S_1	C_1	F	street_4	159	64	84.8	B+
1202	S_1	C_2	F	street_4	176	94	63.5	B-
1301	S_1	C_3	M	street_4	161	68	31.5	B+
1303	S_1	C_3	M	street_7	188	82	49.7	B
2101	S_2	C_1	M	street_7	174	84	83.3	C

df['Address'].values

array(['street_1', 'street_2', 'street_2', 'street_2', 'street_4',
       'street_5', 'street_4', 'street_6', 'street_5', 'street_6',
       'street_4', 'street_1', 'street_7', 'street_2', 'street_5',
       'street_7', 'street_6', 'street_4', 'street_5', 'street_4',
       'street_5', 'street_7', 'street_4', 'street_1', 'street_7',
       'street_4', 'street_5', 'street_7', 'street_6', 'street_4',
       'street_2', 'street_7', 'street_6', 'street_2', 'street_6'],
      dtype=object)

df.loc[[True if i[-1]=='4' or i[-1]=='7' else False for i in df['Address'].values]].head()
## df[df['Address'].str.contains('4|7')]

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1105	S_1	C_1	F	street_4	159	64	84.8	B+
1202	S_1	C_2	F	street_4	176	94	63.5	B-
1301	S_1	C_3	M	street_4	161	68	31.5	B+
1303	S_1	C_3	M	street_7	188	82	49.7	B
2101	S_2	C_1	M	street_7	174	84	83.3	C
2103	S_2	C_1	M	street_4	157	61	52.5	B-
2105	S_2	C_1	M	street_4	170	81	34.2	A
2202	S_2	C_2	F	street_7	194	77	68.5	B+
2203	S_2	C_2	M	street_4	155	91	73.8	A+
2205	S_2	C_2	F	street_7	183	76	85.4	B
2301	S_2	C_3	F	street_4	157	78	72.3	B+
2303	S_2	C_3	F	street_7	190	99	65.9	C
2305	S_2	C_3	M	street_4	187	73	48.9	B
2402	S_2	C_4	M	street_7	166	82	48.7	B

提问的问题如何把缺失值替换为前后值的平均值

## 
#s = pd.Series([1,2,1,np.nan,7,4,3,np.nan,7,4,5])
#s[s.isna()] = (s[s[s.isna()].index+1].values + s[s[s.isna()].index-1].values)/2

小节：本质上说，loc中能传入的只有布尔列表和索引子集构成的列表，只要把握这个原则就很容易理解上面那些操作

（b）iloc方法（注意与loc不同，切片右端点不包含）

① 单行索引：

df.iloc[3]

School          S_1
Class           C_1
Gender            F
Address    street_2
Height          167
Weight           81
Math           80.4
Physics          B-
Name: 1104, dtype: object

#df.loc[1102:,'Height']
#df.iloc[1:]['Height']
#df.iloc[1:,4]
#df[1:]['Height']

ID
1102    192
1103    186
1104    167
1105    159
1201    188
1202    176
1203    160
1204    162
1205    167
1301    161
1302    175
1303    188
1304    195
1305    187
2101    174
2102    161
2103    157
2104    159
2105    170
2201    193
2202    194
2203    155
2204    175
2205    183
2301    157
2302    171
2303    190
2304    164
2305    187
2401    192
2402    166
2403    158
2404    160
2405    193
Name: Height, dtype: int64

② 多行索引：

df.iloc[:5]

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+

③ 单列索引：

df.iloc[:,3].head()

ID
1101    street_1
1102    street_2
1103    street_2
1104    street_2
1105    street_4
Name: Address, dtype: object

④ 多列索引：

对于全部行从第七列每次减少两列

df.iloc[:,7::-2].head(10)

	Physics	Weight	Address	Class
ID
1101	A+	63	street_1	C_1
1102	B+	73	street_2	C_1
1103	B+	82	street_2	C_1
1104	B-	81	street_2	C_1
1105	B+	64	street_4	C_1
1201	A-	68	street_5	C_2
1202	B-	94	street_4	C_2
1203	A+	53	street_6	C_2
1204	B	63	street_5	C_2
1205	B-	63	street_6	C_2

⑤ 混合索引：

## 每次隔三行 每次隔一列倒数
df.iloc[3::4,7::-2].head()

	Physics	Weight	Address	Class
ID
1104	B-	81	street_2	C_1
1203	A+	53	street_6	C_2
1302	A-	57	street_1	C_3
2101	C	84	street_7	C_1
2105	A	81	street_4	C_1

⑥ 函数式索引：

默认值取3 即取第三行

df.iloc[lambda x:[3]].head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1104	S_1	C_1	F	street_2	167	81	80.4	B-

小节：由上所述，iloc中接收的参数只能为整数或整数列表，不能使用布尔索引纯元素Series布尔索引报错将其转为list列表就可以或者在后边加上values就可

（c） []操作符

如果不想陷入困境，请不要在行索引为浮点时使用[]操作符，因为在Series中的浮点[]并不是进行位置比较，而是值比较，非常特殊

（c.1）Series的[]操作

① 单元素索引：

s = pd.Series(df['Math'],index=df.index)
s[1101]
#使用的是索引标签

34.0

② 多行索引：

使用的是绝对位置的整数切片，与元素无关，这里容易混淆

s[0:4]

ID
1101    34.0
1102    32.5
1103    87.2
1104    80.4
Name: Math, dtype: float64

③ 函数式索引：

注意使用lambda函数时，直接切片(如：s[lambda x: 16::-6])就报错，此时使用的不是绝对位置切片，而是元素切片，非常易错

s[lambda x: x.index[16::-6]]
## cannot do slice indexing on with these indexers
#s[lambda x:16::-6]

ID
2102    50.6
1301    31.5
1105    84.8
Name: Math, dtype: float64

④ 布尔索引：

s[s>80]

ID
1103    87.2
1104    80.4
1105    84.8
1201    97.0
1302    87.7
1304    85.2
2101    83.3
2205    85.4
2304    95.5
Name: Math, dtype: float64

（c.2）DataFrame的[]操作

① 单行索引：

同Series使用了绝对位置切片

df[1:2]
#这里非常容易写成df['label']，会报错
#如果想要获得某一个元素，可用如下get_loc方法：

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1102	S_1	C_1	F	street_2	192	73	32.5	B+

row = df.index.get_loc(1102)
df[row:row+1]

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1102	S_1	C_1	F	street_2	192	73	32.5	B+

② 多行索引：

用切片，如果是选取指定的某几行，推荐使用loc，否则很可能报错

df[3:5]

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+

③ 单列索引：

df['School'].head()

ID
1101    S_1
1102    S_1
1103    S_1
1104    S_1
1105    S_1
Name: School, dtype: object

④ 多列索引：

df[['School','Math']].head()

	School	Math
ID
1101	S_1	34.0
1102	S_1	32.5
1103	S_1	87.2
1104	S_1	80.4
1105	S_1	84.8

⑤函数式索引：

df[lambda x:['Math','Physics']].head()

	Math	Physics
ID
1101	34.0	A+
1102	32.5	B+
1103	87.2	B+
1104	80.4	B-
1105	84.8	B+

⑥ 布尔索引：

#print(type(df['Gender']=='F'))
df[df['Gender']=='F'].head()

<class 'pandas.core.series.Series'>

小节：一般来说，[]操作符常用于列选择或布尔选择，尽量避免行的选择

2. 布尔索引

（a）布尔符号：’&’,’|’,’~’：分别代表和and，或or，取反not

df[(df['Gender']=='F')&(df['Address']=='street_2')].head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
2401	S_2	C_4	F	street_2	192	62	45.3	A
2404	S_2	C_4	F	street_2	160	84	67.7	B

df[(df['Math']>85)|(df['Address']=='street_7')].head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1201	S_1	C_2	M	street_5	188	68	97.0	A-
1302	S_1	C_3	F	street_1	175	57	87.7	A-
1303	S_1	C_3	M	street_7	188	82	49.7	B
1304	S_1	C_3	M	street_2	195	70	85.2	A

df[~((df['Math']>75)|(df['Address']=='street_1'))].head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1202	S_1	C_2	F	street_4	176	94	63.5	B-
1203	S_1	C_2	M	street_6	160	53	58.8	A+
1204	S_1	C_2	F	street_5	162	63	33.8	B
1205	S_1	C_2	F	street_6	167	63	68.4	B-

loc和[]中相应位置都能使用布尔列表选择：

df[:8]['Address']=='street_6'

ID
1101    False
1102    False
1103    False
1104    False
1105    False
1201    False
1202    False
1203     True
Name: Address, dtype: bool

(df[:8]['Address']=='street_6').values

array([False, False, False, False, False, False, False,  True])

如果不加values就会索引对齐发生错误，Pandas中的索引对齐是一个重要特征，很多时候非常使用

## 选择 Math分数大于60的行，选择最后一列其他列为False
# df.loc[:,(df[:8]['Address']=='street_6').values].head()
# df.loc[df['Math']>60,:].head()
df.loc[df['Math']>60,(df[:8]['Address']=='street_6').values].head()
#但是若不加以留意，就会埋下隐患
## 这里我的理解是如果不加values是代之前八行的布尔Series 而对应的位置在列上产生矛盾
## Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match

	Physics
ID
1103	B+
1104	B-
1105	B+
1201	A-
1202	B-

（b） isin方法

df[df['Address'].isin(['street_1','street_4'])&df['Physics'].isin(['A','A+'])]

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
2105	S_2	C_1	M	street_4	170	81	34.2	A
2203	S_2	C_2	M	street_4	155	91	73.8	A+

all与&的思路是类似的，其中的1代表按照跨列方向判断是否全为True

#上面也可以用字典方式写：
df[df[['Address','Physics']].isin({'Address':['street_1','street_4'],'Physics':['A','A+']}).all(1)]

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
2105	S_2	C_1	M	street_4	170	81	34.2	A
2203	S_2	C_2	M	street_4	155	91	73.8	A+

3. 快速标量索引

当只需要取一个元素时，at和iat方法能够提供更快的实现：

# 访问行/列标签对的单个值。
## 必须输入两个参数
display(df.at[1101,'School'])
df.at[1101,'School']
# display(df.loc[1101,'School'])
# # 通过整数位置访问行/列对的单个值
# display(df.iat[0,0])
# display(df.iloc[0,0])
# #可尝试去掉注释对比时间
# %timeit df.at[1101,'School']
# %timeit df.loc[1101,'School']
# %timeit df.iat[0,0]
# %timeit df.iloc[0,0]

'S_1'
'S_1'

4. 区间索引

此处介绍并不是说只能在单级索引中使用区间索引，只是作为一种特殊类型的索引方式，在此处先行介绍

（a）利用interval_range方法

pd.interval_range(start=0,end=5)
#closed参数可选'left''right''both''neither'，默认左开右闭

IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
              closed='right',
              dtype='interval[int64]')

pd.interval_range(start=0,periods=10,freq=5)
#periods参数控制区间个数，freq控制步长

IntervalIndex([(0, 5], (5, 10], (10, 15], (15, 20], (20, 25], (25, 30], (30, 35], (35, 40], (40, 45], (45, 50]],
              closed='right',
              dtype='interval[int64]')

（b）利用cut将数值列转为区间为元素的分类变量，例如统计数学成绩的区间情况：

math_interval = pd.cut(df['Math'],bins=[0,40,60,80,100])
#注意，如果没有类型转换，此时并不是区间类型，而是category类型
math_interval.head()

ID
1101      (0, 40]
1102      (0, 40]
1103    (80, 100]
1104    (80, 100]
1105    (80, 100]
Name: Math, dtype: category
Categories (4, interval[int64]): [(0, 40] < (40, 60] < (60, 80] < (80, 100]]

df.join(math_interval,rsuffix='_interval')

	School	Class	Gender	Address	Height	Weight	Math	Physics	Math_interval
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+	(0, 40]
1102	S_1	C_1	F	street_2	192	73	32.5	B+	(0, 40]
1103	S_1	C_1	M	street_2	186	82	87.2	B+	(80, 100]
1104	S_1	C_1	F	street_2	167	81	80.4	B-	(80, 100]
1105	S_1	C_1	F	street_4	159	64	84.8	B+	(80, 100]
1201	S_1	C_2	M	street_5	188	68	97.0	A-	(80, 100]
1202	S_1	C_2	F	street_4	176	94	63.5	B-	(60, 80]
1203	S_1	C_2	M	street_6	160	53	58.8	A+	(40, 60]
1204	S_1	C_2	F	street_5	162	63	33.8	B	(0, 40]
1205	S_1	C_2	F	street_6	167	63	68.4	B-	(60, 80]
1301	S_1	C_3	M	street_4	161	68	31.5	B+	(0, 40]
1302	S_1	C_3	F	street_1	175	57	87.7	A-	(80, 100]
1303	S_1	C_3	M	street_7	188	82	49.7	B	(40, 60]
1304	S_1	C_3	M	street_2	195	70	85.2	A	(80, 100]
1305	S_1	C_3	F	street_5	187	69	61.7	B-	(60, 80]
2101	S_2	C_1	M	street_7	174	84	83.3	C	(80, 100]
2102	S_2	C_1	F	street_6	161	61	50.6	B+	(40, 60]
2103	S_2	C_1	M	street_4	157	61	52.5	B-	(40, 60]
2104	S_2	C_1	F	street_5	159	97	72.2	B+	(60, 80]
2105	S_2	C_1	M	street_4	170	81	34.2	A	(0, 40]
2201	S_2	C_2	M	street_5	193	100	39.1	B	(0, 40]
2202	S_2	C_2	F	street_7	194	77	68.5	B+	(60, 80]
2203	S_2	C_2	M	street_4	155	91	73.8	A+	(60, 80]
2204	S_2	C_2	M	street_1	175	74	47.2	B-	(40, 60]
2205	S_2	C_2	F	street_7	183	76	85.4	B	(80, 100]
2301	S_2	C_3	F	street_4	157	78	72.3	B+	(60, 80]
2302	S_2	C_3	M	street_5	171	88	32.7	A	(0, 40]
2303	S_2	C_3	F	street_7	190	99	65.9	C	(60, 80]
2304	S_2	C_3	F	street_6	164	81	95.5	A-	(80, 100]
2305	S_2	C_3	M	street_4	187	73	48.9	B	(40, 60]
2401	S_2	C_4	F	street_2	192	62	45.3	A	(40, 60]
2402	S_2	C_4	M	street_7	166	82	48.7	B	(40, 60]
2403	S_2	C_4	F	street_6	158	60	59.7	B+	(40, 60]
2404	S_2	C_4	F	street_2	160	84	67.7	B	(60, 80]
2405	S_2	C_4	F	street_6	193	54	47.6	B	(40, 60]

df.join(math_interval,rsuffix='_interval')[['Math','Math_interval']]

	Math	Math_interval
ID
1101	34.0	(0, 40]
1102	32.5	(0, 40]
1103	87.2	(80, 100]
1104	80.4	(80, 100]
1105	84.8	(80, 100]
1201	97.0	(80, 100]
1202	63.5	(60, 80]
1203	58.8	(40, 60]
1204	33.8	(0, 40]
1205	68.4	(60, 80]
1301	31.5	(0, 40]
1302	87.7	(80, 100]
1303	49.7	(40, 60]
1304	85.2	(80, 100]
1305	61.7	(60, 80]
2101	83.3	(80, 100]
2102	50.6	(40, 60]
2103	52.5	(40, 60]
2104	72.2	(60, 80]
2105	34.2	(0, 40]
2201	39.1	(0, 40]
2202	68.5	(60, 80]
2203	73.8	(60, 80]
2204	47.2	(40, 60]
2205	85.4	(80, 100]
2301	72.3	(60, 80]
2302	32.7	(0, 40]
2303	65.9	(60, 80]
2304	95.5	(80, 100]
2305	48.9	(40, 60]
2401	45.3	(40, 60]
2402	48.7	(40, 60]
2403	59.7	(40, 60]
2404	67.7	(60, 80]
2405	47.6	(40, 60]

（c）区间索引的选取

## rsuffix Suffix to use from right frame’s overlapping columns. 规定表合成时使用的后缀
## 使用调用框架的索引默认使用左边
## 选取 Math  Math_interval列
## 重置索引为Math_interval
df_i = df.join(math_interval,rsuffix='_interval')[['Math','Math_interval']]\
            .reset_index().set_index('Math_interval')
df_i.head()

	ID	Math
Math_interval
(0, 40]	1101	34.0
(0, 40]	1102	32.5
(80, 100]	1103	87.2
(80, 100]	1104	80.4
(80, 100]	1105	84.8

df_i.loc[65].head()
#包含该值就会被选中

	ID	Math
Math_interval
(60, 80]	1202	63.5
(60, 80]	1205	68.4
(60, 80]	1305	61.7
(60, 80]	2104	72.2
(60, 80]	2202	68.5

df_i.loc[[65,90]].head()

	ID	Math
Math_interval
(60, 80]	1202	63.5
(60, 80]	1205	68.4
(60, 80]	1305	61.7
(60, 80]	2104	72.2
(60, 80]	2202	68.5

df_i.index.astype('interval').overlaps(pd.Interval(70, 85))

array([False, False,  True,  True,  True,  True,  True, False, False,
        True, False,  True, False,  True,  True,  True, False, False,
        True, False, False,  True,  True, False,  True,  True, False,
        True,  True, False, False, False, False,  True, False])

如果想要选取某个区间，先要把分类变量转为区间变量，再使用overlap方法：

overlaps Check whether two Interval objects overlap.如果重叠返回true

#df_i.loc[pd.Interval(70,75)].head() 报错
## astype将dtype变换为interval
## 获得到布尔列表
df_i[df_i.index.astype('interval').overlaps(pd.Interval(70, 85))].head()

	ID	Math
Math_interval
(80, 100]	1103	87.2
(80, 100]	1104	80.4
(80, 100]	1105	84.8
(80, 100]	1201	97.0
(60, 80]	1202	63.5

二、多级索引

1. 创建多级索引

（a）通过from_tuple或from_arrays

① 直接创建元组

tuples = [('A','a'),('A','b'),('B','a'),('B','b')]
mul_index = pd.MultiIndex.from_tuples(tuples, names=('Upper', 'Lower'))
#type(mul_index)
mul_index

pandas.core.indexes.multi.MultiIndex

pd.DataFrame({'Score':['perfect','good','fair','bad']},index=mul_index)

		Score
Upper	Lower
A	a	perfect
A	b	good
B	a	fair
B	b	bad

L1 = list('AABB')
L2 = list('abab')
#type(zip(L1,L2)) 按列压缩
list(zip(L1,L2))

[('A', 'a'), ('A', 'b'), ('B', 'a'), ('B', 'b')]

② 利用zip创建元组

L1 = list('AABB')
L2 = list('abab')
tuples = list(zip(L1,L2))
mul_index = pd.MultiIndex.from_tuples(tuples, names=('Upper', 'Lower'))
pd.DataFrame({'Score':['perfect','good','fair','bad']},index=mul_index)

		Score
Upper	Lower
A	a	perfect
A	b	good
B	a	fair
B	b	bad

③ 通过Array创建

arrays = [['A','a'],['A','b'],['B','a'],['B','b']]
mul_index = pd.MultiIndex.from_tuples(arrays, names=('Upper', 'Lower'))
pd.DataFrame({'Score':['perfect','good','fair','bad']},index=mul_index)

		Score
Upper	Lower
A	a	perfect
A	b	good
B	a	fair
B	b	bad

由此看出内部自动转成元组

mul_index

MultiIndex([('A', 'a'),
            ('A', 'b'),
            ('B', 'a'),
            ('B', 'b')],
           names=['Upper', 'Lower'])

（b）通过from_product

## 根据多个可迭代项的笛卡尔积建立MultiIndex。
L1 = ['A','B']
L2 = ['a','b']
pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
#两两相乘 笛卡尔乘积

MultiIndex(levels=[['A', 'B'], ['a', 'b']],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=['Upper', 'Lower'])

（c）指定df中的列创建（set_index方法）

df_using_mul = df.set_index(['Class','Address'])
df_using_mul.head()

		School	Gender	Height	Weight	Math	Physics
Class	Address
C_1	street_1	S_1	M	173	63	34.0	A+
	street_2	S_1	F	192	73	32.5	B+
	street_2	S_1	M	186	82	87.2	B+
	street_2	S_1	F	167	81	80.4	B-
	street_4	S_1	F	159	64	84.8	B+

2. 多层索引切片

df_using_mul.head()

		School	Gender	Height	Weight	Math	Physics
Class	Address
C_1	street_1	S_1	M	173	63	34.0	A+
	street_2	S_1	F	192	73	32.5	B+
	street_2	S_1	M	186	82	87.2	B+
	street_2	S_1	F	167	81	80.4	B-
	street_4	S_1	F	159	64	84.8	B+

（a）一般切片

当索引不排序时，单个索引会报出性能警告
is_lexsorted()该函数检查是否排序

# df_using_mul.loc['C_2','street_5']
#df_using_mul.index.is_lexsorted()
df_using_mul.sort_index().loc['C_2','street_5']
#df_using_mul.sort_index().index.is_lexsorted()

		School	Gender	Height	Weight	Math	Physics
Class	Address
C_2	street_5	S_1	M	188	68	97.0	A-
	street_5	S_1	F	162	63	33.8	B
	street_5	S_2	M	193	100	39.1	B

当不排序时，不能使用多层切片

#df_using_mul.loc[('C_2','street_5'):] 报错
df_using_mul.sort_index().loc[('C_2','street_6'):('C_3','street_4')]
#注意此处由于使用了loc，因此仍然包含右端点

		School	Gender	Height	Weight	Math	Physics
Class	Address
C_2	street_6	S_1	M	160	53	58.8	A+
	street_6	S_1	F	167	63	68.4	B-
	street_7	S_2	F	194	77	68.5	B+
	street_7	S_2	F	183	76	85.4	B
C_3	street_1	S_1	F	175	57	87.7	A-
	street_2	S_1	M	195	70	85.2	A
	street_4	S_1	M	161	68	31.5	B+
	street_4	S_2	F	157	78	72.3	B+
	street_4	S_2	M	187	73	48.9	B

非元组也是合法的，表示选中该层所有元素

df_using_mul.sort_index().loc[('C_2','street_7'):'C_3'].head()

		School	Gender	Height	Weight	Math	Physics
Class	Address
C_2	street_7	S_2	F	194	77	68.5	B+
C_2	street_7	S_2	F	183	76	85.4	B
C_3	street_1	S_1	F	175	57	87.7	A-
	street_2	S_1	M	195	70	85.2	A
	street_4	S_1	M	161	68	31.5	B+

（b）第一类特殊情况：由元组构成列表

df_using_mul.sort_index().loc[[('C_2','street_7'),('C_3','street_2')]]
#表示选出某几个元素，精确到最内层索引

		School	Gender	Height	Weight	Math	Physics
Class	Address
C_2	street_7	S_2	F	194	77	68.5	B+
C_2	street_7	S_2	F	183	76	85.4	B
C_3	street_2	S_1	M	195	70	85.2	A

（c）第二类特殊情况：由列表构成元组

df_using_mul.sort_index().loc[(['C_2','C_3'],['street_4','street_7']),:]
#选出第一层在‘C_2’和'C_3'中且第二层在'street_4'和'street_7'中的行

		School	Gender	Height	Weight	Math	Physics
Class	Address
C_2	street_4	S_1	F	176	94	63.5	B-
	street_4	S_2	M	155	91	73.8	A+
	street_7	S_2	F	194	77	68.5	B+
	street_7	S_2	F	183	76	85.4	B
C_3	street_4	S_1	M	161	68	31.5	B+
	street_4	S_2	F	157	78	72.3	B+
	street_4	S_2	M	187	73	48.9	B
	street_7	S_1	M	188	82	49.7	B
	street_7	S_2	F	190	99	65.9	C

3. 多层索引中的slice对象

L1,L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
L3,L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
df_s = pd.DataFrame(np.random.rand(9,9),index=mul_index1,columns=mul_index2)
df_s

	Big	D			E			F
	Small	d	e	f	d	e	f	d	e	f
Upper	Lower
A	a	0.669897	0.872137	0.821429	0.395501	0.120418	0.244982	0.778986	0.035890	0.446660
	b	0.455974	0.665127	0.634520	0.519962	0.662663	0.812809	0.901754	0.552220	0.994960
	c	0.273936	0.465171	0.697102	0.724623	0.525177	0.012609	0.535589	0.089462	0.650650
B	a	0.640397	0.836209	0.398517	0.800238	0.267954	0.217449	0.539515	0.016317	0.507975
	b	0.527949	0.939956	0.944304	0.969345	0.738103	0.007159	0.256861	0.815255	0.449877
	c	0.799722	0.657790	0.515653	0.825300	0.333965	0.499731	0.822523	0.946890	0.950979
C	a	0.904004	0.815088	0.090696	0.141877	0.191694	0.941201	0.131053	0.504041	0.815668
	b	0.126503	0.638547	0.721800	0.169264	0.799974	0.557650	0.498308	0.971533	0.892235
	c	0.207226	0.085262	0.041191	0.465851	0.718353	0.891853	0.185869	0.037784	0.625961

idx=pd.IndexSlice
idx

<pandas.core.indexing._IndexSlice at 0x2a41a8c1588>

索引Slice的使用非常灵活：

df_s.loc[idx['B':,df_s['D']['d']>0.3],idx[df_s.sum()>4]]
#df_s.sum()默认为对列求和，因此返回一个长度为9的数值列表

	Big	D		E		F
	Small	e	f	d	e	d	f
Upper	Lower
B	c	0.344853	0.029392	0.646919	0.760591	0.360384	0.168119
C	a	0.472030	0.987852	0.747845	0.466530	0.764040	0.795936

4. 索引层的交换

（a）swaplevel方法（两层交换）

df_using_mul.head()

		School	Gender	Height	Weight	Math	Physics
Class	Address
C_1	street_1	S_1	M	173	63	34.0	A+
	street_2	S_1	F	192	73	32.5	B+
	street_2	S_1	M	186	82	87.2	B+
	street_2	S_1	F	167	81	80.4	B-
	street_4	S_1	F	159	64	84.8	B+

## 交换 i j 级别序列 在特定的axis=0轴上 即列轴
df_using_mul.swaplevel(i=1,j=0,axis=0).sort_index().head()

		School	Gender	Height	Weight	Math	Physics
Address	Class
street_1	C_1	S_1	M	173	63	34.0	A+
	C_2	S_2	M	175	74	47.2	B-
	C_3	S_1	F	175	57	87.7	A-
street_2	C_1	S_1	F	192	73	32.5	B+
street_2	C_1	S_1	M	186	82	87.2	B+

（b）reorder_levels方法（多层交换）

df_muls = df.set_index(['School','Class','Address'])
df_muls.head()

			Gender	Height	Weight	Math	Physics
School	Class	Address
S_1	C_1	street_1	M	173	63	34.0	A+
		street_2	F	192	73	32.5	B+
		street_2	M	186	82	87.2	B+
		street_2	F	167	81	80.4	B-
		street_4	F	159	64	84.8	B+

## 按指定顺序更改标签 
df_muls.reorder_levels([2,0,1],axis=0).sort_index().head()

			Gender	Height	Weight	Math	Physics
Address	School	Class
street_1	S_1	C_1	M	173	63	34.0	A+
	S_1	C_3	F	175	57	87.7	A-
	S_2	C_2	M	175	74	47.2	B-
street_2	S_1	C_1	F	192	73	32.5	B+
street_2	S_1	C_1	M	186	82	87.2	B+

#如果索引有name，可以直接使用name
df_muls.reorder_levels(['Address','School','Class'],axis=0).sort_index().head()

			Gender	Height	Weight	Math	Physics
Address	School	Class
street_1	S_1	C_1	M	173	63	34.0	A+
	S_1	C_3	F	175	57	87.7	A-
	S_2	C_2	M	175	74	47.2	B-
street_2	S_1	C_1	F	192	73	32.5	B+
street_2	S_1	C_1	M	186	82	87.2	B+

三、索引设定

1. index_col参数

index_col是read_csv中的一个参数，而不是某一个方法：

#pd.read_csv('data/table.csv',index_col=['Address','School']).head()
## 使用默认顺序
pd.read_csv('data/table.csv').head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1101	M	street_1	173	63	34.0	A+
1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
2	S_1	C_1	1103	M	street_2	186	82	87.2	B+
3	S_1	C_1	1104	F	street_2	167	81	80.4	B-
4	S_1	C_1	1105	F	street_4	159	64	84.8	B+

2. reindex和reindex_like

reindex是指重新索引，它的重要特性在于索引对齐，很多时候用于重新排序

df.head(10)

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+
1201	S_1	C_2	M	street_5	188	68	97.0	A-
1202	S_1	C_2	F	street_4	176	94	63.5	B-
1203	S_1	C_2	M	street_6	160	53	58.8	A+
1204	S_1	C_2	F	street_5	162	63	33.8	B
1205	S_1	C_2	F	street_6	167	63	68.4	B-

df.reindex(index=[1101,1203,1206,2402])

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173.0	63.0	34.0	A+
1203	S_1	C_2	M	street_6	160.0	53.0	58.8	A+
1206	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2402	S_2	C_4	M	street_7	166.0	82.0	48.7	B

df.reindex(columns=['Height','Gender','Average']).head()

	Height	Gender	Average
ID
1101	173	M	NaN
1102	192	F	NaN
1103	186	M	NaN
1104	167	F	NaN
1105	159	F	NaN

可以选择缺失值的填充方法：fill_value和method（bfill/ffill/nearest），其中method参数必须索引单调

bfill表示用所在索引1206的后一个有效行填充，ffill为前一个有效行，nearest是指最近的

#df.reindex(index=[1101,1203,1206,2402],method='nearest')
df.reindex(index=[1101,1203,1206,2402],method='bfill')
#df.reindex(index=[1101,1203,1206,2402],method='nearest')

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1203	S_1	C_2	M	street_6	160	53	58.8	A+
1206	S_1	C_2	F	street_6	167	63	68.4	B-
2402	S_2	C_4	M	street_7	166	82	48.7	B

数值上1205比1301更接近1206，因此用前者填充

df.reindex(index=[1101,1203,1206,2402],method='nearest')

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1203	S_1	C_2	M	street_6	160	53	58.8	A+
1206	S_1	C_2	F	street_6	167	63	68.4	B-
2402	S_2	C_4	M	street_7	166	82	48.7	B

reindex_like的作用为生成一个横纵索引完全与参数列表一致的DataFrame，数据使用被调用的表

df_temp = pd.DataFrame({'Weight':np.zeros(5),
                        'Height':np.zeros(5),
                        'ID':[1101,1104,1103,1106,1102]}).set_index('ID')
#df_temp
df_temp.reindex_like(df[0:5][['Weight','Height']])

	Weight	Height
ID
1101	0.0	0.0
1102	0.0	0.0
1103	0.0	0.0
1104	0.0	0.0
1105	NaN	NaN

如果df_temp单调还可以使用method参数：

df_temp = pd.DataFrame({'Weight':range(5),
                        'Height':range(5),
                        'ID':[1101,1104,1103,1106,1102]}).set_index('ID').sort_index()
#df_temp
df_temp.reindex_like(df[0:5][['Weight','Height']],method='bfill')
#可以自行检验这里的1105的值是否是由bfill规则填充

	Weight	Height
ID
1101	0	0
1102	4	4
1103	2	2
1104	1	1
1105	3	3

3. set_index和reset_index

先介绍set_index：从字面意思看，就是将某些列作为索引

使用表内列作为索引：

df.head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+

df.set_index('Class').head()

	School	Gender	Address	Height	Weight	Math	Physics
Class
C_1	S_1	M	street_1	173	63	34.0	A+
C_1	S_1	F	street_2	192	73	32.5	B+
C_1	S_1	M	street_2	186	82	87.2	B+
C_1	S_1	F	street_2	167	81	80.4	B-
C_1	S_1	F	street_4	159	64	84.8	B+

利用append参数可以将当前索引维持不变

df.set_index('Class',append=True).head()

		School	Gender	Address	Height	Weight	Math	Physics
ID	Class
1101	C_1	S_1	M	street_1	173	63	34.0	A+
1102	C_1	S_1	F	street_2	192	73	32.5	B+
1103	C_1	S_1	M	street_2	186	82	87.2	B+
1104	C_1	S_1	F	street_2	167	81	80.4	B-
1105	C_1	S_1	F	street_4	159	64	84.8	B+

当使用与表长相同的列作为索引（需要先转化为Series，否则报错）：

df.set_index(pd.Series(range(df.shape[0]))).head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	M	street_1	173	63	34.0	A+
1	S_1	C_1	F	street_2	192	73	32.5	B+
2	S_1	C_1	M	street_2	186	82	87.2	B+
3	S_1	C_1	F	street_2	167	81	80.4	B-
4	S_1	C_1	F	street_4	159	64	84.8	B+

可以直接添加多级索引：

df.set_index([pd.Series(range(df.shape[0])),pd.Series(np.ones(df.shape[0]))]).head()

		School	Class	Gender	Address	Height	Weight	Math	Physics
0	1.0	S_1	C_1	M	street_1	173	63	34.0	A+
1	1.0	S_1	C_1	F	street_2	192	73	32.5	B+
2	1.0	S_1	C_1	M	street_2	186	82	87.2	B+
3	1.0	S_1	C_1	F	street_2	167	81	80.4	B-
4	1.0	S_1	C_1	F	street_4	159	64	84.8	B+

下面介绍reset_index方法，它的主要功能是将索引重置

默认状态直接恢复到自然数索引：

df.reset_index().head()

	ID	School	Class	Gender	Address	Height	Weight	Math	Physics
0	1101	S_1	C_1	M	street_1	173	63	34.0	A+
1	1102	S_1	C_1	F	street_2	192	73	32.5	B+
2	1103	S_1	C_1	M	street_2	186	82	87.2	B+
3	1104	S_1	C_1	F	street_2	167	81	80.4	B-
4	1105	S_1	C_1	F	street_4	159	64	84.8	B+

用level参数指定哪一层被reset，用col_level参数指定set到哪一层：

L1,L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
L3,L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
df_temp = pd.DataFrame(np.random.rand(9,9),index=mul_index1,columns=mul_index2)
df_temp.head()

	Big	D			E			F
	Small	d	e	f	d	e	f	d	e	f
Upper	Lower
A	a	0.639608	0.511061	0.439781	0.436492	0.473135	0.858310	0.538744	0.637985	0.441126
	b	0.740495	0.942201	0.681934	0.087504	0.600452	0.993468	0.611108	0.736948	0.177016
	c	0.965554	0.301755	0.659689	0.017442	0.110830	0.301041	0.205102	0.471527	0.052899
B	a	0.057613	0.303613	0.242287	0.435452	0.289861	0.514301	0.859227	0.477511	0.655530
B	b	0.746643	0.091802	0.450144	0.255623	0.583973	0.398016	0.127545	0.313311	0.638950

df_temp1.columns

MultiIndex(levels=[['D', 'E', 'F', ''], ['d', 'e', 'f', 'Lower']],
           codes=[[3, 0, 0, 0, 1, 1, 1, 2, 2, 2], [3, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=['Big', 'Small'])

df_temp1 = df_temp.reset_index(level=0,col_level=1)
df_temp1.head()

Big		D			E			F
Small	Upper	d	e	f	d	e	f	d	e	f
Lower
a	A	0.639608	0.511061	0.439781	0.436492	0.473135	0.858310	0.538744	0.637985	0.441126
b	A	0.740495	0.942201	0.681934	0.087504	0.600452	0.993468	0.611108	0.736948	0.177016
c	A	0.965554	0.301755	0.659689	0.017442	0.110830	0.301041	0.205102	0.471527	0.052899
a	B	0.057613	0.303613	0.242287	0.435452	0.289861	0.514301	0.859227	0.477511	0.655530
b	B	0.746643	0.091802	0.450144	0.255623	0.583973	0.398016	0.127545	0.313311	0.638950

df_temp1.columns
#看到的确插入了level2

MultiIndex(levels=[['D', 'E', 'F', ''], ['d', 'e', 'f', 'Upper']],
           codes=[[3, 0, 0, 0, 1, 1, 1, 2, 2, 2], [3, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=['Big', 'Small'])

df_temp1.index
#最内层索引被移出

Index(['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c'], dtype='object', name='Lower')

4. rename_axis和rename

rename_axis是针对多级索引的方法，作用是修改某一层的索引名，而不是索引标签

df_temp.rename_axis(index={'Lower':'LowerLower'},columns={'Big':'BigBig'})

	BigBig	D			E			F
	Small	d	e	f	d	e	f	d	e	f
Upper	LowerLower
A	a	0.036300	0.111297	0.509819	0.322065	0.107846	0.393002	0.951695	0.352045	0.055245
	b	0.585976	0.817781	0.526512	0.560763	0.647126	0.801922	0.520511	0.708981	0.774692
	c	0.859907	0.824712	0.675348	0.238558	0.869172	0.965363	0.803485	0.568771	0.734648
B	a	0.129040	0.278234	0.981728	0.903423	0.676240	0.371291	0.876571	0.338353	0.021567
	b	0.221386	0.744765	0.080138	0.013936	0.623034	0.314859	0.520199	0.742233	0.834835
	c	0.905252	0.037512	0.377849	0.225882	0.781182	0.232511	0.778518	0.572787	0.858842
C	a	0.678289	0.324638	0.165628	0.737036	0.591936	0.312173	0.319689	0.289072	0.954463
	b	0.647861	0.527387	0.505945	0.488335	0.631082	0.639539	0.785094	0.026073	0.881210
	c	0.990722	0.691715	0.697617	0.115831	0.129802	0.994152	0.176841	0.449053	0.145351

rename方法用于修改列或者行索引标签，而不是索引名：

df_temp.rename(index={'A':'T'},columns={'e':'changed_e'}).head()

	Big	D			E			F
	Small	d	changed_e	f	d	changed_e	f	d	changed_e	f
Upper	Lower
T	a	0.036300	0.111297	0.509819	0.322065	0.107846	0.393002	0.951695	0.352045	0.055245
	b	0.585976	0.817781	0.526512	0.560763	0.647126	0.801922	0.520511	0.708981	0.774692
	c	0.859907	0.824712	0.675348	0.238558	0.869172	0.965363	0.803485	0.568771	0.734648
B	a	0.129040	0.278234	0.981728	0.903423	0.676240	0.371291	0.876571	0.338353	0.021567
B	b	0.221386	0.744765	0.080138	0.013936	0.623034	0.314859	0.520199	0.742233	0.834835

四、常用索引型函数

1. where函数

当对条件为False的单元进行填充：

df.head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+

不满足条件的行全部被设置为NaN
通过other来改变不符合条件的值

df.where(df['Gender']=='M',' ').head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34	A+
1102
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104
1105

通过这种方法筛选结果和[]操作符的结果完全一致：

df.where(df['Gender']=='M').dropna().head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173.0	63.0	34.0	A+
1103	S_1	C_1	M	street_2	186.0	82.0	87.2	B+
1201	S_1	C_2	M	street_5	188.0	68.0	97.0	A-
1203	S_1	C_2	M	street_6	160.0	53.0	58.8	A+
1301	S_1	C_3	M	street_4	161.0	68.0	31.5	B+

第一个参数为布尔条件，第二个参数为填充值：

用随机值替换NaN值

df.where(df['Gender']=='M',np.random.rand(df.shape[0],df.shape[1])).head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173.000000	63.000000	34.000000	A+
1102	0.55092	0.303216	0.824431	0.755427	0.433119	0.997173	0.358161	0.699621
1103	S_1	C_1	M	street_2	186.000000	82.000000	87.200000	B+
1104	0.313753	0.481069	0.335257	0.645973	0.551690	0.562617	0.684399	0.312336
1105	0.0705601	0.067762	0.818158	0.352117	0.200073	0.076810	0.885933	0.142629

2. mask函数

mask函数与where功能上相反，其余完全一致，即对条件为True的单元进行填充

df.mask(df['Gender']=='M').dropna().head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1102	S_1	C_1	F	street_2	192.0	73.0	32.5	B+
1104	S_1	C_1	F	street_2	167.0	81.0	80.4	B-
1105	S_1	C_1	F	street_4	159.0	64.0	84.8	B+
1202	S_1	C_2	F	street_4	176.0	94.0	63.5	B-
1204	S_1	C_2	F	street_5	162.0	63.0	33.8	B

df.mask(df['Gender']=='M',np.random.rand(df.shape[0],df.shape[1])).head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	0.273962	0.25028	0.587471	0.977206	0.442403	0.319460	0.460991	0.842498
1102	S_1	C_1	F	street_2	192.000000	73.000000	32.500000	B+
1103	0.436674	0.741524	0.46996	0.688603	0.938241	0.531811	0.794352	0.17495
1104	S_1	C_1	F	street_2	167.000000	81.000000	80.400000	B-
1105	S_1	C_1	F	street_4	159.000000	64.000000	84.800000	B+

3. query函数

df.head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+

query函数中的布尔表达式中，下面的符号都是合法的：行列索引名、字符串、and/not/or/&/|/~/not in/in/==/!=、四则运算符

## 封装了类似于数据库查询的方法
df.query('(Address in ["street_6","street_7"])&(Weight>(70+10))&(ID in [1303,2304,2402])')

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1303	S_1	C_3	M	street_7	188	82	49.7	B
2304	S_2	C_3	F	street_6	164	81	95.5	A-
2402	S_2	C_4	M	street_7	166	82	48.7	B

五、重复元素处理

1. duplicated方法

该方法返回了是否重复的布尔列表

##默认为first
df.duplicated('Class').head(10)

ID
1101    False
1102     True
1103     True
1104     True
1105     True
1201    False
1202     True
1203     True
1204     True
1205     True
dtype: bool

df.duplicated('Class',keep='first').head(10)

ID
1101    False
1102     True
1103     True
1104     True
1105     True
1201    False
1202     True
1203     True
1204     True
1205     True
dtype: bool

可选参数keep默认为first，即首次出现设为不重复，若为last，则最后一次设为不重复，若为False，则所有重复项为False

df.duplicated('Class',keep='last').tail()

ID
2401     True
2402     True
2403     True
2404     True
2405    False
dtype: bool

df.duplicated('Class',keep=False).head()

ID
1101    True
1102    True
1103    True
1104    True
1105    True
dtype: bool

2. drop_duplicates方法

从名字上看出为剔除重复项，这在后面章节中的分组操作中可能是有用的，例如需要保留每组的第一个值：

df.drop_duplicates('Class')

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1201	S_1	C_2	M	street_5	188	68	97.0	A-
1301	S_1	C_3	M	street_4	161	68	31.5	B+
2401	S_2	C_4	F	street_2	192	62	45.3	A

参数与duplicate函数类似：

df.drop_duplicates('Class',keep='last')

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
2105	S_2	C_1	M	street_4	170	81	34.2	A
2205	S_2	C_2	F	street_7	183	76	85.4	B
2305	S_2	C_3	M	street_4	187	73	48.9	B
2405	S_2	C_4	F	street_6	193	54	47.6	B

在传入多列时等价于将多列共同视作一个多级索引，比较重复项：

df.drop_duplicates(['School','Class'],keep = 'first',inplace=False)

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1201	S_1	C_2	M	street_5	188	68	97.0	A-
1301	S_1	C_3	M	street_4	161	68	31.5	B+
2101	S_2	C_1	M	street_7	174	84	83.3	C
2201	S_2	C_2	M	street_5	193	100	39.1	B
2301	S_2	C_3	F	street_4	157	78	72.3	B+
2401	S_2	C_4	F	street_2	192	62	45.3	A

六、抽样函数

这里的抽样函数指的就是sample函数

（a）n为样本量

df.sample(n=5)

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
2103	S_2	C_1	M	street_4	157	61	52.5	B-
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1301	S_1	C_3	M	street_4	161	68	31.5	B+
1304	S_1	C_3	M	street_2	195	70	85.2	A
1105	S_1	C_1	F	street_4	159	64	84.8	B+

（b）frac为抽样比

df.sample(frac=0.05)

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1105	S_1	C_1	F	street_4	159	64	84.8	B+
2402	S_2	C_4	M	street_7	166	82	48.7	B

（c）replace为是否放回

df.sample(n=df.shape[0],replace=True).head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
2403	S_2	C_4	F	street_6	158	60	59.7	B+
2404	S_2	C_4	F	street_2	160	84	67.7	B
2405	S_2	C_4	F	street_6	193	54	47.6	B
2303	S_2	C_3	F	street_7	190	99	65.9	C
1203	S_1	C_2	M	street_6	160	53	58.8	A+

df.sample(n=35,replace=True).index.is_unique

False

（d）axis为抽样维度，默认为0，即抽行

df.sample(n=3,axis=1).head()

	Address	Weight	School
ID
1101	street_1	63	S_1
1102	street_2	73	S_1
1103	street_2	82	S_1
1104	street_2	81	S_1
1105	street_4	64	S_1

（e）weights为样本权重，自动归一化

df.sample(n=3,weights=np.random.rand(df.shape[0])).head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1302	S_1	C_3	F	street_1	175	57	87.7	A-
1305	S_1	C_3	F	street_5	187	69	61.7	B-
2404	S_2	C_4	F	street_2	160	84	67.7	B

#以某一列为权重，这在抽样理论中很常见
df.sample(n=3,weights=df['Math']).head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
2205	S_2	C_2	F	street_7	183	76	85.4	B
2404	S_2	C_4	F	street_2	160	84	67.7	B
1203	S_1	C_2	M	street_6	160	53	58.8	A+

数分虐我千百遍

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
第2章索引

第2章索引文章目录第2章索引大致内容一、单级索引1. loc方法、iloc方法、[]操作符最常用的索引方法可能就是这三类，其中iloc表示位置索引，loc表示标签索引，[]也具有很大的便利性，各有特点（a）loc方法（注意：所有在loc中使用的切片全部包含右端点！）① 单行索引：② 多行索引：③ 单列索引：④ 多列索引：⑤ 联合索引：⑥ 函数式索引：⑦ 布尔索引（将重点在第2节介绍）小节：本...
复制链接

扫一扫