一、索引器
1. 表的列索引
列索引是最常见的索引形式,一般通过[]
来实现。通过[列名]
可以从DataFrame
中取出相应的列,返回值为Series
import numpy as np
import pandas as pd
df = pd.read_csv('../data/learn_pandas.csv', usecols = ['School', 'Grade', 'Name', 'Gender', 'Weight', 'Transfer'])
df['Name'].head()
0 Gaopeng Yang
1 Changqiang You
2 Mei Sun
3 Xiaojuan Sun
4 Gaojuan You
Name: Name, dtype: object
如果要取出多个列,则可以通过[列名组成的列表]
,其返回值为一个DataFrame
,例如从表中取出性别和姓名两列:
df[['Gender', 'Name']].head()
Gender | Name | |
---|---|---|
0 | Female | Gaopeng Yang |
1 | Male | Changqiang You |
2 | Male | Mei Sun |
3 | Female | Xiaojuan Sun |
4 | Male | Gaojuan You |
此外,若要取出单列,且列名中不包含空格,则可以用.列名
取出,这和[列名]
是等价的:
df.Name.head()
0 Gaopeng Yang
1 Changqiang You
2 Mei Sun
3 Xiaojuan Sun
4 Gaojuan You
Name: Name, dtype: object
2. 序列的行索引
【a】以字符串为索引的Series
如果取出单个索引的对应元素,则可以使用[item]
,若Series
只有单个值对应,则返回这个标量值,如果有多个值对应,则返回一个Series
:
s = pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'a', 'a', 'a', 'c'])
s['a']
a 1
a 3
a 4
a 5
dtype: int64
s['b']
2
如果取出多个索引的对应元素,则可以使用[items的列表]
:
s[['c', 'b']]
c 6
b 2
dtype: int64
如果想要取出某两个索引之间的元素,并且这两个索引是在整个索引中唯一出现,则可以使用切片,,同时需要注意这里的切片会包含两个端点:
s['c': 'b': -2]
c 6
a 4
b 2
dtype: int64
【b】以整数为索引的Series
在使用数据的读入函数时,如果不特别指定所对应的列作为索引,那么会生成从0开始的整数索引作为默认索引。当然,任意一组符合长度要求的整数都可以作为索引。
和字符串一样,如果使用[int]
或[int_list]
,则可以取出对应索引元素的值:
s = pd.Series(['a', 'b', 'c', 'd', 'e', 'f'], index=[1, 3, 1, 2, 5, 4])
s[1]
1 a
1 c
dtype: object
s[[2,3]]
2 d
3 b
dtype: object
如果使用整数切片,则会取出对应索引位置的值,注意这里的整数切片同Python
中的切片一样不包含右端点:
s[1:-1:2]
3 b
2 d
dtype: object
3. loc索引器
前面讲到了对DataFrame
的列进行选取,下面要讨论其行的选取。对于表而言,有两种索引器,一种是基于元素的loc
索引器,另一种是基于位置的iloc
索引器。
loc
索引器的一般形式是loc[*, *]
,其中第一个*
代表行的选择,第二个*
代表列的选择,如果省略第二个位置写作loc[*]
,这个*
是指行的筛选。其中,*
的位置一共有五类合法对象,分别是:单个元素、元素列表、元素切片、布尔列表以及函数,下面将依次说明。
为了演示相应操作,先利用set_index
方法把Name
列设为索引,关于该函数的其他用法将在多级索引一章介绍。
df_demo = df.set_index('Name')
df_demo.head()
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
Name | |||||
Gaopeng Yang | Shanghai Jiao Tong University | Freshman | Female | 46.0 | N |
Changqiang You | Peking University | Freshman | Male | 70.0 | N |
Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
Xiaojuan Sun | Fudan University | Sophomore | Female | 41.0 | N |
Gaojuan You | Fudan University | Sophomore | Male | 74.0 | N |
【a】*
为单个元素
此时,直接取出相应的行或列,如果该元素在索引中重复则结果为DataFrame
,否则为Series
:
df_demo.loc['Qiang Sun'] # 多个人叫此名字
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
Name | |||||
Qiang Sun | Tsinghua University | Junior | Female | 53.0 | N |
Qiang Sun | Tsinghua University | Sophomore | Female | 40.0 | N |
Qiang Sun | Shanghai Jiao Tong University | Junior | Female | NaN | N |
df_demo.loc['Quan Zhao'] # 名字唯一
School Shanghai Jiao Tong University
Grade Junior
Gender Female
Weight 53
Transfer N
Name: Quan Zhao, dtype: object
也可以同时选择行和列:
df_demo.loc['Qiang Sun', 'School'] # 返回Series
Name
Qiang Sun Tsinghua University
Qiang Sun Tsinghua University
Qiang Sun Shanghai Jiao Tong University
Name: School, dtype: object
df_demo.loc['Quan Zhao', 'School'] # 返回单个元素
'Shanghai Jiao Tong University'
【b】*
为元素列表
此时,取出列表中所有元素值对应的行或列:
df_demo.loc[['Qiang Sun','Quan Zhao'], ['School','Gender']]
School | Gender | |
---|---|---|
Name | ||
Qiang Sun | Tsinghua University | Female |
Qiang Sun | Tsinghua University | Female |
Qiang Sun | Shanghai Jiao Tong University | Female |
Quan Zhao | Shanghai Jiao Tong University | Female |
【c】*
为切片
之前的Series
使用字符串索引时提到,如果是唯一值的起点和终点字符,那么就可以使用切片,并且包含两个端点,如果不唯一则报错:
df_demo.loc['Gaojuan You':'Gaoqiang Qian', 'School':'Gender']
School | Grade | Gender | |
---|---|---|---|
Name | |||
Gaojuan You | Fudan University | Sophomore | Male |
Xiaoli Qian | Tsinghua University | Freshman | Female |
Qiang Chu | Shanghai Jiao Tong University | Freshman | Female |
Gaoqiang Qian | Tsinghua University | Junior | Female |
需要注意的是,如果DataFrame
使用整数索引,其使用整数切片的时候和上面字符串索引的要求一致,都是元素切片,包含端点且起点、终点不允许有重复值。
df_loc_slice_demo = df_demo.copy()
df_loc_slice_demo.index = range(df_demo.shape[0],0,-1)
df_loc_slice_demo.loc[5:3]
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
5 | Fudan University | Junior | Female | 46.0 | N |
4 | Tsinghua University | Senior | Female | 50.0 | N |
3 | Shanghai Jiao Tong University | Senior | Female | 45.0 | N |
df_loc_slice_demo.loc[3:5] # 没有返回,说明不是整数位置切片
School | Grade | Gender | Weight | Transfer |
---|
【d】*
为布尔列表
在实际的数据处理中,根据条件来筛选行是极其常见的,此处传入loc
的布尔列表与DataFrame
长度相同,且列表为True
的位置所对应的行会被选中,False
则会被剔除。
df_demo.loc[df_demo.Weight>70].head()
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
Name | |||||
Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
Gaojuan You | Fudan University | Sophomore | Male | 74.0 | N |
Xiaopeng Zhou | Shanghai Jiao Tong University | Freshman | Male | 74.0 | N |
Xiaofeng Sun | Tsinghua University | Senior | Male | 71.0 | N |
Qiang Zheng | Shanghai Jiao Tong University | Senior | Male | 87.0 | N |
前面所提到的传入元素列表,也可以通过isin
方法返回的布尔列表等价写出,例如选出所有大一和大四的同学信息:
df_demo.loc[df_demo.Grade.isin(['Freshman', 'Senior'])].head()
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
Name | |||||
Gaopeng Yang | Shanghai Jiao Tong University | Freshman | Female | 46.0 | N |
Changqiang You | Peking University | Freshman | Male | 70.0 | N |
Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
Xiaoli Qian | Tsinghua University | Freshman | Female | 51.0 | N |
Qiang Chu | Shanghai Jiao Tong University | Freshman | Female | 52.0 | N |
对于复合条件而言,可以用|(或), &(且), ~(取反)
的组合来实现
condition_1_1 = df_demo.School == 'Fudan University'
condition_1_2 = df_demo.Grade == 'Senior'
condition_1_3 = df_demo.Weight > 70
condition_1 = condition_1_1 & condition_1_2 & condition_1_3
condition_2_1 = df_demo.School == 'Peking University'
condition_2_2 = df_demo.Grade == 'Senior'
condition_2_3 = df_demo.Weight > 80
condition_2 = condition_2_1 & (~condition_2_2) & condition_2_3
df_demo.loc[condition_1 | condition_2]
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
Name | |||||
Qiang Han | Peking University | Freshman | Male | 87.0 | N |
Chengpeng Zhou | Fudan University | Senior | Male | 81.0 | N |
Changpeng Zhao | Peking University | Freshman | Male | 83.0 | N |
Chengpeng Qian | Fudan University | Senior | Male | 73.0 | Y |
【练一练】
select_dtypes
是一个实用函数,它能够从表中选出相应类型的列,若要选出所有数值型的列,只需使用.select_dtypes('number')
,请利用布尔列表选择的方法结合DataFrame
的dtypes
属性在learn_pandas
数据集上实现这个功能。
import numpy as np
import pandas as pd
df = pd.read_csv('../data/learn_pandas.csv')
df.select_dtypes('number')
Height | Weight | Test_Number | |
---|---|---|---|
0 | 158.9 | 46.0 | 1 |
1 | 166.5 | 70.0 | 1 |
2 | 188.9 | 89.0 | 2 |
3 | NaN | 41.0 | 2 |
4 | 174.0 | 74.0 | 2 |
... | ... | ... | ... |
195 | 153.9 | 46.0 | 2 |
196 | 160.9 | 50.0 | 3 |
197 | 153.9 | 45.0 | 1 |
198 | 175.3 | 71.0 | 2 |
199 | 155.7 | 51.0 | 1 |
200 rows × 3 columns
df[df.columns[(df.dtypes == 'int64') | (df.dtypes == 'float64')]]
Height | Weight | Test_Number | |
---|---|---|---|
0 | 158.9 | 46.0 | 1 |
1 | 166.5 | 70.0 | 1 |
2 | 188.9 | 89.0 | 2 |
3 | NaN | 41.0 | 2 |
4 | 174.0 | 74.0 | 2 |
... | ... | ... | ... |
195 | 153.9 | 46.0 | 2 |
196 | 160.9 | 50.0 | 3 |
197 | 153.9 | 45.0 | 1 |
198 | 175.3 | 71.0 | 2 |
199 | 155.7 | 51.0 | 1 |
200 rows × 3 columns
【e】*
为函数
这里的函数,必须以前面的四种合法形式之一为返回值,并且函数的输入值为DataFrame
本身。假设仍然是上述复合条件筛选的例子,可以把逻辑写入一个函数中再返回,需要注意的是函数的形式参数x
本质上即为df_demo
:
def condition(x):
condition_1_1 = x.School == 'Fudan University'
condition_1_2 = x.Grade == 'Senior'
condition_1_3 = x.Weight > 70
condition_1 = condition_1_1 & condition_1_2 & condition_1_3
condition_2_1 = x.School == 'Peking University'
condition_2_2 = x.Grade == 'Senior'
condition_2_3 = x.Weight > 80
condition_2 = condition_2_1 & (~condition_2_2) & condition_2_3
result = condition_1 | condition_2
return result
df_demo.loc[condition]
此外,还支持使用lambda
表达式,其返回值也同样必须是先前提到的四种形式之一:
df_demo.loc[lambda x:'Quan Zhao', lambda x:'Gender']
'Female'
由于函数无法返回如start: end: step
的切片形式,故返回切片时要用slice
对象进行包装:
df_demo.loc[lambda x: slice('Gaojuan You', 'Gaoqiang Qian')]
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
Name | |||||
Gaojuan You | Fudan University | Sophomore | Male | 74.0 | N |
Xiaoli Qian | Tsinghua University | Freshman | Female | 51.0 | N |
Qiang Chu | Shanghai Jiao Tong University | Freshman | Female | 52.0 | N |
Gaoqiang Qian | Tsinghua University | Junior | Female | 50.0 | N |
最后需要指出的是,对于Series
也可以使用loc
索引,其遵循的原则与DataFrame
中用于行筛选的loc[*]
完全一致,此处不再赘述。
【WARNING】不要使用链式赋值
在对表或者序列赋值时,应当在使用一层索引器后直接进行赋值操作,这样做是由于进行多次索引后赋值是赋在临时返回的copy
副本上的,而没有真正修改元素从而报出SettingWithCopyWarning
警告。例如,下面给出的例子:
df_chain = pd.DataFrame([[0,0],[1,0],[-1,0]], columns=list('AB'))
df_chain
import warnings
with warnings.catch_warnings():
warnings.filterwarnings('error')
try:
df_chain[df_chain.A!=0].B = 1 # 使用方括号列索引后,再使用点的列索引
except Warning as w:
Warning_Msg = w
print(Warning_Msg)
df_chain
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A | B | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 0 |
2 | -1 | 0 |
df_chain.loc[df_chain.A!=0,'B'] = 1
df_chain
A | B | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 1 |
2 | -1 | 1 |
【END】
4. iloc索引器
iloc
的使用与loc
完全类似,只不过是针对位置进行筛选,在相应的*
位置处一共也有五类合法对象,分别是:整数、整数列表、整数切片、布尔列表以及函数,函数的返回值必须是前面的四类合法对象中的一个,其输入同样也为DataFrame
本身。
df_demo.iloc[1, 1] # 第二行第二列
'Freshman'
df_demo.iloc[[0, 1], [0, 1]] # 前两行前两列
School | Grade | |
---|---|---|
Name | ||
Gaopeng Yang | Shanghai Jiao Tong University | Freshman |
Changqiang You | Peking University | Freshman |
df_demo.iloc[1: 4, 2:4] # 切片不包含结束端点
Gender | Weight | |
---|---|---|
Name | ||
Changqiang You | Male | 70.0 |
Mei Sun | Male | 89.0 |
Xiaojuan Sun | Female | 41.0 |
df_demo.iloc[lambda x: slice(1, 4)] # 传入切片为返回值的函数
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
Name | |||||
Changqiang You | Peking University | Freshman | Male | 70.0 | N |
Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
Xiaojuan Sun | Fudan University | Sophomore | Female | 41.0 | N |
在使用布尔列表的时候要特别注意&#x