一、索引器
1. 表的列索引
列索引是最常见的索引形式,通过[列名]
可以从 DataFrame 中取出相应的列,返回值为Series
。如果单列且列名中不含有空格,也可以使用.列名
的方式取出,与[列名]等价。
df['Name'].head() #使用 [列名] 取出相应的列
0 Gaopeng Yang
1 Changqiang You
2 Mei Sun
3 Xiaojuan Sun
4 Gaojuan You
Name: Name, dtype: object
df.Name.head() #使用.列名的方式取出相应的列
0 Gaopeng Yang
1 Changqiang You
2 Mei Sun
3 Xiaojuan Sun
4 Gaojuan You
Name: Name, dtype: object
如果要取出多个列,则可以通过 [列名组成的列表] ,其返回值为一个 DataFrame
df[['Gender', 'Name']].head() #通过列名组成的列表取出多个列
Gender Name
0 Female Gaopeng Yang
1 Male Changqiang You
2 Male Mei Sun
3 Female Xiaojuan Sun
4 Male Gaojuan You
2. 序列的行索引
a. 以字符串为索引的 Series
如果取出单个索引的对应元素,则可以使用[item]
s = pd.Series([1, 2, 3, 4, 5, 6],
index = ['a', 'b', 'a', 'a', 'a', 'c'])
s['a'] #返回索引'a'对应的元素(多个值对应,返回Series)
a 1
a 3
a 4
a 5
dtype: int64
s['b'] #返回索引'b'对应的元素(单个值对应,返回标量)
>>> 2
如果取出多个索引的对应元素,则可以使用[items的列表]
s[['c', 'b']] #返回索引'b'和'c'对应的元素
c 6
b 2
dtype: int64
想要取出某两个索引之间的元素,并且这两个索引是在整个索引中唯一出现,则可以使用切片,同时需要注意这里的切片会包含两个端点
s['c': 'b': -1]
c 6
a 5
a 4
a 3
b 2
dtype: int64
s['c': 'b': -2]
c 6
a 4
b 2
dtype: int64
b. 以整数为索引的 Series
如果不特别指定所对应的列作为索引,那么会生成从0开始的整数索引作为默认索引。
s1 = pd.Series(['a', 'b', 'c', 'd', 'e', 'f'])
s1[1]
>>> 'b'
s1.index #取索引
>>> RangeIndex(start=0, stop=6, step=1)
和字符串一样,如果使用[int]
或 [int_list]
,则可以取出对应索引元素的值:
s = pd.Series(['a', 'b', 'c', 'd', 'e', 'f'],
index=[1, 3, 1, 2, 5, 4])
s[1]
1 a
1 c
dtype: object
s[[2, 5]]
2 d
5 e
dtype: object
如果使用整数切片,则会取出对应索引位置的值,注意这里的整数切片不包含右端点
s[0:-1:1] #不包含'f'
1 a
3 b
1 c
2 d
5 e
dtype: object
关于索引类型的说明:
如果不想陷入麻烦,那么请不要把纯浮点以及任何混合类型(字符串、整数、浮点类型等的混合)作为索引,否则可能会在具体的操作时报错或者返回非预期的结果。
3. loc索引器(基于元素的索引器)
loc 索引器的一般形式是 loc[* , * ] ,其中第一个 * 代表行的选择,第二个 * 代表列的选择,如果省略第二个位置写作 loc[ * ]
df_demo = df.set_index('Name')
df_demo.head()
School Grade Gender Weight Transfer
Name
Gaopeng Yang Shanghai Jiao Tong University Freshman Female 46.0 N
Changqiang You Peking University Freshman Male 70.0 N
Mei Sun Shanghai Jiao Tong University Senior Male 89.0 N
Xiaojuan Sun Fudan University Sophomore Female 41.0 N
Gaojuan You Fudan University Sophomore Male 74.0 N
a. * 为单个元素
直接取出相应的行或列,如果该元素在索引中重复则结果为 DataFrame,否则为 Series :
df_demo.loc['Qiang Sun'] #有重复数据,结果为DataFrame
School Grade Gender Weight Transfer
Name
Qiang Sun Tsinghua University Junior Female 53.0 N
Qiang Sun Tsinghua University Sophomore Female 40.0 N
Qiang Sun Shanghai Jiao Tong University Junior Female NaN N
df_demo.loc[:, ['Grade']].head() #直接取出相应的列
Grade
Name
Gaopeng Yang Freshman
Changqiang You Freshman
Mei Sun Senior
Xiaojuan Sun Sophomore
Gaojuan You Sophomore
df_demo.loc['Quan Zhao'] #姓名唯一
School Shanghai Jiao Tong University
Grade Junior
Gender Female
Weight 53
Transfer N
Name: Quan Zhao, dtype: object
df_demo.loc['Qiang Sun', 'School'] #返回Series
Name
Qiang Sun Tsinghua University
Qiang Sun Tsinghua University
Qiang Sun Shanghai Jiao Tong University
Name: School, dtype: object
df_demo.loc['Qiang Sun', ['School']] #返回DataFrame
School
Name
Qiang Sun Tsinghua University
Qiang Sun Tsinghua University
Qiang Sun Shanghai Jiao Tong University
df_demo.loc['Quan Zhao', 'School'] #返回单个元素
>>> 'Shanghai Jiao Tong University'
b. * 为元素列表
取出列表中所有元素值对应的行或列:
df_demo.loc[['Qiang Sun','Quan Zhao'], ['School','Gender']]
School Gender
Name
Qiang Sun Tsinghua University Female
Qiang Sun Tsinghua University Female
Qiang Sun Shanghai Jiao Tong University Female
Quan Zhao Shanghai Jiao Tong University Female
df_demo.loc[['Quan Zhao'], ['School']] #返回DataFrame
School
Name
Quan Zhao Shanghai Jiao Tong University
c. * 为切片
如果是唯一值的起点和终点字符,那么就可以使用切片,并且包含两个端点,如果不唯一则报错:
df_demo.loc['Gaojuan You':'Gaoqiang Qian', 'School':'Gender']
School Grade Gender
Name
Gaojuan You Fudan University Sophomore Male
Xiaoli Qian Tsinghua University Freshman Female
Qiang Chu Shanghai Jiao Tong University Freshman Female
Gaoqiang Qian Tsinghua University Junior Female
df_demo.loc['Qiang Sun':'Gaoqiang Qian', 'School':'Gender'] #不唯一,报错
>>> KeyError: "Cannot get left slice bound for non-unique label: 'Qiang Sun'"
需要注意的是,如果 DataFrame 使用整数索引,其使用整数切片的时候和上面字符串索引的要求一致,都是元素切片,包含端点且起点、终点不允许有重复值。
df_loc_slice_demo = df_demo.copy() #创建一个新的对象进行拷贝
df_loc_slice_demo.index = range(df_demo.shape[0],0,-1) #索引倒序排列
df_loc_slice_demo.head()
School Grade Gender Weight Transfer
200 Shanghai Jiao Tong University Freshman Female 46.0 N
199 Peking University Freshman Male 70.0 N
198 Shanghai Jiao Tong University Senior Male 89.0 N
197 Fudan University Sophomore Female 41.0 N
196 Fudan University Sophomore Male 74.0 N
df_loc_slice_demo.loc[7:4]
School Grade Gender Weight Transfer
7 Tsinghua University Senior Male 79.0 N
6 Peking University Senior Female 49.0 NaN
5 Fudan University Junior Female 46.0 N
4 Tsinghua University Senior Female 50.0 N
df_loc_slice_demo.loc[4:7] #没有返回
School Grade Gender Weight Transfer
d. * 为布尔列表
根据条件来筛选行是极其常见的,此处传入 loc 的布尔列表与 DataFrame 长度相同,且列表为 True
的位置所对应的行会被选中, False 则会被剔除。
df_demo.loc[df_demo.Gender=='Female'].head()
School Grade Gender Weight Transfer
Name
Gaopeng Yang Shanghai Jiao Tong University Freshman Female 46.0 N
Xiaojuan Sun Fudan University Sophomore Female 41.0 N
Xiaoli Qian Tsinghua University Freshman Female 51.0 N
Qiang Chu Shanghai Jiao Tong University Freshman Female 52.0 N
Gaoqiang Qian Tsinghua University Junior Female 50.0 N
传入元素列表,也可以通过 isin 方法返回的布尔列表等价写出:
df_demo.loc[df_demo.Gender.isin(['Female'])].head()
对于复合条件而言,可以用 |(或), &(且), ~(取反) 的组合来实现:
condition_1 = (df_demo.School == 'Fudan University') & (df_demo.Grade == 'Senior') & (df_demo.Weight > 70)
condition_2 = (df_demo.School == 'Peking University') & (df_demo.Grade != 'Senior') & (df_demo.Weight > 80)
df_demo.loc[condition_1 | condition_2]
School Grade Gender Weight Transfer
Name
Qiang Han Peking University Freshman Male 87.0 N
Chengpeng Zhou Fudan University Senior Male 81.0 N
Changpeng Zhao Peking University Freshman Male 83.0 N
Chengpeng Qian Fudan University Senior Male 73.0 Y
练一练:
select_dtypes
是一个实用函数,它能够从表中选出相应类型的列,若要选出所有数值型的列,只需使用.select_dtypes('number')
,请利用布尔列表选择的方法结合 DataFrame 的 dtypes 属性在 learn_pandas 数据集上实现这个功能。
df_demo.select_dtypes('number').head()
Weight
Name
Gaopeng Yang 46.0
Changqiang You 70.0
Mei Sun 89.0
Xiaojuan Sun 41.0
Gaojuan You 74.0
df_demo.loc[:,(df.dtypes == 'int') | (df.dtypes == 'float')].head()
Weight
Name
Gaopeng Yang 46.0
Changqiang You 70.0
Mei Sun 89.0
Xiaojuan Sun 41.0
Gaojuan You 74.0
#方法2:来自队员rain(np.number)
df_demo.loc[:,df_demo.dtypes == np.number].head()
e. * 为函数
这里的函数,必须以前面的四种合法形式之一为返回值,并且函数的输入值为 DataFrame 本身。
def condition(x):
condition_1 = (x.School == 'Fudan University') & (x.Grade == 'Senior') & (x.Weight > 70)
condition_2 = (x.School == 'Peking University') & (x.Grade != 'Senior') & (x.Weight > 80)
result = condition_1 | condition_2
return result
df_demo.loc[condition]
School Grade Gender Weight Transfer
Name
Qiang Han Peking University Freshman Male 87.0 N
Chengpeng Zhou Fudan University Senior Male 81.0 N
Changpeng Zhao Peking University Freshman Male 83.0 N
Chengpeng Qian Fudan University Senior Male 73.0 Y
还支持使用 lambda 表达式,其返回值也同样必须是先前提到的四种形式之一:
df_demo.loc[lambda x:'Quan Zhao', lambda x:'Gender']
>>> 'Female'
由于函数无法返回切片形式,故返回切片时要用slice
对象进行包装:
#返回'Gaojuan You'和'Gaoqiang Qian'之间的数据,包含端点
df_demo.loc[lambda x: slice('Gaojuan You', 'Gaoqiang Qian')]
School Grade Gender Weight Transfer
Name
Gaojuan You Fudan University Sophomore Male 74.0 N
Xiaoli Qian Tsinghua University Freshman Female 51.0 N
Qiang Chu Shanghai Jiao Tong University Freshman Female 52.0 N
Gaoqiang Qian Tsinghua University Junior Female 50.0 N
不要使用链式赋值
在对表或者序列赋值时,应当在使用一层索引器后直接进行赋值操作,这样做是由于进行多次索引后赋值是赋在临时返回的copy
副本上的,而没有真正修改元素从而报出SettingWithCopyWarning
警告。
#原始案例:
df_chain = pd.DataFrame([[0,0],[1,0],[-1,0]], columns=list('AB'))
df_chain
A B
0 0 0
1 1 0
2 -1 0
#会警告,且不会赋值成功
df_chain[df_chain.A!=0].B = 1
C:\ProgramData\Anaconda3\lib\site-
packages\pandas\core\generic.py:5170: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self[name] = value
df_chain.loc[df_chain.A!=0,'B'] = 1
df_chain
A B
0 0 0
1 1 1
2 -1 1
4. iloc索引器
iloc
的使用与loc
完全类似,只不过是针对位置进行筛选
a. * 为整数
df_demo.iloc[:,2].head() #取列值
Name
Gaopeng Yang Female
Changqiang You Male
Mei Sun Male
Xiaojuan Sun Female
Gaojuan You Male
Name: Gender, dtype: object
df_demo.iloc[1,:] #取行值
School Peking University
Grade Freshman
Gender Male
Weight 70
Transfer N
Name: Changqiang You, dtype: object
df_demo.iloc[1, 1] # 第二行第二列
>>> 'Freshman'
b. * 为整数列表
df_demo.iloc[[0, 2], [0, 1]] # 第1行和第3行,前两列
School Grade
Name
Gaopeng Yang Shanghai Jiao Tong University Freshman
Mei Sun Shanghai Jiao Tong University Senior
df_demo.iloc[[0, 1, 2], [0, 1]] # 前三行前两列
School Grade
Name
Gaopeng Yang Shanghai Jiao Tong University Freshman
Changqiang You Peking University Freshman
Mei Sun Shanghai Jiao Tong University Senior
c. * 为整数切片
df_demo.iloc[0:3, 0:2] #上一段代码的等价形式(前三行前两列)
d. * 为布尔列表
在使用布尔列表的时候要特别注意,不能传入Series
而必须传入序列的 values
,否则会报错。因此,在使用布尔筛选的时候还是应当优先考虑loc
的方式。
#之前的小练习
df_demo.iloc[:,(df_demo.dtypes == np.number).values].head()
Weight
Name
Gaopeng Yang 46.0
Changqiang You 70.0
Mei Sun 89.0
Xiaojuan Sun 41.0
Gaojuan You 74.0
e. * 为函数
df_demo.iloc[lambda x: slice(0, 3)] #传入前三行
School Grade Gender Weight Transfer
Name
Gaopeng Yang Shanghai Jiao Tong University Freshman Female 46.0 N
Changqiang You Peking University Freshman Male 70.0 N
Mei Sun Shanghai Jiao Tong University Senior Male 89.0 N
参考文献:
https://datawhalechina.github.io/joyful-pandas/build/html/%E7%9B%AE%E5%BD%95/ch3.html