import numpy as np
import pandas as pd
一、索引器
1.列索引
对于DataFrame而言,可以通过传入某个列名、或者多个列名的列表来获取该列的值。返回的是Series或者DataFrame。
df = pd.read_csv('data/learn_pandas.csv', usecols=['School', 'Grade', 'Name', 'Gender', 'Weight', 'Transfer'])
df['Name'].head()
0 Gaopeng Yang
1 Changqiang You
2 Mei Sun
3 Xiaojuan Sun
4 Gaojuan You
Name: Name, dtype: object
df[['School','Grade']].head()
| School | Grade |
|---|
| 0 | Shanghai Jiao Tong University | Freshman |
|---|
| 1 | Peking University | Freshman |
|---|
| 2 | Shanghai Jiao Tong University | Senior |
|---|
| 3 | Fudan University | Sophomore |
|---|
| 4 | Fudan University | Sophomore |
|---|
2.行索引
这里主要是Series的行索引,针对Series的行索引与DataFrame的列索引相似。
s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'a', 'a', 'c'])
s['a']
a 1
a 3
a 4
dtype: int64
s['b':'c']
b 2
a 3
a 4
c 5
dtype: int64
3.loc iloc选择器
针对DataFrame取行,可利用loc(元素)与iloc(位置)选择器。
loc[*,*]中的*代表行与列,有五类合法对象:单个元素、元素列表、元素切片、布尔列表以及函数.可以组合使用(比如行用切片,列用布尔列表)
df_demo = df.set_index('Name')
df_demo.head()
| School | Grade | Gender | Weight | Transfer |
|---|
| Name | | | | | |
|---|
| Gaopeng Yang | Shanghai Jiao Tong University | Freshman | Female | 46.0 | N |
|---|
| Changqiang You | Peking University | Freshman | Male | 70.0 | N |
|---|
| Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
|---|
| Xiaojuan Sun | Fudan University | Sophomore | Female | 41.0 | N |
|---|
| Gaojuan You | Fudan University | Sophomore | Male | 74.0 | N |
|---|
df_demo.loc['Mei Sun'].head()
| School | Grade | Gender | Weight | Transfer |
|---|
| Name | | | | | |
|---|
| Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
|---|
| Mei Sun | Shanghai Jiao Tong University | Junior | Female | 50.0 | N |
|---|
同时选择行与列。返回 是一个Series或者一个元素。
df_demo.loc['Qiang Sun', 'School']
Name
Qiang Sun Tsinghua University
Qiang Sun Tsinghua University
Qiang Sun Shanghai Jiao Tong University
Name: School, dtype: object
df_demo.loc['Quan Zhao', 'School']
'Shanghai Jiao Tong University'
df_demo.loc['Gaopeng Yang':'Gaojuan You','School':'Gender']
| School | Grade | Gender |
|---|
| Name | | | |
|---|
| Gaopeng Yang | Shanghai Jiao Tong University | Freshman | Female |
|---|
| Changqiang You | Peking University | Freshman | Male |
|---|
| Mei Sun | Shanghai Jiao Tong University | Senior | Male |
|---|
| Xiaojuan Sun | Fudan University | Sophomore | Female |
|---|
| Gaojuan You | Fudan University | Sophomore | Male |
|---|
布尔列表,也就是根据某些条件来筛选出行来,很常见.布尔列表要与DataFrame的行数,(列数)相同
df_demo.loc[df_demo.Weight > 70].head()
| School | Grade | Gender | Weight | Transfer |
|---|
| Name | | | | | |
|---|
| Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
|---|
| Gaojuan You | Fudan University | Sophomore | Male | 74.0 | N |
|---|
| Xiaopeng Zhou | Shanghai Jiao Tong University | Freshman | Male | 74.0 | N |
|---|
| Xiaofeng Sun | Tsinghua University | Senior | Male | 71.0 | N |
|---|
| Qiang Zheng | Shanghai Jiao Tong University | Senior | Male | 87.0 | N |
|---|
df_demo.loc[df_demo.Grade.isin(['Freshman', 'Senior'])].head()
isin是Series的一个函数,用于数据清洗。传入的是一个列表值,然后去看看isin前面的Grade列的每个值是否包含传入的列表值,返回的是一个布尔列表,长度为样本数,因为每行的Grade都判断。
| School | Grade | Gender | Weight | Transfer |
|---|
| Name | | | | | |
|---|
| Gaopeng Yang | Shanghai Jiao Tong University | Freshman | Female | 46.0 | N |
|---|
| Changqiang You | Peking University | Freshman | Male | 70.0 | N |
|---|
| Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
|---|
| Xiaoli Qian | Tsinghua University | Freshman | Female | 51.0 | N |
|---|
| Qiang Chu | Shanghai Jiao Tong University | Freshman | Female | 52.0 | N |
|---|
对于复合条件,用|,&,~(取反)来组合,这里启发了task01的EX1的第3小问,为什么用的是&而不是and。
[练一练]select_dtypes 是一个实用函数,它能够从表中选出相应类型的列,若要选出所有数值型的列,只需使用 .select_dtypes(‘number’) ,请利用布尔列表选择的方法结合 DataFrame 的 dtypes 属性在 learn_pandas 数据集上实现这个功能。
df_demo.loc[:,df_demo.dtypes.isin(['float64'])].head()
思路:选择出数值型的列,首先要先找出DataFrame所有列的类型,这里用dtypes得到(DataFrame列本质也是Series)。然后利用isin判断哪些列类型是数值的(这里只设置了float64),设置对应的布尔值,然后利用loc的布尔列表形式去访问就行了。
| Weight |
|---|
| Name | |
|---|
| Gaopeng Yang | 46.0 |
|---|
| Changqiang You | 70.0 |
|---|
| Mei Sun | 89.0 |
|---|
| Xiaojuan Sun | 41.0 |
|---|
| Gaojuan You | 74.0 |
|---|
df_demo.select_dtypes('number').head()
| Weight |
|---|
| Name | |
|---|
| Gaopeng Yang | 46.0 |
|---|
| Changqiang You | 70.0 |
|---|
| Mei Sun | 89.0 |
|---|
| Xiaojuan Sun | 41.0 |
|---|
| Gaojuan You | 74.0 |
|---|
df_chain = pd.DataFrame([[0,0],[1,0],[-1,0]],columns=list('AB'))
df_chain
df_chain.loc[df_chain.A != 0, 'B'] = 1
df_chain
df_demo.iloc[1, 2]
'Male'
df_demo.iloc[lambda x: slice(1, 4)]
| School | Grade | Gender | Weight | Transfer |
|---|
| Name | | | | | |
|---|
| Changqiang You | Peking University | Freshman | Male | 70.0 | N |
|---|
| Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
|---|
| Xiaojuan Sun | Fudan University | Sophomore | Female | 41.0 | N |
|---|
df_demo.iloc[(df_demo.Weight > 70).values].head()
| School | Grade | Gender | Weight | Transfer |
|---|
| Name | | | | | |
|---|
| Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
|---|
| Gaojuan You | Fudan University | Sophomore | Male | 74.0 | N |
|---|
| Xiaopeng Zhou | Shanghai Jiao Tong University | Freshman | Male | 74.0 | N |
|---|
| Xiaofeng Sun | Tsinghua University | Senior | Male | 71.0 | N |
|---|
| Qiang Zheng | Shanghai Jiao Tong University | Senior | Male | 87.0 | N |
|---|
(df_demo.Weight>70).dtype
dtype('bool')
通过报错’numpy.ndarray’ object has no attribute 'head’发现传入的是一个np数组,所以就可以理解为什么要是values而不能是Series了,因为iloc代表的是索引,所以用np列表。
4.query方法
df_demo.loc[df_demo.Weight > df_demo.Weight.mean()].head()
| School | Grade | Gender | Weight | Transfer |
|---|
| Name | | | | | |
|---|
| Changqiang You | Peking University | Freshman | Male | 70.0 | N |
|---|
| Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
|---|
| Gaojuan You | Fudan University | Sophomore | Male | 74.0 | N |
|---|
| Xiaopeng Zhou | Shanghai Jiao Tong University | Freshman | Male | 74.0 | N |
|---|
| Xiaomei Zhou | Tsinghua University | Senior | Female | 57.0 | N |
|---|
df_demo.query('(Weight > Weight.mean())').head()
这里query创建了DataFrame(可以理解为query前面的DataFrame),因此传入的字符串可以省略DataFrame,减少代码量。
| School | Grade | Gender | Weight | Transfer |
|---|
| Name | | | | | |
|---|
| Changqiang You | Peking University | Freshman | Male | 70.0 | N |
|---|
| Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
|---|
| Gaojuan You | Fudan University | Sophomore | Male | 74.0 | N |
|---|
| Xiaopeng Zhou | Shanghai Jiao Tong University | Freshman | Male | 74.0 | N |
|---|
| Xiaomei Zhou | Tsinghua University | Senior | Female | 57.0 | N |
|---|
二、多级索引
1.多级索引及其表的结构
np.random.seed(0)
multi_index = pd.MultiIndex.from_product([list('ABCD'),df.Gender.unique()],names=('School','Gender'))
multi_column = pd.MultiIndex.from_product([['Height', 'Weight'], df.Grade.unique()],names=('Indicator','Grade'))
df_multi = pd.DataFrame(np.c_[(np.random.randn(8,4)*5+163).tolist(),
(np.random.randn(8,4)*5+65).tolist()],
index=multi_index,
columns=multi_column).round(1)
df_multi
| Indicator | Height | Weight |
|---|
| Grade | Freshman | Senior | Sophomore | Junior | Freshman | Senior | Sophomore | Junior |
|---|
| School | Gender | | | | | | | | |
|---|
| A | Female | 171.8 | 165.0 | 167.9 | 174.2 | 60.6 | 55.1 | 63.3 | 65.8 |
|---|
| Male | 172.3 | 158.1 | 167.8 | 162.2 | 71.2 | 71.0 | 63.1 | 63.5 |
|---|
| B | Female | 162.5 | 165.1 | 163.7 | 170.3 | 59.8 | 57.9 | 56.5 | 74.8 |
|---|
| Male | 166.8 | 163.6 | 165.2 | 164.7 | 62.5 | 62.8 | 58.7 | 68.9 |
|---|
| C | Female | 170.5 | 162.0 | 164.6 | 158.7 | 56.9 | 63.9 | 60.5 | 66.9 |
|---|
| Male | 150.2 | 166.3 | 167.3 | 159.3 | 62.4 | 59.1 | 64.9 | 67.1 |
|---|
| D | Female | 174.3 | 155.7 | 163.2 | 162.1 | 65.3 | 66.5 | 61.8 | 63.2 |
|---|
| Male | 170.7 | 170.3 | 163.8 | 164.9 | 61.6 | 63.2 | 60.9 | 56.4 |
|---|
多级索引的名字和值可以通过names与values来获取
df_multi.index.names
FrozenList(['School', 'Gender'])
df_multi.index.values
array([('A', 'Female'), ('A', 'Male'), ('B', 'Female'), ('B', 'Male'),
('C', 'Female'), ('C', 'Male'), ('D', 'Female'), ('D', 'Male')],
dtype=object)
df_multi.index.get_level_values(1)
Index(['Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male'], dtype='object', name='Gender')
df_multi = df.set_index(['School', 'Grade'])
df_multi = df_multi.sort_index()
df_multi.head()
| | Name | Gender | Weight | Transfer |
|---|
| School | Grade | | | | |
|---|
| Fudan University | Freshman | Changqiang Yang | Female | 49.0 | N |
|---|
| Freshman | Gaoqiang Qin | Female | 63.0 | N |
|---|
| Freshman | Gaofeng Zhao | Female | 43.0 | N |
|---|
| Freshman | Yanquan Wang | Female | 55.0 | N |
|---|
| Freshman | Feng Wang | Male | 74.0 | N |
|---|
2.多级索引的loc选择器
df_multi.loc[('Fudan University', 'Junior')].head()
| | Name | Gender | Weight | Transfer |
|---|
| School | Grade | | | | |
|---|
| Fudan University | Junior | Yanli You | Female | 48.0 | N |
|---|
| Junior | Chunqiang Chu | Male | 72.0 | N |
|---|
| Junior | Changfeng Lv | Male | 76.0 | N |
|---|
| Junior | Yanjuan Lv | Female | 49.0 | NaN |
|---|
| Junior | Gaoqiang Zhou | Female | 43.0 | N |
|---|
df_multi.loc[[('Fudan University', 'Senior'),('Shanghai Jiao Tong University', 'Freshman')]].head()
| | Name | Gender | Weight | Transfer |
|---|
| School | Grade | | | | |
|---|
| Fudan University | Senior | Chengpeng Zheng | Female | 38.0 | N |
|---|
| Senior | Feng Zhou | Female | 47.0 | N |
|---|
| Senior | Gaomei Lv | Female | 34.0 | N |
|---|
| Senior | Chunli Lv | Female | 56.0 | N |
|---|
| Senior | Chengpeng Zhou | Male | 81.0 | N |
|---|
df_multi.loc[df_multi.Weight > 70].head()
| | Name | Gender | Weight | Transfer |
|---|
| School | Grade | | | | |
|---|
| Fudan University | Freshman | Feng Wang | Male | 74.0 | N |
|---|
| Junior | Chunqiang Chu | Male | 72.0 | N |
|---|
| Junior | Changfeng Lv | Male | 76.0 | N |
|---|
| Senior | Chengpeng Zhou | Male | 81.0 | N |
|---|
| Senior | Chengpeng Qian | Male | 73.0 | Y |
|---|
df_multi.loc[(['Peking University','Fudan University'],['Sophomore', 'Junior']),:].head()
| | Name | Gender | Weight | Transfer |
|---|
| School | Grade | | | | |
|---|
| Peking University | Sophomore | Changmei Xu | Female | 43.0 | N |
|---|
| Sophomore | Xiaopeng Qin | Male | NaN | N |
|---|
| Sophomore | Mei Xu | Female | 39.0 | N |
|---|
| Sophomore | Xiaoli Zhou | Female | 55.0 | N |
|---|
| Sophomore | Peng Han | Female | 34.0 | NaN |
|---|
df_multi.loc[[('Peking University', 'Junior'),('Fudan University', 'Sophomore')]].head()
| | Name | Gender | Weight | Transfer |
|---|
| School | Grade | | | | |
|---|
| Peking University | Junior | Juan Xu | Female | NaN | N |
|---|
| Junior | Changjuan You | Female | 47.0 | N |
|---|
| Junior | Gaoli Xu | Female | 48.0 | N |
|---|
| Junior | Gaoquan Zhou | Male | 70.0 | N |
|---|
| Junior | Qiang You | Female | 56.0 | N |
|---|
3.多级索引的IndexSlice对象
多级索引中,即使在索引不重复的时候,也只能对元组整体进行切片,而不能对每层进行切片,也不允许将切片和布尔列表混合使用,引入 IndexSlice 对象就能解决这个问题。
np.random.seed(0)
L1,L2 = ['A', 'B', 'C'],['a', 'b', 'c']
mul_index1 = pd.MultiIndex.from_product([L1, L2],names=('Upper', ' Lower'))
L3,L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
df_ex = pd.DataFrame(np.random.randint(-9, 10, (9, 9)),
index = mul_index1,
columns = mul_index2)
df_ex
| Big | D | E | F |
|---|
| Small | d | e | f | d | e | f | d | e | f |
|---|
| Upper | Lower | | | | | | | | | |
|---|
| A | a | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 | -5 |
|---|
| b | -3 | 3 | -8 | -3 | -2 | 5 | 8 | -4 | 4 |
|---|
| c | -1 | 0 | 7 | -4 | 6 | 6 | -9 | 9 | -6 |
|---|
| B | a | 8 | 5 | -2 | -9 | -8 | 0 | -9 | 1 | -6 |
|---|
| b | 2 | 9 | -7 | -9 | -9 | -5 | -4 | -3 | -1 |
|---|
| c | 8 | 6 | -5 | 0 | 1 | -8 | -8 | -2 | 0 |
|---|
| C | a | -6 | -3 | 2 | 5 | 9 | -9 | 5 | -6 | 3 |
|---|
| b | 1 | 2 | -5 | -3 | -5 | 6 | -6 | 3 | -5 |
|---|
| c | -1 | 5 | 6 | -6 | 6 | 4 | 7 | 8 | -4 |
|---|
loc[idx[*,*]]两个*代表行与列。
idx = pd.IndexSlice
df_ex.loc[idx['C':,('D','f'):]]
| Big | D | E | F |
|---|
| Small | f | d | e | f | d | e | f |
|---|
| Upper | Lower | | | | | | | |
|---|
| C | a | 2 | 5 | 9 | -9 | 5 | -6 | 3 |
|---|
| b | -5 | -3 | -5 | 6 | -6 | 3 | -5 |
|---|
| c | 6 | -6 | 6 | 4 | 7 | 8 | -4 |
|---|
loc[idx[*,*],idx[*,*]] 两个的话,idx就代表行与列,然后里面的*代表索引层
df_ex.loc[idx[:'A','b':], idx['E':,'e':]]
| Big | E | F |
|---|
| Small | e | f | e | f |
|---|
| Upper | Lower | | | | |
|---|
| A | b | -2 | 5 | -4 | 4 |
|---|
| c | 6 | 6 | 9 | -6 |
|---|
4.多级索引构造
除了使用set_index(‘索引名字’)之外(将DataFrame列索引当做新行索引),自己构造。
my_tuple = [('a', 'cat'),('a', 'dog'),('b', 'cat'),('b', 'dog')]
my_index = pd.MultiIndex.from_tuples(my_tuple, names=['First','Second'])
df_multi_tuples = pd.DataFrame(np.random.randint(1,9,(4,2)),
index=my_index)
df_multi_tuples
| | 0 | 1 |
|---|
| First | Second | | |
|---|
| a | cat | 2 | 4 |
|---|
| dog | 1 | 8 |
|---|
| b | cat | 6 | 7 |
|---|
| dog | 1 | 2 |
|---|
my_array = [list('aabb'),['cat', 'dog'] * 2]
my_array_index = pd.MultiIndex.from_arrays(my_array, names=['First', 'Sceond'])
df_multi_arrays = pd.DataFrame(np.random.randint(1,8,(4,2)),
index = my_array_index)
df_multi_arrays
| | 0 | 1 |
|---|
| First | Sceond | | |
|---|
| a | cat | 3 | 5 |
|---|
| dog | 3 | 1 |
|---|
| b | cat | 6 | 4 |
|---|
| dog | 3 | 3 |
|---|
my_list1 = ['a' ,'b']
my_list2 = ['cat', 'dog']
my_product_index = pd.MultiIndex.from_product([my_list1, my_list2],
names=['First','Second'])
df_multi_products = pd.DataFrame(np.random.randn(4,2),
index=my_product_index)
df_multi_products
| | 0 | 1 |
|---|
| First | Second | | |
|---|
| a | cat | 0.066517 | 0.302472 |
|---|
| dog | -0.634322 | -0.362741 |
|---|
| b | cat | -0.672460 | -0.359553 |
|---|
| dog | -0.813146 | -1.726283 |
|---|
三、索引的常用方法.
1.索引层的交换和删除
np.random.seed(0)
L1, L2, L3 = ['A','B'],['a','b'],['alpha','beta']
mul_index1 = pd.MultiIndex.from_product([L1, L2, L3],
names=('Upper', 'Lower','Extra'))
L4,L5,L6 = ['C','D'],['c','d'],['cat','dog']
mul_index2 = pd.MultiIndex.from_product([L4, L5, L6],
names=('Big', 'Small', 'Other'))
df_ex = pd.DataFrame(np.random.randint(-9, 10,(8,8)),
index = mul_index1,
columns=mul_index2)
df_ex.columns
MultiIndex([('C', 'c', 'cat'),
('C', 'c', 'dog'),
('C', 'd', 'cat'),
('C', 'd', 'dog'),
('D', 'c', 'cat'),
('D', 'c', 'dog'),
('D', 'd', 'cat'),
('D', 'd', 'dog')],
names=['Big', 'Small', 'Other'])
df_ex.swaplevel(0 ,2, axis= 1).head()
| | Other | cat | dog | cat | dog | cat | dog | cat | dog |
|---|
| | Small | c | c | d | d | c | c | d | d |
|---|
| | Big | C | C | C | C | D | D | D | D |
|---|
| Upper | Lower | Extra | | | | | | | | |
|---|
| A | a | alpha | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
|---|
| beta | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 |
|---|
| b | alpha | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 |
|---|
| beta | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 |
|---|
| B | a | alpha | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
|---|
df_ex.reorder_levels([2,1,0],axis=1).head()
| | Other | cat | dog | cat | dog | cat | dog | cat | dog |
|---|
| | Small | c | c | d | d | c | c | d | d |
|---|
| | Big | C | C | C | C | D | D | D | D |
|---|
| Upper | Lower | Extra | | | | | | | | |
|---|
| A | a | alpha | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
|---|
| beta | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 |
|---|
| b | alpha | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 |
|---|
| beta | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 |
|---|
| B | a | alpha | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
|---|
2.索引属性的修改
df_ex.rename_axis(index={'Extra':'changed_row'},
columns={'Other':'changed_col'}).head()
| | Big | C | D |
|---|
| | Small | c | d | c | d |
|---|
| | changed_col | cat | dog | cat | dog | cat | dog | cat | dog |
|---|
| Upper | Lower | changed_row | | | | | | | | |
|---|
| A | a | alpha | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
|---|
| beta | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 |
|---|
| b | alpha | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 |
|---|
| beta | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 |
|---|
| B | a | alpha | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
|---|
df_ex.rename(columns={'cat':'not_cat'},level=2).head()
| | Big | C | D |
|---|
| | Small | c | d | c | d |
|---|
| | Other | not_cat | dog | not_cat | dog | not_cat | dog | not_cat | dog |
|---|
| Upper | Lower | Extra | | | | | | | | |
|---|
| A | a | alpha | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
|---|
| beta | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 |
|---|
| b | alpha | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 |
|---|
| beta | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 |
|---|
| B | a | alpha | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
|---|
df_ex.rename(index=lambda x: str.upper(x),level=2).head()
| | Big | C | D |
|---|
| | Small | c | d | c | d |
|---|
| | Other | cat | dog | cat | dog | cat | dog | cat | dog |
|---|
| Upper | Lower | Extra | | | | | | | | |
|---|
| A | a | ALPHA | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
|---|
| BETA | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 |
|---|
| b | ALPHA | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 |
|---|
| BETA | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 |
|---|
| B | a | ALPHA | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
|---|
对整个索引的元素替换,可以利用迭代器实现。这里iter用于生成迭代器;next从迭代器选元素。这里我以为迭代如果数量不够会从头再循环着来,发现并不是,会报错(还是没有理解到位的缘故)。
new_values = iter(list('abcdefgh'))
df_ex.rename(index=lambda x:next(new_values),level=2)
| | Big | C | D |
|---|
| | Small | c | d | c | d |
|---|
| | Other | cat | dog | cat | dog | cat | dog | cat | dog |
|---|
| Upper | Lower | Extra | | | | | | | | |
|---|
| A | a | a | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
|---|
| b | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 |
|---|
| b | c | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 |
|---|
| d | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 |
|---|
| B | a | e | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
|---|
| f | -9 | -5 | -4 | -3 | -1 | 8 | 6 | -5 |
|---|
| b | g | 0 | 1 | -8 | -8 | -2 | 0 | -6 | -3 |
|---|
| h | 2 | 5 | 9 | -9 | 5 | -6 | 3 | 1 |
|---|
df_temp = df_ex.copy()
new_idx = df_temp.index.map(lambda x:(x[0],
x[1],
str.upper(x[2])))
df_temp.index = new_idx
df_temp.head()
| | Big | C | D |
|---|
| | Small | c | d | c | d |
|---|
| | Other | cat | dog | cat | dog | cat | dog | cat | dog |
|---|
| Upper | Lower | Extra | | | | | | | | |
|---|
| A | a | ALPHA | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
|---|
| BETA | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 |
|---|
| b | ALPHA | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 |
|---|
| BETA | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 |
|---|
| B | a | ALPHA | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
|---|
3.索引的设置与重置
df_new = pd.DataFrame({'A':list('aabb'),
'B':list('PQRT'),
'C':[1,2,3,4]})
df_new
df_new.set_index(['A'],append=True)
my_index = pd.Series(list('WXYZ'), name='D')
df_new = df_new.set_index(['A',my_index])
df_new
df_new.reset_index(['D'],drop=True)
df_reindex = pd.DataFrame({"Weight":[60,70,80],
"Height":[176,180,179]},
index=['1001','1003','1002'])
df_reindex
| Weight | Height |
|---|
| 1001 | 60 | 176 |
|---|
| 1003 | 70 | 180 |
|---|
| 1002 | 80 | 179 |
|---|
df_reindex.reindex(index=['1001','1002','1003','1004'],
columns=['Weight','Gender'])
| Weight | Gender |
|---|
| 1001 | 60.0 | NaN |
|---|
| 1002 | 80.0 | NaN |
|---|
| 1003 | 70.0 | NaN |
|---|
| 1004 | NaN | NaN |
|---|
df_set_1 = pd.DataFrame([[0,1],[1,2],[3,4]],
index = pd.Index(['a','b','a'],name='id1'))
df_set_2 = pd.DataFrame([[4,5],[2,6],[7,1]],
index = pd.Index(['b','b','c'],name='id2'))
df_set_in_col_1 = df_set_1.reset_index()
df_set_in_col_2 = df_set_2.reset_index()
df_set_in_col_1[df_set_in_col_1.id1.isin(df_set_in_col_2.id2)]
练习
Ex1:公司员工数据集
1.分别只使用 query 和 loc 选出年龄不超过四十岁且工作部门为 Dairy 或 Bakery 的男性。
df_demo = df.loc[(df.age <= 40) & (df.gender == 'M') & df.department.isin(['Dairy', 'Bakery'])]
print(df_demo.head())
用loc方法直接将几个条件连接起来就得到了。
df_demo = df.query('((age <= 40) & (gender == \'M\') & department in([\'Dairy\', \'Bakery\']))')
print(df_demo.head())
用query方法可以省略对于df的书写。我第一次时仍然像loc一样使用isin,但发现报错,所以在字符串内应该使用in。
参考答案的方法如下:
dpt = ['Dairy', 'Bakery']
df.query("(age <= 40)&(department == @dpt)&(gender=='M')").head(3)
2.选出员工 ID 号 为奇数所在行的第1、第3和倒数第2列。
df_demo = df.iloc[(df.EmployeeID.values % 2 != 0), [0, 2, -2]]
print(df_demo)
通过题意,看出选择iloc比较合适。因为关心的是第几列,而不是某列的值。参考答案的方法也是这样的:
df.iloc[(df.EmployeeID%2==1).values,[0,2,-2]].head()
答案是先得到布尔的df,然后求values得到列表;我先求values得到列表,然后对列表求布尔。
3.按照以下步骤进行索引操作:
a.把后三列设为索引后交换内外两层
df_demo = df.set_index(['department', 'job_title', 'gender']).swaplevel(0, 2, axis=0)
df_demo = df.set_index(df.columns[-3:].tolist()).swaplevel(0, 2, axis=0)
这里要选择后三列,先通过列索引得到然后将其转化为列表型。(我一开始想到了通过列索引得到后三列,但忘了tolist这一步,否则是Series不是列表。
b.恢复中间一层
df_demo = df_demo.reset_index(['job_title'])
df_demo = df_demo.reset_index(level=1)
c.修改外层索引名为 Gender
df_demo = df_demo.rename_axis(index={'gender': 'Gender'})
d.用下划线合并两层行索引
df_temp = df_demo.copy()
new_index = df_temp.index.map(lambda x: (x[0]+'_'+x[1]))
df_temp.index = new_index
e.把行索引拆分为原状态
re_index = df_temp.index.map(lambda x: tuple(x.split('_')))
df_temp.index = re_index
这里展开之后索引没名字了(怪不得下面还有两问)
f.修改索引名为原表名称
df_temp = df_temp.reindex_like(df_demo)
df_temp = df_temp.rename_axis(index=['gender', 'department'])
还是答案正确
g.恢复默认索引并将列保持为原表的相对位置
df_temp = df_temp.reset_index()
df_temp = df_temp.reorder_levels([3, 4, 5, 6, 0, 1, 2], axis=1)
df_temp = df_temp.reset_index().reindex(df.columns, axis=1)
总结
这次学习任务感觉索引的方法很多。就算看过并敲过一遍,然后在题目需要用到的时候还是想不到。还是不熟练的原因吧,还需多加练习,学习的路还很长。