Day3-Python索引(Datawhale)

一、索引器

import numpy as np
import pandas as pd

1.1 表的索引

列索引,通过[列名]实现:返回值为Series

[列名组成的列表]:返回值为DataFrame

.列名:取出单列并且列名不包含空格,等价于[列名]

df = pd.read_csv('data/learn_pandas.csv',
                usecols=['School','Grade','Name','Gender','Weight','Transfer'])
df['Name'].head()
0      Gaopeng Yang
1    Changqiang You
2           Mei Sun
3      Xiaojuan Sun
4       Gaojuan You
Name: Name, dtype: object
df[['Grade','Name']].head()
GradeName
0FreshmanGaopeng Yang
1FreshmanChangqiang You
2SeniorMei Sun
3SophomoreXiaojuan Sun
4SophomoreGaojuan You
df.Name.head()
0      Gaopeng Yang
1    Changqiang You
2           Mei Sun
3      Xiaojuan Sun
4       Gaojuan You
Name: Name, dtype: object

1.2 序列的行索引

1.2.1 以字符串为索引的Series

如果取出单个索引的对应元素,可以使用[item],

若Series只有单个值对应,返回标量值,

多个值对应,返回Series

如果取出某两个索引之间的元素,并且这两个索引是在整个索引中唯一出现,则可以使用切片,切片包含两个端点

s = pd.Series([1, 2, 3, 4, 5, 6],
             index=['a','b','a','a','a','c'])
print(s)
s['a']
a    1
b    2
a    3
a    4
a    5
c    6
dtype: int64





a    1
a    3
a    4
a    5
dtype: int64
s['b']
2
s[['a','c']]
a    1
a    3
a    4
a    5
c    6
dtype: int64
s['c':'b':-2]
c    6
a    4
b    2
dtype: int64

1.2.2 以整数为索引的Series

不指定索引,会生成从0开始的整数索引

和字符串一样取出对应索引元素的值

整数切片不包含右端点

s = pd.Series(['a','b','c','d','e','f'],
             index=[1,3,1,2,5,4])
s[1]
1    a
1    c
dtype: object
s[1:-1:2]
3    b
2    d
dtype: object

说明:不要将纯浮点以及任何混合类型作为索引

1.3 loc索引器

基于元素的loc索引器,形式loc[,],第一个代表行的选择,第二个代表列的索引

loc[*]代表行的筛选

*的五类合法对象:单个元素、元素列表、元素切片、布尔列表、函数

Series可以使用loc索引

df_demo = df.set_index('Name')
df_demo.head()
SchoolGradeGenderWeightTransfer
Name
Gaopeng YangShanghai Jiao Tong UniversityFreshmanFemale46.0N
Changqiang YouPeking UniversityFreshmanMale70.0N
Mei SunShanghai Jiao Tong UniversitySeniorMale89.0N
Xiaojuan SunFudan UniversitySophomoreFemale41.0N
Gaojuan YouFudan UniversitySophomoreMale74.0N

1.3.1 *为单个元素

df_demo.loc['Qiang Sun']
SchoolGradeGenderWeightTransfer
Name
Qiang SunTsinghua UniversityJuniorFemale53.0N
Qiang SunTsinghua UniversitySophomoreFemale40.0N
Qiang SunShanghai Jiao Tong UniversityJuniorFemaleNaNN

选择行和列

df_demo.loc['Qiang Sun','School']
Name
Qiang Sun              Tsinghua University
Qiang Sun              Tsinghua University
Qiang Sun    Shanghai Jiao Tong University
Name: School, dtype: object
df_demo.loc['Quan Zhao','School']
'Shanghai Jiao Tong University'

1.3.2 *为元素列表

df_demo.loc[['Qiang Sun','Quan Zhao'],['School','Gender']]
SchoolGender
Name
Qiang SunTsinghua UniversityFemale
Qiang SunTsinghua UniversityFemale
Qiang SunShanghai Jiao Tong UniversityFemale
Quan ZhaoShanghai Jiao Tong UniversityFemale

1.3.3 *为切片

唯一值的起点和终点字符,可以使用切片,并且包含两个端点

df_demo.loc['Gaojuan You':'Gaoqiang Qian','School':'Gender']
SchoolGradeGender
Name
Gaojuan YouFudan UniversitySophomoreMale
Xiaoli QianTsinghua UniversityFreshmanFemale
Qiang ChuShanghai Jiao Tong UniversityFreshmanFemale
Gaoqiang QianTsinghua UniversityJuniorFemale

注意:整数索引的切片也是包含端点且起点、终点、不允许重复

df_demo_copy = df_demo.copy()
df_demo_copy.index = range(df_demo.shape[0],0,-1) #倒序
df_demo_copy.head()
SchoolGradeGenderWeightTransfer
200Shanghai Jiao Tong UniversityFreshmanFemale46.0N
199Peking UniversityFreshmanMale70.0N
198Shanghai Jiao Tong UniversitySeniorMale89.0N
197Fudan UniversitySophomoreFemale41.0N
196Fudan UniversitySophomoreMale74.0N
df_demo_copy.loc[5:3]
SchoolGradeGenderWeightTransfer
5Fudan UniversityJuniorFemale46.0N
4Tsinghua UniversitySeniorFemale50.0N
3Shanghai Jiao Tong UniversitySeniorFemale45.0N
df_demo_copy.loc[3:5] 
#没有返回,说明不存在切片的顺序,上面我们使用的是倒序,要保持一致,或者使用-1
SchoolGradeGenderWeightTransfer
df_demo_copy.loc[3:5:-1]
SchoolGradeGenderWeightTransfer
3Shanghai Jiao Tong UniversitySeniorFemale45.0N
4Tsinghua UniversitySeniorFemale50.0N
5Fudan UniversityJuniorFemale46.0N

1.3.4 *为布尔列表

根据条件来筛选行,传入loc的布尔列表与DataFrame长度一致,且True的位置被选中,False被剔除

#体重超过70kg的学生
df_demo.loc[df_demo.Weight>70].head()
SchoolGradeGenderWeightTransfer
Name
Mei SunShanghai Jiao Tong UniversitySeniorMale89.0N
Gaojuan YouFudan UniversitySophomoreMale74.0N
Xiaopeng ZhouShanghai Jiao Tong UniversityFreshmanMale74.0N
Xiaofeng SunTsinghua UniversitySeniorMale71.0N
Qiang ZhengShanghai Jiao Tong UniversitySeniorMale87.0N

使用isin方法返回的布尔列表

df_demo.loc[df_demo.Grade.isin(['Freshman','Senior'])].head()
SchoolGradeGenderWeightTransfer
Name
Gaopeng YangShanghai Jiao Tong UniversityFreshmanFemale46.0N
Changqiang YouPeking UniversityFreshmanMale70.0N
Mei SunShanghai Jiao Tong UniversitySeniorMale89.0N
Xiaoli QianTsinghua UniversityFreshmanFemale51.0N
Qiang ChuShanghai Jiao Tong UniversityFreshmanFemale52.0N

对于复合条件可以使用|(或),&(且),~(反)的组合来实现

#选出复旦大学体重超过70的大四学生或者北大男生体重超过80的非大四学生
x1 = df_demo.School == 'Fudan University'
x2 = df_demo.Grade == 'Senior'
x3 = df_demo.Weight>70
x = x1 & x2 & x3

y1 = df_demo.School == 'Peking University'
y2 = df_demo.Grade == 'Senior'
y3 = df_demo.Weight >80
y = y1 & (~y2) & y3

df_demo.loc[x | y]

SchoolGradeGenderWeightTransfer
Name
Qiang HanPeking UniversityFreshmanMale87.0N
Chengpeng ZhouFudan UniversitySeniorMale81.0N
Changpeng ZhaoPeking UniversityFreshmanMale83.0N
Chengpeng QianFudan UniversitySeniorMale73.0Y

1.3.5 练一练

select_dtypes是一个实用函数,能从表中选出相应类型的列,若要选出所有数值型的列,只需要使用.select_dtypes(‘number’),请使用布尔列表选择的方法集合DataFrame的dtypes属性在learn_pandas数据上实现功能

df_demo.select_dtypes('number').head()
Weight
Name
Gaopeng Yang46.0
Changqiang You70.0
Mei Sun89.0
Xiaojuan Sun41.0
Gaojuan You74.0
df_demo[df_demo.columns[df_demo.dtypes == 'float64']].head()
Weight
Name
Gaopeng Yang46.0
Changqiang You70.0
Mei Sun89.0
Xiaojuan Sun41.0
Gaojuan You74.0

1.3.6 *为函数

这里的函数,必须以前面的四种合法形式(单个元素、元素列表、切片、布尔型)之一为返回值,并给函数的输入值为DataFrame本身。函数的形式参数x本质即为df_demo

def condition(z):
    x1 = z.School == 'Fudan University'
    x2 = z.Grade == 'Senior'
    x3 = z.Weight>70
    x = x1 & x2 & x3
    y1 = z.School == 'Peking University'
    y2 = z.Grade == 'Senior'
    y3 = z.Weight >80
    y = y1 & (~y2) & y3
    result = x | y
    return result
df_demo.loc[condition]
SchoolGradeGenderWeightTransfer
Name
Qiang HanPeking UniversityFreshmanMale87.0N
Chengpeng ZhouFudan UniversitySeniorMale81.0N
Changpeng ZhaoPeking UniversityFreshmanMale83.0N
Chengpeng QianFudan UniversitySeniorMale73.0Y

支持lambda表达式

df_demo.loc[lambda x :'Quan Zhao',lambda x:'Gender']
'Female'

slice为切片函数,slice(start,stop,step)

df_demo.loc[lambda x: slice('Gaojuan You','Gaoqiang Qian')]
SchoolGradeGenderWeightTransfer
Name
Gaojuan YouFudan UniversitySophomoreMale74.0N
Xiaoli QianTsinghua UniversityFreshmanFemale51.0N
Qiang ChuShanghai Jiao Tong UniversityFreshmanFemale52.0N
Gaoqiang QianTsinghua UniversityJuniorFemale50.0N
df_chain = pd.DataFrame([[0,0],[1,0],[-1,0]],columns=list('AB'))
df_chain
AB
000
110
2-10
df_chain.loc[df_chain.A!=0,'B'] = 1 
#将df_chain中A列不为0的B列的值赋值为1
df_chain
AB
000
111
2-11

1.4 iloc索引器

iloc针对位置进行索引

五类合法对象:整数、整数列表、整数切片、布尔列表、函数

函数的返回值必须是整数、整数列表、整数切片、布尔列表之一

切片不好含结束端点

1.4.1 整数

df_demo.iloc[1,1] #第二行第二列
'Freshman'

1.4.2 整数列表

df_demo.iloc[[0,1],[0,1]] #前两行前两列
SchoolGrade
Name
Gaopeng YangShanghai Jiao Tong UniversityFreshman
Changqiang YouPeking UniversityFreshman

1.4.3 切片

df_demo.iloc[1:4,2:4] #1-3行,2-3列
GenderWeight
Name
Changqiang YouMale70.0
Mei SunMale89.0
Xiaojuan SunFemale41.0

1.4.4 布尔类型

布尔列表不能传入Series,必须传入values,布尔筛选时优先考虑从loc

df_demo.iloc[(df_demo.Weight>80).values].head()
SchoolGradeGenderWeightTransfer
Name
Mei SunShanghai Jiao Tong UniversitySeniorMale89.0N
Qiang ZhengShanghai Jiao Tong UniversitySeniorMale87.0N
Qiang HanPeking UniversityFreshmanMale87.0N
Chengpeng ZhouFudan UniversitySeniorMale81.0N
Feng HanShanghai Jiao Tong UniversitySophomoreMale82.0N

1.4.5 函数

df_demo.iloc[lambda x : slice(1,4)]  #1至3行所有的列
SchoolGradeGenderWeightTransfer
Name
Changqiang YouPeking UniversityFreshmanMale70.0N
Mei SunShanghai Jiao Tong UniversitySeniorMale89.0N
Xiaojuan SunFudan UniversitySophomoreFemale41.0N

Series序列可以通过iloc返回相应位置的值或子序列

df_demo.School
Name
Gaopeng Yang      Shanghai Jiao Tong University
Changqiang You                Peking University
Mei Sun           Shanghai Jiao Tong University
Xiaojuan Sun                   Fudan University
Gaojuan You                    Fudan University
                              ...              
Xiaojuan Sun                   Fudan University
Li Zhao                     Tsinghua University
Chengqiang Chu    Shanghai Jiao Tong University
Chengmei Shen     Shanghai Jiao Tong University
Chunpeng Lv                 Tsinghua University
Name: School, Length: 200, dtype: object
df_demo.School.iloc[1] #第一行的值
'Peking University'
df_demo.School.iloc[1:5:2] #不包含结尾,返回1、3行
Name
Changqiang You    Peking University
Xiaojuan Sun       Fudan University
Name: School, dtype: object

1.5 query方法

pandas支持将字符串形式的查询表达式传入query方法来查询数据,其表达式的执行结果必须返回布尔列表。

df.query('((School == "Fudan University")& ' #字符串需要使用双引号
         '(Grade == "Senior")& '
         '(Weight >70))|'
         '((School == "Peking University")&'
         '(Grade != "Senior")&'
         '(Weight > 80))')  
SchoolGradeNameGenderWeightTransfer
38Peking UniversityFreshmanQiang HanMale87.0N
66Fudan UniversitySeniorChengpeng ZhouMale81.0N
99Peking UniversityFreshmanChangpeng ZhaoMale83.0N
131Fudan UniversitySeniorChengpeng QianMale73.0Y

query可以直接调用列名和正常的函数调用无差别

df.query('Weight>Weight.mean()').head()
SchoolGradeNameGenderWeightTransfer
1Peking UniversityFreshmanChangqiang YouMale70.0N
2Shanghai Jiao Tong UniversitySeniorMei SunMale89.0N
4Fudan UniversitySophomoreGaojuan YouMale74.0N
10Shanghai Jiao Tong UniversityFreshmanXiaopeng ZhouMale74.0N
14Tsinghua UniversitySeniorXiaomei ZhouFemale57.0N

对于含有空格的列名,需要使用(`col name`)的方式引用

可以使用英语的字面用法:or、and、in 、not in

df.query('(Grade not in ["Freshman","Sophomore"]) and'
        '(Gender == "Male")').head()
SchoolGradeNameGenderWeightTransfer
2Shanghai Jiao Tong UniversitySeniorMei SunMale89.0N
16Tsinghua UniversityJuniorXiaoqiang QinMale68.0N
17Tsinghua UniversityJuniorPeng WangMale65.0N
18Tsinghua UniversitySeniorXiaofeng SunMale71.0N
21Shanghai Jiao Tong UniversitySeniorXiaopeng ShenMale62.0NaN

query引入外部变量,在变量前加@符号。

例如:取出体重位于70kg到80kg之间的学生

low, high = 70, 80
df.query('Weight >= @low and Weight <= @high').head()
#df.query('Weight.between (@low,@high)')的查询没有实现,转换为>=&<=
SchoolGradeNameGenderWeightTransfer
1Peking UniversityFreshmanChangqiang YouMale70.0N
4Fudan UniversitySophomoreGaojuan YouMale74.0N
10Shanghai Jiao Tong UniversityFreshmanXiaopeng ZhouMale74.0N
18Tsinghua UniversitySeniorXiaofeng SunMale71.0N
35Peking UniversityFreshmanGaoli ZhaoMale78.0N

1.6 随机抽样

将每一行看作样本,每一列看作一个特征,整个DataFrame看作总体,可以使用sample函数进行随机抽样。

sample函数参数:

n : 抽样数量

axis : 抽样的方向(0为行,1为列)

frac : 抽样比例(0.3表示从总体抽30%)

replace : 是否有放回,replace=True为有放回抽样

weights : 每个样本的抽样相对概率

例如:构造df_sample以value值的相对大小为抽样概率进行有放回抽样,抽样数量为3

df_sample = pd.DataFrame({'id' : list('abcde'),
                           'value' : [1, 2, 3, 4, 90]})
df_sample
idvalue
0a1
1b2
2c3
3d4
4e90
df_sample.sample(3, replace=True, weights=df_sample.value)
idvalue
4e90
4e90
4e90

二、多级索引

2.1 多级索引及其表的结构

索引的名字names

索引的值values

获取单层索引get_level_values(无法修改索引值)

np.random.seed(0)
multi_index = pd.MultiIndex.from_product([list('ABCD'),
                                         df.Gender.unique()],names=('School','Gender'))
multi_column = pd.MultiIndex.from_product([['Height','Weight'],
                                          df.Grade.unique()],names=('Indicator','Grade'))
df_multi = pd.DataFrame(np.c_[(np.random.randn(8,4)*5+163).tolist(),
                             (np.random.randn(8,4)*5+65).tolist()],
                             index = multi_index,
                              columns = multi_column).round(1)
df_multi
IndicatorHeightWeight
GradeFreshmanSeniorSophomoreJuniorFreshmanSeniorSophomoreJunior
SchoolGender
AFemale171.8165.0167.9174.260.655.163.365.8
Male172.3158.1167.8162.271.271.063.163.5
BFemale162.5165.1163.7170.359.857.956.574.8
Male166.8163.6165.2164.762.562.858.768.9
CFemale170.5162.0164.6158.756.963.960.566.9
Male150.2166.3167.3159.362.459.164.967.1
DFemale174.3155.7163.2162.165.366.561.863.2
Male170.7170.3163.8164.961.663.260.956.4

在这里插入图片描述

注意:外层索引的值在第一次出现后会被隐藏

df_multi.index.names #列索引
df_multi.columns.names #行名称
FrozenList(['Indicator', 'Grade'])
df_multi.index.values
array([('A', 'Female'), ('A', 'Male'), ('B', 'Female'), ('B', 'Male'),
       ('C', 'Female'), ('C', 'Male'), ('D', 'Female'), ('D', 'Male')],
      dtype=object)
df_multi.columns.values
array([('Height', 'Freshman'), ('Height', 'Senior'),
       ('Height', 'Sophomore'), ('Height', 'Junior'),
       ('Weight', 'Freshman'), ('Weight', 'Senior'),
       ('Weight', 'Sophomore'), ('Weight', 'Junior')], dtype=object)
df_multi.index.get_level_values(0)
Index(['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'], dtype='object', name='School')

2.2 多级索引中的loc索引器

df_multi_t = df.set_index(['School','Grade'])
df_multi_t.head()
NameGenderWeightTransfer
SchoolGrade
Shanghai Jiao Tong UniversityFreshmanGaopeng YangFemale46.0N
Peking UniversityFreshmanChangqiang YouMale70.0N
Shanghai Jiao Tong UniversitySeniorMei SunMale89.0N
Fudan UniversitySophomoreXiaojuan SunFemale41.0N
SophomoreGaojuan YouMale74.0N

多级索引中的单个元素以元组为单位,使用loc和iloc方法时,只需要将标量的位置替换为对应的元组,不过在索引前最好对MultiIndex进行排序

df_multi_t = df_multi_t.sort_index() #对索引进行排序
df_multi_t.head()
NameGenderWeightTransfer
SchoolGrade
Fudan UniversityFreshmanChangqiang YangFemale49.0N
FreshmanGaoqiang QinFemale63.0N
FreshmanGaofeng ZhaoFemale43.0N
FreshmanYanquan WangFemale55.0N
FreshmanFeng WangMale74.0N

2.2.1 单一值

df_multi_t.loc[('Fudan University','Junior'),['Weight','Gender']].head()
#列名还是需要使用列表进行获取
WeightGender
SchoolGrade
Fudan UniversityJunior48.0Female
Junior72.0Male
Junior76.0Male
Junior49.0Female
Junior43.0Female

2.2.2 多值列表

df_multi_t.loc[[('Fudan University','Senior'),
                ('Shanghai Jiao Tong University','Freshman')]]
NameGenderWeightTransfer
SchoolGrade
Fudan UniversitySeniorChengpeng ZhengFemale38.0N
SeniorFeng ZhouFemale47.0N
SeniorGaomei LvFemale34.0N
SeniorChunli LvFemale56.0N
SeniorChengpeng ZhouMale81.0N
SeniorGaopeng QinFemale52.0N
SeniorChunjuan XuFemale47.0N
SeniorJuan ZhangFemale47.0N
SeniorChengpeng QianMale73.0Y
SeniorXiaojuan QianFemale50.0N
SeniorQuan XuFemale44.0N
Shanghai Jiao Tong UniversityFreshmanGaopeng YangFemale46.0N
FreshmanQiang ChuFemale52.0N
FreshmanXiaopeng ZhouMale74.0N
FreshmanYanpeng LvMale65.0N
FreshmanXiaopeng ZhaoFemale53.0N
FreshmanChunli ZhaoMale83.0N
FreshmanPeng ZhangFemaleNaNN
FreshmanXiaoquan SunFemale40.0N
FreshmanChunmei ShiFemale52.0N
FreshmanXiaomei YangFemale49.0N
FreshmanXiaofeng QianFemale49.0N
FreshmanChangmei LvMale75.0N
FreshmanQiang FengMale80.0N

2.2.3 布尔类型

df_multi_t.loc[df_multi_t.Weight>70].head()
NameGenderWeightTransfer
SchoolGrade
Fudan UniversityFreshmanFeng WangMale74.0N
JuniorChunqiang ChuMale72.0N
JuniorChangfeng LvMale76.0N
SeniorChengpeng ZhouMale81.0N
SeniorChengpeng QianMale73.0Y

2.2.4 函数

df_multi_t.loc[lambda x:('Fudan University','Junior')].head()
NameGenderWeightTransfer
SchoolGrade
Fudan UniversityJuniorYanli YouFemale48.0N
JuniorChunqiang ChuMale72.0N
JuniorChangfeng LvMale76.0N
JuniorYanjuan LvFemale49.0NaN
JuniorGaoqiang ZhouFemale43.0N

2.2.5 练一练

与单层索引类似,若存在重复元素,则不能使用切片,请去除重复索引后给出一个元素切片的例子

df_multi_uni = df.drop_duplicates(['School','Grade']).set_index(['School', 'Grade'])  
df_multi_uni = df_multi_uni.sort_index()
df_multi_uni.loc[('Shanghai Jiao Tong University','Freshman'):('Tsinghua University','Junior')] 
NameGenderWeightTransfer
SchoolGrade
Shanghai Jiao Tong UniversityFreshmanGaopeng YangFemale46.0N
JuniorFeng ZhengFemale51.0N
SeniorMei SunMale89.0N
SophomoreYanfeng QianFemale48.0N
Tsinghua UniversityFreshmanXiaoli QianFemale51.0N
JuniorGaoqiang QianFemale50.0N

2.2.6 交叉组合索引

需要指定loc的列,全选为 : ,每一层需要选中的元素用列表存放,传入loc的形式为[(level_0_list,level_1_list),cols]

例:实现所有北大和复旦的大二大三同学

#使用之前的列表方法
res = df_multi_t.loc[[('Peking University','Sophomore'),
                     ('Peking University','Junior'),
                     ('Fudan University','Sophomore'),
                     ('Fudan University','Junior')]]
print(res.head())
print(res.shape)
                                     Name  Gender  Weight Transfer
School            Grade                                           
Peking University Sophomore   Changmei Xu  Female    43.0        N
                  Sophomore  Xiaopeng Qin    Male     NaN        N
                  Sophomore        Mei Xu  Female    39.0        N
                  Sophomore   Xiaoli Zhou  Female    55.0        N
                  Sophomore      Peng Han  Female    34.0      NaN
(33, 4)
#使用交叉组合
res_t = df_multi_t.loc[(['Peking University','Fudan University'],
                      ['Sophomore','Junior']),:]
print(res_t.head())
print(res_t.shape)
                                     Name  Gender  Weight Transfer
School            Grade                                           
Peking University Sophomore   Changmei Xu  Female    43.0        N
                  Sophomore  Xiaopeng Qin    Male     NaN        N
                  Sophomore        Mei Xu  Female    39.0        N
                  Sophomore   Xiaoli Zhou  Female    55.0        N
                  Sophomore      Peng Han  Female    34.0      NaN
(33, 4)

2.3 IndexSlice对象

对每层进行切片,将切片和布尔列表混用。slice的两种形式:loc[idx[,]]和loc[idx[,],idx[,]]

#构建索引不重复的DataFrame
np.random.seed(0)
L1, L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper','Lower'))
L3, L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big','Small'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(9,9)),
                     index = mul_index1,
                     columns = mul_index2)
df_ex
BigDEF
Smalldefdefdef
UpperLower
Aa36-9-6-6-209-5
b-33-8-3-258-44
c-107-466-99-6
Ba85-2-9-80-91-6
b29-7-9-9-5-4-3-1
c86-501-8-8-20
Ca-6-3259-95-63
b12-5-3-56-63-5
c-156-66478-4

定义slice

idx = pd.IndexSlice

2.3.1 loc[idx[,]]

这种情况不能进行多层切片,[,]表示[行,列],与loc一致

df_ex.loc[idx['C':,('D','f'):]]
BigDEF
Smallfdefdef
UpperLower
Ca259-95-63
b-5-3-56-63-5
c6-66478-4

布尔序列索引

此处存在疑问点???

df_ex.loc[idx[:'A',lambda x:x.sum()>0]] 
#列和大于0,为什么第一列会出现
BigDF
Smalldee
UpperLower
Aa369
b-33-4
c-109

2.3.2 loc[idx[,],idx[,]]

这种可以分层进行切片,前一个idx指行索引,后一个idx指列索引

df_ex.loc[idx[:'A','b':],idx['E':,'e':]]
BigEF
Smallefef
UpperLower
Ab-25-44
c669-6

2.4 多级索引的构造

set_index

pd.MultiIndex对象下:

from_tuples

from_arrays

from_product

2.4.1 from_tuples

根据传入由元组组成的列表进行构造

my_tuple = [('a','cat'),('a','dog'),('b','cat'),('b','dog')]
pd.MultiIndex.from_tuples(my_tuple , names=['first','second'])
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['first', 'second'])

2.4.2 from_arrays

根据传入列表中的对应层的列表进行构造

my_array = [list('aabb'),['cat','dog']*2]
pd.MultiIndex.from_arrays(my_array,names=['First','Second'])
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

2.4.3 from_product

根据给定多个列表的笛卡尔积进行构造

my_list1 = ['a','b']
my_list2 = ['cat','dog']
pd.MultiIndex.from_product([my_list1,
                           my_list2],
                          names=['First','Second'])
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

三、索引的常用方法

3.1 索引层的交换和删除

np.random.seed(0)
L1,L2,L3=['A','B'],['a','b'],['alpha','beta']
mul_index1 = pd.MultiIndex.from_product([L1,L2,L3],
                                       names=('Upper','Lower','Extra'))
L4,L5,L6=['C','D'],['c','d'],['cat','dog']
mul_index2 = pd.MultiIndex.from_product([L4,L5,L6],
                                       names=('Big','Small','Other'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(8,8)),
                    index=mul_index1,
                    columns=mul_index2)
df_ex
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
UpperLowerExtra
Aaalpha36-9-6-6-209
beta-5-33-8-3-258
balpha-44-107-466
beta-99-685-2-9-8
Baalpha0-91-629-7-9
beta-9-5-4-3-186-5
balpha01-8-8-20-6-3
beta259-95-631

3.1.1 索引层的交换

由swaplevel和reorder_levels完成,前者交换两层,后者交换任意层,两者都可以指定交换的轴式哪一个,即行索引和列索引

df_ex.swaplevel(0,2,axis=1).head() 
#axis=1表示列,0代表第一行,2代表第三行。整个的意思是将列的第一行和第三行互换
Othercatdogcatdogcatdogcatdog
Smallccddccdd
BigCCCCDDDD
UpperLowerExtra
Aaalpha36-9-6-6-209
beta-5-33-8-3-258
balpha-44-107-466
beta-99-685-2-9-8
Baalpha0-91-629-7-9
df_ex.reorder_levels([2,0,1],axis=0).head() 
#[2,0,1]代表现在排在第1位的是原来的第3位,排在第2位的是原来的第1位,排在第3位的是原来的第2位,选取行索引
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
ExtraUpperLower
alphaAa36-9-6-6-209
betaAa-5-33-8-3-258
alphaAb-44-107-466
betaAb-99-685-2-9-8
alphaBa0-91-629-7-9

3.1.2 索引层的删除

droplevel

df_ex.droplevel(1,axis=1) #删除列索引的第2个
BigCD
Othercatdogcatdogcatdogcatdog
UpperLowerExtra
Aaalpha36-9-6-6-209
beta-5-33-8-3-258
balpha-44-107-466
beta-99-685-2-9-8
Baalpha0-91-629-7-9
beta-9-5-4-3-186-5
balpha01-8-8-20-6-3
beta259-95-631
df_ex.droplevel([0,1],axis=0) #删除行索引的第1个和第2个
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
Extra
alpha36-9-6-6-209
beta-5-33-8-3-258
alpha-44-107-466
beta-99-685-2-9-8
alpha0-91-629-7-9
beta-9-5-4-3-186-5
alpha01-8-8-20-6-3
beta259-95-631

3.2 索引属性的修改

3.2.1 索引层名字的修改

使用rename_axis对索引层的名字进行修改,常见形式是传入字典的映射

df_ex.rename_axis(index={'Upper':'Changed_row'},
                 columns={'Other':'Change_Col'}).head()
BigCD
Smallcdcd
Change_Colcatdogcatdogcatdogcatdog
Changed_rowLowerExtra
Aaalpha36-9-6-6-209
beta-5-33-8-3-258
balpha-44-107-466
beta-99-685-2-9-8
Baalpha0-91-629-7-9

3.2.2 索引层值的修改

rename对索引的值进行修改,如果是多级索引需要指定修改的层号level

df_ex.rename(columns={'cat':'not_cat'},
            level=2).head()
BigCD
Smallcdcd
Othernot_catdognot_catdognot_catdognot_catdog
UpperLowerExtra
Aaalpha36-9-6-6-209
beta-5-33-8-3-258
balpha-44-107-466
beta-99-685-2-9-8
Baalpha0-91-629-7-9

传入参数为函数,输入值就是索引元素

df_ex.rename(index=lambda x:str.upper(x),
            level=2).head()
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
UpperLowerExtra
AaALPHA36-9-6-6-209
BETA-5-33-8-3-258
bALPHA-44-107-466
BETA-99-685-2-9-8
BaALPHA0-91-629-7-9

3.2.3 练一练

在rename_axis中使用函数完成和例子中一样的功能

df_ex.rename_axis(index=lambda x:'Changed_Row' if x=='Upper' else x,
                  columns=lambda x:'Changed_Col' if x=='Other' else x ).head() 
BigCD
Smallcdcd
Changed_Colcatdogcatdogcatdogcatdog
Changed_RowLowerExtra
Aaalpha36-9-6-6-209
beta-5-33-8-3-258
balpha-44-107-466
beta-99-685-2-9-8
Baalpha0-91-629-7-9

3.2.4 对整个索引的元素进行替换

  1. 使用迭代器
new_values = iter(list('abcdefgh'))
df_ex.rename(index=lambda x:next(new_values),
            level=2)
#迭代器的元素一定要与替换的元素数量保持一致
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
UpperLowerExtra
Aaa36-9-6-6-209
b-5-33-8-3-258
bc-44-107-466
d-99-685-2-9-8
Bae0-91-629-7-9
f-9-5-4-3-186-5
bg01-8-8-20-6-3
h259-95-631

思考:如果对于第一层索引的修改,设置的值是几个?首先考虑设置的值为8个,测试时发现使用两个会报错,因此要使用8个

new_values = iter(list('AB'*4))
df_ex.rename(index=lambda x:next(new_values),
            level=0)
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
UpperLowerExtra
Aaalpha36-9-6-6-209
Babeta-5-33-8-3-258
Abalpha-44-107-466
Bbbeta-99-685-2-9-8
Aaalpha0-91-629-7-9
Babeta-9-5-4-3-186-5
Abalpha01-8-8-20-6-3
Bbbeta259-95-631

单层索引的迭代可以取出索引的values属性,再给定列表,最后进行index对象重新赋值

多级索引,先把某一层索引临时转为表的元素,再进行修改,最后重新设定为索引

2.map

定义在Index上的方法,与rename方法中函数用法类似,它传入的不是层的标量值,是直接传入索引的元组,可以提供跨层的遍历修改

#字符串转大写
df_temp = df_ex.copy()
new_idx = df_temp.index.map(lambda x:(x[0],
                                     x[1],
                                     str.upper(x[2])))
df_temp.index = new_idx
df_temp.head()
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
UpperLowerExtra
AaALPHA36-9-6-6-209
BETA-5-33-8-3-258
bALPHA-44-107-466
BETA-99-685-2-9-8
BaALPHA0-91-629-7-9

map的另一个使用方法:对多级索引的压缩

df_temp = df_ex.copy()
new_idx = df_temp.index.map(lambda x :(x[0]+'-'+
                                      x[1]+'-'+
                                      x[2]))
df_temp.index = new_idx
df_temp.head()#变为单层索引
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
A-a-alpha36-9-6-6-209
A-a-beta-5-33-8-3-258
A-b-alpha-44-107-466
A-b-beta-99-685-2-9-8
B-a-alpha0-91-629-7-9

反向展开

new_idx = df_temp.index.map(lambda x :tuple(x.split('-')))
#使用split将数据分隔开,然后再通过tuple转换为数组
df_temp.index = new_idx
df_temp.head() #三层索引
BigCD
Smallcdcd
Othercatdogcatdogcatdogcatdog
Aaalpha36-9-6-6-209
beta-5-33-8-3-258
balpha-44-107-466
beta-99-685-2-9-8
Baalpha0-91-629-7-9

3.3 索引的设置和重置

df_new = pd.DataFrame({'A':list('aacd'),
                      'B':list('PQRT'),
                      'C':[1,2,3,4]})
df_new
ABC
0aP1
1aQ2
2cR3
3dT4

3.3.1 索引的设置

使用set_index完成,参数:append,表示是否保留原来的索引

df_new.set_index('A')
BC
A
aP1
aQ2
cR3
dT4
df_new.set_index('A',append=True)
BC
A
0aP1
1aQ2
2cR3
3dT4

指定多列为索引

df_new.set_index(['A','B'])
C
AB
aP1
Q2
cR3
dT4

添加新列作为索引,在参数中传入Series

df_new = pd.DataFrame({'A':list('aacd'),
                      'B':list('PQRT'),
                      'C':[1,2,3,4]})
my_index = pd.Series(list('WXYZ'),name='D')
df_new = df_new.set_index(['A',my_index])
df_new
BC
AD
aWP1
XQ2
cYR3
dZT4

3.3.2 索引的重置

reset_index是set_index的逆函数,主要参数是drop,表示是否将去掉索引层丢弃(不添加到列中)

df_new.reset_index(['D'])
DBC
A
aWP1
aXQ2
cYR3
dZT4
df_new.reset_index(['A'],drop=True)
BC
D
WP1
XQ2
YR3
ZT4

重置索引

df_new.reset_index()
ADBC
0aWP1
1aXQ2
2cYR3
3dZT4

3.4 索引的变形

对索引进行扩充或者剔除,指定一个新的索引或将原表相应的索引对应元素填充到新索引构成的表中

使用reindex

例:在新表中增加一名员工的同时去掉身高列并增加性别列

df_reindex = pd.DataFrame({'Weight':[60,70,80],
                           'Height':[176,180,179]},
                         index = ['1001','1003','1002'])
df_reindex
WeightHeight
100160176
100370180
100280179
df_reindex.reindex(index=['1001','1003','1002','1004'],
                   columns=['Weight','Gender'])
WeightGender
100160.0NaN
100370.0NaN
100280.0NaN
1004NaNNaN

reindex_like

仿照传入的表的索引来进行被调用表索引的变形

df_reindex
WeightHeight
100160176
100370180
100280179
df_existed = pd.DataFrame(index=['1001','1002','1003','1004'],
                         columns=['Weight','Gender'])
df_reindex.reindex_like(df_existed)
WeightGender
100160.0NaN
100280.0NaN
100370.0NaN
1004NaNNaN

四、索引运算

4.1 集合的运算法则

A并B: 属于A并且属于B (A.intersection(B) <> A&B)

A交B: 属于A或者属于B (A.union(B) <
> A|B)

A减B: 属于A但是不属于B (A.difference(B) <> (A^B)&A)

A和B的对称差: A交B-A并B (A.symmetric_difference(B) <
> A^B)

4.2 一般的索引运算

由于集合的元素是不重复的,先用unique去重后再进行运算

df_set_1 = pd.DataFrame([[0,1],[1,2],[3,4]],
                       index = pd.Index(['a','b','a'],name='id1'))
df_set_2 = pd.DataFrame([[4,5],[2,6],[7,1]],
                       index = pd.Index(['b','b','c'],name='id2'))
id1,id2 = df_set_1.index.unique(),df_set_2.index.unique()

id1.intersection(id2) 
Index(['b'], dtype='object')
id1.union(id2)
Index(['a', 'b', 'c'], dtype='object')
id1.difference(id2)
Index(['a'], dtype='object')
id1.symmetric_difference(id2)
Index(['a', 'c'], dtype='object')

若两张表需要做集合运算的列并没有被设置为索引,一种方法:先转换为索引set_index,运算后再恢复。第二种:使用isin函数

例:选出id列交集所在的行

df_set_in_col1 = df_set_1.reset_index()
df_set_in_col1
id101
0a01
1b12
2a34
df_set_in_col2 = df_set_2.reset_index()
df_set_in_col2
id201
0b45
1b26
2c71
df_set_in_col1[df_set_in_col1.id1.isin(df_set_in_col2.id2)]
id101
1b12

五、练习

5.1 公司员工数据集

df = pd.read_csv('data/Company.csv')
df.head()
EmployeeIDbirthdate_keyagecity_namedepartmentjob_titlegender
013181/3/195461VancouverExecutiveCEOM
113191/3/195758VancouverExecutiveVP StoresF
213201/2/195560VancouverExecutiveLegal CounselF
313211/2/195956VancouverExecutiveVP Human ResourcesM
413221/9/195857VancouverExecutiveVP FinanceM

1.分别使用query和loc选出年龄不超过四十岁且工作部门为Dairy或Bakery的男性

使用query

思路:就按照条件一级一级的去拼写,但是在将==换成is in 的时候会报错,之后测试发现正确的写法是in 没有is

df.query('((age <= 40)&'
        ' (department in ["Dairy","Bakery"])&'
         '(gender in "M"))'
        ).shape
(441, 7)

使用loc

思路:loc需要使用布尔型来找到符合条件的值,最后再通过shape去判断查询的结果是否一致

x1 = df.age<=40
x2 = df.department.isin(["Dairy","Bakery"])
x3 = df.gender == 'M'
x = x1 & x2 & x3
df.loc[x].shape
(441, 7)

2.选出员工ID号为奇数所在行的第1、第3和倒数第2列

思路:首先使用不被2整除来判断ID号为奇数,这个是列值的判断,我们需要使用loc;然后找出第1,3,倒数第2列,这个是位置的索引,需要使用iloc,注意是从0开始计算

(df.loc[df.EmployeeID%2!=0]).iloc[:,[0,2,-2]].head()
EmployeeIDagejob_title
1131958VP Stores
3132156VP Human Resources
5132353Exec Assistant, VP Stores
6132551Exec Assistant, Legal Counsel
8132948Store Manager

3.按照步骤进行索引操作

3.1 将后三列设为索引后交换内外两层

思路:首先使用columns取出后三列,然后使用set_index将后三列设置为索引,因为只交换两层,我们使用swaplevel对列索引的位置进行交换

df.columns[-4:-1].values
array(['city_name', 'department', 'job_title'], dtype=object)
df_demo = df.set_index(list(df.columns[-4:-1].values))
df_demo.head()
EmployeeIDbirthdate_keyagegender
city_namedepartmentjob_title
VancouverExecutiveCEO13181/3/195461M
VP Stores13191/3/195758F
Legal Counsel13201/2/195560F
VP Human Resources13211/2/195956M
VP Finance13221/9/195857M
df_demo.swaplevel(0,2,axis=0).head()
EmployeeIDbirthdate_keyagegender
job_titledepartmentcity_name
CEOExecutiveVancouver13181/3/195461M
VP StoresExecutiveVancouver13191/3/195758F
Legal CounselExecutiveVancouver13201/2/195560F
VP Human ResourcesExecutiveVancouver13211/2/195956M
VP FinanceExecutiveVancouver13221/9/195857M

3.2 恢复中间一层

恢复就是使用reset_index, 可以通过df_demo.index.names[1]找到索引第1列的名字

df_demo1 = df_demo.reset_index(df_demo.index.names[1])
df_demo1.head()
departmentEmployeeIDbirthdate_keyagegender
city_namejob_title
VancouverCEOExecutive13181/3/195461M
VP StoresExecutive13191/3/195758F
Legal CounselExecutive13201/2/195560F
VP Human ResourcesExecutive13211/2/195956M
VP FinanceExecutive13221/9/195857M

3.3 修改外索引名为Gender

直接使用rename_axis,对index进行修改

df_demo1.rename_axis(index={'city_name':'Gender'}).head()
departmentEmployeeIDbirthdate_keyagegender
Genderjob_title
VancouverCEOExecutive13181/3/195461M
VP StoresExecutive13191/3/195758F
Legal CounselExecutive13201/2/195560F
VP Human ResourcesExecutive13211/2/195956M
VP FinanceExecutive13221/9/195857M

3.4 是下划线合并两层行索引

合并索引使用map函数

df_demo1.head()
departmentEmployeeIDbirthdate_keyagegender
city_namejob_title
VancouverCEOExecutive13181/3/195461M
VP StoresExecutive13191/3/195758F
Legal CounselExecutive13201/2/195560F
VP Human ResourcesExecutive13211/2/195956M
VP FinanceExecutive13221/9/195857M
df_temp = df_demo1.copy()
new_idx = df_temp.index.map(lambda x :(x[0]+'_'+
                            x[1]))
df_temp.index = new_idx
df_temp.head()
departmentEmployeeIDbirthdate_keyagegender
Vancouver_CEOExecutive13181/3/195461M
Vancouver_VP StoresExecutive13191/3/195758F
Vancouver_Legal CounselExecutive13201/2/195560F
Vancouver_VP Human ResourcesExecutive13211/2/195956M
Vancouver_VP FinanceExecutive13221/9/195857M

3.5 将行索引拆分为原状态

反向使用map和split

new_idx = df_temp.index.map(lambda x :tuple(x.split('_')))
df_temp.index = new_idx
df_temp.head()
departmentEmployeeIDbirthdate_keyagegender
VancouverCEOExecutive13181/3/195461M
VP StoresExecutive13191/3/195758F
Legal CounselExecutive13201/2/195560F
VP Human ResourcesExecutive13211/2/195956M
VP FinanceExecutive13221/9/195857M

3.6 修改索引名为原表名称

之前使用的是copy出来的df_temp,现在先将df_demo1的索引名称选取出来,再将名称赋值给df_temp.index.names

new_name = df_demo1.index.names
new_name
FrozenList(['city_name', 'job_title'])
df_temp.index.names = new_name
df_temp.head(1)
departmentEmployeeIDbirthdate_keyagegender
city_namejob_title
VancouverCEOExecutive13181/3/195461M

3.7 恢复默认索引并将列保持为原表的相对位置

先将索引还原,然后利用loc按照顺序获取值之后再将它赋值到本身

df_new = df_temp.reset_index().head()
df_new
city_namejob_titledepartmentEmployeeIDbirthdate_keyagegender
0VancouverCEOExecutive13181/3/195461M
1VancouverVP StoresExecutive13191/3/195758F
2VancouverLegal CounselExecutive13201/2/195560F
3VancouverVP Human ResourcesExecutive13211/2/195956M
4VancouverVP FinanceExecutive13221/9/195857M
cols = list(df.columns)
df_new = df_new.loc[:,cols]
df_new
EmployeeIDbirthdate_keyagecity_namedepartmentjob_titlegender
013181/3/195461VancouverExecutiveCEOM
113191/3/195758VancouverExecutiveVP StoresF
213201/2/195560VancouverExecutiveLegal CounselF
313211/2/195956VancouverExecutiveVP Human ResourcesM
413221/9/195857VancouverExecutiveVP FinanceM

5.2 巧克力数据集

df = pd.read_csv('data/chocolate.csv')
df.head(3)
CompanyReview\nDateCocoa\nPercentCompany\nLocationRating
0A. Morin201663%France3.75
1A. Morin201570%France2.75
2A. Morin201570%France3.00
  1. 把列索引名中的\n替换为空格

    思路:列索引的替换思维,将原列索引取出来后进行置换后再赋值回原列索引
df_demo = df.copy()
df_demo.columns = df_demo.columns.map(lambda x:x.replace('\n',' '))
df_demo.head()
CompanyReview DateCocoa PercentCompany LocationRating
0A. Morin201663%France3.75
1A. Morin201570%France2.75
2A. Morin201570%France3.00
3A. Morin201570%France3.50
4A. Morin201570%France3.50
  1. 巧克力Rating评分为1至5,每0.25分一个档,请选出2.75分及以下且可可含量Coco Percent高于中位数的样本

我们发现Cocoa Percent存的是字符,我们需要将它转换为float的型进行中位数的计算;字段包含由空格,要注意使用"`"

df_demo['Cocoa Percent'] = df_demo['Cocoa Percent'].apply(lambda x: float(x[:-1])/100)
df_demo.head()
CompanyReview DateCocoa PercentCompany LocationRating
0A. Morin20160.63France3.75
1A. Morin20150.70France2.75
2A. Morin20150.70France3.00
3A. Morin20150.70France3.50
4A. Morin20150.70France3.50
df_demo.query('((Rating <=2.75) &'
              '(`Cocoa Percent` >= `Cocoa Percent`.median()))').head()
CompanyReview DateCocoa PercentCompany LocationRating
1A. Morin20150.70France2.75
5A. Morin20140.70France2.75
10A. Morin20130.70France2.75
14A. Morin20130.70France2.75
33Akesson's (Pralus)20100.75Switzerland2.75
  1. 将Review Date和Company Location设为索引后,选出Review Data在2012年之后且Company Location不属于France,Canada,Amsterdam,Belgium

思路:我考虑的是通过将2012年以后的年份和需要筛选的Location条件筛选后作为需要去查找的条件,再通过两者求得两者的笛卡尔积找出最后符合条件的值。(我觉得这不是一个很好的方法,但是是目前所能想到的)

df_demo1 = df.copy()
df_demo1.columns = df_demo1.columns.map(lambda x:x.replace('\n',' '))
df_demo1 = df_demo1.set_index(['Review Date','Company Location'])
date = df_demo1.index.get_level_values(0)
date1 = list(set(date))
location = df_demo1.index.get_level_values(1)
location1 = list(set(location))
x = [i for i in date1 if i>2012 ]
y = [i for i in location1 if i not in  ['France','Canada','Amsterdam','Belgium']]
df_demo1.loc[(x,y),:].head(2)
CompanyCocoa PercentRating
Review DateCompany Location
2016AustriaMartin Mayer76%2.75
AustriaMartin Mayer82%3.00
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值