5 pandas 单级索引 loc,iloc,[]+区间索引

Michael_Flemming

已于 2022-07-23 21:02:21 修改

阅读量1.1k

点赞数

分类专栏： pandas记录文章标签： pandas python 数据分析

于 2022-07-23 13:05:25 首次发布

本文链接：https://blog.csdn.net/weixin_44360866/article/details/125938787

版权

pandas记录专栏收录该内容

5 篇文章 0 订阅

订阅专栏

pandas索引

a. interval_range()方法
- b. cut()方法将数值转为区间变量

单级索引有：loc标签索引、iloc位置索引、[]

loc 本质：loc中能传入的只有布尔列表和索引子集构成的列表。
iloc本质：iloc中接收的参数只能为整数或整数列表或布尔列表，不能使用布尔Series，如果要用就必须把values拿出来。
双冒号::最后一个位置是步长，::2 表示间隔2取元素。
一般来说，[]操作符常用于列选择或布尔选择，尽量避免行的选择。

记住loc和iloc就能实现大部分功能了，[ ]的用法有些混乱。

单行，多行，切片，函数式，布尔。

函数：

all()方法，any()方法
默认axis=0，即跨行，all()判断这一列的值是否全真，全真返回True。any()有一个真就返回Ture。
all(1)是给axis传了1，即跨列运算，一行全True就返回True。
join()合并方法，合并那章会讲。
reset_index()方法，就是把原df的index变成列，现在index设置成0开始的整数。
set_index(列名) 就是取一列作为index。
想要交换index列和某一数值列，.reset_index().set_indedx(列名)
astype(类型)方法可以转换数据类型。里面的类型要是字符串形式，比如df.index.astype(‘interval’).

一、 loc 标签索引

行索引

print(df.loc[1105])  # 1单行
print()
print(df.loc[[1101, 1105]])  # 2多行，需要里面再加一个[]
print()
print(df.loc[1110:1220])  # 切片，dataframe的切片是左闭右闭
print()
print(df.loc[2402::-1].head())  # 注意中间两个冒号，-1表示步长

School          S_1
Class           C_1
Gender            F
Address    street_4
Height          159
Weight           64
Math           84.8
Physics          B+
Name: 1105, dtype: object

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1105    S_1   C_1      F  street_4     159      64  84.8      B+

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1201    S_1   C_2      M  street_5     188      68  97.0      A-
1202    S_1   C_2      F  street_4     176      94  63.5      B-
1203    S_1   C_2      M  street_6     160      53  58.8      A+
1204    S_1   C_2      F  street_5     162      63  33.8       B
1205    S_1   C_2      F  street_6     167      63  68.4      B-

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
2402    S_2   C_4      M  street_7     166      82  48.7       B
2401    S_2   C_4      F  street_2     192      62  45.3       A
2305    S_2   C_3      M  street_4     187      73  48.9       B
2304    S_2   C_3      F  street_6     164      81  95.5      A-
2303    S_2   C_3      F  street_7     190      99  65.9       C

列索引

print(df.loc[:, 'Height'].head())  # 单列
print(df.loc[:, ['Height', 'Weight']].head())  # 多列，还是要用[]括起来
print(df.loc[:, 'Height':'Math'].head()) # 切片

ID
1101    173
1102    192
1103    186
1104    167
1105    159
Name: Height, dtype: int64
      Height  Weight
ID                  
1101     173      63
1102     192      73
1103     186      82
1104     167      81
1105     159      64
      Height  Weight  Math
ID                        
1101     173      63  34.0
1102     192      73  32.5
1103     186      82  87.2
1104     167      81  80.4
1105     159      64  84.8

联合索引

print(df.loc[1104:2401, 'Height':'Math'].head())

      Height  Weight  Math
ID                        
1104     167      81  80.4
1105     159      64  84.8
1201     188      68  97.0
1202     176      94  63.5
1203     160      53  58.8

函数式索引

print(df.loc[lambda x: x['Gender'] == 'F'].head())
# 等价于下面
# print(df.loc[df['Gender']=='F'])

    School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+
1202    S_1   C_2      F  street_4     176      94  63.5      B-
1204    S_1   C_2      F  street_5     162      63  33.8       B

布尔索引★★
loc的布尔索引，可以传一个dtype是 " 布尔值的Series " ，也可以是一个" ndarray " 或者 " 列表 " 。

print(df.loc[df['Address'].isin(['street_7', 'street_4'])].head())
print(df.loc[[True if x[-1] in '47' else False for x in df['Address']]].head())
# 上面两个结果一样

    School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1105    S_1   C_1      F  street_4     159      64  84.8      B+
1202    S_1   C_2      F  street_4     176      94  63.5      B-
1301    S_1   C_3      M  street_4     161      68  31.5      B+
1303    S_1   C_3      M  street_7     188      82  49.7       B
2101    S_2   C_1      M  street_7     174      84  83.3       C
     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1105    S_1   C_1      F  street_4     159      64  84.8      B+
1202    S_1   C_2      F  street_4     176      94  63.5      B-
1301    S_1   C_3      M  street_4     161      68  31.5      B+
1303    S_1   C_3      M  street_7     188      82  49.7       B
2101    S_2   C_1      M  street_7     174      84  83.3       C

二、iloc 位置索引

行索引

# 行索引
print(df.iloc[3])  # 单行
print(df.iloc[[3, 5]])  # 多行
print(df.iloc[3:5])  # 切片

School          S_1
Class           C_1
Gender            F
Address    street_2
Height          167
Weight           81
Math           80.4
Physics          B-
Name: 1104, dtype: object
     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1201    S_1   C_2      M  street_5     188      68  97.0      A-
     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+

列索引

print(df.iloc[:, 3].head())  # 单列
print(df.iloc[:, 7::-2].head())  # 多列

混合索引

print(df.iloc[::2, 7::-2].head())

函数式索引

print(df.iloc[lambda x: [3]])

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1104    S_1   C_1      F  street_2     167      81  80.4      B-

布尔索引★

iloc中接收的参数只能为整数或整数列表或布尔列表，不能使用布尔Series，如果要用就必须如下把values拿出来。

print(df.iloc[(df['Gender'] == 'M').values].head())
# print(df.iloc[df['Gender']=='M'].head()) 错！！报错如下
# NotImplementedError: iLocation based boolean indexing on an integer type is not available

三、[]操作符

Series的[]

总结：不要用[]对Series和DataFrame进行行行索引了…记不住用法，都用loc和iloc得了。

注意：Series也是可以用loc和iloc的。

DataFrame的[ ] 只有位置索引，没有标签索引。
将df的一列拿出来做Series，index还是df的index：

s = pd.Series(df['Math'], index=df.index)

下面看Series行索引【Series也只有行索引，因为本来就只有一列

# Series的[]
s = pd.Series(df['Math'], index=df.index)
print(s[1101])  # 标签索引，单元素
print(s[0:4])  # 使用的是位置切片
# print(s[0])  # 报错，Series没有这样的位置索引
# print(s[0,2]) # 报错

print(s[lambda x: x.index[16::-6]])  # 函数索引
# 下面没懂
# 注意使用lambda函数时，直接切片(如：s[lambda x: 16::-6])就报错，此时使用的不是绝对位置切片，而是元素切片，非常易错

布尔索引

# 布尔索引
print((s > 80).head())
print(s[s > 80].head())

ID
1101    False
1102    False
1103     True
1104     True
1105     True
Name: Math, dtype: bool
ID
1103    87.2
1104    80.4
1105    84.8
1201    97.0
1302    87.7
Name: Math, dtype: float64

下面不好记。

# 【注意】如果不想陷入困境，请不要在行索引为浮点时使用[]操作符，因为在Series中[]的浮点切片并不是进行位置比较，而是值比较，非常特殊
s_int = pd.Series([1, 2, 3, 4], index=[1, 3, 5, 6])
s_float = pd.Series([1, 2, 3, 4], index=[1., 3., 5., 6.])
print(s_int[2:])  # 这里2就是索引
print(s_float[2:])  # 这里2是元素

DataFrame的[]

总结：只用[]用来dataframe的列索引就好了，行索引的用法很麻烦。一般来说，[]操作符常用于列选择或布尔选择，尽量避免行的选择
单行索引，注意，非常特殊，只能使用位置索引，必须写成切片的样子，不然会以为是列索引，报错.
**注意，**没有df[‘label’]的写法，这种写法是应该用于loc. 所以，dataframe关于[]的行索引只有位置索引，没有标签索引。

print(df[1:2])

如果非要使用label+[]来获取行的话：先得到这个label的位置索引，再利用[]取行.

row = df.index.get_loc(1102)  # 该函数返回label的位置索引
print(df[row:row + 1])

多行索引，只有切片形式

# 多行索引，切片形式，如果想选指定的，隔开的几行，推荐使用loc，使用[]容易报错
print(df[2:5])

列索引

# 单列索引
print(df['School'].head())

# 多列索引
print(df[['School', 'Math']].head())  # 同loc和iloc，需要多套一个[]

函数式

# 函数式索引
print(df[lambda x: ['Math', 'Physics']].head())

布尔索引★

print(df[df['Gender'] == 'F'].head())  # 和loc，iloc的差不多，反正就是对一列的值进行判断，返回bool值，然后根据返回的bool型Series挑出来行

四、布尔索引

布尔符号 & | ~，用于布尔数据间的运算.
原理：loc,iloc,[]都可以使用bool索引，都是传一个一维的bool序列，然后根据这个布尔序列取出符合条件的行。
需要注意的是，iloc接受不了Series类型的bool序列，但是可以使用.values将这个Series转换为ndarray，就可以了。
至于loc和[]的布尔索引，使用方法相同。

使用布尔运算符

一个位置的。

print('&:')
print(df.iloc[((df['Gender'] == 'F') & (df['Address'] == 'street_2')).values].head())
print('|:')
print(df[(df['Math'] > 85) | (df['Address'] == 'street_1')].head())
print('~:')
print(df[(~(df['Physics'] == 'B-') | (df['Gender'] == 'M'))].head())

&:
     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
2401    S_2   C_4      F  street_2     192      62  45.3       A
2404    S_2   C_4      F  street_2     160      84  67.7       B
|:
     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1201    S_1   C_2      M  street_5     188      68  97.0      A-
1302    S_1   C_3      F  street_1     175      57  87.7      A-
1304    S_1   C_3      M  street_2     195      70  85.2       A
~:
     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1105    S_1   C_1      F  street_4     159      64  84.8      B+
1201    S_1   C_2      M  street_5     188      68  97.0      A-

两个位置的，loc和[]中两个位置都可以进行bool索引

# loc和[]中两个位置都可以进行bool索引
# df.columns == 'physics'是一个dtype为bool的ndarray，并且是一维的行向量
print(df.loc[df['Math'] > 60, df.columns == 'Physics'].head())

     Physics
ID          
1103      B+
1104      B-
1105      B+
1201      A-
1202      B-

isin()方法

# 有两个条件，一般思路写法：将两个series序列用&运算符先计算，然后再取布尔索引
print(df[df.loc[:, 'Address'].isin(['street_1', 'street_4']) & df['Physics'].isin(['A', 'A+'])].head())
# 使用字典的写法：先把两列都取出来，也是用isin()判断，但是返回的是两列的布尔值dataframe,需要使用all()方法对这两列进行运算，变成一维的Series。
# 注意all()方法默认axis=0，即默认跨行运算，现在要传参数1，因为我们想跨列运算。
print(df[df[['Address', 'Physics']].isin({'Address': ['street_1', 'street_4'], 'Physics': ['A', 'A+']}).all(1)])

isin:
     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
2105    S_2   C_1      M  street_4     170      81  34.2       A
2203    S_2   C_2      M  street_4     155      91  73.8      A+
     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
2105    S_2   C_1      M  street_4     170      81  34.2       A
2203    S_2   C_2      M  street_4     155      91  73.8      A+

五、快速标量索引，at标签索引和 iat位置索引

当只需要取一个元素时，at和iat方法能够提供更快的实现
可以使用timeit计算对比一下时间

print(df.at[1101, 'School'])
print(df.loc[1101, 'School'])
print(df.iat[5, 5])
print(df.iloc[5, 5])


S_1
S_1
68
68

六、区间索引

a. interval_range()方法

感觉现在还没有特别体会到区间索引的作用，感觉如果要提取数据在一个区间的行的话，布尔索引完全可以实现。

Interval()方法用来生成一个区间，数据类型就是interval。
interval_range()方法，生成区间，可以指定起始，终止，步长，区间数。periods参数控制区间个数，freq控制步长。
closed参数可选’left’‘right’‘both’‘neither’，默认左开右闭（上面两个方法都有这个参数）

print(pd.Interval(70, 85, closed='left'))  # interval用来生成一个区间
# closed参数可选'left''right''both''neither'，默认左开右闭
# periods参数控制区间个数，freq控制步长
print(pd.interval_range(start=0, periods=10, freq=5, closed='both'))
print(pd.interval_range(start=0, end=5, periods=2, closed='neither'))

[70, 85)
IntervalIndex([[0, 5], [5, 10], [10, 15], [15, 20], [20, 25], [25, 30], [30, 35], [35, 40], [40, 45], [45, 50]],
              closed='both',
              dtype='interval[int64]')
IntervalIndex([(0.0, 2.5), (2.5, 5.0)],
              closed='neither',
              dtype='interval[float64]')

b. cut()方法将数值转为区间变量

比如现在要统计df种的数学成绩，把成绩在某些区间的行挑出来。
先提取df[‘Math’]这一列，然后使用cut()方法将这一列的值转换成区间值。

math_interval = pd.cut(df['Math'], bins=[0, 40, 60, 80, 100])
print(math_interval.head())

ID
1101      (0, 40]
1102      (0, 40]
1103    (80, 100]
1104    (80, 100]
1105    (80, 100]
Name: Math, dtype: category
Categories (4, interval[int64]): [(0, 40] < (40, 60] < (60, 80] < (80, 100]]

注意上面cut返回的数据类型是category。
现在想将math_interval当作索引。
先join，再取出’Math’, 'Math_interval’两列，然后将’Math_interva当作索引，但是保留原来的索引ID，就是下面的写法：

df_i = df.join(math_interval, rsuffix='_interval')[['Math', 'Math_interval']].reset_index().set_index('Math_interval')
print(df_i.head())

                 ID  Math
Math_interval            
(0, 40]        1101  34.0
(0, 40]        1102  32.5
(80, 100]      1103  87.2
(80, 100]      1104  80.4
(80, 100]      1105  84.8

用一个数对df_i进行索引，区间包含该数就会被选中。

print(df_i.head())
print(df_i.loc[65])  # 包含该值就会被选中
print(df_i.loc[[65, 90]].head())  # 包含这两个值就会被选中

                 ID  Math
Math_interval            
(60, 80]       1202  63.5
(60, 80]       1205  68.4
(60, 80]       1305  61.7
(60, 80]       2104  72.2
(60, 80]       2202  68.5
(60, 80]       2203  73.8
(60, 80]       2301  72.3
(60, 80]       2303  65.9
(60, 80]       2404  67.7
                 ID  Math
Math_interval            
(60, 80]       1202  63.5
(60, 80]       1205  68.4
(60, 80]       1305  61.7
(60, 80]       2104  72.2
(60, 80]       2202  68.5

想把和某个区间重叠的行都给选出来【其实感觉这个例子是布尔索引…】

# 如果想要选取某个区间，先要把分类变量转为区间变量，再使用overlap方法[其实这是布尔索引了]
# 上面的思路就是：想用一个区间索引出df_i的行，但是不能直接df_i[区间]（会报错），所以就只用overlap
# df_i[pd.Interval(60, 80, closed='right')] # 报错
# 说要转化为区间变量的意思是，df_i的index虽然是区间组成，但数据类型还是indexes.category，需要装换成interval的数据类型
print(df[df_i.index.astype('interval').overlaps(pd.Interval(70, 85))])
# 上面的df_i.index.astype('interval').overlaps(pd.Interval(70, 85))是布尔类型
# astype()方法可以转换数据类型