Python 数据分析三剑客之 Pandas（二）：Index 索引对象以及各种索引操作

最新推荐文章于 2024-05-12 11:03:50 发布

marraybug

最新推荐文章于 2024-05-12 11:03:50 发布

阅读量1.1k

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/marraybug/article/details/118527983

版权

python 专栏收录该内容

115 篇文章 25 订阅

订阅专栏

CSDN 课程推荐：《迈向数据科学家：带你玩转Python数据分析》，讲师齐伟，苏州研途教育科技有限公司CTO，苏州大学应用统计专业硕士生指导委员会委员；已出版《跟老齐学Python：轻松入门》《跟老齐学Python：Django实战》、《跟老齐学Python：数据分析》和《Python大学实用教程》畅销图书。

Pandas 系列文章：

另有 NumPy、Matplotlib 系列文章已更新完毕，欢迎关注：

NumPy 系列文章： https://itrhx.blog.csdn.net/category_9780393.html
Matplotlib 系列文章： https://itrhx.blog.csdn.net/category_9780418.html
- *

推荐学习资料与网站（博主参与部分文档翻译）：

NumPy 官方中文网： https://www.numpy.org.cn/
Pandas 官方中文网： https://www.pypandas.cn/
Matplotlib 官方中文网： https://www.matplotlib.org.cn/
NumPy、Matplotlib、Pandas 速查表： https://github.com/TRHX/Python-quick-reference-table
- *

文章目录

【1】Index 索引对象
【2】Pandas 一般索引
- 【2.1】Series 索引
- - 【2.1.1】head() / tail()
  - 【2.1.2】行索引
  - 【2.1.3】切片索引
  - 【2.1.4】花式索引
  - 【2.1.5】布尔索引
- 【2.2】DataFrame 索引
- - 【2.2.1】head() / tail()
  - 【2.2.2】列索引
  - 【2.2.3】切片索引
  - 【2.2.4】花式索引
  - 【2.2.5】布尔索引
【3】索引器：loc 和 iloc
- 【3.1】loc 标签索引
- - 【3.1.1】Series.loc
  - 【3.1.2】DataFrame.loc
- 【3.2】iloc 位置索引
- - 【3.2.1】Series.iloc
  - 【3.2.2】DataFrame.iloc
【4】Pandas 重新索引
- *

    这里是一段防爬虫文本，请读者忽略。
    本文原创首发于 CSDN，作者 TRHX。
    博客首页：https://itrhx.blog.csdn.net/
    本文链接：https://itrhx.blog.csdn.net/article/details/106698307
    未经授权，禁止转载！恶意转载，后果自负！尊重原创，远离剽窃！

【1】Index 索引对象

Series 和 DataFrame 中的索引都是 Index 对象，为了保证数据的安全，索引对象是不可变的，如果尝试更改索引就会报错；常见的 Index 种类有：索引（Index），整数索引（Int64Index），层级索引（MultiIndex），时间戳类型（DatetimeIndex）。

一下代码演示了 Index 索引对象和其不可变的性质：

    >>> import pandas as pd
    >>> obj = pd.Series([1, 5, -8, 2], index=['a', 'b', 'c', 'd'])
    >>> obj.index
    Index(['a', 'b', 'c', 'd'], dtype='object')
    >>> type(obj.index)
    <class 'pandas.core.indexes.base.Index'>
    >>> obj.index[0] = 'e'
    Traceback (most recent call last):
      File "<pyshell#28>", line 1, in <module>
        obj.index[0] = 'e'
      File "C:\Users\...\base.py", line 3909, in __setitem__
        raise TypeError("Index does not support mutable operations")
    TypeError: Index does not support mutable operations

index 索引对象常用属性

官方文档： https://pandas.pydata.org/docs/reference/api/pandas.Index.html

属性	描述
T	转置
array	index 的数组形式，常见 [ 官方文档

    >>> import pandas as pd
    >>> obj = pd.Series([1, 5, -8, 2], index=['a', 'b', 'c', 'd'])
    >>> obj.index
    Index(['a', 'b', 'c', 'd'], dtype='object')
    >>> 
    >>> obj.index.array
    <PandasArray>
    ['a', 'b', 'c', 'd']
    Length: 4, dtype: object
    >>> 
    >>> obj.index.dtype
    dtype('O')
    >>> 
    >>> obj.index.hasnans
    False
    >>>
    >>> obj.index.inferred_type
    'string'
    >>> 
    >>> obj.index.is_monotonic
    True
    >>>
    >>> obj.index.is_monotonic_decreasing
    False
    >>> 
    >>> obj.index.is_monotonic_increasing
    True
    >>> 
    >>> obj.index.is_unique
    True
    >>> 
    >>> obj.index.nbytes
    16
    >>>
    >>> obj.index.ndim
    1
    >>>
    >>> obj.index.nlevels
    1
    >>>
    >>> obj.index.shape
    (4,)
    >>> 
    >>> obj.index.size
    4
    >>> 
    >>> obj.index.values
    array(['a', 'b', 'c', 'd'], dtype=object)

index 索引对象常用方法

官方文档： https://pandas.pydata.org/docs/reference/api/pandas.Index.html

方法	描述
all(self, args, *kwargs)	判断所有元素是否为真，有 0 会被视为 False
any(self, args, *kwargs)	判断是否至少有一个元素为真，均为 0 会被视为 False
append(self, other)	连接另一个 index，产生一个新的 index
argmax(self[, axis, skipna])	返回 index 中最大值的索引值
argmin(self[, axis, skipna])	返回 index 中最小值的索引值
argsort(self, args, *kwargs)	对 index 从小到大排序，返回排序后的元素在原 index 中的索引值
delete(self, loc)	删除指定索引位置的元素，返回删除后的新 index
difference(self, other[, sort])	在第一个 index 中删除第二个 index 中的元素，即差集
drop(self, labels[, errors])	在原 index 中删除传入的值
drop_duplicates(self[, keep])	删除重复值，keep 参数可选值如下：

‘first’ ：保留第一次出现的重复项； ‘last’ ：保留最后一次出现的重复项； False ：不保留重复项 duplicated(self[, keep]) | 判断是否为重复值，keep 参数可选值如下： ‘first’ ：第一次重复的为 False，其他为 True； ‘last’ ：最后一次重复的为 False，其他为 True； False ：所有重复的均为 True dropna(self[, how]) | 删除缺失值，即 NaN fillna(self[, value, downcast]) | 用指定值填充缺失值，即 NaN equals(self, other) | 判断两个 index 是否相同 insert(self, loc, item) | 将元素插入到指定索引处，返回新的 index intersection(self, other[, sort]) | 返回两个 index 的交集 isna(self) | 检测 index 元素是否为缺失值，即 NaN isnull(self) | 检测 index 元素是否为缺失值，即 NaN max(self[, axis, skipna]) | 返回 index 的最大值 min(self[, axis, skipna]) | 返回 index 的最小值 union(self, other[, sort]) | 返回两个 index 的并集 unique(self[, level]) | 返回 index 中的唯一值，相当于去除重复值

all(self, *args, **kwargs) 【官方文档】

    >>> import pandas as pd
    >>> pd.Index([1, 2, 3]).all()
    True
    >>>
    >>> pd.Index([0, 1, 2]).all()
    False

any(self, *args, **kwargs) 【官方文档】

    >>> import pandas as pd
    >>> pd.Index([0, 0, 1]).any()
    True
    >>>
    >>> pd.Index([0, 0, 0]).any()
    False

append(self, other) 【官方文档】

    >>> import pandas as pd
    >>> pd.Index(['a', 'b', 'c']).append(pd.Index([1, 2, 3]))
    Index(['a', 'b', 'c', 1, 2, 3], dtype='object')

argmax(self[, axis, skipna]) 【官方文档】

    >>> import pandas as pd
    >>> pd.Index([5, 2, 3, 9, 1]).argmax()
    3

argmin(self[, axis, skipna]) 【官方文档】

    >>> import pandas as pd
    >>> pd.Index([5, 2, 3, 9, 1]).argmin()
    4

argsort(self, *args, **kwargs) 【官方文档】

    >>> import pandas as pd
    >>> pd.Index([5, 2, 3, 9, 1]).argsort()
    array([4, 1, 2, 0, 3], dtype=int32)

delete(self, loc) 【官方文档】

    >>> import pandas as pd
    >>> pd.Index([5, 2, 3, 9, 1]).delete(0)
    Int64Index([2, 3, 9, 1], dtype='int64')

difference(self, other[, sort]) 【官方文档】

    >>> import pandas as pd
    >>> idx1 = pd.Index([2, 1, 3, 4])
    >>> idx2 = pd.Index([3, 4, 5, 6])
    >>> idx1.difference(idx2)
    Int64Index([1, 2], dtype='int64')
    >>> idx1.difference(idx2, sort=False)
    Int64Index([2, 1], dtype='int64')

drop(self, labels[, errors]) 【官方文档】

    >>> import pandas as pd
    >>> pd.Index([5, 2, 3, 9, 1]).drop([2, 1])
    Int64Index([5, 3, 9], dtype='int64')

drop_duplicates(self[, keep]) 【官方文档】

    >>> import pandas as pd
    >>> idx = pd.Index(['lama', 'cow', 'lama', 'beetle', 'lama', 'hippo'])
    >>> idx.drop_duplicates(keep='first')
    Index(['lama', 'cow', 'beetle', 'hippo'], dtype='object')
    >>> idx.drop_duplicates(keep='last')
    Index(['cow', 'beetle', 'lama', 'hippo'], dtype='object')
    >>> idx.drop_duplicates(keep=False)
    Index(['cow', 'beetle', 'hippo'], dtype='object')

duplicated(self[, keep]) 【官方文档】

    >>> import pandas as pd
    >>> idx = pd.Index(['lama', 'cow', 'lama', 'beetle', 'lama'])
    >>> idx.duplicated()
    array([False, False,  True, False,  True])
    >>> idx.duplicated(keep='first')
    array([False, False,  True, False,  True])
    >>> idx.duplicated(keep='last')
    array([ True, False,  True, False, False])
    >>> idx.duplicated(keep=False)
    array([ True, False,  True, False,  True])

dropna(self[, how]) 【官方文档】

    >>> import numpy as np
    >>> import pandas as pd
    >>> pd.Index([2, 5, np.NaN, 6, np.NaN, np.NaN]).dropna()
    Float64Index([2.0, 5.0, 6.0], dtype='float64')

fillna(self[, value, downcast]) 【官方文档】

    >>> import numpy as np
    >>> import pandas as pd
    >>> pd.Index([2, 5, np.NaN, 6, np.NaN, np.NaN]).fillna(5)
    Float64Index([2.0, 5.0, 5.0, 6.0, 5.0, 5.0], dtype='float64')

equals(self, other) 【官方文档】

    >>> import pandas as pd
    >>> idx1 = pd.Index([5, 2, 3, 9, 1])
    >>> idx2 = pd.Index([5, 2, 3, 9, 1])
    >>> idx1.equals(idx2)
    True
    >>> 
    >>> idx1 = pd.Index([5, 2, 3, 9, 1])
    >>> idx2 = pd.Index([5, 2, 4, 9, 1])
    >>> idx1.equals(idx2)
    False

intersection(self, other[, sort]) 【官方文档】

    >>> import pandas as pd
    >>> idx1 = pd.Index([1, 2, 3, 4])
    >>> idx2 = pd.Index([3, 4, 5, 6])
    >>> idx1.intersection(idx2)
    Int64Index([3, 4], dtype='int64')

insert(self, loc, item) 【官方文档】

    >>> import pandas as pd
    >>> pd.Index([5, 2, 3, 9, 1]).insert(2, 'A')
    Index([5, 2, 'A', 3, 9, 1], dtype='object')

isna(self) 【官方文档】、 isnull(self) 【官方文档】

    >>> import numpy as np
    >>> import pandas as pd
    >>> pd.Index([2, 5, np.NaN, 6, np.NaN, np.NaN]).isna()
    array([False, False,  True, False,  True,  True])
    >>> pd.Index([2, 5, np.NaN, 6, np.NaN, np.NaN]).isnull()
    array([False, False,  True, False,  True,  True])

max(self[, axis, skipna]) 【官方文档】、 min(self[, axis, skipna]) 【官方文档】

    >>> import pandas as pd
    >>> pd.Index([5, 2, 3, 9, 1]).max()
    9
    >>> pd.Index([5, 2, 3, 9, 1]).min()
    1

union(self, other[, sort]) 【官方文档】

    >>> import pandas as pd
    >>> idx1 = pd.Index([1, 2, 3, 4])
    >>> idx2 = pd.Index([3, 4, 5, 6])
    >>> idx1.union(idx2)
    Int64Index([1, 2, 3, 4, 5, 6], dtype='int64')

unique(self[, level]) 【官方文档】

    >>> import pandas as pd
    >>> pd.Index([5, 1, 3, 5, 1]).unique()
    Int64Index([5, 1, 3], dtype='int64')

【2】Pandas 一般索引

由于在 Pandas 中，由于有一些更高级的索引操作，比如重新索引，层级索引等，因此将一般的切片索引、花式索引、布尔索引等归纳为一般索引。

【2.1】Series 索引

【2.1.1】head() / tail()

Series.head() 和 Series.tail() 方法可以获取的前五行和后五行数据，如果向 head() / tail() 里面传入参数，则会获取指定行：

    >>> import pandas as pd
    >>> import numpy as np
    >>> obj = pd.Series(np.random.randn(8))
    >>> obj
    0   -0.643437
    1   -0.365652
    2   -0.966554
    3   -0.036127
    4    1.046095
    5   -2.048362
    6   -1.865551
    7    1.344728
    dtype: float64
    >>> 
    >>> obj.head()
    0   -0.643437
    1   -0.365652
    2   -0.966554
    3   -0.036127
    4    1.046095
    dtype: float64
    >>> 
    >>> obj.head(3)
    0   -0.643437
    1   -0.365652
    2   -0.966554
    dtype: float64
    >>>
    >>> obj.tail()
    3    1.221221
    4   -1.373496
    5    1.032843
    6    0.029734
    7   -1.861485
    dtype: float64
    >>>
    >>> obj.tail(3)
    5    1.032843
    6    0.029734
    7   -1.861485
    dtype: float64

【2.1.2】行索引

Pandas 中可以按照位置进行索引，也可以按照索引名（index）进行索引，也可以用 Python 字典的表达式和方法来获取值：

    >>> import pandas as pd
    >>> obj = pd.Series([1, 5, -8, 2], index=['a', 'b', 'c', 'd'])
    >>> obj
    a    1
    b    5
    c   -8
    d    2
    dtype: int64
    >>> obj['c']
    -8
    >>> obj[2]
    -8
    >>> 'b' in obj
    True
    >>> obj.keys()
    Index(['a', 'b', 'c', 'd'], dtype='object')
    >>> list(obj.items())
    [('a', 1), ('b', 5), ('c', -8), ('d', 2)]

【2.1.3】切片索引

切片的方法有两种：按位置切片和按索引名（index）切片，注意：按位置切片时， 不包含 终止索引；按索引名（index）切片时，包含终止索引。

    >>> import pandas as pd
    >>> obj = pd.Series([1, 5, -8, 2], index=['a', 'b', 'c', 'd'])
    >>> obj
    a    1
    b    5
    c   -8
    d    2
    dtype: int64
    >>>
    >>> obj[1:3]
    b    5
    c   -8
    dtype: int64
    >>>
    >>> obj[0:3:2]
    a    1
    c   -8
    dtype: int64
    >>>
    >>> obj['b':'d']
    b    5
    c   -8
    d    2
    dtype: int64

【2.1.4】花式索引

所谓的花式索引，就是间隔索引、不连续的索引，传递一个由索引名（index）或者位置参数组成的列表来一次性获得多个元素：

    >>> import pandas as pd
    >>> obj = pd.Series([1, 5, -8, 2], index=['a', 'b', 'c', 'd'])
    >>> obj
    a    1
    b    5
    c   -8
    d    2
    dtype: int64
    >>> 
    >>> obj[[0, 2]]
    a    1
    c   -8
    dtype: int64
    >>> 
    >>> obj[['a', 'c', 'd']]
    a    1
    c   -8
    d    2
    dtype: int64

【2.1.5】布尔索引

可以通过一个布尔数组来索引目标数组，即通过布尔运算（如：比较运算符）来获取符合指定条件的元素的数组。

    >>> import pandas as pd
    >>> obj = pd.Series([1, 5, -8, 2, -3], index=['a', 'b', 'c', 'd', 'e'])
    >>> obj
    a    1
    b    5
    c   -8
    d    2
    e   -3
    dtype: int64
    >>> 
    >>> obj[obj > 0]
    a    1
    b    5
    d    2
    dtype: int64
    >>> 
    >>> obj > 0
    a     True
    b     True
    c    False
    d     True
    e    False
    dtype: bool

【2.2】DataFrame 索引

【2.2.1】head() / tail()

和 Series 一样， DataFrame.head() 和 DataFrame.tail() 方法同样可以获取 DataFrame 的前五行和后五行数据，如果向 head() / tail() 里面传入参数，则会获取指定行：

    >>> import pandas as pd
    >>> import numpy as np
    >>> obj = pd.DataFrame(np.random.randn(8,4), columns = ['a', 'b', 'c', 'd'])
    >>> obj
              a         b         c         d
    0 -1.399390  0.521596 -0.869613  0.506621
    1 -0.748562 -0.364952  0.188399 -1.402566
    2  1.378776 -1.476480  0.361635  0.451134
    3 -0.206405 -1.188609  3.002599  0.563650
    4  0.993289  1.133748  1.177549 -2.562286
    5 -0.482157  1.069293  1.143983 -1.303079
    6 -1.199154  0.220360  0.801838 -0.104533
    7 -1.359816 -2.092035  2.003530 -0.151812
    >>> 
    >>> obj.head()
              a         b         c         d
    0 -1.399390  0.521596 -0.869613  0.506621
    1 -0.748562 -0.364952  0.188399 -1.402566
    2  1.378776 -1.476480  0.361635  0.451134
    3 -0.206405 -1.188609  3.002599  0.563650
    4  0.993289  1.133748  1.177549 -2.562286
    >>> 
    >>> obj.head(3)
              a         b         c         d
    0 -1.399390  0.521596 -0.869613  0.506621
    1 -0.748562 -0.364952  0.188399 -1.402566
    2  1.378776 -1.476480  0.361635  0.451134
    >>>
    >>> obj.tail()
              a         b         c         d
    3 -0.206405 -1.188609  3.002599  0.563650
    4  0.993289  1.133748  1.177549 -2.562286
    5 -0.482157  1.069293  1.143983 -1.303079
    6 -1.199154  0.220360  0.801838 -0.104533
    7 -1.359816 -2.092035  2.003530 -0.151812
    >>> 
    >>> obj.tail(3)
              a         b         c         d
    5 -0.482157  1.069293  1.143983 -1.303079
    6 -1.199154  0.220360  0.801838 -0.104533
    7 -1.359816 -2.092035  2.003530 -0.151812

【2.2.2】列索引

DataFrame 可以按照列标签（columns）来进行列索引：

    >>> import pandas as pd
    >>> import numpy as np
    >>> obj = pd.DataFrame(np.random.randn(7,2), columns = ['a', 'b'])
    >>> obj
              a         b
    0 -1.198795  0.928378
    1 -2.878230  0.014650
    2  2.267475  0.370952
    3  0.639340 -1.301041
    4 -1.953444  0.148934
    5 -0.445225  0.459632
    6  0.097109 -2.592833
    >>>
    >>> obj['a']
    0   -1.198795
    1   -2.878230
    2    2.267475
    3    0.639340
    4   -1.953444
    5   -0.445225
    6    0.097109
    Name: a, dtype: float64
    >>> 
    >>> obj[['a']]
              a
    0 -1.198795
    1 -2.878230
    2  2.267475
    3  0.639340
    4 -1.953444
    5 -0.445225
    6  0.097109
    >>> 
    >>> type(obj['a'])
    <class 'pandas.core.series.Series'>
    >>> type(obj[['a']])
    <class 'pandas.core.frame.DataFrame'>

【2.2.3】切片索引

DataFrame 中的切片索引是针对行来操作的，切片的方法有两种：按位置切片和按索引名（index）切片，注意：按位置切片时，不包含终止索引；按索引名（index）切片时，包含终止索引。

    >>> import pandas as pd
    >>> import numpy as np
    >>> data = np.random.randn(5,4)
    >>> index = ['I1', 'I2', 'I3', 'I4', 'I5']
    >>> columns = ['a', 'b', 'c', 'd']
    >>> obj = pd.DataFrame(data, index, columns)
    >>> obj
               a         b         c         d
    I1  0.828676 -1.663337  1.753632  1.432487
    I2  0.368138  0.222166  0.902764 -1.436186
    I3  2.285615 -2.415175 -1.344456 -0.502214
    I4  3.224288 -0.500268  1.293596 -1.235549
    I5 -0.938833 -0.804433 -0.170047 -0.566766
    >>> 
    >>> obj[0:3]
               a         b         c         d
    I1  0.828676 -1.663337  1.753632  1.432487
    I2  0.368138  0.222166  0.902764 -1.436186
    I3  2.285615 -2.415175 -1.344456 -0.502214
    >>>
    >>> obj[0:4:2]
               a         b         c         d
    I1 -0.042168  1.437354 -1.114545  0.830790
    I3  0.241506  0.018984 -0.499151 -1.190143
    >>>
    >>> obj['I2':'I4']
               a         b         c         d
    I2  0.368138  0.222166  0.902764 -1.436186
    I3  2.285615 -2.415175 -1.344456 -0.502214
    I4  3.224288 -0.500268  1.293596 -1.235549

【2.2.4】花式索引

和 Series 一样，所谓的花式索引，就是间隔索引、不连续的索引，传递一个由列名（columns）组成的列表来一次性获得多列元素：

    >>> import pandas as pd
    >>> import numpy as np
    >>> data = np.random.randn(5,4)
    >>> index = ['I1', 'I2', 'I3', 'I4', 'I5']
    >>> columns = ['a', 'b', 'c', 'd']
    >>> obj = pd.DataFrame(data, index, columns)
    >>> obj
               a         b         c         d
    I1 -1.083223 -0.182874 -0.348460 -1.572120
    I2 -0.205206 -0.251931  1.180131  0.847720
    I3 -0.980379  0.325553 -0.847566 -0.882343
    I4 -0.638228 -0.282882 -0.624997 -0.245980
    I5 -0.229769  1.002930 -0.226715 -0.916591
    >>> 
    >>> obj[['a', 'd']]
               a         d
    I1 -1.083223 -1.572120
    I2 -0.205206  0.847720
    I3 -0.980379 -0.882343
    I4 -0.638228 -0.245980
    I5 -0.229769 -0.916591

【2.2.5】布尔索引

可以通过一个布尔数组来索引目标数组，即通过布尔运算（如：比较运算符）来获取符合指定条件的元素的数组。

    >>> import pandas as pd
    >>> import numpy as np
    >>> data = np.random.randn(5,4)
    >>> index = ['I1', 'I2', 'I3', 'I4', 'I5']
    >>> columns = ['a', 'b', 'c', 'd']
    >>> obj = pd.DataFrame(data, index, columns)
    >>> obj
               a         b         c         d
    I1 -0.602984 -0.135716  0.999689 -0.339786
    I2  0.911130 -0.092485 -0.914074 -0.279588
    I3  0.849606 -0.420055 -1.240389 -0.179297
    I4  0.249986 -1.250668  0.329416 -1.105774
    I5 -0.743816  0.430647 -0.058126 -0.337319
    >>> 
    >>> obj[obj > 0]
               a         b         c   d
    I1       NaN       NaN  0.999689 NaN
    I2  0.911130       NaN       NaN NaN
    I3  0.849606       NaN       NaN NaN
    I4  0.249986       NaN  0.329416 NaN
    I5       NaN  0.430647       NaN NaN
    >>> 
    >>> obj > 0
            a      b      c      d
    I1  False  False   True  False
    I2   True  False  False  False
    I3   True  False  False  False
    I4   True  False   True  False
    I5  False   True  False  False

    这里是一段防爬虫文本，请读者忽略。
    本文原创首发于 CSDN，作者 TRHX。
    博客首页：https://itrhx.blog.csdn.net/
    本文链接：https://itrhx.blog.csdn.net/article/details/106698307
    未经授权，禁止转载！恶意转载，后果自负！尊重原创，远离剽窃！

【3】索引器：loc 和 iloc

loc 是标签索引、iloc 是位置索引，注意：在 Pandas1.0.0 之前还有 ix 方法（即可按标签也可按位置索引），在 Pandas1.0.0 之后已被移除。

【3.1】loc 标签索引

loc 标签索引，即根据 index 和 columns 来选择数据。

【3.1.1】Series.loc

在 Series 中，允许输入：

单个标签，例如 5 或 'a' ，（注意， 5 是 index 的名称，而不是位置索引）；
标签列表或数组，例如 ['a', 'b', 'c'] ；
带有标签的切片对象，例如 'a':'f' 。

官方文档： https://pandas.pydata.org/docs/reference/api/pandas.Series.loc.html

    >>> import pandas as np
    >>> obj = pd.Series([1, 5, -8, 2], index=['a', 'b', 'c', 'd'])
    >>> obj
    a    1
    b    5
    c   -8
    d    2
    dtype: int64
    >>> 
    >>> obj.loc['a']
    1
    >>> 
    >>> obj.loc['a':'c']
    a    1
    b    5
    c   -8
    dtype: int64
    >>>
    >>> obj.loc[['a', 'd']]
    a    1
    d    2
    dtype: int64

【3.1.2】DataFrame.loc

在 DataFrame 中，第一个参数索引行，第二个参数是索引列，允许输入的格式和 Series 大同小异。

官方文档： https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html

    >>> import pandas as pd
    >>> obj = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
    >>> obj
       A  B  C
    a  1  2  3
    b  4  5  6
    c  7  8  9
    >>> 
    >>> obj.loc['a']
    A    1
    B    2
    C    3
    Name: a, dtype: int64
    >>> 
    >>> obj.loc['a':'c']
       A  B  C
    a  1  2  3
    b  4  5  6
    c  7  8  9
    >>> 
    >>> obj.loc[['a', 'c']]
       A  B  C
    a  1  2  3
    c  7  8  9
    >>> 
    >>> obj.loc['b', 'B']
    5
    >>> obj.loc['b', 'A':'C']
    A    4
    B    5
    C    6
    Name: b, dtype: int64

【3.2】iloc 位置索引

作用和 loc 一样，不过是基于索引的编号来索引，即根据 index 和 columns 的位置编号来选择数据。

【3.2.1】Series.iloc

官方文档： https://pandas.pydata.org/docs/reference/api/pandas.Series.iloc.html

在 Series 中，允许输入：

整数，例如 5 ；
整数列表或数组，例如 [4, 3, 0] ；
具有整数的切片对象，例如 1:7 。

    >>> import pandas as np
    >>> obj = pd.Series([1, 5, -8, 2], index=['a', 'b', 'c', 'd'])
    >>> obj
    a    1
    b    5
    c   -8
    d    2
    dtype: int64
    >>> 
    >>> obj.iloc[1]
    5
    >>> 
    >>> obj.iloc[0:2]
    a    1
    b    5
    dtype: int64
    >>> 
    >>> obj.iloc[[0, 1, 3]]
    a    1
    b    5
    d    2
    dtype: int64

【3.2.2】DataFrame.iloc

官方文档： https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html

在 DataFrame 中，第一个参数索引行，第二个参数是索引列，允许输入的格式和 Series 大同小异：

    >>> import pandas as pd
    >>> obj = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
    >>> obj
       A  B  C
    a  1  2  3
    b  4  5  6
    c  7  8  9
    >>> 
    >>> obj.iloc[1]
    A    4
    B    5
    C    6
    Name: b, dtype: int64
    >>> 
    >>> obj.iloc[0:2]
       A  B  C
    a  1  2  3
    b  4  5  6
    >>> 
    >>> obj.iloc[[0, 2]]
       A  B  C
    a  1  2  3
    c  7  8  9
    >>> 
    >>> obj.iloc[1, 2]
    6
    >>> 
    >>> obj.iloc[1, 0:2]
    A    4
    B    5
    Name: b, dtype: int64

【4】Pandas 重新索引

Pandas 对象的一个重要方法是 reindex，其作用是创建一个新对象，它的数据符合新的索引。以 DataFrame.reindex 为例（Series 类似），基本语法如下：

DataFrame.reindex(self, labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=nan, limit=None, tolerance=None)

部分参数描述如下：（完整参数解释参见官方文档）

参数	描述
index	用作索引的新序列，既可以是 index 实例，也可以是其他序列型的 Python 数据结构
method	插值（填充）方式，取值如下：

None ：不填补空白； pad / ffill ：将上一个有效的观测值向前传播到下一个有效的观测值； backfill / bfill ：使用下一个有效观察值来填补空白； nearest ：使用最近的有效观测值来填补空白。 fill_value | 在重新索引的过程中，需要引入缺失值时使用的替代值 limit | 前向或后向填充时的最大填充量 tolerance | 向前或向后填充时，填充不准确匹配项的最大间距（绝对值距离） level | 在 Multilndex 的指定级别上匹配简单索引，否则选其子集 copy | 默认为 True，无论如何都复制；如果为 False，则新旧相等就不复制

reindex 将会根据新索引进行重排。如果某个索引值当前不存在，就引入缺失值：

    >>> import pandas as pd
    >>> obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
    >>> obj
    d    4.5
    b    7.2
    a   -5.3
    c    3.6
    dtype: float64
    >>> 
    >>> obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
    >>> obj2
    a   -5.3
    b    7.2
    c    3.6
    d    4.5
    e    NaN
    dtype: float64

对于时间序列这样的有序数据，重新索引时可能需要做一些插值处理。method 选项即可达到此目的，例如，使用 ffill 可以实现前向值填充：

    >>> import pandas as pd
    >>> obj = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
    >>> obj
    0      blue
    2    purple
    4    yellow
    dtype: object
    >>> 
    >>> obj2 = obj.reindex(range(6), method='ffill')
    >>> obj2
    0      blue
    1      blue
    2    purple
    3    purple
    4    yellow
    5    yellow
    dtype: object

借助 DataFrame，reindex可以修改（行）索引和列。只传递一个序列时，会重新索引结果的行：

    >>> import pandas as pd
    >>> import numpy as np
    >>> obj = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
    >>> obj
       Ohio  Texas  California
    a     0      1           2
    c     3      4           5
    d     6      7           8
    >>> 
    >>> obj2 = obj.reindex(['a', 'b', 'c', 'd'])
    >>> obj2
       Ohio  Texas  California
    a   0.0    1.0         2.0
    b   NaN    NaN         NaN
    c   3.0    4.0         5.0
    d   6.0    7.0         8.0

列可以用 columns 关键字重新索引：

    >>> import pandas as pd
    >>> import numpy as np
    >>> obj = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
    >>> obj
       Ohio  Texas  California
    a     0      1           2
    c     3      4           5
    d     6      7           8
    >>> 
    >>> states = ['Texas', 'Utah', 'California']
    >>> obj.reindex(columns=states)
       Texas  Utah  California
    a      1   NaN           2
    c      4   NaN           5
    d      7   NaN           8

    这里是一段防爬虫文本，请读者忽略。
    本文原创首发于 CSDN，作者 TRHX。
    博客首页：https://itrhx.blog.csdn.net/
    本文链接：https://itrhx.blog.csdn.net/article/details/106698307
    未经授权，禁止转载！恶意转载，后果自负！尊重原创，远离剽窃！

marraybug

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
Python 数据分析三剑客之 Pandas（二）：Index 索引对象以及各种索引操作

CSDN 课程推荐：《迈向数据科学家：带你玩转Python数据分析》，讲师齐伟，苏州研途教育科技有限公司CTO，苏州大学应用统计专业硕士生指导委员会委员；已出版《跟老齐学Python：轻松入门》《跟老齐学Python：Django实战》、《跟老齐学Python：数据分析》和《Python大学实用教程》畅销图书。Pandas 系列文章： Python 数据分析三剑客之 Pandas（一）：认识 Pandas 及其 Series、DataFrame 对象 Python 数据分析三剑客之 P
复制链接

扫一扫