Python 数据处理 —— pandas 索引类型-CSDN博客

本文链接：https://blog.csdn.net/dxs18459111694/article/details/134631844

MultiIndex 排序

可以使用 sort_index() 对 MultiIndex 排序

In [101]: import random

In [102]: random.shuffle(tuples)

In [103]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))

In [104]: s
Out[104]: 
foo  one    0.206053
qux  two   -0.251905
bar  two   -2.213588
baz  one    1.063327
bar  one    1.266143
foo  two    0.299368
qux  one   -0.863838
baz  two    0.408204
dtype: float64

In [105]: s.sort_index()
Out[105]: 
bar  one    1.266143
     two   -2.213588
baz  one    1.063327
     two    0.408204
foo  one    0.206053
     two    0.299368
qux  one   -0.863838
     two   -0.251905
dtype: float64

In [106]: s.sort_index(level=0)
Out[106]: 
bar  one    1.266143
     two   -2.213588
baz  one    1.063327
     two    0.408204
foo  one    0.206053
     two    0.299368
qux  one   -0.863838
     two   -0.251905
dtype: float64

In [107]: s.sort_index(level=1)
Out[107]: 
bar  one    1.266143
baz  one    1.063327
foo  one    0.206053
qux  one   -0.863838
bar  two   -2.213588
baz  two    0.408204
foo  two    0.299368
qux  two   -0.251905
dtype: float64

如果命名了 MultiIndex 的级别，你也可以给 sort_index 传递一个级别名称

In [108]: s.index.set_names(["L1", "L2"], inplace=True)

In [109]: s.sort_index(level="L1")
Out[109]: 
L1   L2 
bar  one    1.266143
     two   -2.213588
baz  one    1.063327
     two    0.408204
foo  one    0.206053
     two    0.299368
qux  one   -0.863838
     two   -0.251905
dtype: float64

In [110]: s.sort_index(level="L2")
Out[110]: 
L1   L2 
bar  one    1.266143
baz  one    1.063327
foo  one    0.206053
qux  one   -0.863838
bar  two   -2.213588
baz  two    0.408204
foo  two    0.299368
qux  two   -0.251905
dtype: float64

在较高维的对象上，如果有 MultiIndex，你可以按级别对其他轴进行排序

In [111]: df.T.sort_index(level=1, axis=1)
Out[111]: 
        one      zero       one      zero
          x         x         y         y
0  0.600178  2.410179  1.519970  0.132885
1  0.274230  1.450520 -0.493662 -0.023688

即使数据没有排序，索引也可以工作，但效率相当低(并显示 PerformanceWarning)。它还将返回数据的拷贝，而不是视图

In [112]: dfm = pd.DataFrame(
   .....:     {"jim": [0, 0, 1, 1], "joe": ["x", "x", "z", "y"], "jolie": np.random.rand(4)}
   .....: )
   .....: 

In [113]: dfm = dfm.set_index(["jim", "joe"])

In [114]: dfm
Out[114]: 
            jolie
jim joe          
0   x    0.490671
    x    0.120248
1   z    0.537020
    y    0.110968

>>> dfm.loc[(1, 'z')]
PerformanceWarning: indexing past lexsort depth may impact performance.

           jolie
jim joe
1   z    0.64094

此外，如果你试图索引一些没有完全 lexsorted 的索引，这可能会引发

>>> dfm.loc[(0, 'y'):(1, 'z')]
UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'

MultiIndex 上的 is_lexsorted() 方法会显示索引是否排序，而 lexsort_depth 属性会返回排序深度

In [115]: dfm.index.is_lexsorted()
Out[115]: False

In [116]: dfm.index.lexsort_depth
Out[116]: 1

In [117]: dfm = dfm.sort_index()

In [118]: dfm
Out[118]: 
            jolie
jim joe          
0   x    0.490671
    x    0.120248
1   y    0.110968
    z    0.537020

In [119]: dfm.index.is_lexsorted()
Out[119]: True

In [120]: dfm.index.lexsort_depth
Out[120]: 2

take 方法

与 NumPy ndarrays 类似，pandas 的 Index、Series 和 DataFrame 也提供了 take() 方法，用于沿着给定的轴并按给定的索引获取所有元素

这个给定的索引必须是一个列表或整数索引位置的 ndarray，take 也接受负整数作为对象末端的相对位置

In [122]: index = pd.Index(np.random.randint(0, 1000, 10))

In [123]: index
Out[123]: Int64Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')

In [124]: positions = [0, 9, 3]

In [125]: index[positions]
Out[125]: Int64Index([214, 329, 567], dtype='int64')

In [126]: index.take(positions)
Out[126]: Int64Index([214, 329, 567], dtype='int64')

In [127]: ser = pd.Series(np.random.randn(10))

In [128]: ser.iloc[positions]
Out[128]: 
0   -0.179666
9    1.824375
3    0.392149
dtype: float64

In [129]: ser.take(positions)
Out[129]: 
0   -0.179666
9    1.824375
3    0.392149
dtype: float64

对于 DataFrames，给定的索引应该是一个一维的 list 或 ndarray，用于指定行或列的位置

In [130]: frm = pd.DataFrame(np.random.randn(5, 3))

In [131]: frm.take([1, 4, 3])
Out[131]: 
          0         1         2
1 -1.237881  0.106854 -1.276829
4  0.629675 -1.425966  1.857704
3  0.979542 -1.633678  0.615855

In [132]: frm.take([0, 2], axis=1)
Out[132]: 
          0         2
0  0.595974  0.601544
1 -1.237881 -1.276829
2 -0.767101  1.499591
3  0.979542  0.615855
4  0.629675  1.857704

需要注意的是，pandas 对象上的 take 方法并不支持布尔索引，如果传入布尔索引可能会返回意想不到的结果

In [133]: arr = np.random.randn(10)

In [134]: arr.take([False, False, True, True])
Out[134]: array([-1.1935, -1.1935,  0.6775,  0.6775])

In [135]: arr[[0, 1]]
Out[135]: array([-1.1935,  0.6775])

In [136]: ser = pd.Series(np.random.randn(10))

In [137]: ser.take([False, False, True, True])
Out[137]: 
0    0.233141
0    0.233141
1   -0.223540
1   -0.223540
dtype: float64

In [138]: ser.iloc[[0, 1]]
Out[138]: 
0    0.233141
1   -0.223540
dtype: float64

最后，要注意一下性能，因为 take 方法处理的输入范围更窄，所以它的性能比华丽的索引快得多

In [139]: arr = np.random.randn(10000, 5)

In [140]: indexer = np.arange(10000)

In [141]: random.shuffle(indexer)

In [142]: %timeit arr[indexer]
   .....: %timeit arr.take(indexer, axis=0)
   .....: 
127 us +- 535 ns per loop (mean +- std. dev. of 7 runs, 10000 loops each)
37.6 us +- 224 ns per loop (mean +- std. dev. of 7 runs, 10000 loops each)

In [143]: ser = pd.Series(arr[:, 0])

In [144]: %timeit ser.iloc[indexer]
   .....: %timeit ser.take(indexer)
   .....: 
71.2 us +- 624 ns per loop (mean +- std. dev. of 7 runs, 10000 loops each)
62.8 us +- 565 ns per loop (mean +- std. dev. of 7 runs, 10000 loops each)

索引类型

在前面几节中，我们已经讨论了很多 MultiIndex

在下面的小节中，我们将重点介绍其他一些索引类型

1 CategoricalIndex

CategoricalIndex 是一种用于支持重复索引的索引类型。这是一个围绕分类的容器，允许高效地索引和存储具有大量重复元素的索引

In [145]: from pandas.api.types import CategoricalDtype

In [146]: df = pd.DataFrame({"A": np.arange(6), "B": list("aabbca")})

In [147]: df["B"] = df["B"].astype(CategoricalDtype(list("cab")))

In [148]: df
Out[148]: 
   A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

In [149]: df.dtypes
Out[149]: 
A       int64
B    category
dtype: object

In [150]: df["B"].cat.categories
Out[150]: Index(['c', 'a', 'b'], dtype='object')

将其设置为索引会创建 CategoricalIndex

In [151]: df2 = df.set_index("B")

In [152]: df2.index
Out[152]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

使用 __getitem__/.iloc/.loc 的工作方式类似于重复的 Index。

索引必须包含在类别中，否则操作将引发 KeyError

In [153]: df2.loc["a"]
Out[153]: 
   A
B   
a  0
a  1
a  5

索引后，CategoricalIndex 被保留

In [154]: df2.loc["a"].index
Out[154]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

对索引排序将按类别的顺序排序，我们用 CategoricalDtype(list('cab'))) 创建了索引，因此排序的顺序是 cab

In [155]: df2.sort_index()
Out[155]: 
   A
B   
c  4
a  0
a  1
a  5
b  2
b  3

对索引的 Groupby 操作也将保留索引的性质

In [156]: df2.groupby(level=0).sum()
Out[156]: 
   A
B   
c  4
a  6
b  5

In [157]: df2.groupby(level=0).sum().index
Out[157]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

重新索引操作将根据所传递索引器的类型返回结果索引。传递一个列表将返回一个普通的索引

使用 Categorical 进行索引将返回一个 CategoricalIndex，根据传入的 Categorical dtype 的类别建立索引。

这允许我们任意索引这些甚至不在类别中的值，类似于你重新索引任何 pandas 索引。

In [158]: df3 = pd.DataFrame(
   .....:     {"A": np.arange(3), "B": pd.Series(list("abc")).astype("category")}
   .....: )
   .....: 

In [159]: df3 = df3.set_index("B")

In [160]: df3
Out[160]: 
   A
B   
a  0
b  1
c  2

In [161]: df3.reindex(["a", "e"])
Out[161]: 
     A
B     
a  0.0
e  NaN

In [162]: df3.reindex(["a", "e"]).index
Out[162]: Index(['a', 'e'], dtype='object', name='B')

In [163]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe")))
Out[163]: 
     A
B     
a  0.0
e  NaN

In [164]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe"))).index
Out[164]: CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, name='B', dtype='category')

注意：

对 CategoricalIndex 形状重构的和比较操作必须具有相同的类别，否则将引发 TypeError

In [165]: df4 = pd.DataFrame({"A": np.arange(2), "B": list("ba")})

In [166]: df4["B"] = df4["B"].astype(CategoricalDtype(list("ab")))

In [167]: df4 = df4.set_index("B")

In [168]: df4.index
Out[168]: CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, name='B', dtype='category')

In [169]: df5 = pd.DataFrame({"A": np.arange(2), "B": list("bc")})

In [170]: df5["B"] = df5["B"].astype(CategoricalDtype(list("bc")))

In [171]: df5 = df5.set_index("B")

In [172]: df5.index
Out[172]: CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, name='B', dtype='category')

>>> pd.concat([df4, df5])
TypeError: categories must match existing categories when appending

2 Int64Index 和 RangeIndex

Int64Index 是 pandas 中的一个基本的索引。这是一个不可变的数组，实现了一个有序的、可切片的集合。

RangeIndex 是 Int64Index 的一个子类，它为所有 NDFrame 对象提供默认索引。

RangeIndex 是 Int64Index 的优化版本，可以表示单调有序集。它们类似于 Python 的范围类型

3 Float64Index

默认情况下，当在创建索引时传递浮点值或混合整型浮点值时，将自动创建 Float64Index

这就实现了一个纯粹的基于标签的切片范式，使得 [], ix, loc 用于标量索引和切片的工作方式完全相同

In [173]: indexf = pd.Index([1.5, 2, 3, 4.5, 5])

In [174]: indexf
Out[174]: Float64Index([1.5, 2.0, 3.0, 4.5, 5.0], dtype='float64')

In [175]: sf = pd.Series(range(5), index=indexf)

In [176]: sf
Out[176]: 
1.5    0
2.0    1
3.0    2
4.5    3
5.0    4
dtype: int64

[] , .loc 的标量选择将始终基于标签。整数将匹配相等的浮点索引（例如 3 等于 3.0）

In [177]: sf[3]
Out[177]: 2

In [178]: sf[3.0]
Out[178]: 2

In [179]: sf.loc[3]
Out[179]: 2

In [180]: sf.loc[3.0]
Out[180]: 2

唯一的位置索引是通过 iloc

In [181]: sf.iloc[3]
Out[181]: 3

没有找到标量索引将引发 KeyError。在使用 []、ix、loc 时，切片主要取决于索引的值，而在使用 iloc 时总是基于位置。

当切片是布尔型时例外，在这种情况下它将始终是有位置的。

In [182]: sf[2:4]
Out[182]: 
2.0    1
3.0    2
dtype: int64

In [183]: sf.loc[2:4]
Out[183]: 
2.0    1
3.0    2
dtype: int64

In [184]: sf.iloc[2:4]
Out[184]: 
3.0    2
4.5    3
dtype: int64

在 float 索引中，允许使用 float 进行切片

In [185]: sf[2.1:4.6]
Out[185]: 
3.0    2
4.5    3
dtype: int64

In [186]: sf.loc[2.1:4.6]
Out[186]: 
3.0    2
4.5    3
dtype: int64

在非浮点型索引中，使用浮点型进行切片将引发 TypeError

In [1]: pd.Series(range(5))[3.5]
TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)

In [1]: pd.Series(range(5))[3.5:4.5]
TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)

下面是使用这种索引的一个典型的例子。假设您有一个不规则的类似 timedelta 的索引，但是数据是以浮点数的形式记录的。例如，这可以是毫秒的偏移量

In [187]: dfir = pd.concat(
   .....:     [
   .....:         pd.DataFrame(
   .....:             np.random.randn(5, 2), index=np.arange(5) * 250.0, columns=list("AB")
   .....:         ),
   .....:         pd.DataFrame(
   .....:             np.random.randn(6, 2),
   .....:             index=np.arange(4, 10) * 250.1,
   .....:             columns=list("AB"),
   .....:         ),
   .....:     ]
   .....: )
   .....: 

In [188]: dfir
Out[188]: 
               A         B
0.0    -0.435772 -1.188928
250.0  -0.808286 -0.284634
500.0  -1.815703  1.347213
750.0  -0.243487  0.514704
1000.0  1.162969 -0.287725
1000.4 -0.179734  0.993962
1250.5 -0.212673  0.909872
1500.6 -0.733333 -0.349893
1750.7  0.456434 -0.306735
2000.8  0.553396  0.166221
2250.9 -0.101684 -0.734907

对于所有选择操作将始终以值为基础工作

In [189]: dfir[0:1000.4]
Out[189]: 
               A         B
0.0    -0.435772 -1.188928
250.0  -0.808286 -0.284634
500.0  -1.815703  1.347213
750.0  -0.243487  0.514704
1000.0  1.162969 -0.287725
1000.4 -0.179734  0.993962

In [190]: dfir.loc[0:1001, "A"]
Out[190]: 
0.0      -0.435772
250.0    -0.808286
500.0    -1.815703
750.0    -0.243487
1000.0    1.162969
1000.4   -0.179734
Name: A, dtype: float64

In [191]: dfir.loc[1000.4]
Out[191]: 
A   -0.179734
B    0.993962
Name: 1000.4, dtype: float64

你可以检索前 1 秒(1000 毫秒)的数据

In [192]: dfir[0:1000]
Out[192]: 
               A         B
0.0    -0.435772 -1.188928
250.0  -0.808286 -0.284634
500.0  -1.815703  1.347213
750.0  -0.243487  0.514704
1000.0  1.162969 -0.287725

如果你需要基于整数位置进行选择，你应该使用 iloc

In [193]: dfir.iloc[0:5]
Out[193]: 
               A         B
0.0    -0.435772 -1.188928
250.0  -0.808286 -0.284634
500.0  -1.815703  1.347213
750.0  -0.243487  0.514704
1000.0  1.162969 -0.287725

4 IntervalIndex

IntervalIndex 和它对应的类型 IntervalDtype，即 Interval 标量类型，允许在 pandas 中对区间符号提供支持

IntervalIndex 允许一些唯一的索引，并且也用作 cut() 和 qcut() 中的返回类型

4.1 使用 IntervalIndex 进行索引

IntervalIndex 可以在 Series 和 DataFrame 中作为索引使用

In [194]: df = pd.DataFrame(
   .....:     {"A": [1, 2, 3, 4]}, index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4])
   .....: )
   .....: 

In [195]: df
Out[195]: 
        A
(0, 1]  1
(1, 2]  2
(2, 3]  3
(3, 4]  4

通过 .loc 沿着区间的边缘进行基于标签的索引，就像你期望的那样，选择那个特定的区间

In [196]: df.loc[2]
Out[196]: 
A    2
Name: (1, 2], dtype: int64

In [197]: df.loc[[2, 3]]
Out[197]: 
        A
(1, 2]  2
(2, 3]  3

如果您选择一个包含在一个区间内的标签，这也将选择这个区间

In [198]: df.loc[2.5]
Out[198]: 
A    3
Name: (2, 3], dtype: int64

In [199]: df.loc[[2.5, 3.5]]
Out[199]: 
        A
(2, 3]  3
(3, 4]  4

使用 Interval 索引选择将只返回精确匹配

In [200]: df.loc[pd.Interval(1, 2)]
Out[200]: 
A    2
Name: (1, 2], dtype: int64

试图选择一个不完全包含在 IntervalIndex 中的 Interval 将引发一个 KeyError

In [7]: df.loc[pd.Interval(0.5, 2.5)]
---------------------------------------------------------------------------
KeyError: Interval(0.5, 2.5, closed='right')

可以使用 overlaps() 方法来选择与给定 Interval 重叠的所有 Intervals，从而创建一个布尔索引器

In [201]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))

In [202]: idxr
Out[202]: array([ True,  True,  True, False])

In [203]: df[idxr]
Out[203]: 
        A
(0, 1]  1
(1, 2]  2
(2, 3]  3

4.2 用 cut 和 qcut 来装箱数据

cut() 和 qcut() 都返回一个 Categorical 对象，它们创建的 bins 以 IntervalIndex 的形式存储在其 .categories 属性中

In [204]: c = pd.cut(range(4), bins=2)

In [205]: c
Out[205]: 
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

In [206]: c.categories
Out[206]: 
IntervalIndex([(-0.003, 1.5], (1.5, 3.0]],
              closed='right',
              dtype='interval[float64]')

cut() 也接受一个 IntervalIndex 作为它的 bins 参数，首先，我们在调用 cut() 时，将一些数据和 bins 参数设置为一个固定的数字，以生成 bins。

然后，我们将 .category 的值传递给后续调用 cut() 函数的 bins 参数，新的数据将被分到对应的 bins 中。

In [207]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[207]: 
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

任何超出所有 bins 的值都将被分配一个 NaN 值

4.3 生成区间范围

如果需要固定间隔的区间，可以使用 interval_range() 函数来创建 IntervalIndex，使用不同的 start、end 和 periods。

interval_range 的默认频率为数字间隔 1

In [208]: pd.interval_range(start=0, end=5)
Out[208]: 
IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
              closed='right',
              dtype='interval[int64]')

In [209]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4)
Out[209]: 
IntervalIndex([(2017-01-01, 2017-01-02], (2017-01-02, 2017-01-03], (2017-01-03, 2017-01-04], (2017-01-04, 2017-01-05]],
              closed='right',
              dtype='interval[datetime64[ns]]')

In [210]: pd.interval_range(end=pd.Timedelta("3 days"), periods=3)
Out[210]: 
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (2 days 00:00:00, 3 days 00:00:00]],
              closed='right',
              dtype='interval[timedelta64[ns]]')

freq 参数可以用来指定频率，并且可以使用各种类似于 datetime 间隔的频率别名

In [211]: pd.interval_range(start=0, periods=5, freq=1.5)
Out[211]: 
IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]],
              closed='right',
              dtype='interval[float64]')

In [212]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4, freq="W")
Out[212]: 
IntervalIndex([(2017-01-01, 2017-01-08], (2017-01-08, 2017-01-15], (2017-01-15, 2017-01-22], (2017-01-22, 2017-01-29]],
              closed='right',
              dtype='interval[datetime64[ns]]')

In [213]: pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9H")
Out[213]: 
IntervalIndex([(0 days 00:00:00, 0 days 09:00:00], (0 days 09:00:00, 0 days 18:00:00], (0 days 18:00:00, 1 days 03:00:00]],
              closed='right',
              dtype='interval[timedelta64[ns]]')

此外，closed 参数可用于指定间隔在哪边关闭。默认情况下，区间是左开右闭

In [214]: pd.interval_range(start=0, end=4, closed="both")
Out[214]: 
IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]],
              closed='both',
              dtype='interval[int64]')

In [215]: pd.interval_range(start=0, end=4, closed="neither")
Out[215]: 
IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)],
              closed='neither',
              dtype='interval[int64]')

指定 start、end 和 periods 将生成一个从开始到结束均匀间隔的区间，返回的 IntervalIndex 中包含 periods 个元素

In [216]: pd.interval_range(start=0, end=6, periods=4)
Out[216]: 
IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]],
              closed='right',
              dtype='interval[float64]')

In [217]: pd.interval_range(pd.Timestamp("2018-01-01"), pd.Timestamp("2018-02-28"), periods=3)
Out[217]: 
IntervalIndex([(2018-01-01, 2018-01-20 08:00:00], (2018-01-20 08:00:00, 2018-02-08 16:00:00], (2018-02-08 16:00:00, 2018-02-28]],
              closed='right',
              dtype='interval[datetime64[ns]]')