Pandas 十二：索引index的用途

最新推荐文章于 2025-03-18 16:19:57 发布

蔡大远

最新推荐文章于 2025-03-18 16:19:57 发布

阅读量551

点赞数

分类专栏： Pandas

本文链接：https://blog.csdn.net/cai_and_luo/article/details/117082023

版权

Pandas 专栏收录该内容

27 篇文章

订阅专栏

Pandas 十二：索引index的用途

问题：index就是用于数据查询的，但是把数据列"存储于普通的column"和"存储于index"有什么区别？

如果index只是用于数据筛选，区别确实不大

不过index还有这些用途：

更方便的数据查询；
使用index可以获得性能提升；
自动的数据对齐功能；
更多更强大的数据结构支持；

import pandas as pd
No output

df = pd.read_csv("./datas/ml-latest-small/ratings.csv")
No output

df.head()

userId	movieId	rating	timestamp
0	1	1	4.0	964982703
1	1	3	4.0	964981247
2	1	6	4.0	964982224
3	1	47	5.0	964983815
4	1	50	5.0	964982931

df.count()

userId       100836
movieId      100836
rating       100836
timestamp    100836
dtype: int64

1、使用index查询数据

# drop==False，让索引列还保持在column
df.set_index("userId", inplace=True, drop=False)
No output

df.head()

userId	movieId	rating	timestamp
userId				
1	1	1	4.0	964982703
1	1	3	4.0	964981247
1	1	6	4.0	964982224
1	1	47	5.0	964983815
1	1	50	5.0	964982931

df.index

Int64Index([  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
            ...
            610, 610, 610, 610, 610, 610, 610, 610, 610, 610],
           dtype='int64', name='userId', length=100836)

# 使用index的查询方法
df.loc[500].head(5)

userId	movieId	rating	timestamp
userId				
500	500	1	4.0	1005527755
500	500	11	1.0	1005528017
500	500	39	1.0	1005527926
500	500	101	1.0	1005527980
500	500	104	4.0	1005528065

# 使用column的condition查询方法
df.loc[df["userId"] == 500].head()

userId	movieId	rating	timestamp
userId				
500	500	1	4.0	1005527755
500	500	11	1.0	1005528017
500	500	39	1.0	1005527926
500	500	101	1.0	1005527980
500	500	104	4.0	1005528065

2. 使用index会提升查询性能

如果index是唯一的，Pandas会使用哈希表优化，查询性能为O(1);
如果index不是唯一的，但是有序，Pandas会使用二分查找算法，查询性能为O(logN);
如果index是完全随机的，那么每次查询都要扫描全表，查询性能为O(N);

实验1：完全随机的顺序查询

# 将数据随机打散
from sklearn.utils import shuffle
df_shuffle = shuffle(df)
No output

df_shuffle.head()

userId	movieId	rating	timestamp
userId				
266	266	260	4.0	945670679
452	452	1608	5.0	1019580899
414	414	168	2.0	961514438
362	362	2329	4.0	1530640231
525	525	588	3.5	1476476589

# 索引是否是递增的
df_shuffle.index.is_monotonic_increasing

False

df_shuffle.index.is_unique

False

# 计时，查询id==500数据性能
%timeit df_shuffle.loc[500]
309 µs ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

实验2：将index排序后的查询

df_sorted = df_shuffle.sort_index()
No output

df_sorted.head()

userId	movieId	rating	timestamp
userId				
1	1	2161	5.0	964981710
1	1	2273	4.0	964982310
1	1	457	5.0	964981909
1	1	70	3.0	964982400
1	1	1617	5.0	964982951

# 索引是否是递增的
df_sorted.index.is_monotonic_increasing

True

df_sorted.index.is_unique

False

%timeit df_sorted.loc[500]
179 µs ± 4.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

3. 使用index能自动对齐数据

包括series和dataframe
21

s1 = pd.Series([1,2,3], index=list("abc"))
No output

s1

a    1
b    2
c    3
dtype: int64

s2 = pd.Series([2,3,4], index=list("bcd"))
No output

s2

b    2
c    3
d    4
dtype: int64

s1+s2

a    NaN
b    4.0
c    6.0
d    NaN
dtype: float64

4. 使用index更多更强大的数据结构支持

很多强大的索引数据结构

CategoricalIndex，基于分类数据的Index，提升性能；
MultiIndex，多维索引，用于groupby多维聚合后结果等；
DatetimeIndex，时间类型索引，强大的日期和时间的方法支持；