python中isin函数_python – Pandas`isin`函数的更快替代品

最新推荐文章于 2024-06-24 21:18:01 发布

weixin_39857876

最新推荐文章于 2024-06-24 21:18:01 发布

阅读量400

点赞数

文章标签： python中isin函数

编辑2：这是一个链接,可以看到各种大pandas操作的性能,但它似乎似乎不包括迄今为止的合并和连接.

编辑1：这些基准测试是针对一个相当古老的pandas版本,可能还不相关.请参阅Mike关于合并的评论.

这取决于您的数据的大小,但对于大型数据集DataFrame.join似乎是要走的路.这需要您的DataFrame索引为您的“ID”,并且您要加入的Series或DataFrame具有的索引是您的“ID_list”.该系列还必须具有与join一起使用的名称,该名称将作为名为name的新字段引入.您还需要指定内部联接以获取类似isin的内容,因为join默认为左连接.语法查询似乎具有与大数据集相同的速度特性.

如果您正在处理小数据集,则会得到不同的行为,使用列表解析或应用字典实际上比使用isin更快.

否则,您可以尝试使用Cython获得更快的速度.

# I'm ignoring that the index is defaulting to a sequential number. You

# would need to explicitly assign your IDs to the index here, e.g.:

# >>> l_series.index = ID_list

mil = range(1000000)

l = mil

l_series = pd.Series(l)

df = pd.DataFrame(mil, columns=['ID'])

In [247]: %timeit df[df.index.isin(l)]

1 loops, best of 3: 1.12 s per loop

In [248]: %timeit df[df.index.isin(l_series)]

1 loops, best of 3: 549 ms per loop

# index vs column doesn't make a difference here

In [304]: %timeit df[df.ID.isin(l_series)]

1 loops, best of 3: 541 ms per loop

In [305]: %timeit df[df.index.isin(l_series)]

1 loops, best of 3: 529 ms per loop

# query 'in' syntax has the same performance as 'isin'

In [249]: %timeit df.query('index in @l')

1 loops, best of 3: 1.14 s per loop

In [250]: %timeit df.query('index in @l_series')

1 loops, best of 3: 564 ms per loop

# ID must be the index for DataFrame.join and l_series must have a name.

# join defaults to a left join so we need to specify inner for existence.

In [251]: %timeit df.join(l_series, how='inner')

10 loops, best of 3: 93.3 ms per loop

# Smaller datasets.

df = pd.DataFrame([1,2,3,4], columns=['ID'])

l = range(10000)

l_dict = dict(zip(l, l))

l_series = pd.Series(l)

l_series.name = 'ID_list'

In [363]: %timeit df.join(l_series, how='inner')

1000 loops, best of 3: 733 µs per loop

In [291]: %timeit df[df.ID.isin(l_dict)]

1000 loops, best of 3: 742 µs per loop

In [292]: %timeit df[df.ID.isin(l)]

1000 loops, best of 3: 771 µs per loop

In [294]: %timeit df[df.ID.isin(l_series)]

100 loops, best of 3: 2 ms per loop

# It's actually faster to use apply or a list comprehension for these small cases.

In [296]: %timeit df[[x in l_dict for x in df.ID]]

1000 loops, best of 3: 203 µs per loop

In [299]: %timeit df[df.ID.apply(lambda x: x in l_dict)]

1000 loops, best of 3: 297 µs per loop

weixin_39857876

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python中isin函数_python – Pandas`isin`函数的更快替代品

编辑2：这是一个链接,可以看到各种大pandas操作的性能,但它似乎似乎不包括迄今为止的合并和连接.编辑1：这些基准测试是针对一个相当古老的pandas版本,可能还不相关.请参阅Mike关于合并的评论.这取决于您的数据的大小,但对于大型数据集DataFrame.join似乎是要走的路.这需要您的DataFrame索引为您的“ID”,并且您要加入的Series或DataFrame具有的索引是您的“I...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。