python import pandas as pd_python – Pandas pd.Series.isin性能与集合与数组

最新推荐文章于 2024-09-02 00:58:40 发布

weixin_39719732

最新推荐文章于 2024-09-02 00:58:40 发布

阅读量1.6k

点赞数

文章标签： python import pandas as pd

本文链接：https://blog.csdn.net/weixin_39719732/article/details/111537332

版权

本文深入探讨了Python中Pandas的pd.Series.isin方法的性能，指出其在不同场景下的运行时间复杂度。通过分析，展示了Cython实现的isin方法如何在特定情况下优于内置的Python集合和numpy数组操作。结论建议根据数据规模选择合适的方法，以优化性能。

摘要由CSDN通过智能技术生成

这可能不是很明显,但pd.Series.isin使用O(1) – 查找.

经过分析,证明了上述说法,我们将利用其洞察力创建一个Cython原型,可以轻松击败最快的开箱即用解决方案.

假设“set”有n个元素,“series”有m个元素.运行时间是：

T(n,m)=T_preprocess(n)+m*T_lookup(n)

对于纯python版本,这意味着：

> T_preprocess(n)= 0 – 无需预处理

> T_lookup(n)= O(1) – 众所周知的python集的行为

>结果T(n,m)= O(m)

pd.Series.isin(x_arr)会发生什么？显然,如果我们跳过预处理并在线性时间内搜索,我们将得到O(n * m),这是不可接受的.

在调试器或探查器(我使用valgrind-callgrind kcachegrind)的帮助下很容易看到,发生了什么：工作马是函数__pyx_pw_6pandas_5_libs_9hashtable_23ismember_int64.其定义可以在here找到：

>在预处理步骤中,从x_arr的n个元素中创建散列映射(pandas使用khash from klib),即在运行时间O(n)中.

> m个查找在O(1)中发生,或者在构造的散列图中总共发生O(m).

>结果T(n,m)= O(m)O(n)

我们必须记住 – numpy-array的元素是raw-C-integers而不是原始集合中的Python对象 – 所以我们不能按原样使用set.

将Python对象集转换为一组C-int的替代方法是将单个C-int转换为Python对象,从而能够使用原始集.这就是[i in x_set for i in ser.values] -variant：

>没有预处理.

> m个查找发生在每个O(1)时间或总共O(m),但由于必要的Python对象创建,查找速度较慢.

>结果T(n,m)= O(m)

显然,使用Cython可以加快这个版本的速度.

但是足够的理论,让我们来看看固定ms的不同ns的运行时间：

我们可以看到：预处理的线性时间主导了大ns的numpy版本.从numpy转换为pure-python(numpy-> python)的版本具有与pure-python版本相同的常量行为,但速度较慢,因为必要的转换 – 这完全符合我们的分析.

在图中不能很好地看出：如果n< n numpy版本变得更快 - 在这种情况下,khash-lib的更快查找起着最重要的作用,而不是预处理部分. 我从这个分析中得到的结论：

> n< m：pd.Series.isin应该被采用,因为O(n) - 预处理并不昂贵.

> n> m :(可能是cythonized版本的)[i在x_set for i in ser.values]应该采用,因此避免使用O(n).

>显然有一个灰色区域,其中n和m大致相等,如果没有测试,很难说哪个解决方案最好.

>如果你有它在你的控制之下：最好的做法是直接将集合构建为C整数集(khash(already wrapped in pandas)或甚至一些c -implementations),从而消除了预处理的需要.我不知道,大pandas中是否有可以重复使用的东西,但在Cython中编写函数可能不是什么大问题.

问题是最后一个建议不能开箱即用,因为在它们的界面中,大pandas和numpy都没有一套概念(至少对我有限的知识).但是拥有raw-C-set-interfaces将是两全其美的：

>不需要预处理,因为值已作为集合传递

>不需要转换,因为传递的集合包含raw-C值

我编写了一个快速而又脏的Cython-wrapper for khash(灵感来自pandas包装),它可以通过pip install https://github.com/realead/cykhash/zipball/master安装,然后与Cython一起用于更快的isin版本：

%%cython

import numpy as np

cimport numpy as np

from cykhash.khashsets cimport Int64Set

def isin_khash(np.ndarray[np.int64_t, ndim=1] a, Int64Set b):

cdef np.ndarray[np.uint8_t,ndim=1, cast=True] res=np.empty(a.shape[0],dtype=np.bool)

cdef int i

for i in range(a.size):

res[i]=b.contains(a[i])

return res

作为另一种可能性,可以包装c的unordered_map(参见清单C),其缺点是需要c-library和(正如我们将看到的)稍慢.

比较方法(参见清单D创建时间)：

khash比numpy-> python快约20倍,比纯python快6倍(但纯python不是我们想要的),甚至比cpp-version快3倍.

房源

1)用valgrind进行分析：

#isin.py

import numpy as np

import pandas as pd

np.random.seed(0)

x_set = {i for i in range(2*10**6)}

x_arr = np.array(list(x_set))

arr = np.random.randint(0, 20000, 10000)

ser = pd.Series(arr)

for _ in range(10):

ser.isin(x_arr)

现在：

>>> valgrind --tool=callgrind python isin.py

>>> kcachegrind

导致以下调用图：

B：用于产生运行时间的ipython代码：

import numpy as np

import pandas as pd

%matplotlib inline

import matplotlib.pyplot as plt

np.random.seed(0)

x_set = {i for i in range(10**2)}

x_arr = np.array(list(x_set))

x_list = list(x_set)

arr = np.random.randint(0, 20000, 10000)

ser = pd.Series(arr)

lst = arr.tolist()

n=10**3

result=[]

while n<3*10**6:

x_set = {i for i in range(n)}

x_arr = np.array(list(x_set))

x_list = list(x_set)

t1=%timeit -o ser.isin(x_arr)

t2=%timeit -o [i in x_set for i in lst]

t3=%timeit -o [i in x_set for i in ser.values]

result.append([n, t1.average, t2.average, t3.average])

n*=2

#plotting result:

for_plot=np.array(result)

plt.plot(for_plot[:,0], for_plot[:,1], label='numpy')

plt.plot(for_plot[:,0], for_plot[:,2], label='python')

plt.plot(for_plot[:,0], for_plot[:,3], label='numpy->python')

plt.xlabel('n')

plt.ylabel('running time')

plt.legend()

plt.show()

C：cpp-wrapper：

%%cython --cplus -c=-std=c++11 -a

from libcpp.unordered_set cimport unordered_set

cdef class HashSet:

cdef unordered_set[long long int] s

cpdef add(self, long long int z):

self.s.insert(z)

cpdef bint contains(self, long long int z):

return self.s.count(z)>0

import numpy as np

cimport numpy as np

cimport cython

@cython.boundscheck(False)

@cython.wraparound(False)

def isin_cpp(np.ndarray[np.int64_t, ndim=1] a, HashSet b):

cdef np.ndarray[np.uint8_t,ndim=1, cast=True] res=np.empty(a.shape[0],dtype=np.bool)

cdef int i

for i in range(a.size):

res[i]=b.contains(a[i])

return res

D：使用不同的set-wrappers绘制结果：

import numpy as np

import pandas as pd

%matplotlib inline

import matplotlib.pyplot as plt

from cykhash import Int64Set

np.random.seed(0)

x_set = {i for i in range(10**2)}

x_arr = np.array(list(x_set))

x_list = list(x_set)

arr = np.random.randint(0, 20000, 10000)

ser = pd.Series(arr)

lst = arr.tolist()

n=10**3

result=[]

while n<3*10**6:

x_set = {i for i in range(n)}

x_arr = np.array(list(x_set))

cpp_set=HashSet()

khash_set=Int64Set()

for i in x_set:

cpp_set.add(i)

khash_set.add(i)

assert((ser.isin(x_arr).values==isin_cpp(ser.values, cpp_set)).all())

assert((ser.isin(x_arr).values==isin_khash(ser.values, khash_set)).all())

t1=%timeit -o isin_khash(ser.values, khash_set)

t2=%timeit -o isin_cpp(ser.values, cpp_set)

t3=%timeit -o [i in x_set for i in lst]

t4=%timeit -o [i in x_set for i in ser.values]

result.append([n, t1.average, t2.average, t3.average, t4.average])

n*=2

#ploting result:

for_plot=np.array(result)

plt.plot(for_plot[:,0], for_plot[:,1], label='khash')

plt.plot(for_plot[:,0], for_plot[:,2], label='cpp')

plt.plot(for_plot[:,0], for_plot[:,3], label='pure python')

plt.plot(for_plot[:,0], for_plot[:,4], label='numpy->python')

plt.xlabel('n')

plt.ylabel('running time')

ymin, ymax = plt.ylim()

plt.ylim(0,ymax)

plt.legend()

plt.show()