Pandas 提高性能详解（enhancing performance）(一)

本文介绍如何使用Cython显著提升PandasDataFrame的处理速度，通过示例展示了在三种技术（Cython、Numba和pandas.eval）上的加速效果，特别聚焦于Cython的使用，从纯Python函数到Cython化，再到利用ndarray进一步优化，最终实现速度提升约200倍。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在本系列文章中，我们将展示如何使pandas DataFrame在三种不同技术上加速: Cython, Numba和pandas.eval。当我们在测试程序上按行迭代DataFrame时，使用Cython和Numba相比纯python编程的方式，运行速度提高约200倍。

本篇文章将重点介绍Cython的使用。

Cython的使用场景？

对于很多数据处理量不大，对性能要求不高的程序来说，在纯python和numpy中使用pandas就足够了，pandas丰富易用的方法能使你快速的对数据进行处理。本文章假设你已经使用了numpy的方法重构了你的代码，并且尽可能的剔除了for循环。

使用纯Python代码

我们将创建一个DataFrame, 并逐行处理它，查看程序的运行速度。

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': np.random.randn(1000),
                   'b': np.random.randn(1000),
                   'N': np.random.randint(100, 1000, (1000)),
                   'x': 'x'})

df.head()

Out:
        N	a	        b	        x
0	623	0.932671	1.663997	x
1	789	-1.034634	-0.899380	x
2	348	1.416209	-0.642386	x
3	766	-0.687798	-1.834033	x
4	687	1.204980	-0.059116	x

下面是纯Python的函数：

def f(x):
    return x * (x-1)

def integrate_f(a, b, N):
    s = 0     
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx

然后，我们使用apply函数，将上面创建的DataFrame运用到integrate_f函数（逐行）：

%timeit df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)

Out:
151 ms ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

（上面的%timeit为计算程序运行时间的语法糖，在ipython和jupyter notebook中使用才有效。）

上面程序运行约需要151ms, 这显然运行不够块。下面我们将用prun函数来测试哪一部分程序运行最消耗时间（显示前四个最耗时的操作）：

%prun -l 4 df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)

Out:
         661861 function calls (656852 primitive calls) in 0.267 seconds
    
    Ordered by: internal time
    List reduced from 141 to 4 due to restriction <4>

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1000    0.136    0.000    0.196    0.000 <ipython-input-10-d33f40f5bef5>:1(integrate_f)
  543296    0.061    0.000    0.061    0.000 <ipython-input-9-30a8062c568e>:1(f)
    3000    0.008    0.000    0.046    0.000 base.py:2454(get_value)
    3000    0.005    0.000    0.053    0.000 series.py:598(__getitem__)

可以看到，程序运行的大部分时间都消耗在integrate_f和f函数上，故我们需要努力Cython化这两个函数。

使用Cython

首先我们要引入Cython魔法函数：

%load_ext Cython

然后我们复制上面创建的函数到Cython，如下：

%%Cython
def f(x):
    return x * (x-1)
def integrate_f(a, b, N):
    s = 0     
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx

然后在上面函数中添加类型，如下：

%%cython
cdef double f_typed(double x):
    return x * (x - 1)
cpdef double integrate_f_typed(double a, double b, int N):
    cdef int i
    cdef double s, dx
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_typed(a + i * dx)
    return s * dx

然后再次运行函数：

%timeit df.apply(lambda x: integrate_f_typed(x['a'], x['b'], x['N']), axis=1)
24.3 ms ± 753 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

通过简单的Cython优化，运行速度提高了约7倍，下面继续查看下它最耗时的部分：

%prun -l 4 df.apply(lambda x: integrate_f_typed(x['a'], x['b'], x['N']), axis=1)

Out:
         119310 function calls (114290 primitive calls) in 0.068 seconds

   Ordered by: internal time
   List reduced from 211 to 4 due to restriction <4>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     3000    0.009    0.000    0.043    0.000 base.py:3090(get_value)
     3000    0.005    0.000    0.050    0.000 series.py:764(__getitem__)
        1    0.004    0.004    0.064    0.064 {pandas._libs.reduction.reduce}
     3000    0.004    0.000    0.004    0.000 {method 'get_value' of 'pandas._libs.index.IndexEngine' objects}

使用ndarray继续优化

使用python去迭代DataFrame的每一行是非常耗时间的，由于ndarray的底层是用C语言实现的，故用其再次重构。

%%cython
cimport numpy as np
import numpy as np
cdef double f_typed(double x) except? -2:
    return x * (x - 1)
cpdef double integrate_f_typed(double a, double b, int N):
    cdef int i
    cdef double s, dx
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_typed(a + i * dx)
    return s * dx
cpdef np.ndarray[double] apply_integrate_f(np.ndarray col_a, np.ndarray col_b, np.ndarray col_N):
    assert (col_a.dtype == np.float and col_b.dtype == np.float and col_N.dtype == np.int)
    cdef Py_ssize_t i, n = len(col_N)
    assert (len(col_a) == len(col_b) == n)
    cdef np.ndarray[double] res = np.empty(n)
    for i in range(len(col_a)):
        res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
    return res

apply_integrate_f(df['a'].values, df['b'].values, df['N'].values)

注意，df转化为ndarray需要使用values方法，下面看下重构后的运行时间：

%timeit apply_integrate_f(df['a'].values, df['b'].values, df['N'].values)
1 ms ± 45.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

通过ndarray方法的改写，程序运行时间降到了1ms，是不是很兴奋？

%prun -l 4 apply_integrate_f(df['a'].values, df['b'].values, df['N'].values)

Out:
         214 function calls in 0.003 seconds

   Ordered by: internal time
   List reduced from 54 to 4 due to restriction <4>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.001    0.001 {built-in method _cython_magic_661a12b675f8fc2280d89ec29d0e4b5e.apply_integrate_f}
        1    0.000    0.000    0.003    0.003 {built-in method builtins.exec}
        1    0.000    0.000    0.003    0.003 <string>:1(<module>)
        3    0.000    0.000    0.000    0.000 frame.py:3100(_box_col_values)

由上面的时间运行分析，我们知道运行时间主要消耗在apply_integrate_f函数上，故我们可以继续改进它：

%%cython
cimport cython
cimport numpy as np
import numpy as np
cdef double f_typed(double x) except? -2:
    return x * (x - 1)
cpdef double integrate_f_typed(double a, double b, int N):
    cdef int i
    cdef double s, dx
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_typed(a + i * dx)
    return s * dx
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef np.ndarray[double] apply_integrate_f_wrap(np.ndarray[double] col_a, np.ndarray[double] col_b, np.ndarray[int] col_N):
    cdef int i, n = len(col_N)
    assert len(col_a) == len(col_b) == n
    cdef np.ndarray[double] res = np.empty(n)
    for i in range(n):
        res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
    return res

%timeit apply_integrate_f_wrap(df['a'].values, df['b'].values, df['N'].values)
696 µs ± 4.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

通过以上的优化，我们将python程序运行时间由151ms优化到0.696ms, 速度提高约200倍，故编写python程序时，用以上方法改写你的代码，将解决你大部分问题，是不是很期待，具体Cython学习，请查看官方文档。