python变量衍生apply速度优化及改进

最新推荐文章于 2023-02-23 21:17:13 发布

jin_tmac

最新推荐文章于 2023-02-23 21:17:13 发布

阅读量916

点赞数 1

分类专栏：机器学习与数据挖掘 python 文章标签： python 数据分析数据建模

本文链接：https://blog.csdn.net/jin_tmac/article/details/119679663

版权

机器学习与数据挖掘同时被 2 个专栏收录

27 篇文章 1 订阅

订阅专栏

python

10 篇文章 0 订阅

订阅专栏

python数据分析生成衍生变量的时候，使用apply的方法速度很慢，尤其是遇到批量生成好几千变量，且数据量比较大的情况下。

N = 10
A_list = np.random.randint(1, 100, N)
B_list = np.random.randint(1, 100, N)
df = pd.DataFrame({'A': A_list, 'B': B_list})
df.head()
#     A   B
# 0  78  50
# 1  23  91
# 2  55  62
# 3  82  64
# 4  99  80

def divide(a, b):
    if b == 0:
        return 0.0
    return float(a)/b

df['result'] = df.apply(lambda row: divide(row['A'], row['B']), axis=1)

下面是stackoverflow上几个方法的对比：

# Python 3.6.5, NumPy 1.14.3, Pandas 0.23.0

np.random.seed(0)
N = 10**5

%timeit list(map(divide, df['A'], df['B']))                                   # 43.9 ms
%timeit np.vectorize(divide)(df['A'], df['B'])                                # 48.1 ms
%timeit [divide(a, b) for a, b in zip(df['A'], df['B'])]                      # 49.4 ms
%timeit [divide(a, b) for a, b in df[['A', 'B']].itertuples(index=False)]     # 112 ms
%timeit df.apply(lambda row: divide(*row), axis=1, raw=True)                  # 760 ms
%timeit df.apply(lambda row: divide(row['A'], row['B']), axis=1)              # 4.83 s
%timeit [divide(row['A'], row['B']) for _, row in df[['A', 'B']].iterrows()]  # 11.6 s

Some takeaways:

The tuple-based methods (the first 4) are a factor more efficient than pd.Series-based methods (the last 3).
np.vectorize, list comprehension + zip and map methods, i.e. the top 3, all have roughly the same performance. This is because they use tuple and bypass some Pandas overhead from pd.DataFrame.itertuples.
There is a significant speed improvement from using raw=True with pd.DataFrame.apply versus without. This option feeds NumPy arrays to the custom function instead of pd.Series objects.

还有更快的方法：

from numba import njit

@njit
def divide(a, b):
    res = np.empty(a.shape)
    for i in range(len(a)):
        if b[i] != 0:
            res[i] = a[i] / b[i]
        else:
            res[i] = 0
    return res

%timeit divide(df['A'].values, df['B'].values)  # 717 µs