python数据分析生成衍生变量的时候,使用apply的方法速度很慢,尤其是遇到批量生成好几千变量,且数据量比较大的情况下。
N = 10
A_list = np.random.randint(1, 100, N)
B_list = np.random.randint(1, 100, N)
df = pd.DataFrame({'A': A_list, 'B': B_list})
df.head()
# A B
# 0 78 50
# 1 23 91
# 2 55 62
# 3 82 64
# 4 99 80
def divide(a, b):
if b == 0:
return 0.0
return float(a)/b
df['result'] = df.apply(lambda row: divide(row['A'], row['B']), axis=1)
下面是stackoverflow上几个方法的对比:
# Python 3.6.5, NumPy 1.14.3, Pandas 0.23.0
np.random.seed(0)
N = 10**5
%timeit list(map(divide, df['A'], df['B'])) # 43.9 ms
%timeit np.vectorize(divide)(df['A'], df['B']) # 48.1 ms
%timeit [divide(a, b) for a, b in zip(df['A'], df['B'])] # 49.4 ms
%timeit [divide(a, b) for a, b in df[['A', 'B']].itertuples(index=False)] # 112 ms
%timeit df.apply(lambda row: divide(*row), axis=1, raw=True) # 760 ms
%timeit df.apply(lambda row: divide(row['A'], row['B']), axis=1) # 4.83 s
%timeit [divide(row['A'], row['B']) for _, row in df[['A', 'B']].iterrows()] # 11.6 s
Some takeaways:
- The tuple-based methods (the first 4) are a factor more efficient than pd.Series-based methods (the last 3).
- np.vectorize, list comprehension + zip and map methods, i.e. the top 3, all have roughly the same performance. This is because they use tuple and bypass some Pandas overhead from pd.DataFrame.itertuples.
- There is a significant speed improvement from using raw=True with pd.DataFrame.apply versus without. This option feeds NumPy arrays to the custom function instead of pd.Series objects.
还有更快的方法:
from numba import njit
@njit
def divide(a, b):
res = np.empty(a.shape)
for i in range(len(a)):
if b[i] != 0:
res[i] = a[i] / b[i]
else:
res[i] = 0
return res
%timeit divide(df['A'].values, df['B'].values) # 717 µs