python什么时候用eval,何时使用DataFrame.eval()与pandas.eval()或python eval()

I have a few dozen conditions (e.g., foo > bar) that I need to evaluate on ~1MM rows of a DataFrame, and the most concise way of writing this is to store these conditions as a list of strings and create a DataFrame of boolean results (one row per record x one column per condition). (User input is not being evaluated.)

In the quest for premature optimization, I am trying to determine whether I should write these conditions for evaluation within DataFrame (e.g., df.eval("foo > bar") or just leave it to python as in eval("df.foo > df.bar")

You should not use eval() for simple expressions or for expressions

involving small DataFrames. In fact, eval() is many orders of

magnitude slower for smaller expressions/objects than plain ol’

Python. A good rule of thumb is to only use eval() when you have a

DataFrame with more than 10,000 rows.

It would be nice to be able to use the df.eval("foo > bar") syntax, because my list would be a little more readable, but I can't ever find a case where it's not slower to evaluate. The documentation shows examples of where pandas.eval() is faster than python eval() (which matches my experience) but none for DataFrame.eval() (which is listed as 'Experimental').

For example, DataFrame.eval() is still a clear loser in a not-simple expression on a large-ish DataFrame:

import pandas as pd

import numpy as np

import numexpr

import timeit

someDf = pd.DataFrame({'a':np.random.uniform(size=int(1e6)), 'b':np.random.uniform(size=int(1e6))})

%timeit -n100 someDf.eval("a**b - a*b > b**a - b/a") # DataFrame.eval() on notional expression

%timeit -n100 eval("someDf['a']**someDf['b'] - someDf['a']*someDf['b'] > someDf['b']**someDf['a'] - someDf['b']/someDf['a']")

%timeit -n100 pd.eval("someDf.a**someDf.b - someDf.a*someDf.b > someDf.b**someDf.a - someDf.b/someDf.a")

100 loops, best of 3: 29.9 ms per loop

100 loops, best of 3: 18.7 ms per loop

100 loops, best of 3: 15.4 ms per loop

So is the benefit of DataFrame.eval() merely in simplifying the input, or can we identify circumstances where using this method is actually faster?

Are there any other guidelines for when to use which eval()? (I'm aware that pandas.eval() does not support the complete set of operations.)

pd.show_versions()

INSTALLED VERSIONS

------------------

commit: None

python: 3.5.1.final.0

python-bits: 64

OS: Windows

OS-release: 7

machine: AMD64

processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel

byteorder: little

LC_ALL: None

LANG: en_US

pandas: 0.18.0

nose: 1.3.7

pip: 8.1.2

setuptools: 20.3

Cython: 0.23.4

numpy: 1.10.4

scipy: 0.17.0

statsmodels: None

xarray: None

IPython: 4.1.2

sphinx: 1.3.1

patsy: 0.4.0

dateutil: 2.5.3

pytz: 2016.2

blosc: None

bottleneck: 1.0.0

tables: 3.2.2

numexpr: 2.5

matplotlib: 1.5.1

openpyxl: 2.3.2

xlrd: 0.9.4

xlwt: 1.0.0

xlsxwriter: 0.8.4

lxml: 3.6.0

bs4: 4.4.1

html5lib: None

httplib2: None

apiclient: None

sqlalchemy: 1.0.12

pymysql: None

psycopg2: None

jinja2: 2.8

boto: 2.39.0

解决方案So is the benefit of DataFrame.eval() merely in simplifying the input, or can we identify circumstances where using this method is actually faster?

The source code for DataFrame.eval() shows that it actually just creates arguments to pass to pd.eval():

def eval(self, expr, inplace=None, **kwargs):

inplace = validate_bool_kwarg(inplace, 'inplace')

resolvers = kwargs.pop('resolvers', None)

kwargs['level'] = kwargs.pop('level', 0) + 1

if resolvers is None:

index_resolvers = self._get_index_resolvers()

resolvers = dict(self.iteritems()), index_resolvers

if 'target' not in kwargs:

kwargs['target'] = self

kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)

return _eval(expr, inplace=inplace, **kwargs)

Where _eval() is just an alias for pd.eval() which is imported at the beginning of the module:

from pandas.core.computation.eval import eval as _eval

So anything that you can do with df.eval(), you could do with pd.eval() + a few extra lines to set things up. As things currently stand, df.eval() is never strictly faster than pd.eval(). But that doesn't mean there can't be cases where df.eval() is just as good as pd.eval(), yet more convenient to write.

However, after playing around with the %prun magic it appears that the call by df.eval() to df._get_index_resolvers() adds on a fair bit of time to the df.eval() method. Ultimately, _get_index_resolvers() ends up calling the .copy() method of numpy.ndarray, which is what ends up slowing things down. Meanwhile, pd.eval() does call numpy.ndarray.copy() at some point, but it takes a negligible amount of time (on my machine at least).

Long story short, it appears that df.eval() tends to be slower than pd.eval() because under the hood it's just pd.eval() with extra steps, and these steps are non-trivial.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值