modin
I ran across a neat little library called Modin recently that claims to run pandas faster. The one line sentence that they use to describe the project is:
最近,我遇到了一个名为Modin的小图书馆,声称可以更快地运行熊猫。 他们用来描述项目的单行句子是:
Speed up your Pandas workflows by changing a single line of code
通过更改一行代码来加快Pandas工作流程
Interesting…and important if true.
有趣的……如果是真的,就很重要。
Using modin only requires importing modin instead of pandas and thats it…no other changes required to your existing code.
使用modin只需要导入modin而不是pandas就可以了……不需要对现有代码进行其他更改。
One caveat – modin currently uses pandas 0.20.3 (at least it installs pandas 0.20. when modin is installed with pip install modin
). If you’re using the latest version of pandas and need functionality that doesn’t exist in previous versions, you might need to wait on checking out modin – or play around with trying to get it to work with the latest version of pandas (I haven’t done that yet).
一个警告– modin当前使用的是pandas 0.20.3(至少在使用pip install modin
modin安装了modin的情况下会安装pandas 0.20)。 如果您使用的是最新版本的熊猫,并且需要以前版本中不存在的功能,则可能需要等待签出modin –或尝试使它与最新版本的熊猫一起使用(我还没有这样做)。
To install modin:
要安装modin:
pip install modin
To use modin:
要使用modin:
import modin.pandas as pd
That’s it. Rather than import pandas as pd
you import modin.pandas as pd
and you get all the advantages of additional speed.
而已。 除了import pandas as pd
import modin.pandas as pd
,您还可以import modin.pandas as pd
,您将获得额外的速度优势。
According to the documentation, modin takes advantage of multi-cores on modern machines, which pandas does not do. From their website:
根据文档,modin利用了现代机器上的多核,而pandas则没有。 从他们的网站:
In pandas, you are only able to use one core at a time when you are doing computation of any kind. With Modin, you are able to use all of the CPU cores on your machine. Even in
read_csv
, we see large gains by efficiently distributing the work across your entire machine.在熊猫中,每次进行任何类型的计算时,您一次只能使用一个内核。 使用Modin,您可以使用计算机上的所有CPU内核。 即使在
read_csv
,我们也可以通过在整个机器上高效地分配工作来获得巨大收益。
Let’s give is a shot and see how it works.
让我们试一试,看看它是如何工作的。
For this test, I’m going to try out their read_csv
method since its something they highlight. For this test, I have a 105 MB csv file. Lets time both pandas and modin and see how things work.
对于此测试,我将尝试使用他们的read_csv
方法,因为它们突出了它。 对于此测试,我有一个105 MB的csv文件。 让我们同时对熊猫和莫丁进行计时,看看它们是如何工作的。
We’ll start with pandas.
我们将从熊猫开始。
from timeit import default_timer as timer
import pandas as pd
# run 25 iterations of read_csv to get an average
time = []
for i in range (0, 25):
start = timer()
df = pd.read_csv('OSMV-20190206.csv')
end = timer()
time.append((end - start))
# print out the average time taken
# I *think* I got this little trick from
# from https://stackoverflow.com/a/9039992/2887031
print reduce(lambda x, y: x + y, time) / len(time)
With pandas, it seems to take – on average – 1.26 seconds to read a 105MB csv file.
使用熊猫,读取105MB的csv文件似乎平均需要1.26秒 。
Now, lets take a look at modin.
现在,让我们看一下modin。
Before continuing, I should share that I had to do a couple extra steps to get modin to work beyond just pip install modin
. I had to install typing and dask as well.
在继续之前,我应该分享我必须做一些额外的步骤才能使modin不仅可以通过pip install modin
来工作。 我还必须安装打字和dask。
pip install "modin[dask]"
pip install typing
Using the exact same code as above (except one minor change to import modin — import modin.pandas as pd
.
使用与上述代码完全相同的代码(除了对import modin.pandas as pd
一次较小更改— import modin.pandas as pd
。
from timeit import default_timer as timer
import modin.pandas as pd
# run 25 iterations of read_csv to get an average
time = []
for i in range (0, 25):
start = timer()
df = pd.read_csv('OSMV-20190206.csv')
end = timer()
time.append((end - start))
# print out the average time taken
# I *think* I got this little trick from
# from https://stackoverflow.com/a/9039992/2887031
print reduce(lambda x, y: x + y, time) / len(time)
With modin, it seems to take – on average – 0.96 seconds to read a 105MB csv file.
使用modin,读取105MB的csv文件似乎平均需要0.96秒 。
Using modin – in this example – I was able to save 0.3 seconds off of read time for reading in that 105MB csv file. That may not seem like a lot of time, but if you’ve got 5000 csv files to read in that are of similar size, that’s a savings of 1500 seconds on average…that’s 25 minutes of time saved in just reading files.
在本例中,使用modin可以节省读取时间0.3秒钟,从而可以读取该105MB的csv文件。 这似乎不花很多时间,但是如果您要读取5000个csv文件,它们的大小相似,则平均可以节省1500秒……仅读取文件就可以节省25分钟的时间。
Modin uses Ray to speed pandas up, so there could be even more savings if you get in and play around with some of the settings of Ray.
Modin使用Ray加快了熊猫的速度,因此如果您进入并试用Ray的某些设置,可能会节省更多。
翻译自: https://www.pybloggers.com/2019/02/quick-tip-speed-up-pandas-using-modin/
modin