python性能太低了_使用Python Dask包会降低性能吗？

最新推荐文章于 2024-02-28 10:55:25 发布

weixin_39585675

最新推荐文章于 2024-02-28 10:55:25 发布

阅读量180

点赞数

文章标签： python性能太低了

本文链接：https://blog.csdn.net/weixin_39585675/article/details/111458051

版权

I'm trying out some tests of dask.bag to prepare for a big text processing job over millions of text files. Right now, on my test sets of dozens to hundreds of thousands of text files, I'm seeing that dask is running about 5 to 6 times slower than a straight single-threaded text processing function.

Can someone explain where I'll see the speed benefits of running dask over a large amount of text files? How many files would I have to process before it starts getting faster? Is 150,000 small text files simply too few? What sort of performance parameters should I be tweaking to get dask to speed up when processing files? What could account for a 5x decrease in performance over straight single-threaded text processing?

Here's an example of the code I'm using to test dask out. This is running against a test set of data from Reuters located at:

This data isn't exactly the same as the data I'm working against. In my other case it's a bunch of individual text files, one document per file, but the performance decrease I'm seeing is about the same. Here's the code:

import dask.bag as db

from collections import Counter

import string

import glob

import datetime

my_files = "./reuters/*.ascii"

def single_threaded_text_processor():

c = Counter()

for my_file in glob.glob(my_files):

with open(my_file, "r") as f:

d = f.read()

c.update(d.split())

return(c)

start = datetime.datetime.now()

print(single_threaded_text_processor().most_common(5))

print(str(datetime.datetime.now() - start))

start = datetime.datetime.now()

b = db.read_text(my_files)

wordcount = b.str.split().concat().frequencies().topk(5, lambda x: x[1])

print(str([w for w in wordcount]))

print(str(datetime.datetime.now() - start))

Here were my results:

[('the', 119848), ('of', 72357), ('to', 68642), ('and', 53439), ('in', 49990)]

0:00:02.958721

[(u'the', 119848), (u'of', 72357), (u'to', 68642), (u'and', 53439), (u'in', 49990)]

0:00:17.877077

解决方案

Dask incurs about a cost of roughly 1ms overhead per task. By default the dask.bag.read_text function creates one task per filename. I suspect that you're just being swamped by overhead.

The solution here is probably to process several files in one task. The read_text function doesn't give you any options to do this, but you could switch out to dask.delayed, which provides a bit more flexibility and then convert to a dask.bag later if preferred.

weixin_39585675

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python性能太低了_使用Python Dask包会降低性能吗？

I'm trying out some tests of dask.bag to prepare for a big text processing job over millions of text files. Right now, on my test sets of dozens to hundreds of thousands of text files, I'm seeing that...
复制链接

扫一扫