python性能太低了_使用Python Dask包会降低性能吗?

I'm trying out some tests of dask.bag to prepare for a big text processing job over millions of text files. Right now, on my test sets of dozens to hundreds of thousands of text files, I'm seeing that dask is running about 5 to 6 times slower than a straight single-threaded text processing function.

Can someone explain where I'll see the speed benefits of running dask over a large amount of text files? How many files would I have to process before it starts getting faster? Is 150,000 small text files simply too few? What sort of performance parameters should I be tweaking to get dask to speed up when processing files? What could account for a 5x decrease in performance over straight single-threaded text processing?

Here's an example of the code I'm using to test dask out. This is running against a test set of data from Reuters located at:

This data isn't exactly the same as the data I'm working against. In my other case it's a bunch of individual text files, one document per file, but the performance decrease I'm seeing is about the same. Here's the code:

import dask.bag as db

from collections import Counter

import string

import glob

import datetime

my_files = "./reuters/*.ascii"

def single_threaded_text_processor():

c = Counter()

for my_file in glob.glob(my_files):

with open(my_file, "r") as f:

d = f.read()

c.update(d.split())

return(c)

start = datetime.datetime.now()

print(single_threaded_text_processor().most_common(5))

print(str(datetime.datetime.now() - start))

start = datetime.datetime.now()

b = db.read_text(my_files)

wordcount = b.str.split().concat().frequencies().topk(5, lambda x: x[1])

print(str([w for w in wordcount]))

print(str(datetime.datetime.now() - start))

Here were my results:

[('the', 119848), ('of', 72357), ('to', 68642), ('and', 53439), ('in', 49990)]

0:00:02.958721

[(u'the', 119848), (u'of', 72357), (u'to', 68642), (u'and', 53439), (u'in', 49990)]

0:00:17.877077

解决方案

Dask incurs about a cost of roughly 1ms overhead per task. By default the dask.bag.read_text function creates one task per filename. I suspect that you're just being swamped by overhead.

The solution here is probably to process several files in one task. The read_text function doesn't give you any options to do this, but you could switch out to dask.delayed, which provides a bit more flexibility and then convert to a dask.bag later if preferred.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值