I'm trying out some tests of dask.bag to prepare for a big text processing job over millions of text files. Right now, on my test sets of dozens to hundreds of thousands of text files, I'm seeing that dask is running about 5 to 6 times slower than a straight single-threaded text processing function.
Can someone explain where I'll see the speed benefits of running dask over a large amount of text files? How many files would I have to process before it starts getting faster? Is 150,000 small text files simply too few? What sort of performance parameters should I be tweaking to get dask to speed up when processing files? What could account for a 5x decrease in performance over straight single-threaded text processing?
Here's an example of the code I'm using to test dask out. This is running against a test set of data from Reuters located at:
This data isn't exactly the same as the data I'm working against. In my other case it's a bunch of individual text files, one document per file, but the performance decrease I'm seeing is about the same. Here's the code:
import dask.bag as db
from collections import Counter
import string
import glob
import datetime
my_files = "./reuters/*.ascii"
def single_threaded_text_processor():
c = Counter()
for my_file in glob.glob(my_files):
with open(my_file, "r") as f:
d = f.read()
c.update(d.split())
return(c)
start = datetime.datetime.now()
print(single_threaded_text_processor().most_common(5))
print(str(datetime.datetime.now() - start))
start = datetime.datetime.now()
b = db.read_text(my_files)
wordcount = b.str.split().concat().frequencies().topk(5, lambda x: x[1])
print(str([w for w in wordcount]))
print(str(datetime.datetime.now() - start))
Here were my results:
[('the', 119848), ('of', 72357), ('to', 68642), ('and', 53439), ('in', 49990)]
0:00:02.958721
[(u'the', 119848), (u'of', 72357), (u'to', 68642), (u'and', 53439), (u'in', 49990)]
0:00:17.877077
解决方案
Dask incurs about a cost of roughly 1ms overhead per task. By default the dask.bag.read_text function creates one task per filename. I suspect that you're just being swamped by overhead.
The solution here is probably to process several files in one task. The read_text function doesn't give you any options to do this, but you could switch out to dask.delayed, which provides a bit more flexibility and then convert to a dask.bag later if preferred.