Some Notes on Tim Bray's Wide Finder Benchmark
Fredrik Lundh | Updated October 12, 2007 | Originally posted October 6, 2007
The Problem #
Tim Bray recently posted about his experiences from using Erlang to do some straightforward parsing of a large log file, inspired by a chapter he wrote for the book Beautiful Code. As it turned out, Erlang isn’t exactly optimized for tasks like this. After trying to parse a 1,000,000-line log file, Tim notes:
“My first cut in Erlang, based on the per-process dictionary, took around eight minutes of CPU, and kept one of my MacBook’s Core Duo processors pegged at 97% while it was running. Ouch!”
That’s less than a half megabyte per second. Not very impressive. Let’s see if we can come up with something better in Python.
A Single-Threaded Python Solution #
Santiago Gala followed up on Tim’s original post with a nice map/reduce-based implementation in Python:
http://memojo.com/~sgala/blog/2007/09/29/Python-Erlang-Map-Reduce
Santiago’s script uses a series of nested generators to do filtering and mapping, and then uses a for-in-loop to reduce the mapped stream into a dictionary.
To benchmark the script, I created a sample by concatenating 100 copies of Tim’s original 10,000-line sample file. With that file, Santiago’s script needs about 6.7 seconds wall-time to parse 200 megabytes of log data on my Core Duo laptop (using Windows XP, warmed-up disk caches, and the final print statement replaced with a pass).
Tim’s 1.67 GHz Core Duo L2400 MacBook should match my 1.66 GHz Core Duo T2300 HP notebook pretty well, so that’s about 70 times faster than his Erlang program, and about twice as fast as his Ruby version. Not too shabby.
But we can speed things up a bit more, of course.
Compiling the RE #
Python’s RE engine caches compiled expressions, but it’s usually a good idea to move the cache lookup out of the inner loop anyway. And while we’re at it, we can move the method lookup out of the loop as well:
pat = r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) "
search = re.compile(pat).search
matches = (search(line) for line in file("o10k.ap"))
With these changes, the script finishes in 4.1 seconds.
Skipping lines that cannot match #
Somewhat less obvious is the fact that we can use Python’s in operator to filter out lines that cannot match:
matches = (search(line) for line in file("o10k.ap")
if "GET /ongoing/When" in line)
The RE engine does indeed use special code for literal prefixes, but the sublinear substring search algorithm that was introduced in 2.5 is a lot faster in cases like this, so this simple change gives a noticable speedup; the script now runs in 2.9 seconds.
Reading files in binary mode (Windows) #
On Windows (and in theory, on other platforms that distinguish between text files and binary files), data read via the standard file object are scanned for Windows-style line endings (“\r\n”). Any such character combination is then translated to a single newline, for consistency.
This is of course very convenient, since it allows you to treat text files in the same way no matter what platform you’re on, but on files this large, the performance penality is starting to get noticable.
We can turn this off simply by passing in the “rb” flag (read binary) to the open function.
matches = (search(line) for line in file("o10k.ap", "rb")
if "GET /ongoing/When" in line)
The file object will still break things up in lines, and our code doesn’t look at the line endings, so we still get the same result. Just a bit quicker.
The Code #
Here’s the final version of Santiago’s script:
import re
from collections import defaultdict
FILE = "o1000k.ap"
pat = re.compile(r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) ")
search = pat.search
# map
matches = (search(line) for line in file(FILE, "rb") if "GET /ongoing/When" in line)
mapp = (match.group(1) for match in matches if match)
# reduce
count = defaultdict(int)
for page in mapp:
count[page] +=1
for key in sorted(count, key=count.get)[:10]:
print "%40s = %s" % (key, count[key])
To get a version that’s set up for benchmarking, get the wf-2.py file from this directory:
This version of the script finishes in 1.9 seconds. This is a 3.5x speedup over Santiago’s version, and over 250x faster than Tim’s Erlang version. Pretty good for a short single-threaded script, don’t you think?
But I’m running this on a Core Duo machine. Two CPU cores, that is. What about using them both for this task?
A Multi-Threaded Python Solution #
To run multiple subtasks in parallel, we need to split the task up in some way. Since the program reads a single text file, the easiest way to do that is to split the file into multiple pieces on the way in. Here’s a simple function that rushes through the file, splitting it up in 1 megabyte chunks, and returns chunk offsets and sizes:
def getchunks(file, size=1024*1024):
f = open(file)
while 1:
start = f.tell()
f.seek(size, 1)
s = f.readline()
yield start, f.tell() - start
if not s:
break
By default, this splits the file in megabyte-sized chunks:
>>> for chunk in getchunks("o1000k.ap"):
... print chunk
(0L, 1048637L)
(1048637L, 1048810L)
(2097447L, 1048793L)
(3146240L, 1048603L)
Note the use of readline to make sure that each chunk ends at a newline character. (Without this, there’s a small chance that we’ll miss some entries here and there. This is probably not much of a problem in practice, but let’s stick to the exact solution for now.)
So, given a list of chunks, we need something that takes a chunk, and produces a partial result. Here’s a first attempt, where the map and reduce steps are combined into a single loop:
pat = re.compile(...)
def process(file, chunk):
f = open(file)
f.seek(chunk[0])
d = defaultdict(int)
search = pat.search
for line in f.read(chunk[1]).splitlines():
if "GET /ongoing/When" in line:
m = search(line)
if m:
d[m.group(1)] += 1
return d
Note that we cannot loop over the file itself, since we need to stop when we reach the end of it. The above version solves this by reading the entire chunk, and then splitting it into lines.
To test this code, we can loop over the chunks and feed them to the process function, one by one, and combine the result:
count = defaultdict(int)
for chunk in getchunks(file):
for key, value in process(file, chunk).items():
count[key] += value
This version is a bit slower than the non-chunked version on my machine; one pass over the 200 megabyte file takes about 2.6 seconds.
However, since a chunk is guaranteed to contain a full set of lines, we can speed things up a bit more by looking for matches in the chunk itself instead of splitting it into lines:
def process(file, chunk):
f = open(file)
f.seek(chunk[0])
d = defaultdict(int)
for page in pat.findall(f.read(chunk[1])):
d[page] += 1
return d
With this change, the time drops to 1.8 seconds (3.7x faster than the original version).
The next step is to set things up so we can do the processing in parallel. First, we’ll call the process function from a standard “worker thread” wrapper:
import threading, Queue
# job queue
queue = Queue.Queue()
# result queue
result = []
class Worker(threading.Thread):
def run(self):
while 1:
args = queue.get()
if args is None:
break
result.append(process(*args))
queue.task_done()
This uses the standard “worker thread” pattern, with a thread-safe Queue for pending jobs, and a plain list object to collect the results (list.append is an atomic operation in CPython).
To finish the script, just create a bunch of workers, give them something to do (via the queue), and collect the results into a single dictionary:
for i in range(4):
w = Worker()
w.setDaemon(1)
w.start()
for chunk in getchunks(file):
queue.put((file, chunk))
queue.join()
count = defaultdict(int)
for item in result:
for key, value in item.items():
count[key] += value
With a single thread, this runs in about 1.8 seconds (same as the non-threaded version). When we increase the number of threads, things are improved:
- Two threads: 1.9 seconds
- Three: 1.7 seconds
- Four to eight: 1.6 seconds
For this specific test, the ideal number appears to be three threads per CPU. With fewer threads, the CPU:s will occasionally get stuck waiting for I/O.
Or perhaps they’re waiting for the interpreter itself; Python uses a global interpreter lock to protect the interpreter internals from simultaneous access, so there’s probably some fighting over the interpreter going on as well. To get even more performance out of this, we need to get around the lock in some way.
Luckily, for this kind of problem, the solution is straightforward.
A Multi-Processor Python Solution #
To fully get around the interpreter lock, we need to run each subtask in a separate process. An easy way to do that is to let each worker thread start an associated process, send it a chunk, and read back the result. To make things really simple, and also portable, we’ll use the script itself as the subprocess, and use a special option to enter “subprocess” mode.
Here’s the updated worker thread:
import subprocess, sys
executable = [sys.executable]
if sys.platform == "win32":
executable.append("-u") # use raw mode on windows
class Worker(threading.Thread):
def run(self):
process = subprocess.Popen(
executable + [sys.argv[0], "--process"],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE
)
stdin = process.stdin
stdout = process.stdout
while 1:
cmd = queue.get()
if cmd is None:
putobject(stdin, None)
break
putobject(stdin, cmd)
result.append(getobject(stdout))
queue.task_done()
where the getobject and putobject helpers are defined as:
import marshal, struct
def putobject(file, object):
data = marshal.dumps(object)
file.write(struct.pack("I", len(data)))
file.write(data)
file.flush()
def getobject(file):
try:
n = struct.unpack("I", file.read(4))[0]
except struct.error:
return None
return marshal.loads(file.read(n))
The worker thread runs a copy of the script itself, and passes in the “—process” option. To enter subprocess mode, we need to look for that before we do anything else:
if "--process" in sys.argv:
stdin = sys.stdin
stdout = sys.stdout
while 1:
args = getobject(stdin)
if args is None:
sys.exit(0) # done
result = process(*args)
putobject(stdout, result)
else:
... create worker threads ...
With this approach, the processing time drops to 1.2 seconds, when using two threads/processes (one per CPU). But that’s about as good as it gets; adding more processes doesn’t really improve things on this machine.
Memory Mapping #
So, is this the best we can get? Not quite. We can speed up the file access as well, by switching to memory mapping:
import mmap
filemap = None
def process(file, chunk):
global filemap, fileobj
if filemap is None or fileobj.name != file:
fileobj = open(file)
filemap = mmap.mmap(
fileobj.fileno(),
os.path.getsize(file),
access=mmap.ACCESS_READ
)
d = defaultdict(int)
for file in pat.findall(filemap, chunk[0], chunk[0]+chunk[1]):
d[file] += 1
return d
Note that findall can be applied directly to the mapped region, thanks to Python’s internal memory buffer interface. Also note that the mmap module doesn’t support windowing, so the code needs to map the entire file in each subprocess. This can result in overly excessive use of virtual memory on some platforms (running this on your own log files if you’re on a shared web server is not necessarily a good idea. Yes, I’ve tried ;-).
Anyway, this gets the job done in 0.9 seconds, with the original chunk size. But since we’re mapping the entire file anyway in each subprocess, we can increase the chunk size to reduce the process communication overhead. With 50 megabyte chunks, the script runs in just under 0.8 seconds.
Summary #
In this article, we took a relatively fast Python implementation and optimized it, using a number of tricks:
- Pre-compiled RE patterns
- Fast filtering of candidate lines
- Chunked reading
- Multiple processes
- Memory mapping, combined with support for RE operations on mapped buffers
This reduced the time needed to parse 200 megabytes of log data from 6.7 seconds to 0.8 seconds on the test machine. Or in other words, the final version is over 8 times faster than the original Python version, and (potentially) 600 times faster than Tim’s original Erlang version.
However, it should be noticed that the benchmark I’ve been using focuses on processing speed, not disk speed. The code will most likely behave differently on cold caches (and will definitely take longer to run), on machines with different disk systems, and of course also on machines with additional cores.
If you have some time to spare and some interesting hardware to run it on, feel free to grab the code and take it on a ride:
(see the README.txt file for details.)
Addenda #
2007-10-07: Stanley Seibert has adapted the code to use the processing library, which provides multiprocess functionality with a lot less (user) code; see Parallel Processing in Python with processing for details.
2007-10-07: Bioinformatics veteran and fellow Python string-type hacker Andrew Dalke points out, via mail, that it’s possible to shave off a few more cycles by extracting all URL:s that start with “/ongoing/When/” (which we’re looking for anyway), and then removing bogus URL:s during post-processing. Andrew has also written a custom parser based on mxTextTools, which is a quite a bit faster than the RE solution. Hopefully, he’ll turn his findings into a blog post, so I can link to his work ;-) See More notes on Wide Finder for the full story (which is more about fast “narrow finding” than “wide finding”, though).
2007-10-07: Bill de hÓra has some code too.
2007-10-07: And Steve Vinoski has tried the code from this article on some big iron: “I ran his wf-6.py on an 8-core 2.33 GHz Intel Xeon Linux box with 8GB of RAM, and it ran best at 5 processes, clocking in at 0.336 sec. Another process-based approach, wf-5.py, executed best with 8 processes, presumably one per core, in 0.358 sec. The multithreaded approach, wf-4.py, ran best with 5 threads, at 1.402 sec (but also got the same result with 19 threads, go figure). Using the same dataset, I get 11.8 sec from my best Erlang effort, which is obviously considerably slower.”
2007-10-08: Paul Boddie provides code and results using a different parallelization library, pprocess.
2007-10-08: Tim Bray summarizes recent developments.
2007-10-12: Updated the article to use binary mode on Windows. This makes the chunk calculations a bit more reliable (tell can misbehave on text files), and speeds things up quite a bit, since the I/O layer no longer needs to convert line endings.
2007-10-31: Tim Bray has tested a bunch of implementations on a multicore Solaris box. When I write this, Python’s in the lead ;-)
[comment on/vote for this article]
Python multiprocessing: sharing a large read-only object between processes?
Do child processes spawned via multiprocessing share objects created earlier in the program?
I have the following setup:
do_some_processing(filename):
for line in file(filename):
if line.split(',')[0] in big_lookup_object:
# something here
if __name__ == '__main__':
big_lookup_object = marshal.load('file.bin')
pool = Pool(processes=4)
print pool.map(do_some_processing, glob.glob('*.data'))
I'm loading some big object into memory, then creating a pool of workers that need to make use of that big object. The big object is accessed read-only, I don't need to pass modifications of it between processes.
My question is: is the big object loaded into shared memory, as it would be if I spawned a process in unix/c, or does each process load its own copy of the big object?
Update: to clarify further - big_lookup_object is a shared lookup object. I don't need to split that up and process it separately. I need to keep a single copy of it. The work that I need to split it is reading lots of other large files and looking up the items in those large files against the lookup object.
Further update: database is a fine solution, memcached might be a better solution, and file on disk (shelve or dbm) might be even better. In this question I was particularly interested in an in memory solution. For the final solution I'll be using hadoop, but I wanted to see if I can have a local in-memory version as well.
Answer |
"Do child processes spawned via multiprocessing share objects created earlier in the program?"
No.
Processes have independent memory space.
Solution 1
To make best use of a large structure with lots of workers, do this.
Write each worker as a "filter" -- reads intermediate results from stdin, does work, writes intermediate results on stdout.
Connect all the workers as a pipeline:
process1 <source | process2 | process3 | ... | processn >result
Each process reads, does work and writes.
This is remarkably efficient since all processes are running concurrently. The writes and reads pass directly through shared buffers between the processes.
Solution 2
In some cases, you have a more complex structure -- often a "fan-out" structure. In this case you have a parent with multiple children.
Parent opens source data. Parent forks a number of children.
Parent reads source, farms parts of the source out to each concurrently running child.
When parent reaches the end, close the pipe. Child gets end of file and finishes normally.
The child parts are pleasant to write because each child simply reads sys.sydin
.
The parent has a little bit of fancy footwork in spawning all the children and retaining the pipes properly, but it's not too bad.
Fan-in is the opposite structure. A number of independently running processes need to interleave their inputs into a common process. The collector is not as easy to write, since it has to read from many sources.
Reading from many named pipes is often done using the select
module to see which pipes have pending input.
Solution 3
Shared lookup is the definition of a database.
Solution 3A -- load a database. Let the workers process the data in the database.
Solution 3B -- create a very simple server using werkzeug (or similar) to provide WSGI applications that respond to HTTP GET so the workers can query the server.
Solution 4
Shared filesystem object. Unix OS offers shared memory objects. These are just files that are mapped to memory so that swapping I/O is done instead of more convention buffered reads.
You can do this from a Python context in several ways
Write a startup program that (1) breaks your original gigantic object into smaller objects, and (2) starts workers, each with a smaller object. The smaller objects could be pickled Python objects to save a tiny bit of file reading time.
Write a startup program that (1) reads your original gigantic object and writes a page-structured, byte-coded file using
seek
operations to assure that individual sections are easy to find with simple seeks. This is what a database engine does -- break the data into pages, make each page easy to locate via aseek
.Spawn workers with access this this large page-structured file. Each worker can seek to the relevant parts and do their work there.
http://www.doughellmann.com/PyMOTW/multiprocessing/mapreduce.html
Implementing MapReduce with multiprocessing¶
The Pool class can be used to create a simple single-server MapReduce implementation. Although it does not give the full benefits of distributed processing, it does illustrate how easy it is to break some problems down into distributable units of work.
SimpleMapReduce¶
In MapReduce, input data is broken down into chunks for processing by different worker instances. Each chunk of input data is mapped to an intermediate state using a simple transformation. The intermediate data is then collected together and partitioned based on a key value so that all of the related values are together. Finally, the partitioned data is reduced to a result set.
import collections import itertools import multiprocessing class SimpleMapReduce(object): def __init__(self, map_func, reduce_func, num_workers=None): """ map_func Function to map inputs to intermediate data. Takes as argument one input value and returns a tuple with the key and a value to be reduced. reduce_func Function to reduce partitioned version of intermediate data to final output. Takes as argument a key as produced by map_func and a sequence of the values associated with that key. num_workers The number of workers to create in the pool. Defaults to the number of CPUs available on the current host. """ self.map_func = map_func self.reduce_func = reduce_func self.pool = multiprocessing.Pool(num_workers) def partition(self, mapped_values): """Organize the mapped values by their key. Returns an unsorted sequence of tuples with a key and a sequence of values. """ partitioned_data = collections.defaultdict(list) for key, value in mapped_values: partitioned_data[key].append(value) return partitioned_data.items() def __call__(self, inputs, chunksize=1): """Process the inputs through the map and reduce functions given. inputs An iterable containing the input data to be processed. chunksize=1 The portion of the input data to hand to each worker. This can be used to tune performance during the mapping phase. """ map_responses = self.pool.map(self.map_func, inputs, chunksize=chunksize) partitioned_data = self.partition(itertools.chain(*map_responses)) reduced_values = self.pool.map(self.reduce_func, partitioned_data) return reduced_values
Counting Words in Files¶
The following example script uses SimpleMapReduce to counts the “words” in the reStructuredText source for this article, ignoring some of the markup.
import multiprocessing import string from multiprocessing_mapreduce import SimpleMapReduce def file_to_words(filename): """Read a file and return a sequence of (word, occurances) values. """ STOP_WORDS = set([ 'a', 'an', 'and', 'are', 'as', 'be', 'by', 'for', 'if', 'in', 'is', 'it', 'of', 'or', 'py', 'rst', 'that', 'the', 'to', 'with', ]) TR = string.maketrans(string.punctuation, ' ' * len(string.punctuation)) print multiprocessing.current_process().name, 'reading', filename output = [] with open(filename, 'rt') as f: for line in f: if line.lstrip().startswith('..'): # Skip rst comment lines continue line = line.translate(TR) # Strip punctuation for word in line.split(): word = word.lower() if word.isalpha() and word not in STOP_WORDS: output.append( (word, 1) ) return output def count_words(item): """Convert the partitioned data for a word to a tuple containing the word and the number of occurances. """ word, occurances = item return (word, sum(occurances)) if __name__ == '__main__': import operator import glob input_files = glob.glob('*.rst') mapper = SimpleMapReduce(file_to_words, count_words) word_counts = mapper(input_files) word_counts.sort(key=operator.itemgetter(1)) word_counts.reverse() print '\nTOP 20 WORDS BY FREQUENCY\n' top20 = word_counts[:20] longest = max(len(word) for word, count in top20) for word, count in top20: print '%-*s: %5s' % (longest+1, word, count)
Each input filename is converted to a sequence of (word, 1) pairs by file_to_words. The data is partitioned by SimpleMapReduce.partition() using the word as the key, so the partitioned data consists of a key and a sequence of 1 values representing the number of occurrences of the word. The reduction phase converts that to a pair of (word, count) values by calling count_words for each element of the partitioned data set.
$ python multiprocessing_wordcount.py PoolWorker-2 reading communication.rst PoolWorker-2 reading index.rst PoolWorker-1 reading basics.rst PoolWorker-1 reading mapreduce.rst TOP 20 WORDS BY FREQUENCY process : 75 multiprocessing : 40 worker : 35 after : 30 running : 29 start : 28 processes : 26 python : 26 literal : 25 header : 25 pymotw : 25 end : 25 daemon : 23 now : 21 consumer : 19 starting : 18 exiting : 16 event : 15 value : 14 run : 13
See also
-
MapReduce - Wikipedia
- Overview of MapReduce on Wikipedia. MapReduce: Simplified Data Processing on Large Clusters
- Google Labs presentation and paper on MapReduce. operator
- Operator tools such as itemgetter().
Warning
Some of this package’s functionality requires a functioning shared semaphore implementation on the host operating system. Without one, the multiprocessing.synchronize module will be disabled, and attempts to import it will result in an ImportError. See issue 3770 for additional information.
Note
Functionality within this package requires that the __main__ method be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the multiprocessing.Pool examples will not work in the interactive interpreter. For example:
Extended Slices
Ever since Python 1.4, the slicing syntax has supported an optional third ``step'' or ``stride'' argument. For example, these are all legal Python syntax: L[1:10:2]
, L[:-1:1]
, L[::-1]
. This was added to Python at the request of the developers of Numerical Python, which uses the third argument extensively. However, Python's built-in list, tuple, and string sequence types have never supported this feature, raising a TypeError if you tried it. Michael Hudson contributed a patch to fix this shortcoming.
For example, you can now easily extract the elements of a list that have even indexes:
>>> L = range(10) >>> L[::2] [0, 2, 4, 6, 8]
Negative values also work to make a copy of the same list in reverse order:
>>> L[::-1] [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
This also works for tuples, arrays, and strings:
>>> s='abcd' >>> s[::2] 'ac' >>> s[::-1] 'dcba'
If you have a mutable sequence such as a list or an array you can assign to or delete an extended slice, but there are some differences between assignment to extended and regular slices. Assignment to a regular slice can be used to change the length of the sequence:
>>> a = range(3) >>> a [0, 1, 2] >>> a[1:3] = [4, 5, 6] >>> a [0, 4, 5, 6]
Extended slices aren't this flexible. When assigning to an extended slice, the list on the right hand side of the statement must contain the same number of items as the slice it is replacing:
>>> a = range(4) >>> a [0, 1, 2, 3] >>> a[::2] [0, 2] >>> a[::2] = [0, -1] >>> a [0, 1, -1, 3] >>> a[::2] = [0,1,2] Traceback (most recent call last): File "<stdin>", line 1, in ? ValueError: attempt to assign sequence of size 3 to extended slice of size 2
Deletion is more straightforward:
>>> a = range(4) >>> a [0, 1, 2, 3] >>> a[::2] [0, 2] >>> del a[::2] >>> a [1, 3]
One can also now pass slice objects to the __getitem__ methods of the built-in sequences:
>>> range(10).__getitem__(slice(0, 5, 2)) [0, 2, 4]
Or use slice objects directly in subscripts:
>>> range(10)[slice(0, 5, 2)] [0, 2, 4]
To simplify implementing sequences that support extended slicing, slice objects now have a method indices(length) which, given the length of a sequence, returns a (start, stop, step)
tuple that can be passed directly to range(). indices() handles omitted and out-of-bounds indices in a manner consistent with regular slices (and this innocuous phrase hides a welter of confusing details!). The method is intended to be used like this:
class FakeSeq: ... def calc_item(self, i): ... def __getitem__(self, item): if isinstance(item, slice): indices = item.indices(len(self)) return FakeSeq([self.calc_item(i) for i in range(*indices)]) else: return self.calc_item(i)
From this example you can also see that the built-in slice object is now the type object for the slice type, and is no longer a function. This is consistent with Python 2.2, where int, str, etc., underwent the same change.