I'm processing very large amounts of data, stored in a dictionary, using multiprocessing. Basically all I'm doing is loading some signatures, stored in a dictionary, building a shared dict object out of it (getting the 'proxy' object returned by Manager.dict() ) and passing this proxy as argument to the function that has to be executed in multiprocessing.
Just to clarify:
signatures = dict()
load_signatures(signatures)
[...]
manager = Manager()
signaturesProxy = manager.dict(signatures)
[...]
result = pool.map ( myfunction , [ signaturesProxy ]*NUM_CORES )
Now, everything works perfectly if signatures is less than 2 million entries or so. Anyways, I have to process a dictionary with 5.8M keys (pickling signatures in binary format generates a 4.8 GB file). In this case, the process dies during the creation of the proxy object:
Traceback (most recent call last):
File "matrix.py", line 617, in
signaturesProxy = manager.dict(signatures)
File "/usr/lib/python2.6/multiprocessing/managers.py", line 634, in temp
token, exp = self._create(typeid, *args, **kwds)
File "/usr/lib/python2.6/multiprocessing/managers.py", line 534, in _create
id, exposed = dispatch(conn, None, 'create', (typeid,)+args, kwds)
File "/usr/lib/python2.6/multiprocessing/managers.py", line 79, in dispatch
raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError:
---------------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python2.6/multiprocessing/managers.py", line 173, in handle_request
request = c.recv()
EOFError
---------------------------------------------------------------------------
I know the data structure is huge but I'm working on a machine equipped w/ 32GB of RAM, and running top I see that the process, after loading the signatures, occupies 7GB of RAM. It then starts building the proxy object and the RAM usage goes up to ~17GB of RAM but never gets close to 32. At this point, the RAM usage starts diminishing quickly and the process terminates with the above error. So I guess this is not due to an out-of-memory error...
Any idea or suggestion?
Thank you,
Davide
解决方案
If the dictionaries are read-only, you don't need proxy objects in most operating systems.
Just load the dictionaries before starting the workers, and put them somewhere they'll be reachable; the simplest place is globally to a module. They'll be readable from the workers.
from multiprocessing import Pool
buf = ""
def f(x):
buf.find("x")
return 0
if __name__ == '__main__':
buf = "a" * 1024 * 1024 * 1024
pool = Pool(processes=1)
result = pool.apply_async(f, [10])
print result.get(timeout=5)
This only uses 1GB of memory combined, not 1GB for each process, because any modern OS will make a copy-on-write shadow of the data created before the fork. Just remember that changes to the data won't be seen by other workers, and memory will, of course, be allocated for any data you change.
It will use some memory: the page of each object containing the reference count will be modified, so it'll be allocated. Whether this matters depends on the data.
This will work on any OS that implements ordinary forking. It won't work on Windows; its (crippled) process model requires relaunching the entire process for each worker, so it's not very good at sharing data.