python 字典操作一对多_python:使用多处理功能共享巨大的字典

I'm processing very large amounts of data, stored in a dictionary, using multiprocessing. Basically all I'm doing is loading some signatures, stored in a dictionary, building a shared dict object out of it (getting the 'proxy' object returned by Manager.dict() ) and passing this proxy as argument to the function that has to be executed in multiprocessing.

Just to clarify:

signatures = dict()

load_signatures(signatures)

[...]

manager = Manager()

signaturesProxy = manager.dict(signatures)

[...]

result = pool.map ( myfunction , [ signaturesProxy ]*NUM_CORES )

Now, everything works perfectly if signatures is less than 2 million entries or so. Anyways, I have to process a dictionary with 5.8M keys (pickling signatures in binary format generates a 4.8 GB file). In this case, the process dies during the creation of the proxy object:

Traceback (most recent call last):

File "matrix.py", line 617, in

signaturesProxy = manager.dict(signatures)

File "/usr/lib/python2.6/multiprocessing/managers.py", line 634, in temp

token, exp = self._create(typeid, *args, **kwds)

File "/usr/lib/python2.6/multiprocessing/managers.py", line 534, in _create

id, exposed = dispatch(conn, None, 'create', (typeid,)+args, kwds)

File "/usr/lib/python2.6/multiprocessing/managers.py", line 79, in dispatch

raise convert_to_error(kind, result)

multiprocessing.managers.RemoteError:

---------------------------------------------------------------------------

Traceback (most recent call last):

File "/usr/lib/python2.6/multiprocessing/managers.py", line 173, in handle_request

request = c.recv()

EOFError

---------------------------------------------------------------------------

I know the data structure is huge but I'm working on a machine equipped w/ 32GB of RAM, and running top I see that the process, after loading the signatures, occupies 7GB of RAM. It then starts building the proxy object and the RAM usage goes up to ~17GB of RAM but never gets close to 32. At this point, the RAM usage starts diminishing quickly and the process terminates with the above error. So I guess this is not due to an out-of-memory error...

Any idea or suggestion?

Thank you,

Davide

解决方案

If the dictionaries are read-only, you don't need proxy objects in most operating systems.

Just load the dictionaries before starting the workers, and put them somewhere they'll be reachable; the simplest place is globally to a module. They'll be readable from the workers.

from multiprocessing import Pool

buf = ""

def f(x):

buf.find("x")

return 0

if __name__ == '__main__':

buf = "a" * 1024 * 1024 * 1024

pool = Pool(processes=1)

result = pool.apply_async(f, [10])

print result.get(timeout=5)

This only uses 1GB of memory combined, not 1GB for each process, because any modern OS will make a copy-on-write shadow of the data created before the fork. Just remember that changes to the data won't be seen by other workers, and memory will, of course, be allocated for any data you change.

It will use some memory: the page of each object containing the reference count will be modified, so it'll be allocated. Whether this matters depends on the data.

This will work on any OS that implements ordinary forking. It won't work on Windows; its (crippled) process model requires relaunching the entire process for each worker, so it's not very good at sharing data.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值