python内存分配失败,OSError：[Errno 12]使用python多处理池时无法分配内存

最新推荐文章于 2022-04-07 13:55:47 发布

蔡恩泽

最新推荐文章于 2022-04-07 13:55:47 发布

阅读量707

点赞数

文章标签： python内存分配失败

I am trying to apply a function to 5 cross validation sets in parallel using Python's multiprocessing and repeat that for different parameter values, like so:

import pandas as pd

import numpy as np

import multiprocessing as mp

from sklearn.model_selection import StratifiedKFold

#simulated datasets

X = pd.DataFrame(np.random.randint(2, size=(3348,868), dtype='int8'))

y = pd.Series(np.random.randint(2, size=3348, dtype='int64'))

#dummy function to apply

def _work(args):

del(args)

for C in np.arange(0.0,2.0e-3,1.0e-6):

splitter = StratifiedKFold(n_splits=5)

with mp.Pool(processes=5) as pool:

pool_results = \

pool.map(

func=_work,

iterable=((C,X.iloc[train_index],X.iloc[test_index]) for train_index, test_index in splitter.split(X, y))

)

However halfway through execution I get the following error:

Traceback (most recent call last):

File "mre.py", line 19, in

with mp.Pool(processes=5) as pool:

File "/usr/lib/python3.5/multiprocessing/context.py", line 118, in Pool

context=self.get_context())

File "/usr/lib/python3.5/multiprocessing/pool.py", line 168, in __init__

self._repopulate_pool()

File "/usr/lib/python3.5/multiprocessing/pool.py", line 233, in _repopulate_pool

w.start()

File "/usr/lib/python3.5/multiprocessing/process.py", line 105, in start

self._popen = self._Popen(self)

File "/usr/lib/python3.5/multiprocessing/context.py", line 267, in _Popen

return Popen(process_obj)

File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__

self._launch(process_obj)

File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 67, in _launch

self.pid = os.fork()

OSError: [Errno 12] Cannot allocate memory

I'm running this on Ubuntu 16.04 with 32Gb of memory, and checking htop during execution it never goes over 18.5Gb, so I don't think I'm running out of memory.

It is definitly due to the splitting of my dataframes with the indexes from splitter.split(X,y) since when I directly pass my dataframes to the Pool object no error is thrown.

I saw this answer that says it might be due to too many file dependencies being created, but I have no idea how I might go about fixing that, and isn't the context manager supposed to help avoid this sort of problem?

解决方案

os.fork() makes a copy of a process, so if you're sitting at about 18 GB of usage, and want to call fork, you need another 18 GB. Twice 18 is 36 GB, which is well over 32 GB. While this analysis is (intentionally) naive—some things don't get copied on fork—it's probably sufficient to explain the problem.

The solution is either to make the pools earlier, when less memory needs to be copied, or to work harder at sharing the largest objects. Or, of course, add more memory (perhaps just virtual memory, i.e., swap space) to the system.

蔡恩泽

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python内存分配失败,OSError：[Errno 12]使用python多处理池时无法分配内存

I am trying to apply a function to 5 cross validation sets in parallel using Python's multiprocessing and repeat that for different parameter values, like so:import pandas as pdimport numpy as npimpor...
复制链接

扫一扫