python3多进程下载_python3多进程共享numpy数组（只读）

最新推荐文章于 2024-04-30 16:58:32 发布

weixin_39682944

最新推荐文章于 2024-04-30 16:58:32 发布

阅读量373

点赞数

文章标签： python3多进程下载

本文链接：https://blog.csdn.net/weixin_39682944/article/details/111451466

版权

I'm not sure if this title is appropriate for my situation: the reason why I want to share numpy array is that it might be one of the potential solutions to my case, but if you have other solutions that would also be nice.

My task: I need to implement an iterative algorithm with multiprocessing, while each of these processes need to have a copy of data(this data is large, and read-only, and won't change during the iterative algorithm).

I've written some pseudo code to demonstrate my idea:

import multiprocessing

def worker_func(data, args):

# do sth...

return res

def compute(data, process_num, niter):

data

result = []

args = init()

for iter in range(niter):

args_chunk = split_args(args, process_num)

pool = multiprocessing.Pool()

for i in range(process_num):

result.append(pool.apply_async(worker_func,(data, args_chunk[i])))

pool.close()

pool.join()

# aggregate result and update args

for res in result:

args = update_args(res.get())

if __name__ == "__main__":

compute(data, 4, 100)

The problem is in each iteration, I have to pass the data to subprocess, which is very time-consuming.

I've come up with two potential solutions:

share data among processes (it's ndarray), that's the title of this question.

Keep subprocess alive, like a daemon process or something...and wait for call. By doing that, I only need to pass the data at the very beginning.

So, is there any way to share a read-only numpy array among process? Or if you have a good implementation of solution 2, it also works.

Thanks in advance.

解决方案

If you absolutely must use Python multiprocessing, then you can use Python multiprocessing along with Arrow's Plasma object store to store the object in shared memory and access it from each of the workers. See this example, which does the same thing using a Pandas dataframe instead of a numpy array.

If you don't absolutely need to use Python multiprocessing, you can do this much more easily with Ray. One advantage of Ray is that it will work out of the box not just with arrays but also with Python objects that contain arrays.

Under the hood, Ray serializes Python objects using Apache Arrow, which is a zero-copy data layout, and stores the result in Arrow's Plasma object store. This allows worker tasks to have read-only access to the objects without creating their own copies. You can read more about how this works.

Here is a modified version of your example that runs.

import numpy as np

import ray

ray.init()

@ray.remote

def worker_func(data, i):

# Do work. This function will have read-only access to

# the data array.

return 0

data = np.zeros(10**7)

# Store the large array in shared memory once so that it can be accessed

# by the worker tasks without creating copies.

data_id = ray.put(data)

# Run worker_func 10 times in parallel. This will not create any copies

# of the array. The tasks will run in separate processes.

result_ids = []

for i in range(10):

result_ids.append(worker_func.remote(data_id, i))

# Get the results.

results = ray.get(result_ids)

Note that if we omitted the line data_id = ray.put(data) and instead called worker_func.remote(data, i), then the data array would be stored in shared memory once per function call, which would be inefficient. By first calling ray.put, we can store the object in the object store a single time.

weixin_39682944

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python3多进程下载_python3多进程共享numpy数组（只读）

I'm not sure if this title is appropriate for my situation: the reason why I want to share numpy array is that it might be one of the potential solutions to my case, but if you have other solutions th...
复制链接

扫一扫

python3多进程下载_python3多进程共享numpy数组（只读）

“相关推荐”对你有帮助么？