python3多进程下载_python3多进程共享numpy数组(只读)

I'm not sure if this title is appropriate for my situation: the reason why I want to share numpy array is that it might be one of the potential solutions to my case, but if you have other solutions that would also be nice.

My task: I need to implement an iterative algorithm with multiprocessing, while each of these processes need to have a copy of data(this data is large, and read-only, and won't change during the iterative algorithm).

I've written some pseudo code to demonstrate my idea:

import multiprocessing

def worker_func(data, args):

# do sth...

return res

def compute(data, process_num, niter):

data

result = []

args = init()

for iter in range(niter):

args_chunk = split_args(args, process_num)

pool = multiprocessing.Pool()

for i in range(process_num):

result.append(pool.apply_async(worker_func,(data, args_chunk[i])))

pool.close()

pool.join()

# aggregate result and update args

for res in result:

args = update_args(res.get())

if __name__ == "__main__":

compute(data, 4, 100)

The problem is in each iteration, I have to pass the data to subprocess, which is very time-consuming.

I've come up with two potential solutions:

share data among processes (it's ndarray), that's the title of this question.

Keep subprocess alive, like a daemon process or something...and wait for call. By doing that, I only need to pass the data at the very beginning.

So, is there any way to share a read-only numpy array among process? Or if you have a good implementation of solution 2, it also works.

Thanks in advance.

解决方案

If you absolutely must use Python multiprocessing, then you can use Python multiprocessing along with Arrow's Plasma object store to store the object in shared memory and access it from each of the workers. See this example, which does the same thing using a Pandas dataframe instead of a numpy array.

If you don't absolutely need to use Python multiprocessing, you can do this much more easily with Ray. One advantage of Ray is that it will work out of the box not just with arrays but also with Python objects that contain arrays.

Under the hood, Ray serializes Python objects using Apache Arrow, which is a zero-copy data layout, and stores the result in Arrow's Plasma object store. This allows worker tasks to have read-only access to the objects without creating their own copies. You can read more about how this works.

Here is a modified version of your example that runs.

import numpy as np

import ray

ray.init()

@ray.remote

def worker_func(data, i):

# Do work. This function will have read-only access to

# the data array.

return 0

data = np.zeros(10**7)

# Store the large array in shared memory once so that it can be accessed

# by the worker tasks without creating copies.

data_id = ray.put(data)

# Run worker_func 10 times in parallel. This will not create any copies

# of the array. The tasks will run in separate processes.

result_ids = []

for i in range(10):

result_ids.append(worker_func.remote(data_id, i))

# Get the results.

results = ray.get(result_ids)

Note that if we omitted the line data_id = ray.put(data) and instead called worker_func.remote(data, i), then the data array would be stored in shared memory once per function call, which would be inefficient. By first calling ray.put, we can store the object in the object store a single time.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
numpy数组元素周围的操作可以通过以下几种方式实现: 1. 切片操作:可以使用numpy数组的切片操作来获取数组中元素的周围元素。 例如,对于一个二维数组arr,要获取第i行第j列元素周围的元素,可以使用如下切片操作: ```python arr[i-1:i+2, j-1:j+2] ``` 这将返回一个3x3的子数组,其中心元素为arr[i,j],周围的8个元素为该子数组的其余元素。 2. 使用numpy.pad()函数:numpy.pad()函数可以用来在数组的边缘添加一个或多个值,从而扩展数组的大小。可以使用该函数来添加额外的行和列,然后通过索引访问周围的元素。 例如,对于一个二维数组arr,要获取第i行第j列元素周围的元素,可以使用如下代码: ```python padded_arr = np.pad(arr, ((1, 1), (1, 1)), mode='constant') surrounding = padded_arr[i:i+3, j:j+3] ``` 这将在数组的边缘添加一行和一列,并使用常量值填充这些额外的元素。然后可以使用切片操作来获取中心元素周围的元素。 3. 使用numpy.roll()函数:numpy.roll()函数可以用来沿着给定轴滚动数组的元素。可以使用该函数来将数组的行和列进行滚动,从而获取周围的元素。 例如,对于一个二维数组arr,要获取第i行第j列元素周围的元素,可以使用如下代码: ```python rows, cols = arr.shape row_indices = np.arange(i-1, i+2) % rows col_indices = np.arange(j-1, j+2) % cols surrounding = arr[row_indices][:, col_indices] ``` 这将将第i行向上和向下滚动一行,并将第j列向左和向右滚动一列,从而获取中心元素周围的元素。使用模运算可以确保在数组的边缘滚动时正确处理索引。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值