python cpp cuda混合编程,Python多处理与PyCUDA

I've got a problem that I want to split across multiple CUDA devices, but I suspect my current system architecture is holding me back;

What I've set up is a GPU class, with functions that perform operations on the GPU (strange that). These operations are of the style

for iteration in range(maxval):

result[iteration]=gpuinstance.gpufunction(arguments,iteration)

I'd imagined that there would be N gpuinstances for N devices, but I don't know enough about multiprocessing to see the simplest way of applying this so that each device is asynchronously assigned, and strangely few of the examples that I came across gave concrete demonstrations of collating results after processing.

Can anyone give me any pointers in this area?

UPDATE

Thank you Kaloyan for your guidance in terms of the multiprocessing area; if CUDA wasn't specifically the sticking point I'd be marking you as answered. Sorry.

Perviously to playing with this implementation, the gpuinstance class initiated the CUDA device with import pycuda.autoinit But that didn't appear to work, throwing invalid context errors as soon as each (correctly scoped) thread met a cuda command. I then tried manual initialisation in the __init__ constructor of the class with...

pycuda.driver.init()

self.mydev=pycuda.driver.Device(devid) #this is passed at instantiation of class

self.ctx=self.mydev.make_context()

self.ctx.push()

My assumption here is that the context is preserved between the list of gpuinstances is created and when the threads use them, so each device is sitting pretty in its own context.

(I also implemented a destructor to take care of pop/detach cleanup)

Problem is, invalid context exceptions are still appearing as soon as the thread tries to touch CUDA.

Any ideas folks? And Thanks to getting this far. Automatic upvotes for people working 'banana' into their answer! :P

解决方案

You need to get all your bananas lined up on the CUDA side of things first, then think about the best way to get this done in Python [shameless rep whoring, I know].

The CUDA multi-GPU model is pretty straightforward pre 4.0 - each GPU has its own context, and each context must be established by a different host thread. So the idea in pseudocode is:

Application starts, process uses the API to determine the number of usable GPUS (beware things like compute mode in Linux)

Application launches a new host thread per GPU, passing a GPU id. Each thread implicitly/explicitly calls equivalent of cuCtxCreate() passing the GPU id it has been assigned

Profit!

In Python, this might look something like this:

import threading

from pycuda import driver

class gpuThread(threading.Thread):

def __init__(self, gpuid):

threading.Thread.__init__(self)

self.ctx = driver.Device(gpuid).make_context()

self.device = self.ctx.get_device()

def run(self):

print "%s has device %s, api version %s" \

% (self.getName(), self.device.name(), self.ctx.get_api_version())

# Profit!

def join(self):

self.ctx.detach()

threading.Thread.join(self)

driver.init()

ngpus = driver.Device.count()

for i in range(ngpus):

t = gpuThread(i)

t.start()

t.join()

This assumes it is safe to just establish a context without any checking of the device beforehand. Ideally you would check the compute mode to make sure it is safe to try, then use an exception handler in case a device is busy. But hopefully this gives the basic idea.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值