目录
GPU上常用的数学模块 Elementwise Functions on GPUArray Instances
整除 Rounding and Absolute Value
指数、对数和根 Exponentials, Logarithms and Roots
三角函数/反三角函数 Trigonometric Functions
浮点分解与装配 Floating Point Decomposition and Assembly
单流程自定义表达式计算 Single-pass Custom Expression Evaluation
并行扫描/前缀和 Parallel Scan / Prefix Sum
还原和扫描中的自定义数据类型 Custom data types in Reduction and Scan
变量类型
CUDA支持的所有向量类型(如float3和long4)都可用作此类中的numpy数据类型。这些numpy.dtype实例的字段名为x、y、z和w,就像它们的CUDA对应项一样。它们既适用于传递给内核的参数,也适用于在内核和Python代码之间来回传递数据。对于每种类型,还提供make_类型功能(例如make_float3(x,y,z))。
In [97]: pycuda.gpuarray.vec.make_int1(np.array([1,2,3],dtype=np.float32))
Out[97]: array([(1,), (2,), (3,)], dtype=[('x', '<i4')])
GPUarray类
pycuda.gpuarray.GPUArray(shape, dtype, *, allocator=None, order="C")¶
numpy.ndarray的工作原理类似存储数据并在计算设备上执行计算。shape数type型的工作方式与numpy完全相同。gpuarray中的算术方法支持scalar的广播。(例如数组+5)如果分配器是一个可调用的对象,当使用要分配的字节数的参数进行调用时,它将返回一个对象,该对象可以强制转换为表示新分配内存地址的int。观察pycuda.driver.mem_alloc()和pycuda.tools.devicemmorypool.alloc()都是这个接口的模型。
All arguments beyond allocator should be considered keyword-only.这句话没翻译明白,这个类下面的方法?
gpudata
¶
The pycuda.driver.DeviceAllocation
instance created for the memory that backs this GPUArray
.
In [117]: a
Out[117]: array([1., 2., 3., 4., 5.], dtype=float32)
In [119]: a.gpudata
Out[119]: <pycuda._driver.DeviceAllocation at 0x16b7fbbbcb0>
shape
¶
The tuple of lengths of each dimension in the array.
In [118]: a.shape
Out[118]: (5,)
dtype
¶
The numpy.dtype
of the items in the GPU array.
In [120]: a.dtype
Out[120]: dtype('float32')
size
¶
The number of meaningful entries in the array. Can also be computed by multiplying up the numbers in shape
.
In [122]: a.size
Out[122]: 5
mem_size
¶
The total number of entries, including padding, that are present in the array. Padding may arise for example because of pitch adjustment by pycuda.driver.mem_alloc_pitch()
.
In [123]: a.mem_size
Out[123]: 5
nbytes
¶
The size of the entire array in bytes. Computed as size
times dtype.itemsize
.
In [124]: a.nbytes
Out[124]: 20
In [125]: a.ndim
Out[125]: 1
strides
¶
Tuple of bytes to step in each dimension when traversing an array.
In [126]: a.strides
Out[126]: (4,)
flags
¶
Return an object with attributes c_contiguous, f_contiguous and forc, which may be used to query contiguity properties in analogy to numpy.ndarray.flags
.
In [130]: a.flags
Out[130]: <pycuda.compyte.array.ArrayFlags at 0x16b7fbc6e48>
ptr
¶
Return an int
reflecting the address in device memory where this array resides.
In [131]: a.ptr
Out[131]: 30154949120
__len__
()¶
Returns the size of the leading dimension of self.
In [133]: a.__len__
Out[133]: <bound method GPUArray.__len__ of array([1., 2., 3., 4., 5.], dtype=float32)>
Warning
This method existed in version 0.93 and below, but it returned the value of size
instead of its current value. The change was made in order to match numpy
.
reshape
(shape, order="C")¶
In [140]: a.reshape((1,5))
Out[140]: array([[1., 2., 3., 4., 5.]], dtype=float32)
In [141]: a.reshape((5,1))
Out[141]:
array([[1.],
[2.],
[3.],
[4.],
[5.]], dtype=float32)
Returns an array containing the same data with a new shape.
ravel
()¶
Returns flattened array containing the same data.
In [144]: a.ravel
Out[144]: <bound method GPUArray.ravel of array([1., 2., 3., 4., 5.], dtype=float32)>
In [145]: a.ravel()
Out[145]: array([1., 2., 3., 4., 5.], dtype=float32)
view
(dtype=None)¶
Returns view of array with the same data. If dtype is different from current dtype, the actual bytes of memory will be reinterpreted.
In [148]: a.view
Out[148]: <bound method GPUArray.view of array([1., 2., 3., 4., 5.], dtype=float32)>
In [149]: a.view()
Out[149]: array([1., 2., 3., 4., 5.], dtype=float32)
squeeze
(dtype=None)¶
Returns a view of the array with dimensions of length 1 removed.
In [151]: a.squeeze()
Out[151]: array([1., 2., 3., 4., 5.], dtype=float32)
set
(ary)¶
Transfer the contents the numpy.ndarray
object ary onto the device.
ary must have the same dtype and size (not necessarily shape) as self.
In [168]: a.set(np.array([5,5,6,7,8],dtype=np.float32))
In [169]: a
Out[169]: array([5., 5., 6., 7., 8.], dtype=float32)
set_async
(ary, stream=None)¶
异步地将numpy.ndarray对象元的内容传输到设备上,可以选择在stream.ary上排序,它必须具有与自身相同的数据类型和大小(不一定是形状)。
In [195]: d=a
In [196]: d
Out[196]: array([1., 2., 3., 4., 5.], dtype=float32)
In [197]: d.set_async(b)
In [198]: d
Out[198]: array([5., 5., 6., 7., 8.], dtype=float32)
In [178]: c = pycuda.driver.mem_alloc(a.nbytes)
In [188]: pycuda.driver.memcpy_htod(c,np.ones(5).astype(np.float32))
In [189]: c
Out[189]: <pycuda._driver.DeviceAllocation at 0x16b16179580>
get
(ary=None, pagelocked=False)¶
Transfer the contents of self into ary or a newly allocated numpy.ndarray
. If ary is given, it must have the same shape and dtype. If it is not given, a pagelocked specifies whether the new array is allocated page-locked.
Changed in version 2015.2: ary with different shape was deprecated
In [199]: d.get()
Out[199]: array([5., 5., 6., 7., 8.], dtype=float32)
In [205]: type(d.get())
Out[205]: numpy.ndarray
.
get_async
(stream=None, ary=None)¶
In [202]: d.get_async()
Out[202]: array([5., 5., 6., 7., 8.], dtype=float32)
In [203]: type(d.get_async())
Out[203]: numpy.ndarray
同上,不过是异步实现的
copy
()¶
In [171]: b = a.copy()
In [172]: b
Out[172]: array([5., 5., 6., 7., 8.], dtype=float32)
New in version 2013.1.
mul_add(self, selffac, other, otherfac, add_timer=None, stream=None):
Return selffac*self + otherfac*other. add_timer, if given, is invoked with the result from pycuda.driver.Function.prepared_timed_call()
.
In [175]: a
Out[175]: array([1., 2., 3., 4., 5.], dtype=float32)
In [176]: b
Out[176]: array([5., 5., 6., 7., 8.], dtype=float32)
In [177]: a.__add__(b)
Out[177]: array([ 6., 7., 9., 11., 13.], dtype=float32)
__add__
(other)¶
__sub__
(other)¶
__iadd__
(other)¶
__isub__
(other)¶
__neg__
(other)¶
__mul__
(other)¶
__div__
(other)¶
__rdiv__
(other)¶
__pow__
(other)¶
__abs__
()¶
Return a GPUArray
containing the absolute value of each element of self.
fill
(scalar, stream=None)¶
Fill the array with scalar.
In [207]: d.fill(4)
Out[207]: array([4., 4., 4., 4., 4.], dtype=float32)
astype
(dtype, stream=None)¶
Return self, cast to dtype.
In [208]: a.dtype
Out[208]: dtype('float32')
In [210]: a.astype(np.int8)
Out[210]: array([4, 4, 4, 4, 4], dtype=int8)
real
¶
Return the real part of self, or self if it is real.
In [218]: d
Out[218]: array([1.+1.j, 2.+2.j, 3.+3.j, 4.+4.j, 5.+5.j, 6.+6.j])
In [219]: d.real
Out[219]: array([1., 2., 3., 4., 5., 6.])
New in version 0.94.
imag
¶
Return the imaginary part of self, or zeros_like(self) if it is real.
conj
()¶
In [226]: d
Out[226]: array([1.+1.j, 2.+2.j, 3.+3.j, 4.+4.j, 5.+5.j, 6.+6.j])
In [227]: d.conj()
Out[227]: array([1.-1.j, 2.-2.j, 3.-3.j, 4.-4.j, 5.-5.j, 6.-6.j])
Return the complex conjugate of self, or self if it is real.
bind_to_texref
(texref, allow_offset=False)¶
Bind self to the pycuda.driver.TextureReference
texref.
Due to alignment requirements, the effective texture bind address may be different from the requested one by an offset. This method returns this offset in units of self’s data type. If allow_offset is False
, a nonzero value of this offset will cause an exception to be raised.
Note
It is recommended to use bind_to_texref_ext()
instead of this method.
bind_to_texref_ext
(texref, channels=1, allow_double_hack=False, allow_offset=False)¶
Bind self to the pycuda.driver.TextureReference
texref. In addition, set the texture reference’s format to match dtype
and its channel count to channels. This routine also sets the texture reference’s pycuda.driver.TRSF_READ_AS_INTEGER
flag, if necessary.
Due to alignment requirements, the effective texture bind address may be different from the requested one by an offset. This method returns this offset in units of self’s data type. If allow_offset is False
, a nonzero value of this offset will cause an exception to be raised.
New in version 0.93.
As of this writing, CUDA textures do not natively support double-precision floating point data. To remedy this deficiency, PyCUDA contains a workaround, which can be enabled by passing True for allow_double_hack. In this case, use the following code for texture access in your kernel code:
#include <pycuda-helpers.hpp>
texture<fp_tex_double, 1, cudaReadModeElementType> my_tex;
__global__ void f()
{
...
fp_tex1Dfetch(my_tex, threadIdx.x);
...
}
切片赋值
gpu支持切片,索引,修改值等等操作,完全可以当作numpy的数组来操作/。
构建gpuarray 例子
pycuda.gpuarray.to_gpu(ary, allocator=None)
返回 把ary(numpy的array)复制给gpu-array
pycuda.gpuarray.to_gpu_async(ary, allocator=None, stream=None)¶
返回值同上 不过复制是异步完成的,可以选择按顺序排列到流中。
pycuda.gpuarray.empty(shape, dtype, *, allocator=None, order="C")¶
构建一个形状的空数组
pycuda.gpuarray.zeros(shape, dtype, *, allocator=None, order="C")¶
构建某个形状的全0数组
pycuda.gpuarray.empty_like(other_ary, dtype=None, order="K")¶
构建一个形状和 other_ary一样的空数组. The dtype and order attributes allow these aspects to be set independently of their values in other_ary. For order, “A” means retain Fortran-ordering if the input is Fortran-contiguous, otherwise use “C” ordering. The default, order or “K” tries to match the strides of other_ary as closely as possible.
pycuda.gpuarray.zeros_like(other_ary, dtype=None, order="K")¶
构建一个形状和 other_ary一样的全0数组. The dtype and order attributes allow these aspects to be set independently of their values in other_ary. For order, “A” means retain Fortran-ordering if the input is Fortran-contiguous, otherwise use “C” ordering. The default, order or “K” tries to match the strides of other_ary as closely as possible.
pycuda.gpuarray.ones_like(other_ary, dtype=None, order="K")¶
构建一个形状和 other_ary一样的全0数组. The dtype and order attributes allow these aspects to be set independently of their values in other_ary. For order, “A” means retain Fortran-ordering if the input is Fortran-contiguous, otherwise use “C” ordering. The default, order or “K” tries to match the strides of other_ary as closely as possible.
pycuda.gpuarray.
arange
(start, stop, step, dtype=None, stream=None)
方法同numpy.arange一样。
Create a GPUArray
filled with numbers spaced step apart, starting from start and ending at stop.
For floating point arguments, the length of the result is ceil((stop - start)/step). This rule may result in the last element of the result being greater than stop.dtype, if not specified, is taken as the largest common type of start, stop and step.
pycuda.gpuarray.take(a, indices, stream=None)
返回的是a(gpu.array的数组)对应indices(gpu.array的数组)索引值。
全0和空数组不同。
条件函数 Conditionals
pycuda.gpuarray.
if_positive
(criterion, then_, else_, out=None, stream=None)¶
Return an array like then_, which, for the element at index i, contains then_[i] if criterion[i]>0, else else_[i]. (added in 0.94)
pycuda.gpuarray.
maximum
(a, b, out=None, stream=None)¶
返回ab间的最大值
pycuda.gpuarray.
minimum
(a, b, out=None, stream=None)¶
返回ab间最小值
简单的算法 Reductions
pycuda.gpuarray.
sum
(a, dtype=None, stream=None) 求和
pycuda.gpuarray.
subset_sum
(subset, a, dtype=None, stream=None)
New in version 2013.1.
pycuda.gpuarray.
dot
(a, b, dtype=None, stream=None) 点乘
pycuda.gpuarray.
subset_dot
(subset, a, b, dtype=None, stream=None)
pycuda.gpuarray.
max
(a, stream=None) 极大
pycuda.gpuarray.
min
(a, stream=None) 极小
pycuda.gpuarray.
subset_max
(subset, a, stream=None)
pycuda.gpuarray.
subset_min
(subset, a, stream=None)
这个subset_min没用明白
subset是a的子集?
GPU上常用的数学模块 Elementwise Functions on GPUArray
Instances
pycuda.cumath
模块包含了数学中常用的模块。
整除 Rounding and Absolute Value
pycuda.cumath.
fabs
(array, *, out=None, stream=None)
pycuda.cumath.
ceil
(array, *, out=None, stream=None)
pycuda.cumath.
floor
(array, *, out=None, stream=None)
指数、对数和根 Exponentials, Logarithms and Roots
pycuda.cumath.
exp
(array, *, out=None, stream=None)
pycuda.cumath.
log
(array, *, out=None, stream=None)
pycuda.cumath.
log10
(array, *, out=None, stream=None)
pycuda.cumath.
sqrt
(array, *, out=None, stream=None)
三角函数/反三角函数 Trigonometric Functions
pycuda.cumath.
sin
(array, *, out=None, stream=None)
pycuda.cumath.
cos
(array, *, out=None, stream=None)
pycuda.cumath.
tan
(array, *, out=None, stream=None)
pycuda.cumath.
asin
(array, *, out=None, stream=None)
pycuda.cumath.
acos
(array, *, out=None, stream=None)
pycuda.cumath.
atan
(array, *, out=None, stream=None)
双曲函数 Hyperbolic Functions
pycuda.cumath.
sinh
(array, *, out=None, stream=None)
pycuda.cumath.
cosh
(array, *, out=None, stream=None)
pycuda.cumath.
tanh
(array, *, out=None, stream=None)
浮点分解与装配 Floating Point Decomposition and Assembly
pycuda.cumath.
fmod
(arg, mod, stream=None)
为arg和mod中的每个元素返回除法arg/mod的浮点余数。
pycuda.cumath.
frexp
(arg, stream=None)¶
返回一个元组(有效位、指数),使arg==有效位*2**指数。
pycuda.cumath.
ldexp
(significand, exponent, stream=None)¶
返回一个新的浮点值数组,该数组由有效位和指数项组成,并成对出现,结果为=有效位*2**指数。
pycuda.cumath.
modf
(arg, stream=None)¶
返回包含arg整数部分和小数部分的数组的元组(fracpart、intpart)。
单流程自定义表达式计算 Single-pass Custom Expression Evaluation
在gpuarray实例上评估的表达式,可能有些效率低下。因为会为每个中间结果创建一个新的临时表达式。模块pycuda.elementwise中的功能包含帮助生成内核的工具,这些内核在一次传递中对一个或多个操作数计算多级表达式。实现了评估表达式的作用。
pycuda.elementwise.ElementwiseKernel(arguments, operation, name="kernel", keep=False, options=[], preamble="")
生成一个内核,该内核接受多个标量或向量参数,并对其参数的每个条目执行标量操作(如果该参数是向量)。
参数被指定为格式化为C参数列表的字符串。操作被指定为不带分号的C赋值语句。操作中的向量应通过变量i进行索引。
name指定编译内核时使用的名称,keep和选项将不修改地传递给pycuda.compiler.sourceModule。
preamble指定在elementwise内核规范之前包含的一些源代码。您可以使用它包括其他文件和/或定义操作使用的函数。
__call__(*args,range=none,slice=none)
调用生成的标量内核。参数可以是scalars或gpuarray实例。
如果给定了范围,则它必须是一个切片对象,并指定执行操作的索引i的范围。
如果给定了slice,则它必须是slice对象,并指定执行操作的索引i的范围,截断为容器。另外,slice可以包含相对于数组结尾的索引的负索引。
如果给定了流,则它必须是pycuda.driver.stream对象,在该对象中执行将被序列化。
下面是一个使用示例:
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
import numpy
from pycuda.curandom import rand as curand
a_gpu = curand((50,))
b_gpu = curand((50,))
from pycuda.elementwise import ElementwiseKernel
import importlib,sys
importlib.reload(sys)
lin_comb = ElementwiseKernel(
"float a, float *x, float b, float *y, float *z",
"z[i] = a*x[i] + b*y[i]",
"linear_combination")
c_gpu = gpuarray.empty_like(a_gpu)
lin_comb(5, a_gpu, 6, b_gpu, c_gpu)
import numpy.linalg as la
assert la.norm((c_gpu - (5*a_gpu+6*b_gpu)).get()) < 1e-5
定义衰减Custom Reductions
pycuda.reduction.ReductionKernel(dtype_out, neutral, reduce_expr, map_expr=None, arguments=None, name="reduce_kernel", keep=False, options=[], preamble="", allocator=None)
生成一个内核kernel,该内核接受多个标量或矢量参数(至少一个矢量参数),对矢量参数的每个条目执行map expr,然后对其结果执行reduce expr。中性作为初始值。Preamble提供了在实际减少内核代码之前添加预处理器指令和其他代码(如帮助函数)的可能性。
map expr中的向量应该由变量i索引。reduce expr使用形式值“a”和“b”来表示二进制缩减操作的两个操作数。如果不指定ap_expr,“in[i]”,则会自动假定只有一个输入参数。
dtype_out指定执行缩减并返回结果的numpy.dtype。中性指定为浮点数或格式化为字符串的整数。reduce expr和map expr被指定为字符串格式的操作,参数被指定为字符串格式的C参数列表。name指定编译内核时使用的名称,keep和选项将不修改地传递给pycuda.compiler.sourceModule。前导码被指定为一串代码。
并行扫描/前缀和 Parallel Scan / Prefix Sum
pycuda.scan.exclusivescankernel(dtype,scan_expr,neutral,name_prefix=“scan”,options=[],preamble=“”)
生成一个内核,该内核可以使用作为scan_expr给出的任何关联操作来计算前缀和。scan_expr使用形参值为“a”和“b”来表示关联二进制操作的两个操作数。neutral是scan_expr的中性元素(没懂这个翻译),遵从scan_expr(a,neutral)==a。dtype指定正在操作的数组的类型。名称前缀用于内核名称,以确保配置文件和日志中的可识别性。选项是生成时要使用的编译器选项列表。preamble指定在实际内核之前插入的代码字符串。
__call__(self,input,output,allocator=none,queue=none)
pycuda.scan.
InclusiveScanKernel
(dtype,scan_expr,neutral=none,name_prefix=“scan”,options=[],preamble=“”,devices=none)
像exclusivescankernel一样工作。与专用情况不同,不需要中性。
knl = InclusiveScanKernel(np.int32, "a+b")
n = 2**20-2**18+5
host_data = np.random.randint(0, 10, n).astype(np.int32)
dev_data = gpuarray.to_gpu(queue, host_data)
knl(dev_data)
assert (dev_data.get() == np.cumsum(host_data, axis=0)).all()
还原和扫描中的自定义数据类型 Custom data types in Reduction and Scan
如果要在扫描和还原中使用自己的(struct/union/whatever)数据类型,我的理解是cpu类型与numpy互换:
pycuda.tools.register dtype(dtype,name)
dtype是指numpy.dtype()
GPGPU算法 GPGPU Algorithms
reikna提供了类似于numpy的功能(fft、rng、矩阵乘法),可以参与pycuda.gpuarray.GPUArray
对象的计算。
reikna url:https://pypi.org/project/reikna/