Usage of computations
All reikna
computation classes are derived from the Computation
class and therefore share the same API and behavior. A computation object is an opaque typed function-like object containing all the information necessary to generate GPU kernels that implement some algorithm, along with necessary internal temporary and persistent memory buffers. Before use it needs to be compiled by calling compile()
for a given Thread
(thus using its associated device and queue). This method returns a ComputationCallable
object which takes GPU arrays and scalar parameters and calls its internal kernels.
Computations and transformations
One often needs to perform some simple processing of the input or output values of a computation. This can be scaling, splitting complex values into components, padding, and so on. Some of these operations require additional memory to store intermediate results, and all of them involve additional overhead of calling the kernel, and passing values to and from the device memory. Reikna
porvides an API to define such transformations and attach them to “core” computations, effectively compiling the transformation code into the main kernel(s), thus avoiding all these drawbacks.
Transformation tree
Before talking about transformations themselves, we need to take a closer look at the computation signatures. Every Computation
object has a signature
attribute containing funcsigs.Signature
object. It is the same signature object as can be exctracted from any Python function using funcsigs.signature
function (or inspect.signature
from the standard library for Python >= 3.3). When the computation object is compiled, the resulting callable will have this exact signature.
The base signature for any computation can be found in its documentation (and, sometimes, can depend on the arguments passed to its constructor — see, for example, PureParallel
). The signature can change if a user connects transformations to some parameter via connect()
; in this case the signature
attribute will change accordingly.
All attached transformations form a tree with roots being the base parameters computation has right after creation, and leaves forming the user-visible signature, which the compiled ComputationCallable
will have.
As an example, let us consider a pure parallel computation object with one output, two inputs and a scalar parameter, which performs the calculation out = in1 + in2 + param
:
from __future__ import print_function
import numpy
from reikna import cluda
from reikna.cluda import Snippet
from reikna.core import Transformation, Type, Annotation, Parameter
from reikna.algorithms import PureParallel
import reikna.transformations as transformations
arr_t = Type(numpy.float32, shape=128)
carr_t = Type(numpy.complex64, shape=128)
comp = PureParallel(
[Parameter('out', Annotation(carr_t, 'o')),
Parameter('in1', Annotation(carr_t, 'i')),
Parameter('in2', Annotation(carr_t, 'i')),
Parameter('param', Annotation(numpy.float32))],
"""
VSIZE_T idx = ${idxs[0]};
${out.store_idx}(
idx, ${in1.load_idx}(idx) + ${in2.load_idx}(idx) + ${param});
""")
The details of creating the computation itself are not important for this example; they are provided here just for the sake of completeness. The initial transformation tree of comp
object looks like:
| out | >>
>> | in1 |
>> | in2 |
>> | param |
Here the insides of ||
are the base computation (the one defined by the developer), and >>
denote inputs and outputs provided by the user. The computation signature is:
>>> for param in comp.signature.parameters.values():
... print(param.name + ":" + repr(param.annotation))
out:Annotation(Type(complex64, shape=(128,), strides=(8,)), role='o')
in1:Annotation(Type(complex64, shape=(128,), strides=(8,)), role='i')
in2:Annotation(Type(complex64, shape=(128,), strides=(8,)), role='i')
param:Annotation(float32)
Now let us attach the transformation to the output which will split it into two halves: out1 = out / 2
, out2 = out / 2
:
tr = transformations.split_complex(comp.parameter.out)
comp.parameter.out.connect(tr, tr.input, out1=tr.real, out2=tr.imag)
We have used the pre-created transformation here for simplicity; writing custom transformations is described in Writing a transformation.
In addition, we want in2
to be scaled before being passed to the main computation. To achieve this, we connect the scaling transformation to it:
tr = transformations.mul_param(comp.parameter.in2, numpy.float32)
comp.parameter.in2.connect(tr, tr.output, in2_prime=tr.input, param2=tr.param)
The transformation tree now looks like:
| out | ----> out1 >>
| | \-> out2 >>
>> | in1 |
>> in2_prime ------> | in2 |
>> param2 ----/ | |
| param |
As can be seen, nothing has changed from the base computation’s point of view: it still gets the same inputs and outputs to the same array. But user-supplied parameters (>>
) have changed, which can be also seen in the value of the signature
:
>>> for param in comp.signature.parameters.values():
... print(param.name + ":" + repr(param.annotation))
out1:Annotation(Type(float32, shape=(128,), strides=(4,)), role='o')
out2:Annotation(Type(float32, shape=(128,), strides=(4,)), role='o')
in1:Annotation(Type(complex64, shape=(128,), strides=(8,)), role='i')
in2_prime:Annotation(Type(complex64, shape=(128,), strides=(8,)), role='i')
param2:Annotation(float32)
param:Annotation(float32)
Notice that the order of the final signature is obtained by traversing the transformation tree depth-first, starting from the base parameters. For more details see the note in the documentation for connect()
.
The resulting computation returns the value in1 + (in2_prime * param2) + param
split in half. In order to run it, we have to compile it first. When prepare_for
is called, the data types and shapes of the given arguments will be propagated to the roots and used to prepare the original computation.
api = cluda.ocl_api()
thr = api.Thread.create()
in1_t = comp.parameter.in1
in2p_t = comp.parameter.in2_prime
out1 = thr.empty_like(comp.parameter.out1)
out2 = thr.empty_like(comp.parameter.out2)
in1 = thr.to_device(numpy.ones(in1_t.shape, in1_t.dtype))
in2_prime = thr.to_device(numpy.ones(in2p_t.shape, in2p_t.dtype))
c_comp = comp.compile(thr)
c_comp(out1, out2, in1, in2_prime, 4, 3)
Transformation restrictions
There are some limitations of the transformation mechanics:
- Transformations are purely parallel, that is they cannot use local memory. In fact, they are very much like
PureParallel
computations, except that the indices they use are defined by the main computation, and not set by the GPU driver. - External endpoints of the output transformations cannot point to existing nodes in the transformation tree. This is the direct consequence of the first limitation — it would unavoidably create races between memory writes from different branches. On the other hand, input transformations can be safely connected to existing nodes, including base nodes (although note that inputs are not cached; so even if you load twice from the same index of the same input node, the global memory will be queried twice).