General-purpose programming on GPU
First steps in CUDA
Giuseppe Bilotta, Eugenio Rustico, Alexis Hérault
DMI — Università di Catania
Sezione di Catania — INGV
Vector types
CUDA has built-in support for vector types: multi-dimensional data with 1 to 4 components, addressed by .x
, .y
, .z
, .w
. Some definitions:
struct uchar1
{
unsigned char x;
};
struct __align__(4) ushort2
{
unsigned short x, y;
};
struct uint3
{
unsigned int x, y, z;
};
struct __builtin_align__(16) float4
{
float x, y, z, w;
};
You can make them available in CPU code by including the appropriate header:
#include <vector_types.h>
Example usages:
- images with 8-bit color (
uchar3
) (RGB => x, y, z); - images with 32-bit color and transparency (
uint4
) (RGBA => x, y, z, w); - particle systems (
float3
) (x, y, z: physical coordinates);
Texture memory
3D graphics uses textures to ‘draw’ fancy stuff on 3D surfaces. Texture allocation and reading are also made available to the computing part of a GPU and can improve memory access or reduce computations in some use cases.
Kernels only have read-only access to texture memory. Reading a texel (texture element) is done with a fetch at given coordinates on atexture reference.
Before it can be used, a texture reference must be bound to a memory region. Binding is done by the CPU. Multiple texture references can be bound to the same or to overlapping memory areas.
A texture reference can have one, two or three dimensions. They can be bound either to standard linear memory addresses, or to special memory allocations called CUDA Arrays.
Texture references can only be used for integers (signed and unsigned, 8-, 16- or 32-bit wide), floats, and for the corresponding 1-, 2- and 4-component (but not 3-component) vector types.
Normalization of the values (mapping the value range to [0.0, 1.0] or [-1.0, 1.0]) can also be done automatically by texture references for 8-bit or 16-bit signed and unsigned integers. In this mode, for example, an unsigned 8-bit value of 0xcd
(decimal 205) will be fetched as 0.803921569f
(205/255).
Texture coordinates are floats in the range [0, N) where N is the texture size in that dimension. Example: a 64×32 texture will have coordinates [0,63]×[0,31]. Textures can be set to use normalized coordinates, mapping the actual size to [0, 1) in all dimensions.
Out-of-bounds coordinates are clamped, i.e. replaced with the closest in-bound coordinate. Example: a fetch for (-3.3, 33.3) on the previous texture would retrieve (0, 31). With normalized coordinates, textures can be set to wrap out of bounds coordinates. A fetch for (1.25, -1.25) would retrieve (0.25, 0.75).
Coordinates that do not fall exactly on a texel can return either the nearest neighbour or a value interpolated linearly from the neighbouring texels.
Some high-level examples. A 1-dimensional texture of float
elements, returning the corresponding element type:
texture<float, 1, cudaReadModeElementType> posTex;
A 2-dimensional texture of char4
elements, returning a float4
with components in [-1.0, 1.0]:
texture<char4, 2, cudaReadModeNormalizedFloat> pixTex;
Texture references are always static (they have file scope) and must be global (i.e. do not declare them inside a function or structure).
The texture<>
construct is a high-level interface to the structure
struct textureReference {
int normalized;
enum cudaTextureFilterMode filterMode;
enum cudaTextureAddressMode addressMode[3];
struct cudaChannelFormatDesc channelDesc;
}
that allow you to choose if you want to normalize coordinates (normalized = 1
), interpolate coordinates (filterMode = cudaFilterModeLinear
) or set out-of-bounds coordinates to wrap in any particular direction (addressMode[i] = cudaAddressModeWrap
).
Binding textures
Examples for binding textures to linear memory:
texture<float, 1> oneTex;
texture<float, 2> twoTex;
float *dVector;
cudaMalloc(&dVector, width*height*sizeof(float));
cudaBindTexture(NULL, oneTex, dVector, vecsize);
cudaBindTexture2D(NULL, twoTex, dVector, twoTex.channelDesc,
width, height, pitch);
cudaUnbindTexture(twoTex);
cudaUnbindTexture(oneTex);
In the 2D case it is necessary to specify the pitch, i.e. the byte length of a row. This is typically width*sizeof(element)
, but may be larger is the rows are padded in memory to ensure a given alignment.
The first parameter is used to retrieve the offset that must be used to access elements, but it's only needed when the memory was not allocated with cudaMalloc
(e.g. a texture pointing to a subset of an existing memory area) to comply with the memory alignment requirements of textures.
Transposing an image, with and without textures:
- CPU code to transpose an image
- GPU code to transpose an image (no textures)
- GPU code to transpose an image (1D textures)
- GPU code to transpose an image (2D textures)
Further texture examples: increasing the depth of an image, and various address modes
- include file to read/write images in PAM format
- example PAM image
- base GPU code with 8-to-16 bitdepth conversion
- further texture usage examples
PAM is a simple but flexible uncompressed bitmap format developed within the netpbm image manipulation toolkit . ImageMagick can be used to display and convert PAM images.
CUDA Arrays
CUDA arrays are memory areas dedicate to textures. They are read-only for the GPU (and are only accessible through texture fetches), and can be written to by the CPU using cudaMemcpyToArray
. Devices with capability 2.0 or higher can write to CUDA arrays using surfaces, a read-write feature similar to textures.
Allocating a CUDA array requires the specification of a channel format description matching the one used by the texture that needs to be mapped on it.
Allocation:
texture<> someTex;
cudaArray *dArray;
cudaMallocArray(&dArray, &someTex.channelDesc, width, height);
Use (can copy at an offset wo, ho
in the array):
cudaMemcpyToArray(dArray, wo, ho, source, size, cudaMemcpyHostToDevice);
cudaBindTextureToArray(someTex, dArray);
Release:
cudaFreeArray(dArray);
The channel format description
The cudaChannelFormatDesc
describes the format of a texture element.
struct cudaChannelFormatDesc {
int x, y, z, w;
enum cudaChannelFormatKind f;
}
where x, y, z, w
are set to the number of bits for each component, and f
is one of cudaChannelFormatKindSigned
,cudaChannelFormatKindUnsigned
, cudaChannelFormatKindFloat
.
Example, for float
texels we could create a channel with
cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
while for short4
texels this would be
cudaCreateChannelDesc(16, 16, 16, 16, cudaChannelFormatKindSigned);