VexCL is a vector expression template library for OpenCL. It has been created for ease of OpenCL development with C++. VexCL strives to reduce amount of boilerplate code needed to develop OpenCL applications. The library provides convenient and intuitive notation for vector arithmetic, reduction, sparse matrix-vector products, etc. Multi-device and even multi-platform computations are supported. The source code of the library is distributed under very permissive MIT license.
The code is available at https://github.com/ddemidov/vexcl.
Doxygen-generated documentation: http://ddemidov.github.io/vexcl.
Slides from VexCL-related talks:
The paper Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries compares both convenience and performance of several GPGPU libraries, including VexCL.
Table of contents
- Context initialization
- Memory allocation
- Copies between host and devices
- Vector expressions
- Reductions
- Sparse matrix-vector products
- Stencil convolutions
- Fast Fourier Transform
- Multivectors
- Converting generic C++ algorithms to OpenCL
- Custom kernels
- Interoperability with other libraries
- Supported compilers
Context initialization
VexCL can transparently work with multiple compute devices that are present in the system. VexCL context is initialized with a device filter, which is just a functor that takes a reference to cl::Device
and returns a bool
. Several standard filters are provided, but one can easily add a custom functor. Filters may be combined with logical operators. All compute devices that satisfy the provided filter are added to the created context. In the example below all GPU devices that support double precision arithmetics are selected:
One of the most convenient filters is vex::Filter::Env which selects compute devices based on environment variables. It allows to switch compute device without need to recompile the program.
Memory allocation
The vex::vector<T>
class constructor accepts a const reference to std::vector<cl::CommandQueue>
. A vex::Context
instance may be conveniently converted to the type, but it is also possible to initialize the command queues elsewhere, thus completely eliminating the need to create a vex::Context
. Each command queue in the list should uniquely identify a single compute device.
The contents of the created vector will be partitioned across all devices that were present in the queue list. Size of each partition will be proportional to the device bandwidth, which is measured the first time the device is used. All vectors of the same size are guaranteed to be partitioned consistently, which allows to minimize inter-device communication.
In the example below, three device vectors of the same size are allocated. Vector A
is copied from host vector a
, and the other vectors are created uninitialized: