VexCL: Vector expression template library for OpenCL

23 篇文章 0 订阅
14 篇文章 1 订阅

转自:http://www.codeproject.com/Articles/415058/VexCL-Vector-expression-template-library-for-OpenC

introduction

VexCL is vector expression template library for OpenCL. It has been created for ease of C++ based OpenCL development. Multi-device (and multi-platform)  computations are supported. Source code of the library is publicly available under MIT  license. See the Doxygen-generated documentation at http://ddemidov.github.com/vexcl.

This is the first of two articles describing the VexCL library. The first part is an introduction to the VexCL interface.  The second part compares VexCL performance  with existing GPGPU implementations. The comparison is based on odeint -  a modern C++ library for numerical solutions of ordinary differential equations.

To quote Khronos group, the organization behind  the OpenCL standard, OpenCL is the first open, royalty-free standard  for cross-platform, parallel programming of modern processors found in personal computers, servers, and handheld/embedded devices. OpenCL  (Open Computing  Language) greatly improves speed and responsiveness for a wide spectrum of applications in numerous market categories from gaming and entertainment to scientific and medical software.

The weakest sides of OpenCL are, probably, lack of tools and libraries around it and  the amount of boilerplate code needed to develop OpenCL applications. The VexCL  library tries to solve the latter issue. To start with an example, consider the "hello world" problem of OpenCL: addition of two vectors.  The following code block contains the program implementing the task with pure OpenCL. Note that  the official C++ bindings are used here; the C variant would be much more verbose.

#include <iostream>
#include <vector>
#include <string>

#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>

// Compute c = a + b.
static const char source[] =
    "#if defined(cl_khr_fp64)\n"
    "#  pragma OPENCL EXTENSION cl_khr_fp64: enable\n"
    "#elif defined(cl_amd_fp64)\n"
    "#  pragma OPENCL EXTENSION cl_amd_fp64: enable\n"
    "#else\n"
    "#  error double precision is not supported\n"
    "#endif\n"
    "kernel void add(\n"
    "       uint n,\n"
    "       global const double *a,\n"
    "       global const double *b,\n"
    "       global double *c\n"
    "       )\n"
    "{\n"
    "    uint i = get_global_id(0);\n"
    "    if (i < n) {\n"
    "       c[i] = a[i] + b[i];\n"
    "    }\n"
    "}\n";

int main() {
    const unsigned int N = 1 << 20;

    try {
	// Get list of OpenCL platforms.
	std::vector<cl::Platform> platform;
	cl::Platform::get(&platform);

	if (platform.empty()) {
	    std::cerr << "OpenCL platforms not found." << std::endl;
	    return 1;
	}

	// Get first available GPU device which supports double precision.
	cl::Context context;
	std::vector<cl::Device> device;
	for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) {
	    std::vector<cl::Device> pldev;

	    try {
		p->getDevices(CL_DEVICE_TYPE_GPU, &pldev);

		for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) {
		    if (!d->getInfo<CL_DEVICE_AVAILABLE>()) continue;

		    std::string ext = d->getInfo<CL_DEVICE_EXTENSIONS>();

		    if (
			    ext.find("cl_khr_fp64") == std::string::npos &&
			    ext.find("cl_amd_fp64") == std::string::npos
		       ) continue;

		    device.push_back(*d);
		    context = cl::Context(device);
		}
	    } catch(...) {
		device.clear();
	    }
	}

	if (device.empty()) {
	    std::cerr << "GPUs with double precision not found." << std::endl;
	    return 1;
	}

	std::cout << device[0].getInfo<CL_DEVICE_NAME>() << std::endl;

	// Create command queue.
	cl::CommandQueue queue(context, device[0]);

	// Compile OpenCL program for found device.
	cl::Program program(context, cl::Program::Sources(
		    1, std::make_pair(source, strlen(source))
		    ));

	try {
	    program.build(device);
	} catch (const cl::Error&) {
	    std::cerr
		<< "OpenCL compilation error" << std::endl
		<< program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device[0])
		<< std::endl;
	    return 1;
	}

	cl::Kernel add = cl::Kernel(program, "add");

	// Prepare input data.
	std::vector<double> a(N, 1);
	std::vector<double> b(N, 2);
	std::vector<double> c(N);

	// Allocate device buffers and transfer input data to device.
	cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
		a.size() * sizeof(double), a.data());

	cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
		b.size() * sizeof(double), b.data());

	cl::Buffer C(context, CL_MEM_READ_WRITE,
		c.size() * sizeof(double));

	// Set kernel arguments.
	add.setArg(0, N);
	add.setArg(1, A);
	add.setArg(2, B);
	add.setArg(3, C);

	// Launch kernel on the compute device.
	queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange);

	// Get result back to host.
	queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(double), c.data());

	// Should get '3' here.
	std::cout << c[42] << std::endl;
    } catch (const cl::Error &err) {
	std::cerr
	    << "OpenCL error: "
	    << err.what() << "(" << err.err() << ")"
	    << std::endl;
	return 1;
    }
}

I am sorry for wasting page space, but that is the point of this example. Feel free to collapse this ugly monster away as soon as you note that it contains  133 lines of code. As you can see, the basic steps of an OpenCL program are selection of  the compute device, initialization of the OpenCL context, building the OpenCL  program, allocating and initializing memory on the selected device, and running the compute kernel. VexCL strives to simplify (and get rid of, where possible)  these steps. A program that solves the same problem with the help of the VexCL library is presented below.

#include <iostream>
#include <vector>
#include <vexcl/vexcl.hpp>

int main() {
    const unsigned int N = 1 << 20;

    try {
	// Init VexCL context: grab one GPU with double precision.
	vex::Context ctx(
		vex::Filter::Type(CL_DEVICE_TYPE_GPU) &&
		vex::Filter::DoublePrecision &&
		vex::Filter::Count(1)
		);

	if (ctx.queue().empty()) {
	    std::cerr << "GPUs with double precision not found." << std::endl;
	    return 1;
	}

	std::cout << ctx << std::endl;

	// Prepare input data.
	std::vector<double> a(N, 1);
	std::vector<double> b(N, 2);
	std::vector<double> c(N);

	// Allocate device vectors and transfer input data to device.
	vex::vector<double> A(ctx.queue(), a);
	vex::vector<double> B(ctx.queue(), b);
	vex::vector<double> C(ctx.queue(), N);

	// Launch kernel on compute device.
	C = A + B;

	// Get result back to host.
	vex::copy(C, c);

	// Should get '3' here.
	std::cout << c[42] << std::endl;
    } catch (const cl::Error &err) {
	std::cerr << "OpenCL error: " << err << std::endl;
	return 1;
    }
}

This program is much more concise (45 lines to be precise) and almost as effective. VexCL uses template metaprogramming techniques, so most of the work  is done at compilation time.  The only considerable overhead the VexCL example has is dynamic construction of kernel source. But that is  a relatively  lightweight operation and it is performed only once, when an expression is first encountered in your program.

VexCL makes heavy use of C++11 features, so your compiler has to be modern enough. GCC versions 4.6 and above are fully supported. Microsoft Visual C++  2010 manages to compile the project with some features disabled: since it does not support variadic templates, only one-argument built-in functions are enabled;  user functions are not available at all.

VexCL interface description

Compute device selection

VexCL supports multi-device computations. Compute devices you use may even belong to different OpenCL platforms. For example,  a single VexCL context may  include AMD and NVIDIA graphic cards as well as Intel CPU. The easiest way to initialize VexCL is by using  the vex::Context class. Its constructor  accepts a device filter or a functor acting on cl::Device. Several standard  filters are defined. Compute devices may be filtered by name, vendor, platform, type (CPU or GPU), etc. Device filters may be combined with logical  operators. For example, the following piece of code lists all available GPUs supporting double precision computations:

vex::Context ctx(
    vex::Filter::Type(CL_DEVICE_TYPE_GPU) && vex::Filter::DoublePrecision
    );
std::cout << ctx << std::endl;

In case built-in functionality is not enough, users may provide their own filters.

Almost any class in VexCL accepts std::vector of cl::CommandQueues as a constructor argument. Each command queue  is presumably located on a separate compute device. This list of queues may be obtained by a call to  the queue() method of  the initialized vex::Context class, or be supplied by a user. This allows to easily incorporate VexCL into existing projects that have their own means of OpenCL initialization.

Vector arithmetic

Once you have initialized the VexCL context, you can allocate device vectors on the associated devices. Each device in  the VexCL context gets its own share of vector  memory and computational work. Vectors in VexCL are partitioned across all devices given at construction time. All vectors use  the same partitioning scheme.  This guarantees that corresponding elements of two equally sized vectors are located on  the same compute devices and no inter-device data transfer is required for computations.

In the example below, device vector x is allocated across all devices supporting double precision.

vex::Context ctx(vex::Filter::DoublePrecision);
const size_t N = 1024 * 1024;
vex::vector<double> x(ctx.queue(), N);

The default partitioning scheme is based on the measuring performance of the code a = b + c; on every device in  the context, where a, b, and c are vectors residing on the benchmarked device. This test is performed first time any multi-device vector is allocated.  Each device gets part of a vector proportional to its performance. This is another overhead introduced when using  the VexCL library in a multi-device  context. But this overhead is easily removed by providing a device weighting function. If, for example, you have  a homogeneous set of compute devices, then  equal partitioning is the best choice. In order to skip the bandwidth test, you assign  the same weight to each device:

vex::partitioning_scheme<>::set(
    [](const cl::Context&, const cl::Device&) { return 1.0; }
    );

Note that you have to set the partitioning scheme before you allocate any vector.  Otherwise,  the default partitioning scheme will be used. The partitioning behaviour may be altered only once. This guarantees that all vectors are consistently partitioned.

Once the device vectors are allocated, simple and intuitive arithmetic expressions may be used to alter their state. For every expression you use,  the appropriate  kernel is generated, compiled (first time it is encountered in your program), and called automatically. If you want to get sources of the generated kernels  output to the standard stream, define a VEXCL_SHOW_KERNELS macro before including  the VexCL headers.

const size_t n = 1 << 20;
std::vector<float> x(n);
std::generate(x.begin(), x.end(), [](){ return (float)rand() / RAND_MAX; });

vex::vector<float> X(ctx.queue(), x);
vex::vector<float> Y(ctx.queue(), n);
vex::vector<float> Z(ctx.queue(), n);

Y = 42;
Z = sqrt(2 * X) + pow(cos(Y), 2.0);

Computational work is split between devices. Vector expressions supported by VexCL are  embarrassingly parallel, so if you have two or more compute devices,  then you should get linear speedup for your code. You can copy the result back to  the host or you can use vex::vector::operator[] to read (or write)  vector elements directly. Though the latter technique is very ineffective and should be used for debugging purposes only:

copy(Z, x);
assert(x[42] == Z[42]);

Reductions

Reduction of a vector to a single value is a commonly used operation. You use reduction to obtain  the sum of vector elements, or to find the maximum/minimum element  of a vector. The class template vex::Reductor allows to perform reduction of an arbitrary expression. Supported types of reduction are SUM, MIN, and MAX. This is how you obtain  the sum of all elements in an expression:

vex::vector<double> x(ctx.queue(), 4096);
vex::vector<double> y(ctx.queue(), 4096);

// ...

vex::Reductor<double,vex::SUM> sum(ctx.queue());

// Sum:
std::cout << sum(x) << std::endl;

// Inner product:
std::cout << sum(x * y) << std::endl;

// How many elements in x are greater than 1?
std::cout << sum(x > 1) << std::endl;

Another example is finding the maximum distance from the origin for a set of 2D points. Point coordinates are stored in x and y vectors:

vex::Reductor<double, vex::MAX> max(ctx.queue());
std::cout << max(sqrt(pow(x, 2.0) + pow(y, 2.0))) << std::endl;

Note that the vex::Reductor instance allocates a small temporary buffer on each device in its queue list at construction time, so for performance  reasons it should be allocated once and used throughout the application lifetime.

User-defined functions

Simple functions to be used in kernels are easily defined in VexCL. All you need is to define  the function body, its return type, and the types of its arguments.  After that, the function may be used in vector expressions in the same way built-in functions are used. Note that  the function body has to be defined at global  scope and belongs to the extern const char[] type. These requirements are necessary in order to use the body string as a template parameter.  The following example counts 2D points located in the first quadrant:

extern const char between_body[] = "return prm1 <= prm2 && prm2 <= prm3;";
UserFunction<between_body, bool(double, double, double)> between;

// Number of points in first quadrant
size_t points_in_1q(
    const vex::Reductor<size_t, vex::SUM> &sum,
    const vex::vector<double> &x,
    const vex::vector<double> &y
    )
{
    return sum(between(0.0, atan2(y, x), M_PI/2));
}

Any valid vector expression may be used as a function argument. Note that function parameters in the body definition are always named as prm1prm2, etc. The function body does not have to be a one-liner: any valid sequence of OpenCL operators is allowed. Due to the fact  that OpenCL kernels are compiled at runtime, you will not get any compilation errors regarding your function until  the first expression using it is encountered in your program.

Custom functions may be used not only to introduce new functionality, but also for performance sake. See  the Kernel generation section for further explanations.

Sparse matrix-vector products and stencil convolutions

One of the most common operations in linear algebra is matrix-vector multiplication.  The class vex::SpMat holds the representation of a sparse  matrix, spanning several devices. Its constructor accepts a sparse matrix stored in CRS format.  On the compute devices, the matrix is stored in hybrid ELL-CRS format. In the example below,  the device matrix is constructed from host vectors and is used to compute the maximum residual value:

// Construct the matrix.
vex::SpMat<double> A(ctx.queue(), n, row.data(), col.data(), val.data());

// Compute residual. r, f, and u are device vectors.
r = f - A * u;
double res = max(fabs(r));

See file examples/cg.cpp for an example of implementing  the conjugate gradient method for  the numerical solution of a system of linear equations.

Another commonly used operation is the convolution of a vector with a stencil. VexCL supports three kinds of stencils: simple stencils, generalized stencils,  and user-defined stencil operators. Simple stencil is based on a one-dimensional array and is used as is. Generalized stencil is based on a  dense two-dimensional matrix and is used together with a built-in or user-defined function of one argument. To define  a stencil operator, the user has to provide  its dimensions (width and center), and body string.

Simple stencil convolution comes in handy in many situations. For example, it allows us to apply a moving average filter to a device vector. All you need is  to construct a vex::stencil object from std::vector, a pair of iterators, or from an initializer list. You also need to specify  the position of the stencil center:

// Moving average with 5-points window.
std::vector<double> sdata(5, 0.2);
vex::stencil<double> s(ctx.queue(), sdata, sdata.size() / 2);

// Convolve x with s. x and y are device vectors.
y = x * s;

Generalized stencil is basically a small dense coefficient matrix. It is used in combination with a built-in or user-defined function of one argument.  For example, the following nonlinear vector operator:

y[i] = sin(x[i-1] - x[i]) + sin(x[i] - x[i+1]);

may be implemented as:

const uint rows   = 2;
const uint cols   = 3;
const uint center = 1;

vex::gstencil<double> S(ctx.queue(), rows, cols, center, {
    1, -1,  0,	// stencil coefficients for first sin argument.
    0,  1, -1	// stencil coefficients for second sin argument.
    });

y = sin(x * S);

A user-defined function may be used together with generalized stencils. To implement the following operator:

y[i] = x[i] + pow(x[i-1] + x[i+1], 3);

you would write:

extern const char pow3_body[] = "return pow(prm1, 3);";
UserFunction<pow3_body, double(double)> pow3;

void pow3_oper(const vex::vector<double> &x, vex::vector<double> &y) {
    gstencil<double> S(ctx.queue(), 1, 3, 1, {1, 0, 1});
    y = x + pow3(x * S);
}

User-defined stencil operators allow to define efficient arbitrary stencils. You need to specify stencil dimensions and provide the body for the function  returning the local value of the stencil. Let us implement the previous example with  the help of a stencil operator.  Note that in the body, string elements  of a vector that stencil is applied to are referred as X[i], where i is relative to the stencil center:

// Defined at global scope:
extern const char pow3_op_body[] =
    "double t = X[-1] + X[1];\n"
    "return X[0] + t * t * t;"

// Define stencil operator:
const uint width  = 3;
const uint center = 1;
vex::StencilOperator<double, width, center, pow3_op_body> pow3_op(ctx.queue());

// Apply stencil operator to device vector x:
y = pow3_op(x);

The latter implementation is more effective because it uses single kernel launch.  You see, stencil convolution, as well as sparse matrix-vector  multiplication, is a two-step operation. First, halo points have to be exchanged between neighboring compute devices. Second,  the OpenCL kernel has to be  launched. Because of this, stencil convolution is allowed only in additive expressions.  Internally, such expressions are computed separately. First, all  terms except stencil convolution are computed and stored to the result vector. Second, stencil convolution is computed and added to the result. So  the expression y = x + pow3(x * S); is functionally equivalent to:

y = x;              // First kernel launch.
y += pow3(x * S);   // Second kernel launch.

Kernel generation

Each expression you use in your code results in the OpenCL kernel. For example, the following expression:

x = 2 * y - sin(z);

triggers generation, compilation, and launch of this kernel:

#if defined(cl_khr_fp64)
#  pragma OPENCL EXTENSION cl_khr_fp64: enable
#elif defined(cl_amd_fp64)
#  pragma OPENCL EXTENSION cl_amd_fp64: enable
#endif
kernel void Sub_Mul_cvsinv(
	ulong n,
	global double *res,
	int prmll,
	global double *prmlr,
	global double *prmr1
	)
{
	size_t i = get_global_id(0);
	size_t grid_size = get_num_groups(0) * get_local_size(0);
	while (i < n) {
		res[i] = ((prmll * prmlr[i]) - sin(prmr1[i]));
		i += grid_size;
	}
}

Kernel for each expression is generated and compiled only once (per OpenCL context), when it is first encountered in your program. Expression in the above  example corresponds to the expression tree in the figure below:

Expression tree

As you can see, the type of an expression tree depends on its operations and types of its operands. So an expression 2.0 * y - sin(z) would result in  a similar, but different tree, since the type of the constant here is double instead of int. This should be kept in mind if you want to limit  the number of generated kernels.  Another thing to remember is that there is no way to tell if several terminals of a tree refer to the same data. Consider the  example from the User-defined functions section, where we had to count points in  the first quadrant. We could implement the example without using a custom function:

return sum(atan2(y, x) >= 0.0 && atan2(y, x) <= M_PI/2);

The resulting kernel would read data from vectors x and y twice, which is not very nice from  a performance standpoint.  The introduction of the between function allows us to halve the number of global memory reads here!

Custom kernels

As Kozma Prutkov repeatedly said, "One cannot embrace the unembraceable". So in order  to be usable, VexCL has to support custom kernels. When writing OpenCL kernel, you have to remember that VexCL vectors are distributed across all compute  devices present in the context. vex::vector::operator() returns a cl::Buffer object that stores part of a vector on a given device.  Other methods useful for kernel authors are vex::vector::part_size() and vex::vector::part_start(). The former returns  the size of a vector  partition located on a given device; the latter returns the number of first vector elements located on a given device. So x.part_start(d+1) - x.part_start(d) is always equal to x.part_size(d). If the result depends on the neighbor points, you have to remember that these points are possibly located on a different compute device.  In this case the exchange of these halo points has to be arranged manually.

The following example builds and launches a custom kernel for each device in the context:

// Use environment variables to control device selection.
vex::Context ctx( vex::Filter::Env );

const size_t n = 1 << 20;
vex::vector<float> x(ctx.queue(), n);

std::vector<cl::Kernel> kernel(ctx.size());

for(uint d = 0; d < ctx.size(); d++) {
    cl::Program program = vex::build_sources(ctx.context(d), std::string(
	"kernel void dummy(ulong size, global float *x)\n"
	"{\n"
	"    size_t i = get_global_id(0);\n"
	"    if (i < size) x[i] = 4.2;\n"
	"}\n"
	));
    kernel[d] = cl::Kernel(program, "dummy");
}

for(uint d = 0; d < ctx.size(); d++) {
    kernel[d].setArg(0, static_cast<cl_ulong>(x.part_size());
    kernel[d].setArg(1, x(d));
    ctx.queue(d).enqueueNDRangeKernel(kernel[d], cl::NullRange, n, cl::NullRange);
}

In rare cases your kernel contains a syntax error, the error description will be output to  the standard error stream and a cl::Error exception will be  thrown. By the way, VexCL overloads the output to a stream operator for cl::Error objects to get more readable error messages.

Scalability

Scalability of the library with respect to the number of compute devices is presented in this section. Effective performance (GFLOPS) and bandwidth  (GB/sec) were measured by launching a big number of test kernels on one, two, or three Nvidia Tesla C2070 cards.  The effect of adding a fourth, somewhat slower,  device (Intel Core i7) was tested as well. The results shown are averaged over 20 runs. The details of the experiments may be found in file examples/benchmark.cpp. Basically,  the performance of the following code was measured:

// Vector arithmetic
a += b + c * d;
// Reduction
double s = sum(a * b);
// Stencil convolution
y = x * s;
// SpMV
y += A * x;

Effective performance

Effective bandwidth

As you can see, VexCL scales almost linearly with the addition of compute devices. Note that the performance and bandwidth for a stencil convolution operation are  much higher than for other primitives. This is due to the fact that much faster local (shared) memory is used in this algorithm, and formulas for effective  performance and bandwidth do not take this into account.  Another thing worth noting is  the overall degradation of performance after an Intel CPU is added to the VexCL  context. The only primitive gaining speed from this addition is vector arithmetic. This is probably because  the performance of vector arithmetic was used as a basis for problem partitioning.

Conclusion

VexCL allows to write compact and readable code without sacrificing too much performance. In the next article, headmyshoulder will talk about integrating VexCL into odeint -  a modern C++ library  for numerical solution of ordinary differential equations. odeint now supports GPU computations both with CUDA by using  the Thrust library and OpenCL by using VexCL. This allows us to compare  the performance of the two solutions.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值