介绍
在本教程中,我们将会看到在OpenCL中使用长数据类型。我们也会触及如何和为何要在OpenCL中使用原子变量。在这个过程中,我们介绍了Mali-T600系列GPU支持的64位原子扩展。
长数据类型
在OpenCL嵌入式配置文档中,64位整数(例如,long,ulong)类型是可选的(包括相应的向量数据类型和操作)。然而,Mali-T600系列GPU实现OpenCL完全简档,支持64位整数类型是必须的。64位整数在Mali-T600系列GPU上被支持,并且全硬件加速。
长数据类型被用于需要非常大的整数计算的场合。示例用例包括:
> 定点算术
> 加密/解密
> 散列
> 64位算术
64位原子操作
这一教程需要为跨内核的累加值进行原子操作。32位整数的原子操作是OpenCL 1.1完全简档核心的一部分,因此被所有完全简档实现所支持(包括Mali-T600系列GPU)。然而,我们需要64位整数的原子操作,这在OpenCL 1.1中是可选扩展(cl_khr_int64_base_atomics)。所有Mali-T600系列GPU实现了64位原子操作的扩展。
注意:在OpenCL嵌入式简档中,32位和64位原子操作都是可选扩展。
实现
除非另作说明,所有的代码片段都来自"samples\64_bit_integer\64_bit_integer.cl"。
图像尺寸
我们已经包括了一张512x512大小的输入位图用于这个样例(为保持安装程序的尺寸较小)。然而,在较大的图像使用时,你更有可能看到性能的提升(相比运行在CPU上的C代码)。在使用OpenCL时,有一些相关的启动开销。当输入数据大小比较小时,这些开销可能会大于并行处理带来的好处。
这个样例已经被编码为允许使用任何输入位图。简单修改"sample\assets"目录下的input.bmp为你选择的输入图像。当使用更大图像时,你将会看到更大的计算性能提升。
64位算术
一些脸部识别技术,例如Robust实时对象检测(Viola和Jones, 2001)框架需要使用下面的方程计算一个子窗口的变量:
Variance = ((Σ p) / N )2 - 1/N * Σ(p2)
p是像素值,N是像素总的数目。
对于这个例子,我们仅仅计算像素值的和,平方后的像素的和。我们在一个OpenCL内核中计算这些变量。
如果我们考虑8 bit每像素,最大的像素值为255。平方这个值(255 * 288 = 65025)适合一个ushort(16位类型,最大值为65535)。我们使用ushort8,因为8 * 16位 = 128位,推荐向量宽度。
然而,平方的和与像素的和可能会溢出一个short和int型。因此,我们将它们转换成ulong类型,累加所有在向量中的值,直到得到一个可以加到累加器中的一个单独的值(分别是sumOfPixels和squareOfPixels)。
/* Load 8 pixels (char) and convert them to shorts to calculate the square.*/
ushort8 pixelShort = convert_ushort8(vload8(i, imagePixels));
/* Square of 255 < 2 ^ 16. */
ushort8 newSquareShort = pixelShort * pixelShort;
/*
* Convert original pixel value and the square to longs to sum
* all the vectors together and add the final values to the
* respective accumulators.
*/
ulong8 pixelLong = convert_ulong8(pixelShort);
ulong8 newSquareLong = convert_ulong8(newSquareShort);
/*
* Use vector data type suffixes (.lo and .hi) to get smaller vector types,
* until we obtain one single value.
*/
ulong4 sumLongPixels1 = pixelLong.hi + pixelLong.lo;
ulong2 sumLongPixels2 = sumLongPixels1.hi + sumLongPixels1.lo;
ulong sumLongPixels3 = sumLongPixels2.hi + sumLongPixels2.lo;
ulong4 sumLongSquares1 = newSquareLong.hi + newSquareLong.lo;
ulong2 sumLongSquares2 = sumLongSquares1.hi + sumLongSquares1.lo;
ulong sumLongSquares3 = sumLongSquares2.hi + sumLongSquares2.lo;
如果所有的内核在同一时间访问累加器,内存访问冲突会发生。这会导致竞态条件,数据会被丢失。
为避免于此,我们使用atom_add来加一个整数值到一个通过指针引用的值。这确保在加法操作期间,没有其他执行在相同设备上的内核读或写那个内存。原子操作也对其它函数存在(例如,乘法,减法,递增,递减)。这意味着这种操作代价非常高昂,因此只在必要时使用它。
atom_add(sumOfPixels, sumLongPixels3);
atom_add(squareOfPixels, sumLongSquares3);
为了使能64位整数的atom_add,我们使用这个pragma到内核代码中:
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
运行样例
参考样例1。
附录1:内核源码
/*
* This confidential and proprietary software may be used only as
* authorised by a licensing agreement from ARM Limited
* (C) COPYRIGHT 2013 ARM Limited
* ALL RIGHTS RESERVED
* The entire notice above must be reproduced on all authorised
* copies and copies may only be made to the extent permitted
* by a licensing agreement from ARM Limited.
*/
/**
* \brief Long data type (64-bit integer) kernel.
* \details This kernel loads 8 pixel values to calculate the square of each pixel value. Then it accumulates the
* square of pixels and the sum of pixels values in the respective accumulators.
* \param[in] imagePixels Input array with image pixels.
* \param[in] squareOfPixels Sum of the square of pixel values.
* \param[out] sumOfPixels Sum of pixel values.
*/
/* [Enable atom_add extension] */
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
/* [Enable atom_add extension] */
__kernel void long_vectors(__global uchar* restrict imagePixels,
__global ulong* restrict squareOfPixels,
__global ulong* restrict sumOfPixels)
{
/*
* Set i to be the ID of the kernel instance.
* If the global work size (set by clEnqueueNDRangeKernel) is n,
* then n kernels will be run and i will be in the range [0, n - 1].
*/
int i = get_global_id(0);
/* [Squares and sums]*/
/* Load 8 pixels (char) and convert them to shorts to calculate the square.*/
ushort8 pixelShort = convert_ushort8(vload8(i, imagePixels));
/* Square of 255 < 2 ^ 16. */
ushort8 newSquareShort = pixelShort * pixelShort;
/*
* Convert original pixel value and the square to longs to sum
* all the vectors together and add the final values to the
* respective accumulators.
*/
ulong8 pixelLong = convert_ulong8(pixelShort);
ulong8 newSquareLong = convert_ulong8(newSquareShort);
/*
* Use vector data type suffixes (.lo and .hi) to get smaller vector types,
* until we obtain one single value.
*/
ulong4 sumLongPixels1 = pixelLong.hi + pixelLong.lo;
ulong2 sumLongPixels2 = sumLongPixels1.hi + sumLongPixels1.lo;
ulong sumLongPixels3 = sumLongPixels2.hi + sumLongPixels2.lo;
ulong4 sumLongSquares1 = newSquareLong.hi + newSquareLong.lo;
ulong2 sumLongSquares2 = sumLongSquares1.hi + sumLongSquares1.lo;
ulong sumLongSquares3 = sumLongSquares2.hi + sumLongSquares2.lo;
/* [Squares and sums]*/
/*
* As all the kernels are accessing sumOfPixels
* and squareOfPixels at the same time,
* we use atom_add to ensure only one kernel
* at a time can access the given variables.
* This means that this operation is very expensive,
* so we want to use it only when necessary.
*/
/* [Atomic transaction] */
atom_add(sumOfPixels, sumLongPixels3);
atom_add(squareOfPixels, sumLongSquares3);
/* [Atomic transaction] */
}
附录2:宿主机源码
/*
* This confidential and proprietary software may be used only as
* authorised by a licensing agreement from ARM Limited
* (C) COPYRIGHT 2013 ARM Limited
* ALL RIGHTS RESERVED
* The entire notice above must be reproduced on all authorised
* copies and copies may only be made to the extent permitted
* by a licensing agreement from ARM Limited.
*/
#include "common.h"
#include "image.h"
#include <CL/cl.h>
#include <iostream>
#include <fstream>
#include <sstream>
#include <cstddef>
#include <cmath>
using namespace std;
/**
* \brief Long data type (64-bit integer) OpenCL example.
* \details An example to calculate, for an image:
* - the sum of the squares of the pixels values
* - sum of the pixels values.
* Makes use of the long data type and 64-bit atomics.
* The main calculation code is in an OpenCL kernel which is executed on a GPU device.
* \return The exit code of the application, non-zero if a problem occurred.
*/
int main(void)
{
string filename = "assets/input.bmp";
cl_context context = 0;
cl_command_queue commandQueue = 0;
cl_program program = 0;
cl_device_id device = 0;
cl_kernel kernel = 0;
const int numberOfMemoryObjects = 3;
/* Index values for the memory objects. */
const unsigned int imagePixelsIndex = 0;
const unsigned int squareIndex = 1;
const unsigned int sumIndex = 2;
cl_mem memoryObjects[3] = {0, 0, 0};
cl_int errorNumber;
if (!createContext(&context))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to create an OpenCL context. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
if (!createCommandQueue(context, &commandQueue, &device))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to create the OpenCL command queue. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Checking 64-bit integer atomics extension support. */
if (!isExtensionSupported (device, "cl_khr_int64_base_atomics"))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "cl_khr_int64_base_atomics is not supported on this device. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
if (!createProgram(context, device, "assets/64_bit_integer.cl", &program))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to create OpenCL program." << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
kernel = clCreateKernel(program, "long_vectors", &errorNumber);
if (!checkSuccess(errorNumber))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to create OpenCL kernel. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Load 24-bits per pixel RGB data from a bitmap. */
cl_int width;
cl_int height;
unsigned char* loadedRGBData = NULL;
if (!loadFromBitmap(filename, &width, &height, &loadedRGBData))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed loading bitmap. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Buffer for the image pixels. */
size_t bufferSizeChar = width * height * sizeof(unsigned char);
/* Buffer for the accumulators*/
size_t bufferSizeLong = sizeof(cl_ulong);
/*
* Ask the OpenCL implementation to allocate buffers for the data.
* We ask the OpenCL implementation to allocate memory rather than allocating
* it on the CPU to avoid having to copy the data later.
* The read/write flags relate to accesses to the memory from within the kernel.
*/
bool createMemoryObjectsSuccess = true;
memoryObjects[imagePixelsIndex] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSizeChar, NULL, &errorNumber);
createMemoryObjectsSuccess &= checkSuccess(errorNumber);
memoryObjects[squareIndex] = clCreateBuffer(context, CL_MEM_READ_WRITE| CL_MEM_ALLOC_HOST_PTR, bufferSizeLong, NULL, &errorNumber);
createMemoryObjectsSuccess &= checkSuccess(errorNumber);
memoryObjects[sumIndex] = clCreateBuffer(context, CL_MEM_READ_WRITE| CL_MEM_ALLOC_HOST_PTR, bufferSizeLong, NULL, &errorNumber);
createMemoryObjectsSuccess &= checkSuccess(errorNumber);
if (!createMemoryObjectsSuccess)
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to create OpenCL buffer. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Map the input memory objects to a host side pointers. */
bool mapMemoryObjectsSuccess = true;
cl_uchar* inputImagePixels = (cl_uchar*)clEnqueueMapBuffer(commandQueue, memoryObjects[imagePixelsIndex], CL_TRUE, CL_MAP_WRITE, 0, bufferSizeChar, 0, NULL, NULL, &errorNumber);
mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
cl_ulong* inputSquareOfPixels = (cl_ulong*)clEnqueueMapBuffer(commandQueue, memoryObjects[squareIndex], CL_TRUE, CL_MAP_WRITE, 0, bufferSizeLong, 0, NULL, NULL, &errorNumber);
mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
cl_ulong* inputSumOfPixels = (cl_ulong*)clEnqueueMapBuffer(commandQueue, memoryObjects[sumIndex], CL_TRUE, CL_MAP_WRITE, 0, bufferSizeLong, 0, NULL, NULL, &errorNumber);
mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
if (!mapMemoryObjectsSuccess)
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Mapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/*
* Convert 24-bits per pixel RGB into 8-bits per pixel luminance data
* and fill the array for the kernel.
*/
RGBToLuminance(loadedRGBData, inputImagePixels, width, height);
delete [] loadedRGBData;
/* Ensure the accumulators are initialized to zero. */
*inputSquareOfPixels = 0;
*inputSumOfPixels = 0;
/* Unmap the memory so we can pass it to the kernel. */
bool unmapMemoryObjectsSuccess = true;
unmapMemoryObjectsSuccess &= checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[imagePixelsIndex], inputImagePixels, 0, NULL, NULL));
unmapMemoryObjectsSuccess &= checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[squareIndex], inputSquareOfPixels, 0, NULL, NULL));
unmapMemoryObjectsSuccess &= checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[sumIndex], inputSumOfPixels, 0, NULL, NULL));
if (!unmapMemoryObjectsSuccess)
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Unmapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Set the kernel arguments */
bool setKernelArgumentsSuccess = true;
setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, imagePixelsIndex, sizeof(cl_mem), &memoryObjects[imagePixelsIndex]));
setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, squareIndex, sizeof(cl_mem), &memoryObjects[squareIndex]));
setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, sumIndex, sizeof(cl_mem), &memoryObjects[sumIndex]));
if (!setKernelArgumentsSuccess)
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed setting OpenCL kernel arguments. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* An event to associate with the Kernel. Allows us to retrieve profiling information later. */
cl_event event = 0;
/*
* Each instance of the kernel operates on a 8 * 1 portion of the image.
* Therefore, the global work size must be 1.
*/
size_t globalWorksize[1] = {(width * height) / 8};
int work_dim = 1;
/* Enqueue the kernel */
if (!checkSuccess(clEnqueueNDRangeKernel(commandQueue, kernel, work_dim, NULL, globalWorksize, NULL, 0, NULL, &event)))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed enqueuing the kernel. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Wait for kernel execution completion. */
if (!checkSuccess(clFinish(commandQueue)))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed waiting for kernel execution to finish. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Print the profiling information for the event. */
printProfilingInfo(event);
/* Release the event object. */
if (!checkSuccess(clReleaseEvent(event)))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed releasing the event object. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Get pointers to the output data. */
mapMemoryObjectsSuccess = true;
cl_ulong* squareOfPixels = (cl_ulong*)clEnqueueMapBuffer(commandQueue, memoryObjects[squareIndex], CL_TRUE, CL_MAP_READ, 0, bufferSizeLong, 0, NULL, NULL, &errorNumber);
mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
cl_ulong* sumOfPixels = (cl_ulong*)clEnqueueMapBuffer(commandQueue, memoryObjects[sumIndex], CL_TRUE, CL_MAP_READ, 0, bufferSizeLong, 0, NULL, NULL, &errorNumber);
mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
if (!mapMemoryObjectsSuccess)
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to map buffer. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* [Output the results] */
cout << "Square of the pixel values = " << *squareOfPixels << "\n";
cout << "Sum of the pixel values = " << *sumOfPixels << endl;
/* [Output the results] */
/* Unmap the memory object as we are finished using them from the CPU side. */
unmapMemoryObjectsSuccess = true;
unmapMemoryObjectsSuccess &= checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[squareIndex], squareOfPixels, 0, NULL, NULL));
unmapMemoryObjectsSuccess &= checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[sumIndex], sumOfPixels, 0, NULL, NULL));
if (!unmapMemoryObjectsSuccess)
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Unmapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Release OpenCL objects. */
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
}