《Mali OpenCL SDK v1.1.0》教程样例之五“64位整数与原子操作”


介绍


  在本教程中,我们将会看到在OpenCL中使用长数据类型。我们也会触及如何和为何要在OpenCL中使用原子变量。在这个过程中,我们介绍了Mali-T600系列GPU支持的64位原子扩展。


长数据类型


  在OpenCL嵌入式配置文档中,64位整数(例如,long,ulong)类型是可选的(包括相应的向量数据类型和操作)。然而,Mali-T600系列GPU实现OpenCL完全简档,支持64位整数类型是必须的。64位整数在Mali-T600系列GPU上被支持,并且全硬件加速。


  长数据类型被用于需要非常大的整数计算的场合。示例用例包括:

 > 定点算术

 > 加密/解密

 > 散列

 > 64位算术


64位原子操作


  这一教程需要为跨内核的累加值进行原子操作。32位整数的原子操作是OpenCL 1.1完全简档核心的一部分,因此被所有完全简档实现所支持(包括Mali-T600系列GPU)。然而,我们需要64位整数的原子操作,这在OpenCL 1.1中是可选扩展(cl_khr_int64_base_atomics)。所有Mali-T600系列GPU实现了64位原子操作的扩展。

注意:在OpenCL嵌入式简档中,32位和64位原子操作都是可选扩展。



实现


  除非另作说明,所有的代码片段都来自"samples\64_bit_integer\64_bit_integer.cl"。


图像尺寸


  我们已经包括了一张512x512大小的输入位图用于这个样例(为保持安装程序的尺寸较小)。然而,在较大的图像使用时,你更有可能看到性能的提升(相比运行在CPU上的C代码)。在使用OpenCL时,有一些相关的启动开销。当输入数据大小比较小时,这些开销可能会大于并行处理带来的好处。


  这个样例已经被编码为允许使用任何输入位图。简单修改"sample\assets"目录下的input.bmp为你选择的输入图像。当使用更大图像时,你将会看到更大的计算性能提升。


64位算术


  一些脸部识别技术,例如Robust实时对象检测(Viola和Jones, 2001)框架需要使用下面的方程计算一个子窗口的变量:

Variance = ((Σ p) / N )2 - 1/N * Σ(p2)

p是像素值,N是像素总的数目。


  对于这个例子,我们仅仅计算像素值的和,平方后的像素的和。我们在一个OpenCL内核中计算这些变量。


  如果我们考虑8 bit每像素,最大的像素值为255。平方这个值(255 * 288 = 65025)适合一个ushort(16位类型,最大值为65535)。我们使用ushort8,因为8 * 16位 = 128位,推荐向量宽度。


  然而,平方的和与像素的和可能会溢出一个short和int型。因此,我们将它们转换成ulong类型,累加所有在向量中的值,直到得到一个可以加到累加器中的一个单独的值(分别是sumOfPixels和squareOfPixels)。

    /* Load 8 pixels (char) and convert them to shorts to calculate the square.*/
    ushort8 pixelShort = convert_ushort8(vload8(i, imagePixels));
    /* Square of 255 < 2 ^ 16. */
    ushort8 newSquareShort = pixelShort * pixelShort;
    /*
     * Convert original pixel value and the square to longs to sum
     * all the vectors together and add the final values to the
     * respective accumulators.
     */
    ulong8 pixelLong = convert_ulong8(pixelShort);
    ulong8 newSquareLong = convert_ulong8(newSquareShort);
    /*
     * Use vector data type suffixes (.lo and .hi) to get smaller vector types,
     * until we obtain one single value.
     */
    ulong4 sumLongPixels1 = pixelLong.hi + pixelLong.lo;
    ulong2 sumLongPixels2 = sumLongPixels1.hi + sumLongPixels1.lo;
    ulong sumLongPixels3 = sumLongPixels2.hi + sumLongPixels2.lo;
    ulong4 sumLongSquares1 = newSquareLong.hi + newSquareLong.lo;
    ulong2 sumLongSquares2 = sumLongSquares1.hi + sumLongSquares1.lo;
    ulong sumLongSquares3 = sumLongSquares2.hi + sumLongSquares2.lo;
   如果所有的内核在同一时间访问累加器,内存访问冲突会发生。这会导致竞态条件,数据会被丢失。


  为避免于此,我们使用atom_add来加一个整数值到一个通过指针引用的值。这确保在加法操作期间,没有其他执行在相同设备上的内核读或写那个内存。原子操作也对其它函数存在(例如,乘法,减法,递增,递减)。这意味着这种操作代价非常高昂,因此只在必要时使用它。

    atom_add(sumOfPixels, sumLongPixels3);
    atom_add(squareOfPixels, sumLongSquares3);
   为了使能64位整数的atom_add,我们使用这个pragma到内核代码中:

#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable


运行样例

  参考样例1。



附录1:内核源码

/*
 * This confidential and proprietary software may be used only as
 * authorised by a licensing agreement from ARM Limited
 *    (C) COPYRIGHT 2013 ARM Limited
 *        ALL RIGHTS RESERVED
 * The entire notice above must be reproduced on all authorised
 * copies and copies may only be made to the extent permitted
 * by a licensing agreement from ARM Limited.
 */

/**
 * \brief  Long data type (64-bit integer) kernel.
 * \details This kernel loads 8 pixel values to calculate the square of each pixel value. Then it accumulates the
 * square of pixels and the sum of pixels values in the respective accumulators.
 * \param[in] imagePixels Input array with image pixels.
 * \param[in] squareOfPixels Sum of the square of pixel values.
 * \param[out] sumOfPixels Sum of pixel values.
 */

/* [Enable atom_add extension] */
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
/* [Enable atom_add extension] */

__kernel void long_vectors(__global uchar* restrict imagePixels,
                           __global ulong* restrict squareOfPixels,
                           __global ulong* restrict sumOfPixels)
{
    /*
     * Set i to be the ID of the kernel instance.
     * If the global work size (set by clEnqueueNDRangeKernel) is n,
     * then n kernels will be run and i will be in the range [0, n - 1].
     */
    int i = get_global_id(0);

    /* [Squares and sums]*/
    /* Load 8 pixels (char) and convert them to shorts to calculate the square.*/
    ushort8 pixelShort = convert_ushort8(vload8(i, imagePixels));
    /* Square of 255 < 2 ^ 16. */
    ushort8 newSquareShort = pixelShort * pixelShort;

    /*
     * Convert original pixel value and the square to longs to sum
     * all the vectors together and add the final values to the
     * respective accumulators.
     */
    ulong8 pixelLong = convert_ulong8(pixelShort);
    ulong8 newSquareLong = convert_ulong8(newSquareShort);

    /*
     * Use vector data type suffixes (.lo and .hi) to get smaller vector types,
     * until we obtain one single value.
     */
    ulong4 sumLongPixels1 = pixelLong.hi + pixelLong.lo;
    ulong2 sumLongPixels2 = sumLongPixels1.hi + sumLongPixels1.lo;
    ulong sumLongPixels3 = sumLongPixels2.hi + sumLongPixels2.lo;

    ulong4 sumLongSquares1 = newSquareLong.hi + newSquareLong.lo;
    ulong2 sumLongSquares2 = sumLongSquares1.hi + sumLongSquares1.lo;
    ulong sumLongSquares3 = sumLongSquares2.hi + sumLongSquares2.lo;
    /* [Squares and sums]*/

    /*
     * As all the kernels are accessing sumOfPixels
     * and squareOfPixels at the same time,
     * we use atom_add to ensure only one kernel
     * at a time can access the given variables.
     * This means that this operation is very expensive,
     * so we want to use it only when necessary.
     */
    /* [Atomic transaction] */
    atom_add(sumOfPixels, sumLongPixels3);
    atom_add(squareOfPixels, sumLongSquares3);
    /* [Atomic transaction] */
}

附录2:宿主机源码

/*
 * This confidential and proprietary software may be used only as
 * authorised by a licensing agreement from ARM Limited
 *    (C) COPYRIGHT 2013 ARM Limited
 *        ALL RIGHTS RESERVED
 * The entire notice above must be reproduced on all authorised
 * copies and copies may only be made to the extent permitted
 * by a licensing agreement from ARM Limited.
 */

#include "common.h"
#include "image.h"

#include <CL/cl.h>
#include <iostream>
#include <fstream>
#include <sstream>
#include <cstddef>
#include <cmath>

using namespace std;

/**
 * \brief  Long data type (64-bit integer) OpenCL example.
 * \details An example to calculate, for an image:
 *          - the sum of the squares of the pixels values
 *          - sum of the pixels values.
 *          Makes use of the long data type and 64-bit atomics.
 *          The main calculation code is in an OpenCL kernel which is executed on a GPU device.
 * \return The exit code of the application, non-zero if a problem occurred.
 */
int main(void)
{
    string filename = "assets/input.bmp";

    cl_context context = 0;
    cl_command_queue commandQueue = 0;
    cl_program program = 0;
    cl_device_id device = 0;
    cl_kernel kernel = 0;
    const int numberOfMemoryObjects = 3;
    /* Index values for the memory objects. */
    const unsigned int imagePixelsIndex = 0;
    const unsigned int squareIndex = 1;
    const  unsigned int sumIndex = 2;
    cl_mem memoryObjects[3] = {0, 0, 0};
    cl_int errorNumber;


    if (!createContext(&context))
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed to create an OpenCL context. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    if (!createCommandQueue(context, &commandQueue, &device))
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed to create the OpenCL command queue. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /* Checking 64-bit integer atomics extension support. */
    if (!isExtensionSupported (device, "cl_khr_int64_base_atomics"))
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "cl_khr_int64_base_atomics is not supported on this device. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    if (!createProgram(context, device, "assets/64_bit_integer.cl", &program))
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed to create OpenCL program." << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    kernel = clCreateKernel(program, "long_vectors", &errorNumber);
    if (!checkSuccess(errorNumber))
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed to create OpenCL kernel. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /* Load 24-bits per pixel RGB data from a bitmap. */
    cl_int width;
    cl_int height;
    unsigned char* loadedRGBData = NULL;
    if (!loadFromBitmap(filename, &width, &height, &loadedRGBData))
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed loading bitmap. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /* Buffer for the image pixels. */
    size_t bufferSizeChar = width * height * sizeof(unsigned char);
    /* Buffer for the accumulators*/
    size_t bufferSizeLong = sizeof(cl_ulong);

    /*
     * Ask the OpenCL implementation to allocate buffers for the data.
     * We ask the OpenCL implementation to allocate memory rather than allocating
     * it on the CPU to avoid having to copy the data later.
     * The read/write flags relate to accesses to the memory from within the kernel.
     */
    bool createMemoryObjectsSuccess = true;

    memoryObjects[imagePixelsIndex] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSizeChar, NULL, &errorNumber);
    createMemoryObjectsSuccess &= checkSuccess(errorNumber);

    memoryObjects[squareIndex] = clCreateBuffer(context, CL_MEM_READ_WRITE| CL_MEM_ALLOC_HOST_PTR, bufferSizeLong, NULL, &errorNumber);
    createMemoryObjectsSuccess &= checkSuccess(errorNumber);

    memoryObjects[sumIndex] = clCreateBuffer(context, CL_MEM_READ_WRITE| CL_MEM_ALLOC_HOST_PTR, bufferSizeLong, NULL, &errorNumber);
    createMemoryObjectsSuccess &= checkSuccess(errorNumber);

    if (!createMemoryObjectsSuccess)
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed to create OpenCL buffer. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /* Map the input memory objects to a host side pointers. */
    bool mapMemoryObjectsSuccess = true;
    cl_uchar* inputImagePixels = (cl_uchar*)clEnqueueMapBuffer(commandQueue, memoryObjects[imagePixelsIndex], CL_TRUE, CL_MAP_WRITE, 0, bufferSizeChar, 0, NULL, NULL, &errorNumber);
    mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
    cl_ulong* inputSquareOfPixels = (cl_ulong*)clEnqueueMapBuffer(commandQueue, memoryObjects[squareIndex], CL_TRUE, CL_MAP_WRITE, 0, bufferSizeLong, 0, NULL, NULL, &errorNumber);
    mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
    cl_ulong* inputSumOfPixels = (cl_ulong*)clEnqueueMapBuffer(commandQueue, memoryObjects[sumIndex], CL_TRUE, CL_MAP_WRITE, 0, bufferSizeLong, 0, NULL, NULL, &errorNumber);
    mapMemoryObjectsSuccess &= checkSuccess(errorNumber);

    if (!mapMemoryObjectsSuccess)
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Mapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /*
     * Convert 24-bits per pixel RGB into 8-bits per pixel luminance data
     * and fill the array for the kernel.
     */
    RGBToLuminance(loadedRGBData, inputImagePixels, width, height);
    delete [] loadedRGBData;

    /* Ensure the accumulators are initialized to zero. */
    *inputSquareOfPixels = 0;
    *inputSumOfPixels = 0;

    /* Unmap the memory so we can pass it to the kernel. */
    bool unmapMemoryObjectsSuccess = true;
    unmapMemoryObjectsSuccess &= checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[imagePixelsIndex], inputImagePixels, 0, NULL, NULL));
    unmapMemoryObjectsSuccess &= checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[squareIndex], inputSquareOfPixels, 0, NULL, NULL));
    unmapMemoryObjectsSuccess &= checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[sumIndex], inputSumOfPixels, 0, NULL, NULL));

    if (!unmapMemoryObjectsSuccess)
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Unmapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /* Set the kernel arguments */
    bool setKernelArgumentsSuccess = true;
    setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, imagePixelsIndex, sizeof(cl_mem), &memoryObjects[imagePixelsIndex]));
    setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, squareIndex, sizeof(cl_mem), &memoryObjects[squareIndex]));
    setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, sumIndex, sizeof(cl_mem), &memoryObjects[sumIndex]));

    if (!setKernelArgumentsSuccess)
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed setting OpenCL kernel arguments. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /* An event to associate with the Kernel. Allows us to retrieve profiling information later. */
    cl_event event = 0;

    /*
      * Each instance of the kernel operates on a 8 * 1 portion of the image.
     * Therefore, the global work size must be 1.
     */
    size_t globalWorksize[1] = {(width * height) / 8};
    int work_dim = 1;
    /* Enqueue the kernel */
    if (!checkSuccess(clEnqueueNDRangeKernel(commandQueue, kernel, work_dim, NULL, globalWorksize, NULL, 0, NULL, &event)))
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed enqueuing the kernel. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /* Wait for kernel execution completion. */
    if (!checkSuccess(clFinish(commandQueue)))
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed waiting for kernel execution to finish. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /* Print the profiling information for the event. */
    printProfilingInfo(event);
    /* Release the event object. */
    if (!checkSuccess(clReleaseEvent(event)))
    {
       cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
       cerr << "Failed releasing the event object. " << __FILE__ << ":"<< __LINE__ << endl;
       return 1;
    }

    /* Get pointers to the output data. */
    mapMemoryObjectsSuccess = true;
    cl_ulong* squareOfPixels = (cl_ulong*)clEnqueueMapBuffer(commandQueue, memoryObjects[squareIndex], CL_TRUE, CL_MAP_READ, 0, bufferSizeLong, 0, NULL, NULL, &errorNumber);
    mapMemoryObjectsSuccess &= checkSuccess(errorNumber);

    cl_ulong* sumOfPixels = (cl_ulong*)clEnqueueMapBuffer(commandQueue, memoryObjects[sumIndex], CL_TRUE, CL_MAP_READ, 0, bufferSizeLong, 0, NULL, NULL, &errorNumber);
    mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
    if (!mapMemoryObjectsSuccess)
    {
       cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
       cerr << "Failed to map buffer. " << __FILE__ << ":"<< __LINE__ << endl;
       return 1;
    }

    /* [Output the results] */
    cout << "Square of the pixel values = " <<  *squareOfPixels << "\n";
    cout << "Sum of the pixel values = " <<  *sumOfPixels << endl;
    /* [Output the results] */

    /* Unmap the memory object as we are finished using them from the CPU side. */
    unmapMemoryObjectsSuccess = true;
    unmapMemoryObjectsSuccess &= checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[squareIndex], squareOfPixels, 0, NULL, NULL));
    unmapMemoryObjectsSuccess &= checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[sumIndex], sumOfPixels, 0, NULL, NULL));

    if (!unmapMemoryObjectsSuccess)
    {
       cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
       cerr << "Unmapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
       return 1;
    }

    /* Release OpenCL objects. */
    cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
}
























评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值