《Mali OpenCL SDK v1.1.0》教程样例之五“64位整数与原子操作”

最新推荐文章于 2022-12-02 10:43:28 发布

Leon Lu

最新推荐文章于 2022-12-02 10:43:28 发布

阅读量2.9k

点赞数

分类专栏： OpenCL 文章标签： opencl

本文链接：https://blog.csdn.net/cloud_desktop/article/details/20471963

版权

OpenCL 专栏收录该内容

11 篇文章 35 订阅

订阅专栏

介绍

　　在本教程中，我们将会看到在OpenCL中使用长数据类型。我们也会触及如何和为何要在OpenCL中使用原子变量。在这个过程中，我们介绍了Mali-T600系列GPU支持的64位原子扩展。

长数据类型

　　在OpenCL嵌入式配置文档中，64位整数(例如，long，ulong)类型是可选的(包括相应的向量数据类型和操作)。然而，Mali-T600系列GPU实现OpenCL完全简档，支持64位整数类型是必须的。64位整数在Mali-T600系列GPU上被支持，并且全硬件加速。

　　长数据类型被用于需要非常大的整数计算的场合。示例用例包括：

　> 定点算术

　> 加密/解密

　> 散列

　> 64位算术

64位原子操作

　　这一教程需要为跨内核的累加值进行原子操作。32位整数的原子操作是OpenCL 1.1完全简档核心的一部分，因此被所有完全简档实现所支持(包括Mali-T600系列GPU)。然而，我们需要64位整数的原子操作，这在OpenCL 1.1中是可选扩展(cl_khr_int64_base_atomics)。所有Mali-T600系列GPU实现了64位原子操作的扩展。

注意：在OpenCL嵌入式简档中，32位和64位原子操作都是可选扩展。

实现

　　除非另作说明，所有的代码片段都来自"samples\64_bit_integer\64_bit_integer.cl"。

图像尺寸

　　我们已经包括了一张512x512大小的输入位图用于这个样例(为保持安装程序的尺寸较小)。然而，在较大的图像使用时，你更有可能看到性能的提升(相比运行在CPU上的C代码)。在使用OpenCL时，有一些相关的启动开销。当输入数据大小比较小时，这些开销可能会大于并行处理带来的好处。

　　这个样例已经被编码为允许使用任何输入位图。简单修改"sample\assets"目录下的input.bmp为你选择的输入图像。当使用更大图像时，你将会看到更大的计算性能提升。

64位算术

　　一些脸部识别技术，例如Robust实时对象检测(Viola和Jones, 2001)框架需要使用下面的方程计算一个子窗口的变量：

Variance = ((Σ p) / N )² - 1/N * Σ(p²)

p是像素值，N是像素总的数目。

　　对于这个例子，我们仅仅计算像素值的和，平方后的像素的和。我们在一个OpenCL内核中计算这些变量。

　　如果我们考虑8 bit每像素，最大的像素值为255。平方这个值(255 * 288 = 65025)适合一个ushort(16位类型，最大值为65535)。我们使用ushort8，因为8 * 16位 = 128位，推荐向量宽度。

　　然而，平方的和与像素的和可能会溢出一个short和int型。因此，我们将它们转换成ulong类型，累加所有在向量中的值，直到得到一个可以加到累加器中的一个单独的值(分别是sumOfPixels和squareOfPixels)。

    /* Load 8 pixels (char) and convert them to shorts to calculate the square.*/
    ushort8 pixelShort = convert_ushort8(vload8(i, imagePixels));
    /* Square of 255 < 2 ^ 16. */
    ushort8 newSquareShort = pixelShort * pixelShort;
    /*
     * Convert original pixel value and the square to longs to sum
     * all the vectors together and add the final values to the
     * respective accumulators.
     */
    ulong8 pixelLong = convert_ulong8(pixelShort);
    ulong8 newSquareLong = convert_ulong8(newSquareShort);
    /*
     * Use vector data type suffixes (.lo and .hi) to get smaller vector types,
     * until we obtain one single value.
     */
    ulong4 sumLongPixels1 = pixelLong.hi + pixelLong.lo;
    ulong2 sumLongPixels2 = sumLongPixels1.hi + sumLongPixels1.lo;
    ulong sumLongPixels3 = sumLongPixels2.hi + sumLongPixels2.lo;
    ulong4 sumLongSquares1 = newSquareLong.hi + newSquareLong.lo;
    ulong2 sumLongSquares2 = sumLongSquares1.hi + sumLongSquares1.lo;
    ulong sumLongSquares3 = sumLongSquares2.hi + sumLongSquares2.lo;

　　如果所有的内核在同一时间访问累加器，内存访问冲突会发生。这会导致竞态条件，数据会被丢失。

　　为避免于此，我们使用atom_add来加一个整数值到一个通过指针引用的值。这确保在加法操作期间，没有其他执行在相同设备上的内核读或写那个内存。原子操作也对其它函数存在(例如，乘法，减法，递增，递减)。这意味着这种操作代价非常高昂，因此只在必要时使用它。

    atom_add(sumOfPixels, sumLongPixels3);
    atom_add(squareOfPixels, sumLongSquares3);

　　为了使能64位整数的atom_add，我们使用这个pragma到内核代码中：

#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable

运行样例

　　参考样例1。

附录1：内核源码

/*
 * This confidential and proprietary software may be used only as
 * authorised by a licensing agreement from ARM Limited
 *    (C) COPYRIGHT 2013 ARM Limited
 *        ALL RIGHTS RESERVED
 * The entire notice above must be reproduced on all authorised
 * copies and copies may only be made to the extent permitted
 * by a licensing agreement from ARM Limited.
 */

/**
 * \brief  Long data type (64-bit integer) kernel.
 * \details This kernel loads 8 pixel values to calculate the square of each pixel value. Then it accumulates the
 * square of pixels and the sum of pixels values in the respective accumulators.
 * \param[in] imagePixels Input array with image pixels.
 * \param[in] squareOfPixels Sum of the square of pixel values.
 * \param[out] sumOfPixels Sum of pixel values.
 */

/* [Enable atom_add extension] */
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
/* [Enable atom_add extension] */

__kernel void long_vectors(__global uchar* restrict imagePixels,
                           __global ulong* restrict squareOfPixels,
                           __global ulong* restrict sumOfPixels)
{
    /*
     * Set i to be the ID of the kernel instance.
     * If the global work size (set by clEnqueueNDRangeKernel) is n,
     * then n kernels will be run and i will be in the range [0, n - 1].
     */
    int i = get_global_id(0);

    /* [Squares and sums]*/
    /* Load 8 pixels (char) and convert them to shorts to calculate the square.*/
    ushort8 pixelShort = convert_ushort8(vload8(i, imagePixels));
    /* Square of 255 < 2 ^ 16. */
    ushort8 newSquareShort = pixelShort * pixelShort;

    /*
     * Convert original pixel value and the square to longs to sum
     * all the vectors together and add the final values to the
     * respective accumulators.
     */
    ulong8 pixelLong = convert_ulong8(pixelShort);
    ulong8 newSquareLong = convert_ulong8(newSquareShort);

    /*
     * Use vector data type suffixes (.lo and .hi) to get smaller vector types,
     * until we obtain one single value.
     */
    ulong4 sumLongPixels1 = pixelLong.hi + pixelLong.lo;
    ulong2 sumLongPixels2 = sumLongPixels1.hi + sumLongPixels1.lo;
    ulong sumLongPixels3 = sumLongPixels2.hi + sumLongPixels2.lo;

    ulong4 sumLongSquares1 = newSquareLong.hi + newSquareLong.lo;
    ulong2 sumLongSquares2 = sumLongSquares1.hi + sumLongSquares1.lo;
    ulong sumLongSquares3 = sumLongSquares2.hi + sumLongSquares2.lo;
    /* [Squares and sums]*/

    /*
     * As all the kernels are accessing sumOfPixels
     * and squareOfPixels at the same time,
     * we use atom_add to ensure only one kernel
     * at a time can access the given variables.
     * This means that this operation is very expensive,
     * so we want to use it only when necessary.
     */
    /* [Atomic transaction] */
    atom_add(sumOfPixels, sumLongPixels3);
    atom_add(squareOfPixels, sumLongSquares3);
    /* [Atomic transaction] */
}

附录2：宿主机源码

/*
 * This confidential and proprietary software may be used only as
 * authorised by a licensing agreement from ARM Limited
 *    (C) COPYRIGHT 2013 ARM Limited
 *        ALL RIGHTS RESERVED
 * The entire notice above must be reproduced on all authorised
 * copies and copies may only be made to the extent permitted
 * by a licensing agreement from ARM Limited.
 */

#include "common.h"
#include "image.h"

#include <CL/cl.h>
#include <iostream>
#include <fstream>
#include <sstream>
#include <cstddef>
#include <cmath>

using namespace std;

/**
 * \brief  Long data type (64-bit integer) OpenCL example.
 * \details An example to calculate, for an image:
 *          - the sum of the squares of the pixels values
 *          - sum of the pixels values.
 *          Makes use of the long data type and 64-bit atomics.
 *          The main calculation code is in an OpenCL kernel which is executed on a GPU device.
 * \return The exit code of the application, non-zero if a problem occurred.
 */
int main(void)
{
    string filename = "assets/input.bmp";

    cl_context context = 0;
    cl_command_queue commandQueue = 0;
    cl_program program = 0;
    cl_device_id device = 0;
    cl_kernel kernel = 0;
    const int numberOfMemoryObjects = 3;
    /* Index values for the memory objects. */
    const unsigned int imagePixelsIndex = 0;
    const unsigned int squareIndex = 1;
    const  unsigned int sumIndex = 2;
    cl_mem memoryObjects[3] = {0, 0, 0};
    cl_int errorNumber;


    if (!createContext(&context))
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed to create an OpenCL context. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    if (!createCommandQueue(context, &commandQueue, &device))
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed to create the OpenCL command queue. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /* Checking 64-bit integer atomics extension support. */
    if (!isExtensionSupported (device, "cl_khr_int64_base_atomics"))
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "cl_khr_int64_base_atomics is not supported on this device. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    if (!createProgram(context, device, "assets/64_bit_integer.cl", &program))
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed to create OpenCL program." << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    kernel = clCreateKernel(program, "long_vectors", &errorNumber);
    if (!checkSuccess(errorNumber))
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed to create OpenCL kernel. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /* Load 24-bits per pixel RGB data from a bitmap. */
    cl_int width;
    cl_int height;
    unsigned char* loadedRGBData = NULL;
    if (!loadFromBitmap(filename, &width, &height, &loadedRGBData))
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed loading bitmap. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /* Buffer for the image pixels. */
    size_t bufferSizeChar = width * height * sizeof(unsigned char);
    /* Buffer for the accumulators*/
    size_t bufferSizeLong = sizeof(cl_ulong);

    /*
     * Ask the OpenCL implementation to allocate buffers for the data.
     * We ask the OpenCL implementation to allocate memory rather than allocating
     * it on the CPU to avoid having to copy the data later.
     * The read/write flags relate to accesses to the memory from within the kernel.
     */
    bool createMemoryObjectsSuccess = true;

    memoryObjects[imagePixelsIndex] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSizeChar, NULL, &errorNumber);
    createMemoryObjectsSuccess &= checkSuccess(errorNumber);

    memoryObjects[squareIndex] = clCreateBuffer(context, CL_MEM_READ_WRITE| CL_MEM_ALLOC_HOST_PTR, bufferSizeLong, NULL, &errorNumber);
    createMemoryObjectsSuccess &= checkSuccess(errorNumber);

    memoryObjects[sumIndex] = clCreateBuffer(context, CL_MEM_READ_WRITE| CL_MEM_ALLOC_HOST_PTR, bufferSizeLong, NULL, &errorNumber);
    createMemoryObjectsSuccess &= checkSuccess(errorNumber);

    if (!createMemoryObjectsSuccess)
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed to create OpenCL buffer. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /* Map the input memory objects to a host side pointers. */
    bool mapMemoryObjectsSuccess = true;
    cl_uchar* inputImagePixels = (cl_uchar*)clEnqueueMapBuffer(commandQueue, memoryObjects[imagePixelsIndex], CL_TRUE, CL_MAP_WRITE, 0, bufferSizeChar, 0, NULL, NULL, &errorNumber);
    mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
    cl_ulong* inputSquareOfPixels = (cl_ulong*)clEnqueueMapBuffer(commandQueue, memoryObjects[squareIndex], CL_TRUE, CL_MAP_WRITE, 0, bufferSizeLong, 0, NULL, NULL, &errorNumber);
    mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
    cl_ulong* inputSumOfPixels = (cl_ulong*)clEnqueueMapBuffer(commandQueue, memoryObjects[sumIndex], CL_TRUE, CL_MAP_WRITE, 0, bufferSizeLong, 0, NULL, NULL, &errorNumber);
    mapMemoryObjectsSuccess &= checkSuccess(errorNumber);

    if (!mapMemoryObjectsSuccess)
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Mapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /*
     * Convert 24-bits per pixel RGB into 8-bits per pixel luminance data
     * and fill the array for the kernel.
     */
    RGBToLuminance(loadedRGBData, inputImagePixels, width, height);
    delete [] loadedRGBData;

    /* Ensure the accumulators are initialized to zero. */
    *inputSquareOfPixels = 0;
    *inputSumOfPixels = 0;

    /* Unmap the memory so we can pass it to the kernel. */
    bool unmapMemoryObjectsSuccess = true;
    unmapMemoryObjectsSuccess &= checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[imagePixelsIndex], inputImagePixels, 0, NULL, NULL));
    unmapMemoryObjectsSuccess &= checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[squareIndex], inputSquareOfPixels, 0, NULL, NULL));
    unmapMemoryObjectsSuccess &= checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[sumIndex], inputSumOfPixels, 0, NULL, NULL));

    if (!unmapMemoryObjectsSuccess)
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Unmapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /* Set the kernel arguments */
    bool setKernelArgumentsSuccess = true;
    setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, imagePixelsIndex, sizeof(cl_mem), &memoryObjects[imagePixelsIndex]));
    setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, squareIndex, sizeof(cl_mem), &memoryObjects[squareIndex]));
    setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, sumIndex, sizeof(cl_mem), &memoryObjects[sumIndex]));

    if (!setKernelArgumentsSuccess)
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed setting OpenCL kernel arguments. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /* An event to associate with the Kernel. Allows us to retrieve profiling information later. */
    cl_event event = 0;

    /*
      * Each instance of the kernel operates on a 8 * 1 portion of the image.
     * Therefore, the global work size must be 1.
     */
    size_t globalWorksize[1] = {(width * height) / 8};
    int work_dim = 1;
    /* Enqueue the kernel */
    if (!checkSuccess(clEnqueueNDRangeKernel(commandQueue, kernel, work_dim, NULL, globalWorksize, NULL, 0, NULL, &event)))
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed enqueuing the kernel. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /* Wait for kernel execution completion. */
    if (!checkSuccess(clFinish(commandQueue)))
    {
        cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
        cerr << "Failed waiting for kernel execution to finish. " << __FILE__ << ":"<< __LINE__ << endl;
        return 1;
    }

    /* Print the profiling information for the event. */
    printProfilingInfo(event);
    /* Release the event object. */
    if (!checkSuccess(clReleaseEvent(event)))
    {
       cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
       cerr << "Failed releasing the event object. " << __FILE__ << ":"<< __LINE__ << endl;
       return 1;
    }

    /* Get pointers to the output data. */
    mapMemoryObjectsSuccess = true;
    cl_ulong* squareOfPixels = (cl_ulong*)clEnqueueMapBuffer(commandQueue, memoryObjects[squareIndex], CL_TRUE, CL_MAP_READ, 0, bufferSizeLong, 0, NULL, NULL, &errorNumber);
    mapMemoryObjectsSuccess &= checkSuccess(errorNumber);

    cl_ulong* sumOfPixels = (cl_ulong*)clEnqueueMapBuffer(commandQueue, memoryObjects[sumIndex], CL_TRUE, CL_MAP_READ, 0, bufferSizeLong, 0, NULL, NULL, &errorNumber);
    mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
    if (!mapMemoryObjectsSuccess)
    {
       cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
       cerr << "Failed to map buffer. " << __FILE__ << ":"<< __LINE__ << endl;
       return 1;
    }

    /* [Output the results] */
    cout << "Square of the pixel values = " <<  *squareOfPixels << "\n";
    cout << "Sum of the pixel values = " <<  *sumOfPixels << endl;
    /* [Output the results] */

    /* Unmap the memory object as we are finished using them from the CPU side. */
    unmapMemoryObjectsSuccess = true;
    unmapMemoryObjectsSuccess &= checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[squareIndex], squareOfPixels, 0, NULL, NULL));
    unmapMemoryObjectsSuccess &= checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[sumIndex], sumOfPixels, 0, NULL, NULL));

    if (!unmapMemoryObjectsSuccess)
    {
       cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
       cerr << "Unmapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
       return 1;
    }

    /* Release OpenCL objects. */
    cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
}