《OpenCL异构计算》学习笔记

最新推荐文章于 2024-05-20 23:08:52 发布

GayKoo

最新推荐文章于 2024-05-20 23:08:52 发布

阅读量672

点赞数 1

文章标签： opencl

本文链接：https://blog.csdn.net/GayKoo/article/details/117307311

版权

引言

以下仅为书中一些基本概念，具体应参考官方specification文档、C++ binding文档和api参考文档。以下OpenCL基于OpenCL 1.1

OpenCL简介

kernel和OpenCL执行模型

kernel是OpenCL程序在设备上实际执行的代码，每个kernel称为work-item，多个work-item组成一个work-group
kernel执行期间，需指定一个n维索引空间（NDRange）来声明需要新建多少个work-item
NDRange是work-item的索引空间，被定义为数据类型为size_t，长度为N的数组，N代表所描述的work-item的维度

//work-item维度为2维，有1024个work-item
size_t SpaceSize[2]={1024,1};

一个work-group包含多个work-item，work-group维度与NDRange维度一致，NDRange元素长度要能被work-group长度整除

//一个work-group有64个work-item 总共分为16个work-group(1024/64)
size_t WorkGroupSize[2]={64,1};

每个work-item都是独立运行的，即使在同一个work-group内

平台和设备

主机-设备交互

平台模型将一个设备定义一系列计算单元（功能独立），每个计算单元中进一步划分为处理部件
clGetPlatformIDs()用来获取指定系统上的可用计算平台，在应用程序中应当调用两次，从而获取平台相关信息
clGetDeviceIDs()获取平台上支持的设备，如GPU、CPU和其他等要参考OpenCL 1.1 参考文档，与clGetPlatformIDs()类似调用方法

执行环境

上下文

上下文（context）存在于主机端，协调主机-设备的交互机制，管理设备上的可用内存对象，跟踪针对每个设备新建的kernel和程序
clCreateContext()用于新建上下文。其参数properties用于限定上下文范围，将上下文局限于某个平台可令开发者为多个平台提供上下文。开发者必须提供与context关联的设备数量和设备ID，也可设置用户回调函数

命令队列

一旦主机端指定运行kernel的设备且context已经新建，则每个设备必须新建命令队列才可运行，每个命令队列仅关联一个设备
主机端需要对设备进行操作，就需要把命令提交到设备对应的命令队列上
clCreateCommandQueue()用于新建命令队列并且关联到某个设备，支持乱序执行命令
所有指定主机与设备交互的api，会以’clEnqueue’开头，且需要一个命令队列作为参数

事件

任何操作作为命令入队到命令队列中，都会产生事件
作用：表示依赖；提供程序剖析机制（profiling）

内存对象

新建的内存对象，只在一个上下文中有效，运行时根据数据依赖关系的需要管理特定设备的数据传输

buffer

类似数组，由malloc()新建，buffer中数据在内存中连续存储
clCreateBuffer()分配缓冲区，返回内存对象；clEnqueueWriteBuffer()将主机端内存的数据传到OpenCL buffer中；clEnqueueReadBuffer()将OpenCL buffer数据传回主机端内存
如果在独立的加速器设备（如GPU）上执行的kernel依赖某个buffer，该buffer必须传输到设备上
buffer链接的是context，而不是设备，应当在OpenCL运行时决定数据移动

image

不透明对象，利于进行数据填充和其他可能的优化
未必所有OpenCL设备支持image，应当先使用clGetDeviceInfo()查看设备是否支持image对象
image不能像数组直接引用，相邻元素不保证放在连续内存中
image元素使用cl_image_format表示
通过clCreateImage2D()或clCreateImage3D()新建OpenCL image
读取image数据时，要用sampler对象；数据写入image时需强制转换为对应格式

flush命令和finish命令

clFinish()函数，阻塞直到命令队列中的所有命令完成
clFlush()函数，阻塞直到命令队列中的所有命令被移出队列，命令准备就绪但不保证命令已经执行完毕

新建一个OpenCL程序对象

OpenCL C代码被称为Program，是kernel函数的集合，kernel是被调度到设备上的执行单位
通过clCreateProgramWithSource()（较方便，直接读取源代码文件）或clCreateProgramWithBinary()使OpenCL 源代码转换为cl_program对象
通过clBuildProgram()编译对应Program，如果编译错误，会报告错误信息

OpenCL的kernel

从cl_program中抽取kernel，kernel的名称和program对象传递给clCreateKernel()，如果对象有效且kernel被发现，会返回kernel对象
kernel执行前，需先用clSetKernelArg()指定kernel的参数
通过clEnqueueNDRangeKernel()执行kernel，该函数的参数需根据参考文档认真研究

内存模型

全局内存，对设备上所有计算单位可见，主机->设备和设备->主机的数据驻留在全局内存上。__global关键字定义的数据，驻留在全局内存上
常量内存，方便所有work-item同时访问常量数据，全局内存的一部分。__constant关键字定义的数据，驻留在常量内存上
本地内存，供work-group共享。比全局内存访问延迟短，带宽更高。__local关键字定义的数据，驻留在本地内存上
私有内存，对单个work-item可见，局部变量和非指针类型的kernel参数默认为私有

编写kernel

函数以__kernel开始，返回值必须为void，必须指定指针所指向的地址空间（全局、常量和本地）
同一个work-group中的work-item多次使用的数据缓存到本地内存（__local）上可以提高性能。一旦work-item执行完成，其本地内存上数据是临时的，要保存结果则需把结果传回全局内存

本章向量相加程序源码

//kernel file code
__kernel void vecadd(__global int* A, __global int* B, __global int* C)
{
	int idx = get_global_id(0); //get the unique ID of the work-item
	C[idx] = A[idx] + B[idx];
}

//main file code
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>

int main()
{
	int *A = NULL;  //input
	int *B = NULL;  //input
	int *C = NULL;  //output
	const int elements = 2048; //elements in each array
	
	size_t datasize = sizeof(int)*elements; //size of data
	//allocate space for arrays
	A = (int*)malloc(datasize);
	B = (int*)malloc(datasize);
	C = (int*)malloc(datasize);
	for(int i=0;i<elements;i++){ //initialize the input
		A[i] = i;
		B[i] = i;
	}
	
	cl_int status; //for error checking but not checked in this example
	//step 1
	//discover and initialize platforms
	cl_uint numPlatforms=0;
	cl_platform_id *platforms=NULL;
	status = clGetPlatformIDs(0,NULL,&numPlatforms); //get the number of platforms
	plarforms = (cl_platform_id*)malloc(numPlatforms*sizeof(cl_platform_id)); //allocate space for each platform
	status = clGetPlatformIDs(numPlatforms,platforms,NULL); //fill in platforms

	//step 2
	//discover and initialize devices
	cl_uint numDevices = 0;
	cl_device_id *devices = NULL;
	status = clGetDeviceIDs(platforms[0], CL_DEVICE_TYPE_ALL, 0, NULL, &numDevices); //get the number of all devices
	devices = (cl_device_id*)malloc(numDevices*sizeof(cl_device_id)); //allocate space for each device
	status = clGetDeviceIDs(platforms[0], CL_DEVICE_TYPE_ALL, numDevices, devices, NULL); //fill in devices

	//step 3
	//create context
	cl_context context=NULL;
	context = clCreateContext(NULL, numDevices, devices, NULL, NULL, &status); //create a context and associate it with the devices

	//step 4
	//create a command queue
	cl_command_queue cmdQueue;
	cmdQueue = clCreateCommandQueue(context, devices[0], 0, &status); //create a command queue and associate it with the device you want to execute on
}

	//step 5
	//create device buffers
	cl_mem bufferA;
	cl_mem bufferB;
	cl_mem bufferC;

	bufferA = clCreateBuffer(context, CL_MEM_READ_ONLY, datasize, NULL, &status); //create a buffer object that will contain the data from the host array A
	bufferB = clCreateBuffer(context, CL_MEM_READ_ONLY, datasize, NULL, &status); 
	bufferC = clCreateBuffer(context, CL_MEM_WRITE_ONLY, datasize, NULL, &status); 

	//step 6
	//write host data to device buffers
	status = clEnqueueWriteBuffer(cmdQueue, bufferA, CL_FALSE, 0, datasize, A, 0, NULL, NULL);
	status = clEnqueueWriteBuffer(cmdQueue, bufferB, CL_FALSE, 0, datasize, B, 0, NULL, NULL);

	//step 7
	//create and compile the program
	cl_program program = clCreateProgramWithSource(context, 1, 'kernel file path', NULL, &status); //create the program from the kernel file
	status = clBuildProgram(program, numDevices, devices, NULL, NULL, NULL);

	//step 8
	//create the kernel
	cl_kernel kernel = NULL;
	kernel = clCreateKernel(program, "vecadd", &status); //create the kernel from the function named "vecadd"

	//step 9
	//set the kernel arguments
	status = clSetKernelArg(kernel, 0, sizeof(cl_mem), &bufferA); //associate the buffers with the kernel
	status = clSetKernelArg(kernel, 1, sizeof(cl_mem), &bufferB);
	status = clSetKernelArg(kernel, 2, sizeof(cl_mem), &bufferC);

	//step 10
	//configure the work-item structure
	size_t globalWorkSize[1];
	globalWorkSize[0] = elements; //number of work-item

	//step 11
	//enqueue the kernel execution
	status = clEnqueueNDRangeKernel(cmnQueue, kernel, 1, NULL, globalWorkSize, NULL, 0, NULL, NULL); //globalWorkSize is the 1D dimension of the work-item

	//step 12
	//read the output buffer back to the host
	clEnqueueReadBuffer(cmdQueue, bufferC, CL_TRUE, 0, datasize, C, 0, NULL, NULL);
	bool result = true;
	for(int i=0;i<elements;i++){
		if(C[i] != i+i){
			result = false;
			break;
		}
	}
	if(result) printf("correct\n");
	else printf("incorrect\n");

	//step 13
	//release the OpenCL Resources
	clReleaseKernel(kernel);
	clReleaseProgram(program);
	clReleaseCommandQueue(cmdQueue);
	clReleaseMemObject(bufferA);
	clReleaseMemObject(bufferB);
	clReleaseMemObject(bufferC);
	clReleaseContext(context);

	//free host resources
	free(A);
	free(B);
	free(C);
	free(platforms);
	free(devices);
}