Differences from OpenCL 1.1 to 1.2

This article will be of interest if you don’t want to read the whole new specifications [PDF] for OpenCL 1.2.

As always, feedback will be much appreciated.

After many meetings with the many members of the OpenCL task force, a lot of ideas sprouted. And every 17 or 18 months a new version comes out of OpenCL to give form to all these ideas. You can see totally new ideas coming up and already brought outside in another product by a member. You can also see ideas not appearing at all as other members voted against them. The last category is very interesting and hopefully we’ll see a lot of forum-discussion soon what should be in the next version, as it is missing now.

With the release of 1.2 there was also announced that (at least) two task forces will be erected. One of them will target integration in high-level programming languages, which tells me that phase 1 of creating the standard is complete and we can expect to go for OpenCL 2.0. I will discuss these phases in a follow-up and what you as a user, programmer or customer, can expect… and how you can act on it.

Another big announcement was that Altera is starting to support OpenCL for a FPGA-product. In another article I will let you know everything there is to know. For now, let’s concentrate on the actual differences in this version software-wise, and what you can do with it. I have added links to the 1.1 and 1.2 man-pages, so you can look it up.

New Kernel-functions

The most rudiment debug-tool, printf, first needed to have a vendor-specific extension enabled, but now you can flood the standard output without it. For those who have not tried printf yet, have a global size of 1000, let the CPU print “pingn” and the kernels “pongn”. Then you’ll know exactly why you need to be careful with this function.

The function popcount returns the number of ones in a variable. So if x is 5 (binary 101), then popcount(x) is 2. A nice explanation of fast popcount on SSE is here. It counts bits regardless of what it represents, so it also counts the sign-bit.

Replaced functions

The OpenCL group prefers to change the name of functions when the parameter-list changes. Below you’ll find the “new” functions I encountered.

clEnqueueMarkerclEnqueueBarrier and clEnqueueWaitForEvents have been merged into clEnqueueMarkerWithWaitList and clEnqueueBarrierWithWaitList. The barrier and marker functionality are still the same, but if a non-NULL waiting-list is given it will also continue if all the events have occurred. This was tricky to program before. A new option is that you can fire an event when all previous events have occurred.

clCreateImage2D and clCreateImage3D have been merged into clCreateImageclCreateFromGLTexture2Dand clCreateFromGLTexture3D have been merged into clCreateFromGLTexture. As the functions were comparable and the parameter texture_target handles the differences, not much has changed. What is new (and a mayor reason for merging these functions) is the adding of 1D images, and support for image-arrays (see below for explanation how they work). 1D images were introduced to be compliant with OpenGL 1D images.

Mem-flags CL_MEM_COPY_HOST_WRITE_ONLY, CL_MEM_COPY_HOST_READ_ONLY and CL_MEM_COPY_HOST_NO_ACCESS have been added to describe how the host can connect to the object at the device, where 1.1 only described how the device could access the object and if the memory was allocated at the device or the host.

clUnloadCompiler got renamed to clUnloadPlatformCompiler, and clGetExtensionFunctionAddress toclGetExtensionFunctionAddressForPlatform – both must now specify a valid platform-reference. This seems to be logical, as clUnloadCompiler probably removed compilers of all platforms, and the function-address seems to be unspecified when more platforms were loaded.

DirectX

Besides the fancy 1D images, support for DirectX 9 and 11 textures also have been added. DX9 is an interesting choice, but this way such software can be given a longer life by adding OpenCL to speed it up. I still disagree with the idea that it has official KHR-support, as it only works under Microsoft code. Under Linux (and all its derivatives like Android) and OSX it is not supported.

The new functions clCreateFromDX9MediaSurfaceKHRclEnqueueAcquireDX9MediaSurfacesKHR andclEnqueueReleaseDX9MediaSurfacesKHR are comparable to clCreateFromD3D10Texture2DKHR, clEnqueueAcquireD3D10ObjectsKHR and clEnqueueReleaseD3D10ObjectsKHR. clCreateFromD3D11BufferKHR,clCreateFromD3D11Texture2DKHRclCreateFromD3D11Texture3DKHR,clEnqueueAcquireD3D11ObjectsKHR and clEnqueueReleaseD3D11ObjectsKHR are like their D3D10-counterparts.

Sharing like cl_khr_d3d10_sharing for DX9 and 11 is enabled withcl_khr_dx9_media_sharing and cl_khr_d3d11_sharing. The counterparts ofclGetDeviceIDsFromD3D10KHR are clGetDeviceIDsFromD3D11KHR and clGetDeviceIDsFromDX9MediaAdapterKHR.

Multi-user and Multi-device

As OpenCL-devices get more powerful, it is very probable that the device can be shared better. Also, it gets more common to have multiple GPUs in a system, and/or have various capable devices now CPUs get better support.

clEnqueueMigrateMemObjects helps with multiple devices to copy memory objects from one device to another; first this had to be done by copying via the host.

clCreateSubDevices partitions a device in sub-devices. It can be partitioned in equal parts, specified sizes, or depending on specific hardware. The last option can split the devices based on i.e. cache-hierarchy, so that the different subdevices have shared cache at the given level. The functions clRetainDevice and clReleaseDevice have been altered to handle sub-devices. First this was under the extension device_fission.

Intitalisation of data

clEnqueueFillBuffer and clEnqueueFillImage help with initialising data by filling it with a pattern or a colour. This was first best done at the host, or with a kernel specially written for it, or just ignored. Our lives have improved.

Building

It seems that more effort is put in making sure the kernels are better protected. The function clBuildProgram can be split up between clCompileProgram and clLinkProgram. If I understand correctly, it is comparable to how clCreateProgramWithBinary works, as this takes compiled binaries.

clGetProgramInfo en clGetProgramBuildInfo have been extended to get information on how the program has been built. The new function clGetKernelArgInfo returns specified information on the arguments used for building the kernel. This is useful when the building of the software is separated from the program. Such is the case when binaries are used.

Image arrays

An array of 1D or 2D images can be written by write_image{f|i|ui|h}. The image ID is given by the y (1D) or z (2D) value. With read_image{f|i|ui|h} you need to specify the coordinates plus the image-number, int2 for 1D and int3 for 2D images.

The kernel-function get_image_array_size returns the number of images in an array. It is the responsibility of the software to keep things in order, as it does not give an array of image-numbers.

Other

Macros CL_VERSION_1_2 and __OPENCL_C_VERSION__ have been added. The first gives a 1 when supported, or 0 when not. The latter gives 120 for version 1.2.

Double-precision is now an optional core feature instead of an extension. Meaning, you just need to check if the device supports it, but you don’t need to pragma it in.

CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE has been deprecated. It gives the smallest alignment in bytes which can be used for any data type. It is quite comparable to CL_DEVICE_MEM_BASE_ADDR_ALIGN. This could help select the best device for an alignment-optimised kernel, but is rarely used.

A new flag CL_MAP_WRITE_INVALIDATE_REGION has been added to cl_map_flags. This is comparable to CL_MAP_WRITE, but without guarantees memory is not being overwritten.

Storage class specifiers extern and static are now supported. A storage class settles the scope of the variable (c definition here).

Video

Tim Mattson of Intel explains some of the highlights of OpenCL 1.2 in this 12 minute video


Vincent Hindriksen

Mr. Vincent Hindriksen M.Sc is the founder of StreamComputing, a collective of performance engineering experts.

Do you need expertise in performance engineering? We have several experts available (HPC, GPGPU, OpenCL, CUDA, MPI, OpenMP) and solve any kind of performance problem. Contact me via phone or mail to discuss further: +31 854865760 or vincent@streamcomputing.eu

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers) are both advanced natural language processing (NLP) models developed by OpenAI and Google respectively. Although they share some similarities, there are key differences between the two models. 1. Pre-training Objective: GPT is pre-trained using a language modeling objective, where the model is trained to predict the next word in a sequence of words. BERT, on the other hand, is trained using a masked language modeling objective. In this approach, some words in the input sequence are masked, and the model is trained to predict these masked words based on the surrounding context. 2. Transformer Architecture: Both GPT and BERT use the transformer architecture, which is a neural network architecture that is specifically designed for processing sequential data like text. However, GPT uses a unidirectional transformer, which means that it processes the input sequence in a forward direction only. BERT, on the other hand, uses a bidirectional transformer, which allows it to process the input sequence in both forward and backward directions. 3. Fine-tuning: Both models can be fine-tuned on specific NLP tasks, such as text classification, question answering, and text generation. However, GPT is better suited for text generation tasks, while BERT is better suited for tasks that require a deep understanding of the context, such as question answering. 4. Training Data: GPT is trained on a massive corpus of text data, such as web pages, books, and news articles. BERT is trained on a similar corpus of text data, but it also includes labeled data from specific NLP tasks, such as the Stanford Question Answering Dataset (SQuAD). In summary, GPT and BERT are both powerful NLP models, but they have different strengths and weaknesses depending on the task at hand. GPT is better suited for generating coherent and fluent text, while BERT is better suited for tasks that require a deep understanding of the context.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值