Getting started with OpenCL and GPU Computing

最新推荐文章于 2021-09-10 19:00:19 发布

凌风探梅

最新推荐文章于 2021-09-10 19:00:19 发布

阅读量1.3k

点赞数

分类专栏：并行计算专题

并行计算专题专栏收录该内容

8 篇文章 2 订阅

订阅专栏

Getting started with OpenCL and GPU Computing

BY ERIK SMISTAD · JUNE 21, 2010

OpenCL (Open Computing Language) is a new framework for writing programs that execute in parallel on different compute devices (such as CPUs and GPUs) from different vendors (AMD, Intel, ATI, Nvidia etc.). The framework defines a language to write “kernels” in. These kernels are the functions which are to run on the different compute devices. In this post I explain how to get started with OpenCL and how to make a small OpenCL program that will compute the sum of two lists in parallel.

Installing and setting up OpenCL on your computer

First of all you need to download the newest drivers to your graphics card. This is important because OpenCL will not work if you don’t have drivers that support OpenCL.

To install OpenCL you need to download an implementation of OpenCL. The major graphic vendors Nvidia and AMD/ATI have both released implementations of OpenCL for their GPUs. These implementation come in a so called software development kits and often include some useful tools such as a visual profiler. The next step is to download and install the SDK for the GPU you have on your computer. Note that not all graphic cards are supported. A list of which graphic cards are supported can be found on the vendors websites.

For AMD/ATI GPUs download the AMD APP SDK (formerly known as AMD Stream SDK)
For Nvidia GPUs download the CUDA Toolkit

The installation steps differ for each SDK and the OS you are running. Follow the installation manual of the SDK carefully. Personally I use Ubuntu Linux and have an AMD 7970 graphics card. Below are some installation steps for this specific setup.

Installing OpenCL on Ubuntu Linux with AMD graphics card

To install the latest AMD drivers on Ubuntu 12.04 open additional drivers and install/active the one called “ATI/AMD proprietary FGLRX graphic driver (post-release updates)”.
After that is done, restart and download and extract the AMD APP SDK.

AMD APP SDK 2.8 includes an installer. Run this with the command:

sudo sh Install-AMD-APP.sh

Next, install the OpenCL headers files

sudo apt-get install opencl-headers

And your done! Note that the AMD APP SDK and its samples is located at /opt/AMDAPP.

Installing OpenCL on Ubuntu Linux with NVIDIA graphics card

Download the CUDA toolkit for Ubuntu from NVIDIAs CUDA site. Open a terminal an run the installation file with the command:

sudo sh cudatoolkit_3.1_linux_64_ubuntu9.10.run

Download the Developer Drivers for Linux at the same website and install it by first stopping X, running the file and start X again. To stop X use:

sudo /etc/init.d/gdm stop

Then get a terminal up by pressing CTRL+ALT+F5, login and navigate to where you downloaded the devdriver then type:

sudo sh devdriver_3.1_linux_64_256.40.run

After the driver has been installed start x again by typing

startx

Before compiling an OpenCL application you need to add the path to the lib folder of CUDA to LD_LIBRARY_PATH like so:

export LD_LIBRARY_PATH=/usr/local/cuda/lib64

Your first OpenCL program – Vector addition

To demonstrate OpenCL I explain how to perform the simple task of vector addition. Suppose we have two lists of numbers, A and B, of equal size. The task of vector addition is to add the elements of A with the elements of B and put the result in the element of a new list called C of the same size. The figure below explains the operation.

Two lists A and B and the result list C of vector addition on A and B

The naive way of performing this operation is to simply loop through the list and perform the operation on one element at a time like the C++ code below:

for(int i = 0; i < LIST_SIZE; i++) {
    C[i] = A[i] + B[i];
}

This algorithm is simple but has a linear time complexity, O(n) where n is the size of the list. But since each iteration of this loop is independent on the other iterations this operation is data parallel, meaning that each iteration can be computed simultaneously. So if we have n cores on a processor this operation can be performed in constant time O(1).

To make OpenCL perform this operation in parallel we need to make the kernel. The kernel is the function which will run on the compute device.

The kernel

The kernel is written in the OpenCL language which is a subset of C and has a lot of math and vector functions included. The kernel to perform the vector addition operation is defined below.

__kernel void vector_add(__global const int *A, __global const int *B, __global int *C) {
 
    // Get the index of the current element to be processed
    int i = get_global_id(0);
 
    // Do the operation
    C[i] = A[i] + B[i];
}

The host program

The host program controls the execution of kernels on the compute devices. The host program is written in C, but bindings for other languages like C++ and Python exists. The OpenCL API is defined in the cl.h (or opencl.h for apple) header file. Below is the code for the host program that executes the kernel above on compute device. I will not go into details on each step as this is supposed to be an introductory article although I can recommend the book “The OpenCL Programming Book” if you want to dive into the details. The main steps of a host program is as follows:

Get information about the platform and the devices available on the computer (line 42)
Select devices to use in execution (line 43)
Create an OpenCL context (line 47)
Create a command queue (line 50)
Create memory buffer objects(line 53-58)
Transfer data (list A and B) to memory buffers on the device (line 61-64)
Create program object (line 67)
Load the kernel source code (line 24-35) and compile it (line 71) (online exeuction) or load the precompiled binary OpenCL program (offline execution)
Create kernel object (line 74)
Set kernel arguments (line 77-79)
Execute the kernel (line 84)
Read memory objects. In this case we read the list C from the compute device (line 88-90)

#include <stdio.h>
#include <stdlib.h>
 
#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif
 
#define MAX_SOURCE_SIZE (0x100000)
 
int main(void) {
    // Create the two input vectors
    int i;
    const int LIST_SIZE = 1024;
    int *A = (int*)malloc(sizeof(int)*LIST_SIZE);
    int *B = (int*)malloc(sizeof(int)*LIST_SIZE);
    for(i = 0; i < LIST_SIZE; i++) {
        A[i] = i;
        B[i] = LIST_SIZE - i;
    }
 
    // Load the kernel source code into the array source_str
    FILE *fp;
    char *source_str;
    size_t source_size;
 
    fp = fopen("vector_add_kernel.cl", "r");
    if (!fp) {
        fprintf(stderr, "Failed to load kernel.\n");
        exit(1);
    }
    source_str = (char*)malloc(MAX_SOURCE_SIZE);
    source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp);
    fclose( fp );
 
    // Get platform and device information
    cl_platform_id platform_id = NULL;
    cl_device_id device_id = NULL;   
    cl_uint ret_num_devices;
    cl_uint ret_num_platforms;
    cl_int ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
    ret = clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_DEFAULT, 1, 
            &device_id, &ret_num_devices);
 
    // Create an OpenCL context
    cl_context context = clCreateContext( NULL, 1, &device_id, NULL, NULL, &ret);
 
    // Create a command queue
    cl_command_queue command_queue = clCreateCommandQueue(context, device_id, 0, &ret);
 
    // Create memory buffers on the device for each vector 
    cl_mem a_mem_obj = clCreateBuffer(context, CL_MEM_READ_ONLY, 
            LIST_SIZE * sizeof(int), NULL, &ret);
    cl_mem b_mem_obj = clCreateBuffer(context, CL_MEM_READ_ONLY,
            LIST_SIZE * sizeof(int), NULL, &ret);
    cl_mem c_mem_obj = clCreateBuffer(context, CL_MEM_WRITE_ONLY, 
            LIST_SIZE * sizeof(int), NULL, &ret);
 
    // Copy the lists A and B to their respective memory buffers
    ret = clEnqueueWriteBuffer(command_queue, a_mem_obj, CL_TRUE, 0,
            LIST_SIZE * sizeof(int), A, 0, NULL, NULL);
    ret = clEnqueueWriteBuffer(command_queue, b_mem_obj, CL_TRUE, 0, 
            LIST_SIZE * sizeof(int), B, 0, NULL, NULL);
 
    // Create a program from the kernel source
    cl_program program = clCreateProgramWithSource(context, 1, 
            (const char **)&source_str, (const size_t *)&source_size, &ret);
 
    // Build the program
    ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL);
 
    // Create the OpenCL kernel
    cl_kernel kernel = clCreateKernel(program, "vector_add", &ret);
 
    // Set the arguments of the kernel
    ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&a_mem_obj);
    ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&b_mem_obj);
    ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&c_mem_obj);
 
    // Execute the OpenCL kernel on the list
    size_t global_item_size = LIST_SIZE; // Process the entire lists
    size_t local_item_size = 64; // Divide work items into groups of 64
    ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, 
            &global_item_size, &local_item_size, 0, NULL, NULL);
 
    // Read the memory buffer C on the device to the local variable C
    int *C = (int*)malloc(sizeof(int)*LIST_SIZE);
    ret = clEnqueueReadBuffer(command_queue, c_mem_obj, CL_TRUE, 0, 
            LIST_SIZE * sizeof(int), C, 0, NULL, NULL);
 
    // Display the result to the screen
    for(i = 0; i < LIST_SIZE; i++)
        printf("%d + %d = %d\n", A[i], B[i], C[i]);
 
    // Clean up
    ret = clFlush(command_queue);
    ret = clFinish(command_queue);
    ret = clReleaseKernel(kernel);
    ret = clReleaseProgram(program);
    ret = clReleaseMemObject(a_mem_obj);
    ret = clReleaseMemObject(b_mem_obj);
    ret = clReleaseMemObject(c_mem_obj);
    ret = clReleaseCommandQueue(command_queue);
    ret = clReleaseContext(context);
    free(A);
    free(B);
    free(C);
    return 0;
}

To make OpenCL run the kernel on the GPU you can change the constant CL_DEVICE_TYPE_DEFAULT to CL_DEVICE_TYPE_GPU in line 43. To run on CPU you can set it to CL_DEVICE_TYPE_CPU. This shows how easy OpenCL makes it to run different programs on different compute devices.

The source code for this example can be downloaded here.

Compiling an OpenCL program

If the OpenCL header and library files are located in their proper folders (/usr/include and /usr/lib) the following command will compile the vectorAddition program.

gcc main.c -o vectorAddition -l OpenCL

How to learn more

To learn more about OpenCL I recommend the book from Fixstars called The OpenCL programming book. Below are some links to useful sites with information on OpenCL:

shilpa

March 12, 2015 at 13:08

Hi Erik,

Your program is pretty good to understand . I ran this program on a octacore machine but could not see the expected parallel processing of instructions.
The time taken for this code was more than a sequential code for vector addition. I doubt , this program is not running on multiple cores parallely on my system. Could you plz help ?

Regards,
Shilpa

Reply
- Erik Smistad
  
  March 23, 2015 at 17:18
  
  This is probably because the vectors are so small (only 1024 items). Try to increase the size of the vector to lets say 1024*1024*1024, and you will probably see a speedup.
  
  Reply
  - celio
    
    August 31, 2016 at 10:01
    
    i changed item_size to 1024*1024*32 but i did not defferene between GPU &CPU
    
    Reply
KSSR

November 28, 2014 at 16:51

hello all,
I need instruction about setting up opencl environemnt for Multicore system i.e on GPU.

Reply
kevinkit

July 21, 2014 at 15:28

Sry I was wrong with the cl_mem stuff but nevertheless the “+1″ is missing. (THIS I NOT C++!)

Reply
kevinkit

July 20, 2014 at 20:01

Doesn’t it has to be

int *A = (int*)malloc(sizeof(int)*(LIST_SIZE +1));

and this in every other memory allocation furthermore when you allocate the memory objects it should be

cl_mem a_mem_obj = clCreateBuffer(context, CL_MEM_READ_ONLY,
(LIST_SIZE+1) * sizeof(cl_mem), NULL, &ret);

instead of cl_int.

Reply
Fabio

July 17, 2014 at 11:11

Hi, I have a question.
When you call clEnqueueReadBuffer, you don’t have to pass as parameter also the list of events to wait before read the buffer? I mean, you can get the event identifier from clEnqueueNDRangeKernel and pass it to clEnqueueReadBuffer, otherwise the program may read the results before the sum is completed.
If it’s not needed, why?

Reply
- Erik Smistad
  
  July 17, 2014 at 13:31
  
  The third argument of clEnqueueReadBuffer with the value CL_TRUE ensures that this call is blocking. That means that the function will not return until it has read the buffer and thus explicit synchronization is not needed. However, if you set this argument to CL_FALSE you have to do explicit synchronization using events as you suggest.
  
  Reply
  - Fabio
    
    July 17, 2014 at 14:27
    
    I know that the third argument make the call blocking, but I think that means that (as you say) “the function will not return until it has read the buffer”. Howerer the documentation doesn’t say anything about the previous enqueued operations for this argument. Maybe is not so clear.
    The documentation says also that you must ensure that “All commands that use this buffer object have finished execution before the read command begins execution”
    
    Reply
    - Erik Smistad
      
      July 17, 2014 at 15:18
      
      Ah yes, that is a good point. The clue here is that your command queue is created with in-order execution (this is default and most devices doesn’t support out-of-order execution). In-order execution guarantees that all of the queued commands will be executed in the order in which they were added to the queue. Thus, for clEnqueueReadBuffer to finish, all other queued operations have to finish first.
      
      Seehttp://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clCreateCommandQueue.htmlfor more info on this
      
      Reply
    - Fabio
      
      July 17, 2014 at 15:33
      
      Aaahh sorry
      Your queue is not out-of-order.
      
      I’m working with an aout-of-order queue and I have some problems so I’m trying to understand…
      Thanks
      
      Reply
Anonymous

April 9, 2014 at 10:15

Have you found out how to fix the failure?

Name (required):

I added a printf() to your code after line 43:if (ret != CL_SUCCESS) {printf(“Error: Failed to query platforms! (%d)\n”, ret);return EXIT_FAILURE;}
after compiling, running it gives me this error: “Failed to query platforms (-1001)”

Reply
meenu

February 13, 2014 at 10:03

I am currently using ubuntu13.04 and have a VGA compatible controller: NVIDIA Corporation GK107 [GeForce GT 630 OEM] (rev al) … My open CL samples are running fine for CL_DEVICE_TYPE_DEFAULT and CL_DEVICE_TYPE_CPU… But they are not able to find OPENCL devices for CL_DEVICE_TYPE_GPU…

Reply
- Erik Smistad
  
  February 13, 2014 at 12:28
  
  The NVIDIA OpenCL platform only support GPUs. So most likely you have more than one platform installed (or the NVIDIA platform is not installed). Try to select the correct platform. You can do this by increasing the number of entries in the clGetPlatformIDs function, seehttps://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clGetPlatformIDs.html
  
  Reply
  - Anonymous
    
    February 18, 2014 at 06:17
    
    Okie i will tell u what exactly my configuration is and what i have observed then probably u might help me out… My CPU is intel core processor -i7 and i have an inbuilt gpu of nvidia. Now i have installed both intel sdk for opencl and nvidia propeitary drivers for ubuntu13.04. How can i create a run time so that it identifies both CPU and GPU. Currently i feel that only intel platform is getting recognised … hence opencl is working fine for option CPU and not GPU. Is there a way around where both my devices are identified and probably i can transfer data between my CPU and GPU. Also in my vendor directory i can observe both intelocl64.icd and nvidia.icd.
    
    Reply
    - Erik Smistad
      
      February 18, 2014 at 13:47
      
      An OpenCL context can only be associated with ONE platform.You have TWO platforms installed and the code above only selects ONE platform, whichever one comes first, in your case the intel platform. To select the NVIDIA platform you need to increase the number of entries:
      cl_platform_id platform_ids[2]; clGetPlatformIDs(2, platform_ids, &ret_num_platforms);
      
      and then select the other platform like this:
      clGetDeviceIDs( platform_ids[1], CL_DEVICE_TYPE_DEFAULT, 1, &device_id, &ret_num_devices);
      
      Reply
      - meenu
        
        February 19, 2014 at 09:46
        
        Thanks Eric !!! This was the solution… Thanks again…
        
        Reply
Gustavo Rozolin da Silva

October 12, 2013 at 21:25

Excellent post Erik,

Erik what I need to change in this code for to pass __local arg to kernel.

Thanks you.

Reply
- Erik Smistad
  
  October 14, 2013 at 16:23
  
  Local memory, or shared memory as it is also called, is not accessible from the host. So if you want to use it you have to transfer data from global to local explicitly in a kernel.
  
  Reply
Laxator2

October 12, 2013 at 19:19

Here is how it worked on my machine, using a (rather old) Nvidia card under PCLinuxOS :

gcc main.c -I/usr/local/cuda-5.0/include/ -L/usr/lib/nvidia-current -l OpenCL -o vectorAddition

Great example, short and to the point.

Reply
Kareti

July 31, 2013 at 06:37

Hi Erik, The blog was very helpful. Thanks for that.

I have no GPU on my laptop, so is there a way to practise the opencl programs by emulating a GPU! I am using Fedora 18, 64 bit !

Thank you.

Reply
- Erik Smistad
  
  July 31, 2013 at 11:11
  
  Yes, you can use your CPU instead. That’s the nice thing about OpenCL: The code can run on different types of processors. To do so simply install the Intel or AMD OpenCL runtime, depending on which type of processor you have. Afterwards execute the example above and it should run on the CPU.
  
  Reply
Victor

July 18, 2013 at 11:58

Hi Erik. Could u tell me what does this mean?
gcc -c -I /usr/include/CL main.c -o main.o
gcc -o local16 main.o -L /usr/lib -l OpenCL
/usr/lib64/gcc/x86_64-suse-linux/4.7/../../../../x86_64-suse-linux/bin/ld: skipping incompatible /usr/lib/libOpenCL.so when searching for -lOpenCL
/usr/lib64/gcc/x86_64-suse-linux/4.7/../../../../x86_64-suse-linux/bin/ld: skipping incompatible /usr/lib/libc.so when searching for -lc

I dont know if I successfuly link to the lib. What’s weird is that the result keeps the same even if I comment the whole kernel code.

Reply
- Erik Smistad
  
  July 31, 2013 at 11:13
  
  This is a problem with the linking, not the source code. Not sure what the problem is, never seen that error message before
  
  Reply
Xocoatzin

June 6, 2013 at 20:42

Wow, it just worked. Awesome!

Reply
Anonymous

May 12, 2013 at 18:04

Very nice article, thanks!

I’m wondering whether the statements:

ret = clFlush(command_queue);
ret = clFinish(command_queue);

are actually needed. If I’m getting it right, since the command queue is in-order, when clEnqueueReadBuffer (which is blocking thanks to the CL_TRUE parameter) returns, the command queue should be empty.

Another point that would be worth explaining is why there is no clFinish
between clEnqueueNDRangeKernel and clEnqueueReadBuffer.

Reply
- Erik Smistad
  
  May 13, 2013 at 10:44
  
  Since the program terminates right after all the clean up statements, none of them are actually needed.
  
  In either case, you are correct, the flush and finish statements are not necessary.
  
  When the command queue is in-order, OpenCL will make sure that all the enqueue commands are performed in the order in which they are called in your C program.
  
  Reply
Rohit Vashistha

April 25, 2013 at 18:51

Hi
All those getting ‘zero’ result change the option “GPU” to “CPU” or vise versa
Regards,
Rohit

Reply
Jai

April 22, 2013 at 15:03

I have a desktop pc with configuration as:
Intel(R) Core(TM) 2 Duo CPU E700@2.93GHz, 2GB RAM 32 bit OS

and I am willing to purchase a graphic card with config as:
Sapphire AMD/ATI RADEON HD 6450 2GB RAM

Can you please tell me is it compatible for my pc…?
Thanks in advance.

Reply
Jack

April 8, 2013 at 13:39

Can i run openCL on a CPU? (I do not have a GPU but want to experiment with openCL before buying one).
I have an intel i3 processor with a dual boot for Windows 7 and Ubuntu 12.04

Reply
- Erik Smistad
  
  April 8, 2013 at 13:56
  
  Sure. Just install the Intel OpenCL SDK from http://software.intel.com/en-us/vcsource/tools/opencl-sdk
  
  Reply
Anonymous

February 22, 2013 at 02:41

The sample code all worked fine.
But when I changed CL_DEVICE_TYPE_DEFAULT into CL_DEVICE_TYPE_GPU, it runs, but give me:
0 + 1024 = 763461144
1 + 1023 = 32716
2 + 1022 = 763461144
3 + 1021 = 32716
4 + 1020 = 15489024
5 + 1019 = 0
6 + 1018 = 15489024
7 + 1017 = 0
8 + 1016 = 0
9 + 1015 = 0
10 + 1014 = 0
11 + 1013 = 0
12 + 1012 = 0
13 + 1011 = 0
14 + 1010 = 0
15 + 1009 = 0
16 + 1008 = 0
17 + 1007 = 0
18 + 1006 = 0
19 + 1005 = 0
20 + 1004 = 0
21 + 1003 = 0
22 + 1002 = 0
23 + 1001 = 0
24 + 1000 = 0
25 + 999 = 0
26 + 998 = 124817
27 + 997 = 0
28 + 996 = 1801415779
29 + 995 = 1717531240
30 + 994 = 540292720
31 + 993 = 1633643619
32 + 992 = 1717527661
33 + 991 = 540292720
34 + 990 = 1801415779
35 + 989 = 1734308456
36 + 988 = 1633841004
37 + 987 = 1852399468
38 + 986 = 1597125492
39 + 985 = 1702060386
40 + 984 = 1869898079
41 + 983 = 1935894893
42 + 982 = 1600938784
43 + 981 = 1601333355
44 + 980 = 1651469415
45 + 979 = 1767861345
46 + 978 = 842232942
47 + 977 = 1954047327
48 + 976 = 1701080677
49 + 975 = 1952538468
50 + 974 = 1667853679
….
i tried CL_DEVICE_TYPE_CPU, and it worked fine.

why is GPU not working?

Reply
safwan

December 29, 2012 at 21:32

Thank you for this briefly example and its work with me but I have a problem when I change the value of LIST_SIZE to another value, the program execute but she doesn’t execute the kernel fonction an finaly the result i have for C[i]=0 :
0+1024=0
1+1023=0
…
..
how can I resolve this problem?
Thanks

Reply
- Erik Smistad
  
  January 2, 2013 at 14:53
  
  If you change the LIST_SIZE variable to another variable that is not dividable with 64 it will not run because local size is set to 64 (see line 83). This means that 64 work-items are grouped together.
  
  If you only get 0, you probably haven’t installed OpenCL correctly.
  
  Reply
Yaknan

December 28, 2012 at 06:30

Hi Erik,
I am enjoying your tutorials on OpenCl. Thanks for the good work. Please, am testing a hypothesis here for my thesis work on enhancing graphic rendering of the ray tracing algorithm. Am wondering if it is possible to integrate cilk++ with openCL. The idea is to see if cilk++ will take full and better advantage of CPUs while OpnCL takes advantage of the GPUs more efficiently. Thanks!

Reply
- Erik Smistad
  
  January 2, 2013 at 14:50
  
  As far as I know cilk++ can run regular C/C++ code as well. And OpenCL is written in C, so I think it should work..
  
  Reply
prince

November 14, 2012 at 12:35

i am using gpu of type nvidia,i am using opencl but when i run the program using ctrl+F5(start without debugging )then i get result in which gpu takes more time than cpu but when i run the program cpu takes more time than gpu and result is also i am giving
start without debugging -> cpu time=6127 ms gpu time= 6240 ms
start with debug-> cpu time= 18354 ms gpu time= 9125 ms

wt is the reason in this difference……
visual studio 2010 i am using
the code is here. wt is going wrong.?..thanks

// Hello.cpp : Defines the entry point for the console application.
//

//#include
#include
#include
#include
#include
#include “CL/cl.h”
#define DATA_SIZE 100000
const char *KernelSource =
“kernel void hello(global float *input , global float *output)\n”\
“{\n”\
” size_t id =get_global_id(0);\n”\
“output[id] =input[id]*input[id];\n”\
“} ”
“\n”\
“\n”;
//float start_time,end_time;

int main(void)
{
double start_time,end_time;
start_time=clock();
cl_context context;
cl_context_properties properties[3];
cl_kernel kernel;
cl_command_queue command_queue;
cl_program program;
cl_int err;
cl_uint num_of_platforms=0;
cl_platform_id platform_id;
cl_device_id device_id;
cl_uint num_of_devices=0;
cl_mem input,output;
size_t global;
float inputData[100000];
for(int j=0;j<100000;j++)
{
inputData[j]=(float)j;
}

float results[DATA_SIZE];//={0};

// int i;

//retrieve a list of platform variable
if(clGetPlatformIDs(1,&platform_id,&num_of_platforms)!=CL_SUCCESS)
{
printf("Unable to get platform_id\n");
return 1;
}

//try to get supported GPU DEvice
if(clGetDeviceIDs(platform_id,CL_DEVICE_TYPE_CPU,1,&device_id,
&num_of_devices)!=CL_SUCCESS)
{
printf("unable to get device_id\n");
return 1;
}

//context properties list -must be terminated with 0
properties[0]=CL_CONTEXT_PLATFORM;
properties[1]=(cl_context_properties) platform_id;
properties[2]=0;

//create a context with the GPU device
context=clCreateContext(properties,1,&device_id,NULL,NULL,&err);

//create command queue using the context and device
command_queue=clCreateCommandQueue(context,device_id,0,&err);

//create a program from the kernel source code
program=clCreateProgramWithSource(context,1,(const char**)
&KernelSource,NULL,&err);

//compile the program
err=clBuildProgram(program,0,NULL,NULL,NULL,NULL);
if((err!=CL_SUCCESS))
{
printf("build error \n",err);
size_t len;
char buffer[4096];
//get the build log
clGetProgramBuildInfo(program,device_id,CL_PROGRAM_BUILD_LOG,sizeof(buffer),buffer,&len);
printf("—-build Log—\n%s\n",buffer);
exit(1);

// return 1;
}

//specify which kernel from the program to execute
kernel=clCreateKernel(program,"hello",&err);

//create buffers for the input and output
input=clCreateBuffer(context,CL_MEM_READ_ONLY,sizeof(float)*DATA_SIZE,NULL,NULL);

output=clCreateBuffer(context,CL_MEM_WRITE_ONLY,sizeof(float)*DATA_SIZE,NULL,NULL);

//load data into the input buffer

clEnqueueWriteBuffer(command_queue,input,CL_TRUE,0,
sizeof(float)*DATA_SIZE,inputData,0,NULL,NULL);

//set the argument list for the kernel command
clSetKernelArg(kernel,0,sizeof(cl_mem),&input);
clSetKernelArg(kernel,1,sizeof(cl_mem),&output);
global=DATA_SIZE;

//enqueue the kernel command for execution
clEnqueueNDRangeKernel(command_queue,kernel,1,NULL,&global,NULL,0,NULL,NULL);
clFinish(command_queue);

//copy the results from out of the buffer
clEnqueueReadBuffer(command_queue,output,CL_TRUE,0,sizeof(float)*DATA_SIZE,results,0,
NULL,NULL);

//print the results
printf("output:");
for(int i=0;i<DATA_SIZE;i++)
{
printf("%f\n",results[i]);
//printf("no. of times loop run %d\n",count);
}

//cleanup-release OpenCL resources

clReleaseMemObject(input);
clReleaseMemObject(output);
clReleaseProgram(program);
clReleaseKernel(kernel);
clReleaseCommandQueue(command_queue);
clReleaseContext(context);
end_time=clock();
printf("execution time is%f",end_time-start_time);
_getch();
return 0;

}

Reply
- Erik Smistad
  
  November 15, 2012 at 16:05
  
  It is normal for the execution time to increase when debugging an application. It doesn’t mean that anything is wrong with the program
  
  Reply
swap

August 7, 2012 at 22:26

How will i execute same program on windows 7 with intel HD 4000 GPU.
I have installed Intel opencl SDK

Reply
- Erik Smistad
  
  August 10, 2012 at 14:23
  
  Just remember to link to the lib files included in the Intel OpenCL SDK and add the include folder. If you are using Visual Studio you can add these things in the project settings menu.
  
  Reply
swap

August 7, 2012 at 22:24

how will i execute same program on windows 7 with intel HD 4000.
I have downloaded Intel Opencl SDK.

Reply
Lennart

July 10, 2012 at 01:03

Thanks!

You officially got me started with OpenCL

(Hilsen fra Bergen)

Reply
Ricardas

July 4, 2012 at 17:56

Great Tutorial. Thanks !

Reply
Shinchan

June 3, 2012 at 05:26

Hi,
I followed your steps and was able to get main.o.
But when i did
gcc main.o -o host -L /home/mydir/Downloads/ati-stream-sdk-v2.1-lnx64/lib/x86_64 -l OpenCL

I got
/usr/bin/ld: cannot find -lopenCL
clooect2: id returned 1 eit status

I have no idea what this means. Please help!

Reply
- Erik Smistad
  
  June 4, 2012 at 11:35
  
  It means that the linker can’t find the libOpenCL file which should lie in your /home/mydir/Downloads/ati-stream-sdk-v2.1-lnx64/lib/x86_64 folder. Make sure you are using a large O in “-lOpenCL” and not “-lopenCL” as it says in your error message: “/usr/bin/ld: cannot find -lopenCL”
  
  Reply
spandei

April 4, 2012 at 04:05

Thank you!
You write very understandable code:)
To tell the truth your article helped me to understand the OpenCL Mechanism better than the whole AMD SKD. Keep it up!

Reply
Bishoy Mikhael

March 7, 2012 at 23:35

i failed to compile the example,can you please review my configuration for the IDE, i’ve tried MS Visual Studio 2010, NetBeans 7.1 and Eclipse Indigo using both AMD and NVIDIA SDKs on Windows 7 x64 with Nvidia GeForce 330M graphics card.
i’ve declared the environment variables for CUDA SDK as follows $(CUDA_LIB_PATH), $(CUDA_BIN_PATH) and $(CUDA_INC_PATH) for (CUDA installation path\lib\x64), (CUDA installation path\bin), (CUDA installation path\include) respectively. in NetBeans TOOLS–>Options in C/C++ Code Assistance tab i’ve added the include directory for CUDA SDK, then in the project properties in “C Compiler” tab i’ve added the include directory path in “Include Directory”, and in the “Linker” tab i’ve added the library path in “Additional Library Directories” then “opencl.lib” in “Additional Dependencies”, i din’t know what to add in the “Compilation Line” or if there is another settings i’m missing.

when i build the project i get an error:

“/bin/sh: -c: line 0: syntax error near unexpected token `(‘
/bin/sh: -c: line 0: `gcc.exe -m64 -c -I(CUDA_INC_PATH\) -MMD -MP -MF build/Release/MinGW-Windows/_ext/141055651/vector_add.o.d -o build/Release/MinGW-Windows/_ext/141055651/vector_add.o /C/Users/Arch/Documents/NetBeansProjects/vector_add/vector_add.c’
make[2]: *** [build/Release/MinGW-Windows/_ext/141055651/vector_add.o] Error 2″

Reply
- Erik Smistad
  
  March 12, 2012 at 10:44
  
  Always a nightmare to compile on windows… but it should work on Visual Studio 2010 and setting include and library settings should be enough. Don’t know what’s wrong in your case
  
  Reply
  - Bishoy Mikhael
    
    March 13, 2012 at 11:20
    
    i’ve uninstalled VS 2010, netbeans, MinGW, cygwin and the newly installed Microsoft Visual C++ 2008 Redistributables and .NET frameworks, then installed Code::Blocks then i copied the CL directory from the include directory of NVIDIA Toolkit and set the additional libraries in Code::Blocks to point at the OpenCL.lib, guess what, it worked fine without any errors
    
    Reply
Hao Wang

March 1, 2012 at 18:39

Hi,

I’m trying to run OpenCL program on a simulator (gem5).

The simulator supports Full-System mode, that is, first boot Linux from a disk-image and then run the program.

I borrowed the ICD record and libOpenCL.so from AMD SDK, and put them into the proper place in the image file.

But the simulation trace shows that, it fails to find a platform, and then crashes when trying to create a context.

Do you have any suggestions on my situation>
Thank you.

Reply
- Erik Smistad
  
  March 5, 2012 at 10:14
  
  Hi
  
  Make sure the ICD files can be read by the program and that the AMD APP SDK and display drivers are properly installed.
  
  Reply
rupam

February 29, 2012 at 08:00

hi, I an pretty new in GPU.Currently I am working on AMD RADEON 6500 series…. I have written some program on MATLAB 2011b…I want to run the codes on GPU…can u plz instruct me how to run .m codes on GPU…thanx in advance…

Reply
vegihat

February 27, 2012 at 13:15

Hello Erik,

i try to understand what is the physical meaning of below clEnqueueNDRangeKernel’s arguments

const size_t *global_work_size
const size_t *local_work_size

you used values
size_t global_item_size = LIST_SIZE;
size_t local_item_size = 64

which means that we have LIST_SIZE/64 work-groups,right?

what’s the difference between local_item_size=1 and local_item_size = 64?

i want to understand which is the perfect value of the local_item_size .

Reply
- Erik Smistad
  
  February 27, 2012 at 14:46
  
  The “perfect” value of local_item_size is very system and application dependent. You can omit this parameter and let OpenCL decide by itself. Note that for NVIDIA GPUs the local_item_size should be a multiple of 32 (one warp) or else some work-items will be idle. The same applies to AMD GPUs, but with a multiple of 64 instead (one wavefront as they call it). This is because NVIDIA and AMD schedules 32 and 64 work-items atomically on each compute unit.
  
  Reply
The Ham

February 25, 2012 at 18:53

Hi, to all people that get wrong output (garbage or all 0)
I used this code under gForce M8400 GS and get garbage and error -52 in clEnqueueNDRangeKernel which is wrong kernel arguments.

that is what i changed in this code:
add 4-th argument for the kernel
ret = clSetKernelArg(kernel, 3, sizeof(int), &LIST_SIZE);

remember that in this code LIST_SIZE must be 64 * m (m is integer)

This code works ok on my AMD HD 6670 without any changes,
dont know why (just started with OpenCL)

(cant rly add comment on this site!!)

Reply
The Ham

February 25, 2012 at 18:52

Hi, to all people that get wrong output (garbage or all 0)
I used this code under gForce M8400 GS and get garbage and error -52 in clEnqueueNDRangeKernel which is wrong kernel arguments.

that is what i changed in this code:
add 4-th argument for the kernel
ret = clSetKernelArg(kernel, 3, sizeof(int), &LIST_SIZE);

remember that in this code LIST_SIZE must be 64 * m (m is integer)

This code works ok on my AMD HD 6670 without any changes,
dont know why (just started with OpenCL)

Reply
- The Ham
  
  February 25, 2012 at 18:59
  
  i just can get it right…
  
  another thing in the kernel
  change output buffer type
  __global int *C to __global float *C or you get all = 0
  
  Reply
LaKing

February 22, 2012 at 16:26

Hello. ..

I compiled some sample applications, and get this error when running any OpenCL application. …

OpenCL SW Info:

Error -1001 in clGetPlatformIDs Call !!!

I was googling for several hours, got some useful info’s but could not solve the problem yet. Any ideas? … Thanks.

Reply
- Erik Smistad
  
  February 23, 2012 at 13:46
  
  Hi
  
  I think it means that it can’t find any OpenCL platforms. Check to see if the .icd files are properly installed. They should exist under /etc/OpenCL/vendors. If no .icd files exists there you have to find them in the downloaded SDK and extract them manually to this location.
  
  Reply
Otto

February 15, 2012 at 19:53

The OpenCL library seems to slow initial execution down a bit. The example above does not really work the GPU. I created the same thing in pure C and it’s almost 3 times faster.