From AMD forums:
OpenCL Mingw
============
In fact, that is quite easy to make a libOpenCL.a for MingW. I've done it, and now I can compile OpenCL examples with MingW.
The idea is to use the tool reimp found in mingw, which allows you to create an a import library for a DLL (ie create libXXX.a from a XXX.lib where XXX.lib is only the import library for XXX.dll ; I insist : it won't create a libXXX.a when XXX.lib is a general library, it only works for import libraries ; moreover, the name mangling in the DLL must not be C++ mangling : simple bare function names are OK).
1. Open a command prompt where the PATH contains the mingw\bin ; go to the ATIStreamSDK\lib\x86 where you find OpenCL.lib, and type reimp OpenCL.lib
You get OpenCL.def and libopencl.a -> this is what you want to link against.
2. When you compile your .c / .cpp using cl.h, add a compiler define _MSC_VER in order to define the stdcall convention, else the linker will not search for the good names into libopencl.a.
That's all ; it worked for me.
In case reimp tells you "bad or corrupt import lib" or something like that, you just have to use dlltool (included in mingw) to generate libopencl.a from OpenCL.def :
dlltool -l libopencl.a -d OpenCL.def -A -k
where OpenCL.def is for instance this file (.def contain export names from DLL) :
http://pastebin.com/f2ac38b2f
OpenCL and AMD constant mem
==============
Q:
I have about 32KB total worth of 16-bit (short int) constants in 4 or so lookup tables of different sizes. I'd like to be able to access them in parallel from different threads in as quick a way as possible. Architecturally it would seem like the texture cache is ideal, but if I just place them in the CL kernel file and tag it with the __constant specifier, will they be located somewhere that will be accessed quickly?
A:
Although it is not in the current release, if you place data like this in a constant address space array in the kernel file, it will be placed in a constant buffer when this gets fully implemented. The constant buffer peak is around a factor of 10x faster than the L1 speed on 770, which is ~480GB/s, but slower than register file access.
See "A compiler for parallel execution of numerical Python programs on graphics processing units"
OpenCL and CAL
================
Quick answer is yes ( OpenCL is written on top of CAL, so it can't be faster ). Full answer is a little bit longer.
On the 4xxx family with CAL you can get almost full power of the card. But you should be warned - it will be rather painfull. Documentation is really bad or missing ( with regard to optimization ) and compiler is sometimes doing strange things ( so you need it to trick it to get quality code ). On the other hand OpenCL for 4xxx is reaalllyyy bad ( lacking cached memory access and LDS ) - it's about 3x slower than Brook+.
With 5xxx family it's hard to say. There are some results suggesting ( search streamsdk forum ) that there is problem with memory transfer speed ( we will se if new CAL version will corect it ). So with exception of memory transfer you can get almost full power of 5xxx with CAL.
OpenCL on 5xxx is again a problem. In theory OpenCL on 5xxx should work like a charm ( it doesn't miss LDS, new memory access instructions ) but results are not supporting it ( maybe again problems with memory - who knows ). At the moment performance for some applications is comparable to OpenCL on 8800GT.
smallpt bugs in AMD OpenCL
==========================
smallpt 1.2 has bugs in AMD OpenCL currently fixed trunk:
http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=123480&enterthread=y
Nvidia bug?:
writing to Image obtained via CL/GL
==================================
Is it possible yet to write to an image that's been obtained via clCreateFromGLTexture2D?
I get an error -30 (CL_INVALID_VALUE) when I try writing to it with clWriteImage or clEnqueueNDRangeKernel (after acquiring it), but writing to an image that's been created with clCreateImage2D works ok. Writing to buffers acquired from OpenGL works ok too.
Am I missing some extra step needed, or is it just not supported yet?
If it's not supported, is there a list of the current issues in NVidia's OpenCL implementation?
OpenCL Mingw
============
In fact, that is quite easy to make a libOpenCL.a for MingW. I've done it, and now I can compile OpenCL examples with MingW.
The idea is to use the tool reimp found in mingw, which allows you to create an a import library for a DLL (ie create libXXX.a from a XXX.lib where XXX.lib is only the import library for XXX.dll ; I insist : it won't create a libXXX.a when XXX.lib is a general library, it only works for import libraries ; moreover, the name mangling in the DLL must not be C++ mangling : simple bare function names are OK).
1. Open a command prompt where the PATH contains the mingw\bin ; go to the ATIStreamSDK\lib\x86 where you find OpenCL.lib, and type reimp OpenCL.lib
You get OpenCL.def and libopencl.a -> this is what you want to link against.
2. When you compile your .c / .cpp using cl.h, add a compiler define _MSC_VER in order to define the stdcall convention, else the linker will not search for the good names into libopencl.a.
That's all ; it worked for me.
In case reimp tells you "bad or corrupt import lib" or something like that, you just have to use dlltool (included in mingw) to generate libopencl.a from OpenCL.def :
dlltool -l libopencl.a -d OpenCL.def -A -k
where OpenCL.def is for instance this file (.def contain export names from DLL) :
http://pastebin.com/f2ac38b2f
OpenCL and AMD constant mem
==============
Q:
I have about 32KB total worth of 16-bit (short int) constants in 4 or so lookup tables of different sizes. I'd like to be able to access them in parallel from different threads in as quick a way as possible. Architecturally it would seem like the texture cache is ideal, but if I just place them in the CL kernel file and tag it with the __constant specifier, will they be located somewhere that will be accessed quickly?
A:
Although it is not in the current release, if you place data like this in a constant address space array in the kernel file, it will be placed in a constant buffer when this gets fully implemented. The constant buffer peak is around a factor of 10x faster than the L1 speed on 770, which is ~480GB/s, but slower than register file access.
See "A compiler for parallel execution of numerical Python programs on graphics processing units"
OpenCL and CAL
================
Quick answer is yes ( OpenCL is written on top of CAL, so it can't be faster ). Full answer is a little bit longer.
On the 4xxx family with CAL you can get almost full power of the card. But you should be warned - it will be rather painfull. Documentation is really bad or missing ( with regard to optimization ) and compiler is sometimes doing strange things ( so you need it to trick it to get quality code ). On the other hand OpenCL for 4xxx is reaalllyyy bad ( lacking cached memory access and LDS ) - it's about 3x slower than Brook+.
With 5xxx family it's hard to say. There are some results suggesting ( search streamsdk forum ) that there is problem with memory transfer speed ( we will se if new CAL version will corect it ). So with exception of memory transfer you can get almost full power of 5xxx with CAL.
OpenCL on 5xxx is again a problem. In theory OpenCL on 5xxx should work like a charm ( it doesn't miss LDS, new memory access instructions ) but results are not supporting it ( maybe again problems with memory - who knows ). At the moment performance for some applications is comparable to OpenCL on 8800GT.
smallpt bugs in AMD OpenCL
==========================
smallpt 1.2 has bugs in AMD OpenCL currently fixed trunk:
http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=123480&enterthread=y
Nvidia bug?:
writing to Image obtained via CL/GL
==================================
Is it possible yet to write to an image that's been obtained via clCreateFromGLTexture2D?
I get an error -30 (CL_INVALID_VALUE) when I try writing to it with clWriteImage or clEnqueueNDRangeKernel (after acquiring it), but writing to an image that's been created with clCreateImage2D works ok. Writing to buffers acquired from OpenGL works ok too.
Am I missing some extra step needed, or is it just not supported yet?
If it's not supported, is there a list of the current issues in NVidia's OpenCL implementation?