JOCL(Java bindings for OpenCL)提供的API保持与OpenCL的原始API尽可能的相近。功能全部采用静态方法实现,语义和这些方法的签名与原来的库函数保持一致。除了一些Java语言的特殊限制。
之所以选JOCL
1 本人java开发,不懂C++
2 本人笔记本上只有AMD渣显卡
之前选过aparapi但没成功一直报错,可能是显卡原因.
本地显卡是AMD Radeon HD 7600M系列 驱动程序版本15.200.1062.1004 本地CPU是i7 3740QM
首先 安装好正确的驱动 然后安装AMD提供的AMD APP SDK 2.9 目前版本是2.9 这俩自行去官网下载.
http://support.amd.com/en-us/kb-articles/Pages/OpenCL2-Driver.aspx
成功后cmd下执行clinfo -v 能出来结果说明安装ok了
然后下载jocl
http://www.jocl.org/downloads/downloads.html
具体参考
https://jogamp.org/jocl/doc/HowToBuild.html
如果不想自己编译源码 直接用archive 下载地址
http://jogamp.org/deployment/jogamp-current/archive/
下载这个最大的文件后解压
取\jogamp-all-platforms\jar下面的gluegen-rt.jar gluegen-rt-natives-windows-amd64.jar jocl.jar jocl-natives-windows-amd64.jar这四个jar到自己项目的lib中
然后写个demo 用下面的代码
import com.jogamp.opencl.CLBuffer;
import com.jogamp.opencl.CLCommandQueue;
import com.jogamp.opencl.CLContext;
import com.jogamp.opencl.CLDevice;
import com.jogamp.opencl.CLKernel;
import com.jogamp.opencl.CLProgram;
import java.io.IOException;
import java.nio.FloatBuffer;
import java.util.Random;
import static java.lang.System.*;
import static com.jogamp.opencl.CLMemory.Mem.*;
import static java.lang.Math.*;
/**
* Hello Java OpenCL example. Adds all elements of buffer A to buffer B
* and stores the result in buffer C.<br/>
* Sample was inspired by the Nvidia VectorAdd example written in C/C++
* which is bundled in the Nvidia OpenCL SDK.
*
* @author Michael Bien
*/
public class HelloJOCL {
public static void main(String[] args) throws IOException {
// set up (uses default CLPlatform and creates context for all devices)
CLContext context = CLContext.create();
out.println("created "+context);
// always make sure to release the context under all circumstances
// not needed for this particular sample but recommented
try{
// select fastest device
CLDevice device = null;
device = context.getMaxFlopsDevice(CLDevice.Type.GPU);
//device = context.getMaxFlopsDevice(CLDevice.Type.CPU);
out.println("using "+device);
// create command queue on device.
CLCommandQueue queue = device.createCommandQueue();
int elementCount = 59449477; // Length of arrays to process
int localWorkSize = min(device.getMaxWorkGroupSize(), 256); // Local work size dimensions
int globalWorkSize = roundUp(localWorkSize, elementCount); // rounded up to the nearest multiple of the localWorkSize
// load sources, create and build program
CLProgram program = context.createProgram(HelloJOCL.class.getResourceAsStream("VectorAdd.cl")).build();
// A, B are input buffers, C is for the result
CLBuffer<FloatBuffer> clBufferA = context.createFloatBuffer(globalWorkSize, READ_ONLY);
CLBuffer<FloatBuffer> clBufferB = context.createFloatBuffer(globalWorkSize, READ_ONLY);
CLBuffer<FloatBuffer> clBufferC = context.createFloatBuffer(globalWorkSize, WRITE_ONLY);
out.println("used device memory: "
+ (clBufferA.getCLSize()+clBufferB.getCLSize()+clBufferC.getCLSize())/1000000 +"MB");
// fill input buffers with random numbers
// (just to have test data; seed is fixed -> results will not change between runs).
fillBuffer(clBufferA.getBuffer(), 12345);
fillBuffer(clBufferB.getBuffer(), 67890);
// get a reference to the kernel function with the name 'VectorAdd'
// and map the buffers to its input parameters.
CLKernel kernel = program.createCLKernel("VectorAdd");
kernel.putArgs(clBufferA, clBufferB, clBufferC).putArg(elementCount);
// asynchronous write of data to GPU device,
// followed by blocking read to get the computed results back.
long time = nanoTime();
queue.putWriteBuffer(clBufferA, false)
.putWriteBuffer(clBufferB, false)
.put1DRangeKernel(kernel, 0, globalWorkSize, localWorkSize)
.putReadBuffer(clBufferC, true);
time = nanoTime() - time;
// print first few elements of the resulting buffer to the console.
out.println("a+b=c results snapshot: ");
for(int i = 0; i < 10; i++)
out.print(clBufferC.getBuffer().get() + ", ");
out.println("...; " + clBufferC.getBuffer().remaining() + " more");
out.println("computation took: "+(time/1000000)+"ms");
}finally{
// cleanup all resources associated with this context.
context.release();
}
}
private static void fillBuffer(FloatBuffer buffer, int seed) {
Random rnd = new Random(seed);
while(buffer.remaining() != 0)
buffer.put(rnd.nextFloat()*100);
buffer.rewind();
}
private static int roundUp(int groupSize, int globalSize) {
int r = globalSize % groupSize;
if (r == 0) {
return globalSize;
} else {
return globalSize + groupSize - r;
}
}
}
这个程序大概算5000万次加法运算吧
device = context.getMaxFlopsDevice(CLDevice.Type.GPU);
这里可以指定用CPU还是GPU
测试后如图
,没优化代码情况下, 这个显卡还没我CPU算的快
还有其它的demo 自己看官网文档即可