Debug tools: CodeXL
Arch : GCN
1 CU 4xSIMD(16 Threads, consective run 4 cycle) = 64 threads
1 CU 4xVALU + 1xSALU + 32K_LDS max/group + 16K_L1Cache
1 wavefront == 64 thread
max 256 VGPR
max 101 SGPR
inline Constants integer[-16,64] float[PI -4.0 -2.0 -1.0 -0.5...]
Memory type:
__private registers
__constant uniform(readonly constant
__local LDS
__global SSBO(shader Storage buffer
__image
flat memory (gfx9 above
1. intrinsics/builtin function(assembly optimize) 使用内部函数利用硬件的特殊指令
ballot EXEC
mbcnt D=(S0 & ThreadMask[31:0]) + S1
bcnt
barycentric coordinate
readfirstlane convert VGPR to SGPR, can reduce VGPR workload,
interp vertex parameter
swizzle
permute(ds,
fp16
med3,min3,max3
fma
msad
cube
div_fixup
?trig_
bit operations
......
2. about Latency 指令延时
浮点数指令延时比整数指令要少很多,没必要做'整数量化优化'
以trinity APU为例,FP Add 延时 GPU 是CPU的5.7倍, Int Add 延时GPU是CPU的45.3倍
memory 延时gpu也比cpu长一些.
branch 延时特别长
launch 延时长,因为有kernel,buffer的上传设置CU寄存器等过程 (如果一个game把所有的shader都编译到一起或者可分别上传的subroutine,是不是启动会快很多
3. channel rank bank conflict
避免channel,bank冲突,内存最好以tiled存放, 每个tile一个group
指定默认的local_size
__attribute__((reqd_work_group_size(X, Y, Z)))
4. branch eliminate
gpu 分支非常抵消,通过select指令,循环展开,消除跳转
. loop unroll
5. bit operations:
bfm ballot mbcnt bfe bitcmp wqm quadmask ff0 ff1 flbit bitset0 bitset1 saveexec
6. registers, local
do not declare global data, declare tempary at local
Example:
1. meansure kernel time
clCreateCommandQueueWithProperties(context, device, properties[CL_PROFILING_COMMAND_QUEUED]...
clEnqueueNDRangeKernel(...,&event)
clWaitForEvents();
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, size, &startTime, &length);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, size, &endTime, &length);
2. integration Histogram
http://blog.csdn.net/10km/article/details/51610735
http://www.doc88.com/p-9933630209054.html
REFERENCE:
http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401334_pgfId-472054