AMD opencl on GCN 优化小结

最新推荐文章于 2022-01-22 14:34:35 发布

weixin_30292843

最新推荐文章于 2022-01-22 14:34:35 发布

阅读量165

点赞数

原文链接：http://www.cnblogs.com/neophy/p/6245426.html

版权

Debug tools: CodeXL
Arch : GCN
　　1 CU 4xSIMD(16 Threads, consective run 4 cycle) = 64 threads
　　1 CU 4xVALU + 1xSALU + 32K_LDS max/group + 16K_L1Cache
　　1 wavefront == 64 thread
　　max 256 VGPR
　　max 101 SGPR
　　inline Constants integer[-16,64] float[PI -4.0 -2.0 -1.0 -0.5...]
Memory type:
　　__private registers
　　__constant uniform(readonly constant
　　__local LDS
　　__global SSBO(shader Storage buffer
　　__image
　　flat memory (gfx9 above

1. intrinsics/builtin function(assembly optimize) 使用内部函数利用硬件的特殊指令
　　ballot EXEC
　　mbcnt D=(S0 & ThreadMask[31:0]) + S1
　　bcnt
　　barycentric coordinate
　　readfirstlane convert VGPR to SGPR, can reduce VGPR workload,
　　interp vertex parameter
　　swizzle
　　permute(ds,
　　fp16
　　med3,min3,max3
　　fma
　　msad
　　cube
　　div_fixup
　　?trig_
　　bit operations
　　......

2. about Latency 指令延时

浮点数指令延时比整数指令要少很多，没必要做'整数量化优化'

以trinity APU为例，FP Add 延时 GPU 是CPU的5.7倍， Int Add 延时GPU是CPU的45.3倍
memory 延时gpu也比cpu长一些.
branch 延时特别长
launch 延时长，因为有kernel，buffer的上传设置CU寄存器等过程（如果一个game把所有的shader都编译到一起或者可分别上传的subroutine，是不是启动会快很多

3. channel rank bank conflict
避免channel，bank冲突，内存最好以tiled存放，每个tile一个group
指定默认的local_size
__attribute__((reqd_work_group_size(X, Y, Z)))

4. branch eliminate
gpu 分支非常抵消，通过select指令，循环展开，消除跳转
. loop unroll
5. bit operations：
bfm ballot mbcnt bfe bitcmp wqm quadmask ff0 ff1 flbit bitset0 bitset1 saveexec
6. registers, local
do not declare global data, declare tempary at local

Example:
1. meansure kernel time

clCreateCommandQueueWithProperties(context, device, properties[CL_PROFILING_COMMAND_QUEUED]...
clEnqueueNDRangeKernel(...,&event)
clWaitForEvents();

clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, size, &startTime, &length);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, size, &endTime, &length);

2. integration Histogram
http://blog.csdn.net/10km/article/details/51610735
http://www.doc88.com/p-9933630209054.html

REFERENCE:
http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401334_pgfId-472054

转载于:https://www.cnblogs.com/neophy/p/6245426.html