有关Data alignmet(align)
参考:
https://www.quora.com/In-CUDA-programming-what-are-Alignment-Requirements
memory alignment on GPUs is a bit strict when it comes to optimising, although the first approach towards optimising is to make memory access optimum. This is similar to how it’s done on CPUs where optimised memory access yields better timings, subsequently on GPUs 16 byte memory alignments enhances optimised memory access. Moreover to make it possible several steps are applied such as creating a structure type or creating class with memory alignment equivalent to 16 byte memory placements. Furthermore array of objects or structure of arrays is used to initialise data structures, ideally 8 ints, four floats and on Nvidia GPUs 2 doubles could be access at one clock cycle.
Some processors require that objects must be stored in memory at an address that is evenly divisible by some number, which is called the alignment of that object.
For example an array of 32-bit integers with 4-byte alignment must be stored at an address that is evenly divisible by 4.
Many CPUs support a small alignment size for all objects, maybe 1 byte.
GPUs often have more stringent restrictions, for example that primitive objects have natural alignment (the alignment is equal to the size of the object).
Alignment restrictions exist because they can be implemented more efficiently in hardware. In particular, accesses to objects with natural alignment that are smaller than the cache line size can always be satisfied with one cache access. If the alignment is smaller than the object size, then the object can span cache lines, and the hardware required to access it becomes more complex.
我们平时使用的变量(initialise data structures)其在内存中的分布都是连续的,对齐的,而我们自定义的变量或结构体,其成员在内存中的分布是否连续就不确定了,我们可以显示的使用标识符 _ _ align_ _
来确定一个对齐连续的内存存放我们定义的结构体对象。
对于结构、大小和队列要求可以通过编译器强制使用队列指定的__align__(8)
或 __align__(16)
如:
struct __align(16)__
{
float a;
float b;
float c;
float d;
};
变量a,b,c,d的在内存中是连续的。
作用:
合并访问内存。尤其是在GPU端,合并访问很重要