1 each SM support maximum 8 block
2 each SM support maximum 1024? thread
3 SM split block into warp(32)
4 max shared memory 16K
5 max register?
6 IO / calulate
7 bank conflict
8 reduction
9 memory coaleseing -> load serialize into share memory
10 长时间指令提前?
11 Loop unrolling
12 prefetching
13 Use texture constant memory