GPU基本概念详解

最新推荐文章于 2024-05-01 00:03:32 发布

weixin_30449453

最新推荐文章于 2024-05-01 00:03:32 发布

阅读量115

点赞数

原文链接：http://www.cnblogs.com/qingsunny/p/3402732.html

版权

§1 个 multiprocessor <-> 1个instruction unit <-> 8 个processor <-> 在一个warp中执行 <-> 32条threads

1个processor处理1条thread，所以1个warp在1个multiprocessor中需要4个clock cycles执行完成；

1个multiprocessor可以处理多个block，但是一个block只能放在一个multiprocessor中；

一个block里头有shared memory，这些shared memory分成16个banks，刚好让half-warp （一个warp有32条threads）操作。当多条processor同时要同一个bank里头的数据时，bank会broadcast，让这些thread排队，这其实是sequential的，就与CPU没有区别，没有了GPU的优点，所以要尽量避免这种情况发生。

一个processor一次只能处理一条thread，之所以比CPU快，是因为GPU有多个processor。

coalescing：global memory transactions （基本单位 half-warp：16条threads）

的时候，第一条thread对应的address必须是64bytes的倍数，然后之后31条threads 一一对应，只有这样才能批量的global memory transaction,否则thread就要一条一条做global memory transaction，效率很低。

ú Each active thread is allocated some registers for the entire lifetime of the thread

ú Each active block is allocated shared memory for the entire lifetime of the block

§Very fast on-chip memory

§Can be used to avoid non-coalesced global memory accesses

§Can be used to reduce global memory accesses

§Shared memory is organized into 16 banks, where successive 4-byte words are assigned to successive banks

§Memory load or store of n addresses by a half-warp that span n distinct memory banks can be serviced simultaneously

§If multiple addresses map to the same memory bank, the accesses are serialized

§If multiple requests for the same memory address, a broadcast occurs

同一个block中的threads共用shared memory，同一个block中的thread才能进行__syncthreads()。这个是shared memory的概念。global memory有thread的概念，但是没有block的概念，即没有将thread组织成block，让同一个block中的thread相互协作。

Shared memory is organized into 16 banks

在device的code中，如果是unqualified的变量是register memory的。

转载于:https://www.cnblogs.com/qingsunny/p/3402732.html