CUDA实例——加速矩阵乘法

最新推荐文章于 2024-06-27 13:55:56 发布

momo大魔王

最新推荐文章于 2024-06-27 13:55:56 发布

阅读量1.8k

点赞数 1

本文链接：https://blog.csdn.net/weixin_38087754/article/details/104790818

版权

Ref: CUDA C programing guide

https://docs.nvidia.com/cuda/cuda-runtime-api/index.html

一什么是CUDA？

CUDA和 Nvidia CUDA Compiler (NVCC)赋予了用户在CUDA-enabled GPU上额外的运行SIMT（Single instruction, multiple thread ）的能力。

Host：Ordinary Computer

Device： Graphic Cards

SIMT：多线程同一时间内执行相同的指令在不同的data上。

二 GPU VS CPU

CPU是PC的大脑，GPU是PC的灵魂，随着技术的进步，GPU已经局限在PC内，GPU已经迎来了Deep learning的热潮，是现代超级计算机的关键。

本质上，CPU是快速和多样的，一般执行那些需要大量交互的任务（一般是不可并行化）。而GPU是将复杂的任务拆分成能够并行运行的小任务。

构造上，CPU是少核多缓存的。GPU是多核少缓存的。the CPU is composed of just a few cores with lots of cache memory that can handle a few software threads at a time. In contrast, a GPU is composed of hundreds of cores that can handle thousands of threads simultaneously.

三 CUDA Programming Model

GPU可以看作是拥有N * M的线程矩阵，以SIMT的方式运行。

Threads：

是 CUDA最基本的单位，每个线程拥有自己的Counter和Register。在同一Block下的线程共享一块内存（非常快速）。

GMEM Coalescing：尽可能少次的访问mem

Memory accesses are handled per half-warp
Carry out the smallest possible number of transactions
Reduce transaction size when possible

Blocks：

CUDA中是以blokc为单位运行的，所以thread必须要成组才可以运行。
Compute Capability是用来衡量block的计算能力的，小于2.0的blocks至多只能有512个线程，大于2.0可以至多有1024个线程。
Blocks需要指定维度（1，2，3），而且所有的block维度必须相同。实在无法划分的threads会就被停用。

Grids：

由Block组成的矩阵（1，2

最低0.47元/天解锁文章

momo大魔王

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
1
评论
CUDA实例——加速矩阵乘法

Ref: CUDA C programing guidehttps://docs.nvidia.com/cuda/cuda-runtime-api/index.html一什么是CUDA？CUDA和Nvidia CUDA Compiler(NVCC)赋予了用户在CUDA-enabled GPU上额外的运行SIMT（Single instruction, multiple threa...
复制链接

扫一扫