CUDA编程：SM（Streaming Multiprocessing）

最新推荐文章于 2025-01-31 05:39:12 发布

physuleo

最新推荐文章于 2025-01-31 05:39:12 发布

阅读量2.1k

点赞数 1

本文介绍了CUDA程序相关知识。CUDA程序架构包含分配GPU内存、数据拷贝、激活内核计算等5个主要步骤。当核函数在CPU端启动，执行会移到GPU上产生大量线程。CUDA有线程块和包含块的格两级线程层级，还介绍了核函数的定义、限制及CPU与GPU处理器的差异。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

部分内容出自：

http://ju.outofmemory.cn/entry/292387

http://www.hds.bme.hu/~fhegedus/C++/Professional%20CUDA%20C%20Programming.pdf

一个 GPU 包含多个 Streaming Multiprocessor ，而每个 Streaming Multiprocessor 又包含多个 core 。 Streaming Multiprocessors 支持并发执行多达几百的 thread 。
一个 thread block 只能调度到一个 Streaming Multiprocessor 上运行，直到 thread block 运行完毕。一个 Streaming Multiprocessor 可以同时运行多个thread block 。

这里有两种分割数据的方式：block就是按线程数等分数据，10个线程就把数据分成10份，一个线程处理一份；而cyclic则是数据的份数大于线程数，举个例子，10个线程把数据分成20份，第一个线程处理第1，11份，第二个线程处理第2，12份。。。。。。，循环处理多次。

Even though many-core and multicore are used to label GPU and CPU architectures, a GPU core is quite different than a CPU core.
A CPU core, relatively heavy-weight, is designed for very complex control logic, seeking to optimize the execution of sequential programs.
A GPU core, relatively light-weight, is optimized for data-parallel tasks with simpler control logic, focusing on the throughput of parallel programs.

CPU处理器相对重量化，设计用来应对复杂的控制逻辑，寻求优化序列化程序的执行方法。
GPU处理器相对而言轻量化，经过优化之后用较为简单的控制逻辑来处理并行数据任务，主要集中于并行程序的使用。

一个基本的CUDA程序架构包含5个主要方面：

A typical CUDA program structure consists of five main steps:
1. Allocate GPU memories. 分配GPU的内存
2. Copy data from CPU memory to GPU memory. 复制CPU内存数据到GPU内存
3. Invoke the CUDA kernel to perform program-specific computation. 激活CUDA内核去计算特定程序的计算
4. Copy data back from GPU memory to CPU memory. 将数据GPU再一次拷贝回到CPU当中
5. Destroy GPU memories. 删除GPU数据

When a kernel function is launched from the host side, execution is moved to a device where a large number of threads are generated and each thread executes the statements specified by the kernel function. CUDA exposes a thread hierarchy abstraction to enable you to organize your threads. This is a two-level thread hierarchy decomposed into blocks of threads and grids of blocks:

当一个核函数在host side启动（即 CPU 处理器端），执行过程将会被移动到device上（即GPU上），在GPU上会产生大量的线程（Threads），每一个线程都执行和函数的一个特定代码。CUDA 中有两种抽象层级，让你可以安排组织你的线程。这两种层级的线程层可以分解成两部分：线程块和包含块的格。

All threads spawned by a single kernel launch are collectively called a grid. All threads in a grid share the same global memory space. A grid is made up of many thread blocks. A thread block is a group of threads that can cooperate with each other using:
➤ Block-local synchronization
➤ Block-local shared memory
Threads from different blocks cannot cooperate.

所有由一个核产生一系列线程将成为一个集合，名为 grid 。在一个格子内的所有线程将分享一样的全局内存空间。一个 grid 由许多的 block 组成。一个线程块（thread block) 是一组线程的组合，它们可以互相之间合作（用下面的方法）：

Block-local synchronization
Block-local shared memory

不同的block中的线程不可以合作。

关于kernel

A kernel function is the code to be executed on the device side. In a kernel function, you define the computation for a single thread, and the data access for that thread. When the kernel is called, many different CUDA threads perform the same computation in parallel.

一个核函数是将要在设备端（device）运行的一系列代码。在核函数中，你为一个单独的线程指定运算和线程可以接触到的数据。当这个核被激活时，许多的CUDA线程并行同样的计算。

The following restrictions apply for all kernels:
➤ Access to device memory only 只可以接触设备的内存
➤ Must have void return type 返回值必须是空
➤ No support for a variable number of arguments 不支持一个参数的变量
➤ No support for static variables 不支持静态变量
➤ No support for function pointers 不支持函数指针
➤ Exhibit an asynchronous behavior 表现为一个非同步的