CUDA C++ Programming Guide(Version 10.0) —— 2. Programming Model

GPU 常用于并行计算,可拥有大量线程数,下面主要是对线程的一些描述。


CUDA C extends C by allowing the programmer to define C functions, called kernels, that when called, are executed N times in parallel by N different CUDA threads, asopposed to only once like regular C functions.

kernels 即CUDA每个线程执行内容对应的C函数。

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
int main()
    // Kernel invocation with N threads
    VecAdd<<<1, N>>>(A, B, C);

核函数使用__global__进行指示, 而函数调用的执行配置则使用<<<...>>>来指定线程数。核函数中的threadIdx为内置变量,提供线程的索引信息。

Thread Hierarchy(线程层次结构)


__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
int main()
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock(N, N);
    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

Memory Hierarchy(内存层次结构)



对应的维度的描述,1D或2D的线程可描述为Blocks, 多个Blocks则称之为Grid, 下图可描述该模型

除上图,还有两种全局只读内存:constant and texture memory

cuda 通过<<< >>>符号来分配索引线程的方式,常用有15种索引方式。

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdlib.h>
#include <iostream>

using namespace std;

//thread 1D
__global__ void testThread1(int *c, const int *a, const int *b)
    int i = threadIdx.x;
    c[i] = b[i] - a[i];

//thread 2D
__global__ void testThread2(int *c, const int *a, const int *b)
    int i = threadIdx.x + threadIdx.y*blockDim.x;
    c[i] = b[i] - a[i];

//thread 3D
__global__ void testThread3(int *c, const int *a, const int *b)
    int i = threadIdx.x + threadIdx.y*blockDim.x + threadIdx.z*blockDim.x*blockDim.y;
    c[i] = b[i] - a[i];

//block 1D
__global__ void testBlock1(int *c, const int *a, const int *b)
    int i = blockIdx.x;
    c[i] = b[i] - a[i];

//block 2D
__global__ void testBlock2(int *c, const int *a, const int *b)
    int i = blockIdx.x + blockIdx.y*gridDim.x;
    c[i] = b[i] - a[i];

//block 3D
__global__ void testBlock3(int *c, const int *a, const int *b)
    int i = blockIdx.x + blockIdx.y*gridDim.x + blockIdx.z*gridDim.x*gridDim.y;
    c[i] = b[i] - a[i];

//block-thread 1D-1D
__global__ void testBlockThread1(int *c, const int *a, const int *b)
    int i = threadIdx.x + blockDim.x*blockIdx.x;
    c[i] = b[i] - a[i];

//block-thread 1D-2D
__global__ void testBlockThread2(int *c, const int *a, const int *b)
    int threadId_2D = threadIdx.x + threadIdx.y*blockDim.x;
    int i = threadId_2D+ (blockDim.x*blockDim.y)*blockIdx.x;
    c[i] = b[i] - a[i];

//block-thread 1D-3D
__global__ void testBlockThread3(int *c, const int *a, const int *b)
    int threadId_3D = threadIdx.x + threadIdx.y*blockDim.x + threadIdx.z*blockDim.x*blockDim.y;
    int i = threadId_3D + (blockDim.x*blockDim.y*blockDim.z)*blockIdx.x;
    c[i] = b[i] - a[i];

//block-thread 2D-1D
__global__ void testBlockThread4(int *c, const int *a, const int *b)
    int blockId_2D = blockIdx.x + blockIdx.y*gridDim.x;
    int i = threadIdx.x + blockDim.x*blockId_2D;
    c[i] = b[i] - a[i];

//block-thread 3D-1D
__global__ void testBlockThread5(int *c, const int *a, const int *b)
    int blockId_3D = blockIdx.x + blockIdx.y*gridDim.x + blockIdx.z*gridDim.x*gridDim.y;
    int i = threadIdx.x + blockDim.x*blockId_3D;
    c[i] = b[i] - a[i];

//block-thread 2D-2D
__global__ void testBlockThread6(int *c, const int *a, const int *b)
    int threadId_2D = threadIdx.x + threadIdx.y*blockDim.x;
    int blockId_2D = blockIdx.x + blockIdx.y*gridDim.x;
    int i = threadId_2D + (blockDim.x*blockDim.y)*blockId_2D;
    c[i] = b[i] - a[i];

//block-thread 2D-3D
__global__ void testBlockThread7(int *c, const int *a, const int *b)
    int threadId_3D = threadIdx.x + threadIdx.y*blockDim.x + threadIdx.z*blockDim.x*blockDim.y;
    int blockId_2D = blockIdx.x + blockIdx.y*gridDim.x;
    int i = threadId_3D + (blockDim.x*blockDim.y*blockDim.z)*blockId_2D;
    c[i] = b[i] - a[i];

//block-thread 3D-2D
__global__ void testBlockThread8(int *c, const int *a, const int *b)
    int threadId_2D = threadIdx.x + threadIdx.y*blockDim.x;
    int blockId_3D = blockIdx.x + blockIdx.y*gridDim.x + blockIdx.z*gridDim.x*gridDim.y;
    int i = threadId_2D + (blockDim.x*blockDim.y)*blockId_3D;
    c[i] = b[i] - a[i];

//block-thread 3D-3D
__global__ void testBlockThread9(int *c, const int *a, const int *b)
    int threadId_3D = threadIdx.x + threadIdx.y*blockDim.x + threadIdx.z*blockDim.x*blockDim.y;
    int blockId_3D = blockIdx.x + blockIdx.y*gridDim.x + blockIdx.z*gridDim.x*gridDim.y;
    int i = threadId_3D + (blockDim.x*blockDim.y*blockDim.z)*blockId_3D;
    c[i] = b[i] - a[i];

void addWithCuda(int *c, const int *a, const int *b, unsigned int size)
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;


    cudaMalloc((void**)&dev_c, size * sizeof(int));
    cudaMalloc((void**)&dev_a, size * sizeof(int));
    cudaMalloc((void**)&dev_b, size * sizeof(int));

    cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);

    //testThread1<<<1, size>>>(dev_c, dev_a, dev_b);

    //uint3 s;s.x = size/5;s.y = 5;s.z = 1;
    //testThread2 <<<1,s>>>(dev_c, dev_a, dev_b);

    //uint3 s; s.x = size / 10; s.y = 5; s.z = 2;
    //testThread3<<<1, s >>>(dev_c, dev_a, dev_b);

    //testBlock1<<<size,1 >>>(dev_c, dev_a, dev_b);

    //uint3 s; s.x = size / 5; s.y = 5; s.z = 1;
    //testBlock2<<<s, 1 >>>(dev_c, dev_a, dev_b);

    //uint3 s; s.x = size / 10; s.y = 5; s.z = 2;
    //testBlock3<<<s, 1 >>>(dev_c, dev_a, dev_b);

    //testBlockThread1<<<size/10, 10>>>(dev_c, dev_a, dev_b);

    //uint3 s1; s1.x = size / 100; s1.y = 1; s1.z = 1;
    //uint3 s2; s2.x = 10; s2.y = 10; s2.z = 1;
    //testBlockThread2 << <s1, s2 >> >(dev_c, dev_a, dev_b);

    //uint3 s1; s1.x = size / 100; s1.y = 1; s1.z = 1;
    //uint3 s2; s2.x = 10; s2.y = 5; s2.z = 2;
    //testBlockThread3 << <s1, s2 >> >(dev_c, dev_a, dev_b);

    //uint3 s1; s1.x = 10; s1.y = 10; s1.z = 1;
    //uint3 s2; s2.x = size / 100; s2.y = 1; s2.z = 1;
    //testBlockThread4 << <s1, s2 >> >(dev_c, dev_a, dev_b);

    //uint3 s1; s1.x = 10; s1.y = 5; s1.z = 2;
    //uint3 s2; s2.x = size / 100; s2.y = 1; s2.z = 1;
    //testBlockThread5 << <s1, s2 >> >(dev_c, dev_a, dev_b);

    //uint3 s1; s1.x = size / 100; s1.y = 10; s1.z = 1;
    //uint3 s2; s2.x = 5; s2.y = 2; s2.z = 1;
    //testBlockThread6 << <s1, s2 >> >(dev_c, dev_a, dev_b);

    //uint3 s1; s1.x = size / 100; s1.y = 5; s1.z = 1;
    //uint3 s2; s2.x = 5; s2.y = 2; s2.z = 2;
    //testBlockThread7 << <s1, s2 >> >(dev_c, dev_a, dev_b);

    //uint3 s1; s1.x = 5; s1.y = 2; s1.z = 2;
    //uint3 s2; s2.x = size / 100; s2.y = 5; s2.z = 1;
    //testBlockThread8 <<<s1, s2 >>>(dev_c, dev_a, dev_b);

    uint3 s1; s1.x = 5; s1.y = 2; s1.z = 2;
    uint3 s2; s2.x = size / 200; s2.y = 5; s2.z = 2;
    testBlockThread9<<<s1, s2 >>>(dev_c, dev_a, dev_b);

    cudaMemcpy(c, dev_c, size*sizeof(int), cudaMemcpyDeviceToHost);



int main()
    const int n = 1000;

    int *a = new int[n];
    int *b = new int[n];
    int *c = new int[n];
    int *cc = new int[n];

    for (int i = 0; i < n; i++)
        a[i] = rand() % 100;
        b[i] = rand() % 100;
        c[i] = b[i] - a[i];

    addWithCuda(cc, a, b, n);

    FILE *fp = fopen("out.txt", "w");
    for (int i = 0; i < n; i++)
        fprintf(fp, "%d %d\n", c[i], cc[i]);

    bool flag = true;
    for (int i = 0; i < n; i++)
        if (c[i] != cc[i])
            flag = false;

    if (flag == false)
        printf("no pass");


    delete[] a;
    delete[] b;
    delete[] c;
    delete[] cc;

    return 0;

Heterogeneous Programming (异构编程)

The CUDA programming model also assumes that both the host and the device maintain their own separate memory spaces in DRAM, referred to as host memory and device memory, respectively. Therefore, a program manages the global, constant, and texture memory spaces visible to kernels through calls to the CUDA runtime. This includes device memory allocation and deallocation as well as data transfer between host and device memory.

Compute Capability

设备的计算能力由版本号表示,有时也称为"SM version"。 此版本号标识GPU硬件支持的功能,并由应用程序在运行时用于确定当前GPU上可用的硬件功能和/或指令。

计算能力包括 major revision 主要修订号X 和 minor revision 次要修订号Y,并由X.Y表示。


  • The major revision number is 7 for devices based on the Volta architecture
  • The major revision number is 6 for devices based on the Pascal architecture
  • The major revision number is 5 for devices based on the Maxwell architecture
  • The major revision number is 3 for devices based on the Kepler architecture
  • The major revision number is 2 for devices based on the Fermi architecture
  • The major revision number is 1 for devices based on the Tesla architecture


CUDA programming: a developer's guide to parallel computing with GPUs. by Shane Cook. Over the past five years there has been a revolution in computing brought about by a company that for successive years has emerged as one of the premier gaming hardware manufacturersdNVIDIA. With the introduction of the CUDA (Compute Unified Device Architecture) programming language, for the first time these hugely powerful graphics coprocessors could be used by everyday C programmers to offload computationally expensive work. From the embedded device industry, to home users, to supercomputers, everything has changed as a result of this. One of the major changes in the computer software industry has been the move from serial programming to parallel programming. Here, CUDA has produced great advances. The graphics processor unit (GPU) by its very nature is designed for high-speed graphics, which are inherently parallel. CUDA takes a simple model of data parallelism and incorporates it into a programming model without the need for graphics primitives. In fact, CUDA, unlike its predecessors, does not require any understanding or knowledge of graphics or graphics primitives. You do not have to be a games programmer either. The CUDA language makes the GPU look just like another programmable device. Throughout this book I will assume readers have no prior knowledge of CUDA, or of parallel programming. I assume they have only an existing knowledge of the C/C++ programming language. As we progress and you become more competent with CUDA, we’ll cover more advanced topics, taking you from a parallel unaware programmer to one who can exploit the full potential of CUDA. For programmers already familiar with parallel programming concepts and CUDA, we’ll be discussing in detail the architecture of the GPUs and how to get the most from each, including the latest Fermi and Kepler hardware. Literally anyone who can program in C or C++ can program with CUDA in a few hours given a little training. Getting from novice CUDA programmer, with a several times speedup to 10 times–plus speedup is what you should be capable of by the end of this book. The book is very much aimed at learning CUDA, but with a focus on performance, having first achieved correctness. Your level of skill and understanding of writing high-performance code, especially for GPUs, will hugely benefit from this text. This book is a practical guide to using CUDA in real applications, by real practitioners. At the same time, however, we cover the necessary theory and background so everyone, no matter what their background, can follow along and learn how to program in CUDA, making this book ideal for both professionals and those studying GPUs or parallel programming.




