一个简单的CUDA11编程例子

最新推荐文章于 2024-05-27 12:17:34 发布

109702008

最新推荐文章于 2024-05-27 12:17:34 发布

阅读量1.8k

点赞数

分类专栏：人工智能 # CUDA 文章标签：并行计算 cuda

本文链接：https://blog.csdn.net/eidolon_foot/article/details/108514182

版权

人工智能同时被 2 个专栏收录

219 篇文章 2 订阅

订阅专栏

CUDA

3 篇文章 0 订阅

订阅专栏

1.标准的C++程序（没有使用CUDA）：

#include <iostream>
#include <math.h>

// function to add the elements of two arrays
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20; // 1M elements

  float *x = new float[N];
  float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the CPU
  add(N, x, y);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete [] x;
  delete [] y;

  return 0;
}

保存文件名：add.cpp。用C++编译器编译、运行。

2.一个线程的CUDA程序：

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
  add<<<1, 256>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);
  
  return 0;
}

保存文件名：add.cu。

用CUDA编译器编译：nvcc add.cu -o add_cuda。

（编译时遇到了一个问题。用“x64 Native Tools Command Prompt for VS 2019”打开终端，编译。）

使用nvprof查看运行时间：nvprof add_cuda。

（运行时遇到缺少cupti64_2020.1.1.dll的问题。重新自定义安装CUDA11，在目录C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\extras\CUPTI\lib64找到cupti64_2020.1.1.dll，拷贝到C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin。）

3.多个线程的CUDA程序：

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
  add<<<1, 256>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);
  
  return 0;
}

保存文件名：add_block.cu。

用CUDA编译器编译：nvcc add_block.cu -o add_block_cuda。

使用nvprof查看运行时间：nvprof add_block_cuda。

参考：https://developer.nvidia.com/blog/even-easier-introduction-cuda/

109702008

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
打赏
0
评论
一个简单的CUDA11编程例子

1.标准的C++程序（没有使用CUDA）：#include <iostream>#include <math.h>// function to add the elements of two arraysvoid add(int n, float *x, float *y){ for (int i = 0; i < n; i++) y[i] = x[i] + y[i];}int main(void){ int N = 1<&lt
复制链接

扫一扫