CUDA并行方式比较

最新推荐文章于 2022-11-02 12:34:41 发布

MusicLEEE

最新推荐文章于 2022-11-02 12:34:41 发布

阅读量326

点赞数

分类专栏： C++ CUDA 文章标签： cuda c++

本文链接：https://blog.csdn.net/qq_39731130/article/details/121373497

版权

C++ 同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

CUDA

3 篇文章 0 订阅

订阅专栏

我们有一个长度为100,000的整型数组，需要用10个GPU线程并行地对这个数组进行自增计算，每个数组元素需要自增100次，可知每个线程负责10,000个数组元素。下面有两种方式。

#define ARRAY_SIZE 100000
#define THREAD_SIZE 10

方式1：Kernel函数内循环

每个GPU线程计算出数组元素的起始地址和尾地址，依次对每个元素进行100次自增操作。

Kernel函数：

__global__ void kernel_1(int* array, int len, int threadCount)
{
    int tid = threadIdx.x;
    int start = tid * (len + threadCount-1) / threadCount;
    int end = (tid+1) * (len + threadCount-1) / threadCount < len ? (tid+1) * (len + threadCount-1) / threadCount: len;
    for(int i = start; i<end; i++)
    {
        for(int j=0; j<100; j++)
        {
            array[i]++;
        }
    }
}

外部调用：

kernel_1<<< 1, THREAD_SIZE >>> (d_array, ARRAY_SIZE, THREAD_SIZE);

方式2：外部循环

GPU线程每次只对一个数组元素进行100次自增操作，数组元素地址由外部给定。外部使用循环体多次调用Kernel函数，完成所有数组元素的自增任务。

Kernel函数：

__global__ void kernel_2(int* array, int firstPos)
{
    int tid = threadIdx.x;
    int pos = firstPos + tid;
    for(int j=0; j<100; j++)
    {
        array[pos]++;
    }
}

外部调用：

for(int firstPos=0; firstPos<ARRAY_SIZE; firstPos+=THREAD_SIZE)
    {
        kernel_2<<< 1, THREAD_SIZE >>> (d_array, firstPos);
    }

我们统计了两种方式的执行时间，实验结果如下：

方式1：

Starting program: /home/leee/test/threadTest
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffef7e9700 (LWP 25370)]
[New Thread 0x7fffeefe8700 (LWP 25371)]
Time:369674
[Thread 0x7fffeefe8700 (LWP 25371) exited]
[Thread 0x7fffef7e9700 (LWP 25370) exited]
[Inferior 1 (process 25338) exited with code 01]

方式2：

Starting program: /home/leee/test/threadTest
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffef7e9700 (LWP 25195)]
[New Thread 0x7fffeefe8700 (LWP 25196)]
Time:395915
[Thread 0x7fffeefe8700 (LWP 25196) exited]
[Thread 0x7fffef7e9700 (LWP 25195) exited]
[Inferior 1 (process 25164) exited with code 01]

可见两种方式的执行时间接近，方式1略优于方式2。这是因为方式2中GPU需要多次地与CPU进行中断交互。事实证明for循环可以正常地应用于GPU的Kernel函数中。

此外，如果以同步地方法执行方式2，执行时间会更长。

外部调用：

for(int firstPos=0; firstPos<ARRAY_SIZE; firstPos+=THREAD_SIZE)
    {
        kernel_2<<< 1, THREAD_SIZE >>> (d_array, firstPos);
        err = cudaDeviceSynchronize();
    }

结果：

Starting program: /home/leee/test/threadTest
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffef7e9700 (LWP 27645)]
[New Thread 0x7fffeefe8700 (LWP 27646)]
Time:597882
[Thread 0x7fffeefe8700 (LWP 27646) exited]
[Thread 0x7fffef7e9700 (LWP 27645) exited]
[Inferior 1 (process 27612) exited with code 01]

MusicLEEE

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
CUDA并行方式比较

我们有一个长度为100,000的整型数组，需要用10个GPU线程并行地对这个数组进行自增计算，每个数组元素需要自增100次，可知每个线程负责10,000个数组元素。下面有两种方式。#define ARRAY_SIZE 100000#define THREAD_SIZE 10方式1：Kernel函数内循环每个GPU线程计算出数组元素的起始地址和尾地址，依次对每个元素进行100次自增操作。Kernel函数：__global__ void kernel_1(int* array, int len,
复制链接

扫一扫