CPU与GPU的内存带宽对比(CPU vs CUDA GPU memory bandwidth)

原文链接:http://blog.cudachess.org/2009/07/cpu-vs-cuda-gpu-memory-bandwidth/

导读:

最近打算学习CUDA,但在与一个同学聊天时她提到GPU并不适用于某些类型的计算,瓶颈在于I/O上。可我看了下GPU的参数,内存带宽(Memory Bandwidth)很高,怎么会这样呢?下面这篇文章可以回答这个问题。


如何对比和解读现代CPU与使用CUDA架构的GPU的内存带宽差距?


根据我个人目前的研究,我认为尽管GPU的内存带宽很大,但CPU的一级缓实际上比CUDA架构效率更高。


CUDA GPU的速度可以达到gigaflops(每秒10亿次浮点操作),是Core i7/Nethalem速度的十倍。为充分利用强大的计算能力,我们需要从存储器中(全局显存或计算机内存)尽量快地给他们提供数据。

我通过这篇有趣的文章benchmarked overclocked Core i7 cache and memory bandwidth发现在三通道DDR3中:一级缓存的读写峰值可以达到50GB/s,但这两个操作是可以同时进行的,因此总峰值可以达到100GB/s,但计算机内存速度(三通道DDR3)仅为16GB/s。这很令人惊讶,三年前的 Athlon X2 3800+ (2×2Hz)一级缓存比现在最新的主存速度要快!(译者注:怀疑原文输入错误,应该是惊叹三年前的比现在快,而不是反之)


CUDA的共享存储器 (16KB/8 Scalar Processors)和CPU的一级缓存(32K)的速度差不多,都是50GB/s。


GPU的共享存储器内存带宽可以达到100GB/s ~ 150GB/s,是计算机内存带宽的8倍,这是因为多个64位接口(8 vs 3)和更高的时钟频率。


下面比较GPU的共享内存读写速度和CPU的一级缓存读写速度。对于i7处理器,因为四个核都有自己的一级缓存,因此峰值可以达到200~400GB/s。而CUDA GTX285因为有30组8标量处理器,因此期内存带宽可以达到1500GB/s,是超频后i7的4倍。


总结一下,CUDA GPU的全局存储器速度是计算机内存的8倍,共享存储器是现代CPU一级缓存的4倍。


原文如下:

What is the memory bandwidth of modern CPU versus that of CUDA-enabled GPU?

As far as I figured it out, I thought GPU memory bandwidth was huge, but I thought that memory bandwidth of CPU L1-cache could be effectively better than actual CUDA architecture.

With all the horsepower delivered by CUDA GPU, up to 10X Gigaflops on GTX than current Core i7/Nehalem processors, we all need to be able to feed them with data and unload results as fast as possible in memory (global videocard memory or computer’s main memory).

I found an interesting article that benchmarked overclocked Core i7 cache and memory bandwidth, in triple-channel with fast DDR3: L1 cache peaks around 50GB/s reading or writing but could do both at once, peaking at 100GB/s, while main computer memory (triple-channel DDR3) was limited to 16 GB/s. That’s actually astonishing anyway, a 3 years old Athlon X2 3800+ (2×2Hz) L1-cache doesn’t deliver more than actual main memory of today!!!

To compare the L1 cache of a CPU (32KB), we should use CUDA Shared Memory (16KB/8 Scalar Processors), and it delivers around 50GB/s too, a value that is strangely similar.

To compare the main memory of the computer we have the Global Memory and it delivers between 100GB/s and 150GB/s, nearly 8X the computer’s main memory bandwidth, due to multiple 64-bits interface (8 instead 3) and higher clock values.

But when you test a shared memory access or a L1-cache access speed, you have to think there’s 4 core on a core i7, each one with it’s dedicated L1-cache, peaking at 200GB-400GB/s depending on the tasks.

On the other side, with 30 groups of 8 Scalar Processors, the Shared Memory of a CUDA GTX 285 may deliver 1500 GB/s, around 4X the aggregated L1-cache of an overclocked Core i7!

To resume, CUDA-enabled GPU offers up to 8X the speed of main memory and 4X the speed of L1-cache compared to a moderne CPU, and it shows!


  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值