GPU CPU各自的特长比较

最新推荐文章于 2023-10-09 20:00:55 发布

wangeen

最新推荐文章于 2023-10-09 20:00:55 发布

阅读量998

点赞数

分类专栏：并行计算

并行计算专栏收录该内容

18 篇文章 0 订阅

订阅专栏

http://stackoverflow.com/questions/7091958/cpu-vs-gpu-when-cpu-is-better

Based on my experience, I will summarize the key differences in terms of performance between parallel programs in CPUs and GPUs. Trust me, a comparison can be changed from generation to generation. So I will just point out what is good and is bad for CPUs and GPUs. Of course, if you make a program at the extreme, i.e., having only bad or good sides, it will run definitely faster on one platform. But a mixture of those requires very complicated reasoning.

Host program level

One key difference is memory transfer cost. GPU devices requires some memory transfers. This cost is non-trivial in some cases, for example when you have to frequently transfer some big arrays. In my experience, this cost can be minimized but pushing most of host code to device code. The only cases you can do so are when you have to interact with the host operating system in program, such as outputting to monitor.

Device program level

Now we come to see a complex picture that hasn't been fully revealed yet. What I mean is there are many mysterious scenes in GPUs that haven't been disclosed. But still, we have a lot of distinguish CPU and GPU (kernel code) in terms of performance.

There are few factors that I noticed those dramatically contribute to the difference.

Workload distribution

GPUs, which consist of many execution units, are designed to handle massively parallel programs. If you have little of work, say a few sequential tasks, and put these tasks on a GPU, only a few of those many execution units are busy, thus will be slower than CPU. Because CPUs are, in other hand, better to handle short and sequential tasks. The reason is simple, CPUs are much more complicated and able to exploit instruction level parallelism, whereas GPUs exploit thread level parallelism. Well, I heard NVIDIA GF104 can do Superscalar, but I had no chance to experience with it though.

It is worth noting that, in GPUs, workload are divided into small blocks (or workgroups in OpenCL), and blocks are arranged in chunks, each of which is executed in one Streaming processor (I am using terminologies from NVIDIA). But in CPUs, those blocks are executed sequentially - I can't think of anything else than a single loop.

Thus, for programs that have small number of blocks, it will be likely to run faster on CPUs.

Control flow instructions

Branches are bad things to GPUs, always. Please bear in mind that GPUs prefer equal things. Equal blocks, equal threads within a blocks, and equal threads within a warp. But what matters the most?

                            ***Branch divergences.***

Cuda/OpenCL programmers hate branch divergences. Since all the threads somehow are divided into sets of 32 threads, called a warp, and all threads within a warp execute in lockstep, a branch divergence will cause some threads in the warp to be serialized. Thus, the execution time of the warp will be accordingly multiplied.

Unlike GPUs, each cores in CPUs can follow their own path. Furthermore, branches can be efficiently executed because CPUs have branch prediction.

Thus, programs that have more warp divergences are likely to run faster on CPUs.

Memory access instructions

This REALLY is complicated enough so let's make it brief.

Remember that global memory accesses have very high latency (400-800 cycles). So in old generations of GPUs, whether memory accesses are coalesced was a critical matter. Now your GTX560 (Fermi) has more 2 level of caches. So global memory access cost can be reduced in many cases. However, caches in CPUs and GPUs are different, so their effects are also different.

What I can say is that it really really depends on your memory access pattern, your kernel code pattern (how memory accesses are interleaved with computation, the types of operations, etc., ) to tell if one runs faster on GPUs or CPUs.

But somehow you can expect a huge number of cache misses (in GPUs) has a very bad effect on GPUs (how bad? - it depends on your code).

Additionally, shared memory is an important feature of GPUs. Accessing to shared memory is as fast as accessing to GPU L1 cache. So kernels that make use of shared memory will have pretty much benefit.

Some other factors I haven't really mentioned but those can have big impact on the performance in many cases such as bank conflicts, size of memory transaction, GPU occupancy...