深度学习GPU卡的理解(一)

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/langb2014/article/details/53885175

一直不是很清楚到底买什么NVIDIA卡合适?对于硬件小白的我来说,买显存最大的没问题,并不清楚什么GPU适合什么深度模型。碰巧找到一个懂GPU的行家,由于国内翻墙比较麻烦就将这几篇blog转过来了。地址

《Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning》

It is again and again amazing to see how much speedup you get when you use GPUs for deep learning: Compared to CPUs 5x speedups are typical, but on larger problems one can achieve 10x speedups. With GPUs you can try out new ideas, algorithms and experiments much faster than usual and get almost immediate feedback as to what works and what does not. If you are serious about deep learning you should definitely get a GPU. But which one should you get? In this blog post I will guide you through the choices to the GPU which is best for you.

Having a fast GPU is a very important aspect when one begins to learn deep learning as this allows for rapid gain in practical experience which is key to building the expertise with which you will be able to apply deep learning to new problems. Without this rapid feedback it just takes too much time to learn from one’s mistakes and it can be discouraging and frustrating to go on with deep learning.With GPUs I quickly learned how to apply deep learning on a range of Kaggle competitions and I managed to earn second place in the Partly Sunny with a Chance of Hashtags Kaggle competition, where it was the task to predict weather ratings for a given tweet. In the competition I used a rather large two layered deep neural network with rectified linear units and dropout for regularization and this deep net fitted barely into my 6GB GPU memory. More details on my approach can be found here.

Should I get multiple GPUs?

Excited by what deep learning can do with GPUs I plunged myself into multi-GPU territory by assembling a small GPU cluster with InfiniBand 40Gbit/s interconnect. I was thrilled to see if even better results can be obtained with multiple GPUs.

I quickly found that it is not only very difficult to parallelize neural networks on multiple GPUs efficiently, but also that the speedup was only mediocre for dense neural networks. Small neural networks could be parallelized rather efficiently using data parallelism, but larger neural networks like I used in the Partly Sunny with a Chance of Hashtags Kaggle competition received almost no speedup.

Later I ventured further down the road and I developed a new 8-bit compression technique which enables you to parallelize dense or fully connected layers much more efficiently with model parallelism compared to 32-bit methods.

However, I also found that parallelization can be horribly frustrating. I naively optimized parallel algorithms for a range of problems, only to find that even with optimized custom code it does not work out well, given the effort that you have to put in. You need to be very aware of your hardware and how it interacts with deep learning algorithms to gauge if you can benefit from parallelization in the first place.

GPU picSetup in my main computer: You can see three GXT Titan and an InfiniBand card. Is this a good setup for doing deep learning?

However, there are other, less specialized problems where parallelization works great. For example convolutional layers can be parallelized quite easily and they scale well. Many frameworks support this kind of parallelism and if you use 4 GPUs you usually will see a speedup of about 2.5-3 for most frameworks like TensorFlow, Caffe, Theano, and Torch while for optimized research code you will get a speedup of about 3.6-3.8. Microsoft’s CNTK offers the best parallelization performance which is close to research code. CNTK has the downside that it is currently very difficult to use as it uses config files instead of a library API.

Currently I am doing an internship at Microsoft Research where I will be working on CNTK and I can tell you that both the API will be improved significantly and the parallelization performance might be improved further still.

With these changes the parallelization on multiple GPUs and multiple computers might finally be on a level where normal users will easily profit from fast training of all different kinds of deep learning models, be it convolutional nets, recurrent nets, or fully connected nets.

Another advantage of using multiple GPUs is that you can run multiple algorithms or experiments separately on each GPU. You gain no speedups, but you get more information of your performance by using different algorithms or parameters at once. This is highly useful if your main goal is to gain deep learning experience as quickly as possible and also it is very useful for researchers, who want try multiple versions of a new algorithm at the same time.

This is psychologically important if you want to learn deep learning. The shorter the intervals for performing a task and receiving feedback for that task, the better the brain able to integrate relevant memory pieces for that task into a coherent picture. If you train two convolutional nets on separate GPUs on small datasets you will more quickly get a feel for what is important to perform well; you will more readily be able to detect patterns in the cross validation error and interpret them correctly — that is for which pattern you need to adjust which parameter or what layer needs to be added, removed, or adjusted.

So overall, one can say that one GPU should be sufficient for almost any task but that multiple GPUs are becoming more and more important to accelerate your deep learning models. Multiple cheap GPUs are also excellent if you want to learn deep learning quickly.

So what kind of accelerator should I get? NVIDIA, AMD, or Xeon Phi?

NVIDIA’s standard libraries made it very easy to establish the first deep learning libraries in CUDA, while there were no such powerful standard libraries for AMD’s OpenCL. Right now, there are just no good deep learning libraries for AMD cards – so NVIDIA it is. Even if some OpenCL libraries would be available in the future I would stick with NVIDIA: The thing is that the GPU computing or GPGPU community is very large for CUDA and rather small for OpenCL. Thus, in the CUDA community, good open source solutions and solid advice for your programming is readily available.

Additionally, NVIDIA now goes all-in with respect to deep learning. They bet that deep learning will become big in the next 10 years or so. You will not see such commitment from AMD.

In the case of Xeon Phi it is advertised that you will be able to use standard C code and transform that code easily into accelerated Xeon Phi code. This feature might sounds quite interesting because you might think that you can rely on the vast resources of C code. However, in reality only very small portions of C code are supported so that this feature is not really useful and most portions of C that you will be able to run will be slow.

I worked on a Xeon Phi cluster with over 500 Xeon Phis and the frustrations with it had been endless. I could not run my unit tests because Xeon Phi MKL is not compatible with numpy; I had to refactor large portions of code because the Intel Xeon Phi compiler is unable to make proper reductions for templates — for example for switch statements; I had to change my C interface because some C++11 features are just not supported by the Intel compiler. All this led to frustrating refactorings which I had to perform without unit tests. It took ages. It was hell.

And then when my code finally executed, everything ran very slowly. There are bugs(?) or just problems in the thread scheduler(?) which cripple performance if the tensor sizes on which you operate change in succession. For example if you have differently sized fully connected layers, or dropout layers the Xeon Phi is slower than the CPU. So stay away from Xeon Phis if you want to do deep learning!

Understanding the basic memory requirements of convolutional nets

When you want to choose the GPU which is right for you, you need to understand how much memory you will need to use deep learning for your problems. So the next two passages are dedicated to explore the memory consumption of convolutional nets so that you can make sure that you get a GPU with as much memory as you need, but not more so that you can save money.

The memory requirement of convolutional networks are very different from simple neural networks. You may think that they have much less parameters and thus require less memory. That is generally true if you just want to store the network, but not if you want to train it.

The activations and errors of each convolutional layer are huge compared to simple neural networks and this is the main memory footprint. Just summing up the activations and errors we can determine the approximate memory requirement. However, it is difficult to think about what the activation and error size is at which state in the network. In general the first few layers can eat up a lot of memory, hence the main memory requirement stems from the input size of your data so it makes good sense to first think about your input data.

ImageNet usually takes 224x224x3 as input dimensions, that is 224 by 224 pixel images with 3 color channels. While 12GB of memory are essential for state-of-the-art results on ImageNet on a similar dataset with 112x112x3 dimensions we might get state-of-the-art results with just 4-6GB of memory. On the other hand, for a video dataset with inputs of size 25x75x75x3 the 12GB of memory might be way short of what you would need for good results.

However, another important aspect would be how many samples your dataset has. For example if you only take 10% of the images of the ImageNet dataset, then your model very quickly overfits (it just does not have enough examples to generalize well) so that you a smaller networks that consume much less memory would be sufficient to do as well as a convolutional net can do, so that about 4GB or less of memory would be good for this task. So this means that the fewer images you have, the less memory you need in turn.

The same is true for the number of classes you have for your labels. If you would take just two classes from the ImageNet dataset and build a model for them, then the model would consume much less memory than the model for 1000 classes. This is so, because overfitting occurs much faster if you have less classes that need to be distinguished from each other, or in other words words, you just need much less parameter to distinguish two classes from each other compared to 1000.

One practical example for these guidelines is the Kaggle plankton detection competition. At first I thought about entering the competition as I might have a huge advantage through my 4 GPU system. I reasoned I might be able to train a very large convolutional net in a very short time – one thing that others cannot do because they lack the hardware. However, due to the small data set (about 50×50 pixels, 2 color channels, 400k training images; about 100 classes) I quickly found that overfitting was an issue even for small nets that neatly fit into a small GPU and which are fast to train. So there was hardly a speed advantage of multiple GPUs and not any advantage at all of having a large GPU memory. So a small GPU with 4-6GB of memory would have been quite sufficient to achieve very good results on this task.

While in this example my memory was sufficient you will eventually encounter datasets where your GPU memory will not suffice. However, you do not necessarily need to buy a new GPU to do well on these problems. All that is needed might be the use of a simple memory reduction technique.

Memory reduction techniques and their effect

One technique is to use larger strides for the convolutional kernels, that is we apply the patch-wise convolution not for every pixel, but every two or four pixels (stride of 2 or 4) so that we generate less output data. This is usually used for input layers because these use most of the memory.

Another trick to reduce the memory footprint is to introduce a 1×1 convolutional kernel layer, which reduces the channels. For example 64x64x256 inputs can be reduced to 64x64x96 inputs by 96 1×1 kernels.

One obvious technique is pooling. A 2×2 pooling layer will reduce the amount of data for the layer by four and thus reduces the memory footprint for subsequent layers significantly.

If everything fails you can always reduce the mini-batch size. The mini-batch size is a very significant factor for memory. Using a batch size of 64 instead of 128 halves memory consumption. However, training may also take longer, especially the last stages of training where it becomes more and more important to have accurate gradients. Most convolution operations are also optimized for mini-batch sizes of 64 or greater so that starting from a batch size of 32 the training speed is greatly reduced. So shrinking the mini-batch size to or even below 32 should only be used as an option of last resort.

Another often overlooked choice is to change the data type that theconvolution net uses. By switching from 32-bit to 16-bit you can easilyhalve the memory consumption without degrading classification performance. On P100 Tesla cards this will even give you a hefty speedup.

So what do these memory reduction techniques look like when used on real data?

If we take a batch size of 128, images with 250×250 pixels and three colors (250x250x3) as inputs and we use 3×3 kernels that increase in steps of 32,64,96… in number, then we would have roughly the following memory footprint just for the errors and activations:

92MB->1906MB->3720MB->5444MB->…

So our memory would blow up very quickly. If we now would use 16-bit instead of 32-bit the numbers above would be halved; same for a batch size of 64. Using both a batch size of 64 and using 16-bit would quarter all numbers. But we would still be off by a lot of memory to train a deep network with many more layers.

How does this change if we add a stride of 2 for the first layer, followed by a 2×2 max pooling?

92MB (input)->952MB (conv)->238MB (pool)->240MB (conv)->340MB (conv)->….

Which looks much more manageable. We would still run into memory problems if we have something like 20-30 layers, but you could just apply another max pooling or other technique. For example 32 1×1 kernels would reduce the last layer from 340MB to just 113MB so that we could easily extend our network with many more layers without any problems.

However, if you use max pooling, striding and 1×1 kernels extensively you will throw away so much information during these layers that you will hurt your predictive performance as the network has much less data to work with. So while these techniques are very efficient to reduce memory consumption you need to use them with care. One of the things that you will learn over time when training convolutional nets is how to best mix these techniques to get a network with good results without having any memory problems.

Understanding the temporary memory requirements of convolutional nets

What I explained above were the main sources of memory consumption of convolutional nets and how you can alleviate memory problems. However, there is another layer of memory consumption which is less important and more difficult to understand but which might also serve you to get the most out of your networks and which might help you to determine how much memory you actually need for your deep learning work.

There are generally three types of convolutional implementations. One uses Fourier transforms, the others direct computation on the data by realigning it in memory first. This realignment happens in either patch-like structures for pixel-by-pixel calculations or to matrix-like structures to use matrix multiplication to perform convolution.

{\mbox{featuremap}({\bf x}, {\bf x_0}) = \int\limits_{-\infty}^\infty \mbox{input}({\bf x }- {\bf x_0})\mbox{kernel}({\bf x_0}), d{\bf x} = \sqrt{2\pi}\times \mbox{input}^\star\times\mbox{kernel}^\star}
Continuous convolution theorem with abuse in notation: The input represents an image or feature map and the subtraction of the argument can be thought of creating image patches with width {x_0} with respect to some x, which is then multiplied by the kernel. The integration turns into a multiplication in the Fourier domain; here {f^\star(x)} denotes a Fourier transformed function. For discrete “dimensions”(x) we have a sum instead of an integral – but the idea is the same.

The mathematical operation of convolution can be described by a simple element-wise matrix multiplication in the Fourier frequency domain. So one can perform a fast Fourier transform on the inputs and on each kernel and multiply them element-wise to obtain feature maps – the outputs of a convolutional layer. During the backward pass we do an inverse fast Fourier transform to receive gradients in the standard domain so that we can update the weights. Ideally, we store all these Fourier transforms in memory to save the time of allocating the memory during each pass. This can amount to a lot of extra memory and this is the chunk of memory that is added for the Fourier method for convolutional nets – holding all this memory is just required to make everything run smoothly.

This method is clearly the fastest method for convolution. With Winograd fast Fourier Transform this convolution technique is very fast for the popular 3×3 convolutional kernels. Also other kernels are usually fastest with the fast Fourier transform method.

However, this method can use quite some memory, but since research is still active in this area, the kernels that use fast Fourier transform are quite efficient in both performance and memory and the kernels are becomming better and better.

The other two methods that operate directly on image patches realign memory for overlapping patches to allow contiguous memory access. Slow memory access is probably the thing that hurts the performance of an algorithm the most and prefetching and aligning of memory into contiguous memory makes the convolution run much faster.

Contiguous memory means that all memory addresses lie next to each other – there is no “skipping” of indices – and this allows much faster memory reads. Alternatively you can lay the memory out in matrices and then matrix multiply them to achieve the same end. Since matrix multiplications are highly optimized already this is also a good strategy for a well-performing convolutional operation. There is much more going on in the CUDA code for this approach of calculating convolutions, but prefetching of inputs or pixels is the main reason of increased memory usage.

The matrix multiplication variant uses a bit more memory since some entries are repeated, but this method is often a bit faster than the now rather outdated method to compute the convolution patch-wise.

I hope this gives an idea about what is going on with memory in convolutional neural nets. Now we have a look at what practical advice might look like.

Fastest GPU for a given budget

Processing performance is most often measured in floating-point operations per second (FLOPS). This measure is often advertised in GPU computing and it is also the measure which determines which supercomputer enters the TOP500 list of the fastest supercomputers. However, this measure is misleading, as it measures processing power on problems that do not occur in practice.


It turns out that the most important practical measure for GPU performance is memory bandwidth in GB/s, which measures how much memory can be read and written per second. Memory bandwidth is so important because almost all mathematical operations, such as matrix multiplication, dot product, sum, addition etcetera, are bandwidth bound, that is limited by how much numbers can be fetched from memory rather than how many calculation can be performed on those given numbers.

There are other reasons why GPUs are so well suited for many computing tasks and also deep learning and if you want to get a deeper understanding of GPUs you can read my quora answer to the question “Why are GPUs well-suited to deep learning?”.

 

memory-bandwidthComparison of bandwidth for CPUs and GPUs over time: Bandwidth is one of the main reasons why GPUs are faster for computing than CPUs are.

Bandwidth can directly be compared within an architecture, for example the performance of the Pascal cards like GTX 1080 vs. GTX 1070, can directly be compared by looking at their memory bandwidth alone. However, across architecture, for example Pascal vs. Maxwelllike GTX 1080 vs. GTX Titan X cannot be compared directly due to how different architectures utilize the given memory bandwidth differently. This makes everything a bit tricky, but overall bandwidth alone will give you a good overview over how fast a GPU roughly isTo determine the fastest GPU for a given budget one can use this Wikipedia page and look at Bandwidth in GB/s; the listed prices are quite accurate for newer cards (900 and 1000 series), but older cards are significantly cheaper than the listed prices – especially if you buy those cards via eBay. For example a regular GTX Titan X goes for around $700 on eBay.

Another important factor to consider however is that not all architectures are compatible with cuDNN. Since almost all deep learning libraries make use of cuDNN for covolutional operations this restricts the choice of GPUs to Kepler GPUs or better, that is GTX 600 series or above. On top of that, Kepler GPUs are generally quite slow. So this means you should prefer GPUs of the 900 or 1000 series for good performance.

To give a rough estimate of how the cards perform with respect to each other on deep learning tasks I constructed a simple list of GPU equivalence. How to read this? For example one GTX 980 is as fast as 0.35 Titan X Pascal, or in other terms, Titan X Pascal is almost three times faster than a GTX 980.

Titan X Pascal = 0.7 GTX 1080 = 0.55 GTX 1070 = 0.5 GTX Titan X = 0.5 GTX 980 Ti = 0.4 GTX 1060 = 0.35 GTX 980

GTX 1080 = 0.3 GTX 970 = 0.25 GTX Titan = 0.175 AWS GPU instance (g2.2 and g2.8) = 0.175 GTX 960

Generally, I would recommend the GTX 1080 or GTX 1070. They are both excellent cards and if you have the money for a GTX 1080 you should go ahead with that. The GTX 1070 is a bit cheaper and still faster than a regular GTX Titan X. These cards currently have poor half-float performance and thus do not have any other advantage than speed over a regular GTX Titan X. Both cards should be preferred over the GTX 980 Ti due to their increased memory of 8GB (instead of 6GB).

The GTX 1060 is the best entry GPU for when you want to try deep learning for the first time, or if you want to occasionally use it for Kaggle competition. Its 6GB memory can be quite limiting, but for many applications it is sufficient. The GTX 1060 is slower than a regular Titan X, but it is comparable in both performance and eBay price of the GTX 980. Overall this is clearly the best bang for the buck you can get. A very solid choice.

However, these new Pascal cards have a are not ideal for deep learning with their 6GB or 8GB memory. That memory is quite sufficient for most tasks, for example for Kaggle competitions, most image datasets, deep style and natural language understanding tasks. But for researchers, especially if they work on ImageNet, video data or natural language understanding tasks with huge context data the regular GTX Titan X or GTX Titan X Pascal with 12GB of memory will be better – memory is just too important here.

Researchers should definitely go with the new GTX Titan X Pascal due to its better speed. However, the cost difference is quite significant and if you need the memory a GTX Titan X from eBay is a very solid choice. If you already own regular GTX Titan Xs, then you should think again if the additional investment and the hassle with selling the old cards and buying the new is really worth it. I will probably keep my regular GTX Titan Xs.

The options are now more limited for people that have very little money for a GPU. GPU instances on Amazon web services are quite expensive and slow now and no longer pose a good option if you have less money. I do not recommend a GTX 970 as it is slow, still rather expensive even if bought in used condition and there are memory problems associated with the card to boot. Instead, try to get the additional money to buy a GTX 1060 which is faster, has a larger memory and has no memory problems.

Prices might drop again so that the regular GTX Titan from eBay might be a viable choice, but currently it is too pricey with about $300. Instead I would recommend buying a GTX 1060. The GTX 680 and the GTX 960 are also cheap choices if you can find them cheap on eBay, but I would rather recommend going for a GTX 1060.

Amazon Web Services (AWS) GPU instances

In the previous version of this blog post I recommended AWS GPU spot instances, but I would no longer recommend this option. The GPUs on AWS are now rather slow (one GTX 1080 is five times faster than a AWS GPU) and prices have shot up dramatically in the last months. It now again seems much more sensible to buy your own GPU.

Conclusion

With all the information in this article you should be able to reason which GPU to choose by balancing the required memory size, bandwidth in GB/s for speed and the price of the GPU, and this reasoning will be solid for many years to come. But right now my recommendation is to get a GTX 1080, or GTX 1070, whichever you can afford; a GTX 1060 for learning deep learning and Kaggle; and if you are a researcher you might want to get a Titan X Pascal (or stick to existing GTX Titan Xs).

If you have little money try to scratch together enough money for the GTX 1060. It might be a bit more expensive like other good used cards like the GTX 970, GTX 960,  GTX 680, or regular GTX Titan, but is much faster, offers at least the same memory and better support due to its newer architecture (this is not important currently, but might be so for future libraries). 

TL;DR advice

Best GPU overallTitan X Pascal
Cost efficient but expensiveGTX 1080, GTX 1070
Cost efficient and cheap:  GTX 1060
Cheapest card with no troubles: GTX 1060
I work with data sets > 250GB: Regular GTX Titan X or Titan X Pascal
I have little money: GTX 1060
I have almost no money: Try to get enough money to buy a GTX 1060; other options are only a little bit cheaper but much worst otherwise
I do Kaggle: GTX 1060 or GTX 1070 for any “normal” competition, or regular GTX Titan X for “deep learning competitions”
I am a researcher: Titan X Pascal or regular GTX Titan X; you might want to skip the Pascal upgrade if you already have regular GTX Titan Xs
I want to build a GPU cluster: This is really complicated, you can get some ideas here
I started deep learning and I am serious about it: Start with a GTX 1060. Depending of what area you choose next (startup, Kaggle, research, applied deep learning) sell your GTX 1060 and buy something more appropriate

Update 2016-07-23: Added Titan X Pascal and GTX 1060; updated recommendations
Update 2016-06-25: Reworked multi-GPU section; removed simple neural network memory section as no longer relevant; expanded convolutional memory section; truncated AWS section due to not being efficient anymore; added my opinion about the Xeon Phi; added updates for the GTX 1000 series
Update 2015-08-20: Added section for AWS GPU instances; added GTX 980 Ti to the comparison relation
Update 2015-04-22: GTX 580 no longer recommended; added performance relationships between cards
Update 2015-03-16: Updated GPU recommendations: GTX 970 and GTX 580
Update 2015-02-23: Updated GPU recommendations and memory calculations
Update 2014-09-28: Added emphasis for memory requirement of CNNs

Acknowledgements

I want to thank Mat Kelcey for helping me to debug and test custom code for the GTX 970; I want to thank Sander Dieleman for making me aware of the shortcomings of my GPU memory advice for convolutional nets; I want to thank Hannes Bretschneider for pointing out software dependency problems for the GTX 580; and I want to thank Oliver Griesel for pointer out notebook solutions for AWS instances.

阅读更多

没有更多推荐了,返回首页