继续搬砖第三篇《How to Parallelize Deep Learning on GPUs Part 1/2: Data Parallelism》
In my last blog post I showed what to look out for when you build a GPU cluster. Most importantly, you want a fast network connection between your servers and using MPI in your programming will make things much easier than to use the options available in CUDA itself.
In this blog post I explain how to utilize such a cluster to parallelize neural networks in different ways and what the advantages and downfalls are for such algorithms. The two different algorithms are data and model parallelism. In this blog entry I will focus on data parallelism.
So what are these two? Data parallelism is when you use the same model for every thread, but feed it with different parts of the data; model parallelism is when you use the same data for every thread, but split the model among threads.
For neural networks this means that data parallelism uses the same weights and but different mini-batches in each thread; the gradients need to be synchronized, i.e. averaged, after each pass through a mini-batch.
Model parallelism splits the weights of the net equally among the threads and all threads work on a single mini-batch; here the generated output after each layer needs to be synchronized, i.e. stacked, to provide the input to the next layer.
Each method has its advantages and disadvantages which change from architecture to architecture. Let us look at data parallelism first and its bottlenecks
first and in the next post I will look at model parallelism.
Severity of the network bottleneck of data parallelism
The idea of data parallelism is simple. If you have, say, 4 GPUs you split a mini-batch into parts for each of them, say, you split a mini-batch with 128 examples into 32 examples for each GPU. Then you feed the respective batch through the net and obtain gradients for each split of the mini-batch. You then use MPI to collect all the gradients and update the parameters with the overall average.
Data parallelism diagram. There is no communication in the forward pass, and during the backward pass you synchronize gradients.
The biggest problem with this approach is that during the backward pass you have to pass the whole gradient to the all other GPUs. If you have a 1000×1000 weight matrix then you need to pass 4000000 bytes to each network. If we take a 40Gbit/s network card – which is already quite fast – then you will need to pass the data from one node to another node (however, there is some additional overhead that is neglected here). If you have six GPUs in two nodes you need to pass the data to five other GPUs, three of which need to go through the network card (3x 0.75ms), while two can use PCIe 3.0 to pass the data to the other two GPUs (about three times as fast; 2x 0.25ms). However, the PCIe pass is independent of the network card pass, so the time needed is determined by the network card time alone, i.e. 2.25ms. However, only one GPU can transfer data through the network card at any one time in any one node, so that we have to multiply that time by three, i.e. 7.75ms. Now the bottom line is, that we just need about 0.2ms for a matrix multiply through that layer (100×1000 dot 1000×1000) and about twice as much for the backward pass. We can pass the gradient while we work on the next layer, but in the end the network card speed limits our overall computation by quite a bit. This is more marked the larger you scale your system: A four node system working on the same problem needs about 20.25ms to pass the gradients around to the other GPUs. One can easily see that data parallelism does not scale with size of the cluster.
To counter this bottleneck is to reduce the parameters of the gradient through max pooling, maxout units or by simply using convolution. Another way is to increase the computational time/network time ratio by other means, e.g. by using is computationally intensive optimization techniques like RMSProp. You need the same time to pass the gradients to each other, but more time is spend on computation, thus increasing the utility of the fast GPUs.
Another thing you can do when you use computationally intensive optimization techniques is to hide latency of networking under the computation of the gradients. This means while you passing the first gradient to all other nodes, you can already start a big RMSProp computation asynchronously for the next layer. This technique can give a speedup of about 0-20 % depending on network architecture.
But this is not the only problem with data parallelism. There is a very technical bottleneck hidden in the GPU architecture which took me quite a while to understand. To understand why the GPU architecture is a problem we first need to look at the usage and purpose of mini-batches.
A divergence: Why do we use mini-batches?
If we start with randomly initialized parameters or even if we start with pretrained parameters, we do not need a pass through all the data to get an accurate gradient update that will head into the direction of a local minimum. If we take MNIST as an example, if we have a gradient which includes 10 common mistakes that the network does for each class (mini-batch size of about 128), then we will go into a direction that reduces the error greatly already as the gradient captures rough and common mistakes. If we choose a greater batch size (say 512) then we not only capture common errors, but also catch errors that are more subtle. However, it is not very sensible to fine-tune a system if you know it still has major errors. So overall we gain little by increasing the batch size. We need more computation to do roughly the same and this is the main argument why we use a mini-batch size as small as possible. However, if we choose a mini-batch size that is too small, then we do not capture all the common errors which are relevant for the data set and thus our gradient might not head near a local optimum, so there is a limit how small you can make mini-batches.
How does this relate to data parallelism? If we want a mini-batch size of 128 and use data parallelism to divide it among, say, eight GPUs, then each net calculates gradients for 16 samples which is then averages with the data from the other GPUs. And exactly here kicks the hardware bottleneck in.
Memory tiles: Patches of fast GPU memory for efficient dot product calculations
To calculate dot products on the GPU, you need to copy small patches, called memory tiles, into shared memory, i.e. very fast but very small memory (limited to a few kilobytes). The problem is that the standard cuBLAS uses either a 64×128 memory tiles and when you have a batch size less than 64 you waste a lot of precious shared memory. Also if you use a batch size not equal to a multiple of 32 you equally waste shared memory (threads are only started in blocks of 32 threads), so one should use a batch size which is a multiple of 32 or multiple of 64 if possible. For data parallelism this means that you lose significant processing speed once you go below a batch size of 64 for each GPU. If you have many GPUs this can be quite limiting and this is yet another reason why the data parallelism approach does not scale well beyond a certain point.
All in all this sounds quite dire for data parallelism, but data parallelism has its uses. If you know the bottlenecks, you can wield data parallelism as a might tool for certain applications. This is demonstrated by Alex Krishevsky in his paper where he uses data parallelism in the convolutional layers of his net, and thus achieves a speedup of 3.74x by using four GPUs and 6.25x using eight GPUs. His system features two CPUs and 8 GPUs in one node, so he can use the full PCIe speed for the two sets of four GPUs and relatively fast PCIe connection between CPUs to distribute the data among all eight GPUs.
Besides convolutional neural networks, another use of data parallelism might be to use it in recurrent neural networks, which typically have less parameters and highly computationally intensive gradient updates – both are wins for data parallelism.
In my next blog post I will focus on model parallelism, which is efficient for large networks and scales well to larger clusters.
《How to Parallelize Deep Learning on GPUs Part 2/2: Model Parallelism》
In my last blog post I explained what model and data parallelism is and analysed how to use data parallelism effectively in deep learning. In this blog post I will focus on model parallelism.
To recap, model parallelism is, when you split the model among GPUs and use the same data for each model; so each GPU works on a part of the model rather than a part of the data. In deep learning, one approach is to do this by splitting the weights, e.g. a 1000×1000 weight matrix would be split into a 1000×250 matrix if you use four GPUs.
Model parallelism diagram. Synchronizing communication is needed after each dot product with the weight matrix for both forward and backward pass.
One advantage of this approach is immediately apparent: If we split the weights among the GPUs we can have very large neural networks which weights would not fit into the memory of a single GPU. In part I mentioned this in an earlier blog post, where I also said that such large neural networks are largely unnecessary. However, for very big unsupervised learning tasks – which will become quite important in the near future – such large networks will be needed in order to learn fine grained features that could learn “intelligent” behavior.
How does a forward and backward pass work with such split matrices? This is most obvious when we do the matrix algebra step by step:
We start looking at which would be the dot matrix multiply for the usual forward pass case. The dimensions for using model parallelism with two GPUs for a batch size of 128 and a 1000×500 weight matrix would be:
Standard: 128×1000 dot 1000×500 = 128×500
Split by weight matrix first dimension: 128×500 dot 500×500 = 128×500 -> add matrices
Split by weight matrix second dimension: 128×1000 dot 1000×250 = 128×250 -> stack matrices
To calculate the errors in the layer below we need to pass the current error through to the next layer, or more mathematically, we calculate the deltas by taking the dot product of the error of the previous layer and the weights that connect to the next layer , i.e. :
Standard: 128×500 dot 500×1000 = 128×1000
Split by weight matrix first dimension: 128×500 dot 500×500 = 128×500 -> stack matrices
Split by weight matrix second dimension: 128×250 dot 250×1000 = 128×1000 -> add matrices
We see here, we need to synchronize (adding or stacking weights) after each dot product and you may think that this is slow when compared to data parallelism, where we synchronize only once. But one can quickly see that this is not so for most cases if we do the math: In data parallelism a 1000×500 gradient needs to be transferred once for the 1000×500 layer – that’s 500000 elements; for model parallelism we just need to transfer a small matrix for each forward and backward pass with a total of 128000 or 160000 elements – that’s nearly 4 times less data! So the network card bandwidth is still the main bottleneck in the whole application, but much less so than in the data parallelism case.
This is of course all relative and depends on the network architecture. Data parallelism will be quite fast for small networks and very slow for large networks, the opposite is true for model parallelism. The more parameters we have, the more beneficial is model parallelism. Its true strength comes to play if you have neural networks where the weights do not fit into a single GPU memory. Here model parallelism might achieve that for which one would need thousands of CPUs.
However, if you run small networks where the GPUs are not saturated and have some free capacity (not all cores are running), then model parallelism will be slow. Unlike data parallelism, there are no tricks you can use to hide the communication needed for synchronization, this is because we have only partial information for the whole batch. With this partial information we cannot compute the activities in the next layer and thus have to wait for the completion of the synchronization to move forward.
How the advantages and disadvantages can be combined is best shown by Alex Krizhevsky who demonstrates the efficiency of using data parallelism in the convolutional layers and model parallelism in the dense layers of a convolutional neural network.