Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minute

最新推荐文章于 2020-07-14 14:45:48 发布

liudy95

最新推荐文章于 2020-07-14 14:45:48 发布

阅读量385

点赞数

分类专栏： NOTE 文章标签：深度学习

本文链接：https://blog.csdn.net/liudyOoO/article/details/104729206

版权

NOTE 专栏收录该内容

28 篇文章 0 订阅

订阅专栏

背景

Challenge 1: Larger mini-batch size often leads to lower test accuracy, as there exists a generalization gap [^p9].
Challenge 2: When using large clusters, it is harder to achieve near-linear scalability as the number of machines in- creases, especially for models with the high communication- to-computation ratio.

System overview

Jizhi Training System Overview
our system contains the following three modules: 1) input pipeline module; 2) training module; and 3) communication module.

The input pipeline module delivers data for the next step before the current step has finished. It uses pipelining in order to minimize both CPU and GPU idle time.
The training module includes model construction and variable management. In this module, we have incorporated optimizations such as forward/backward computation with mixed-precision and model update with LARS.
The communication module uses tensor fusion and hybrid all-reduce to optimize the scaling efficiency according to tensor size and cluster size.

SYSTEM IMPLEMENTATION AND OPTIMIZATIONS

Mixed-Precision Training with LARS

In our strategy, the operations in forward and backward propagation are performed in FP16, while the weights and gradients are cast to single-precision (FP32) format before applying LARS and cast back to FP16 afterward
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QpLgV8Py-1583637372300)(imgs/mtl.png)]

Improvements on Model Architecture

leave the batch normalization parameters unregularized

Improvements on Communication Strategies

Problem of ring-based all reduce: the traditional ring- based all-reduce implementation does not scale due to the following reason: In a cluster with k GPUs, Ring all-reduce will split the data on each GPU into k chunks and do the reduce in k − 1 iterations [24]. When k gets larger, the messages passing between nodes will become smaller and fail to utilize the full bandwidth of the network.
Strategy:

Tensor fusion: Usually, gradient tensor sizes for convolution layers are much smaller than fully-connected layers. Sending too many small tensors in the network will not only cause the bandwidth to be under-utilized but also increase the latency. The core idea of tensor fusion is to pack multiple small size tensors together before all-reduce to better utilize the bandwidth of the network. We set a parameter θ. In the backward phase, as tensors from each layer come in, we fuse them into a buffer pool if the total size is less than θ, and only send the fused tensor out for all-reduce when the total size is larger than θ
Hierarchical All-reduce: Hierarchical all-reduce could solve this problem for small tensor communication. Instead of using ring all-reduce where each GPU sends and receives $\frac{m}{p}$ bytes of data in $2 (p - 1)$ steps.We can group k GPUs together, then use a three-phase algorithm to do the all-reduce across all GPUs.