Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minute

背景
  • Challenge 1: Larger mini-batch size often leads to lower test accuracy, as there exists a generalization gap [^p9].
  • Challenge 2: When using large clusters, it is harder to achieve near-linear scalability as the number of machines in- creases, especially for models with the high communication- to-computation ratio.
System overview

Jizhi Training System Overview
our system contains the following three modules: 1) input pipeline module; 2) training module; and 3) communication module.

  • The input pipeline module delivers data for the next step before the current step has finished. It uses pipelining in order to minimize both CPU and GPU idle time.
  • The training module includes model construction and variable management. In this module, we have incorporated optimizations such as forward/backward computation with mixed-precision and model update with LARS.
  • The communication module uses tensor fusion and hybrid all-reduce to optimize the scaling efficiency according to tensor size and cluster size.
SYSTEM IMPLEMENTATION AND OPTIMIZATIONS
Mixed-Precision Training with LARS
  • In our strategy, the operations in forward and backward propagation are performed in FP16, while the weights and gradients are cast to single-precision (FP32) format before applying LARS and cast back to FP16 afterward
    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QpLgV8Py-1583637372300)(imgs/mtl.png)]
Improvements on Model Architecture
  • leave the batch normalization parameters unregularized
Improvements on Communication Strategies

Problem of ring-based all reduce: the traditional ring- based all-reduce implementation does not scale due to the following reason: In a cluster with k GPUs, Ring all-reduce will split the data on each GPU into k chunks and do the reduce in k − 1 iterations [24]. When k gets larger, the messages passing between nodes will become smaller and fail to utilize the full bandwidth of the network.
Strategy:

  • Tensor fusion: Usually, gradient tensor sizes for convolution layers are much smaller than fully-connected layers. Sending too many small tensors in the network will not only cause the bandwidth to be under-utilized but also increase the latency. The core idea of tensor fusion is to pack multiple small size tensors together before all-reduce to better utilize the bandwidth of the network. We set a parameter θ. In the backward phase, as tensors from each layer come in, we fuse them into a buffer pool if the total size is less than θ, and only send the fused tensor out for all-reduce when the total size is larger than θ
  • Hierarchical All-reduce: Hierarchical all-reduce could solve this problem for small tensor communication. Instead of using ring all-reduce where each GPU sends and receives m p \frac{m}{p} pm bytes of data in 2 ( p − 1 ) 2(p−1) 2(p1) steps.We can group k GPUs together, then use a three-phase algorithm to do the all-reduce across all GPUs.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值