深度学习GPU卡的理解(二)

继续搬砖第二篇

《How To Build and Use a Multi GPU System for Deep Learning》

When I started using GPUs for deep learning my deep learning skills improved quickly. When you can run experiments of algorithms and algorithms with different parameters and gain rapid feedback you can just learn much more quickly. At the beginning, deep learning is a lot of trial and error: You have to get a feel what parameters need to be adjusted, or what puzzle piece is missing in order to get a good result. A GPU helps you to fail quickly and learn important lessons so that you can keep improving. Soon my deep learning skills were sufficient to take the 2nd place in the Crowdflower competition where the task was to predict weather labels from given tweets (sunny, raining etc.).

After this success I was tempted to use multiple GPUs in order to train deep learning algorithms even faster. I also took interest in learning very large models which do not fit into a single GPU. I thus wanted to build a little GPU cluster and explore the possibilities to speed up deep learning with multiple nodes with multiple GPUs. At the same time I was offered to do contract work as a data base developer through my old employer. This gave me opportunity to get the money to build the GPU cluster I thought of.

Important components in a GPU cluster

When I did my research on which hardware to buy I soon realized, that the main bottleneck will be the network bandwidth, i.e. how much data can be transferred from computer to computer per second. The network bandwidth of network cards (affordable cards are at about 4GB/s) does not come even close to the speed of PCIe 3.0 bandwidth (15.75 GB/s). So GPU-to-GPU communication within a computer will be fast, but it will be slow between computers. On top of that most network card only work with memory that is registered with the CPU and so the GPU to GPU transfer between two nodes would be like this: GPU 1 to CPU 1 to Network Card 1 to Network Card 2 to CPU 2 to GPU 2. What this means is, if one chooses a slow network card then there might be no speedups over a single computer. Even with fast network cards, if the cluster is large, one does not even get speedups from GPUs when compared to CPUs as the GPUs just work too fast for the network cards to keep up with them.

This is the reason why many big companies like Google and Microsoft are using CPU rather than GPU clusters to train their big neural networks. Luckily, Mellanox and Nvidia recently came together to work on that problem and the result is GPUDirect RDMA, a network card driver that can make sense of GPU memory addresses and thus can transfer data directly from GPU to GPU between computers.

GPU pic - Copy
NVIDIA GPUDirect RDMA can bypass the CPU for inter-node communication – data is directly transfered between two GPUs.

Generally your best bet for cheap network cards is eBay. I won an auction for a set of 40Gbit/s Mellanox network cards that support GPUDirect RDMA along with the fitting fibre cable on eBay. I already had two GTX Titan GPUs with 6GB of memory and as I wanted to build huge models that do not fit into a single memory, so I decided to keep the 6GB cards and buy more of them to build a cluster that features 24GB memory. In retrospect this was a rather foolish (and expensive) idea, but little did I know about the performance of such large models and how to evaluate the performance of GPUs. All the lessons I learned from this can be found here. Besides that the hardware is rather straightforward. For fast inter-node communication PCIe 3.0 is faster than PCIe 2.0, so I got a PCIe 3.0 board. It is also a good idea to have about two times the RAM than you have GPU memory to be able to work more freely to handle big nets. As deep learning programs use a single thread for a GPU most of the time, a CPU with as many cores as GPUs you have is often sufficient.

Hardware: Check. Software: ?

There are basically two options how to do multi-GPU programming. You do it in CUDA and have a single thread and manage the GPUs directly by setting the current device and by declaring and assigning a dedicated memory-stream to each GPU, or the other options is to use CUDA-aware MPI where a single thread is spawned for each GPU and all communication and synchronization is handled by MPI. The first method is rather complicated as you need to create efficient abstractions where you loop through the GPUs and handle streaming and computing. Even with efficient abstractions your code can blow up quickly in line count making it less readable and maintainable.

GPU pic - Copy
Some sample MPI code. The first action spreads one chuck of data to all others computer in the network; the second action receives one chuck of data from every process. That is all you need to do, it is very easy!

The second option is much more efficient and clean. MPI is the standard in high performance computing and its standardized library means that you can be sure that a MPI method really does what it is supposed to do. Underlying MPI the same principles are used as in the first method describes above, but the abstraction is so good that it is quite easy to adapt single GPU code to multiple GPU code (at least for data parallelism). The result is clean and maintainable code and as such I would always recommend using MPI for multi-GPU computing.  As MPI libraries come in many languages and you can pair them with the language of your choice. With these two components you are ready to go and can immediately start programming deep learning algorithms for multiple GPUs.

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
M/D-CAP3U是天津雷航光电科技有限公司推出的一款复合加速计算平台,由Xilinx的28nm制程的FPGA — XC7K325T-3FFG900I和NVidia的16nm制程的GPU — TX2互联构成。 产品细节 FPGA的前端接口 支持CameraLink Base输入1路 支持SD-SDI / HD-SDI / 3G-SDI输入1路 支持同轴高清STP视频输入1路 支持D1标清模拟视频输入4路 支持CameraLink Base输出1路 支持DVI / HDMI输出1路 支持同轴高清STP视频输出1路 TX2 GPU扩展出的接口 10 / 100 / 1000三速自适应以太网 HDMI输出(micro-HDMI连接器) USB3.0 / USB2.0输出 系统存储接口 支持SATA硬盘(mSATA连接器) 支持NVME-SSD硬盘(M.2 M Key连接器) FPGA性能指标及功能描述 板载1GByte DDR3-1600内存(FPGA挂接的DDR3) 强大的Kintex-7 FPGA专注于浮点高密运算 / 算法预处理 / 算法加速 / 前端接口管理等功能 TX2-FPGA的PCIE带宽是800MB/s ~ 1.2GB/s 存储及使用环境 存储温度 :-55℃~125℃ 工作温度 :-45℃~80℃ 工作时相对湿度 :20%~80% 震动冲击 :±35g 可靠性 :MTBF ≥ 5000h 维修性 :MTTR ≤ 0.5h 供电电压 :+ 12V 整机功耗 :≤ 30W 尺寸 :100*160MM 提供的软件 FPGA固件 PCIE信道管理及收发引擎,视频前端收发引擎,NVME读写引擎 TX2侧的Linux驱动 负责将各个视频输入输出节点映射为Linux系统下标准的V4L2设备,所有的视频数据都是经由PCIE链路由FPGA推送至TX2的DDR4内存,后FPGA中断通知TX2取视频数据 TX2侧的V4L2视频捕获Demo 演示如何通过V4L2驱动抓取前端视频 经典的算法Demo 在线学习型目标跟踪(可提供源码,作为您的开发起点) 前沿的算法Demo Yolo—基于深度学习的多目标识别框架(可提供源码,作为您的开发起点)

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值