caffe --Multi-GPU Usage

最新推荐文章于 2024-05-02 12:56:11 发布

Thuzxy

最新推荐文章于 2024-05-02 12:56:11 发布

阅读量2.1k

点赞数 1

分类专栏：并行计算

本文链接：https://blog.csdn.net/hfutzxy/article/details/71637393

版权

并行计算专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Currently Multi-GPU is only supported via the C/C++ paths and only for training.

The GPUs to be used for training can be set with the “-gpu” flag on the command line to the ‘caffe’ tool. e.g. “build/tools/caffe train –solver=models/bvlc_alexnet/solver.prototxt –gpu=0,1” will train on GPUs 0 and 1.

NOTE: each GPU runs the batchsize specified in your train_val.prototxt. So if you go from 1 GPU to 2 GPU, your effective batchsize will double. e.g. if your train_val.prototxt specified a batchsize of 256, if you run 2 GPUs your effective batch size is now 512. So you need to adjust the batchsize when running multiple GPUs and/or adjust your solver params, specifically learning rate.

Hardware Configuration Assumptions

为获得最佳性能，需要在设备之间进行P2P DMA访问。没有P2P接入，例如跨越PCIe根复杂，数据通过主机复制，有效的交换带宽大大降低。

目前的实现具有“软”的假设，即所使用的设备是均匀的。在实践中，同一般类的任何设备都应该一起工作，但性能和总体尺寸受到使用的最小设备的限制。例如如果您结合了TitanX和GTX980，则性能将受到980的限制。

“nvidia-smi topo -m”将显示连接矩阵。您可以通过PCIe网桥进行P2P，但在此时不能通过套接字级链接进行，例如跨多个主板上的CPU插槽。

性能严重依赖于系统的PCIe拓扑结构，您正在训练的神经网络的配置以及每个层的速度。像DIGITS DevBox这样的系统具有优化的PCIe拓扑（X99-E WS芯片组）。一般来说，像AlexNet，CaffeNet，VGG，GoogleNet这样的网络平均，2个GPU的缩放比例平均为1.8倍。 4个GPU在缩放时开始衰减。通常使用“弱缩放”，批量大小随着GPU数量的增加而增加，您将看到3.5倍缩放比例。通过“强大的扩展”，系统可以成为沟通约束，特别是与层级性能优化，如cuDNNv3中的那些，您可能会看到更接近于2.x中等性能的缩放。与参数数量相比，计算量大的网络往往具有最佳的缩放性能。