参数服务器在分布式深度学习的应用

前几天仔细阅读distbelief的论文, 发现如下这篇文章对论文的分析比较到位,对论文不太懂的,可以看 http://blog.csdn.net/itplus/article/details/31831661,这里我转载内容:








另外,可以参考wangyi大牛的博客http://cxwangyi.github.io/2013/04/09/asynchronous-parameter-updating-with-gradient-based-methods/,这里作为转载。

In this NIPS 2012 paper, Large Scale Distributed Deep Networks, researchers at Google presented their work on distributed learning of deep neural networks.

One of the most interesting points in this paper is the asynchronous SGD algorithm, which enables a parallel (distributed) software architecture that is scalable and can make use of thousands CPUs.

To apply SGD to large data sets, we introduce Downpour SGD, a variant of asynchronous stochastic gradient descent that uses multiple replicas of a single DistBelief model. The basic approach is as follows: We divide the training data into a number of subsets and run a copy of the model on each of these subsets. The models communicate updates through a centralized parameter server, which keeps the current state of all parameters for the model, sharded across many machines (e.g., if we have 10 parameter server shards, each shard is responsible for storing and applying updates to 1/10th of the model parameters) (Figure 2). This approach is asynchronous in two distinct aspects: the model replicas run independently of each other, and the parameter server shards also run independently of one another.

Intuitively, the asynchronous algorithm looks like a hack, or a compromise between the effectiveness of the mathematical algorithm and the scalability of the distributed system. But to our surprise, the authors claimed that the asynchronous algorithm works more effective than synchronous SGD.

Why??

My understand is that traditional gradient-based optimization is like a bee flying along the direction defined by the current gradient. In batch learning, the direction is computed using the whole training data set. In SGD, the direction is computed using a randomly selected mini-batch of the data.

In contrast, the asynchronous parallel SGD works like a swamp of bees, each flies along a distinct direction. These directions vary because they are computed from the asynchronously updated parameters at the beginning of each mini-batch. For the same reason, these bees wouldn’t be far away.

The swamp of bees optimize collaboratively and covers a region like region-based optimization, where the region is composed of a set of points. This, I think, is the reason that parallel asynchronous SGD works better than traditional gradient-base optimization algorithms.

我们看到这里使用了参数服务器的思想。可以参考jl大牛写的博客http://hunch.net/?p=151364, 有空好好学习。


  • 2
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值