distributed tensorflow - tensorflow dev summit 2017

最新推荐文章于 2023-07-21 12:54:18 发布

简牧

最新推荐文章于 2023-07-21 12:54:18 发布

阅读量781

点赞数

分类专栏： tensorflow 文章标签：分布式tensorflow in-graph between-graph tensor flow容错

本文链接：https://blog.csdn.net/qq_35799003/article/details/85059074

版权

整理自tensorflow dev summit 2017- distributed tensorflow－YouTube视频，文字来自于英文字幕。
如何利用低级别API实现使用分布式tensorflow。

目标

1）模型的replica
2）如何把变量放在不同device上
3）session和server
4）fault tolerance容错

单机和多机简单对比

展示tensorflow如何从单机到分布式。
在单机时，通过tf.device把graph拆分到不同的device上，机器自动完成子图的训练。

I’m going to show you how the core concepts you might be used to in single-process TensorFlow translate to the distributed world. I’ll give you some ideas for how to deal with the complexity that ensues. So I just claimed that distributed TensorFlow has a minimalist core. What did I mean by that? Well, let’s say I’ve got just one computer, and it’s got a GCPU device and a GPU device in it. And if I want to write a TensorFlow program to use these devices, I can put these little with_tf.device annotations in my code, so that for this example, the variables go on the CPU, and the math goes on the GPU, where it’s going to run faster. Then, when I come to run this program, TensorFlow splits up the graph between the devices and it puts in the necessary DNA between the devices to run it for me.

对于多机情况，把graph拆分到不同机器上对于开发者来说是透明的，只是在声明tf.dervice的时候名字稍微不同，不同机器之间通过gRPC通信交换tensor。

So what happens if you have multiple machines? Let’s say, for reasons that will become apparent later, that we want to take those variables and put them on the CPU device of a different process. Well, TensorFlow treats the remote devices exactly the same as the local ones. All I have to do is add just a little bit of information to these device names, and the runtime will put the variables in a different process, splitting up the graph between the devices in the different processes and adding the necessary communication. In this case, it will be using GRPC to transfer tensors between the processes instead of DNA from the GPU device. So there you have it. Using distributed TensorFlow is just a simple matter of getting all of your device placements exactly right. And yeah, I heard a wry chuckle. Yeah, I’m sure you know exactly how easy that can be, if you’ve ever written a TensorFlow program.

graph replication

in-graph replication和between-graph replication是tensorflow中两个关于分布式的重要概念，是接近数据并行data parallel分布式训练的两种模型执行方式，其中对于实践来说，between-graph replication又是面对大规模机器训练更有扩展性的方式。

in-graph replication

So the first idea that works pretty well for distributed training, particularly when a single model will fit in a single machine, is replication. So just like in DisBelief, we take the compute-intensive part of the model training, the forwards and the backprop, and we make a copy of it in multiple worker tasks, so that each task works on a different subset of the data. This is data parallel training, like we were doing for that Inception example back at the start.

And the simplest thing, simplest way we can achieve this is by doing something I’m going to call in-graph replication. The reason for this name will hopefully become self-explanatory when I-- or maybe just explanatory when I tell you what the code does.

We start by putting the variables on a PS task, like the earlier example. And this is just so that they’re in a central location that they can be accessed by all of the workers.
And then the easiest way to do the in-graph replication is just to split up a batch of input data into equal-sized chunks, loop over the worker tasks, and use this tf.device string here to put a subgraph on each worker to compute a partial result.
And then finally, we combine together all of the partial results into a single loss value that we optimize by using a standard TensorFlow optimizer. And sure enough, when you tell it to compute the loss, TensorFlow will split up the graph across the workers, and it will run across these worker tasks and the PS all in parallel.

So in-graph replication is pretty easy to achieve. It’s not a big modification to your existing programs. And it works pretty well up to a small number of replicas. If you want to replic