tensorflow打印模型结构_社区分享 | 详解 TensorFlow 中 Placement 的最后一道防线 — Placer 算法...

最新推荐文章于 2024-07-16 09:09:45 发布

weixin_39996478

最新推荐文章于 2024-07-16 09:09:45 发布

阅读量463

点赞数

文章标签： tensorflow打印模型结构

本文作者王思宇，阿里巴巴算法专家，从事深度学习算法平台建设，TensorFlow 分布式架构设计与大规模分布式性能优化工作，开源 TensorFlow 项目 contributor。

本文转自：互联网西门二少 (id: ximen_yushao)

注：建议在阅读本文时同时梳理代码~

1. 问题引入

在使用 TensorFlow 构建模型时，为了能够使用 GPU 的 Device，你可能会用到下面的这样的写法。

with tf.device('/gpu:0'):
  a = tf.get_variable(.....)
  b = .......
  c = .......

那么，上面代码中的 a、b 和 c 就真的一定会放在 GPU:0 上吗？如果 c 不存在 GPU 上的实现会怎么样？进一步地，有没有其他约束会让用户的设置失效？

事实上，当你打开 session config 的 log_device_placement 选项后，仔细逐个检查每个 Op 被放置的位置，你会发现某些 Op 并没有如你所愿被你控制，而是被“悄悄地”放到别的 Device 上了。

这并不是 Bug，而是 Placer 算法模块发挥了保护作用。Placer 算法是 TensorFlow 中 Placement 设置的最后一道防线。它工作在 TensorFlow 底层，在尽可能满足用户诉求的前提下，暗中纠正部分不合理的 Placement。

且听我从设计初衷与源码上，为你娓娓道来。

2. Placement 设计初衷

受限于单个 Device 的计算能力和存储大小，模型分片是重要的需求点之一。它的本质是将模型和相关的计算切分到不同的 Device，如此不但可以解决单个 Device 放不下大模型的问题，还有可能带来计算加速的收益。

在深度学习框架方面，显然在 TensorFlow 上做模型分片比 Caffe 更加容易，这主要得益于 TensorFlow 的 Placement 机制。Placement 是 TensorFlow 引入的特有概念，它指定某个 Op 与具体 Device 的绑定关系，因此模型分片问题实际上就是该模型上每个 Op 的 Placement 问题。

在 Python 层面，一共存在两个 API 与 Placement 相关的接口，它们不但广泛存在于框架代码中，还可以被用户拿来直接使用。

但是用户指定 Placement 信息存在一定的不可靠性，它与 Op 的实际情况往往存在一定的矛盾，这就需要 TensorFlow 中的 Placer 模块来解决。

3. Placer 功能描述

Python 构完图之后，请你把 GraphDef 打印出来，我们要关注每一个 Node 的 NodeDef 结构(如下图)，这里有两个地方和 Placement 相关。

device 属性：它显示指定了这个 Node 应该被放在何种 Device 上，它由用户通过 with tf.device 指定。
字符串标记 loc:@xxxx：这是 Placement 的约束条件，隐式指明该 Node 的 Placement 应该和哪些 Node 保持一致。xxxx 代表某个 Group 的名字，该 Node 应该和 Group 名为 xxxx 内的所有 Node 的 Placement 保持一致。

可以想象，以上两个信息可能会出现矛盾的情形。

Placer 不但要处理二者的矛盾，还要通过一些规则尽可能避免因 Placement 不当带来的性能问题。每个 Node 在经过 Placer 处理后都会得到最终的 Placement 信息，它将重新覆盖 NodeDef 中的 device 属性内容。

所以，通俗地讲，Placer 的功能就是推断并填入所有 NodeDef 的 device 属性。

4. 一些前驱内容

梳理逻辑时难免会碰到一些为解决这个问题专门设立的名词和经典的算法，所以建议在阅读 Placer 模块相关内容之前先确认已经弄清楚下面的东西，避免走一些弯路。

显式 Placement：指用户通过 with tf.device 直接指定的 Placement 信息，它将写入上一小节中 NodeDef 中的 device 属性。
隐式 Placement：指间接指定的 Placement 信息，这个信息与上一小节中 NodeDef 中的 loc:@xxxx 对应。上一节说到，xxxx 是一个 Group 的名字，该 Group 内所有的 Node 都要求具有相同的 Placement 信息，这个 Group 被叫做 Colocation Group，属于一种约束 (Constraint) 条件。
Find-Union 算法：并查集算法，Placer 内最重要的算法。TensorFlow 通过 Find-Union 算法高效地处理了 Node 的 Colocation 问题。简单而言，逻辑上，多个具有相同 Colocation Group 的 Node 应该被“并”到同一个组中，从而“查”某个 Node 的 Placement 信息时，可以更快速地获取整组的信息。在实现时，如何设计更好的数据结构，并高效地实施“并”和“查”两个过程，是并查集算法的核心。

5. Placer决策基本原则

Placer 会根据会对 Graph 进行一定程度的分析，并结合用户的要求对每个 Node 的 Placement 进行微调，微调的原则可以概括为下面四点：

尽可能满足用户要求 (User Requirement First)：每个 Node 的 Placement 会尽量满足用户的要求
尽可能使用计算更快的设备 (High Performance Device)：若某个 Node 的 Placement 没有被用户指定，则优先分配计算更快的设备
保证程序可运行 (Runable)：若某个 Node 不存在用户要求的 Placement 相关实现版本，会退而求其次选择其它实现版本，保障程序可以用
尽可能考虑近邻特性 (Near When Possible)：在做 Placement 的微调时考虑节点的近邻特性，尽可能减少无意义的拷贝

6. 原则原理详细展开

1. 尽可能满足用户要求 (User Requirement First)

用户要求分为两种，一种是显示指定，表现为在 Node 中设置的 device 信息；另一种是隐式指定，表现为 loc:@xxxx 属性，即 Colocation Group。

Placer 会根据用户这两方面的要求并结合实际情况做 Placement 信息补全和微调。

文章开头的截图展示了某个 Node 的 NodeDef 信息，它表明类型为 MatMul 的 Op 被用户显示指定放到 '/device:GPU:0' 上，同时希望放入名为 global_step 的 Colocation Group 中。

NodeDef 中的 device 属性和 loc:@xxxx 属性分别由下面两个 Python 级别的 API 引入，它们都由用户来控制，有些被用在高层 API 内部封装中。

# device attributes
@tf_export("device")
def device(device_name_or_function):

# colocation attributes
@tf_export("colocate_with")
def colocate_with(op, ignore_existing=False):

2. 尽可能使用更快的计算设备 (High Performance Device)

如果某个 Node 的 device 属性中不含 device_type(即 GPU 或 CPU)，那么 Placer 必须决定使用何种 Device。每种 Device 注册到 TensorFlow 中时都带有优先级，通常高优先级的 Device 具有更好的计算性能。

当某个 Op 具有多种 Device 实现时，Placer 将选取优先级最高的 Device 实现版本，通过设置 device_type 为所有实现版本中最高优先级的 Device 来实现这种选取。

3. 保证程序可运行 (Runable)

这是通过 Soft Placement 机制保证的(在 session config 里可以设置)。

如果某个 Node 被显示指定精确放在某 Device 上，但系统中却没有该 Device 上的实现版本，那么为了保证程序可用，Soft Placement 将发挥作用，它将忽略 device type，在系统中按照 Device 优先级选取另一个可用的实现版本重新改写 Placement。

举例而言，假设某 Node 的 op 是 SparseToDense，device_type 被指定为 GPU，但目前 SparseToDense 在 TensorFlow 中只有 CPU 的实现，那么 Soft Placement 将改写该 Node 的 device_type 为 CPU。

4. 尽可能考虑近邻特性 (Near When Possible)这块就比较复杂了，但我们要抓住重点，你就不会乱：关注三类特殊的 Op 类型，他们的特殊性，决定了其近邻是需要特殊处理的，分别是：

Generator 类 Op：入度为 0，出度为 1 的 Op
MetaData 类 Op：直接在 Tensor 的元数据 MetaData 上操作，不改变 Tensor 本身的内容，比如 Reshape)
Ref 类或 Resource 类：例如 Variable 这种可能发生赋值的 Op(或者叫左值)

在 Placer 中使用以下三种启发式规则来分别应对上面三种特殊的 Op。

若某个 Node 是 GeneratorNode，将其与 Consumer 与其放在同一个 Device 上可以防止无意义的跨 Device 拷贝。这一步在算法中被称之为启发式规则 A；
若某个 Node 是 MetaDataNode，将其与 Producer 放在相同的 Device上也可以防止无意义的跨 Device 拷贝。这一步在算法中被称为启发式规则 B；
若某个 Node 的输入是 Reference type 或者是 Reource type，那么尽量将其与输入放在同一个 Colocation Group中(比如 Variable，对其 assign 等操作肯定直接在 Variable 所在之地执行即可，如果 Variable 在 A 处，对其的 assign 在 B 处，显然是不合理的)。算法中没有为这个步骤起名字，为了方便我们称之为启发式规则 C。

7. Placer 决策总体流程

总体流程分为四个步骤，下图展示了宏观层面的流程图。其中最后两个步骤相对较为复杂，下一节中将会细化其流程图。

8. Placer 分布详解与关键代码

注意！本节看源码的时候，要注重结构，而不是每个细节都去纠缠。

第一步 — 根据外部指定 Colocation 聚合 Group

一般情况下，没有被用户指定 Colocation Group 信息的 Node 会被单独放入一个 Group 中作为唯一的成员，并以该 Node 的 Name 作为 Group 的名字，所以 Graph 中每个 Node 都会有自己的 Colocation Group。

从逻辑上来说，合并多个 Group 是非常简单的问题，但是这个场景中的 Group 不仅是 Node 的集合，还包含若干属性，比如某个 Group 的 possible device 表示这个 Group 可用的所有 Device 集合。

因此我们需要一种数据结构和算法，帮助我们在合并两个 Group 时很方便地生成新 Group 及相关属性(方便 Union)，并且能够根据某个 Node 快速查看所属 Group 的所有属性(快速 Find)，这就是 Find-Union 的优势所在。

Find-Union 算法原理将不在这里描述，这里只给出代码中 Find-Union 用到的基本数据结构 — Member，它用来描述 Group 的基本信息。在阅读下段代码注释前，需要对 Find-Union 中的树形结构含义有基本的理解。

// Represents a node in the disjoint node set forest, and the
  // accumulated constraints on the device used by that node.
  struct Member {
    Member() = default;
    // The id of the node that is the parent of this one, or its own
    // id if it is a root. parent <= 0 indicates that this member is invalid.
    int parent = -1;

    // A proxy for the depth of the tree that is used to prefer
    // connecting smaller trees to larger trees when merging disjoint
    // sets.
    int rank = 0;

    // The intersection of all device types supported by this node,
    // and those of all of its children, in priority order
    // of the preferred device.
    DeviceTypeVector supported_device_types;

    // The merged form of the device requested for this node, with
    // those of all of its children.
    DeviceNameUtils::ParsedName device_name;

    // If this node is a root, stores a list of Devices to which this node
    // and all of its children have been assigned, or nullptr if this
    // has not yet been computed.
    std::vector possible_devices;
  };

下面的代码是处理这一步骤的核心代码。首先创建 ColocationGraph 对象，这是一个处理 Colocation Group 的工具类，里面使用了 Find-Union 算法对 Group 进行聚合。

在调用 InitiailizeMembers 对 Find-Union 算法的基本数据结构进行初始化之后，就直接调用 ColocationAllNodes 根据用户指定的所有 colocation 信息进行聚合。

ColocationGraph colocation_graph(
      graph_, devices_,
      options_ == nullptr || options_->config.allow_soft_placement(),
      default_device_);

  TF_RETURN_IF_ERROR(colocation_graph.InitializeMembers());

  // 1. First add all of the nodes. Note that steps (1) and (2)
  // requires two passes over the nodes because the graph (and hence
  // the constraints) may not be acyclic.
  TF_RETURN_IF_ERROR(colocation_graph.ColocateAllNodes());

第二步 — 应用启发式规则 C(处理 Ref 类 Op Placement)

这一步将对 Colocation Group 进行调整。在遍历 Graph 的每个 Node 时，需要根据 Node input 来决定是否将该 Node 所在的 Group 与 Source Node 所在的 Group 合并。

如果 Node 的 input 是 Reference type 或者 DT_RESOURCE(关于 DT_RESOURCE 一般会在使用 ResourceVariable 时才会碰到。ResourceVariable 与 Variable 相比具有很多新特性，这些特性是 TF2.0 中主推的内容。关于它的优势我们不在这里展开，只对其 Op 的类型做一个说明。

Variable 在 C++ 层面的 Op 类型是 VariableV2，而 ResourceVariable 在 C++ 层面的 Op 类型为 VarHandleOp。后者产生的 Tensor 就是一种 DT_RESOURCE)，那么就尝试做合并。在合并之前需要做必要的可行性检查，适当地主动报错。比如在合并时除了要考虑这一对节点的连接以外，还需要考虑这个 Node 的其他输入是否属于 Reference type 或者 DT_RESOURCE。这一部分的代码比较长，但逻辑比较简单，这里不再展示。

第三步 — 应用启发式规则 B(处理 MetaData 类的 Op Placement)

从这一步开始，Placer 才开始真正的为每个 Node 分配 Device，下面的流程图中展示了这一步骤。

如果当前的 Node 的 device 属性中已经有值，那么 Placer 将不再对其做重复的 assign 操作，直接跳过这个 Node；
如果当前 Node 是 GeneratorNode，先将其放入一个名为 second_pass 的 vector 中；
如果不是以上两种情况，那么该Node正是这一步骤需要处理的对象。先从该 Node 所在的 Colocation Group 中获取可用的 Devices(获取会受到 Soft Placement 的影响)作为候选。如果该 node 是 MetaData node，那么会尝试应用启发式规则 B，否则，将分配候选集中优先级最高的 Device。

 int assigned_device = -1;

    // Heuristic B: If the node only operates on metadata, not data,
    // then it is desirable to place that metadata node with its
    // input.
if (IsMetadata(node)) {
      // Make sure that the input device type is in the list of supported
      // device types for this node.
      const Node* input = (*node->in_edges().begin())->src();
      // TODO(vrv): if the input is empty, consider postponing this
      // node's assignment to the second pass, so that we handle the
      // case where a metadata node's input comes from a backedge
      // of a loop.
if (CanAssignToDevice(input->assigned_device_name(), *devices)) {
        assigned_device = input->assigned_device_name_index();
      }
    }

    // Provide the default, if necessary.
if (assigned_device == -1) {
      assigned_device = graph_->InternDeviceName((*devices)[0]->name());
    }

    AssignAndLog(assigned_device, node);

第四步 — 应用启发式规则 A(处理 Generator 类的 Op Placement)

这一步将对 second_pass 数组中的所有的 Node 分配 Device，下面的流程图中展示了这一步骤。

放在 second_pass 中的代码全部是 GeneratorNode，所以只需要应用启发式规则 A 即可，和步骤 3 一样，启发式规则 A 的应用也是尝试性的，如果实在不能满足，会直接分配候选 Device 中优先级最高的 Device，下面是启发式规则 A 的应用部分代码。

    int assigned_device = -1;

    // Heuristic A application.
if (IsGeneratorNode(node)) {
      const Node* output = (*node->out_edges().begin())->dst();
      int output_device_name = output->assigned_device_name_index();

      const bool consumers_on_same_device = std::all_of(
          node->out_edges().begin(), node->out_edges().end(),
          [output_device_name](const Edge* e) {
return e->dst()->assigned_device_name_index() == output_device_name;
          });

if (consumers_on_same_device &&
          CanAssignToDevice(output->assigned_device_name(), *devices)) {
        assigned_device = output_device_name;
      }
    }

    // Provide the default, if necessary.
if (assigned_device == -1) {
      assigned_device = graph_->InternDeviceName((*devices)[0]->name());
    }

    AssignAndLog(assigned_device, node);

至此，所有 Node 的 Placement 信息都已经分配并微调完毕。

9. 总结

经过 Placer 处理的 GraphDef 解决了显式和隐式 Placement 信息的所有冲突，可谓是最后一道防线。

在 Placer 之后，GraphDef 将被送入 GraphPartitioner 模块中根据每个 Node 的 device 做子图切分，并插入 Send，Recv 以及必要的 ControlFlow 节点。因此，此步必不可少。

我们也可以看出，Placer 模块的核心是对 Placement 进行微调，由于启发式规则相对简单，性能问题并未完全解决。甚至，我们马上可以想到，在分布式模式下，粗糙的 Placement 方案会让作业性能变得非常差，因为它会引入计算之外的通信开销。

TensorFlow 高度灵活的 Placement 控制接口，让模型并行的策略设计方面具备相当大的想象空间，这也是 DL 系统层面研究的热点之一。而将 Placement 策略自动化，并隐藏到框架中，似乎是用户十分关心的问题。这不但可以提高框架的易用性，让用户完全专注在模型算法层面，也可以让初学者用户避免写出性能较差的程序。

但是自动搜索 Placement 最佳策略的难度非常大，因为它要考虑集群通信的带宽，以及每个 Op 的计算量，是一个与硬件和环境高度联系的复杂问题。不仅如此，通常深度学习模型含有成千上万个 Node，这使得方案的搜索空间巨大无比。

对于这个问题的解决办法，目前是百家争鸣。如果你对策略感兴趣，我这里给你推荐一篇 Google 发表的论文，它利用强化学习搜索更好的分片策略。有兴趣的同学可以参考这篇 ICML 的论文：Device Placement Optimization with Reinforcement Learning。