自制深度学习推理框架-算子的执行流程

qq_32901731

已于 2023-02-23 20:23:01 修改

阅读量429

点赞数

分类专栏：自制深度学习推理框架文章标签：深度学习人工智能

于 2023-02-20 21:13:41 首次发布

本文链接：https://blog.csdn.net/qq_32901731/article/details/129131694

版权

自制深度学习推理框架专栏收录该内容

8 篇文章 19 订阅

订阅专栏

本文详细介绍了自制深度学习推理框架中计算图的设计，包括Operator的结构、数据流和控制流的概念。通过广度优先搜索实现计算图的执行顺序，重点讲解了如何通过ProbeNextLayer函数探测并拷贝上一级的输出到后继节点，以及在Forward函数中如何通过执行队列实现节点的调度执行。文章还提供了测试案例，展示了实际的执行顺序与预期一致。

摘要由CSDN通过智能技术生成

自制深度学习推理框架-算子的执行流程

获取本节课的代码

	git clone https://github.com/zjhellofss/KuiperCourse
	git checkout eleven

计算图的设计

Graph的结构

Operators: 记录所有的节点
Input operator: 指定的输入节点
Output operator: 指定的输出节点
Global input data: 模型的外部全局输入（用户指定的输入）

Operator的结构

Input data: 节点的输入数据
Output data: 节点的输出数据
Operator params: 计算节点的参数
Next operators: 该节点的下一个节点，数量有且大于一个
Layer:
- 每个Operator具体计算的执行者，layer先从input data中取得本层的输入，再通过layer定义的计算结果，并得到output data中
- 计算的过程中所需要的参数已经被提前存放到Operator params中

Graph中的数据流动

我们从下图中可以看出，一个Graph中包含了两个要素，一个要素是多个operators，另一个要素是连通operators之间的数据通路。

也就是说，前一个operator的输出将作为后一个operator的输入存在，其中在输入和输出中传递的数据，是以前面课程中谈到的Tensor类进行的。

在这里插入图片描述

Graph中的数据流和控制流

在这里插入图片描述

我们可以看到，在图中，Graph在执行时在逻辑上可以分为两条路径，一条是控制流，另外一条是数据流。在数据流中，前一个operator产生的输出传递到后续operator作为输入。

那么Graph是如何得知一个operator的后续operator的？我们可以看到在前方Operator定义中，有一个变量为Next operators，这个变量记录了一个operator的后继节点。在上图中，我们可以看到op1有两个后继节点op2和op3，他们也是通过op1.next_oprators得到的。

所以在图的执行中，有两个很重要的部分：

通过op.layer根据输入来进行计算，并得到当前层的输出
将当前层的输出顺利并且正确地传递到后继节点中。传递的路径是previous op.output to next op.input

计算图的执行

Q: 对于一个计算图，我们应该采取怎么样的执行顺序呢？

A: 计算节点的执行是通过广度优先搜索来实现的，当然也有人说这就是一种拓扑排序的实现。

计算图执行的图示

我们从一个图实例来了解一下计算图中的节点是怎么被调度执行的。
在这里插入图片描述

从图中我们可以看出，现在要执行的图是总共拥有7个op, 分别从op1到op7.

它们之间的前后关系如图中的箭头指向，例如op2, op3, op4均为op1的后继节点，换句话说，只有等到op1执行结束之后,op2, op3, op4才能开始执行，这三个节点的输入也都来自于op3的输出，以下的顺序是上面这个图中的执行顺序。

从graph.input_operator的定义可以知道，op1是开始执行的节点，因此在当前时刻将op1放入到执行队列中
op1被从执行队列中取出执行，并得到op1的计算输出，存放到op1.output_data中；同时，根据op1.output_operators定位到op1的后续三个节点，op2, op3和op4, 随后将op1.output_data拷贝到这三个后继节点的输入中
现在的执行队列存放了三个节点，分别为op2, op3和op4. 随后我们根据先进先出的顺序取出op2开始执行，因为op2没有后继节点，所以执行完毕后直接开始下一轮迭代
取出队列中的队头op3,在op3执行完毕之后将op3.output_data拷贝到op5.input_data中，并将op5入执行队列

…

随后的执行顺序如图所示，总之也是在一个节点执行完毕之后，通过current_op.output_operators来寻找它的后继节点，并将当前节点的输出拷贝到后继节点的输入中

节点执行中的拓扑序

在这里插入图片描述

可以看到在上图中op5和op6都有后继节点为op7, 从执行的顺序上来说，op5会先一步执行，op5会将在本节点的计算输出拷贝到op7.input_data中，但是随后却不能将op7入队列，因为op7的执行输入还依赖于它的另外一个前驱节点op6.

只有当op6也执行完毕之后，才能将op7入执行队列。从另一个角度“拓扑顺序”来了解这种执行顺序：当一个节点的入度等于0的时候，才能将这个节点放入到执行队列中，当op5和op6都被执行完毕后，op7的入度才是0, 才能进入到执行队列中等待下一步的执行。

项目中计算图调度执行实现

项目中的计算图调度执行是对上方图例的一个还原，我们在这一节中通过分析代码的方式来看看怎么来做一个广度优先搜索（拓扑排序）。

寻找并拷贝上一级的输出到后继节点

void RuntimeGraph::ProbeNextLayer(
    const std::shared_ptr<RuntimeOperator> &current_op,
    std::deque<std::shared_ptr<RuntimeOperator>> &operator_queue,
    std::vector<std::shared_ptr<Tensor<float>>> layer_output_datas) {
  const auto &next_ops = current_op->output_operators;

  std::vector<std::vector<std::shared_ptr<ftensor>>> next_input_datas_arr;
  for (const auto &next_op : next_ops) {
    const auto &next_rt_operator = next_op.second;
    const auto &next_input_operands = next_rt_operator->input_operands;
    // 找到后继节点
    if (next_input_operands.find(current_op->name) !=
        next_input_operands.end()) {
      std::vector<std::shared_ptr<ftensor>> next_input_datas =
          next_input_operands.at(current_op->name)->datas;
      next_input_datas_arr.push_back(next_input_datas);
      next_rt_operator->meet_num += 1;
      if (std::find(operator_queue.begin(), operator_queue.end(),
                    next_rt_operator) == operator_queue.end()) {
        if (CheckOperatorReady(next_rt_operator)) {
          operator_queue.push_back(next_rt_operator);
        }
      }
    }
  }
  SetOpInputData(layer_output_datas, next_input_datas_arr);
}

这一步的实现被定义在runtime_ir.cpp的ProbeNextLayer函数中，我们首先来分析一下该函数的参数输入。

void RuntimeGraph::ProbeNextLayer(
    const std::shared_ptr<RuntimeOperator> &current_op,
    std::deque<std::shared_ptr<RuntimeOperator>> &operator_queue,
    std::vector<std::shared_ptr<Tensor<float>>> layer_output_datas)

可以看到该函数有三个参数，分别为current_op,operator_queue和layer_output_datas，这三个参数的定义如下：

current_op表示当前执行完毕的节点，operator_queue就是在上一节中提到的节点执行队列，layer_output_datas就是当前current_op被执行后得到的对应输出。

 const auto &next_ops = current_op->output_operators;
 std::vector<std::vector<std::shared_ptr<ftensor>>> next_input_datas_arr;

得到当前节点current_op的后继节点, next_ops

  std::vector<std::vector<std::shared_ptr<ftensor>>> next_input_datas_arr;
  for (const auto &next_op : next_ops) {
    const auto &next_rt_operator = next_op.second;
    const auto &next_input_operands = next_rt_operator->input_operands;

这里对next_ops进行遍历，依次获得后继节点中的其中一个next_op，随后我们得到next_op的输入数据引用。

我们要得到next_op.input_operands呢？我们就是要把current_op.output_data拷贝到其中，完成current_op输出到后继节点输入的拷贝。

next_rt_operator->meet_num += 1;
if (std::find(operator_queue.begin(), operator_queue.end(),next_rt_operator) == operator_queue.end()) {
	if (CheckOperatorReady(next_rt_operator)) {
    	operator_queue.push_back(next_rt_operator);
    }
}

可以看到其中的meet_num，对于一个节点next_operator来说，如果meet_num的数量等于它前驱的数量，说明它现在可以被放入到执行队列中。

在这里插入图片描述

从这里可以看出，对于op7来说，如果op7.meet_num等于2, 也就是op7前驱节点的数量，那么此时op7可以被放入到执行队列中。那么什么时候meet_num会增加呢？ 当op的前趋节点访问过当前节点一次的时候meet_num就会加1.

当op5执行完毕，并调用ProbeNextLayer的时候，op7的meet_num = 1, 当op6执行完毕并调用ProbeNextLayer的时候，op7.meet_num = 2, 等这时候才将op7放到执行队列中。

bool RuntimeGraph::CheckOperatorReady(
    const std::shared_ptr<RuntimeOperator> &op) {
  CHECK(op != nullptr);
  CHECK(op->meet_num <= op->input_operands.size());
  if (op->meet_num == op->input_operands.size()) {
    return true;
  } else {
    return false;
  }
}

具体的实现放在CheckOperatorReady中，如果当前的meet_num已经等于前驱的数量，则将这个后继op放入到执行队列中去。这个函数在ProbeNextLayer函数中被执行。

将当前层current_op的输出layer_output_datas传递到下一级，传递的方法是通过SetOpInputData进行的。

这里不细讲这个函数，总体是将layer_output_datas这个输出张量复制到next_input_datas_arr这个张量数组（后继的输入）上，指针复制几乎无消耗。

 SetOpInputData(layer_output_datas, next_input_datas_arr);

广度优先搜索的执行顺序的实现

在这里插入图片描述

以上文中的这个图为例，分别由7个执行节点组成，从op1到op7，执行顺序如图中所示。我们实现的内容就是要通过一个广度有限搜索去模拟这个执行的过程。

广度优先搜索执行的实现是使用一个队列，将一个节点的入度为0的后继节点放入到队列中，并在下一轮循环中按照先进先出的顺序对队列中的节点进行执行。


std::vector<std::shared_ptr<Tensor<float>>>
RuntimeGraph::Forward(const std::vector<std::shared_ptr<Tensor<float>>> &inputs,
                      bool debug) {
  if (graph_state_ < GraphState::Complete) {
    LOG(FATAL) << "Graph need be build!";
  }
  CHECK(graph_state_ == GraphState::Complete)
      << "Graph status error, current state is " << int(graph_state_);

  std::shared_ptr<RuntimeOperator> input_op;
  if (input_operators_maps_.find(input_name_) == input_operators_maps_.end()) {
    LOG(FATAL) << "Can not find the input node: " << input_name_;
  } else {
    input_op = input_operators_maps_.at(input_name_);
  }

  std::shared_ptr<RuntimeOperator> output_op;
  if (output_operators_maps_.find(output_name_) ==
      output_operators_maps_.end()) {
    LOG(FATAL) << "Can not find the output node: " << input_name_;
  } else {
    output_op = output_operators_maps_.at(output_name_);
  }

  std::deque<std::shared_ptr<RuntimeOperator>> operator_queue;
  operator_queue.push_back(input_op);
  std::map<std::string, double> run_duration_infos;



  while (!operator_queue.empty()) {
    std::shared_ptr<RuntimeOperator> current_op = operator_queue.front();
    operator_queue.pop_front();

    if (!current_op || current_op == output_op) {
        LOG(INFO) << "Model Inference End";
      	break;
    }

    if (current_op == input_op) {
      ProbeNextLayer(current_op, operator_queue, inputs);
    } else {
      std::string current_op_name = current_op->name;
      if (!CheckOperatorReady(current_op)) {
        if (operator_queue.empty()) {
          // 当current op是最后一个节点的时候，说明它已经不能被ready
          LOG(FATAL) << "Current operator is not ready!";
          break;
        } else {
          // 如果不是最后一个节点，它还有被ready的可能性
          operator_queue.push_back(current_op);
        }
      }

      const std::vector<std::shared_ptr<RuntimeOperand>> &input_operand_datas =
          current_op->input_operands_seq;
      std::vector<std::shared_ptr<Tensor<float>>> layer_input_datas;
      for (const auto &input_operand_data : input_operand_datas) {
        for (const auto &input_data : input_operand_data->datas) {
          layer_input_datas.push_back(input_data);
        }
      }

      CHECK(!layer_input_datas.empty()) << "Layer input data is empty";
      CHECK(current_op->output_operands != nullptr &&
            !current_op->output_operands->datas.empty())
          << "Layer output data is empty";

      const auto &start = std::chrono::steady_clock::now();
      ProbeNextLayer(current_op, operator_queue,
                     current_op->output_operands->datas);
    }
  }

  for (const auto &op : this->operators_) {
    op->meet_num = 0;
  }

  CHECK(output_op->input_operands.size() == 1)
      << "The graph only support one path to the output node yet!";
  const auto &output_op_input_operand = output_op->input_operands.begin();
  const auto &output_operand = output_op_input_operand->second;
  return output_operand->datas;
}

这个功能的执行位置放在Forward函数中，我们首先来看它的两个参数，inputs为模型的输入张量，debug表示是否开启打印调试功能。

std::vector<std::shared_ptr<Tensor<float>>> RuntimeGraph::Forward(
    const std::vector<std::shared_ptr<Tensor<float>>> &inputs, bool debug)

这里是Forward方法中对图状态的检查，只有图状态为complete的时候才能执行图的调度，图的complete时间发生在：

图中的计算节点都初始化完毕
输入输入输出算子都准备好相关的空间之后

input_op为整张图的开始执行节点，也就是模型的执行入口。

  if (graph_state_ < GraphState::Complete) {
    LOG(FATAL) << "Graph need be build!";
  }
  CHECK(graph_state_ == GraphState::Complete)
          << "Graph status error, current state is " << int(graph_state_);

  std::shared_ptr<RuntimeOperator> input_op;
  if (input_operators_maps_.find(input_name_) == input_operators_maps_.end()) {
    LOG(FATAL) << "Can not find the input node: " << input_name_;
  } else {
    input_op = input_operators_maps_.at(input_name_);
  }

将输入节点送入到执行队列中，执行队列在这里的变量为operator_queue，是一个deque结构，方便从尾部插入，并从头部取出（完成先进先出）。

std::deque<std::shared_ptr<RuntimeOperator>> operator_queue;
operator_queue.push_back(input_op);
  
std::map<std::string, double> run_duration_infos;

while (!operator_queue.empty()) {
    std::shared_ptr<RuntimeOperator> current_op = operator_queue.front();
    operator_queue.pop_front();

    if (!current_op || current_op == output_op) {
      if (debug) {
        LOG(INFO) << "Model Inference End";
      }
      break;
    }  
    ......
}

std::shared_ptr<RuntimeOperator> current_op = operator_queue.front(); 从队列中获取一个被执行的节点，按照先进先出的顺序执行。

if (current_op == input_op) {
	ProbeNextLayer(current_op, operator_queue, inputs);
}

这里分为两种情况，如果当前节点是输入节点，就直接使用ProbeNextLayer将输入拷贝到输入节点的下一层中。

std::string current_op_name = current_op->name;
if (!CheckOperatorReady(current_op)) {
   if (operator_queue.empty()) {
     // 当current op是最后一个节点的时候，说明它已经不能被ready
     LOG(FATAL) << "Current operator is not ready!";
     break;
    } else {
    // 如果不是最后一个节点，它还有被ready的可能性
    operator_queue.push_back(current_op);
  }
}

如果当前的节点(current_op)不是输入节点(input_operator)就对它是否准备好进行检查，检查的方式同样是使用CheckOperatorReady检查当前节点的入度，如果入度等于0，那么当前的节点就允许被执行。

如果这个节点还没有ready，就需要重新被放入到operator_queue当中。

const std::vector<std::shared_ptr<RuntimeOperand>> &input_operand_datas 
    = current_op->input_operands_seq;
std::vector<std::shared_ptr<Tensor<float>>> layer_input_datas;
for (const auto &input_operand_data : input_operand_datas) {
   for (const auto &input_data : input_operand_data->datas) {
       layer_input_datas.push_back(input_data);
   }
}

将当前op中的input移动到layer_input_datas(全指针拷贝，损耗可以忽略不计)，也就是从op->input_operands_seq中到layer_input_datas中。

InferStatus status = current_op->layer->Forward(layer_input_datas, current_op->output_operands->datas);

在op自身ready，且输入已经准备到layer_input_data之后，开始执行算子，但是这节课中算子执行不讨论。

ProbeNextLayer(current_op, operator_queue, current_op->output_operands->datas);

在执行完毕后，对当前的算子current_op的输出同步它下一级后继节点的输入中。

while (!operator_queue.empty())

当执行队列中的节点执行均执行完毕，且图中没有未执行的节点时就跳出循环。

  CHECK(output_op->input_operands.size() == 1)
      << "The graph only support one path to the output node yet!";
  const auto &output_op_input_operand = output_op->input_operands.begin();
  const auto &output_operand = output_op_input_operand->second;
  return output_operand->datas;

将output operator的input operand输出为最后的结果，换句话理解，输出节点的输入张量就是最后得到的结果。

调试图的执行

TEST(test_forward, forward1) {
  using namespace kuiper_infer;
  const std::string &param_path = "tmp/resnet18_hub.pnnx.param";
  const std::string &weight_path = "tmp/resnet18_hub.pnnx.bin";
  RuntimeGraph graph(param_path, weight_path);
  graph.Build("pnnx_input_0", "pnnx_output_0");
  const auto &operators = graph.operators();
  LOG(INFO) << "operator size: " << operators.size();
  uint32_t batch_size = 2;
  std::vector<sftensor> inputs(batch_size);
  for (uint32_t i = 0; i < batch_size; ++i) {
    inputs.at(i) = std::make_shared<ftensor>(3, 224, 224);
    inputs.at(i)->Fill(1.f);
  }
  const std::vector<sftensor>& outputs = graph.Forward(inputs, true);
}

我们这里正在调试一个resnet图的顺序，图的一部分结构如图。为了方便大家观察，我们在节点的右侧标出了该节点的name

在这里插入图片描述
我们代码打出的执行顺序，可以看出和图中原有的顺序是一支的

I20230222 12:56:27.897986 10787 runtime_ir.cpp:541] current operator: convbn2d_0
I20230222 12:56:27.898545 10787 runtime_ir.cpp:541] current operator: relu
I20230222 12:56:27.898821 10787 runtime_ir.cpp:541] current operator: maxpool
I20230222 12:56:27.898998 10787 runtime_ir.cpp:541] current operator: convbn2d_1
I20230222 12:56:27.899176 10787 runtime_ir.cpp:541] current operator: layer1.0.relu
I20230222 12:56:27.899349 10787 runtime_ir.cpp:541] current operator: convbn2d_2
I20230222 12:56:27.899520 10787 runtime_ir.cpp:541] current operator: pnnx_expr_14