智能计算系统实验(1) BANGC算子实现与TensorFlow的集成

最新推荐文章于 2024-04-14 21:35:15 发布

两个幽灵

最新推荐文章于 2024-04-14 21:35:15 发布

阅读量4.6k

点赞数 14

分类专栏：深度学习

本文链接：https://blog.csdn.net/u010099177/article/details/113096813

版权

深度学习专栏收录该内容

15 篇文章 1 订阅

订阅专栏

这是2020年6月份的实验。现在决定发出来。

1.1 算子实现和测试

1.1.1 算子实验

(1) 初始化环境：cd /opt/AICSE-demo-student/env; source env.sh

(2) 进入plugin_power_difference_kernel.h，补全头文件。补全头文件需要确定参数列表。PowerDifference的含义是 $X－Y)^Z$ ，其中 $Z$ 是标量，因此至少需要3个参数。后来从PPT中发现了PowerDifference的声明，确定了是有5个参数。

(3) 理解内存的层次。从图1-1中可以看出，设备内存DRAM和共享内存SRAM都需要复制到NRAM或者WRAM才能进行运算。
在这里插入图片描述
图1-1 内存层次

(4) 算法设计

这个步骤要考虑的因素有：

由于NRAM内存有限，所以DRAM上的数据要分块拷贝到NRAM中进行计算；
设置Task的维度，不同的Task用不同的Core进行处理；减少中间数据占用的内存量。

前两个因素主要影响①数据被分成几块，②每块数据的偏移怎么求；最后一个因素主要影响③需要几个中间变量。

①数据要被分为len/ONELINE块，每个核能分到len/ONELINE/taskDim块。ONELINE是每个块的大小。
②假设有80块，共8个任务，然后让第0个任务计算第0~9块数据，第1个任务计算10~19块数据，依次类推。让外层循环i从0~9，用taskId表示任务索引，要处理的块是i+taskId×10。
③首先需要input1_nram保存 $\bold{X}$ 的块，用input2_nram保存 $\bold{Y}$ 的块。 $\bold{X}－\bold{Y}$ 存到input1_nram中，此时input1_nram已经存放了 $(\bold{X}－\bold{Y})^1$ 。然后input1_nram与input1_nram相乘，把结果存到input2_nram，则input2_nram保存的是 $(\bold{X}－\bold{Y})^2$ 。如果Z更大的话，则input1_nram多次与input2_nram相乘，将结果保存到input2_nram。

代码1-1 PowerDifference算子实现

# define ONELINE 512
// PowerDifference BCL单核实现
__mlu_entry__ void PowerDifferenceKernel(half* input1, half* input2, int32_t pow, half* output, int32_t len)
{
  // if (taskId > 0) return; // built-in variables
  if (pow < 0) return;
  // __bang_printf("总长度 %d 任务维度%d\n", len, taskDim);
  int32_t quotient = len / ONELINE / taskDim;
  int32_t rem = len % ONELINE;
  if (rem) quotient += 1;
  
  __nram__ half input1_nram[ONELINE];
  __nram__ half input2_nram[ONELINE];

  int offset;
  int one_line_size = ONELINE * sizeof(half);

  for (int32_t i = 0; i < quotient; i++) {
    offset = ONELINE * (i + quotient * taskId);
    __memcpy(input1_nram, input1 + offset, one_line_size, GDRAM2NRAM);
    __memcpy(input2_nram, input2 + offset, one_line_size, GDRAM2NRAM);
    __bang_sub(input1_nram, input1_nram, input2_nram, ONELINE);
    __bang_mul(input2_nram, input1_nram, input1_nram, ONELINE);
    for (int32_t j = 2; j < pow; j++)
    {
      __bang_mul(input2_nram, input2_nram, input1_nram, ONELINE);
    }
    __memcpy(output + offset, input2_nram, one_line_size, NRAM2GDRAM);
  }
}

1.1.2 算子测试

需要补全PowerDiff.cpp。主要是完成cnrtInvokeKernel函数和数据拷贝。从CNRT的文档中可以找到cnrtInvokeKernel_V2的例子，参考完成即可。数据拷贝要完成主机内存与设备内存DRAM的相互拷贝。

代码1-2 数据拷贝与内核调用

  // 完成cnrtMemcpy拷入函数
  cnrtMemcpy(mlu_input1, input1_half, dims_a * sizeof(half), CNRT_MEM_TRANS_DIR_HOST2DEV);
  cnrtMemcpy(mlu_input2, input2_half, dims_a * sizeof(half), CNRT_MEM_TRANS_DIR_HOST2DEV);
  cnrtMemcpy(mlu_output, output_half, dims_a * sizeof(half), CNRT_MEM_TRANS_DIR_HOST2DEV);
  ......
  cnrtInvokeKernel_V2((void *)&PowerDifferenceKernel, dim, params, c, pQueue);  
  ......
  // 完成cnrtMemcpy拷出函数
  cnrtMemcpy(output_half, mlu_output, dims_a * sizeof(half), CNRT_MEM_TRANS_DIR_DEV2HOST);

在这里插入图片描述
图1-2 单算子测试

我写了一个脚本，运行单算子测试50次，代码在图1-2中。可以看出运行50次的平均运行时间是30ms左右。

1.2 算子集成和框架算子测试

算子集成分为CNPlugin集成和TensorFlow算子集成。CNPlugin集成的任务是补全plugin_power_difference_op.cc和cnplugin.h并编译新的Cambricon-CNPlugin。

plugin_power_difference_op.cc中共有4个函数，有3个需要完成。

①cnmlCreatePluginPowerDifferenceOpParam函数需要补全Kernel需要的参数，然后通过模仿别的算子，使用*param创建一个结构体，并将参数传给param。

代码1-3 cnmlCreatePluginPowerDifferenceOpParam函数

cnmlStatus_t cnmlCreatePluginPowerDifferenceOpParam(  // 创建操作参数
  cnmlPluginPowerDifferenceOpParam_t *param,
  // 添加变量
  half* input1, half* input2, int pow, half* output, int len
) {
  *param = new cnmlPluginPowerDifferenceOpParam();
  // 配置变量
  (*param)->input1 = input1;
  (*param)->input2 = input2;
  (*param)->pow = pow;
  (*param)->output = output;
  (*param)->len = len;
  return CNML_STATUS_SUCCESS;
}

②cnmlCreatePluginPowerDifferenceOp函数首先要根据mlu_lib_ops.cc中的调用方式补全参数，然后按Kernel函数的顺序标记输入输出参数，最后完成cnmlCreatePluginOp函数。

代码1-4 cnmlCreatePluginPowerDifferenceOp函数

cnmlStatus_t cnmlCreatePluginPowerDifferenceOp( // 创建算子
  cnmlBaseOp_t *op,
  // 添加变量
  cnmlTensor** inputs_ptr, int pow, cnmlTensor** outputs_ptr, int len
) {
  cnrtKernelParamsBuffer_t params;      
  cnrtGetKernelParamsBuffer(&params);   
  cnrtKernelParamsBufferMarkInput(params);
  cnrtKernelParamsBufferMarkInput(params);
  cnrtKernelParamsBufferAddParam(params, &pow, sizeof(int));
  cnrtKernelParamsBufferMarkOutput(params);
  cnrtKernelParamsBufferAddParam(params, &len, sizeof(int));
  void **InterfacePtr = reinterpret_cast<void **>(&PowerDifferenceKernel); 

  cnmlCreatePluginOp(op,
                     "PowerDifference",
                     InterfacePtr, params,
                     inputs_ptr, 2,
                     outputs_ptr, 1,
                     nullptr, 0);
  cnrtDestroyKernelParamsBuffer(params);
  return CNML_STATUS_SUCCESS;
}

③cnmlComputePluginPowerDifferenceOpForward函数首先要根据mlu_lib_ops.cc中的调用方式补全参数，然后通过模仿别的算子，补全cnmlComputePluginOpForward_V4的参数。

代码1-5 cnmlComputePluginPowerDifferenceOpForward函数

cnmlStatus_t cnmlComputePluginPowerDifferenceOpForward(
  cnmlBaseOp_t op,
  // 添加变量
  void** inputs_ptr, void** outputs_ptr,
  cnrtQueue_t queue
) {
  // 完成Compute函数
  cnmlComputePluginOpForward_V4(op,
                                nullptr, inputs_ptr, 2,
                                nullptr, outputs_ptr, 1,
                                queue,
                                nullptr);
  return CNML_STATUS_SUCCESS;
}

补全cnplugin.h文件首先需要创建cnmlPluginPowerDifferenceOpParam结构体，然后把plugin_power_difference_op.cc中4个函数的声明复制到这里。

代码1-6 补全cnplugin.h文件

struct cnmlPluginPowerDifferenceOpParam {
  half* input1;
  half* input2;
  int pow;
  half* output;
  int len;
};

typedef cnmlPluginPowerDifferenceOpParam* cnmlPluginPowerDifferenceOpParam_t;

cnmlStatus_t cnmlCreatePluginPowerDifferenceOpParam(  
  cnmlPluginPowerDifferenceOpParam_t *param,
  half* input1, half* input2, int pow, half* output, int len
);
cnmlStatus_t cnmlDestroyPluginPowerDifferenceOpParam( // 删除操作参数
  cnmlPluginPowerDifferenceOpParam_t *param
);
cnmlStatus_t cnmlCreatePluginPowerDifferenceOp(
  cnmlBaseOp_t *op,
  cnmlTensor** inputs_ptr, int pow, cnmlTensor** outputs_ptr, int len
);
cnmlStatus_t cnmlComputePluginPowerDifferenceOpForward(
  cnmlBaseOp_t op,
  void** inputs_ptr, void** outputs_ptr,
  cnrtQueue_t queue
);