TensorFlow RunTime(TFRT) 小试

最新推荐文章于 2024-05-17 10:01:44 发布

昵称：RainMaker

最新推荐文章于 2024-05-17 10:01:44 发布

阅读量829

点赞数

分类专栏： Machine learning

本文链接：https://blog.csdn.net/mengkevin/article/details/115393842

版权

Machine learning 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

文章目录

前言
一、编译流程
二、验证流程
- 1.编译工具
- 2.测试验证
4.25号更新

前言

最近几天试了试 TensorFlow RunTime(TFRT)，优劣就不讲了，官方文档说得很详细了。TFRT本质上还是遍历DAG，依次检查operation并找到对应的算子，在目标机器上执行。小试TFRT，是想知道如下2个问题的答案：

TFRT是怎么和MLIR结合的？
TFRT是怎么产生机器码的？

一、编译流程

TFRT从某种程度看，是为MLIR定制的runtime，其本身也是以一系列dialect的形式存在。但它的特殊之处在于，其是作为最终的dialect设计的，即不会lowering到其他内嵌的dialect；是在TFRT里，就支持backend的kernel的运行。

那么其他dialect怎么对接到TFRT呢？以CPU作为backend为例，提供了如下2对算子：

cpurt.compile 和 cpurt.execute
cpurt.corert.compile 和 cpurt.corert.execute

compile负责编译其他dlalect表示的mlir IR；execute负责运行编译后的结果。而compile是分如下几步走：

将其他dialect表示的IR去lowering到llvm内嵌的dialcet
将标准dialect等lowering到llvm dialect
之后利用mlir运行engine将llvm dialect转换为机器码

其中第一步是由客户定制，以参数的方式将pass pipeline传给compile算子；第2步和第3步则在compile算子内部完成。

二、验证流程

1.编译工具

编译生成如下工具：

$ bazel build //tools:tfrt_translate
$ bazel build //tools:bef_executor

tfrt_translate可将TFRT dialect表示的IR保存为Binary Executable Format(BEF)格式的文件，也可将BEF反转为mlir；而bef_executor解释执行BEF文件。

2.测试验证

以 backends/cpu/mlir_tests/jit/compile.corert.mlir文件为例，

tfrt_translate --mlir-to-bef compile.corert.mlir > compile.corert.bef
bef_executor compile.corert.bef

可以正确得到结果。

记得以前玩nGraph时，是可以dump出中间obj文件的，TFRT应该也可以支持类似的功能。翻了翻源码，添加如下代码：

diff --git a/backends/cpu/lib/jit/cpurt.cc b/backends/cpu/lib/jit/cpurt.cc
index 39372a2..59ea602 100644
--- a/backends/cpu/lib/jit/cpurt.cc
+++ b/backends/cpu/lib/jit/cpurt.cc
@@ -400,6 +400,8 @@ Error CompilationResult::Execute(ArrayRef<MemrefDesc> operands,
   Execute(exec_ctx, &call_frame);
+  engine_->dumpToObjectFile("./cpu-genereated-code.o");
+
   // Convert compiled function return values into results.
   if (auto err = ReturnResults(results, &call_frame)) return err;

在bef_executor完后，可以在当前目录下看到文件 cpu-genereated-code.o。

$ file cpu-genereated-code.o
cpu-genereated-code.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), with debug_info, not stripped

$ readelf -s  cpu-genereated-code.o
Symbol table '.symtab' contains 26 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS LLVMDialectModule
     2: 0000000000000000     0 SECTION LOCAL  DEFAULT    2
     3: 00000000000002a0   351 FUNC    LOCAL  DEFAULT    2 async_execute_fn.resume
     4: 0000000000000400    35 FUNC    LOCAL  DEFAULT    2 async_execute_fn.destroy
     5: 0000000000000430    31 FUNC    LOCAL  DEFAULT    2 async_execute_fn.cleanup
    11: 0000000000000000    84 FUNC    GLOBAL DEFAULT    2 main
    12: 0000000000000060   252 FUNC    GLOBAL DEFAULT    2 async_execute_fn
    13: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND mlirAsyncRuntimeDropRef
    14: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND mlirAsyncRuntimeCreateTok
    15: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND mlirAsyncRuntimeCreateVal
    16: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND malloc
    17: 0000000000000160     5 FUNC    GLOBAL DEFAULT    2 __resume
    18: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND mlirAsyncRuntimeExecute
    19: 0000000000000170   121 FUNC    GLOBAL DEFAULT    2 _mlir_main
    20: 00000000000001f0   130 FUNC    GLOBAL DEFAULT    2 _mlir_async_execute_fn
    21: 0000000000000280    11 FUNC    GLOBAL DEFAULT    2 _mlir___resume
    22: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND mlirAsyncRuntimeGetValueS
    23: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND mlirAsyncRuntimeEmplaceVa
    24: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND mlirAsyncRuntimeEmplaceTo
    25: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND free

那么此obj可以通过链接生成可执行文件吗？经过一番折腾，发现依赖如下库：

bazel build //backends/cpu:async_runtime_api
bazel build //backends/cpu:async_runtime
bazel build @llvm-project//llvm:Support
bazel build :hostcontext
bazel build :support

最终用如下命令可以编译成功：

clang++  -g cpu-genereated-code.o \
     -lasync_runtime_api  -lasync_runtime -lhostcontext -lsupport -lSupport -lLLVM-10  \
     -o my_code_share

但是运行my_code_share报错。后又一系列折腾，包括静态编译等，跳过了报错，但又碰到crash，看起来好像是cpu-genereated-code.o和libs中的call procedure不匹配导致。那就不折腾了。

问题应该是不能直接在mlir中定义main函数，而是需要在c/cpp文件中定义main并调用mlir中实现的函数。

4.25号更新

今天update了repo，发现生成的obj文件已经是优化过的，去掉了原有的中间过程函数。

$ readelf -s cpu-genereated-code.o
Symbol table '.symtab' contains 15 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS LLVMDialectModule
    10: 0000000000000000    88 FUNC    GLOBAL DEFAULT    2 alloc_filled_i32
    11: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND malloc
    12: 0000000000000060    93 FUNC    GLOBAL DEFAULT    2 my_entry
    13: 00000000000000c0    95 FUNC    GLOBAL DEFAULT    2 _mlir_alloc_filled_i32
    14: 0000000000000120    96 FUNC    GLOBAL DEFAULT    2 _mlir_my_entry

按照使用MLIR完成一个端到端的编译流程 – 一条通路，在TFRT上仿写了编译运行的流程。

使用的mlir文件如下：

#map = affine_map<(d0) -> (d0)>

module @kernels attributes { tfrt.compiled } {
	func @alloc_filled_i32(%arg0: i32) -> memref<10xi32> {
	  %c0 = constant 0 : index
	  %c1 = constant 1 : index
	  %c10 = constant 10 : index
	  %0 = memref.alloc() : memref<10xi32>
	  scf.for %idx = %c0 to %c10 step %c1 {
	  	memref.store %arg0, %0[%idx] : memref<10xi32>
	  }
	  return %0 : memref<10xi32>
    }

	func @my_entry()-> memref<10xi32> {
    	%0 = memref.alloc() : memref<10xi32>
  		%i2 = constant 2 : i32
  		%i3 = constant 3 : i32
 		 %arg0 = call @alloc_filled_i32(%i2) : (i32) -> memref<10xi32>
  		%arg1 = call @alloc_filled_i32(%i3) : (i32) -> memref<10xi32>
		linalg.generic {
		 indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]
		 indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]
		}
		ins(%arg0, %arg1 : memref<10xi32>, memref<10xi32>) outs(%0 : memref<10xi32>) {
		    ^bb0(%arg2: i32, %arg3: i32, %arg4: i32):  
		      %1 = addi %arg2, %arg3 : i32
		      linalg.yield %1 : i32
		  }
		 memref.dealloc %arg0 : memref<10xi32>
		 memref.dealloc %arg1 : memref<10xi32>
         return %0 : memref<10xi32>
    }
}
func @compiled_add_f32_tensors() {
  %compilation_result = cpurt.corert.compile { kernel = @kernels::@my_entry }
  %result = cpurt.corert.execute %compilation_result ()
              : () -> !corert.tensorhandle
  tfrt.return
}

文件比较简单，就是实现简单的Add操作，用cpurt.corert.compile去编译这个my_entry函数。

但是在 tfrt_translate 阶段报错，原因是不认识操作符 scf.for，后添加如下patch去支持 scf dialect。

diff --git a/BUILD b/BUILD
index 38b3e941..a40f1852 100644
--- a/BUILD
+++ b/BUILD
@@ -1441,6 +1441,7 @@ tfrt_cc_library(
         "@llvm-project//mlir:MathDialect",
         "@llvm-project//mlir:MemRefDialect",
         "@llvm-project//mlir:StandardOps",
+        "@llvm-project//mlir:SCFDialect",
         "@tf_runtime//backends/cpu:cpurt_opdefs",

diff --git a/lib/init_tfrt_dialects.cc b/lib/init_tfrt_dialects.cc
index 13e559f6..9642b6ee 100644
--- a/lib/init_tfrt_dialects.cc
+++ b/lib/init_tfrt_dialects.cc
@@ -38,6 +38,9 @@
 #include "tfrt/tensor/opdefs/tensor_shape.h"
 #include "tfrt/test_kernels/opdefs/test_kernels.h"
+#include "mlir/Dialect/SCF/SCF.h"
 namespace tfrt {
 void RegisterTFRTDialects(mlir::DialectRegistry &registry) {
@@ -62,6 +65,7 @@ void RegisterTFRTCompiledDialects(mlir::DialectRegistry &registry) {
   registry.insert<mlir::linalg::LinalgDialect>();
   registry.insert<mlir::math::MathDialect>();
   registry.insert<mlir::memref::MemRefDialect>();
+  registry.insert<mlir::scf::SCFDialect>();

tfrt_translate和bef_executor成功执行完成后，可得到cpu-genereated-code.o。为了执行函数my_entry，写了如下的主函数：

typedef struct MemRef_descriptor_ {
  int *basePtr;
  int *data;
  int64_t offset;
  int64_t sizes[1];
  int64_t strides[1];
} Memref;

extern "C" Memref my_entry();
int main()
{
        Memref v = my_entry();
        printf("sizeof(int) = %lu, sizoef(void*)= %lu\n", sizeof(int), sizeof(void*));
        printf("%p, %p, 0x%lx, 0x%lx\n", v.basePtr, v.data, v.offset, v.sizes[0]);
        for(int i=0; i<v.sizes[0];++i) {
                printf("0x%x ", v.data[i]);
        }
        printf("\n");
        free(v.basePtr);
        return 0;
}

可以得到正确的计算结果：

sizeof(int) = 4, sizoef(void*)= 8
0x1492280, 0x1492280, 0x0, 0xa
0x5 0x5 0x5 0x5 0x5 0x5 0x5 0x5 0x5 0x5

昵称：RainMaker

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
TensorFlow RunTime(TFRT) 小试

文章目录前言一、编译流程二、验证流程1.编译工具2.测试验证4.25号更新前言最近几天试了试 TensorFlow RunTime(TFRT)，优劣就不讲了，官方文档说得很详细了。TFRT本质上还是遍历DAG，依次检查operation并找到对应的算子，在目标机器上执行。小试TFRT，是想知道如下2个问题的答案：TFRT是怎么和MLIR结合的？TFRT是怎么产生机器码的？一、编译流程TFRT从某种程度看，是为MLIR定制的runtime，其本身也是以一系列dialect的形式存在。但.
复制链接

扫一扫

专栏目录