所有的Pass:
REGISTER_OPTIMIZATION(OptimizationPassRegistry::PRE_PLACEMENT, 26,
EncapsulateXlaComputationsPass);
--
REGISTER_OPTIMIZATION(OptimizationPassRegistry::PRE_PLACEMENT, 25,
IntroduceFloatingPointJitterPass);
--
REGISTER_OPTIMIZATION(OptimizationPassRegistry::POST_REWRITE_FOR_EXEC, 5,
CloneConstantsForBetterClusteringPass);
--
REGISTER_OPTIMIZATION(OptimizationPassRegistry::POST_REWRITE_FOR_EXEC, 9,
ClusterScopingPass);
--
REGISTER_OPTIMIZATION(OptimizationPassRegistry::POST_REWRITE_FOR_EXEC, 10,
MarkForCompilationPass);
--
REGISTER_OPTIMIZATION(OptimizationPassRegistry::POST_REWRITE_FOR_EXEC, 20,
IncreaseDynamismForAutoJitPass);
--
REGISTER_OPTIMIZATION(OptimizationPassRegistry::POST_REWRITE_FOR_EXEC, 30,
PartiallyDeclusterPass);
--
REGISTER_OPTIMIZATION(OptimizationPassRegistry::POST_REWRITE_FOR_EXEC, 40,
ReportClusteringInfoPass);
--
REGISTER_OPTIMIZATION(OptimizationPassRegistry::POST_REWRITE_FOR_EXEC, 45,
AsyncIoConversionPass);
--
REGISTER_OPTIMIZATION(OptimizationPassRegistry::POST_REWRITE_FOR_EXEC, 50,
EncapsulateSubgraphsPass);
--
REGISTER_OPTIMIZATION(OptimizationPassRegistry::POST_REWRITE_FOR_EXEC, 60,
BuildXlaOpsPass);
按照顺序排列:
IntroduceFloatingPointJitterPass
EncapsulateXlaComputationsPass
CloneConstantsForBetterClusteringPass
ClusterScopingPass
MarkForCompilationPass 标记xla算子
IncreaseDynamismForAutoJitPass
PartiallyDeclusterPass
ReportClusteringInfoPass
AsyncIoConversionPass
EncapsulateSubgraphsPass xla算子到sub graph
BuildXlaOpsPass subgraph到xla compile和 xla run。
一、IntroduceFloatingPointJitterPass
a debug only pass,检查float类型相关错误。
二、EncapsulateXlaComputationsPass
重写python端定义的带有_xla_compile_id标记的xla node (TODO)
三、CloneConstantsForBetterClusteringPass
clone 常量。两个目的:
-
减少两个cluster之间的依赖关系。假设Constant C同时被x,y依赖。无论x跟C放到一个cluster内还是y跟c。本来并行执行的x和y变成了串行执行。
-
多GPU设备时,const跟同设备的算子cluster后,依赖该const的其他算子本来可以立刻开始,但因为cluster的关系必须得等该cluster执行完毕后才能开始。
四、ClusterScopingPass
该Pass必须得在MarkForCompilationPass前做。该Pass将给计算图中的每个node添加范围信息来之后后续的分cluster情况。如果没有这一步计算图的并行度将大幅度下降。从而影响性能。
将pipeline的入队和出队, pipeline化,而不是cluster在一起。
五、MarkForCompilationPass
六、IncreaseDynamismForAutoJitPass
将slice相关的算子挪动到cluster外,因为动态的slice会导致xla算子重新编译。
七、PartiallyDeclusterPass
把一部分算子挪动到cluster外。原因有以下两点:
-
Clones nodes to outside their cluster to avoid device-to-host copies.
-
把一些Function函数挪动到cluster外部将减少编译次数。例如tensorflow/compiler/jit/partially_decluster_pass.cc:278
-
将RootShapeConsumer挪动到cluster外部。
八、ReportClusteringInfoPass
无优化
九、AsyncIoConversionPass
将同步的xla output(例如ret val)转换成异步的output(例如: _XlaAsyncOutSend或者_XlaAsyncOutRecv)
异步的send/recv操作可以让xla的下一个op更早的开始而不用等待xla算子完成。