onnx在华为昇腾AI模型转换过程中的报错The Add_718 op dtype is not same, type1:DT_INT64, type2:DT_INT32

本文链接：https://blog.csdn.net/wentinghappyday/article/details/127205896

问题背景：

在英伟达机器上开发的模型已经转为onnx格式，现在需要放在华为昇腾计算AI硬件上，支持推理能力。因此需要将模型转为华为需要的om格式。

官方的教程

https://support.huawei.com/enterprise/zh/doc/EDOC1100232270?idPath=23710424%7C251366513%7C22892968%7C251168373

遇到的问题：

采用autocv训练的模型、经过torch–>onnx转换后保存；
在华为昇腾硬件平台ATC工具进行模型转换

atc --mode=0 --model=./out.onnx 
    --framework=5 
    --input_format="NCHW" 
    --input_shape="input:1,3,608,608" 
    --output=./om_models/road_hole_detection_310 
    --soc_version=Ascend310 
    --output_type=FP32

转换过程报错如下：

ATC run failed, Make sure this is what you want: '--mode=0'. Please check the detail log, Try 'atc --help' for more information
E89999: Inner Error!
E89999  op[Add_718], The Add_718 op dtype is not same, type1:DT_INT64, type2:DT_INT32[FUNC:CheckTwoInputDtypeSame][FILE:util.cc][LINE:103]
        Verifying Add_718 failed.[FUNC:InferShapeAndType][FILE:infershape_pass.cc][LINE:135]
        Call InferShapeAndType for node:Add_718(Add) failed[FUNC:Infer][FILE:infershape_pass.cc][LINE:117]
        process pass InferShapePass on node:Add_718 failed, ret:4294967295[FUNC:RunPassesOnNode][FILE:base_pass.cc][LINE:507]
        build graph failed, graph id:0, ret:1343242270[FUNC:BuildModel][FILE:ge_generator.cc][LINE:1426]

这个问题2021年底就有类似的错误了，但是一致没有给出代码级别的解决方案：
https://bbs.huaweicloud.com/forum/thread-172573-1-1.html
https://gitee.com/ascend/modelzoo/issues/I5OOHG?from=project-issue

解决思路：

报错提示Add节点的输入类型不一致。通过https://netron.app/这个网站可以将onnx模型的结构打印出来，寻找到出问题的节点的位置。Add_718节点的输入分别是Topk节点的输出indices和Mul节点的输出。Onnx模型中，Topk节点的输出是Int64的，Mul节点的输出是Int64的。按理不应该出现问题。
这是netron打印出来的模型中的类型说明，因为是从pytorch转的onnx，所以topk的indices输出确实是int64.
在这里插入图片描述
但是在昇腾硬件平台ATC工具中，强制转换之后的topk类型为int32。这就导致了add节点输入的两个节点类型不一致。

这个问题也有同学在gitee上提出了https://gitee.com/ascend/parser/issues/I4M8N9?from=project-issue，解决方案其实最好是昇腾硬件平台在topk算子转换中可以添加类型自动转换的cast算子，这样对于用户来说就无感了，可以同时兼容tensorflow和pytorch的topk算子输出类型。

但是昇腾硬件平台可能有自己的考虑，建议我们将所有的类型都转为int32。所以现在解决思路也很清晰，将onnx模型中add节点的输入都转换为统一类型。根据官网的建议，优先选择将输入类型转为int32。

实际解决：

实际上，将mul的输出cast成int32，保持和topk的一致性，然后输入add节点。在atc转换的时候不会报错，可以正常生成模型，但是在模型运行的时候，会存在问题。
最终是在onnx模型中，将topk的输出之后添加一个cast算子，将其转为是int64(这个操作在onnx上看起来有点迷惑，因为topk本身输出是int64的，但是因为atc转换之后会将topk的输出转为int32,这时候后面加上的cast算子就有用了。）
在这里插入图片描述
观察结构图，可以看到topk的输出indices节点是“1328”,因此插入规则是在1328之后添加一个cast算子，然后将add的输入节点1328改为cast算子的输出节点。

import onnx
from onnx import helper as h
from onnx import checker as ch
from onnx import TensorProto

def convert_topk_to_int64(nodes):
    new_nodes = []
    for node in nodes:
        if node.name == "Add_718":
            new_scale_node = onnx.helper.make_node(
                "Cast",
                inputs=['1328'],
                outputs=['8888'],
                to=TensorProto.INT64)
            new_add_node = onnx.helper.make_node(
                'Add',
                inputs=['8888', '1340'],
                outputs=['1341']
            )
            new_nodes += [new_scale_node,new_add_node]
        else:
            new_nodes += [node]

    return new_nodes


if __name__=='__main__':
    model = onnx.load('yolox_model.onnx')
    graph = model.graph
    nodes = graph.node
    opset_version = model.opset_import[0].version

    graph_name = f"{graph.name}-int32"
    new_nodes = convert_topk_to_int64(nodes)
    graph_int32 = h.make_graph(
        new_nodes,
        # graph.node,
        graph_name,
        graph.input,
        graph.output,
        initializer=graph.initializer,
    )

    model_int32 = h.make_model(graph_int32, producer_name="onnx-typecast")
    model_int32.opset_import[0].version = opset_version
    ch.check_model(model_int32)
    onnx.save_model(model_int32, "out.onnx")

其他思路：

在遇到这个问题的时候，并不是一下子就找到了正确的解题思路的。中间尝试过很多其他方法，比如下面参考文献中的考虑直接转换节点的属性和类型；尝试把onnx模型全部节点转为为int32类型（github开源代码https://github.com/aadhithya/onnx-typecast）;但是都无法很好的解决上面的问题。