1. Ascend 310B1的npu推理思路
在昇腾 Ascend 310B1 NPU 上基于 PyTorch 进行推理时,通过 torch_npu
替换原有 GPU/CUDA 操作。
torch_npu的技术参考文档:pytorch: Ascend Extension for PyTorch
2. 推理过程中可能遇到的问题和解决方案
1. NPU 设备不支持 double
数据类型
-
错误日志中提示:
Warning: Device do not support double dtype now, dtype cast repalce with float.
这表明 NPU 设备(Ascend 310B1)不支持
double
(即float64
)数据类型,因此系统自动将其转换为float
(即float32
)。 -
解决方法:
确保所有张量的数据类型为float32
,而不是double
。可以在创建张量时显式指定数据类型:x = torch.randn(2, 3, dtype=torch.float32).to(device) y = torch.randn(2, 3, dtype=torch.float32).to(device)
2. 环境变量未配置正确
-
错误日志中提示:
ImportError: libhccl.so: cannot open shared object file: No such file or directory
libhccl.so
文件缺失的问题是一般是因为环境变量配置。在运行包含torch_npu的代码之前需要运行环境配置命令。 -
解决方法:
在运行实际python文件之前需要在命令行运行以下命令: -
3. NPU 设备不支持某些算子
-
错误日志中提示:
RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is MaxPoolWithArgmaxV1.
问题的原因可能是
MaxPoolWithArgmaxV1
在当前的 CANN 工具包版本中不被支持。 -
解决方法:
如果某个算子不被支持,可以尝试使用功能相似的其他算子来替代。MaxPool2d
不被支持,可以尝试使用AvgPool2d代替
。 -
错误日志中提示:
[W compiler_depend.ts:387] Warning: E40021: Failed to compile Op [DropOutDoMask]. (oppath: [Compile /usr/local/Ascend/ascend-toolkit/7.0.RC1/opp/built-in/op_impl/ai_core/tbe/impl/drop_out_do_mask.py failed with errormsg/stack: File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/tbe/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__ raise get_last_ffi_error()
问题的原因可能是DropOutDoMask,在当前的 CANN 工具包版本中不被支持。
-
解决方法:
可以尝试手动实现较为简单的算子dropout。 -
# 自定义的 DropoutLayer class DropoutLayer(nn.Module): def __init__(self, p=0.5): super(DropoutLayer, self).__init__() self.p = p def forward(self, x: torch.Tensor) -> torch.Tensor: if self.training: # 仅在训练时应用 dropout # 生成 mask (均匀分布随机数,超过 p 的位置保持) mask = (torch.rand_like(x) > self.p).float() # 将被丢弃的元素设为 0,并缩放剩下的部分 return x * mask / (1 - self.p) return x # 在评估模式下直接返回输入
3. cpu和npu实际性能表现
1. Alexnet
模型结构如下:
------------------------------------------------------------------
1-Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(1, 1))
------------------------------------------------------------------
2-Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
------------------------------------------------------------------
3-Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
------------------------------------------------------------------
4-Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
------------------------------------------------------------------
5-Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
------------------------------------------------------------------
6-Linear(in_features=9216, out_features=4096, bias=True)
------------------------------------------------------------------
7-Linear(in_features=4096, out_features=4096, bias=True)
------------------------------------------------------------------
8-Linear(in_features=4096, out_features=1000, bias=True)
设备\模块 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 总计(ms) |
CPU | 5.349 | 8.966 | 5.186 | 5.349 | 4.071 | 186.742 | 81.339 | 9.404 | 306.406 |
NPU | 0.456 | 0.922 | 3.265 | 4.786 | 2.954 | 13.936 | 7.239 | 1.836 | 35.394 |
2. lenet
模型结构如下:
------------------------------------------------------------------
1-Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
------------------------------------------------------------------
2-Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
------------------------------------------------------------------
3-Linear(in_features=46656, out_features=1024, bias=True)
------------------------------------------------------------------
4-Linear(in_features=1024, out_features=1024, bias=True)
------------------------------------------------------------------
5-Linear(in_features=1024, out_features=1000, bias=True)
设备\模块 | 1 | 2 | 3 | 4 | 5 | 总计(ms) |
CPU | 6.490 | 4.001 | 226.313 | 2.632 | 1.814 | 241.250 |
NPU | 0.434 | 0.245 | 11.871 | 0.446 | 0.416 | 13.412 |
3. resnet
模型结构如下:
------------------------------------------------------------------
1-Sequential(
(0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): AvgPool2d(kernel_size=3, stride=2, padding=1)
)
------------------------------------------------------------------
2-BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU()
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
------------------------------------------------------------------
3-BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU()
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
------------------------------------------------------------------
4-BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU()
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
------------------------------------------------------------------
5-BasicBlock(
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU()
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
------------------------------------------------------------------
6-BasicBlock(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU()
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
------------------------------------------------------------------
7-BasicBlock(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU()
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
------------------------------------------------------------------
8-BasicBlock(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU()
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
------------------------------------------------------------------
9-BasicBlock(
(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU()
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
------------------------------------------------------------------
10-BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU()
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
------------------------------------------------------------------
11-BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU()
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
------------------------------------------------------------------
12-BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU()
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
------------------------------------------------------------------
13-BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU()
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
------------------------------------------------------------------
14-BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU()
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
------------------------------------------------------------------
15-BasicBlock(
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU()
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
------------------------------------------------------------------
16-BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU()
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
------------------------------------------------------------------
17-BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU()
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
------------------------------------------------------------------
18-Sequential(
(0): AdaptiveAvgPool2d(output_size=(1, 1))
(1): Flatten(start_dim=1, end_dim=-1)
(2): Linear(in_features=512, out_features=1000, bias=True)
)
设备\模块 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 总计(ms) |
CPU | 16.252 | 16.860 | 16.917 | 17.021 | 16.515 | 15.840 | 15.983 | 16.053 | 22.174 | 24.503 | 26.659 | 24.864 | 25.237 | 25.101 | 63.421 | 57.389 | 57.276 | 1.259 | 458.324 |
NPU | 3.313 | 1.062 | 0.993 | 0.979 | 1.633 | 1.520 | 1.499 | 1.509 | 4.218 | 5.088 | 5.087 | 5.074 | 5.075 | 5.087 | 16.119 | 21.140 | 21.282 | 0.364 | 101.042 |