NotImplementedError: Could not run ‘aten::empty_strided‘ with arguments from the ‘CUDA‘ backend.

最新推荐文章于 2025-03-04 17:52:54 发布

Together_CZ

最新推荐文章于 2025-03-04 17:52:54 发布

阅读量1.3k

点赞数 17

文章标签：深度学习机器学习人工智能

本文链接：https://blog.csdn.net/Together_CZ/article/details/139596983

版权

今天在加载前面训练好的模型进行检测推理计算的时候，出现了一个奇怪的报错，

NotImplementedError: Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::empty_strided' is only available for these backends: [CPU, Meta, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].

终端截图如下：

报错详情内容如下：

NotImplementedError: Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::empty_strided' is only available for these backends: [CPU, Meta, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].

CPU: registered at aten\src\ATen\RegisterCPU.cpp:18433 [kernel]
Meta: registered at aten\src\ATen\RegisterMeta.cpp:12703 [kernel]
BackendSelect: registered at aten\src\ATen\RegisterBackendSelect.cpp:665 [kernel]
Python: registered at ..\aten\src\ATen\core\PythonFallbackKernel.cpp:47 [backend fallback]
Named: registered at ..\aten\src\ATen\core\NamedRegistrations.cpp:7 [backend fallback]
Conjugate: fallthrough registered at ..\aten\src\ATen\ConjugateFallback.cpp:22 [kernel]
Negative: fallthrough registered at ..\aten\src\ATen\native\NegateFallback.cpp:22 [kernel]
ADInplaceOrView: fallthrough registered at ..\aten\src\ATen\core\VariableFallbackKernel.cpp:64 [backend fallback]
AutogradOther: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:10483 [autograd kernel]
AutogradCPU: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:10483 [autograd kernel]
AutogradCUDA: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:10483 [autograd kernel]
AutogradXLA: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:10483 [autograd kernel]
AutogradLazy: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:10483 [autograd kernel]
AutogradXPU: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:10483 [autograd kernel]
AutogradMLC: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:10483 [autograd kernel]
AutogradHPU: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:10483 [autograd kernel]
AutogradNestedTensor: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:10483 [autograd kernel]
AutogradPrivateUse1: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:10483 [autograd kernel]
AutogradPrivateUse2: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:10483 [autograd kernel]
AutogradPrivateUse3: registered at ..\torch\csrc\autograd\generated\VariableType_2.cpp:10483 [autograd kernel]
Tracer: registered at ..\torch\csrc\autograd\generated\TraceType_2.cpp:11423 [kernel]
UNKNOWN_TENSOR_TYPE_ID: fallthrough registered at ..\aten\src\ATen\autocast_mode.cpp:466 [backend fallback]
Autocast: fallthrough registered at ..\aten\src\ATen\autocast_mode.cpp:305 [backend fallback]
Batched: registered at ..\aten\src\ATen\BatchingRegistrations.cpp:1016 [backend fallback]
VmapMode: fallthrough registered at ..\aten\src\ATen\VmapModeRegistrations.cpp:33 [backend fallback]

溯源报错查询发现是在torch加载模型的地方报错的。

代码中具体位置如下：

model = torch.jit.load(model_path).to(device)

我还是比较少用到jit这个方法的，以前加载模型的话大都是直接torch.load来进行的，为此专门查了一下Torch的官方文档，在这里，如下所示：

torch.jit.load

torch.jit.load(f, map_location=None, _extra_files=None, _restore_shapes=False)[SOURCE]

Load a ScriptModule or ScriptFunction previously saved with torch.jit.save.

All previously saved modules, no matter their device, are first loaded onto CPU, and then are moved to the devices they were saved from. If this fails (e.g. because the run time system doesn’t have certain devices), an exception is raised.

Parameters

f – a file-like object (has to implement read, readline, tell, and seek), or a string containing a file name
map_location (string or torch.device) – A simplified version of map_location in torch.jit.save used to dynamically remap storages to an alternative set of devices.
_extra_files (dictionary of filename to content) – The extra filenames given in the map would be loaded and their content would be stored in the provided map.
_restore_shapes (bool) – Whether or not to retrace the module on load using stored inputs

Returns

A ScriptModule object.

实例代码如下：

import torch
import io

torch.jit.load('scriptmodule.pt')

# Load ScriptModule from io.BytesIO object
with open('scriptmodule.pt', 'rb') as f:
    buffer = io.BytesIO(f.read())

# Load all tensors to the original device
torch.jit.load(buffer)

# Load all tensors onto CPU, using a device
buffer.seek(0)
torch.jit.load(buffer, map_location=torch.device('cpu'))

# Load all tensors onto CPU, using a string
buffer.seek(0)
torch.jit.load(buffer, map_location='cpu')

# Load with extra files.
extra_files = {'foo.txt': ''}  # values will be replaced with data
torch.jit.load('scriptmodule.pt', _extra_files=extra_files)
print(extra_files['foo.txt'])

简单翻译了一下英文稳定介绍，如下：

加载以前使用torch.jit.save保存的ScriptModule或ScriptFunction。

所有以前保存的模块，无论其设备如何，都首先加载到CPU上，然后移动到保存它们的设备。
如果此操作失败（例如，因为运行时系统没有特定的设备），将引发异常。

因为之前的模型是借助于torch.jit.save方法进行存储的，所以这里模型的加载也是需要使用torch.jit.load来完成的，如果说之前的模型是直接torch.save存储的，那么加载模型的话也是可以直接基于torch.jit.load实现的。

官方的文档其实已经指出来了问题产生的原因：就是原始的模型是基于GPU设备训练然后存储的，使用torch.jit.load方法时会先将模型加载到CPU中，之后移动到原来存储模型的设备也就是GPU中，但是我本地是没有GPU设备的，自然也就找不到了，也就会报错了。

解决办法也很简单，将原来的加载代码改为如下：

model = torch.jit.load(model_path, map_location='cpu')

或者：

model = torch.jit.load(model_path, map_location=torch.device('cpu')).to(torch.device('cpu'))