Jetson The call to cuMemHostRegister(0xace67b80, 33792, 0) failed
Jetson 运行 mpi 报错:
--------------------------------------------------------------------------
The call to cuMemHostRegister(0xace67b80, 33792, 0) failed.
Host: tegra-ubuntu
cuMemHostRegister return value: 801
Memory Pool: sm
--------------------------------------------------------------------------
[tegra-ubuntu:03031] 11 more processes have sent help message help-mpi-common-cuda.txt / cuMemHostRegister failed
[tegra-ubuntu:03031] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
这是因为 Jetson 不支持 cuMemHostRegister
,修改为 cuMemHostAlloc
。
修改 openmpi-4.1.2/opal/mca/common/cuda/common_cuda.c
:
- 方法1:将所有的
cuMemHostRegister
替换为cuMemHostAlloc
- 方法2:使用宏定义
__aarch64__
区分是电脑还是 jetson,例如:
#ifdef __aarch64__ // cuMemHostRegister cannot be supported on Jetson
OPAL_CUDA_DLSYM(libcuda_handle, cuMemHostAlloc);
#else
OPAL_CUDA_DLSYM(libcuda_handle, cuMemHostRegister);
#endif
将 openmpi-4.1.2/opal/mca/common/cuda/common_cuda.c
内所有使用了 cuMemHostRegister
都做类似替换。
修改 "openmpi-4.1.2/opal/mca/common/cuda/help-mpi-common-cuda.txt
,将:
#
[cuMemHostRegister during init failed]
The call to cuMemHostRegister(%p, %d, 0) failed.
Host: %s
cuMemHostRegister return value: %d
Registration cache: %s
#
[cuMemHostRegister failed]
The call to cuMemHostRegister(%p, %d, 0) failed.
Host: %s
cuMemHostRegister return value: %d
Registration cache: %s
修改为:
#
[cuMemHostRegister during init failed]
The call to cuMemHostRegister(%p, %d, 0) failed.
Host: %s
cuMemHostRegister return value: %d
Registration cache: %s
#
[cuMemHostRegister failed]
The call to cuMemHostRegister(%p, %d, 0) failed.
Host: %s
cuMemHostRegister return value: %d
Registration cache: %s
#
[cuMemHostAlloc during init failed]
The call to cuMemHostAlloc(%p, %d, 0) failed.
Host: %s
cuMemHostAlloc return value: %d
Registration cache: %s
#
[cuMemHostAlloc failed]
The call to cuMemHostAlloc(%p, %d, 0) failed.
Host: %s
cuMemHostAlloc return value: %d
Registration cache: %s