问题描述:
在使用python的pynvml软件包用于监控子进程的gpu占用时,调用nvmlDeviceGetComputeRunningProcesses()将会报错:
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "inference_pool_process.py", line 227, in <module>
main(args.gpu, args.duration)
File "inference_pool_process.py", line 199, in main
collected_results = [result.get() for result in results if result.ready()]
File "inference_pool_process.py", line 199, in <listcomp>
collected_results = [result.get() for result in results if result.ready()]
File "/data2/zyli86/anaconda3/envs/onnxgpu37/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
pynvml.NVMLLibraryMismatchError: Unversioned function called and the pyNVML version does not match the NVML lib version. Either use matching pyNVML and NVML lib versions or use a versioned function such as nvmlDeviceGetComputeRunningProcesses_v2
根据报错信息,似乎是版本不匹配的问题。按照建议尝试了nvmlDeviceGetComputeRunningProcesses_v2函数也会报错(提示Function Not Found)。
问题分析:
查看一下pynvml.py部分的源码。对相关函数的定义如下:
# Added in 2.285
def nvmlDeviceGetComputeRunningProcesses_v2(handle):
# first call to get the size
c_count = c_uint(0)
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
ret = fn(handle, byref(c_count), None)
if (ret == NVML_SUCCESS):
# special case, no running processes
return []
elif (ret == NVML_ERROR_INSUFFICIENT_SIZE):
# typical case
# oversize the array incase more processes are created
c_count.value = c_count.value * 2 + 5
proc_array = c_nvmlProcessInfo_v2_t * c_count.value
c_procs = proc_array()
# make the call again
ret = fn(handle, byref(c_count), c_procs)
_nvmlCheckReturn(ret)
procs = []
for i in range(c_count.value):
# use an alternative struct for this object
obj = nvmlStructToFriendlyObject(c_procs[i])
if (obj.usedGpuMemory == NVML_VALUE_NOT_AVAILABLE_ulonglong.value):
# special case for WDDM on Windows, see comment above
obj.usedGpuMemory = None
procs.append(obj)
return procs
else:
# error case
raise NVMLError(ret)
# Added in 2.285
def nvmlDeviceGetComputeRunningProcesses_v3(handle):
# first call to get the size
c_count = c_uint(0)
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
ret = fn(handle, byref(c_count), None)
if (ret == NVML_SUCCESS):
# special case, no running processes
return []
elif (ret == NVML_ERROR_INSUFFICIENT_SIZE):
# typical case
# oversize the array incase more processes are created
c_count.value = c_count.value * 2 + 5
proc_array = c_nvmlProcessInfo_v3_t * c_count.value
c_procs = proc_array()
# make the call again
ret = fn(handle, byref(c_count), c_procs)
_nvmlCheckReturn(ret)
procs = []
for i in range(c_count.value):
# use an alternative struct for this object
obj = nvmlStructToFriendlyObject(c_procs[i])
if (obj.usedGpuMemory == NVML_VALUE_NOT_AVAILABLE_ulonglong.value):
# special case for WDDM on Windows, see comment above
obj.usedGpuMemory = None
procs.append(obj)
return procs
else:
# error case
raise NVMLError(ret)
@throwOnVersionMismatch
def nvmlDeviceGetComputeRunningProcesses(handle):
return nvmlDeviceGetComputeRunningProcesses_v3(handle)
解决方法:
方法1. 软件包较新以及驱动较旧导致的,可以按照旧版本nvidia-ml-py。亲测该版本可用
pip install nvidia-ml-py==11.450.51
方法2. 根据该issue建议的方法,将实际调用的_nvmlGetFunctionPointer()函数传参修改为"nvmlDeviceGetComputeRunningProcesses"即可(将版本号去除): dundefined symbol: nvmlDeviceGetComputeRunningProcesses_v2 · Issue #43 · gpuopenanalytics/pynvml · GitHub
不修改软件包源代码的办法,写一个wrapper函数即可:
from pynvml import *
from pynvml import _nvmlGetFunctionPointer, _nvmlCheckReturn
from ctypes import byref, c_uint
"""
This is A Wrapper Function for old driver compatibility with Driver Version: 418.39 CUDA Version: 10.1.
"""
def get_compute_running_processes(handle):
c_count = c_uint(0)
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses") # Edited
ret = fn(handle, byref(c_count), None)
if ret == NVML_SUCCESS:
return []
elif ret == NVML_ERROR_INSUFFICIENT_SIZE:
c_count.value = c_count.value * 2 + 5
proc_array = c_nvmlProcessInfo_t * c_count.value
c_procs = proc_array()
ret = fn(handle, byref(c_count), c_procs)
_nvmlCheckReturn(ret)
procs = []
for i in range(c_count.value):
obj = nvmlStructToFriendlyObject(c_procs[i])
if obj.usedGpuMemory == NVML_VALUE_NOT_AVAILABLE_ulonglong.value:
obj.usedGpuMemory = None
procs.append(obj)
return procs
else:
raise NVMLError(ret)
but,如果你要使用nvmlDeviceGetComputeRunningProcesses这个函数,仅仅做这样的修改还没完成。此时保存的进程相关信息发生了错乱,pid和usedMemory的值交替混淆:
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 23878, usedGpuMemory: 187695104 B, gpuInstanceId: 23874, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 187695104, usedGpuMemory: 23881 B, gpuInstanceId: 187695104, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 23876, usedGpuMemory: 187695104 B, gpuInstanceId: 23872, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 187695104, usedGpuMemory: 23873 B, gpuInstanceId: 456654848, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 23880, usedGpuMemory: 457703424 B, gpuInstanceId: 23875, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 456654848, usedGpuMemory: 23877 B, gpuInstanceId: 187695104, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 23879, usedGpuMemory: 457703424 B, gpuInstanceId: 0, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 0, usedGpuMemory: 0 B, gpuInstanceId: 0, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 0, usedGpuMemory: 0 B, gpuInstanceId: 0, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 0, usedGpuMemory: 0 B, gpuInstanceId: 0, computeInstanceId: 0)
需要重写class c_nvmlProcessInfo_t类,将后面两个字段删去:
class c_nvmlProcessInfo_t_edited(_PrintableStructure):
_fields_ = [
('pid', c_uint),
('usedGpuMemory', c_ulonglong),
]
_fmt_ = {'usedGpuMemory': "%d B"}
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21232, usedGpuMemory: 457703424 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21235, usedGpuMemory: 456654848 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21234, usedGpuMemory: 456654848 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21238, usedGpuMemory: 456654848 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21237, usedGpuMemory: 187695104 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21233, usedGpuMemory: 187695104 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21231, usedGpuMemory: 456654848 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21230, usedGpuMemory: 457703424 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21236, usedGpuMemory: 456654848 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21229, usedGpuMemory: 187695104 B)
讨论:
问题的原因可能出现在显卡驱动版本过旧上,博主的运行环境相关版本是Driver Version: 418.39,CUDA Version: 10.1。因为自己是普通用户且显卡太久无法更新,所以只能修改函数。 此外,较早版本的pynvml可能得以解决问题,暂时没有尝试。