解决pynvml.NVMLLibraryMismatchError: Unversioned function called and the pyNVML version does not match

问题描述:

        在使用python的pynvml软件包用于监控子进程的gpu占用时,调用nvmlDeviceGetComputeRunningProcesses()将会报错:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "inference_pool_process.py", line 227, in <module>
    main(args.gpu, args.duration)
  File "inference_pool_process.py", line 199, in main
    collected_results = [result.get() for result in results if result.ready()]
  File "inference_pool_process.py", line 199, in <listcomp>
    collected_results = [result.get() for result in results if result.ready()]
  File "/data2/zyli86/anaconda3/envs/onnxgpu37/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
pynvml.NVMLLibraryMismatchError: Unversioned function called and the pyNVML version does not match the NVML lib version. Either use matching pyNVML and NVML lib versions or use a versioned function such as nvmlDeviceGetComputeRunningProcesses_v2

        根据报错信息,似乎是版本不匹配的问题。按照建议尝试了nvmlDeviceGetComputeRunningProcesses_v2函数也会报错(提示Function Not Found)。

问题分析:

        查看一下pynvml.py部分的源码。对相关函数的定义如下:

# Added in 2.285
def nvmlDeviceGetComputeRunningProcesses_v2(handle):
    # first call to get the size
    c_count = c_uint(0)
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
    ret = fn(handle, byref(c_count), None)
    if (ret == NVML_SUCCESS):
        # special case, no running processes
        return []
    elif (ret == NVML_ERROR_INSUFFICIENT_SIZE):
        # typical case
        # oversize the array incase more processes are created
        c_count.value = c_count.value * 2 + 5
        proc_array = c_nvmlProcessInfo_v2_t * c_count.value
        c_procs = proc_array()
        # make the call again
        ret = fn(handle, byref(c_count), c_procs)
        _nvmlCheckReturn(ret)
        procs = []
        for i in range(c_count.value):
            # use an alternative struct for this object
            obj = nvmlStructToFriendlyObject(c_procs[i])
            if (obj.usedGpuMemory == NVML_VALUE_NOT_AVAILABLE_ulonglong.value):
                # special case for WDDM on Windows, see comment above
                obj.usedGpuMemory = None
            procs.append(obj)
        return procs
    else:
        # error case
        raise NVMLError(ret)

# Added in 2.285
def nvmlDeviceGetComputeRunningProcesses_v3(handle):
    # first call to get the size
    c_count = c_uint(0)
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
    ret = fn(handle, byref(c_count), None)

    if (ret == NVML_SUCCESS):
        # special case, no running processes
        return []
    elif (ret == NVML_ERROR_INSUFFICIENT_SIZE):
        # typical case
        # oversize the array incase more processes are created
        c_count.value = c_count.value * 2 + 5
        proc_array = c_nvmlProcessInfo_v3_t * c_count.value
        c_procs = proc_array()

        # make the call again
        ret = fn(handle, byref(c_count), c_procs)
        _nvmlCheckReturn(ret)

        procs = []
        for i in range(c_count.value):
            # use an alternative struct for this object
            obj = nvmlStructToFriendlyObject(c_procs[i])
            if (obj.usedGpuMemory == NVML_VALUE_NOT_AVAILABLE_ulonglong.value):
                # special case for WDDM on Windows, see comment above
                obj.usedGpuMemory = None
            procs.append(obj)

        return procs
    else:
        # error case
        raise NVMLError(ret)

@throwOnVersionMismatch
def nvmlDeviceGetComputeRunningProcesses(handle):
    return nvmlDeviceGetComputeRunningProcesses_v3(handle)

解决方法:

        方法1. 软件包较新以及驱动较旧导致的,可以按照旧版本nvidia-ml-py。亲测该版本可用

pip install nvidia-ml-py==11.450.51

        方法2. 根据该issue建议的方法,将实际调用的_nvmlGetFunctionPointer()函数传参修改为"nvmlDeviceGetComputeRunningProcesses"即可(将版本号去除): dundefined symbol: nvmlDeviceGetComputeRunningProcesses_v2 · Issue #43 · gpuopenanalytics/pynvml · GitHub

        不修改软件包源代码的办法,写一个wrapper函数即可:

from pynvml import *
from pynvml import _nvmlGetFunctionPointer, _nvmlCheckReturn
from ctypes import byref, c_uint

"""
This is A Wrapper Function for old driver compatibility with Driver Version: 418.39 CUDA Version: 10.1.
"""
def get_compute_running_processes(handle):
    c_count = c_uint(0)
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses")  # Edited
    ret = fn(handle, byref(c_count), None)

    if ret == NVML_SUCCESS:
        return []
    elif ret == NVML_ERROR_INSUFFICIENT_SIZE:
        c_count.value = c_count.value * 2 + 5
        proc_array = c_nvmlProcessInfo_t * c_count.value
        c_procs = proc_array()

        ret = fn(handle, byref(c_count), c_procs)
        _nvmlCheckReturn(ret)

        procs = []
        for i in range(c_count.value):
            obj = nvmlStructToFriendlyObject(c_procs[i])
            if obj.usedGpuMemory == NVML_VALUE_NOT_AVAILABLE_ulonglong.value:
                obj.usedGpuMemory = None
            procs.append(obj)
        return procs
    else:
        raise NVMLError(ret)

        but,如果你要使用nvmlDeviceGetComputeRunningProcesses这个函数,仅仅做这样的修改还没完成。此时保存的进程相关信息发生了错乱,pid和usedMemory的值交替混淆:

c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 23878, usedGpuMemory: 187695104 B, gpuInstanceId: 23874, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 187695104, usedGpuMemory: 23881 B, gpuInstanceId: 187695104, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 23876, usedGpuMemory: 187695104 B, gpuInstanceId: 23872, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 187695104, usedGpuMemory: 23873 B, gpuInstanceId: 456654848, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 23880, usedGpuMemory: 457703424 B, gpuInstanceId: 23875, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 456654848, usedGpuMemory: 23877 B, gpuInstanceId: 187695104, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 23879, usedGpuMemory: 457703424 B, gpuInstanceId: 0, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 0, usedGpuMemory: 0 B, gpuInstanceId: 0, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 0, usedGpuMemory: 0 B, gpuInstanceId: 0, computeInstanceId: 0)
c_procs[i]: <class 'pynvml.c_nvmlProcessInfo_v2_t'> c_nvmlProcessInfo_v2_t(pid: 0, usedGpuMemory: 0 B, gpuInstanceId: 0, computeInstanceId: 0)

       需要重写class c_nvmlProcessInfo_t类,将后面两个字段删去:

class c_nvmlProcessInfo_t_edited(_PrintableStructure):
    _fields_ = [
        ('pid', c_uint),
        ('usedGpuMemory', c_ulonglong),
    ]
    _fmt_ = {'usedGpuMemory': "%d B"}
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21232, usedGpuMemory: 457703424 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21235, usedGpuMemory: 456654848 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21234, usedGpuMemory: 456654848 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21238, usedGpuMemory: 456654848 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21237, usedGpuMemory: 187695104 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21233, usedGpuMemory: 187695104 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21231, usedGpuMemory: 456654848 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21230, usedGpuMemory: 457703424 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21236, usedGpuMemory: 456654848 B)
c_procs[i]: <class '__main__.c_nvmlProcessInfo_t_edited'> c_nvmlProcessInfo_t_edited(pid: 21229, usedGpuMemory: 187695104 B)

讨论:

        问题的原因可能出现在显卡驱动版本过旧上,博主的运行环境相关版本是Driver Version: 418.39,CUDA Version: 10.1。因为自己是普通用户且显卡太久无法更新,所以只能修改函数。 此外,较早版本的pynvml可能得以解决问题,暂时没有尝试。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

narcissus1e7b97

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值