完整报错如下:
[1747037005.199765] [user-G5500-V7:3956167:0] ib_md.c:282 UCX ERROR ibv_reg_mr(address=0x7f5d75a00000, length=37748736, access=0xf) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1747037005.199877] [user-G5500-V7:3956167:0] mpool.c:269 UCX ERROR Failed to allocate memory pool (name=rc_recv_desc) chunk: Input/output error
[1747037005.200048] [user-G5500-V7:3956167:a] ib_md.c:282 UCX ERROR ibv_reg_mr(address=0x7f5d64200000, length=6291456, access=0xf) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1747037005.200062] [user-G5500-V7:3956167:a] mpool.c:269 UCX ERROR Failed to allocate memory pool (name=ud_recv_skb) chunk: Input/output error
[1747037005.306430] [user-G5500-V7:3956167:0] ib_iface.c:1230 UCX ERROR mlx5_0: ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1747037005.306458] [user-G5500-V7:3956167:0] ucp_worker.c:1413 UCX ERROR uct_iface_open(rc_verbs/mlx5_0:1) failed: Input/output error
[1747037005.353530] [user-G5500-V7:3956167:0] ib_md.c:282 UCX ERROR ibv_reg_mr(address=0x7f5d75a00000, length=37748736, access=0xf) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1747037005.353564] [user-G5500-V7:3956167:0] mpool.c:269 UCX ERROR Failed to allocate memory pool (name=rc_recv_desc) chunk: Input/output error
[1747037005.353621] [user-G5500-V7:3956167:0] ib_md.c:282 UCX ERROR ibv_reg_mr(address=0x7f5d75a00000, length=37748736, access=0xf) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1747037005.353635] [user-G5500-V7:3956167:0] mpool.c:269 UCX ERROR Failed to allocate memory pool (name=rc_recv_desc) chunk: Input/output error
[1747037005.353669] [user-G5500-V7:3956167:0] ib_md.c:282 UCX ERROR ibv_reg_mr(address=0x7f5d75a00000, length=37748736, access=0xf) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1747037005.353679] [user-G5500-V7:3956167:0] mpool.c:269 UCX ERROR Failed to allocate memory pool (name=rc_recv_desc) chunk: Input/output error
[1747037005.353724] [user-G5500-V7:3956167:0] ib_md.c:282 UCX ERROR ibv_reg_mr(address=0x7f5d75a00000, length=37748736, access=0xf) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1747037005.353734] [user-G5500-V7:3956167:0] mpool.c:269 UCX ERROR Failed to allocate memory pool (name=rc_recv_desc) chunk: Input/output error
使用 supervisor
守护服务的时候,报错如上。
服务启动后,是可以正常运行的,所以并不是代码本身的问题。经查询是底层 UCX 在注册大内存块的时候,无法申请足够的固定内存。我这的原因可能是被守护的服务主要用了 trtllm
,mLock()
对可锁定内存做了限制,ulimit -l
默认只有8192k,从而导致 UCX 无法申请足够的固定内存。
验证:
ulimit -l
如果结果不是8192,即交互式 shell 里 ulimit -l
确实是 24704980,但 Supervisor 启动的进程依然只有 8192 kbytes,这是因为:
-
Shell 会话的 ulimit
你在终端里运行 ulimit -l,那是登录 shell(PAM 激活了 /etc/security/limits.conf)后的值,和 Supervisor 启动服务时用的默认进程资源限制 是 两回事。 -
Supervisor 默认不继承登录 shell 的 limits
Supervisor 以自己的守护进程身份(通常在系统启动时)启动,它没有通过 PAM 登录流程,所以不会读取 /etc/security/limits.conf,进程就用内核默认的 8 MB memlock。
# 找到 Pid
sudo supervisorctl status
cat /proc/3956167/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 771726 771726 processes
Max open files 1024 524288 files
Max locked memory 8388608 8388608 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 771726 771726 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
可以看到 Max locked memory 8388608
,说明 Supervisor 启动的那个子进程还被默认的 8 MB memlock 限制着。要让它真正变成 unlimited,最常用的方式有两种:要么在 Supervisor 配置里直接放开,要么在启动脚本里手动调用 setrlimit。
方法:在 Supervisor 配置里直接放开:
编辑 supervisord.conf
,加入:
[program:tts-server]
command=CUDA_VISIBLE_DEVICES=3 python -m src.service.websocket_server --host=0.0.0.0 --port=9998 --workers=1
directory=/home/wangguisen/projects/f5_tts_faster
user=wangguisen
# 放开 memlock 限制
rlimit_memlock=-1:-1 ; -1 表示 unlimited(无限制), 或 rlimit_memlock=4194304:4194304 (单位:KB)
...
如果 -1:1 不生效,可能是系统版本的 Supervisor 不支持 -1,可以用数值型的表示
16777216:16777216
16 * 1024 * 1024 = 16777216
如果还是不行,可能是 Supervisor 版本低了,或者对/etc/security/limits.conf
缺少全局权限
注意,版本最好是
≥ 4.2
:sudo supervisorctl version
然后:
sudo supervisorctl reread
sudo supervisorctl update
sudo supervisorctl restart tts-server
另外,如果命令里又启动了 Gunicorn/Uvicorn 这类会再 fork worker 的进程,Supervisor 只给主进程设置了 rlimit,fork 出来的子 Worker 并不会继承相同的 rlimit。