failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to ‘unlimited‘ (current: 8

完整报错如下:

[1747037005.199765] [user-G5500-V7:3956167:0]           ib_md.c:282  UCX  ERROR ibv_reg_mr(address=0x7f5d75a00000, length=37748736, access=0xf) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1747037005.199877] [user-G5500-V7:3956167:0]           mpool.c:269  UCX  ERROR Failed to allocate memory pool (name=rc_recv_desc) chunk: Input/output error
[1747037005.200048] [user-G5500-V7:3956167:a]           ib_md.c:282  UCX  ERROR ibv_reg_mr(address=0x7f5d64200000, length=6291456, access=0xf) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1747037005.200062] [user-G5500-V7:3956167:a]           mpool.c:269  UCX  ERROR Failed to allocate memory pool (name=ud_recv_skb) chunk: Input/output error
[1747037005.306430] [user-G5500-V7:3956167:0]        ib_iface.c:1230 UCX  ERROR mlx5_0: ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1747037005.306458] [user-G5500-V7:3956167:0]      ucp_worker.c:1413 UCX  ERROR uct_iface_open(rc_verbs/mlx5_0:1) failed: Input/output error
[1747037005.353530] [user-G5500-V7:3956167:0]           ib_md.c:282  UCX  ERROR ibv_reg_mr(address=0x7f5d75a00000, length=37748736, access=0xf) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1747037005.353564] [user-G5500-V7:3956167:0]           mpool.c:269  UCX  ERROR Failed to allocate memory pool (name=rc_recv_desc) chunk: Input/output error
[1747037005.353621] [user-G5500-V7:3956167:0]           ib_md.c:282  UCX  ERROR ibv_reg_mr(address=0x7f5d75a00000, length=37748736, access=0xf) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1747037005.353635] [user-G5500-V7:3956167:0]           mpool.c:269  UCX  ERROR Failed to allocate memory pool (name=rc_recv_desc) chunk: Input/output error
[1747037005.353669] [user-G5500-V7:3956167:0]           ib_md.c:282  UCX  ERROR ibv_reg_mr(address=0x7f5d75a00000, length=37748736, access=0xf) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1747037005.353679] [user-G5500-V7:3956167:0]           mpool.c:269  UCX  ERROR Failed to allocate memory pool (name=rc_recv_desc) chunk: Input/output error
[1747037005.353724] [user-G5500-V7:3956167:0]           ib_md.c:282  UCX  ERROR ibv_reg_mr(address=0x7f5d75a00000, length=37748736, access=0xf) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1747037005.353734] [user-G5500-V7:3956167:0]           mpool.c:269  UCX  ERROR Failed to allocate memory pool (name=rc_recv_desc) chunk: Input/output error

使用 supervisor 守护服务的时候,报错如上。

服务启动后,是可以正常运行的,所以并不是代码本身的问题。经查询是底层 UCX 在注册大内存块的时候,无法申请足够的固定内存。我这的原因可能是被守护的服务主要用了 trtllmmLock() 对可锁定内存做了限制,ulimit -l默认只有8192k,从而导致 UCX 无法申请足够的固定内存。

验证:

ulimit -l

如果结果不是8192,即交互式 shell 里 ulimit -l 确实是 24704980,但 Supervisor 启动的进程依然只有 8192 kbytes,这是因为:

  • Shell 会话的 ulimit
    你在终端里运行 ulimit -l,那是登录 shell(PAM 激活了 /etc/security/limits.conf)后的值,和 Supervisor 启动服务时用的默认进程资源限制 是 两回事。

  • Supervisor 默认不继承登录 shell 的 limits
    Supervisor 以自己的守护进程身份(通常在系统启动时)启动,它没有通过 PAM 登录流程,所以不会读取 /etc/security/limits.conf,进程就用内核默认的 8 MB memlock。

# 找到 Pid
sudo supervisorctl status
cat /proc/3956167/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             771726               771726               processes 
Max open files            1024                 524288               files     
Max locked memory         8388608              8388608              bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       771726               771726               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us 

可以看到 Max locked memory 8388608 ,说明 Supervisor 启动的那个子进程还被默认的 8 MB memlock 限制着。要让它真正变成 unlimited,最常用的方式有两种:要么在 Supervisor 配置里直接放开,要么在启动脚本里手动调用 setrlimit。

方法:在 Supervisor 配置里直接放开:

编辑 supervisord.conf,加入:

[program:tts-server]
command=CUDA_VISIBLE_DEVICES=3 python -m src.service.websocket_server --host=0.0.0.0 --port=9998 --workers=1
directory=/home/wangguisen/projects/f5_tts_faster
user=wangguisen

# 放开 memlock 限制
rlimit_memlock=-1:-1  ; -1 表示 unlimited(无限制), 或 rlimit_memlock=4194304:4194304 (单位:KB)

...

如果 -1:1 不生效,可能是系统版本的 Supervisor 不支持 -1,可以用数值型的表示 16777216:16777216
16 * 1024 * 1024 = 16777216
如果还是不行,可能是 Supervisor 版本低了,或者对 /etc/security/limits.conf 缺少全局权限

注意,版本最好是 ≥ 4.2 sudo supervisorctl version

然后:

sudo supervisorctl reread
sudo supervisorctl update
sudo supervisorctl restart tts-server



另外,如果命令里又启动了 Gunicorn/Uvicorn 这类会再 fork worker 的进程,Supervisor 只给主进程设置了 rlimit,fork 出来的子 Worker 并不会继承相同的 rlimit。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

WGS.

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值