链接错误
ConnectionError: Tried to launch distributed communication on port `29500`, but another process is utilizing it. Please specify a different port (such as using the `----main_process_port` flag or specifying a different `main_process_port` in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to `0`.
** KD training takes 6 seconds.
ConnectionError:试图在端口' 29500 '上启动分布式通信,但另一个进程正在使用它。请指定一个不同的端口(例如使用' ----main_process_port '标志或在配置文件中指定一个不同的' main_process_port '),然后重新运行脚本。要自动使用下一个开放端口(在单个节点上),可以将其设置为“0”。
** KD训练需要6秒。
解决办法
用top来kill掉占用内存比较多的
CUDA错误
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
解决办法
改成1