- 问题:
当训练tdnn时迭代到110次时报错
查看对应的log文件,显示
ERROR (nnet3-chain-train[5.5.0-]:AllocateNewRegion():cu-allocator.cc:519) Failed to allocate a memory region of 2502950912 bytes. Possibly this is due to sharing the GPU. Try switching the GPUs to exclusive mode (nvidia-smi -c 3) and using the option --use-gpu=wait to scripts like steps/nnet3/chain/train.py. Memory info: free:4773M, used:6244M, total:11018M, free/total:0.433275 CUDA error: 'out of memory'
- 解决办法:
修改GPU模式:
sudo nvidia-smi -c 3
修改run_e2e_tdnn.sh
然后重新运行脚本。
解决。