apex的实践

最新推荐文章于 2024-07-28 02:20:11 发布

weixin_34149796

最新推荐文章于 2024-07-28 02:20:11 发布

阅读量1.9k

点赞数

文章标签： python 人工智能

原文链接：https://juejin.im/post/5cb04cd15188251af26d25d6

版权

欢迎访问个人博客Alex Chiu的学习空间

apex是NVIDIA开源的用于在PyTorch框架下实现混合精度训练的模块，能够方便地进行FP16训练。

This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.

其API地址为 nvidia.github.io/apex

安装中踩的坑

我在编译安装apex的过程中遇到了一些问题，通过查issues来解决的。

使用时碰到segmentation fault

可以试试gcc5，可以用 conda install -c psi4 gcc-5 来安装，参考 github.com/NVIDIA/apex…

如果碰到"GLIBCXX_3.4.20' not found"这个问题

可以试试找到 path_to_anaconda3/lib/libstdc++.so.6，然后连接到apex引用的路径下，或者自己加一个lib PATH。

如果碰到FusedLayerNorm有关的错误

可能是和没装cuda的扩展，可以

Try a full pip uninstall apex, then cd apex_repo_dir; rm-rf build; python setup.py install --cuda_ext --cpp_ext and see if the segfault persists."

参考https://github.com/huggingface/pytorch-pretrained-BERT/issues/284

使用时的坑

AttributeError: 'NoneType' object has no attribute 'contiguous'

模型中有无用的layers(weights)(例子: github.com/FDecaYed/py…)，导致反向传递梯度后，这些weights的梯度为none，就会报“AttributeError: 'NoneType' object has no attribute 'contiguous'”的错误，可以参考https://github.com/NVIDIA/apex/issues/131

解决方案：1. 改apex的源码，让其判断梯度是否为none，2. 改模型，去掉无用的weights，第二种方法更好一些，或者等apex更新吧。

p.type().is_cuda() ASSERT FAILED at csrc/fused_adam_cuda.cpp:12

这个错误是我自己的问题，model.cuda() 应该在 FusedAdam的声明之前，不然会报这个错误。

cuda runtime error (77) : an illegal memory access

我现在碰到了这个错误，也不知道该怎么定位，在 github.com/huggingface… 也有一个人遇到了。