DGL GPU版本加速训练时遇到的各种错误总结，以及常见错误解决方案

五阿哥爱跳舞

已于 2022-06-10 17:45:53 修改

阅读量6.6k

点赞数 4

分类专栏： python 图神经网络/图表示学习操作系统文章标签： DGL

于 2022-03-23 22:40:27 首次发布

本文链接：https://blog.csdn.net/adreammaker/article/details/123698333

版权

操作系统同时被 3 个专栏收录

38 篇文章 4 订阅

订阅专栏

图神经网络/图表示学习

16 篇文章 1 订阅

订阅专栏

python

10 篇文章 0 订阅

订阅专栏

文章目录

1.常见问题解决路径
- 1.1官网问题反馈页面
- 1.2github的issue页面
2.我遇到的问题
3.PyTorch如何使用GPU，训练神经网络时哪些东西可以传到GPU运算
4.pytorch保存和加载模型

1.常见问题解决路径

1.1官网问题反馈页面

首先：去DGL的官网问题解答页面，右上角有搜索，搜索你的代码报错的内容，基本上能解决个80%
https://discuss.dgl.ai/

1.2github的issue页面

https://github.com/dmlc/dgl/issues

2.我遇到的问题

2.1 RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_mm)

意思是说：我的数据很杂乱，没有统一放到cuda或者cpu里
我去官网搜索相关问题的结果如下：
https://discuss.dgl.ai/t/runtimeerror-expected-all-tensors-to-be-on-the-same-device-but-found-at-least-two-devices-cuda-0-and-cpu/2813

解决方案就是要把你的图放到cuda当中
g = g.to(device)
而且不能g.to(device)这样写，必须按照上面的写法

2.2 dgl._ffi.base.DGLError: Cannot assign node feature “h” on device cpu to a graph on device cuda:0. Call DGLGraph.to() to copy the graph to the same device.

解决方案：在forward中所有没有放到cuda中的dgl图，都要放入cuda中，这里的主要问题是，新生成的子图是在cpu中，需要转换：

https://discuss.dgl.ai/t/error-while-trying-to-run-graphsage-train-sampling-unsupervised-py-example/1177/12
相似问题解答：
Okay so I added feat = feat.to(‘cuda’) in the forward method along with the graph = graph.to(‘cuda’) and it worked. Thank you for all your help!!

2.3 ml_edges = G0.filter_edges(lambda edges: edges.data[‘ml’])

dgl通过过滤器得到的边的下标是在cpu中的

3.PyTorch如何使用GPU，训练神经网络时哪些东西可以传到GPU运算

#（1）判断GPU是否可用
if torch.cuda.is_available():
    device = torch.device('cuda')
    cudnn.benchmark = True
else:
    device = torch.device('cpu')
 
#（2）构建网络时，把网络，与损失函数转换到GPU上
model = CNN().to(device)
loss = nn.CrossEntropyLoss().to(device)
 
#（3）训练网络时，把数据转换到GPU上
x, y = x.to(device), y.to(device)
 
# 注，只有tensor类型才能上传到GPU上，故需要对numpy数据进行转换成rensor类型
# torch.tensor(x) 或 torch.from_numpy(x) 
#两者的区别见 https://blog.csdn.net/github_28260175/article/details/105382060
 
#（4）对训练的输出结果有些需要使用np的函数进行操作，需先将输出结果转到CPU上，并转成numpy类型，再使用np的函数
output = (model(x)).cpu().numpy()

4.pytorch保存和加载模型

参考

五阿哥爱跳舞

关注

4
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
DGL GPU版本加速训练时遇到的各种错误总结，以及常见错误解决方案

文章目录1.常见问题解决路径1.1官网问题反馈页面1.2github的issue页面2.我遇到的问题2.1 RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_mm)2.2 dgl._ffi.base.DGLError: Ca
复制链接

扫一扫

专栏目录