conda如何升级pytorch_PyTorch教程学习总结

最新推荐文章于 2022-11-05 10:04:32 发布

weixin_39605191

最新推荐文章于 2022-11-05 10:04:32 发布

阅读量833

点赞数

1.一些重要的概念

Tensor

autograd

Variable

nn -- high-level abstraction

pytorch_with_examples

The nnpackage defines a set of Modules, which are roughly equivalent to neural network layers.
torch.nn.Linear
torch.nn.ReLU

saving_loading_models

A common PyTorch convention is to save models using either a .ptor .pthfile extension.

nn_tutorial

A trailling _in PyTorch signifies that the operation is performed in-place. View is PyTorch's version of numpy's reshape.
A Sequentialobject runs each of the modules contained within it, in a sequential manner.

使用nn.optim时，requires_grad设为False，意为freeze some layers

加载数据使用datasets loader

2.一些操作

Tensor.topkto get the index of the greatest value：

def categoryFromOutput(output):
    top_n, top_i = output.data.topk(1) # Tensor out of Variable with .data
    category_i = top_i[0][0]
    return all_categories[category_i], category_i

print(categoryFromOutput(output))

nn.LogSoftmax对应的loss是criterion = nn.NLLLoss()

nn.LSTM

nn.GRU

使用python指定GPU，如下

有一台服务器，服务器上有多块儿GPU可以供使用，但此时只希望使用第2块和第4块GPU，但是我们希望代码能看到的仍然是有两块GPU，分别编号为0,1，这个时候我们可以使用环境变量CUDA_VISIBLE_DEVICES来解决这个问题。
比如：
CUDA_VISIBLE_DEVICES=1 只有编号为1的GPU对程序是可见的，在代码中gpu[0]指的就是这块儿GPU
CUDA_VISIBLE_DEVICES=0,2,3 只有编号为0,2,3的GPU对程序是可见的，在代码中gpu[0]指的是第0块儿，gpu[1]指的是第2块儿，gpu[2]指的是第3块儿
CUDA_VISIBLE_DEVICES=2,0,3 只有编号为0,2,3的GPU对程序是可见的，但是在代码中gpu[0]指的是第2块儿，gpu[1]指的是第0块儿，gpu[2]指的是第3块儿

在python程序中，我们可以这么写

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
torch.cuda.is_available()
dev = torch.device("cuda") / torch.device("cpu")

torch.nn.Embedding

2.1 关于DataParallel

这里有一个讨论dataparallel-imbalanced-memory-usage。

多卡情况下，使用DP训练LSTM，输出的结果会在dim=0维度上进行concat。在进行序列识别任务，比如下面这份开源代码

今天又发现一个问题，当使用DataParallel做并行化时，model的成员函数如果有自定义的，会被清除，只保留原始nn.Module的成员函数，暂时不知道原因。在可视化model的一些自定义层（如stn）时，没法通过model.stn这种方式访问到。

2.2 数据类型转换

Pytorch-数据类型转换

2.3 一些见到的汇总

Pytorch maxpool的ceil_mode

3.遇到的坑

3.1 在测试pytorch-yolo2的时候，发现这个错误，已解决

PyTorch socket.error [Errno 111] Connection refused

3.2 在测试CornerNet的时候，使用conda安装pytorch等环境，

要升级conda里的gcc版本，因为Pytorch要求gcc>=4.9.

方法是在anaconda cloud里下载了gcc 4.9，安装后软连接即可，如下

ln -s /home/20xxx/anaconda2/envs/CornerNet/bin/gcc-4.9 /home/20xxx/anaconda2/envs/CornerNet//bin/gcc  
ln -s /home/20xxx/anaconda2/envs/CornerNet/bin/g++-4.9 /home/20xxx/anaconda2/envs/CornerNet//bin/g++

3.2 Pytorch与Caffe中pooling层的计算公式不同，pytorch默认向下取整，Caffe默认向上取整。

3.3 RuntimeError

3.3.1 RuntimeError: Only Tensors of floating point dtype can require gradients

在运行xmfbit/captcha-recognition时，会报此错误。把main.py的test()里这一段修改下

#x, act_lengths, flatten_target, target_lengths = tensor_to_variable(
            (x, act_lengths, flatten_target, target_lengths), volatile=True)
x, act_lengths, flatten_target, target_lengths = tensor_to_variable(
            (x, act_lengths, flatten_target, target_lengths), volatile=False)

关于requires_grad和volatile二者的区别和联系，还没有调查过。

另外，captcha-recognition在进行warpctc的python绑定时，使用的是pytoch-1.0的cpu版本。gpu版本当前pytorch10-py36-cuda8.0-cudnn7.1.2运行会有错误。

3.4 Error

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)

解决办法：出现这个错误的情况是，在服务器上的docker中运行训练代码时，batch size设置得过大，shared memory不够（因为docker限制了shm）.解决方法是，将Dataloader的num_workers调小

RuntimeError: DataLoader worker (pid 27) is killed by signal: Killed. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.

References

[1] Pytorch中文文档.

[2]Pytorch源码编译简明指南

weixin_39605191

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
conda如何升级pytorch_PyTorch教程学习总结

1.一些重要的概念TensorautogradVariablenn -- high-level abstractionpytorch_with_examplesThennpackage defines a set ofModules, which are roughly equivalent to neural network layers.torch.nn.Lineartorch.nn.ReLU...
复制链接

扫一扫