pytorch resnet18 train note

torch跟torchvision的有相应的对应官方推荐版本,如何确定两者之间对应的合适版本呢?有一个小技巧就是torch跟torchvision的最新版本从后往前数!因为每推出一个新版的torch,就会有一个版本的torchvision。

https://download.pytorch.org/whl/torch_stable.html
如图
在这里插入图片描述

 

对应关系:torch和torchvision版本对应关系, torch cuda api 和本地cuda 对应,python cudatoolkit 无所谓,python 和pytorch 对应

torch torchvision python cuda
1.5.1 0.6.1 >=3.6 9.2, 10.1,10.2
1.5.0 0.6.0 >=3.6 9.2, 10.1,10.2
1.4.0 0.5.0 ==2.7, >=3.5, <=3.8 9.2, 10.0
1.3.1 0.4.2 ==2.7, >=3.5, <=3.7 9.2, 10.0
1.3.0 0.4.1 ==2.7, >=3.5, <=3.7 9.2, 10.0
1.2.0 0.4.0 ==2.7, >=3.5, <=3.7 9.2, 10.0
1.1.0 0.3.0 ==2.7, >=3.5, <=3.7 9.0, 10.0
<1.0.1 0.2.2 ==2.7, >=3.5, <=3.7 9.0, 10.0

torch和torchvison下载地址
https://download.pytorch.org/whl/torch_stable.html

Pytorch 0.4.0迁移指南(与之前版本编程上的不同点)https://blog.csdn.net/sunqiande88/article/details/80172391

安装一个旧环境

conda create -n env_27 python=2.7.13

conda install pytorch=0.3.0 torchvision cuda80 cudatoolkit=8.0 six=1.12 numpy matplotlib  pandas

 失败,各种库函数不兼容,所以环境还是要配新一点的,配好就别动了

conda python3.6 安装 opencv
conda install -c https://conda.anaconda.org/menpo opencv3 #安装opencv3

另一个conda虚拟环境:

Python 3.5.4 |Continuum Analytics, Inc.| (default, Aug 14 2017, 13:26:58) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torchvision
>>> print(torch.__version__)
0.4.1
>>> print(torchvision.__version__)
0.2.2
>>> print(torch.version.cuda)
8.0.61


cat /usr/local/cuda/version.txt

CUDA Version 8.0.44

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Sun_Sep__4_22:14:01_CDT_2016
Cuda compilation tools, release 8.0, V8.0.44

cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

#define CUDNN_MAJOR      6
#define CUDNN_MINOR      0
#define CUDNN_PATCHLEVEL 21
--
#define CUDNN_VERSION    (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#include "driver_types.h"


resnet50 train: https://www.jianshu.com/p/b935e108ba7d

resnet18 regression train.py

#coding:utf-8
import torch 
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms, models
from torch.utils.data import Dataset, DataLoader
from PIL import Image

import time
import os
import numpy as np
import matplotlib.pyplot as plt


batch_size= 64
num_classes= 50

class MyDataset(Dataset):
    def __init__(self, path, transform= None, target_transform= None):
        fh= open(path, 'r')
        imgs= []
        for line in fh:
            line= line.rstrip()
            words= line.split()
            imgs.append((words[0], int(words[1])))
            self.imgs= imgs
            self.transform= transform
            self.target_transform= target_transform
    def __getitem__(self, index):
        fn, label= self.imgs[index]
        img= Image.open(fn).convert('RGB')
        if self.transform is not None:
            img = self.transform(img)
        return img, label
    def __len__(self):
        return len(self.imgs)



train_transforms = transforms.Compose(
        [
	transforms.Resize(224),
        #transforms.RandomHorizontalFlip(),#随机水平翻转
        transforms.ToTensor() #转化成张量
        #transforms.Normalize([0.485, 0.456, 0.406], #归一化
        #                     [0.229, 0.224, 0.225])
])

test_valid_transforms = transforms.Compose(
        [transforms.Resize(224),
         transforms.ToTensor()
         #transforms.Normalize([0.485, 0.456, 0.406],
         #                    [0.229, 0.224, 0.225])
])

train_dataset= MyDataset('/home/zmz/model_09/train.txt', transform= train_transforms)
valid_dataset= MyDataset('/home/zmz/model_09/valid.txt' , transform= test_valid_transforms)

train_data_size= len(train_dataset)
valid_data_size= len(valid_dataset)

train_loader = DataLoader( train_dataset, batch_size= batch_size, shuffle=True)
valid_loader = DataLoader( valid_dataset, batch_size= batch_size, shuffle=True)

 
model = models.resnet18(pretrained=True)
for param in model.parameters():
    param.requires_grad= True

fc_inputs= model.fc.in_features
model.fc= nn.Sequential(
    nn.Linear(fc_inputs, 256),
    nn.ReLU(),
    nn.Dropout(0.4),
    nn.Linear(256, num_classes)
    #nn.LogSoftmax(dim=1)
)

device= torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model= model.to(device)

#criterion = nn.NLLLoss()
criterion=nn.CrossEntropyLoss()
optimizer=optim.SGD(model.parameters(),lr=0.01,momentum=0.8)

def train_and_valid(model, loss_function, optimizer, epochs):
    history= []
    best_acc= 0.0
    best_epoch= 0

    for epoch in range(epochs):
        epoch_start= time.time()
        print("Epoch: {}/{}".format(epoch+1, epochs))

        model.train()

        train_loss= 0.0
        train_acc= 0.0
        valid_loss= 0.0
        valid_acc= 0.0

        for batch_index, data in enumerate( train_loader, 0):
            inputs, target= data
            inputs= inputs.to(device)
            target= target.to(device)

            optimizer.zero_grad()
            outputs= model(inputs)
            loss= criterion(outputs,target)
            loss.backward()
            optimizer.step()

            train_loss += loss.item()* inputs.size(0)
            
            ret, prediction= torch.max(outputs.data, 1)
            correct_count= prediction.eq(target.data.view_as(prediction))
            acc= torch.mean(correct_count.type(torch.FloatTensor))
            train_acc+= acc.item()*inputs.size(0)

        correct= 0
        total= 0
        with torch.no_grad(): #不计算梯度
            model.eval()

            for batch_size, (inputs, labels) in enumerate(valid_loader, 0):
                inputs = inputs.to(device)
                labels = labels.to(device)
 
                outputs = model(inputs)
                loss = loss_function(outputs, labels)
 
                valid_loss += loss.item() * inputs.size(0)
 
                ret, predictions = torch.max(outputs.data, 1)
                correct_counts = predictions.eq(labels.data.view_as(predictions))
                acc = torch.mean(correct_counts.type(torch.FloatTensor))
                valid_acc += acc.item() * inputs.size(0)

        avg_train_loss = train_loss/train_data_size
        avg_train_acc = train_acc/train_data_size
 
        avg_valid_loss = valid_loss/valid_data_size
        avg_valid_acc = valid_acc/valid_data_size
 
        history.append([avg_train_loss, avg_valid_loss, avg_train_acc, avg_valid_acc])
 
        if best_acc < avg_valid_acc:
            best_acc = avg_valid_acc
            best_epoch = epoch + 1

        epoch_end = time.time()
 
        print("Epoch: {:03d}, Training: Loss: {:.4f}, Accuracy: {:.4f}%, \n Validation: Loss: {:.4f}, Accuracy: {:.4f}%, Time: {:.4f}s".format(
            epoch+1, avg_train_loss, avg_train_acc*100, avg_valid_loss, avg_valid_acc*100, epoch_end-epoch_start
        ))
        print("Best Accuracy for validation : {:.4f} at epoch {:03d}".format(best_acc, best_epoch))
 
        torch.save(model, '/home/zmz/model_09/models/'+'model_'+str(epoch+1)+'.pkl')
    return model, history


num_epochs = 50
trained_model, history = train_and_valid(model, criterion, optimizer, num_epochs)
torch.save(history, '/home/zmz/model_09/models/'+'_history.pkl')
 
history = np.array(history)

plt.plot(history[:, 0:2])
plt.legend(['Tr Loss', 'Val Loss'])
plt.xlabel('Epoch Number')
plt.ylabel('Loss')
#plt.ylim(0, 1)
plt.savefig('_loss_curve.png')
plt.show()
 
plt.plot(history[:, 2:4])
plt.legend(['Tr Accuracy', 'Val Accuracy'])
plt.xlabel('Epoch Number')
plt.ylabel('Accuracy')
#plt.ylim(0, 1)
plt.savefig('_accuracy_curve.png')

plt.show()

训练\valid 数据存在一个txt 如下

/home/zmz/model_09/cut2/2720.jpg 3000
/home/zmz/model_09/cut2/3095.jpg 3000
/home/zmz/model_09/cut2/3470.jpg 3000
/home/zmz/model_09/cut2/3845.jpg 3000
/home/zmz/model_09/cut2/4595.jpg 3000
/home/zmz/model_09/cut2/4970.jpg 3000

test.py 

#coding:utf-8
import torch 
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms, models
from torch.utils.data import Dataset, DataLoader
from PIL import Image

import time
import os
import numpy as np
import matplotlib.pyplot as plt


batch_size= 1
num_classes= 50

class MyDataset(Dataset):
    def __init__(self, path, transform= None, target_transform= None):
        fh= open(path, 'r')
        imgs= []
        for line in fh:
            line= line.rstrip()
            words= line.split()
            imgs.append((words[0],float(words[1])))
            self.imgs= imgs
            self.transform= transform
            self.target_transform= target_transform
    def __getitem__(self, index):
        fn, label= self.imgs[index]
        img= Image.open(fn).convert('RGB')
        if self.transform is not None:
            img = self.transform(img)
        return img, label
    def __len__(self):
        return len(self.imgs)



test_valid_transforms = transforms.Compose(
        [transforms.Resize(224),
         transforms.ToTensor()
         #transforms.Normalize([0.485, 0.456, 0.406],
         #                    [0.229, 0.224, 0.225])
])

valid_dataset= MyDataset('/home/zmz/model_09/reg_valid.txt' , transform= test_valid_transforms)


valid_data_size= len(valid_dataset)


valid_loader = DataLoader( valid_dataset, batch_size= batch_size, shuffle=True)

 
model = torch.load('./fig_reg/model_47.pkl')  # = models.resnet18(pretrained=True)


device= torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model= model.to(device)

#criterion = nn.NLLLoss()
criterion=nn.MSELoss(size_average= False)

def train_and_valid(model, loss_function, valid_data_size):

    for epoch in range(valid_data_size):
        correct= 0
        total= 0
        with torch.no_grad(): #不计算梯度
            model.eval()
            for batch_index, (inputs, labels) in enumerate(valid_loader, 0):
                inputs = inputs.to(device)
                labels = labels.float()
                labels = labels.to(device)
 
                outputs = model(inputs)
                loss = loss_function(outputs.squeeze(1), labels)
 
        print("Epoch: {:03d}, outputs:{:4f}, label:{:.4f}, Loss: {:.4f}: ".format(
            	epoch+1, outputs.item(), labels.item(), loss.item()))
train_and_valid(model, criterion, valid_data_size)


 

训练时候选择 不更新预训练模型,训练效果40%准确率很差 
锁了以后显存明显占用小2300/11000MB, 打开显存10000/11000

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1532498777990/work/aten/src/THC/generated/../generic/THCTensorMathReduce.cu:18

1080ti
64以上的batch会报out of memory
实时显存
watch -n 1 nvidia-smi

训练需要关注的一些问题:

数据:

数据量, 1400训练是否容易过拟合

数据增强,

预处理:转换色彩空间:transform里的随机裁剪\翻转\中心裁剪\色度变化\归一化

 数据和label是不是对应

训练和测试集是不是同分布,随机抽取, 例如 8:1:1划分为 train  valid  test

训练:

网络的输入输出大小

是否使用并更新预训练模型的权重

针对分类\回归选择loss

优化方法:sgdm\  Aadm (真好用)

batch size: 一般大一些好,128左右最好, 这个看显存,本次Resnet50 224*224  batch64基本用满10g

学习率:0.001开始10的倍数增长,动量比例0.8左右吧,Adam 不存在这个问题

每一代存一次模型,在train loss 持续下降\valid loss 在下降后准备上升的,作为比较好的训练结果..如果train 很差考虑欠拟合

 

处理过拟合:resnet50过拟合处理 https://blog.csdn.net/weixin_43610118/article/details/99561227

Resnet 如何提高验证和测试准确率? https://www.zhihu.com/question/278563008

索性把训练集的每张图片都进行了随机翻转 截取 变色处理,这样一来,训练样本翻倍,3000变6000,再硬train一发后,it works!!val acc逐级上涨,最高达到了96.7%,最后test acc也有96.2%. 怎么样,听起来似乎在自己骗自己,不就是扩充了数据嘛。但我认为这也是一种手段吧,至少没有费劲的去网上扒图片来提升准确率。所以之前确实是过拟合了,也就是说在加了正则和BN后,甚至网络层数降到只有8层卷积层,模型依然过拟合,dropout什么的也没用,有了BN后加不加dropout影响不大,也就是为什么之前怎么train都不work,训练的准确率倒是很快稳定在100%,这也说明了Resnet是训练神经的一大利器. 看来样本大小和模型必须要匹配,否则很容易过拟合。样本扩充后,训练明显不是那么容易了,大体上验证和训练跟随着跌跌撞撞的往谷底收敛,但效果达到了才最重要。

作者:C-Walk
链接:https://www.zhihu.com/question/278563008/answer/401790505
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

torchvision.ToTensor https://blog.csdn.net/WhiffeYF/article/details/104747845

PyTorch 实战(模型训练、模型加载、模型测试) https://blog.csdn.net/public669/article/details/97752226

matplotlib https://www.cnblogs.com/BackingStar/p/10986955.html

回归 分类 指标 https://blog.csdn.net/weixin_41012399/article/details/91472569

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值