train篇
目录:
- x.1 前言
- x.2 一个巨微型的train.py脚本
- x.3 dataset.py的完成
- x.4 train.py的完成
- x.5 [代码]train文件中两个文件脚本(dataset.py; train.py;)的迭代
- x.6 函数手册
推荐阅读x.2和x.5
推荐阅读时间:15min
x.1 前言
我们主要考虑最mini的情况,在train文件中,主要包含两个最基本的文件train.py
和dataset.py
。
train.py
主要是对超参数的设置;完成一个epoch的训练;dataset.py
主要是对数据的加载;
x.2 一个巨微型的train文件(糅合了train.py和dataset.py)
我们观看下面的代码,这是一个巨微型的train文件,包含了生成dataset->dataloader;设置必要超参数;训练一个epoch的必要操作;
20230406: add FLOPs to calculate computational complexity
# test
from torch.utils.data import DataLoader
from torch.utils.data.dataset import Dataset
import torch.nn as nn
from fvcore.nn import FlopCountAnalysis
class TrainSet(Dataset):
def __init__(self, X, Y):
# 定义好 image 的路径
self.X, self.Y = X, Y
def __getitem__(self, index):
return self.X[index], self.Y[index]
def __len__(self):
return len(self.X)
# calculate parameters
# 参考: `https://blog.csdn.net/qq_41979513/article/details/102369396`
def get_parameter_number(model):
total_num = sum(p.numel() for p in model.parameters())
trainable_num = sum(p.numel() for p in model.parameters() if p.requires_grad)
return {'Total': total_num, 'Trainable': trainable_num}
def main():
# 1) parse args
...
# 2)DataLoader
X_tensor = torch.ones((16, 3, 224, 224))
Y_tensor = torch.zeros((16, 5))
mydataset = TrainSet(X_tensor, Y_tensor)
train_loader = DataLoader(mydataset, batch_size=8, shuffle=True)
# 3)model
model=swin_tiny_patch4_window7_224(5)
# 打印模型,但是只会打印模型的__init__中的继承自nn.Module的子类
print(model)
# 3)setting hyperparameter 超参数设置
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()
# [x] 计算params参数量(越小越好)
# 参考: `https://blog.csdn.net/qq_41979513/article/details/102369396`
params_dict = get_parameter_number(model)
# 计算参数参数量 torch.float32 的参数为例
size = params_dict["Trainable"] * 4 / (1024 * 1024)
print("the total number of params is {}, trainable params is {}.\nthe size of params is {} MB".format(params_dict["Total"], params_dict["Trainable"], size))
# [x] 计算flops计算量(越大越好)
flops1 = FlopCountAnalysis(model, X_tensor)
print("Model's FLOPs:", flops1.total())
# 4) Training loop
for epoch in range(10):
for i, (X, y) in enumerate(train_loader):
# predict = forward pass with our model
pred = model(X)
loss = loss_fn(pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print('epoch={},i={}'.format(epoch,i))
if __name__ == '__main__':
main()
20230403: add calculate parameters
# test
from torch.utils.data import DataLoader
from torch.utils.data.dataset import Dataset
import torch.nn as nn
class TrainSet(Dataset):
def __init__(self, X, Y):
# 定义好 image 的路径
self.X, self.Y = X, Y
def __getitem__(self, index):
return self.X[index], self.Y[index]
def __len__(self):
return len(self.X)
# calculate parameters
# 参考: `https://blog.csdn.net/qq_41979513/article/details/102369396`
def get_parameter_number(model):
total_num = sum(p.numel() for p in model.parameters())
trainable_num = sum(p.numel() for p in model.parameters() if p.requires_grad)
return {'Total': total_num, 'Trainable': trainable_num}
def main():
# 1) parse args
...
# 2)DataLoader
X_tensor = torch.ones((16, 3, 224, 224))
Y_tensor = torch.zeros((16, 5))
mydataset = TrainSet(X_tensor, Y_tensor)
train_loader = DataLoader(mydataset, batch_size=8, shuffle=True)
# 3)model
model=swin_tiny_patch4_window7_224(5)
# 打印模型,但是只会打印模型的__init__中的继承自nn.Module的子类
print(model)
# 3)setting hyperparameter 超参数设置
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()
# [x] 计算参数个数
# 参考: `https://blog.csdn.net/qq_41979513/article/details/102369396`
params_dict = get_parameter_number(model)
# 计算参数参数量 torch.float32 的参数为例
size = params_dict["Trainable"] * 4 / (1024 * 1024)
print("the total number of params is {}, trainable params is {}.\nthe size of params is {} MB".format(params_dict["Total"], params_dict["Trainable"], size))
# 4) Training loop
for epoch in range(10):
for i, (X, y) in enumerate(train_loader):
# predict = forward pass with our model
pred = model(X)
loss = loss_fn(pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print('epoch={},i={}'.format(epoch,i))
if __name__ == '__main__':
main()
20230214: origin
from torch.utils.data import DataLoader
from torch.utils.data.dataset import Dataset
#from torchsummary import summary
class TrainSet(Dataset):
def __init__(self, X, Y):
# 定义好 image 的路径
self.X, self.Y = X, Y
def __getitem__(self, index):
return self.X[index], self.Y[index]
def __len__(self):
return len(self.X)
def main():
X_tensor = torch.ones((4,1,32, 256, 256))
Y_tensor = torch.zeros((4,1,32, 256, 256))
mydataset = TrainSet(X_tensor, Y_tensor)
train_loader = DataLoader(mydataset, batch_size=2, shuffle=True)
net=Net()
print(net)
import torch.nn as nn
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(net.parameters(), lr=1e-3)
# 3) Training loop
for epoch in range(10):
for i, (X, y) in enumerate(train_loader):
# predict = forward pass with our model
pred = net(X)
loss = loss_fn(pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print('epoch={},i={}'.format(epoch,i))
if __name__ == '__main__':
main()
x.3 dataset.py的完成
dataset.py主要由两部分组成,用于寻找你自己dataset的class DIYDataset(Dataset):
和用于测试的if __name__=="__main__":
x.3.1 if __name__=="__main__":
main函数用于测试,你需要设置能够让dataset成功运行的参数,你要寻找的数据路径,你需要传入的transforms(如果你设置了ToTensor()
或者torch.tensor()
),案例如下:
if __name__ == "__main__":
# from PIL import PILLOW_VERSION
from torchvision import transforms
transform = (transforms.Compose([transforms.ToTensor()]), transforms.Compose([transforms.ToTensor()]))
path = "/home/yingmuzhi/_learning/pytorch/_pipeline/_example_AlexNet/data/input"
signal, target = MyDataset(path=path, train=True, transforms=transform)[0] # 既走了__init__也走了__getitem__
x.3.2 class DIYDataset(Dataset):
这一个步骤是dataset.py中最重要的,他主要完成以下的事情:
__init__
: 在init方法中,我们要根据传入的参数设置好transform;训练集或者验证集;signal的path list和target的path list;__len__
: 返回signal-target一共有多少对;__getitem__
:根据init中的path读取signal-target的具体数据,每次返回一对相互对应的signal-target;
案例如下:
class MyDataset(Dataset):
"""
introduce:
this class generate your own data.
"""
def __init__(self, path: str, train: bool, transforms: tuple) -> None:
"""
introduce:
generate your dataset in path, generate either train dataset or validation dataset.
dataset will pass transforms, make some preprocess.
args:
:param str path: path include your training dataset and validation dataset.
:param bool train: True , return training dataset; False, return validation dataset.
:param tuple(torchvision.transforms.Compose, torchvision.transforms.Compose) transforms:
There are two transforms to process your input and label. But remember the transforms to
training dataset may not the same as the transforms to validation dataset.
return:
void.
"""
super().__init__()
self.train = train
self.transforms = transforms
self.signal: list = [] # same as "input", "images"
self.target: list = [] # same as "label"
if self.train:
# get training dataset, which should be a list contain the paths of training dataset.
df = pd.read_csv(os.path.join(path, "input_train.csv"))
list_signal = list(df[["DIR_PATH"]].to_numpy().squeeze(1))
list_target = list(df.loc[:, ["LABEL"]].to_numpy().squeeze(1))
self.signal = list_signal
self.target = list_target
else:
# get validation dataset, which should be a paths list.
df = pd.read_csv(os.path.join(path, "input_test.csv"))
list_signal = list(df[["DIR_PATH"]].to_numpy().squeeze(1))
list_target = list(df.loc[:, ["LABEL"]].to_numpy().squeeze(1))
self.signal = list_signal
self.target = list_target
def __len__(self) -> int:
"""
introduce:
return the length of labels. How many pairs in your dataset.
args:
void.
return:
:param int pairs: How many pairs in your dataset.
"""
pairs = len(self.target)
return pairs
def __getitem__(self, index):
"""
introduce:
this function will return the specific index pair of [signal, target].
args:
:param int index: the index^th pair of [signal, target].
return:
:param torch.Tensor signal: the input of your dataset.
:param torch.Tensor target: the label of your dataset.
"""
# load your data according to self.signal and self.target
signal = cv2.imread(self.signal[index])
target = np.array(self.target[index], dtype="int64")
# before proprocess you'd make your img back to PIL
signal = Image.fromarray(signal)
# some preprocess in transform
if self.transforms is not None:
signal = self.transforms[0](signal)
# target = self.transforms[1](target)
target = torch.from_numpy(target) # return a 0 dimension torch.Tensor
# return
return signal, target
x.4 train.py的完成
我们在进行train.py
书写的时候主要有两个函数和一个开始,分别是def parse_args():
,def main(args):
和用于全函数开始的if __name__=="__main__":
。
x.4.1 if __name__=="__main__":
main函数作为开始,它的写法往往固定,如下:
if __name__ == "__main__":
args = parse_args()
main(args)
x.4.2 def parse_args():
parse_args()函数主要用于传递参数,模板如下:
def parse_args():
"""
introduce:
parse your arguments.
args:
void.
return:
:param argparse.Namespace args: the args you need in after things.
"""
import argparse
parser = argparse.ArgumentParser(description="pytorch AlexNet training")
parser.add_argument("--data-path", default="/home/yingmuzhi/_data", help="DRIVE root")
# exclude background
parser.add_argument("--num-classes", default=2, type=int)
parser.add_argument("--device", default="cuda:0", help="training device")
parser.add_argument("-b", "--batch-size", default=32, type=int)
parser.add_argument("--epochs", default=30, type=int, metavar="N",
help="number of total epochs to train")
parser.add_argument('--lr', default=0.01, type=float, help='initial learning rate')
parser.add_argument('--momentum', default=0.09, type=float, metavar='M',
help='momentum')
parser.add_argument('--resume', default='/home/yingmuzhi/_learning/pytorch/_pipeline/_example_AlexNet/data/output/model/model.pth', help='resume from checkpoint')
parser.add_argument('--start-epoch', default=0, type=int, metavar='N',
help='start epoch')
# parser.add_argument('--save-best', default=True, type=bool, help='only save best dice weights') # can add early stop to avoid overfit Mixed precision training parameters
# parser.add_argument("--amp", default=False, type=bool,
# help="Use torch.cuda.amp for mixed precision training")
# parser.add_argument("--result-file", default="/home/yingmuzhi/unet/")
# parser.add_argument('--wd', '--weight-decay', default=1e-4, type=float,
# metavar='W', help='weight decay (default: 1e-4)',
# dest='weight_decay')
# parser.add_argument('--print-freq', default=1, type=int, help='print frequency')
args = parser.parse_args()
return args
x.4.3 def main(args):
该方法用于正式训练,你的训练内容往往由以下几个步骤组成:
- 从传入的args中解析参数并配置部分超参数(需要思考的参数如下,在train脚本中需要设置
batch_size
,device
,dataloader
,loss_function
,optimizer
,epoch
等参数) - 设置好DataLoader(书写dataloder。设置
transforms
,dataset
,dataloader
,batch_size
等参数,因为dataloader中要用到) - 创建model,将model加载预训练权重;将所有超参数配置全。
- 开始epoch训练(取数据; loss(model())正向传播; optimizer.zero_grad()+loss.backward()反向传播更新梯度矩阵; optimizer.step()更新参数矩阵;)
def main(args):
"""
introduce:
train your model.
args:
:param argparse.Namespace args: the args you need in after things.
return:
void.
"""
# -------------------------
# --- | 1. parse args | ---
# -------------------------
# parse your arguments
device = torch.device(args.device if torch.cuda.is_available() else "cpu")
batch_size = args.batch_size
num_workers = min([os.cpu_count(), batch_size if batch_size > 1 else 0, 8])
num_classes = args.num_classes
data_path = args.data_path
start_epoch = args.start_epoch # may be decided by params saved in pre-trained model
epochs = args.epochs
resume = args.resume
if args.lr:
lr = args.lr
if args.momentum:
momentum=args.momentum
# -------------------------
# --- | 2. DataLoader | ---
# -------------------------
# DataLoader --- make input to (224, 224, 3)
train_transforms = (
transforms.Compose([
transforms.Resize(256),
transforms.RandomCrop((224, 224)),
transforms.ToTensor(),
]),
transforms.Compose([transforms.ToTensor()]))
train_dateset = MyDataset(
path = "/home/yingmuzhi/_learning/pytorch/_pipeline/_example_AlexNet/data/input",
train=True,
transforms = train_transforms,
)
train_loader = DataLoader(
train_dateset,
batch_size=batch_size,
shuffle=False,
num_workers=num_workers,
)
validation_transforms = (
transforms.Compose([
transforms.Resize(256),
transforms.RandomCrop((224, 224)),
transforms.ToTensor(),
]),
transforms.Compose([transforms.ToTensor()]))
validation_dataset = MyDataset(
path = "/home/yingmuzhi/_learning/pytorch/_pipeline/_example_AlexNet/data/input",
train=False,
transforms=validation_transforms,
)
validation_loader = DataLoader(
validation_dataset,
batch_size=batch_size,
shuffle=False,
num_workers=num_workers
)
# ------------------------
# --- | 3. set model | ---
# ------------------------
# model and parameters, loss's calculation can be in func "train one epoch"
model = AlexNet(num_classes=5, init_weights=True).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
lr_scheduler = None
loss_funtion = nn.CrossEntropyLoss()
# load pre-trained model
if os.path.exists(resume):
checkpoint = torch.load(resume, map_location="cpu")
model.load_state_dict(checkpoint["model"])
optimizer.load_state_dict(checkpoint["optimizer"])
# lr_scheduler.load_state_dict(checkpoint["lr_scheduler"])
start_epoch = checkpoint["epoch"] + 1
print("load pre-trained model successfully!")
else:
print("load pre-trained model failed.")
# -----------------------------------
# --- | 4. start epoch training | ---
# -----------------------------------
train_steps = len(train_loader)
validation_steps = len(validation_loader)
start_time = time.time()
# epochs
for epoch in range(start_epoch, epochs):
# training dataset
model.train()
train_loss = 0.
validation_loss = 0.
for _, (signal, target) in enumerate(train_loader, start=0):
signal = signal.to(device)
target = target.to(device)
output = model(signal)
loss = loss_funtion(output, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss += loss.item()
# print(train_loss)
# validation dataset
model.eval()
with torch.no_grad(): # will not calculate grad
for _, (signal, target) in enumerate(validation_loader, start=0):
signal = signal.to(device)
target = target.to(device)
output = model(signal)
loss = loss_funtion(output, target)
validation_loss += loss.item()
# set the output in terminal per epoch
train_loss = train_loss / train_steps
validation_loss = validation_loss / validation_steps
print("In epoch {}, your training loss is {}, validation loss is {}".format(epoch, train_loss, validation_loss))
# save model
save_files = {
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
# "lr_scheduler": lr_scheduler.state_dict(),
"epoch": epoch,
"args": args
}
torch.save(save_files, resume)
# timing
total_time = time.time() - start_time
print("it takes total time: {}".format(total_time))
x.5 [代码]train文件中两个文件脚本(dataset.py; train.py;)的迭代
x.5.1 version1 AlexNet
在version1中是以AlexNet为网络,以5种花分类为数据集,进行脚本的完成。
dataset.py如下:
import cv2, os
import numpy as np
# from PIL import Image
from torch.utils.data import Dataset
from natsort import natsorted
import pandas as pd
from torchvision import transforms
import cv2
import torch
from PIL import Image
class MyDataset(Dataset):
"""
introduce:
this class generate your own data.
"""
def __init__(self, path: str, train: bool, transforms: tuple) -> None:
"""
introduce:
generate your dataset in path, generate either train dataset or validation dataset.
dataset will pass transforms, make some preprocess.
args:
:param str path: path include your training dataset and validation dataset.
:param bool train: True , return training dataset; False, return validation dataset.
:param tuple(torchvision.transforms.Compose, torchvision.transforms.Compose) transforms:
There are two transforms to process your input and label. But remember the transforms to
training dataset may not the same as the transforms to validation dataset.
return:
void.
"""
super().__init__()
self.train = train
self.transforms = transforms
self.signal: list = [] # same as "input", "images"
self.target: list = [] # same as "label"
if self.train:
# get training dataset, which should be a list contain the paths of training dataset.
df = pd.read_csv(os.path.join(path, "input_train.csv"))
list_signal = list(df[["DIR_PATH"]].to_numpy().squeeze(1))
list_target = list(df.loc[:, ["LABEL"]].to_numpy().squeeze(1))
self.signal = list_signal
self.target = list_target
else:
# get validation dataset, which should be a paths list.
df = pd.read_csv(os.path.join(path, "input_test.csv"))
list_signal = list(df[["DIR_PATH"]].to_numpy().squeeze(1))
list_target = list(df.loc[:, ["LABEL"]].to_numpy().squeeze(1))
self.signal = list_signal
self.target = list_target
def __len__(self) -> int:
"""
introduce:
return the length of labels. How many pairs in your dataset.
args:
void.
return:
:param int pairs: How many pairs in your dataset.
"""
pairs = len(self.target)
return pairs
def __getitem__(self, index):
"""
introduce:
this function will return the specific index pair of [signal, target].
args:
:param int index: the index^th pair of [signal, target].
return:
:param torch.Tensor signal: the input of your dataset.
:param torch.Tensor target: the label of your dataset.
"""
# load your data according to self.signal and self.target
signal = cv2.imread(self.signal[index])
target = np.array(self.target[index], dtype="int64")
# before proprocess you'd make your img back to PIL
signal = Image.fromarray(signal)
# some preprocess in transform
if self.transforms is not None:
signal = self.transforms[0](signal)
# target = self.transforms[1](target)
target = torch.from_numpy(target) # return a 0 dimension torch.Tensor
# return
return signal, target
if __name__ == "__main__":
# from PIL import PILLOW_VERSION
from torchvision import transforms
path = "/home/yingmuzhi/_learning/pytorch/_pipeline/_example_AlexNet/data/input"
transform = (transforms.Compose([transforms.ToTensor()]), transforms.Compose([transforms.ToTensor()]))
signal, target = MyDataset(path=path, train=True, transforms=transform)[0]
train.py如下:
'''
signal: dtype float32
target: dtype int64'''
# --- add path
import os, sys
project_path = os.path.dirname(os.path.dirname(__file__))
root_path = os.path.dirname(project_path)
sys.path.append(project_path)
# import
import time, torch
from torch.utils.data import DataLoader
from torchvision import transforms
from dataset import MyDataset
from model.model import AlexNet
import torch.optim as optim
import torch.nn as nn
def parse_args():
"""
introduce:
parse your arguments.
args:
void.
return:
:param argparse.Namespace args: the args you need in after things.
"""
import argparse
parser = argparse.ArgumentParser(description="pytorch AlexNet training")
parser.add_argument("--data-path", default="/home/yingmuzhi/_data", help="DRIVE root")
# exclude background
parser.add_argument("--num-classes", default=2, type=int)
parser.add_argument("--device", default="cuda:0", help="training device")
parser.add_argument("-b", "--batch-size", default=32, type=int)
parser.add_argument("--epochs", default=30, type=int, metavar="N",
help="number of total epochs to train")
parser.add_argument('--lr', default=0.01, type=float, help='initial learning rate')
parser.add_argument('--momentum', default=0.09, type=float, metavar='M',
help='momentum')
# parser.add_argument('--wd', '--weight-decay', default=1e-4, type=float,
# metavar='W', help='weight decay (default: 1e-4)',
# dest='weight_decay')
# parser.add_argument('--print-freq', default=1, type=int, help='print frequency')
parser.add_argument('--resume', default='/home/yingmuzhi/_learning/pytorch/_pipeline/_example_AlexNet/data/output/model/model.pth', help='resume from checkpoint')
parser.add_argument('--start-epoch', default=0, type=int, metavar='N',
help='start epoch')
# parser.add_argument('--save-best', default=True, type=bool, help='only save best dice weights') # can add early stop to avoid overfit
# Mixed precision training parameters
# parser.add_argument("--amp", default=False, type=bool,
# help="Use torch.cuda.amp for mixed precision training")
# parser.add_argument("--result-file", default="/home/yingmuzhi/unet/")
args = parser.parse_args()
return args
def main(args):
"""
introduce:
train your model.
args:
:param argparse.Namespace args: the args you need in after things.
return:
void.
"""
# -------------------------
# --- | 1. parse args | ---
# -------------------------
# parse your arguments
device = torch.device(args.device if torch.cuda.is_available() else "cpu")
batch_size = args.batch_size
num_workers = min([os.cpu_count(), batch_size if batch_size > 1 else 0, 8])
num_classes = args.num_classes
data_path = args.data_path
start_epoch = args.start_epoch # may be decided by params saved in pre-trained model
epochs = args.epochs
resume = args.resume
if args.lr:
lr = args.lr
if args.momentum:
momentum=args.momentum
# -------------------------
# --- | 2. DataLoader | ---
# -------------------------
# DataLoader --- make input to (224, 224, 3)
train_transforms = (
transforms.Compose([
transforms.Resize(256),
transforms.RandomCrop((224, 224)),
transforms.ToTensor(),
]),
transforms.Compose([transforms.ToTensor()]))
train_dateset = MyDataset(
path = "/home/yingmuzhi/_learning/pytorch/_pipeline/_example_AlexNet/data/input",
train=True,
transforms = train_transforms,
)
train_loader = DataLoader(
train_dateset,
batch_size=batch_size,
shuffle=False,
num_workers=num_workers,
)
validation_transforms = (
transforms.Compose([
transforms.Resize(256),
transforms.RandomCrop((224, 224)),
transforms.ToTensor(),
]),
transforms.Compose([transforms.ToTensor()]))
validation_dataset = MyDataset(
path = "/home/yingmuzhi/_learning/pytorch/_pipeline/_example_AlexNet/data/input",
train=False,
transforms=validation_transforms,
)
validation_loader = DataLoader(
validation_dataset,
batch_size=batch_size,
shuffle=False,
num_workers=num_workers
)
# ------------------------
# --- | 3. set model | ---
# ------------------------
# model and parameters, loss's calculation can be in func "train one epoch"
model = AlexNet(num_classes=5, init_weights=True).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
lr_scheduler = None
loss_funtion = nn.CrossEntropyLoss()
# load pre-trained model
if os.path.exists(resume):
checkpoint = torch.load(resume, map_location="cpu")
model.load_state_dict(checkpoint["model"])
optimizer.load_state_dict(checkpoint["optimizer"])
# lr_scheduler.load_state_dict(checkpoint["lr_scheduler"])
start_epoch = checkpoint["epoch"] + 1
print("load pre-trained model successfully!")
else:
print("load pre-trained model failed.")
# -----------------------------------
# --- | 4. start epoch training | ---
# -----------------------------------
train_steps = len(train_loader)
validation_steps = len(validation_loader)
start_time = time.time()
# epochs
for epoch in range(start_epoch, epochs):
# training dataset
model.train()
train_loss = 0.
validation_loss = 0.
for _, (signal, target) in enumerate(train_loader, start=0):
signal = signal.to(device)
target = target.to(device)
output = model(signal)
loss = loss_funtion(output, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss += loss.item()
# print(train_loss)
# validation dataset
model.eval()
with torch.no_grad(): # will not calculate grad
for _, (signal, target) in enumerate(validation_loader, start=0):
signal = signal.to(device)
target = target.to(device)
output = model(signal)
loss = loss_funtion(output, target)
validation_loss += loss.item()
# set the output in terminal per epoch
train_loss = train_loss / train_steps
validation_loss = validation_loss / validation_steps
print("In epoch {}, your training loss is {}, validation loss is {}".format(epoch, train_loss, validation_loss))
# save model
save_files = {
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
# "lr_scheduler": lr_scheduler.state_dict(),
"epoch": epoch,
"args": args
}
torch.save(save_files, resume)
# timing
total_time = time.time() - start_time
print("it takes total time: {}".format(total_time))
if __name__ == "__main__":
args = parse_args()
main(args)
x.6 函数手册
DataLoader
在DataLoader环节我们需要选择合适的Transforms传入Dataset,向DataLoader中传入Dataset和batch,DataLoader就会每次从Dataset中取出batch个数据。其中最为重要的就是选定适合的Transforms传入Dataset中,设定合适的DataLoader。
Transforms选定如下:
DataLoader案例如下:
# dataloader
device = torch.device(args.device if torch.cuda.is_available() else "cpu")
batch_size = args.batch_size
num_workers = min([os.cpu_count(), batch_size if batch_size > 1 else 0, 8])
# segmentation nun_classes + background
num_classes = args.num_classes + 1
# using compute_mean_std.py
mean = (0.709, 0.381, 0.224)
std = (0.127, 0.079, 0.043)
train_dataset = DriveDataset(args.data_path,
train=True,
transforms=get_transform(train=True, mean=mean, std=std))
val_dataset = DriveDataset(args.data_path,
train=False,
transforms=get_transform(train=False, mean=mean, std=std))
train_loader = torch.utils.data.DataLoader(train_dataset,
batch_size=batch_size,
num_workers=num_workers,
shuffle=True,
pin_memory=True,
collate_fn=train_dataset.collate_fn)
val_loader = torch.utils.data.DataLoader(val_dataset,
batch_size=1,
num_workers=num_workers,
pin_memory=True,
collate_fn=val_dataset.collate_fn)
训练参数
在该步骤中需要指定模型,优化器,加载预训练权重(迁移学习)。
# 模型
model = create_model(num_classes=num_classes)
model.to(device)
params_to_optimize = [p for p in model.parameters() if p.requires_grad]
# 优化器
optimizer = torch.optim.SGD(
params_to_optimize,
lr=args.lr, momentum=args.momentum, weight_decay=args.weight_decay
)
我们使用torch.load()
和torch.save()
用来加载和保存训练超参,我们在load和save中指定model.load_state_load()
和model.state_dict()
用来将训练的权重超参保存为字典格式进行存储,如下:
# 保存.pth文件
# 设定文件存储的格式
save_file = {"model": model.state_dict(),
"optimizer": optimizer.state_dict(), # 优化器中参数
"lr_scheduler": lr_scheduler.state_dict(),
"epoch": epoch,
"args": args}
torch.save(save_file, "save_weights/best_model.pth")
# 加载.pth文件
# 从.pth文件中取数据
checkpoint = torch.load(args.resume, map_location='cpu') # args.resume="save_weights/best_model.pth"; map_location指的是映射到CPU上加载模型
model.load_state_dict(checkpoint['model']) # 从dictionary中根据key取value,若是用.state_dict()进行存储,则需要用.load_state_dict()将值取出
optimizer.load_state_dict(checkpoint['optimizer'])
lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])
args.start_epoch = checkpoint['epoch'] + 1 # 若不是用.state_dict()取出,则直接取出来用便可
epoch
每一次训练都是在epoch中进行,每一个epoch需要进行训练和测试并将训练结果进行存储,并记录每一轮训练时长。
训练的完整代码如下:
# 用来保存训练以及验证过程中信息
results_file = "/home/yingmuzhi/unet/results{}.txt".format(datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
best_dice = 0.
start_time = time.time()
for epoch in range(args.start_epoch, args.epochs):
mean_loss, lr = train_one_epoch(model, optimizer, train_loader, device, epoch, num_classes,
lr_scheduler=lr_scheduler, print_freq=args.print_freq, scaler=scaler)
confmat, dice = evaluate(model, val_loader, device=device, num_classes=num_classes)
val_info = str(confmat)
print(val_info)
print(f"dice coefficient: {dice:.3f}")
# write into txt
with open(results_file, "a") as f:
# 记录每个epoch对应的train_loss、lr以及验证集各指标
train_info = f"[epoch: {epoch}]\n" \
f"train_loss: {mean_loss:.4f}\n" \
f"lr: {lr:.6f}\n" \
f"dice coefficient: {dice:.3f}\n"
f.write(train_info + val_info + "\n\n")
if args.save_best is True:
if best_dice < dice:
best_dice = dice
else:
continue
save_file = {"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"lr_scheduler": lr_scheduler.state_dict(),
"epoch": epoch,
"args": args}
if args.amp:
save_file["scaler"] = scaler.state_dict()
if args.save_best is True:
torch.save(save_file, "save_weights/best_model.pth")
else:
torch.save(save_file, "save_weights/model_{}.pth".format(epoch))
total_time = time.time() - start_time
total_time_str = str(datetime.timedelta(seconds=int(total_time)))
print("training time {}".format(total_time_str))
argparse配置参数
使用argparse封装需要的参数。
参看https://blog.csdn.net/qq_43369406/article/details/127787799
argparse函数案例如下:
def parse_args():
import argparse
parser = argparse.ArgumentParser(description="pytorch unet training")
parser.add_argument("--data-path", default="./", help="DRIVE root")
# exclude background
parser.add_argument("--num-classes", default=1, type=int)
parser.add_argument("--device", default="cuda", help="training device")
parser.add_argument("-b", "--batch-size", default=4, type=int)
parser.add_argument("--epochs", default=200, type=int, metavar="N",
help="number of total epochs to train")
parser.add_argument('--lr', default=0.01, type=float, help='initial learning rate')
parser.add_argument('--momentum', default=0.9, type=float, metavar='M',
help='momentum')
parser.add_argument('--wd', '--weight-decay', default=1e-4, type=float,
metavar='W', help='weight decay (default: 1e-4)',
dest='weight_decay')
parser.add_argument('--print-freq', default=1, type=int, help='print frequency')
parser.add_argument('--resume', default='', help='resume from checkpoint')
parser.add_argument('--start-epoch', default=0, type=int, metavar='N',
help='start epoch')
parser.add_argument('--save-best', default=True, type=bool, help='only save best dice weights')
# Mixed precision training parameters
parser.add_argument("--amp", default=False, type=bool,
help="Use torch.cuda.amp for mixed precision training")
args = parser.parse_args()
return args
if __name__ == '__main__':
args = parse_args()
args.data_path
torch.utils.data.DataLoader
DataLoader
的作用是接收一个dataset
对象,并生成一个DataLoader对象,它的函数声明如下:
torch.utils.data.DataLoader(dataset, batch_size=1,
shuffle=None, sampler=None, batch_sampler=None, num_workers=0,
collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None,
*, prefetch_factor=2, persistent_workers=False, pin_memory_device='')
其实我们只要知道DataLoader接收一个dataset对象并生成一个DataLoader对象便可,我们需要指定DataLoader
中的dataset
对象,batch_size
每一次迭代(一个epoch)导入的图片的个数,batch_size由硬件设备显存决定,一般batch_size越大训练效果越好,shuffle
是否打乱,num_workers
载入数据的线程数(在linux下可以定义,在windows下设置为0)。
iter()和next()
iter()和next()事python自带的函数,iter() 函数接收一个支持迭代的集合对象(注意list不是迭代器),返回一个迭代器。object – 支持迭代的集合对象,函数定义如下:
iter(object[, sentinel])
next()会调用迭代器的下一个元素。
torch.nn.CrossEntropyLoss类
指定损失函数为CrossEntropyLoss,它的函数定义如下:
torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=- 100, reduce=None, reduction='mean', label_smoothing=0.0)
往往不需要传实参,直接默认值便可。
torch.optim.Adam类
定义优化器为Adam优化器,函数声明如下:
torch.optim.Adam(params, lr=0.001,
betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, *, foreach=None, maximize=False, capturable=False)
需要指定params
和lr
参数,其中params处往往传入网络的全部参数net.parameters()
,torch.nn.Module继承而来的方法,使传递全部参数,指定初始学习率lr=0.001
。
python中enumerate()方法
enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标,以tuple类型返回。一般用在 for 循环当中,常见案例如下:
>>> seq = ['one', 'two', 'three']
>>> for i, element in enumerate(seq):
... print i, element
...
0 one
1 two
2 three
dataloader 是个迭代器,故将dataloader传入后会输出每一次迭代的batch,如这个数据集每次的batch就是一个四维tensor(images)和一个一维tensor(labels)。
torch.optim.Adam.zero_grad()方法
没一次batch的images进行处理后都需要调用该方法清除梯度,每一次batch要累加loss。每一次epoch清除loss。但每一次epoch和batch都会更新net.parameters()
FP,BP
# 每一次batch都要做一次这个
outputs = net(inputs) # 正向传播计算y估
loss = loss_function(outputs, lables) # 应该又是一个回调函数,计算损失函数的值,即y估和y的残差
loss.backward() # 误差的反向传播,这一步骤和下一步骤才是BP的完整算法-更新参数(即计算偏导,给参数赋值)
optimizer.step() # 更新参数 update parameters in net
running_loss += loss.item()
loss是一个tensor(1.1)的0维tensor,使用item()将其转换为标量。
with torch.no_grad()
在验证集中计算准确值时候不进行梯度的自动更新。
predict_y = torch.max(outputs, dim=1)[1]
最终经过网络输出的outputs是一个[batch, labels]的[10000, 10]的Tensor,torch.max(outputs, dim=1)
指我们对outsputs的第一个维度(10个数据中)取最大值,torch.max(outputs, dim=1)[1]
指的是将所取数的序列号(0-9)返回给predict_y。所以predict_y是一个10000的Tensor。
accuracy = (predict_y == test_label).sum().item() / test_label.size(0)
predict_y和test_label都是torch.Size([10000])的tensor,索引tensor([])的方法可以看做list[0, 1]用索引调值,索引torch.Size([])的方法需要使用tensor.size(0)索引值。特殊的,对于tensor(1)这样的0维向量,则使用.item()方法将其转换为数值。
.sum()语句计算predict_y和test_label中相等元素的个数,返回一个0维的tensor(12)变量,使用.item()方法获取它的数值。使用test_label.size(0)获得test_label在第一维度的值(10000)。相除即得accuracy准确率,precision是精度。
vscode stepinto 不能进入代码
launch.json里添加
"purpose":["debug-in-terminal"]