本文主要讲述pytorch可视化的一些内容,以下为本文目录:
- TensorBoard 使用指南
1.1 安装
1.2 启动
1.3 图像可视化 - torchinfo
- profiler
- TensorBoard 使用指南
随着深度学习的发展,网络模型越来越复杂,难以确定模型每一层的输入输出结构,如果能清晰了解模型结构和数据变化,将有助于我们更加高效地开发模型。TensorBoard 就是为此而生,他可以将代码运行过程中你感兴趣的数据保存在一个文件夹里,然后再读取文件夹中的数据,用浏览器显示出来。除了可视化模型结构外,TensorBoard 还可以记录并可视化训练过程的 loss 变化、可视化图像、连续变量、参数分布等,既是一个记录员,又是一位画家。
1.1 安装
在命令行中输入以下命令即可安装:
pip install tensorboardX
当然也可以使用 pytorch 自带的 tensorboard,这样就可以不用额外安装tensorboard了。
1.2 启动
tensorboard 的启动流程可以分为以下几步:
- 创建记录器writer
# 方式一:下载安装使用tensorboardX
from tensorboardX import SummaryWriter
# 方式二:使用pytorch自带的tensorboard
from torch.utils.tensorboard import SummaryWriter
# 创建记录器
log_dir = '/path/to/logs' # 日志文件输出路径,可以自定义
writer = SummaryWriter(log_dir)
- 写入并保存日志文件
writer.add_graph(model, input_to_model=torch.rand(1, 3, 512, 512))
writer.close()
- 在对应环境的命令行中输入以下命令,启动tensorboard
tensorboard --log_dir=/path/to/logs --port=xxxx #"/path/to/logs"对应前面的 log_dir
# port为外部访问tensorboard的端口号,如果不是在服务器远程使用的话则不需要配置port
- 打开可视化界面
复制粘贴步骤3得到的网址,在浏览器中打开:http://localhost:6006/
第一步,下载好resnet
模型:
import torchvision.models as models
import os
os.environ['TORCH_HOME']='E:\pytorch\Data' #修改模型保存路径,下载好后重新加载模型就会在这个目录下加载了
# pretrained = True表示下载使用预训练得到的权重,False表示不适用预训练得到的权重
resnet34 = models.resnet34(pretrained=True)
print(resnet34)
ResNet(
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(2): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer2): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
...
(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
(fc): Linear(in_features=512, out_features=1000, bias=True)
)
第二步,创建记录器,写入保存好日志文件:
import torch
import torch.nn as nn
# 方式一:下载安装使用tensorboardX
from tensorboardX import SummaryWriter
#方式二:使用pytorch自带的tensorboard
# from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('./runs')
writer.add_graph(resnet34, input_to_model=torch.rand(1, 3, 512, 512))
writer.close()
第三步:在当前环境对应的anaconda prompt 中输入以下命令,启动tensorboard:
tensorboard --logdir=runs
第四步,复制上一步命令行执行后得到的网址,粘贴到浏览器中即可查看tensorboard:
1.3 图像可视化
利用tensorboard可以方便地展示图片。
单张图片:add_image
多张图片:add_images
多合一:先用torchvision.utils.make_grid将多张图片拼成一张,然后用writer.add_image显示
以torchvision的 CIFAR10 数据集为例:
import torchvision
from tensorboardX import SummaryWriter
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
transform1 = transforms.Compose([transforms.ToTensor()])
train_data = datasets.CIFAR10(".", train=True, download=True, transform=transform1)
test_data = datasets.CIFAR10(".", train=False, download=True, transform=transform1)
batch_size = 256
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_data, batch_size=batch_size)
images, labels = next(iter(train_loader))
print(images.shape, labels.shape)
# 查看单张图片
writer = SummaryWriter('./picture')
writer.add_image('images[0]', images[0])
writer.close()
将多张图片拼接成一张图片,中间用黑色网格分割:
writer = SummaryWriter('./picture')
img_grid = torchvision.utils.make_grid(images)
writer.add_image('imgaes_grid', img_grid)
writer.close()
将多张图片直接写入:
writer = SummaryWriter('./picture')
writer.add_images('images', images, global_step=0)
writer.close()
接下来我们修改一下 resnet34
的输出,然后跑一跑 CIFAR10 分类任务:
import numpy as np
import time
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print('Device: ', device)
model = resnet34
model.fc = nn.Linear(512, 10) #修改模型
model = model.to(device)
writer = SummaryWriter('./runs')
writer.add_graph(model, input_to_model=images.to(device))
writer.close()
loss_fn = nn.CrossEntropyLoss()
lr = 1e-3
max_epochs = 5
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer=optimizer, step_size=5, gamma=0.8)
def train(max_epochs, save_dir):
writer = SummaryWriter(save_dir)
for epoch in range(max_epochs):
model.train()
train_loss, val_loss = 0, 0
pred_label, true_label = [], []
start_time = time.time()
for images, labels in train_loader:
images = images.to(device)
labels = labels.to(device)
optimizer.zero_grad()
out = model(images)
pred = torch.argmax(out, 1)
loss = loss_fn(out, labels)
loss.backward()
optimizer.step()
train_loss += loss.item() * images.size(0)
pred_label.append(pred.cpu().data.numpy())
true_label.append(labels.cpu().data.numpy())
train_loss = train_loss / len(train_loader.dataset)
true_label, pred_label = np.concatenate(true_label), np.concatenate(pred_label)
train_acc = np.sum(true_label == pred_label) / len(pred_label)
scheduler.step()
lr0 = scheduler.get_last_lr()
end_time = time.time()
cost_time = end_time - start_time
writer.add_scalar('train_loss', train_loss, epoch)
writer.add_scalar('learning rate', lr0[-1], epoch)
writer.add_scalar('Train_acc', train_acc, epoch)
with torch.no_grad():
model.eval()
pred_labels, true_labels = [], []
for images, labels in train_loader:
images = images.to(device)
labels = labels.to(device)
optimizer.zero_grad()
out = model(images)
pred = torch.argmax(out, 1)
loss = loss_fn(out, labels)
val_loss += loss.item() * images.size(0)
pred_labels.append(pred.cpu().data.numpy())
true_labels.append(labels.cpu().data.numpy())
val_loss = val_loss / len(train_loader.dataset)
true_labels, pred_labels = np.concatenate(true_labels), np.concatenate(pred_labels)
val_acc = np.sum(true_labels == pred_labels) / len(pred_labels)
writer.add_scalar('Val_loss', val_loss, epoch)
writer.add_scalar('Val_acc', val_acc, epoch)
writer.close()
print('Epoch: {}, Train_loss: {:.6f}, Train_acc:{:.2f}%, Val_loss: {:.6f}, Val_acc: {:.2f}%, time: {:.3f}s'.format(epoch,
train_loss, train_acc*100, val_loss, val_acc*100, cost_time))
%%time
import datetime
t0 = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M')
save_dir = './runs/' + t0 + '/'
if not os.path.isdir(save_dir):
os.makedirs(save_dir)
print('save_dir: ', save_dir)
train(max_epochs=max_epochs, save_dir=save_dir)
save_dir: .runs/2022-06-24-12-20/
Epoch: 0, Train_loss: 0.137615, Train_acc:95.85, Val_loss: 0.066162, Val_acc: 97.97%, time: 50.250s
Epoch: 1, Train_loss: 0.080220, Train_acc:97.26, Val_loss: 0.066573, Val_acc: 97.76%, time: 53.099s
Epoch: 2, Train_loss: 0.069566, Train_acc:97.70, Val_loss: 0.065918, Val_acc: 97.72%, time: 49.805s
Epoch: 3, Train_loss: 0.069826, Train_acc:97.61, Val_loss: 0.074443, Val_acc: 97.47%, time: 54.373s
Epoch: 4, Train_loss: 0.065152, Train_acc:97.75, Val_loss: 0.065349, Val_acc: 97.85%, time: 55.126s
CPU times: total: 5min 52s
Wall time: 6min 23s
可以看到resnet34 的恐怖性能,无论是训练集还是验证集上都获得了97%左右的精度。
tensorboard 可以可视化不同时候跑出来的结果。其中绿色线是 resnet34 跑出来的结果,才跑了5个epoch,准确率却能一直稳定在97%以上,非常恐怖。