K同学[365天深度学习训练营]第八周记录J1 ResNet-50算法实战与解析鸟类识别

54afive

已于 2024-02-03 08:56:56 修改

阅读量558

点赞数 9

文章标签：深度学习人工智能

于 2024-02-02 22:46:53 首次发布

本文链接：https://blog.csdn.net/afive54/article/details/136001235

版权

- 系统环境：WIN10-WSL2-Ubuntu22.04

- 语言环境：Python3.9.18

- 编译器：vscode+jupyter notebook

- 深度学习环境：Pytorch2.1.2

- 显卡：NVIDIA GeForce RTX 2080

残差神经网络

将深度神经网络的卷积输入直接作输出的恒等映射，从而处理了梯度爆炸的问题，可以构建深度相当高的网络。

本次训练营构建的ResNet-50模型，就是一个深度为50层的模型。

原本的tensorflow框架的代码很简单，输入正确可以直接运行
运行结果为：

Epoch 1/10
2024-02-02 22:03:01.962842: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8904
2024-02-02 22:03:02.156792: I external/local_tsl/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-02-02 22:03:03.783534: I external/local_xla/xla/service/service.cc:168] XLA service 0x8e845e0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-02-02 22:03:03.783584: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 2080, Compute Capability 7.5
2024-02-02 22:03:03.798336: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1706882583.915908  142587 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
2024-02-02 22:03:04.077082: W external/local_tsl/tsl/framework/bfc_allocator.cc:366] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2024-02-02 22:03:06.653049: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.55GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-02-02 22:03:06.653111: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.55GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-02-02 22:03:06.661182: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.55GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-02-02 22:03:06.661247: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.55GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-02-02 22:03:10.373743: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.57GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-02-02 22:03:10.373805: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.57GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-02-02 22:03:10.401941: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.57GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-02-02 22:03:10.402007: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.57GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-02-02 22:03:13.737407: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.34GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-02-02 22:03:13.737466: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.34GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
57/57 [==============================] - 35s 167ms/step - loss: 1.5137 - accuracy: 0.6681 - val_loss: 11.6719 - val_accuracy: 0.2478
Epoch 2/10
57/57 [==============================] - 5s 95ms/step - loss: 0.4082 - accuracy: 0.8695 - val_loss: 18.7063 - val_accuracy: 0.3009
Epoch 3/10
57/57 [==============================] - 5s 92ms/step - loss: 0.1347 - accuracy: 0.9646 - val_loss: 2.3750 - val_accuracy: 0.6283
Epoch 4/10
57/57 [==============================] - 5s 86ms/step - loss: 0.0404 - accuracy: 0.9912 - val_loss: 0.5619 - val_accuracy: 0.8673
Epoch 5/10
57/57 [==============================] - 4s 79ms/step - loss: 0.1209 - accuracy: 0.9602 - val_loss: 2.2072 - val_accuracy: 0.6195
Epoch 6/10
57/57 [==============================] - 5s 80ms/step - loss: 0.2262 - accuracy: 0.9204 - val_loss: 6.5125 - val_accuracy: 0.3982
Epoch 7/10
57/57 [==============================] - 5s 80ms/step - loss: 0.2314 - accuracy: 0.9204 - val_loss: 22.8252 - val_accuracy: 0.2743
Epoch 8/10
57/57 [==============================] - 5s 83ms/step - loss: 0.1196 - accuracy: 0.9690 - val_loss: 6.4240 - val_accuracy: 0.4690
Epoch 9/10
57/57 [==============================] - 5s 83ms/step - loss: 0.1115 - accuracy: 0.9624 - val_loss: 5.8264 - val_accuracy: 0.6106
Epoch 10/10
57/57 [==============================] - 4s 78ms/step - loss: 0.1057 - accuracy: 0.9735 - val_loss: 2.5852 - val_accuracy: 0.6106

pytorch 框架的话代码会复杂很多，将其转为pytorch框架的代码如下：

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import torchvision

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

device
from PIL import Image
import pathlib
from pathlib import Path
import matplotlib.pyplot as plt
# 支持中文
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号

import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')  # 忽略一些warning内容，无需打印

data_dir='/home/wangjh/CNN/训练营/J1ResNet-50算法实战与解析/第8天/bird_photos'
data_dir=pathlib.Path(data_dir)
'''前期工作-查看数据'''
image_count = len(list(data_dir.glob('*/*.jpg')))
print("图片总数为：", image_count)
image_list = list(data_dir.glob('Bananaquit/*.jpg'))
image = Image.open(str(image_list[1]))
# 查看图像实例的属性
print(image.format, image.size, image.mode)
plt.imshow(image)
plt.axis("off")
plt.show()

import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, random_split

# 定义参数
batch_size = 8
img_height = 224
img_width = 224

# 定义预处理转换
transform = transforms.Compose([
    transforms.Resize((img_height, img_width)),
    transforms.ToTensor(),
])

# 加载数据集
dataset = datasets.ImageFolder(root=data_dir, transform=transform)

# 分割数据集为训练集和验证集
val_size = int(len(dataset) * 0.2)  # 20% 作为验证集
train_size = len(dataset) - val_size
train_ds, val_ds = random_split(dataset, [train_size, val_size], generator=torch.Generator().manual_seed(123))

# 创建数据加载器
train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=batch_size, shuffle=False)

# 获取类别名称
class_names = dataset.classes
print(class_names)

from torch.utils.data import DataLoader

# 已经创建了 train_ds 和 val_ds 数据集
# 设置 DataLoader 的参数
train_loader = DataLoader(train_ds, batch_size=8, shuffle=True, num_workers=4, pin_memory=True)
val_loader = DataLoader(val_ds, batch_size=8, shuffle=False, num_workers=4, pin_memory=True)
import matplotlib.pyplot as plt
import numpy as np
import torchvision

# Assuming train_loader and class_names are already defined

def imshow(inp, title=None):
    """Imshow for Tensor."""
    inp = inp.numpy().transpose((1, 2, 0))  # Convert tensor (C, H, W) to (H, W, C)
    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    inp = std * inp + mean  # Unnormalize
    inp = np.clip(inp, 0, 1)
    plt.imshow(inp)
    if title is not None:
        plt.title(title)
    plt.axis('off')  # Hide axes

# Correctly getting a single batch from the DataLoader
dataiter = iter(train_loader)
images, labels = next(dataiter)

# Make a grid from batch
out = torchvision.utils.make_grid(images)

plt.figure(figsize=(10, 5))
imshow(out, title=[class_names[x] for x in labels])

import torch
import torch.nn as nn
import torch.nn.functional as F

class IdentityBlock(nn.Module):
    def __init__(self, in_channels, filters, kernel_size, stage, block):
        super(IdentityBlock, self).__init__()
        filters1, filters2, filters3 = filters
        name_base = f"{stage}{block}_identity_block_"

        self.conv1 = nn.Conv2d(in_channels, filters1, kernel_size=1, stride=1, padding=0, bias=False)
        self.bn1 = nn.BatchNorm2d(filters1)
        
        self.conv2 = nn.Conv2d(filters1, filters2, kernel_size=kernel_size, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(filters2)
        
        self.conv3 = nn.Conv2d(filters2, filters3, kernel_size=1, stride=1, padding=0, bias=False)
        self.bn3 = nn.BatchNorm2d(filters3)
        
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        out += identity
        out = self.relu(out)

        return out

class ConvBlock(nn.Module):
    def __init__(self, in_channels, filters, kernel_size, stage, block, strides=2):
        super(ConvBlock, self).__init__()
        filters1, filters2, filters3 = filters
        name_base = f"{stage}{block}_conv_block_"

        self.conv1 = nn.Conv2d(in_channels, filters1, kernel_size=1, stride=strides, bias=False)
        self.bn1 = nn.BatchNorm2d(filters1)

        self.conv2 = nn.Conv2d(filters1, filters2, kernel_size=kernel_size, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(filters2)

        self.conv3 = nn.Conv2d(filters2, filters3, kernel_size=1, stride=1, bias=False)
        self.bn3 = nn.BatchNorm2d(filters3)

        self.shortcut = nn.Sequential(
            nn.Conv2d(in_channels, filters3, kernel_size=1, stride=strides, bias=False),
            nn.BatchNorm2d(filters3)
        )
        
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        identity = self.shortcut(x)

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        out += identity
        out = self.relu(out)

        return out

class ResNet50(nn.Module):
    def __init__(self, input_shape=(3, 224, 224), num_classes=1000):
        super(ResNet50, self).__init__()
        self.in_channels = 64

        self.conv1 = nn.Conv2d(input_shape[0], self.in_channels, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(self.in_channels)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        # Define the resnet blocks
        self.layer1 = self._make_layer([64, 64, 256], blocks=3, stage=2, stride=1)
        self.layer2 = self._make_layer([128, 128, 512], blocks=4, stage=3, stride=2)
        self.layer3 = self._make_layer([256, 256, 1024], blocks=6, stage=4, stride=2)
        self.layer4 = self._make_layer([512, 512, 2048], blocks=3, stage=5, stride=2)

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(2048, num_classes)

    def _make_layer(self, filters, blocks, stage, stride):
        layers = []

        # First block is a ConvBlock with stride
        layers.append(ConvBlock(self.in_channels, filters, kernel_size=3, stage=stage, block='a', strides=stride))
        self.in_channels = filters[2]

        # Remaining blocks are IdentityBlocks
        for b in range(1, blocks):
            layers.append(IdentityBlock(self.in_channels, filters, kernel_size=3, stage=stage, block=chr(97+b)))

        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)

        return x

# Example
model = ResNet50()
print(model)

import torch
from torch import nn, optim
from torch.utils.data import DataLoader

# 假设 train_loader 和 val_loader 已经根据前面的说明定义好了

# 设置初始学习率
initial_learning_rate = 1e-3

# 定义模型（假设 model 已经按照前面的指导创建）
model = ResNet50(num_classes=1000)  # 根据需要调整 num_classes

# 选择设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 定义优化器
optimizer = optim.Adam(model.parameters(), lr=initial_learning_rate)

# 定义损失函数
criterion = nn.CrossEntropyLoss()

# 训练模型
epochs = 10

# 定义训练和验证函数
def train_one_epoch(epoch_index, train_loader):
    model.train()  # Set model to training mode
    running_loss = 0.0
    correct_predictions = 0

    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        _, preds = torch.max(output, 1)
        correct_predictions += torch.sum(preds == target.data)

    epoch_loss = running_loss / len(train_loader.dataset)
    epoch_acc = correct_predictions.double() / len(train_loader.dataset)
    print(f'Train Epoch: {epoch_index+1} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')

def validate(model, val_loader):
    model.eval()  # Set model to evaluate mode
    running_loss = 0.0
    correct_predictions = 0

    with torch.no_grad():
        for data, target in val_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            loss = criterion(output, target)

            running_loss += loss.item()
            _, preds = torch.max(output, 1)
            correct_predictions += torch.sum(preds == target.data)

    epoch_loss = running_loss / len(val_loader.dataset)
    epoch_acc = correct_predictions.double() / len(val_loader.dataset)
    print(f'Validation Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')

# 初始化列表来存储训练和验证的准确率和损失
train_acc = []
val_acc = []
train_loss = []
val_loss = []

for epoch in range(epochs):
    # 训练阶段
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0
    for data, target in train_loader:
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        _, predicted = torch.max(output.data, 1)
        total += target.size(0)
        correct += (predicted == target).sum().item()
    
    epoch_loss = running_loss / len(train_loader)
    epoch_acc = correct / total
    train_loss.append(epoch_loss)
    train_acc.append(epoch_acc)

    # 验证阶段
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0
    with torch.no_grad():
        for data, target in val_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            loss = criterion(output, target)

            running_loss += loss.item()
            _, predicted = torch.max(output.data, 1)
            total += target.size(0)
            correct += (predicted == target).sum().item()

    epoch_loss = running_loss / len(val_loader)
    epoch_acc = correct / total
    val_loss.append(epoch_loss)
    val_acc.append(epoch_acc)

    print(f'Epoch {epoch+1}, Train Loss: {train_loss[-1]}, Train Acc: {train_acc[-1]}, Val Loss: {val_loss[-1]}, Val Acc: {val_acc[-1]}')

epochs_range = range(epochs)

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(epochs_range, train_acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, train_loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')

plt.show()

from torchvision import transforms
import torch

# 定义转换，假设和训练时相同
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
])

model.eval()  # 设置模型为评估模式

# 假设 val_loader 是我们的验证数据加载器
dataiter = iter(val_loader)
images, labels = next(dataiter)

# 显示图片并进行预测
plt.figure(figsize=(10, 5))
plt.suptitle("预测结果展示", fontsize=10)

for i in range(8):
    ax = plt.subplot(2, 4, i + 1)
    img = images[i]

    # 显示图片
    # 注意：imshow 需要输入的格式是 (H, W, C)
    plt.imshow(img.permute(1, 2, 0).numpy())

    # 预测
    img = img.unsqueeze(0)  # 增加批次维度
    predictions = model(img.to(device))  # 需要转移到相同的设备
    predicted_class = predictions.argmax(1)

    plt.title(class_names[predicted_class], fontsize=10)
    plt.axis("off")

plt.show()

pytorch框架的代码运行结果如下：

Epoch 1, Train Loss: 1.9462667912767644, Train Acc: 0.46017699115044247, Val Loss: 4.998708836237589, Val Acc: 0.3185840707964602
Epoch 2, Train Loss: 1.0981759057756055, Train Acc: 0.5707964601769911, Val Loss: 1.3721581081549326, Val Acc: 0.4690265486725664
Epoch 3, Train Loss: 1.0875106004246495, Train Acc: 0.581858407079646, Val Loss: 1.2485054910182953, Val Acc: 0.6460176991150443
Epoch 4, Train Loss: 0.8475891780434993, Train Acc: 0.665929203539823, Val Loss: 1.0908969322840372, Val Acc: 0.6283185840707964
Epoch 5, Train Loss: 0.7583002661142433, Train Acc: 0.7256637168141593, Val Loss: 1.174149598677953, Val Acc: 0.7610619469026548
Epoch 6, Train Loss: 0.882950791141443, Train Acc: 0.6637168141592921, Val Loss: 1.8050361315409342, Val Acc: 0.4424778761061947
Epoch 7, Train Loss: 0.764416627455176, Train Acc: 0.7477876106194691, Val Loss: 0.6479228059450786, Val Acc: 0.7699115044247787
Epoch 8, Train Loss: 0.7154606750659775, Train Acc: 0.7588495575221239, Val Loss: 1.1928787171840667, Val Acc: 0.6460176991150443
Epoch 9, Train Loss: 0.7729809545634085, Train Acc: 0.75, Val Loss: 0.8592363484203815, Val Acc: 0.7787610619469026
Epoch 10, Train Loss: 0.7094438946560809, Train Acc: 0.7743362831858407, Val Loss: 0.6289080878098806, Val Acc: 0.8141592920353983

感想：
模型训练时非常不稳定，准确率波动很大，原因还不清楚，我才刚接触残差神经网络

54afive

关注

9
点赞
踩
7

收藏

觉得还不错? 一键收藏
打赏
0
评论
K同学[365天深度学习训练营]第八周记录J1 ResNet-50算法实战与解析鸟类识别

系统环境：WIN10-WSL2-Ubuntu22.04- 语言环境：Python3.9.18- 深度学习环境：Pytorch2.1.2- 显卡：NVIDIA GeForce RTX 2080。
复制链接

扫一扫

K同学[365天深度学习训练营]第八周记录J1 ResNet-50算法实战与解析 鸟类识别

残差神经网络

原本的tensorflow框架的代码很简单，输入正确可以直接运行 运行结果为：

pytorch 框架的话代码会复杂很多，将其转为pytorch框架的代码如下：

pytorch框架的代码运行结果如下：

感想： 模型训练时非常不稳定，准确率波动很大，原因还不清楚，我才刚接触残差神经网络

K同学[365天深度学习训练营]第八周记录J1 ResNet-50算法实战与解析鸟类识别

原本的tensorflow框架的代码很简单，输入正确可以直接运行
运行结果为：

感想：
模型训练时非常不稳定，准确率波动很大，原因还不清楚，我才刚接触残差神经网络