13. 利用GPU训练

13.1 利用GPU训练(方式一)

① GPU训练主要有三部分,网络模型、数据(输入、标注)、损失函数,这三部分放到GPU上。

import torchvision
import torch
from torch import nn
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter

# from model import * 相当于把 model中的所有内容写到这里,这里直接把 model 写在这里
class Tudui(nn.Module):
    def __init__(self):
        super(Tudui, self).__init__()        
        self.model1 = nn.Sequential(
            nn.Conv2d(3,32,5,1,2),  # 输入通道3,输出通道32,卷积核尺寸5×5,步长1,填充2    
            nn.MaxPool2d(2),
            nn.Conv2d(32,32,5,1,2),
            nn.MaxPool2d(2),
            nn.Conv2d(32,64,5,1,2),
            nn.MaxPool2d(2),
            nn.Flatten(),  # 展平后变成 64*4*4 了
            nn.Linear(64*4*4,64),
            nn.Linear(64,10)
        )
        
    def forward(self, x):
        x = self.model1(x)
        return x

# 准备数据集
train_data = torchvision.datasets.CIFAR10("./dataset",train=True,transform=torchvision.transforms.ToTensor(),download=True)       
test_data = torchvision.datasets.CIFAR10("./dataset",train=False,transform=torchvision.transforms.ToTensor(),download=True)       

# length 长度
train_data_size = len(train_data)
test_data_size = len(test_data)
# 如果train_data_size=10,则打印:训练数据集的长度为:10
print("训练数据集的长度:{}".format(train_data_size))
print("测试数据集的长度:{}".format(test_data_size))

# 利用 Dataloader 来加载数据集
train_dataloader = DataLoader(train_data, batch_size=64)        
test_dataloader = DataLoader(test_data, batch_size=64)

# 创建网络模型
tudui = Tudui() 
if torch.cuda.is_available():
    tudui = tudui.cuda() # 网络模型转移到cuda上

# 损失函数
loss_fn = nn.CrossEntropyLoss() # 交叉熵,fn 是 fuction 的缩写
if torch.cuda.is_available():
    loss_fn = loss_fn.cuda()        # 损失函数转移到cuda上

# 优化器
learning = 0.01  # 1e-2 就是 0.01 的意思
optimizer = torch.optim.SGD(tudui.parameters(),learning)   # 随机梯度下降优化器  

# 设置网络的一些参数
# 记录训练的次数
total_train_step = 0
# 记录测试的次数
total_test_step = 0

# 训练的轮次
epoch = 10

# 添加 tensorboard
writer = SummaryWriter("logs")

for i in range(epoch):
    print("-----第 {} 轮训练开始-----".format(i+1))
    
    # 训练步骤开始
    tudui.train() # 当网络中有dropout层、batchnorm层时,这些层能起作用
    for data in train_dataloader:
        imgs, targets = data
        if torch.cuda.is_available():
            imgs = imgs.cuda()  # 数据放到cuda上
            targets = targets.cuda() # 数据放到cuda上
        outputs = tudui(imgs)
        loss = loss_fn(outputs, targets) # 计算实际输出与目标输出的差距
        
        # 优化器对模型调优
        optimizer.zero_grad()  # 梯度清零
        loss.backward() # 反向传播,计算损失函数的梯度
        optimizer.step()   # 根据梯度,对网络的参数进行调优
        
        total_train_step = total_train_step + 1
        if total_train_step % 100 == 0:
            print("训练次数:{},Loss:{}".format(total_train_step,loss.item()))  # 方式二:获得loss值
            writer.add_scalar("train_loss",loss.item(),total_train_step)
    
    # 测试步骤开始(每一轮训练后都查看在测试数据集上的loss情况)
    tudui.eval()  # 当网络中有dropout层、batchnorm层时,这些层不能起作用
    total_test_loss = 0
    total_accuracy = 0
    with torch.no_grad():  # 没有梯度了
        for data in test_dataloader: # 测试数据集提取数据
            imgs, targets = data # 数据放到cuda上
            if torch.cuda.is_available():
                imgs = imgs.cuda() # 数据放到cuda上
                targets = targets.cuda()
            outputs = tudui(imgs)
            loss = loss_fn(outputs, targets) # 仅data数据在网络模型上的损失
            total_test_loss = total_test_loss + loss.item() # 所有loss
            accuracy = (outputs.argmax(1) == targets).sum()
            total_accuracy = total_accuracy + accuracy
            
    print("整体测试集上的Loss:{}".format(total_test_loss))
    print("整体测试集上的正确率:{}".format(total_accuracy/test_data_size))
    writer.add_scalar("test_loss",total_test_loss,total_test_step)
    writer.add_scalar("test_accuracy",total_accuracy/test_data_size,total_test_step)  
    total_test_step = total_test_step + 1
    
    torch.save(tudui, "./model/tudui_{}.pth".format(i)) # 保存每一轮训练后的结果
    #torch.save(tudui.state_dict(),"tudui_{}.path".format(i)) # 保存方式二         
    print("模型已保存")
    
writer.close()

结果:

Files already downloaded and verified
Files already downloaded and verified
训练数据集的长度:50000
测试数据集的长度:10000
-----第 1 轮训练开始-----
训练次数:100,Loss:2.289992094039917
训练次数:200,Loss:2.2927844524383545
训练次数:300,Loss:2.2730984687805176
训练次数:400,Loss:2.2006278038024902
训练次数:500,Loss:2.1675028800964355
训练次数:600,Loss:2.116072416305542
训练次数:700,Loss:2.04477596282959
整体测试集上的Loss:317.0560564994812
整体测试集上的正确率:0.28700000047683716
模型已保存
-----第 2 轮训练开始-----
训练次数:800,Loss:1.893830418586731
训练次数:900,Loss:1.8772207498550415
训练次数:1000,Loss:1.9800275564193726
训练次数:1100,Loss:2.007078170776367
训练次数:1200,Loss:1.7352533340454102
训练次数:1300,Loss:1.6947956085205078
训练次数:1400,Loss:1.756855845451355
训练次数:1500,Loss:1.8372352123260498
整体测试集上的Loss:299.94190883636475
整体测试集上的正确率:0.31619998812675476
模型已保存
-----第 3 轮训练开始-----
训练次数:1600,Loss:1.7673416137695312
训练次数:1700,Loss:1.6654351949691772
训练次数:1800,Loss:1.9246405363082886
训练次数:1900,Loss:1.7132933139801025
训练次数:2000,Loss:1.93990159034729
训练次数:2100,Loss:1.4903961420059204
训练次数:2200,Loss:1.4754142761230469
训练次数:2300,Loss:1.7652970552444458
整体测试集上的Loss:272.9526561498642
整体测试集上的正确率:0.37139999866485596
模型已保存
-----第 4 轮训练开始-----
训练次数:2400,Loss:1.7254819869995117
训练次数:2500,Loss:1.3386430740356445
训练次数:2600,Loss:1.5852587223052979
训练次数:2700,Loss:1.648303508758545
训练次数:2800,Loss:1.4971883296966553
训练次数:2900,Loss:1.5891362428665161
训练次数:3000,Loss:1.3380193710327148
训练次数:3100,Loss:1.542701005935669
整体测试集上的Loss:278.19843327999115
整体测试集上的正确率:0.36139997839927673
模型已保存
-----第 5 轮训练开始-----
训练次数:3200,Loss:1.3419318199157715
训练次数:3300,Loss:1.468044400215149
训练次数:3400,Loss:1.484485149383545
训练次数:3500,Loss:1.54210364818573
训练次数:3600,Loss:1.5797978639602661
训练次数:3700,Loss:1.3390973806381226
训练次数:3800,Loss:1.3077597618103027
训练次数:3900,Loss:1.4766919612884521
整体测试集上的Loss:269.36583971977234
整体测试集上的正确率:0.3871999979019165
模型已保存
-----第 6 轮训练开始-----
训练次数:4000,Loss:1.439847469329834
训练次数:4100,Loss:1.436941146850586
训练次数:4200,Loss:1.5766061544418335
训练次数:4300,Loss:1.249019742012024
训练次数:4400,Loss:1.164270281791687
训练次数:4500,Loss:1.4175126552581787
训练次数:4600,Loss:1.4056789875030518
整体测试集上的Loss:252.13275730609894
整体测试集上的正确率:0.4244000017642975
模型已保存
-----第 7 轮训练开始-----
训练次数:4700,Loss:1.3679763078689575
训练次数:4800,Loss:1.526027798652649
训练次数:4900,Loss:1.3590809106826782
训练次数:5000,Loss:1.4296003580093384
训练次数:5100,Loss:0.9916519522666931
训练次数:5200,Loss:1.3147145509719849
训练次数:5300,Loss:1.2122020721435547
训练次数:5400,Loss:1.3860883712768555
整体测试集上的Loss:235.14292180538177
整体测试集上的正确率:0.46209999918937683
模型已保存
-----第 8 轮训练开始-----
训练次数:5500,Loss:1.2311736345291138
训练次数:5600,Loss:1.2175472974777222
训练次数:5700,Loss:1.2189043760299683
训练次数:5800,Loss:1.2750414609909058
训练次数:5900,Loss:1.3556095361709595
训练次数:6000,Loss:1.5370352268218994
训练次数:6100,Loss:1.025504231452942
训练次数:6200,Loss:1.0661875009536743
整体测试集上的Loss:222.47956597805023
整体测试集上的正确率:0.4927999973297119
模型已保存
-----第 9 轮训练开始-----
训练次数:6300,Loss:1.4051152467727661
训练次数:6400,Loss:1.1392022371292114
训练次数:6500,Loss:1.6226587295532227
训练次数:6600,Loss:1.0815491676330566
训练次数:6700,Loss:1.048026442527771
训练次数:6800,Loss:1.1510660648345947
训练次数:6900,Loss:1.1476961374282837
训练次数:7000,Loss:0.9481611847877502
整体测试集上的Loss:212.00453734397888
整体测试集上的正确率:0.5181999802589417
模型已保存
-----第 10 轮训练开始-----
训练次数:7100,Loss:1.2802095413208008
训练次数:7200,Loss:0.9643581509590149
训练次数:7300,Loss:1.098695993423462
训练次数:7400,Loss:0.8831453323364258
训练次数:7500,Loss:1.19520902633667
训练次数:7600,Loss:1.2724679708480835
训练次数:7700,Loss:0.8894400000572205
训练次数:7800,Loss:1.205102801322937
整体测试集上的Loss:202.72463756799698
整体测试集上的正确率:0.54339998960495
模型已保存

 13.2 GPU训练时间

 

import torchvision
import torch
from torch import nn
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
import time

# from model import * 相当于把 model中的所有内容写到这里,这里直接把 model 写在这里
class Tudui(nn.Module):
    def __init__(self):
        super(Tudui, self).__init__()        
        self.model1 = nn.Sequential(
            nn.Conv2d(3,32,5,1,2),  # 输入通道3,输出通道32,卷积核尺寸5×5,步长1,填充2    
            nn.MaxPool2d(2),
            nn.Conv2d(32,32,5,1,2),
            nn.MaxPool2d(2),
            nn.Conv2d(32,64,5,1,2),
            nn.MaxPool2d(2),
            nn.Flatten(),  # 展平后变成 64*4*4 了
            nn.Linear(64*4*4,64),
            nn.Linear(64,10)
        )
        
    def forward(self, x):
        x = self.model1(x)
        return x

# 准备数据集
train_data = torchvision.datasets.CIFAR10("./dataset",train=True,transform=torchvision.transforms.ToTensor(),download=True)       
test_data = torchvision.datasets.CIFAR10("./dataset",train=False,transform=torchvision.transforms.ToTensor(),download=True)       

# length 长度
train_data_size = len(train_data)
test_data_size = len(test_data)
# 如果train_data_size=10,则打印:训练数据集的长度为:10
print("训练数据集的长度:{}".format(train_data_size))
print("测试数据集的长度:{}".format(test_data_size))

# 利用 Dataloader 来加载数据集
train_dataloader = DataLoader(train_data, batch_size=64)        
test_dataloader = DataLoader(test_data, batch_size=64)

# 创建网络模型
tudui = Tudui() 
if torch.cuda.is_available():
    tudui = tudui.cuda() # 网络模型转移到cuda上

# 损失函数
loss_fn = nn.CrossEntropyLoss() # 交叉熵,fn 是 fuction 的缩写
if torch.cuda.is_available():
    loss_fn = loss_fn.cuda()        # 损失函数转移到cuda上

# 优化器
learning = 0.01  # 1e-2 就是 0.01 的意思
optimizer = torch.optim.SGD(tudui.parameters(),learning)   # 随机梯度下降优化器  

# 设置网络的一些参数
# 记录训练的次数
total_train_step = 0
# 记录测试的次数
total_test_step = 0

# 训练的轮次
epoch = 10

# 添加 tensorboard
writer = SummaryWriter("logs")
start_time = time.time()

for i in range(epoch):
    print("-----第 {} 轮训练开始-----".format(i+1))
    
    # 训练步骤开始
    tudui.train() # 当网络中有dropout层、batchnorm层时,这些层能起作用
    for data in train_dataloader:
        imgs, targets = data
        if torch.cuda.is_available():
            imgs = imgs.cuda()  # 数据放到cuda上
            targets = targets.cuda() # 数据放到cuda上
        outputs = tudui(imgs)
        loss = loss_fn(outputs, targets) # 计算实际输出与目标输出的差距
        
        # 优化器对模型调优
        optimizer.zero_grad()  # 梯度清零
        loss.backward() # 反向传播,计算损失函数的梯度
        optimizer.step()   # 根据梯度,对网络的参数进行调优
        
        total_train_step = total_train_step + 1
        if total_train_step % 100 == 0:
            end_time = time.time()
            print(end_time - start_time) # 运行训练一百次后的时间间隔
            print("训练次数:{},Loss:{}".format(total_train_step,loss.item()))  # 方式二:获得loss值
            writer.add_scalar("train_loss",loss.item(),total_train_step)
    
    # 测试步骤开始(每一轮训练后都查看在测试数据集上的loss情况)
    tudui.eval()  # 当网络中有dropout层、batchnorm层时,这些层不能起作用
    total_test_loss = 0
    total_accuracy = 0
    with torch.no_grad():  # 没有梯度了
        for data in test_dataloader: # 测试数据集提取数据
            imgs, targets = data # 数据放到cuda上
            if torch.cuda.is_available():
                imgs = imgs.cuda() # 数据放到cuda上
                targets = targets.cuda()
            outputs = tudui(imgs)
            loss = loss_fn(outputs, targets) # 仅data数据在网络模型上的损失
            total_test_loss = total_test_loss + loss.item() # 所有loss
            accuracy = (outputs.argmax(1) == targets).sum()
            total_accuracy = total_accuracy + accuracy
            
    print("整体测试集上的Loss:{}".format(total_test_loss))
    print("整体测试集上的正确率:{}".format(total_accuracy/test_data_size))
    writer.add_scalar("test_loss",total_test_loss,total_test_step)
    writer.add_scalar("test_accuracy",total_accuracy/test_data_size,total_test_step)  
    total_test_step = total_test_step + 1
    
    torch.save(tudui, "./model/tudui_{}.pth".format(i)) # 保存每一轮训练后的结果
    #torch.save(tudui.state_dict(),"tudui_{}.path".format(i)) # 保存方式二         
    print("模型已保存")
    
writer.close()

 结果:

Files already downloaded and verified
Files already downloaded and verified
训练数据集的长度:50000
测试数据集的长度:10000
-----第 1 轮训练开始-----
1.0935008525848389
训练次数:100,Loss:2.2871038913726807
2.1766483783721924
训练次数:200,Loss:2.2836720943450928
3.27374267578125
训练次数:300,Loss:2.259164333343506
4.42803692817688
训练次数:400,Loss:2.170818328857422
5.506956577301025
训练次数:500,Loss:2.1002814769744873
6.58754301071167
训练次数:600,Loss:2.0413668155670166
7.650376319885254
训练次数:700,Loss:2.0200154781341553
整体测试集上的Loss:316.68364894390106
整体测试集上的正确率:0.2789999842643738
模型已保存
-----第 2 轮训练开始-----
10.175889730453491
训练次数:800,Loss:1.8918509483337402
11.24414849281311
训练次数:900,Loss:1.8798954486846924
12.356922149658203
训练次数:1000,Loss:1.970682978630066
13.43547511100769
训练次数:1100,Loss:2.0064470767974854
14.509244680404663
训练次数:1200,Loss:1.7197221517562866
15.598143815994263
训练次数:1300,Loss:1.6999645233154297
16.67508888244629
训练次数:1400,Loss:1.7595139741897583
17.747746229171753
训练次数:1500,Loss:1.849331259727478
整体测试集上的Loss:304.3353645801544
整体测试集上的正确率:0.31610000133514404
模型已保存
-----第 3 轮训练开始-----
20.33113145828247
训练次数:1600,Loss:1.7673357725143433
21.411443948745728
训练次数:1700,Loss:1.6436196565628052
22.475884914398193
训练次数:1800,Loss:1.9101005792617798
23.543425798416138
训练次数:1900,Loss:1.7177188396453857
24.60761523246765
训练次数:2000,Loss:1.9782830476760864
25.691354751586914
训练次数:2100,Loss:1.523171067237854
26.782272815704346
训练次数:2200,Loss:1.4762014150619507
27.82503628730774
训练次数:2300,Loss:1.7781658172607422
整体测试集上的Loss:272.44360399246216
整体测试集上的正确率:0.37199997901916504
模型已保存
-----第 4 轮训练开始-----
30.293652772903442
训练次数:2400,Loss:1.7340704202651978
31.373929500579834
训练次数:2500,Loss:1.3520257472991943
32.44764447212219
训练次数:2600,Loss:1.574364423751831
33.513572454452515
训练次数:2700,Loss:1.6468950510025024
34.61698246002197
训练次数:2800,Loss:1.4663115739822388
35.69143986701965
训练次数:2900,Loss:1.6123905181884766
36.75266122817993
训练次数:3000,Loss:1.3316911458969116
37.8302538394928
训练次数:3100,Loss:1.5095850229263306
整体测试集上的Loss:264.94398534297943
整体测试集上的正确率:0.3986999988555908
模型已保存
-----第 5 轮训练开始-----
40.43262219429016
训练次数:3200,Loss:1.3727346658706665
41.48542404174805
训练次数:3300,Loss:1.443982481956482
42.52226686477661
训练次数:3400,Loss:1.5196319818496704
43.57080316543579
训练次数:3500,Loss:1.5449475049972534
44.60450720787048
训练次数:3600,Loss:1.568708062171936
45.64966917037964
训练次数:3700,Loss:1.3194901943206787
46.709717750549316
训练次数:3800,Loss:1.2732317447662354
47.74911880493164
训练次数:3900,Loss:1.415683388710022
整体测试集上的Loss:253.18030643463135
整体测试集上的正确率:0.42249998450279236
模型已保存
-----第 6 轮训练开始-----
50.21744728088379
训练次数:4000,Loss:1.3912277221679688
51.265125036239624
训练次数:4100,Loss:1.410901665687561
52.28390049934387
训练次数:4200,Loss:1.521787405014038
53.33956241607666
训练次数:4300,Loss:1.2260788679122925
54.391708850860596
训练次数:4400,Loss:1.1339644193649292
55.45666837692261
训练次数:4500,Loss:1.3752398490905762
56.52565860748291
训练次数:4600,Loss:1.4126766920089722
整体测试集上的Loss:236.17250859737396
整体测试集上的正确率:0.45719999074935913
模型已保存
-----第 7 轮训练开始-----
58.975016832351685
训练次数:4700,Loss:1.327752947807312
60.01860165596008
训练次数:4800,Loss:1.5265493392944336
61.06228733062744
训练次数:4900,Loss:1.382441520690918
62.13616943359375
训练次数:5000,Loss:1.4380030632019043
63.18708825111389
训练次数:5100,Loss:1.0084904432296753
64.28091526031494
训练次数:5200,Loss:1.312524437904358
65.38735771179199
训练次数:5300,Loss:1.1935137510299683
66.48723554611206
训练次数:5400,Loss:1.3607358932495117
整体测试集上的Loss:223.93975222110748
整体测试集上的正确率:0.4819999933242798
模型已保存
-----第 8 轮训练开始-----
69.10827493667603
训练次数:5500,Loss:1.1847436428070068
70.24965786933899
训练次数:5600,Loss:1.2199389934539795
71.35597825050354
训练次数:5700,Loss:1.2233123779296875
72.44557046890259
训练次数:5800,Loss:1.2635695934295654
73.50763511657715
训练次数:5900,Loss:1.3924380540847778
74.50808930397034
训练次数:6000,Loss:1.5825486183166504
75.47654867172241
训练次数:6100,Loss:1.035813570022583
76.46864604949951
训练次数:6200,Loss:1.1380523443222046
整体测试集上的Loss:212.88353633880615
整体测试集上的正确率:0.513700008392334
模型已保存
-----第 9 轮训练开始-----
78.92930197715759
训练次数:6300,Loss:1.4175732135772705
80.02452445030212
训练次数:6400,Loss:1.1150474548339844
81.0898060798645
训练次数:6500,Loss:1.5558857917785645
82.11342310905457
训练次数:6600,Loss:1.095849633216858
83.19743394851685
训练次数:6700,Loss:1.061813235282898
84.28776097297668
训练次数:6800,Loss:1.160451054573059
85.28279232978821
训练次数:6900,Loss:1.1402560472488403
86.31971287727356
训练次数:7000,Loss:0.9515166282653809
整体测试集上的Loss:203.51595824956894
整体测试集上的正确率:0.5372999906539917
模型已保存
-----第 10 轮训练开始-----
88.85867476463318
训练次数:7100,Loss:1.2563235759735107
89.92428064346313
训练次数:7200,Loss:1.028809905052185
90.95707082748413
训练次数:7300,Loss:1.08479642868042
91.98656606674194
训练次数:7400,Loss:0.8235641717910767
92.97793579101562
训练次数:7500,Loss:1.2311100959777832
93.9680666923523
训练次数:7600,Loss:1.2486273050308228
94.95079374313354
训练次数:7700,Loss:0.9207454919815063
95.94353938102722
训练次数:7800,Loss:1.2435222864151
整体测试集上的Loss:194.90294301509857
整体测试集上的正确率:0.557200014591217
模型已保存

13.3 CPU训练时间

import torchvision
import torch
from torch import nn
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
import time

# from model import * 相当于把 model中的所有内容写到这里,这里直接把 model 写在这里
class Tudui(nn.Module):
    def __init__(self):
        super(Tudui, self).__init__()        
        self.model1 = nn.Sequential(
            nn.Conv2d(3,32,5,1,2),  # 输入通道3,输出通道32,卷积核尺寸5×5,步长1,填充2    
            nn.MaxPool2d(2),
            nn.Conv2d(32,32,5,1,2),
            nn.MaxPool2d(2),
            nn.Conv2d(32,64,5,1,2),
            nn.MaxPool2d(2),
            nn.Flatten(),  # 展平后变成 64*4*4 了
            nn.Linear(64*4*4,64),
            nn.Linear(64,10)
        )
        
    def forward(self, x):
        x = self.model1(x)
        return x

# 准备数据集
train_data = torchvision.datasets.CIFAR10("./dataset",train=True,transform=torchvision.transforms.ToTensor(),download=True)       
test_data = torchvision.datasets.CIFAR10("./dataset",train=False,transform=torchvision.transforms.ToTensor(),download=True)       

# length 长度
train_data_size = len(train_data)
test_data_size = len(test_data)
# 如果train_data_size=10,则打印:训练数据集的长度为:10
print("训练数据集的长度:{}".format(train_data_size))
print("测试数据集的长度:{}".format(test_data_size))

# 利用 Dataloader 来加载数据集
train_dataloader = DataLoader(train_data, batch_size=64)        
test_dataloader = DataLoader(test_data, batch_size=64)

# 创建网络模型
tudui = Tudui() 

# 损失函数
loss_fn = nn.CrossEntropyLoss() # 交叉熵,fn 是 fuction 的缩写

# 优化器
learning = 0.01  # 1e-2 就是 0.01 的意思
optimizer = torch.optim.SGD(tudui.parameters(),learning)   # 随机梯度下降优化器  

# 设置网络的一些参数
# 记录训练的次数
total_train_step = 0
# 记录测试的次数
total_test_step = 0

# 训练的轮次
epoch = 10

# 添加 tensorboard
writer = SummaryWriter("logs")
start_time = time.time()

for i in range(epoch):
    print("-----第 {} 轮训练开始-----".format(i+1))
    
    # 训练步骤开始
    tudui.train() # 当网络中有dropout层、batchnorm层时,这些层能起作用
    for data in train_dataloader:
        imgs, targets = data
        outputs = tudui(imgs)
        loss = loss_fn(outputs, targets) # 计算实际输出与目标输出的差距
        
        # 优化器对模型调优
        optimizer.zero_grad()  # 梯度清零
        loss.backward() # 反向传播,计算损失函数的梯度
        optimizer.step()   # 根据梯度,对网络的参数进行调优
        
        total_train_step = total_train_step + 1
        if total_train_step % 100 == 0:
            end_time = time.time()
            print(end_time - start_time) # 运行训练一百次后的时间间隔
            print("训练次数:{},Loss:{}".format(total_train_step,loss.item()))  # 方式二:获得loss值
            writer.add_scalar("train_loss",loss.item(),total_train_step)
    
    # 测试步骤开始(每一轮训练后都查看在测试数据集上的loss情况)
    tudui.eval()  # 当网络中有dropout层、batchnorm层时,这些层不能起作用
    total_test_loss = 0
    total_accuracy = 0
    with torch.no_grad():  # 没有梯度了
        for data in test_dataloader: # 测试数据集提取数据
            imgs, targets = data 
            outputs = tudui(imgs)
            loss = loss_fn(outputs, targets) # 仅data数据在网络模型上的损失
            total_test_loss = total_test_loss + loss.item() # 所有loss
            accuracy = (outputs.argmax(1) == targets).sum()
            total_accuracy = total_accuracy + accuracy
            
    print("整体测试集上的Loss:{}".format(total_test_loss))
    print("整体测试集上的正确率:{}".format(total_accuracy/test_data_size))
    writer.add_scalar("test_loss",total_test_loss,total_test_step)
    writer.add_scalar("test_accuracy",total_accuracy/test_data_size,total_test_step)  
    total_test_step = total_test_step + 1
    
    torch.save(tudui, "./model/tudui_{}.pth".format(i)) # 保存每一轮训练后的结果
    #torch.save(tudui.state_dict(),"tudui_{}.path".format(i)) # 保存方式二         
    print("模型已保存")
    
writer.close()

结果:

Files already downloaded and verified
Files already downloaded and verified
训练数据集的长度:50000
测试数据集的长度:10000
-----第 1 轮训练开始-----
3.761235237121582
训练次数:100,Loss:2.291699171066284
7.478676080703735
训练次数:200,Loss:2.2810616493225098
11.149278163909912
训练次数:300,Loss:2.2673659324645996
14.876582384109497
训练次数:400,Loss:2.210559606552124
18.794732332229614
训练次数:500,Loss:2.074248790740967
22.666887521743774
训练次数:600,Loss:2.029463052749634
26.518835306167603
训练次数:700,Loss:2.025493860244751
整体测试集上的Loss:315.7099049091339
整体测试集上的正确率:0.2777999937534332
模型已保存
-----第 2 轮训练开始-----
33.49093294143677
训练次数:800,Loss:1.8920475244522095
37.37390112876892
训练次数:900,Loss:1.8434715270996094
41.431575775146484
训练次数:1000,Loss:1.9236050844192505
45.389270067214966
训练次数:1100,Loss:2.011040687561035
49.43605923652649
训练次数:1200,Loss:1.6993070840835571
53.62735366821289
训练次数:1300,Loss:1.6654363870620728
58.2660493850708
训练次数:1400,Loss:1.753265142440796
62.52872014045715
训练次数:1500,Loss:1.813820481300354
整体测试集上的Loss:304.07691729068756
整体测试集上的正确率:0.3098999857902527
模型已保存
-----第 3 轮训练开始-----
70.04687976837158
训练次数:1600,Loss:1.7496393918991089
74.19148874282837
训练次数:1700,Loss:1.6370826959609985
78.51184940338135
训练次数:1800,Loss:1.8948217630386353
83.03685450553894
训练次数:1900,Loss:1.7091740369796753
87.36472058296204
训练次数:2000,Loss:1.9168915748596191
91.5152907371521
训练次数:2100,Loss:1.5194813013076782
95.88392543792725
训练次数:2200,Loss:1.4738638401031494
100.08612132072449
训练次数:2300,Loss:1.7649239301681519
整体测试集上的Loss:266.925869345665
整体测试集上的正确率:0.38499999046325684
模型已保存
-----第 4 轮训练开始-----
107.81971716880798
训练次数:2400,Loss:1.7411062717437744
111.95616102218628
训练次数:2500,Loss:1.3490957021713257
116.07963228225708
训练次数:2600,Loss:1.577816367149353
120.41316413879395
训练次数:2700,Loss:1.6967650651931763
124.64287948608398
训练次数:2800,Loss:1.4929475784301758
128.7123486995697
训练次数:2900,Loss:1.6131006479263306
132.94610214233398
训练次数:3000,Loss:1.347227931022644
137.22871589660645
训练次数:3100,Loss:1.4926567077636719
整体测试集上的Loss:260.8921568393707
整体测试集上的正确率:0.40639999508857727
模型已保存
-----第 5 轮训练开始-----
145.22107672691345
训练次数:3200,Loss:1.3609188795089722
149.55124926567078
训练次数:3300,Loss:1.459675669670105
153.86187386512756
训练次数:3400,Loss:1.4940723180770874
158.21399784088135
训练次数:3500,Loss:1.5735642910003662
162.51304960250854
训练次数:3600,Loss:1.6013926267623901
166.73556113243103
训练次数:3700,Loss:1.3678141832351685
170.68037581443787
训练次数:3800,Loss:1.2831741571426392
174.55300641059875
训练次数:3900,Loss:1.4196735620498657
整体测试集上的Loss:258.5555330514908
整体测试集上的正确率:0.4147000014781952
模型已保存
-----第 6 轮训练开始-----
181.89517664909363
训练次数:4000,Loss:1.394544243812561
185.81528973579407
训练次数:4100,Loss:1.4785242080688477
189.75436854362488
训练次数:4200,Loss:1.504089593887329
193.7331829071045
训练次数:4300,Loss:1.1989901065826416
197.86846470832825
训练次数:4400,Loss:1.169187068939209
202.10944604873657
训练次数:4500,Loss:1.3368093967437744
206.46737694740295
训练次数:4600,Loss:1.4030650854110718
整体测试集上的Loss:248.35702466964722
整体测试集上的正确率:0.43479999899864197
模型已保存
-----第 7 轮训练开始-----
213.79702472686768
训练次数:4700,Loss:1.2863177061080933
217.70893836021423
训练次数:4800,Loss:1.5342319011688232
221.56816983222961
训练次数:4900,Loss:1.412546157836914
225.4557182788849
训练次数:5000,Loss:1.435633897781372
229.29314064979553
训练次数:5100,Loss:1.050623893737793
233.22323894500732
训练次数:5200,Loss:1.327545166015625
237.1871302127838
训练次数:5300,Loss:1.2706438302993774
241.23810291290283
训练次数:5400,Loss:1.3970144987106323
整体测试集上的Loss:238.9216102361679
整体测试集上的正确率:0.4553000032901764
模型已保存
-----第 8 轮训练开始-----
248.59216332435608
训练次数:5500,Loss:1.1989145278930664
252.57087922096252
训练次数:5600,Loss:1.2739124298095703
256.6464595794678
训练次数:5700,Loss:1.2550328969955444
260.85662841796875
训练次数:5800,Loss:1.2594654560089111
264.96409726142883
训练次数:5900,Loss:1.352506399154663
269.10122084617615
训练次数:6000,Loss:1.5692474842071533
273.26241970062256
训练次数:6100,Loss:1.051681399345398
277.37177181243896
训练次数:6200,Loss:1.1093714237213135
整体测试集上的Loss:229.03875291347504
整体测试集上的正确率:0.48089998960494995
模型已保存
-----第 9 轮训练开始-----
285.03535556793213
训练次数:6300,Loss:1.438887119293213
289.1406488418579
训练次数:6400,Loss:1.1292884349822998
293.3350794315338
训练次数:6500,Loss:1.5554381608963013
297.4605076313019
训练次数:6600,Loss:1.12319815158844
301.41761565208435
训练次数:6700,Loss:1.0609500408172607
305.4384708404541
训练次数:6800,Loss:1.1414461135864258
309.32322096824646
训练次数:6900,Loss:1.0653573274612427
313.22136521339417
训练次数:7000,Loss:0.9645416140556335
整体测试集上的Loss:217.3968950510025
整体测试集上的正确率:0.508400022983551
模型已保存
-----第 10 轮训练开始-----
320.61516642570496
训练次数:7100,Loss:1.252223253250122
324.5729761123657
训练次数:7200,Loss:1.0116769075393677
328.631311416626
训练次数:7300,Loss:1.1434015035629272
332.65182423591614
训练次数:7400,Loss:0.8558588624000549
336.61728739738464
训练次数:7500,Loss:1.2400795221328735
340.65006160736084
训练次数:7600,Loss:1.3492536544799805
344.64593052864075
训练次数:7700,Loss:0.9260987043380737
348.731153011322
训练次数:7800,Loss:1.3142049312591553
整体测试集上的Loss:208.29399240016937
整体测试集上的正确率:0.5317999720573425
模型已保存

13.4  利用GPU训练(方式二)

① 电脑上有两个显卡时,可以用指定cuda:0、cuda:1。

import torchvision
import torch
from torch import nn
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
import time

# 定义训练的设备
#device = torch.device("cpu")
#device = torch.device("cuda")   # 使用 GPU 方式一 
#device = torch.device("cuda:0") # 使用 GPU 方式二
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# from model import * 相当于把 model中的所有内容写到这里,这里直接把 model 写在这里
class Tudui(nn.Module):
    def __init__(self):
        super(Tudui, self).__init__()        
        self.model1 = nn.Sequential(
            nn.Conv2d(3,32,5,1,2),  # 输入通道3,输出通道32,卷积核尺寸5×5,步长1,填充2    
            nn.MaxPool2d(2),
            nn.Conv2d(32,32,5,1,2),
            nn.MaxPool2d(2),
            nn.Conv2d(32,64,5,1,2),
            nn.MaxPool2d(2),
            nn.Flatten(),  # 展平后变成 64*4*4 了
            nn.Linear(64*4*4,64),
            nn.Linear(64,10)
        )
        
    def forward(self, x):
        x = self.model1(x)
        return x

# 准备数据集
train_data = torchvision.datasets.CIFAR10("./dataset",train=True,transform=torchvision.transforms.ToTensor(),download=True)       
test_data = torchvision.datasets.CIFAR10("./dataset",train=False,transform=torchvision.transforms.ToTensor(),download=True)       

# length 长度
train_data_size = len(train_data)
test_data_size = len(test_data)
# 如果train_data_size=10,则打印:训练数据集的长度为:10
print("训练数据集的长度:{}".format(train_data_size))
print("测试数据集的长度:{}".format(test_data_size))

# 利用 Dataloader 来加载数据集
train_dataloader = DataLoader(train_data, batch_size=64)        
test_dataloader = DataLoader(test_data, batch_size=64)

# 创建网络模型
tudui = Tudui() 
tudui = tudui.to(device) # 也可以不赋值,直接 tudui.to(device) 


# 损失函数
loss_fn = nn.CrossEntropyLoss() # 交叉熵,fn 是 fuction 的缩写
loss_fn = loss_fn.to(device) # 也可以不赋值,直接loss_fn.to(device)

# 优化器
learning = 0.01  # 1e-2 就是 0.01 的意思
optimizer = torch.optim.SGD(tudui.parameters(),learning)   # 随机梯度下降优化器  

# 设置网络的一些参数
# 记录训练的次数
total_train_step = 0
# 记录测试的次数
total_test_step = 0

# 训练的轮次
epoch = 10

# 添加 tensorboard
writer = SummaryWriter("logs")
start_time = time.time()

for i in range(epoch):
    print("-----第 {} 轮训练开始-----".format(i+1))
    
    # 训练步骤开始
    tudui.train() # 当网络中有dropout层、batchnorm层时,这些层能起作用
    for data in train_dataloader:
        imgs, targets = data            
        imgs = imgs.to(device) # 也可以不赋值,直接 imgs.to(device)
        targets = targets.to(device) # 也可以不赋值,直接 targets.to(device)
        outputs = tudui(imgs)
        loss = loss_fn(outputs, targets) # 计算实际输出与目标输出的差距
        
        # 优化器对模型调优
        optimizer.zero_grad()  # 梯度清零
        loss.backward() # 反向传播,计算损失函数的梯度
        optimizer.step()   # 根据梯度,对网络的参数进行调优
        
        total_train_step = total_train_step + 1
        if total_train_step % 100 == 0:
            end_time = time.time()
            print(end_time - start_time) # 运行训练一百次后的时间间隔
            print("训练次数:{},Loss:{}".format(total_train_step,loss.item()))  # 方式二:获得loss值
            writer.add_scalar("train_loss",loss.item(),total_train_step)
    
    # 测试步骤开始(每一轮训练后都查看在测试数据集上的loss情况)
    tudui.eval()  # 当网络中有dropout层、batchnorm层时,这些层不能起作用
    total_test_loss = 0
    total_accuracy = 0
    with torch.no_grad():  # 没有梯度了
        for data in test_dataloader: # 测试数据集提取数据
            imgs, targets = data # 数据放到cuda上
            imgs = imgs.to(device) # 也可以不赋值,直接 imgs.to(device)
            targets = targets.to(device) # 也可以不赋值,直接 targets.to(device)
            outputs = tudui(imgs)
            loss = loss_fn(outputs, targets) # 仅data数据在网络模型上的损失
            total_test_loss = total_test_loss + loss.item() # 所有loss
            accuracy = (outputs.argmax(1) == targets).sum()
            total_accuracy = total_accuracy + accuracy
            
    print("整体测试集上的Loss:{}".format(total_test_loss))
    print("整体测试集上的正确率:{}".format(total_accuracy/test_data_size))
    writer.add_scalar("test_loss",total_test_loss,total_test_step)
    writer.add_scalar("test_accuracy",total_accuracy/test_data_size,total_test_step)  
    total_test_step = total_test_step + 1
    
    torch.save(tudui, "./model/tudui_{}.pth".format(i)) # 保存每一轮训练后的结果
    #torch.save(tudui.state_dict(),"tudui_{}.path".format(i)) # 保存方式二         
    print("模型已保存")
    
writer.close()

结果:

 

Files already downloaded and verified
Files already downloaded and verified
训练数据集的长度:50000
测试数据集的长度:10000
-----第 1 轮训练开始-----
1.1190404891967773
训练次数:100,Loss:2.2926671504974365
2.2812979221343994
训练次数:200,Loss:2.291703701019287
3.386057138442993
训练次数:300,Loss:2.2745745182037354
4.541907548904419
训练次数:400,Loss:2.221169948577881
5.640037298202515
训练次数:500,Loss:2.143411159515381
6.726482629776001
训练次数:600,Loss:2.0441091060638428
7.838879585266113
训练次数:700,Loss:2.0090014934539795
整体测试集上的Loss:312.4657955169678
整体测试集上的正确率:0.28279998898506165
模型已保存
-----第 2 轮训练开始-----
10.41140604019165
训练次数:800,Loss:1.8645917177200317
11.455690383911133
训练次数:900,Loss:1.827837347984314
12.512084007263184
训练次数:1000,Loss:1.9033353328704834
13.599088907241821
训练次数:1100,Loss:2.0170090198516846
14.64348030090332
训练次数:1200,Loss:1.7100862264633179
15.72208046913147
训练次数:1300,Loss:1.6826354265213013
16.752166986465454
训练次数:1400,Loss:1.7191925048828125
17.81931185722351
训练次数:1500,Loss:1.8116774559020996
整体测试集上的Loss:306.7045053243637
整体测试集上的正确率:0.3068999946117401
模型已保存
-----第 3 轮训练开始-----
20.318028688430786
训练次数:1600,Loss:1.7589811086654663
21.38711452484131
训练次数:1700,Loss:1.6722180843353271
22.505618572235107
训练次数:1800,Loss:1.9415262937545776
23.604503393173218
训练次数:1900,Loss:1.7454909086227417
24.74000310897827
训练次数:2000,Loss:1.9074403047561646
25.785309076309204
训练次数:2100,Loss:1.5321683883666992
26.833311796188354
训练次数:2200,Loss:1.4686038494110107
27.883039236068726
训练次数:2300,Loss:1.8088748455047607
整体测试集上的Loss:264.2274956703186
整体测试集上的正确率:0.3926999866962433
模型已保存
-----第 4 轮训练开始-----
30.434141159057617
训练次数:2400,Loss:1.7530766725540161
31.50102210044861
训练次数:2500,Loss:1.3466917276382446
32.588942766189575
训练次数:2600,Loss:1.5937833786010742
33.64913892745972
训练次数:2700,Loss:1.6885923147201538
34.69320559501648
训练次数:2800,Loss:1.5292593240737915
35.72002124786377
训练次数:2900,Loss:1.6046268939971924
36.74435377120972
训练次数:3000,Loss:1.3702434301376343
37.789002656936646
训练次数:3100,Loss:1.5583586692810059
整体测试集上的Loss:247.68864715099335
整体测试集上的正确率:0.42879998683929443
模型已保存
-----第 5 轮训练开始-----
40.23552346229553
训练次数:3200,Loss:1.3889607191085815
41.28690481185913
训练次数:3300,Loss:1.4547197818756104
42.32324028015137
训练次数:3400,Loss:1.487451434135437
43.36912536621094
训练次数:3500,Loss:1.6039626598358154
44.43635702133179
训练次数:3600,Loss:1.5406546592712402
45.52009439468384
训练次数:3700,Loss:1.355963110923767
46.61804127693176
训练次数:3800,Loss:1.293853521347046
47.66825032234192
训练次数:3900,Loss:1.4567005634307861
整体测试集上的Loss:239.61021220684052
整体测试集上的正确率:0.44669997692108154
模型已保存
-----第 6 轮训练开始-----
50.18902587890625
训练次数:4000,Loss:1.4021949768066406
51.221325397491455
训练次数:4100,Loss:1.4686369895935059
52.25768494606018
训练次数:4200,Loss:1.5711930990219116
53.29710626602173
训练次数:4300,Loss:1.2274739742279053
54.35805821418762
训练次数:4400,Loss:1.1256041526794434
55.45258617401123
训练次数:4500,Loss:1.346487045288086
56.498899936676025
训练次数:4600,Loss:1.4574103355407715
整体测试集上的Loss:229.56566536426544
整体测试集上的正确率:0.4640999734401703
模型已保存
-----第 7 轮训练开始-----
58.9901008605957
训练次数:4700,Loss:1.3305902481079102
60.09166860580444
训练次数:4800,Loss:1.5128451585769653
61.15304517745972
训练次数:4900,Loss:1.4225473403930664
62.24405121803284
训练次数:5000,Loss:1.4352083206176758
63.328041315078735
训练次数:5100,Loss:1.0108458995819092
64.43191266059875
训练次数:5200,Loss:1.2999461889266968
65.55889964103699
训练次数:5300,Loss:1.2483041286468506
66.67005276679993
训练次数:5400,Loss:1.40975821018219
整体测试集上的Loss:221.8911657333374
整体测试集上的正确率:0.4901999831199646
模型已保存
-----第 8 轮训练开始-----
69.31889057159424
训练次数:5500,Loss:1.2309132814407349
70.37002444267273
训练次数:5600,Loss:1.2406929731369019
71.45024251937866
训练次数:5700,Loss:1.206421136856079
72.53801417350769
训练次数:5800,Loss:1.2449841499328613
73.61350750923157
训练次数:5900,Loss:1.382934331893921
74.64801716804504
训练次数:6000,Loss:1.5476189851760864
75.68919968605042
训练次数:6100,Loss:1.0594358444213867
76.78617668151855
训练次数:6200,Loss:1.1037648916244507
整体测试集上的Loss:214.6394373178482
整体测试集上的正确率:0.5138999819755554
模型已保存
-----第 9 轮训练开始-----
79.35270118713379
训练次数:6300,Loss:1.4193459749221802
80.38360047340393
训练次数:6400,Loss:1.1300890445709229
81.4340546131134
训练次数:6500,Loss:1.5622072219848633
82.51292634010315
训练次数:6600,Loss:1.119008183479309
83.57669281959534
训练次数:6700,Loss:1.0774811506271362
84.61026763916016
训练次数:6800,Loss:1.1881333589553833
85.65419411659241
训练次数:6900,Loss:1.116170048713684
86.69365286827087
训练次数:7000,Loss:0.9820349812507629
整体测试集上的Loss:204.89984810352325
整体测试集上的正确率:0.5370000004768372
模型已保存
-----第 10 轮训练开始-----
89.25331830978394
训练次数:7100,Loss:1.339141607284546
90.34024834632874
训练次数:7200,Loss:0.8925604224205017
91.38928580284119
训练次数:7300,Loss:1.134442925453186
92.44890975952148
训练次数:7400,Loss:0.8384325504302979
93.53598165512085
训练次数:7500,Loss:1.2126699686050415
94.57306551933289
训练次数:7600,Loss:1.2007839679718018
95.60608768463135
训练次数:7700,Loss:0.8869692087173462
96.65610480308533
训练次数:7800,Loss:1.3008511066436768
整体测试集上的Loss:195.62357383966446
整体测试集上的正确率:0.5604999661445618
模型已保存

13.5 运行Terminal语句

① 运行terminal上运行的命令,可以在代码块中输入语句,在语句前加一个感叹号。

② 输入 !nvidia-smi,可以查看显卡配置。

!nvidia-smi
Thu Mar 31 17:24:49 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 471.35       Driver Version: 471.35       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0  On |                  N/A |
| N/A   61C    P0    47W /  N/A |   2913MiB / 16384MiB |     10%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1868    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A     14152    C+G   ...4__htrsf667h5kn2\AWCC.exe    N/A      |
|    0   N/A  N/A     14904    C+G   ...2\extracted\WeChatApp.exe    N/A      |
|    0   N/A  N/A     19304    C+G   ...y\AccountsControlHost.exe    N/A      |
|    0   N/A  N/A     21816    C+G   ...5n1h2txyewy\SearchApp.exe    N/A      |
|    0   N/A  N/A     23044    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A     23480    C+G   ...2txyewy\TextInputHost.exe    N/A      |
|    0   N/A  N/A     24180    C+G   ...tracted\WechatBrowser.exe    N/A      |
|    0   N/A  N/A     24376    C+G   ...erver\YourPhoneServer.exe    N/A      |
|    0   N/A  N/A     24912    C+G   ...kzcwy\mcafee-security.exe    N/A      |
|    0   N/A  N/A     25524    C+G   ...me\Application\chrome.exe    N/A      |
|    0   N/A  N/A     27768    C+G   ...cw5n1h2txyewy\LockApp.exe    N/A      |
|    0   N/A  N/A     27788      C   ...a\envs\py3.6.3\python.exe    N/A      |
|    0   N/A  N/A     27960    C+G   ...y\ShellExperienceHost.exe    N/A      |
|    0   N/A  N/A     31320    C+G   C:\Windows\explorer.exe         N/A      |
|    0   N/A  N/A     32796    C+G   ...e\StoreExperienceHost.exe    N/A      |
|    0   N/A  N/A     35728    C+G   ...artMenuExperienceHost.exe    N/A      |
+-----------------------------------------------------------------------------+
  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Transformer模型在训练时需要大量的计算资源。为了加速训练过程,可以使用分布式GPU进行并行训练。下面是使用PyTorch和Horovod库进行分布式GPU训练的示例代码: ```python import torch import torch.nn as nn import torch.optim as optim import horovod.torch as hvd from torch.utils.data import DataLoader # 初始化Horovod hvd.init() # 设置GPU设备 torch.cuda.set_device(hvd.local_rank()) # 定义模型和数据加载器 model = Transformer().cuda() data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True) # 将模型放到所有可用的GPU上 model = nn.parallel.DistributedDataParallel(model, device_ids=[hvd.local_rank()]) # 定义优化器和学习率调度器 optimizer = optim.Adam(model.parameters(), lr=0.001 * hvd.size()) scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10) # 将优化器与Horovod绑定 optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters()) # 开始训练 for epoch in range(num_epochs): for batch in data_loader: inputs, targets = batch inputs = inputs.cuda() targets = targets.cuda() optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, targets) loss.backward() optimizer.step() # 调整学习率 scheduler.step() # 在所有进程结束时进行反初始化 hvd.shutdown() ``` 在这个示例代码中,首先使用Horovod库初始化并设置GPU设备。然后将模型放到所有可用的GPU上,并使用Horovod库将优化器与模型绑定。训练过程中,每个进程都加载一个batch的数据进行计算,并在所有进程结束时进行反初始化。需要注意的是,使用Horovod库进行分布式GPU训练时,需要将学习率乘以进程数量进行调整,以保证训练的稳定性。 以上是一个简单的示例代码,实际应用中还需要根据具体的情况进行调整。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值