1.构造数据集
Step1:使用torch.utils.data中的Dataset类构建数据集,定义一个需要创建的数据集类(这里是DiabetesDataset)
Step2:使用torch.utils.data中的DataLoader类加载数据集
import torch
import numpy as np
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
class DiabetesDataset(Dataset):
def __init__(self):
pass
def __getitem__(self,index):
pass
def __len__(self):
pass
dataset = DiabetesDataset()
train_loader = DataLoader(dataset = dataset, batch_size=batch_size, shuffle=True)
DataLoader的形式如下:
torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=None, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None, ***, prefetch_factor=None, persistent_workers=False, pin_memory_device='')
Parameters
- dataset (Dataset) – dataset from which to load the data.
- batch_size (int, optional) – how many samples per batch to load (default:
1
).- shuffle (bool, optional) – set to
True
to have the data reshuffled at every epoch (default:False
).- sampler (Sampler or Iterable*,* optional) – defines the strategy to draw samples from the dataset. Can be any
Iterable
with__len__
implemented. If specified,shuffle
must not be specified.- batch_sampler (Sampler or Iterable*,* optional) – like
sampler
, but returns a batch of indices at a time. Mutually exclusive withbatch_size
,shuffle
,sampler
, anddrop_last
.- num_workers (int, optional) – how many subprocesses to use for data loading.
0
means that the data will be loaded in the main process. (default:0
)- collate_fn (Callable*,* optional) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
- pin_memory (bool, optional) – If
True
, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or yourcollate_fn
returns a batch that is a custom type, see the example below.- drop_last (bool, optional) – set to
True
to drop the last incomplete batch, if the dataset size is not divisible by the batch size. IfFalse
and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default:False
)- timeout (numeric*,* optional) – if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default:
0
)- worker_init_fn (Callable*,* optional) – If not
None
, this will be called on each worker subprocess with the worker id (an int in[0, num_workers - 1]
) as input, after seeding and before data loading. (default:None
)- multiprocessing_context (str or multiprocessing.context.BaseContext*,* optional) – If
None
, the default multiprocessing context of your operating system will be used. (default:None
)- generator (torch.Generator, optional) – If not
None
, this RNG will be used by RandomSampler to generate random indexes and multiprocessing to generatebase_seed
for workers. (default:None
)- prefetch_factor (int, optional*,* keyword-only arg) – Number of batches loaded in advance by each worker.
2
means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default isNone
. Otherwise, if value ofnum_workers > 0
default is2
).- persistent_workers (bool, optional) – If
True
, the data loader will not shut down the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. (default:False
)- pin_memory_device (str, optional) – the device to
pin_memory
to ifpin_memory
isTrue
.
下面是实现使用Dataset和DataLoader对数据加载的完整步骤:
import torch
import numpy as np
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
class DiabetesDataset(Dataset):
def init(self,file_path):
xy = np.loadtxt(file_path,delimiter=',',dtype = np.float32)
self.len = xy.shape[0]
self.x_data = torch.from_numpy(xy[:,:-1])
self.y_data = torch.from_numpy(xy[:,[-1]])
def getitem(self,index):
return self.x_data[index],self.y_data[index]
def len(self):
return self.len
file_path = r"D:\jupyter\pytorch基础\diabetes.csv.gz"
dataset = DiabetesDataset(file_path)
train_loader = DataLoader(dataset=dataset, batch_size=20, shuffle=True)
2.构造训练网络
定义训练网络需要继承torch.nn.Module类
class LinearModel(torch.nn.Module):
def __init__(self):
super(LinearModel,self).__init__() # 继承父类的__init__
self.linear = torch.nn.Linear(1,1) # linear的参数为(input参数个数,output参数个数)
def forward(self,x):
y_pred = self.linear(x)
return y_pred
model = LinearModel()
torch.nn中包含线性单元Linear、激活函数Sigmoid、ReLU等,下面是一个网络的实现
class Model(torch.nn.Module):
def __init__(self):
super(Model,self).__init__()
self.linear1 = torch.nn.Linear(8,6)
self.linear2 = torch.nn.Linear(6,4)
self.linear3 = torch.nn.Linear(4,1)
self.activate1 = torch.nn.Sigmoid()
self.activate2 = torch.nn.ReLU()
def forward(self,x):
x = self.activate2(self.linear1(x))
x = self.activate2(self.linear2(x))
x = self.activate1(self.linear3(x))
return x
model = Model()
3.构建criterion
和optimizer
# BCE损失函数
criterion = torch.nn.BCELoss(reduction='mean')
# SGD优化器
optimizer = torch.optim.SGD(model.parameters(),lr=0.1)
4.训练
for epoch in range(100):
for i, data in enumerate(train_loader, 0):
# 1 prepare data
inputs, labels = data
# 2 forward
y_pred = model(inputs)
loss = criterion(y_pred, labels)
print(epoch, i, loss.item())
# 3 backward
optimizer.zero_grad()
loss.backward()
# 4 update
optimizer.step()