李宏毅 机器学习 HM1 Simple Baseline
要求
根据前n天的确诊人数预测第n+1天的人数
准备
虚拟环境准备
使用anaconda命令窗口
# 查看虚拟环境
conda env list
# 创建虚拟环境
conda create -n new_name python=3.6
# 启动虚拟环境
conda activate new_name
# 退出虚拟环境
conda deactivate
# 删除虚拟环境
conda remove -n new_name --all
数据集
covid.train.csv-训练数据 包括州、症状、行为、心情、以及确诊数据
covid.test.csv-测试数据
代码解析
导库
数据集直接下载到本地 因此使用pandas库进行数据读取
import torch as th
import torch.nn as nn
from torch.utils.data import Dataset,DataLoader
import pandas as pd
import numpy as np
import csv
import os
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
可复现 Reproducibility
- Pytorch的官方文档 Reproducibility.
-
!!!在不同 Pytorch 版本或不同硬件平台之间,即便使用完全相同的 seed,也无法保证完全复现,只能尽量使实验结果在特定的硬件和软件平台上是可复现
-
可复现性主要包括:
-
(1) Controlling sources of randomness
-
(2) configure PyTorch to Avoid using nondeterministic algorithms for some operations (但这可能会使得算法更慢)
-
# 如果使用pytorch库 可以通过
import torch
touch.manual_seed(0)
# 如果使用random库 可以通过
import random
random.seed(0)
# 如果使用numpy库 也可以通过
import numpy as np
np.random.random.seed(0)
# 也可以通过定义函数生成随机数
def set_random_seed(seed):
torch.manual_seed(seed)
random.seed(seed)
np.random.seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
set_random_seed(seed=42)
-
CUDA convolution benchmarking
- Pytorch的官方文档 benchmarking.
- 当使用新的尺寸参数调用 cuDNN convolution 时,benchmark 功能会执行多种卷积算法并进行基准测试寻找其中最快的算法,之后同样的尺寸参数将会一直使用最快的算法
由于 benchmarking noise 和硬件平台的不同,benchmark 功能可能会选择不同的算法
关闭 benchmarking 功能可以使得 cuDNN 固定地选择某种算法,但确定性算法的性能表现可能比不确定算法的差
torch.backends.cudnn.benchmark = False
-
CUDA convolution determinism
- 因为禁用benchmarking只能保证使用的算法固定 但使用的算法本身也可能是 nondeterministic 的,因此需要设置torch.backends.cudnn.deterministic来保证 convolution operation 具有确定性的行为 (设置只作用于 convolution operation,而torch.use_deterministic_algorithms是作用于所有 operations)
torch.backends.cudnn.deterministic = True
-
Avoiding nondeterministic algorithms
- 通过设置torch.use_deterministic_algorithms为True,可以强制 PyTorch 使用 deterministic algorithms 而非 nondeterministic algorithms
- 但如果某个算法是 nondeterministic 的且没有 deterministic 的版本则会报错
torch.use_deterministic_algorithms(True)
-
CUDA RNN and LSTM
- 在CUDA的一些版本里,RNN和LSTM可能会有一些non-deterministic的行为 需要通过官方文档确认
- Pytorch的官方文档 torch.nn.RNN.
- Pytorch的官方文档 torch.nn.LSTM.
-
DataLoader
-
Pytorch的官方文档 DataLoader.
-
DataLoader will reseed workers following Randomness in multi-process data loading algorithm. Use worker_init_fn() and generator to preserve reproducibility
-
Randomness in multi-process data loading
-
By default, each worker will have its PyTorch seed set to base_seed + worker_id, where base_seed is a long generated by main process using its RNG (thereby, consuming a RNG state mandatorily) or a specified generator. However, seeds for other libraries may be duplicated upon initializing workers, causing each worker to return identical random numbers. (See this section in FAQ.).
In worker_init_fn, you may access the PyTorch seed set for each worker with either torch.utils.data.get_worker_info().seed or torch.initial_seed(), and use it to seed other libraries before data loading.
-
当 num_workers > 0 时需要进行如下设置:DataLoader will reseed workers following Randomness in multi-process data loading algorithm. Use worker_init_fn() and generator to preserve reproducibility (Make sure that your dataloader loads samples in the same order every call.):
-
def seed_worker(worker_id):
worker_seed = torch.initial_seed() % 2**32
numpy.random.seed(worker_seed)
random.seed(worker_seed)
g = torch.Generator()
g.manual_seed(0)
DataLoader(
train_dataset,
batch_size=batch_size,
num_workers=num_workers,
worker_init_fn=seed_worker,
generator=g,
)
因此 在代码中定义seed以确保算法固定且为deterministic
seed = 230701 # set a random seed for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)
判断类型
def get_device():
''' 判断GPU是否可用 (if GPU is available, use GPU) '''
return 'cuda' if torch.cuda.is_available() else 'cpu'
可视化-模型训练
def plot_learning_curve(loss_record, title=''):
''' Plot learning curve of your DNN (train & dev loss) '''
total_steps = len(loss_record['train'])
x_1 = range(total_steps)
x_2 = x_1[::len(loss_record['train']) // len(loss_record['dev'])]
figure(figsize=(6, 4)) # 设置图表的宽和高
plt.plot(x_1, loss_record['train'], c='tab:red', label='train')
plt.plot(x_2, loss_record['dev'], c='tab:cyan', label='dev')
plt.ylim(0.0, 5.)
plt.xlabel('Training steps')
plt.ylabel('MSE loss')
plt.title('Learning curve of {}'.format(title))
plt.legend() # 使修改后的label生效
plt.show()
可视化-模型验证
def plot_pred(dv_set, model, device, lim=35., preds=None, targets=None):
''' Plot prediction of your DNN '''
if preds is None or targets is None:
model.eval() # 设置模型为测试模式
preds, targets = [], []
for x, y in dv_set:
x, y = x.to(device), y.to(device)
with torch.no_grad():
pred = model(x)
# detach()运算-用于将Tensor从计算图中分离出来 cpu()运算-将Tensor数据传回CPU
preds.append(pred.detach().cpu())
targets.append(y.detach().cpu())
# torch.cat()-在给定维中(dim指定)连接给定序列的seq张量
preds = torch.cat(preds, dim=0).numpy()
targets = torch.cat(targets, dim=0).numpy()
figure(figsize=(5, 5))
plt.scatter(targets, preds, c='r', alpha=0.5)
plt.plot([-0.2, lim], [-0.2, lim], c='b')
plt.xlim(-0.2, lim)
plt.ylim(-0.2, lim)
plt.xlabel('ground truth value')
plt.ylabel('predicted value')
plt.title('Ground Truth v.s. Prediction')
plt.show()
数据处理
- 数据包括 训练数据train 验证数据dev 测试数据test
- 训练数据和验证数据都从train.csv中产生
定义 COVID 19 Dataset 数据处理类
- 必须实现三个函数: init()、len()和__getitem__()
- init()
- path为数据存储路径
def __init__(self,
path,
mode='train',
target_only=False):
self.mode = mode
# 读取数据为numpy数组形式
with open(path, 'r') as fp:
data = list(csv.reader(fp))
data = np.array(data[1:])[:, 1:].astype(float)
# target_only为默认值,则选取所有93个特征作为训练数据;
if not target_only:
feats = list(range(93))
else:
pass
if mode == 'test':
# Testing data
# data: 893 x 93 (40 states + day 1 (18) + day 2 (18) + day 3 (17))
data = data[:, feats]
# 将numpy array格式的数据均转化为torch.FloatTensor类型
self.data = torch.FloatTensor(data)
else:
# Training data (train/dev sets)
# data: 2700 x 94 (40 states + day 1 (18) + day 2 (18) + day 3 (18))
target = data[:, -1]
data = data[:, feats]
# 将train.csv数据中按9:1划分训练数据和验证数据
if mode == 'train':
indices = [i for i in range(len(data)) if i % 10 != 0]
elif mode == 'dev':
indices = [i for i in range(len(data)) if i % 10 == 0]
# 将numpy array格式的数据均转化为torch.FloatTensor类型
self.data = torch.FloatTensor(data[indices])
self.target = torch.FloatTensor(target[indices])
# Z-Score 标准化
self.data[:, 40:] = \
(self.data[:, 40:] - self.data[:, 40:].mean(dim=0, keepdim=True)) \
/ self.data[:, 40:].std(dim=0, keepdim=True)
self.dim = self.data.shape[1]
print('Finished reading the {} set of COVID19 Dataset ({} samples found, each dim = {})'
.format(mode, len(self.data), self.dim))
归一化 标准化 正则化: https://zhuanlan.zhihu.com/p/29957294
- getitem() & len()
def __getitem__(self, index):
# Returns one sample at a time
# 根据传入的index传回数据
if self.mode in ['train', 'dev']:
# training
return self.data[index], self.target[index]
else:
# testing
return self.data[index]
def __len__(self):
return len(self.data)
DataLoader
def prep_dataloader(path, mode, batch_size, n_jobs=0, target_only=False):
''' Generates a dataset, then is put into a dataloader. '''
dataset = COVID19Dataset(path, mode=mode, target_only=target_only) # Construct dataset
dataloader = DataLoader(
dataset, batch_size,
shuffle=(mode == 'train'), drop_last=False,
num_workers=n_jobs, pin_memory=True) # Construct dataloader
return dataloader
定义神经网络结构
class NeuralNet(nn.Module):
''' A simple fully-connected deep neural network '''
def __init__(self, input_dim):
super(NeuralNet, self).__init__()
# Sequential是模块的有序容器。
self.net = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(),
nn.Linear(64, 1)
)
# 损失函数 MSE
# reduction(可选) "none"|"mean"|"sum"
# "none":不应用缩减
# "mean":输出的总和将除以输出中的元素数
# "sum" : 将对输出进行求和。(默认值)
self.criterion = nn.MSELoss(reduction='mean')
def forward(self, x):
''' Given input of size (batch_size x input_dim), compute output of the network '''
return self.net(x).squeeze(1)
# 计算损失函数
def cal_loss(self, pred, target):
return self.criterion(pred, target)
定义训练函数
def train(tr_set, dv_set, model, config, device):
''' DNN training '''
n_epochs = config['n_epochs'] # Maximum number of epochs
# 优化器
optimizer = getattr(torch.optim, config['optimizer'])(
model.parameters(), **config['optim_hparas'])
min_mse = 1000.
loss_record = {'train': [], 'dev': []} #recording training loss
early_stop_cnt = 0
epoch = 0
while epoch < n_epochs:
model.train() # model to training mode 将模型设置为训练模式
for x, y in tr_set: # iterate through the dataloader
optimizer.zero_grad() # set gradient to zero
x, y = x.to(device), y.to(device)
pred = model(x) # forward pass (compute output)
mse_loss = model.cal_loss(pred, y) # compute loss
mse_loss.backward() # compute gradient (backpropagation)
optimizer.step() # update model with optimizer
loss_record['train'].append(mse_loss.detach().cpu().item())
# 在完成一个epoch的训练后,在验证集上测试模型的效果
dev_mse = dev(dv_set, model, device)
if dev_mse < min_mse:
# 若在验证集上得到更好的效果,则及时保存模型的参数
min_mse = dev_mse
print('Saving model (epoch = {:4d}, loss = {:.4f})'
.format(epoch + 1, min_mse))
torch.save(model.state_dict(), config['save_path']) # 保存模型到指定路径
early_stop_cnt = 0
else:
early_stop_cnt += 1 # 统计模型效果连续不变好的次数
epoch += 1
loss_record['dev'].append(dev_mse)
if early_stop_cnt > config['early_stop']:
# 如果模型连续不变好的次数大于预设值,代表模型已经不能够训练得到更好的结果,应及时停止训练
break
print('Finished training after {} epochs'.format(epoch))
return min_mse, loss_record
定义验证函数
def dev(dv_set, model, device):
model.eval() # set model to evalutation mode 将模型设置为验证模式
total_loss = 0
for x, y in dv_set: # iterate through the dataloader
x, y = x.to(device), y.to(device)
with torch.no_grad(): # disable gradient calculation 取消梯度计算
pred = model(x) # forward pass (compute output)
mse_loss = model.cal_loss(pred, y) # compute loss
total_loss += mse_loss.detach().cpu().item() * len(x) # accumulate loss
total_loss = total_loss / len(dv_set.dataset) # compute averaged loss
return total_loss
定义测试函数
def test(tt_set, model, device):
model.eval()
preds = []
for x in tt_set: # iterate through the dataloader
x = x.to(device)
with torch.no_grad():
pred = model(x)
preds.append(pred.detach().cpu()) # 记录每批数据的预测值
preds = torch.cat(preds, dim=0).numpy() # 融合每一批数据的预测值,并将其转化为numpy数据
return preds
设置超参数
device = get_device()
os.makedirs('models', exist_ok=True) # The trained model will be saved to ./models/
target_only = False
config = {
'n_epochs': 3000, # maximum number of epochs
'batch_size': 270, # mini-batch size for dataloader
'optimizer': 'SGD', # 优化器
'optim_hparas': {
'lr': 0.001, # 学习率
'momentum': 0.9 # momentum
},
'early_stop': 200, # 提前结束批次(距离你模型性能上一次提升的批次数)
'save_path': 'models/model.pth'
}
创建数据加载器
tr_path = 'covid.train.csv' # path to training data
tt_path = 'covid.test.csv' # path to testing data
tr_set = prep_dataloader(tr_path, 'train', config['batch_size'], target_only=target_only)
dv_set = prep_dataloader(tr_path, 'dev', config['batch_size'], target_only=target_only)
tt_set = prep_dataloader(tt_path, 'test', config['batch_size'], target_only=target_only)
创建DNN
model = NeuralNet(tr_set.dataset.dim).to(device)
模型训练
model_loss, model_loss_record = train(tr_set, dv_set, model, config, device)
plot_learning_curve(model_loss_record, title='deep model')
保存模型 验证模型
del model
model = NeuralNet(tr_set.dataset.dim).to(device)
ckpt = torch.load(config['save_path'], map_location='cpu') # Load your best model
model.load_state_dict(ckpt)
plot_pred(dv_set, model, device) # Show prediction on the validation set
保存结果
def save_pred(preds, file):
print('Saving results to {}'.format(file))
with open(file, 'w') as fp:
writer = csv.writer(fp)
writer.writerow(['id', 'tested_positive'])
for i, p in enumerate(preds):
writer.writerow([i, p])
preds = test(tt_set, model, device)
save_pred(preds, 'pred.csv')
参考
[1]: https://blog.csdn.net/weixin_42437114/article/details/129000271 ↩︎
[2]:李宏毅 机器学习课程和作业代码
[3]:https://zhuanlan.zhihu.com/p/422706397