【李宏毅 机器学习 HM1-COVID19 Cases Prediction (Regression)】

要求

根据前n天的确诊人数预测第n+1天的人数

准备

虚拟环境准备

使用anaconda命令窗口
# 查看虚拟环境
conda env list
# 创建虚拟环境
conda create -n new_name python=3.6
# 启动虚拟环境	
conda activate new_name
# 退出虚拟环境
conda deactivate
# 删除虚拟环境
conda remove -n new_name --all

数据集

covid.train.csv-训练数据 包括州、症状、行为、心情、以及确诊数据
covid.test.csv-测试数据

代码解析

导库

数据集直接下载到本地 因此使用pandas库进行数据读取
import torch as th
import torch.nn as nn
from torch.utils.data import Dataset,DataLoader

import pandas as pd
import numpy as np
import csv
import os

import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

可复现 Reproducibility

  • Pytorch的官方文档 Reproducibility.
    • !!!在不同 Pytorch 版本或不同硬件平台之间,即便使用完全相同的 seed,也无法保证完全复现,只能尽量使实验结果在特定的硬件和软件平台上是可复现

    • 可复现性主要包括:

    • (1) Controlling sources of randomness

    • (2) configure PyTorch to Avoid using nondeterministic algorithms for some operations (但这可能会使得算法更慢)

# 如果使用pytorch库 可以通过
import torch
touch.manual_seed(0)

# 如果使用random库 可以通过
import random
random.seed(0)

# 如果使用numpy库 也可以通过
import numpy as np
np.random.random.seed(0)

# 也可以通过定义函数生成随机数
def set_random_seed(seed):
	torch.manual_seed(seed)
	random.seed(seed)
    np.random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)

set_random_seed(seed=42)
  • CUDA convolution benchmarking

    • Pytorch的官方文档 benchmarking.
    • 当使用新的尺寸参数调用 cuDNN convolution 时,benchmark 功能会执行多种卷积算法并进行基准测试寻找其中最快的算法,之后同样的尺寸参数将会一直使用最快的算法
      由于 benchmarking noise 和硬件平台的不同,benchmark 功能可能会选择不同的算法
      关闭 benchmarking 功能可以使得 cuDNN 固定地选择某种算法,但确定性算法的性能表现可能比不确定算法的差
torch.backends.cudnn.benchmark = False
  • CUDA convolution determinism

    • 因为禁用benchmarking只能保证使用的算法固定 但使用的算法本身也可能是 nondeterministic 的,因此需要设置torch.backends.cudnn.deterministic来保证 convolution operation 具有确定性的行为 (设置只作用于 convolution operation,而torch.use_deterministic_algorithms是作用于所有 operations)
torch.backends.cudnn.deterministic = True
  • Avoiding nondeterministic algorithms

    • 通过设置torch.use_deterministic_algorithms为True,可以强制 PyTorch 使用 deterministic algorithms 而非 nondeterministic algorithms
    • 但如果某个算法是 nondeterministic 的且没有 deterministic 的版本则会报错
torch.use_deterministic_algorithms(True)
  • CUDA RNN and LSTM

    • 在CUDA的一些版本里,RNN和LSTM可能会有一些non-deterministic的行为 需要通过官方文档确认
    • Pytorch的官方文档 torch.nn.RNN.
    • Pytorch的官方文档 torch.nn.LSTM.
  • DataLoader

    • Pytorch的官方文档 DataLoader.

    • DataLoader will reseed workers following Randomness in multi-process data loading algorithm. Use worker_init_fn() and generator to preserve reproducibility

    • Randomness in multi-process data loading

    • By default, each worker will have its PyTorch seed set to base_seed + worker_id, where base_seed is a long generated by main process using its RNG (thereby, consuming a RNG state mandatorily) or a specified generator. However, seeds for other libraries may be duplicated upon initializing workers, causing each worker to return identical random numbers. (See this section in FAQ.).

      In worker_init_fn, you may access the PyTorch seed set for each worker with either torch.utils.data.get_worker_info().seed or torch.initial_seed(), and use it to seed other libraries before data loading.

    • 当 num_workers > 0 时需要进行如下设置:DataLoader will reseed workers following Randomness in multi-process data loading algorithm. Use worker_init_fn() and generator to preserve reproducibility (Make sure that your dataloader loads samples in the same order every call.):

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    numpy.random.seed(worker_seed)
    random.seed(worker_seed)

g = torch.Generator()
g.manual_seed(0)

DataLoader(
    train_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    worker_init_fn=seed_worker,
    generator=g,
)

因此 在代码中定义seed以确保算法固定且为deterministic

seed = 230701  # set a random seed for reproducibility

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(seed)
torch.manual_seed(seed)

if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

判断类型

def get_device():
    ''' 判断GPU是否可用 (if GPU is available, use GPU) '''
    return 'cuda' if torch.cuda.is_available() else 'cpu'

可视化-模型训练

def plot_learning_curve(loss_record, title=''):
    ''' Plot learning curve of your DNN (train & dev loss) '''
    total_steps = len(loss_record['train'])
    x_1 = range(total_steps)
    x_2 = x_1[::len(loss_record['train']) // len(loss_record['dev'])]
    figure(figsize=(6, 4)) # 设置图表的宽和高
    plt.plot(x_1, loss_record['train'], c='tab:red', label='train')
    plt.plot(x_2, loss_record['dev'], c='tab:cyan', label='dev')
    plt.ylim(0.0, 5.)
    plt.xlabel('Training steps')
    plt.ylabel('MSE loss')
    plt.title('Learning curve of {}'.format(title))
    plt.legend() # 使修改后的label生效
    plt.show()

可视化-模型验证

def plot_pred(dv_set, model, device, lim=35., preds=None, targets=None):
    ''' Plot prediction of your DNN '''
    if preds is None or targets is None:
        model.eval() # 设置模型为测试模式
        preds, targets = [], []
        for x, y in dv_set:
            x, y = x.to(device), y.to(device)
            with torch.no_grad():
                pred = model(x)
                # detach()运算-用于将Tensor从计算图中分离出来 cpu()运算-将Tensor数据传回CPU
                preds.append(pred.detach().cpu())
                targets.append(y.detach().cpu())
        # torch.cat()-在给定维中(dim指定)连接给定序列的seq张量
        preds = torch.cat(preds, dim=0).numpy()
        targets = torch.cat(targets, dim=0).numpy()

    figure(figsize=(5, 5))
    plt.scatter(targets, preds, c='r', alpha=0.5)
    plt.plot([-0.2, lim], [-0.2, lim], c='b')
    plt.xlim(-0.2, lim)
    plt.ylim(-0.2, lim)
    plt.xlabel('ground truth value')
    plt.ylabel('predicted value')
    plt.title('Ground Truth v.s. Prediction')
    plt.show()

数据处理

  • 数据包括 训练数据train 验证数据dev 测试数据test
  • 训练数据和验证数据都从train.csv中产生

定义 COVID 19 Dataset 数据处理类

  • 必须实现三个函数: init()、len()和__getitem__()
  • init()
    • path为数据存储路径
def __init__(self,
             path,
             mode='train',
             target_only=False):
	self.mode = mode
	# 读取数据为numpy数组形式
	with open(path, 'r') as fp:
            data = list(csv.reader(fp))
            data = np.array(data[1:])[:, 1:].astype(float)
            
      	# target_only为默认值,则选取所有93个特征作为训练数据;
        if not target_only:
            feats = list(range(93))
        else:
            pass

        if mode == 'test':
            # Testing data
            # data: 893 x 93 (40 states + day 1 (18) + day 2 (18) + day 3 (17))
            data = data[:, feats]
            
            # 将numpy array格式的数据均转化为torch.FloatTensor类型
            self.data = torch.FloatTensor(data)
        else:
            # Training data (train/dev sets)
            # data: 2700 x 94 (40 states + day 1 (18) + day 2 (18) + day 3 (18))
            target = data[:, -1]
            data = data[:, feats]
            
            # 将train.csv数据中按9:1划分训练数据和验证数据
            if mode == 'train':
                indices = [i for i in range(len(data)) if i % 10 != 0]
            elif mode == 'dev':
                indices = [i for i in range(len(data)) if i % 10 == 0]
            
            # 将numpy array格式的数据均转化为torch.FloatTensor类型
            self.data = torch.FloatTensor(data[indices])
            self.target = torch.FloatTensor(target[indices])

        # Z-Score 标准化
        self.data[:, 40:] = \
            (self.data[:, 40:] - self.data[:, 40:].mean(dim=0, keepdim=True)) \
            / self.data[:, 40:].std(dim=0, keepdim=True)

        self.dim = self.data.shape[1]

        print('Finished reading the {} set of COVID19 Dataset ({} samples found, each dim = {})'
              .format(mode, len(self.data), self.dim))

归一化 标准化 正则化: https://zhuanlan.zhihu.com/p/29957294

  • getitem() & len()
def __getitem__(self, index):
        # Returns one sample at a time
        # 根据传入的index传回数据
        
        if self.mode in ['train', 'dev']:
            # training
            return self.data[index], self.target[index]
        else:
            # testing
            return self.data[index]

def __len__(self):
        return len(self.data)

DataLoader

def prep_dataloader(path, mode, batch_size, n_jobs=0, target_only=False):
    ''' Generates a dataset, then is put into a dataloader. '''
    dataset = COVID19Dataset(path, mode=mode, target_only=target_only)  # Construct dataset
    dataloader = DataLoader(
        dataset, batch_size,
        shuffle=(mode == 'train'), drop_last=False,
        num_workers=n_jobs, pin_memory=True)                            # Construct dataloader
    return dataloader

定义神经网络结构

class NeuralNet(nn.Module):
    ''' A simple fully-connected deep neural network '''
    def __init__(self, input_dim):
        super(NeuralNet, self).__init__()

		# Sequential是模块的有序容器。
        self.net = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )

        # 损失函数 MSE
        # reduction(可选) "none"|"mean"|"sum"
            # "none":不应用缩减
            # "mean":输出的总和将除以输出中的元素数
            # "sum" : 将对输出进行求和。(默认值)
        self.criterion = nn.MSELoss(reduction='mean')

    def forward(self, x):
        ''' Given input of size (batch_size x input_dim), compute output of the network '''
        return self.net(x).squeeze(1)

    # 计算损失函数
    def cal_loss(self, pred, target):
        return self.criterion(pred, target)

定义训练函数

def train(tr_set, dv_set, model, config, device):
    ''' DNN training '''

    n_epochs = config['n_epochs']  # Maximum number of epochs

    # 优化器
    optimizer = getattr(torch.optim, config['optimizer'])(
        model.parameters(), **config['optim_hparas'])

    min_mse = 1000.
    loss_record = {'train': [], 'dev': []}      #recording training loss
    early_stop_cnt = 0
    epoch = 0
    while epoch < n_epochs:
        model.train()                           # model to training mode 将模型设置为训练模式
        for x, y in tr_set:                     # iterate through the dataloader
            optimizer.zero_grad()               # set gradient to zero
            x, y = x.to(device), y.to(device)
            pred = model(x)                     # forward pass (compute output)
            mse_loss = model.cal_loss(pred, y)  # compute loss
            mse_loss.backward()                 # compute gradient (backpropagation)
            optimizer.step()                    # update model with optimizer
            loss_record['train'].append(mse_loss.detach().cpu().item())

        # 在完成一个epoch的训练后,在验证集上测试模型的效果
        dev_mse = dev(dv_set, model, device)
        if dev_mse < min_mse:
            # 若在验证集上得到更好的效果,则及时保存模型的参数
            min_mse = dev_mse
            print('Saving model (epoch = {:4d}, loss = {:.4f})'
                .format(epoch + 1, min_mse))
            torch.save(model.state_dict(), config['save_path'])  # 保存模型到指定路径
            early_stop_cnt = 0
        else:
            early_stop_cnt += 1 # 统计模型效果连续不变好的次数

        epoch += 1
        loss_record['dev'].append(dev_mse)
        if early_stop_cnt > config['early_stop']:
            # 如果模型连续不变好的次数大于预设值,代表模型已经不能够训练得到更好的结果,应及时停止训练
            break

    print('Finished training after {} epochs'.format(epoch))
    return min_mse, loss_record

定义验证函数

def dev(dv_set, model, device):
    model.eval()                                # set model to evalutation mode 将模型设置为验证模式
    total_loss = 0
    for x, y in dv_set:                         # iterate through the dataloader
        x, y = x.to(device), y.to(device)
        with torch.no_grad():                   # disable gradient calculation 取消梯度计算
            pred = model(x)                     # forward pass (compute output)
            mse_loss = model.cal_loss(pred, y)  # compute loss
        total_loss += mse_loss.detach().cpu().item() * len(x)  # accumulate loss
    total_loss = total_loss / len(dv_set.dataset)              # compute averaged loss
    return total_loss

定义测试函数

def test(tt_set, model, device):
    model.eval()
    preds = []
    for x in tt_set:                            # iterate through the dataloader
        x = x.to(device)
        with torch.no_grad():
            pred = model(x)
            preds.append(pred.detach().cpu())   # 记录每批数据的预测值
    preds = torch.cat(preds, dim=0).numpy()     # 融合每一批数据的预测值,并将其转化为numpy数据
    return preds

设置超参数

device = get_device()
os.makedirs('models', exist_ok=True)  # The trained model will be saved to ./models/
target_only = False

config = {
    'n_epochs': 3000,                # maximum number of epochs
    'batch_size': 270,               # mini-batch size for dataloader
    'optimizer': 'SGD',              # 优化器
    'optim_hparas': {
        'lr': 0.001,                 # 学习率
        'momentum': 0.9              # momentum
    },
    'early_stop': 200,               # 提前结束批次(距离你模型性能上一次提升的批次数)
    'save_path': 'models/model.pth'
}

创建数据加载器

tr_path = 'covid.train.csv'  # path to training data
tt_path = 'covid.test.csv'   # path to testing data

tr_set = prep_dataloader(tr_path, 'train', config['batch_size'], target_only=target_only)
dv_set = prep_dataloader(tr_path, 'dev', config['batch_size'], target_only=target_only)
tt_set = prep_dataloader(tt_path, 'test', config['batch_size'], target_only=target_only)

创建DNN

model = NeuralNet(tr_set.dataset.dim).to(device)

模型训练

model_loss, model_loss_record = train(tr_set, dv_set, model, config, device)

plot_learning_curve(model_loss_record, title='deep model')

保存模型 验证模型

del model
model = NeuralNet(tr_set.dataset.dim).to(device)
ckpt = torch.load(config['save_path'], map_location='cpu')  # Load your best model
model.load_state_dict(ckpt)
plot_pred(dv_set, model, device)  # Show prediction on the validation set

保存结果

def save_pred(preds, file):
    print('Saving results to {}'.format(file))
    with open(file, 'w') as fp:
        writer = csv.writer(fp)
        writer.writerow(['id', 'tested_positive'])
        for i, p in enumerate(preds):
            writer.writerow([i, p])

preds = test(tt_set, model, device)
save_pred(preds, 'pred.csv')

参考
[1]: https://blog.csdn.net/weixin_42437114/article/details/129000271 ↩︎
[2]:李宏毅 机器学习课程和作业代码
[3]:https://zhuanlan.zhihu.com/p/422706397

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值