超参数优化专题之工具—wandb/wandb(2)

发呆的比目鱼

已于 2023-07-28 13:07:30 修改

阅读量4.6k

点赞数

分类专栏：算法实战文章标签： python 人工智能机器学习

于 2022-08-28 14:52:27 首次发布

原文链接：https://blog.csdn.net/xieocean/article/details/128119865

版权

算法同时被 2 个专栏收录

43 篇文章

订阅专栏

实战

11 篇文章

订阅专栏

超参数优化专题之工具—wandb / wandb（2）

注意，这部分参考了https://zhuanlan.zhihu.com/p/522355820，写的全面，还提供了代码参考https://zhuanlan.zhihu.com/p/522355820

wandb是一个免费的，用于记录实验数据的工具。wandb相比于tensorboard之类的工具，有更加丰富的用户管理，团队管理功能，更加方便团队协作。使用wandb首先要在网站上创建team，然后在team下创建project，然后project下会记录每个实验的详细数据。

安装

pip install wandb

登入官网

然后在wandb官网注册一个账号，然后获取该账号的私钥。

wandb login

（如果代码中有用到wandb，其实会自动要求登录，但是如果用nohup挂起脚本的话就无法实现，所以还是建议提前登录好）
API key的位置在：https://wandb.ai/authorize
（值得注意的是，这个配置应该是全局的，虽然我在一个虚拟环境下登录了我的账号，但是在别的虚拟环境下重新安装wandb还是可以直接通用。我估计是因为缓存在了本地）
用wandb login --relogin强制重新登录。

用jupyter notebook则为：

!pip install wandb
wandb.login()

基本使用

开启新项目，跟踪指标、超参，添加报警信息

感觉就是init新项目，然后指标用log（在下面例子里可以看到，可以分组（如分为train/val）），超参用config，需要报警的内容就添加alert，结果添加到summary，最后finish就行，比较好用。

init文档：Launch Experiments with wandb.init - Documentation
对dashboard整体布局（run page）的介绍：Run Page - Documentation
数据可视化/跟踪指标：Data Visualization - Documentation
跟踪超参：Configure Experiments with wandb.config - Documentation
警告部分文档：Send Alerts with wandb.alert - Documentation

在测试代码时如临时不想与wandb同步，需设置环境变量，使wandb模式变成离线： WANDB_MODE=offline（具体做法是在运行Python代码的命令行中，在python前面加上这句命令）

import wandb
wandb.init(project="my-awesome-project")  #其他入参：name（见下例），config（见下）

wandb.log({'accuracy': train_acc, 'loss': train_loss})

wandb.config.dropout = 0.2

wandb.alert(
    title="Low accuracy", 
    text=f"Accuracy {acc} is below the acceptable threshold {thresh}"
)

在jupyter notebook上，用一个假示例来模拟：

import random

# Launch 5 simulated experiments
total_runs = 5
for run in range(total_runs):
  # 🐝 1️⃣ Start a new run to track this script
  wandb.init(
      # Set the project where this run will be logged
      project="wandbexample1", 
      # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
      name=f"experiment_{run}", 
      # Track hyperparameters and run metadata
      config={
      "learning_rate": 0.02,
      "architecture": "CNN",
      "dataset": "CIFAR-100",
      "epochs": 10,
      })
  
  # This simple block simulates a training loop logging metrics
  epochs = 10
  offset = random.random() / 5
  for epoch in range(2, epochs):
      acc = 1 - 2 ** -epoch - random.random() / epoch - offset
      loss = 2 ** -epoch + random.random() / epoch + offset
      
      # 🐝 2️⃣ Log metrics from your script to W&B
      wandb.log({"acc": acc, "loss": loss})
      
  # Mark the run as finished
  wandb.finish()

在输出中会给出wandb项目链接，在浏览器中打开即可。
记录的内容：

离线保存数据，稍后再同步

将init设置为离线模式，主要通过环境变量实现控制

WANDB_API_KEY=$KEY
WANDB_MODE=“offline”

import wandb
import os
 
os.environ["WANDB_API_KEY"] = YOUR_KEY_HERE
os.environ["WANDB_MODE"] = "offline"
 
config = {
  "dataset": "CIFAR10",
  "machine": "offline cluster",
  "model": "CNN",
  "learning_rate": 0.01,
  "batch_size": 128,
}
 
wandb.init(project="offline-demo")
 
for i in range(100):
  wandb.log({"accuracy": i})

后面同步使用：wandb sync wandb/dryrun-folder-name

MNIST示例

jupyter notebook上：MNIST分类器

#@title
import wandb
import math
import random
import torch, torchvision
import torch.nn as nn
import torchvision.transforms as T
from tqdm.notebook import tqdm

device = "cuda:0" if torch.cuda.is_available() else "cpu"

def get_dataloader(is_train, batch_size, slice=5):
    "Get a training dataloader"
    full_dataset = torchvision.datasets.MNIST(root=".", train=is_train, transform=T.ToTensor(), download=True)
    sub_dataset = torch.utils.data.Subset(full_dataset, indices=range(0, len(full_dataset), slice))
    loader = torch.utils.data.DataLoader(dataset=sub_dataset, 
                                         batch_size=batch_size, 
                                         shuffle=True if is_train else False, 
                                         pin_memory=True, num_workers=2)
    return loader

def get_model(dropout):
    "A simple model"
    model = nn.Sequential(nn.Flatten(),
                         nn.Linear(28*28, 256),
                         nn.BatchNorm1d(256),
                         nn.ReLU(),
                         nn.Dropout(dropout),
                         nn.Linear(256,10)).to(device)
    return model

def validate_model(model, valid_dl, loss_func, log_images=False, batch_idx=0):
    "Compute performance of the model on the validation dataset and log a wandb.Table"
    model.eval()
    val_loss = 0.
    with torch.inference_mode():
        correct = 0
        for i, (images, labels) in tqdm(enumerate(valid_dl), leave=False):
            images, labels = images.to(device), labels.to(device)

            # Forward pass ➡
            outputs = model(images)
            val_loss += loss_func(outputs, labels)*labels.size(0)

            # Compute accuracy and accumulate
            _, predicted = torch.max(outputs.data, 1)
            correct += (predicted == labels).sum().item()

            # Log one batch of images to the dashboard, always same batch_idx.
            if i==batch_idx and log_images:
                log_image_table(images, predicted, labels, outputs.softmax(dim=1))
    return val_loss / len(valid_dl.dataset), correct / len(valid_dl.dataset)

def log_image_table(images, predicted, labels, probs):
    "Log a wandb.Table with (img, pred, target, scores)"
    # 🐝 Create a wandb Table to log images, labels and predictions to
    table = wandb.Table(columns=["image", "pred", "target"]+[f"score_{i}" for i in range(10)])
    for img, pred, targ, prob in zip(images.to("cpu"), predicted.to("cpu"), labels.to("cpu"), probs.to("cpu")):
        table.add_data(wandb.Image(img[0].numpy()*255), pred, targ, *prob.numpy())
    wandb.log({"predictions_table":table}, commit=False)

训练：

# Launch 5 experiments, trying different dropout rates
for i in range(5):
    # 🐝 initialise a wandb run
    wandb.init(
        project="wandbexample1",
        name="pytorch_example"+str(i),
        config={
            "epochs": 10,
            "batch_size": 128,
            "lr": 1e-3,
            "dropout": random.uniform(0.01, 0.80),
            })
    
    # Copy your config 
    config = wandb.config

    # Get the data
    train_dl = get_dataloader(is_train=True, batch_size=config.batch_size)
    valid_dl = get_dataloader(is_train=False, batch_size=2*config.batch_size)
    n_steps_per_epoch = math.ceil(len(train_dl.dataset) / config.batch_size)
    
    # A simple MLP model
    model = get_model(config.dropout)

    # Make the loss and optimizer
    loss_func = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=config.lr)

   # Training
    example_ct = 0
    step_ct = 0
    for epoch in tqdm(range(config.epochs)):
        model.train()
        for step, (images, labels) in enumerate(tqdm(train_dl, leave=False)):
            images, labels = images.to(device), labels.to(device)

            outputs = model(images)
            train_loss = loss_func(outputs, labels)
            optimizer.zero_grad()
            train_loss.backward()
            optimizer.step()
            
            example_ct += len(images)
            metrics = {"train/train_loss": train_loss, 
                       "train/epoch": (step + 1 + (n_steps_per_epoch * epoch)) / n_steps_per_epoch, 
                       "train/example_ct": example_ct}
            
            if step + 1 < n_steps_per_epoch:
                # 🐝 Log train metrics to wandb 
                wandb.log(metrics)
                
            step_ct += 1

        val_loss, accuracy = validate_model(model, valid_dl, loss_func, log_images=(epoch==(config.epochs-1)))

        # 🐝 Log train and validation metrics to wandb
        val_metrics = {"val/val_loss": val_loss, 
                       "val/val_accuracy": accuracy}
        wandb.log({**metrics, **val_metrics})
        
        print(f"Train Loss: {train_loss:.3f}, Valid Loss: {val_loss:3f}, Accuracy: {accuracy:.2f}")

    # If you had a test set, this is how you could log it as a Summary metric
    wandb.summary['test_accuracy'] = 0.8

    # 🐝 Close your wandb run 
    wandb.finish()

wandb网页首页：

点进一个run内：

** jupyter notebook wandb警告示例**

# Start a wandb run
wandb.init(project="wandbexample1")

# Simulating a model training loop
acc_threshold = 0.3
for training_step in range(1000):

    # Generate a random number for accuracy
    accuracy = round(random.random() + random.random(), 3)
    print(f'Accuracy is: {accuracy}, {acc_threshold}')
    
    # 🐝 Log accuracy to wandb
    wandb.log({"Accuracy": accuracy})

    # 🔔 If the accuracy is below the threshold, fire a W&B Alert and stop the run
    if accuracy <= acc_threshold:
        # 🐝 Send the wandb Alert
        wandb.alert(
            title='Low Accuracy',
            text=f'Accuracy {accuracy} at step {training_step} is below the acceptable theshold, {acc_threshold}',
        )
        print('Alert triggered')
        break

# Mark the run as finished (useful in Jupyter notebooks)
wandb.finish()