超参数优化专题之工具—wandb / wandb(2)
注意,这部分参考了https://zhuanlan.zhihu.com/p/522355820,写的全面,还提供了代码参考https://zhuanlan.zhihu.com/p/522355820
wandb是一个免费的,用于记录实验数据的工具。wandb相比于tensorboard之类的工具,有更加丰富的用户管理,团队管理功能,更加方便团队协作。使用wandb首先要在网站上创建team,然后在team下创建project,然后project下会记录每个实验的详细数据。
安装
pip install wandb
登入官网
然后在wandb官网注册一个账号,然后获取该账号的私钥。
wandb login
(如果代码中有用到wandb,其实会自动要求登录,但是如果用nohup挂起脚本的话就无法实现,所以还是建议提前登录好)
API key的位置在:https://wandb.ai/authorize
(值得注意的是,这个配置应该是全局的,虽然我在一个虚拟环境下登录了我的账号,但是在别的虚拟环境下重新安装wandb还是可以直接通用。我估计是因为缓存在了本地)
用wandb login --relogin
强制重新登录。
用jupyter notebook则为:
!pip install wandb
wandb.login()
基本使用
开启新项目,跟踪指标、超参,添加报警信息
感觉就是init新项目,然后指标用log(在下面例子里可以看到,可以分组(如分为train/val)),超参用config,需要报警的内容就添加alert,结果添加到summary,最后finish就行,比较好用。
init文档:Launch Experiments with wandb.init - Documentation
对dashboard整体布局(run page)的介绍:Run Page - Documentation
数据可视化/跟踪指标:Data Visualization - Documentation
跟踪超参:Configure Experiments with wandb.config - Documentation
警告部分文档:Send Alerts with wandb.alert - Documentation
在测试代码时如临时不想与wandb
同步,需设置环境变量,使wandb
模式变成离线: WANDB_MODE=offline
(具体做法是在运行Python代码的命令行中,在python前面加上这句命令)
import wandb
wandb.init(project="my-awesome-project") #其他入参:name(见下例),config(见下)
wandb.log({'accuracy': train_acc, 'loss': train_loss})
wandb.config.dropout = 0.2
wandb.alert(
title="Low accuracy",
text=f"Accuracy {acc} is below the acceptable threshold {thresh}"
)
在jupyter notebook上,用一个假示例来模拟:
import random
# Launch 5 simulated experiments
total_runs = 5
for run in range(total_runs):
# 🐝 1️⃣ Start a new run to track this script
wandb.init(
# Set the project where this run will be logged
project="wandbexample1",
# We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
name=f"experiment_{run}",
# Track hyperparameters and run metadata
config={
"learning_rate": 0.02,
"architecture": "CNN",
"dataset": "CIFAR-100",
"epochs": 10,
})
# This simple block simulates a training loop logging metrics
epochs = 10
offset = random.random() / 5
for epoch in range(2, epochs):
acc = 1 - 2 ** -epoch - random.random() / epoch - offset
loss = 2 ** -epoch + random.random() / epoch + offset
# 🐝 2️⃣ Log metrics from your script to W&B
wandb.log({"acc": acc, "loss": loss})
# Mark the run as finished
wandb.finish()
在输出中会给出wandb项目链接,在浏览器中打开即可。
记录的内容:
离线保存数据,稍后再同步
将init设置为离线模式,主要通过环境变量实现控制
- WANDB_API_KEY=$KEY
- WANDB_MODE=“offline”
import wandb
import os
os.environ["WANDB_API_KEY"] = YOUR_KEY_HERE
os.environ["WANDB_MODE"] = "offline"
config = {
"dataset": "CIFAR10",
"machine": "offline cluster",
"model": "CNN",
"learning_rate": 0.01,
"batch_size": 128,
}
wandb.init(project="offline-demo")
for i in range(100):
wandb.log({"accuracy": i})
后面同步使用:wandb sync wandb/dryrun-folder-name
MNIST示例
jupyter notebook上:MNIST分类器
#@title
import wandb
import math
import random
import torch, torchvision
import torch.nn as nn
import torchvision.transforms as T
from tqdm.notebook import tqdm
device = "cuda:0" if torch.cuda.is_available() else "cpu"
def get_dataloader(is_train, batch_size, slice=5):
"Get a training dataloader"
full_dataset = torchvision.datasets.MNIST(root=".", train=is_train, transform=T.ToTensor(), download=True)
sub_dataset = torch.utils.data.Subset(full_dataset, indices=range(0, len(full_dataset), slice))
loader = torch.utils.data.DataLoader(dataset=sub_dataset,
batch_size=batch_size,
shuffle=True if is_train else False,
pin_memory=True, num_workers=2)
return loader
def get_model(dropout):
"A simple model"
model = nn.Sequential(nn.Flatten(),
nn.Linear(28*28, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(256,10)).to(device)
return model
def validate_model(model, valid_dl, loss_func, log_images=False, batch_idx=0):
"Compute performance of the model on the validation dataset and log a wandb.Table"
model.eval()
val_loss = 0.
with torch.inference_mode():
correct = 0
for i, (images, labels) in tqdm(enumerate(valid_dl), leave=False):
images, labels = images.to(device), labels.to(device)
# Forward pass ➡
outputs = model(images)
val_loss += loss_func(outputs, labels)*labels.size(0)
# Compute accuracy and accumulate
_, predicted = torch.max(outputs.data, 1)
correct += (predicted == labels).sum().item()
# Log one batch of images to the dashboard, always same batch_idx.
if i==batch_idx and log_images:
log_image_table(images, predicted, labels, outputs.softmax(dim=1))
return val_loss / len(valid_dl.dataset), correct / len(valid_dl.dataset)
def log_image_table(images, predicted, labels, probs):
"Log a wandb.Table with (img, pred, target, scores)"
# 🐝 Create a wandb Table to log images, labels and predictions to
table = wandb.Table(columns=["image", "pred", "target"]+[f"score_{i}" for i in range(10)])
for img, pred, targ, prob in zip(images.to("cpu"), predicted.to("cpu"), labels.to("cpu"), probs.to("cpu")):
table.add_data(wandb.Image(img[0].numpy()*255), pred, targ, *prob.numpy())
wandb.log({"predictions_table":table}, commit=False)
训练:
# Launch 5 experiments, trying different dropout rates
for i in range(5):
# 🐝 initialise a wandb run
wandb.init(
project="wandbexample1",
name="pytorch_example"+str(i),
config={
"epochs": 10,
"batch_size": 128,
"lr": 1e-3,
"dropout": random.uniform(0.01, 0.80),
})
# Copy your config
config = wandb.config
# Get the data
train_dl = get_dataloader(is_train=True, batch_size=config.batch_size)
valid_dl = get_dataloader(is_train=False, batch_size=2*config.batch_size)
n_steps_per_epoch = math.ceil(len(train_dl.dataset) / config.batch_size)
# A simple MLP model
model = get_model(config.dropout)
# Make the loss and optimizer
loss_func = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=config.lr)
# Training
example_ct = 0
step_ct = 0
for epoch in tqdm(range(config.epochs)):
model.train()
for step, (images, labels) in enumerate(tqdm(train_dl, leave=False)):
images, labels = images.to(device), labels.to(device)
outputs = model(images)
train_loss = loss_func(outputs, labels)
optimizer.zero_grad()
train_loss.backward()
optimizer.step()
example_ct += len(images)
metrics = {"train/train_loss": train_loss,
"train/epoch": (step + 1 + (n_steps_per_epoch * epoch)) / n_steps_per_epoch,
"train/example_ct": example_ct}
if step + 1 < n_steps_per_epoch:
# 🐝 Log train metrics to wandb
wandb.log(metrics)
step_ct += 1
val_loss, accuracy = validate_model(model, valid_dl, loss_func, log_images=(epoch==(config.epochs-1)))
# 🐝 Log train and validation metrics to wandb
val_metrics = {"val/val_loss": val_loss,
"val/val_accuracy": accuracy}
wandb.log({**metrics, **val_metrics})
print(f"Train Loss: {train_loss:.3f}, Valid Loss: {val_loss:3f}, Accuracy: {accuracy:.2f}")
# If you had a test set, this is how you could log it as a Summary metric
wandb.summary['test_accuracy'] = 0.8
# 🐝 Close your wandb run
wandb.finish()
wandb网页首页:
点进一个run内:
** jupyter notebook wandb警告示例**
# Start a wandb run
wandb.init(project="wandbexample1")
# Simulating a model training loop
acc_threshold = 0.3
for training_step in range(1000):
# Generate a random number for accuracy
accuracy = round(random.random() + random.random(), 3)
print(f'Accuracy is: {accuracy}, {acc_threshold}')
# 🐝 Log accuracy to wandb
wandb.log({"Accuracy": accuracy})
# 🔔 If the accuracy is below the threshold, fire a W&B Alert and stop the run
if accuracy <= acc_threshold:
# 🐝 Send the wandb Alert
wandb.alert(
title='Low Accuracy',
text=f'Accuracy {accuracy} at step {training_step} is below the acceptable theshold, {acc_threshold}',
)
print('Alert triggered')
break
# Mark the run as finished (useful in Jupyter notebooks)
wandb.finish()
acc变成0.155,小于阈值0.3,所以报了警告。因为我没有slack所以是给发到邮件上:
调参
Hyperparameter Tuning - Documentation
Organizing_Hyperparameter_Sweeps_in_PyTorch_with_W&B.ipynb
合作报告
Collaborative Reports - Documentation
跟踪pipeline的数据和模型版本
Data + Model Versioning - Documentation
数据可视化/跟踪指标日志
Data Visualization - Documentation
自动化深度学习平台的配置文件
Environment Variables - Documentation
本地化的解决方案
这个我暂时用不到,但是也列在这里以便备用:
Private Hosting - Documentation
示例
Examples - Documentation
示例的dashboard:wandb_example Workspace – Weights & Biases
Interagations
输入输出储存到wandb上的数据
Import & Export Data - Documentation
参考
https://zhuanlan.zhihu.com/p/522355820