文章目录
一、关于 PyTorch Lightning
PyTorch Lightning 是用于预训练、微调和部署AI模型的深度学习框架。
新-部署模型?查看LitServe,用于模型服务的PyTorch Lightning
- github : https://github.com/Lightning-AI/pytorch-lightning
- 官网:https://lightning.ai/
- 官方文档:https://lightning.ai/docs
- Discord
Lightning有2个核心包
PyTorchLightning:大规模训练和部署PyTorch。
Lightning结构:专家控制。
Lightning让您可以精细控制要在PyTorch上添加多少抽象。
二、快速启动
安装Lightning:
pip install lightning
高级安装选项
使用可选依赖项安装
pip install lightning['extra']
Conda
conda install lightning -c conda-forge
安装稳定版
从源代码安装未来版本
pip install https://github.com/Lightning-AI/lightning/archive/refs/heads/release/stable.zip -U
安装 bleeding-edge
从源代码安装 nightly(不保证)
pip install https://github.com/Lightning-AI/lightning/archive/refs/heads/master.zip -U
或从测试PyPI
pip install -iU https://test.pypi.org/simple/ pytorch-lightning
PyTorchLightning示例
定义培训工作流程。这是一个玩具示例(探索真实示例):
# main.py
# ! pip install torchvision
import torch, torch.nn as nn, torch.utils.data as data, torchvision as tv, torch.nn.functional as F
import lightning as L
# --------------------------------
# Step 1: Define a LightningModule
# --------------------------------
# A LightningModule (nn.Module subclass) defines a full *system*
# (ie: an LLM, diffusion model, autoencoder, or simple image classifier).
class LitAutoEncoder(L.LightningModule):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(nn.Linear(28 * 28, 128), nn.ReLU(), nn.Linear(128, 3))
self.decoder = nn.Sequential(nn.Linear(3, 128), nn.ReLU(), nn.Linear(128, 28 * 28))
def forward(self, x):
# in lightning, forward defines the prediction/inference actions
embedding = self.encoder(x)
return embedding
def training_step(self, batch, batch_idx):
# training_step defines the train loop. It is independent of forward
x, _ = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
self.log("train_loss", loss)
return loss
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
# -------------------
# Step 2: Define data
# -------------------
dataset = tv.datasets.MNIST(".", download=True, transform=tv.transforms.ToTensor())
train, val = data.random_split(dataset, [55000, 5000])
# -------------------
# Step 3: Train
# -------------------
autoencoder = LitAutoEncoder()
trainer = L.Trainer()
trainer.fit(autoencoder, data.DataLoader(train), data.DataLoader(val))
在您的终端上运行模型
pip install torchvision
python main.py
三、为什么选择PyTorchLightning?
PyTorch Lightning只是有组织的PyTorch-Lightning解开PyTorch代码以将科学与工程解耦。
例子
探索使用PyTorch Lightning可能进行的各种类型的训练。预训练和微调任何类型的模型以执行任何任务,例如分类、分割、汇总等:
任务 | 描述 | 运行 |
---|---|---|
你好世界 | 预训练-你好世界示例 | Open in Studio |
图像分类 | Finetune-ResNet-34模型对汽车图像进行分类 | Open in Studio |
图像分割 | Finetune-ResNet-50模型对图像进行分割 | Open in Studio |
对象检测 | Finetune-Faster R-CNN模型检测对象 | Open in Studio |
文本分类 | Finetune-文本分类器(BERT模型) | Open in Studio |
文本摘要 | Finetune-文本摘要(Hugging Face Transformers 模型) | |
音频生成 | Finetune-音频生成器(Transformers 模型) | Open in Studio |
LLM微调 | Finetune-LLM(Meta Llama 3.18B) | Open in Studio |
图像生成 | 预训练-图像生成器(扩散模型) | Open in Studio |
推荐系统 | 训练-推荐系统(因式分解和嵌入) | Open in Studio |
时间序列预测 | Open in Studio) |
高级功能
Lightning拥有超过40多种高级功能,专为大规模的专业人工智能研究而设计。
这里有一些例子:
无需更改代码即可在1000多个GPU上进行训练
# 8 GPUs
# no code changes needed
trainer = Trainer(accelerator="gpu", devices=8)
# 256 GPUs
trainer = Trainer(accelerator="gpu", devices=8, num_nodes=32)
无需更改代码即可在TPU等其他加速器上进行训练
# no code changes needed
trainer = Trainer(accelerator="tpu", devices=8)
16-bit precision
# no code changes needed
trainer = Trainer(precision=16)
Experiment managers
from lightning import loggers
# tensorboard
trainer = Trainer(logger=TensorBoardLogger("logs/"))
# weights and biases
trainer = Trainer(logger=loggers.WandbLogger())
# comet
trainer = Trainer(logger=loggers.CometLogger())
# mlflow
trainer = Trainer(logger=loggers.MLFlowLogger())
# neptune
trainer = Trainer(logger=loggers.NeptuneLogger())
# ... and dozens more
Early Stopping
es = EarlyStopping(monitor="val_loss")
trainer = Trainer(callbacks=[es])
Checkpointing
checkpointing = ModelCheckpoint(monitor="val_loss")
trainer = Trainer(callbacks=[checkpointing])
Export to torchscript (JIT) (production use)
# torchscript
autoencoder = LitAutoEncoder()
torch.jit.save(autoencoder.to_torchscript(), "model.pt")
Export to ONNX (production use)
# onnx
with tempfile.NamedTemporaryFile(suffix=".onnx", delete=False) as tmpfile:
autoencoder = LitAutoEncoder()
input_sample = torch.randn((1, 64))
autoencoder.to_onnx(tmpfile.name, input_sample, export_params=True)
os.path.isfile(tmpfile.name)
与非结构化PyTorch相比的优势
- 模型变得与硬件无关
- 代码清晰易读,因为工程代码被抽象掉了
- 更容易繁殖
- 少犯错误,因为Lightning处理棘手的工程
- 保持所有的灵活性(LightningModules仍然是PyTorch模块),但删除了大量样板文件
- Lightning与流行的机器学习工具有几十个集成。
- 用每一个新的公关严格测试。我们测试PyTorch和Python支持的版本、每个操作系统、多GPU甚至TPU的每一个组合。
- 最小的运行速度开销(与纯PyTorch相比,每个时期约300 ms)。
四、Lightning Fabric:专家控制
在任何规模的任何设备上运行,对PyTorch训练循环和扩展策略进行专家级控制。您甚至可以编写自己的Trainer。
Fabric专为最复杂的模型而设计,如基础模型缩放、LLM、扩散、Transformers 、强化学习、主动学习。任何尺寸。
要更改什么
+ import lightning as L
import torch; import torchvision as tv
dataset = tv.datasets.CIFAR10("data", download=True,
train=True,
transform=tv.transforms.ToTensor())
+ fabric = L.Fabric()
+ fabric.launch()
model = tv.models.resnet18()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
- device = "cuda" if torch.cuda.is_available() else "cpu"
- model.to(device)
+ model, optimizer = fabric.setup(model, optimizer)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=8)
+ dataloader = fabric.setup_dataloaders(dataloader)
model.train()
num_epochs = 10
for epoch in range(num_epochs):
for batch in dataloader:
inputs, labels = batch
- inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = torch.nn.functional.cross_entropy(outputs, labels)
- loss.backward()
+ fabric.backward(loss)
optimizer.step()
print(loss.data)
结果结构代码(复制我!)
import lightning as L
import torch; import torchvision as tv
dataset = tv.datasets.CIFAR10("data", download=True,
train=True,
transform=tv.transforms.ToTensor())
fabric = L.Fabric()
fabric.launch()
model = tv.models.resnet18()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
model, optimizer = fabric.setup(model, optimizer)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=8)
dataloader = fabric.setup_dataloaders(dataloader)
model.train()
num_epochs = 10
for epoch in range(num_epochs):
for batch in dataloader:
inputs, labels = batch
optimizer.zero_grad()
outputs = model(inputs)
loss = torch.nn.functional.cross_entropy(outputs, labels)
fabric.backward(loss)
optimizer.step()
print(loss.data)
关键功能
松从在CPU上运行切换到GPU(Apple Si、CUDA、…)、TPU、多GPU甚至多节点训练
# Use your available hardware
# no code changes needed
fabric = Fabric()
# Run on GPUs (CUDA or MPS)
fabric = Fabric(accelerator="gpu")
# 8 GPUs
fabric = Fabric(accelerator="gpu", devices=8)
# 256 GPUs, multi-node
fabric = Fabric(accelerator="gpu", devices=8, num_nodes=32)
# Run on TPUs
fabric = Fabric(accelerator="tpu")
使用最先进的分布式训练策略(DDP、FSDP、DeepSpeed)和开箱即用的混合精度
# Use state-of-the-art distributed training techniques
fabric = Fabric(strategy="ddp")
fabric = Fabric(strategy="deepspeed")
fabric = Fabric(strategy="fsdp")
# Switch the precision
fabric = Fabric(precision="16-mixed")
fabric = Fabric(precision="64")
所有设备逻辑样板都为您处理
# no more of this!
- model.to(device)
- batch.to(device)
使用Fabric原语构建您自己的自定义训练器,用于训练检查点、日志记录等
import lightning as L
class MyCustomTrainer:
def __init__(self, accelerator="auto", strategy="auto", devices="auto", precision="32-true"):
self.fabric = L.Fabric(accelerator=accelerator, strategy=strategy, devices=devices, precision=precision)
def fit(self, model, optimizer, dataloader, max_epochs):
self.fabric.launch()
model, optimizer = self.fabric.setup(model, optimizer)
dataloader = self.fabric.setup_dataloaders(dataloader)
model.train()
for epoch in range(max_epochs):
for batch in dataloader:
input, target = batch
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
self.fabric.backward(loss)
optimizer.step()
你可以找到一个更广泛的例子在我们的例子
示例
Self-supervised Learning
Convolutional Architectures
Reinforcement Learning
GANs
Classic ML
持续集成
Lightning在多个CPU、GPU和TPU以及主要Python和PyTorch版本上经过严格测试。
*Codecov>90%+但构建延迟可能会显示更少
2025-01-19(日)