利用更好的Transformer进行快速转换推理

白酒永远的神

已于 2024-06-01 22:10:56 修改

阅读量103

点赞数

文章标签： pytorch

于 2024-04-07 08:50:47 首次发布

本文介绍如何在PyTorch1.12中使用BetterTransformer进行高效的生产级推理，特别是针对Transformer模型的CPU和GPU加速，以及利用Nativemultiheadattention和稀疏性优化。教程展示了在不同设备和设置下的模型性能比较。

摘要由CSDN通过智能技术生成

Fast Transformer Inference with Better Transformer

Author: Michael Gschwind

This tutorial introduces Better Transformer (BT) as part of the PyTorch 1.12 release. In this tutorial, we show how to use Better Transformer for production inference with torchtext. Better Transformer is a production ready fastpath to accelerate deployment of Transformer models with high performance on CPU and GPU. The fastpath feature works transparently for models based either directly on PyTorch core nn.module or with torchtext.

Models which can be accelerated by Better Transformer fastpath execution are those using the following PyTorch core torch.nn.module classes TransformerEncoder, TransformerEncoderLayer, and MultiHeadAttention. In addition, torchtext has been updated to use the core library modules to benefit from fastpath acceleration. (Additional modules may be enabled with fastpath execution in the future.)

Better Transformer offers two types of acceleration:

Native multihead attention (MHA) implementation for CPU and GPU to improve overall execution efficiency.
Exploiting sparsity in NLP inference. Because of variable input lengths, input tokens may contain a large number of padding tokens for which processing may be skipped, delivering significant speedups.

Fastpath execution is subject to some criteria. Most importantly, the model must be executed in inference mode and operate on input tensors that do not collect gradient tape information (e.g., running with torch.no_grad).

To follow this example in Google Colab, click here.

本教程介绍 PyTorch 1.12 版本中的 Better Transformer (BT)。在本教程中，我们将展示如何使用 Better Transformer 与 torchtext 一起进行生产推理。Better Transformer 是一种生产就绪的快速路径，可加速 Transformer 模型的部署，并在 CPU 和 GPU 上实现高性能。对于直接基于 PyTorch 核心 nn.module 或使用 torchtext 的模型，快速路径功能都能透明地工作。

可通过 Better Transformer fastpath 加速执行的模型包括使用以下 PyTorch 核心 torch.nn.module 类的模型：TransformerEncoder、TransformerEncoderLayer 和 MultiHeadAttention。此外，torchtext 已更新为使用核心库模块，以受益于 fastpath 加速。(将来可能会启用更多具有 fastpath 执行功能的模块）。

更好的 Transformer 提供两种类型的加速：

针对 CPU 和 GPU 的本地多头关注 (MHA) 实现，以提高整体执行效率。
在 NLP 推理中利用稀疏性。由于输入长度不固定，输入标记可能包含大量的填充标记，对这些标记的处理可以跳过，从而显著提高速度。

快速路径执行需要遵循一些标准。最重要的是，模型必须在推理模式下执行，并在不收集梯度带信息的输入张量上运行（例如，使用 torch.no_grad 运行）。

要在 Google Colab 中跟踪此示例，请单击此处。

Better Transformer Features in This Tutorial

Load pretrained models (created before PyTorch version 1.12 without Better Transformer)
Run and benchmark inference on CPU with and without BT fastpath (native MHA only)
Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA only)
Enable sparsity support
Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA + sparsity)
加载预训练模型（在 PyTorch 1.12 版之前创建，不含 Better Transformer）
在有 BT fastpath 和没有 BT fastpath 的 CPU 上运行推理并进行基准测试（仅本地 MHA）
在（可配置）DEVICE 上运行推理并进行基准测试，带或不带 BT fastpath（仅限本地 MHA）
启用稀疏性支持
使用或不使用 BT fastpath（本地 MHA + 稀疏性）在（可配置）DEVICE 上运行推理并进行基准测试

Additional Information

Additional information about Better Transformer may be found in the PyTorch.Org blog A Better Transformer for Fast Transformer Inference.

有关 Better Transformer 的更多信息，请参阅 PyTorch.Org 博客 A Better Transformer for Fast Transformer Inference。

1. Setup

1.1 Load pretrained models

We download the XLM-R model from the predefined torchtext models by following the instructions in torchtext.models. We also set the DEVICE to execute on-accelerator tests. (Enable GPU execution for your environment as appropriate.)

我们按照 torchtext.models 中的说明，从预定义的 torchtext 模型中下载 XLM-R 模型。我们还将 DEVICE 设置为在加速器上执行测试。(请根据您的环境启用 GPU 执行）。

import torch
import torch.nn as nn

print(f"torch version: {torch.__version__}")

DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

print(f"torch cuda available: {torch.cuda.is_available()}")

import torch, torchtext
from torchtext.models import RobertaClassificationHead
from torchtext.functional import to_tensor
xlmr_large = torchtext.models.XLMR_LARGE_ENCODER
classifier_head = torchtext.models.RobertaClassificationHead(num_classes=2, input_dim = 1024)
model = xlmr_large.get_model(head=classifier_head)
transform = xlmr_large.transform()

1.2 Dataset Setup

We set up two types of inputs: a small input batch and a big input batch with sparsity.

我们设置了两种类型的输入：小批量输入和具有稀疏性的大批量输入。

small_input_batch = [
               "Hello world",
               "How are you!"
]
big_input_batch = [
               "Hello world",
               "How are you!",
               """`Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.`

It was in July, 1805, and the speaker was the well-known Anna
Pavlovna Scherer, maid of honor and favorite of the Empress Marya
Fedorovna. With these words she greeted Prince Vasili Kuragin, a man
of high rank and importance, who was the first to arrive at her
reception. Anna Pavlovna had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
St. Petersburg, used only by the elite."""
]

Next, we select either the small or large input batch, preprocess the inputs and test the model.

接下来，我们选择小批量或大批量输入，预处理输入并测试模型。

input_batch=big_input_batch

model_input = to_tensor(transform(input_batch), padding_value=1)
output = model(model_input)
output.shape

Finally, we set the benchmark iteration count:

最后，我们设置基准迭代次数：

ITERATIONS=10

2. Execution

2.1 Run and benchmark inference on CPU with and without BT fastpath (native MHA only)

We run the model on CPU, and collect profile information:

The first run uses traditional (“slow path”) execution.
The second run enables BT fastpath execution by putting the model in inference mode using model.eval() and disables gradient collection with torch.no_grad().

You can see an improvement (whose magnitude will depend on the CPU model) when the model is executing on CPU. Notice that the fastpath profile shows most of the execution time in the native TransformerEncoderLayer implementation aten::_transformer_encoder_layer_fwd.

我们在 CPU 上运行模型，并收集配置文件信息：

第一次运行采用传统（“慢速路径”）执行方式。
第二次运行使用 model.eval()将模型置于推理模式，并使用 torch.no_grad() 关闭梯度收集，从而启用 BT fastpath 执行。

当模型在 CPU 上执行时，您可以看到改进（其幅度取决于 CPU 模型）。请注意，fastpath 配置文件显示大部分执行时间都在本地 TransformerEncoderLayer 实现 aten::_transformer_encoder_layer_fwd 中。

print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=False) as prof:
  for i in range(ITERATIONS):
    output = model(model_input)
print(prof)

model.eval()

print("fast path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=False) as prof:
  with torch.no_grad():
    for i in range(ITERATIONS):
      output = model(model_input)
print(prof)

2.2 Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA only)

We check the BT sparsity setting:

我们检查 BT 稀疏度设置：

model.encoder.transformer.layers.enable_nested_tensor

We disable the BT sparsity:

我们禁用了 BT 稀疏功能：

model.encoder.transformer.layers.enable_nested_tensor=False

We run the model on DEVICE, and collect profile information for native MHA execution on DEVICE:

The first run uses traditional (“slow path”) execution.
The second run enables BT fastpath execution by putting the model in inference mode using model.eval() and disables gradient collection with torch.no_grad().

When executing on a GPU, you should see a significant speedup, in particular for the small input batch setting:

我们在 DEVICE 上运行模型，并收集在 DEVICE 上执行本地 MHA 的配置文件信息：

第一次运行采用传统（“慢路径”）执行方式。
第二次运行使用 model.eval()将模型置于推理模式，并使用 torch.no_grad() 关闭梯度收集，从而启用 BT fastpath 执行。

在 GPU 上执行时，你会发现速度明显加快，尤其是在小批量输入设置下：

model.to(DEVICE)
model_input = model_input.to(DEVICE)

print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  for i in range(ITERATIONS):
    output = model(model_input)
print(prof)

model.eval()

print("fast path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  with torch.no_grad():
    for i in range(ITERATIONS):
      output = model(model_input)
print(prof)

2.3 Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA + sparsity)

We enable sparsity support:

我们启用了稀疏性支持：

model.encoder.transformer.layers.enable_nested_tensor = True

We run the model on DEVICE, and collect profile information for native MHA and sparsity support execution on DEVICE:

The first run uses traditional (“slow path”) execution.
The second run enables BT fastpath execution by putting the model in inference mode using model.eval() and disables gradient collection with torch.no_grad().

When executing on a GPU, you should see a significant speedup, in particular for the large input batch setting which includes sparsity:

我们在 DEVICE 上运行模型，并收集在 DEVICE 上执行本地 MHA 和稀疏性支持的配置文件信息：

第一次运行采用传统（“慢路径”）执行方式。
第二次运行使用 model.eval()将模型置于推理模式，并使用 torch.no_grad() 关闭梯度收集，从而启用 BT fastpath 执行。

在 GPU 上执行时，你应该会看到明显的提速，尤其是包含稀疏性的大输入批次设置：

model.to(DEVICE)
model_input = model_input.to(DEVICE)

print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  for i in range(ITERATIONS):
    output = model(model_input)
print(prof)

model.eval()

print("fast path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  with torch.no_grad():
    for i in range(ITERATIONS):
      output = model(model_input)
print(prof)

Summary

In this tutorial, we have introduced fast transformer inference with Better Transformer fastpath execution in torchtext using PyTorch core Better Transformer support for Transformer Encoder models. We have demonstrated the use of Better Transformer with models trained prior to the availability of BT fastpath execution. We have demonstrated and benchmarked the use of both BT fastpath execution modes, native MHA execution and BT sparsity acceleration.

在本教程中，我们介绍了在 torchtext 中使用 PyTorch 内核 Better Transformer 支持 Transformer Encoder 模型的 Better Transformer fastpath execution 快速变换器推理。我们演示了如何将 Better Transformer 用于 BT 快速路径执行之前训练的模型。我们对 BT fastpath 执行模式、本地 MHA 执行和 BT sparsity 加速两种模式的使用进行了演示和基准测试。