本文使用100行代码,极简的教大家入门比较标准的量化步骤,从怎么用、用在哪里、哪里不能用等问题都将涵盖。

网上很多关于量化的文章,要么就是跑一跑官方残缺的例子,要么就是过旧的API,早已经不潮流。现在比较fashion的方式,是使用 torch.fx来做量化。本文将使用100行代码,极简的教你入门比较标准的量化步骤。这些步骤不是简单的告诉你torch.fx有什么卵用,大家都知道它有什么卵用,只是怎么用,用在哪里,哪里不能用,这些问题需要解答。本文100行代码,麻雀虽小五脏俱全,不管你量化什么模型,一顿套用就是了,出了问题我背锅。

很多古老的文章,还在用手动插入stub来做量化节点,这就好比在21世纪还在飞鸽传书。我们必然会包含一下几个完整的内容:

  • fx怎么插入量化节点,不要吓倒,这就一行代码;
  • 量化的模型怎么保存权重到本地;
  • 怎么把量化后的权重再load回来;
  • 怎么做calibration,做跟不做区别多大;
  • fx到底有没有局限性;

以上问题,本文都将囊括。

量化前期知识

此处省略三万字,具体大家清百度。没啥好讲的。

量化现状

如果你要问我现在最好的量化工具是什么,我的回答是没有。真的,不管是 nni,还是 nvidia的 pytorch_quantization ,还是nncf so on,不是说这些东西不好,而是在做的各位都是垃圾。

这些东西本质上是在做一件事情,至少从量化角度上看是这样的,但是到最后不具备通用性,当你看到 pytorch_quanzation 这个工具保存的模型体积根float32一样的时候,就会开始怀疑人生了,这tm是人干的事儿?这就好比普通人想要中杯,他便要说这是大杯。

轮子不好用,那就只能自己造轮子了。只能说,torch.fxyyds. 用了都说好,谁用谁知道。

100行代码

talk is cheap,我们直接上代码。需要注意的是,torch.fx最好使用最新的stable版本,老版本API或有不同之处,我测试的是 `1.11`。

由于pytorch的自带的 imagnet系列模型,我们没有办法做calibration,我们用小一些的Cifra10, 不需要下载,pytorch自己可以处理,但是这就需要我们自己finetune一下。

先把finetune的代码备好:

这只是用来fintune一个我们准备去量化,并且校准的模型:

import torch
import torch.nn as nn
import torch.nn.functional as F
import copy
import torchvision
from torchvision import transforms
from torchvision.models.resnet import resnet50, resnet18
from torch.quantization.quantize_fx import prepare_fx, convert_fx
from torch.ao.quantization.fx.graph_module import ObservedGraphModule
from torch.quantization import (
    get_default_qconfig,
)
from torch import optim
import os
import time


def train_model(model, train_loader, test_loader, device):
    # The training configurations were not carefully selected.
    learning_rate = 1e-2
    num_epochs = 20
    criterion = nn.CrossEntropyLoss()
    model.to(device)
    # It seems that SGD optimizer is better than Adam optimizer for ResNet18 training on CIFAR10.
    optimizer = optim.SGD(
        model.parameters(), lr=learning_rate, momentum=0.9, weight_decay=1e-5
    )
    # optimizer = optim.Adam(model.parameters(), lr=learning_rate, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
    for epoch in range(num_epochs):
        # Training
        model.train()

        running_loss = 0
        running_corrects = 0

        for inputs, labels in train_loader:
            inputs = inputs.to(device)
            labels = labels.to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = model(inputs)
            _, preds = torch.max(outputs, 1)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # statistics
            running_loss += loss.item() * inputs.size(0)
            running_corrects += torch.sum(preds == labels.data)

        train_loss = running_loss / len(train_loader.dataset)
        train_accuracy = running_corrects / len(train_loader.dataset)

        # Evaluation
        model.eval()
        eval_loss, eval_accuracy = evaluate_model(
            model=model, test_loader=test_loader, device=device, criterion=criterion
        )
        print(
            "Epoch: {:02d} Train Loss: {:.3f} Train Acc: {:.3f} Eval Loss: {:.3f} Eval Acc: {:.3f}".format(
                epoch, train_loss, train_accuracy, eval_loss, eval_accuracy
            )
        )
    return model

def prepare_dataloader(num_workers=8, train_batch_size=128, eval_batch_size=256):
    train_transform = transforms.Compose(
        [
            transforms.RandomCrop(32, padding=4),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
        ]
    )
    test_transform = transforms.Compose(
        [
            transforms.ToTensor(),
            transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
        ]
    )
    train_set = torchvision.datasets.CIFAR10(
        root="data", train=True, download=True, transform=train_transform
    )
    # We will use test set for validation and test in this project.
    # Do not use test set for validation in practice!
    test_set = torchvision.datasets.CIFAR10(
        root="data", train=False, download=True, transform=test_transform
    )
    train_sampler = torch.utils.data.RandomSampler(train_set)
    test_sampler = torch.utils.data.SequentialSampler(test_set)

    train_loader = torch.utils.data.DataLoader(
        dataset=train_set,
        batch_size=train_batch_size,
        sampler=train_sampler,
        num_workers=num_workers,
    )
    test_loader = torch.utils.data.DataLoader(
        dataset=test_set,
        batch_size=eval_batch_size,
        sampler=test_sampler,
        num_workers=num_workers,
    )
    return train_loader, test_loader
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.

然后训练一波模型:

if __name__ == "__main__":
    train_loader, test_loader = prepare_dataloader()

    # first finetune model on cifar, we don't have imagnet so using cifar as test
    model = resnet18(pretrained=True)
    model.fc = nn.Linear(512, 10)
    if os.path.exists("r18_row.pth"):
        model.load_state_dict(torch.load("r18_row.pth", map_location="cpu"))
    else:
        train_model(model, train_loader, test_loader, torch.device("cuda"))
        print("train finished.")
        torch.save(model.state_dict(), "r18_row.pth")
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.

接下来就是核心代码:

def quant_fx(model):
    model.eval()
    qconfig = get_default_qconfig("fbgemm")
    qconfig_dict = {
        "": qconfig,
        # 'object_type': []
    }
    model_to_quantize = copy.deepcopy(model)
    prepared_model = prepare_fx(model_to_quantize, qconfig_dict)
    print("prepared model: ", prepared_model)

    quantized_model = convert_fx(prepared_model)
    print("quantized model: ", quantized_model)
    torch.save(model.state_dict(), "r18.pth")
    torch.save(quantized_model.state_dict(), "r18_quant.pth")
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.

懂了吗?很快阿,啪一下,一个int8的量化模型就生成了。

没错,其实都不用100行,15行就够了。torch.fx 就是这么的牛逼!

我们做一个evaluation,来验证一下,在不校准的情况下,精度如何:

def evaluate_model(model, test_loader, device=torch.device("cpu"), criterion=None):
    t0 = time.time()
    model.eval()
    model.to(device)
    running_loss = 0
    running_corrects = 0
    for inputs, labels in test_loader:

        inputs = inputs.to(device)
        labels = labels.to(device)
        outputs = model(inputs)
        _, preds = torch.max(outputs, 1)

        if criterion is not None:
            loss = criterion(outputs, labels).item()
        else:
            loss = 0

        # statistics
        running_loss += loss * inputs.size(0)
        running_corrects += torch.sum(preds == labels.data)

    eval_loss = running_loss / len(test_loader.dataset)
    eval_accuracy = running_corrects / len(test_loader.dataset)
    t1 = time.time()
    print(f"eval loss: {eval_loss}, eval acc: {eval_accuracy}, cost: {t1 - t0}")
    return eval_loss, eval_accuracy
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.

这是evaluation的结果:

eval loss: 0.0, eval acc: 0.8476999998092651, cost: 2.8914074897766113
eval loss: 0.0, eval acc: 0.15240000188350677, cost: 1.240293264389038
  • 1.
  • 2.

可以看到,精度下降严重。此时需要进行一下校准,我直接放校准函数:

def calib_quant_model(model, calib_dataloader):
    assert isinstance(
        model, ObservedGraphModule
    ), "model must be a perpared fx ObservedGraphModule."
    model.eval()
    with torch.inference_mode():
        for inputs, labels in calib_dataloader:
            model(inputs)
    print("calib done.")
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.

that's all. 就这么简单。

如果你有其他非分类模型,也可以直接把dataloader丢进来。请注意,这里的标签并没有用到。只需要统计数据的分布即可。 

非常简单。

最后我们再次eval一下:

def quant_calib_and_eval(model):
    # test only on CPU
    model.to(torch.device("cpu"))
    model.eval()

    qconfig = get_default_qconfig("fbgemm")
    qconfig_dict = {
        "": qconfig,
        # 'object_type': []
    }

    model2 = copy.deepcopy(model)
    model_prepared = prepare_fx(model2, qconfig_dict)
    model_int8 = convert_fx(model_prepared)
    model_int8.load_state_dict(torch.load("r18_quant.pth"))
    model_int8.eval()

    a = torch.randn([1, 3, 224, 224])
    o1 = model(a)
    o2 = model_int8(a)

    diff = torch.allclose(o1, o2, 1e-4)
    print(diff)
    print(o1.shape, o2.shape)
    print(o1, o2)
    get_output_from_logits(o1)
    get_output_from_logits(o2)

    train_loader, test_loader = prepare_dataloader()
    evaluate_model(model, test_loader)
    evaluate_model(model_int8, test_loader)

    # calib quant model
    model2 = copy.deepcopy(model)
    model_prepared = prepare_fx(model2, qconfig_dict)
    model_int8 = convert_fx(model_prepared)
    torch.save(model_int8.state_dict(), "r18.pth")
    model_int8.eval()

    model_prepared = prepare_fx(model2, qconfig_dict)
    calib_quant_model(model_prepared, test_loader)
    model_int8 = convert_fx(model_prepared)
    torch.save(model_int8.state_dict(), "r18_quant_calib.pth")
    evaluate_model(model_int8, test_loader)
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.

得到结果:

eval loss: 0.0, eval acc: 0.8476999998092651, cost: 2.8914074897766113
eval loss: 0.0, eval acc: 0.15240000188350677, cost: 1.240293264389038
calib done.
eval loss: 0.0, eval acc: 0.8442999720573425, cost: 1.2966759204864502
  • 1.
  • 2.
  • 3.
  • 4.

精度瞬间恢复了。速度快了超过一半。

总结

ok,我们用几十行代码就完成这个量化过程。并且使用校准,恢复了精度。由此可见fx的强大之处。

抛出一个问题,欢迎留言区解答:

  • torch.fx量化的模型,如果export 到onnx并使用其他前推引擎推理。