nlp自然语言处理面经_拥抱面和onnx运行时更快，更小量化的nlp-CSDN博客

nlp自然语言处理面经

This post was written by Morgan Funtowicz, Machine Learning Engineer from Hugging Face and Yufeng Li, Senior Software Engineer from Microsoft

本文 由Hugging Face的机器学习工程师 Morgan Funtowicz 和Microsoft的高级软件工程师 Yufeng Li 撰写

Transformer models used for natural language processing (NLP) are big. BERT-base-uncased has ~110 million parameters, RoBERTa-base has ~125 million parameters, and GPT-2 has ~117 million parameters. Each parameter is a floating-point number that requires 32 bits (FP32). This means the file sizes of these models are huge as is the memory they consume. Not to mention all the computation that needs to happen on all these bits.

用于自然语言处理(NLP)的变压器模型很大。不带BERT的BERT具有约1.1亿个参数，基于RoBERTa的具有约1.25亿个参数，而GPT-2具有约1.17亿个参数。每个参数都是一个浮点数，需要32位(FP32)。这意味着这些模型的文件大小以及它们消耗的内存都是巨大的。更不用说在所有这些位上需要进行的所有计算。

These challenges make it difficult to run transformer models on client devices with limited memory and compute resource. Growing awareness of privacy and data transfer costs make on-device inferencing appealing. Even on the cloud, latency and cost are very important and any large-scale application needs to optimize for these.

这些挑战使得很难在内存和计算资源有限的客户端设备上运行转换器模型。对隐私和数据传输成本的意识日益增强，使设备上的推理更具吸引力。即使在云上，延迟和成本也非常重要，任何大型应用程序都需要对此进行优化。

Quantization and distillation are two techniques commonly used to deal with these size and performance challenges. These techniques are complementary and can be used together. Distillation was covered in a previous blog post by Hugging Face. Here we discuss quantization which can be applied to your models easily and without retraining. This work builds on the optimized inference with ONNX Runtime we previously shared and can give you additional performance boost as well as unblock inferencing on client devices.

量化和蒸馏是通常用于应对这些尺寸和性能挑战的两种技术。这些技术是互补的，可以一起使用。 Hugging Face在先前的博客文章中介绍了蒸馏技术。在这里，我们讨论可以轻松地应用于模型而无需重新训练的量化。这项工作建立在我们之前共享的ONNX Runtime的优化推论的基础上，可以为您提供额外的性能提升以及在客户端设备上取消阻止推论。

量化 (Quantization)

Quantization approximates floating-point numbers with lower bit width numbers, dramatically reducing memory footprint and accelerating performance. Quantization can introduce accuracy loss since fewer bits limit the precision and range of values. However, researchers have extensively demonstrated that weights and activations can be represented using 8-bit integers (INT8) without incurring significant loss in accuracy.

量化使用较低的位宽数字来近似浮点数，从而大大减少了内存占用并提高了性能。量化会引入精度损失，因为较少的位会限制值的精度和范围。但是，研究人员已广泛证明，权重和激活可以使用8位整数(INT8)表示，而不会造成准确性的重大损失。

Compared to FP32, INT8 representation reduces data storage and bandwidth by 4x, which also reduces energy consumed. In terms of inference perfo