Minifloats: FP Types for DNNs

EverNoob

已于 2022-09-07 17:16:55 修改

阅读量2.6k

点赞数

分类专栏： Hardware System_Design Machine_Learning 文章标签：数据结构深度学习硬件架构

于 2022-03-24 12:59:55 首次发布

本文链接：https://blog.csdn.net/maxzcl/article/details/123707125

版权

Machine_Learning 同时被 3 个专栏收录

54 篇文章 1 订阅

订阅专栏

Hardware

28 篇文章 0 订阅

订阅专栏

System_Design

27 篇文章 1 订阅

订阅专栏

https://en.wikipedia.org/wiki/Minifloat

In computing, minifloats are floating-point values represented with very few bits. Predictably, they are not well suited for general-purpose numerical calculations. They are used for special purposes, most often in computer graphics, where iterations are small and precision has aesthetic effects.[citation needed] Machine learning also uses similar formats like bfloat16. Additionally, they are frequently encountered as a pedagogical tool in computer-science courses to demonstrate the properties and structures of floating-point arithmetic and IEEE 754 numbers.

FP Representation and Notation

FP representation

Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of:

A signed (meaning positive or negative) digit string of a given length in a given base (or radix). This digit string is referred to as the significand, mantissa, or coefficient.[nb 1] The length of the significand determines the precision to which numbers can be represented. The radix point position is assumed always to be somewhere within the significand—often just after or just before the most significant digit, or to the right of the rightmost (least significant) digit. This article generally follows the convention that the radix point is set just after the most significant (leftmost) digit.
A signed integer exponent (also referred to as the characteristic, or scale),[nb 2] which modifies the magnitude of the number.

Minifloat Notation

https://en.wikipedia.org/wiki/Minifloat

A minifloat is usually described using a tuple of four numbers, (S, E, M, B):

S is the length of the sign field. It is usually either 0 or 1.
E is the length of the exponent field.
M is the length of the mantissa (significand) field.
B is the exponent bias.

A minifloat format denoted by (S, E, M, B) is, therefore, S + E + M bits long.

FP Value Computation and Precision

Fixed Point and Floating Point Number Representations

let Exp denote unbiased value from Exponent section, i.e. Exp = e - Bias;

let E denote length of Exponent section;

we know

max(Exp) = 2^E - 1

and by IEEE convention, we want the range of Exp to be "evenly" divided between positve and negative values, hence

Bias = ceil(max(Exp) / 2) = 2^(E - 1) - 1

But of course, as the example for minifloat demonstrated above, the choice of E, M, B can be arbitrary when crafting a datatype to use for domain specific problems, even the signed bit can be optional, but so far, as we shall see later, it is always kept.

!Precision!

By the given equation, we should note that Mantissa section is relatively unimportant for it mainly controls point value precision;

incremental value is influenced by Exponent as O(2^n), while by Mantissa as O(10*n), where n are bits used in representing extremas of the dataset;

==> for really short FP types such as FP8, the influence would be tipping towards Mantissa, and the tradeoff in choosing M, E, B or even to keep S should be highly situational.

incremental/step value precision and range are primarily determined by Exponent and Bias;

==> which means, the closer to 0, the better the incremental precision.

Minifloats in ML (DNNs)

(from Wiki bfolat16)

TF32

What is the TensorFloat-32 Precision Format? | NVIDIA Blog

TensorFloat-32 is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations used at the heart of AI and certain HPC applications. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. Combining TF32 with structured sparsity on the A100 enables performance gains over Volta of up to 20x.

Understanding the New Math

It helps to step back for a second to see how TF32 works and where it fits.

Math formats are like rulers. The number of bits in a format’s exponent determines its range, how large an object it can measure. Its precision — how fine the lines are on the ruler — comes from the number of bits used for its mantissa, the part of a floating point number after the radix or decimal point.

A good format strikes a balance ==> often for the highly specific applied situation. It should use enough bits to deliver precision without using so many it slows processing and bloats memory.

The chart below shows how TF32 is a hybrid that strikes this balance for tensor operations.

TF32 strikes a balance that delivers performance with range and accuracy.

TF32 uses the same 10-bit mantissa as the half-precision (FP16) math, shown to have more than sufficient margin for the precision requirements of AI workloads. And TF32 adopts the same 8-bit exponent as FP32 so it can support the same numeric range.

The combination makes TF32 a great alternative to FP32 for crunching through single-precision math, specifically the massive multiply-accumulate functions at the heart of deep learning and many HPC apps.

Applications using NVIDIA libraries enable users to harness the benefits of TF32 with no code change required. TF32 Tensor Cores operate on FP32 inputs and produce results in FP32. Non-matrix operations continue to use FP32.

For maximum performance, the A100 also has enhanced 16-bit math capabilities. It supports both FP16 and Bfloat16 (BF16) at double the rate of TF32 ==> as we can see that since TF is 19 bits, it is mostly a memory saving trick, when aligned to 8B and stores as 24B dtype; computational wise, it is the at the same rate as FP32, but may require less ALUs to operate. Employing Automatic Mixed Precision, users can get a further 2x higher performance with just a few lines of code.

TF32 Is Demonstrating Great Results Today

Compared to FP32, TF32 shows a 6x speedup training BERT, one of the most demanding conversational AI models. Applications-level results on other AI training and HPC apps that rely on matrix math will vary by workload.

To validate the accuracy of TF32, we used it to train a broad set of AI networks across a wide variety of applications from computer vision to natural language processing to recommender systems. All of them have the same convergence-to-accuracy behavior as FP32.

That’s why NVIDIA is making TF32 the default on its cuDNN library which accelerates key math operations for neural networks. At the same time, NVIDIA is working with the open-source communities that develop AI frameworks to enable TF32 as their default training mode on A100 GPUs, too.

In June, developers will be able to access a version of the TensorFlow framework and a version of the PyTorch framework with support for TF32 on NGC, NVIDIA’s catalog of GPU-accelerated software.

“TensorFloat-32 provides a huge out-of-the-box performance increase for AI applications for training and inference while preserving FP32 levels of accuracy,” said Kemal El Moujahid, director of Product Management for TensorFlow.

“We plan to make TensorFloat-32 supported natively in TensorFlow to enable data scientists to benefit from dramatically higher speedups in NVIDIA A100 Tensor Core GPUs without any code changes,” he added.

BF16

https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

The bfloat16 (Brain Floating Point)[1][2] floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing.[3] It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8(==>7)-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.[4]

The bfloat16 format was developed by Google Brain, an artificial intelligence research group at Google.[5] The bfloat16 format is utilized in Intel AI processors, such as Nervana NNP-L1000, Xeon processors (AVX-512 BF16 extensions), and Intel FPGAs,[6][7][8] Google Cloud TPUs,[9][10][11] and TensorFlow.[11][12] ARMv8.6-A,[13] AMD ROCm,[14] and CUDA [15] also support the bfloat16 format. On these platforms, bfloat16 may also be used in mixed-precision arithmetic, where bfloat16 numbers may be operated on and expanded to wider data types.

Example: Intel's Integration of bfloat16

https://www.intel.com/content/dam/develop/external/us/en/documents/bf16-hardware-numerics-definition-white-paper.pdf

FP8

summary paper: fp8 in inference

https://arxiv.org/ftp/arxiv/papers/2104/2104.07329.pdf

2018 IBM

https://proceedings.neurips.cc/paper/2018/file/335d3d1cd7ef05ec77714a215134914c-Paper.pdf

Our 8-bit floating point number (FP8) has a (sign, exponent, mantissa) format of (1, 5, 2) bits - where the format is chosen carefully to represent weights, activations, errors and gradients used in the three GEMMs

2022 NVIDIA Hopper

NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog

NVIDIA Hopper FP8 data format

The H100 GPU adds FP8 Tensor Cores to accelerate both AI training and inference. As shown in Figure 6, FP8 Tensor Cores support FP32 and FP16 accumulators, and two new FP8 input types:

E4M3 with 4 exponent bits, 3 mantissa bits, and 1 sign bit
E5M2, with 5 exponent bits, 2 mantissa bits, and 1 sign bit

E4M3 supports computations requiring less dynamic range with more precision, while E5M2 provides a wider dynamic range and less precision. FP8 halves data storage requirements and doubles throughput compared to FP16 or BF16.

The new transformer engine described later in this post uses both FP8 and FP16 precisions to reduce memory usage and increase performance, while still maintaining accuracy for large language and other models.

Figure 6. New NVIDIA Hopper FP8 precisions: 2x throughput and half the footprint of H100 FP16 or BF16