Cracking C++(7): 使用 fp16 类型

最新推荐文章于 2024-02-21 19:33:38 发布

ArthurBreeze

最新推荐文章于 2024-02-21 19:33:38 发布

阅读量4k

点赞数 1

分类专栏： C/C++ 文章标签： c++ 开发语言 fp16

本文链接：https://blog.csdn.net/baiyu33/article/details/131151131

版权

C/C++ 专栏收录该内容

50 篇文章

订阅专栏

文章目录

1. 目的

fp16 类型主要作用是计算加速，在使用 CPU 执行计算的平台上， x86 硬件并没有原生支持 fp16，但是网络上可以找到模拟实现，可以初步体验下 fp16 类型的基本计算。

对于 fp16 相关的 SIMD 计算，本文没有涉及。

2. 支持 fp16 的平台

fp16 指的是 IEEE754-2008 提出的格式。

在 x86 CPU 上，不能完全支持 fp16：

如果只是开启 SSE2 的编译支持，那么是用 float 对应的指令来模拟的
如果开启了 -mavx512fp16 编译选项，并且硬件支持，那么将生成 AVX512-FP16 指令

在 arm CPU 上， arm8.2 架构提供了 fp16 对应的硬件指令，最常见的设备就是手机了。

NNIDIA GPU 和 AMD GPU 的 fp16 支持暂时不了解。

3. fp16 的模拟实现

3.1 开源库概况

在日常开发场景下，一边写代码一边运行程序，使用到 fp16 类型并且希望结果是正确的，有这几种选择：

使用带 avx512 的 CPU 设备
使用 Android / iPhone 手机
使用 NVIDIA / AMD GPU
使用模拟实现的 fp16 库

其中前三种方式有一定门槛，使用 fp16 模拟库则容易得多。了解到的 fp16 模拟库：

half (参考[1]) , C++ 实现，模拟 fp16 类型和相关计算
FP16 (参考[2]), C 实现，包括数据类型转换，并和其他模拟实现做了性能评测比较

直观体验是，这两个库里面没使用 __fp16 关键字，显然只是模拟实现。

如果写好的程序打算迁移到支持 fp16 的设备上运行，用上述两个模拟库无法享受硬件相关的 SIMD 指令加速。更合适的 fp16 C/C++ 库应当是：

在支持 fp16 的硬件 + 编译参数组合下，使用 fp16 类型相关的 builtin 函数
在不支持 fp16 的硬件 + 编译参数组合下，使用模拟实现
- 在支持 SSE2 的平台上，可以用 SSE2 的 SIMD 指令来模拟 fp16 的指令，至少比按标量计算要快。参考[3]
- 对于标量的操作，完全是软件模拟
提供 C++ class 的接口，并在硬件 + 编译参数支持 fp16 时，提供转换到原生 fp16 类型及其指针类型的 operator 转换函数

不过，从头实现是耗时且困难的，参考 half 和 FP16 两个库的实现则可以提供适当加速，原生 fp16 类型的使用则可参照 ncnn 的代码。

最近火起来的 llama.cpp 和 whisper.cpp, 都是基于 ggml, 而 ggml 在不支持 fp16 的平台上，使用了 FP16 项目中的代码作为模拟实现。

3.2 x86平台的编译器对 fp16 类型的支持

Clang 3.5 开始提供 __fp16 类型。也就是说在 x64 的 PC 上，可以使用 __fp16. Clang 15.0 支持了 _Float16 类型。

GCC 的 x86 平台则不支持 __fp16, GCC 12.1 开始支持了 _Float16 类型。

MSVC 没有支持 __fp16 和 _Float16.

3.3 `__fp16` 类型的限制：不能作为函数参数

__fp16 是一个纯粹的存储类型，不能按值传参，因为只有 API permissive 类型（https://gitlab.com/x86-psABIs/x86-64-ABI）才可以按值传入。

void print(FP16 x)  // 这里会编译报错，error: parameters cannot have __fp16 type; did you forget * ?
{
    //std::cout << "result " << (float)x << "\n";
    std::cout << "result " << fmt::format("{}", (float)x) << "\n";
}

参考： [4].

3.4 封装 half 库

这里的想法是，提供一个 class，在支持 __fp16 的平台上，数据成员是 __fp16 类型，否则使用 half 库的 half_float::half 类型。

// https://github.com/pytorch/glow/issues/1329

#include <fmt/core.h>
#include <iostream>
#include <stdint.h>
#include <bitset>

#include "half.hpp"

// _Float16
// GCC >= 12.1
// Clang >= 15.0

// __fp16
// GCC: 13.1 still N/A
// Clang >= 3.5

#define PHANTOM_CHECK_VERSION(wanted_major, wanted_minor, wanted_patch, current_major, current_minor, current_patch) \
                (((current_major) > (wanted_major)) || \
                    (((current_major) == (wanted_major)) \
                        && (((current_minor) > (wanted_minor)) || \
                            (((current_minor) == (wanted_minor)) \
                                && ((current_patch) >= (wanted_patch))))))

#define PHANTOM_CHECK_GCC_VERSION(wanted_major, wanted_minor, wanted_patch) \
                PHANTOM_CHECK_VERSION(wanted_major, wanted_minor, wanted_patch, \
                                      __GNUC__, __GNUC_MINOR__, __GNUC_PATCHLEVEL__)

#define PHANTOM_CHECK_CLANG_VERSION(wanted_major, wanted_minor, wanted_patch) \
                PHANTOM_CHECK_VERSION(wanted_major, wanted_minor, wanted_patch, \
                                      __GNUC__, __GNUC_MINOR__, __GNUC_PATCHLEVEL__)


#ifdef __clang__
#define PHANTOM_CLANG_COMPILER 1
#elif defined(__INTEL_COMPILER)
#define PHANTOM_INTEL_COMPILER 1
#elif defined(__MINGW64__)
#define PHANTOM_MINGW64_COMPILER 1
#elif defined(__MINGW32__)
#define PHANTOM_MINGW32_COMPILER 1
#elif defined(__GCC__)
#define PHANTOM_GCC_COMPILER 1
#elif defined(_MSC_VER)
#define PHANTOM_MSVC_COMPILER 1
#endif

#if PHANTOM_CLANG_COMPILER && PHANTOM_CHECK_CLANG_VERSION(15, 0, 0)
#define fp16_storage_type _Float16
#elif PHANTOM_CLANG_COMPILER && PHANTOM_CHECK_CLANG_VERSION(3, 5, 0)
#define fp16_storage_type __fp16
#elif PHANTOM_GCC_COMPILER && PHANTOM_CHECK_GCC_VERSION(12, 1, 0)
#define fp16_storage_type _Float16
#else
#define fp16_storage_type half_float::half data_;
#endif

// https://github.com/pytorch/glow/issues/1329
struct FP16
{
    fp16_storage_type data_;
    FP16(float x = 0.)
        : data_(x)
    {
    }
    FP16 operator+(const FP16&) const;
    operator float() const
    {
        return data_;
    }
};

FP16 FP16::operator+(const FP16& c) const
{
    FP16 result;
    result.data_ = (this->data_ + c.data_);
    return result;
}

void print(FP16 x)
{
    //std::cout << "result " << (float)x << "\n";
    std::cout << "result " << fmt::format("{}", (float)x) << "\n";
}

3.5 执行计算

使用前一步封装好的 FP16 类，执行标量、数组的计算。

int main()
{
    FP16 pi(3.1415926f);
    FP16 one(1.0f);
    FP16 res = pi + one;

    print(pi);
    print(one);
    print(res);

    FP16 data[4] = { pi, pi, pi, pi };
    for (int i = 0; i < 4; i++)
    {
        data[i] = data[i] + one;
    }
    for (int i = 0; i < 4; i++)
    {
        print(data[i]);
    }

    return 0;
}

输出

result 3.140625
result 1
result 4.140625
result 4.140625
result 4.140625
result 4.140625
result 4.140625

参考了 half库的文档([5]) https://half.sourceforge.net/index.html

4. Referennces

[1] https://sourceforge.net/projects/half/
[2] https://github.com/Maratyszcza/FP16
[3] Using Half Precision Floating Point on x86 CPUs
[4] https://lists.llvm.org/pipermail/llvm-dev/2021-March/149004.html
[5] https://half.sourceforge.net/index.html