crc32 simd 指令集优化

深海蓝河

已于 2024-05-03 15:39:03 修改

阅读量698

点赞数 9

分类专栏： C++ 文章标签： c++ 网络安全

于 2024-05-01 18:02:56 首次发布

本文链接：https://blog.csdn.net/qq_38413468/article/details/138376884

版权

C++ 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

本文详细介绍了CRC32算法的基础原理，表驱动优化方法，以及如何利用ArmCRC32和IntelSSE4.2SIMD指令进行更高效的性能提升。通过性能测试对比，展示了CRC32SIMD在处理大量数据时的速度优势。

摘要由CSDN通过智能技术生成

crc32 simd 指令集优化

基本原理

CRC的基本原理是利用多项式模 2 除法来生成校验值，然后将这个校验值附加到数据中。当数据传输完成后，接收方可以再次执行相同的多项式模 2 除法运算，并将得到的结果与发送方附加的校验值进行比较，以验证数据的完整性。校验值计算公式如下：校验值 C = D % P, D 为一个待校验长度为 k 的比特流，P 为预设的整数。对于 CRC32 来说，它是一个 33 位的比特序列。以下是 CRC32 基础算法的具体执行过程：

Load the register with zero bits.
Augment the message by appending W zero bits to the end of it.
While (more message bits)
  Begin
  Shift the register left by one bit, reading the next bit of the
      augmented message into register bit position 0.
  If (a 1 bit popped out of the register during step 3)
      Register = Register XOR Poly.
  End
The register now contains the remainder.

Table-Driven 算法

对于基础的 CRC32 算法，它具有以下问题：

它是基于位操作的，很难去编码；
它每次循环只能处理 1 个 bit，这是非常低效的。

为了解决这些问题，表驱动算法被发明出来。它一次性可以处理 8 个 bits，即 1 byte。以下是该算法的 C 语言实现。

uint32_t crc32_byte(uint8_t *p, uint32_t bytelength) {
  uint32_t crc = 0xffffffff;
  while (bytelength-- != 0)
    crc = crctable[((uint8_t)crc ^ *(p++))] ^ (crc >> 8);
  return (crc ^ 0xffffffff);
}

这里的 crctable 可以参考 g_crc32_1EDC6F41。也可以手动创建：

void crc32_fill(uint32_t *table) {
  uint8_t index = 0, z;
  do {
    table[index] = index;
    for (z = 8; z; z--)
      table[index] = (table[index] & 1) ? (table[index] >> 1) ^ 0x1EDC6F41 : table[index] >> 1;
  } while (++index);
}

CRC32 SIMD 算法

为了进一步加速 CRC32 算法，Arm CRC32 指令集或者 Intel SSE4.2 指令集均支持了相关的SIMD 指令。主要包含以下指令：

unsigned int _mm_crc32_u16 (unsigned int crc, unsigned short v);
unsigned int _mm_crc32_u32 (unsigned int crc, unsigned int v);
unsigned __int64 _mm_crc32_u64 (unsigned __int64 crc, unsigned __int64 v);
unsigned int _mm_crc32_u8 (unsigned int crc, unsigned char v);

它至多可以一次性处理 8 bytes，是表驱动算法的 8 倍。

以下是一个 x86 平台的参考实现：

#ifdef __x86_64__
#define ALIGN_SIZE 8
#else
#define ALIGN_SIZE 4
#endif
#define ALIGN_MASK (ALIGN_SIZE - 1)

uint32_t extend(uint32_t init_crc, const char *data, size_t n) {
  uint32_t res = init_crc ^ 0xffffffff;
  size_t i;
#ifdef __x86_64__
  uint64_t *ptr_u64;
  uint64_t tmp;
#endif
  uint32_t *ptr_u32;
  uint16_t *ptr_u16;
  uint8_t *ptr_u8;

  // aligned to machine word's boundary
  for (i = 0; (i < n) && ((intptr_t)(data + i) & ALIGN_MASK); ++i) {
    res = _mm_crc32_u8(res, data[i]);
  }

#ifdef __x86_64__
  tmp = res;
  while (n - i >= sizeof(uint64_t)) {
    ptr_u64 = (uint64_t *)&data[i];
    tmp = _mm_crc32_u64(tmp, *ptr_u64);
    i += sizeof(uint64_t);
  }
  res = (uint32_t)tmp;
#endif
  while (n - i >= sizeof(uint32_t)) {
    ptr_u32 = (uint32_t *)&data[i];
    res = _mm_crc32_u32(res, *ptr_u32);
    i += sizeof(uint32_t);
  }
  while (n - i >= sizeof(uint16_t)) {
    ptr_u16 = (uint16_t *)&data[i];
    res = _mm_crc32_u16(res, *ptr_u16);
    i += sizeof(uint16_t);
  }
  while (n - i >= sizeof(uint8_t)) {
    ptr_u8 = (uint8_t *)&data[i];
    res = _mm_crc32_u8(res, *ptr_u8);
    i += sizeof(uint8_t);
  }

  return res ^ 0xffffffff;
}
static inline uint32_t crc32_simd(const char *data, size_t n) {
  return extend(0, data, n);
}

性能测试

这里主要测试了表驱动算法和 CRC32 SIMD 算法的性能。以下是测试代码：

int main() {
  uint8_t packet[8192];
  for (size_t i = 0; i < 8192; i++) {
    packet[i] = i;
  }
  auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < 1000; i++) {
    crc32_byte(packet, 8192);
  }
  auto end = std::chrono::steady_clock::now();
  auto t1 = end - start;
  cout << t1.count() << endl;
  // cout << checksum << endl;
  start = std::chrono::steady_clock::now();
  for (int i = 0; i < 1000; i++) {
    crc32_simd(reinterpret_cast<char *>(packet), 8192);
  }
  end = std::chrono::steady_clock::now();
  auto t2 = end - start;
  cout << t2.count() << endl;
  cout << "speed: " << t1 * 1.0 / t2 << endl;
  // cout << checksum << endl;
  return 0;
}

以下是执行结果:

$ g++ crc32_test.cpp -msse4.2 -O2
$./a.out
57
29
speed: 1.96552

参考

[1]. A PAINLESS GUIDE TO CRC ERROR DETECTION ALGORITHMS
[2]. crc32 table
[3]. Intel硬件指令加速计算CRC32
[4]. Intel® Intrinsics Guide

深海蓝河

关注

9
点赞
踩
17

收藏

觉得还不错? 一键收藏
0
评论
crc32 simd 指令集优化

本文将给出 CRC32 的表驱动算法和 SIMD 优化算法，以及它们的性能对比测试。
复制链接

扫一扫

专栏目录

crc32 simd 指令集优化

crc32 simd 指令集优化

基本原理

Table-Driven 算法

CRC32 SIMD 算法

性能测试

参考

“相关推荐”对你有帮助么？