ForUtil的原理与使用

最新推荐文章于 2021-11-21 15:26:26 发布

spring-hz

最新推荐文章于 2021-11-21 15:26:26 发布

阅读量131

点赞数

分类专栏： lucene

本文链接：https://blog.csdn.net/gs_albb/article/details/118446712

版权

lucene 专栏收录该内容

14 篇文章 1 订阅

订阅专栏

原理

看了源码中的forUtil.encode的代码，着实看不太懂，等待大佬指教。

// Inspired from https://fulmicoton.com/posts/bitpacking/
// Encodes multiple integers in a long to get SIMD-like speedups.
// If bitsPerValue <= 8 then we pack 8 ints per long
// else if bitsPerValue <= 16 we pack 4 ints per long
// else we pack 2 ints per long

javadoc中说是受了https://fulmicoton.com/posts/bitpacking/启发，将多个int元素编码到long元素中, 以期获得SIMD级别的速度。
如果int元素表示所需的bit个数<=8, 可以压缩8个int到一个long中;
如果int元素表示所需的bit个数<=16, 可以压缩4个int到一个long中;
其他情况，可以压缩2个int到一个long中。

使用

lucene 中有一个关于ForUtil的测试用例，如下:

public void testEncodeDecode() throws IOException {
    final int iterations = RandomNumbers.randomIntBetween(random(), 50, 1000);
    final int[] values = new int[iterations * ForUtil.BLOCK_SIZE];
	// 共iterations个迭代组，每个迭代组中BLOCK_SIZ个int正数，每个迭代组中的最大int正数
	// 不会超过bpv个bit位来表达
    for (int i = 0; i < iterations; ++i) {
      final int bpv = TestUtil.nextInt(random(), 1, 31);
      for (int j = 0; j < ForUtil.BLOCK_SIZE; ++j) {
        // 随机int正数，正数最多占用bpv个bit(显而易见，bpv不会大于31)
        values[i * ForUtil.BLOCK_SIZE + j] = RandomNumbers.randomIntBetween(random(),
            0, (int) PackedInts.maxValue(bpv));
      }
    }

    final Directory d = new ByteBuffersDirectory();
    final long endPointer;

    {
      // encode
      IndexOutput out = d.createOutput("test.bin", IOContext.DEFAULT);
      final ForUtil forUtil = new ForUtil();

      for (int i = 0; i < iterations; ++i) {
        long[] source = new long[ForUtil.BLOCK_SIZE];
        long or = 0;
        for (int j = 0; j < ForUtil.BLOCK_SIZE; ++j) {
          source[j] = values[i*ForUtil.BLOCK_SIZE+j];
          or |= source[j];
        }
        // 通过上面的or取或，得到本次迭代组(一个BLOCK)中的最大int正数需要bpv个bit来表示
        final int bpv = PackedInts.bitsRequired(or);
        out.writeByte((byte) bpv);
        // 编码
        forUtil.encode(source, bpv, out);
      }
      endPointer = out.getFilePointer();
      System.out.println("编码使用了" + endPointer + "个字节, 编码了" + values.length + "个int正数(正常应占用" + values.length*4 + "个字节)");
      out.close();
    }

    {
      // decode 解码
      IndexInput in = d.openInput("test.bin", IOContext.READONCE);
      final ForUtil forUtil = new ForUtil();
      for (int i = 0; i < iterations; ++i) {
        final int bitsPerValue = in.readByte();
        final long currentFilePointer = in.getFilePointer();
        final long[] restored = new long[ForUtil.BLOCK_SIZE];
        forUtil.decode(bitsPerValue, in, restored);
        int[] ints = new int[ForUtil.BLOCK_SIZE];
        for (int j = 0; j < ForUtil.BLOCK_SIZE; ++j) {
          ints[j] = Math.toIntExact(restored[j]);
        }
        assertArrayEquals(Arrays.toString(ints),
            ArrayUtil.copyOfSubArray(values, i*ForUtil.BLOCK_SIZE, (i+1)*ForUtil.BLOCK_SIZE),
            ints);
        assertEquals(forUtil.numBytes(bitsPerValue), in.getFilePointer() - currentFilePointer);
      }
      assertEquals(endPointer, in.getFilePointer());
      in.close();
    }

    d.close();
  }

随机几个输出结果如下:

编码使用了48707个字节, 编码了24960个int正数(正常应占用99840个字节)
编码使用了164001个字节, 编码了82048个int正数(正常应占用328192个字节)
编码使用了48117个字节, 编码了23168个int正数(正常应占用92672个字节)

可以看出，节省了一半的存储量。

参考

倒排索引压缩：改进的PForDelta算法

spring-hz

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ForUtil的原理与使用

目录原理使用参考lucene 中的倒排索引DocId压缩工具类，尽量将一组连续的int类型的docId压缩存储。原理看了源码中的forUtil.encode的代码，着实看不太懂，等待大佬指教。// Inspired from https://fulmicoton.com/posts/bitpacking/// Encodes multiple integers in a long to get SIMD-like speedups.// If bitsPerValue <= 8 then
复制链接

扫一扫