Lucene中的倒排链磁盘存储（PForDelta）

chuanyangwang

已于 2022-06-23 20:52:46 修改

阅读量612

点赞数

分类专栏： ES 文章标签： java lucene apache

于 2021-11-21 15:26:26 首次发布

本文链接：https://blog.csdn.net/chuanyangwang/article/details/121454411

版权

ES 专栏收录该内容

50 篇文章 0 订阅

订阅专栏

docid的压缩

org.apache.lucene.codecs.lucene84.ForDeltaUtil#encodeDeltas

freq的压缩

org.apache.lucene.codecs.lucene84.PForUtil#encode

    // We store the patch on a byte, so we can't decrease the number of bits required by more than 8
    final int patchedBitsRequired =  Math.max(PackedInts.bitsRequired(top4[0]), maxBitsRequired - 8);

SIMD

深入代码优化 (二) 使用SIMD优化程序_吴小锤的博客-CSDN博客_simd

基于 SIMD 指令的 PFOR-DELTA 解压和查找 - 知乎

例子1:

原始数组

values = new int[]{12,5,2,8,7,7,1,9,11,7,6,6,6,6,3,3,1,1,1,6,3,7,5,4,5,10,7,8,6,3,6,7,3,11,12,4,4,7,14,1,14,7,6,6,12,11,3,13,11,6,11,8,11,2,3,8,11,14,1,5,9,6,11,10,13,7,11,9,4,1,8,3,14,11,7,4,1,4,4,8,7,3,2,7,2,14,2,10,4,3,14,5,7,13,12,6,12,4,13,13,9,13,11,14,5,11,11,2,6,2,12,5,14,12,10,3,10,6,3,14,3,6,3,13,1,13,8,10};

collapse之后

00001100 00000001 00000011 00001011 00001101 00000111 00001100 00001110
00000101 00000001 00001011 00000110 00000111 00000011 00000100 00001100
00000010 00000001 00001100 00001011 00001011 00000010 00001101 00001010
00001000 00000110 00000100 00001000 00001001 00000111 00001101 00000011
00000111 00000011 00000100 00001011 00000100 00000010 00001001 00001010
00000111 00000111 00000111 00000010 00000001 00001110 00001101 00000110
00000001 00000101 00001110 00000011 00001000 00000010 00001011 00000011
00001001 00000100 00000001 00001000 00000011 00001010 00001110 00001110

00001011 00000101 00001110 00001011 00001110 00000100 00000101 00000011
00000111 00001010 00000111 00001110 00001011 00000011 00001011 00000110
00000110 00000111 00000110 00000001 00000111 00001110 00001011 00000011
00000110 00001000 00000110 00000101 00000100 00000101 00000010 00001101
00000110 00000110 00001100 00001001 00000001 00000111 00000110 00000001
00000110 00000011 00001011 00000110 00000100 00001101 00000010 00001101
00000011 00000110 00000011 00001011 00000100 00001100 00001100 00001000
00000011 00000111 00001101 00001010 00001000 00000110 00000101 00001010

第二轮压缩：

11001011 00010101 00111110 10111011 11011110 01110100 11000101 11100011 
01010111 00011010 10110111 01101110 01111011 00110011 01001011 11000110 
00100110 00010111 11000110 10110001 10110111 00101110 11011011 10100011 
10000110 01101000 01000110 10000101 10010100 01110101 11010010 00111101 
01110110 00110110 01001100 10111001 01000001 00100111 10010110 10100001 
01110110 01110011 01111011 00100110 00010100 11101101 11010010 01101101 
00010011 01010110 11100011 00111011 10000100 00101100 10111100 00111000 
10010011 01000111 00011101 10001010 00111000 10100110 11100101 11101010

存在磁盘里的机构：

11100011 11000101 01110100 11011110 10111011 00111110 00010101 11001011 
11000110 01001011 00110011 01111011 01101110 10110111 00011010 01010111 
10100011 11011011 00101110 10110111 10110001 11000110 00010111 00100110 
00111101 11010010 01110101 10010100 10000101 01000110 01101000 10000110 
10100001 10010110 00100111 01000001 10111001 01001100 00110110 01110110 
01101101 11010010 11101101 00010100 00100110 01111011 01110011 01110110 
00111000 10111100 00101100 10000100 00111011 11100011 01010110 00010011 
11101010 11100101 10100110 00111000 10001010 00011101 01000111 10010011

例子2:

原始数据

values = new int[]{5,5,2,1,7,7,1,2,4,7,6,6,6,6,3,3,1,1,1,6,3,7,5,4,5,3,7,1,6,3,6,7,3,4,5,4,4,7,7,1,7,7,6,6,5,4,3,6,4,6,4,1,4,2,3,1,4,7,1,5,2,6,4,3,6,7,4,2,4,1,1,3,7,4,7,4,1,4,4,1,7,3,2,7,2,7,2,3,4,3,7,5,7,6,5,6,5,4,6,6,2,6,4,7,5,4,4,2,6,2,5,5,7,5,3,3,3,6,3,7,3,6,3,6,1,6,1,3};

collapse之后

第一部分
00000101 00000001 00000011 00000100 00000110 00000111 00000101 00000111 
00000101 00000001 00000100 00000110 00000111 00000011 00000100 00000101 
00000010 00000001 00000101 00000100 00000100 00000010 00000110 00000011 
00000001 00000110 00000100 00000001 00000010 00000111 00000110 00000011 
00000111 00000011 00000100 00000100 00000100 00000010 00000010 00000011 
00000111 00000111 00000111 00000010 00000001 00000111 00000110 00000110
 
第二部分
00000001 00000101 00000111 00000011 00000001 00000010 00000100 00000011 
00000010 00000100 00000001 00000001 00000011 00000011 00000111 00000111 
00000100 00000101 00000111 00000100 00000111 00000100 00000101 00000011 
00000111 00000011 00000111 00000111 00000100 00000011 00000100 00000110 
00000110 00000111 00000110 00000001 00000111 00000111 00000100 00000011 
00000110 00000001 00000110 00000101 00000100 00000101 00000010 00000110 

第三部分
00000110 00000110 00000101 00000010 00000001 00000111 00000110 00000001 
00000110 00000011 00000100 00000110 00000100 00000110 00000010 00000110 
00000011 00000110 00000011 00000100 00000100 00000101 00000101 00000001 
00000011 00000111 00000110 00000011 00000001 00000110 00000101 00000011

第二轮压缩：

每8位的最后两位用来存储第三部分mask为 00000011 00000011 00000011 00000011 00000011 00000011 00000011 00000011

10100111 00110111 01111110 10001101 11000100 11101011 10110011 11101100 
10101001 00110000 10000111 11000101 11101111 01101111 10011100 10111111 
01010010 00110111 10111100 10010010 10011100 01010010 11010110 01101110 
00111101 11001111 10011101 00111110 01010010 11101110 11010010 01111000 
11111010 01111101 10011011 10000100 10011100 01011111 01010011 01101110 
11111011 11100111 11111010 01010111 00110001 11110110 11001001 11011011

10100111 00110111 01111110 10001101 11000100 11101011 10110011 11101100
10101001 00110000 10000111 11000101 11101111 01101111 10011100 10111111
01010010 00110111 10111100 10010010 10011100 01010010 11010110 01101110
00111101 11001111 10011101 00111110 01010010 11101110 11010010 01111000
11111010 01111101 10011011 10000100 10011100 01011111 01010011 01101110
11111011 11100111 11111010 01010111 00110001 11110110 11001001 11011011

存储的数据：

11101100 10110011 11101011 11000100 10001101 01111110 00110111 10100111 
10111111 10011100 01101111 11101111 11000101 10000111 00110000 10101001 
01101110 11010110 01010010 10011100 10010010 10111100 00110111 01010010 
01111000 11010010 11101110 01010010 00111110 10011101 11001111 00111101 
01101110 01010011 01011111 10011100 10000100 10011011 01111101 11111010 
11011011 11001001 11110110 00110001 01010111 11111010 11100111 11111011

例子3

数组

values = new int[]{24,4,14,14,7,2,11,22,31,23,1,5,31,6,31,18,31,30,30,16,13,30,12,6,10,12,17,28,4,18,25,19,22,2,6,25,13,20,5,29,23,15,26,1,1,28,17,1,24,23,5,9,16,3,18,20,19,31,7,9,20,7,6,1,1,22,6,16,6,16,18,5,32,5,8,7,13,28,7,14,14,31,8,17,29,13,2,9,19,4,21,8,4,32,21,19,32,31,7,9,12,12,15,29,11,10,15,2,17,6,32,21,21,9,30,11,17,9,16,4,14,5,10,18,23,16,31,1};

collapse之后

第一部分
00011000 00011111 00010110 00011000 00000001 00001110 00100000 00010101 
00000100 00011110 00000010 00010111 00010110 00011111 00011111 00001001 
00001110 00011110 00000110 00000101 00000110 00001000 00000111 00011110 
00001110 00010000 00011001 00001001 00010000 00010001 00001001 00001011 
00000111 00001101 00001101 00010000 00000110 00011101 00001100 00010001 
00000010 00011110 00010100 00000011 00010000 00001101 00001100 00001001 
00001011 00001100 00000101 00010010 00010010 00000010 00001111 00010000 
00010110 00000110 00011101 00010100 00000101 00001001 00011101 00000100 
00011111 00001010 00010111 00010011 00100000 00010011 00001011 00001110 
00010111 00001100 00001111 00011111 00000101 00000100 00001010 00000101 
00000001 00010001 00011010 00000111 00001000 00010101 00001111 00001010 
00000101 00011100 00000001 00001001 00000111 00001000 00000010 00010010 

第二部分
00011111 00000100 00000001 00010100 00001101 00000100 00010001 00010111 
00000110 00010010 00011100 00000111 00011100 00100000 00000110 00010000 
00011111 00011001 00010001 00000110 00000111 00010101 00100000 00011111 
00010010 00010011 00000001 00000001 00001110 00010011 00010101 00000001

第二轮压缩：

01100001 01111100 01011000 01100001 00000100 00111000 10000001 01010101 
00010011 01111001 00001000 01011101 01011011 01111101 01111100 00100101 
00111011 01111000 00011001 00010100 00011001 00100000 00011101 01111011 
00111000 01000001 01100101 00100100 01000001 01000110 00100100 00101101 
00011101 00110100 00110111 01000001 00011011 01110100 00110001 01000100 
00001010 01111010 01010000 00001111 01000000 00110100 00110010 00100100 
00101101 00110001 00010101 01001000 01001000 00001001 00111110 01000001 
01011011 00011010 01110100 01010001 00010101 00100101 01110100 00010011 
01111111 00101001 01011101 01001110 10000011 01001101 00101100 00111011 
01011101 00110001 00111100 01111100 00010100 00010001 00101001 00010100 
00000100 01000100 01101000 00011100 00100011 01010100 00111101 00101000 
00010110 01110011 00000101 00100101 00011110 00100011 00001001 01001001

存储结果：

01010101 10000001 00111000 00000100 01100001 01011000 01111100 01100001 
00100101 01111100 01111101 01011011 01011101 00001000 01111001 00010011 
01111011 00011101 00100000 00011001 00010100 00011001 01111000 00111011 
00101101 00100100 01000110 01000001 00100100 01100101 01000001 00111000 
01000100 00110001 01110100 00011011 01000001 00110111 00110100 00011101 
00100100 00110010 00110100 01000000 00001111 01010000 01111010 00001010 
01000001 00111110 00001001 01001000 01001000 00010101 00110001 00101101 
00010011 01110100 00100101 00010101 01010001 01110100 00011010 01011011 
00111011 00101100 01001101 10000011 01001110 01011101 00101001 01111111 
00010100 00101001 00010001 00010100 01111100 00111100 00110001 01011101 
00101000 00111101 01010100 00100011 00011100 01101000 01000100 00000100 
01001001 00001001 00100011 00011110 00100101 00000101 01110011 00010110

以上结果使用org.apache.lucene.backward_codecs.lucene84.TestForDeltaUtil testEncodeDecode 测试得出。

步骤如下：

1. collapse 例如collapse8

  private static void collapse8(long[] arr) {
    for (int i = 0; i < 16; ++i) {
      arr[i] =
          (arr[i] << 56)
              | (arr[16 + i] << 48)
              | (arr[32 + i] << 40)
              | (arr[48 + i] << 32)
              | (arr[64 + i] << 24)
              | (arr[80 + i] << 16)
              | (arr[96 + i] << 8)
              | arr[112 + i];
    }
  }

2. 第二部压缩，利用Long中所有的bit

如bitsPerValue为3时。如果利用Long中所有bit则需要（3*128）/64 = 6个Long元素。没8个bit可以放入两个数值，那么还剩下2个bit。这2个bit需要和下一个Long元素中的两个bit结合起来使用。

3. 最后需要转一下因为

// Java longs are big endian and we want to read little endian longs, so we need to reverse
// bytes

org.apache.lucene.codecs.MultiLevelSkipListReader#init

  /** Initializes the reader, for reuse on a new term. */
  public void init(long skipPointer, int df) throws IOException {
    this.skipPointer[0] = skipPointer;
    this.docCount = df;
    assert skipPointer >= 0 && skipPointer <= skipStream[0].length() 
    : "invalid skip pointer: " + skipPointer + ", length=" + skipStream[0].length();
    Arrays.fill(skipDoc, 0);
    Arrays.fill(numSkipped, 0);
    Arrays.fill(childPointer, 0);
    
    for (int i = 1; i < numberOfSkipLevels; i++) {
      skipStream[i] = null;
    }
    loadSkipLevels();
  }
  
  /** Loads the skip levels  */
  private void loadSkipLevels() throws IOException {
    if (docCount <= skipInterval[0]) {
      numberOfSkipLevels = 1;
    } else {
      numberOfSkipLevels = 1+MathUtil.log(docCount/skipInterval[0], skipMultiplier);
    }

    if (numberOfSkipLevels > maxNumberOfSkipLevels) {
      numberOfSkipLevels = maxNumberOfSkipLevels;
    }

    skipStream[0].seek(skipPointer[0]);
    
    int toBuffer = numberOfLevelsToBuffer;
    
    for (int i = numberOfSkipLevels - 1; i > 0; i--) {
      // the length of the current level
      long length = skipStream[0].readVLong();
      
      // the start pointer of the current level
      skipPointer[i] = skipStream[0].getFilePointer();
      if (toBuffer > 0) {
        // buffer this level
        skipStream[i] = new SkipBuffer(skipStream[0], (int) length);
        toBuffer--;
      } else {
        // clone this stream, it is already at the start of the current level
        skipStream[i] = skipStream[0].clone();
        if (inputIsBuffered && length < BufferedIndexInput.BUFFER_SIZE) {
          ((BufferedIndexInput) skipStream[i]).setBufferSize(Math.max(BufferedIndexInput.MIN_BUFFER_SIZE, (int) length));
        }
        
        // move base stream beyond the current level
        skipStream[0].seek(skipStream[0].getFilePointer() + length);
      }
    }
   
    // use base stream for the lowest level
    skipPointer[0] = skipStream[0].getFilePointer();
  }

    @Override
    public int advance(int target) throws IOException {
      if (target > nextSkipDoc) {
        advanceShallow(target);
      }
      if (docBufferUpto == BLOCK_SIZE) {
        if (seekTo >= 0) {
            //跳转到指定的位置
          docIn.seek(seekTo);
          isFreqsRead = true; // reset isFreqsRead
          seekTo = -1;
        }
            
        refillDocs();
      }

        // 填充docBuffer
      int next = findFirstGreater(docBuffer, target, docBufferUpto);
      this.doc = (int) docBuffer[next];
      docBufferUpto = next + 1;
      return doc;
    }

skipTo：

org.apache.lucene.codecs.MultiLevelSkipListReader#skipTo

先上再下

  /** Skips entries to the first beyond the current whose document number is
   *  greater than or equal to <i>target</i>. Returns the current doc count. 
   */
  public int skipTo(int target) throws IOException {

    // walk up the levels until highest level is found that has a skip
    // for this target
    int level = 0;
    while (level < numberOfSkipLevels - 1 && target > skipDoc[level + 1]) {
      level++;
    }    

    while (level >= 0) {
      if (target > skipDoc[level]) {
        if (!loadNextSkip(level)) {
          continue;
        }
      } else {
        // no more skips on this level, go down one level
        if (level > 0 && lastChildPointer > skipStream[level - 1].getFilePointer()) {
          seekChild(level - 1);
        } 
        level--;
      }
    }
    
    return numSkipped[0] - skipInterval[0] - 1;
  }

填充

    private void refillDocs() throws IOException {
      // Check if we skipped reading the previous block of freqBuffer, and if yes, position docIn after it
      if (isFreqsRead == false) {
        pforUtil.skip(docIn);
        isFreqsRead = true;
      }
      
      final int left = docFreq - blockUpto;
      assert left >= 0;

      if (left >= BLOCK_SIZE) {
        forDeltaUtil.decodeAndPrefixSum(docIn, accum, docBuffer);

        if (indexHasFreq) {
          if (needsFreq) {
            isFreqsRead = false;
          } else {
            pforUtil.skip(docIn); // skip over freqBuffer if we don't need them at all
          }
        }
        blockUpto += BLOCK_SIZE;
      } else if (docFreq == 1) {
        docBuffer[0] = singletonDocID;
        freqBuffer[0] = totalTermFreq;
        docBuffer[1] = NO_MORE_DOCS;
        blockUpto++;
      } else {
        // Read vInts:
        readVIntBlock(docIn, docBuffer, freqBuffer, left, indexHasFreq);
        prefixSum(docBuffer, left, accum);
        docBuffer[left] = NO_MORE_DOCS;
        blockUpto += left;
      }
      accum = docBuffer[BLOCK_SIZE - 1];
      docBufferUpto = 0;
      assert docBuffer[BLOCK_SIZE] == NO_MORE_DOCS;
    }

读取：

  void decodeAndPrefixSum(DataInput in, long base, long[] longs) throws IOException {
    final int bitsPerValue = Byte.toUnsignedInt(in.readByte());
    if (bitsPerValue == 0) {
      prefixSumOfOnes(longs, base);
    } else {
      forUtil.decodeAndPrefixSum(bitsPerValue, in, base, longs);
    }
  }

  private static void decode6(DataInput in, long[] tmp, long[] longs) throws IOException {
    // 将数据从磁盘读取到tmp中
    in.readLongs(tmp, 0, 12);
    // 取出前6bit放在longs中
    shiftLongs(tmp, 12, longs, 0, 2, MASK8_6);
    // 取出后2bit放在tmp中
    shiftLongs(tmp, 12, tmp, 0, 0, MASK8_2);
    // 将后2bit， 3个一组组合起来，放在后面4个Long中。 最后生成16个元素的longs返回
    for (int iter = 0, tmpIdx = 0, longsIdx = 12; iter < 4; ++iter, tmpIdx += 3, longsIdx += 1) {
      long l0 = tmp[tmpIdx + 0] << 4;
      l0 |= tmp[tmpIdx + 1] << 2;
      l0 |= tmp[tmpIdx + 2] << 0;
      longs[longsIdx + 0] = l0;
    }
  }

  /**
   * The pattern that this shiftLongs method applies is recognized by the C2 compiler, which
   * generates SIMD instructions for it in order to shift multiple longs at once.
   */
    
    // 右移shift且并上mask. 可以被C2编译器识别，编译成为SIMD指令
  private static void shiftLongs(long[] a, int count, long[] b, int bi, int shift, long mask) {
    for (int i = 0; i < count; ++i) {
      b[bi + i] = (a[i] >>> shift) & mask;
    }
  }

  private static void expand8To32(long[] arr) {
    // 将16个元素拆开， 每个元素只保存两个数
    for (int i = 0; i < 16; ++i) {
      long l = arr[i];
      // 0 64
      arr[i] = (l >>> 24) & 0x000000FF000000FFL;
      // 16 80  
      arr[16 + i] = (l >>> 16) & 0x000000FF000000FFL;
      // 32 96  
      arr[32 + i] = (l >>> 8) & 0x000000FF000000FFL;
      // 48 112
      arr[48 + i] = l & 0x000000FF000000FFL;
    }
  }

  private static void prefixSum32(long[] arr, long base) {
    // 将base加在第0个元素上
    arr[0] += base << 32;
    // 累加 0 - 63 的delta
    innerPrefixSum32(arr);

    expand32(arr);
    // 第63号元素作为累加的起点
    final long l = arr[BLOCK_SIZE / 2 - 1];
    // 累加获取后64个的真实值
    for (int i = BLOCK_SIZE / 2; i < BLOCK_SIZE; ++i) {
      arr[i] += l;
    }
  }


    // 将累加delta
  private static void innerPrefixSum32(long[] arr) {
    arr[1] += arr[0];
    arr[2] += arr[1];
    arr[3] += arr[2];
    arr[4] += arr[3];
    arr[5] += arr[4];
    arr[6] += arr[5];
    arr[7] += arr[6];
    arr[8] += arr[7];
    arr[9] += arr[8];
    arr[10] += arr[9];
    arr[11] += arr[10];
    arr[12] += arr[11];
    arr[13] += arr[12];
    arr[14] += arr[13];
    arr[15] += arr[14];
    arr[16] += arr[15];
    arr[17] += arr[16];
    arr[18] += arr[17];
    arr[19] += arr[18];
    arr[20] += arr[19];
    arr[21] += arr[20];
    arr[22] += arr[21];
    arr[23] += arr[22];
    arr[24] += arr[23];
    arr[25] += arr[24];
    arr[26] += arr[25];
    arr[27] += arr[26];
    arr[28] += arr[27];
    arr[29] += arr[28];
    arr[30] += arr[29];
    arr[31] += arr[30];
    arr[32] += arr[31];
    arr[33] += arr[32];
    arr[34] += arr[33];
    arr[35] += arr[34];
    arr[36] += arr[35];
    arr[37] += arr[36];
    arr[38] += arr[37];
    arr[39] += arr[38];
    arr[40] += arr[39];
    arr[41] += arr[40];
    arr[42] += arr[41];
    arr[43] += arr[42];
    arr[44] += arr[43];
    arr[45] += arr[44];
    arr[46] += arr[45];
    arr[47] += arr[46];
    arr[48] += arr[47];
    arr[49] += arr[48];
    arr[50] += arr[49];
    arr[51] += arr[50];
    arr[52] += arr[51];
    arr[53] += arr[52];
    arr[54] += arr[53];
    arr[55] += arr[54];
    arr[56] += arr[55];
    arr[57] += arr[56];
    arr[58] += arr[57];
    arr[59] += arr[58];
    arr[60] += arr[59];
    arr[61] += arr[60];
    arr[62] += arr[61];
    arr[63] += arr[62];
  }


    // 分开  
  private static void expand32(long[] arr) {
    for (int i = 0; i < 64; ++i) {
      long l = arr[i];
        // 低64位
      arr[i] = l >>> 32;
        // 高64位
      arr[64 + i] = l & 0xFFFFFFFFL;
    }
  }

    // 对于含有异常值的情况
  void decode(DataInput in, long[] longs) throws IOException {
    // 取出token
    final int token = Byte.toUnsignedInt(in.readByte());
    // 获取bitsPerValue
    final int bitsPerValue = token & 0x1f;
    // 获取异常数的数量
    final int numExceptions = token >>> 5;
    if (bitsPerValue == 0) {
      Arrays.fill(longs, 0, ForUtil.BLOCK_SIZE, in.readVLong());
    } else {
    // 与上面处理步骤一样
      forUtil.decode(bitsPerValue, in, longs);
    }
    for (int i = 0; i < numExceptions; ++i) {
    // 将异常值加入特定的值上面
      longs[Byte.toUnsignedInt(in.readByte())] |= Byte.toUnsignedLong(in.readByte()) << bitsPerValue;
    }
  }

org.apache.lucene.backward_codecs.lucene84.Lucene84PostingsWriter

public void startDoc(int docID, int termDocFreq) throws IOException {
    // Have collected a block of docs, and get a new doc.
    // Should write skip data as well as postings list for
    // current block.
    if (lastBlockDocID != -1 && docBufferUpto == 0) {
      skipWriter.bufferSkip(
          lastBlockDocID,
          competitiveFreqNormAccumulator,
          docCount,
          lastBlockPosFP,
          lastBlockPayFP,
          lastBlockPosBufferUpto,
          lastBlockPayloadByteUpto);
      competitiveFreqNormAccumulator.clear();
    }

    final int docDelta = docID - lastDocID;

    if (docID < 0 || (docCount > 0 && docDelta <= 0)) {
      throw new CorruptIndexException(
          "docs out of order (" + docID + " <= " + lastDocID + " )", docOut);
    }

    docDeltaBuffer[docBufferUpto] = docDelta;
    if (writeFreqs) {
      freqBuffer[docBufferUpto] = termDocFreq;
    }

    docBufferUpto++;
    docCount++;

    if (docBufferUpto == BLOCK_SIZE) {
        // 直接调用encodeDeltas
      forDeltaUtil.encodeDeltas(docDeltaBuffer, docOut);
      if (writeFreqs) {
        // 写词频的时候才会考虑异常值的情况。 关于异常值请看参考文档
        pforUtil.encode(freqBuffer, docOut);
      }
      // NOTE: don't set docBufferUpto back to 0 here;
      // finishDoc will do so (because it needs to see that
      // the block was filled so it can save skip data)
    }

    lastDocID = docID;
    lastPosition = 0;
    lastStartOffset = 0;

    long norm;
    if (fieldHasNorms) {
      boolean found = norms.advanceExact(docID);
      if (found == false) {
        // This can happen if indexing hits a problem after adding a doc to the
        // postings but before buffering the norm. Such documents are written
        // deleted and will go away on the first merge.
        norm = 1L;
      } else {
        norm = norms.longValue();
        assert norm != 0 : docID;
      }
    } else {
      norm = 1L;
    }

    competitiveFreqNormAccumulator.add(writeFreqs ? termDocFreq : 1, norm);
  }

参考文档

http://paperhub.s3.amazonaws.com/7558905a56f370848a04fa349dd8bb9d.pdfhttp://paperhub.s3.amazonaws.com/7558905a56f370848a04fa349dd8bb9d.pdf 倒排索引PForDelta压缩算法——基本假设和霍夫曼压缩同 - bonelee - 博客园https://www.cnblogs.com/bonelee/p/6882088.html 倒排索引压缩：改进的PForDelta算法 - 胡潇 - 博客园由于倒排索引文件往往占用巨大的磁盘空间，我们自然想到对数据进行压缩。同时，引进压缩算法后，使得磁盘占用减少，操作系统在query processing过程中磁盘读取效率也能提升。另外，压缩算法不仅要考https://www.cnblogs.com/huxiao-tee/p/4644422.html#_label1

基于 SIMD 指令的 PFOR-DELTA 解压和查找 - 知乎PFOR-DELTA 是一种经典的有序整数数列压缩算法，被广泛使用在搜索、推荐引擎的倒排索引和召回队列压缩中。PFOR-DELTA 的具体算法这里就不展开了，不了解的同学可以参考它的原始论文《 Super-Scalar RAM-CPU Cache …https://zhuanlan.zhihu.com/p/63662886

什么是流水线友好的代码？_pennyliang的专栏-CSDN博客https://blog.csdn.net/pennyliang/article/details/5785020 索引压缩算法New PForDelta简介以及使用SIMD技术的优化-阿里云开发者社区https://developer.aliyun.com/article/563081

C++性能榨汁机之循环展开 - 知乎https://zhuanlan.zhihu.com/p/37582101

chuanyangwang

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lucene中的倒排链磁盘存储（PForDelta）

参考文档http://paperhub.s3.amazonaws.com/7558905a56f370848a04fa349dd8bb9d.pdfhttp://paperhub.s3.amazonaws.com/7558905a56f370848a04fa349dd8bb9d.pdf倒排索引PForDelta压缩算法——基本假设和霍夫曼压缩同 - bonelee - 博客园https://www.cnblogs.com/bonelee/p/6882088.html倒排索引压缩：改............
复制链接

扫一扫