OpenBitSet和OpenBitSetIterator

最新推荐文章于 2024-10-31 16:16:13 发布

chengqianl

最新推荐文章于 2024-10-31 16:16:13 发布

阅读量134

点赞数

分类专栏： lucene 文章标签：算法 lucene Apache Go

本文链接：https://blog.csdn.net/chengqianl/article/details/83762752

版权

lucene 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

OpenBitSet和OpenBitSetIterator

算法的思想是用一个long的数组的index和这个这个数组的某个值的某一位表示一个数，如果是一个long数组，如果不存在重复的情况下，最大可达到64倍的压缩，
算法的实现过程以long OpenBitSet这个类实现的一个上面提到的记录数据的数组

public OpenBitSet(long numBits) {
bits = new long[bits2words(numBits)]; //根据指定的长度创建数组
wlen = bits.length; //记录数组的长度
}

//计算数组的长度，给定一个长度，创建一个数组，长度除以64，通过移位来实现除法的
public static int bits2words(long numBits) {
return (int)(((numBits-1)>>>6)+1);
}

设置值的过程，首先查看数组的大小是否满足，如果满足设置对应的值，对应的值的设置过程是，先计算该数字在数组里面的位置为a，然后计算这个位置的值。
位置的是通过expandingWordNum方法得到的
值的设置的原理是：先计算该数字的后6位的值是多少为b，然后就设置该数组的第a个数的第b位的值为1
/** sets a bit, expanding the set size if necessary */
public void set(long index) {
int wordNum = expandingWordNum(index);
int bit = (int)index & 0x3f;
long bitmask = 1L << bit;
bits[wordNum] |= bitmask;
}

计算该数所在数组的位置，如果超过数组的长度，则通过ensureCapacity扩大数组的长度
protected int expandingWordNum(long index) {
int wordNum = (int)(index >> 6);
if (wordNum>=wlen) {
ensureCapacity(index+1);
wlen = wordNum+1;
}
return wordNum;
}

数组的长度的算法是在org.apache.lucene.util. ArrayUtil 里面实现的

public static int getNextSize(int targetSize) {
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
return (targetSize >> 3) + (targetSize < 9 ? 3 : 6) + targetSize;
}

还原的算法通过OpenBitSetIterator 实现的构造方法如下，这个对象里面有二个属性，一个是保存long[] bits 的数组一个用来记录长度。

private final long[] arr;

private final int words;

public OpenBitSetIterator(OpenBitSet obs) {
this(obs.getBits(), obs.getNumWords());
}

public OpenBitSetIterator(long[] bits, int numWords) {
arr = bits;
words = numWords;
}

还原的算法：遍历这个数组，取出每个值，由这个值的每个bit的值和这个值的index还原存储的long的值。
计算的过程实际上通过以为的wordShift 加上最后一个byte的每一个位的位置的值。每个byte置的每个位的位置的值是通过一个int的数据表示的，因为一个byte中的bit的位置最多是8 ，所以可以用一个4个的bit表示8。那一个int 就可以表示8个位置。
public int nextDoc() {
if (indexArray == 0) { // indexArray 是一个int形的数据用他的四个bit 就是存储位置信息
if (word != 0) {
word >>>= 8;
wordShift += 8;
}

while (word == 0) {如果
if (++i >= words) {
return curDocId = NO_MORE_DOCS;
}
word = arr[i];
wordShift = -1; // loop invariant code motion should move this
}

// after the first time, should I go with a linear search, or
// stick with the binary search in shift?
shift();
}

int bitIndex = (indexArray & 0x0f) + wordShift;
indexArray >>>= 4;
// should i<<6 be cached as a separate variable?
// it would only save one cycle in the best circumstances.
return curDocId = (i<<6) + bitIndex;
}