Lucene NumericRangeQuery的初步理解

最新推荐文章于 2021-11-23 21:23:19 发布

njit_peiyuan

最新推荐文章于 2021-11-23 21:23:19 发布

阅读量463

点赞数 1

文章标签： java python 数据结构与算法

本文链接：https://blog.csdn.net/weixin_34161064/article/details/85106774

版权

理解NumericRangeQuery, 首先需要理解Lucene是如何存储数值类型. 文本初步探讨了Int和Float两种数值类型在Lucene中的存储实现,数值类型的分词原理, 最后给出NumericRangeQuery的简单理解.

Lucene最初设计是实现全文检索功能, 即只处理字符串. 因此, 在处理数值时, Lucene也是将数值编码为字符串。

将数值转换成字符串, Lucene-5.2.0对应的实现类为org.apache.lucene.util.NumericUtils。

其编码的方式如下：

Int类型的编码：

public static void main(String[] args){
    BytesRefBuilder act = new BytesRefBuilder();
    NumericUtils.intToPrefixCodedBytes(1, 0, act);

    BytesRef ref = act.get();
    System.out.println(ref.length);
}

可以发现NumericUtils把Int类型编码为6byte. 其中的1byte用于区别原数据类型为Int还是Long,

SHIFT_START_INT  = 0x60;
SHIFT_START_LONG = 0x20;

另外的5byte表示原数. 我们知道, Int是32位, 即4byte. 为什么这里需要5byte呢?

我们先思考另一个问题: 如何将Int转码成字符串, 并且保证其顺序呢? 即如果两个整数x,y编码成字符串a,b 要保证: Integer.compare(x,y) = String.compare(a,b)

首先,整数的取值范围(-2147483648,2147483647).

插一句, 除去符号位, -2147483648的补码与0的补码是一样的, 实际上2147483648是溢出了的. 换个角度 -2147483648 = -0 关于0和-2147483648的编码，可以看出，除符号位外，两者是一样的

public static void intToBytesRef(){

        BytesRefBuilder act1 = new BytesRefBuilder();
        NumericUtils.intToPrefixCodedBytes(Integer.MIN_VALUE, 0, act1);
        BytesRef ref1 = act1.get();
        System.out.println(ref1);

        BytesRefBuilder act2 = new BytesRefBuilder();
        NumericUtils.intToPrefixCodedBytes(0, 0, act2);
        BytesRef ref2 = act2.get();
        System.out.println(ref2.toString());
}

OK, 思路回到如何将Int转码成字符串, 并且保证其顺序的问题. 如果我们单独只关注正数和负数, 那么会发现:

对于正数, 其补码范围为: 0x00 00 00 01(1)到0x7f ff ff ff(2147483647), 是有序的, 保证了: Integer.compare(x,y) = String.compare(a,b).

对于负数, 其补码范围为: 0x80 00 00 00(-2147483648)到0xff ff ff ff(-1), 是有序的, 保证了: Integer.compare(x,y) = String.compare(a,b).

使用python的struct包, 可以很方便地查看一个整数的补码:

>>> from struct import *
>>> pack('>i',-2147483648)
'\x80\x00\x00\x00'
>>> pack('>i',0)
'\x00\x00\x00\x00'

如果希望直接查看32-bit的二进制码, 如下:

>>>"".join([bin(ord(i))[2:].rjust(8,'0') for i in pack('>i', -2)])
'11111111111111111111111111111110'

还有一个问题: 从整体上, 负数得到的编码是大于正数得到的编码, 这就不符合Integer.compare(x,y) = String.compare(a,b). 如何处理这一情况呢?

int sortableBits = val ^ 0x80000000;

采用二进制数的异域操作, 将正整数与负整数的符号位交换一下即可. 这样就可以保证整数编码后的字符串整体有序了. 所以这里取名sortableBits

接下来就回到将Int编码为5-byte的问题. For that integer values (32 bit or 64 bit) are made unsigned and the bits are converted to ASCII chars with each 7 bit.即每7bit为了一个单位.

这是因为Lucene保存Unicode时使用的是UTF-8编码，这种编码的特点是，unicode值为0-127的字符使用一个字节编码。其实我们可以把32位的int看出5个7位的整数，这样的utf8编码就只有5个字节了.

到这里, 再看NumericUtils.intToPrefixCodedBytes()的代码就会很清晰了.

  public static void intToPrefixCodedBytes(final int val, final int shift, final BytesRefBuilder bytes) {
    // ensure shift is 0..31
    if ((shift & ~0x1f) != 0) {
      throw new IllegalArgumentException("Illegal shift value, must be 0..31; got shift=" + shift);
    }
    int nChars = (((31-shift)*37)>>8) + 1;    // i/7 is the same as (i*37)>>8 for i in 0..63
    bytes.setLength(nChars+1);   // one extra for the byte that contains the shift info
    bytes.grow(NumericUtils.BUF_SIZE_LONG);  // use the max
    bytes.setByteAt(0, (byte)(SHIFT_START_INT + shift));
    int sortableBits = val ^ 0x80000000;
    sortableBits >>>= shift;
    while (nChars > 0) {
      // Store 7 bits per byte for compatibility
      // with UTF-8 encoding of terms
      bytes.setByteAt(nChars--, (byte)(sortableBits & 0x7f));
      sortableBits >>>= 7;
    }
  }

关于shift参数, 由于是前缀编码PrefixCodedBytes, shift用于处理前缀问题,与本文讨论的主题无关, 暂不考虑.

浮点数(Float/Double)在计算机中的存储存储遵循IEEE-754标准. 通常我们用到的是单精度(float)和双精度(double)这两种,对应的字节数是4byte和和8byte. 下面以Float为例, 来了解计算机是如何存储浮点数. IEEE 754-1985 将存储空间分成三个部分，从左到右(最高位到最低位)的顺序依次是：符号位(sign)、exponent(指数位)、fraction(分数位)。其中sign占1-bit, exponent占8-bit, fraction占23-bit。

对于单精度: 1-8-23 (32)；对于双精度: 1-11-52 (64) 例如单精度浮点数5.5，二进制表示如下:

------------------------------------------------
|   0 |1000 0001 |011 0000 0000 0000 0000 0000 |
------------------------------------------------
|Sign | exponent |        fraction             |
------------------------------------------------

接下来,我们逆向思考: 上面这样的二进制数, 如何转换才得到5.5的呢? 首先给出计算公式:

v = (-1)^s * 2^E * M

首先处理符号位 s=0, 所以 (-1)^0 = 1 ；

然后处理指数位. 指数位单独拎出来计算, 其值为

>>> int('10000001',2)129

2^E = 2^(129-127) = 4 ; 为什么要减去127呢? 这里的指数位采用的是biased exponent, 翻译过来就是有偏移的指数(本来应该是129, 无端减去127, 当然偏移了). 本来指数的取值范围为[-127,127], 但是为了方便计算机对不同浮点数进行大小比较, 将指数偏移127, 使得负指数也表示成了一个正数.

最后处理分数位 23-bit fraction的处理与指数位不同, 我总结的8字秘诀就是exponent看值, fraction数个. 即对于23-bit fraction从左到右,

第 1位: 2^(-1) = 0.5

第 2位: 2^(-2) = 0.25 . .

第10位: 2^(-10) = 0.0009765625

. .

第23位: 2^(-23)= 1.1920928955078125e-07

所以对于fraction 011 0000 0000 0000 0000 0000

f = 1*2^(-2) + 1*2^(-3) = 0.375; 
M = f + 1 = 1.375

综上所述: 5.5 = 1 * 4 * 1.375

其实可以证明, fraction最大值近似为1. 即2^(-1) +2^(-2) + ... + 2^(-n)的极限为1.

对于fraction, 其值M的计算规则需要考虑exponent. 根据exponent的取值分为3种情况: e = 0 和 e =[1,254] 和 e=255. 由于Float的exponent只有8位, 所以其最大值为255.

e=[1,254] 是通常情况, 覆盖了99%以上的浮点数. 我们称之为规格化的值, 此时 M= 1 + f

e=0 是第一种特殊情况, 我们称之为非规格化的值, 此时 M = f

e=255是第二种特殊情况, 若fraction中23-bit全是0，表示无穷大(infinite); 否则表示NaN(Not a Number)

为了能够多看几个例子, 多做几个实验, 从而对这个转化过程形成感觉. 用python实现了两个简单的函数. 一个是将浮点数转换成二进制字符串, 一个是将二进制字符串转换成浮点数.感谢stackoverflow贡献了如此精妙的实现方法.

>>> import struct
>>> def float2bin(num):
...   return ''.join(bin(ord(c)).replace('0b', '').rjust(8, '0') for c in struct.pack('!f', num))
... 
>>> 
>>> def bin2float(bits):
...   return struct.unpack('f',struct.pack('I',int(bits,2)))
... 
>>> float2bin(0.1)
'00111101110011001100110011001101'
>>> float2bin(1.0)
'00111111100000000000000000000000'
>>> float2bin(0.5)
'00111111000000000000000000000000'
>>> float2bin(2.0)
'01000000000000000000000000000000'
>>>

当然, 也可以用Java查看一个Float的二进制字符串

System.out.println(Integer.toBinaryString(Float.floatToIntBits(5.5f)));

多解析几个实例后, 就能够理解Float的二进制存储机制.

了解了Float的存储原理后, 再学习Lucene对Float的处理方法, 就简明很多了.

首先看一个简单的浮点数存储和检索的例子

package learn.learn;

import java.io.IOException;
import java.nio.file.Paths;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FloatField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.BytesRefBuilder;
import org.apache.lucene.util.NumericUtils;

public class NumericRangeQueryDemo {
    static Directory d = null;
    public static void index() throws IOException{
        d = FSDirectory.open(Paths.get("indexfile"));
        IndexWriterConfig conf = new IndexWriterConfig(new StandardAnalyzer());
        IndexWriter iw = new IndexWriter(d, conf);
        Document doc = new Document();
        doc.add(new FloatField("f2", 2.5f, Field.Store.YES));
        iw.addDocument(doc);
        doc = new Document();
        iw.close();

    }

    public static void search() throws IOException{
        d = FSDirectory.open(Paths.get("indexfile"));
        IndexReader r = DirectoryReader.open(d);
        IndexSearcher searcher = new IndexSearcher(r);

        BytesRefBuilder act = new BytesRefBuilder();
        NumericUtils.intToPrefixCodedBytes(NumericUtils.floatToSortableInt(2.5f), 0, act);

        TopDocs n = searcher.search(new TermQuery(new Term("f2",act.get())), 2);
        System.out.println(n.totalHits);
        Document doc = searcher.doc(0);
        System.out.println(doc);

    }

    public static void main(String[] args) throws IOException {
        index();
        search();
    }
}

前面讲到Lucene处理Int类型是将int转换成6字节有序的字符串. 对于Float类型, 则是先转换成int, 然后按int类型的方式处理. 关键点在于NumericUtils.floatToSortableInt() . 题外话: 理解Lucene处理数值的原理,关键点在于理解NumericUtils类.

分析Float型数据, 与前面分析Int型数据一样, 正负拆开. 如果这个float是正数，那么把它看成int也是正数，而且根据前面的说明，指数在前，所以顺序也是保持好的。如果它是个负数，把它看成int也是负数，但是顺序就反了. 例如:

 float2bin(-1.0) = '10111111100000000000000000000000'
 float2bin(-2.0) = '11000000000000000000000000000000'

-1.0 > -2.0 但是, '10111111100000000000000000000000' < '11000000000000000000000000000000' 因此NumericUtils.floatToSortableInt()作了相应的处理

  // Lucene-5.2.0
  public static int sortableFloatBits(int bits) {
    return bits ^ (bits >> 31) & 0x7fffffff;
  }

根据运算符优先级, 计算顺序为bits ^ ( (bits >> 31) & 0x7fffffff ); 注意这里的位移是算术位移, 即如果bits为负数，　则左移31位后，就变成了0xffffffff.

即 符号位不变，　正数保持, 负数翻转. 这样做虽然会导致负数二进制字符串 > 正数二进制字符串的情况出现, 但是NumericUtils.intToPrefixCoded()会做稍后的处理, 所以最终保证了 Integer.compare(x,y) = String.compare(a,b)

前面了解到Lucene对Int类型和Float类型处理机制如下: 1. 对于是Float类型, 将Float转成Int, 然后按照Int类型处理. 2. 对于Int类型, 将其转换成Sortable Int, 然后按照7-bit为一个单位转换成长度为6的字节数组.

本节的目标是了解Lucene对数值类型进行分词的过程. 了解了这一过程, 就很容易理解Lucene数值类型的查询原理, 比如NumericRangeQuery.

我们知道, Lucene对英文分词, 基本上就是按空格进行切分, 比如"show me the code", 分词后的形式就是["show", "me", "the", "code"] 数值类型分词与文本分词不同, 比如整数1, 转换成字节数组后,其值为[60 8 0 0 0 1](注意数组中是16进制, 而非10进制)

// Lucene-5.2.0
public static void main(String[] args) throws IOException {
    BytesRefBuilder bytes = new BytesRefBuilder();
    NumericUtils.intToPrefixCodedBytes(1, 0, bytes);
    System.out.println(bytes.toBytesRef()); // [60 8 0 0 0 1]
}

对于[60 8 0 0 0 1], 如果按照默认的precisionStep=8, 则分词的结果为:

[60 8 0 0 0 1]
[68 4 0 0 0]
[70 2 0 0]
[78 1 0]

分词的代码为:

public static void main(String[] args) throws IOException {
    final NumericTokenStream stream= new NumericTokenStream(8).setIntValue(1);
    final TermToBytesRefAttribute bytesAtt = stream.getAttribute(TermToBytesRefAttribute.class);
    final TypeAttribute typeAtt = stream.getAttribute(TypeAttribute.class);
    final NumericTokenStream.NumericTermAttribute numericAtt = stream.getAttribute(NumericTokenStream.NumericTermAttribute.class);
    final BytesRef bytes = bytesAtt.getBytesRef();
    stream.reset();
    for (int shift=0; shift<32; shift+=NumericUtils.PRECISION_STEP_DEFAULT_32) {
      stream.incrementToken();
      bytesAtt.fillBytesRef();
      System.out.println(bytesAtt.getBytesRef());
    }
    stream.end();
    stream.close();

}

数值分词其实就是拆分前缀. 上面的结果不像通常理解的前缀关系,这是因为添加了shift信息. 如果同时对多个数进行分词, 排序后对比, 就能体会到前缀的意义了.

前缀的比特数由precisionStep决定, 对于NumericUtils.intToPrefixCodedBytes(), 对应着参数shift

  public static void intToPrefixCodedBytes(final int val, final int shift, final BytesRefBuilder bytes) {
    // ensure shift is 0..31
    if ((shift & ~0x1f) != 0) {
      throw new IllegalArgumentException("Illegal shift value, must be 0..31; got shift=" + shift);
    }
    int nChars = (((31-shift)*37)>>8) + 1;    // i/7 is the same as (i*37)>>8 for i in 0..63
    bytes.setLength(nChars+1);   // one extra for the byte that contains the shift info
    bytes.grow(NumericUtils.BUF_SIZE_LONG);  // use the max
    bytes.setByteAt(0, (byte)(SHIFT_START_INT + shift));
    int sortableBits = val ^ 0x80000000;
    sortableBits >>>= shift;
    while (nChars > 0) {
      // Store 7 bits per byte for compatibility
      // with UTF-8 encoding of terms
      bytes.setByteAt(nChars--, (byte)(sortableBits & 0x7f));
      sortableBits >>>= 7;
    }
  }

上面的代码, 在Lucene处理Int类型数据的方法与原理一文中也贴过. 再看上面的代码, 是否觉得清晰了许多?

前缀具有什么优良的特性呢? 在数据结构上, 前缀属于典型的以空间换时间策略. 即通过存储空间的消耗,换取到极短的查询时间. 如果学习过Trie和线段数, 树状数组这些数据结构, 可能会更容易理解Lucene这里的做法.

(说明,本图来源于博客: http://blog.csdn.net/zhufenglonglove/article/details/51700898 致谢! )

我们知道, Lucene存储的是倒排索引, 即term ---> [docid, docid, ... ] . 假设有如下的需求: 查询价格在[421, 448]的商品. 假如商品的价格信息如下: A=423, B=445 对于前缀索引, 其索引结构是这样的:

423---> [A]
425 --> [A]
42  --> [A,B]
4   --> [A,B]

在查询的时候, 只需要查询前缀4, 就可以了.

为了对Lucene的前缀更有感觉, 可以对一系列的整数进行分词, 然后查看分词的结果. 代码如下:

    public static void tokenAnalyzer(Set<String> list , int val) throws IOException{

        final NumericTokenStream stream= new NumericTokenStream(8).setIntValue(val);
        final TermToBytesRefAttribute bytesAtt = stream.getAttribute(TermToBytesRefAttribute.class);
        final TypeAttribute typeAtt = stream.getAttribute(TypeAttribute.class);
        final NumericTokenStream.NumericTermAttribute numericAtt = stream.getAttribute(NumericTokenStream.NumericTermAttribute.class);
        final BytesRef bytes = bytesAtt.getBytesRef();
        stream.reset();
        for (int shift=0; shift<32; shift+=NumericUtils.PRECISION_STEP_DEFAULT_32) {
          stream.incrementToken();
          bytesAtt.fillBytesRef();
          list.add(bytesAtt.getBytesRef().toString());

        }
        stream.end();
        stream.close();
    }

    public static void main(String[] args) throws IOException {
        TreeSet<String> list = new TreeSet<String>();
        for(int i=1;i<512;i++){
            tokenAnalyzer(list, i);
        }
        System.out.println("size of list is "+list.size());
        for(String s: list)System.out.println(s);
    }

结果如下:

size of list is 515
[60 8 0 0 0 10]
    ...
[60 8 0 0 3 e]
[60 8 0 0 3 f]
[68 4 0 0 0]
[68 4 0 0 1]
[70 2 0 0]
[78 1 0]

如果查询区间[1,255]的文档信息, 则只需要查询[68 4 0 0 0]就OK了. 如果单纯地使用BooleanQuery,不构建前缀索引, 则需要拼接255个TermQuery.两者之间的查询性能, 可想而之.

前面说到, 前缀的缺点就是空间消耗. 这一点可以在建立索引时通过precisionStep参数来调整. precisionStep越小, 空间消耗越大, precisionStep越大, 空间消耗越小. 需要注意的是, 在业务中,并不是precisionStep越小, 查询性能越好. 究竟precisionStep设置多大才能达到最佳的平衡点, 需要具体业务, 具体对待.

对于NumericRangeQuery的分析, NumericUtils.splitRange()是核心

搜索的样例代码如下:

import java.io.IOException;
import java.nio.file.Paths;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.IntField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.NumericRangeQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class NumericRangeQueryDemo {
    static Directory d = null;
    public static void index() throws IOException{
        d = FSDirectory.open(Paths.get("indexfile"));
        IndexWriterConfig conf = new IndexWriterConfig(new StandardAnalyzer());
        IndexWriter iw = new IndexWriter(d, conf);
        Document doc =null;
        for(int i=0;i<512;i++)
        {
            doc = new Document();
             doc.add(new IntField("f2", i, Field.Store.YES));
             iw.addDocument(doc);
        }

        iw.close();

    }

    public static void search() throws IOException{
        d = FSDirectory.open(Paths.get("indexfile"));
        IndexReader r = DirectoryReader.open(d);
        IndexSearcher searcher = new IndexSearcher(r);

        Query  query = NumericRangeQuery.newIntRange("f2", 0, 255, true, true);
        TopDocs n = searcher.search(query, 2);
        System.out.println(n.totalHits);
        Document doc = searcher.doc(0);
        System.out.println(doc);

    }

    public static void main(String[] args) throws IOException {
        index();
        search();
    }
}

我们先不管splitRange()代码的细节, 先根据前面理解到的知识, 来预测对于某一个[minBound,maxBound], spiltRange后在NumericRangeQuery.NumericRangeTermsEnum.rangeBounds中生成的结果是什么?

例如:

当: precisitionStep=8, [minBound,maxBound]=[0, 16777215]时, rangeBounds=[[78 1 0], [78 1 0]]
当: precisitionStep=8, [minBound,maxBound]=[0, 65535]时, rangeBounds=[70 2 0 0], [70 2 0 0]
当: precisitionStep=8, [minBound,maxBound]=[0, 255]时, rangeBounds=[[68 4 0 0 0], [68 4 0 0 0]]
当: precisitionStep=8, [minBound,maxBound]=[0,1023]时, rangeBounds=[[68 4 0 0 0], [68 4 0 0 3]]
当: precisitionStep=8, [minBound,maxBound]=[0, 511]时, rangeBounds=[[68 4 0 0 0], [68 4 0 0 1]]
当: precisitionStep=8, [minBound,maxBound]=[0, 254]时, rangeBounds=[[60 8 0 0 0 0], [60 8 0 0 1 7e]]
当: precisitionStep=8, [minBound,maxBound]=[0, 127]时, rangeBounds=[[60 8 0 0 0 0], [60 8 0 0 0 7f]]
当: precisitionStep=8, [minBound,maxBound]=[10, 1023]时, rangeBounds=[[60 8 0 0 0 a], [60 8 0 0 1 7f], [68 4 0 0 1], [68 4 0 0 3]]

研究几个案例后, 关于splitRange()的逻辑, 就比较有感觉了. 例如: [minBound,maxBound]=[2, 1024]

首先会处理: [2,255], [1024,1024], 生成 [60 8 0 0 0 2], [60 8 0 0 1 7f], [60 8 0 0 8 0], [60 8 0 0 8 0]

然后会处理: [256,768], 生成 [68 4 0 0 1], [68 4 0 0 3] 所以最后splitRange生成的结果是[[60 8 0 0 0 2], [60 8 0 0 1 7f], [60 8 0 0 8 0], [60 8 0 0 8 0],[68 4 0 0 1], [68 4 0 0 3]] 结束.

总体的策略是先枝叶, 后主干.

通过上面的案例,结合前面理解的NumericTokenStream, 可以发现,在precisionStep=8时, [0,65535] 区间管理如下:

                 [0,65535]

[0,255], [256,511], ... , [62324,62579], [62580, 65535]

取值区间确定后, 当拿到的term比较多时, 一般是超过16个, 则使用bitset, 否则使用booleanQuery, 代码逻辑见MultiTermQueryConstantScoreWrapper.createWeight(). 在MultiTermQueryConstantScoreWrapper.createWeight()创建的ConstantScoreWeight对象的rewrite()方法.

最后, 再看具体代码的实现, 理解作者编码的细节, 每个变量的作用.

  /** This helper does the splitting for both 32 and 64 bit. */
  private static void splitRange(
    final Object builder, final int valSize,
    final int precisionStep, long minBound, long maxBound
  ) {
    if (precisionStep < 1)
      throw new IllegalArgumentException("precisionStep must be >=1");
    if (minBound > maxBound) return;
    for (int shift=0; ; shift += precisionStep) {
      // calculate new bounds for inner precision
      /*
       * diff的作用就是将每一轮的处理控制在算精度范围内, 以precisitionStep=8为例: 
       * diff=2^8
       * diff=2^16
       * diff=2^24
       * 即每一次扩大8-位
       * */
      final long diff = 1L << (shift+precisionStep),
        /*
         * mask, 直译就是掩码, 以precisionStep=8为例:
         * mask = 0x00000000000000ff
         * mask = 0x000000000000ff00
         * mask = 0x0000000000ff0000
         * */
        mask = ((1L<<precisionStep) - 1L) << shift;
      /*
       * hasLower/hasUpper 用于判别当前边界是枝叶还是树干. 主要作用于第一轮, 即shift=0时
       * */
      final boolean
        hasLower = (minBound & mask) != 0L,
        hasUpper = (maxBound & mask) != mask;
      /*
       * nextMinBound/nexMaxBound  可以形象理解为标记断点
       * */
      final long
        nextMinBound = (hasLower ? (minBound + diff) : minBound) & ~mask,
        nextMaxBound = (hasUpper ? (maxBound - diff) : maxBound) & ~mask;
      final boolean
        lowerWrapped = nextMinBound < minBound,
        upperWrapped = nextMaxBound > maxBound;
      /*
       * 这下面的逻辑就是真正的剪枝了, 需要注意的是, addRange会重新调整maxBound.
       * 例如: 对于区间[0,1024], 在这里看到的split后的区间是[0,768], [1024,1024],
       * 实际上,在addRange函数中,通过  maxBound |= (1L << shift) - 1L; 将区间修正为
       * [0,1023], [1024,1024]
       * */
      if (shift+precisionStep>=valSize || nextMinBound>nextMaxBound || lowerWrapped || upperWrapped) {
        // We are in the lowest precision or the next precision is not available.
        addRange(builder, valSize, minBound, maxBound, shift);
        // exit the split recursion loop
        break;
      }

      if (hasLower)
        addRange(builder, valSize, minBound, minBound | mask, shift);
      if (hasUpper)
        addRange(builder, valSize, maxBound & ~mask, maxBound, shift);

      // recurse to next precision
      minBound = nextMinBound;
      maxBound = nextMaxBound;
    }
  }

参考:

http://blog.csdn.net/zhufenglonglove/article/details/51700898

http://blog.csdn.net/debiann/article/details/23012699

http://brokendreams.iteye.com/blog/2256239

njit_peiyuan

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Lucene NumericRangeQuery的初步理解

理解NumericRangeQuery, 首先需要理解Lucene是如何存储数值类型. 文本初步探讨了Int和Float两种数值类型在Lucene中的存储实现,数值类型的分词原理, 最后给出NumericRangeQuery的简单理解.Lucene最初设计是实现全文检索功能, 即只处理字符串. 因此, 在处理数值时, Lucene也是将数值编码为字符串。将数值转换成字符串, Lucene-5.2...
复制链接

扫一扫