lucene NumericUtils

最新推荐文章于 2021-11-22 22:22:33 发布

Xiao_Qiang_

最新推荐文章于 2021-11-22 22:22:33 发布

阅读量1.1k

点赞数

分类专栏： lucene

lucene 专栏收录该内容

21 篇文章 0 订阅

订阅专栏

主要涉及几个类

NumericRangeQuery 数值型检索类，含(NumericRangeTermEnum)数值型词项迭代器

NumericUtils 索引和检索时，数值型运算类

NumericTokenStream 索引时解析数值型字段的类

NumericField

一、核心函数

1.1 数值转换函数intToPrefixCoded

// 关键函数，使用前缀码转换将数值用字符串表示
// 数值使用前缀码转换(intToPrefixCoded)为字符串

public static int intToPrefixCoded(final int val, final int shift, final char[] buffer)
{
if (shift>31 || shift<0)
throw new IllegalArgumentException("Illegal shift value, must be 0..31");

// 10000000000000000000000000000000
// 0000 0000 0001 0001 0101 0100 0000 1010
// 1000 0000 0000 0000 0000 0000 0000 0000
// 1000 0000 0001 0001 0101 0100 0000 1010
// 补码 - 取反加一
// 1000 0000 0001 0001 0101 0100 0000 1001
// 0111 1111 1110 1110 1010 1011 1111 0110
int nChars = (31-shift)/7 + 1, len = nChars+1;
buffer[0] = (char)( shift);

int sortableBits = val ^ 0x80000000;//异或
sortableBits >>>= shift; // 逻辑移位

System.out.println(sortableBits);

while (nChars>=1)
{
// Store 7 bits per character for good efficiency when UTF-8 encoding.
// The whole number is right-justified so that lucene can prefix-encode
// the terms more efficiently.

buffer[nChars--] = (char)(sortableBits & 0x7f); // & 1111111 // 取低七位

sortableBits >>>= 7; // 右移七位

}

// 低位字节存储高位值，这样比较可以从高位起

return len;
}

例如对
int nMinLongitude = 1135626;
int nMaxLongitude = 1135632;
做前缀码转换(intToPrefixCoded)为字符串
由于低位字节存储高位值，因此数字高位的相同意味着字符串前缀的相同

字符串内容(取码值)由低位至高位为
8 0 69 40 10 和
8 0 69 40 16
可见他们有相同的前缀，因为lucene在词项编码存储的时候使用了相同前缀编码
因此此两个int用字符串表示的词项有相同前缀且顺序稳定(做排序而言)

1.2 位图标记函数

位图标记过程

public void set(long index)
{
int wordNum = expandingWordNum(index); // 第几个字节
int bit = (int)index & 0x3f; // 第几位置1
long bitmask = 1L << bit;
bits[wordNum] |= bitmask;
}

调用过程
IndexSearcher.search(Weight, Filter, Collector) line: 245
ConstantScoreQuery$ConstantWeight.scorer(IndexReader, boolean, boolean) line: 81
ConstantScoreQuery$ConstantScorer.<init>(ConstantScoreQuery, Similarity, IndexReader, Weight) line: 116
MultiTermQueryWrapperFilter.getDocIdSet(IndexReader) line: 171
MultiTermQueryWrapperFilter$2(MultiTermQueryWrapperFilter$TermGenerator).generate(IndexReader, TermEnum) line: 115
MultiTermQueryWrapperFilter$2.handleDoc(int) line: 169
OpenBitSet.set(long) line: 233

二、索引过程
索引
使用intToPrefixCoded函数将数值转换为字符串
转换结果，数值高位依次相等，字符串前缀依次相同，这样的结果是在查询时可以从值较小开始扫描，以后取的词项都是
前缀相同且值较大的词项或者值较大的词项。满足了区间扫描的过程

三、检索过程
检索
// 构建数值型查询
Integer min = new Integer(nMinLongitude);
Integer max = new Integer(nMaxLongitude);

lucene - lewutian@126 - lewutian@126的博客
// 生成数值型查询类NumericRangeQuery，可设置查询的步长
Query query = NumericRangeQuery.newIntRange(field,min, max,true, true);// 标志位为是否包括上下确界

// 重写query,生成用于数值型查询的词项迭代器NumericRangeTermEnum
调用过程如下
IndexSearcher(Searcher).createWeight(Query) line: 232
NumericRangeQuery(Query).weight(Searcher) line: 98
IndexSearcher.rewrite(Query) line: 306
NumericRangeQuery(MultiTermQuery).rewrite(IndexReader) line: 382
MultiTermQuery$1(MultiTermQuery$ConstantScoreAutoRewrite).rewrite(IndexReader, MultiTermQuery) line: 227
NumericRangeQuery.getEnum(IndexReader) line: 302

protected FilteredTermEnum getEnum(final IndexReader reader)
{
生成词项迭代器
return new NumericRangeTermEnum(reader);
}

// 生成迭代器同时依旧步长切分数值范围(若干个块)
NumericUtils.splitIntRange(new NumericUtils.IntRangeBuilder()
// 切分值填充于rangeBounds,每个块有上下确界

// 过程续上一步
NumericRangeQuery$NumericRangeTermEnum.<init>(NumericRangeQuery, IndexReader) line: 449
NumericUtils.splitIntRange(NumericUtils$IntRangeBuilder, int, int, int) line: 359
NumericUtils.splitRange(Object, int, int, long, long) line: 367


遍历所有合符条件的词项，根据词项的postings做位图标记

// 执行过程如下
IndexSearcher(Searcher).search(Query, Collector) line: 130
IndexSearcher.search(Weight, Filter, Collector) line: 245
ConstantScoreQuery$ConstantWeight.scorer(IndexReader, boolean, boolean) line: 81
ConstantScoreQuery$ConstantScorer.<init>(ConstantScoreQuery, Similarity, IndexReader, Weight) line: 116
MultiTermQueryWrapperFilter.getDocIdSet(IndexReader) line: 171
(MultiTermQueryWrapperFilter$TermGenerator).generate(IndexReader, TermEnum) line: 100
MultiTermQueryWrapperFilter$2

// 位图标记过程
abstract class MultiTermQueryWrapperFilter::TermGenerator
{
public void generate(IndexReader reader, TermEnum enumerator) throws IOException
{
final int[] docs = new int[32];
final int[] freqs = new int[32];

TermDocs termDocs = reader.termDocs();

try {
int termCount = 0;
do {
Term term = enumerator.term(); // "enumerator"= NumericRangeTermEnum (id=579)

if (term == null)
break;
termCount++;
termDocs.seek(term);

while (true) {
// 读取该词项的postings

final int count = termDocs.read(docs, freqs);

if (count != 0)
{
for(int i=0;i<count;i++)
{
handleDoc(docs[i]); // 标记过程
}
} else {
break;
}
}
} while (enumerator.next());

query.incTotalNumberOfTerms(termCount); // 下一个符合的词项

} finally {
termDocs.close();
}
}
abstract public void handleDoc(int doc);
}

Xiao_Qiang_

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
lucene NumericUtils

主要涉及几个类NumericRangeQuery 数值型检索类，含(NumericRangeTermEnum)数值型词项迭代器NumericUtils 索引和检索时，数值型运算类NumericTokenStream 索引时解析数值型字段的类NumericField 一、核心函数1.1 数值转换
复制链接

扫一扫

专栏目录