OpenBitSet和OpenBitSetIterator在TermRangeQuery中的运用

OpenBitSet和OpenBitSetIterator在TermRangeQuery中的运用
在MultiTermQuery 的rewrite方法中,如果 if (pendingTerms.size() >= termCountLimit || docVisitCount >= docCountCutoff) 的就会使用MultiTermQueryWrapperFilter,如果查询出来的term的总数目大于termCountLimit或者docVisitCount是 df ,如果df 大于docCountCutoff 则使用MultiTermQueryWrapperFilter,否则使用BooleanQuery,他们之间的关系是or的关系, MultiTermQueryWrapperFilter 使用OpenBitSet收集docId,使用OpenBitSetIterator还原docId

@Override
public Query rewrite(IndexReader reader, MultiTermQuery query) throws IOException {
// Get the enum and start visiting terms. If we
// exhaust the enum before hitting either of the
// cutoffs, we use ConstantBooleanQueryRewrite; else,
// ConstantFilterRewrite:
final Collection<Term> pendingTerms = new ArrayList<Term>();
final int docCountCutoff = (int) ((docCountPercent / 100.) * reader.maxDoc());
final int termCountLimit = Math.min(BooleanQuery.getMaxClauseCount(), termCountCutoff);
int docVisitCount = 0;

FilteredTermEnum enumerator = query.getEnum(reader);
try {
while(true) {
Term t = enumerator.term();
if (t != null) {
pendingTerms.add(t);
// Loading the TermInfo from the terms dict here
// should not be costly, because 1) the
// query/filter will load the TermInfo when it
// runs, and 2) the terms dict has a cache:
docVisitCount += reader.docFreq(t);
}

if (pendingTerms.size() >= termCountLimit || docVisitCount >= docCountCutoff) {
// Too many terms -- make a filter.
Query result = new ConstantScoreQuery(new MultiTermQueryWrapperFilter<MultiTermQuery>(query));
result.setBoost(query.getBoost());
return result;
} else if (!enumerator.next()) {
// Enumeration is done, and we hit a small
// enough number of terms & docs -- just make a
// BooleanQuery, now
BooleanQuery bq = new BooleanQuery(true);
for (final Term term: pendingTerms) {
TermQuery tq = new TermQuery(term);
bq.add(tq, BooleanClause.Occur.SHOULD);
}
// Strip scores
Query result = new ConstantScoreQuery(new QueryWrapperFilter(bq));
result.setBoost(query.getBoost());
query.incTotalNumberOfTerms(pendingTerms.size());
return result;
}
}
} finally {
enumerator.close();
}
}

收集的docId的代码 调用如图所示
[img]http://dl.iteye.com/upload/attachment/349283/6e2c639f-d0ab-3efc-a8d6-d880bfd4e320.jpg[/img]

先new 一个OpenBitSet,大小是查询出来的当前segemt中最大文档的数目,然后通过
SegmentTermDocs 的public int read(final int[] docs, final int[] freqs)
这个方法读取docId和frg,然后 通过for循环
for(int i=0;i<count;i++) {
bitSet.set(docs[i]);
}
把docId放到OpenBitSet里面
[img]http://dl.iteye.com/upload/attachment/349285/6fec9035-14f2-310c-8712-341d7185dd9f.jpg[/img]


代码如下和注释如下
public DocIdSet getDocIdSet(IndexReader reader) throws IOException {
//返回TermRangeTermEnum对象,这个对象先用用小的那个string new一个term,然后定位到tis文件,while循环读取term信息,然后去frg文件里面读取docId,在while循环里面,通过SegmentTermDocs读取frg文件的docId和frg。
final TermEnum enumerator = query.getEnum(reader);
try {
// if current term in enum is null, the enum is empty -> shortcut
if (enumerator.term() == null)
return DocIdSet.EMPTY_DOCIDSET;
// else fill into a OpenBitSet
final OpenBitSet bitSet = new OpenBitSet(reader.maxDoc());
final int[] docs = new int[32];
final int[] freqs = new int[32];

// new 一个SegmentTermDocs的实例,会调用它的read方法读取docId
TermDocs termDocs = reader.termDocs();
try {
int termCount = 0;
do {
Term term = enumerator.term();
if (term == null)
break;
termCount++;
SegmentTermDocs 在frg文件里面seek到term的对应的docid的开始位置
termDocs.seek(term);
while (true) {
// 读取docId,一次读取到32 个docId到 docs数组里面,如果没有32个则读取实际的数目
final int count = termDocs.read(docs, freqs);
if (count != 0) {
for(int i=0;i<count;i++) {
bitSet.set(docs[i]);
}
} else {
break;
}
}

} while (enumerator.next());

query.incTotalNumberOfTerms(termCount);

} finally {
termDocs.close();
}
return bitSet;
} finally {
enumerator.close();
}
}


enumerator.next()方法截图如下,enumerator是TermRangeTermEnum,会调用父类的FilteredTermEnum next方法。
[img]http://dl.iteye.com/upload/attachment/349287/14f676ba-df01-3cc6-baa9-3cfcd24367c3.jpg[/img]
[img]http://dl.iteye.com/upload/attachment/349289/82f2eb57-96dc-3941-9ce9-ed4966f2a496.jpg[/img]

FilteredTermEnum的next方法如下,他会调用actualEnum读取tis文件里面的下一个term,然后调用termCompare 方法,termCompare 这个方法是抽象方法,留给子类实现,
TermRangeTermEnum方法的实现逻辑是和右边的区间的term做一个比较,看查询的term是否超出区间
public boolean next() throws IOException {
if (actualEnum == null) return false; // the actual enumerator is not initialized!
currentTerm = null;
while (currentTerm == null) {
if (endEnum()) return false;
if (actualEnum.next()) {
Term term = actualEnum.term();
if (termCompare(term)) {
currentTerm = term;
return true;
}
}
else return false;
}
currentTerm = null;
return false;
}

还原是在ConstantScorer的nextDoc方法调用的如下图
[img]http://dl.iteye.com/upload/attachment/349291/bf32a90a-5425-361b-ab48-216aa983d5ba.jpg[/img]

public int nextDoc() throws IOException {
return docIdSetIterator.nextDoc();
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值