lucene的排序和缓存的应用

最新推荐文章于 2021-02-25 05:07:09 发布

chengqianl

最新推荐文章于 2021-02-25 05:07:09 发布

阅读量187

点赞数

分类专栏： lucene 文章标签： lucene 算法 Apache

本文链接：https://blog.csdn.net/chengqianl/article/details/83708449

版权

lucene 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

Lucene的排序是通过FieldComparator及其子类实现的，以StringOrdValComparator作为例子详细说明lucene的排序的基于缓存FieldCache实现。

思路：用一个数组保存某个filed字段对应的所有的document的最大的一个term。这个数组的index就是docId，值对应所有这个filed所有term的数组的index

StringOrdValComparator 类里面的
private String[] lookup; 值为某个filed的所有的term的值

private int[] order; index为docId，值为lookup的index，表示某个document的某个field的最大的term的在lookup中的index。

排序的时候根据docId查询出order[docId]值的大小比较。

初始化
IndexSearcher# public void search(Weight weight, Filter filter, Collector collector) 方法中调用的TopFieldCollector#OneComparatorNonScoringCollector.setNextReader
调用FieldComparator#StringOrdValComparator. setNextReadercollector.setNextReader(subReaders[i], docStarts[i]);
方法中调用
FieldCache.DEFAULT.getStringIndex(reader, field)
初始化filed对应的所有的term信息和frg信息。String就从””开始读取直到读完，

算法是在FieldCacheImpl#StringIndexCache.createValue实现的
代码如下结果就是某个document的对应的某个term的最大值
@Override
protected Object createValue(IndexReader reader, Entry entryKey)
throws IOException {
String field = StringHelper.intern(entryKey.field);
final int[] retArray = new int[reader.maxDoc()];
String[] mterms = new String[reader.maxDoc()+1];
TermDocs termDocs = reader.termDocs();
TermEnum termEnum = reader.terms (new Term (field));
int t = 0; // current term number

// an entry for documents that have no terms in this field
// should a document with no terms be at top or bottom?
// this puts them at the top - if it is changed, FieldDocSortedHitQueue
// needs to change as well.
mterms[t++] = null;

try {
do {
Term term = termEnum.term();
if (term==null || term.field() != field) break;

// store term text
// we expect that there is at most one term per document
if (t >= mterms.length) throw new RuntimeException ("there are more terms than " +
"documents in field \"" + field + "\", but it's impossible to sort on " +
"tokenized fields");
mterms[t] = term.text();

termDocs.seek (termEnum);
while (termDocs.next()) {
retArray[termDocs.doc()] = t;
}

t++;
} while (termEnum.next());
} finally {
termDocs.close();
termEnum.close();
}

if (t == 0) {
// if there are no terms, make the term array
// have a single null entry
mterms = new String[1];
} else if (t < mterms.length) {
// if there are less terms than documents,
// trim off the dead array space
String[] terms = new String[t];
System.arraycopy (mterms, 0, terms, 0, t);
mterms = terms;
}

StringIndex value = new StringIndex (retArray, mterms);
return value;
}
};

排序的实现
Lucene的查询结果是放在优先队列里面的，优先对象是通过org.apache.lucene.search.FieldComparator.StringOrdValComparator进行比较的。比较的StringOrdValComparator. Compare方法

@Override
public int compare(int slot1, int slot2) {
if (readerGen[slot1] == readerGen[slot2]) {
int cmp = ords[slot1] - ords[slot2];
if (cmp != 0) {
return cmp;
}
}

final String val1 = values[slot1];
final String val2 = values[slot2];
if (val1 == null) {
if (val2 == null) {
return 0;
}
return -1;
} else if (val2 == null) {
return 1;
}
return val1.compareTo(val2);
}

如果优先队列满了，则先和最低层的比较，如果大于最底层的则先替代最底层的，然后对优先队列重新排序

TopFieldCollector# OneComparatorNonScoringCollector. Collect的红色标识
@Override
public void collect(int doc) throws IOException {
[color=darkred] ++totalHits;
[b] if (queueFull) {
if ((reverseMul * comparator.compareBottom(doc)) <= 0) {
// since docs are visited in doc Id order, if compare is 0, it means
// this document is largest than anything else in the queue, and
// therefore not competitive.
return;
}

// This hit is competitive - replace bottom element in queue & adjustTop
comparator.copy(bottom.slot, doc);
updateBottom(doc);
comparator.setBottom(bottom.slot);[/b][/color] } else {
// Startup transient: queue hasn't gathered numHits yet
final int slot = totalHits - 1;
// Copy hit into queue
comparator.copy(slot, doc);
add(slot, doc, Float.NaN);
if (queueFull) {
comparator.setBottom(bottom.slot);
}
}
}