lucene的hit类

最新推荐文章于 2022-04-10 09:30:38 发布

iteye_7642

最新推荐文章于 2022-04-10 09:30:38 发布

阅读量195

点赞数

分类专栏： lucene学习笔记文章标签： lucene Apache Cache 算法 F#

本文链接：https://blog.csdn.net/iteye_7642/article/details/81722597

版权

lucene学习笔记专栏收录该内容

0 篇文章 0 订阅

订阅专栏

本文转自：
http://daihaixiang.blog.163.com/blog/static/3830134200862394745683/

关于Hits类。
这个Hits类可是非常的重要，因为Lucene使用了缓存机制，关于缓存的实现就是在这个Hits类中。Hits工作过程中，使用了LRU算法，即通过一个HitDoc结构来实现一个双向链表，使用LRU置换算法，记录用户最近访问过的Document。
开门见山，直接拿出Hits类的实现代码来说话。
package org.apache.lucene.search;
import java.io.IOException;
import java.util.Vector;
import java.util.Iterator;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
public final class Hits {
private Weight weight;
private Searcher searcher;
private Filter filter = null;
private Sort sort = null;
private int length; // Hits的长度，即满足查询的结果数量
private Vector hitDocs = new Vector(); // 用作缓存检索结果的(Hit)
private HitDoc first; // head of LRU cache
private HitDoc last; // tail of LRU cache
private int numDocs = 0; // number cached
private int maxDocs = 200; // max to cache
Hits(Searcher s, Query q, Filter f) throws IOException {
weight = q.weight(s);
searcher = s;
filter = f;
getMoreDocs(50); // retrieve 100 initially | 从缓存中取出检索结果，如果缓存为null，则需要查询，查询后将结果加入缓存中
}
Hits(Searcher s, Query q, Filter f, Sort o) throws IOException {
weight = q.weight(s);
searcher = s;
filter = f;
sort = o;
getMoreDocs(50); // retrieve 100 initially | 从缓存中取出检索结果，如果缓存为null，则需要查询，查询后将结果加入缓存中

}
/**
* 将满足检索结果的Document加入到缓存hitDocs中
*/
private final void getMoreDocs(int min) throws IOException {
/
System.out.println("■■■■■■■■■■■■■■■■■■■■■■■■进入getMoreDocs()方法中时，hitDocs.size="+hitDocs.size());
///
if (hitDocs.size() > min) {
min = hitDocs.size();
}
int n = min * 2; // 扩充缓存容量为默认的2倍(默认最小情况下，也要扩充缓存。即使检索结果为1条记录，缓存的长度也扩充为100)
TopDocs topDocs = (sort == null) ? searcher.search(weight, filter, n) : searcher.search(weight, filter, n, sort);
length = topDocs.totalHits;
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
float scoreNorm = 1.0f;

if (length > 0 && topDocs.getMaxScore() > 1.0f) {
scoreNorm = 1.0f / topDocs.getMaxScore();
}
int end = scoreDocs.length maxDocs) { // if cache is full
HitDoc oldLast = last;
remove(last); // flush last
oldLast.doc = null; // let doc get gc'd
}
if (hitDoc.doc == null) {
hitDoc.doc = searcher.doc(hitDoc.id); // cache miss: read document
}
return hitDoc.doc;
}
// 得到第n个Document的得分
public final float score(int n) throws IOException {
return hitDoc(n).score;
}
// 得到第n个Document的编号
public final int id(int n) throws IOException {
return hitDoc(n).id;
}
public Iterator iterator() {
return new HitIterator(this);
}
private final HitDoc hitDoc(int n) throws IOException {
if (n >= length) {
throw new IndexOutOfBoundsException("Not a valid hit number: " + n);
}
if (n >= hitDocs.size()) {
getMoreDocs(n);
}
return (HitDoc) hitDocs.elementAt(n);
}
private final void addToFront(HitDoc hitDoc) { // insert at front of cache
if (first == null) {
last = hitDoc;
} else {
first.prev = hitDoc;
}
hitDoc.next = first;
first = hitDoc;
hitDoc.prev = null;
numDocs++;
}
private final void remove(HitDoc hitDoc) { // remove from cache
if (hitDoc.doc == null) { // it's not in the list
return; // abort
}
if (hitDoc.next == null) {
last = hitDoc.prev;
} else {
hitDoc.next.prev = hitDoc.prev;
}
if (hitDoc.prev == null) {
first = hitDoc.next;
} else {
hitDoc.prev.next = hitDoc.next;
}
numDocs--;
}
}
final class HitDoc {
float score;
int id;
Document doc = null;
HitDoc next; // in doubly-linked cache
HitDoc prev; // in doubly-linked cache
HitDoc(float s, int i) {
score = s;
id = i;
}
}
上面代码中，红色标注的部分为后面测试之用。
一次查询时，需要构造一个Query实例。从Hits类的成员变量来看，在检索的过程中，一个Query实例并不是只使用一次，那么多次使用进行查询就需要记录这个Query实例的状态。
为了更加直观，写了一个测试类，来观察缓存长度的分配情况：
package org.shirdrn.lucene.learn.test;
import java.io.IOException;
import java.util.Date;
import java.util.Iterator;
import org.apache.lucene.analysis.cjk.CJKAnalyzer;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.Hit;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.LockObtainFailedException;
public class MyHitsTest {

public void create() throws CorruptIndexException, LockObtainFailedException, IOException{
String indexPath = "H:\\index";
IndexWriter writer = new IndexWriter(indexPath,new CJKAnalyzer(),true);
for(int i=0;i min = 50不成立，接着n = min*2 = 50*2 = 100，因此离开getMoreDocs()方法时hitDocs.size() = 100；
第二次进入getMoreDocs()方法时，hitDocs.size() = 100 > min = 50成立，从而设置min = hitDocs.size() = 100，接着n = min*2 = 100*2 = 200，因此离开getMoreDocs()方法时hitDocs.size() = 200；
第三次进入getMoreDocs()方法时，hitDocs.size() = 200 > min = 100成立，从而设置min = hitDocs.size() = 200，接着n = min*2 = 200*2 = 400，因此离开getMoreDocs()方法时hitDocs.size() = 400；
如果满足查询的检索结果的Document数量足够大的话，应该继续是：
第四次进入getMoreDocs()方法时，hitDocs.size() = 400，离开getMoreDocs()方法时hitDocs.size() = 800；
第五次进入getMoreDocs()方法时，hitDocs.size() = 800，离开getMoreDocs()方法时hitDocs.size() = 1600；
……
根据上面，最后一次(第四次)进入getMoreDocs()方法的时候，hitDocs.size() = 400 > min = 400不成立，接着n = min*2 = 400*2 = 800，此时虽然缓存扩充了，但是执行searcher.search(weight, filter, n) 的时候取到了100条满足条件的Document，故而缓存的实际大小为hitDocs.size() = 500，因此离开getMoreDocs()方法时hitDocs.size() = 500，其实此次如果满足查询的Document数量足够，可以达到hitDocs.size() = 800。

iteye_7642

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
lucene的hit类

本文转自： http://daihaixiang.blog.163.com/blog/static/3830134200862394745683/关于Hits类。这个Hits类可是非常的重要，因为Lucene使用了缓存机制，关于缓存的实现就是在这个Hits类中。Hits工作过程中，使用了LRU算法，即通过一个HitDoc结构来实现一个双向链表，使用LRU置换算法，记录用户最近访...
复制链接

扫一扫

专栏目录