【转载 http://blog.csdn.net/forfuture1978/article/details/5606136】
6、FilteredQuery
FilteredQuery包含两个成员变量:
- Query query:查询对象
- Filter filter:其有一个函数DocIdSet getDocIdSet(IndexReader reader) 得到一个文档号集合,结果文档必须出自此文档集合,注此处的过滤器所包含的文档号并不是要过滤掉的文档号,而是过滤后需要的文档号。
FilterQuery所得到的结果集同两者取AND查询相同,只不过打分的时候,FilterQuery只考虑query的部分,不考虑filter的部分。
Filter包含很多种如下:
6.1、TermsFilter
其包含一个成员变量Set terms=new TreeSet(),所有包含terms集合中任一term的文档全部属于文档号集合。
其getDocIdSet函数如下:
public DocIdSet getDocIdSet(IndexReader reader) throws IOException
{
//生成一个bitset,大小为索引中文档总数
OpenBitSet result=new OpenBitSet(reader.maxDoc());
TermDocs td = reader.termDocs();
try
{
//遍历每个term的文档列表,将文档号都在bitset中置一,从而bitset包含了所有的文档号。
for (Iterator iter = terms.iterator(); iter.hasNext();)
{
Term term = iter.next();
td.seek(term);
while (td.next())
{
result.set(td.doc());
}
}
}
finally
{
td.close();
}
return result;
}
6.2、BooleanFilter
其像BooleanQuery相似,包含should的filter,must的filter,not的filter,在getDocIdSet的时候,先将所有满足should的文档号集合之间取OR的关系,然后同not的文档号集合取NOT的关系,最后同must的文档号集合取AND的关系,得到最后的文档集合。
其getDocIdSet函数如下:
public DocIdSet getDocIdSet(IndexReader reader) throws IOException
{
OpenBitSetDISI res = null;
if (shouldFilters != null) {
for (int i = 0; i < shouldFilters.size(); i++) {
if (res == null) {
res = new OpenBitSetDISI(getDISI(shouldFilters, i, reader), reader.maxDoc());
} else {
//将should的filter的文档号全部取OR至bitset中
DocIdSet dis = shouldFilters.get(i).getDocIdSet(reader);
if(dis instanceof OpenBitSet) {
res.or((OpenBitSet) dis);
} else {
res.inPlaceOr(getDISI(shouldFilters, i, reader));
}
}
}
}
if (notFilters!=null) {
for (int i = 0; i < notFilters.size(); i++) {
if (res == null) {
res = new OpenBitSetDISI(getDISI(notFilters, i, reader), reader.maxDoc());
res.flip(0, reader.maxDoc());
} else {
//将not的filter的文档号全部取NOT至bitset中
DocIdSet dis = notFilters.get(i).getDocIdSet(reader);
if(dis instanceof OpenBitSet) {
res.andNot((OpenBitSet) dis);
} else {
res.inPlaceNot(getDISI(notFilters, i, reader));
}
}
}
}
if (mustFilters!=null) {
for (int i = 0; i < mustFilters.size(); i++) {
if (res == null) {
res = new OpenBitSetDISI(getDISI(mustFilters, i, reader), reader.maxDoc());
} else {
//将must的filter的文档号全部取AND至bitset中
DocIdSet dis = mustFilters.get(i).getDocIdSet(reader);
if(dis instanceof OpenBitSet) {
res.and((OpenBitSet) dis);
} else {
res.inPlaceAnd(getDISI(mustFilters, i, reader));
}
}
}
}
if (res !=null)
return finalResult(res, reader.maxDoc());
return DocIdSet.EMPTY_DOCIDSET;
}
6.3、DuplicateFilter
DuplicateFilter实现了如下的功能:
比如说我们有这样一批文档,每篇文档都分成多页,每篇文档都有一个id,然而每一页是按照单独的Document进行索引的,于是进行搜索的时候,当一篇文档的两页都包含关键词的时候,此文档id在结果集中出现两次,这是我们不想看到的,DuplicateFilter就是指定一个域如id,在此域相同的文档仅取其中一篇。
DuplicateFilter包含以下成员变量:
- String fieldName:域的名称
- int keepMode:KM_USE_FIRST_OCCURRENCE表示重复的文档取第一篇,KM_USE_LAST_OCCURRENCE表示重复的文档取最后一篇。
- int processingMode:
- PM_FULL_VALIDATION是首先将bitset中所有文档都设为false,当出现同组重复文章的第一篇的时候,将其设为1
- PM_FAST_INVALIDATION是首先将bitset中所有文档都设为true,除了同组重复文章的第一篇,其他的的全部设为0
- 两者在所有的文档都包含指定域的情况下,功能一样,只不过后者不用处理docFreq=1的文档,速度加快。
- 然而当有的文档不包含指定域的时候,后者由于都设为true,则没有机会将其清零,因而会被允许返回,当然工程中应避免这种情况。
其getDocIdSet函数如下:
public DocIdSet getDocIdSet(IndexReader reader) throws IOException
{
if(processingMode==PM_FAST_INVALIDATION)
{
return fastBits(reader);
}
else
{
return correctBits(reader);
}
}
private OpenBitSet correctBits(IndexReader reader) throws IOException
{
OpenBitSet bits=new OpenBitSet(reader.maxDoc());
Term startTerm=new Term(fieldName);
TermEnum te = reader.terms(startTerm);
if(te!=null)
{
Term currTerm=te.term();
//如果属于指定的域
while((currTerm!=null)&&(currTerm.field()==startTerm.field()))
{
int lastDoc=-1;
//则取出包含此term的所有的文档
TermDocs td = reader.termDocs(currTerm);
if(td.next())
{
if(keepMode==KM_USE_FIRST_OCCURRENCE)
{
//第一篇设为true
bits.set(td.doc());
}
else
{
do
{
lastDoc=td.doc();
}while(td.next());
bits.set(lastDoc); //最后一篇设为true
}
}
if(!te.next())
{
break;
}
currTerm=te.term();
}
}
return bits;
}
private OpenBitSet fastBits(IndexReader reader) throws IOException
{
OpenBitSet bits=new OpenBitSet(reader.maxDoc());
bits.set(0,reader.maxDoc()); //全部设为true
Term startTerm=new Term(fieldName);
TermEnum te = reader.terms(startTerm);
if(te!=null)
{
Term currTerm=te.term();
//如果属于指定的域
while((currTerm!=null)&&(currTerm.field()==startTerm.field()))
{
if(te.docFreq()>1)
{
int lastDoc=-1;
//取出所有的文档
TermDocs td = reader.termDocs(currTerm);
td.next();
if(keepMode==KM_USE_FIRST_OCCURRENCE)
{
//除了第一篇不清零
td.next();
}
do
{
lastDoc=td.doc();
bits.clear(lastDoc); //其他全部清零
}while(td.next());
if(keepMode==KM_USE_LAST_OCCURRENCE)
{
bits.set(lastDoc); //最后一篇设为true
}
}
if(!te.next())
{
break;
}
currTerm=te.term();
}
}
return bits;
}
举例,我们索引如下的文件:
File indexDir = new File("TestDuplicateFilter/index");
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
Document doc = new Document();
doc.add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("contents", "page 1: hello world", Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc);
doc = new Document();
doc.add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("contents", "page 2: hello world", Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc);
doc = new Document();
doc.add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("contents", "page 3: hello world", Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc);
doc = new Document();
doc.add(new Field("id", "2", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("contents", "page 1: hello world", Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc);
doc = new Document();
doc.add(new Field("id", "2", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("contents", "page 2: hello world", Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc);
writer.close();
如果搜索TermQuery tq = new TermQuery(new Term("contents","hello")),则结果为:
id : 1
id : 1
id : 1
id : 2
id : 2
如果按如下进行搜索:
File indexDir = new File("TestDuplicateFilter/index");
IndexReader reader = IndexReader.open(FSDirectory.open(indexDir));
IndexSearcher searcher = new IndexSearcher(reader);
TermQuery tq = new TermQuery(new Term("contents","hello"));
DuplicateFilter filter = new DuplicateFilter("id");
FilteredQuery query = new FilteredQuery(tq, filter);
TopDocs docs = searcher.search(query, 50);
for (ScoreDoc doc : docs.scoreDocs) {
Document ldoc = reader.document(doc.doc);
String id = ldoc.get("id");
System.out.println("id : " + id);
}
则结果为:
id : 1
id : 2
6.4、FieldCacheRangeFilter及FieldCacheTermsFilter
在介绍与FieldCache相关的Filter之前,先介绍FieldCache。
FieldCache缓存的是不是存储域的内容,而是索引域中term的内容,索引中的term是String的类型,然而可以将其他的类型作为String类型索引进去,例如"1","2.3"等,然后搜索的时候将这些信息取出来。
FieldCache支持如下类型:
- byte[] getBytes (IndexReader reader, String field, ByteParser parser)
- double[] getDoubles(IndexReader reader, String field, DoubleParser parser)
- float[] getFloats (IndexReader reader, String field, FloatParser parser)
- int[] getInts (IndexReader reader, String field, IntParser parser)
- long[] getLongs(IndexReader reader, String field, LongParser parser)
- short[] getShorts (IndexReader reader, String field, ShortParser parser)
- String[] getStrings (IndexReader reader, String field)
- StringIndex getStringIndex (IndexReader reader, String field)
其中StringIndex包含两个成员:
- String[] lookup:按照字典顺序排列的所有term。
- int[] order:其中位置表示文档号,order[i]第i篇文档包含的term在lookup中的位置。
FieldCache默认的实现FieldCacheImpl,其中包含成员变量Map<?>,Cache> caches保存从类型到Cache的映射。</CLASS<?>
private synchronized void init() {
caches = new HashMap<?>,Cache>(7);</CLASS<?>
caches.put(Byte.TYPE, new ByteCache(this));
caches.put(Short.TYPE, new ShortCache(this));
caches.put(Integer.TYPE, new IntCache(this));
caches.put(Float.TYPE, new FloatCache(this));
caches.put(Long.TYPE, new LongCache(this));
caches.put(Double.TYPE, new DoubleCache(this));
caches.put(String.class, new StringCache(this));
caches.put(StringIndex.class, new StringIndexCache(this));
}
其实现接口getInts 如下,即先得到Integer类型所对应的IntCache然后,再从其中根据reader和由field和parser组成的Entry得到整型值。
public int[] getInts(IndexReader reader, String field, IntParser parser) throws IOException {
return (int[]) caches.get(Integer.TYPE).get(reader, new Entry(field, parser));
}
各类缓存的父类Cache包含成员变量Map> readerCache,其中key是IndexReader,value是一个Map,此Map的key是Entry,也即是field,value是缓存的int[]的值。(也即在这个reader的这个field中有一个数组的int,每一项代表一篇文档)。</OBJECT,>
Cache的get函数如下:
public Object get(IndexReader reader, Entry key) throws IOException {
Map innerCache;</ENTRY,OBJECT>
Object value;
final Object readerKey = reader.getFieldCacheKey(); //此函数返回this,也即IndexReader本身
synchronized (readerCache) {
innerCache = readerCache.get(readerKey); //通过IndexReader得到Map
if (innerCache == null) { //如果没有则新建一个Map
innerCache = new HashMap();</ENTRY,OBJECT>
readerCache.put(readerKey, innerCache);
value = null;
} else {
value = innerCache.get(key); //此Map的key是Entry,value即是缓存的值
}
//如果缓存不命中,则创建此值
if (value == null) {
value = new CreationPlaceholder();
innerCache.put(key, value);
}
}
if (value instanceof CreationPlaceholder) {
synchronized (value) {
CreationPlaceholder progress = (CreationPlaceholder) value;
if (progress.value == null) {
progress.value = createValue(reader, key); //调用此函数创建缓存值
synchronized (readerCache) {
innerCache.put(key, progress.value);
}
}
}
return progress.value;
}
return value;
}
Cache的createValue函数根据类型的不同而不同,我们仅分析IntCache和StringIndexCache的实现.
IntCache的createValue函数如下:
protected Object createValue(IndexReader reader, Entry entryKey) throws IOException {
Entry entry = entryKey;
String field = entry.field;
IntParser parser = (IntParser) entry.custom;
int[] retArray = null;
TermDocs termDocs = reader.termDocs();
TermEnum termEnum = reader.terms (new Term (field));
try {
//依次将域中所有的term都取出来,用IntParser进行解析,缓存retArray[]位置即文档号,retArray[i]即第i篇文档所包含的int值.
do {
Term term = termEnum.term();
if (term==null || term.field() != field) break;
int termval = parser.parseInt(term.text());
if (retArray == null)
retArray = new int[reader.maxDoc()];
termDocs.seek (termEnum);
while (termDocs.next()) {
retArray[termDocs.doc()] = termval;
}
} while (termEnum.next());
} catch (StopFillCacheException stop) {
} finally {
termDocs.close();
termEnum.close();
}
if (retArray == null)
retArray = new int[reader.maxDoc()];
return retArray;
}
};
StringIndexCache的createValue函数如下:
protected Object createValue(IndexReader reader, Entry entryKey) throws IOException {
String field = StringHelper.intern(entryKey.field);
final int[] retArray = new int[reader.maxDoc()];
String[] mterms = new String[reader.maxDoc()+1];
TermDocs termDocs = reader.termDocs();
TermEnum termEnum = reader.terms (new Term (field));
int t = 0;
mterms[t++] = null;
try {
do {
Term term = termEnum.term();
if (term==null || term.field() != field) break;
mterms[t] = term.text(); //mterms[i]保存的是按照字典顺序第i个term所对应的字符串。
termDocs.seek (termEnum);
while (termDocs.next()) {
retArray[termDocs.doc()] = t; //retArray[i]保存的是第i篇文档所包含的字符串在mterms中的位置。
}
t++;
} while (termEnum.next());
} finally {
termDocs.close();
termEnum.close();
}
if (t == 0) {
mterms = new String[1];
} else if (t < mterms.length) {
String[] terms = new String[t];
System.arraycopy (mterms, 0, terms, 0, t);
mterms = terms;
}
StringIndex value = new StringIndex (retArray, mterms);
return value;
}
FieldCacheRangeFilter的可以是各种类型的Range,其中Int类型用下面的函数生成:
public static FieldCacheRangeFilter newIntRange(String field, FieldCache.IntParser parser, Integer lowerVal, Integer upperVal, boolean includeLower, boolean includeUpper) {
return new FieldCacheRangeFilter(field, parser, lowerVal, upperVal, includeLower, includeUpper) {
@Override
public DocIdSet getDocIdSet(IndexReader reader) throws IOException {
final int inclusiveLowerPoint, inclusiveUpperPoint;
//计算左边界
if (lowerVal != null) {
int i = lowerVal.intValue();
if (!includeLower && i == Integer.MAX_VALUE)
return DocIdSet.EMPTY_DOCIDSET;
inclusiveLowerPoint = includeLower ? i : (i + 1);
} else {
inclusiveLowerPoint = Integer.MIN_VALUE;
}
//计算右边界
if (upperVal != null) {
int i = upperVal.intValue();
if (!includeUpper && i == Integer.MIN_VALUE)
return DocIdSet.EMPTY_DOCIDSET;
inclusiveUpperPoint = includeUpper ? i : (i - 1);
} else {
inclusiveUpperPoint = Integer.MAX_VALUE;
}
if (inclusiveLowerPoint > inclusiveUpperPoint)
return DocIdSet.EMPTY_DOCIDSET;
//从cache中取出values,values[i]表示第i篇文档在此域中的值
final int[] values = FieldCache.DEFAULT.getInts(reader, field, (FieldCache.IntParser) parser);
return new FieldCacheDocIdSet(reader, (inclusiveLowerPoint <= 0 && inclusiveUpperPoint >= 0)) {
@Override
boolean matchDoc(int doc) {
//仅在文档i所对应的值在区间内的时候才返回。
return values[doc] >= inclusiveLowerPoint && values[doc] <= inclusiveUpperPoint;
}
};
}
};
}
FieldCacheRangeFilter同NumericRangeFilter或者TermRangeFilter功能类似,只不过后两者取得docid的bitset都是从索引中取出,而前者是缓存了的,加快了速度。
同样FieldCacheTermsFilter同TermFilter功能类似,也是前者进行了缓存,加快了速度。
6.5、MultiTermQueryWrapperFilter
MultiTermQueryWrapperFilter包含成员变量Q query,其getDocIdSet得到满足此query的文档号bitset。
public DocIdSet getDocIdSet(IndexReader reader) throws IOException {
final TermEnum enumerator = query.getEnum(reader);
try {
if (enumerator.term() == null)
return DocIdSet.EMPTY_DOCIDSET;
final OpenBitSet bitSet = new OpenBitSet(reader.maxDoc());
final int[] docs = new int[32];
final int[] freqs = new int[32];
TermDocs termDocs = reader.termDocs();
try {
int termCount = 0;
//遍历满足query的所有term
do {
Term term = enumerator.term();
if (term == null)
break;
termCount++;
termDocs.seek(term);
while (true) {
//得到每个term的文档号列表,放入bitset
final int count = termDocs.read(docs, freqs);
if (count != 0) {
for(int i=0;i<=""> </COUNT;I++)>
bitSet.set(docs[i]);
}
} else {
break;
}
}
} while (enumerator.next());
query.incTotalNumberOfTerms(termCount);
} finally {
termDocs.close();
}
return bitSet;
} finally {
enumerator.close();
}
}
MultiTermQueryWrapperFilter有三个重要的子类:
- NumericRangeFilter:以NumericRangeQuery作为query
- PrefixFilter:以PrefixQuery作为query
- TermRangeFilter:以TermRangeQuery作为query
6.6、QueryWrapperFilter
其包含一个查询对象,getDocIdSet会获得所有满足此查询的文档号:
public DocIdSet getDocIdSet(final IndexReader reader) throws IOException {
final Weight weight = query.weight(new IndexSearcher(reader));
return new DocIdSet() {
public DocIdSetIterator iterator() throws IOException {
return weight.scorer(reader, true, false); //Scorer的next即返回一个个文档号。
}
};
}
6.7、SpanFilter
6.7.1、SpanQueryFilter
其包含一个SpanQuery query,作为过滤器,其除了通过getDocIdSet得到文档号之外,bitSpans函数得到的SpanFilterResult还包含位置信息,可以用于在FilterQuery中起过滤作用。
public DocIdSet getDocIdSet(IndexReader reader) throws IOException {
SpanFilterResult result = bitSpans(reader);
return result.getDocIdSet();
}
public SpanFilterResult bitSpans(IndexReader reader) throws IOException {
final OpenBitSet bits = new OpenBitSet(reader.maxDoc());
Spans spans = query.getSpans(reader);
List tmp = new ArrayList(20);
int currentDoc = -1;
SpanFilterResult.PositionInfo currentInfo = null;
while (spans.next())
{
//将docid放入bitset
int doc = spans.doc();
bits.set(doc);
if (currentDoc != doc)
{
currentInfo = new SpanFilterResult.PositionInfo(doc);
tmp.add(currentInfo);
currentDoc = doc;
}
//将start和end信息放入PositionInfo
currentInfo.addPosition(spans.start(), spans.end());
}
return new SpanFilterResult(bits, tmp);
}
6.7.2、CachingSpanFilter
由Filter的接口DocIdSet getDocIdSet(IndexReader reader)得知,一个docid的bitset是同一个reader相对应的。
有前面对docid的描述可知,其仅对一个打开的reader有意义。
CachingSpanFilter有一个成员变量Map cache保存从reader到SpanFilterResult的映射,另一个成员变量SpanFilter filter用于缓存不命中的时候得到SpanFilterResult。</INDEXREADER,SPANFILTERRESULT>
其getDocIdSet如下:
public DocIdSet getDocIdSet(IndexReader reader) throws IOException {
SpanFilterResult result = getCachedResult(reader);
return result != null ? result.getDocIdSet() : null;
}
private SpanFilterResult getCachedResult(IndexReader reader) throws IOException {
lock.lock();
try {
if (cache == null) {
cache = new WeakHashMap();</INDEXREADER,SPANFILTERRESULT>
}
//如果缓存命中,则返回缓存中的结果。
final SpanFilterResult cached = cache.get(reader);
if (cached != null) return cached;
} finally {
lock.unlock();
}
//如果缓存不命中,则用SpanFilter直接从reader中得到结果。
final SpanFilterResult result = filter.bitSpans(reader);
lock.lock();
try {
//将新得到的结果放入缓存
cache.put(reader, result);
} finally {
lock.unlock();
}
return result;
}