Elasticsearch之高亮进阶-高性能高亮器, 让Elasticsearch飞一会儿

        很多应用场景下,搜索带高亮显示可以较好的改善用户体验。常用的企业搜索引擎Elasticsearch、Solr中均提供了高亮的功能。Elasticsearch、Solr中的高亮显示是均来源于lucene的高亮模块,luncene允许在一个或者多个字段上突出显示搜索内容,在中高亮方式上,lucene支持三种高亮显示方式highlighter, fast-vector-highlighter, postings-highlighter,  在solr中,highlighter 高亮是缺省配置高亮方式。在ElasticSearch中,highlighter 同样是默认的高亮方式。  


     highlighter 高亮也叫plain高亮,该方式有一定的优点也有一定的缺点,先说说缺点。highlighter方式高亮是个实时分析处理高亮器。即用户在查询的时候,es取到了符合条件的docid后,将需要高亮的字段数据提取到内存,再调用该字段的分析器进行分词,分词完毕后采用相似度算法计算得分最高的前n组并高亮段返回数据。以ansj分析器为例,官方给出的性能在60-80万字/每秒,但实际上中服务器运行效率会小于该值(服务器主频都比较低),在生产环境下,ansj分词效率大多在 40-50万字/秒。假设用户搜索的都是比较大的文档同时需要进行高亮。按照一页查询40条(每条数据20k)的方式进行显示,即使相似度计算以及搜索排序不耗时,整个查询也会被高亮拖累到接近两秒,这种查询就有点无法忍受了。


fast-vector-highlighter :
      为解决 highlighter 高亮在大文本字段上的性能问题,lucene高亮模块提供了基于向量的高亮方式 fast-vector-highlighter。要采用fast-vector-highlighter(fvh)高亮方式,在数据建索引时候,需要配置存储词向量的词位置、词偏移量。fast-vector-highlighter在高亮时候的逻辑如下:
    6.读取字段内容(多字段用空格隔开),根据提取的词向量直接定位截取高亮字段(注意:lucene原生高亮存在bug,bug分别存在core 以及 highlighter工程中,之前我写过如何修改)
     由此可见,fast-vector-highlighter 省去了实时分析过程,但是多了磁盘读取,故fast-vector-highlighter 也有一定的优点以及缺点.


(1)fast-vector-highlighter  高亮方式需要存储词向量,而在词库丰富的系统中,存储词向量往往要比不存储词向量多占用一倍的空间。

(2)fast-vector-highlighter  高亮会比plain高亮多出至少一倍的io操作次数,读取的字节大小也多出至少一倍,大量的io请求会让搜索引擎并发能力降低。

(1)当实时分词速度小于磁盘读随机取速度的时候,从磁盘读取词向量的fast-vector-highlighter高亮有明显优势,例如: ansj分词器处理1百万字的文档耗时约两秒,而当前企业硬盘一分钟转速约为一万转,即一秒钟有160次的寻址能力,单次寻址并读取20k耗时约为7-10ms。分40次从磁盘总共读取2M内容耗时约为300毫秒,重复读取数据时候io上存在缓存,速度较快。与plain方式相比,fvh高亮在文档字段内容较大的情况下具有较大优势。

       默认plain高亮方式占用空间小,但是对大字段高亮慢,fvh对大字段高亮快,但占用空间过大,有没有一种高亮方式可以折中一些,即不要占用太大空间,对大字段分词也会太慢?当然有,lucene还提供了postings-highlighter(postings)高亮方式,postings-highlighter 高亮方式也是采用词量向量的方式进行高亮,与fvh高亮不同的是postings高亮只存储了词向量的位置信息,并未存储词向量的偏移量,故中大字段存储中,其比fvh节省约20-30%的存储空间。在实际使用中,postings高亮的优点和缺点都不突出,故高亮时候对小字段采用highlighter高亮方式,大字段采用fast-vector-highlighter即可满足需求。

       目前,lucene提供的默认plain高亮方式占用空间小,但是对大文本操作速度又太慢,fvh速度快,但占用磁盘空间和io操作又太多,在生产环境下,系统的吞吐量以及存储量都无法达到一个满意的水平。为了达到空间占用与默认的高亮其相同,速度比fast-vector-highlighter 高亮速度快,根据lucene高亮器的实现结构,我自己写了个高亮器,名称为fast-highlighter

 fast-highlighter 由几部分组成:
 1.es环境调用插件 FastPlainHighlighter,用于环境变量处理。

 1. 短语高亮(带引号高亮,短语会被分成多个词,高亮时候只有位置连续符合的才被高亮)

 2. 最优化返回(需要计算最符合或者高亮词数最多的钱n段)

 3. 高亮时候允许不区分大小写匹配,不区分全角半角匹配高亮




   (2)检索文本中包含 “国美电器” 关键词文章并高亮返回40条


采用fast-vector-highlighter 高亮方式耗时336毫秒:

  • "took": 336,
  • "timed_out": false,
  • "_shards": {
    • "total": 1,
    • "successful": 1,
    • "failed": 0
  • "hits": {
    • "total": 115,
    • "max_score": 0.19190195,
    • "hits": [
      • {
        • "_index": "test_v1",
        • "_type": "test",
        • "_id": "51000508",
        • "_score": 0.19190195,
        • "highlight": {
          • "text": [
            • "且主要品类零售额增速均高于上年同期水平。(2) 6 月 12 日,<b>国美</b><b>电器</b>宣布,其股东特别大会已通过公司更名议案。中文名称由“<b>国美</b><b>电器</b>控股有限公司”更改为“<b>国美</b>零售控股有限公司”。同日公司宣布正式推出全球首家专业 VR 影院,地点位于国美旗下大中<b>电器</b>北京马甸店。<b>国美</b> VR 影院将打破售票入场形式,采用“时间售卖”的方式,正式对外营业后一小时将收费"

采用fast--highlighter 高亮耗时132毫秒:


  • "took": 132,
  • "timed_out": false,
  • "_shards": {
    • "total": 1,
    • "successful": 1,
    • "failed": 0
  • "hits": {
    • "total": 115,
    • "max_score": 0.19190195,
    • "hits": [
      • {
        • "_index": "test_v2",
        • "_type": "test",
        • "_id": "51000508",
        • "_score": 0.19190195,
        • "highlight": {
          • "text": [
            • "家用<b>电器</b>类零售额同比增长 1.6%,相比上年同期加快了 11.8 个百分点。(2) 6 月 12 日,<b>国美</b><b>电器</b>宣布,其股东特别大会已通过公司更名议案。中文名称由“<b>国美</b><b>电器</b>控股有限公司”更改为“<b>国美</b>零售控股有限公司”。同日,公司宣布正式推出全球首家专业 VR 影院,地点位于<b>国美</b>旗下大中<b>电器</b>北京马甸店。<b>国美</b>"

从结果可以看到,自己实现的高亮器可以比fast高亮器性能提高一倍以上,如下为fast-highlighter 的核心代码。

package org.elasticsearch.search.highlight;

import com.google.common.collect.Maps;
import org.apache.lucene.search.highlight.*;
import org.apache.lucene.search.vectorhighlight.BoundaryScanner;
import org.apache.lucene.search.vectorhighlight.CustomFieldQuery;
import org.apache.lucene.search.vectorhighlight.FieldQuery;
import org.apache.lucene.search.vectorhighlight.SimpleBoundaryScanner;
import org.apache.lucene.search.vectorhighlight.FieldQuery.Phrase;
import org.apache.lucene.util.BytesRefHash;
import org.elasticsearch.ExceptionsHelper;
import org.elasticsearch.common.text.Text;
import org.elasticsearch.index.mapper.FieldMapper;
import org.elasticsearch.search.fetch.FetchPhaseExecutionException;
import org.elasticsearch.search.fetch.FetchSubPhase;
import org.elasticsearch.search.internal.SearchContext;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

 * @author jkuang.nj
public class FastPlainHighlighter implements Highlighter
	private static final String CACHE_KEY = "highlight-fast";
	public static final char mark = 0;
	private static final SimpleBoundaryScanner DEFAULT_BOUNDARY_SCANNER = new SimpleBoundaryScanner();

	public HighlightField highlight(HighlighterContext highlighterContext)
		SearchContextHighlight.Field field = highlighterContext.field;
		SearchContext context = highlighterContext.context;
		FetchSubPhase.HitContext hitContext = highlighterContext.hitContext;
		FieldMapper mapper = highlighterContext.mapper;
		Encoder encoder = field.fieldOptions().encoder().equals("html") ? HighlightUtils.Encoders.HTML : HighlightUtils.Encoders.DEFAULT;

		if (!hitContext.cache().containsKey(CACHE_KEY))
			hitContext.cache().put(CACHE_KEY, new HighlighterEntry());

		HighlighterEntry cache = (HighlighterEntry) hitContext.cache().get(CACHE_KEY);
			FieldQuery fieldQuery;
			if (field.fieldOptions().requireFieldMatch())
				if (cache.fieldMatchFieldQuery == null)
					cache.fieldMatchFieldQuery = new CustomFieldQuery(highlighterContext.query, hitContext.topLevelReader(), true,
				fieldQuery = cache.fieldMatchFieldQuery;
				if (cache.noFieldMatchFieldQuery == null)
					cache.noFieldMatchFieldQuery = new CustomFieldQuery(highlighterContext.query, hitContext.topLevelReader(), true,
				fieldQuery = cache.noFieldMatchFieldQuery;

			if (!cache.analysises.containsKey(field.field()))
				cache.setPhrases(field.field(), fieldQuery.getPhrases(field.field()));
				cache.setWords(field.field(), fieldQuery.getTermSet(field.field()));
			FastHighlighter entry = cache.mappers.get(mapper);
			if (entry == null)

				BoundaryScanner boundaryScanner = DEFAULT_BOUNDARY_SCANNER;
				if (field.fieldOptions().boundaryMaxScan() != SimpleBoundaryScanner.DEFAULT_MAX_SCAN
						|| field.fieldOptions().boundaryChars() != SimpleBoundaryScanner.DEFAULT_BOUNDARY_CHARS)
					boundaryScanner = new SimpleBoundaryScanner(field.fieldOptions().boundaryMaxScan(), field.fieldOptions().boundaryChars());
				entry = new FastHighlighter(encoder, boundaryScanner);
				cache.mappers.put(mapper, entry);

			String[] fragments;
			int numberOfFragments = field.fieldOptions().numberOfFragments() == 0 ? 1 : field.fieldOptions().numberOfFragments();
			int fragmentCharSize = field.fieldOptions().numberOfFragments() == 0 ? 50 : field.fieldOptions().fragmentCharSize();
			List textsToHighlight = null;
				textsToHighlight = HighlightUtils.loadFieldValues(field, mapper, context, hitContext);
				StringBuilder buffer = new StringBuilder();
				for (Object textToHighlight : textsToHighlight)
					String text = textToHighlight.toString();
					buffer.append(text).append(" ");
				fragments = entry.getBestBestFragments(cache.analysises.get(field.field()), cache.phrases.get(field.field()), buffer,
						numberOfFragments, fragmentCharSize, field.fieldOptions().preTags(), field.fieldOptions().postTags());
			catch (Exception e)
				if (ExceptionsHelper.unwrap(e, BytesRefHash.MaxBytesLengthExceededException.class) != null)
					return null;
					throw new FetchPhaseExecutionException(context, "Failed to highlight field [" + highlighterContext.fieldName + "]", e);

			if (fragments != null && fragments.length > 0)
				return new HighlightField(highlighterContext.fieldName, Text.convertFromStringArray(fragments));

			int noMatchSize = highlighterContext.field.fieldOptions().noMatchSize();
			if (noMatchSize > 0 && textsToHighlight.size() > 0)
				String fieldContents = textsToHighlight.get(0).toString();
				return new HighlightField(highlighterContext.fieldName,
						new Text[] { new Text(fieldContents.substring(0, Math.min(fragmentCharSize, fieldContents.length()))) });

			return null;

		catch (Exception e)
			throw new FetchPhaseExecutionException(context, "Failed to highlight field [" + highlighterContext.fieldName + "]", e);

	public boolean canHighlight(FieldMapper fieldMapper)
		return true;

	private class HighlighterEntry
		public FieldQuery noFieldMatchFieldQuery;
		public FieldQuery fieldMatchFieldQuery;
		public Map
     > phrases = new HashMap<>();
		public Map
       mappers = Maps.newHashMap();
		public Map
        analysises = new HashMap<>();

		public void setPhrases(String field, Set
         phrases) { if(!this.phrases.containsKey(field)){ this.phrases.put(field, phrases); } } public void setWords(String field, Set 
          words) { if (!analysises.containsKey(field)) { TreeAnalysis analysis = new TreeAnalysis(); if (words != null && words.size() > 0) { for (String word : words) { analysis.add(word); } } analysises.put(field, analysis); } } } static class FragmentScore implements Comparable 
           { int point = 0; int distance = 0; List 
            terms = new ArrayList<>(); HashSet 
             set = new HashSet<>(); StringBuffer buffer = new StringBuffer(); public FragmentScore(int distance) { this.distance = distance; } public void updateScore(Set 
              phrases) { for (Phrase phrase : phrases) { if (buffer.indexOf(phrase.toString()) >= 0) { this.point += 5 * phrase.list.size(); } } } public boolean add(Term term) { if (terms.size() == 0 || term.pos - terms.get(0).pos <= distance) { if (terms.size() == 0) { buffer.append(term.word); } else { int dis = term.pos - terms.get(terms.size() - 1).pos; buffer.append(mark).append(dis).append(mark); buffer.append(term.word); } terms.add(term); this.point += term.length(); if (set.size() > 0) { if (!set.contains(term.word)) { this.point += 2; set.add(term.word); } if (term.pos - terms.get(terms.size() - 1).pos == 1) { this.point += 2; } } return true; } return false; } @Override public int compareTo(FragmentScore o) { return -(this.point - o.point); } } public class FastHighlighter { BoundaryScanner boundaryScanner; Encoder encoder; public FastHighlighter(Encoder encoder, BoundaryScanner boundaryScanner) { this.encoder = encoder; this.boundaryScanner = boundaryScanner; } public String[] getBestBestFragments(TreeAnalysis analyzer, Set 
               phrases, StringBuilder buffer, int maxNumFragments, int fragmentSize, String[] preTags, String[] postTags) { List 
                fragmentScores; if (maxNumFragments <= 1) { fragmentScores = getBestFragments(analyzer, phrases, buffer.toString(), fragmentSize); } else { fragmentScores = getBestFragments(analyzer, buffer.toString(), maxNumFragments, fragmentSize); } return toString(buffer, fragmentSize, fragmentScores, preTags, postTags); } public String[] toString(StringBuilder buffer, int fragmentSize, List 
                 fragmentScores, String[] preTags, String[] postTags) { List 
                  list = new ArrayList<>(); for (FragmentScore score : fragmentScores) { List 
                   terms = score.terms; Term head = terms.get(0); Term tail = terms.get(terms.size() - 1); int start = boundaryScanner.findStartOffset(buffer, head.startoffset); int end = boundaryScanner.findEndOffset(buffer, tail.endoffset()); if (fragmentScores.size() == 1 && buffer.length() <= fragmentSize) { start = 0; end = buffer.length(); } else if (fragmentSize - (tail.endoffset() - head.startoffset) > (fragmentSize / 10)) { int size = fragmentSize - (tail.endoffset() - head.startoffset); if (head.startoffset < (size * 3 / 10)) { start = 0; } else { start = boundaryScanner.findStartOffset(buffer, head.startoffset); } if (buffer.length() - start < fragmentSize) { end = buffer.length(); } else { end = boundaryScanner.findEndOffset(buffer, Math.max(start + fragmentSize, tail.endoffset())); } } StringBuffer result = new StringBuffer(); for (int i = 0; i < terms.size(); i++) { Term term = terms.get(i); result.append(buffer.substring(start, term.startoffset)); result.append(getTag(preTags, i)); result.append(encoder.encodeText(buffer.substring(term.startoffset, term.endoffset()))); result.append(getTag(postTags, i)); start = term.endoffset(); } result.append(buffer.substring(start, end)); list.add(result.toString()); } return list.toArray(new String[0]); } public final List 
                    getBestFragments(TreeAnalysis analyzer, Set 
                     phrases, String text, int fragmentSize) { if (analyzer == null) { return new ArrayList<>(); } List 
                      fragments = new ArrayList 
                       (); FragmentScore fragmentScore = null; List 
                        terms = analyzer.find(text); for (int i = 0, j = 0; i < terms.size(); i++) { FragmentScore fScore = new FragmentScore(fragmentSize); for (j = i; j < terms.size(); j++) { if (!fScore.add(terms.get(j))) { break; } } fScore.updateScore(phrases); if (fragmentScore == null || fragmentScore.compareTo(fScore) >= 0) { fragmentScore = fScore; } if (j >= terms.size()) { break; } } if (fragmentScore != null) { fragments.add(fragmentScore); } return fragments; } public final List 
                         getBestFragments(TreeAnalysis analyzer, String text, int maxNumFragments, int fragmentSize) { if (analyzer == null) { return null; } List 
                          terms = analyzer.find(text); List 
                           fragments = new ArrayList 
                            (); FragmentScore fScore = new FragmentScore(fragmentSize); for (int i = 0; i < terms.size(); i++) { if (!fScore.add(terms.get(i))) { fragments.add(fScore); fScore = new FragmentScore(fragmentSize); fScore.add(terms.get(i)); } } fragments.add(fScore); Collections.sort(fragments); while (fragments.size() > maxNumFragments) { fragments.remove(fragments.size() - 1); } return fragments; } protected String getTag(String[] tags, int num) { int n = num % tags.length; return tags[n]; } } public static class Term { String word; int startoffset, pos; public Term(int startoffset, int pos, String word) { this.startoffset = startoffset; this.pos = pos; this.word = word; } public int endoffset() { return this.startoffset+word.length(); } public int length() { return word.length(); } public String toString() { return "start:" + startoffset + " pos:" + pos+" word:"+word; } } public static class TreeAnalysis { private TNode root = new TNode((char) 0, false); boolean[] nodes = new boolean[64 * 1024]; static final char ch0 ='\uFF00'; static final char ch1 ='\uFF5F'; public List 
                             find(String str) { int start = 0; int length = str.length(); str = str.toLowerCase(); char[] values = str.toCharArray(); List 
                              terms = new ArrayList<>(); int sumpos = 0; while (start < length) { char ch = values[start]; //全椒字符串换为半角字符 ch= (char) (ch > ch0 && ch < ch1 ? ch - 65248 :ch); if (!nodes[ch]) { start++; continue; } else { int pos = root.find(values, start, -1); if (pos >= start) { terms.add(new Term(start, start - sumpos + terms.size(), str.substring(start, pos + 1))); sumpos += pos + 1 - start; start = pos + 1; } else { start++; } } } return terms; } public void add(String str) { if (str == null || str.length() == 0) { return; } str = str.toLowerCase(); nodes[(int)str.charAt(0)] = true; root.insert(str, 0); } private static class TNode implements Comparable 
                               { // 标记当前节点是否是一个词的终止字符 boolean mark; // 当前节点的字符 char value; // 子节点 TNode[] nodes; int nodesize; public TNode(char ch, boolean mark) { this.value = ch; this.mark = mark; } public int find(char[] chs, int nextPos, int leafoffset) { if (nextPos >= chs.length) { return -1; } int size = 0; char ch = chs[nextPos]; //全椒字符串换为半角字符 ch= (char) (ch > ch0 && ch < ch1 ? ch - 65248 :ch); while (size < this.nodesize && nodes[size++].value < ch); int pos = nodes[size - 1].value == ch ? size - 1 : -1; // int pos = index(chs[nextPos]); if (pos >= 0) { if (nodes[pos].mark) { leafoffset = nextPos; if (nodes[pos].nodesize == 0) { return nextPos; } } int next = nodes[pos].find(chs, nextPos + 1, leafoffset); return next > leafoffset ? next : leafoffset; } else { return -1; } } /*public int index(char ch) { if (this.nodesize < 5) { int size = 0; while (size < this.nodesize && nodes[size++].value < ch) ; return nodes[size - 1].value == ch ? size - 1 : -1; } else { return indexOf(nodes, this.nodesize, ch, Type._index); } }*/ int indexOf(TNode[] nodes, int size, char node, Type type) { int fromIndex = 0; int toIndex = size - 1; while (fromIndex <= toIndex) { int mid = (fromIndex + toIndex) >> 1; int cmp = nodes[mid].compareTo(node);// this.comparator.compare(nodes[mid], // node); if (cmp < 0) fromIndex = mid + 1; else if (cmp > 0) toIndex = mid - 1; else return type == Type._insert ? -(mid + 1) : mid; // key // found } switch (type) { case _insert: return fromIndex; case _index: return -(fromIndex + 1); default: return toIndex; } } public void insert(String str, int pos) { char ch = str.charAt(pos); boolean isleaf = pos == str.length() - 1; if (this.nodesize == 0) { nodes = new TNode[1]; nodes[0] = new TNode(ch, isleaf); if (!isleaf) { nodes[0].insert(str, pos + 1); } this.nodesize++; } else { int _index = indexOf(nodes, nodesize, ch, Type._insert); if (_index >= 0) { int moved = this.nodesize - _index; if (this.nodesize == nodes.length) { nodes = Arrays.copyOf(nodes, nodes.length + 1); } if (moved > 0) { System.arraycopy(nodes, _index, nodes, _index + 1, moved); } nodes[_index] = new TNode(ch, isleaf); if (!isleaf) { nodes[_index].insert(str, pos + 1); } this.nodesize++; } else { if (isleaf) { nodes[0].mark = true; } else { nodes[-_index - 1].insert(str, pos + 1); } } } } @Override public int compareTo(TNode o) { if (this.value > o.value) { return 1; } else if (this.value < o.value) { return -1; } return 0; } public int compareTo(char o) { if (this.value > o) { return 1; } else if (this.value < o) { return -1; } return 0; } enum Type { _insert, _index } } } } /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.lucene.search.vectorhighlight; import java.io.IOException; import java.util.ArrayList; import java.util.Collection; import java.util.HashMap; import java.util.HashSet; import java.util.Iterator; import java.util.LinkedHashSet; import java.util.List; import java.util.Map; import java.util.Set; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.queries.CustomScoreQuery; import org.apache.lucene.search.BooleanClause; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.BoostQuery; import org.apache.lucene.search.ConstantScoreQuery; import org.apache.lucene.search.DisjunctionMaxQuery; import org.apache.lucene.search.FilteredQuery; import org.apache.lucene.search.MultiTermQuery; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.Query; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.vectorhighlight.FieldTermStack.TermInfo; /** * FieldQuery breaks down query object into terms/phrases and keeps * them in a QueryPhraseMap structure. */ public class FieldQuery { final boolean fieldMatch; // fieldMatch==true, Map 
                                // fieldMatch==false, Map 
                                  rootMaps = new HashMap<>(); // fieldMatch==true, Map 
                                   // fieldMatch==false, Map 
                                     > termSetMap = new HashMap<>(); //存储短语 Map 
                                      > phraseMap = new HashMap<>(); int termOrPhraseNumber; // used for colored tag support // The maximum number of different matching terms accumulated from any one MultiTermQuery private static final int MAX_MTQ_TERMS = 1024; public static class Phrase { public List 
                                       list = new ArrayList<>(); StringBuffer buffer = new StringBuffer(); public void add(String word,int position) { if(list.size()==0){ buffer.append(word); }else{ int pos= position-list.get(list.size()-1).position; char z = 0; buffer.append(z).append(pos).append(z); buffer.append(word); } list.add(new Term(position, word)); } public String toString() { return buffer.toString(); } public static class Term{ public int position; public String word; public Term(int position, String word) { this.position = position; this.word = word; } } } protected FieldQuery( Query query, IndexReader reader, boolean phraseHighlight, boolean fieldMatch ) throws IOException { this.fieldMatch = fieldMatch; Set 
                                        flatQueries = new LinkedHashSet<>(); flatten( query, reader, flatQueries, 1f ); saveTerms( flatQueries, reader ); Collection 
                                         expandQueries = expand( flatQueries ); for( Query flatQuery : expandQueries ){ QueryPhraseMap rootMap = getRootMap( flatQuery ); rootMap.add( flatQuery, reader ); float boost = 1f; while (flatQuery instanceof BoostQuery) { BoostQuery bq = (BoostQuery) flatQuery; flatQuery = bq.getQuery(); boost *= bq.getBoost(); } if( !phraseHighlight && flatQuery instanceof PhraseQuery ){ PhraseQuery pq = (PhraseQuery)flatQuery; if( pq.getTerms().length > 1 ){ for( Term term : pq.getTerms() ) rootMap.addTerm( term, boost ); } } } } /** For backwards compatibility you can initialize FieldQuery without * an IndexReader, which is only required to support MultiTermQuery */ FieldQuery( Query query, boolean phraseHighlight, boolean fieldMatch ) throws IOException { this (query, null, phraseHighlight, fieldMatch); } void flatten( Query sourceQuery, IndexReader reader, Collection 
                                          flatQueries, float boost ) throws IOException{ while (true) { if (sourceQuery.getBoost() != 1f) { boost *= sourceQuery.getBoost(); sourceQuery = sourceQuery.clone(); sourceQuery.setBoost(1f); } else if (sourceQuery instanceof BoostQuery) { BoostQuery bq = (BoostQuery) sourceQuery; sourceQuery = bq.getQuery(); boost *= bq.getBoost(); } else { break; } } if( sourceQuery instanceof BooleanQuery ){ BooleanQuery bq = (BooleanQuery)sourceQuery; for( BooleanClause clause : bq ) { if( !clause.isProhibited() ) { flatten( clause.getQuery(), reader, flatQueries, boost ); } } } else if( sourceQuery instanceof DisjunctionMaxQuery ){ DisjunctionMaxQuery dmq = (DisjunctionMaxQuery)sourceQuery; for( Query query : dmq ){ flatten( query, reader, flatQueries, boost ); } } else if( sourceQuery instanceof TermQuery ){ if (boost != 1f) { sourceQuery = new BoostQuery(sourceQuery, boost); } if( !flatQueries.contains( sourceQuery ) ) flatQueries.add( sourceQuery ); } else if( sourceQuery instanceof PhraseQuery ){ PhraseQuery pq = (PhraseQuery)sourceQuery; if( pq.getTerms().length == 1 ) sourceQuery = new TermQuery( pq.getTerms()[0] ); if (boost != 1f) { sourceQuery = new BoostQuery(sourceQuery, boost); } flatQueries.add(sourceQuery); } else if (sourceQuery instanceof ConstantScoreQuery) { final Query q = ((ConstantScoreQuery) sourceQuery).getQuery(); if (q != null) { flatten( q, reader, flatQueries, boost); } } else if (sourceQuery instanceof FilteredQuery) { final Query q = ((FilteredQuery) sourceQuery).getQuery(); if (q != null) { flatten( q, reader, flatQueries, boost); } } else if (sourceQuery instanceof CustomScoreQuery) { final Query q = ((CustomScoreQuery) sourceQuery).getSubQuery(); if (q != null) { flatten( q, reader, flatQueries, boost); } } else if (reader != null) { Query query = sourceQuery; Query rewritten; if (sourceQuery instanceof MultiTermQuery) { rewritten = new MultiTermQuery.TopTermsScoringBooleanQueryRewrite(MAX_MTQ_TERMS).rewrite(reader, (MultiTermQuery) query); } else { rewritten = query.rewrite(reader); } if (rewritten != query) { // only rewrite once and then flatten again - the rewritten query could have a speacial treatment // if this method is overwritten in a subclass. flatten(rewritten, reader, flatQueries, boost); } // if the query is already rewritten we discard it } // else discard queries } /* * Create expandQueries from flatQueries. * * expandQueries := flatQueries + overlapped phrase queries * * ex1) flatQueries={a,b,c} * => expandQueries={a,b,c} * ex2) flatQueries={a,"b c","c d"} * => expandQueries={a,"b c","c d","b c d"} */ Collection 
                                           expand( Collection 
                                            flatQueries ){ Set 
                                             expandQueries = new LinkedHashSet<>(); for( Iterator 
                                              i = flatQueries.iterator(); i.hasNext(); ){ Query query = i.next(); i.remove(); expandQueries.add( query ); float queryBoost = 1f; while (query instanceof BoostQuery) { BoostQuery bq = (BoostQuery) query; queryBoost *= bq.getBoost(); query = bq.getQuery(); } if( !( query instanceof PhraseQuery ) ) continue; for( Iterator 
                                               j = flatQueries.iterator(); j.hasNext(); ){ Query qj = j.next(); float qjBoost = 1f; while (qj instanceof BoostQuery) { BoostQuery bq = (BoostQuery) qj; qjBoost *= bq.getBoost(); qj = bq.getQuery(); } if( !( qj instanceof PhraseQuery ) ) continue; checkOverlap( expandQueries, (PhraseQuery)query, queryBoost, (PhraseQuery)qj, qjBoost ); } } return expandQueries; } /* * Check if PhraseQuery A and B have overlapped part. * * ex1) A="a b", B="b c" => overlap; expandQueries={"a b c"} * ex2) A="b c", B="a b" => overlap; expandQueries={"a b c"} * ex3) A="a b", B="c d" => no overlap; expandQueries={} */ private void checkOverlap( Collection 
                                                expandQueries, PhraseQuery a, float aBoost, PhraseQuery b, float bBoost ){ if( a.getSlop() != b.getSlop() ) return; Term[] ats = a.getTerms(); Term[] bts = b.getTerms(); if( fieldMatch && !ats[0].field().equals( bts[0].field() ) ) return; checkOverlap( expandQueries, ats, bts, a.getSlop(), aBoost); checkOverlap( expandQueries, bts, ats, b.getSlop(), bBoost ); } /* * Check if src and dest have overlapped part and if it is, create PhraseQueries and add expandQueries. * * ex1) src="a b", dest="c d" => no overlap * ex2) src="a b", dest="a b c" => no overlap * ex3) src="a b", dest="b c" => overlap; expandQueries={"a b c"} * ex4) src="a b c", dest="b c d" => overlap; expandQueries={"a b c d"} * ex5) src="a b c", dest="b c" => no overlap * ex6) src="a b c", dest="b" => no overlap * ex7) src="a a a a", dest="a a a" => overlap; * expandQueries={"a a a a a","a a a a a a"} * ex8) src="a b c d", dest="b c" => no overlap */ private void checkOverlap( Collection 
                                                 expandQueries, Term[] src, Term[] dest, int slop, float boost ){ // beginning from 1 (not 0) is safe because that the PhraseQuery has multiple terms // is guaranteed in flatten() method (if PhraseQuery has only one term, flatten() // converts PhraseQuery to TermQuery) for( int i = 1; i < src.length; i++ ){ boolean overlap = true; for( int j = i; j < src.length; j++ ){ if( ( j - i ) < dest.length && !src[j].text().equals( dest[j-i].text() ) ){ overlap = false; break; } } if( overlap && src.length - i < dest.length ){ PhraseQuery.Builder pqBuilder = new PhraseQuery.Builder(); for( Term srcTerm : src ) pqBuilder.add( srcTerm ); for( int k = src.length - i; k < dest.length; k++ ){ pqBuilder.add( new Term( src[0].field(), dest[k].text() ) ); } pqBuilder.setSlop( slop ); Query pq = pqBuilder.build(); if (boost != 1f) { pq = new BoostQuery(pq, 1f); } if(!expandQueries.contains( pq ) ) expandQueries.add( pq ); } } } QueryPhraseMap getRootMap( Query query ){ String key = getKey( query ); QueryPhraseMap map = rootMaps.get( key ); if( map == null ){ map = new QueryPhraseMap( this ); rootMaps.put( key, map ); } return map; } /* * Return 'key' string. 'key' is the field name of the Query. * If not fieldMatch, 'key' will be null. */ private String getKey( Query query ){ if( !fieldMatch ) return null; while (query instanceof BoostQuery) { query = ((BoostQuery) query).getQuery(); } if( query instanceof TermQuery ) return ((TermQuery)query).getTerm().field(); else if ( query instanceof PhraseQuery ){ PhraseQuery pq = (PhraseQuery)query; Term[] terms = pq.getTerms(); return terms[0].field(); } else if (query instanceof MultiTermQuery) { return ((MultiTermQuery)query).getField(); } else throw new RuntimeException( "query \"" + query.toString() + "\" must be flatten first." ); } /* * Save the set of terms in the queries to termSetMap. * * ex1) q=name:john * - fieldMatch==true * termSetMap=Map<"name",Set<"john">> * - fieldMatch==false * termSetMap=Map 
                                                  <"john">> * * ex2) q=name:john title:manager * - fieldMatch==true * termSetMap=Map<"name",Set<"john">, * "title",Set<"manager">> * - fieldMatch==false * termSetMap=Map 
                                                   <"john","manager">> * * ex3) q=name:"john lennon" * - fieldMatch==true * termSetMap=Map<"name",Set<"john","lennon">> * - fieldMatch==false * termSetMap=Map 
                                                    <"john","lennon">> */ void saveTerms( Collection 
                                                     flatQueries, IndexReader reader ) throws IOException{ for( Query query : flatQueries ){ while (query instanceof BoostQuery) { query = ((BoostQuery) query).getQuery(); } Set 
                                                      terms = getTerms( query ); Set 
                                                       termSet = getTermSet( query ); if( query instanceof TermQuery ){ termSet.add( ((TermQuery)query).getTerm().text() ); } else if( query instanceof PhraseQuery ){ int[] positions=((PhraseQuery)query).getPositions(); Term[] terms2 =((PhraseQuery)query).getTerms(); Phrase phrase = new Phrase(); for (int i = 0; i < terms2.length; i++) { phrase.add(terms2[i].text(), positions[i]); termSet.add( terms2[i].text() ); } if(terms2.length > 1){ terms.add(phrase); } } else if (query instanceof MultiTermQuery && reader != null) { BooleanQuery mtqTerms = (BooleanQuery) query.rewrite(reader); for (BooleanClause clause : mtqTerms) { termSet.add (((TermQuery) clause.getQuery()).getTerm().text()); } } else throw new RuntimeException( "query \"" + query.toString() + "\" must be flatten first." ); } } private Set 
                                                        getTermSet( Query query ){ String key = getKey( query ); Set 
                                                         set = termSetMap.get( key ); if( set == null ){ set = new HashSet<>(); termSetMap.put( key, set ); } return set; } //用fastHighlighter private Set 
                                                          getTerms( Query query ){ String key = getKey( query ); Set 
                                                           set = phraseMap.get( key ); if( set == null ){ set = new HashSet<>(); phraseMap.put( key, set ); } return set; } public Set 
                                                            getTermSet( String field ){ return termSetMap.get( fieldMatch ? field : null ); } /** * 短语不进行分词 * @param field * @return */ public Set 
                                                             getPhrases( String field ){ return phraseMap.get( fieldMatch ? field : null ); } /** * * @return QueryPhraseMap */ public QueryPhraseMap getFieldTermMap( String fieldName, String term ){ QueryPhraseMap rootMap = getRootMap( fieldName ); return rootMap == null ? null : rootMap.subMap.get( term ); } /** * * @return QueryPhraseMap */ public QueryPhraseMap searchPhrase( String fieldName, final List 
                                                              phraseCandidate ){ QueryPhraseMap root = getRootMap( fieldName ); if( root == null ) return null; return root.searchPhrase( phraseCandidate ); } private QueryPhraseMap getRootMap( String fieldName ){ return rootMaps.get( fieldMatch ? fieldName : null ); } public int nextTermOrPhraseNumber(){ return termOrPhraseNumber++; } /** * Internal structure of a query for highlighting: represents * a nested query structure */ public static class QueryPhraseMap { boolean terminal; int slop; // valid if terminal == true and phraseHighlight == true float boost; // valid if terminal == true int termOrPhraseNumber; // valid if terminal == true FieldQuery fieldQuery; Map 
                                                               subMap = new HashMap<>(); public QueryPhraseMap( FieldQuery fieldQuery ){ this.fieldQuery = fieldQuery; } void addTerm( Term term, float boost ){ QueryPhraseMap map = getOrNewMap( subMap, term.text() ); map.markTerminal( boost ); } private QueryPhraseMap getOrNewMap( Map 
                                                                subMap, String term ){ QueryPhraseMap map = subMap.get( term ); if( map == null ){ map = new QueryPhraseMap( fieldQuery ); subMap.put( term, map ); } return map; } void add( Query query, IndexReader reader ) { float boost = 1f; while (query instanceof BoostQuery) { BoostQuery bq = (BoostQuery) query; query = bq.getQuery(); boost = bq.getBoost(); } if( query instanceof TermQuery ){ addTerm( ((TermQuery)query).getTerm(), boost ); } else if( query instanceof PhraseQuery ){ PhraseQuery pq = (PhraseQuery)query; Term[] terms = pq.getTerms(); Map 
                                                                 map = subMap; QueryPhraseMap qpm = null; for( Term term : terms ){ qpm = getOrNewMap( map, term.text() ); map = qpm.subMap; } qpm.markTerminal( pq.getSlop(), boost ); } else throw new RuntimeException( "query \"" + query.toString() + "\" must be flatten first." ); } public QueryPhraseMap getTermMap( String term ){ return subMap.get( term ); } private void markTerminal( float boost ){ markTerminal( 0, boost ); } private void markTerminal( int slop, float boost ){ this.terminal = true; this.slop = slop; this.boost = boost; this.termOrPhraseNumber = fieldQuery.nextTermOrPhraseNumber(); } public boolean isTerminal(){ return terminal; } public int getSlop(){ return slop; } public float getBoost(){ return boost; } public int getTermOrPhraseNumber(){ return termOrPhraseNumber; } public QueryPhraseMap searchPhrase( final List 
                                                                  phraseCandidate ){ QueryPhraseMap currMap = this; for( TermInfo ti : phraseCandidate ){ currMap = currMap.subMap.get( ti.getText() ); if( currMap == null ) return null; } return currMap.isValidTermOrPhrase( phraseCandidate ) ? currMap : null; } public boolean isValidTermOrPhrase( final List 
                                                                   phraseCandidate ){ // check terminal if( !terminal ) return false; // if the candidate is a term, it is valid if( phraseCandidate.size() == 1 ) return true; // else check whether the candidate is valid phrase // compare position-gaps between terms to slop int pos = phraseCandidate.get( 0 ).getPosition(); for( int i = 1; i < phraseCandidate.size(); i++ ){ int nextPos = phraseCandidate.get( i ).getPosition(); if( Math.abs( nextPos - pos - 1 ) > slop ) return false; pos = nextPos; } return true; } } } 
  • 4
  • 15
  • 17
