这里我们搜索的内容是“一步一步跟我学习lucene”,搜索引擎展示的结果中对用户的输入信息进行了配色方面的处理,这种区分正常文本和输入内容的效果即是高亮显示;
这样做的好处:
- 视觉上让人便于查找有搜索对应的文本块;
- 界面展示更友好;
lucene提供了highlighter插件来体现类似的效果;
highlighter对查询关键字高亮处理;
highlighter包包含了用于处理结果页查询内容高亮显示的功能,其中Highlighter类highlighter包的核心组件,借助Fragmenter, fragment Scorer, 和Formatter等类来支持用户自定义高亮展示的功能;
示例程序
这里边我利用了之前的做的目录文件索引
- package com.lucene.search.util;
- import java.io.IOException;
- import java.io.StringReader;
- import java.util.concurrent.ExecutorService;
- import java.util.concurrent.Executors;
- import org.apache.lucene.analysis.Analyzer;
- import org.apache.lucene.analysis.TokenStream;
- import org.apache.lucene.analysis.standard.StandardAnalyzer;
- import org.apache.lucene.document.Document;
- import org.apache.lucene.index.Term;
- import org.apache.lucene.search.IndexSearcher;
- import org.apache.lucene.search.ScoreDoc;
- import org.apache.lucene.search.TermQuery;
- import org.apache.lucene.search.TopDocs;
- import org.apache.lucene.search.highlight.Highlighter;
- import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
- import org.apache.lucene.search.highlight.QueryScorer;
- import org.apache.lucene.search.highlight.SimpleFragmenter;
- import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
- import org.apache.lucene.util.BytesRef;
- public class HighlighterTest {
- public static void main(String[] args) {
- IndexSearcher searcher;
- TopDocs docs;
- ExecutorService service = Executors.newCachedThreadPool();
- try {
- searcher = SearchUtil.getMultiSearcher("index", service);
- Term term = new Term("content",new BytesRef("lucene"));
- TermQuery termQuery = new TermQuery(term);
- docs = SearchUtil.getScoreDocsByPerPage(1, 30, searcher, termQuery);
- ScoreDoc[] hits = docs.scoreDocs;
- QueryScorer scorer = new QueryScorer(termQuery);
- SimpleHTMLFormatter simpleHtmlFormatter = new SimpleHTMLFormatter("<B>","</B>");//设定高亮显示的格式<B>keyword</B>,此为默认的格式
- Highlighter highlighter = new Highlighter(simpleHtmlFormatter,scorer);
- highlighter.setTextFragmenter(new SimpleFragmenter(20));//设置每次返回的字符数
- Analyzer analyzer = new StandardAnalyzer();
- for(int i=0;i<hits.length;i++){
- Document doc = searcher.doc(hits[i].doc);
- String str = highlighter.getBestFragment(analyzer, "content", doc.get("content")) ;
- System.out.println(str);
- }
- } catch (IOException e1) {
- // TODO Auto-generated catch block
- e1.printStackTrace();
- } catch (InvalidTokenOffsetsException e) {
- // TODO Auto-generated catch block
- e.printStackTrace();
- }finally{
- service.shutdown();
- }
- }
- }
lucene的highlighter高亮展示的原理:
- 根据Formatter和Scorer创建highlighter对象,formatter定义了高亮的显示方式,而scorer定义了高亮的评分;
评分的算法是先根据term的评分值获取对应的document的权重,在此基础上对文本的内容进行轮询,获取对应的文本出现的次数,和它在term对应的文本中出现的位置(便于高亮处理),评分并分词的算法为:
- public float getTokenScore() {
- position += posIncAtt.getPositionIncrement();//记录出现的位置
- String termText = termAtt.toString();
- WeightedSpanTerm weightedSpanTerm;
- if ((weightedSpanTerm = fieldWeightedSpanTerms.get(
- termText)) == null) {
- return 0;
- }
- if (weightedSpanTerm.positionSensitive &&
- !weightedSpanTerm.checkPosition(position)) {
- return 0;
- }
- float score = weightedSpanTerm.getWeight();//获取权重
- // found a query term - is it unique in this doc?
- if (!foundTerms.contains(termText)) {//结果排重处理
- totalScore += score;
- foundTerms.add(termText);
- }
- return score;
- }
formatter的原理为:对搜索的文本进行判断,如果scorer获取的totalScore不小于0,即查询内容在对应的term中存在,则按照格式拼接成preTag+查询内容+postTag的格式
详细算法如下:
- public String highlightTerm(String originalText, TokenGroup tokenGroup) {
- if (tokenGroup.getTotalScore() <= 0) {
- return originalText;
- }
- // Allocate StringBuilder with the right number of characters from the
- // beginning, to avoid char[] allocations in the middle of appends.
- StringBuilder returnBuffer = new StringBuilder(preTag.length() + originalText.length() + postTag.length());
- returnBuffer.append(preTag);
- returnBuffer.append(originalText);
- returnBuffer.append(postTag);
- return returnBuffer.toString();
- }
其默认格式为“<B></B>”的形式;
- Highlighter根据scorer和formatter,对document进行分析,查询结果调用getBestTextFragments,TokenStream tokenStream,String text,boolean mergeContiguousFragments,int maxNumFragments),其过程为
- scorer首先初始化查询内容对应的出现位置的下标,然后对tokenstream添加PositionIncrementAttribute,此类记录单词出现的位置;
- 对文本内容进行轮询,判断查询的文本长度是否超出限制,如果超出文本长度提示过长内容;
- 如果获取到指定的文本,先对单次查询的内容进行内容的截取(截取值根据setTextFragmenter指定的值决定),再调用formatter的highlightTerm方法对文本进行重新构建
- 在本次轮询和下次单词出现之前对文本内容进行处理
查询工具类
- package com.lucene.search.util;
- import java.io.File;
- import java.io.IOException;
- import java.nio.file.Paths;
- import java.util.Set;
- import java.util.concurrent.ExecutorService;
- import org.apache.lucene.analysis.Analyzer;
- import org.apache.lucene.analysis.standard.StandardAnalyzer;
- import org.apache.lucene.document.Document;
- import org.apache.lucene.index.DirectoryReader;
- import org.apache.lucene.index.IndexReader;
- import org.apache.lucene.index.MultiReader;
- import org.apache.lucene.index.Term;
- import org.apache.lucene.queryparser.classic.ParseException;
- import org.apache.lucene.queryparser.classic.QueryParser;
- import org.apache.lucene.search.BooleanQuery;
- import org.apache.lucene.search.IndexSearcher;
- import org.apache.lucene.search.MatchAllDocsQuery;
- import org.apache.lucene.search.NumericRangeQuery;
- import org.apache.lucene.search.Query;
- import org.apache.lucene.search.ScoreDoc;
- import org.apache.lucene.search.TermQuery;
- import org.apache.lucene.search.TopDocs;
- import org.apache.lucene.search.BooleanClause.Occur;
- import org.apache.lucene.search.highlight.Highlighter;
- import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
- import org.apache.lucene.search.highlight.QueryScorer;
- import org.apache.lucene.search.highlight.SimpleFragmenter;
- import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
- import org.apache.lucene.store.FSDirectory;
- import org.apache.lucene.util.BytesRef;
- /**lucene索引查询工具类
- * @author lenovo
- *
- */
- public class SearchUtil {
- /**获取IndexSearcher对象
- * @param indexPath
- * @param service
- * @return
- * @throws IOException
- */
- public static IndexSearcher getIndexSearcherByParentPath(String parentPath,ExecutorService service) throws IOException{
- MultiReader reader = null;
- //设置
- try {
- File[] files = new File(parentPath).listFiles();
- IndexReader[] readers = new IndexReader[files.length];
- for (int i = 0 ; i < files.length ; i ++) {
- readers[i] = DirectoryReader.open(FSDirectory.open(Paths.get(files[i].getPath(), new String[0])));
- }
- reader = new MultiReader(readers);
- } catch (IOException e) {
- // TODO Auto-generated catch block
- e.printStackTrace();
- }
- return new IndexSearcher(reader,service);
- }
- /**多目录多线程查询
- * @param parentPath 父级索引目录
- * @param service 多线程查询
- * @return
- * @throws IOException
- */
- public static IndexSearcher getMultiSearcher(String parentPath,ExecutorService service) throws IOException{
- File file = new File(parentPath);
- File[] files = file.listFiles();
- IndexReader[] readers = new IndexReader[files.length];
- for (int i = 0 ; i < files.length ; i ++) {
- readers[i] = DirectoryReader.open(FSDirectory.open(Paths.get(files[i].getPath(), new String[0])));
- }
- MultiReader multiReader = new MultiReader(readers);
- IndexSearcher searcher = new IndexSearcher(multiReader,service);
- return searcher;
- }
- /**根据索引路径获取IndexReader
- * @param indexPath
- * @return
- * @throws IOException
- */
- public static DirectoryReader getIndexReader(String indexPath) throws IOException{
- return DirectoryReader.open(FSDirectory.open(Paths.get(indexPath, new String[0])));
- }
- /**根据索引路径获取IndexSearcher
- * @param indexPath
- * @param service
- * @return
- * @throws IOException
- */
- public static IndexSearcher getIndexSearcherByIndexPath(String indexPath,ExecutorService service) throws IOException{
- IndexReader reader = getIndexReader(indexPath);
- return new IndexSearcher(reader,service);
- }
- /**如果索引目录会有变更用此方法获取新的IndexSearcher这种方式会占用较少的资源
- * @param oldSearcher
- * @param service
- * @return
- * @throws IOException
- */
- public static IndexSearcher getIndexSearcherOpenIfChanged(IndexSearcher oldSearcher,ExecutorService service) throws IOException{
- DirectoryReader reader = (DirectoryReader) oldSearcher.getIndexReader();
- DirectoryReader newReader = DirectoryReader.openIfChanged(reader);
- return new IndexSearcher(newReader, service);
- }
- /**多条件查询类似于sql in
- * @param querys
- * @return
- */
- public static Query getMultiQueryLikeSqlIn(Query ... querys){
- BooleanQuery query = new BooleanQuery();
- for (Query subQuery : querys) {
- query.add(subQuery,Occur.SHOULD);
- }
- return query;
- }
- /**多条件查询类似于sql and
- * @param querys
- * @return
- */
- public static Query getMultiQueryLikeSqlAnd(Query ... querys){
- BooleanQuery query = new BooleanQuery();
- for (Query subQuery : querys) {
- query.add(subQuery,Occur.MUST);
- }
- return query;
- }
- /**从指定配置项中查询
- * @return
- * @param analyzer 分词器
- * @param field 字段
- * @param fieldType 字段类型
- * @param queryStr 查询条件
- * @param range 是否区间查询
- * @return
- */
- public static Query getQuery(String field,String fieldType,String queryStr,boolean range){
- Query q = null;
- try {
- if(queryStr != null && !"".equals(queryStr)){
- if(range){
- String[] strs = queryStr.split("\\|");
- if("int".equals(fieldType)){
- int min = new Integer(strs[0]);
- int max = new Integer(strs[1]);
- q = NumericRangeQuery.newIntRange(field, min, max, true, true);
- }else if("double".equals(fieldType)){
- Double min = new Double(strs[0]);
- Double max = new Double(strs[1]);
- q = NumericRangeQuery.newDoubleRange(field, min, max, true, true);
- }else if("float".equals(fieldType)){
- Float min = new Float(strs[0]);
- Float max = new Float(strs[1]);
- q = NumericRangeQuery.newFloatRange(field, min, max, true, true);
- }else if("long".equals(fieldType)){
- Long min = new Long(strs[0]);
- Long max = new Long(strs[1]);
- q = NumericRangeQuery.newLongRange(field, min, max, true, true);
- }
- }else{
- if("int".equals(fieldType)){
- q = NumericRangeQuery.newIntRange(field, new Integer(queryStr), new Integer(queryStr), true, true);
- }else if("double".equals(fieldType)){
- q = NumericRangeQuery.newDoubleRange(field, new Double(queryStr), new Double(queryStr), true, true);
- }else if("float".equals(fieldType)){
- q = NumericRangeQuery.newFloatRange(field, new Float(queryStr), new Float(queryStr), true, true);
- }else{
- Analyzer analyzer = new StandardAnalyzer();
- q = new QueryParser(field, analyzer).parse(queryStr);
- }
- }
- }else{
- q= new MatchAllDocsQuery();
- }
- System.out.println(q);
- } catch (ParseException e) {
- // TODO Auto-generated catch block
- e.printStackTrace();
- }
- return q;
- }
- /**根据field和值获取对应的内容
- * @param fieldName
- * @param fieldValue
- * @return
- */
- public static Query getQuery(String fieldName,Object fieldValue){
- Term term = new Term(fieldName, new BytesRef(fieldValue.toString()));
- return new TermQuery(term);
- }
- /**根据IndexSearcher和docID获取默认的document
- * @param searcher
- * @param docID
- * @return
- * @throws IOException
- */
- public static Document getDefaultFullDocument(IndexSearcher searcher,int docID) throws IOException{
- return searcher.doc(docID);
- }
- /**根据IndexSearcher和docID
- * @param searcher
- * @param docID
- * @param listField
- * @return
- * @throws IOException
- */
- public static Document getDocumentByListField(IndexSearcher searcher,int docID,Set<String> listField) throws IOException{
- return searcher.doc(docID, listField);
- }
- /**分页查询
- * @param page 当前页数
- * @param perPage 每页显示条数
- * @param searcher searcher查询器
- * @param query 查询条件
- * @return
- * @throws IOException
- */
- public static TopDocs getScoreDocsByPerPage(int page,int perPage,IndexSearcher searcher,Query query) throws IOException{
- TopDocs result = null;
- if(query == null){
- System.out.println(" Query is null return null ");
- return null;
- }
- ScoreDoc before = null;
- if(page != 1){
- TopDocs docsBefore = searcher.search(query, (page-1)*perPage);
- ScoreDoc[] scoreDocs = docsBefore.scoreDocs;
- if(scoreDocs.length > 0){
- before = scoreDocs[scoreDocs.length - 1];
- }
- }
- result = searcher.searchAfter(before, query, perPage);
- return result;
- }
- public static TopDocs getScoreDocs(IndexSearcher searcher,Query query) throws IOException{
- TopDocs docs = searcher.search(query, getMaxDocId(searcher));
- return docs;
- }
- /**高亮显示字段
- * @param searcher
- * @param field
- * @param keyword
- * @param preTag
- * @param postTag
- * @param fragmentSize
- * @return
- * @throws IOException
- * @throws InvalidTokenOffsetsException
- */
- public static String[] highlighter(IndexSearcher searcher,String field,String keyword,String preTag, String postTag,int fragmentSize) throws IOException, InvalidTokenOffsetsException{
- Term term = new Term("content",new BytesRef("lucene"));
- TermQuery termQuery = new TermQuery(term);
- TopDocs docs = getScoreDocs(searcher, termQuery);
- ScoreDoc[] hits = docs.scoreDocs;
- QueryScorer scorer = new QueryScorer(termQuery);
- SimpleHTMLFormatter simpleHtmlFormatter = new SimpleHTMLFormatter(preTag,postTag);//设定高亮显示的格式<B>keyword</B>,此为默认的格式
- Highlighter highlighter = new Highlighter(simpleHtmlFormatter,scorer);
- highlighter.setTextFragmenter(new SimpleFragmenter(fragmentSize));//设置每次返回的字符数
- Analyzer analyzer = new StandardAnalyzer();
- String[] result = new String[hits.length];
- for (int i = 0; i < result.length ; i++) {
- Document doc = searcher.doc(hits[i].doc);
- result[i] = highlighter.getBestFragment(analyzer, field, doc.get(field));
- }
- return result;
- }
- /**统计document的数量,此方法等同于matchAllDocsQuery查询
- * @param searcher
- * @return
- */
- public static int getMaxDocId(IndexSearcher searcher){
- return searcher.getIndexReader().maxDoc();
- }
- }