很多搜索引擎的检索结果都会将匹配的关键词高亮显示出来,便于用户的快速识别,Lucene.NET当然也提供高亮功能。
1.高亮功能实现
1.1.安装Lucene.NET.HighLight
Lucene.NET的高亮功能由Lucene.NET.HighLight包实现,使用NuGet管理器安装,建议与Lucene.NET保持相同版本。
1.2.修改查询方法
高亮显示是一个锦上添花的功能,所以打算把是否高亮设置为搜索输入项的可配置项,同时高亮的功能实现也在具体的查询方法中体现。
搜索输入项修SingleSearchOption改为:
public class SingleSearchOption:SearchOptionBase
{
/// <summary>
/// 检索关键词
/// </summary>
public string Keyword { get; set; }
/// <summary>
/// 限定检索域
/// </summary>
public List<string> Fields { get; set; }
/// <summary>
/// 是否高亮显示
/// </summary>
public bool IsHightLight { get; set; }
public SingleSearchOption(string keyword,List<string> fields,int maxHits=100,bool isHightLight=false)
{
if (string.IsNullOrWhiteSpace(keyword))
{
throw new ArgumentException("搜索关键词不能为空");
}
Keyword = keyword;
Fields = fields;
MaxHits = maxHits;
IsHightLight = isHightLight;
}
}
查询方法SingleSearch修改为:
public SingleSearchResult SingleSearch(SingleSearchOption option)
{
SingleSearchResult result = new SingleSearchResult();
Stopwatch watch=Stopwatch.StartNew();
using (Lucene.Net.Index.DirectoryReader reader = DirectoryReader.Open(Directory))
{
//实例化索引检索器
IndexSearcher searcher = new IndexSearcher(reader);
var queryParser = new MultiFieldQueryParser(LuceneVersion.LUCENE_48, option.Fields.ToArray(), Analyzer);
Query query = queryParser.Parse(option.Keyword);
var matches = searcher.Search(query, option.MaxHits).ScoreDocs;
#region 高亮
QueryScorer scorer = new QueryScorer(query);
Highlighter highlighter = new Highlighter(scorer);
#endregion
result.TotalHits = matches.Count();
foreach (var match in matches)
{
var doc = searcher.Doc(match.Doc);
SearchResultItem item = new SearchResultItem();
item.Score = match.Score;
item.EntityId = doc.GetField(CoreConstant.EntityId).GetStringValue();
item.EntityName = doc.GetField(CoreConstant.EntityType).GetStringValue();
String storedField = doc.Get(option.Fields[0]);
if (option.IsHightLight)//高亮
{
TokenStream stream = TokenSources.GetAnyTokenStream(reader, match.Doc, option.Fields[0], doc, Analyzer);
IFragmenter fragmenter = new SimpleSpanFragmenter(scorer);
highlighter.TextFragmenter = fragmenter;
string fragment = highlighter.GetBestFragment(stream, storedField);
item.FieldValue = fragment;
}
else
{
item.FieldValue = storedField;
}
result.Items.Add(item);
}
}
watch.Stop();
result.Elapsed = watch.ElapsedMilliseconds;
return result;
}
1.3.高亮测试示例
简单的高亮功能就修改完成了,使用WebAPI接口测试一下
在返回结果中可以看到,检索结果中的关键词“设计”均被加上了"</B/><//B/>"的标签。
2.高亮功能原理
在上面的示例中使用了Lucene.NET最简单的高亮效果,其原理并不复杂,了解其原理也能帮助我们实现更多更丰富的效果。简单来讲就是将查询结果进行二次处理,找到匹配关键字的位置,并添加样式重写查询结果。其大致流程如下。
2.1.QueryScorer实例化
QueryScorer scorer = new QueryScorer(query);
QueryScorer实现了Lucene.Net.Search.Highlight.IScorer接口,根据找到的唯一查询词的数量对文本片段进行评分。其构造函数的参数是当前的查询示例Query,此外可选项有IndexReader实例、需要高亮显示的Field的名称。
public QueryScorer(Query query) => this.Init(query, (string) null, (IndexReader) null, true);
public QueryScorer(Query query, string field) => this.Init(query, field, (IndexReader) null, true);
public QueryScorer(Query query, IndexReader reader, string field) => this.Init(query, field, reader, true);
public QueryScorer(Query query, IndexReader reader, string field, string defaultField)
{
this.defaultField = defaultField.Intern();
this.Init(query, field, reader, true);
}
public QueryScorer(Query query, string field, string defaultField)
{
this.defaultField = defaultField.Intern();
this.Init(query, field, (IndexReader) null, true);
}
2.2.HighLighter实例化
Highlighter highlighter = new Highlighter(scorer);
HighLighter类从其名字上就能看出来用于高亮标记文本中的对应项。
public Highlighter(IScorer fragmentScorer)
: this((IFormatter) new SimpleHTMLFormatter(), fragmentScorer)
{
}
public Highlighter(IFormatter formatter, IScorer fragmentScorer)
: this(formatter, (IEncoder) new DefaultEncoder(), fragmentScorer)
{
}
public Highlighter(IFormatter formatter, IEncoder encoder, IScorer fragmentScorer)
{
this._formatter = formatter;
this._encoder = encoder;
this._fragmentScorer = fragmentScorer;
}
2.3.获取TokenStream
TokenStream可以说是这里核心了,通过之前Lucene.NET的工作流程我们知道,文本会被分割成TokenStream,里面记录的每个Token的位置。通过TokenStream就能快速的找到需要添加高亮效果的分词。
TokenStream stream = TokenSources.GetAnyTokenStream(reader, match.Doc, option.Fields[0], doc, Analyzer);
从Token集合中获取当前IndexReader、匹配文档Document和Analyzer对应的Token Stream。
public static TokenStream GetAnyTokenStream(
IndexReader reader,
int docId,
string field,
Document doc,
Analyzer analyzer)
{
TokenStream tokenStream = (TokenStream) null;
Terms terms = reader.GetTermVectors(docId)?.GetTerms(field);
if (terms != null)
tokenStream = TokenSources.GetTokenStream(terms);
return tokenStream ?? TokenSources.GetTokenStream(doc, field, analyzer);
}
2.4.设置Fragment
IFragmenter fragmenter = new SimpleSpanFragmenter(scorer);
highlighter.TextFragmenter = fragmenter;
实例化实现IFragmenter接口的类,用于将文本拆分成不同大小的片段,而不是单个的字。
/// <param name="queryScorer"><see cref="T:Lucene.Net.Search.Highlight.QueryScorer" /> that was used to score hits</param>
public SimpleSpanFragmenter(QueryScorer queryScorer)
: this(queryScorer, 100)
{
}
/// <param name="queryScorer"><see cref="T:Lucene.Net.Search.Highlight.QueryScorer" /> that was used to score hits</param>
/// <param name="fragmentSize">size in bytes of each fragment</param>
public SimpleSpanFragmenter(QueryScorer queryScorer, int fragmentSize)
{
this.fragmentSize = fragmentSize;
this.queryScorer = queryScorer;
}
2.5.替换关键字样式
String storedField = doc.Get(option.Fields[0]);
string fragment = highlighter.GetBestFragment(stream, storedField);
最后一步就是将找到的关键字添加样式。
public string[] GetBestFragments(TokenStream tokenStream, string text, int maxNumFragments)
{
maxNumFragments = Math.Max(1, maxNumFragments);
TextFragment[] bestTextFragments = this.GetBestTextFragments(tokenStream, text, true, maxNumFragments);
List<string> stringList = new List<string>();
for (int index = 0; index < bestTextFragments.Length; ++index)
{
if (bestTextFragments[index] != null && (double) bestTextFragments[index].Score > 0.0)
stringList.Add(bestTextFragments[index].ToString());
}
return stringList.ToArray();
}
注意:这个过程有一点需要注意,那就是使用的Analyzer–或者进一步讲是使用的分词器–要前后保持一致。如果创建索引和查询时使用的分词器不同,关键词与结果匹配不上不说,由于分词结果的差异,关键词的位置也会出现偏移。