Lucene高级篇

最新推荐文章于 2024-08-07 10:00:55 发布
摩西_玄晨
最新推荐文章于 2024-08-07 10:00:55 发布
阅读量668
点赞数
分类专栏：搜索引擎[lucene] 文章标签： lucene null java query string
搜索引擎[lucene] 专栏收录该内容
14 篇文章 0 订阅
订阅专栏
Lucene相关度排序的调整

Lucene的搜索结果默认按相关度排序，这个相关度排序是基于内部的Score和DocID，Score又基于关键词的内部评分和做索引时的boost。默认Score高的排前面，如果Score一样，再按索引顺序，先索引的排前面。那么有人问了，如果我要先索引的排后面怎么办呢？隐士研究了源码后发现这是相当简单的事情。以下代码基于Lucene 2.0。
看Sort的默认构造函数，相关度就是SortField.FIELD_SCORE和SortField.FIELD_DOC的组合。
java 代码
 
   /**
 * Sorts by computed relevance. This is the same sort criteria as calling
 * {@link Searcher#search(Query) Searcher#search()}without a sort criteria,
 * only with slightly more overhead.
 */  
 public Sort() {   
   this(new SortField[] { SortField.FIELD_SCORE, SortField.FIELD_DOC });   
 }  
 
 那么该如何构造我们需要的SortField呢？请看SortField的一个构造函数，有一个参数reverse可供我们调整结果集的顺序。

   java 代码 
 
    /** Creates a sort, possibly in reverse, by terms in the given field with the
     * type of term values explicitly given.
     * @param field   Name of field to sort by.   Can be null if
     *               type is SCORE or DOC.
     * @param type    Type of values in the terms.
     * @param reverse True if natural order should be reversed.
     */  
   public SortField (String field, int type, boolean reverse) {   
     this.field = (field != null) ? field.intern() : field;   
     this.type = type;   
     this.reverse = reverse;   
    }  
 
 由此可见，只要构造一个SortField[]就可以实现我们要的功能，请看：

    java 代码 
  
     // 评分降序，评分一样时后索引的排前面   
 new SortField[] { SortField.FIELD_SCORE, new SortField(null, SortField.DOC, true) }   
   
 // 评分升序，评分一样时后索引的排前面，呵呵，此为最不相关的排前面，挺有趣的   
 new SortField[] { new SortField(null, SortField.SCORE, true), new SortField(null, SortField.DOC, true) }  
 
 呵呵，只要将此SortField[]作为参数传入Sort的构造函数得到Sort的一个instance，将此instance传入searcher.search(query, sort)即可得到了期望的结果。
 
 1. 多字段搜索
 
 使用 MultiFieldQueryParser 可以指定多个搜索字段。
 
     Query query = MultiFieldQueryParser.Parse("name*", new string[] { FieldName, FieldValue }, analyzer); 
    
 IndexReader reader = IndexReader.Open(directory); 
    
 IndexSearcher searcher = new IndexSearcher(reader); 
    
 Hits hits = searcher.Search(query); 
   
2. 多条件搜索

除了使用 QueryParser.Parse 分解复杂的搜索语法 外，还可以通过组合多个 Query 来达到目的。

     Query query1 = new TermQuery(new Term(FieldValue, "name1")); // 词语搜索 
    
 Query query2 = new WildcardQuery(new Term(FieldName, "name*")); // 通配符  
    
 //Query query3 = new PrefixQuery(new Term(FieldName, "name1")); // 字段搜索 Field:Keyword，自动在结尾添加 * 
    
 //Query query4 = new RangeQuery(new Term(FieldNumber, NumberTools.LongToString(11L)), new Term(FieldNumber, NumberTools.LongToString(13L)), true); // 范围搜索 
    
 //Query query5 = new FilteredQuery(query, filter); // 带过滤条件的搜索 
    
 BooleanQuery query = new BooleanQuery(); 
    
 query.Add(query1, BooleanClause.Occur.MUST); 
    
 query.Add(query2, BooleanClause.Occur.MUST); 
    
 IndexSearcher searcher = new IndexSearcher(reader); 
    
 Hits hits = searcher.Search(query); 
   
3. 设置权重

可以给 Document 和 Field 增加权重(Boost)，使其在搜索结果排名更加靠前。缺省情况下，搜索结果以 Document.Score 作为排序依据，该数值越大排名越靠前。Boost 缺省值为 1。

     Score = Score * Boost 
   
通过上面的公式，我们就可以设置不同的权重来影响排名。

如下面的例子中根据 VIP 级别设定不同的权重。
     Document document = new Document(); 
    
 switch (vip) 
    
 { 
    
    case VIP.Gold: document.SetBoost(2F); break; 
    
    case VIP.Argentine: document.SetBoost(1.5F); break; 
    
 } 
   
只要 Boost 足够大，那么就可以让某个命中结果永远排第一位，这就是百度等网站的"收费排名"业务。明显有失公平，鄙视一把。  

4. 排序

通过 SortField 的构造参数，我们可以设置排序字段，排序条件，以及倒排。

     Sort sort = new Sort(new SortField(FieldName, SortField.DOC, false)); 
    
 IndexSearcher searcher = new IndexSearcher(reader); 
    
 Hits hits = searcher.Search(query, sort); 
   
排序对搜索速度影响还是很大的，尽可能不要使用多个排序条件。

5. 过滤

使用 Filter 对搜索结果进行过滤，可以获得更小范围内更精确的结果。

举个例子，我们搜索上架时间在 2005-10-1 到 2005-10-30 之间的商品。
对于日期时间，我们需要转换一下才能添加到索引库，同时还必须是索引字段。
     // index 
    
 document.Add(FieldDate, DateField.DateToString(date), Field.Store.YES, Field.Index.UN_TOKENIZED); 
    
 //... 
    
 // search 
    
 Filter filter = new DateFilter(FieldDate, DateTime.Parse("2005-10-1"), DateTime.Parse("2005-10-30")); 
    
 Hits hits = searcher.Search(query, filter); 
   
除了日期时间，还可以使用整数。比如搜索价格在 100 ~ 200 之间的商品。
Lucene.Net NumberTools 对于数字进行了补位处理，如果需要使用浮点数可以自己参考源码进行。
     // index 
    
 document.Add(new Field(FieldNumber, NumberTools.LongToString((long)price), Field.Store.YES, Field.Index.UN_TOKENIZED)); 
    
 //... 
    
 // search 
    
 Filter filter = new RangeFilter(FieldNumber, NumberTools.LongToString(100L), NumberTools.LongToString(200L), true, true); 
    
 Hits hits = searcher.Search(query, filter); 
   
使用 Query 作为过滤条件。
     QueryFilter filter = new QueryFilter(QueryParser.Parse("name2", FieldValue, analyzer)); 
   
我们还可以使用 FilteredQuery 进行多条件过滤。

     Filter filter = new DateFilter(FieldDate, DateTime.Parse("2005-10-10"), DateTime.Parse("2005-10-15")); 
    
 Filter filter2 = new RangeFilter(FieldNumber, NumberTools.LongToString(11L), NumberTools.LongToString(13L), true, true); 
    
 Query query = QueryParser.Parse("name*", FieldName, analyzer); 
    
 query = new FilteredQuery(query, filter); 
    
 query = new FilteredQuery(query, filter2); 
    
 IndexSearcher searcher = new IndexSearcher(reader); 
    
 Hits hits = searcher.Search(query); 
   
6. 分布搜索

我们可以使用 MultiReader 或 MultiSearcher 搜索多个索引库。

     MultiReader reader = new MultiReader(new IndexReader[] { IndexReader.Open(@"c:\index"), IndexReader.Open(@"\\server\index") }); 
    
 IndexSearcher searcher = new IndexSearcher(reader); 
    
 Hits hits = searcher.Search(query); 
   
或

     IndexSearcher searcher1 = new IndexSearcher(reader1); 
    
 IndexSearcher searcher2 = new IndexSearcher(reader2); 
    
 MultiSearcher searcher = new MultiSearcher(new Searchable[] { searcher1, searcher2 }); 
    
 Hits hits = searcher.Search(query); 
   
还可以使用 ParallelMultiSearcher 进行多线程并行搜索。

7. 合并索引库

将 directory1 合并到 directory2 中。
     Directory directory1 = FSDirectory.GetDirectory("index1", false); 
    
 Directory directory2 = FSDirectory.GetDirectory("index2", false); 
    
 IndexWriter writer = new IndexWriter(directory2, analyzer, false); 
    
 writer.AddIndexes(new Directory[] { directory }); 
    
 Console.WriteLine(writer.DocCount()); 
    
 writer.Close(); 
   
8. 显示搜索语法字符串

我们组合了很多种搜索条件，或许想看看与其对等的搜索语法串是什么样的。
     BooleanQuery query = new BooleanQuery(); 
    
 query.Add(query1, true, false); 
    
 query.Add(query2, true, false); 
    
 //... 
    
 Console.WriteLine("Syntax: {0}", query.ToString()); 
   
输出：
Syntax: +(name:name* value:name*) +number:[0000000000000000b TO 0000000000000000d]

呵呵，就这么简单。

9. 操作索引库

删除 (软删除，仅添加了删除标记。调用 IndexWriter.Optimize() 后真正删除。)
     IndexReader reader = IndexReader.Open(directory); 
    
 // 删除指定序号(DocId)的 Document。 
    
 reader.Delete(123); 
    
 // 删除包含指定 Term 的 Document。 
    
 reader.Delete(new Term(FieldValue, "Hello")); 
    
 // 恢复软删除。 
    
 reader.UndeleteAll(); 
    
 reader.Close(); 
   
增量更新 (只需将 create 参数设为 false，即可往现有索引库添加新数据。)
     Directory directory = FSDirectory.GetDirectory("index", false); 
    
 IndexWriter writer = new IndexWriter(directory, analyzer, false); 
    
 writer.AddDocument(doc1); 
    
 writer.AddDocument(doc2); 
    
 writer.Optimize(); 
    
 writer.Close(); 
   
10. 优化

批量向 FSDirectory 增加索引时，增大合并因子(mergeFactor )和最小文档合并数(minMergeDocs)有助于提高性能，减少索引时间。

     IndexWriter writer = new IndexWriter(directory, analyzer, true); 
    
 writer.maxFieldLength = 1000; // 字段最大长度 
    
 writer.mergeFactor = 1000; 
    
 writer.minMergeDocs = 1000; 
    
 for (int i = 0; i < 10000; i++) 
    
 { 
    
    // Add Documentes... 
    
 } 
    
 writer.Optimize(); 
    
 writer.Close(); 
   
相关参数说明 

 转自《深入 Lucene 索引机制 》
 
 利用 Lucene，在创建索引的工程中你可以充分利用机器的硬件资源来提高索引的效率。当你需要索引大量的文件时，你会注意到索引过程的瓶颈是在往磁盘上写索引文件的过程中。为了解决这个问题, Lucene 在内存中持有一块缓冲区。但我们如何控制 Lucene 的缓冲区呢？幸运的是，Lucene 的类 IndexWriter 提供了三个参数用来调整缓冲区的大小以及往磁盘上写索引文件的频率。
 
 1．合并因子 (mergeFactor)
 
 这个参数决定了在 Lucene 的一个索引块中可以存放多少文档以及把磁盘上的索引块合并成一个大的索引块的频率。比如，如果合并因子的值是 10，那么当内存中的文档数达到 10 的时候所有的文档都必须写到磁盘上的一个新的索引块中。并且，如果磁盘上的索引块的隔数达到 10 的话，这 10 个索引块会被合并成一个新的索引块。这个参数的默认值是 10，如果需要索引的文档数非常多的话这个值将是非常不合适的。对批处理的索引来讲，为这个参数赋一个比较大的值会得到比较好的索引效果。
 
 2．最小合并文档数 (minMergeDocs)
 
 这个参数也会影响索引的性能。它决定了内存中的文档数至少达到多少才能将它们写回磁盘。这个参数的默认值是10，如果你有足够的内存，那么将这个值尽量设的比较大一些将会显著的提高索引性能。
 
 3．最大合并文档数 (maxMergeDocs)
 
 这个参数决定了一个索引块中的最大的文档数。它的默认值是 Integer.MAX_VALUE，将这个参数设置为比较大的值可以提高索引效率和检索速度，由于该参数的默认值是整型的最大值，所以我们一般不需要改动这个参数。