lucene3.6.0的扩展搜索

最新推荐文章于 2019-03-08 12:48:24 发布
zhongweijian
最新推荐文章于 2019-03-08 12:48:24 发布
阅读量948
点赞数
分类专栏： java 开源框架文章标签： lucene 扩展 filter sorting search class
本文链接：https://blog.csdn.net/zhongweijian/article/details/7622707
版权
java 同时被 2 个专栏收录
116 篇文章 0 订阅
订阅专栏
开源框架
39 篇文章 0 订阅
订阅专栏
自定义排序

IndexSearcher.java  动态计算存储的饭馆离某个位置最近最远
  /** Expert: Low-level search implementation with arbitrary sorting.  Finds
   * the top <code>n</code> hits for <code>query</code>, applying
   * <code>filter</code> if non-null, and sorting the hits by the criteria in
   * <code>sort</code>.
   *
   * <p>Applications should usually call {@link
   * Searcher#search(Query,Filter,int,Sort)} instead.
   * 
   * @throws BooleanQuery.TooManyClauses
   */
  @Override
  public TopFieldDocs search(Weight weight, Filter filter,
      final int nDocs, Sort sort) throws IOException {
    return search(weight, filter, nDocs, sort, true);
  }



SortField.java
  /** Creates a sort with a custom comparison function.
   * @param field Name of field to sort by; cannot be <code>null</code>.
   * @param comparator Returns a comparator for sorting hits.
   */
  public SortField(String field, FieldComparatorSource comparator) {
    initFieldType(field, CUSTOM);
    this.comparatorSource = comparator;
  }

FieldComparatorSource.java
/**
 * Provides a {@link FieldComparator} for custom field sorting.
 *
 * @lucene.experimental
 *
 */
public abstract class FieldComparatorSource implements Serializable {

  /**
   * Creates a comparator for the field in the given index.
   * 
   * @param fieldname
   *          Name of the field to create comparator for.
   * @return FieldComparator.
   * @throws IOException
   *           If an error occurs reading the index.
   */
  public abstract FieldComparator<?> newComparator(String fieldname, int numHits, int sortPos, boolean reversed)
      throws IOException;
}


对查询结果的进一步计算或者处理
Collector.java
* <p><b>NOTE:</b> The doc that is passed to the collect
 * method is relative to the current reader. If your
 * collector needs to resolve this to the docID space of the
 * Multi*Reader, you must re-base it by recording the
 * docBase from the most recent setNextReader call.  Here's
 * a simple example showing how to collect docIDs into a
 * BitSet:</p>
 * 
 * <pre>
 * Searcher searcher = new IndexSearcher(indexReader);
 * final BitSet bits = new BitSet(indexReader.maxDoc());
 * searcher.search(query, new Collector() {
 *   private int docBase;
 * 
 *   <em>// ignore scorer</em>
 *   public void setScorer(Scorer scorer) {
 *   }
 *
 *   <em>// accept docs out of order (for a BitSet it doesn't matter)</em>
 *   public boolean acceptsDocsOutOfOrder() {
 *     return true;
 *   }
 * 
 *   public void collect(int doc) {
 *     bits.set(doc + docBase);
 *   }
 * 
 *   public void setNextReader(IndexReader reader, int docBase) {
 *     this.docBase = docBase;
 *   }
 * });
 * </pre>

扩展QueryParse
1.禁用模糊查询和通配符查询
    /**
   * Builds a new FuzzyQuery instance
   * @param term Term
   * @param minimumSimilarity minimum similarity
   * @param prefixLength prefix length
   * @return new FuzzyQuery Instance
   */
  protected Query newFuzzyQuery(Term term, float minimumSimilarity, int prefixLength) {
    // FuzzyQuery doesn't yet allow constant score rewrite
    return new FuzzyQuery(term,minimumSimilarity,prefixLength);  //去掉改为抛出异常
  }

自定义过滤器，对于搜索结果本身可能会经常变化，导致在某段时间内可能需要过滤掉，某段时间不需要过滤，如果把这个字段加入索引，则可能导致结果不准确。比较好的方案是定义过滤器，可以根据某些特定规则对搜索进行过滤。比如热销书，某本书可能某段时间是热销书，某段时间不是，如果把是否热销书作为一个字段加入索引中，则不太合适，此时可以使用自定义filter计算某个doc是否要过滤掉。
  

/** 
 *  Abstract base class for restricting which documents may
 *  be returned during searching.
 */
public abstract class Filter implements java.io.Serializable {
  
  /**
   * Creates a {@link DocIdSet} enumerating the documents that should be
   * permitted in search results. <b>NOTE:</b> null can be
   * returned if no documents are accepted by this Filter.
   * <p>
   * Note: This method will be called once per segment in
   * the index during searching.  The returned {@link DocIdSet}
   * must refer to document IDs for that segment, not for
   * the top-level reader.
   * 
   * @param reader a {@link IndexReader} instance opened on the index currently
   *         searched on. Note, it is likely that the provided reader does not
   *         represent the whole underlying index i.e. if the index has more than
   *         one segment the given reader only represents a single segment.
   *          
   * @return a DocIdSet that provides the documents which should be permitted or
   *         prohibited in search results. <b>NOTE:</b> null can be returned if
   *         no documents will be accepted by this Filter.
   * 
   * @see DocIdBitSet
   */
  public abstract DocIdSet getDocIdSet(IndexReader reader) throws IOException;
}

DocIdSet是二进制bit位，各bit的位置跟docid对应，如果某个bit设置为1，则会出现在搜索结果中，否则则不会出现在搜索结果。

filterQuery.java使用过滤后的查询，会拼成最终的查询表达式去查询。

性能问题：
1.lucene会在内部把RangeQuery重写booleanQuery来查询，OR查询表达式

如果查询范围超过1024，会抛出 TooManyClauses异常

  /** Thrown when an attempt is made to add more than {@link
   * #getMaxClauseCount()} clauses. This typically happens if
   * a PrefixQuery, FuzzyQuery, WildcardQuery, or TermRangeQuery 
   * is expanded to many terms during search. 
   */
  public static class TooManyClauses extends RuntimeException {
    public TooManyClauses() {
      super("maxClauseCount is set to " + maxClauseCount);
    }
  }