lucene笔记

最新推荐文章于 2016-11-12 22:56:19 发布

kevinminow

最新推荐文章于 2016-11-12 22:56:19 发布

阅读量257

点赞数

分类专栏： lucene 文章标签： lucene query vector string 代码分析 float

本文链接：https://blog.csdn.net/kevinminow/article/details/7469414

版权

lucene 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

《Lucene in Action》 MoreLikeThis 实例

《Lucene in Action》第二版中对MoreLikeThis 介绍的例子，搜索类似的书籍，代码如下，供大家学习参考：

public class BooksMoreLikeThis {
public static void main(String[] args) throws Throwable {
String indexDir = System.getProperty("index.dir");
FSDirectory directory = FSDirectory.open(new File(indexDir));
IndexReader reader = IndexReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
int numDocs = reader.maxDoc();
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setFieldNames(new String[] {"title", "author"});
mlt.setMinTermFreq(1);  //默认值是2，建议自己做限制，否则可能查不出结果
mlt.setMinDocFreq(1);   //默认值是5，建议自己做限制，否则可能查不出结果
for (int docID = 0; docID < numDocs; docID++) {
System.out.println();
Document doc = reader.document(docID);
System.out.println(doc.get("title"));
Query query = mlt.like(docID);
System.out.println(" query=" + query);
TopDocs similarDocs = searcher.search(query, 10);
if (similarDocs.totalHits == 0)
System.out.println(" None like this");
for(int i=0;i<similarDocs.scoreDocs.length;i++) {
if (similarDocs.scoreDocs[i].doc != docID) {
doc = reader.document(similarDocs.scoreDocs[i].doc);
System.out.println(" -> " + doc.getField("title").stringValue());
}
}
}
searcher.close();
reader.close();
directory.close();
}
}

Lucene相似搜索组件MoreLikeThis原理与代码分析
    
    
     
     7人收藏此文章,我要收藏发表于1个月前 ,     		已有
     
     481次阅读      		共
     
     0个评论    	
    
    
    
    
     
             MoreLikeThis 是 Lucene 的一个捐赠模块，为其Query相关的功能提供了相当不错扩充。MoreLikeThis提供了一组可用于相似搜索的接口，已方便让我们实现自己的相似搜索。
     
     
      
      什么是相似搜索： 
 
     
             相似搜索按我个人的理解，即：查找与某一条搜索结果相关的其他结果。它为用户提供一种不同于标准搜索（查询语句—>结果）的方式，通过一个比较符合自己意图的搜索结果去搜索新的结果（结果—>结果）。
     
     
 
 
 
 
 
 
 
      
      MoreLikeThis 设计思路分析：
         首先，MoreLikeThis 为了实现与Lucene 良好的互动，且扩充Lucene；它提供一个方法，该方法返回一个Query对象，即Lucene的查询对象，只要Lucene通过这个对象检索，就能获得相似结果；所以 MoreLikeThis 和 Lucene 完全能够无缝结合；Solr 中就提供了一个不错的例子。MoreLikeThis 所提供的方法如下：
  
      
      
       
       
        
        
         
         view source
         
         
         
         print
         
         ?
        
        
       
       
       
       
        
        
         
         01 /**
        
        
        
        
         
         02 * Return a query that will return docs like the passed lucene document ID.
        
        
        
        
         
         03 *
        
        
        
        
         
         04 * @param docNum the documentID of the lucene doc to generate the 'More Like This" query for.
        
        
        
        
         
         05 * @return a query that will return docs like the passed lucene document ID.
        
        
        
        
         
         06 */
        
        
        
        
         
         07 public Query like(int docNum) throws IOException {
        
        
        
        
         
         08 if (fieldNames == null) {
        
        
        
        
         
         09 // gather list of valid fields from lucene
        
        
        
        
         
         10 Collection<String> fields = ir.getFieldNames( IndexReader.FieldOption.INDEXED);
        
        
        
        
         
         11 fieldNames = fields.toArray(new String[fields.size()]);
        
        
        
        
         
         12 }
        
        
        
        
         
         13  
        
        
        
        
         
         14 return createQuery(retrieveTerms(docNum));
        
        
        
        
         
         15 } 
        
        
       
       
      
      
        其中的参数 docNum 为那个搜索结果的id，即你要通过的这个搜索结果，来查找其他与之相似搜索结果；而fieldNames可以理解为我们选择的一些域，我们将取出该结果在这些域中的值，以此来分析相似度。程序很明显，这些域是可选的。
         其次，我们来看看它的一个工作流程，是如何得到这个相似查询的（返回的那个Query），我自己画了个流程图一方便简单说明：
 
 
 
 
 
 
  
          大致流程，图中已经明晰，接下来，我们看看MoreLikeThis的源代码是怎么实现，还有一些细节。
 
 
 
      
      MoreLikeThis 源代码分析：
 
         代码的中主要通过4个方法实现上面所示的流程，它们分别是：
          1.  PriorityQueue<Object[]> retrieveTerms(int docNum)：用于提取 docNum 对应检索结果在指定域fieldNames中的值。
          2. void addTermFrequencies(Map<String,Int> termFreqMap, TermFreqVector vector)：它在1方法中被调用，用于封装流程图所提到的Map<String,int> 数据结构，即：每个词项以及它出现的频率。
          3. PriorityQueue<Object[]> createQueue(Map<String,Int> words)：它同样再方法1中被调用，用于将Map中的数据取出，进行一些相似计算后，生成PriorityQueue，方便下一步的封装。
         4. Query createQuery(PriorityQueue<Object[]> q): 用于生成最终的Query，如流程图的最后一步所言。
         接下来，我们依次看看源代码的具体实现：
 
  
      
      
       
       
        
        
         
         view source
         
         
         
         print
         
         ?
        
        
       
       
       
       
        
        
         
         01 /**
        
        
        
        
         
         02 * Find words for a more-like-this query former.
        
        
        
        
         
         03 *
        
        
        
        
         
         04 * @param docNum the id of the lucene document from which to find terms
        
        
        
        
         
         05 */
        
        
        
        
         
         06 public PriorityQueue<Object[]> retrieveTerms(int docNum) throws IOException {
        
        
        
        
         
         07 Map<String,Int> termFreqMap = new HashMap<String,Int>();
        
        
        
        
         
         08 for (int i = 0; i < fieldNames.length; i++) {
        
        
        
        
         
         09 String fieldName = fieldNames[i];
        
        
        
        
         
         10 TermFreqVector vector = ir.getTermFreqVector(docNum, fieldName);
        
        
        
        
         
         11  
        
        
        
        
         
         12 // field does not store term vector info
        
        
        
        
         
         13 if (vector == null) {
        
        
        
        
         
         14 Document d=ir.document(docNum);
        
        
        
        
         
         15 String text[]=d.getValues(fieldName);
        
        
        
        
         
         16 if(text!=null)
        
        
        
        
         
         17 {
        
        
        
        
         
         18 for (int j = 0; j < text.length; j++) {
        
        
        
        
         
         19 addTermFrequencies(new StringReader(text[j]), termFreqMap, fieldName);
        
        
        
        
         
         20 }
        
        
        
        
         
         21 }
        
        
        
        
         
         22 }
        
        
        
        
         
         23 else {
        
        
        
        
         
         24 addTermFrequencies(termFreqMap, vector);
        
        
        
        
         
         25 }
        
        
        
        
         
         26  
        
        
        
        
         
         27 }
        
        
        
        
         
         28  
        
        
        
        
         
         29 return createQueue(termFreqMap);
        
        
        
        
         
         30 }
        
        
       
       
      
      
 
         其中第10行，通过 getTermFreqVector(docNum, fieldName) 返回 TermFreqVector 对象保存了一些字符串和整形数组（它们分别表示fieldName 域中 某一个词项的值，以及该词项出项的频率）
  
      
      
       
       
        
        
         
         view source
         
         
         
         print
         
         ?
        
        
       
       
       
       
        
        
         
         01 /**
        
        
        
        
         
         02 * Adds terms and frequencies found in vector into the Map termFreqMap
        
        
        
        
         
         03 * @param termFreqMap a Map of terms and their frequencies
        
        
        
        
         
         04 * @param vector List of terms and their frequencies for a doc/field
        
        
        
        
         
         05 */
        
        
        
        
         
         06 private void addTermFrequencies(Map<String,Int> termFreqMap, TermFreqVector vector)
        
        
        
        
         
         07 {
        
        
        
        
         
         08 String[] terms = vector.getTerms();
        
        
        
        
         
         09 int freqs[]=vector.getTermFrequencies();
        
        
        
        
         
         10 for (int j = 0; j < terms.length; j++) {
        
        
        
        
         
         11 String term = terms[j];
        
        
        
        
         
         12 
        
        
        
        
         
         13 if(isNoiseWord(term)){
        
        
        
        
         
         14 continue;
        
        
        
        
         
         15 }
        
        
        
        
         
         16 // increment frequency
        
        
        
        
         
         17 Int cnt = termFreqMap.get(term);
        
        
        
        
         
         18 if (cnt == null) {
        
        
        
        
         
         19 cnt=new Int();
        
        
        
        
         
         20 termFreqMap.put(term, cnt);
        
        
        
        
         
         21 cnt.x=freqs[j]; 
        
        
        
        
         
         22 }
        
        
        
        
         
         23 else {
        
        
        
        
         
         24 cnt.x+=freqs[j];
        
        
        
        
         
         25 }
        
        
        
        
         
         26 }
        
        
        
        
         
         27 }
        
        
       
       
      
      
 
         其中第8行，和第9行，通过上一步获得的TermFreqVector对象，获得词项数组和频率数组（terms, freqs），它们是一一对应的。然后10～25行 将这些数据做了一些检查后封装到Map中，频率freqs[]是累加的。
  
      
      
       
       
        
        
         
         view source
         
         
         
         print
         
         ?
        
        
       
       
       
       
        
        
         
         01 /**
        
        
        
        
         
         02 * Create a PriorityQueue from a word->tf map.
        
        
        
        
         
         03 *
        
        
        
        
         
         04 * @param words a map of words keyed on the word(String) with Int objects as the values.
        
        
        
        
         
         05 */
        
        
        
        
         
         06 private PriorityQueue<Object[]> createQueue(Map<String,Int> words) throws IOException {
        
        
        
        
         
         07 // have collected all words in doc and their freqs
        
        
        
        
         
         08 int numDocs = ir.numDocs();
        
        
        
        
         
         09 FreqQ res = new FreqQ(words.size()); // will order words by score
        
        
        
        
         
         10  
        
        
        
        
         
         11 Iterator<String> it = words.keySet().iterator();
        
        
        
        
         
         12 while (it.hasNext()) { // for every word
        
        
        
        
         
         13 String word = it.next();
        
        
        
        
         
         14  
        
        
        
        
         
         15 int tf = words.get(word).x; // term freq in the source doc
        
        
        
        
         
         16 if (minTermFreq > 0 && tf < minTermFreq) {
        
        
        
        
         
         17 continue; // filter out words that don't occur enough times in the source
        
        
        
        
         
         18 }
        
        
        
        
         
         19  
        
        
        
        
         
         20 // go through all the fields and find the largest document frequency
        
        
        
        
         
         21 String topField = fieldNames[0];
        
        
        
        
         
         22 int docFreq = 0;
        
        
        
        
         
         23 for (int i = 0; i < fieldNames.length; i++) {
        
        
        
        
         
         24 int freq = ir.docFreq(new Term(fieldNames[i], word));
        
        
        
        
         
         25 topField = (freq > docFreq) ? fieldNames[i] : topField;
        
        
        
        
         
         26 docFreq = (freq > docFreq) ? freq : docFreq;
        
        
        
        
         
         27 }
        
        
        
        
         
         28  
        
        
        
        
         
         29 if (minDocFreq > 0 && docFreq < minDocFreq) {
        
        
        
        
         
         30 continue; // filter out words that don't occur in enough docs
        
        
        
        
         
         31 }
        
        
        
        
         
         32  
        
        
        
        
         
         33 if (docFreq > maxDocFreq) {
        
        
        
        
         
         34 continue; // filter out words that occur in too many docs 
        
        
        
        
         
         35 }
        
        
        
        
         
         36  
        
        
        
        
         
         37 if (docFreq == 0) {
        
        
        
        
         
         38 continue; // index update problem?
        
        
        
        
         
         39 }
        
        
        
        
         
         40  
        
        
        
        
         
         41 float idf = similarity.idf(docFreq, numDocs);
        
        
        
        
         
         42 float score = tf * idf;
        
        
        
        
         
         43  
        
        
        
        
         
         44 // only really need 1st 3 entries, other ones are for troubleshooting
        
        
        
        
         
         45 res.insertWithOverflow(new Object[]{word, // the word
        
        
        
        
         
         46 topField, // the top field
        
        
        
        
         
         47 Float.valueOf(score), // overall score
        
        
        
        
         
         48 Float.valueOf(idf), // idf
        
        
        
        
         
         49 Integer.valueOf(docFreq), // freq in all docs
        
        
        
        
         
         50 Integer.valueOf(tf)
        
        
        
        
         
         51 });
        
        
        
        
         
         52 }
        
        
        
        
         
         53 return res;
        
        
        
        
         
         54 }
        
        
       
       
      
      
 
         首先第9行，生成一个优先级队列；从12行起，开始逐个遍历每个词项： word；
         接着第21～27行：找出该词项出现频率最高的一个域，以此作为该词项的被检索域。（由上面的过程，我们可以得出，同一个词项的频率值，可能来自多个域中的频率的累加；但在Query中只能有一个检索域，这里选择最高的）
         第41行和42行，做了打分运算，得到一个分值，对应后面要封装的基本查询对象TermQuery的一个权重值；在后面组和多个Query对象时，以此彰显哪个更为重要；这里用到了余弦公式的思想来进行运算，因为Lucene的打分规则也是采用空间向量，判断两个向量的余弦来计算相似度；具体可参考这两篇博客：http://blog.csdn.net/forfuture1978/article/details/5353126，
 http://www.cnblogs.com/ansen/articles/1906353.html  都写得非常好。
 另：在Lucene中可以对3个元素加权重，已提高其对应的排序结果，它们分别是：域(field)，文档(ducument)，查询(query)。
         最后 封装成队列，并返回。
 
  
      
      
       
       
        
        
         
         view source
         
         
         
         print
         
         ?
        
        
       
       
       
       
        
        
         
         01 /**
        
        
        
        
         
         02 * Create the More like query from a PriorityQueue
        
        
        
        
         
         03 */
        
        
        
        
         
         04 private Query createQuery(PriorityQueue<Object[]> q) {
        
        
        
        
         
         05 BooleanQuery query = new BooleanQuery();
        
        
        
        
         
         06 Object cur;
        
        
        
        
         
         07 int qterms = 0;
        
        
        
        
         
         08 float bestScore = 0;
        
        
        
        
         
         09  
        
        
        
        
         
         10 while (((cur = q.pop()) != null)) {
        
        
        
        
         
         11 Object[] ar = (Object[]) cur;
        
        
        
        
         
         12 TermQuery tq = new TermQuery(new Term((String) ar[1], (String) ar[0]));
        
        
        
        
         
         13  
        
        
        
        
         
         14 if (boost) {
        
        
        
        
         
         15 if (qterms == 0) {
        
        
        
        
         
         16 bestScore = ((Float) ar[2]).floatValue();
        
        
        
        
         
         17 }
        
        
        
        
         
         18 float myScore = ((Float) ar[2]).floatValue();
        
        
        
        
         
         19  
        
        
        
        
         
         20 tq.setBoost(boostFactor * myScore / bestScore);
        
        
        
        
         
         21 }
        
        
        
        
         
         22  
        
        
        
        
         
         23 try {
        
        
        
        
         
         24 query.add(tq, BooleanClause.Occur.SHOULD);
        
        
        
        
         
         25 }
        
        
        
        
         
         26 catch (BooleanQuery.TooManyClauses ignore) {
        
        
        
        
         
         27 break;
        
        
        
        
         
         28 }
        
        
        
        
         
         29  
        
        
        
        
         
         30 qterms++;
        
        
        
        
         
         31 if (maxQueryTerms > 0 && qterms >= maxQueryTerms) {
        
        
        
        
         
         32 break;
        
        
        
        
         
         33 }
        
        
        
        
         
         34 }
        
        
        
        
         
         35  
        
        
        
        
         
         36 return query;
        
        
        
        
         
         37 }
        
        
       
       
      
      
 
         第5行，生成一个复合查询对象BooleanQuery，用于将基本查询对象TermQuery依次填入。
         从第10行开始，逐个从Queue队列中取出数据，封装TermQuery。
 
         第14到21行，对每个TermQuery都进行不同的加权，如前面提到的一样
 
         最后返回Query。
 
         OK 整MoreLikeThis的实现分析结束，个人感觉MoreLikeThis 在实际搜索被用到的并不多，但它给我们提供种查找相似结果的思路，也许我们可以经过自己的改造和定义，来优化搜索引擎，使搜索结果更加满意。
 
         原创blog，转载请注明http://my.oschina.net/BreathL/blog/41663

kevinminow

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
lucene笔记

《Lucene in Action》 MoreLikeThis 实例《Lucene in Action》第二版中对MoreLikeThis 介绍的例子，搜索类似的书籍，代码如下，供大家学习参考：public class BooksMoreLikeThis {public static void main(String[] args) throws Throwable {
复制链接

扫一扫