Lucene中MoreLikeThis查询结果为空

最新推荐文章于 2023-02-21 17:33:38 发布

fkbush

最新推荐文章于 2023-02-21 17:33:38 发布

阅读量775

点赞数 2

分类专栏： lucene

本文链接：https://blog.csdn.net/fkbush/article/details/79424406

版权

lucene 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本文只是给lucene小白看的，老鸟勿喷

刚开始学lucene,所以直接使用新版的7.1.0,还在搭建helloworld中，使用MoreListThis查询时结果为空，

怎么都查询不出数据，百度也找不到答案，而且百度出来的内容都是老版本的。无奈只能自己看源码了。

谁有新版本的各种demo请分享下，自己摸索太累了，先谢谢了。

由于用的最新版，找不到demo，所以所有代码都是用的API中的示例代码，如MoreLikeThis API中的

示例如下：

 IndexReader ir = ...
 IndexSearcher is = ...

 MoreLikeThis mlt = new MoreLikeThis(ir);
 Reader target = ... // orig source of doc you want to find similarities to
 Query query = mlt.like( target);
 
 Hits hits = is.search(query);
 // now the usual iteration thru 'hits' - the only thing to watch for is to make sure
 //you ignore the doc if it matches your 'target' document, as it should be similar to itself

照葫芦画瓢，测试代码如下

    @Test
    public void testMoreLikeThis() throws IOException {
        final Path path = Paths.get(INDEX_DIR);
        Directory directory = FSDirectory.open(path);
        Analyzer analyzer = new StandardAnalyzer();

        IndexReader indexReader = DirectoryReader.open(directory);
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        MoreLikeThis mlt = new MoreLikeThis(indexReader);
        mlt.setAnalyzer(analyzer);//这个要加上，不然直接报错
        Query query = mlt.like("content",new StringReader("your doc content to search"));

        TopDocs topDocs = indexSearcher.search(query, 10);
        long conut = topDocs.totalHits;
        System.out.println("检索总条数："+conut);
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            Document document = indexSearcher.doc(scoreDoc.doc);
            System.out.print("相关度："+scoreDoc.score);
            System.out.println(document.get("content"));
        }
    }

执行下查询出的记录数为0，而我查询的内容肯定是在库中的。看源码:

  /**
   * Return a query that will return docs like the passed Readers.
   * This was added in order to treat multi-value fields.
   *
   * @return a query that will return docs like the passed Readers.
   */
  public Query like(String fieldName, Reader... readers) throws IOException {
    Map<String, Map<String, Int>> perFieldTermFrequencies = new HashMap<>();
    for (Reader r : readers) {
      addTermFrequencies(r, perFieldTermFrequencies, fieldName);//这里添加要查询的内容的词频
    }
    return createQuery(createQueue(perFieldTermFrequencies));
  }

看添加词频的源码：

  /**
   * Adds term frequencies found by tokenizing text from reader into the Map words
   *
   * @param r a source of text to be tokenized
   * @param perFieldTermFrequencies a Map of terms and their frequencies per field
   * @param fieldName Used by analyzer for any special per-field analysis
   */
  private void addTermFrequencies(Reader r, Map<String, Map<String, Int>> perFieldTermFrequencies, String fieldName)
      throws IOException {
    if (analyzer == null) {//这里就是刚才说的要设置analyzer,不然会报这个错，而api中的示例代码是没有的
      throw new UnsupportedOperationException("To use MoreLikeThis without " +
          "term vectors, you must provide an Analyzer");
    }
    Map<String, Int> termFreqMap = perFieldTermFrequencies.get(fieldName);
    if (termFreqMap == null) {
      termFreqMap = new HashMap<>();
      perFieldTermFrequencies.put(fieldName, termFreqMap);
    }
    try (TokenStream ts = analyzer.tokenStream(fieldName, r)) {
      int tokenCount = 0;
      // for every token
      CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
      ts.reset();
      while (ts.incrementToken()) {
        String word = termAtt.toString();
        tokenCount++;
        if (tokenCount > maxNumTokensParsed) {
          break;
        }
        if (isNoiseWord(word)) {
          continue;
        }

        // increment frequency
        Int cnt = termFreqMap.get(word);
        if (cnt == null) {
          termFreqMap.put(word, new Int());
        } else {
          cnt.x++;
        }
      }
      ts.end();
    }
  }

这里主要就是创建词频，直接看下一步，根据词频创建搜索内容：

  /**
   * Create a PriorityQueue from a word->tf map.
   *
   * @param perFieldTermFrequencies a per field map of words keyed on the word(String) with Int objects as the values.
   */
  private PriorityQueue<ScoreTerm> createQueue(Map<String, Map<String, Int>> perFieldTermFrequencies) throws IOException {
    // have collected all words in doc and their freqs
    int numDocs = ir.numDocs();
    final int limit = Math.min(maxQueryTerms, this.getTermsCount(perFieldTermFrequencies));
    FreqQ queue = new FreqQ(limit); // will order words by score
    for (Map.Entry<String, Map<String, Int>> entry : perFieldTermFrequencies.entrySet()) {
      Map<String, Int> perWordTermFrequencies = entry.getValue();
      String fieldName = entry.getKey();

      for (Map.Entry<String, Int> tfEntry : perWordTermFrequencies.entrySet()) { // for every word
        String word = tfEntry.getKey();
        int tf = tfEntry.getValue().x; // term freq in the source doc文档中的词频，即这个word在这个doc中出现的次数
        if (minTermFreq > 0 && tf < minTermFreq) {//如果小于设定的最小次数，直接忽略
          continue; // filter out words that don't occur enough times in the source
        }

        int docFreq = ir.docFreq(new Term(fieldName, word));//根据word搜索这个WORD在文档中出现的频率，即在多少个文档中出现过

        if (minDocFreq > 0 && docFreq < minDocFreq) {//如果小于设定的最小次数，直接忽略
          continue; // filter out words that don't occur in enough docs
        }

        if (docFreq > maxDocFreq) {
          continue; // filter out words that occur in too many docs
        }

        if (docFreq == 0) {
          continue; // index update problem?
        }

        float idf = similarity.idf(docFreq, numDocs);
        float score = tf * idf;

        if (queue.size() < limit) {
          // there is still space in the queue
          queue.add(new ScoreTerm(word, fieldName, score, idf, docFreq, tf));
        } else {
          ScoreTerm term = queue.top();
          if (term.score < score) { // update the smallest in the queue in place and update the queue.
            term.update(word, fieldName, score, idf, docFreq, tf);
            queue.updateTop();
          }
        }
      }
    }
    return queue;
  }

根据我上面写的注释就知道问题出在哪了，没有设置最小频率，而由于我这个是HELLOWORLD程序，所以数据也不多，我只录入了5条记录，而且内容都不一样，所以idf肯定是达不到要求的。然后修改自己的代码，设置最小频率：

mlt.setMinTermFreq(1);
mlt.setMinDocFreq(1);

然后再运行测试，终于查出来了。

fkbush

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Lucene中MoreLikeThis查询结果为空

本文只是给lucene小白看的，老鸟勿喷刚开始学lucene,所以直接使用新版的7.1.0,还在搭建helloworld中，使用MoreListThis查询时结果为空，怎么都查询不出数据，百度也找不到答案，而且百度出来的内容都是老版本的。无奈只能自己看源码了。谁有新版本的各种demo请分享下，自己摸索太累了，先谢谢了。由于用的最新版，找不到demo，所以所有代码都是用的AP...
复制链接

扫一扫