lucene MoreLikeThis的实现原理

MoreLikeThis可以用于实现相似文章的查询,其实现原理剖析如下

MoreLikeThis位于lucene捐赠模块Queries目录下,在此转一下实现该类的初衷:

  Lucene does let you access the document frequency of terms, with IndexReader.docFreq().
Term frequencies can be computed by re-tokenizing the text, which, for a single document,
is usually fast enough. But looking up the docFreq() of every term in the document is
probably too slow.

  You can use some heuristics to prune the set of terms, to avoid calling docFreq() too much,
or at all. Since you're trying to maximize a tfidf score, you're probably most interested
in terms with a high tf. Choosing a tf threshold even as low as two or three will radically
reduce the number of terms under consideration. Another heuristic is that terms with a
high idf (i.e., a low df) tend to be longer. So you could threshold the terms by the
number of characters, not selecting anything less than, e.g., six or seven characters.
With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms
that do a pretty good job of characterizing a document.

  It all depends on what you're trying to do. If you're trying to eek out that last percent
of precision and recall regardless of computational difficulty so that you can win a TREC
competition, then the techniques I mention above are useless. But if you're trying to
provide a "more like this" button on a search results page that does a decent job and has
good performance, such techniques might be useful.

  An efficient, effective "more-like-this" query generator would be a great contribution, if
anyone's interested. I'd imagine that it would take a Reader or a String (the document's
text), analyzer Analyzer, and return a set of representative terms using heuristics like those
above. The frequency and length thresholds could be parameters, etc.

 

1)调用该函数生成查询

public Query like(Reader r, String fieldName) throws IOException {
  return createQuery(retrieveTerms(r, fieldName));
}

2)计算Term的TF,IDF,Score...

public PriorityQueue<Object[]> retrieveTerms(Reader r, String fieldName) throws IOException {
  Map<String, Int> words = new HashMap<String, Int>();
  addTermFrequencies(r, words, fieldName);
  return createQueue(words);
}

3)计算文档的中Term的频率,并存放在termFreqMap中返回

private void addTermFrequencies(Reader r, Map<String, Int> termFreqMap, String fieldName)
throws IOException {
  if (analyzer == null) {
    throw new UnsupportedOperationException("To use MoreLikeThis without " +
      "term vectors, you must provide an Analyzer");
  }
  TokenStream ts = analyzer.tokenStream(fieldName, r);
  int tokenCount = 0;
  // for every token
  CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
  ts.reset();
  while (ts.incrementToken()) {
    String word = termAtt.toString();
    tokenCount++;
    if (tokenCount > maxNumTokensParsed) {
      break;
    }
    if (isNoiseWord(word)) {
      continue;
    }

    // increment frequency
    Int cnt = termFreqMap.get(word);
    if (cnt == null) {
      termFreqMap.put(word, new Int());
    } else {
      cnt.x++;
    }
  }
  ts.end();
  ts.close();
}

4)计算Term的TF,IDF,Score...

private PriorityQueue<Object[]> createQueue(Map<String, Int> words) throws IOException {
  // have collected all words in doc and their freqs
  int numDocs = ir.numDocs();
  FreqQ res = new FreqQ(words.size()); // will order words by score

  for (String word : words.keySet()) { // for every word
    int tf = words.get(word).x; // term freq in the source doc
    if (minTermFreq > 0 && tf < minTermFreq) {
      continue; // filter out words that don't occur enough times in the source
    }

    // go through all the fields and find the largest document frequency
    String topField = fieldNames[0];
    int docFreq = 0;
    for (String fieldName : fieldNames) {
      int freq = ir.docFreq(new Term(fieldName, word));
      topField = (freq > docFreq) ? fieldName : topField;
      docFreq = (freq > docFreq) ? freq : docFreq;
    }

    if (minDocFreq > 0 && docFreq < minDocFreq) {
      continue; // filter out words that don't occur in enough docs
    }

    if (docFreq > maxDocFreq) {
      continue; // filter out words that occur in too many docs
    }

    if (docFreq == 0) {
      continue; // index update problem?
    }

    float idf = similarity.idf(docFreq, numDocs);
    float score = tf idf;

    // only really need 1st 3 entries, other ones are for troubleshooting
    res.insertWithOverflow(new Object[]{word, // the word
      topField, // the top field
      score, // overall score
      idf, // idf
      docFreq, // freq in all docs
      tf
    });
  }
  return res;
}

Tips:函数体内有几个关键的类变量

1.minTermFreq:如果被设置(>0),那么Doc中Term的频率必须大于该数值,否则不作为文档词向量的一维。

2.minDocFreq:如果被设置(>0),那么Term在索引中的逆文档频率必须大于该数值,否则不作为文档词向量的一维。

3.maxDocFreq:minDocFreq反过来...

4.similarity:在MoreLikeThis中对应的set函数,可以通过传入不同的similarity实现不同的计算IDF的方式。

5)通过PriorityQueue,生成BooleanQuery

private Query createQuery(PriorityQueue<Object[]> q) {
  BooleanQuery query = new BooleanQuery();
  Object cur;
  int qterms = 0;
  float bestScore = 0;

  while ((cur = q.pop()) != null) {
    Object[] ar = (Object[]) cur;
    TermQuery tq = new TermQuery(new Term((String) ar[1], (String) ar[0]));

    if (boost) {
      if (qterms == 0) {
        bestScore = ((Float) ar[2]);
      }
      float myScore = ((Float) ar[2]);

      tq.setBoost(boostFactor myScore / bestScore);
    }

    try {
      query.add(tq, BooleanClause.Occur.SHOULD);
    }
    catch (BooleanQuery.TooManyClauses ignore) {
      break;
    }

    qterms++;
    if (maxQueryTerms > 0 && qterms >= maxQueryTerms) {
      break;
    }
  }

  return query;
}

Tips:有几个类变量需要注意

1.boost:如果设置为false,那么每一个TermQuery将有相同的重要性,如果设置为true,那么Term的重要性将由TFIDF(Score)决定。

 

总结:

从以上的分析过程可以看出,lucene把文章相似的计算最终处理为一次查询,lucene实现查询的方式是通过向量空间模型进行的;

本次调查MoreLikeThis是因为我要做一个相似文章推荐的实验,从而考评出lucene是否能够胜任海量数据的快速处理(千万数量级)。

转载于:https://www.cnblogs.com/zz-boy/archive/2012/08/16/2642416.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值