lucene源代码学习之 lucene的经典打分过程

最新推荐文章于 2019-05-27 10:36:53 发布

weixin_34061482

最新推荐文章于 2019-05-27 10:36:53 发布

阅读量123

点赞数

本文链接：https://blog.csdn.net/weixin_34061482/article/details/85083761

版权

Lucene中默认的打分模型是VSM（Vector Space Model），其打分公式如下：

看到很多文章都是对这个公式进行解析，但问题的关键在于看了一大段的解析之后，依然不懂其中的细节。我们直接从例子入手：

建立如下的索引：

publicclass LuceneDemo {
 Directory d;
 Analyzer analyzer;
public LuceneDemo() throws IOException{
d=new SimpleFSDirectory(new File("D:/lucene_test"));
analyzer=new WhitespaceAnalyzer(Version.LUCENE_42);
 }
publicvoid index() throws IOException{
IndexWriterConfig conf=new IndexWriterConfig(Version.LUCENE_42, analyzer);
IndexWriter iw=new IndexWriter(d, conf);
Document doc=new Document();
doc=new Document();
doc.add(new TextField("content", "common common common term",Store.YES));
iw.addDocument(doc);
doc=new Document();
doc.add(new TextField("content", "common common term term",Store.YES));
iw.addDocument(doc);
doc=new Document();
doc.add(new TextField("content", "common term term term",Store.YES));
iw.addDocument(doc);
doc=new Document();
doc.add(new TextField("content", "term term term term",Store.YES));
iw.addDocument(doc);
iw.commit();
iw.close();
}
publicvoid search() throws IOException, ParseException{
IndexReader r=DirectoryReader.open(d);
IndexSearcher is=new IndexSearcher(r);
//   TermQuery query=new TermQuery(new Term("content", "common"));
Query query=new QueryParser(Version.LUCENE_42, "content", analyzer).parse("common term");
TopDocs td=is.search(query, 10);
ScoreDoc[] hits=td.scoreDocs;
System.out.println("hits "+hits.length+" docs!");
Document doc;
for (int i = 0; i < hits.length; i++) {
doc=is.doc(hits[i].doc);
System.out.println(hits[i].score);
System.out.println(doc.get("content"));
}
}
publicstaticvoid main(String[] args) throws IOException, ParseException{ 
LuceneDemo ld=new LuceneDemo();                         
//ld.index();
ld.search();
}                                                           
}

一共插入了4篇文本：

common common common term

common common term term

common term term term

term term term term

两个查询词：

common term

搜索的结果是怎样的呢？

hits 4 docs!

0.92219996

common common common term

0.89540654

common common term term

0.80759263

common term term term

0.2382957

term term term term

这个分值是怎么算出来了呢？

Lucene在实现上并没有完全按照公式中的我们设想的步骤来计算，而对计算顺序进行了一调整。

第一步：计算queryNorm(q)

在一次搜索过程中，此值只计算一遍，对每个文档都是同一个值，所以queryNorm(q)不影响文档间的排序，仅仅是作为query向量的归一化因子。

计算公式如下：

Query中一共有两个common和term两个单词，其计算的过程如下：

numDocs

docFreq

idf

sumOfSquaredWeights

queryNorm

common

1.6035059

0.7897047

term

0.776856

第二步：归一化处理。

对每一个查询词，建立Weight对象,并把value=idf(t)*queryNorm*queryWeight预先存储起来。这里queryWeight的值就是idf

idf

queryNorm

queryWeight

value

common

0.7897047

term

0.776856

0.7897047

0.776856

0.4765914

第三步：计算coord(q,d)。

这是一个打分因子，其值取决于文档中包含查询关键词的个数。一般而言，一个文档中包含越多的查询关键词，则其打分会越高。这个计算很简单：

Coord(q,d)=overlap/maxOverlap (overlap为文档包含查询关键词的个数,maxOverlap为查询关键词的总个数,两个相同的词算两个词) lucene在实现的过程中，取了一个巧。直接把[0,maxOverlap]都计算了一遍，然后存储在数组中备用。对本例而言：一共有两个查询词，所以最多有三种结果：

文档不包含查询词

文档包含1个查询词

文档包含2个查询词

coord(q,d)

0.5

第四步：文档初打分。

对于query中的每个查询词分别计算tf(t in d) ,norm(t,d) 。这里需要注意的是idf(t)与文档无关;norm(t,d)是在建索引的时候就已经计算好的，计算方法见TFIDFSimilarity.

computeNorm()。其值如下：

docId