lucene 3.6 Similarity doc

 Expert: Scoring API.

Similarity defines the components of Lucene scoring. Overriding computation of these components is a convenient way to alter Lucene scoring.

[翻译]Similarity定义了Lucene评分的组件,覆盖这个组件的计算方法是一个Lucene评分机制很好的办法

Suggested reading: Introduction To Information Retrieval, Chapter 6.

[翻译]建议阅读:

The following describes how Lucene scoring evolves from underlying information retrieval models to (efficient) implementation. We first brief onVSM Score, then derive from it Lucene's Conceptual Scoring Formula, from which, finally, evolvesLucene's Practical Scoring Function (the latter is connected directly with Lucene classes and methods).

[翻译]以下描述了Lucene如何将潜在的信息检索模型演化到了高效的实现中。我们首先简单介绍下VSM评分机制,然后从它演化到Lucene的概念评分公式,并最终演化到Lucene的实践评分函数(后者直接关系到Lucene的class和方法)

Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM.

[翻译]Lucene结合了布尔模型和VSM模型。过滤文档由BM完成,评分又VSM完成

In VSM, documents and queries are represented as weighted vectors in a multi-dimensional space, where each distinct index term is a dimension, and weights areTf-idf values.

[翻译]在VSM中,文档和查询表达成多维空间中的加权向量,每一个索引项(一般就是一个词)都是一个维度,权重由tf-idf计算得到

VSM does not require weights to be Tf-idf values, but Tf-idf values are believed to produce search results of high quality, and so Lucene is usingTf-idf. Tf and Idf are described in more detail below, but for now, for completion, let's just say that for given termt and document (or query) x, Tf(t,x) varies with the number of occurrences of termt in x (when one increases so does the other) and idf(t) similarly varies with the inverse of the number of index documents containing termt.

[翻译]VSM并不需要权重一定由tf-idf计算得到,但是tf-idf值一般被认为能够产生高质量的搜索结果,因此Lucene使用了tf-idf。tf和idf在下面将作详细描述,但是现在为了完整性,让我们首先明确,对于给定的词项t和文档(或者查询)x,tf(t,x)会随着t在x中出现的数量而变化,idf(t)会随着包含词项t的文档数量的倒数变化。

VSM score of document d for query q is the Cosine Similarity of the weighted query vectors V(q) and V(d)

[翻译]文档d对于查询q的VSM分数是V(q)和V(d)的向量的余弦相似度:  
 

cosine-similarity(q,d)   =  
V(q) · V(d)
–––––––––
|V(q)| |V(d)|
VSM Score

 

Where V(q) · V(d) is the dot product of the weighted vectors, and |V(q)| and |V(d)| are theirEuclidean norms.

[翻译]这里V(q)*V(d)是两个向量的点积。然后 |V(q)| 和|V(d)|是它们的欧式距离。

Note: the above equation can be viewed as the dot product of the normalized weighted vectors, in the sense that dividingV(q) by its euclidean norm is normalizing it to a unit vector.

[翻译]注意:上面的等式可以当作两个向量的标准化点积。

Lucene refines VSM score for both search quality and usability:

[翻译]Lucene在搜索质量和实用性上优化了VSM评分法

  • Normalizing V(d) to the unit vector is known to be problematic in that it removes all document length information. For some documents removing this info is probably ok, e.g. a document made by duplicating a certain paragraph10 times, especially if that paragraph is made of distinct terms. But for a document which contains no duplicated paragraphs, this might be wrong. To avoid this problem, a different document length normalization factor is used, which normalizes to a vector equal to or larger than the unit vector: doc-len-norm(d).
  • [翻译]将V(d)标准化成单位向量被认为是有问题的,因为它去掉了所有文档的长度信息。对于一些场景去掉长度是可以的,例如一个文档中包含了一个被复制了10遍的段落,尤其是在那段落是由独特的词项组成。但是对于那些不包含重复段落的文档,这样做是不对的。为了避免这个问题,这里采用一个不同的文档长度标准化方法,这种方法将文档标准化成一个长于单位向量的向量:doc-len-norm(d)
  • At indexing, users can specify that certain documents are more important than others, by assigning a document boost. For this, the score of each document is also multiplied by its boost valuedoc-boost(d).
  • [翻译]在索引阶段,用户可以描述某些重要性超过其他的文档,通过赋予一个文档加权。这样的话,每个文档的分数可以乘以它的加权分数
  • Lucene is field based, hence each query term applies to a single field, document length normalization is by the length of the certain field, and in addition to document boost there are also document fields boosts.
  • [翻译]LUcene是基于field的,因此每一个查询词项被应用到了一个单个的field上,因此文档的标准化会作用到每一个特定的field上,所以除了有文档的加权,然后各个field的加权
  • The same field can be added to a document during indexing several times, and so the boost of that field is the multiplication of the boosts of the separate additions (or parts) of that field within the document.
  • [翻译]在索引多次后,同样的field可以被添加到一个文档上,所以那个field的加权是各个部分的加权的和。
  • At search time users can specify boosts to each query, sub-query, and each query term, hence the contribution of a query term to the score of a document is multiplied by the boost of that query termquery-boost(q).
  • [翻译]在搜索时段用户可以描述一个查询的加权,因此每一个查询词对文档的贡献也会乘以查询词的加权
  • A document may match a multi term query without containing all the terms of that query (this is correct for some of the queries), and users can further reward documents matching more query terms through a coordination factor, which is usually larger when more terms are matched: coord-factor(q,d).
  • [翻译]一个文档可以匹配多词项查询,而不需要包含查询的所有词项(这对于某些查询是正确的),用户可以通过coordination factor进一步收获那些匹配了多个查询词的文档,当很多的词匹配后这个factor会很大

Under the simplifying assumption of a single field in the index, we get Lucene's Conceptual scoring formula

[翻译]基于单field索引的假设下,Lucene的概念评分公式如下:  
 

score(q,d)   =   coord-factor(q,d) ·   query-boost(q) ·  
V(q) · V(d)
–––––––––
|V(q)|
  ·   doc-len-norm(d)   ·  doc-boost(d)
Lucene Conceptual Scoring Formula

 

The conceptual formula is a simplification in the sense that (1) terms and documents are fielded and (2) boosts are usually per query term rather than per query.

[翻译]概念公式是一个简化。(1)词项和文档都field化了(2)加权通常是针对查询词项而不是整个查询

We now describe how Lucene implements this conceptual scoring formula, and derive from itLucene's Practical Scoring Function.

[翻译]我们现在来描述Lucene如何实现了这个概念评分公式,然后将它演化到Lucene的实践评分函数中

For efficient score computation some scoring components are computed and aggregated in advance:

[翻译]为了效率考虑,一个计算过程将提前完成

  • Query-boost for the query (actually for each query term) is known when search starts.
  • [翻译]每个查询的加权(实际上是查询词项)在搜索开始时已经知道
  • Query Euclidean norm |V(q)| can be computed when search starts, as it is independent of the document being scored. From search optimization perspective, it is a valid question why bother to normalize the query at all, because all scored documents will be multiplied by the same |V(q)|, and hence documents ranks (their order by score) will not be affected by this normalization. There are two good reasons to keep this normalization:
  • [翻译]查询的欧式正则|V(q)|可以在搜索开始时得到,因为它与文档的计算过程独立。从搜索优化的角度看,忍受标准化查询的原因是一个很好的问题。因为所有的文档将乘以同一个|V(q)|,于是文档排名并不会因此而受到影响。然而,这里有两个原因去保留这个标准化过程:
     
    • Recall that Cosine Similarity can be used find how similar two documents are. One can use Lucene for e.g. clustering, and use a document as a query to compute its similarity to other documents. In this use case it is important that the score of document d3 for queryd1 is comparable to the score of document d3 for query d2. In other words, scores of a document for two distinct queries should be comparable. There are other applications that may require this. And this is exactly what normalizing the query vector V(q) provides: comparability (to a certain extent) of two or more queries.
    • [翻译]首先回顾下余弦相似度可以用于表示两个文档的相似度。于是Lucene可以被用于其他地方,如聚类,然后使用文档作为查询来计算它与其他文档的相似度。在这个用例中,文档d3对查询d1的分数与文档d3对查询d2的分数之间的比较就变得有价值了。换句话说,一个文档对于不同的查询的分数要具有区分性。这里还有其他需要用到这个的例子。这就是标准化查询向量V(q)所带来的:两个或者更多的查询间可以比较了
    • Applying query normalization on the scores helps to keep the scores around the unit vector, hence preventing loss of score data because of floating point precision limitations.
    • [翻译]将查询标准化应用到分数上帮助分数处于单位向量附近,因此它防止了由于浮点数导致的得分损失
  • Document length norm doc-len-norm(d) and document boost doc-boost(d) are known at indexing time. They are computed in advance and their multiplication is saved as a single value in the index:norm(d). (In the equations below, norm(t in d) means norm(field(t) in doc d) wherefield(t) is the field associated with term t.)
  • [翻译]文档长度标准化doc-len-norm(d)和文档加权doc-boost(d)在索引阶段已经知道。它们可以提前计算出来,并保存在索引中作为一个值norm(d)。下面的公式中,norm(t in d)指的是norm(field(t) in doc d),其中的field(t)与term t相关

Lucene's Practical Scoring Function is derived from the above. The color codes demonstrate how it relates to those of theconceptual formula:

[翻译]Lucene的实践计算函数从上面的总结中得出了!

score(q,d)   =   coord(q,d)  ·  queryNorm(q)  · ( tf(t in d)  ·  idf(t)2  ·  t.getBoost() ·  norm(t,d) )
 t in q 
Lucene Practical Scoring Function

where

  1. tf(t in d) correlates to the term'sfrequency, defined as the number of times term t appears in the currently scored documentd. Documents that have more occurrences of a given term receive a higher score. Note thattf(t in q) is assumed to be 1 and therefore it does not appear in this equation, However if a query contains twice the same term, there will be two term-queries with that same term and hence the computation would still be correct (although not very efficient). The default computation for tf(t in d) in DefaultSimilarity is: [翻译]ft(t in d)代表了词项的频率,词项t在当前评分的文档d中出现的次数。那些包含给定的词项的数量多的文档将得到高分。注意tf(t in q)被假设为1,因此它不会出现在等式中,然而如果查询中包含同样的词项,那么这个词项将被算两次(当然了,效率不够高)。默认的计算方法为:
     
    tf(t in d)   =  frequency½

     
  2. idf(t) stands for Inverse Document Frequency. This value correlates to the inverse ofdocFreq (the number of documents in which the term t appears). This means rarer terms give higher contribution to the total score.idf(t) appears for t in both the query and the document, hence it is squared in the equation. The default computation foridf(t) in DefaultSimilarity is: [翻译]idf(t)代表了逆文档频率。这个值与docFreq的倒数相关。这意味着某个词出现的文档数越少越好。idf(t)出现在查询和文档两个地方,因此它被做了平方。默认的计算方法是:
     
    idf(t)  =  1 + log (
    numDocs
    –––––––––
    docFreq+1
    )

     
  3. coord(q,d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms. This is a search time factor computed in coord(q,d) by the Similarity in effect at search time. [翻译]coord(q,d)是一个分数因子,它基于在给定文档中找到的查询词的数目。通常来说,一个包含更多查询词的文档将得到更高的分数。这是一个搜索时产生的因子。
     
  4. queryNorm(q) is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable. This is a search time factor computed by the Similarity in effect at search time. The default computation inDefaultSimilarity produces a Euclidean norm: [翻译]queryNorm(q)是一个标准化因子,为了使不同间的查询比较变得有效。这个因此不会影响文档排序(因为所有的文档都乘以这个因子),只是为了不同查询间的分数不同。这是一个搜索时产生的因子。默认的计算方法如下:
     
    queryNorm(q)   =   queryNorm(sumOfSquaredWeights)   =  
    1
    ––––––––––––––
    sumOfSquaredWeights½

     
    The sum of squared weights (of the query terms) is computed by the query Weight object. For example, a BooleanQuery computes this value as: [翻译]权重的平方和由Weight对象计算得到
     
    sumOfSquaredWeights   =  q.getBoost() 2  · ( idf(t)  ·  t.getBoost() ) 2
     t in q 

     
  5. t.getBoost() is a search time boost of termt in the query q as specified in the query text (see query syntax), or as set by application calls to setBoost(). Notice that there is really no direct API for accessing a boost of one term in a multi term query, but rather multi terms are represented in a query as multiTermQuery objects, and so the boost of a term in the query is accessible by calling the sub-querygetBoost(). [翻译]t.getBoost()是一个查询q中词项t的搜索时加权因子,它应该出现在查询文字中,或者由应用调用setBoost()得到。注意……
     
  6. norm(t,d) encapsulates a few (indexing time) boost and length factors:[翻译]norm(t,d)封装了一系列在索引是得到的加权和长度因子
    • Document boost - set by calling doc.setBoost() before adding the document to the index.[翻译]文档加权:在将文档加入索引时调用doc.setBoost()
    • Field boost - set by calling field.setBoost() before adding the field to a document.[翻译]域加权:在将field加入文档前调用field.setBoost()
    • lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.[翻译]长度标准化:在文档加入索引时计算。它与词在该field中出现的次数有关,所以短field的分更高。长度标准化在索引时计算
    The computeNorm(java.lang.String, org.apache.lucene.index.FieldInvertState) method is responsible for combining all of these factors into a single float.

    When a document is added to the index, all the above factors are multiplied. If the document has multiple fields with the same name, all their boosts are multiplied together: [翻译]当以个文档加入索引时,上面提到的所有因此将一起计算,如果有多个field,那么它们将一起连乘加权
     

    norm(t,d)   =   doc.getBoost()  ·  lengthNorm  · f.getBoost()
     field f ind named as t 

     
    However the resulted norm value is encoded as a single byte before being stored. At search time, the norm byte value is read from the indexdirectory and decoded back to a float norm value. This encoding/decoding, while reducing index size, comes with the price of precision loss - it is not guaranteed thatdecode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75. [翻译]然而norm值的结果在存储前将被编码成一个字节。在搜索时,norm字节从索引目录中读出,然后解码成一个浮点值。这个编码解码过程,虽然减少了空间使用,但是造成了精度损失。
     
    Compression of norm values to a single byte saves memory at search time, because once a field is referenced at search time, its norms - for all documents - are maintained in memory. [翻译]norm值的压缩节省了空间,因为一旦一个field在搜索时被参考,那么所有文档的norm值将载入内存(虐心啊……)
     
    The rationale supporting such lossy compression of norm values is that given the difficulty (and inaccuracy) of users to express their true information need by a query, only big differences matter. [翻译]支持压缩损失的理性解释是对于用户表达一个查询的真实信息的需求的难度来说,只有差距十分大时才会产生问题。
     
    Last, note that search time is too late to modify this norm part of scoring, e.g. by using a differentSimilarity for search.    [翻译]最后,注意,搜索时间可能太长以至于来不及修改评分过程的Norm部分。例如,使用一个不同的Similarity   
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
自己动手写搜索引擎 1 第1章 了解搜索引擎 1 1.1 Google神话 1 1.2 体验搜索引擎 1 1.3 你也可以做搜索引擎 4 1.4 本章小结 4 第2章 遍历搜索引擎技术 5 2.1 30分钟实现的搜索引擎 5 2.1.1 准备工作环境(10分钟) 5 2.1.2 编写代码(15分钟) 6 2.1.3 发布运行(5分钟) 9 2.2 搜索引擎基本技术 14 2.2.1 网络蜘蛛 14 2.2.2 全文索引结构 14 2.2.3 Lucene 全文检索引擎 15 2.2.4 Nutch网络搜索软件 15 2.2.5 用户界面 17 2.3 商业搜索引擎技术介绍 17 2.3.1 通用搜索 17 2.3.2 垂直搜索 18 2.3.3 站内搜索 19 2.3.4 桌面搜索 21 2.4 本章小结 21 第3章 获得海量数据 22 3.1 自己的网络蜘蛛 22 3.1.1 BerkeleyDB介绍 27 3.1.2 抓取网页 28 3.1.3 MP3 抓取 29 3.1.4 RSS 抓取 30 3.1.5 图片抓取 33 3.1.6 垂直行业抓取 34 3.2 抓取数据库中的内容 36 3.2.1 建立数据视图 36 3.2.2 JDBC数据库连接 36 3.2.3 增量抓取 40 3.3 抓取本地硬盘上的文件 41 3.3.1 目录遍历 41 3.4 本章小结 42 第4章 提取文档中的文本内容 43 4.1 从HTML文件中提取文本 43 4.1.1 HtmlParser介绍 51 4.1.2 结构化信息提取 54 4.1.3 网页去噪 60 4.1.4 网页结构相似度计算 63 4.1.5 正文提取的工具FireBug 64 4.1.6 正文提取的工具NekoHTML 66 4.1.7 正文提取 68 4.2 从非HTML文件中提取文本 73 4.2.1 TEXT文件 73 4.2.2 PDF文件 73 4.2.3 Word文件 82 4.2.4 Rtf文件 82 4.2.5 Excel文件 83 4.2.6 PowerPoint文件 84 4.3 流媒体内容提取 85 4.3.1 音频流内容提取 85 4.3.2 视频流内容提取 87 4.4 抓取限制应对方法 89 4.5 本章小结 90 第5章 自然语言处理 91 5.1 中文分词处理 91 5.1.1 Lucene 中的中文分词 91 5.1.2 Lietu中文分词的使用 92 5.1.3 中文分词的原理 92 5.1.4 查找词典算法 95 5.1.5 最大概率分词方法 98 5.1.6 新词发现 101 5.1.7 隐马尔可夫模型 102 5.2 语法解析树 104 5.3 文档排重 105 5.4 中文关键词提取 106 5.4.1 关键词提取的基本方法 106 5.4.2 关键词提取的设计 107 5.4.3 从网页提取关键词 107 5.5 相关搜索 107 5.6 拼写检查 110 5.6.1 英文拼写检查 110 5.6.2 中文拼写检查 112 5.7 自动摘要 116 5.7.1 自动摘要技术 117 5.7.2 自动摘要的设计 117 5.7.3 Lucene中的动态摘要 124 5.8 自动分类 125 5.8.1 Classifier4J 126 5.8.2 自动分类的接口定义 127 5.8.3 自动分类的SVM方法实现 128 5.8.4 多级分类 128 5.9 自动聚类 131 5.9.1 聚类的定义 131 5.9.2 K均值聚类方法 131 5.9.3 K均值实现 133 5.10 拼音转换 138 5.11 语义搜索 139 5.12 跨语言搜索 143 5.13 本章小结 144 第6章 创建索引库 145 6.1 设计索引库结构 146 6.1.1 理解 Lucene 的索引库结构 146 6.1.2 设计一个简单的索引库 148 6.2 创建和维护索引库 149 6.2.1 创建索引库 149 6.2.2 向索引库中添加索引文档 149 6.2.3 删除索引库中的索引文档 151 6.2.4 更新索引库中的索引文档 151 6.2.5 索引的合并 151 6.2.6 索引的定时更新 152 6.2.7 索引的备份和恢复 153 6.2.8 修复索引 154 6.3 读写并发控制 154 6.4 优化使用 Lucene 155 6.4.1 索引优化 155 6.4.2 查询优化 157 6.4.3 实现时间加权排序 162 6.4.4 实现字词混合索引 163 6.4.5 定制Similarity 170 6.4.6 定制Tokenizer 171 6.5 查询大容量索引 173 6.6 本章小结 174 第7章

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值