# 2012-3-22日总结

今天突然有心思去看了一下数据提取相关的论文，也算是开阔一下视野吧，主要是BM25算法的改进，通过加入新的相邻词语关系，已经查询到的唯一词数目，来进行某些权重的计算，感觉还是很有意思的。自己也总结了下，准备以后参加xapian用，并加入了Xapian的开发的邮件列表，以及IRC，以后有机会多上这里面逛逛，相信会有很大的提升。以下是我总结的BM25算法改进相关内容，自己用英语写的

The paper suggest that the contribution of a proximity distance
measure should follow a funciton of a convex shape.
so,their final function therefore uses a popular logarithm function
to convert a proximity distance measure to a proximity feature value,
which is then combined with the existing retrieval functions-
Okapi BM25 model.

The minimum pair distance is defined as the smallest distance
value of all pairs of unique matched query terms.

1 Assume that a document matches K unique query terms,and the total number of
occurrences of these K query terms is N.We can record the positions of these N occurrences in order
in the inverted index so that we can scan them one by one.
2 while scanning.we maintain a list of length K,in which we store the last position of each seen query term. In other words,if a term t occurs twice,we would record the location of the first
occurrence when the scanning hits the first one and update it when we hit the second one.
3 in each step,we calculate the span solely based on the information in the list,and finally select
the smallest span value we have ever obtained during the scanning process.
4 Since K is often very small,the algorithm is close to linear in terms of N

Here is an example:
Document1: t1 t2 t1 t3 t5 t4 t2 t3 t4
Document2: t4 t3 t2 t1 t5 t1 t3 t6

the inverted index should be like this:
term                     Document ID and occurrence             the positions of this term
t1                       1[2],2[2]                                                            [1,3,4,6]
t2                       1[2],2[1]                                                            [2,7,3]
t3                       1[2],2[2]                                                            [4,8,2,7]
t5                       1[1],2[5]                                                            [5,5]
t4                       1[2],2[1]                                                            [6,9,1]
t6                        2[1]                                                                   [8]

now ,Assume that the Query terms is t1 t3 t6
then Document1 matches two unique terms(t1 and t3),and the total number of occurrences of
these two query terms is four.and we have already got the position of these uique query term
in the inverted index .

Then we scan the inverted index t1 and t2,maintain a list of length two for Document1
now the positions of these four occurrences are in order.we scan them one by one.
step1 : Scan the inverted index,the position of t1 is the lowest,the list is {1}
step2 : Scan the inverted index,the position of t1 is also the lowest,the list is {3}
step3 : Scan the inverted index,the position of t3 is now the lowest,the list is {3,4} ,now the     smallest span value 1.
step4 : Scan the inverted index,the position of t3 is in the list,the list is {3,8},and we do not update the smallest span value.

Finally we got the smallest span value .and we can use the new retrieval funciton to calculate weights
R (Q, D) =  BM25(Q, D) + π(Q, D)
π(Q, D)  is the function :π(Q, D) = log(α + exp(-δ(Q, D)))
δ(Q, D) is the smallest span value of the matched query terms.
α is a parameter introduced here to allow for certain variations. and α=0.3 work well for most data
sets(I got this form the paper)

• 本文已收录于以下专栏：

## DXperience 8.2.6 part3 (2008年10月22日版本)

• 2008年10月23日 21:02
• 28.61MB
• 下载

## Telerik RadControls for Silverlight3 Q1 2010 SP1 Part1(2010年4月22日的版本)

• 2010年04月29日 16:01
• 28.61MB
• 下载

## 2017-3-22日一次JAVA面试经历

1，自我介绍下，我直接说的项目经历，（哪年在哪个公司呆过） 2，问是否有带过团队的经历，我说去年带过一次。 3，Struts是单例模式还是多例模式？我先说单例模式，后说多例模式。 4，JSP...
• masuwen
• 2017年03月22日 17:36
• 4165

## Telerik RadControls for Silverlight3 Q1 2010 SP1 Part4(2010年4月22日的版本)

• 2010年04月29日 16:36
• 28.61MB
• 下载

## Infragistics NetAdvantage for ASP.NET 2010 Vol 1 Samples Part3(2010年3月22日的版本)

• 2010年03月24日 10:16
• 1.07MB
• 下载

## Infragistics NetAdvantage for ASP.NET 2010 Vol 1 Help Part3(2010年3月22日的版本)

• 2010年03月24日 09:58
• 28.61MB
• 下载

## 2017年8月22日---阶段性工作总结（跨域访问）

举报原因： 您举报文章：2012-3-22日总结 色情 政治 抄袭 广告 招聘 骂人 其他 (最多只允许输入30个字)