- Indexing
| using fixed vocabularies ( e.g. Dewey Decimal System, MeSH ) |
label and training intensively (训练自定义系统中的标签词汇) | |
| term manipulation |
term weighting | |
index terms |
with no predefined set of index terms, automatic indexing use natural language as indexing language.
Application :inverted files .
2.
Automatic retrieval models
| binary decision: is document relevant or not? |
(
Boolean search )
|
presence of term is necessary and sufficient for match
|
boolean operators are set operations (AND, OR)
| |
| ranked algorithm: frequency of document terms |
(
Ranked algorithms )
|
not all search terms necessarily present in document
|
Incarnations:
•
The vector space model (SMART, Salton et al, 1971)
•
The probabilistic model (OKAPI, Robertson/Sp ̈arck Jones, 1976)
•
Web search engines
|
Boolean query provides a simple logical basis for deciding whether anydocument should be returned.
Boolean model 把keywords用boolean operator组合起来,成为一个boolean搜索命令,搜索出的文件只返回两个结果:match or don’t match. 这种搜索模型很适合作为用于图书馆 的
bibliographic search engines
去搜索书目。因为一般我们去图书馆借书,是通过搜索关键词来选择与关键词match的书籍借阅。这种情况下,boolean model应该比自动排序匹配模型准确率要高。
但是Boolean models是不适合大多数情况下的大多数用户的,1. boolean operator不是自然语言,不符合用户行为习惯。2. 大多数用户不会想得到大量的 unranked result, 除非collection里的数据量特别小,但这不符合web search海量的特性。
The vector space model is one of the incarnation of automatic retrieval models. Bag-of-words approach is the standard approach to representing documents, also known as vector space model. 所有的文件和query都将用多维向量来表示,再通过计算向量间的相似度来给文件对于query进行相关性排序。
如何计算向量间相似度 ?主要有两种方法, 1.
Euclidean
method,主要通过计算向量间最短距离来判断向量的相似度。但是在这种方法下,( Frequency of terms overweighted )向量本身的长度对结果影响很大。
2.
cosine
of the
angle
between two vectors
⃗
x
and
⃗
y:
- Term manipulation
- stemming
- stopwords
- term weighting
Binary
|
binary weights - 0/1
|
df
|
Frequency of term in document
|
idf
(
Inverse document frequency
)
|
Frequency in document vs in collection
|
- Evaluation
不同的IR系统由不同的算法、模型构成,人们不断深入研究算法模型就是为了提高体统的能力。因此如何评估系统的能力也显得尤为重要。relevance is judged in a binary way,当在成千上百万的网页搜索的时候,ranking them is too subjective and difficult. Evaluation主要是对评估系统搜得文件的相关性进行评估,也纳入其他factors如:user effort/ease of use 系统对用户来说的易操作程度,response time反应速度,form of represatnation搜索结果的呈现方式。
Evaluation Measures:
|
|
|
|
系统里所有有相关性文件的precision平均值
|
跟recall相关的11个level的precision平均值
|