TEXT PROCESSING : INFORMATION RETRIEVAL

最新推荐文章于 2024-03-19 17:54:36 发布

TinaHer

最新推荐文章于 2024-03-19 17:54:36 发布

阅读量236

点赞数

文章标签：文本处理 text processing

本文链接：https://blog.csdn.net/TinaHer/article/details/79266167

版权

Indexing

manual	using fixed vocabularies ( e.g. Dewey Decimal System, MeSH )
	label and training intensively (训练自定义系统中的标签词汇）
automatic	term manipulation
	term weighting
	index terms

with no predefined set of index terms, automatic indexing use natural language as indexing language. Application ：inverted files .

2. Automatic retrieval models

The boolean model	binary decision: is document relevant or not?
( Boolean search )	presence of term is necessary and sufficient for match
	boolean operators are set operations (AND, OR)
Ranked retrieval model	ranked algorithm: frequency of document terms
( Ranked algorithms )	not all search terms necessarily present in document
	Incarnations: • The vector space model (SMART, Salton et al, 1971) • The probabilistic model (OKAPI, Robertson/Sp ̈arck Jones, 1976) • Web search engines

Boolean query provides a simple logical basis for deciding whether anydocument should be returned. Boolean model 把keywords用boolean operator组合起来，成为一个boolean搜索命令，搜索出的文件只返回两个结果：match or don’t match. 这种搜索模型很适合作为用于图书馆的 bibliographic search engines 去搜索书目。因为一般我们去图书馆借书，是通过搜索关键词来选择与关键词match的书籍借阅。这种情况下，boolean model应该比自动排序匹配模型准确率要高。

但是Boolean models是不适合大多数情况下的大多数用户的，1. boolean operator不是自然语言，不符合用户行为习惯。2. 大多数用户不会想得到大量的 unranked result, 除非collection里的数据量特别小，但这不符合web search海量的特性。

The vector space model is one of the incarnation of automatic retrieval models. Bag-of-words approach is the standard approach to representing documents, also known as vector space model. 所有的文件和query都将用多维向量来表示，再通过计算向量间的相似度来给文件对于query进行相关性排序。

如何计算向量间相似度？主要有两种方法， 1. Euclidean method，主要通过计算向量间最短距离来判断向量的相似度。但是在这种方法下，（ Frequency of terms overweighted ）向量本身的长度对结果影响很大。

2. cosine of the angle between two vectors ⃗ x and ⃗ y：

Term manipulation

stemming
stopwords
term weighting

Binary	binary weights - 0/1
df	Frequency of term in document
idf ( Inverse document frequency )	Frequency in document vs in collection

Evaluation

不同的IR系统由不同的算法、模型构成，人们不断深入研究算法模型就是为了提高体统的能力。因此如何评估系统的能力也显得尤为重要。relevance is judged in a binary way，当在成千上百万的网页搜索的时候，ranking them is too subjective and difficult. Evaluation主要是对评估系统搜得文件的相关性进行评估，也纳入其他factors如：user effort／ease of use 系统对用户来说的易操作程度，response time反应速度，form of represatnation搜索结果的呈现方式。

Evaluation Measures：