Information retrieval 信息检索概要

最新推荐文章于 2022-12-22 04:01:19 发布

Do_Cool_Thing

最新推荐文章于 2022-12-22 04:01:19 发布

阅读量1.6k

点赞数

分类专栏：课堂知识总结

本文链接：https://blog.csdn.net/menyangyang/article/details/18002627

版权

课堂知识总结专栏收录该内容

5 篇文章 0 订阅

订阅专栏

根据老师课件整理的资料。

IR(information retrieval)

IR deals with the representation, storage, organization of , and access to information items such as documents, web pages, online catalogs, structured and semi-structured records, multimedia objects. The representation and organization of the information items should be such as to provide the user with easy access to information of their interest.

Predicting which documents are relevant, and then linearly ranking them.

Information vs data retrieval.

A.Data: Unstructured : open to interpretation // Structured with well-defined semantics

B.Query: Usually incomplete or ambiguous (w.r.t information need) // Well-defined semantics

C.QUALITY OF RESULTS: Partial match allowed, relevance-based ranking // Exact match required - no or many results

D.Foudations: Probabilistic underpinnings// Algebra/Logic

E: application: library // accounting

Clustering vs classification:

Given a set of docs, group them into clusters based on their contents.

Given a set of topics, plus a new doc D, decide which topic(s) D belongs to.

内容概括

一： IR 模型

二：Retrieval evaluation

三：Text classification

四：indexing and searching

五：web retrieval

一.IR 模型（在于产生ranking函数）

1.Bool 模型

文档相关为1 不相关为2

缺点：部分性没有评分标准，不能产生rank序列信息需要翻译为bool形式，太繁琐查询结果要么太单一，要么太冗余

Not partial matches

2.Vector model

Partial matching is possible

Term weights are used to compute a degree of similarity between a query and each document

Documents rank in decreasing order depend on the similarity.

3.Probabilistic model

Using probabilistic model

An ideal answer set for a query.

4.Term weighting

The importance of the Index in each document, useful to compute a rank

4.1.TF-IDF

TF=termi在文档中出现的次数/总term出现的总次数

IDF=log（文档总数/termi出现的文档的个数）

TF-IDF是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜索引擎应用，作为文件与用户查询之间相关程度的度量或评级。某一特定文件内的高词语频率，以及该词语在整个文件集合中的低文件频率，可以产生出高权重的TF-IDF。因此，TF-IDF倾向于保留文档中较为特别的词语，过滤常用词。

4.2.BM25 (best match) is one of the most established probabilistic term weighting models, as the result of a series of experiments on variations of the probabilistic model. Score for the retrieval result.

Divergence from randomness

Not all words are equally important for describing the content of the documents

Formalize the length of the file.

Because the long file is more likely to be retrieved, but it may not be the one we want.

二．Retrieval evaluation (is a critical and integral component of any modern IR system)

To evaluate an IR system is to measure how well the system meets the information needs of the users(may cause different results)

We can use the following methods to value .

Precision recall (is used to value the retrieval performance of IR algorithms)

1. Precision & recall

I : an information request

R: the set of relevant documents for I

A: the answer set for I, generated by an IR system

Recall=(R&A)/R;

Precision=(R&A)/A;

This method needs the detailed knowledge of all the docs. So the following.

2.s single precision value is used.

Precision at 5/10(p@5/10) measures the precision when 5 or 10 docs have been seen.

三.Text classification

Reason:

a.Lager collections need to label the documents

b.Allow searching documents on a subject or topic

Mearchine learning

a.Algorithms that learn patterns in the data

b.Use the pattern to predict the new data

Learning algorithms use training data and can be three types.

a.Supervised learning (algorithm SVM / Naive Bayes)

Training data (classes for input documents)provided as input.

b.Unsupervised learning

No training data

Examples:

Neural network models

Independent component analysis

cluster

c.semi-supervised learning

Small training data // combined with large amount of unlabeled data.

四．Indexing and searching (cause efficiency in IR system, especially in lager-scale application)

Efficiency in IR system :to process user queries with minimal requirements of computational resource.

Index :a data structure built from the text to speed up the searches. and should be updated at reasonable regular intervals.

Inverted indexes:

An inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the database. The inverted file may be the database file itself, rather than its index. It is the most popular data structure used in document retrieval systems,[1] used on a large scale for example in search engines.