根据老师课件整理的资料。
IR(information retrieval)
IR deals with the representation, storage, organization of , and access to information items such as documents, web pages, online catalogs, structured and semi-structured records, multimedia objects. The representation and organization of the information items should be such as to provide the user with easy access to information of their interest.
Predicting which documents are relevant, and then linearly ranking them.
Information vs data retrieval.
A.Data: Unstructured : open to interpretation // Structured with well-defined semantics
B.Query: Usually incomplete or ambiguous (w.r.t information need) // Well-defined semantics
C.QUALITY OF RESULTS: Partial match allowed, relevance-based ranking // Exact match required - no or many results
D.Foudations: Probabilistic underpinnings// Algebra/Logic
E: application: library // accounting
Clustering vs classification:
Given a set of docs, group them into clusters based on their contents.
Given a set of topics, plus a new doc D, decide which topic(s) D belongs to.
内容概括
一: IR 模型
二:Retrieval evaluation
三:Text classification
四:indexing and searching
五:web retrieval
一.IR 模型 (在于产生ranking函数)
1.Bool 模型
文档相关为1 不相关为2
缺点:部分性 没有评分标准,不能产生rank序列 信息需要翻译为bool形式,太繁琐 查询结果要么太单一,要么太冗余
Not partial matches
2.Vector model
Partial matching is possible
Term weights are used to compute a degree of similarity between a query and each document
Documents rank in decreasing order depend on the similarity.
3.Probabilistic model
Using probabilistic model
An ideal answer set for a query.
4.Term weighting
The importance of the Index in each document, useful to compute a rank
4.1.TF-IDF
TF=termi在文档中出现的次数/总term出现的总次数
IDF=log(文档总数/termi出现的文档的个数)
TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜索引擎应用,作为文件与用户查询之间相关程度的度量或评级。某一特定文件内的高词语频率,以及该词语在整个文件集合中的低文件频率,可以产生出高权重的TF-IDF。因此,TF-IDF倾向于保留文档中较为特别的词语,过滤常用词。
4.2.BM25 (best match) is one of the most established probabilistic term weighting models, as the result of a series of experiments on variations of the probabilistic model. Score for the retrieval result.
Divergence from randomness
Not all words are equally important for describing the content of the documents
Formalize the length of the file.
Because the long file is more likely to be retrieved, but it may not be the one we want.
二.Retrieval evaluation (is a critical and integral component of any modern IR system)
To evaluate an IR system is to measure how well the system meets the information needs of the users(may cause different results)
We can use the following methods to value .
Precision recall (is used to value the retrieval performance of IR algorithms)
1. Precision & recall
I : an information request
R: the set of relevant documents for I
A: the answer set for I, generated by an IR system
Recall=(R&A)/R;
Precision=(R&A)/A;
This method needs the detailed knowledge of all the docs. So the following.
2.s single precision value is used.
Precision at 5/10(p@5/10) measures the precision when 5 or 10 docs have been seen.
Reason:
a.Lager collections need to label the documents
b.Allow searching documents on a subject or topic
Mearchine learning
a.Algorithms that learn patterns in the data
b.Use the pattern to predict the new data
Learning algorithms use training data and can be three types.
a.Supervised learning (algorithm SVM / Naive Bayes)
Training data (classes for input documents)provided as input.
b.Unsupervised learning
No training data
Examples:
Neural network models
Independent component analysis
cluster
c.semi-supervised learning
Small training data // combined with large amount of unlabeled data.
四.Indexing and searching (cause efficiency in IR system, especially in lager-scale application)
Efficiency in IR system :to process user queries with minimal requirements of computational resource.
Index :a data structure built from the text to speed up the searches. and should be updated at reasonable regular intervals.
Inverted indexes:
An inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the database. The inverted file may be the database file itself, rather than its index. It is the most popular data structure used in document retrieval systems,[1] used on a large scale for example in search engines.
A word-oriented mechanism for indexing a text collection to speed up the searching task.
The structure includes vocabulary(different words) and the occurrences(lists of list of each word using the sparse matrix).
Ranking
How to find the top-k documents and return them to the user. Using the tf/idf .
The web:
very large, public, unstructured
Need for efficient tools to manage
Search engines are the central tool in the web
Problems:
Characteristics make the retrieval a hardwork:
Large and distributed volume of data available.
Fast peace of the change.
Two types of the web:
data-centric--challenges:distributed/high percentage volatile /lager volume / unstructured and redundant
interaction-centric--challenges:
expressing a query
Interpreting results.
Most popular format on web:
HTML / followed by GIF and JPG, ASCII text, PDF
Structure of web:
Can be viewed as a graph.nodes represent individual pages. The edges represent links between pages.