很好的一本书,介绍的非常全面,看了很久了,还没有看完,刚看完前十章,发现前面看的都忘的差不多了,还是回来记一下吧。
Boolean Retrieval
一、information retrieval定义:
学院派定义:
Information retrieval (IR) is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information need
from within large collections (usually stored on computers).
Category :
Category By Scale :
web search、domain-specific search、personal information retrieval
Basic need:
1、To process large document collections quickly.
2、To allow more flexible matching operations
3、To allow ranked retrieval
Simple idea:
term-document incidence matrix use binary logical OR AND NOT...:110100 AND 110111 AND 101111 = 100100
What is Boolean Retrival:
The Boolean retrieval model is a model for information BOOLEAN RETRIEVAL retrieval in which we
MODEL can pose any query which is in the form of a Boolean expression of terms,
that is, in which terms are combined with the operators AND, OR, and NOT.
Such queries effectively view each document as a set of words.
What's the boolean retrival query like:
(Calpurnia AND Brutus) AND Caesar
how to assess IR system
Precision : What fraction of the returned results are relevant to the information
need?
Recall : What fraction of the relevant documents in the collection were returned
by the system?
vector space model: Easy to rank
Term-document matrix: not scalable
Inverted index: dictionary and posting list.
How Build Inverted index :
1. Collect the documents to be indexed:
Friends, Romans, countrymen. So let it be with Caesar . . .
2. Tokenize the text, turning each document into a list of tokens:
Friends Romans countrymen So . . .
3. Do linguistic preprocessing, producing a list of normalized tokens, which
are the indexing terms: friend roman countryman so . . .
4. Index the documents that each term occurs in by creating an inverted index,
consisting of a dictionary and postings.
Processing Boolean queries:
AND operation:
intersect two posting list:
1 answer ← ()
2 while p1 != NIL and p2 != NIL
3 do if docID(p1) = docID(p2)
4 then ADD(answer, docID(p1))
5 p1 ← next(p1)
6 p2 ← next(p2)
7 else if docID(p1) < docID(p2)
8 then p1 ← next(p1)
9 else p2 ← next(p2)
10 return answer
mulitiple term AND operation:
Process terms in order of increasing document frequency:
if we start by intersecting the two smallest postings lists, then all intermediate resultsmust be no bigger than the smallest postings list, and we are therefore likely to do the least amount of total work
1 terms ← SORTBYINCREASINGFREQUENCY(ht1, . . . , tni)
2 result ← postings( f irst(terms))
3 terms ← rest(terms)
4 while terms != NIL and result != NIL
5 do result ← INTERSECT(result, postings( f irst(terms)))
6 terms ← rest(terms)
7 return result
OR operation:
The idea is 归并排序中的n路归并,similarily with AND operation。
The extended Boolean model versus ranked retrieval:
Proximity operator:
A proximity operator is a way of specifying that two terms in a query must occur in a document close to each other, where closeness may be measured
by limiting the allowed number of intervening words or by reference to a structural unit such as a sentence or paragraph.
Addition to do:
1. We would like to better determine the set of terms in the dictionary and
to provide retrieval that is tolerant to spelling mistakes and inconsistent
choice of words.
2. It is often useful to search for compounds or phrases that denote a concept
such as “operating system”. As the Westlaw examples show, we might also
wish to do proximity queries such as Gates NEAR Microsoft. To answer
such queries, the index has to be augmented to capture the proximities of
terms in documents.
3. A Boolean model only records term presence or absence, but often we
would like to accumulate evidence, givingmoreweight to documents that
have a term several times as opposed to ones that contain it only once. To
be able to do this we need the term frequency information TERM FREQUENCY (the number of
times a term occurs in a document) in postings lists.
4. Boolean queries just retrieve a set of matching documents, but commonly
we wish to have an effective method to order (or “rank”) the returned
results. This requires having a mechanism for determining a document
score which encapsulates how good a match a document is for a query.