# 介绍

1. 从文件中读取词
2. 将该词提取为词干(word stemming)，即去除第三人称形式、过去式、进行时等形式，留下词干），并去除分词(stop word)，即”a”, “is”等没有意义的词。
3. 检查该词是否已经在词典之中。
4. 若不在，则将该词添加入词典之中。更新索引信息。
5. 建立完毕后，将索引文件存入磁盘。

while ( read a document D ) {
while ( read a term T in D ) {
if ( Find( Dictionary, T ) == false )
Insert( Dictionary, T );
Get T’s posting list;
Insert a node to T’s posting list;
}
}
Write the inverted index to disk;

Precision
P = RR / (RR + IR)
Recall
R = RR / (RR + RN)

# 题目

While accessing a term, hashing is faster than search trees. (T or F)

In distributed indexing, document-partitioned strategy is to store on each node all the documents that contain the terms in a certain range. (T or F)

When evaluating the performance of data retrieval, it is important to measure the relevancy of the answer set.

When measuring the relevancy of the answer set, if the precision is high but the recall is low, it means that: (2分)
A. most of the relevant documents are retrieved, but too many irrelevant documents are returned as well
B. most of the retrieved documents are relevant, but still a lot of relevant documents are missed
C. most of the relevant documents are retrieved, but the benchmark set is not large enough
D. most of the retrieved documents are relevant, but the benchmark set is not large enough

Which of the following is NOT concerned for measuring a search engine? (2分)
A. How fast does it index
B. How fast does it search
C. How friendly is the interface
D. How relevant is the answer set