根据信息搜索的向量模型(Vector model)1,基于倒排索引结构2,实现一个简单的搜索引擎(Vector Space Retrieval (VSR))。
实验要求
实验环境
操作系统:Windows 10
开发环境:JDK 1.7
Java库: ansj_seg-2.0.7.jar, nlp-lang-1.0.jar
实验过程
- 加载所有的文档
- 构建倒排索引
- 获取查询
- 根据向量模型进行搜索
- 输出搜索性能
索引构建
倒排索引(英语:Inverted index),也常被称为反向索引、置入档案或反向档案,是一种索引方法,被用来存储在全文搜索下某个单词在一个文档或者一组文档中的存储位置的映射。它是文档检索系统中最常用的数据结构。5
构建过程如下:
Create an empty HashMap, H;
For each document, D, (i.e. file in an input directory):
Create a HashMapVector, V, for D;
For each (non-zero) token, T, in V:
If T is not already in H, create an empty
TokenInfo for T and insert it into H;
Create a TokenOccurrence for T in D and
add it to the occList in the TokenInfo for T;
Compute IDF for all tokens in H:
Let N be the total number of Documents;
For each token, T, in H:
Determine the total number of documents, M,
in which T occurs (the length of T’s occList);
Set the IDF for T to log(N/M);
Compute vector lengths for all documents in H:
Assume the length of all document vectors (stored in the DocumentReference) are initialized to 0.0;
For each token T in H:
Let, I, be the IDF weight of T;
For each TokenOccurence of T in document D:
Let, C, be the count of T in D;
Increment the length of D by (I*C)^2 ;
For each document D in H:
Set the length of D to be the square-root of the current stored length;
其中H: tokenHash, D: document, V: hashMapVector, T: token,使用ansj_seg6对文档分词。去掉在数词、标点符号和停用词,得到的若干个词及其出现的次数保存为一个HashMap。
查询搜索
向量空间模型是一个把文本文件表示为标识符(比如索引)向量的代数模型。它应用于信息过滤、信息检索、索引以及相关排序。7
查询搜索过程如下:
Create a HashMapVector, Q, for the query.
Create empty HashMap, R, to store retrieved documents with scores.
For each token, T, in Q:
Let I be the IDF of T, and K be the count of T in Q;
Set the weight of T in Q: W = K * I;
Let L be the list of TokenOccurences of T from H;
For each TokenOccurence, O, in L:
Let D be the document of O, and C be the count of O (tf of T in D);
If D is not already in R (D was not previously retrieved)
Then add D to R and initialize score to 0.0;
Increment D’s score by W * I * C; (product of T-weight in Q and D)
Compute the length, L, of the vector Q (the sum of the squares of its weights).
For each retrieved document D in R:
Let S be the current accumulated score of D;
(S is the dot-product of D and Q)
Let Y be the length of D as stored in its DocumentReference;
Normalize D’s final score to S / (L * Y);
Sort retrieved documents in R by final score and return results in an array.
其中Q: hashMapVector, R: retrievalHashMap, T: token, L: queryLength,检索的文档使用TF-IDF的余弦相似度作为分数进行排序。
参考
- Vector space model. 维基百科. 最后修订于2016年4月5日. https://en.wikipedia.org/wiki/Vector_space_model ↩
- Inverted index. 维基百科. 最后修订于2016年3月28日. https://en.wikipedia.org/wiki/Inverted_index ↩
- tf–idf. 维基百科. 最后修订于2016年3月8日. https://en.wikipedia.org/wiki/Tf%E2%80%93idf ↩
- Cosine similarity. 维基百科. 最后修订于2016年1月17日. https://en.wikipedia.org/wiki/Cosine_similarity ↩
- Inverted index. 维基百科. 最后修订于2016年3月28日. https://en.wikipedia.org/wiki/Inverted_index ↩
- AnsjSeg 使用手册. http://nlpchina.github.io/ansj_seg/ ↩
- Vector space model. 维基百科. 最后修订于2016年4月5日. https://en.wikipedia.org/wiki/Vector_space_model ↩

本文介绍了一个基于向量空间模型的简单搜索引擎实现,该引擎采用TF-IDF权重计算及余弦相似度进行文档检索与排序。通过构建倒排索引加速搜索过程,并详细展示了索引构建与查询搜索的具体步骤。
415

被折叠的 条评论
为什么被折叠?



