首先看一个例子:
假设有3篇文章,file1, file2, file3,文件内容如下:
file1 (单词1,单词2,单词3,单词4....) file2 (单词a,单词b,单词c,单词d....) file3 (单词1,单词a,单词3,单词d....)
那么建立的倒排索引就是这个样子:
单词1 (file1,file3) 单词2 (file1) 单词3 (file1,file3) 单词a (file2, file3) ....
倒排索引的概念很简单:就是将文件中的单词作为关键字,然后建立单词与文件的映射关系。当然,你还可以添加文件中单词出现的频数等信息。倒排索引是搜索引擎中一个很基本的概念,几乎所有的搜索引擎都会使用到倒排索引。
下面是我对于倒排索引的一个简单的实现。该程序对于输入的一段文字,查找出该词所出现的行号以及出现的次数。
import java.io.*;
import java.util.HashMap;
import java.util.Map;
public class InvertedIndex {
private Map<String, Map<Integer, Integer>> index;
private Map<Integer, Integer> subIndex;
public void createIndex(String filePath) {
index = new HashMap<String, Map<Integer, Integer>>();
try {
File file = new File(filePath);
InputStream is = new FileInputStream(file);
BufferedReader read = new BufferedReader(new InputStreamReader(is));
String temp = null;
int line = 1;
while ((temp = read.readLine()) != null) {
String[] words = temp.split(" ");
for (String word : words) {
if (!index.containsKey(word)) {
subIndex = new HashMap<Integer, Integer>();
subIndex.put(line, 1);
index.put(word, subIndex);
} else {
subIndex = index.get(word);
if (subIndex.containsKey(line)) {
int count = subIndex.get(line);
subIndex.put(line, count+1);
} else {
subIndex.put(line, 1);
}
}
}
line++;
}
read.close();
is.close();
} catch (IOException e) {
System.out.println("error in read file");
}
}
public void find(String str) {
String[] words = str.split(" ");
for (String word : words) {
StringBuilder sb = new StringBuilder();
if (index.containsKey(word)) {
sb.append("word: " + word + " in ");
Map<Integer, Integer> temp = index.get(word);
for (Map.Entry<Integer, Integer> e : temp.entrySet()) {
sb.append("line " + e.getKey() + " [" + e.getValue() + "] , ");
}
} else {
sb.append("word: " + word + " not found");
}
System.out.println(sb);
}
}
public static void main(String[] args) {
InvertedIndex index = new InvertedIndex();
index.createIndex("news.txt");
index.find("I love Shanghai today");
}
}
其中,输入文件news.txt内容为:
I am eriol I live in Shanghai and I love Shanghai I also love travelling life in Shanghai is beautiful
输出结果为:
word: I in line 1 [1] , line 2 [2] , line 3 [1] , word: love in line 2 [1] , line 3 [1] , word: Shanghai in line 2 [2] , line 4 [1] , word: today not found