倒排索引 -- 深入搜索引擎的工作原理 Inverted Indexes – Inside How Search Engines Work

最新推荐文章于 2023-08-06 23:23:59 发布

GarfieldEr007

最新推荐文章于 2023-08-06 23:23:59 发布

阅读量1.6k

点赞数

分类专栏： Tech 文章标签：倒排索引搜索引擎工作原理 Inverted Indexes Search Engines

Tech 专栏收录该内容

50 篇文章 3 订阅

订阅专栏

An Inverted Index is a structure used by search engines and databases to make search terms to files or documents, trading the speed writing the document to the index for searching the index later on. There are two versions of an inverted index, a record-level index which tells you which documents contain the term and a fully inverted index which tells you both the document a term is contained in and where in the file it is. For example if you built a search engine to search the contents of sentences and it was fed these sentences:

{0} - "Turtles love pizza"
{1} - "I love my turtles"
{2} - "My pizza is good"

Then you would store them in a Inverted Indexes like this:

            Record Level     Fully Inverted
"turtles"   {0, 1}           { (0, 0), (1, 3) }
"love"      {0, 1}           { (0, 1), (1, 1) }
"pizza"     {0, 2}           { (0, 2), (2, 1) }
"i"         {1}              { (1, 0) }
"my"        {1, 2}           { (1, 2), (2, 0) }
"is"        {2}              { (2, 2) }
"good"      {2}              { (2, 3) }

The record level sets represent just the document ids where the words are stored, and the fully inverted sets represent the document in the first number inside the parentheses and the location in the document is stored in the second number.

So now if you wanted to search all three documents for the words “my turtles” you would grab the sets (looking at record level only):

"turtles"   {0, 1}
"my"        {1, 2}

Then you would intersect those sets, coming up with the only matching set being 1. Using the Fully Inverted Index would also let us know that the word “my” appeared at position 2 and the word “turtles” at position 3, assuming the word position is important your search.

There is no standard implementation for an Inverted Index as it’s more of a concept rather than an actual algorithm, this however gives you a lot of options.

For the index you can choose to use things like Hashtables, BTrees, or any other fast search data structure.

The intersection becomes a more interesting problem. You can try using Bloom Filters if accuracy isn’t 100% needed, you can brute force the problem by doing a full scan of each set for O(M+N) time for joining two sets. You can also try to do something a little more complicated. Rumor has it that search engines like Google and Bing only merge results until they have enough for a search page and them dump the sets they are loading, though I know very little about how they actually solve this problem.

Here is an example of a simple Inverted Index written in C# that uses a Dictionary as the index and the Linq Intersect function:

public class InvertedIndex
{
    private readonly Dictionary<string, HashSet<int>> _index = new Dictionary<string, HashSet<int>>();
    private readonly Regex _findWords = new Regex(@"[A-Za-z]+");
 
    public void Add(string text, int docId)
    {
        var words = _findWords.Matches(text);
 
        for (var i = 0; i < words.Count; i++)
        {
            var word = words[i].Value;
 
            if (!_index.ContainsKey(word))
                _index[word] = new HashSet<int>();
 
            if (!_index[word].Contains(docId))
                _index[word].Add(docId);
        }
    }
 
    public List<int> Search(string keywords)
    {
        var words = _findWords.Matches(keywords);
        IEnumerable<int> rtn = null;
 
        for (var i = 0; i < words.Count; i++)
        {
            var word = words[i].Value;
            if (_index.ContainsKey(word))
            {
                rtn = rtn == null ? _index[word] : rtn.Intersect(_index[word]);
            }
            else
            {
                return new List<int>();
            }
        }
 
        return rtn != null ? rtn.ToList() : new List<int>();
    }
}

from: https://nullwords.wordpress.com/2013/04/18/inverted-indexes-inside-how-search-engines-work/

GarfieldEr007

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
倒排索引 -- 深入搜索引擎的工作原理 Inverted Indexes – Inside How Search Engines Work

An Inverted Index is a structure used by search engines and databases to make search terms to files or documents, trading the speed writing the document to the index for searching the index later on.
复制链接

扫一扫

专栏目录