倒排索引 -- 深入搜索引擎的工作原理 Inverted Indexes – Inside How Search Engines Work

An Inverted Index is a structure used by search engines and databases to make search terms to files or documents, trading the speed writing the document to the index for searching the index later on. There are two versions of an inverted index, a record-level index which tells you which documents contain the term and a fully inverted index which tells you both the document a term is contained in and where in the file it is. For example if you built a search engine to search the contents of sentences and it was fed these sentences:

{0} - "Turtles love pizza"
{1} - "I love my turtles"
{2} - "My pizza is good"

Then you would store them in a Inverted Indexes like this:

            Record Level     Fully Inverted
"turtles"   {0, 1}           { (0, 0), (1, 3) }
"love"      {0, 1}           { (0, 1), (1, 1) }
"pizza"     {0, 2}           { (0, 2), (2, 1) }
"i"         {1}              { (1, 0) }
"my"        {1, 2}           { (1, 2), (2, 0) }
"is"        {2}              { (2, 2) }
"good"      {2}              { (2, 3) }

The record level sets represent just the document ids where the words are stored, and the fully inverted sets represent the document in the first number inside the parentheses and the location in the document is stored in the second number.

So now if you wanted to search all three documents for the words “my turtles” you would grab the sets (looking at record level only):

"turtles"   {0, 1}
"my"        {1, 2}

Then you would intersect those sets, coming up with the only matching set being 1. Using the Fully Inverted Index would also let us know that the word “my” appeared at position 2 and the word “turtles” at position 3, assuming the word position is important your search.

There is no standard implementation for an Inverted Index as it’s more of a concept rather than an actual algorithm, this however gives you a lot of options.

For the index you can choose to use things like  Hashtables, BTrees, or any other fast search data structure.

The intersection becomes a more interesting problem. You can try using Bloom Filters if accuracy isn’t 100% needed, you can brute force the problem by doing a full scan of each set for O(M+N) time for joining two sets. You can also try to do something a little more complicated. Rumor has it that search engines like Google and Bing only merge results until they have enough for a search page and them dump the sets they are loading, though I know very little about how they actually solve this problem.

Here is an example of a simple Inverted Index written in C# that uses a Dictionary as the index and the Linq Intersect function:

public class InvertedIndex
{
    private readonly Dictionary<string, HashSet<int>> _index = new Dictionary<string, HashSet<int>>();
    private readonly Regex _findWords = new Regex(@"[A-Za-z]+");
 
    public void Add(string text, int docId)
    {
        var words = _findWords.Matches(text);
 
        for (var i = 0; i < words.Count; i++)
        {
            var word = words[i].Value;
 
            if (!_index.ContainsKey(word))
                _index[word] = new HashSet<int>();
 
            if (!_index[word].Contains(docId))
                _index[word].Add(docId);
        }
    }
 
    public List<int> Search(string keywords)
    {
        var words = _findWords.Matches(keywords);
        IEnumerable<int> rtn = null;
 
        for (var i = 0; i < words.Count; i++)
        {
            var word = words[i].Value;
            if (_index.ContainsKey(word))
            {
                rtn = rtn == null ? _index[word] : rtn.Intersect(_index[word]);
            }
            else
            {
                return new List<int>();
            }
        }
 
        return rtn != null ? rtn.ToList() : new List<int>();
    }
}

from: https://nullwords.wordpress.com/2013/04/18/inverted-indexes-inside-how-search-engines-work/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值