实现倒排索引

最新推荐文章于 2022-07-12 14:43:05 发布

文盲奎托斯

最新推荐文章于 2022-07-12 14:43:05 发布

阅读量259

点赞数

分类专栏：数据结构文章标签：索引

本文链接：https://blog.csdn.net/u013701824/article/details/72780062

版权

数据结构专栏收录该内容

1 篇文章 0 订阅

订阅专栏

倒排索引就是反向索引，一般索引指的是根据记录位置来查找值，而倒排索引的原理是：根据属性的值来查找记录位置。

比如说，现在有这些文章以及文章中含有的单词。

以英文为例，下面是要被索引的文本：

文本0    - "it is what it is"
文本1    - "what is it"
文本2    - "it is a banana"

我们就能得到下面的反向文件索引：

 // 反向索引
 "a":      {2}
 "banana": {2}
 "is":     {0, 1, 2}
 "it":     {0, 1, 2}
 "what":   {0, 1}

检索的条件”what”, “is” 和 “it” 将对应这个集合： {0,1} ，{0,1,2} ， {0,1,2}。

那么，同时含有这三个文件的文档就是上面的几个文件的交集，也就是0和1号是有这两个文件里面是有”what”, “is” 和 “it” 三个单词的。

#include <unordered_map>
#include <unordered_set>
#include <string>
#include <vector>
#include <memory>
#include <iostream>
using namespace std;

unordered_map<string, shared_ptr<unordered_set<int>>> InvertedIndex(vector<vector<string>> vec_vec)
{
  unordered_map<string, shared_ptr<unordered_set<int>>> ret_map;
  int no = 0;

  for(auto vec : vec_vec)
  {
    for(auto str : vec)
    {
        auto it = ret_map.find(str);
        if(it != ret_map.end())
        {
            it->second->insert(no);
        }
        else
        {
            shared_ptr<unordered_set<int>> temp_ptr = make_shared<unordered_set<int>>();
            ret_map[str] = temp_ptr;
            temp_ptr->insert(no);
        }
    }
    ++no;
  }
  return ret_map;
}

int main()
{
    vector<string> vec1{"it", "is", "what", "it", "is"};
    vector<string> vec2{"what", "is", "it"};
    vector<string> vec3{"it", "is", "a", "banana"};
    vector<vector<string>> vec_vec{vec1, vec2, vec3};

    unordered_map<string, shared_ptr<unordered_set<int>>> ret_map =
        InvertedIndex(vec_vec);

    for(auto it : ret_map)
    {
        cout<<it.first<<" : [ ";
        for(auto itt : *(it.second))
        {
            cout<<itt<<" ";
        }
        cout<<"]"<<endl;
    }
    return 0;
}

输出：

[liboyang@localhost test]$ g++ InvertedIndex.cpp -std=c++11
[liboyang@localhost test]$ ./a.out 
what : [ 1 0 ]
a : [ 2 ]
is : [ 2 1 0 ]
banana : [ 2 ]
it : [ 2 1 0 ]