[数据结构]倒排索引介绍

名栩

于 2025-01-10 00:10:01 发布

阅读量1k

点赞数 24

分类专栏： # 算法与数据结构系统设计文章标签：数据结构

本文链接：https://blog.csdn.net/meanshe/article/details/145045000

版权

系统设计同时被 2 个专栏收录

32 篇文章

订阅专栏

算法与数据结构

4 篇文章

订阅专栏

倒排索引详解

1. 原理

倒排索引（Inverted Index）是一种数据结构，用于存储在文档集合中出现的单词，以及这些单词出现的文档列表。这种索引方式常用于全文搜索引擎，如Elasticsearch和Solr，以快速进行文本搜索。
工作原理：

分词：将文档内容分割成单词或词组（Tokens）。
建立映射：为每个单词创建一个列表，记录包含该单词的文档ID。
存储结构：通常使用字典树（Trie）或哈希表存储单词，关联的文档ID则存储在列表或树结构中。

2. 应用场景

搜索引擎：快速检索包含特定关键词的文档。
信息检索系统：在大量文本中高效地查找信息。
日志分析：快速定位包含特定信息的日志条目。

3. 数据结构模型

下面是倒排索引的简化数据结构模型，使用Mermaid语法表示：

在这个模型中，InvertedIndex是一个映射，将单词映射到PostingList。PostingList包含一个Document列表，每个Document都有一个ID和内容。

4. 技术组件推荐

Elasticsearch：基于Lucene的搜索引擎，提供强大的全文搜索能力。
Solr：基于Lucene的搜索平台，支持复杂的搜索需求。
Lucene：Apache的开源搜索引擎库，用于实现倒排索引。

5. 代码示例

下面是一个简单的Go语言示例，展示如何构建一个基本的倒排索引：

package main
import (
	"fmt"
	"strings"
)
// Document represents a document with an ID and content.
type Document struct {
	ID     string
	Content string
}
// InvertedIndex represents the inverted index data structure.
type InvertedIndex map[string]map[string]bool
// BuildInvertedIndex builds an inverted index from a list of documents.
func BuildInvertedIndex(docs []Document) InvertedIndex {
	index := InvertedIndex{}
	for _, doc := range docs {
		words := strings.Fields(doc.Content)
		for _, word := range words {
			if _, ok := index[word]; !ok {
				index[word] = make(map[string]bool)
			}
			index[word][doc.ID] = true
		}
	}
	return index
}
// Search searches the inverted index for the given query.
func (index InvertedIndex) Search(query string) []string {
	words := strings.Fields(query)
	docIDs := make(map[string]bool)
	for _, word := range words {
		if _, ok := index[word]; ok {
			for docID := range index[word] {
				docIDs[docID] = true
			}
		}
	}
	var result []string
	for docID := range docIDs {
		result = append(result, docID)
	}
	return result
}
func main() {
	docs := []Document{
		{ID: "1", Content: "hello world"},
		{ID: "2", Content: "hello go"},
		{ID: "3", Content: "go language"},
	}
	index := BuildInvertedIndex(docs)
	query := "hello go"
	result := index.Search(query)
	fmt.Printf("Documents containing '%s': %v\n", query, result)
}