理解inverted index||full-text search||Search Engines

最新推荐文章于 2024-06-21 17:05:28 发布

PerpetualLearner

最新推荐文章于 2024-06-21 17:05:28 发布

阅读量496

点赞数

本文链接：https://blog.csdn.net/The_Time_Runner/article/details/111409415

版权

ElasticSearch 专栏收录该内容

20 篇文章 0 订阅

订阅专栏

Overview

From Elasticsearch, It uses a data structuree called an inverted index that supports very fast full-text searches. An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in.
Notions
Full-text database vs Bibliographic database

A full-text database or a complete-text database is a database that contains the complete text of books, dissertations, journals, magazines, newspapers or other kinds of textual documents.

A bibliographic database is a database of bibliographic records, an organized digital collection of references to published literature, including journal and newspaper articles, conference proceedings, reports, government and legal publications, books, etc.
Full-text search

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database.

Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references)
Inverted Index

In computer science, an inverted index (also referred to as a postings file or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents (named in contrast to a forward index, which maps from documents to content).

The purpose of an inverted index is to allow fast full-text searches, at a cost of increased processing when a document is added to the database.

The inverted file may be the database file itself, rather than its index.

It is the most popular data structure used in document retrieval systems.

The inverted index data structure is a central component of a typical search engine indexing algorithm.
Forward Index

The forward index stores a list of words for each document.

Document Words
Document1 the, cow, says, moo
Document2 the ,cat, and, the, hat
Document3 the, dish, ran, away, with, the, spoon
Search engine indexing algorithm

Search engine optimisation indexing collects, parses, and stores data to facilitate fast and accurate information retrieval.
The Awesome Power of the Inverted Index

The inverted index is a wonder that helps find and make sense of information buried in mounds of data, text and binaries.

An Inverted Index is a simple but powerful way to search documents, images, media, and even data. Unlike just a keyword search, an inverted index allows you to search the inherent structure of any document.

There’s no need to use a table name or special query language to get the information you want. You just type it into a search box and the search engine figures out the rest.

Inverted Indexes were invented decades ago, in the same era that much of the first AI and machine learning algorithms were born. But the vast increase in computing power in recent years has made it possible to make use of the inverted index structure and generate fast search results from huge stores of indexed data and information.

One of the reasons they’re become so popular is the Apache Solr open source project, which created a basic infrastructure for inverted indexes and doing searches over them.

Inverted indexes should become an integral tool for IT innovators because they help companies make sense of the exploding landscape of data, especially data spread across many different forms and locations.
Traditional Database (Forward Indexes) vs Search Engines (Inverted Index)

Introduction to Inverted Indexes

《理解Search Engine vs Traditional Database》

In traditional SQL DB the data will look something like this:

Doc ID Doc Content
1 Welcome to the Hotel California Such a lovely place
2 she’s buying a stairway to Heaven
3 Hey Jude, don’t make it bad
4 Welcome to the heaven
Performance in traditional SQL DBs is gained by querying over primary key or by building efficient “indexes” for traversing these db tables.

You can use inverted indexes in SQL DBs like postgresql, but they are not as efficient as they are in search engines like elasticsearch/lucene etc.

The indexes used in SQL like B-Tree index( the default one ), HashIndexes are kind of a forward indexes where generally the mapping is from Document(aka doc Id) to the whole data row.

In Reverse Indexes the mapping is from “terms” to the Documents (as shown in the table below):

Term Doc Id
buying Doc2
california Doc1
Heaven Doc2, Doc4
hotel Doc1
Jude Doc3
lovely Doc1
stairway Doc2
welcome Doc1, Doc4
and so on… …
If you just search “welcome lovely”, you don’t have any exact match in the database but using the inverted index we can see that the user is looking for Doc1, Doc4 (Doc1 having the highest rank score since it is in both the document list for the term welcome and lovely)
Components of Inverted Indexes

The two main components of a inverted index are Dictionary and Posting Lists.
Dictionary

The dictionary works as a lookup data structure on top of the posting lists.

It has two broad sections of solutions: hashing and search trees.

Given an inverted index and a query, our first task is to determine whether each query term exists in the vocabulary.
Posting Lists

The actual index data is stored in posting list.

It is accessed through the search engine’s dictionary. Each term has its own posting list assigned to it.

Since the actual size of posting list is too large and therefore its better to keep this stored over disk to reduce the cost. Only during query processing are the query term’s posting list is loaded into the memory, as required by the query processing routines.
Stop words

Some extremely common words that would appear to be of little value in helping select documents matching a query need are excluded from the vocabulary entirely. Like a, an, and , are, as etc.
References