理解inverted index||full-text search||Search Engines

  • Overview

    From Elasticsearch, It uses a data structuree called an inverted index that supports very fast full-text searches. An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in.

  • Notions

  • Full-text database vs Bibliographic database

    A full-text database or a complete-text database is a database that contains the complete text of books, dissertations, journals, magazines, newspapers or other kinds of textual documents.

    A bibliographic database is a database of bibliographic records, an organized digital collection of references to published literature, including journal and newspaper articles, conference proceedings, reports, government and legal publications, books, etc.

  • Full-text search

    In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database.

    Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references)

  • Inverted Index

    In computer science, an inverted index (also referred to as a postings file or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents (named in contrast to a forward index, which maps from documents to content).

    The purpose of an inverted index is to allow fast full-text searches, at a cost of increased processing when a document is added to the database.

    The inverted file may be the database file itself, rather than its index.

    It is the most popular data structure used in document retrieval systems.

    The inverted index data structure is a central component of a typical search engine indexing algorithm.

  • Forward Index

    The forward index stores a list of words for each document.

    DocumentWords
    Document1the, cow, says, moo
    Document2the ,cat, and, the, hat
    Document3the, dish, ran, away, with, the, spoon
  • Search engine indexing algorithm

    Search engine optimisation indexing collects, parses, and stores data to facilitate fast and accurate information retrieval.

  • The Awesome Power of the Inverted Index

    The inverted index is a wonder that helps find and make sense of information buried in mounds of data, text and binaries.

    An Inverted Index is a simple but powerful way to search documents, images, media, and even data. Unlike just a keyword search, an inverted index allows you to search the inherent structure of any document.

    There’s no need to use a table name or special query language to get the information you want. You just type it into a search box and the search engine figures out the rest.

    Inverted Indexes were invented decades ago, in the same era that much of the first AI and machine learning algorithms were born. But the vast increase in computing power in recent years has made it possible to make use of the inverted index structure and generate fast search results from huge stores of indexed data and information.

    One of the reasons they’re become so popular is the Apache Solr open source project, which created a basic infrastructure for inverted indexes and doing searches over them.

    Inverted indexes should become an integral tool for IT innovators because they help companies make sense of the exploding landscape of data, especially data spread across many different forms and locations.

  • Traditional Database (Forward Indexes) vs Search Engines (Inverted Index)

    Introduction to Inverted Indexes

    理解Search Engine vs Traditional Database

    In traditional SQL DB the data will look something like this:

    Doc IDDoc Content
    1Welcome to the Hotel California Such a lovely place
    2she’s buying a stairway to Heaven
    3Hey Jude, don’t make it bad
    4Welcome to the heaven

    Performance in traditional SQL DBs is gained by querying over primary key or by building efficient “indexes” for traversing these db tables.

    You can use inverted indexes in SQL DBs like postgresql, but they are not as efficient as they are in search engines like elasticsearch/lucene etc.

    The indexes used in SQL like B-Tree index( the default one ), HashIndexes are kind of a forward indexes where generally the mapping is from Document(aka doc Id) to the whole data row.

    In Reverse Indexes the mapping is from “terms” to the Documents (as shown in the table below):

    TermDoc Id
    buyingDoc2
    californiaDoc1
    HeavenDoc2, Doc4
    hotelDoc1
    JudeDoc3
    lovelyDoc1
    stairwayDoc2
    welcomeDoc1, Doc4
    and so on…

    If you just search “welcome lovely”, you don’t have any exact match in the database but using the inverted index we can see that the user is looking for Doc1, Doc4 (Doc1 having the highest rank score since it is in both the document list for the term welcome and lovely)

  • Components of Inverted Indexes

    The two main components of a inverted index are Dictionary and Posting Lists.

  • Dictionary

    The dictionary works as a lookup data structure on top of the posting lists.

    It has two broad sections of solutions: hashing and search trees.

    Given an inverted index and a query, our first task is to determine whether each query term exists in the vocabulary.

  • Posting Lists

    The actual index data is stored in posting list.

    It is accessed through the search engine’s dictionary. Each term has its own posting list assigned to it.

    Since the actual size of posting list is too large and therefore its better to keep this stored over disk to reduce the cost. Only during query processing are the query term’s posting list is loaded into the memory, as required by the query processing routines.

  • Stop words

    Some extremely common words that would appear to be of little value in helping select documents matching a query need are excluded from the vocabulary entirely. Like a, an, and , are, as etc.

  • References

  1. Introduction to Information Retrieval : A first take at building an inverted index
  2. GeeksforGeeks
  3. The Awesome Power of the Inverted Index
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值