Index (search engine) In Wiki

Wiki上面关于Index (search engine)的介绍

Search engine indexing entails how data is collected, parsed, and stored to
facilitate fast and accurate information retrieval. Index design incorporates
interdisciplinary concepts from linguistics, cognitive psychology, mathematics
, informatics, physics and computer science. An alternate name for the
process is Web indexing, which is within the context of search engines
designed to find web pages on the Internet.


Popular engines focus on the full-text indexing of online, natural language
documents[1], yet there are other searchable media types such as video and
audio[2], and graphics[3][4].


Meta search engines reuse the indices of other services and do not store a
local index, whereas cache-based search engines permanently store the index
along with the corpus. Unlike full text indices, partial text services
restrict the depth indexed to reduce index size. Larger services typically
perform indexing at a predetermined interval due to the required time and
processing costs, while agent-based search engines index in real time.

Contents
1 Indexing
1.1 Index Design Factors
1.2 Index Data Structures
1.3 Challenges in Parallelism
1.4 Inverted indices
1.5 Index Merging
1.6 The Forward Index
1.7 Compression
2 Document Parsing
2.1 Challenges in Natural Language Processing
2.2 Tokenization
2.3 Language Recognition
2.4 Format Analysis
2.5 Section Recognition
2.6 Meta Tag Indexing
3 See also
4 Further reading
5 References

Indexing
The goal of storing an index is to optimize the speed and performance of
finding relevant documents for a search query. Without an index, the search
engine would scan every document in the corpus, which would take a
considerable amount of time and computing power. For example, an index of 10,
000 documents can be queried within milliseconds, while a sequential scan of
every word in 10,000 large documents could take hours. The trade-offs for the
time saved during information retrieval are the additional computer storage
required to store the index and a considerable increase in the time required
for an update to take place.


索引设计的时候需要考虑的一些因素
[edit] Index Design Factors
Major factors in designing a search engine's architecture include:

合并因子,
Merge factors

How data enters the index, or how words or subject features are added to the
index during text corpus traversal and whether multiple indexers can work
asynchronously. The indexer must first check whether it is updating old
content or adding new content. Traversal typically correlates to the data
collection policy. Search engine index merging is similar in concept to the
SQL Merge command and other merge algorithms.[5]

Storage techniques
存储技术.
How to store the index data - whether information should be data compressed
or filtered

Index size
索引规模. 需要多大的存储容量用于支持目前的索引.
How much computer storage is required to support the index.

查找速度.
在倒排表里面的查找速度;以及更新,删除的速度.
Lookup speed
How quickly a word can be found in the inverted index. How quickly an entry
in a data structure can be found, versus how quickly it can be updated or
removed, is a central focus of computer science.

维护
Maintenance
Maintaining the index over time[6].

容错
Fault tolerance
如何处理索引数据丢失.如何能够把...
How important it is for the service to be reliable, how to deal with index
corruption, whether bad data can be treated in isolation, dealing with bad
hardware, partitioning schemes such as hash-based or composite partitioning[7]
, replication.

索引数据结构
[edit] Index Data Structures

Search engine architectures vary in how indexing is performed and in index
storage to meet the various design factors. Types of indices include:

后缀树
Suffix tree

Figuratively structured like a tree, supports linear time lookup. Built by
storing the suffixes of words. Used for searching for patterns in DNA
sequences and clustering. A major drawback is that the storage of a word in
the tree may require more storage than storing the word itself.[8] An
alternate representation is a suffix array, which is considered to require
less virtual memory and supports data compression like BWT.


Trees
An ordered tree data structure that is used to store an associative array
where the keys are strings. Regarded as faster than a hash table, but are
less space efficient. The suffix tree is a type of trie. Tries support
extendable hashing, which is important for search engine indexing.[9]

倒排表
Inverted indices
Stores a list of occurrences of each atomic search criterion[10], typically
in the form of a hash table or binary tree[11][12].

引用索引
Citation indices
Stores the existence of citations or hyperlinks between documents to support
citation analysis, a subject of Bibliometrics.

N-gram索引
Ngram indices
For storing sequences of length of data to support other types of retrieval
or text mini.[13]

词-文档矩阵
Term document matrices
Used in latent semantic analysis, stores the occurrences of words in
documents in a two dimensional sparse matrix.


并行化存在的挑战.
[edit] Challenges in Parallelism

竞争条件和一致性容错.
需要协调新增数据添加和实时查询之间的冲突问题.
A major challenge in the design of search engines is the management of
parallel computing processes. There are many opportunities for race
conditions and coherent faults. For example, a new document is added to the
corpus and the index must be updated, but the index simultaneously needs to
continue responding to search queries. This is a collision between two
competing tasks. Consider that authors are producers of information, and a
web crawler is the consumer of this information, grabbing the text and
storing it in a cache (or corpus). The forward index is the consumer of the
information produced by the corpus, and the inverted index is the consumer of
information produced by the forward index. This is commonly referred to as a
producer-consumer model. The indexer is the producer of searchable
information and users are the consumers that need to search. The challenge is
magnified when working with distributed storage and distributed processing.
In an effort to scale with larger amounts of indexed information, the search
engine's architecture may involve distributed computing, where the search
engine consists of several machines operating in unison. This increases the
possibilities for incoherency and makes it more difficult to maintain a fully-
synchronized, distributed, parallel architecture.[14]


倒排索引
[edit] Inverted indices
倒排索引含有每个词分别含有的文档的列表.
搜索引擎可以检索到查询中每个词匹配到的文档.
Many search engines incorporate an inverted index when evaluating a search
query to quickly locate documents containing the words in a query and rank
these documents by relevance. The inverted index stores a list of the
documents containing each word. The search engine can retrieve the matching
documents quickly using direct access to find the documents associated with
each word in the query. The following is a simplified illustration of an
inverted index:


Inverted Index Word Documents
the Document 1, Document 3, Document 4, Document 5
cow Document 2, Document 3, Document 4
says Document 5
moo Document 7


仅仅是一个bool索引,表示某个文档有或者没有某个单词.
记录一些附加的信息,比如某个词所在的位置,以及词频,这些都是信息检索的核心研究问题.

Using this index it can only be determined whether or a word exists within a
particular document, it stores no information regarding the frequency and
position of the word and is therefore considered to be a boolean index. Such
an index could only serve to determine which documents match a query, but
could not contribute to ranking matched documents. In some designs the index
includes additional information such as the frequency of each word in each
document or the positions of a word in each document.[15] With position, the
search algorithm can identify word proximity to support searching for phrases
. Frequency can be used to help in ranking the relevance of documents to the
query. Such topics are the central research focus of information retrieval.

倒排索引是一个稀疏矩阵.
使用许多不同的方法来保存稀疏矩阵用来减少计算机的内存存储需求.
类似于latent semantic analysis里面的"词-文档矩阵".
可以看出一个hash表,一些使用二叉树(需要较大的容量,但是降低查找时间)保存.
大型的索引一般都是用分布式hash表的方式.
The inverted index is a sparse matrix given that words are not present in
each document. It is stored differently than a two dimensional array to
reduce computer storage memory requirements. The index is similar to the term
document matrices employed by latent semantic analysis. The inverted index
can be considered a form of a hash table. In some cases the index is a form
of a binary tree, which requires additional storage but may reduce the lookup
time. In larger indices the architecture is typically distributed hash table.[
16]


Inverted indices can be programmed in several computer programming languages.[
17][18]


索引合并
[edit] Index Merging
倒排索引通过合并或者重建的方式创建.
需要支持增量的索引. 添加或者更新一个文档.

The inverted index is filled via a merge or rebuild. A rebuild is similar to
a merge but first deletes the contents of the inverted index. The
architecture may be designed to support incremental indexing[19], where a
merge involves identifying the document or documents to add into or update in
the index and parsing each document into words. For technical accuracy, a
merge involves the unison of newly indexed documents, typically residing in
virtual memory, with the index cache residing on one or more computer hard
drives.


在解析之后,索引器添加含有的文档到对应词的文档列表里面.

通常是划分成创建前向索引(forward index)以及为每个entry排序前向索引的内容???

After parsing, the indexer adds the containing document to the document list
for the appropriate words. The process of finding each word in the inverted
index in order to denote that it occurred within a document may be too time
consuming when designing a larger search engine, and so this process is
commonly split up into the development of a forward index and the process of
sorting the contents of the forward index for entry into the inverted index.
The inverted index is named inverted because it is an inversion of the
forward index.


前向索引
The Forward Index
前向索引保存的是每个文档里面的一系列词的列表.
The forward index stores a list of words for each document. The following is
a simplified form of the forward index:


Forward Index Document Words
Document 1 the,cow,says,moo
Document 2 the,cat,and,the,hat
Document 3 the,dish,ran,away,with,the,spoon


The rationale behind developing a forward index is that as documents are
parsing, it is better to immediately store the words per document. The
delineation enables Asynchronous system processing, which partially
circumvents the inverted index update bottleneck.[20] The forward index is
sorted to transform it to an inverted index. The forward index is essentially
a list of pairs consisting of a document and a word, collated by the document
. Converting the forward index to an inverted index is only a matter of
sorting the pairs by the words. In this regard, the inverted index is a word-
sorted forward index.


压缩

[edit] Compression
Generating or maintaining a large-scale search engine index represents a
significant storage and processing challenge. Many search engines utilize a
form of compression to reduce the size of the indices on disk.[21] Consider
the following scenario for a full text, Internet, search engine.

2000年估计有20亿的网页,平均有250个单词每个网页.
An estimated 2,000,000,000 different web pages exist as of the year 2000[22]
A fictitious estimate of 250 words per webpage on average, based on the
assumption of being similar to the pages of a novel.[23]

单个字符8bits保存,一些编码平均使用2byge.
It takes 8 bits (or 1 byte) to store a single character. Some encodings use 2
bytes per character[24][25]

一个单词的平均大小为5个字母.
The average number of characters in any given word on a page can be estimated
at 5 (Wikipedia:Size comparisons)

The average personal computer comes with about 20 gigabytes of usable space[26
]
20亿的网页需要保存50亿的词入口???

Given these estimates, generating an uncompressed index (assuming a non-
conflated, simple, index) for 2 billion web pages would need to store 5
billion word entries. At 1 byte per character, or 5 bytes per word, this
would require 2500 gigabytes of storage space alone, more than the average
size a personal computer's free disk space. This space is further increased
in the case of a distributed storage architecture that is fault-tolerant.
Using compression, the index size can be reduced to a portion of its size,
depending on which compression techniques are chosen. The trade off is the
time and processing power required to perform compression and decompression.


Notably, large scale search engine designs incorporate the cost of storage,
and the costs of electricity to power the storage. Compression, in this regard
, is a measure of cost as well.


文档解析.
解析也可叫做tokenization.
[edit] Document Parsing
Document parsing involves breaking apart the components (words) of a document
or other form of media for insertion into the forward and inverted indices.
For example, if the full contents of a document consisted of the sentence "
Hello World", there would typically be two words found, the token "Hello" and
the token "World". In the context of search engine indexing and natural
language processing, parsing is more commonly referred to as tokenization,
and sometimes word boundary disambiguation, tagging, text segmentation,
content analysis, text analysis, text mining, concordance generation, Speech
segmentation, lexing, or lexical analysis. The terms 'indexing', 'parsing',
and 'tokenization' are used interchangeably in corporate slang.


Natural language processing, as of 2006, is the subject of continuous
research and technological improvement. There are a host of challenges in
tokenization, in extracting the necessary information from documents for
indexing to support quality searching. Tokenization for indexing involves
multiple technologies, the implementation of which are commonly kept as
corporate secrets.


自然语言处理中的挑战.
[edit] Challenges in Natural Language Processing
分词Word Boundary Ambiguity.

Word Boundary Ambiguity - native English speakers can at first consider
tokenization to be a straightforward task, but this is not the case with
designing a multilingual indexer. In digital form, the text of other
languages such as Chinese, Japanese or Arabic represent a greater challenge
as words are not clearly delineated by whitespace. The goal during
tokenization is to identify words for which users will search. Language
specific logic is employed to properly identify the boundaries of words,
which is often the rationale for designing a parser for each language
supported (or for groups of languages with similar boundary markers and syntax
).

语言歧义.
得到一些词的part of speech词性.
Language Ambiguity - to assist with properly ranking matching documents, many
search engines collect additional information about words, such as its
language or lexical category (part of speech). These techniques are language-
dependent as the syntax varies among languages. Documents do not always
clearly identify the language of the document or represent it accurately. In
tokenizing the document, some search engines attempt to automatically
identify the language of the document.

不同的文件格式.
Diverse File Formats - in order to correctly identify what bytes of a
document represent characters, the file format must be correctly handled.
Search engines which support multiple file formats must be able to correctly
open and access the document and be able to tokenize the characters of the
document.


Faulty Storage - the quality of the natural language data is not always
assumed to be perfect. An unspecified amount of documents, particular on the
Internet, do not always closely obey proper file protocol. binary characters
may be mistakenly encoded into various parts of a document. Without
recognition of these characters and appropriate handling, the index quality
or indexer performance could degrade.


..//????
tokenizer,parser,lexer.
Yacc&Lex
分析器

[edit] Tokenization
Unlike literate human adults, computers are not inherently aware of the
structure of a natural language document and do not instantly recognize words
and sentences. To a computer, a document is only a big sequence of bytes.
Computers do not know that a space character between two sequences of
characters means that there are two separate words in the document. Instead,
a computer program is developed by humans which trains the computer, or
instructs the computer, how to identify what constitutes an individual or
distinct word, referred to as a token. This program is commonly referred to
as a tokenizer or parser or lexer. Many search engines, as well as other
natural language processing software, incorporate specialized programs for
parsing, such as YACC OR Lex.

一些字符,单词的区分...
punctuation
sequences of numerical characters, alphabetical characters, alphanumerical
characters, binary characters (backspace, null, print, and other antiquated
print commands), whitespace (space, tab, carriage return, line feed), and
entities such as email addresses, phone numbers, and URLs.
以及一些更详细的..


During tokenization, the parser identifies sequences of characters, which
typically represent words. Commonly recognized tokens include punctuation,
sequences of numerical characters, alphabetical characters, alphanumerical
characters, binary characters (backspace, null, print, and other antiquated
print commands), whitespace (space, tab, carriage return, line feed), and
entities such as email addresses, phone numbers, and URLs. When identifying
each token, several characteristics may be stored such as the token's case (
upper, lower, mixed, proper), language or encoding, lexical category (part of
speech, like 'noun' or 'verb'), position, sentence number, sentence position,
length, and line number.


语言识别.
识别文档的语言类型.
[edit] Language Recognition
If the search engine supports multiple languages, a common initial step
during tokenization is to identify each document's language, given that many
of the later steps are language dependent (such as stemming and part of
speech tagging). Language recognition is the process by which a computer
program attempts to automatically identify, or categorize, the language of a
document. Other names for language recognition include language classification
, language analysis, language identification, and language tagging. Automated
language recognition is the subject of ongoing research in natural language
processing. Finding which language the words belongs to may involve the use
of a language recognition chart.


格式分析.(HTML格式分析)
[edit] Format Analysis
Depending on whether the search engine supports multiple document formats,
documents must be prepared for tokenization. The challenge is that many
document formats contain, in addition to textual content, formatting
information. For example, HTML documents contain HTML tags, which specify
formatting information, like whether to start a new line, or display a word
in bold, or change the font size or family. If the search engine were to
ignore the difference between content and markup, the segments would also be
included in the index, leading to poor search results. Format analysis
involves the identification and handling of formatting content embedded
within documents which control how the document is rendered on a computer
screen or interpreted by a software program. Format analysis is also referred
to as structure analysis, format parsing, tag stripping, format stripping,
text normalization, text cleaning, or text preparation. The challenge of
format analysis is further complicated by the intricacies of various file
formats. Certain file formats are proprietary and very little information is
disclosed, while others are well documented. Common, well-documented file
formats that many search engines support include:


Microsoft Word
Microsoft Excel
Microsoft Powerpoint
IBM Lotus Notes
HTML
ASCII text files (a text document without any formatting)
Adobe's Portable Document Format (PDF)
PostScript (PS)
LaTex
The UseNet archive (NNTP) and other deprecated bulletin board formats
XML and derivatives like RSS
SGML (this is more of a general protocol)
Multimedia meta data formats like ID3


Techniques for dealing with various formats include:

Using a publicly available commercial parsing tool that is offered by the
organization which developed, maintains, or owns the format

Writing a custom parser
Some search engines support inspection of files that are stored in a
compressed, or encrypted, file format. If working with a compressed format,
then the indexer first decompresses the document, which may result in one or
more files, each of which must be indexed separately. Commonly supported
compressed file formats include:


ZIP - Zip File
RAR - Archive File
CAB - Microsoft Windows Cabinet File
Gzip - Gzip file
BZIP - Bzip file
TAR, GZ, and TAR.GZ - Unix Gzip'ped Archives
Format analysis can involve quality improvement methods to avoid including '
bad information' in the index. Content can manipulate the formatting
information to include additional content. Examples of abusing document
formatting for spamdexing:

使用div分隔. css,js.
Including hundreds or thousands of words in a section which is hidden from
view on the computer screen, but visible to the indexer, by use of formatting
(e.g. hidden "div" tag in HTML, which may incorporate the use of CSS or
Javascript to do so).

前景色和背景色相同.
Setting the foreground font color of words to the same as the background color
, making words hidden on the computer screen to a person viewing the document
, but not hidden to the indexer.

区域/段落识别
Section Recognition
搜索引擎具有识别区域,辨识出文档的主要部分的功能,在进行tokenization之前.
并非所有网上的数据看起来像是正规表达的书面文本.
网上的文档,含有大量错误的内容,侧边栏,这些内容不含有有用的内容.
Some search engines incorporate section recognition, the identification of
major parts of a document, prior to tokenization. Not all the documents in a
corpus read like a well-written book, divided into organized chapters and
pages. Many documents on the web contain erroneous content and side-sections
which do not contain primary material, that which the document is about, such
as newsletters and corporate reports. For example, this article may display a
side menu with words inside links to other web pages. Some file formats, like
HTML or PDF, allow for content to be displayed in columns. Even though the
content is displayed, or rendered, in different areas of the view, the raw
markup content may store this information sequentially. Words that appear in
the raw source content sequentially are indexed sequentially, even though
these sentences and paragraphs are rendered in different parts of the
computer screen. If search engines index this content as if it were normal
content, a dilemma ensues where the quality of the index is degraded and
search quality is degraded due to the mixed content and improper word
proximity.

两种主要可能存在的问题.
Two primary problems are noted:

在不同的sections里面的内容被索引认为是相关的,但是实际上不是这样.
Content in different sections is treated as related in the index, when in
reality it is not
'side bar'侧边栏里面的内容被包含到索引里面
Organizational 'side bar' content is included in the index, but the side bar
content does not contribute to the meaning of the document, and the index is
filled with a poor representation of its documents, assuming the goal is to
go after the meaning of each document, a sub-goal of providing quality search
results.
使用javascript进行渲染的部分.
Section analysis may require the search engine to implement the rendering
logic of each document, essentially an abstract representation of the actual
document, and then index the representation instead. For example, some
content on the Internet is rendered via Javascript. Viewers of web pages in
web browsers see this content. If the search engine does not render the page
and evaluate the Javascript within the page, it would not 'see' this content
in the same way, and index the document incorrectly. Given that some search
engines do not bother with rendering issues, many web page designers avoid
displaying content via Javascript or use the Noscript tag to ensure that the
web page is indexed properly. At the same time, this fact is also exploited
to cause the search engine indexer to 'see' different content than the viewer.


meta标记索引.
[edit] Meta Tag Indexing
Specific documents offer embedded meta information such as the author,
keywords, description, and language. For HTML pages, the meta tag contains
keywords which are also included in the index. During earlier growth periods
in the Internet and search engine technology (more so, the hardware on which
it runs) would only index the keywords in the meta tags for the forward index
(and still applying techniques such as stemming and stop words). The full
document would not be parsed. At this time, full-text indexing was not as
well established, nor was the hardware to support such technology. The design
of the HTML markup language initially included support for meta tags for this
very purpose of being properly and easily indexed, without requiring
tokenization.[27]


As the Internet grew (the number of users capable of browsing the web and the
number of websites increased and the technology for making websites and
hosting websites improved), many brick-and-mortar corporations went 'online'
in the mid 1990s and established corporate websites. The keywords used to
describe webpages (many of which were corporate-oriented webpages similar to
product brochures) changed from descriptive keywords to marketing-oriented
keywords designed to drive sales by placing the webpage high in the search
results for specific search queries. The fact that these keywords were
subjectively-specified was leading to spamdexing, which drove many search
engines to adopt full-text indexing technologies in the 1990s. Search engine
designers and companies could only place so many 'marketing keywords' into
the content of a webpage before draining it of all interesting and useful
information. Given that conflict of interest with the business goal of
designing user-oriented websites which were 'sticky', the customer lifetime
value equation was changed to incorporate more useful content into the
website in hopes of retaining the visitor. In this sense, full-text indexing
was more objective and increased the quality of search engine results, as it
was one more step away from subjective control of search engine result
placement, which in turn furthered research of full-text indexing technologies.

桌面搜索
In the context of Desktop search, many solutions incorporate meta tags to
provide a way for authors to further customize how the search engine will
index content from various files that is not evident from the file content.
Desktop search is under the control of the user and the changes in that
context only serve to help, unlike Internet search engines which must focus
more on the full text index.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值