lucene Index Store TermVector 说明

最新的lucene 3.0的field是这样的:

Field options for indexing
Index.ANALYZED – use the analyzer to break the Field’s value into a stream of separate tokens and make each token searchable.
Index.NOT_ANALYZED – do index the field, but do not analyze the String. Instead, treat the Field’s entire value as a single token and make that token searchable. 
Index.ANALYZED_NO_NORMS – an advanced variant of Index.ANALYZED which does not store norms information in the index. 
Index.NOT_ANALYZED_NO_NORMS – just like , but also do not store Norms.
Index.NO – don’t make this field’s value available for searching at all.

Field options for storing fields
Store.YES — store the value. When the value is stored, the original String in its entirety is recorded in the index and may be retrieved by an IndexReader.
Store.NO – do not store the value. This is often used along with Index.ANALYZED to index a large text field that doesn’t need to be retrieved in its original form.

Field options for term vectors
TermVector.YES – record the unique terms that occurred, and their counts, in each document, but do not store any positions or offsets information.
TermVector.WITH_POSITIONS – record the unique terms and their counts, and also the positions of each occurrence of every term, but no offsets.
TermVector.WITH_OFFSETS – record the unique terms and their counts, with the offsets (start & end character position) of each occurrence of every term, but no positions.
TermVector.WITH_POSITIONS_OFFSETS – store unique terms and their counts, along with positions and offsets.
TermVector.NO – do not store any term vector information.
If Index.NO is specified for a field, then you must also specify TermVector.NO.

具一些例子来说明这些怎么用
Index                   Store  TermVector                                Example usage 
NOT_ANALYZED     YES         NO                                        Identifiers (file names, primary keys),
                                                                                         Telephone and Social Security
                                                                                         numbers, URLs, personal names, Dates
ANALYZED              YES     WITH_POSITIONS_OFFSETS    Document title, document abstract
ANALYZED              NO      WITH_POSITIONS_OFFSETS    Document body
NO                         YES        NO                                        Document type, database primary key
NOT_ANALYZED     NO         NO                                         Hidden keywords

When Lucene builds the inverted index, by default it stores all necessary information to implement the Vector Space model. This model requires the count of every term that occurred in the document, as well as the positions of each occurrence (needed for phrase searches).
You can tell Lucene to skip indexing the term frequency and positions by calling:
Field.setOmitTermFreqAndPositions(true)

 

摘自:http://www.cnblogs.com/fxjwind/archive/2011/07/04/2097705.html

转载于:https://www.cnblogs.com/bonelee/p/6604399.html

Lucene是一个全文检索引擎,它的核心数据结构包括倒排索引和正排索引。其中,倒排索引是Lucene最重要的数据结构之一,它通过将文档中的每个词都映射到包含该词的文档列表来实现快速的文本搜索。 Lucene中的Term Dictionary和Term Index是倒排索引中的两个重要组成部分。Term Dictionary用于存储所有唯一的词项(term),而Term Index则用于快速定位某个词项的位置。 在Lucene中,Term Dictionary和Term Index通常存储在磁盘上。Term Dictionary通常使用一种称为Trie树的数据结构来实现。Trie树是一种树形数据结构,它可以快速地查找某个字符串是否存在,以及在字符串集合中查找前缀匹配的字符串。 Term Index则通常存储在一个称为倒排索引表(Inverted Index Table)的结构中。倒排索引表是由一系列的倒排索引条目(Inverted Index Entry)组成的,每个倒排索引条目包含了一个词项及其在倒排索引中的位置信息,例如该词项在文档列表中出现的次数、该词项在哪些文档中出现等。 当进行文本搜索时,Lucene会首先在Term Dictionary中查找搜索关键词是否存在,然后通过Term Index快速定位到包含该词的文档列表,最后根据文档列表中的文档ID查找正排索引中具体的文档内容。这种基于倒排索引的搜索方式可以实现非常高效的文本搜索,是Lucene等全文检索引擎的核心技术之一。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值