Lucene与搜索引擎技术(index包详解)

Lucene索引中有几个最基础的概念,索引(index),文档(document),域(field),和项(或者译为语词term)

其中Index为Document的序列

     Document为Field的序列

     Field为Term的序列

     Term就是一个子串  

 

存在于不同的Field中的同一个子串被认为是不同的Term.因此Term实际上是用一对子串表示的,第一个子串为Fieldname,第二个为Field中的子串.既然Term这么重要,我们先来认识一下Term.  

 

 

认识Term  

 

 

最好的方法就是看其源码表示

 

 

 

public final class Term implements Comparable, java.io.Serializable {

 

 

 

  String field;

 

 

  String text;

 

 

  public Term(String fld, String txt) {this(fld, txt, true);}

 

 

 

  public final String field() { return field; }

 

 

 

 

 

 

 

 

  public final String text() { return text; }

 

 

 

 

 

 

 

 

//overwrite equals()

 

 

 

 

 

 

 

 

  public final boolean equals(Object o) { }

 

 

 

 

 

 

 

 

//overwrite hashCode()

 

 

 

 

 

 

 

 

  public final int hashCode() {return field.hashCode() + text.hashCode();

 

 

 

 

 

 

 

 

  }

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  public int compareTo(Object other) {return compareTo((Term)other);}

 

 

 

 

 

 

 

 

  public final int compareTo(Term other)

 

 

 

 

 

 

 

 

  final void set(String fld, String txt)  public final String toString() { return field + ":" + text; }

 

 

 

 

 

 

 

 

  private void readObject(java.io.ObjectInputStream in)

 

 

 

 

 

 

 

 

  }

 

 

 

 

 

 

 

 

从代码中我们可以大体看出Tern其实是一个二元组

<FieldName,text>

 

 

 

 

 

 

 

 

倒排索引
为了使得基于项的搜索更有效率,索引中项是静态存储的。Lucene的索引属于索引方式中的倒排索引,因为对于一个项这种索引可以列出包含它的文档。这刚好是文档与项自然联系的倒置。

 

 

 

 

 

 

 

 

Field的类型
Lucene
中,Field的文本可能以逐字的非倒排的方式存储在索引中。而倒排过的Field称为被索引过了。Field也可能同时被存储和被索引。Field的文本可能被分解许多Term而被索引,或者就被用作一个Term而被索引。大多数的Field是被分解过的,但是有些时候某些标识符域被当做一个Term索引是很有用的。

 

 

 

 

 

 

 

 

Index包中的每个类解析

 

 

 

 

 

 

 

 

CompoundFileReader

 

 

 

 

 

 

 

 

       提供读取.cfs文件的方法

.

 

 

 

 

 

 

 

CompoundFileWriter

 

 

 

 

 

 

 

 

       用来构建.cfs文件,Lucene1.4开始,会将下面要提到的各类文件,譬如.tii,.tis等合并成一个.cfs文件

!

 

 

 

 

 

 

 

       其结构如下

 

 

 

 

 

 

 

 

Compound (.cfs) --> FileCount, <DataOffset, FileName>FileCount, FileDataFileCount

 

 

 

 

 

 

 

FileCount --> VInt

 

 

 

 

 

 

 

DataOffset --> Long

 

 

 

 

 

 

 

FileName --> String

 

 

 

 

 

 

 

FileData --> raw file data

 

 

 

 

 

 

 

DocumentWriter

 

 

 

 

 

 

 

 

     构建.frq,.prx,.f文件   

 

 

 

 

 

 

 

 

1FreqFile (.frq) --> <TermFreqs, SkipData>TermCount

 

 

 

 

 

 

 

TermFreqs --> <TermFreq>DocFreq

 

 

 

 

 

 

 

TermFreq --> DocDelta, Freq?

 

 

 

 

 

 

 

SkipData --> <SkipDatum>DocFreq/SkipInterval

 

 

 

 

 

 

 

SkipDatum --> DocSkip,FreqSkip,ProxSkip

 

 

 

 

 

 

 

DocDelta,Freq,DocSkip,FreqSkip,ProxSkip --> VInt

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2The .prx file contains the lists of positions that each term occurs at within documents.

 

 

 

 

 

 

 

ProxFile (.prx) --> <TermPositions>TermCount

 

 

 

 

 

 

 

TermPositions --> <Positions>DocFreq

 

 

 

 

 

 

 

Positions --> <PositionDelta>Freq

 

 

 

 

 

 

 

PositionDelta --> VInt 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3There's a norm file for each indexed field with a byte for each document. The .f[0-9]* file contains, for each document, a byte that encodes a value that is multiplied into the score for hits on that field:

 

 

 

 

 

 

 

Norms (.f[0-9]*) --> <Byte>SegSize

 

 

 

 

 

 

 

Each byte encodes a floating point value. Bits 0-2 contain the 3-bit mantissa, and bits 3-8 contain the 5-bit exponent.

 

 

 

 

 

 

 

These are converted to an IEEE single float value as follows:

 

 

 

 

 

 

 

1.    If the byte is zero, use a zero float.

 

 

 

 

 

 

 

2.    Otherwise, set the sign bit of the float to zero;

 

 

 

 

 

 

 

3.    add 48 to the exponent and use this as the float's exponent;

 

 

 

 

 

 

 

4.    map the mantissa to the high-order 3 bits of the float's mantissa; and

 

 

 

 

 

 

 

5.    set the low-order 21 bits of the float's mantissa to zero.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

FieldInfo

 

 

 

 

 

 

 

 

      里边有Field的部分信息,是一个四元组<name,isIndexed,num,

storeTermVector>

 

 

 

 

 

 

 

FieldInfos

 

 

 

 

 

 

 

 

     此类用来描述Documentfields是否被索引.每个Segment有一个单独的FieldInfo 文件.对于多线程,此类的对象为线程安全的.但是某一时刻,只允许一个线程添加document.别的readerwriter不允许进入.此类维护两个容器ArrayListHashMap,这两个容器都不是synchronized,何言线程安全,不解

??

 

 

 

 

 

 

 

观察write函数可知 .fnm文件的构成为

 

 

 

 

 

 

 

 

     FieldInfos (.fnm) --> FieldsCount, <FieldName, FieldBits>FieldsCount

                        FieldsCount --> VInt

 

 

 

 

 

 

 

 

                        FieldName --> String

 

 

 

 

 

 

 

 

                        FieldBits --> Byte

 

 

 

 

 

 

 

 

FieldReader

 

 

 

 

 

 

 

 

    用来读取.fdx文件和.fdt文件

 

 

 

 

 

 

 

 

FieldWriter

 

 

 

 

 

 

 

 

     此类创建两个文件.fdx.fdt文件

 

 

 

 

 

 

 

 

     FieldIndex(.fdx)对于每一个Document,里面都含有一个指向Field的指针(其实是整数

)

 

 

 

 

 

 

 

<FieldValuesPosition>SegSize

 

 

 

 

 

 

 

FieldValuesPosition --> Uint64

 

 

 

 

 

 

 

             则第ndocumentField pointer

n*8

 

 

 

 

 

 

 

    FieldData(.fdt)里面包含了每一个文档包含的存储的field信息.内容如下

:

 

 

 

 

 

 

 

<DocFieldData>SegSize

 

 

 

 

 

 

 

DocFieldData --> FieldCount, <FieldNum, Bits, Value>FieldCount

 

 

 

 

 

 

 

 

FieldCount --> VInt

 

 

 

 

 

 

 

 

FieldNum --> VInt

 

 

 

 

 

 

 

 

Lucene <= 1.4:

 

 

 

 

 

 

 

 

Bits --> Byte

 

 

 

 

 

 

 

 

Value --> String

 

 

 

 

 

 

 

 

Only the low-order bit of Bits is used. It is one for tokenized fields, and zero for non-tokenized fields.

 

 

 

 

 

 

 

 

FilterIndexReader

 

 

 

 

 

 

 

 

     扩展自IndexReader,提供了具体的方法

.

 

 

 

 

 

 

 

IndexReader

 

 

 

 

 

 

 

 

     abstract class!用来读取建完索引的Directory,并可以返回各种信息,譬如Term,TermPosition等等

.

 

 

 

 

 

 

 

IndexWriter

 

 

 

 

 

 

 

 

    IndexWriter用来创建和维护索引。

 

 

 

 

 

 

 

 

    IndexWriter构造函数中的第三个参数决定一个新的索引是否被创建,或者一个存在的索引是否开放给欲新加入的新的

document

 

 

 

 

 

 

 

    通过addDocument()0函数加入新的documents,当添加完document之后,调用close()函数

 

 

 

 

 

 

 

 

    如果一个Index没有document需要加入并且需要优化查询性能。则在索引close()之前,调用optimize()函数进行优化。

 

 

 

 

 

 

 

 

    Deleteable文件结构:

 

 

 

 

 

 

 

 

   

A file named "deletable" contains the names of files that are no longer used by the index, but which could not be deleted. This is only used on Win32, where a file may not be deleted while it is still open. On other platforms the file contains only null bytes.

 

 

 

 

 

 

 

Deletable --> DeletableCount, <DelableName>DeletableCount

 

 

 

 

 

 

 

DeletableCount --> UInt32

 

 

 

 

 

 

 

DeletableName --> String

 

 

 

 

 

 

 

MultipleTermPositions

 

 

 

 

 

 

 

 

专门用于search包中的

PhrasePrefixQuery

 

 

 

 

 

 

 

MultiReader

 

 

 

 

 

 

 

 

扩展自IndexReader,用来读取多个索引!添加他们的内容

 

 

 

 

 

 

 

 

SegmentInfo

 

 

 

 

 

 

 

 

     一些关于Segment的信息,是一个三元组

<segmentname,docCount,dir>

 

 

 

 

 

 

 

SegmentInfos

 

 

 

 

 

 

 

 

     扩展自Vector,就是一个向量组,其中任意成员为SegmentInfo!用来构建segments文件,每个Index有且只有一个这样的文件,此类提供了readwrite的方法

.

 

 

 

 

 

 

 

     其内容如下

:

 

 

 

 

 

 

 

    

Segments --> Format, Version, NameCounter, SegCount, <SegName, SegSize>SegCount

 

 

 

 

 

 

 

Format, NameCounter, SegCount, SegSize --> UInt32

 

 

 

 

 

 

 

Version --> UInt64

 

 

 

 

 

 

 

SegName --> String

 

 

 

 

 

 

 

Format is -1 in Lucene 1.4.

 

 

 

 

 

 

 

Version counts how often the index has been changed by adding or deleting documents.

 

 

 

 

 

 

 

NameCounter is used to generate names for new segment files.

 

 

 

 

 

 

 

SegName is the name of the segment, and is used as the file name prefix for all of the files that compose the segment's index.

 

 

 

 

 

 

 

SegSize is the number of documents contained in the segment index.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SegmentMergeInfo

 

 

 

 

 

 

 

 

    用来记录segment合并信息

.

 

 

 

 

 

 

 

SegmentMergeQueue

 

 

 

 

 

 

 

 

    扩展自PriorityQueue(按升序排列

)

 

 

 

 

 

 

 

SegmentMerger

 

 

 

 

 

 

 

 

此类合并多个Segment为一个Segment,被IndexWriter.addIndexes()创建此类对象

 

 

 

 

 

 

 

 

如果compoundFileTrue即可以合并了,创建.cfs文件,并且把其余的几乎所有文件全部合并到.cfs文件中!

 

 

 

 

 

 

 

 

SegmentReader

 

 

 

 

 

 

 

 

扩展自IndexReader,提供了很多读取Index的方法

 

 

 

 

 

 

 

 

SegmentTermDocs

 

 

 

 

 

 

 

 

扩展自

TermDocs

 

 

 

 

 

 

 

SegmentTermEnum

 

 

 

 

 

 

 

 

   扩展自

TermEnum

 

 

 

 

 

 

 

SegmentTermPositions

 

 

 

 

 

 

 

 

   扩展自

TermPositions

 

 

 

 

 

 

 

SegmentTermVector

 

 

 

 

 

 

 

 

  扩展自

TermFreqVector

 

 

 

 

 

 

 

Term

 

 

 

 

 

 

 

 

     Term是一个<fieldName,text>.Field由于分多种,但是至少都含有<fieldName,fieldValue>这样二者就可以建立关联了.Term是一个搜索单元.Termtext都是诸如dates,email address,urls等等

.

 

 

 

 

 

 

 

TermDocs

 

 

 

 

 

 

 

 

     TermDocs是一个Interface. TermDocs提供一个接口,用来列举<document,frequency>,以共Term使用

 

 

 

 

 

 

 

 

     <document,frequency>对中,document部分给每一个含有termdocument命名.document根据其document number进行标引.frequency部分列举在每一个documentterm的数量.<document,frequency>对根据document number排序

.

 

 

 

 

 

 

 

TermEnum

 

 

 

 

 

 

 

 

     此类为抽象类,用来enumerate term.Term enumerations Term.compareTo()进行排序此enumeration中的每一个term都要大于所有在此enumeration之前的

term.

 

 

 

 

 

 

 

TermFreqVector

 

 

 

 

 

 

 

 

     Interface用来访问一个documentField

Term Vector

 

 

 

 

 

 

 

TermInfo

 

 

 

 

 

 

 

 

     此类主要用来存储Term信息.其可以说为一个五元组

<Term,docFreq,freqPointer,proxPointer,skipOffset>

 

 

 

 

 

 

 

TermInfoReader

 

 

 

 

 

 

 

 

     未细读,待读完

SegmentTermEnum

 

 

 

 

 

 

 

TermInfoWriter

 

 

 

 

 

 

 

 

     此类用来构建(.tis)(.tii)文件.这些构成了

term dictionary

 

 

 

 

 

 

 

1.     The term infos, or tis file.

 

 

 

 

 

 

 

 

TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos

 

 

 

 

 

 

 

 

TIVersion --> UInt32

 

 

 

 

 

 

 

 

TermCount --> UInt64

 

 

 

 

 

 

 

 

IndexInterval --> UInt32

 

 

 

 

 

 

 

 

SkipInterval --> UInt32

 

 

 

 

 

 

 

 

TermInfos --> <TermInfo>TermCount

 

 

 

 

 

 

 

 

TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>

 

 

 

 

 

 

 

 

Term --> <PrefixLength, Suffix, FieldNum>

 

 

 

 

 

 

 

 

Suffix --> String

 

 

 

 

 

 

 

 

PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta
--> VInt

 

 

 

 

 

 

 

 

This file is sorted by Term. Terms are ordered first lexicographically by the term's field name, and within that lexicographically by the term's text.

 

 

 

 

 

 

 

 

TIVersion names the version of the format of this file and is -2 in Lucene 1.4.

 

 

 

 

 

 

 

 

Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".

 

 

 

 

 

 

 

 

FieldNumber determines the term's field, whose name is stored in the .fdt file.

 

 

 

 

 

 

 

 

DocFreq is the count of documents which contain the term.

 

 

 

 

 

 

 

 

FreqDelta determines the position of this term's TermFreqs within the .frq file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file).

 

 

 

 

 

 

 

 

ProxDelta determines the position of this term's TermPositions within the .prx file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file.

 

 

 

 

 

 

 

 

SkipDelta determines the position of this term's SkipData within the .frq file. In particular, it is the number of bytes after TermFreqs that the SkipData starts. In other words, it is the length of the TermFreq data.

 

 

 

 

 

 

 

 

2.     The term info index, or .tii file.

 

 

 

 

 

 

 

 

This contains every IndexIntervalth entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file.

 

 

 

 

 

 

 

 

The structure of this file is very similar to the .tis file, with the addition of one item per record, the IndexDelta.

 

 

 

 

 

 

 

 

TermInfoIndex (.tii)--> TIVersion, IndexTermCount, IndexInterval, SkipInterval, TermIndices

 

 

 

 

 

 

 

 

TIVersion --> UInt32

 

 

 

 

 

 

 

 

IndexTermCount --> UInt64

 

 

 

 

 

 

 

 

IndexInterval --> UInt32

 

 

 

 

 

 

 

 

SkipInterval --> UInt32

 

 

 

 

 

 

 

 

TermIndices --> <TermInfo, IndexDelta>IndexTermCount

 

 

 

 

 

 

 

 

IndexDelta --> VLong

 

 

 

 

 

 

 

 

IndexDelta determines the position of this term's TermInfo within the .tis file. In particular, it is the difference between the position of this term's entry in that file and the position of the previous term's entry.

 

 

 

 

 

 

 

 

TODO: document skipInterval information

 

 

 

 

 

 

 

 

             其中IndexDelta.tii文件,比之.tis文件多的东西

.

 

 

 

 

 

 

 

TermPosition

 

 

 

 

 

 

 

 

      此类扩展自TermDocs,是一个Interface,用来enumerate<document,frequency,<position>*>三元组

,

 

 

 

 

 

 

 

以供term使用.在此三元组中documentfrequencyTernDocs中的相同.postions部分列出了在一个document,一个term每次出现的顺序位置此三元组为倒排文档的事件表表示

.

 

 

 

 

 

 

 

TermPositionVector

 

 

 

 

 

 

 

 

      扩展自TermFreqVector.比之TermFreqVector扩展了功能,可以提供term所在的位置

 

 

 

 

 

 

 

 

TermVectorReader

 

 

 

 

 

 

 

 

      用来读取.tvd,.tvf.tvx三个文件

.

 

 

 

 

 

 

 

TermVectorWriter

 

 

 

 

 

 

 

 

      用于构建.tvd, .tvf,.tvx文件,这三个文件构成

TermVector

 

 

 

 

 

 

 

 

1.    The Document Index or .tvx file.

 

 

 

 

 

 

 

This contains, for each document, a pointer to the document data in the Document (.tvd) file.

 

 

 

 

 

 

 

DocumentIndex (.tvx) --> TVXVersion<DocumentPosition>NumDocs

 

 

 

 

 

 

 

TVXVersion --> Int

 

 

 

 

 

 

 

DocumentPosition --> UInt64

 

 

 

 

 

 

 

This is used to find the position of the Document in the .tvd file.

 

 

 

 

 

 

 

2.    The Document or .tvd file.

 

 

 

 

 

 

 

This contains, for each document, the number of fields, a list of the fields with term vector info and finally a list of pointers to the field information in the .tvf (Term Vector Fields) file.

 

 

 

 

 

 

 

Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions,>NumDocs

 

 

 

 

 

 

 

TVDVersion --> Int

 

 

 

 

 

 

 

NumFields --> VInt

 

 

 

 

 

 

 

FieldNums --> <FieldNumDelta>NumFields

 

 

 

 

 

 

 

FieldNumDelta --> VInt

 

 

 

 

 

 

 

FieldPositions --> <FieldPosition>NumFields

 

 

 

 

 

 

 

FieldPosition --> VLong

 

 

 

 

 

 

 

The .tvd file is used to map out the fields that have term vectors stored and where the field information is in the .tvf file.

 

 

 

 

 

 

 

3.    The Field or .tvf file.

 

 

 

 

 

 

 

This file contains, for each field that has a term vector stored, a list of the terms and their frequencies.

 

 

 

 

 

 

 

Field (.tvf) --> TVFVersion<NumTerms, NumDistinct, TermFreqs>NumFields

 

 

 

 

 

 

 

TVFVersion --> Int

 

 

 

 

 

 

 

NumTerms --> VInt

 

 

 

 

 

 

 

NumDistinct --> VInt -- Future Use

 

 

 

 

 

 

 

TermFreqs --> <TermText, TermFreq>NumTerms

 

 

 

 

 

 

 

TermText --> <PrefixLength, Suffix>

 

 

 

 

 

 

 

PrefixLength --> VInt

 

 

 

 

 

 

 

Suffix --> String

 

 

 

 

 

 

 

TermFreq --> VInt

 

 

 

 

 

 

 

Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

好的,整个Index包所有类都讲解了,下边咱们开始来编码重新审视一下!

 

 

 

 

 

 

 

 

下边来编制一个程序来结束本章的讨论。

 

 

 

 

 

 

 

 

package org.apache.lucene.index;

 

 

 

 

 

 

 

 

import org.apache.lucene.analysis.*;

 

 

 

 

 

 

 

 

import org.apache.lucene.analysis.standard.*;

 

 

 

 

 

 

 

 

import org.apache.lucene.store.*;

 

 

 

 

 

 

 

 

import org.apache.lucene.document.*;

 

 

 

 

 

 

 

 

import org.apache.lucene.demo.*;

 

 

 

 

 

 

 

 

import org.apache.lucene.search.*;

 

 

 

 

 

 

 

 

import java.io.*;

 

 

 

 

 

 

 

 

/**在使用此程序时,会尽量用到Lucene Index中的每一个类,尽量将其展示个大家

 

 

 

 

 

 

 

 

 *使用的Index包中类有

 

 

 

 

 

 

 

 

 *DocumentWriter(提供给用用户使用的为IndexWriter

 

 

 

 

 

 

 

 

 *FieldInfo(FieldInfos

 

 

 

 

 

 

 

 

 * SegmentDocs(扩展自TermDocs

 

 

 

 

 

 

 

 

 *SegmentReader(扩展自IndexReader,提供给用户使用的是IndexReader

 

 

 

 

 

 

 

 

 *SegmentMerger

 

 

 

 

 

 

 

 

 *segmentTermEnum(扩展自TermEnum

 

 

 

 

 

 

 

 

 *segmentTermPositions(扩展自TermPositions

 

 

 

 

 

 

 

 

 *segmentTermVector(扩展自TermFreqVector

 

 

 

 

 

 

 

 

*/

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

public class TestIndexPackage

 

 

 

 

 

 

 

 

{

 

 

 

 

 

 

 

 

  //用于将Document加入索引

 

 

 

 

 

 

 

 

  public static void indexDocument(String segment,String fileName) throws Exception

 

 

 

 

 

 

 

 

  {

 

 

 

 

 

 

 

 

    //第二个参数用来控制,如果获得不了目录是否创建

 

 

 

 

 

 

 

 

    Directory directory = FSDirectory.getDirectory("testIndexPackage",false);

 

 

 

 

 

 

 

 

    Analyzer analyzer = new SimpleAnalyzer();

 

 

 

 

 

 

 

 

    //第三个参数为每一个Field最多拥有的Token个数

 

 

 

 

 

 

 

 

    DocumentWriter writer = new DocumentWriter(directory,analyzer,Similarity.getDefault(),1000);

 

 

 

 

 

 

 

 

    File file = new File(fileName);

 

 

 

 

 

 

 

 

    //由于使用FileDocumentfile包装成了Docuement,会在document中创建三个fieldpathmodifiedcontents

 

 

 

 

 

 

 

 

    Document doc = FileDocument.Document(file);

 

 

 

 

 

 

 

 

    writer.addDocument(segment,doc);

 

 

 

 

 

 

 

 

    directory.close();

 

 

 

 

 

 

 

 

  }

 

 

 

 

 

 

 

 

  //将多个segment进行合并

 

 

 

 

 

 

 

 

  public static void merge(String segment1,String segment2,String segmentMerged)throws Exception

 

 

 

 

 

 

 

 

  {

 

 

 

 

 

 

 

 

    Directory directory = FSDirectory.getDirectory("testIndexPackage",false);

 

 

 

 

 

 

 

 

    SegmentReader segmentReader1=new SegmentReader(new SegmentInfo(segment1,1,directory));

 

 

 

 

 

 

 

 

    SegmentReader segmentReader2=new SegmentReader(new SegmentInfo(segment2,1,directory));

 

 

 

 

 

 

 

 

    //第三个参数为是否创建.cfs文件

 

 

 

 

 

 

 

 

    SegmentMerger segmentMerger =new SegmentMerger(directory,segmentMerged,false);

 

 

 

 

 

 

 

 

    segmentMerger.add(segmentReader1);

 

 

 

 

 

 

 

 

    segmentMerger.add(segmentReader2);

 

 

 

 

 

 

 

 

    segmentMerger.merge();

 

 

 

 

 

 

 

 

    segmentMerger.closeReaders();

 

 

 

 

 

 

 

 

    directory.close();

 

 

 

 

 

 

 

 

  }

 

 

 

 

 

 

 

 

  //segmentIndex的子索引的所有内容展示给你看。

 

 

 

 

 

 

 

 

  public static void printSegment(String segment) throws Exception

 

 

 

 

 

 

 

 

  {

 

 

 

 

 

 

 

 

    Directory directory =FSDirectory.getDirectory("testIndexPackage",false);

 

 

 

 

 

 

 

 

    SegmentReader segmentReader = new SegmentReader(new SegmentInfo(segment,1,directory));

 

 

 

 

 

 

 

 

    //display documents

 

 

 

 

 

 

 

 

    for(int i=0;i<segmentReader.numDocs();i++)

 

 

 

 

 

 

 

 

      System.out.println(segmentReader.document(i));

 

 

 

 

 

 

 

 

    TermEnum termEnum = segmentReader.terms();//此处实际为

SegmentTermEnum

 

 

 

 

 

 

 

 

    //display term and term positions,termDocs

 

 

 

 

 

 

 

 

    while(termEnum.next())

 

 

 

 

 

 

 

 

    {

 

 

 

 

 

 

 

 

      System.out.print(termEnum.term().toString2());

 

 

 

 

 

 

 

 

      System.out.println(" DocumentFrequency=" + termEnum.docFreq());

 

 

 

 

 

 

 

 

      TermPositions termPositions= segmentReader.termPositions(termEnum.term());

 

 

 

 

 

 

 

 

      int i=0;

 

 

 

 

 

 

 

 

      while(termPositions.next())

 

 

 

 

 

 

 

 

      {

 

 

 

 

 

 

 

 

        System.out.println((i++)+"->"+termPositions);

 

 

 

 

 

 

 

 

      }

 

 

 

 

 

 

 

 

      TermDocs termDocs=segmentReader.termDocs(termEnum.term());//实际为

segmentDocs

 

 

 

 

 

 

 

 

      while (termDocs.next())

 

 

 

 

 

 

 

 

      {

 

 

 

 

 

 

 

 

        System.out.println((i++)+"->"+termDocs);

 

 

 

 

 

 

 

 

      }

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

    }

 

 

 

 

 

 

 

 

    //display field info

 

 

 

 

 

 

 

 

    FieldInfos fieldInfos= segmentReader.fieldInfos;

 

 

 

 

 

 

 

 

    FieldInfo pathFieldInfo = fieldInfos.fieldInfo("path");

 

 

 

 

 

 

 

 

    FieldInfo modifiedFieldInfo = fieldInfos.fieldInfo("modified");

 

 

 

 

 

 

 

 

    FieldInfo contentsFieldInfo =fieldInfos.fieldInfo("contents");

 

 

 

 

 

 

 

 

    System.out.println(pathFieldInfo);

 

 

 

 

 

 

 

 

    System.out.println(modifiedFieldInfo);

 

 

 

 

 

 

 

 

    System.out.println(contentsFieldInfo);

 

 

 

 

 

 

 

 

   //display TermFreqVector

 

 

 

 

 

 

 

 

   for(int i=0;i<segmentReader.numDocs();i++)

 

 

 

 

 

 

 

 

   {

 

 

 

 

 

 

 

 

     //contentstoken之后的term存于了

TermFreqVector

 

 

 

 

 

 

 

 

     TermFreqVector termFreqVector=segmentReader.getTermFreqVector(i,"contents");

 

 

 

 

 

 

 

 

     System.out.println(termFreqVector);

 

 

 

 

 

 

 

 

   }

 

 

 

 

 

 

 

 

  }

 

 

 

 

 

 

 

 

  public static void main(String [] args)

 

 

 

 

 

 

 

 

  {

 

 

 

 

 

 

 

 

    try

 

 

 

 

 

 

 

 

    {

 

 

 

 

 

 

 

 

      Directory directory = FSDirectory.getDirectory("testIndexPackage",true);

 

 

 

 

 

 

 

 

      directory.close();

 

 

 

 

 

 

 

 

      indexDocument("segmentOne","e://lucene//test.txt");

 

 

 

 

 

 

 

 

      //printSegment("segmentOne");

 

 

 

 

 

 

 

 

      indexDocument("segmentTwo","e://lucene//test2.txt");

 

 

 

 

 

 

 

 

     // printSegment("segmentTwo");

 

 

 

 

 

 

 

 

      merge("segmentOne","segmentTwo","merge");

 

 

 

 

 

 

 

 

      printSegment("merge");

 

 

 

 

 

 

 

 

    }

 

 

 

 

 

 

 

 

    catch(Exception e)

 

 

 

 

 

 

 

 

    {

 

 

 

 

 

 

 

 

      System.out.println("caught a "+e.getCause()+"/n with message:"+e.getMessage());

 

 

 

 

 

 

 

 

      e.printStackTrace();

 

 

 

 

 

 

 

 

    }

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  }

 

 

 

 

 

 

 

 

}

 

 

 

 

 

 

 

 

看看其结果如下:

 

 

 

 

 

 

 

 

Document<Text<path:e:/lucene/test.txt> Keyword<modified:0eg4e221c>>

 

 

 

 

 

 

 

 

Document<Text<path:e:/lucene/test2.txt> Keyword<modified:0eg4ee8b4>>

 

 

 

 

 

 

 

 

<Term:FieldName,text>=<contents,china> DocumentFrequency=1

 

 

 

 

 

 

 

 

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=2>

 

 

 

 

 

 

 

 

1-><docNumber,freq>=<0,1>

 

 

 

 

 

 

 

 

<Term:FieldName,text>=<contents,i> DocumentFrequency=2

 

 

 

 

 

 

 

 

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=2 Pos=0,3>

 

 

 

 

 

 

 

 

1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=0>

 

 

 

 

 

 

 

 

2-><docNumber,freq>=<0,2>

 

 

 

 

 

 

 

 

3-><docNumber,freq>=<1,1>

 

 

 

 

 

 

 

 

<Term:FieldName,text>=<contents,love> DocumentFrequency=2

 

 

 

 

 

 

 

 

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=2 Pos=1,4>

 

 

 

 

 

 

 

 

1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=1>

 

 

 

 

 

 

 

 

2-><docNumber,freq>=<0,2>

 

 

 

 

 

 

 

 

3-><docNumber,freq>=<1,1>

 

 

 

 

 

 

 

 

<Term:FieldName,text>=<contents,nankai> DocumentFrequency=1

 

 

 

 

 

 

 

 

0-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=2>

 

 

 

 

 

 

 

 

1-><docNumber,freq>=<1,1>

 

 

 

 

 

 

 

 

<Term:FieldName,text>=<contents,tianjin> DocumentFrequency=1

 

 

 

 

 

 

 

 

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=5>

 

 

 

 

 

 

 

 

1-><docNumber,freq>=<0,1>

 

 

 

 

 

 

 

 

<Term:FieldName,text>=<modified,0eg4e221c> DocumentFrequency=1

 

 

 

 

 

 

 

 

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=0>

 

 

 

 

 

 

 

 

1-><docNumber,freq>=<0,1>

 

 

 

 

 

 

 

 

<Term:FieldName,text>=<modified,0eg4ee8b4> DocumentFrequency=1

 

 

 

 

 

 

 

 

0-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=0>

 

 

 

 

 

 

 

 

1-><docNumber,freq>=<1,1>

 

 

 

 

 

 

 

 

<Term:FieldName,text>=<path,e> DocumentFrequency=2

 

 

 

 

 

 

 

 

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=0>

 

 

 

 

 

 

 

 

1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=0>

 

 

 

 

 

 

 

 

2-><docNumber,freq>=<0,1>

 

 

 

 

 

 

 

 

3-><docNumber,freq>=<1,1>

 

 

 

 

 

 

 

 

<Term:FieldName,text>=<path,lucene> DocumentFrequency=2

 

 

 

 

 

 

 

 

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=1>

 

 

 

 

 

 

 

 

1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=1>

 

 

 

 

 

 

 

 

2-><docNumber,freq>=<0,1>

 

 

 

 

 

 

 

 

3-><docNumber,freq>=<1,1>

 

 

 

 

 

 

 

 

<Term:FieldName,text>=<path,test> DocumentFrequency=2

 

 

 

 

 

 

 

 

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=2>

 

 

 

 

 

 

 

 

1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=2>

 

 

 

 

 

 

 

 

2-><docNumber,freq>=<0,1>

 

 

 

 

 

 

 

 

3-><docNumber,freq>=<1,1>

 

 

 

 

 

 

 

 

<Term:FieldName,text>=<path,txt> DocumentFrequency=2

 

 

 

 

 

 

 

 

0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=3>

 

 

 

 

 

 

 

 

1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=3>

 

 

 

 

 

 

 

 

2-><docNumber,freq>=<0,1>

 

 

 

 

 

 

 

 

3-><docNumber,freq>=<1,1>

 

 

 

 

 

 

 

 

<fieldName,isIndexed,fieldNumber,storeTermVector>=path,true,3,false>

 

 

 

 

 

 

 

 

<fieldName,isIndexed,fieldNumber,storeTermVector>=modified,true,2,false>

 

 

 

 

 

 

 

 

<fieldName,isIndexed,fieldNumber,storeTermVector>=contents,true,1,true>

 

 

 

 

 

 

 

 

{contents: china/1, i/2, love/2, tianjin/1}

 

 

 

 

 

 

 

 

{contents: i/1, love/1, nankai/1}

 

 

 

 

 

 

 

 

认真审视其结果,你就会更加明白Lucene底层的索引结构如何。

 

 

 

 

 

 

 

参考资料:Lucene File Format

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值