之前分析Lucene的添加文档过程是已经知道,文档的添加可分解为域的添加,而域的添加过程就是倒排索引的过程。本文将以域的添加作为入口来分析倒排索引的过程。首先看添加域的入口方法:
private int processField(IndexableField field, long fieldGen, int fieldCount) throws IOException, AbortingException {
String fieldName = field.name();
IndexableFieldType fieldType = field.fieldType();
PerField fp = null;
if (fieldType.indexOptions() == null) {
throw new NullPointerException("IndexOptions must not be null (field: \"" + field.name() + "\")");
}
// Invert indexed fields:
if (fieldType.indexOptions() != IndexOptions.NONE) {
// if the field omits norms, the boost cannot be indexed.
if (fieldType.omitNorms() && field.boost() != 1.0f) {
throw new UnsupportedOperationException("You cannot set an index-time boost: norms are omitted for field '" + field.name() + "'");
}
fp = getOrAddField(fieldName, fieldType, true);
boolean first = fp.fieldGen != fieldGen;
fp.invert(field, first);
if (first) {
fields[fieldCount++] = fp;
fp.fieldGen = fieldGen;
}
} else {
verifyUnIndexedFieldType(fieldName, fieldType);
}
// Add stored fields:
if (fieldType.stored()) {
if (fp == null) {
fp = getOrAddField(fieldName, fieldType, false);
}
if (fieldType.stored()) {
String value = field.stringValue();
if (value != null && value.length() > IndexWriter.MAX_STORED_STRING_LENGTH) {
throw new IllegalArgumentException("stored field \"" + field.name() + "\" is too large (" + value.length() + " characters) to store");
}
try {
storedFieldsConsumer.writeField(fp.fieldInfo, field);
} catch (Throwable th) {
throw AbortingException.wrap(th);
}
}
}
DocValuesType dvType = fieldType.docValuesType();
if (dvType == null) {
throw new NullPointerException("docValuesType must not be null (field: \"" + fieldName + "\")");
}
if (dvType != DocValuesType.NONE) {
if (fp == null) {
fp = getOrAddField(fieldName, fieldType, false);
}
indexDocValue(fp, dvType, field);
}
if (fieldType.pointDimensionCount() != 0) {
if (fp == null) {
fp = getOrAddField(fieldName, fieldType, false);
}
indexPoint(fp, field);
}
return fieldCount;
}
1、 获取域名称fieldName、域类型fieldType。fieldType中包含着索引类型IndexOptions,索引类型总共有以下5钟:
a) NONE:不索引
b) DOCS:仅索引文档
c) DOCS_AND_FREQS:索引文档,词项频率
d) DOCS_AND_FREQS_AND_POSITIONS:索引文档,词项频率,词项位置
e) DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS:索引文档,词项频率,词项位置,词项偏移量
2、 如果索引类型为NONE,则该域不做任何索引操作。也就不可能以这个域作为查询条件去查询相关文档了;如果该域不为NONE,则需要做索引。根据fieldName从PerField数组中获取对应的PerField(如果没有就先新建再获取),PerField中保存了域的信息与索引信息。调用fp.invert(field, first)做倒排索引,然后将倒排索引的数据保存在PerField中。倒排索引是本文重点,后面会重点研究。
3、 用fieldType.stored判断这个域是否需要存储,如果需要存储,根据fieldName从PerField数组中获取对应的PerField(如果没有就先新建再获取),PerField中保存了域的信息与索引信息。再调用storedFieldsConsumer.writeField(fp.fieldInfo, field)将这个域的数据写到内存中。
4、 待所有域都经过以上步骤处理过后,调用finishStoredFields();将需要存储的域的数据从内存中写到.fdx文件中
接下来我们来分析一下倒排索引的过程
public void invert(IndexableField field, boolean first) throws IOException, AbortingException {
if (first) {
// First time we're seeing this field (indexed) in
// this document:
invertState.reset();
}
IndexableFieldType fieldType = field.fieldType();
IndexOptions indexOptions = fieldType.indexOptions();
fieldInfo.setIndexOptions(indexOptions);
if (fieldType.omitNorms()) {
fieldInfo.setOmitsNorms();
}
final boolean analyzed = fieldType.tokenized() && docState.analyzer != null;
// only bother checking offsets if something will consume them.
// TODO: after we fix analyzers, also check if termVectorOffsets will be indexed.
final boolean checkOffsets = indexOptions == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS;
/*
* To assist people in tracking down problems in analysis components, we wish to write the field name to the infostream
* when we fail. We expect some caller to eventually deal with the real exception, so we don't want any 'catch' clauses,
* but rather a finally that takes note of the problem.
*/
boolean succeededInProcessingField = false;
try (TokenStream stream = tokenStream = field.tokenStream(docState.analyzer, tokenStream)) {
// reset the TokenStream to the first token
stream.reset();
invertState.setAttributeSource(stream);
termsHashPerField.start(field, first);
while (stream.incrementToken()) {
// If we hit an exception in stream.next below
// (which is fairly common, e.g. if analyzer
// chokes on a given document), then it's
// non-aborting and (above) this one document
// will be marked as deleted, but still
// consume a docID
int posIncr = invertState.posIncrAttribute.getPositionIncrement();
invertState.position += posIncr;
if (invertState.position < invertState.lastPosition) {
if (posIncr == 0) {
throw new IllegalArgumentException("first position increment must be > 0 (got 0) for field '" + field.name() + "'");
} else if (posIncr < 0) {
throw new IllegalArgumentException("position increment must be >= 0 (got " + posIncr + ") for field '" + field.name() + "'");
} else {
throw new IllegalArgumentException("position overflowed Integer.MAX_VALUE (got posIncr=" + posIncr + " lastPosition=" + invertState.lastPosition + " position=" + invertState.position + ") for field '" + field.name() + "'");
}
} else if (invertState.position > IndexWriter.MAX_POSITION) {
throw new IllegalArgumentException("position " + invertState.position + " is too large for field '" + field.name() + "': max allowed position is " + IndexWriter.MAX_POSITION);
}
invertState.lastPosition = invertState.position;
if (posIncr == 0) {
invertState.numOverlap++;
}
if (checkOffsets) {
int startOffset = invertState.offset + invertState.offsetAttribute.startOffset();
int endOffset = invertState.offset + invertState.offsetAttribute.endOffset();
if (startOffset < invertState.lastStartOffset || endOffset < startOffset) {
throw new IllegalArgumentException("startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards "
+ "startOffset=" + startOffset + ",endOffset=" + endOffset + ",lastStartOffset=" + invertState.lastStartOffset + " for field '" + field.name() + "'");
}
invertState.lastStartOffset = startOffset;
}
invertState.length++;
if (invertState.length < 0) {
throw new IllegalArgumentException("too many tokens in field '" + field.name() + "'");
}
//System.out.println(" term=" + invertState.termAttribute);
// If we hit an exception in here, we abort
// all buffered documents since the last
// flush, on the likelihood that the
// internal state of the terms hash is now
// corrupt and should not be flushed to a
// new segment:
try {
termsHashPerField.add();
} catch (MaxBytesLengthExceededException e) {
byte[] prefix = new byte[30];
BytesRef bigTerm = invertState.termAttribute.getBytesRef();
System.arraycopy(bigTerm.bytes, bigTerm.offset, prefix, 0, 30);
String msg = "Document contains at least one immense term in field=\"" + fieldInfo.name + "\" (whose UTF8 encoding is longer than the max length " + DocumentsWriterPerThread.MAX_TERM_LENGTH_UTF8 + "), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '" + Arrays.toString(prefix) + "...', original message: " + e.getMessage();
if (docState.infoStream.isEnabled("IW")) {
docState.infoStream.message("IW", "ERROR: " + msg);
}
// Document will be deleted above:
throw new IllegalArgumentException(msg, e);
} catch (Throwable th) {
throw AbortingException.wrap(th);
}
}
// trigger streams to perform end-of-stream operations
stream.end();
// TODO: maybe add some safety? then again, it's already checked
// when we come back around to the field...
invertState.position += invertState.posIncrAttribute.getPositionIncrement();
invertState.offset += invertState.offsetAttribute.endOffset();
/* if there is an exception coming through, we won't set this to true here:*/
succeededInProcessingField = true;
} finally {
if (!succeededInProcessingField && docState.infoStream.isEnabled("DW")) {
docState.infoStream.message("DW", "An exception was thrown while processing field " + fieldInfo.name);
}
}
if (analyzed) {
invertState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);
invertState.offset += docState.analyzer.getOffsetGap(fieldInfo.name);
}
invertState.boost *= field.boost();
}
1、 tokenStream = field.tokenStream(docState.analyzer, tokenStream)用分析器analyzer将该域的数据分析为词汇单元流tokenStream,它就像一个枚举类一样,通过stream.incrementToken()获取下一个词汇单元。
2、 FieldInvertState invertState,TermsHashPerField termsHashPerField 是 被PerField所持有的对象,invertState用于记录该域中词汇单元的数量以及位置信息。
invertState.posIncrAttribute 是指前后词汇单元位置的增量,一般为1,说明前后两个词汇单元中间没有空位。如果大于1,说明前后两个词汇单元中间存在空位,这可能是分析器将停词删除后留下的空位。如果如果等于0,有可能是存在同义词的情况了。这时候记录词汇覆盖数的invertState.numOverlap就会加1。invertState.length用于记录词汇数量。
3、 termsHashPerField.add()则用于完成词项索引过程,将索引内容存储于内存缓冲之中。
a) bytesHash.add将词项的文本转成字节数组及数组长度存在bytesHash中。完了会返回一个termId。如果termId<0表示这个词项不是第一次添加。
b) newTerm(termID) 根据termID创建词项对象,在创建过程中,把词频、文档ID、偏移量、位置等信息都记录在了FreqProxPostingsArray之中。在FreqProxPostingsArray中维护了词频、文档ID、偏移量、位置等信息的数组,数组的每一个元素就是一个词项所对应的信息。下面是FreqProxPostingsArray的代码:
static final class FreqProxPostingsArray extends ParallelPostingsArray {
public FreqProxPostingsArray(int size, boolean writeFreqs, boolean writeProx, boolean writeOffsets) {
super(size);
if (writeFreqs) {
termFreqs = new int[size];
}
lastDocIDs = new int[size];
lastDocCodes = new int[size];
if (writeProx) {
lastPositions = new int[size];
if (writeOffsets) {
lastOffsets = new int[size];
}
} else {
assert !writeOffsets;
}
//System.out.println("PA init freqs=" + writeFreqs + " pos=" + writeProx + " offs=" + writeOffsets);
}
int termFreqs[]; // # times this term occurs in the current doc
int lastDocIDs[]; // Last docID where this term occurred
int lastDocCodes[]; // Code for prior doc
int lastPositions[]; // Last position where this term occurred
int lastOffsets[]; // Last endOffset where this term occurred
参考 Lucene索引过程分析