Lucene源码解析--Analyzer类IndexingChain介绍<一>

最新推荐文章于 2022-06-26 21:48:48 发布

clmaykr95629

最新推荐文章于 2022-06-26 21:48:48 发布

阅读量143

点赞数

文档的索引过程是通过DocumentsWriter的内部数据处理链完成的,下面通过代码跟踪的方法来介绍一下索引链的创建过程。

一：下面一段代码是创建索引的一个简单样例，其中红色标识部分将是我们要跟踪的。

public class IndexTest {

public static void main(String[] args)

{

try {

File fileDir =new File("F:\\document");

IndexWriterConfig config=new IndexWriterConfig(Version.LUCENE_43, new StandardAnalyzer(Version.LUCENE_43));

config.setInfoStream(System.out);

config.setOpenMode(OpenMode.CREATE);

IndexWriter writer=new IndexWriter(FSDirectory.open(new File("F:\\index")),config);

for(File file:fileDir.listFiles())

{

Document document=new Document();

document.add(new TextField("content", new FileReader(file)));

document.add(new StringField("title", file.getName(), Store.YES));

writer.addDocument(document);

}

writer.close();

} catch (Exception e) {

e.printStackTrace();

}

Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE 二：IndexWriterConfig config=new IndexWriterConfig(Version.LUCENE_43, new StandardAnalyzer(Version.LUCENE_43))
通过 IndexWriterConfig类的构造函数来创建参数配置对象，我们进入到构造函数内部

public IndexWriterConfig(Version matchVersion, Analyzer analyzer) {

super(analyzer, matchVersion);

}

发现，它调用父类LiveIndexWriterConfig的构造函数。我们继续跟踪

LiveIndexWriterConfig(Analyzer analyzer, Version matchVersion) {

this.analyzer = analyzer;

this.matchVersion = matchVersion;

ramBufferSizeMB = IndexWriterConfig.DEFAULT_RAM_BUFFER_SIZE_MB;

maxBufferedDocs = IndexWriterConfig.DEFAULT_MAX_BUFFERED_DOCS;

maxBufferedDeleteTerms = IndexWriterConfig.DEFAULT_MAX_BUFFERED_DELETE_TERMS;

readerTermsIndexDivisor = IndexWriterConfig.DEFAULT_READER_TERMS_INDEX_DIVISOR;

mergedSegmentWarmer = null;

termIndexInterval = IndexWriterConfig.DEFAULT_TERM_INDEX_INTERVAL; // TODO: this should be private to the codec, not settable here

delPolicy = new KeepOnlyLastCommitDeletionPolicy();

commit = null;

openMode = OpenMode.CREATE_OR_APPEND;

similarity = IndexSearcher.getDefaultSimilarity();

mergeScheduler = new ConcurrentMergeScheduler();

writeLockTimeout = IndexWriterConfig.WRITE_LOCK_TIMEOUT;

indexingChain = DocumentsWriterPerThread.defaultIndexingChain;

codec = Codec.getDefault();

if (codec == null) {

throw new NullPointerException();

}

infoStream = InfoStream.getDefault();

mergePolicy = new TieredMergePolicy();

flushPolicy = new FlushByRamOrCountsPolicy();

readerPooling = IndexWriterConfig.DEFAULT_READER_POOLING;

indexerThreadPool = new ThreadAffinityDocumentsWriterThreadPool(IndexWriterConfig.DEFAULT_MAX_THREAD_STATES);

perThreadHardLimitMB = IndexWriterConfig.DEFAULT_RAM_PER_THREAD_HARD_LIMIT_MB;

}

其中红色标注的部分是我们关心的。

三：首先我们看 indexerThreadPool = new ThreadAffinityDocumentsWriterThreadPool(IndexWriterConfig.DEFAULT_MAX_THREAD_STATES)构建索引的线程池。我们跟踪ThreadAffinityDocumentsWriterThreadPool的构造函数

public ThreadAffinityDocumentsWriterThreadPool(int maxNumPerThreads) {

super(maxNumPerThreads);

assert getMaxThreadStates() >= 1;

}

发现其调用父类DocumentsWriterPerThreadPool的构造函数 Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE

DocumentsWriterPerThreadPool(int maxNumThreadStates) {

if (maxNumThreadStates < 1) {

throw new IllegalArgumentException("maxNumThreadStates must be >= 1 but was: " + maxNumThreadStates);

}

threadStates = new ThreadState[maxNumThreadStates];

numThreadStatesActive = 0;

}

至此，我们发现会创建一个ThreadState数组，数组默认最大值为8. 通过对ThreadState的分析我们知道，ThreadState和一个DocumentsWriterPerThread关联，而DocumentsWriterPerThread中则包含着索引链的关键部分。 Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE
三：接下来我们来分析ThreadState数组中的每个对象，是怎么跟DocumentsWriterPerThread关联起来的。我们回到索引样例中的
IndexWriter writer=new IndexWriter(FSDirectory.open(new File("F:\\index")),config);
继续跟踪IndexWriter的构造函数，我们会发现有一处代码
docWriter = new DocumentsWriter(codec, config, directory, this, globalFieldNumberMap, bufferedDeletesStream);创建 DocumentsWriter对象

四：我们继续跟踪 DocumentsWriter的构造函数

DocumentsWriter(Codec codec, LiveIndexWriterConfig config, Directory directory, IndexWriter writer, FieldNumbers globalFieldNumbers,

BufferedDeletesStream bufferedDeletesStream) {

this.codec = codec;

this.directory = directory;

this.indexWriter = writer;

this.infoStream = config.getInfoStream();

this.similarity = config.getSimilarity();

this.perThreadPool = config.getIndexerThreadPool();

this.chain = config.getIndexingChain();

this.perThreadPool.initialize(this, globalFieldNumbers, config);

flushPolicy = config.getFlushPolicy();

assert flushPolicy != null;

flushPolicy.init(this);

flushControl = new DocumentsWriterFlushControl(this, config);

}

其中标注红色的部分是表示对索引线程池进行初始化操作，我们来看看初始化时做了哪些工作

void initialize(DocumentsWriter documentsWriter, FieldNumbers globalFieldMap, LiveIndexWriterConfig config) {

this.documentsWriter.set(documentsWriter); // thread pool is bound to DW

this.globalFieldMap.set(globalFieldMap);

for (int i = 0; i < threadStates.length; i++) {

final FieldInfos.Builder infos = new FieldInfos.Builder(globalFieldMap);

threadStates[i] = new ThreadState(new DocumentsWriterPerThread(documentsWriter.directory, documentsWriter, infos, documentsWriter.chain));

}

可以看到，针对线程池中的threadStates数组中的每个对象进行初始化，绑定一个 Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE DocumentsWriterPerThread 线程实例。

五：我们来看看 DocumentsWriterPerThread的构造函数

public DocumentsWriterPerThread(Directory directory, DocumentsWriter parent,

FieldInfos.Builder fieldInfos, IndexingChain indexingChain) {

this.directoryOrig = directory;

this.directory = new TrackingDirectoryWrapper(directory);

this.parent = parent;

this.fieldInfos = fieldInfos;

this.writer = parent.indexWriter;

this.infoStream = parent.infoStream;

this.codec = parent.codec;

this.docState = new DocState(this, infoStream);

this.docState.similarity = parent.indexWriter.getConfig().getSimilarity();

bytesUsed = Counter.newCounter();

byteBlockAllocator = new DirectTrackingAllocator(bytesUsed);

pendingDeletes = new BufferedDeletes();

intBlockAllocator = new IntBlockAllocator(bytesUsed);

initialize();

consumer = indexingChain.getChain(this);

}

Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE在代码的最后一句，是为每个线程提供一个索引链。

六：最后然我们来看看索引链中的内容

DocConsumer getChain(DocumentsWriterPerThread documentsWriterPerThread) {

final TermsHashConsumer termVectorsWriter = new TermVectorsConsumer(documentsWriterPerThread);

final TermsHashConsumer freqProxWriter = new FreqProxTermsWriter();

final InvertedDocConsumer termsHash = new TermsHash(documentsWriterPerThread, freqProxWriter, true,

new TermsHash(documentsWriterPerThread, termVectorsWriter, false, null));

final NormsConsumer normsWriter = new NormsConsumer();

final DocInverter docInverter = new DocInverter(documentsWriterPerThread.docState, termsHash, normsWriter);

final StoredFieldsConsumer storedFields = new TwoStoredFieldsConsumers(

new StoredFieldsProcessor(documentsWriterPerThread),

new DocValuesProcessor(documentsWriterPerThread.bytesUsed));

return new DocFieldProcessor(documentsWriterPerThread, docInverter, storedFields);

}

};

Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE
索引链的调用过程，请参见下图 Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE

七：至此，每个IndexWriter创建时，会分配一个默认大小为8的线程池，线程池中存放着DocumentsWriterPerThread线程，每个线程中有一个默认的索引链IndexingChain与之相关联。
Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE

ch.jpg