上一篇文章中的mergeSegments方法中的第51行
mergedDocCount = merger.merge(); // 通过SegmentMerger merger获取需要合并的索引段文件数量
调用了SegmentMerger的对象中的merge方法对索引进行更深入的field&term合并。
SegmentMerger的merge方法如下:
- final int merge() throws IOException {
- int value;
- value = mergeFields();
- mergeTerms();
- mergeNorms();
- if (fieldInfos.hasVectors())
- mergeVectors();
- return value;
- }
按顺序来,首先是mergeField方法:
- private final int mergeFields() throws IOException {
- fieldInfos = new FieldInfos(); // merge field names
- int docCount = 0;
- /**
- * 重新构造field的信息,保存到filedInfos容器中
- */
- for (int i = 0; i < readers.size(); i++) {
- IndexReader reader = (IndexReader) readers.elementAt(i); @?
- addIndexed(reader, fieldInfos, reader.getFieldNames(IndexReader.FieldOption.TERMVECTOR_WITH_POSITION_OFFSET), true, true, true);
- addIndexed(reader, fieldInfos, reader.getFieldNames(IndexReader.FieldOption.TERMVECTOR_WITH_POSITION), true, true, false);
- addIndexed(reader, fieldInfos, reader.getFieldNames(IndexReader.FieldOption.TERMVECTOR_WITH_OFFSET), true, false, true);
- addIndexed(reader, fieldInfos, reader.getFieldNames(IndexReader.FieldOption.TERMVECTOR), true, false, false);
- addIndexed(reader, fieldInfos, reader.getFieldNames(IndexReader.FieldOption.INDEXED), false, false, false);
- fieldInfos.add(reader.getFieldNames(IndexReader.FieldOption.UNINDEXED), false);
- }
- //重新写入field的.fnm文件。
- fieldInfos.write(directory, segment + ".fnm");
- FieldsWriter fieldsWriter = // merge field values
- new FieldsWriter(directory, segment, fieldInfos);
- /**
- * for merging we don't want to compress/uncompress the data,
- * so to tell the FieldsReader that we're in merge mode,
- * we use this FieldSelector
- */
- FieldSelector fieldSelectorMerge = new FieldSelector() {
- public FieldSelectorResult accept(String fieldName) {
- return FieldSelectorResult.LOAD_FOR_MERGE;
- }
- };
- try {
- for (int i = 0; i < readers.size(); i++) {
- IndexReader reader = (IndexReader) readers.elementAt(i);
- int maxDoc = reader.maxDoc();
- for (int j = 0; j < maxDoc; j++)
- if (!reader.isDeleted(j)) { // skip deleted docs,如果没有删除。
- //重新写入.fdt文件及.fdx文件。
- fieldsWriter.addDocument(reader.document(j, fieldSelectorMerge));
- //统计各个segment中包含document的数量。
- docCount++;
- }
- }
- } finally {
- fieldsWriter.close();
- }
- return docCount;
- }
标记中@?的readers哪里来的呢?这个问题一直困扰着我。
那么首先看看SegmentMerger的属性
private Vector readers = new Vector();
SegmentMerger的add方法
final void add(IndexReader reader) {
readers.addElement(reader);
}
又要追溯到上一篇的IndexWriter里面的mergeSegment方法
- if (doMerge) {
- if (infoStream != null) infoStream.print("merging segments");
- merger = new SegmentMerger(this, mergedName);
- for (int i = minSegment; i < end; i++) {
- SegmentInfo si = sourceSegments.info(i);
- if (infoStream != null)
- infoStream.print(" " + si.name + " (" + si.docCount + " docs)");
- IndexReader reader = SegmentReader.get(si); // no need to set deleter (yet)
- merger.add(reader);
- if ((reader.directory() == this.directory) || // if we own the directory
- (reader.directory() == this.ramDirectory))
- segmentsToDelete.addElement(reader); // queue segment for deletion
- }
- }
term信息合并过程
term信息合并过程由SegmentMerger的mergeTerms方法来执行。
- private final void mergeTerms() throws IOException {
- try {
- freqOutput = directory.createOutput(segment + ".frq");//创建合并后的.frq文件。
- proxOutput = directory.createOutput(segment + ".prx");//创建合并后的.frx文件。
- /**
- * 重新构建termInfosWriter,建立一组.tis、.tii文件
- */
- termInfosWriter =
- new TermInfosWriter(directory, segment, fieldInfos,
- termIndexInterval);
- //获取跳跃跨度值。
- skipInterval = termInfosWriter.skipInterval;
- //带有优先级的队列,队列中所有的成员都是索引的读取类(reader)
- queue = new SegmentMergeQueue(readers.size());
- //合并词条term
- mergeTermInfos();
- } finally {
- if (freqOutput != null) freqOutput.close();
- if (proxOutput != null) proxOutput.close();
- if (termInfosWriter != null) termInfosWriter.close();
- if (queue != null) queue.close();
- }
- }
可能会出现了问题:skipInterval是什么意思?
【最佳答案】
skipInterval是对频率与位置文件信息查询时,快速定位的跳跃跨度数值。
举一个建立skip层次信息的例子(某个单词在27个文档中频率信息,跳跃跨度为3,会出现3个层次)如下:
skipInterval = 3:
* c (skip level 2)
* c c c (skip level 1)
* x x x x x x x x x x (skip level 0)
* d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d
编
号 3 6 9 12 15 18 21 24 27 30
* d - document
* x - skip data
* c - skip data with child pointer
如同有27个房间,每3个房间出现一个控制间,在lognN n=skipInterval的情况下,出现3层(level0,level1,level2)查找时,level2保存在内存中减少占用内存的大小,定位某个元素只是需要查找lognN次便可定位到底层的skipinterval个元素集合中,内部最多遍历n次即可找到该元素。
合并TermInfo
SegmentMerger的mergeTermInfos方法代码如下:
- private final void mergeTermInfos() throws IOException {
- int base = 0;
- for (int i = 0; i < readers.size(); i++) {
- IndexReader reader = (IndexReader) readers.elementAt(i);
- TermEnum termEnum = reader.terms();
- SegmentMergeInfo smi = new SegmentMergeInfo(base, termEnum, reader);
- base += reader.numDocs();
- if (smi.next())
- queue.put(smi); // initialize queue
- else
- smi.close();
- }
- SegmentMergeInfo[] match = new SegmentMergeInfo[readers.size()];
- while (queue.size() > 0) {
- int matchSize = 0; // pop matching terms
- match[matchSize++] = (SegmentMergeInfo) queue.pop();
- Term term = match[0].term;
- SegmentMergeInfo top = (SegmentMergeInfo) queue.top();
- /**
- * match数组里面存储了相同的term,match数组的大小即为term在
- * 文集中的docfreq(文集中共有多少个document包含这个term)。
- */
- while (top != null && term.compareTo(top.term) == 0) {
- match[matchSize++] = (SegmentMergeInfo) queue.pop();
- top = (SegmentMergeInfo) queue.top();
- }
- mergeTermInfo(match, matchSize); //add new TermInfo 核心 @
- //合并后重新存储队列中的元素。
- while (matchSize > 0) {
- SegmentMergeInfo smi = match[--matchSize];
- if (smi.next())
- queue.put(smi); // restore queue
- else
- smi.close(); // done with a segment
- }
- }
- }
核心代码@,继续细化到mergeTermInfo方法:
- private final void mergeTermInfo(SegmentMergeInfo[] smis, int n)
- throws IOException {
- long freqPointer = freqOutput.getFilePointer();
- long proxPointer = proxOutput.getFilePointer();
- int df = appendPostings(smis, n); //append posting data 核心
- long skipPointer = writeSkip(); //应该是写入skip跳跃跨度信息。
- if (df > 0) {
- // add an entry to the dictionary with pointers to prox and freq files
- // 提供了字典文件的入口地址。
- termInfo.set(df, freqPointer, proxPointer, (int) (skipPointer - freqPointer));
- termInfosWriter.add(smis[0].term, termInfo);
- }
- }
上述的核心为appendPostings方法,主要作用:重新写入频率文件和位置文件,即合并后的term的posting信心重新填写到索引中。代码如下:
- /** Process postings from multiple(多个) segments all positioned on the
- * same term. Writes out merged entries into freqOutput and
- * the proxOutput streams.
- *
- * @param smis array of segments
- * @param n number of cells in the array actually occupied
- * @return number of documents across all segments where this term was found
- */
- private final int appendPostings(SegmentMergeInfo[] smis, int n)
- throws IOException {
- int lastDoc = 0;
- int df = 0; // number of docs w/ term
- resetSkip();
- for (int i = 0; i < n; i++) {
- SegmentMergeInfo smi = smis[i];
- TermPositions postings = smi.getPositions();
- int base = smi.base;
- int[] docMap = smi.getDocMap();
- postings.seek(smi.termEnum);
- while (postings.next()) {
- int doc = postings.doc();
- if (docMap != null)
- doc = docMap[doc]; // map around deletions
- doc += base; // convert to merged space
- if (doc < 0 || (df > 0 && doc <= lastDoc))
- throw new IllegalStateException("docs out of order (" + doc +
- " <= " + lastDoc + " )");
- /**
- * 记录有多少个document中含有相同的term,因为本posting里面存储的
- * 都是相同的term,而这些term又是从每个document中采集来的不重复的
- * 词条。
- */
- df++;
- /**
- * 如果多个document包含这个term并且频率数值超过skipInterval这个
- * 阀值,需要在频率与位置文件加入跳跃指针。
- */
- if ((df % skipInterval) == 0) {
- bufferSkip(lastDoc);
- }
- /**
- * 计算docCode,当前document编号与前document的编号之间的差值乘以2.
- */
- int docCode = (doc - lastDoc) << 1; // use low bit to flag freq=1
- lastDoc = doc;
- /**
- * term在document中出现的次数就是代码freq的数值。
- */
- int freq = postings.freq();
- if (freq == 1) {
- //用docCode的数值加上1(必为奇数)保存到频率文件中。
- freqOutput.writeVInt(docCode | 1); // write doc & freq=1
- } else {
- freqOutput.writeVInt(docCode); // write doc 偶数
- freqOutput.writeVInt(freq); // write frequency in doc
- }
- int lastPosition = 0; // write position deltas
- for (int j = 0; j < freq; j++) {
- int position = postings.nextPosition();
- //计算位置的差值delta
- proxOutput.writeVInt(position - lastPosition);
- lastPosition = position;
- }
- }
- }
- return df;
- }
private RAMOutputStream skipBuffer = new RAMOutputStream();
bufferSkip的函数实现如下:
- private void bufferSkip(int doc) throws IOException {
- long freqPointer = freqOutput.getFilePointer();
- long proxPointer = proxOutput.getFilePointer();
- //写入当前document编号与连续document编号之间的差值delta
- skipBuffer.writeVInt(doc - lastSkipDoc);
- //在位置文件和频率文件中写入skip信息,写入skip信息后的指针偏移量。
- skipBuffer.writeVInt((int) (freqPointer - lastSkipFreqPointer));
- skipBuffer.writeVInt((int) (proxPointer - lastSkipProxPointer));
- lastSkipDoc = doc;
- lastSkipFreqPointer = freqPointer;
- lastSkipProxPointer = proxPointer;
- }
所得到的bufferSkip现在只能保存在内存中,写入磁盘则需要调用writeSkip方法:
- private long writeSkip() throws IOException {
- long skipPointer = freqOutput.getFilePointer();
- skipBuffer.writeTo(freqOutput);
- return skipPointer;
- }
RAMOutputStream的writeTo方法:
- /** Copy the current contents of this buffer to the named output. */
- public void writeTo(IndexOutput out) throws IOException {
- flush();
- final long end = file.length;
- long pos = 0;
- int buffer = 0;
- while (pos < end) {
- int length = BUFFER_SIZE;
- long nextPos = pos + length;
- if (nextPos > end) { // at the last buffer
- length = (int)(end - pos);
- }
- out.writeBytes((byte[])file.buffers.get(buffer++), length);
- pos = nextPos;
- }
- }
【小问题】
如何把bufferSkip中的内容写入到freqOut中的呢?难道file就是bufferSkip?有待深究……
最后写入合并后的规格化文件Norms,mergeNorms方法如下:
- private void mergeNorms() throws IOException {
- byte[] normBuffer = null;
- IndexOutput output = null;
- try {
- for (int i = 0; i < fieldInfos.size(); i++) {
- FieldInfo fi = fieldInfos.fieldInfo(i);
- if (fi.isIndexed && !fi.omitNorms) {
- if (output == null) {
- output = directory.createOutput(segment + "." + IndexFileNames.NORMS_EXTENSION);
- output.writeBytes(NORMS_HEADER,NORMS_HEADER.length);
- }
- for (int j = 0; j < readers.size(); j++) {
- IndexReader reader = (IndexReader) readers.elementAt(j);
- int maxDoc = reader.maxDoc();
- if (normBuffer == null || normBuffer.length < maxDoc) {
- // the buffer is too small for the current segment
- normBuffer = new byte[maxDoc];
- }
- reader.norms(fi.name, normBuffer, 0);
- if (!reader.hasDeletions()) { //如果不是将要被删除。
- //optimized case for segments without deleted docs
- output.writeBytes(normBuffer, maxDoc);
- } else {
- // this segment has deleted docs, so we have to
- // check for every doc if it is deleted or not
- for (int k = 0; k < maxDoc; k++) {
- if (!reader.isDeleted(k)) { //如果没有被删除
- output.writeByte(normBuffer[k]);
- }
- }
- }
- }
- }
- }
- } finally {
- if (output != null) {
- output.close();
- }
- }
- }
关于output.writeByte(normBuffer[k]);作用应该是把normBuffer[k]按字节写入output流中。