索引的创建
Lucene创建索引的步骤非常简单,只需要创建一个IndexWriter实例即可,这个实例就代表一个索引,它负责创建和维护索引。因此要了解索引创建的详情就需要从IndexWriter的构造函数开始。
public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException {
if (d instanceof FSDirectory && ((FSDirectory) d).checkPendingDeletions()) {
throw new IllegalArgumentException("Directory " + d + " still has pending deleted files; cannot initialize IndexWriter");
}
conf.setIndexWriter(this); // prevent reuse by other instances
config = conf;
infoStream = config.getInfoStream();
// obtain the write.lock. If the user configured a timeout,
// we wrap with a sleeper and this might take some time.
writeLock = d.obtainLock(WRITE_LOCK_NAME);
boolean success = false;
try {
directoryOrig = d;
directory = new LockValidatingDirectoryWrapper(d, writeLock);
analyzer = config.getAnalyzer();
mergeScheduler = config.getMergeScheduler();
mergeScheduler.setInfoStream(infoStream);
codec = config.getCodec();
bufferedUpdatesStream = new BufferedUpdatesStream(infoStream);
poolReaders = config.getReaderPooling();
OpenMode mode = config.getOpenMode();
boolean create;
if (mode == OpenMode.CREATE) {
create = true;
} else if (mode == OpenMode.APPEND) {
create = false;
} else {
// CREATE_OR_APPEND - create only if an index does not exist
create = !DirectoryReader.indexExists(directory);
}
// If index is too old, reading the segments will throw
// IndexFormatTooOldException.
boolean initialIndexExists = true;
String[] files = directory.listAll();
// Set up our initial SegmentInfos:
IndexCommit commit = config.getIndexCommit();
// Set up our initial SegmentInfos:
StandardDirectoryReader reader;
if (commit == null) {
reader = null;
} else {
reader = commit.getReader();
}
if (create) {
if (config.getIndexCommit() != null) {
// We cannot both open from a commit point and create:
if (mode == OpenMode.CREATE) {
throw new IllegalArgumentException("cannot use IndexWriterConfig.setIndexCommit() with OpenMode.CREATE");
} else {
throw new IllegalArgumentException("cannot use IndexWriterConfig.setIndexCommit() when index has no commit");
}
}
// Try to read first. This is to allow create
// against an index that's currently open for
// searching. In this case we write the next
// segments_N file with no segments:
SegmentInfos sis = null;
try {
sis = SegmentInfos.readLatestCommit(directory);
sis.clear();
} catch (IOException e) {
// Likely this means it's a fresh directory
initialIndexExists = false;
sis = new SegmentInfos();
}
segmentInfos = sis;
rollbackSegments = segmentInfos.createBackupSegmentInfos();
// Record that we have a change (zero out all
// segments) pending:
changed();
} else if (reader != null) {
// Init from an existing already opened NRT or non-NRT reader:
if (reader.directory() != commit.getDirectory()) {
throw new IllegalArgumentException("IndexCommit's reader must have the same directory as the IndexCommit");
}
if (reader.directory() != directoryOrig) {
throw new IllegalArgumentException("IndexCommit's reader must have the same directory passed to IndexWriter");
}
if (reader.segmentInfos.getLastGeneration() == 0) {
// TODO: maybe we could allow this? It's tricky...
throw new IllegalArgumentException("index must already have an initial commit to open from reader");
}
// Must clone because we don't want the incoming NRT reader to "see" any changes this writer now makes:
segmentInfos = reader.segmentInfos.clone();
SegmentInfos lastCommit;
try {
lastCommit = SegmentInfos.readCommit(directoryOrig, segmentInfos.getSegmentsFileName());
} catch (IOException ioe) {
throw new IllegalArgumentException("the provided reader is stale: its prior commit file \"" + segmentInfos.getSegmentsFileName() + "\" is missing from index");
}
if (reader.writer != null) {
// The old writer better be closed (we have the write lock now!):
assert reader.writer.closed;
// In case the old writer wrote further segments (which we are now dropping),
// update SIS metadata so we remain write-once:
segmentInfos.updateGenerationVersionAndCounter(reader.writer.segmentInfos);
lastCommit.updateGenerationVersionAndCounter(reader.writer.segmentInfos);
}
rollbackSegments = lastCommit.createBackupSegmentInfos();
if (infoStream.isEnabled("IW")) {
infoStream.message("IW", "init from reader " + reader);
messageState();
}
} else {
// Init from either the latest commit point, or an explicit prior commit point:
String lastSegmentsFile = SegmentInfos.getLastCommitSegmentsFileName(files);
if (lastSegmentsFile == null) {
throw new IndexNotFoundException("no segments* file found in " + directory + ": files: " + Arrays.toString(files));
}
// Do not use SegmentInfos.read(Directory) since the spooky
// retrying it does is not necessary here (we hold the write lock):
segmentInfos = SegmentInfos.readCommit(directoryOrig, lastSegmentsFile);
if (commit != null) {
// Swap out all segments, but, keep metadata in
// SegmentInfos, like version & generation, to
// preserve write-once. This is important if
// readers are open against the future commit
// points.
if (commit.getDirectory() != directoryOrig) {
throw new IllegalArgumentException("IndexCommit's directory doesn't match my directory, expected=" + directoryOrig + ", got=" + commit.getDirectory());
}
SegmentInfos oldInfos = SegmentInfos.readCommit(directoryOrig, commit.getSegmentsFileName());
segmentInfos.replace(oldInfos);
changed();
if (infoStream.isEnabled("IW")) {
infoStream.message("IW", "init: loaded commit \"" + commit.getSegmentsFileName() + "\"");
}
}
rollbackSegments = segmentInfos.createBackupSegmentInfos();
}
commitUserData = new HashMap<String,String>(segmentInfos.getUserData()).entrySet();
pendingNumDocs.set(segmentInfos.totalMaxDoc());
// start with previous field numbers, but new FieldInfos
// NOTE: this is correct even for an NRT reader because we'll pull FieldInfos even for the un-committed segments:
globalFieldNumberMap = getFieldNumberMap();
validateIndexSort();
config.getFlushPolicy().init(config);
docWriter = new DocumentsWriter(this, config, directoryOrig, directory);
eventQueue = docWriter.eventQueue();
// Default deleter (for backwards compatibility) is
// KeepOnlyLastCommitDeleter:
// Sync'd is silly here, but IFD asserts we sync'd on the IW instance:
synchronized(this) {
deleter = new IndexFileDeleter(files, directoryOrig, directory,
config.getIndexDeletionPolicy(),
segmentInfos, infoStream, this,
initialIndexExists, reader != null);
// We incRef all files when we return an NRT reader from IW, so all files must exist even in the NRT case:
assert create || filesExist(segmentInfos);
}
if (deleter.startingCommitDeleted) {
// Deletion policy deleted the "head" commit point.
// We have to mark ourself as changed so that if we
// are closed w/o any further changes we write a new
// segments_N file.
changed();
}
if (reader != null) {
// Pre-enroll all segment readers into the reader pool; this is necessary so
// any in-memory NRT live docs are correctly carried over, and so NRT readers
// pulled from this IW share the same segment reader:
List<LeafReaderContext> leaves = reader.leaves();
assert segmentInfos.size() == leaves.size();
for (int i=0;i<leaves.size();i++) {
LeafReaderContext leaf = leaves.get(i);
SegmentReader segReader = (SegmentReader) leaf.reader();
SegmentReader newReader = new SegmentReader(segmentInfos.info(i), segReader, segReader.getLiveDocs(), segReader.numDocs());
readerPool.readerMap.put(newReader.getSegmentInfo(), new ReadersAndUpdates(this, newReader));
}
// We always assume we are carrying over incoming changes when opening from reader:
segmentInfos.changed();
changed();
}
if (infoStream.isEnabled("IW")) {
infoStream.message("IW", "init: create=" + create);
messageState();
}
success = true;
} finally {
if (!success) {
if (infoStream.isEnabled("IW")) {
infoStream.message("IW", "init: hit exception on init; releasing write lock");
}
IOUtils.closeWhileHandlingException(writeLock);
writeLock = null;
}
}
}
这是一段比较长的代码,可想而知Lucene在创建索引的过程中确实做了不少事情。
参数
首先说一下构造函数的两个参数Directory d,IndexWriterConfig conf。
- d是这个索引所在的文件夹
- conf是这个索引的配置信息
检查pendingDeletes中的文件是否已经被删除
public boolean checkPendingDeletions() throws IOException {
deletePendingFiles();
return pendingDeletes.isEmpty() == false;
}
public synchronized void deletePendingFiles() throws IOException {
if (pendingDeletes.isEmpty() == false) {
// TODO: we could fix IndexInputs from FSDirectory subclasses to call this when they are closed?
// Clone the set since we mutate it in privateDeleteFile:
for(String name : new HashSet<>(pendingDeletes)) {
privateDeleteFile(name, true);
}
}
}
pendingDeletes 的存在可以说是专门问windows操作系统服务的,因为windows下,删除一个文件可能因为这个文件正在被其他进程打开而失败,对于这种文件,Lucene会将它添加到pendingDeletes中。Lucene在d文件夹中建立索引之前需要保证这个文件夹下的pendingDeletes都已经被删除,否则索引初始化将失败。
config持有本IndexWriter实例
conf.setIndexWriter(this);
这么做是为了防止这个config被其他索引实例获取,因为一个config实例只能服务与一个索引实例。
获取文件锁
Lucene使用文件锁的方式保证线程安全,Lucene会尝试在d文件夹下创建一个锁文件write.lock。如果这个文件已经存在,说明已经有其他线程在操作索引,需要等待释放。如果write.lock还没被创建,则创建文件,并且返回持有write.lock创建时间的锁对象,表示获取到了文件锁。
获取配置
- analyzer 分析器
- mergeScheduler 合并策略
- codec 编码器
- mode 打开方式,有三种:
1、CREATE(创建一个新的索引,或者覆盖旧的索引),
2、APPEND(打开一个已经存在的索引),
3、 CREATE_OR_APPEND(如果索引不存在则创建,如果已经存在则打开后追加)
初始化段信息
对于第一次创建索引,此时索引目录下还没有记录段信息的segments.*文件。会做以下3件事:
- 1、new 一个SegmentInfos实例,
- 2、然后再复制一个rollbackSegments作为段信息的备份。
- 3、将段信息的版本号加1
- 4、生成段信息的文件名segments
从段信息中获取的数据
pendingNumDocs.set(segmentInfos.totalMaxDoc());
// start with previous field numbers, but new FieldInfos
// NOTE: this is correct even for an NRT reader because we'll pull FieldInfos even for the un-committed segments:
globalFieldNumberMap = getFieldNumberMap();
- 1、 获取这个索引下所有段中文档的总数量
- 2、 获取域的相关信息
初始化DocumentsWriter
DocumentsWriter主要是负责添加文档和写段文件。
创建IndexFileDeleter实例
deleter = new IndexFileDeleter(files, directoryOrig, directory,
config.getIndexDeletionPolicy(),
segmentInfos, infoStream, this,
initialIndexExists, reader != null);
IndexFileDeleter的作用是维护内存中的SegmentInfos实例,与文件段信息文件segmentN的映射关系。SegmentInfos存在于内存中,随着索引内容的变化不断更新,每当进行commit操作SegmentInfos的内容就会保存到segmentN文件中。
总结:
Lucene创建索引的步骤如下:
- 1、检查pendingDeletes中的文件是否已经被删除
- 2、获取文件锁
- 3、获取配置
- 4、初始化段信息
- 5、从段信息中获取文档总数和域信息数据
- 6、创建IndexFileDeleter实例用于管理段文件