深入学习 Lucene 3.0 索引段

最新推荐文章于 2024-07-13 20:46:05 发布

datree

最新推荐文章于 2024-07-13 20:46:05 发布

阅读量131

点赞数

分类专栏： Java[Script] 文章标签： Java lucene

Java[Script] 专栏收录该内容

30 篇文章 0 订阅

订阅专栏

Lucene索引index由若干段(segment)组成，每一段由若干的文档（document）组成，每一个文档由若干的域（field）组成，每一个域由若干的项（term）组成。
生成索引的代码：


		// 创建两个 Document 对象
		File f1 = new File("d:/lucene/demo1.txt");
		File f2 = new File("d:/lucene/demo2.txt");
		Document doc1 = new Document();
		doc1.add(new Field("path", f1.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));
		doc1.add(new Field("content", new FileReader(f1)));
		Document doc2 = new Document();
		doc2.add(new Field("path", f2.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));
		doc2.add(new Field("content", new FileReader(f2)));
		// 创建索引对象
		IndexWriter writer = new IndexWriter(FSDirectory.open(indexPath),
				new StandardAnalyzer(Version.LUCENE_30), true,
				IndexWriter.MaxFieldLength.LIMITED);
		// 是否复合索引
		writer.setUseCompoundFile(false);
		writer.addDocument(doc1);
		writer.addDocument(doc2);
		writer.optimize();
		writer.close();

测试生成的索引文件：_0.fdt 、_0.fdx、_0.fnm、_0.frq、_0.nrm、_0.prx、_0.tii、_0.tis、segments.gen、segments_2
测试生成的复合索引文件：_0.cfs、_0.cfx、segments.gen、segments_2
其实无论是否复合索引，两个 segments 开头的文件内容是一样的。它存储了段的详细信息，也是下面讨论的主要内容。

1、segments.gen 文件
该文件格式很简单：
version 版本号，占用4个字节。当前版本为 -2
gen0 段号0，占用8个字节
gen1 段号1，占用8个字节

版本号的代码：
参考 org.apache.lucene.index.SegmentInfos 类第 61 行
public static final int FORMAT_LOCKLESS = -2; // 该变量为 final 类型，不能被修改

写入该文件的代码：
参考 org.apache.lucene.index.SegmentInfos 类第 594 - 604 行


       int version = genInput.readInt();
       if (version == FORMAT_LOCKLESS) {
         long gen0 = genInput.readLong();
         long gen1 = genInput.readLong();
         message("fallback check: " + gen0 + "; " + gen1);
         if (gen0 == gen1) {
           // The file is consistent.
           genB = gen0;
           break;
         }
       }

读取该文件的代码：
参考 org.apache.lucene.index.SegmentInfos 类第 849 - 856 行


      IndexOutput genOutput = dir.createOutput(IndexFileNames.SEGMENTS_GEN);
      try {
        genOutput.writeInt(FORMAT_LOCKLESS);
        genOutput.writeLong(generation);
        genOutput.writeLong(generation);
      } finally {
        genOutput.close();
      }

测试生成的 segments.gen 文件十六进制表示分为三部分：
1、FFFFFFFE 显示版本号，占用 4 个字节
2、0000000000000002 显示 gen0 号，占用 8 个字节，转换十进制为 -2
2、0000000000000002 显示 gen1 号，占用 8 个字节
所以文件大小共 20 个字节

2、segments_N 文件
该文件格式比较复杂，：
FORMAT 索引文件格式的版本号。整型占用 4 个字节。
version 索引的版本号，记录了IndexWriter将修改提交到索引文件中的次数。第一次值为当前时间。长整型占用 8 个字节。
counter 是下一个新段(Segment)的段名。整型占用 4 个字节。
infos 段(Segment)的个数。整型占用 4 个字节。
info 段对象的信息：
name 段的名称。第 1 个字节是后面占用的字节数。占用空间取决于名称的长度。
docCount 段中包含的文档数。整型占用 4 个字节。
delGen .del文件的版本号。长整型占用 8 个字节。
docStoreOffset 段中如果共享其它段的域和词向量，该值为偏移地址，否则为 -1 。整型占用 4 个字节。
docStoreSegment 段中共享其它段的域和词向量的段名称。占用空间取决于名称的长度。
docStoreIsCompoundFile 数据是否存储在 *.cfx 文件中。占用 1 个字节。
hasSingleNormFile 是否存在单独的标准化因子文件。占用 1 个字节。
normGen 如果每个域有单独的标准化因子文件，则此数组描述了每个文件的版本号。占用空间取决于文件的数量，每个文件占用 8 个字节。
IsCompoundFile 是否保存为复合文件。占用 1 个字节。
delCount 记录了此段中删除的文档的数目。整型占用 4 个字节。
hasProx 如果至少有一个段omitTf为false，也即词频(term freqency)需要被保存，则HasProx为1，否则为0。占用 1 个字节。
diagnostics 调试信息。占用空间取决于调试的数量，一般值为 0，占用 4 个字节。
userData 用户信息。占用空间取决于调试的数量，一般值为 0，占用 4 个字节。
checksum 校验信息。长整型占用 8 个字节。

写入段信息的代码：
1、参考 org.apache.lucene.index.SegmentInfos 类第 338 - 347 行


      segnOutput.writeInt(CURRENT_FORMAT); // write FORMAT
      segnOutput.writeLong(++version); // every write changes
                                   // the index
      segnOutput.writeInt(counter); // write counter
      segnOutput.writeInt(size()); // write infos
      for (int i = 0; i < size(); i++) {
        info(i).write(segnOutput); // 此处参考 2
      }
      segnOutput.writeStringStringMap(userData);// 此处参考 4
      segnOutput.prepareCommit();// 此处写入长整型的校验码

2、参考 org.apache.lucene.index.SegmentInfo 类第 540 - 564 行


  void write(IndexOutput output)
    throws IOException {
    output.writeString(name);// 此处参考 3
    output.writeInt(docCount);
    output.writeLong(delGen);
    output.writeInt(docStoreOffset);
    if (docStoreOffset != -1) {
      output.writeString(docStoreSegment);
      output.writeByte((byte) (docStoreIsCompoundFile ? 1:0));
    }

    output.writeByte((byte) (hasSingleNormFile ? 1:0));
    if (normGen == null) {
      output.writeInt(NO);
    } else {
      output.writeInt(normGen.length);
      for(int j = 0; j < normGen.length; j++) {
        output.writeLong(normGen[j]);
      }
    }
    output.writeByte(isCompoundFile);
    output.writeInt(delCount);
    output.writeByte((byte) (hasProx ? 1:0));
    output.writeStringStringMap(diagnostics); // 此处参考 4 和 5 
  }

3、参考 org.apache.lucene.store.IndexOutput 类第 103 - 107 行


  public void writeString(String s) throws IOException {
    UnicodeUtil.UTF16toUTF8(s, 0, s.length(), utf8Result);
    writeVInt(utf8Result.length);// 写入名称的长度
    writeBytes(utf8Result.result, 0, utf8Result.length);// 写入名称的字节数组，长度为 utf8Result.length
  }
  public void writeVInt(int i) throws IOException {
    while ((i & ~0x7F) != 0) {// 8 位以上是否存在数据
      writeByte((byte)((i & 0x7f) | 0x80));// 第 8 位设置为 1 ，表示高位还有数据
      i >>>= 7;// 算术右移 7 位
    }
    writeByte((byte)i);
  }

4、参考 org.apache.lucene.store.IndexOutput 类第 214 - 223 行


    if (map == null) {
      writeInt(0);
    } else {
      writeInt(map.size());
      for(final Map.Entry<String, String> entry: map.entrySet()) {
        writeString(entry.getKey()); // 此处参考 3
        writeString(entry.getValue());
      }
    }
  }

5、参考 org.apache.lucene.index.IndexWriter 类第 4159 - 4170 行


    Map<String,String> diagnostics = new HashMap<String,String>();
    diagnostics.put("source", source);
    diagnostics.put("lucene.version", Constants.LUCENE_VERSION); // 大家可以看一下 Constants 类，其实它取得 Java 的环境变量
    diagnostics.put("os", Constants.OS_NAME+"");
    diagnostics.put("os.arch", Constants.OS_ARCH+"");
    diagnostics.put("os.version", Constants.OS_VERSION+"");
    diagnostics.put("java.version", Constants.JAVA_VERSION+"");
    diagnostics.put("java.vendor", Constants.JAVA_VENDOR+"");
    if (details != null) {
      diagnostics.putAll(details);
    }
    info.setDiagnostics(diagnostics);

测试生成的 segments_2 文件的十六进制表示为：
首先是所有段的公共信息
1、FFFFFFF7 索引文件格式的版本号，转换十进制为 -9
2、00000130A66E4ECA 索引的版本号，通过如下转换为知为当前的时间
//省略前面的 0 并声明为长整型
long i = 0x130A66E4ECAL;
//转化为日期类型，输出为 2011-6-19 13:45:04
System.out.println(new Date(i).toLocaleString());
3、00000001 下一个新段的段名，现在只有一个段名称为 0
4、00000001 索引中段的个数

下面每个段的详细信息
5、025F30 段的名称，02 为占用字节的个数，5F30 是UTF8编码为 _0
6、00000002 段中包含的文档数，测试使用 2 个文档
7、FFFFFFFF .del文件的版本号，如果没有删除文档则默认为 -1
8、00000000 段中如果共享其它段的域和词向量的偏移地址。
9、025F30 段的名称，02 为占用字节的个数，5F30 是UTF8编码为 _0
10、00 上面的段是否复合索引文件
11、01 是否单独的标准化因子文件
12、FFFFFFFF 因子文件的个数。测试中未生成标准化因子文件，则为 -1
13、FF 当前索引是否为复合索引文件，否为 -1
14、00000000 删除文档的数量。测试未删除为 0
15、01 词频需要被保存

下面是调用信息和用户信息
16、00000007 调试信息的数量，存在 Map 中。此处为 7 ，后面即为 7 个 key 和 value 的值
可能通过 UltraEdit 查看该项最后一个字节为 2E ，它 Sun Microsystems Inc. 的最后一个点
17、00000000 用户信息的数量，与调试信息结构相同

最后是验证码
18、00000000B171E8F7 整个索引的验证码

datree

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
深入学习 Lucene 3.0 索引段

Lucene索引index由若干段(segment)组成，每一段由若干的文档（document）组成，每一个文档由若干的域（field）组成，每一个域由若干的项（term）组成。生成索引的代码：[code="java"] // 创建两个 Document 对象 File f1 = new File("d:/lucene/demo1.txt"); File f2 = n...
复制链接

扫一扫