边学边记（八） lucene索引结构详解五（_N.tis,_N.tii）

最新推荐文章于 2019-10-22 14:37:55 发布

一洽客服系统

最新推荐文章于 2019-10-22 14:37:55 发布

阅读量2.3k

点赞数

分类专栏： Lucene 文章标签： tis tii文件详解 lucene索引结构详解 lucene索引结构分析 lucene索引结构教程

本文链接：https://blog.csdn.net/a276202460/article/details/5651471

版权

Lucene 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

_N.tis保存了此段内容中的项（term）信息，因为lucene是倒排的索引格式所以分词出来的term保存在tis文件里每个term的信息包含了出现此term的doc的频率(多少个doc存在)等信息，每个term的具体信息中包含了出现此term的域编号（fieldnum）等信息

tis的文件结构：

TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos

TIVersion --> UInt32

TermCount --> UInt64

IndexInterval --> UInt32

SkipInterval --> UInt32

MaxSkipLevels --> UInt32

TermInfos --> <TermInfo> TermCount

TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>

Term --> <PrefixLength, Suffix, FieldNum>

Suffix --> String

PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta
--> VInt

读取tis文件内容（文档中没有详细说明文件保存的细节可参看org.apache.lucene.index.TermInfosWriter 类）

/****************
 *
 *Create Class:ReadTermIndex.java
 *Author:a276202460
 *Create at:2010-6-7
 */
package com.rich.lucene.io;
public class ReadTerminfo {
	/**
	 * @param args
	 * @throws Exception 
	 */
	public static void main(String[] args) throws Exception {
		 String indexfile = "D:/lucenetest/indexs/txtindex/index4/_0.tis";
		 IndexFileInput input = null;
		 try{
			 input = new IndexFileInput(indexfile);
			 System.out.println("term index version:"+input.readInt());
			 long termcount = input.readLong();
			 System.out.println("term count:"+termcount);
			 System.out.println("term IndexInterval:"+input.readInt());
			 System.out.println("term SkipInterval:"+input.readInt());
			 System.out.println("term MaxSkipLevels:"+input.readInt());
			 for(long i = 0 ;i < termcount;i++){
				 System.out.println("*****read term info["+i+"]******");
				 System.out.println("the term share prefixlength is :"+input.readVInt());
				 System.out.println("term's own stuffix is:"+input.readString());
				 System.out.println("exists this term's field number is:"+input.readVInt());
				 int doccount = input.readVInt();
				 System.out.println("the doc count contain this term is:"+doccount);
				 System.out.println("the position of this term's TermFreqs within the .frq file is:"+input.readVLong());
				 System.out.println("the position of this term's TermPositions within the .prx file is:"+input.readVLong());
				 if(doccount >= 16)
				 System.out.println("the position of this term's SkipData within the .frq file is:"+input.readVInt());
			 }
			 
		 }finally{
			 input.close();
		 }
   
	}
}

运行结果：

term index version:-4
term count:22
term IndexInterval:128
term SkipInterval:16
term MaxSkipLevels:10
*****read term info[0]******
the term share prefixlength is :0
term's own stuffix is:做
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:0
the position of this term's TermPositions within the .prx file is:0
*****read term info[1]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[2]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[3]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[4]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[5]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[6]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[7]******
the term share prefixlength is :0
term's own stuffix is:搜
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[8]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[9]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[10]******
the term share prefixlength is :0
term's own stuffix is:球
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[11]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[12]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[13]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[14]******
the term share prefixlength is :0
term's own stuffix is:度
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[15]******
the term share prefixlength is :0
term's own stuffix is:搜
exists this term's field number is:0
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[16]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[17]******
the term share prefixlength is :0
term's own stuffix is:百
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[18]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:0
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[19]******
the term share prefixlength is :0
term's own stuffix is:谷
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[20]******
the term share prefixlength is :0
term's own stuffix is:http://www.baidu.com
exists this term's field number is:1
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[21]******
the term share prefixlength is :11
term's own stuffix is:g.cn
exists this term's field number is:1
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1

由运行结果看貌似对又貌似错为什么有些乱码呢，如果按照java的string的charat然后做equal操作的话两个汉字是不可能有相同的前缀的

看了下文档有看了下源码发现lucene比较前后两个term的公共前缀使用的是UTF8的字节码

比如说"做" 转换为utf8的字节数组是

-27
-127
-102

"全"转换为utf8的字节数组时：

-27
-123
-88

term ‘做’ 和term ‘全’ 比较的话byte[0]的值是相同的。由于term‘做’作为第一个term保存所以保存term ‘做’ 的value信息就是

【3】【-27】【-127】【-102】就是一个String类型的保存格式作为相邻的term ‘全’ 共享了byte【0】

那么此时term ‘全’ 的stuff 字符串的就是【2】【-123】【-88】虽然也是string格式的存储但是作为UTF8编码格式两位的byte是不能保存汉字的如果是纯英文的话就不会出现乱码问题。

修改代码如下：

/****************
 *
 *Create Class:ReadTermIndex.java
 *Author:a276202460
 *Create at:2010-6-7
 */
package com.rich.lucene.io;
import org.apache.lucene.util.UnicodeUtil;
public class ReadTerminfo {
	/**
	 * @param args
	 * @throws Exception 
	 */
	public static void main(String[] args) throws Exception {
		 String indexfile = "D:/lucenetest/indexs/txtindex/index4/_0.tis";
		 IndexFileInput input = null;
		 try{
			 input = new IndexFileInput(indexfile);
			 System.out.println("term index version:"+input.readInt());
			 long termcount = input.readLong();
			 System.out.println("term count:"+termcount);
			 System.out.println("term IndexInterval:"+input.readInt());
			 System.out.println("term SkipInterval:"+input.readInt());
			 System.out.println("term MaxSkipLevels:"+input.readInt());
			 int doccount = 0;
			 int prefixlength = 0;
			  
			 
			 String termvalue = null;
			 byte[] lasttermbyte = null;
			 int stufflenth;
			 for(long i = 0 ;i < termcount;i++){
				 System.out.println("*****read term info["+i+"]******");
				 prefixlength = input.readVInt();
				 System.out.println("the term share prefixlength is :"+prefixlength);
				 stufflenth = input.readVInt();
				 byte[] stuffbyte = new byte[stufflenth];
				 input.readBytes(stuffbyte, 0, stufflenth);
				 
				 if(prefixlength == 0){
					 termvalue = new String(stuffbyte,"UTF-8");
					 lasttermbyte = stuffbyte;
				 }else{
					 byte[] termbyte = new byte[prefixlength+stufflenth];
					 System.arraycopy(lasttermbyte, 0, termbyte, 0, prefixlength);
					 System.arraycopy(stuffbyte, 0, termbyte, prefixlength, stufflenth);
					 termvalue = new String(termbyte,"UTF-8");
					 lasttermbyte = termbyte;
				 }
				 System.out.println("term's value is:"+termvalue);
				 System.out.println("exists this term's field number is:"+input.readVInt());
				 doccount = input.readVInt();
				 System.out.println("the doc count contain this term is:"+doccount);
				 System.out.println("the position of this term's TermFreqs within the .frq file is:"+input.readVLong());
				 System.out.println("the position of this term's TermPositions within the .prx file is:"+input.readVLong());
				 if(doccount >= 16)
				 System.out.println("the position of this term's SkipData within the .frq file is:"+input.readVInt());
			 }
			 
		 }finally{
			 input.close();
		 }
   
	}
}

运行结果：

term index version:-4
term count:22
term IndexInterval:128
term SkipInterval:16
term MaxSkipLevels:10
*****read term info[0]******
the term share prefixlength is :0
term's value is:做
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:0
the position of this term's TermPositions within the .prx file is:0
*****read term info[1]******
the term share prefixlength is :1
term's value is:全
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[2]******
the term share prefixlength is :1
term's value is:内
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[3]******
the term share prefixlength is :1
term's value is:国
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[4]******
the term share prefixlength is :1
term's value is:大
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[5]******
the term share prefixlength is :1
term's value is:度
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[6]******
the term share prefixlength is :1
term's value is:引
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[7]******
the term share prefixlength is :0
term's value is:搜
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[8]******
the term share prefixlength is :1
term's value is:擎
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[9]******
the term share prefixlength is :1
term's value is:最
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[10]******
the term share prefixlength is :0
term's value is:球
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[11]******
the term share prefixlength is :1
term's value is:百
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[12]******
the term share prefixlength is :1
term's value is:的
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[13]******
the term share prefixlength is :1
term's value is:索
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[14]******
the term share prefixlength is :0
term's value is:度
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[15]******
the term share prefixlength is :0
term's value is:搜
exists this term's field number is:0
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[16]******
the term share prefixlength is :1
term's value is:歌
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[17]******
the term share prefixlength is :0
term's value is:百
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[18]******
the term share prefixlength is :1
term's value is:索
exists this term's field number is:0
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[19]******
the term share prefixlength is :0
term's value is:谷
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[20]******
the term share prefixlength is :0
term's value is:http://www.baidu.com
exists this term's field number is:1
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[21]******
the term share prefixlength is :11
term's value is:http://www.g.cn
exists this term's field number is:1
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1

Tis - term 的详细信息存储

Tii – term 详细信息的索引文件（标识详细信息的页索引每 128 个 term 在 tii 文件中建立一个索引项）

两个文件的头信息都是一样的

TIVersion --> UInt32 文件的格式版本号

TermCount --> UInt64 文件中保存的term 的数量（tis 中就是此段索引中的所有分隔的term （项）的数量，不论源来自哪个field,tii 文件中记录的也是此文件中term 的数量但是不是全部，是每页的最后一项（第一页为空最后一页没有记录，128 （IndexInterval ）个term 为一页）

IndexInterval --> UInt32 （每页存储的term 数量）

SkipInterval --> UInt32

MaxSkipLevels --> UInt32

SkipInterval 和 MaxSkipLevels 的意义和其他的文件存储有关系，现在还不知道具体的含义，但是和查看TIS,TII 文件的结构没有关系，以后学习frq ，prx 文件的结构的时候在检验这个标识的意思

头信息完后就是每个term 的具体信息了

TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>

Term --> <PrefixLength, Suffix, FieldNum>

PrefixLength 表示后面一个 term 共享的前面 term 的 byte 长度（ utf8 ）

Suffix 表示自己独有的后缀信息文档中说是字符串，对英文来说没有异议，对中文的话就可能是一个不完整的字符，是长度和后缀 utf8 字节

FieldNum term 来源的 field number

DocFreq 出现此 term 的 document 的数量

FreqDelta frq 文件中词 term 的位置（具体此位置的信息还得接下来看 frq 文件）