边学边记(八) lucene索引结构详解五(_N.tis,_N.tii)

_N.tis保存了此段内容中的项(term)信息,因为lucene是倒排的索引格式 所以分词出来的term保存在tis文件里 每个term的信息包含了出现此term的doc的频率(多少个doc存在)等信息,每个term的具体信息中包含了出现此term的域编号(fieldnum)等信息

tis的文件结构:

TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos

TIVersion --> UInt32

TermCount --> UInt64

IndexInterval --> UInt32

SkipInterval --> UInt32

MaxSkipLevels --> UInt32

TermInfos --> <TermInfo> TermCount

TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>

Term --> <PrefixLength, Suffix, FieldNum>

Suffix --> String

PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta
--> VInt

读取tis文件内容(文档中没有详细说明文件保存的细节可参看org.apache.lucene.index.TermInfosWriter 类)

 

/****************
 *
 *Create Class:ReadTermIndex.java
 *Author:a276202460
 *Create at:2010-6-7
 */
package com.rich.lucene.io;
public class ReadTerminfo {
	/**
	 * @param args
	 * @throws Exception 
	 */
	public static void main(String[] args) throws Exception {
		 String indexfile = "D:/lucenetest/indexs/txtindex/index4/_0.tis";
		 IndexFileInput input = null;
		 try{
			 input = new IndexFileInput(indexfile);
			 System.out.println("term index version:"+input.readInt());
			 long termcount = input.readLong();
			 System.out.println("term count:"+termcount);
			 System.out.println("term IndexInterval:"+input.readInt());
			 System.out.println("term SkipInterval:"+input.readInt());
			 System.out.println("term MaxSkipLevels:"+input.readInt());
			 for(long i = 0 ;i < termcount;i++){
				 System.out.println("*****read term info["+i+"]******");
				 System.out.println("the term share prefixlength is :"+input.readVInt());
				 System.out.println("term's own stuffix is:"+input.readString());
				 System.out.println("exists this term's field number is:"+input.readVInt());
				 int doccount = input.readVInt();
				 System.out.println("the doc count contain this term is:"+doccount);
				 System.out.println("the position of this term's TermFreqs within the .frq file is:"+input.readVLong());
				 System.out.println("the position of this term's TermPositions within the .prx file is:"+input.readVLong());
				 if(doccount >= 16)
				 System.out.println("the position of this term's SkipData within the .frq file is:"+input.readVInt());
			 }
			 
		 }finally{
			 input.close();
		 }
   
	}
}

运行结果:

 

term index version:-4
term count:22
term IndexInterval:128
term SkipInterval:16
term MaxSkipLevels:10
*****read term info[0]******
the term share prefixlength is :0
term's own stuffix is:做
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:0
the position of this term's TermPositions within the .prx file is:0
*****read term info[1]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[2]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[3]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[4]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[5]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[6]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[7]******
the term share prefixlength is :0
term's own stuffix is:搜
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[8]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[9]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[10]******
the term share prefixlength is :0
term's own stuffix is:球
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[11]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[12]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[13]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[14]******
the term share prefixlength is :0
term's own stuffix is:度
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[15]******
the term share prefixlength is :0
term's own stuffix is:搜
exists this term's field number is:0
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[16]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[17]******
the term share prefixlength is :0
term's own stuffix is:百
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[18]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:0
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[19]******
the term share prefixlength is :0
term's own stuffix is:谷
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[20]******
the term share prefixlength is :0
term's own stuffix is:http://www.baidu.com
exists this term's field number is:1
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[21]******
the term share prefixlength is :11
term's own stuffix is:g.cn
exists this term's field number is:1
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1

 

由运行结果看貌似对又貌似错为什么有些乱码呢 ,如果按照java的string的charat然后做equal操作的话两个汉字是不可能有相同的前缀的

看了下文档有看了下源码 发现lucene比较前后两个term的公共前缀使用的是UTF8的字节码

比如说"做" 转换为utf8的字节数组是

-27
-127
-102

"全"转换为utf8的字节数组时:

-27
-123
-88

term ‘做’ 和term ‘全’ 比较的话byte[0]的值是相同的。由于term‘做’作为第一个term保存 所以保存term ‘做’ 的value信息就是

【3】【-27】【-127】【-102】  就是一个String类型的保存格式 作为相邻的term ‘全’ 共享了byte【0】

那么此时term ‘全’  的stuff 字符串的就是【2】【-123】【-88】 虽然也是string格式的存储但是作为UTF8编码格式 两位的byte是不能保存汉字的 如果是纯英文的话就不会出现乱码问题 。

 

修改代码如下:

 

/****************
 *
 *Create Class:ReadTermIndex.java
 *Author:a276202460
 *Create at:2010-6-7
 */
package com.rich.lucene.io;
import org.apache.lucene.util.UnicodeUtil;
public class ReadTerminfo {
	/**
	 * @param args
	 * @throws Exception 
	 */
	public static void main(String[] args) throws Exception {
		 String indexfile = "D:/lucenetest/indexs/txtindex/index4/_0.tis";
		 IndexFileInput input = null;
		 try{
			 input = new IndexFileInput(indexfile);
			 System.out.println("term index version:"+input.readInt());
			 long termcount = input.readLong();
			 System.out.println("term count:"+termcount);
			 System.out.println("term IndexInterval:"+input.readInt());
			 System.out.println("term SkipInterval:"+input.readInt());
			 System.out.println("term MaxSkipLevels:"+input.readInt());
			 int doccount = 0;
			 int prefixlength = 0;
			  
			 
			 String termvalue = null;
			 byte[] lasttermbyte = null;
			 int stufflenth;
			 for(long i = 0 ;i < termcount;i++){
				 System.out.println("*****read term info["+i+"]******");
				 prefixlength = input.readVInt();
				 System.out.println("the term share prefixlength is :"+prefixlength);
				 stufflenth = input.readVInt();
				 byte[] stuffbyte = new byte[stufflenth];
				 input.readBytes(stuffbyte, 0, stufflenth);
				 
				 if(prefixlength == 0){
					 termvalue = new String(stuffbyte,"UTF-8");
					 lasttermbyte = stuffbyte;
				 }else{
					 byte[] termbyte = new byte[prefixlength+stufflenth];
					 System.arraycopy(lasttermbyte, 0, termbyte, 0, prefixlength);
					 System.arraycopy(stuffbyte, 0, termbyte, prefixlength, stufflenth);
					 termvalue = new String(termbyte,"UTF-8");
					 lasttermbyte = termbyte;
				 }
				 System.out.println("term's value is:"+termvalue);
				 System.out.println("exists this term's field number is:"+input.readVInt());
				 doccount = input.readVInt();
				 System.out.println("the doc count contain this term is:"+doccount);
				 System.out.println("the position of this term's TermFreqs within the .frq file is:"+input.readVLong());
				 System.out.println("the position of this term's TermPositions within the .prx file is:"+input.readVLong());
				 if(doccount >= 16)
				 System.out.println("the position of this term's SkipData within the .frq file is:"+input.readVInt());
			 }
			 
		 }finally{
			 input.close();
		 }
   
	}
} 

 

运行结果:

 

term index version:-4
term count:22
term IndexInterval:128
term SkipInterval:16
term MaxSkipLevels:10
*****read term info[0]******
the term share prefixlength is :0
term's value is:做
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:0
the position of this term's TermPositions within the .prx file is:0
*****read term info[1]******
the term share prefixlength is :1
term's value is:全
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[2]******
the term share prefixlength is :1
term's value is:内
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[3]******
the term share prefixlength is :1
term's value is:国
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[4]******
the term share prefixlength is :1
term's value is:大
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[5]******
the term share prefixlength is :1
term's value is:度
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[6]******
the term share prefixlength is :1
term's value is:引
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[7]******
the term share prefixlength is :0
term's value is:搜
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[8]******
the term share prefixlength is :1
term's value is:擎
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[9]******
the term share prefixlength is :1
term's value is:最
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[10]******
the term share prefixlength is :0
term's value is:球
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[11]******
the term share prefixlength is :1
term's value is:百
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[12]******
the term share prefixlength is :1
term's value is:的
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[13]******
the term share prefixlength is :1
term's value is:索
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[14]******
the term share prefixlength is :0
term's value is:度
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[15]******
the term share prefixlength is :0
term's value is:搜
exists this term's field number is:0
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[16]******
the term share prefixlength is :1
term's value is:歌
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[17]******
the term share prefixlength is :0
term's value is:百
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[18]******
the term share prefixlength is :1
term's value is:索
exists this term's field number is:0
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[19]******
the term share prefixlength is :0
term's value is:谷
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[20]******
the term share prefixlength is :0
term's value is:http://www.baidu.com
exists this term's field number is:1
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[21]******
the term share prefixlength is :11
term's value is:http://www.g.cn
exists this term's field number is:1
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1

 

Tis - term 的详细信息存储

Tii – term 详细信息的索引文件(标识详细信息的页索引 每 128 个 term 在 tii 文件中建立一个索引项)

两个文件的头信息都是一样的

TIVersion --> UInt32   文件的格式版本号

TermCount --> UInt64   文件中保存的term 的数量 (tis 中就是此段索引中的所有分隔的term (项)的数量,不论源来自哪个field,tii 文件中记录的也是此文件中term 的数量但是不是全部,是每页的最后一项(第一页为空最后一页没有记录,128 (IndexInterval )个term 为一页)

IndexInterval --> UInt32 (每页存储的term 数量 )

SkipInterval --> UInt32

MaxSkipLevels --> UInt32

SkipInterval 和 MaxSkipLevels 的意义和其他的文件存储有关系,现在还不知道具体的含义,但是和查看TIS,TII 文件的结构没有关系,以后学习frq ,prx 文件的结构的时候在检验这个标识的意思

头信息完后就是每个term 的具体信息了

TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>

Term --> <PrefixLength, Suffix, FieldNum>

PrefixLength 表示后面一个 term 共享的前面 term 的 byte 长度( utf8 )

Suffix 表示自己独有的后缀信息 文档中说是字符串,对英文来说没有异议,对中文的话就可能是一个不完整的字符,是长度和后缀 utf8 字节

FieldNum term 来源的 field number

DocFreq 出现此 term 的 document 的数量

FreqDelta frq 文件中词 term 的位置(具体此位置的信息还得接下来看 frq 文件)

ProxDelta, SkipDelta 和 FreqDelta 意思差不多也是位置信息,指定了位置也就对此位置的信息建立了指针也是一个索引

 

两个文件的内容格式:

 

 

 

 

图中在tis保存第一个term的时候tii保存了一个空的term信息进去

如果tis刚好存了128*n个数据的话 那么最后一页的末项term是不会被记录到tii文件中的。接下来将frq,prx,nrm的信息读取完以后 了解lucene整个查询检索的过程和索引创建的结构就很清楚了。

内容都是边学边写到博客的,欢迎拍砖指正。

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值