_N.tis保存了此段内容中的项(term)信息,因为lucene是倒排的索引格式 所以分词出来的term保存在tis文件里 每个term的信息包含了出现此term的doc的频率(多少个doc存在)等信息,每个term的具体信息中包含了出现此term的域编号(fieldnum)等信息
tis的文件结构:
TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos
TIVersion --> UInt32
TermCount --> UInt64
IndexInterval --> UInt32
SkipInterval --> UInt32
MaxSkipLevels --> UInt32
TermInfos --> <TermInfo> TermCount
TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>
Term --> <PrefixLength, Suffix, FieldNum>
Suffix --> String
PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta
--> VInt
读取tis文件内容(文档中没有详细说明文件保存的细节可参看org.apache.lucene.index.TermInfosWriter 类)
/****************
*
*Create Class:ReadTermIndex.java
*Author:a276202460
*Create at:2010-6-7
*/
package com.rich.lucene.io;
public class ReadTerminfo {
/**
* @param args
* @throws Exception
*/
public static void main(String[] args) throws Exception {
String indexfile = "D:/lucenetest/indexs/txtindex/index4/_0.tis";
IndexFileInput input = null;
try{
input = new IndexFileInput(indexfile);
System.out.println("term index version:"+input.readInt());
long termcount = input.readLong();
System.out.println("term count:"+termcount);
System.out.println("term IndexInterval:"+input.readInt());
System.out.println("term SkipInterval:"+input.readInt());
System.out.println("term MaxSkipLevels:"+input.readInt());
for(long i = 0 ;i < termcount;i++){
System.out.println("*****read term info["+i+"]******");
System.out.println("the term share prefixlength is :"+input.readVInt());
System.out.println("term's own stuffix is:"+input.readString());
System.out.println("exists this term's field number is:"+input.readVInt());
int doccount = input.readVInt();
System.out.println("the doc count contain this term is:"+doccount);
System.out.println("the position of this term's TermFreqs within the .frq file is:"+input.readVLong());
System.out.println("the position of this term's TermPositions within the .prx file is:"+input.readVLong());
if(doccount >= 16)
System.out.println("the position of this term's SkipData within the .frq file is:"+input.readVInt());
}
}finally{
input.close();
}
}
}
运行结果:
term index version:-4
term count:22
term IndexInterval:128
term SkipInterval:16
term MaxSkipLevels:10
*****read term info[0]******
the term share prefixlength is :0
term's own stuffix is:做
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:0
the position of this term's TermPositions within the .prx file is:0
*****read term info[1]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[2]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[3]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[4]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[5]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[6]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[7]******
the term share prefixlength is :0
term's own stuffix is:搜
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[8]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[9]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[10]******
the term share prefixlength is :0
term's own stuffix is:球
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[11]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[12]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[13]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[14]******
the term share prefixlength is :0
term's own stuffix is:度
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[15]******
the term share prefixlength is :0
term's own stuffix is:搜
exists this term's field number is:0
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[16]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[17]******
the term share prefixlength is :0
term's own stuffix is:百
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[18]******
the term share prefixlength is :1
term's own stuffix is:??
exists this term's field number is:0
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[19]******
the term share prefixlength is :0
term's own stuffix is:谷
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[20]******
the term share prefixlength is :0
term's own stuffix is:http://www.baidu.com
exists this term's field number is:1
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[21]******
the term share prefixlength is :11
term's own stuffix is:g.cn
exists this term's field number is:1
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
由运行结果看貌似对又貌似错为什么有些乱码呢 ,如果按照java的string的charat然后做equal操作的话两个汉字是不可能有相同的前缀的
看了下文档有看了下源码 发现lucene比较前后两个term的公共前缀使用的是UTF8的字节码
比如说"做" 转换为utf8的字节数组是
-27
-127
-102
"全"转换为utf8的字节数组时:
-27
-123
-88
term ‘做’ 和term ‘全’ 比较的话byte[0]的值是相同的。由于term‘做’作为第一个term保存 所以保存term ‘做’ 的value信息就是
【3】【-27】【-127】【-102】 就是一个String类型的保存格式 作为相邻的term ‘全’ 共享了byte【0】
那么此时term ‘全’ 的stuff 字符串的就是【2】【-123】【-88】 虽然也是string格式的存储但是作为UTF8编码格式 两位的byte是不能保存汉字的 如果是纯英文的话就不会出现乱码问题 。
修改代码如下:
/****************
*
*Create Class:ReadTermIndex.java
*Author:a276202460
*Create at:2010-6-7
*/
package com.rich.lucene.io;
import org.apache.lucene.util.UnicodeUtil;
public class ReadTerminfo {
/**
* @param args
* @throws Exception
*/
public static void main(String[] args) throws Exception {
String indexfile = "D:/lucenetest/indexs/txtindex/index4/_0.tis";
IndexFileInput input = null;
try{
input = new IndexFileInput(indexfile);
System.out.println("term index version:"+input.readInt());
long termcount = input.readLong();
System.out.println("term count:"+termcount);
System.out.println("term IndexInterval:"+input.readInt());
System.out.println("term SkipInterval:"+input.readInt());
System.out.println("term MaxSkipLevels:"+input.readInt());
int doccount = 0;
int prefixlength = 0;
String termvalue = null;
byte[] lasttermbyte = null;
int stufflenth;
for(long i = 0 ;i < termcount;i++){
System.out.println("*****read term info["+i+"]******");
prefixlength = input.readVInt();
System.out.println("the term share prefixlength is :"+prefixlength);
stufflenth = input.readVInt();
byte[] stuffbyte = new byte[stufflenth];
input.readBytes(stuffbyte, 0, stufflenth);
if(prefixlength == 0){
termvalue = new String(stuffbyte,"UTF-8");
lasttermbyte = stuffbyte;
}else{
byte[] termbyte = new byte[prefixlength+stufflenth];
System.arraycopy(lasttermbyte, 0, termbyte, 0, prefixlength);
System.arraycopy(stuffbyte, 0, termbyte, prefixlength, stufflenth);
termvalue = new String(termbyte,"UTF-8");
lasttermbyte = termbyte;
}
System.out.println("term's value is:"+termvalue);
System.out.println("exists this term's field number is:"+input.readVInt());
doccount = input.readVInt();
System.out.println("the doc count contain this term is:"+doccount);
System.out.println("the position of this term's TermFreqs within the .frq file is:"+input.readVLong());
System.out.println("the position of this term's TermPositions within the .prx file is:"+input.readVLong());
if(doccount >= 16)
System.out.println("the position of this term's SkipData within the .frq file is:"+input.readVInt());
}
}finally{
input.close();
}
}
}
运行结果:
term index version:-4
term count:22
term IndexInterval:128
term SkipInterval:16
term MaxSkipLevels:10
*****read term info[0]******
the term share prefixlength is :0
term's value is:做
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:0
the position of this term's TermPositions within the .prx file is:0
*****read term info[1]******
the term share prefixlength is :1
term's value is:全
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[2]******
the term share prefixlength is :1
term's value is:内
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[3]******
the term share prefixlength is :1
term's value is:国
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[4]******
the term share prefixlength is :1
term's value is:大
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[5]******
the term share prefixlength is :1
term's value is:度
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[6]******
the term share prefixlength is :1
term's value is:引
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[7]******
the term share prefixlength is :0
term's value is:搜
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[8]******
the term share prefixlength is :1
term's value is:擎
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[9]******
the term share prefixlength is :1
term's value is:最
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[10]******
the term share prefixlength is :0
term's value is:球
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[11]******
the term share prefixlength is :1
term's value is:百
exists this term's field number is:2
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[12]******
the term share prefixlength is :1
term's value is:的
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[13]******
the term share prefixlength is :1
term's value is:索
exists this term's field number is:2
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[14]******
the term share prefixlength is :0
term's value is:度
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:3
the position of this term's TermPositions within the .prx file is:3
*****read term info[15]******
the term share prefixlength is :0
term's value is:搜
exists this term's field number is:0
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[16]******
the term share prefixlength is :1
term's value is:歌
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[17]******
the term share prefixlength is :0
term's value is:百
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[18]******
the term share prefixlength is :1
term's value is:索
exists this term's field number is:0
the doc count contain this term is:2
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[19]******
the term share prefixlength is :0
term's value is:谷
exists this term's field number is:0
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:2
the position of this term's TermPositions within the .prx file is:2
*****read term info[20]******
the term share prefixlength is :0
term's value is:http://www.baidu.com
exists this term's field number is:1
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
*****read term info[21]******
the term share prefixlength is :11
term's value is:http://www.g.cn
exists this term's field number is:1
the doc count contain this term is:1
the position of this term's TermFreqs within the .frq file is:1
the position of this term's TermPositions within the .prx file is:1
Tis - term 的详细信息存储
Tii – term 详细信息的索引文件(标识详细信息的页索引 每 128 个 term 在 tii 文件中建立一个索引项)
两个文件的头信息都是一样的
TIVersion --> UInt32 文件的格式版本号
TermCount --> UInt64 文件中保存的term 的数量 (tis 中就是此段索引中的所有分隔的term (项)的数量,不论源来自哪个field,tii 文件中记录的也是此文件中term 的数量但是不是全部,是每页的最后一项(第一页为空最后一页没有记录,128 (IndexInterval )个term 为一页)
IndexInterval --> UInt32 (每页存储的term 数量 )
SkipInterval --> UInt32
MaxSkipLevels --> UInt32
SkipInterval 和 MaxSkipLevels 的意义和其他的文件存储有关系,现在还不知道具体的含义,但是和查看TIS,TII 文件的结构没有关系,以后学习frq ,prx 文件的结构的时候在检验这个标识的意思
头信息完后就是每个term 的具体信息了
TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>
Term --> <PrefixLength, Suffix, FieldNum>
PrefixLength 表示后面一个 term 共享的前面 term 的 byte 长度( utf8 )
Suffix 表示自己独有的后缀信息 文档中说是字符串,对英文来说没有异议,对中文的话就可能是一个不完整的字符,是长度和后缀 utf8 字节
FieldNum term 来源的 field number
DocFreq 出现此 term 的 document 的数量
FreqDelta frq 文件中词 term 的位置(具体此位置的信息还得接下来看 frq 文件)
ProxDelta, SkipDelta 和 FreqDelta 意思差不多也是位置信息,指定了位置也就对此位置的信息建立了指针也是一个索引
两个文件的内容格式:
图中在tis保存第一个term的时候tii保存了一个空的term信息进去
如果tis刚好存了128*n个数据的话 那么最后一页的末项term是不会被记录到tii文件中的。接下来将frq,prx,nrm的信息读取完以后 了解lucene整个查询检索的过程和索引创建的结构就很清楚了。
内容都是边学边写到博客的,欢迎拍砖指正。