lucene4的一个很大的变化就是提供了可插拔的编码器架构,可以自行定义索引结构,包括词元,倒排列表,存储字段,词向量,已删除的文档,段信息,字段信息
关于codec:
lucene4中已经提供了多个codec的实现
Lucene40, 默认编码器.Lucene40Codec
Lucene3x, read-only, 可以用来读取采用3.x创建的索引,不能使用该编码器创建索引.Lucene3xCodec
SimpleText, 采用明文的方式存储索引,适合用来学习,不建议在生产环境中使用. SimpleTextCodec
Appending, 针对采用append写入的文件系统,例如hdfs. AppendingCodec
......
关于format:
codec事实上就是有多组的format构成的,一个codec总共包含8个format,
包含PostingsFormat,DocValuesFormat,StoredFieldsFormat,TermVectorsFormat,FieldInfosFormat,SegmentInfoFormat,NormsFormat,LiveDocsFormat
例StoredFieldsFormat用来处理stored fileds,TermVectorsFormat用来处理term vectors。在lucene4中可以自行定制各个format的实现
目前在lucene4中也提供了多个PostingsFormat的实现
Memory:将所有的term和postinglists加载到一个内存中的FST. MemoryPostingsFormat
Direct:写的时候采用默认的Lucene40PostingsFormat,读的时候在将terms和postinglists加载到内存里面.DirectPostingsFormat
Pulsing:默认将词频小于等于1的term采用inline的方式存储.PulsingPostingsFormat
BloomFilter:可以在每个segment上为某个指定的field添加Bloom Filter.实现了"fast-fail"来判断segment上有没有相对应的key。最适合的场景就是在索引的记录数很多,同时segment也很多的情况下为主键添加Bloom Filter。BloomFilteringPostingsFormat需实现在其他的PostingsFormat之上.这里有个关于BloomFilter的测试https://docs.google.com/spreadsheet/ccc?key=0AsKVSn5SGg_wdFNpNTl3R1cxLTluTTcya2hDRnlfdHc#gid=3
Block:提供了索引的压缩同时也加强了检索性能,在未来的版本中可能会变成默认的PostingsFormat。现在要使用此格式的同学得注意,目前这个版本还处在实验阶段,并不能保证索引格式的向后兼容。和Lucene40不同的是BlockPostingsFormat不会创建 .frq和.prx取而代之的是.doc和.pos文件
....
测试代码:
package test;
import java.io.File;
import java.util.ArrayList;
import java.util.List;
import java.util.UUID;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.cjk.CJKAnalyzer;
import org.apache.lucene.codecs.Codec;
import org.apache.lucene.codecs.PostingsFormat;
import org.apache.lucene.codecs.appending.AppendingCodec;
import org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat;
import org.apache.lucene.codecs.lucene3x.Lucene3xCodec;
import org.apache.lucene.codecs.lucene40.Lucene40Codec;
import org.apache.lucene.codecs.lucene40.Lucene40PostingsFormat;
import org.apache.lucene.codecs.simpletext.SimpleTextCodec;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
/**
* lucene codec
*
* @author wuwen
* @date 2013-1-14 下午04:54:17
*
*/
public class LuceneCodecTest {
static Codec getCodec(String codecname) {
Codec codec = null;
if ("Lucene40".equals(codecname)) {
codec = new Lucene40Codec();
} else if ("Lucene3x".equals(codecname)) {
codec = new Lucene3xCodec();
// throw new UnsupportedOperationException("this codec can only be used for reading");
}
else if ("SimpleText".equals(codecname)) {
codec = new SimpleTextCodec();
} else if ("Appending".equals(codecname)) {
codec = new AppendingCodec();
} else if ("Pulsing40".equals(codecname)) {
codec = new Lucene40Codec() {
public PostingsFormat getPostingsFormatForField(String field) {
return PostingsFormat.forName("Pulsing40");
}
};
} else if ("Memory".equals(codecname)) {
codec = new Lucene40Codec() {
public PostingsFormat getPostingsFormatForField(String field) {
return PostingsFormat.forName("Memory");
}
};
} else if ("BloomFilter".equals(codecname)) {
codec = new Lucene40Codec() {
public PostingsFormat getPostingsFormatForField(String field) {
return new BloomFilteringPostingsFormat(new Lucene40PostingsFormat());
}
};
}else if ("Direct".equals(codecname)) {
codec = new Lucene40Codec() {
public PostingsFormat getPostingsFormatForField(String field) {
return PostingsFormat.forName("Direct");
}
};
} else if ("Block".equals(codecname)) {
codec = new Lucene40Codec() {
public PostingsFormat getPostingsFormatForField(String field) {
return PostingsFormat.forName("Block");
}
};
}
return codec;
}
public static void main(String[] args) {
String[] codecs = {"Lucene40", "Lucene3x", "SimpleText", "Appending", "Pulsing40", "Memory", "BloomFilter", "Direct", "Block"};
String suffixPath = "E:\\lucene\\codec\\";
for (String codecname : codecs) {
String indexPath = suffixPath + codecname;
Codec codec = getCodec(codecname);
Analyzer analyzer = new CJKAnalyzer(Version.LUCENE_40);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
config.setCodec(codec); // 设置编码器
IndexWriter writer = null;
try {
Directory luceneDir = FSDirectory.open(new File(indexPath));
writer = new IndexWriter(luceneDir, config);
List<Document> list = new ArrayList<Document>();
Document doc = new Document();
doc.add(new StringField("GUID", UUID.randomUUID().toString(), Field.Store.YES));
doc.add(new TextField("Content", "北京时间1月14日04:00(西班牙当地时间13日21:00),2012/13赛季西班牙足球甲级联赛第19轮一场焦点战在纳瓦拉国王球场展开争夺.", Field.Store.YES));
list.add(doc);
Document doc1 = new Document();
doc1.add(new StringField("GUID", UUID.randomUUID().toString(), Field.Store.YES));
doc1.add(new TextField("Content", "巴萨超皇马18分毁了西甲?媒体惊呼 克鲁伊夫看不下去.", Field.Store.YES));
list.add(doc1);
Document doc2 = new Document();
doc2.add(new StringField("GUID", UUID.randomUUID().toString(), Field.Store.YES));
doc2.add(new TextField("Content", "what changes in lucene4.", Field.Store.YES));
list.add(doc2);
writer.addDocuments(list);
} catch (Exception e) {
e.printStackTrace();
} finally {
if (writer != null) {
try {
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
}
}