IKAnalyzer源码分析—初始化
本章开始分析IKAnalyzer的源码,在github上下载IKAnalyzer源码,导入Eclipse,通过maven下载依赖包,创建main函数并添加如下代码,
IKAnalyzer ikAnalyzer = new IKAnalyzer();
TokenStream tokenStream = ikAnalyzer.createComponents("");
while(tokenStream.incrementToken()){}
修改IKTokenizer内部的源码,在其构造函数中添加中文用于索引(仅用于测试,IKAnalyzer在被lucene框架调用时lucene会为其设置输入input)。IKAnalyzer的构造函数通过createComponents函数创建一个TokenStream,并调用其incrementToken开始进行分词。
本章分析createComponents函数,下一章开始分析TokenStream的incrementToken函数。
IKAnalyzer::createComponents
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer _IKTokenizer = new IKTokenizer(this.useSmart());
return new TokenStreamComponents(_IKTokenizer);
}
createComponents函数创建一个IKTokenizer,然后将其封装成一个TokenStreamComponents并返回。
IKAnalyzer::createComponents->IKTokenizer::IKTokenizer
public IKTokenizer(boolean useSmart) {
this._IKImplement = new IKSegmenter(this.input, useSmart);
}
public IKSegmenter(Reader input , boolean useSmart){
this.input = input;
this.cfg = DefaultConfig.getInstance();
this.cfg.setUseSmart(useSmart);
this.init();
}
IKTokenizer的构造函数内部创建一个IKSegmenter,IKSegmenter的构造函数会创建一个DefaultConfig用于读取IKAnalyzer.cfg.xml配置文件,然后调用init函数进行初始化。
IKAnalyzer::createComponents->IKTokenizer::IKTokenizer->IKSegmenter::IKSegmenter->DefaultConfig::getInstance
public static Configuration getInstance(){
return new DefaultConfig();
}
private DefaultConfig(){
props = new Properties();
InputStream input = this.getClass().getClassLoader().getResourceAsStream(FILE_NAME);
if(input != null){
props.loadFromXML(input);
}
}
文件名FILE_NAME为IKAnalyzer.cfg.xml,DefaultConfig的构造函数读取并解析该文件,然后将属性添加到props中。
创建完DefaultConfig后,就通过init函数对整个IKAnalyzer进行初始化。
IKAnalyzer::createComponents->IKTokenizer::IKTokenizer->IKSegmenter::IKSegmenter->init
private void init(){
Dictionary.initial(this.cfg);
this.context = new AnalyzeContext(this.cfg);
this.segmenters = this.loadSegmenters();
this.arbitrator = new IKArbitrator();
}
Dictionary的initial函数用来创建词典Dictionary,AnalyzeContext用来存储分析结果,loadSegmenters函数用来加载分析器,最后创建IKArbitrator用来处理歧义。下面首先看下loadSegmenters函数。
IKAnalyzer::createComponents->IKTokenizer::IKTokenizer->IKSegmenter::IKSegmenter->init->loadSegmenters
private List<ISegmenter> loadSegmenters(){
List<ISegmenter> segmenters = new ArrayList<ISegmenter>(4);
segmenters.add(new LetterSegmenter());
segmenters.add(new CN_QuantifierSegmenter());
segmenters.add(new CJKSegmenter());
return segmenters;
}
loadSegmenters函数创建LetterSegmenter用来处理字母,CN_QuantifierSegmenter用来处理量词,CJKSegmenter用来处理中文字,最后添加到列表segmenters中,后续处理文章输入时,会依次调用这些分析器逐字处理。
IKAnalyzer::createComponents->IKTokenizer::IKTokenizer->IKSegmenter::IKSegmenter->init->Dictionary::initial
public static Dictionary initial(Configuration cfg){
singleton = new Dictionary(cfg);
return singleton;
}
private Dictionary(Configuration cfg){
this.cfg = cfg;
this.loadMainDict();
this.loadStopWordDict();
this.loadQuantifierDict();
}
loadMainDict函数用来加载主词典,内部包含了很多的中文词。loadStopWordDict函数用来加载停词词典,例如“也”,“的”,“了”这些字,这些字不需要存储。loadQuantifierDict用来加载量词,默认不配置。本章分析loadMainDict函数,其余函数的加载过程都基本一致。
IKAnalyzer::createComponents->IKTokenizer::IKTokenizer->IKSegmenter::IKSegmenter->init->Dictionary::initial->Dictionary->loadMainDict
private void loadMainDict(){
_MainDict = new DictSegment((char)0);
InputStream is = this.getClass().getClassLoader().getResourceAsStream(cfg.getMainDictionary());
BufferedReader br = new BufferedReader(new InputStreamReader(is , "UTF-8"), 512);
String theWord = null;
do {
theWord = br.readLine();
if (theWord != null && !"".equals(theWord.trim())) {
_MainDict.fillSegment(theWord.trim().toLowerCase().toCharArray());
}
} while (theWord != null);
this.loadExtDict();
}
首先创建DictSegment,用于保存字典。然后读取默认的主词典目录,逐行读取,然后调用DictSegment的fillSegment函数将读取到的数据存储到一个树结构中。最后通过loadExtDict函数读取拓展词库,其路径配置在IKAnalyzer.cfg.xml中。loadExtDict函数的加载过程和loadMainDict类似,下面重点看fillSegment函数。
Dictionary::loadMainDict->fillSegment
void fillSegment(char[] charArray){
this.fillSegment(charArray, 0 , charArray.length , 1);
}
private synchronized void fillSegment(char[] charArray , int begin , int length , int enabled){
Character beginChar = new Character(charArray[begin]);
Character keyChar = charMap.get(beginChar);
if(keyChar == null){
charMap.put(beginChar, beginChar);
keyChar = beginChar;
}
DictSegment ds = lookforSegment(keyChar , enabled);
if(ds != null){
if(length > 1){
ds.fillSegment(charArray, begin + 1, length - 1 , enabled);
}else if (length == 1){
ds.nodeState = enabled;
}
}
}
传入的参数charArray即为从文件中读取的一行行中文词,charMap用于存储存入的字符Character。fillSegment函数首先从charMap取出当前需要存储的字符,然后通过lookforSegment函数在树结构中查找位置并插入,最后嵌套调用fillSegment函数存储其余字符。
Dictionary::loadMainDict->fillSegment->lookforSegment
private DictSegment lookforSegment(Character keyChar , int create){
DictSegment ds = null;
if(this.storeSize <= ARRAY_LENGTH_LIMIT){
DictSegment[] segmentArray = getChildrenArray();
DictSegment keySegment = new DictSegment(keyChar);
int position = Arrays.binarySearch(segmentArray, 0 , this.storeSize, keySegment);
if(position >= 0){
ds = segmentArray[position];
}
if(ds == null && create == 1){
ds = keySegment;
if(this.storeSize < ARRAY_LENGTH_LIMIT){
segmentArray[this.storeSize] = ds;
this.storeSize++;
Arrays.sort(segmentArray , 0 , this.storeSize);
}else{
Map<Character , DictSegment> segmentMap = getChildrenMap();
migrate(segmentArray , segmentMap);
segmentMap.put(keyChar, ds);
this.storeSize++;
this.childrenArray = null;
}
}
}else{
Map<Character , DictSegment> segmentMap = getChildrenMap();
ds = (DictSegment)segmentMap.get(keyChar);
if(ds == null && create == 1){
ds = new DictSegment(keyChar);
segmentMap.put(keyChar , ds);
this.storeSize ++;
}
}
return ds;
}
private void migrate(DictSegment[] segmentArray , Map<Character , DictSegment> segmentMap){
for(DictSegment segment : segmentArray){
if(segment != null){
segmentMap.put(segment.nodeChar, segment);
}
}
}
getChildrenArray首先获取当前节点下用来存储子节点的数组,先使用segmentArray数组存储封装了待存储字符keyChar的DictSegment,数组长度默认值ARRAY_LENGTH_LIMIT是3,当数组存储的DictSegment大于等于3个时改用Map结构存储子节点,此时需要通过migrate函数将原来存储在segmentArray结构中的数据搬移到Map中,Map的key值为存储的char值,value为封装的DictSegment结构。