IKAnalyzer源码分析---1

IKAnalyzer源码分析—初始化

本章开始分析IKAnalyzer的源码,在github上下载IKAnalyzer源码,导入Eclipse,通过maven下载依赖包,创建main函数并添加如下代码,

        IKAnalyzer ikAnalyzer = new IKAnalyzer();
        TokenStream tokenStream = ikAnalyzer.createComponents("");
        while(tokenStream.incrementToken()){}

修改IKTokenizer内部的源码,在其构造函数中添加中文用于索引(仅用于测试,IKAnalyzer在被lucene框架调用时lucene会为其设置输入input)。IKAnalyzer的构造函数通过createComponents函数创建一个TokenStream,并调用其incrementToken开始进行分词。

本章分析createComponents函数,下一章开始分析TokenStream的incrementToken函数。

IKAnalyzer::createComponents

    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer _IKTokenizer = new IKTokenizer(this.useSmart());
        return new TokenStreamComponents(_IKTokenizer);
    }

createComponents函数创建一个IKTokenizer,然后将其封装成一个TokenStreamComponents并返回。

IKAnalyzer::createComponents->IKTokenizer::IKTokenizer

    public IKTokenizer(boolean useSmart) {
        this._IKImplement = new IKSegmenter(this.input, useSmart);
    }

    public IKSegmenter(Reader input , boolean useSmart){
        this.input = input;
        this.cfg = DefaultConfig.getInstance();
        this.cfg.setUseSmart(useSmart);
        this.init();
    }

IKTokenizer的构造函数内部创建一个IKSegmenter,IKSegmenter的构造函数会创建一个DefaultConfig用于读取IKAnalyzer.cfg.xml配置文件,然后调用init函数进行初始化。

IKAnalyzer::createComponents->IKTokenizer::IKTokenizer->IKSegmenter::IKSegmenter->DefaultConfig::getInstance

    public static Configuration getInstance(){
        return new DefaultConfig();
    }

    private DefaultConfig(){        
        props = new Properties();

        InputStream input = this.getClass().getClassLoader().getResourceAsStream(FILE_NAME);
        if(input != null){
            props.loadFromXML(input);
        }
    }

文件名FILE_NAME为IKAnalyzer.cfg.xml,DefaultConfig的构造函数读取并解析该文件,然后将属性添加到props中。

创建完DefaultConfig后,就通过init函数对整个IKAnalyzer进行初始化。

IKAnalyzer::createComponents->IKTokenizer::IKTokenizer->IKSegmenter::IKSegmenter->init

    private void init(){
        Dictionary.initial(this.cfg);
        this.context = new AnalyzeContext(this.cfg);
        this.segmenters = this.loadSegmenters();
        this.arbitrator = new IKArbitrator();
    }

Dictionary的initial函数用来创建词典Dictionary,AnalyzeContext用来存储分析结果,loadSegmenters函数用来加载分析器,最后创建IKArbitrator用来处理歧义。下面首先看下loadSegmenters函数。

IKAnalyzer::createComponents->IKTokenizer::IKTokenizer->IKSegmenter::IKSegmenter->init->loadSegmenters

    private List<ISegmenter> loadSegmenters(){
        List<ISegmenter> segmenters = new ArrayList<ISegmenter>(4);
        segmenters.add(new LetterSegmenter()); 
        segmenters.add(new CN_QuantifierSegmenter());
        segmenters.add(new CJKSegmenter());
        return segmenters;
    }

loadSegmenters函数创建LetterSegmenter用来处理字母,CN_QuantifierSegmenter用来处理量词,CJKSegmenter用来处理中文字,最后添加到列表segmenters中,后续处理文章输入时,会依次调用这些分析器逐字处理。

IKAnalyzer::createComponents->IKTokenizer::IKTokenizer->IKSegmenter::IKSegmenter->init->Dictionary::initial

    public static Dictionary initial(Configuration cfg){
        singleton = new Dictionary(cfg);
        return singleton;
    }

    private Dictionary(Configuration cfg){
        this.cfg = cfg;
        this.loadMainDict();
        this.loadStopWordDict();
        this.loadQuantifierDict();
    }

loadMainDict函数用来加载主词典,内部包含了很多的中文词。loadStopWordDict函数用来加载停词词典,例如“也”,“的”,“了”这些字,这些字不需要存储。loadQuantifierDict用来加载量词,默认不配置。本章分析loadMainDict函数,其余函数的加载过程都基本一致。

IKAnalyzer::createComponents->IKTokenizer::IKTokenizer->IKSegmenter::IKSegmenter->init->Dictionary::initial->Dictionary->loadMainDict

    private void loadMainDict(){
        _MainDict = new DictSegment((char)0);
        InputStream is = this.getClass().getClassLoader().getResourceAsStream(cfg.getMainDictionary());
        BufferedReader br = new BufferedReader(new InputStreamReader(is , "UTF-8"), 512);
        String theWord = null;
        do {
            theWord = br.readLine();
            if (theWord != null && !"".equals(theWord.trim())) {
                _MainDict.fillSegment(theWord.trim().toLowerCase().toCharArray());
            }
        } while (theWord != null);

        this.loadExtDict();
    }   

首先创建DictSegment,用于保存字典。然后读取默认的主词典目录,逐行读取,然后调用DictSegment的fillSegment函数将读取到的数据存储到一个树结构中。最后通过loadExtDict函数读取拓展词库,其路径配置在IKAnalyzer.cfg.xml中。loadExtDict函数的加载过程和loadMainDict类似,下面重点看fillSegment函数。

Dictionary::loadMainDict->fillSegment

    void fillSegment(char[] charArray){
        this.fillSegment(charArray, 0 , charArray.length , 1); 
    }

    private synchronized void fillSegment(char[] charArray , int begin , int length , int enabled){
        Character beginChar = new Character(charArray[begin]);
        Character keyChar = charMap.get(beginChar);

        if(keyChar == null){
            charMap.put(beginChar, beginChar);
            keyChar = beginChar;
        }

        DictSegment ds = lookforSegment(keyChar , enabled);
        if(ds != null){
            if(length > 1){
                ds.fillSegment(charArray, begin + 1, length - 1 , enabled);
            }else if (length == 1){
                ds.nodeState = enabled;
            }
        }

    }

传入的参数charArray即为从文件中读取的一行行中文词,charMap用于存储存入的字符Character。fillSegment函数首先从charMap取出当前需要存储的字符,然后通过lookforSegment函数在树结构中查找位置并插入,最后嵌套调用fillSegment函数存储其余字符。

Dictionary::loadMainDict->fillSegment->lookforSegment

    private DictSegment lookforSegment(Character keyChar ,  int create){

        DictSegment ds = null;

        if(this.storeSize <= ARRAY_LENGTH_LIMIT){
            DictSegment[] segmentArray = getChildrenArray();            
            DictSegment keySegment = new DictSegment(keyChar);
            int position = Arrays.binarySearch(segmentArray, 0 , this.storeSize, keySegment);
            if(position >= 0){
                ds = segmentArray[position];
            }

            if(ds == null && create == 1){
                ds = keySegment;
                if(this.storeSize < ARRAY_LENGTH_LIMIT){
                    segmentArray[this.storeSize] = ds;
                    this.storeSize++;
                    Arrays.sort(segmentArray , 0 , this.storeSize);
                }else{
                    Map<Character , DictSegment> segmentMap = getChildrenMap();
                    migrate(segmentArray ,  segmentMap);
                    segmentMap.put(keyChar, ds);
                    this.storeSize++;
                    this.childrenArray = null;
                }

            }           

        }else{
            Map<Character , DictSegment> segmentMap = getChildrenMap();
            ds = (DictSegment)segmentMap.get(keyChar);
            if(ds == null && create == 1){
                ds = new DictSegment(keyChar);
                segmentMap.put(keyChar , ds);
                this.storeSize ++;
            }
        }

        return ds;
    }

    private void migrate(DictSegment[] segmentArray , Map<Character , DictSegment> segmentMap){
        for(DictSegment segment : segmentArray){
            if(segment != null){
                segmentMap.put(segment.nodeChar, segment);
            }
        }
    }

getChildrenArray首先获取当前节点下用来存储子节点的数组,先使用segmentArray数组存储封装了待存储字符keyChar的DictSegment,数组长度默认值ARRAY_LENGTH_LIMIT是3,当数组存储的DictSegment大于等于3个时改用Map结构存储子节点,此时需要通过migrate函数将原来存储在segmentArray结构中的数据搬移到Map中,Map的key值为存储的char值,value为封装的DictSegment结构。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值