IK分词源代码分析学习——子分词器及歧义处理

最新推荐文章于 2021-12-02 14:32:18 发布

Scroll5165

最新推荐文章于 2021-12-02 14:32:18 发布

阅读量217

点赞数

原文链接：http://www.cnblogs.com/sunshineKID/p/3446546.html

版权

IK分词源代码分析学习——子分词器 http://blog.chinaunix.net/uid-20761674-id-3424176.html

IK分词源代码分析学习——歧义处理 http://blog.chinaunix.net/uid-20761674-id-3424553.html

创建ik对象时，调用IKSegmenter类的构造函数进行初始化

IKSegmenter ik=new IKSegmenter(sr, false);  //true代表调用IKSegmenter()构造函数时使用智能分词

构造函数如下：

public IKSegmenter(Reader input , boolean useSmart){
        this.input = input;
        this.cfg = DefaultConfig.getInstance();
        this.cfg.setUseSmart(useSmart);
        this.init();
    }

this.cfg = DefaultConfig.getInstance();初始化配置文件

public static Configuration getInstance(){
        return new DefaultConfig();
    }
        
    /*
     * 初始化配置文件
     */
    private DefaultConfig(){        
        props = new Properties();
        
        InputStream input = this.getClass().getClassLoader().getResourceAsStream(FILE_NAME);
        if(input != null){
            try {
                props.loadFromXML(input);
            } catch (InvalidPropertiesFormatException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

this.cfg.setUseSmart(useSmart);设置useSmart标志位

/**
     * 设置useSmart标志位
     * useSmart =true ，分词器使用智能切分策略， =false则使用细粒度切分
     * @param useSmart
     */
    public void setUseSmart(boolean useSmart) {
        this.useSmart = useSmart;
    }


this.init()初始化

private void init(){
        //初始化词典单例
        Dictionary.initial(this.cfg);
        //初始化分词上下文
        this.context = new AnalyzeContext(this.cfg);
        //加载子分词器
        this.segmenters = this.loadSegmenters();
        //加载歧义裁决器
        this.arbitrator = new IKArbitrator();
    }

Dictionary.initial(this.cfg);

/**
     * 词典初始化
     * 由于IK Analyzer的词典采用Dictionary类的静态方法进行词典初始化
     * 只有当Dictionary类被实际调用时，才会开始载入词典，
     * 这将延长首次分词操作的时间
     * 该方法提供了一个在应用加载阶段就初始化字典的手段
     * @return Dictionary
     */
    public static Dictionary initial(Configuration cfg){
        if(singleton == null){
            synchronized(Dictionary.class){
                if(singleton == null){
                    singleton = new Dictionary(cfg);
                    return singleton;
                }
            }
        }
        return singleton;
    }

private Dictionary(Configuration cfg){
        this.cfg = cfg;
        //建立一个主词典实例
        _MainDict = new DictSegment((char)0);
        this.loadMainDict(_MainDict);
/*_StopWordDict = new DictSegment((char)0);
        this.loadStopWordDict(_StopWordDict);*/        
        this.loadQuantifierDict();
this.loadCharFreqDict();

    }

private void loadMainDict(DictSegment dstDicSegment){
        
        //读取主词典文件
        InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream(cfg.getMainDictionary());
        if(inputStream == null){
            throw new RuntimeException("Main Dictionary not found!!!");
        }
        
        System.out.println("test加载主字典");
        this.loadWords2DictSegment(inputStream,dstDicSegment);
       
        
        System.out.println("test加载扩展字典");
        this.loadExtDict(dstDicSegment);
       
    }

这里在初始化的时候将主词典和扩展词典一起加载到_MainDict中进行匹配计算，所以若想要实现只加载一个词典可以在这里进行，但由于是在构造其中初始化的时候加载的，就无法根据情况判断是否加载或加载那个词典

private void loadQuantifierDict(){
        //建立一个量词典实例
        _QuantifierDict = new DictSegment((char)0);
        //读取量词词典文件
        InputStream is = this.getClass().getClassLoader().getResourceAsStream(cfg.getQuantifierDicionary());
        if(is == null){
            throw new RuntimeException("Quantifier Dictionary not found!!!");
        }
        loadWords2DictSegment(is, _QuantifierDict);
        System.out.println("加载量词词典");
    }

加载完词典后，便完成了初始化初始化词典单例，跳入初始化分词上下文，然后

加载子分词器this.segmenters = this.loadSegmenters();

private List<ISegmenter> loadSegmenters(){
        List<ISegmenter> segmenters = new ArrayList<ISegmenter>(4);
        //处理字母的子分词器
        segmenters.add(new LetterSegmenter()); 
        //处理中文数量词的子分词器
        segmenters.add(new CN_QuantifierSegmenter());
        //处理中文词的子分词器
        segmenters.add(new CJKSegmenter());
        return segmenters;
    }

加载歧义裁决器

到此，就完成了IKSegmenter ik=new IKSegmenter(sr, false);

进入ik.next()函数的调用，进入分词的主流程（见IK分词源代码分析学习——总体流程）

转载于:https://www.cnblogs.com/sunshineKID/p/3446546.html

Scroll5165

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
IK分词源代码分析学习——子分词器及歧义处理

IK分词源代码分析学习——子分词器http://blog.chinaunix.net/uid-20761674-id-3424176.htmlIK分词源代码分析学习——歧义处理http://blog.chinaunix.net/uid-20761674-id-3424553.html创建ik对象时，调用IKSegmenter类的构造函数进行初始化IKSegmen...
复制链接

扫一扫