word分词器、ansj分词器、mmseg4j分词器、ik-analyzer分词器分词效果评估（转）

最新推荐文章于 2024-05-12 10:02:53 发布

keke_Xin

最新推荐文章于 2024-05-12 10:02:53 发布

阅读量497

点赞数

分类专栏：数据结构和算法分词器文章标签： java 大数据人工智能

本文链接：https://blog.csdn.net/keke_Xin/article/details/84596332

版权

数据结构和算法同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

分词器

1 篇文章 0 订阅

订阅专栏

转自：http://yangshangchuan.iteye.com/blog/2056537（有代码可下载）

word分词器、ansj分词器、mmseg4j分词器、ik-analyzer分词器分词效果评估

博客分类：

人工智能

word分词 word分词器 word分词组件 word分词库中文分词开源中文分词 Java中文分词

word分词是一个Java实现的中文分词组件，提供了多种基于词典的分词算法，并利用ngram模型来消除歧义。能准确识别英文、数字，以及日期、时间等数量词，能识别人名、地名、组织机构名等未登录词。同时提供了Lucene、Solr、ElasticSearch插件。

word分词器分词效果评估主要评估下面7种分词算法：

正向最大匹配算法：MaximumMatching
逆向最大匹配算法：ReverseMaximumMatching
正向最小匹配算法：MinimumMatching
逆向最小匹配算法：ReverseMinimumMatching
双向最大匹配算法：BidirectionalMaximumMatching
双向最小匹配算法：BidirectionalMinimumMatching
双向最大最小匹配算法：BidirectionalMaximumMinimumMatching

所有的双向算法都使用ngram来消歧，分词效果评估分别评估bigram和trigram。

评估采用的测试文本有253 3709行，共2837 4490个字符，标准文本和测试文本一行行对应，标准文本中的词以空格分隔，评估标准为严格一致，评估核心代码如下：

       Java代码   
       
     
/** 
 * 分词效果评估 
 * @param resultText 实际分词结果文件路径 
 * @param standardText 标准分词结果文件路径 
 * @return 评估结果 
 */  
public static EvaluationResult evaluation(String resultText, String standardText) {  
    int perfectLineCount=0;  
    int wrongLineCount=0;  
    int perfectCharCount=0;  
    int wrongCharCount=0;  
    try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));  
        BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){  
        String result;  
        while( (result = resultReader.readLine()) != null ){  
            result = result.trim();  
            String standard = standardReader.readLine().trim();  
            if(result.equals("")){  
                continue;  
            }  
            if(result.equals(standard)){  
                //分词结果和标准一模一样  
                perfectLineCount++;  
                perfectCharCount+=standard.replaceAll("\\s+", "").length();  
            }else{  
                //分词结果和标准不一样  
                wrongLineCount++;  
                wrongCharCount+=standard.replaceAll("\\s+", "").length();  
            }  
        }  
    } catch (IOException ex) {  
        LOGGER.error("分词效果评估失败：", ex);  
    }  
    int totalLineCount = perfectLineCount+wrongLineCount;  
    int totalCharCount = perfectCharCount+wrongCharCount;  
    EvaluationResult er = new EvaluationResult();  
    er.setPerfectCharCount(perfectCharCount);  
    er.setPerfectLineCount(perfectLineCount);  
    er.setTotalCharCount(totalCharCount);  
    er.setTotalLineCount(totalLineCount);  
    er.setWrongCharCount(wrongCharCount);  
    er.setWrongLineCount(wrongLineCount);       
    return er;  
}  

       Java代码   
       
     
/** 
 * 中文分词效果评估结果 
 * @author 杨尚川 
 */  
public class EvaluationResult implements Comparable{  
    private int totalLineCount;  
    private int perfectLineCount;  
    private int wrongLineCount;  
    private int totalCharCount;  
    private int perfectCharCount;  
    private int wrongCharCount;  
  
      
    public float getLinePerfectRate(){  
        return perfectLineCount/(float)totalLineCount*100;  
    }  
    public float getLineWrongRate(){  
        return wrongLineCount/(float)totalLineCount*100;  
    }  
    public float getCharPerfectRate(){  
        return perfectCharCount/(float)totalCharCount*100;  
    }  
    public float getCharWrongRate(){  
        return wrongCharCount/(float)totalCharCount*100;  
    }  
    public int getTotalLineCount() {  
        return totalLineCount;  
    }  
    public void setTotalLineCount(int totalLineCount) {  
        this.totalLineCount = totalLineCount;  
    }  
    public int getPerfectLineCount() {  
        return perfectLineCount;  
    }  
    public void setPerfectLineCount(int perfectLineCount) {  
        this.perfectLineCount = perfectLineCount;  
    }  
    public int getWrongLineCount() {  
        return wrongLineCount;  
    }  
    public void setWrongLineCount(int wrongLineCount) {  
        this.wrongLineCount = wrongLineCount;  
    }  
    public int getTotalCharCount() {  
        return totalCharCount;  
    }  
    public void setTotalCharCount(int totalCharCount) {  
        this.totalCharCount = totalCharCount;  
    }  
    public int getPerfectCharCount() {  
        return perfectCharCount;  
    }  
    public void setPerfectCharCount(int perfectCharCount) {  
        this.perfectCharCount = perfectCharCount;  
    }  
    public int getWrongCharCount() {  
        return wrongCharCount;  
    }  
    public void setWrongCharCount(int wrongCharCount) {  
        this.wrongCharCount = wrongCharCount;  
    }  
    @Override  
    public String toString(){  
        return segmentationAlgorithm.name()+"（"+segmentationAlgorithm.getDes()+"）："  
                +"\n"  
                +"分词速度："+segSpeed+" 字符/毫秒"  
                +"\n"  
                +"行数完美率："+getLinePerfectRate()+"%"  
                +"  行数错误率："+getLineWrongRate()+"%"  
                +"  总的行数："+totalLineCount  
                +"  完美行数："+perfectLineCount  
                +"  错误行数："+wrongLineCount  
                +"\n"  
                +"字数完美率："+getCharPerfectRate()+"%"  
                +" 字数错误率："+getCharWrongRate()+"%"  
                +" 总的字数："+totalCharCount  
                +" 完美字数："+perfectCharCount  
                +" 错误字数："+wrongCharCount;  
    }  
    @Override  
    public int compareTo(Object o) {  
        EvaluationResult other = (EvaluationResult)o;  
        if(other.getLinePerfectRate() - getLinePerfectRate() > 0){  
            return 1;  
        }  
        if(other.getLinePerfectRate() - getLinePerfectRate() < 0){  
            return -1;  
        }  
        return 0;  
    }  
}  

word分词使用trigram评估结果：

       Java代码   
       
     
BidirectionalMaximumMinimumMatching（双向最大最小匹配算法）：  
分词速度：265.62566 字符/毫秒  
行数完美率：55.352688%  行数错误率：44.647312%  总的行数：2533709  完美行数：1402476  错误行数：1131233  
字数完美率：46.23227% 字数错误率：53.76773% 总的字数：28374490 完美字数：13118171 错误字数：15256319  
  
BidirectionalMaximumMatching（双向最大匹配算法）：  
分词速度：335.62155 字符/毫秒  
行数完美率：50.16934%  行数错误率：49.83066%  总的行数：2533709  完美行数：1271145  错误行数：1262564  
字数完美率：40.692997% 字数错误率：59.307003% 总的字数：28374490 完美字数：11546430 错误字数：16828060  
  
ReverseMaximumMatching（逆向最大匹配算法）：  
分词速度：686.71045 字符/毫秒  
行数完美率：46.723125%  行数错误率：53.27688%  总的行数：2533709  完美行数：1183828  错误行数：1349881  
字数完美率：36.67598% 字数错误率：63.32402% 总的字数：28374490 完美字数：10406622 错误字数：17967868  
  
MaximumMatching（正向最大匹配算法）：  
分词速度：733.9535 字符/毫秒  
行数完美率：46.661713%  行数错误率：53.338287%  总的行数：2533709  完美行数：1182272  错误行数：1351437  
字数完美率：36.72861% 字数错误率：63.271393% 总的字数：28374490 完美字数：10421556 错误字数：17952934  
  
BidirectionalMinimumMatching（双向最小匹配算法）：  
分词速度：432.87375 字符/毫秒  
行数完美率：45.863907%  行数错误率：54.136093%  总的行数：2533709  完美行数：1162058  错误行数：1371651  
字数完美率：35.942123% 字数错误率：64.05788% 总的字数：28374490 完美字数：10198395 错误字数：18176095  
  
ReverseMinimumMatching（逆向最小匹配算法）：  
分词速度：1033.58636 字符/毫秒  
行数完美率：41.776066%  行数错误率：58.223934%  总的行数：2533709  完美行数：1058484  错误行数：1475225  
字数完美率：31.678978% 字数错误率：68.32102% 总的字数：28374490 完美字数：8988748 错误字数：19385742  
  
MinimumMatching（正向最小匹配算法）：  
分词速度：1175.4431 字符/毫秒  
行数完美率：36.853836%  行数错误率：63.146164%  总的行数：2533709  完美行数：933769  错误行数：1599940  
字数完美率：26.859812% 字数错误率：73.14019% 总的字数：28374490 完美字数：7621334 错误字数：20753156  

word分词使用bigram评估结果：

       Java代码   
       
     
BidirectionalMaximumMinimumMatching（双向最大最小匹配算法）：  
分词速度：233.49121 字符/毫秒  
行数完美率：55.31531%  行数错误率：44.68469%  总的行数：2533709  完美行数：1401529  错误行数：1132180  
字数完美率：45.834396% 字数错误率：54.165604% 总的字数：28374490 完美字数：13005277 错误字数：15369213  
  
BidirectionalMaximumMatching（双向最大匹配算法）：  
分词速度：303.59401 字符/毫秒  
行数完美率：52.007233%  行数错误率：47.992767%  总的行数：2533709  完美行数：1317712  错误行数：1215997  
字数完美率：42.424194% 字数错误率：57.575806% 总的字数：28374490 完美字数：12037649 错误字数：16336841  
  
BidirectionalMinimumMatching（双向最小匹配算法）：  
分词速度：349.67215 字符/毫秒  
行数完美率：46.766422%  行数错误率：53.23358%  总的行数：2533709  完美行数：1184925  错误行数：1348784  
字数完美率：36.52718% 字数错误率：63.47282% 总的字数：28374490 完美字数：10364401 错误字数：18010089  
  
ReverseMaximumMatching（逆向最大匹配算法）：  
分词速度：598.04272 字符/毫秒  
行数完美率：46.723125%  行数错误率：53.27688%  总的行数：2533709  完美行数：1183828  错误行数：1349881  
字数完美率：36.67598% 字数错误率：63.32402% 总的字数：28374490 完美字数：10406622 错误字数：17967868  
  
MaximumMatching（正向最大匹配算法）：  
分词速度：676.7993 字符/毫秒  
行数完美率：46.661713%  行数错误率：53.338287%  总的行数：2533709  完美行数：1182272  错误行数：1351437  
字数完美率：36.72861% 字数错误率：63.271393% 总的字数：28374490 完美字数：10421556 错误字数：17952934  
  
ReverseMinimumMatching（逆向最小匹配算法）：  
分词速度：806.9586 字符/毫秒  
行数完美率：41.776066%  行数错误率：58.223934%  总的行数：2533709  完美行数：1058484  错误行数：1475225  
字数完美率：31.678978% 字数错误率：68.32102% 总的字数：28374490 完美字数：8988748 错误字数：19385742  
  
MinimumMatching（正向最小匹配算法）：  
分词速度：1020.9208 字符/毫秒  
行数完美率：36.853836%  行数错误率：63.146164%  总的行数：2533709  完美行数：933769  错误行数：1599940  
字数完美率：26.859812% 字数错误率：73.14019% 总的字数：28374490 完美字数：7621334 错误字数：20753156  

Ansj0.9的评估结果如下：

       Java代码   
       
     
Ansj ToAnalysis 精准分词：  
分词速度：495.9188 字符/毫秒  
行数完美率：58.609295%  行数错误率：41.390705%  总的行数：2533709  完美行数：1484989  错误行数：1048720  
字数完美率：50.97614%   字数错误率：49.023857%  总的字数：28374490 完美字数：14464220 错误字数：13910270  
  
Ansj NlpAnalysis NLP分词：  
分词速度：350.7527 字符/毫秒  
行数完美率：58.60353%  行数错误率：41.396465%  总的行数：2533709  完美行数：1484843  错误行数：1048866  
字数完美率：50.75546%  字数错误率：49.244545%  总的字数：28374490 完美字数：14401602 错误字数：13972888  
  
Ansj BaseAnalysis 基本分词：  
分词速度：532.65424 字符/毫秒  
行数完美率：54.028584%  行数错误率：45.97142%  总的行数：2533709  完美行数：1368927  错误行数：1164782  
字数完美率：46.84512%   字数错误率：53.15488%  总的字数：28374490 完美字数：13292064 错误字数：15082426  
  
Ansj IndexAnalysis 面向索引的分词：  
分词速度：564.6103 字符/毫秒  
行数完美率：53.510803%  行数错误率：46.489197%  总的行数：2533709  完美行数：1355808  错误行数：1177901  
字数完美率：46.355087%  字数错误率：53.644913%  总的字数：28374490 完美字数：13153019 错误字数：15221471  

Ansj1.4的评估结果如下：

       Java代码   
       
     
Ansj ToAnalysis 精准分词：  
分词速度：581.7306 字符/毫秒  
行数完美率：58.60302%  行数错误率：41.39698%  总的行数：2533709  完美行数：1484830  错误行数：1048879  
字数完美率：50.968987% 字数错误率：49.031013% 总的字数：28374490 完美字数：14462190 错误字数：13912300  
  
Ansj NlpAnalysis NLP分词：  
分词速度：138.81165 字符/毫秒  
行数完美率：58.1515%  行数错误率：41.8485%  总的行数：2533687  完美行数：1473377  错误行数：1060310  
字数完美率：49.806484% 字数错误率：50.19352% 总的字数：28374398 完美字数：14132290 错误字数：14242108  
  
Ansj BaseAnalysis 基本分词：  
分词速度：627.68475 字符/毫秒  
行数完美率：55.3174%  行数错误率：44.6826%  总的行数：2533709  完美行数：1401582  错误行数：1132127  
字数完美率：48.177986% 字数错误率：51.822014% 总的字数：28374490 完美字数：13670258 错误字数：14704232  
  
Ansj IndexAnalysis 面向索引的分词：  
分词速度：715.55176 字符/毫秒  
行数完美率：50.89444%  行数错误率：49.10556%  总的行数：2533709  完美行数：1289517  错误行数：1244192  
字数完美率：42.965115% 字数错误率：57.034885% 总的字数：28374490 完美字数：12191132 错误字数：16183358  

Ansj分词评估程序如下：

       Java代码   
       
     
import java.io.BufferedReader;  
import java.io.BufferedWriter;  
import java.io.FileInputStream;  
import java.io.FileOutputStream;  
import java.io.IOException;  
import java.io.InputStreamReader;  
import java.io.OutputStreamWriter;  
import java.nio.file.Files;  
import java.nio.file.Paths;  
import java.util.ArrayList;  
import java.util.Collections;  
import java.util.List;  
import org.ansj.domain.Term;  
import org.ansj.splitWord.analysis.BaseAnalysis;  
import org.ansj.splitWord.analysis.IndexAnalysis;  
import org.ansj.splitWord.analysis.NlpAnalysis;  
import org.ansj.splitWord.analysis.ToAnalysis;  
  
/** 
 * Ansj分词器分词效果评估 
 * @author 杨尚川 
 */  
public class AnsjEvaluation {  
  
    public static void main(String[] args) throws Exception{  
        // 测试文件 d:/test-text.txt 和 标准分词结果文件 d:/standard-text.txt 的下载地址：  
        // http://pan.baidu.com/s/1hqihzjY  
          
        List<EvaluationResult> list = new ArrayList<>();  
        // 对文本进行分词  
        float rate = seg("d:/test-text.txt", "d:/result-text-BaseAnalysis.txt", "BaseAnalysis");  
        // 对分词结果进行评估  
        EvaluationResult result = evaluation("d:/result-text-BaseAnalysis.txt", "d:/standard-text.txt");  
        result.setAnalyzer("Ansj BaseAnalysis 基本分词");  
        result.setSegSpeed(rate);  
        list.add(result);  
          
        // 对文本进行分词  
        rate = seg("d:/test-text.txt", "d:/result-text-ToAnalysis.txt", "ToAnalysis");  
        // 对分词结果进行评估  
        result = evaluation("d:/result-text-ToAnalysis.txt", "d:/standard-text.txt");  
        result.setAnalyzer("Ansj ToAnalysis 精准分词");  
        result.setSegSpeed(rate);  
        list.add(result);  
          
        // 对文本进行分词  
        rate = seg("d:/test-text.txt", "d:/result-text-NlpAnalysis.txt", "NlpAnalysis");  
        // 对分词结果进行评估  
        result = evaluation("d:/result-text-NlpAnalysis.txt", "d:/standard-text.txt");  
        result.setAnalyzer("Ansj NlpAnalysis NLP分词");  
        result.setSegSpeed(rate);  
        list.add(result);  
          
        // 对文本进行分词  
        rate = seg("d:/test-text.txt", "d:/result-text-IndexAnalysis.txt", "IndexAnalysis");  
        // 对分词结果进行评估  
        result = evaluation("d:/result-text-IndexAnalysis.txt", "d:/standard-text.txt");  
        result.setAnalyzer("Ansj IndexAnalysis 面向索引的分词");  
        result.setSegSpeed(rate);  
        list.add(result);  
          
        //输出评估结果  
        Collections.sort(list);  
        System.out.println("");  
        for(EvaluationResult r : list){  
            System.out.println(r+"\n");  
        }  
    }  
    private static float seg(final String input, final String output, final String type) throws Exception{  
        float rate = 0;  
        try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8"));  
                BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){  
            long size = Files.size(Paths.get(input));  
            System.out.println("size:"+size);  
            System.out.println("文件大小："+(float)size/1024/1024+" MB");  
            int textLength=0;  
            int progress=0;  
            long start = System.currentTimeMillis();  
            String line = null;  
            while((line = reader.readLine()) != null){  
                if("".equals(line.trim())){  
                    writer.write("\n");  
                    continue;  
                }  
                textLength += line.length();  
                switch(type){  
                    case "BaseAnalysis":  
                        for(Term term : BaseAnalysis.parse(line)){  
                            writer.write(term.getName()+" ");  
                        }  
                        break;  
                    case "ToAnalysis":  
                        for(Term term : ToAnalysis.parse(line)){  
                            writer.write(term.getName()+" ");  
                        }  
                        break;  
                    case "NlpAnalysis":  
                        try{  
                            for(Term term : NlpAnalysis.parse(line)){  
                                writer.write(term.getName()+" ");  
                            }  
                        }catch(Exception e){}  
                        break;  
                    case "IndexAnalysis":  
                        for(Term term : IndexAnalysis.parse(line)){  
                            writer.write(term.getName()+" ");  
                        }  
                        break;  
                }                  
                writer.write("\n");  
                progress += line.length();  
                if( progress > 500000){  
                    progress = 0;  
                    System.out.println("分词进度："+(int)(textLength*2.99/size*100)+"%");  
                }  
            }  
            long cost = System.currentTimeMillis() - start;  
            rate = textLength/(float)cost;  
            System.out.println("字符数目："+textLength);  
            System.out.println("分词耗时："+cost+" 毫秒");  
            System.out.println("分词速度："+rate+" 字符/毫秒");  
        }  
        return rate;  
    }  
    /** 
     * 分词效果评估 
     * @param resultText 实际分词结果文件路径 
     * @param standardText 标准分词结果文件路径 
     * @return 评估结果 
     */  
    private static EvaluationResult evaluation(String resultText, String standardText) {  
        int perfectLineCount=0;  
        int wrongLineCount=0;  
        int perfectCharCount=0;  
        int wrongCharCount=0;  
        try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));  
            BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){  
            String result;  
            while( (result = resultReader.readLine()) != null ){  
                result = result.trim();  
                String standard = standardReader.readLine().trim();  
                if(result.equals("")){  
                    continue;  
                }  
                if(result.equals(standard)){  
                    //分词结果和标准一模一样  
                    perfectLineCount++;  
                    perfectCharCount+=standard.replaceAll("\\s+", "").length();  
                }else{  
                    //分词结果和标准不一样  
                    wrongLineCount++;  
                    wrongCharCount+=standard.replaceAll("\\s+", "").length();  
                }  
            }  
        } catch (IOException ex) {  
            System.err.println("分词效果评估失败：" + ex.getMessage());  
        }  
        int totalLineCount = perfectLineCount+wrongLineCount;  
        int totalCharCount = perfectCharCount+wrongCharCount;  
        EvaluationResult er = new EvaluationResult();  
        er.setPerfectCharCount(perfectCharCount);  
        er.setPerfectLineCount(perfectLineCount);  
        er.setTotalCharCount(totalCharCount);  
        er.setTotalLineCount(totalLineCount);  
        er.setWrongCharCount(wrongCharCount);  
        er.setWrongLineCount(wrongLineCount);       
        return er;  
    }  
    /** 
     * 分词结果 
     */  
    private static class EvaluationResult implements Comparable{  
        private String analyzer;  
        private float segSpeed;  
        private int totalLineCount;  
        private int perfectLineCount;  
        private int wrongLineCount;  
        private int totalCharCount;  
        private int perfectCharCount;  
        private int wrongCharCount;  
  
        public String getAnalyzer() {  
            return analyzer;  
        }  
        public void setAnalyzer(String analyzer) {  
            this.analyzer = analyzer;  
        }  
        public float getSegSpeed() {  
            return segSpeed;  
        }  
        public void setSegSpeed(float segSpeed) {  
            this.segSpeed = segSpeed;  
        }  
        public float getLinePerfectRate(){  
            return perfectLineCount/(float)totalLineCount*100;  
        }  
        public float getLineWrongRate(){  
            return wrongLineCount/(float)totalLineCount*100;  
        }  
        public float getCharPerfectRate(){  
            return perfectCharCount/(float)totalCharCount*100;  
        }  
        public float getCharWrongRate(){  
            return wrongCharCount/(float)totalCharCount*100;  
        }  
        public int getTotalLineCount() {  
            return totalLineCount;  
        }  
        public void setTotalLineCount(int totalLineCount) {  
            this.totalLineCount = totalLineCount;  
        }  
        public int getPerfectLineCount() {  
            return perfectLineCount;  
        }  
        public void setPerfectLineCount(int perfectLineCount) {  
            this.perfectLineCount = perfectLineCount;  
        }  
        public int getWrongLineCount() {  
            return wrongLineCount;  
        }  
        public void setWrongLineCount(int wrongLineCount) {  
            this.wrongLineCount = wrongLineCount;  
        }  
        public int getTotalCharCount() {  
            return totalCharCount;  
        }  
        public void setTotalCharCount(int totalCharCount) {  
            this.totalCharCount = totalCharCount;  
        }  
        public int getPerfectCharCount() {  
            return perfectCharCount;  
        }  
        public void setPerfectCharCount(int perfectCharCount) {  
            this.perfectCharCount = perfectCharCount;  
        }  
        public int getWrongCharCount() {  
            return wrongCharCount;  
        }  
        public void setWrongCharCount(int wrongCharCount) {  
            this.wrongCharCount = wrongCharCount;  
        }  
        @Override  
        public String toString(){  
            return analyzer+"："  
                    +"\n"  
                    +"分词速度："+segSpeed+" 字符/毫秒"  
                    +"\n"  
                    +"行数完美率："+getLinePerfectRate()+"%"  
                    +"  行数错误率："+getLineWrongRate()+"%"  
                    +"  总的行数："+totalLineCount  
                    +"  完美行数："+perfectLineCount  
                    +"  错误行数："+wrongLineCount  
                    +"\n"  
                    +"字数完美率："+getCharPerfectRate()+"%"  
                    +" 字数错误率："+getCharWrongRate()+"%"  
                    +" 总的字数："+totalCharCount  
                    +" 完美字数："+perfectCharCount  
                    +" 错误字数："+wrongCharCount;  
        }  
        @Override  
        public int compareTo(Object o) {  
            EvaluationResult other = (EvaluationResult)o;  
            if(other.getLinePerfectRate() - getLinePerfectRate() > 0){  
                return 1;  
            }  
            if(other.getLinePerfectRate() - getLinePerfectRate() < 0){  
                return -1;  
            }  
            return 0;  
        }  
    }  
}  

MMSeg4j1.9.1的评估结果如下：

       Java代码   
       
     
MMSeg4j ComplexSeg：  
分词速度：794.24805 字符/毫秒  
行数完美率：38.817604%  行数错误率：61.182396%  总的行数：2533688  完美行数：983517  错误行数：1550171  
字数完美率：29.604435% 字数错误率：70.39557% 总的字数：28374428 完美字数：8400089 错误字数：19974339  
  
MMSeg4j SimpleSeg：  
分词速度：1026.1058 字符/毫秒  
行数完美率：37.570095%  行数错误率：62.429905%  总的行数：2533688  完美行数：951909  错误行数：1581779  
字数完美率：28.455273% 字数错误率：71.54473% 总的字数：28374428 完美字数：8074021 错误字数：20300407  
  
MMSeg4j MaxWordSeg：  
分词速度：813.0676 字符/毫秒  
行数完美率：34.27573%  行数错误率：65.72427%  总的行数：2533688  完美行数：868440  错误行数：1665248  
字数完美率：25.20896% 字数错误率：74.79104% 总的字数：28374428 完美字数：7152898 错误字数：21221530  

MMSeg4j1.9.1分词评估程序如下：

       Java代码   
       
     
import com.chenlb.mmseg4j.ComplexSeg;  
import com.chenlb.mmseg4j.Dictionary;  
import com.chenlb.mmseg4j.MMSeg;  
import com.chenlb.mmseg4j.MaxWordSeg;  
import com.chenlb.mmseg4j.Seg;  
import com.chenlb.mmseg4j.SimpleSeg;  
import com.chenlb.mmseg4j.Word;  
import java.io.BufferedReader;  
import java.io.BufferedWriter;  
import java.io.FileInputStream;  
import java.io.FileOutputStream;  
import java.io.IOException;  
import java.io.InputStreamReader;  
import java.io.OutputStreamWriter;  
import java.io.StringReader;  
import java.nio.file.Files;  
import java.nio.file.Paths;  
import java.util.ArrayList;  
import java.util.Collections;  
import java.util.List;  
  
/** 
 * MMSeg4j分词器分词效果评估 
 * @author 杨尚川 
 */  
public class MMSeg4jEvaluation {  
  
    public static void main(String[] args) throws Exception{  
        // 测试文件 d:/test-text.txt 和 标准分词结果文件 d:/standard-text.txt 的下载地址：  
        // http://pan.baidu.com/s/1hqihzjY  
          
        List<EvaluationResult> list = new ArrayList<>();  
        Dictionary dic = Dictionary.getInstance();  
        // 对文本进行分词  
        float rate = seg("d:/test-text.txt", "d:/result-text-ComplexSeg.txt", new ComplexSeg(dic));  
        // 对分词结果进行评估  
        EvaluationResult result = evaluation("d:/result-text-ComplexSeg.txt", "d:/standard-text.txt");  
        result.setAnalyzer("MMSeg4j ComplexSeg");  
        result.setSegSpeed(rate);  
        list.add(result);  
          
        // 对文本进行分词  
        rate = seg("d:/test-text.txt", "d:/result-text-SimpleSeg.txt", new SimpleSeg(dic));  
        // 对分词结果进行评估  
        result = evaluation("d:/result-text-SimpleSeg.txt", "d:/standard-text.txt");  
        result.setAnalyzer("MMSeg4j SimpleSeg");  
        result.setSegSpeed(rate);  
        list.add(result);  
          
        // 对文本进行分词  
        rate = seg("d:/test-text.txt", "d:/result-text-MaxWordSeg.txt", new MaxWordSeg(dic));  
        // 对分词结果进行评估  
        result = evaluation("d:/result-text-MaxWordSeg.txt", "d:/standard-text.txt");  
        result.setAnalyzer("MMSeg4j MaxWordSeg");  
        result.setSegSpeed(rate);  
        list.add(result);  
          
        //输出评估结果  
        Collections.sort(list);  
        System.out.println("");  
        for(EvaluationResult r : list){  
            System.out.println(r+"\n");  
        }  
    }  
    private static float seg(final String input, final String output, final Seg seg) throws Exception{  
        float rate = 0;  
        try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8"));  
                BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){  
            long size = Files.size(Paths.get(input));  
            System.out.println("size:"+size);  
            System.out.println("文件大小："+(float)size/1024/1024+" MB");  
            int textLength=0;  
            int progress=0;  
            long start = System.currentTimeMillis();  
            String line = null;  
            while((line = reader.readLine()) != null){  
                if("".equals(line.trim())){  
                    writer.write("\n");  
                    continue;  
                }  
                textLength += line.length();  
                writer.write(seg(line, seg));  
                writer.write("\n");  
                progress += line.length();  
                if( progress > 500000){  
                    progress = 0;  
                    System.out.println("分词进度："+(int)(textLength*2.99/size*100)+"%");  
                }  
            }  
            long cost = System.currentTimeMillis() - start;  
            rate = textLength/(float)cost;  
            System.out.println("字符数目："+textLength);  
            System.out.println("分词耗时："+cost+" 毫秒");  
            System.out.println("分词速度："+rate+" 字符/毫秒");  
        }  
        return rate;  
    }  
    private static String seg(String text, Seg seg) throws IOException {  
        StringBuilder result = new StringBuilder();  
        MMSeg mmSeg = new MMSeg(new StringReader(text), seg);  
        Word word = null;  
        while((word=mmSeg.next())!=null) {  
            result.append(word.getString()).append(" ");              
        }  
        return result.toString().trim();  
    }  
    /** 
     * 分词效果评估 
     * @param resultText 实际分词结果文件路径 
     * @param standardText 标准分词结果文件路径 
     * @return 评估结果 
     */  
    private static EvaluationResult evaluation(String resultText, String standardText) {  
        int perfectLineCount=0;  
        int wrongLineCount=0;  
        int perfectCharCount=0;  
        int wrongCharCount=0;  
        try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));  
            BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){  
            String result;  
            while( (result = resultReader.readLine()) != null ){  
                result = result.trim();  
                String standard = standardReader.readLine().trim();  
                if(result.equals("")){  
                    continue;  
                }  
                if(result.equals(standard)){  
                    //分词结果和标准一模一样  
                    perfectLineCount++;  
                    perfectCharCount+=standard.replaceAll("\\s+", "").length();  
                }else{  
                    //分词结果和标准不一样  
                    wrongLineCount++;  
                    wrongCharCount+=standard.replaceAll("\\s+", "").length();  
                }  
            }  
        } catch (IOException ex) {  
            System.err.println("分词效果评估失败：" + ex.getMessage());  
        }  
        int totalLineCount = perfectLineCount+wrongLineCount;  
        int totalCharCount = perfectCharCount+wrongCharCount;  
        EvaluationResult er = new EvaluationResult();  
        er.setPerfectCharCount(perfectCharCount);  
        er.setPerfectLineCount(perfectLineCount);  
        er.setTotalCharCount(totalCharCount);  
        er.setTotalLineCount(totalLineCount);  
        er.setWrongCharCount(wrongCharCount);  
        er.setWrongLineCount(wrongLineCount);       
        return er;  
    }  
    /** 
     * 分词结果 
     */  
    private static class EvaluationResult implements Comparable{  
        private String analyzer;  
        private float segSpeed;  
        private int totalLineCount;  
        private int perfectLineCount;  
        private int wrongLineCount;  
        private int totalCharCount;  
        private int perfectCharCount;  
        private int wrongCharCount;  
  
        public String getAnalyzer() {  
            return analyzer;  
        }  
        public void setAnalyzer(String analyzer) {  
            this.analyzer = analyzer;  
        }  
        public float getSegSpeed() {  
            return segSpeed;  
        }  
        public void setSegSpeed(float segSpeed) {  
            this.segSpeed = segSpeed;  
        }  
        public float getLinePerfectRate(){  
            return perfectLineCount/(float)totalLineCount*100;  
        }  
        public float getLineWrongRate(){  
            return wrongLineCount/(float)totalLineCount*100;  
        }  
        public float getCharPerfectRate(){  
            return perfectCharCount/(float)totalCharCount*100;  
        }  
        public float getCharWrongRate(){  
            return wrongCharCount/(float)totalCharCount*100;  
        }  
        public int getTotalLineCount() {  
            return totalLineCount;  
        }  
        public void setTotalLineCount(int totalLineCount) {  
            this.totalLineCount = totalLineCount;  
        }  
        public int getPerfectLineCount() {  
            return perfectLineCount;  
        }  
        public void setPerfectLineCount(int perfectLineCount) {  
            this.perfectLineCount = perfectLineCount;  
        }  
        public int getWrongLineCount() {  
            return wrongLineCount;  
        }  
        public void setWrongLineCount(int wrongLineCount) {  
            this.wrongLineCount = wrongLineCount;  
        }  
        public int getTotalCharCount() {  
            return totalCharCount;  
        }  
        public void setTotalCharCount(int totalCharCount) {  
            this.totalCharCount = totalCharCount;  
        }  
        public int getPerfectCharCount() {  
            return perfectCharCount;  
        }  
        public void setPerfectCharCount(int perfectCharCount) {  
            this.perfectCharCount = perfectCharCount;  
        }  
        public int getWrongCharCount() {  
            return wrongCharCount;  
        }  
        public void setWrongCharCount(int wrongCharCount) {  
            this.wrongCharCount = wrongCharCount;  
        }  
        @Override  
        public String toString(){  
            return analyzer+"："  
                    +"\n"  
                    +"分词速度："+segSpeed+" 字符/毫秒"  
                    +"\n"  
                    +"行数完美率："+getLinePerfectRate()+"%"  
                    +"  行数错误率："+getLineWrongRate()+"%"  
                    +"  总的行数："+totalLineCount  
                    +"  完美行数："+perfectLineCount  
                    +"  错误行数："+wrongLineCount  
                    +"\n"  
                    +"字数完美率："+getCharPerfectRate()+"%"  
                    +" 字数错误率："+getCharWrongRate()+"%"  
                    +" 总的字数："+totalCharCount  
                    +" 完美字数："+perfectCharCount  
                    +" 错误字数："+wrongCharCount;  
        }  
        @Override  
        public int compareTo(Object o) {  
            EvaluationResult other = (EvaluationResult)o;  
            if(other.getLinePerfectRate() - getLinePerfectRate() > 0){  
                return 1;  
            }  
            if(other.getLinePerfectRate() - getLinePerfectRate() < 0){  
                return -1;  
            }  
            return 0;  
        }  
    }  
}   

ik-analyzer2012_u6的评估结果如下：

       Java代码   
       
     
IKAnalyzer 智能切分：  
分词速度：178.3516 字符/毫秒  
行数完美率：37.55943%  行数错误率：62.440567%  总的行数：2533686  完美行数：951638  错误行数：1582048  
字数完美率：27.978464% 字数错误率：72.02154% 总的字数：28374416 完美字数：7938726 错误字数：20435690  
  
IKAnalyzer 细粒度切分：  
分词速度：182.97859 字符/毫秒  
行数完美率：18.872742%  行数错误率：81.12726%  总的行数：2533686  完美行数：478176  错误行数：2055510  
字数完美率：10.936535% 字数错误率：89.06347% 总的字数：28374416 完美字数：3103178 错误字数：25271238  

ik-analyzer2012_u6分词评估程序如下：

       Java代码   
       
     
import java.io.BufferedReader;  
import java.io.BufferedWriter;  
import java.io.FileInputStream;  
import java.io.FileOutputStream;  
import java.io.IOException;  
import java.io.InputStreamReader;  
import java.io.OutputStreamWriter;  
import java.io.StringReader;  
import java.nio.file.Files;  
import java.nio.file.Paths;  
import java.util.ArrayList;  
import java.util.Collections;  
import java.util.List;  
import org.wltea.analyzer.core.IKSegmenter;  
import org.wltea.analyzer.core.Lexeme;  
  
/** 
 * IKAnalyzer分词器分词效果评估 
 * @author 杨尚川 
 */  
public class IKAnalyzerEvaluation {  
  
    public static void main(String[] args) throws Exception{  
        // 测试文件 d:/test-text.txt 和 标准分词结果文件 d:/standard-text.txt 的下载地址：  
        // http://pan.baidu.com/s/1hqihzjY  
          
        List<EvaluationResult> list = new ArrayList<>();  
          
        // 对文本进行分词  
        float rate = seg("d:/test-text.txt", "d:/result-text-ComplexSeg.txt", true);  
        // 对分词结果进行评估  
        EvaluationResult result = evaluation("d:/result-text-ComplexSeg.txt", "d:/standard-text.txt");  
        result.setAnalyzer("IKAnalyzer 智能切分");  
        result.setSegSpeed(rate);  
        list.add(result);  
          
        // 对文本进行分词  
        rate = seg("d:/test-text.txt", "d:/result-text-SimpleSeg.txt", false);  
        // 对分词结果进行评估  
        result = evaluation("d:/result-text-SimpleSeg.txt", "d:/standard-text.txt");  
        result.setAnalyzer("IKAnalyzer 细粒度切分");  
        result.setSegSpeed(rate);  
        list.add(result);  
          
        //输出评估结果  
        Collections.sort(list);  
        System.out.println("");  
        for(EvaluationResult r : list){  
            System.out.println(r+"\n");  
        }  
    }  
    private static float seg(final String input, final String output, final boolean useSmart) throws Exception{  
        float rate = 0;  
        try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8"));  
                BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){  
            long size = Files.size(Paths.get(input));  
            System.out.println("size:"+size);  
            System.out.println("文件大小："+(float)size/1024/1024+" MB");  
            int textLength=0;  
            int progress=0;  
            long start = System.currentTimeMillis();  
            String line = null;  
            while((line = reader.readLine()) != null){  
                if("".equals(line.trim())){  
                    writer.write("\n");  
                    continue;  
                }  
                textLength += line.length();  
                writer.write(seg(line, useSmart));  
                writer.write("\n");  
                progress += line.length();  
                if( progress > 500000){  
                    progress = 0;  
                    System.out.println("分词进度："+(int)(textLength*2.99/size*100)+"%");  
                }  
            }  
            long cost = System.currentTimeMillis() - start;  
            rate = textLength/(float)cost;  
            System.out.println("字符数目："+textLength);  
            System.out.println("分词耗时："+cost+" 毫秒");  
            System.out.println("分词速度："+rate+" 字符/毫秒");  
        }  
        return rate;  
    }  
    private static String seg(String text, boolean useSmart) throws IOException {  
        StringBuilder result = new StringBuilder();  
        IKSegmenter ik = new IKSegmenter(new StringReader(text), useSmart);  
        Lexeme word = null;  
        while((word=ik.next())!=null) {  
            result.append(word.getLexemeText()).append(" ");              
        }  
        return result.toString().trim();  
    }  
    /** 
     * 分词效果评估 
     * @param resultText 实际分词结果文件路径 
     * @param standardText 标准分词结果文件路径 
     * @return 评估结果 
     */  
    private static EvaluationResult evaluation(String resultText, String standardText) {  
        int perfectLineCount=0;  
        int wrongLineCount=0;  
        int perfectCharCount=0;  
        int wrongCharCount=0;  
        try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));  
            BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){  
            String result;  
            while( (result = resultReader.readLine()) != null ){  
                result = result.trim();  
                String standard = standardReader.readLine().trim();  
                if(result.equals("")){  
                    continue;  
                }  
                if(result.equals(standard)){  
                    //分词结果和标准一模一样  
                    perfectLineCount++;  
                    perfectCharCount+=standard.replaceAll("\\s+", "").length();  
                }else{  
                    //分词结果和标准不一样  
                    wrongLineCount++;  
                    wrongCharCount+=standard.replaceAll("\\s+", "").length();  
                }  
            }  
        } catch (IOException ex) {  
            System.err.println("分词效果评估失败：" + ex.getMessage());  
        }  
        int totalLineCount = perfectLineCount+wrongLineCount;  
        int totalCharCount = perfectCharCount+wrongCharCount;  
        EvaluationResult er = new EvaluationResult();  
        er.setPerfectCharCount(perfectCharCount);  
        er.setPerfectLineCount(perfectLineCount);  
        er.setTotalCharCount(totalCharCount);  
        er.setTotalLineCount(totalLineCount);  
        er.setWrongCharCount(wrongCharCount);  
        er.setWrongLineCount(wrongLineCount);       
        return er;  
    }  
    /** 
     * 分词结果 
     */  
    private static class EvaluationResult implements Comparable{  
        private String analyzer;  
        private float segSpeed;  
        private int totalLineCount;  
        private int perfectLineCount;  
        private int wrongLineCount;  
        private int totalCharCount;  
        private int perfectCharCount;  
        private int wrongCharCount;  
  
        public String getAnalyzer() {  
            return analyzer;  
        }  
        public void setAnalyzer(String analyzer) {  
            this.analyzer = analyzer;  
        }  
        public float getSegSpeed() {  
            return segSpeed;  
        }  
        public void setSegSpeed(float segSpeed) {  
            this.segSpeed = segSpeed;  
        }  
        public float getLinePerfectRate(){  
            return perfectLineCount/(float)totalLineCount*100;  
        }  
        public float getLineWrongRate(){  
            return wrongLineCount/(float)totalLineCount*100;  
        }  
        public float getCharPerfectRate(){  
            return perfectCharCount/(float)totalCharCount*100;  
        }  
        public float getCharWrongRate(){  
            return wrongCharCount/(float)totalCharCount*100;  
        }  
        public int getTotalLineCount() {  
            return totalLineCount;  
        }  
        public void setTotalLineCount(int totalLineCount) {  
            this.totalLineCount = totalLineCount;  
        }  
        public int getPerfectLineCount() {  
            return perfectLineCount;  
        }  
        public void setPerfectLineCount(int perfectLineCount) {  
            this.perfectLineCount = perfectLineCount;  
        }  
        public int getWrongLineCount() {  
            return wrongLineCount;  
        }  
        public void setWrongLineCount(int wrongLineCount) {  
            this.wrongLineCount = wrongLineCount;  
        }  
        public int getTotalCharCount() {  
            return totalCharCount;  
        }  
        public void setTotalCharCount(int totalCharCount) {  
            this.totalCharCount = totalCharCount;  
        }  
        public int getPerfectCharCount() {  
            return perfectCharCount;  
        }  
        public void setPerfectCharCount(int perfectCharCount) {  
            this.perfectCharCount = perfectCharCount;  
        }  
        public int getWrongCharCount() {  
            return wrongCharCount;  
        }  
        public void setWrongCharCount(int wrongCharCount) {  
            this.wrongCharCount = wrongCharCount;  
        }  
        @Override  
        public String toString(){  
            return analyzer+"："  
                    +"\n"  
                    +"分词速度："+segSpeed+" 字符/毫秒"  
                    +"\n"  
                    +"行数完美率："+getLinePerfectRate()+"%"  
                    +"  行数错误率："+getLineWrongRate()+"%"  
                    +"  总的行数："+totalLineCount  
                    +"  完美行数："+perfectLineCount  
                    +"  错误行数："+wrongLineCount  
                    +"\n"  
                    +"字数完美率："+getCharPerfectRate()+"%"  
                    +" 字数错误率："+getCharWrongRate()+"%"  
                    +" 总的字数："+totalCharCount  
                    +" 完美字数："+perfectCharCount  
                    +" 错误字数："+wrongCharCount;  
        }  
        @Override  
        public int compareTo(Object o) {  
            EvaluationResult other = (EvaluationResult)o;  
            if(other.getLinePerfectRate() - getLinePerfectRate() > 0){  
                return 1;  
            }  
            if(other.getLinePerfectRate() - getLinePerfectRate() < 0){  
                return -1;  
            }  
            return 0;  
        }  
    }  
}  

ansj、mmseg4j和ik-analyzer的评估程序可在附件中下载，word分词只需运行项目根目录下的evaluation.bat脚本即可。

参考资料：

1、word分词器分词效果评估测试数据集和标准数据集

keke_Xin

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
word分词器、ansj分词器、mmseg4j分词器、ik-analyzer分词器分词效果评估（转）

转自：http://yangshangchuan.iteye.com/blog/2056537（有代码可下载）word分词器、ansj分词器、mmseg4j分词器、ik-analyzer分词器分词效果评估博客分类：人工智能word分词word分词器word分词组件word分词库中文分词开源中文分词Java中文分词 word分词是一个Java实现的中文...
复制链接

扫一扫

专栏目录