常见分词工具总结

最新推荐文章于 2025-04-07 10:56:50 发布

mpk_no1

最新推荐文章于 2025-04-07 10:56:50 发布

阅读量7.2k

点赞数

分类专栏：自然语言处理（NLP）

本文链接：https://blog.csdn.net/mpk_no1/article/details/75201505

版权

由于中文不像英文那样具有天然的分隔符，所以一般情况下，中文自然语言处理的第一步就是要对语料进行分词处理，下面就总结以下我用过的一些常见的中文分词工具。

1.Stanford NLP

使用斯坦福大学的分词器，下载地址http://nlp.stanford.edu/software/segmenter.shtml

在工程里配置好之后，需要加载Properties文件，Properties文件是一些参数设置。

接下来让我们看一个示例：

package com.sectong.application;

    import java.util.List;

    import edu.stanford.nlp.ling.CoreAnnotations;
    import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation;
    import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation;
    import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation;
    import edu.stanford.nlp.ling.CoreLabel;
    import edu.stanford.nlp.pipeline.Annotation;
    import edu.stanford.nlp.pipeline.StanfordCoreNLP;
    import edu.stanford.nlp.util.CoreMap;

    public class CoreNLPSegment {

        public static void main(String[] args) {

            // 载入自定义的Properties文件
            StanfordCoreNLP pipeline = new StanfordCoreNLP("CoreNLP-chinese.properties");

            // 用一些文本来初始化一个注释。文本是构造函数的参数。
            Annotation annotation;
            annotation = new Annotation("我爱北京天安门，天安门上太阳升。");

            // 运行所有选定的代码在本文
            pipeline.annotate(annotation);

            // 从注释中获取CoreMap List，并取第0个值
            List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
            CoreMap sentence = sentences.get(0);

            // 从CoreMap中取出CoreLabel List，逐一打印出来
            List<CoreLabel> tokens = sentence.get(CoreAnnotations.TokensAnnotation.class);
            System.out.println("字/词" + "\

最低0.47元/天解锁文章