Stanford coreNLP源码学习(1)

最新推荐文章于 2024-05-19 09:51:13 发布

zxye

最新推荐文章于 2024-05-19 09:51:13 发布

阅读量3k

点赞数

分类专栏： nlp

本文链接：https://blog.csdn.net/qq_20576847/article/details/53561899

版权

本文是Stanford CoreNLP源码学习系列的第一篇，主要探讨其核心组件和关键算法，帮助读者深入了解自然语言处理的实现细节。

摘要由CSDN通过智能技术生成

代码


//openie is dependent on tokenize,ssplit,pos,depparse
public class Try1 {
    public static void main(String[] args){
        Properties props = new Properties();         //props是一个类似map的结构
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner, depparse, natlog, openie"); 
        /*   tokenize       Tokenizesthetextintoasequenceoftokens.分词，中文中，将句子分成一个个的词，英文中较简单？
           * ssplit       Splits a sequence of tokens into sentences. 断句
           * cleanxml     Removes most or all XML tags from the document             
           * truecase     Determinesthelikelytruecaseoftokens in text
           * pos          part of speech,Labels tokens with their POS tag，词性标注CC,DT,JJR,TO,VB等等等等
           * lemma        lemmatization，词元化，表示出词的原型，例如 sings--sing   your--you  is--be 
           * gender       Adds likely gender information to names
           * ner          named entities recognizer  命名实体识别  识别出是ORGANIZATION组织，LOCATION地点 等等
           *              Time, Location, Organization, Person, Money, Percent, Date   这7种
           * parse        找出句子的语法结构，哪些词可以成组，哪些词是这个动词的主语或宾语
           * depparse     Neural Network Dependency Parser  更厉害的parse？
           * sentiment    Sentiment analysis with a compositional model over trees using deep learning 
           * natlog       Natural Logic   some cute rabbits are small -- some rabbits are small.
           * dcoref       同义词分辨 Implements mention detection and both pronominal and nominal coreference resolution 
           * openie       open information extraction, 提取关系三元组
           * */
        /*
         * 以下两种初始化管道的方式二选一
         */
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);                  //用props初始化管道pipeline
//      StanfordCoreNLP pipeline = new StanfordCoreNLP(
//              PropertiesUtils.asProperties(
//                  "annotators", "tokenize,ssplit,pos,lemma,ner，depparse,natlog,openie",
//                  "ssplit.isOneSentence", "true",
//                  "parse.model", "edu/stanford/nlp/models/srparser/englishSR.ser.gz",
//                  "tokenize.language", "en"));

        String text = "Stanford University is located in Stanford which is one of best good universities in 2015.\n";
        Annotation doc = new Annotation(text);                                 // 用字符串初始化一个annotation类型
        pipeline.annotate(doc);
        /*  将前面的一系列操作处理字符串  StanfordCoreNLP.annotate(Annotation)
         *  得到的doc为处理后的doc
         */
        int sentNo = 0;
        // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
        // sentence 是一个coreMap类型，使用类作为key，value可以为自定义类型
        for(CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)){            //对 doc 的每一句话
            System.out.println("Sentence #" + ++sentNo + ": " + sentence.get(TextAnnotation.class)); //输出处理前的那句话
            //如何得到分句后的结果呢？
            System.out.println("word\tpos\tlema\tne");
            // a CoreLabel is a CoreMap with additional token-specific methods
            for(CoreLabel token : sentence.get(TokensAnnotation.class)){                       //对每句话的每个单词(分词以后的)
                String word = token.get(TextAnnotation.class);                                 //获取分词
                String lema = token.get(LemmaAnnotation.class);

最低0.47元/天解锁文章

zxye

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
2
评论
Stanford coreNLP源码学习(1)

代码//openie is dependent on tokenize,ssplit,pos,depparsepublic class Try1 { public static void main(String[] args){ Properties props = new Properties(); //props是一个类似map的结构
复制链接

扫一扫

专栏目录