Stanford coreNLP源码学习(1)

本文是Stanford CoreNLP源码学习系列的第一篇,主要探讨其核心组件和关键算法,帮助读者深入了解自然语言处理的实现细节。
摘要由CSDN通过智能技术生成

代码


//openie is dependent on tokenize,ssplit,pos,depparse
public class Try1 {
    public static void main(String[] args){
        Properties props = new Properties();         //props是一个类似map的结构
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner, depparse, natlog, openie"); 
        /*   tokenize       Tokenizesthetextintoasequenceoftokens.分词,中文中,将句子分成一个个的词,英文中较简单?
           * ssplit       Splits a sequence of tokens into sentences. 断句
           * cleanxml     Removes most or all XML tags from the document             
           * truecase     Determinesthelikelytruecaseoftokens in text
           * pos          part of speech,Labels tokens with their POS tag,词性标注CC,DT,JJR,TO,VB等等等等
           * lemma        lemmatization,词元化,表示出词的原型,例如 sings--sing   your--you  is--be 
           * gender       Adds likely gender information to names
           * ner          named entities recognizer  命名实体识别  识别出是ORGANIZATION组织,LOCATION地点 等等
           *              Time, Location, Organization, Person, Money, Percent, Date   这7种
           * parse        找出句子的语法结构,哪些词可以成组,哪些词是这个动词的主语或宾语
           * depparse     Neural Network Dependency Parser  更厉害的parse?
           * sentiment    Sentiment analysis with a compositional model over trees using deep learning 
           * natlog       Natural Logic   some cute rabbits are small -- some rabbits are small.
           * dcoref       同义词分辨 Implements mention detection and both pronominal and nominal coreference resolution 
           * openie       open information extraction, 提取关系三元组
           * */
        /*
         * 以下两种初始化管道的方式二选一
         */
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);                  //用props初始化管道pipeline
//      StanfordCoreNLP pipeline = new StanfordCoreNLP(
//              PropertiesUtils.asProperties(
//                  "annotators", "tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie",
//                  "ssplit.isOneSentence", "true",
//                  "parse.model", "edu/stanford/nlp/models/srparser/englishSR.ser.gz",
//                  "tokenize.language", "en"));

        String text = "Stanford University is located in Stanford which is one of best good universities in 2015.\n";
        Annotation doc = new Annotation(text);                                 // 用字符串初始化一个annotation类型
        pipeline.annotate(doc);
        /*  将前面的一系列操作处理字符串  StanfordCoreNLP.annotate(Annotation)
         *  得到的doc为处理后的doc
         */
        int sentNo = 0;
        // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
        // sentence 是一个coreMap类型,使用类作为key,value可以为自定义类型
        for(CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)){            //对 doc 的每一句话
            System.out.println("Sentence #" + ++sentNo + ": " + sentence.get(TextAnnotation.class)); //输出处理前的那句话
            //如何得到分句后的结果呢?
            System.out.println("word\tpos\tlema\tne");
            // a CoreLabel is a CoreMap with additional token-specific methods
            for(CoreLabel token : sentence.get(TokensAnnotation.class)){                       //对每句话的每个单词(分词以后的)
                String word = token.get(TextAnnotation.class);                                 //获取分词
                String lema = token.get(LemmaAnnotation.class);               
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值