[Stanford CoreNLP] Stanford CoreNLP 中 pipeline 的各个 Annotators 简介(2021-02-10)

官网介绍:
https://stanfordnlp.github.io/CoreNLP/annotators.html
每个annotator点进去可以查看更详细的介绍
https://stanfordnlp.github.io/CoreNLP/annotators.html
corenlp版本4.2.2

annotatorDescription我的理解
tokenizeTokenizes the text. This splits the text into roughly “words”, using rules or methods suitable for the language being processed. Sometimes the tokens split up surface words in ways suitable for further NLP-processing, for example “isn’t” becomes “is” and “n’t”. The tokenizer saves the beginning and end character offsets of each token in the input text.pipeline中用tokenize会进行分词,分解成一个个token,token中包括原始文本,lemma(词元)等等信息
cleanxmlRemove xml tokens from the document. May use them to mark sentence ends or to extract metadata.输入的文本忽略xml标签(对文本进行预处理,可能对直接爬取的html处理有好处,没实践过)
docdateAllows user to specify dates for documents.可以给文档设定固定日期
ssplitSplits a sequence of tokens into sentences.分句
posLabels tokens with their POS tag. For more details see this page.标注词性,
lemmaGenerates the word lemmas for all tokens in the corpus.生成词元,还原单词的最简形式
nerRecognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION, SET) entities. Named entities are recognized using a combination of three CRF sequence taggers trained on various corpora, such as ACE and MUC. Numerical entities are recognized using a rule-based system. Numerical entities that require normalization, e.g., dates, are normalized to NormalizedNamedEntityTagAnnotation. For more details on the CRF tagger see this page. Sub-annotators: docdate, regexner, tokensregex, entitymentions, and sutime标注实体,token中包含实体字段,不是实体则是0,是的话根据会有分类,有很多分类,比如数字,时间,地点,人等,可以根据需求在生成pipeline时进行配置
entitymentionsGroup NER tagged tokens together into mentions. Run as part of: ner目前没有用到,不理解是干啥的,和ner有关
regexnerImplements a simple, rule-based NER over token sequences using Java regular expressions. The goal of this Annotator is to provide a simple framework to incorporate NE labels that are not annotated in traditional NL corpora. For example, the default list of regular expressions that we distribute in the models file recognizes ideologies (IDEOLOGY), nationalities (NATIONALITY), religions (RELIGION), and titles (TITLE). Here is a simple example of how to use RegexNER. For more complex applications, you might consider TokensRegex.是tokensregex的简化版:通过一个正则表达式,自定义实体信息
tokensregexRuns a TokensRegex pipeline within a full NLP pipeline.通过一个正则表达式,自定义实体信息
parseProvides full syntactic analysis, using both the constituent and the dependency representations. The constituent-based output is saved in TreeAnnotation. We generate three dependency-based outputs, as follows: basic, uncollapsed dependencies, saved in BasicDependenciesAnnotation; collapsed dependencies saved in CollapsedDependenciesAnnotation; and collapsed dependencies with processed coordinations, in CollapsedCCProcessedDependenciesAnnotation. Most users of our parser will prefer the latter representation. For more details on the parser, please see this page. For more details about the dependencies, please refer to this page.句法分析
depparseProvides a fast syntactic dependency parser. We generate three dependency-based outputs, as follows: basic, uncollapsed dependencies, saved in BasicDependenciesAnnotation; collapsed dependencies saved in CollapsedDependenciesAnnotation; and collapsed dependencies with processed coordinations, in CollapsedCCProcessedDependenciesAnnotation. Most users of our parser will prefer the latter representation. For details about the dependency software, see this page. For more details about dependency parsing in general, see this page.依赖分析,快速分析,和parse原理不同
corefPerforms coreference resolution on a document, building links between entity mentions that refer to the same entity. Has a variety of modes, including rule-based, statistical, and neural. Sub-annotators: coref.mention2015年新出的指代消解
dcorefImplements both pronominal and nominal coreference resolution. The entire coreference graph (with head words of mentions as nodes) is saved in CorefChainAnnotation. For more details on the underlying coreference resolution algorithm, see this page.2015年之前的指代消解,不如coref
relationStanford relation extractor is a Java implementation to find relations between two entities. The current relation extraction model is trained on the relation types (except the ‘kill’ relation) and data from the paper Roth and Yih, Global inference for entity and relation identification via a linear programming formulation, 2007, except instead of using the gold NER tags, we used the NER tags predicted by Stanford NER classifier to improve generalization. The default model predicts relations Live_In, Located_In, OrgBased_In, Work_For, and None. For more details of how to use and train your own model, see this page用来找两个实体之间的关系
natlogMarks quantifier scope and token polarity, according to natural logic semantics. Places an OperatorAnnotation on tokens which are quantifiers (or other natural logic operators), and a PolarityAnnotation on all tokens in the sentence.
openieExtract open-domain relation triples. System description in this paper存储一个三元组,表示关系的主体、关系和客体
entitylinkLink entity mentions to Wikipedia entities将实体链接到维基百科
kbpExtracts (subject, relation, object) triples from sentences, using a combination of a statistical model, patterns over tokens, and patterns over dependencies. Extracts TAC-KBP relations. Details about models and rules can be found in our write up for the TAC-KBP 2016 competition.用来找三元组关系,使用KBP这个库
quoteDeterministically picks out quotes delimited by “ or ‘ from a text. All top-level quotes are supplied by the top level annotation for a text. If a QuotationAnnotation corresponds to a quote that contains embedded quotes, these quotes will appear as embedded QuotationAnnotations that can be accessed from the QuotationAnnotation that they are embedded in. The QuoteAnnotator can handle multi-line and cross-paragraph quotes, but any embedded quotes must be delimited by a different kind of quotation mark than its parents. Does not depend on any other annotators. Support for unicode quotes is not yet present. Sub-annotators: quote.attribution提取引号以及内容
quote.attributionAttribute quotes to speakers in the document. Run as part of: quote提取引号以及内容并找到属性,比如谁说得。自3.9.1版本之后开始默认添加
sentimentImplements Socher et al’s sentiment model. Attaches a binarized tree of the sentence to the sentence level CoreMap. The nodes of the tree then contain the annotations from RNNCoreAnnotations indicating the predicted class and scores for that subtree. See the sentiment page for more information about this project.分析情感信息
truecaseRecognizes the true case of tokens in text where this information was lost, e.g., all upper case text. This is implemented with a discriminative model implemented using a CRF sequence tagger. The true case label, e.g., INIT_UPPER is saved in TrueCaseAnnotation. The token text adjusted to match its true case is saved as TrueCaseTextAnnotation.https://stanfordnlp.github.io/CoreNLP/truecase.html 这篇讲的很详细,大致是说在 POS ,NER之前添加这个annotator可以更准确的标注实体
udfeatsLabels tokens with their Universal Dependencies universal part of speech (UPOS) and features.提供UPOS标注,目前只支持英文,而且以后的corenlp版本不一定会继续支持
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值