共指解析是查找文本中引用同一实体的所有表达式的任务。 Stanford CoreNLP共指解析系统是解决文本中共指的最先进系统。 要使用该系统,我们通常创建一个管道,该管道需要标记化,句子拆分,词性标记,词缀化,命名实体识别和解析。 但是有时,我们会使用其他工具进行预处理,特别是在处理特定领域时。 在这些情况下,我们需要一个独立的共同参照解析系统。 这篇文章展示了如何使用Stanford CoreNLP创建这样的系统。
负载特性
通常,我们只能创建一个空的Properties,因为Stanford CoreNLP工具可以自动将默认的属性加载到模型jar文件中,该文件位于edu.stanford.nlp.pipeline
下。
在其他情况下,我们想使用特定的属性。 以下代码显示了一个从工作目录加载属性文件的示例。
private static final String PROPS_SUFFIX = ".properties";
private Properties loadProperties(String name) {
return loadProperties(name,
Thread.currentThread().getContextClassLoader());
}
private Properties loadProperties(String name, ClassLoader loader) {
if (name.endsWith(PROPS_SUFFIX))
name = name.substring(0, name.length() - PROPS_SUFFIX.length());
name = name.replace('.', '/');
name += PROPS_SUFFIX;
Properties result = null;
// Returns null on lookup failures
System.err.println("Searching for resource: " + name);
InputStream in = loader.getResourceAsStream(name);
try {
if (in != null) {
InputStreamReader reader = new InputStreamReader(in, "utf-8");
result = new Properties();
result.load(reader); // Can throw IOException
}
} catch (IOException e) {
result = null;
} finally {
IOUtils.closeIgnoringExceptions(in);
}
return result;
}
初始化系统
获取属性后,我们可以初始化共指重新排列系统。 例如,
try {
corefSystem = new SieveCoreferenceSystem(new Properties());
mentionExtractor = new MentionExtractor(corefSystem.dictionaries(),
corefSystem.semantics());
} catch (Exception e) {
System.err.println("ERROR: cannot create DeterministicCorefAnnotator!");
e.printStackTrace();
throw new RuntimeException(e);
}
注解
为了提供解析系统,我们首先需要了解注释的结构,该结构表示文档中文本的跨度。 这是这篇文章中最棘手的部分,因为据我所知,没有任何文档可以对其进行详细说明。 Annotation类本身只是Map的实现。
基本上,注释包含一个句子序列(这是另一张地图)。 对于每个句子,我们需要提供令牌序列( CoreLabel的列表),解析树( Tree )和依赖关系图( SemanticGraph )。
Annotation
CoreAnnotations.SentencesAnnotation -> sentences
CoreAnnotations.TokensAnnotation -> tokens
TreeCoreAnnotations.TreeAnnotation -> Tree
SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation -> SemanticGraph
代币
标记的顺序代表一个句子的文本。 每个令牌都是CoreLabel的实例,该实例存储单词,标记(词性),引理,命名实体,normailzied命名实体等。
List<CoreLabel> tokens = new ArrayList<>();
for(int i=0; i<n; i++) {
// create a token
CoreLabel token = new CoreLabel();
token.setWord(word);
token.setTag(tag);
token.setNer(ner);
...
tokens.add(token);
}
ann.set(TokensAnnotation.class, tokens);
解析树
解析树是Tree的实例。 如果使用Penn树库样式 ,Stanford corenlp工具将提供一种易于解析的格式。
Tree tree = Tree.valueOf(getText());
ann.set(TreeAnnotation.class, tree);
语义图
可以根据规则使用树中键入的相依关系来创建语义图。 但是,代码并不是那么简单。
GrammaticalStructureFactory grammaticalStructureFactory =
new EnglishGrammaticalStructureFactory();
GrammaticalStructure gs = grammaticalStructureFactory
.newGrammaticalStructure(tree);
SemanticGraph semanticGraph =
new SemanticGraph(gs.typedDependenciesCollapsed());
请注意,Stanford Corenlp提供了不同类型的依赖项。 其中,共指系统需要“折叠依赖关系”,因此要设置注释,您可以编写
ann.set(
CollapsedDependenciesAnnotation.class,
new SemanticGraph(gs.typedDependenciesCollapsed()));
解决共指
最后,您可以向系统提供注释。 以下代码是一个示例。 它有点长,但易于理解。
private void annotate(Annotation annotation) {
try {
List<Tree> trees = new ArrayList<Tree>();
List<List<CoreLabel>> sentences = new ArrayList<List<CoreLabel>>();
// extract trees and sentence words
// we are only supporting the new annotation standard for this Annotator!
if (annotation.containsKey(CoreAnnotations.SentencesAnnotation.class)) {
// int sentNum = 0;
for (CoreMap sentence : annotation
.get(CoreAnnotations.SentencesAnnotation.class)) {
List<CoreLabel> tokens = sentence
.get(CoreAnnotations.TokensAnnotation.class);
sentences.add(tokens);
Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
trees.add(tree);
MentionExtractor.mergeLabels(tree, tokens);
MentionExtractor.initializeUtterance(tokens);
}
} else {
System.err
.println("ERROR: this coreference resolution system requires SentencesAnnotation!");
return;
}
// extract all possible mentions
// this is created for each new annotation because it is not threadsafe
RuleBasedCorefMentionFinder finder = new RuleBasedCorefMentionFinder();
List<List<Mention>> allUnprocessedMentions = finder
.extractPredictedMentions(annotation, 0, corefSystem.dictionaries());
// add the relevant info to mentions and order them for coref
Document document = mentionExtractor.arrange(
annotation,
sentences,
trees,
allUnprocessedMentions);
List<List<Mention>> orderedMentions = document.getOrderedMentions();
if (VERBOSE) {
for (int i = 0; i < orderedMentions.size(); i++) {
System.err.printf("Mentions in sentence #%d:\n", i);
for (int j = 0; j < orderedMentions.get(i).size(); j++) {
System.err.println("\tMention #"
+ j
+ ": "
+ orderedMentions.get(i).get(j).spanToString());
}
}
}
Map<Integer, CorefChain> result = corefSystem.coref(document);
annotation.set(CorefCoreAnnotations.CorefChainAnnotation.class, result);
// for backward compatibility
if (OLD_FORMAT) {
List<Pair<IntTuple, IntTuple>> links = SieveCoreferenceSystem
.getLinks(result);
if (VERBOSE) {
System.err.printf("Found %d coreference links:\n", links.size());
for (Pair<IntTuple, IntTuple> link : links) {
System.err.printf(
"LINK (%d, %d) -> (%d, %d)\n",
link.first.get(0),
link.first.get(1),
link.second.get(0),
link.second.get(1));
}
}
//
// save the coref output as CorefGraphAnnotation
//
// cdm 2013: this block didn't seem to be doing anything needed....
// List<List<CoreLabel>> sents = new ArrayList<List<CoreLabel>>();
// for (CoreMap sentence:
// annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
// List<CoreLabel> tokens =
// sentence.get(CoreAnnotations.TokensAnnotation.class);
// sents.add(tokens);
// }
// this graph is stored in CorefGraphAnnotation -- the raw links found
// by the coref system
List<Pair<IntTuple, IntTuple>> graph = new ArrayList<Pair<IntTuple, IntTuple>>();
for (Pair<IntTuple, IntTuple> link : links) {
//
// Note: all offsets in the graph start at 1 (not at 0!)
// we do this for consistency reasons, as indices for syntactic
// dependencies start at 1
//
int srcSent = link.first.get(0);
int srcTok = orderedMentions.get(srcSent - 1).get(
link.first.get(1) - 1).headIndex + 1;
int dstSent = link.second.get(0);
int dstTok = orderedMentions.get(dstSent - 1).get(
link.second.get(1) - 1).headIndex + 1;
IntTuple dst = new IntTuple(2);
dst.set(0, dstSent);
dst.set(1, dstTok);
IntTuple src = new IntTuple(2);
src.set(0, srcSent);
src.set(1, srcTok);
graph.add(new Pair<IntTuple, IntTuple>(src, dst));
}
annotation.set(CorefCoreAnnotations.CorefGraphAnnotation.class, graph);
for (CorefChain corefChain : result.values()) {
if (corefChain.getMentionsInTextualOrder().size() < 2)
continue;
Set<CoreLabel> coreferentTokens = Generics.newHashSet();
for (CorefMention mention : corefChain.getMentionsInTextualOrder()) {
CoreMap sentence = annotation.get(
CoreAnnotations.SentencesAnnotation.class).get(
mention.sentNum - 1);
CoreLabel token = sentence.get(
CoreAnnotations.TokensAnnotation.class).get(
mention.headIndex - 1);
coreferentTokens.add(token);
}
for (CoreLabel token : coreferentTokens) {
token.set(
CorefCoreAnnotations.CorefClusterAnnotation.class,
coreferentTokens);
}
}
}
} catch (RuntimeException e) {
throw e;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
翻译自: https://www.javacodegeeks.com/2015/02/resolve-coreference-using-stanford-corenlp.html