使用Stanford CoreNLP解决共指

最新推荐文章于 2024-03-18 09:52:50 发布

danpu0978

最新推荐文章于 2024-03-18 09:52:50 发布

阅读量778

点赞数

文章标签： java linux css javascript html ViewUI

共指解析是查找文本中引用同一实体的所有表达式的任务。 Stanford CoreNLP共指解析系统是解决文本中共指的最先进系统。要使用该系统，我们通常创建一个管道，该管道需要标记化，句子拆分，词性标记，词缀化，命名实体识别和解析。但是有时，我们会使用其他工具进行预处理，特别是在处理特定领域时。在这些情况下，我们需要一个独立的共同参照解析系统。这篇文章展示了如何使用Stanford CoreNLP创建这样的系统。

负载特性

通常，我们只能创建一个空的Properties，因为Stanford CoreNLP工具可以自动将默认的属性加载到模型jar文件中，该文件位于edu.stanford.nlp.pipeline下。

在其他情况下，我们想使用特定的属性。以下代码显示了一个从工作目录加载属性文件的示例。

private static final String PROPS_SUFFIX = ".properties";

  private Properties loadProperties(String name) {
    return loadProperties(name, 
       Thread.currentThread().getContextClassLoader());
  }
  
  private Properties loadProperties(String name, ClassLoader loader) {
    if (name.endsWith(PROPS_SUFFIX))
      name = name.substring(0, name.length() - PROPS_SUFFIX.length());
    name = name.replace('.', '/');
    name += PROPS_SUFFIX;
    Properties result = null;

    // Returns null on lookup failures
    System.err.println("Searching for resource: " + name);
    InputStream in = loader.getResourceAsStream(name);
    try {
      if (in != null) {
        InputStreamReader reader = new InputStreamReader(in, "utf-8");
        result = new Properties();
        result.load(reader); // Can throw IOException
      }
    } catch (IOException e) {
      result = null;
    } finally {
      IOUtils.closeIgnoringExceptions(in);
    }

    return result;
  }

初始化系统

获取属性后，我们可以初始化共指重新排列系统。例如，

try {
      corefSystem = new SieveCoreferenceSystem(new Properties());
      mentionExtractor = new MentionExtractor(corefSystem.dictionaries(),
          corefSystem.semantics());
      
    } catch (Exception e) {
      System.err.println("ERROR: cannot create DeterministicCorefAnnotator!");
      e.printStackTrace();
      throw new RuntimeException(e);
    }

注解

为了提供解析系统，我们首先需要了解注释的结构，该结构表示文档中文本的跨度。这是这篇文章中最棘手的部分，因为据我所知，没有任何文档可以对其进行详细说明。 Annotation类本身只是Map的实现。

基本上，注释包含一个句子序列（这是另一张地图）。对于每个句子，我们需要提供令牌序列（ CoreLabel的列表），解析树（ Tree ）和依赖关系图（ SemanticGraph ）。

Annotation
  CoreAnnotations.SentencesAnnotation -> sentences
    CoreAnnotations.TokensAnnotation -> tokens
    TreeCoreAnnotations.TreeAnnotation -> Tree
    SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation -> SemanticGraph

代币

标记的顺序代表一个句子的文本。每个令牌都是CoreLabel的实例，该实例存储单词，标记（词性），引理，命名实体，normailzied命名实体等。

List<CoreLabel> tokens = new ArrayList<>();
for(int i=0; i<n; i++) {
  // create a token
  CoreLabel token = new CoreLabel();
  token.setWord(word);
  token.setTag(tag);
  token.setNer(ner);
  ...
  tokens.add(token);
}
ann.set(TokensAnnotation.class, tokens);

解析树

解析树是Tree的实例。如果使用Penn树库样式，Stanford corenlp工具将提供一种易于解析的格式。

Tree tree = Tree.valueOf(getText());
ann.set(TreeAnnotation.class, tree);

语义图

可以根据规则使用树中键入的相依关系来创建语义图。但是，代码并不是那么简单。

GrammaticalStructureFactory grammaticalStructureFactory =
    new EnglishGrammaticalStructureFactory();
GrammaticalStructure gs = grammaticalStructureFactory
    .newGrammaticalStructure(tree);
SemanticGraph semanticGraph = 
    new SemanticGraph(gs.typedDependenciesCollapsed());

请注意，Stanford Corenlp提供了不同类型的依赖项。其中，共指系统需要“折叠依赖关系”，因此要设置注释，您可以编写

ann.set(
          CollapsedDependenciesAnnotation.class,
          new SemanticGraph(gs.typedDependenciesCollapsed()));

解决共指

最后，您可以向系统提供注释。以下代码是一个示例。它有点长，但易于理解。

private void annotate(Annotation annotation) {
    try {
      List<Tree> trees = new ArrayList<Tree>();
      List<List<CoreLabel>> sentences = new ArrayList<List<CoreLabel>>();

      // extract trees and sentence words
      // we are only supporting the new annotation standard for this Annotator!
      if (annotation.containsKey(CoreAnnotations.SentencesAnnotation.class)) {
        // int sentNum = 0;
        for (CoreMap sentence : annotation
            .get(CoreAnnotations.SentencesAnnotation.class)) {
          List<CoreLabel> tokens = sentence
              .get(CoreAnnotations.TokensAnnotation.class);
          sentences.add(tokens);
          Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
          trees.add(tree);
          MentionExtractor.mergeLabels(tree, tokens);
          MentionExtractor.initializeUtterance(tokens);
        }
      } else {
        System.err
            .println("ERROR: this coreference resolution system requires SentencesAnnotation!");
        return;
      }

      // extract all possible mentions
      // this is created for each new annotation because it is not threadsafe
      RuleBasedCorefMentionFinder finder = new RuleBasedCorefMentionFinder();
      List<List<Mention>> allUnprocessedMentions = finder
          .extractPredictedMentions(annotation, 0, corefSystem.dictionaries());

      // add the relevant info to mentions and order them for coref
      Document document = mentionExtractor.arrange(
          annotation,
          sentences,
          trees,
          allUnprocessedMentions);
      List<List<Mention>> orderedMentions = document.getOrderedMentions();
      if (VERBOSE) {
        for (int i = 0; i < orderedMentions.size(); i++) {
          System.err.printf("Mentions in sentence #%d:\n", i);
          for (int j = 0; j < orderedMentions.get(i).size(); j++) {
            System.err.println("\tMention #"
                + j
                + ": "
                + orderedMentions.get(i).get(j).spanToString());
          }
        }
      }

      Map<Integer, CorefChain> result = corefSystem.coref(document);
      annotation.set(CorefCoreAnnotations.CorefChainAnnotation.class, result);

      // for backward compatibility
      if (OLD_FORMAT) {
        List<Pair<IntTuple, IntTuple>> links = SieveCoreferenceSystem
            .getLinks(result);

        if (VERBOSE) {
          System.err.printf("Found %d coreference links:\n", links.size());
          for (Pair<IntTuple, IntTuple> link : links) {
            System.err.printf(
                "LINK (%d, %d) -> (%d, %d)\n",
                link.first.get(0),
                link.first.get(1),
                link.second.get(0),
                link.second.get(1));
          }
        }

        //
        // save the coref output as CorefGraphAnnotation
        //

        // cdm 2013: this block didn't seem to be doing anything needed....
        // List<List<CoreLabel>> sents = new ArrayList<List<CoreLabel>>();
        // for (CoreMap sentence:
        // annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
        // List<CoreLabel> tokens =
        // sentence.get(CoreAnnotations.TokensAnnotation.class);
        // sents.add(tokens);
        // }

        // this graph is stored in CorefGraphAnnotation -- the raw links found
        // by the coref system
        List<Pair<IntTuple, IntTuple>> graph = new ArrayList<Pair<IntTuple, IntTuple>>();

        for (Pair<IntTuple, IntTuple> link : links) {
          //
          // Note: all offsets in the graph start at 1 (not at 0!)
          // we do this for consistency reasons, as indices for syntactic
          // dependencies start at 1
          //
          int srcSent = link.first.get(0);
          int srcTok = orderedMentions.get(srcSent - 1).get(
              link.first.get(1) - 1).headIndex + 1;
          int dstSent = link.second.get(0);
          int dstTok = orderedMentions.get(dstSent - 1).get(
              link.second.get(1) - 1).headIndex + 1;
          IntTuple dst = new IntTuple(2);
          dst.set(0, dstSent);
          dst.set(1, dstTok);
          IntTuple src = new IntTuple(2);
          src.set(0, srcSent);
          src.set(1, srcTok);
          graph.add(new Pair<IntTuple, IntTuple>(src, dst));
        }
        annotation.set(CorefCoreAnnotations.CorefGraphAnnotation.class, graph);

        for (CorefChain corefChain : result.values()) {
          if (corefChain.getMentionsInTextualOrder().size() < 2)
            continue;
          Set<CoreLabel> coreferentTokens = Generics.newHashSet();
          for (CorefMention mention : corefChain.getMentionsInTextualOrder()) {
            CoreMap sentence = annotation.get(
                CoreAnnotations.SentencesAnnotation.class).get(
                mention.sentNum - 1);
            CoreLabel token = sentence.get(
                CoreAnnotations.TokensAnnotation.class).get(
                mention.headIndex - 1);
            coreferentTokens.add(token);
          }
          for (CoreLabel token : coreferentTokens) {
            token.set(
                CorefCoreAnnotations.CorefClusterAnnotation.class,
                coreferentTokens);
          }
        }
      }
    } catch (RuntimeException e) {
      throw e;
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
  }

翻译自: https://www.javacodegeeks.com/2015/02/resolve-coreference-using-stanford-corenlp.html

danpu0978

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
使用Stanford CoreNLP解决共指

共指解析是查找文本中引用同一实体的所有表达式的任务。 Stanford CoreNLP共指解析系统是解决文本中共指的最先进系统。要使用该系统，我们通常创建一个管道，该管道需要标记化，句子拆分，词性标记，词缀化，命名实体识别和解析。但是有时，我们会使用其他工具进行预处理，特别是在处理特定领域时。在这些情况下，我们需要一个独立的共同参照解析系统。这篇文章展示了如何使用Stanford C...
复制链接

扫一扫