java命名实体抽取,如何从文本中提取命名实体+动词

最新推荐文章于 2023-06-14 04:25:48 发布

白澤 SaMa

最新推荐文章于 2023-06-14 04:25:48 发布

阅读量235

点赞数

文章标签： java命名实体抽取

Well, my aim is to extract NE (Person) and a verb connected to it from a text. For example, I have this text:

Dumbledore turned and walked back down the street. Harry Potter rolled over inside his blankets without waking up.

As an ideal result i should get

Dumbledore turned walked; Harry Potter rolled

I use Stanford NER to find and mark persons, then I delete all sentences that don't contain NE. So, in the end I have a 'pure' text that consists only of sentences with names of characters.

After that I use Stanford Dependencies. As the result I get smth like this (CONLLU output-format):

1 Dumbledore _ _ NN _ 2 nsubj _ _

2 turned _ _ VBD _ 0 root _ _

3 and _ _ CC _ 2 cc _ _

4 walked _ _ VBD _ 2 conj _ _

5 back _ _ RB _ 4 advmod _ _

6 down _ _ IN _ 8 case _ _

7 the _ _ DT _ 8 det _ _

8 street _ _ NN _ 4 nmod _ _

9 . _ _ . _ 2 punct _ _

1 Harry _ _ NNP _ 2 compound _ _

2 Potter _ _ NNP _ 3 nsubj _ _

3 rolled _ _ VBD _ 0 root _ _

4 over _ _ IN _ 3 compound:prt _ _

5 inside _ _ IN _ 7 case _ _

6 his _ _ PRP$ _ 7 nmod:poss _ _

7 blankets _ _ NNS _ 3 nmod _ _

8 without _ _ IN _ 9 mark _ _

9 waking _ _ VBG _ 3 advcl _ _

10 up _ _ RP _ 9 compound:prt _ _

11 . _ _ . _ 3 punct _ _

And that's where all my problems start. I know the person and the verb, but how to extract it from this format I have no idea.

I guess, i can do it this way: find NN/NNP in the table, find its 'parent' and then extract all its 'child'-words. Theoretically it should work. Theoretically.

The question is if anyone can come up with any other idea how to get a person and its action from the text? Or if there any more rational way to do it?

I'll be very grateful for any help!

解决方案

Here is some sample code to help with your problem:

import java.io.*;

import java.util.*;

import edu.stanford.nlp.ling.*;

import edu.stanford.nlp.pipeline.*;

import edu.stanford.nlp.semgraph.*;

import edu.stanford.nlp.util.*;

public class NERAndVerbExample {

public static void main(String[] args) throws IOException {

Properties props = new Properties();

props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,depparse,entitymentions");

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

String text = "John Smith went to the store.";

Annotation annotation = new Annotation(text);

pipeline.annotate(annotation);

System.out.println("---");

System.out.println("text: " + text);

System.out.println("");

System.out.println("dependency edges:");

for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {

SemanticGraph sg = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);

for (SemanticGraphEdge sge : sg.edgeListSorted()) {

System.out.println(

sge.getGovernor().word() + "," + sge.getGovernor().index() + "," + sge.getGovernor().tag() + "," +

sge.getGovernor().ner()

+ " - " + sge.getRelation().getLongName()

+ " -> "

+ sge.getDependent().word() + "," +

+sge.getDependent().index() + "," + sge.getDependent().tag() + "," + sge.getDependent().ner());

}

System.out.println();

System.out.println("entity mentions:");

for (CoreMap entityMention : sentence.get(CoreAnnotations.MentionsAnnotation.class)) {

int lastTokenIndex = entityMention.get(CoreAnnotations.TokensAnnotation.class).size()-1;

System.out.println(entityMention.get(CoreAnnotations.TextAnnotation.class) +

"\t" +

entityMention.get(CoreAnnotations.TokensAnnotation.class)

.get(lastTokenIndex).get(CoreAnnotations.IndexAnnotation.class) + "\t" +

entityMention.get(CoreAnnotations.NamedEntityTagAnnotation.class));

}

I'm hoping to add some syntactic sugar to Stanford CoreNLP 3.8.0 to assist with working with the entity mentions.

To explain this code a bit, basically the entitymentions annotator goes through and groups tokens with the same NER tag together. So "John Smith" gets marked as an entity mention.

If you go through the dependency graph, you can get the index of each word.

Likewise if you access the list of tokens for an entity mention, you can also find the index of each word for the entity mention.

With a little more code you can link those together and form entity mention verb pairs as you were requesting.

As you can see in the current code it is quite cumbersome to access info for an entity mention, so I am going to try to improve that in 3.8.0.

白澤 SaMa

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
java命名实体抽取,如何从文本中提取命名实体+动词

Well, my aim is to extract NE (Person) and a verb connected to it from a text. For example, I have this text:Dumbledore turned and walked back down the street. Harry Potter rolled over inside his blan...
复制链接

扫一扫