Using Parser to Extract

最新推荐文章于 2023-11-21 14:35:23 发布

HoiDev

最新推荐文章于 2023-11-21 14:35:23 发布

阅读量550

点赞数

分类专栏： NLP 文章标签： parse

本文链接：https://blog.csdn.net/qq_33938256/article/details/52763817

版权

NLP 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Parsing

Parsing is the process of creating a parse tree for a textual unit

A parse tree is a hierarchical data structure that represents the syntactic structure of a sentence.

Parsing is used for many tasks, including:

Machine translation of languages
Synthesizing speech from text
Speech recognition
Grammar checking
Information extraction

Coreference resolution is the condition where two or more expressions in text refer to the same individual or thing.

Relationship types

An interesting site that contains a multitude
of relationships is Freebase(https://www.freebase.com/). It is a database of people, places, and things organized by categories. The WordNet thesaurus (http://wordnet.princeton.edu/) contains a number of relationships

Relationship	Example
Personal	father-of, sister-of, girlfriend-of
Organizational	subsidiary-of, subcommittee–of
Spatial	near-to, northeast-of, under
Physical	part-of, composed-of
Interactions	bonds-with, associates-with, reacts-with

Two types of parsing:

Dependency: This focuses on the relationship between words
Phrase structure: This deals with phrases and their recursive structure

Dependencies can use labels such as subject, determiner, and prepositions to find
relationships.

Parsing techniques include shift-reduce, spanning tree, and cascaded chunking.

Understanding parse trees

Parse trees represent hierarchical relationships between elements of text. For example, a dependency tree shows the relationship between the grammatical elements of a sentence.

(ROOT
    (S
    (NP (DT The) (NN cow))
    (VP (VBD jumped)
        (PP (IN over)
            (NP (DT the) (NN moon))))
    (. .)))

Using extracted relationships

Think about 时下学界/业界关注的知识图谱，底层原理，应该是语言内部、语言及其所联系的背景知识的关系的model

Relationships extracted can be used for a number of purposes including:

Building knowledge bases
Creating directories
Product searches
Patent analysis
Stock analysis
Intelligence analysis

There are many databases built using Wikipedia that extract relationships and information
such as:
- Resource Description Framework (RDF): This uses triples such as Yosemite-location-California, where the location is the relation. This can be found at
http://www.w3.org/RDF/.
- DBPedia: This holds over one billion triples and is an example of a knowledge base created from Wikipedia. This can be found at http://dbpedia.org/About.

Extracting relationships

There are a number of techniques available to extract relationships. These can be grouped as follows:

Hand-built patterns
Supervised methods
Semi-supervised or unsupervised methods
- Bootstrapping methods
- Distant supervision methods
- Unsupervised methods

Hand-built models are used when we have no training data.

If only a little training data is amiable, then the Naive Bayes classifier is a good choice. When more data is available, then techniques such as SVM, Regularized Logistic Regression, and Random forest can be used.

//OpenNLP

String fileLocation = getModelDir() + "/en-parser-chunking.bin";
try (InputStream modelInputStream = new FileInputStream(fileLocation);) 
{
    ParserModel model = new ParserModel(modelInputStream);
    Parser parser = ParserFactory.create(model);

    String sentence = "The cow jumped over the moon";
    //return the top three parses
    Parse parses[] = ParserTool.parseLine(sentence, parser, 3);
    for(Parse parse : parses) 
    {
        parse.show();
        parse.showCodeTree();
        System.out.println("Probability: " + parse.getProb());

        Parse children[] = parse.getChildren();
        for (Parse parseElement : children) 
        {
            System.out.println(parseElement.getText());
            System.out.println(parseElement.getType());
            Parse tags[] = parseElement.getTagNodes();
            System.out.println("Tags");
            for (Parse tag : tags) 
            {
                System.out.println("[" + tag + "]" + " type: " + tag.getType() + " Probability: " + tag.getProb() + " Label: " + tag.getLabel());
            }
        }
    }


} 
catch (IOException ex) 
{
    // Handle exceptions
}

//StanfordNLP

String parserModel = ".../models/lexparser/englishPCFG.ser.gz";
LexicalizedParser lexicalizedParser = LexicalizedParser.loadModel(parserModel);

String[] senetenceArray = {"The", "cow", "jumped", "over","the", "moon", "."};
List<CoreLabel> words = Sentence.toCoreLabelList(senetenceArray);


Tree parseTree = lexicalizedParser.apply(words);

parseTree.pennPrint();

////////////////////////////////////

TreePrint treePrint = new TreePrint("typedDependenciesCollapsed");
treePrint.printTree(parseTree);

Finding word dependencies using the GrammaticalStructure class

//StanfordNLP
//

String sentence = "The cow jumped over the moon.";
TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
Tokenizer<CoreLabel> tokenizer = tokenizerFactory.getTokenizer(new StringReader(sentence));
List<CoreLabel> wordList = tokenizer.tokenize();
parseTree = lexicalizedParser.apply(wordList);

TreebankLanguagePack tlp = lexicalizedParser.treebankLanguagePack;
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parseTree);
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();

System.out.println(tdl);

//This information can also be extracted using the  gov ,  reln , and  dep methods
//which return the governor word, the relationship, and the dependent element, respectively

for(TypedDependency dependency : tdl) 
{
    System.out.println("Governor Word: [" + dependency.gov() + "] Relation: [" + dependency.reln().getLongName() 
    + "] Dependent Word: [" + dependency.dep() + "]");
}

Finding coreference resolution entities

//StanfordNLP
//
String sentence = "He took his cash and she took her change "+ "and together they bought their lunch.";

Properties props = new Properties();
props.put("annotators","tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation = new Annotation(sentence);
pipeline.annotate(annotation);

Map<Integer, CorefChain> corefChainMap = annotation.get(CorefChainAnnotation.class);

Set<Integer> set = corefChainMap.keySet();
Iterator<Integer> setIterator = set.iterator();
while(setIterator.hasNext()) 
{
    CorefChain corefChain = corefChainMap.get(setIterator.next());
    System.out.println("CorefChain: " + corefChain);
}

System.out.print("ClusterId: " + corefChain.getChainID());
CorefMention mention = corefChain.getRepresentativeMention();
System.out.println(" CorefMention: " + mention + " Span: [" + mention.mentionSpan + "]");
List<CorefMention> mentionList = corefChain.getMentionsInTextualOrder();
Iterator<CorefMention> mentionIterator = mentionList.iterator();
while(mentionIterator.hasNext()) 
{
    CorefMention cfm = mentionIterator.next();
    System.out.println("\tMention: " + cfm + " Span: [" + mention.mentionSpan + "]");
    System.out.print("\tMention Mention Type: " + cfm.mentionType + " Gender: " + cfm.gender);
    System.out.println(" Start: " + cfm.startIndex + " End: " + cfm.endIndex);
}
System.out.println();

Extracting relationships for a question-answer system

This process consists of several steps:
1. Finding word dependencies
2. Identifying the type of questions
3. Extracting its relevant components
4. Searching the answer
5. Presenting the answer

//StanfordNLP
String question = "Who is the 32nd president of the United States?";

String parserModel = ".../englishPCFG.ser.gz";
LexicalizedParser lexicalizedParser = LexicalizedParser.loadModel(parserModel);
TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
Tokenizer<CoreLabel> tokenizer = tokenizerFactory.getTokenizer(new StringReader(question));
List<CoreLabel> wordList = tokenizer.tokenize();
Tree parseTree = lexicalizedParser.apply(wordList);
TreebankLanguagePack tlp = lexicalizedParser.treebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parseTree);
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
System.out.println(tdl);
for (TypedDependency dependency : tdl) 
{
    System.out.println("Governor Word: [" + dependency.gov() + "] Relation: [" + dependency.reln().getLongName() + "] Dependent Word: [" + dependency.dep() + "]");
}

//Determining the question type
for (TypedDependency dependency : tdl) 
{
    if ("nominal subject".equals(dependency.reln().getLongName())
    && "who".equalsIgnoreCase(dependency.gov().originalText())) 
    {
        processWhoQuestion(tdl);
    }
}