Using Parser to Extract

Parsing

Parsing is the process of creating a parse tree for a textual unit

A parse tree is a hierarchical data structure that represents the syntactic structure of a sentence.

Parsing is used for many tasks, including:

  • Machine translation of languages
  • Synthesizing speech from text
  • Speech recognition
  • Grammar checking
  • Information extraction

Coreference resolution is the condition where two or more expressions in text refer to the same individual or thing.

Relationship types

An interesting site that contains a multitude
of relationships is Freebase(https://www.freebase.com/). It is a database of people, places, and things organized by categories. The WordNet thesaurus (http://wordnet.princeton.edu/) contains a number of relationships

RelationshipExample
Personalfather-of, sister-of, girlfriend-of
Organizationalsubsidiary-of, subcommittee–of
Spatialnear-to, northeast-of, under
Physicalpart-of, composed-of
Interactionsbonds-with, associates-with, reacts-with

Two types of parsing:

  • Dependency: This focuses on the relationship between words
  • Phrase structure: This deals with phrases and their recursive structure

Dependencies can use labels such as subject, determiner, and prepositions to find
relationships.

Parsing techniques include shift-reduce, spanning tree, and cascaded chunking.

Understanding parse trees

Parse trees represent hierarchical relationships between elements of text. For example, a dependency tree shows the relationship between the grammatical elements of a sentence.

(ROOT
    (S
    (NP (DT The) (NN cow))
    (VP (VBD jumped)
        (PP (IN over)
            (NP (DT the) (NN moon))))
    (. .)))

Using extracted relationships

Think about 时下学界/业界关注的知识图谱,底层原理,应该是语言内部、语言及其所联系的背景知识的关系的model

Relationships extracted can be used for a number of purposes including:

  • Building knowledge bases
  • Creating directories
  • Product searches
  • Patent analysis
  • Stock analysis
  • Intelligence analysis

There are many databases built using Wikipedia that extract relationships and information
such as:
- Resource Description Framework (RDF): This uses triples such as Yosemite-location-California, where the location is the relation. This can be found at
http://www.w3.org/RDF/.
- DBPedia: This holds over one billion triples and is an example of a knowledge base created from Wikipedia. This can be found at http://dbpedia.org/About.

Extracting relationships

There are a number of techniques available to extract relationships. These can be grouped as follows:

  • Hand-built patterns
  • Supervised methods
  • Semi-supervised or unsupervised methods
    • Bootstrapping methods
    • Distant supervision methods
    • Unsupervised methods

Hand-built models are used when we have no training data.

If only a little training data is amiable, then the Naive Bayes classifier is a good choice. When more data is available, then techniques such as SVM, Regularized Logistic Regression, and Random forest can be used.

//OpenNLP

String fileLocation = getModelDir() + "/en-parser-chunking.bin";
try (InputStream modelInputStream = new FileInputStream(fileLocation);) 
{
    ParserModel model = new ParserModel(modelInputStream);
    Parser parser = ParserFactory.create(model);

    String sentence = "The cow jumped over the moon";
    //return the top three parses
    Parse parses[] = ParserTool.parseLine(sentence, parser, 3);
    for(Parse parse : parses) 
    {
        parse.show();
        parse.showCodeTree();
        System.out.println("Probability: " + parse.getProb());

        Parse children[] = parse.getChildren();
        for (Parse parseElement : children) 
        {
            System.out.println(parseElement.getText());
            System.out.println(parseElement.getType());
            Parse tags[] = parseElement.getTagNodes();
            System.out.println("Tags");
            for (Parse tag : tags) 
            {
                System.out.println("[" + tag + "]" + " type: " + tag.getType() + " Probability: " + tag.getProb() + " Label: " + tag.getLabel());
            }
        }
    }


} 
catch (IOException ex) 
{
    // Handle exceptions
}
//StanfordNLP

String parserModel = ".../models/lexparser/englishPCFG.ser.gz";
LexicalizedParser lexicalizedParser = LexicalizedParser.loadModel(parserModel);

String[] senetenceArray = {"The", "cow", "jumped", "over","the", "moon", "."};
List<CoreLabel> words = Sentence.toCoreLabelList(senetenceArray);


Tree parseTree = lexicalizedParser.apply(words);

parseTree.pennPrint();

////////////////////////////////////

TreePrint treePrint = new TreePrint("typedDependenciesCollapsed");
treePrint.printTree(parseTree);

Finding word dependencies using the GrammaticalStructure class

//StanfordNLP
//

String sentence = "The cow jumped over the moon.";
TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
Tokenizer<CoreLabel> tokenizer = tokenizerFactory.getTokenizer(new StringReader(sentence));
List<CoreLabel> wordList = tokenizer.tokenize();
parseTree = lexicalizedParser.apply(wordList);

TreebankLanguagePack tlp = lexicalizedParser.treebankLanguagePack;
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parseTree);
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();

System.out.println(tdl);

//This information can also be extracted using the  gov ,  reln , and  dep methods
//which return the governor word, the relationship, and the dependent element, respectively

for(TypedDependency dependency : tdl) 
{
    System.out.println("Governor Word: [" + dependency.gov() + "] Relation: [" + dependency.reln().getLongName() 
    + "] Dependent Word: [" + dependency.dep() + "]");
}

Finding coreference resolution entities

//StanfordNLP
//
String sentence = "He took his cash and she took her change "+ "and together they bought their lunch.";

Properties props = new Properties();
props.put("annotators","tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation = new Annotation(sentence);
pipeline.annotate(annotation);

Map<Integer, CorefChain> corefChainMap = annotation.get(CorefChainAnnotation.class);

Set<Integer> set = corefChainMap.keySet();
Iterator<Integer> setIterator = set.iterator();
while(setIterator.hasNext()) 
{
    CorefChain corefChain = corefChainMap.get(setIterator.next());
    System.out.println("CorefChain: " + corefChain);
}

System.out.print("ClusterId: " + corefChain.getChainID());
CorefMention mention = corefChain.getRepresentativeMention();
System.out.println(" CorefMention: " + mention + " Span: [" + mention.mentionSpan + "]");
List<CorefMention> mentionList = corefChain.getMentionsInTextualOrder();
Iterator<CorefMention> mentionIterator = mentionList.iterator();
while(mentionIterator.hasNext()) 
{
    CorefMention cfm = mentionIterator.next();
    System.out.println("\tMention: " + cfm + " Span: [" + mention.mentionSpan + "]");
    System.out.print("\tMention Mention Type: " + cfm.mentionType + " Gender: " + cfm.gender);
    System.out.println(" Start: " + cfm.startIndex + " End: " + cfm.endIndex);
}
System.out.println();

Extracting relationships for a question-answer system

This process consists of several steps:
1. Finding word dependencies
2. Identifying the type of questions
3. Extracting its relevant components
4. Searching the answer
5. Presenting the answer

//StanfordNLP
String question = "Who is the 32nd president of the United States?";

String parserModel = ".../englishPCFG.ser.gz";
LexicalizedParser lexicalizedParser = LexicalizedParser.loadModel(parserModel);
TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
Tokenizer<CoreLabel> tokenizer = tokenizerFactory.getTokenizer(new StringReader(question));
List<CoreLabel> wordList = tokenizer.tokenize();
Tree parseTree = lexicalizedParser.apply(wordList);
TreebankLanguagePack tlp = lexicalizedParser.treebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parseTree);
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
System.out.println(tdl);
for (TypedDependency dependency : tdl) 
{
    System.out.println("Governor Word: [" + dependency.gov() + "] Relation: [" + dependency.reln().getLongName() + "] Dependent Word: [" + dependency.dep() + "]");
}

//Determining the question type
for (TypedDependency dependency : tdl) 
{
    if ("nominal subject".equals(dependency.reln().getLongName())
    && "who".equalsIgnoreCase(dependency.gov().originalText())) 
    {
        processWhoQuestion(tdl);
    }
}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值