Parsing
Parsing is the process of creating a parse tree for a textual unit
A parse tree is a hierarchical data structure that represents the syntactic structure of a sentence.
Parsing is used for many tasks, including:
- Machine translation of languages
- Synthesizing speech from text
- Speech recognition
- Grammar checking
- Information extraction
Coreference resolution is the condition where two or more expressions in text refer to the same individual or thing.
Relationship types
An interesting site that contains a multitude
of relationships is Freebase(https://www.freebase.com/). It is a database of people, places, and things organized by categories. The WordNet thesaurus (http://wordnet.princeton.edu/) contains a number of relationships
Relationship | Example |
---|---|
Personal | father-of, sister-of, girlfriend-of |
Organizational | subsidiary-of, subcommittee–of |
Spatial | near-to, northeast-of, under |
Physical | part-of, composed-of |
Interactions | bonds-with, associates-with, reacts-with |
Two types of parsing:
- Dependency: This focuses on the relationship between words
- Phrase structure: This deals with phrases and their recursive structure
Dependencies can use labels such as subject, determiner, and prepositions to find
relationships.
Parsing techniques include shift-reduce, spanning tree, and cascaded chunking.
Understanding parse trees
Parse trees represent hierarchical relationships between elements of text. For example, a dependency tree shows the relationship between the grammatical elements of a sentence.
(ROOT
(S
(NP (DT The) (NN cow))
(VP (VBD jumped)
(PP (IN over)
(NP (DT the) (NN moon))))
(. .)))
Using extracted relationships
Think about 时下学界/业界关注的知识图谱,底层原理,应该是语言内部、语言及其所联系的背景知识的关系的model
Relationships extracted can be used for a number of purposes including:
- Building knowledge bases
- Creating directories
- Product searches
- Patent analysis
- Stock analysis
- Intelligence analysis
There are many databases built using Wikipedia that extract relationships and information
such as:
- Resource Description Framework (RDF): This uses triples such as Yosemite-location-California, where the location is the relation. This can be found at
http://www.w3.org/RDF/.
- DBPedia: This holds over one billion triples and is an example of a knowledge base created from Wikipedia. This can be found at http://dbpedia.org/About.
Extracting relationships
There are a number of techniques available to extract relationships. These can be grouped as follows:
- Hand-built patterns
- Supervised methods
- Semi-supervised or unsupervised methods
- Bootstrapping methods
- Distant supervision methods
- Unsupervised methods
Hand-built models are used when we have no training data.
If only a little training data is amiable, then the Naive Bayes classifier is a good choice. When more data is available, then techniques such as SVM, Regularized Logistic Regression, and Random forest can be used.
//OpenNLP
String fileLocation = getModelDir() + "/en-parser-chunking.bin";
try (InputStream modelInputStream = new FileInputStream(fileLocation);)
{
ParserModel model = new ParserModel(modelInputStream);
Parser parser = ParserFactory.create(model);
String sentence = "The cow jumped over the moon";
//return the top three parses
Parse parses[] = ParserTool.parseLine(sentence, parser, 3);
for(Parse parse : parses)
{
parse.show();
parse.showCodeTree();
System.out.println("Probability: " + parse.getProb());
Parse children[] = parse.getChildren();
for (Parse parseElement : children)
{
System.out.println(parseElement.getText());
System.out.println(parseElement.getType());
Parse tags[] = parseElement.getTagNodes();
System.out.println("Tags");
for (Parse tag : tags)
{
System.out.println("[" + tag + "]" + " type: " + tag.getType() + " Probability: " + tag.getProb() + " Label: " + tag.getLabel());
}
}
}
}
catch (IOException ex)
{
// Handle exceptions
}
//StanfordNLP
String parserModel = ".../models/lexparser/englishPCFG.ser.gz";
LexicalizedParser lexicalizedParser = LexicalizedParser.loadModel(parserModel);
String[] senetenceArray = {"The", "cow", "jumped", "over","the", "moon", "."};
List<CoreLabel> words = Sentence.toCoreLabelList(senetenceArray);
Tree parseTree = lexicalizedParser.apply(words);
parseTree.pennPrint();
////////////////////////////////////
TreePrint treePrint = new TreePrint("typedDependenciesCollapsed");
treePrint.printTree(parseTree);
Finding word dependencies using the GrammaticalStructure class
//StanfordNLP
//
String sentence = "The cow jumped over the moon.";
TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
Tokenizer<CoreLabel> tokenizer = tokenizerFactory.getTokenizer(new StringReader(sentence));
List<CoreLabel> wordList = tokenizer.tokenize();
parseTree = lexicalizedParser.apply(wordList);
TreebankLanguagePack tlp = lexicalizedParser.treebankLanguagePack;
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parseTree);
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
System.out.println(tdl);
//This information can also be extracted using the gov , reln , and dep methods
//which return the governor word, the relationship, and the dependent element, respectively
for(TypedDependency dependency : tdl)
{
System.out.println("Governor Word: [" + dependency.gov() + "] Relation: [" + dependency.reln().getLongName()
+ "] Dependent Word: [" + dependency.dep() + "]");
}
Finding coreference resolution entities
//StanfordNLP
//
String sentence = "He took his cash and she took her change "+ "and together they bought their lunch.";
Properties props = new Properties();
props.put("annotators","tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation = new Annotation(sentence);
pipeline.annotate(annotation);
Map<Integer, CorefChain> corefChainMap = annotation.get(CorefChainAnnotation.class);
Set<Integer> set = corefChainMap.keySet();
Iterator<Integer> setIterator = set.iterator();
while(setIterator.hasNext())
{
CorefChain corefChain = corefChainMap.get(setIterator.next());
System.out.println("CorefChain: " + corefChain);
}
System.out.print("ClusterId: " + corefChain.getChainID());
CorefMention mention = corefChain.getRepresentativeMention();
System.out.println(" CorefMention: " + mention + " Span: [" + mention.mentionSpan + "]");
List<CorefMention> mentionList = corefChain.getMentionsInTextualOrder();
Iterator<CorefMention> mentionIterator = mentionList.iterator();
while(mentionIterator.hasNext())
{
CorefMention cfm = mentionIterator.next();
System.out.println("\tMention: " + cfm + " Span: [" + mention.mentionSpan + "]");
System.out.print("\tMention Mention Type: " + cfm.mentionType + " Gender: " + cfm.gender);
System.out.println(" Start: " + cfm.startIndex + " End: " + cfm.endIndex);
}
System.out.println();
Extracting relationships for a question-answer system
This process consists of several steps:
1. Finding word dependencies
2. Identifying the type of questions
3. Extracting its relevant components
4. Searching the answer
5. Presenting the answer
//StanfordNLP
String question = "Who is the 32nd president of the United States?";
String parserModel = ".../englishPCFG.ser.gz";
LexicalizedParser lexicalizedParser = LexicalizedParser.loadModel(parserModel);
TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
Tokenizer<CoreLabel> tokenizer = tokenizerFactory.getTokenizer(new StringReader(question));
List<CoreLabel> wordList = tokenizer.tokenize();
Tree parseTree = lexicalizedParser.apply(wordList);
TreebankLanguagePack tlp = lexicalizedParser.treebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parseTree);
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
System.out.println(tdl);
for (TypedDependency dependency : tdl)
{
System.out.println("Governor Word: [" + dependency.gov() + "] Relation: [" + dependency.reln().getLongName() + "] Dependent Word: [" + dependency.dep() + "]");
}
//Determining the question type
for (TypedDependency dependency : tdl)
{
if ("nominal subject".equals(dependency.reln().getLongName())
&& "who".equalsIgnoreCase(dependency.gov().originalText()))
{
processWhoQuestion(tdl);
}
}