Parsing human language
Rather different from computer languages
- No types for words(variable, comment, …)
- No brackets around phrases
- Ambiguity
- words
- parses
- Implied information
Parsing
- Parsing means associating tree structures to a sentence, given a grammar(often a CFG)
- There may be 0, 1, or more than 1 such tree structures for the given sentence
- Grammars are declarative
- They don’t specify how the parse tree will be constructed
Syntactic ambiguities
- PP attachment
- Gaps
- Mary likes Physics but hates Chemistry.
-Coordination scope - Small boys and girls are playing.
- Mary likes Physics but hates Chemistry.
- Particles vs. preposition
- She run up a large hill.
-Gerund vs. adjective - Frightening kids can cause trouble.
- She run up a large hill.
Applications of parsing
- Grammar checking
- Question answering
- Machine translation
- Information extraction
- Speech generation
- Speech understanding
- Interpretation
Context-free grammars
- A CFG is 4-tuple
(N,Σ,R,S)
- N: non-terminal symbols
- Σ : terminal symbols(disjoint from N)
- R: rules( A→β ), where β is a string from (Σ∪N)∗ (a string from the union set of Σ and N with 0 to a large number)
- S: start symbol in N
Phrase structure grammar
- Sentences are not just bag of words!(AGAIN!)
- Context view of language
- A PP looks the same whether it is part of the subject NP or part of VP
- Constituent order
- SVO(subject verb object)
- SOV
- …
- Auxiliary verbs
- Imperative sentences
- Interrogative sentences
- Negative sentences
Leftmost derivation
- A leftmost derivation is a sequence of string
s1,s2,...,sn
- s1=S , the start symbol
- sn includes only terminal symbols