POS
- Open class
- nouns, non-modal verbs, adjectives, adverbs
- Closed class
- prepositions, modal verbs, conjunctions, particles, determiners, pronouns
Penn Treebank tag set: the label IN indicates all of the prepositions except for TO which has only to, even if to is the type of particle.
Some observations
- ambiguity: the tag type of the words, and even the pronunciation of the words could be different
Useful for parsing, machine translation, word sense disambiguation, etc.
Main techniques
- rule-based
- machine learning(crf, maximum entropy, markov models)
- transformation-based
Source of information
- Knowledge about individual words (unigram)
- lexical information
- spelling(-or, -er)
- capitalization(IBM)
- Knowledge about neighboring words
Evaluation
- Baseline(relatively high)
- tag each word with its most likely tag
- tag each OOV word as a noun
- accuracy around 90%
- current accuracy
- around 97% for english
- 98% for human performance
Rule-based tagging
- use dictionary or finite-state transducers to find all possible POS
- use disambiguation rules
- e.g., ART + V( article + verb is never allowed )
- hundreds of rules can be designed
Rule examples
Useful for unseen languages