Introduction
- polysemy
- words have multiple senses
- homonymy
- POS ambiguity
Word sense disambiguation
- Task
- given a word
- and its context
- Use for machine translation
- e.g., translate ‘play’ into Spanish
- play the violin = tocar el violin
- play tennis = jugar al tenis
- Other uses
- accent restoration
- text to speech generation
- spell correction
- capitalization restoration
Dictionary methods (Lesk)
- match sentences to dictionary definitions
Decision lists (Yarowsky)
- look at only 2 senses at a time
- ordered rules: collocation gives sense
- Formula
log(p(sensea∣collocationi)p(senseb∣collocationi))
Example
we set collocations a window size of the words around the word bass to be disambiguated
- fish within window -> bass1
- striped bass -> bass1
- guitar within the window -> bass2
- bass player -> bass2
- Play/V bass -> bass2
Classification features
- Adjacent words(collocations)
- Position
- Adjacent POS
- Nearby words
- Syntactic information
- Topic of the text
Classification methods
- KNN(memory-based)
- using Euclidean distance
- find the k most similar examples and return the majority class for them
Bootstrapping
- for an ambiguous word, start with 2 senses and seeds for each sense, and come up with 1 strong and indicative collocation for each sense
- e.g., plant1:leaf, plant2:factory
- Label the ambiguous word in the training set, if it is near leaf, label plant1. But we will only label a small portion of data, say 1% for plant1, 1% for plant2, and 98% for unlabelled.
- For unlabelled data, look for additional collocations that appear in training data.
- e.g., plant1:living
- go to 2.
continue this process until all of the training set is labelled.
- 2 principles:
- 1 collocation for each word
- 1 collocation for each discourse
Training data for WSD
- Senseval/Semcor
- Pseudo-word
- Multilingual corpora
Senseval-1 Evaluation
- Metric
- A = number of assigned senses
- C = number of words assigned correct senses
- T = total number of test words
- Precision = C/A, Recall = C/T
- Result
- Best: 77P/77R
- human lexicographer: 97P/96R