Lecture 1: Introduction & Vector Representations of Text
1. Why is NLP challenging?
a. Natural languages are not designed; they evolve.
- new words appear constantly
- syntactic rules are flexible
- ambiguity is inherent
b. World knowledge is necessary for interpretation
c. So many languages
2.NLP vs. ML
- NLP is a confluence of computer science, artificial intelligence and
linguistics - ML provides statistical techniques for problem solving by learning from data
- ML is often used in modelling NLP tasks
3. NLP vs. Computational Linguistics
- Both mostly use text as data
- In Computational Linguistics(CL), computational/statistical methods are used to support the study of linguistic phenomena and theories
- In NLP, the scope is more general. Computational methods are used for translating text, extracting information, answering questions etc.
4. Vectors and Vector Space
Vector: one-dimensional array
Vector Space: matrix
5. Vector similarity
Dot(inner) product: takes two equal-length sequences of numbers(i.e. vextors) and returns a single value
Cosine similarity: normalise dot product([0,1])by dividing withe vectors’ lengths( or magnitude or norm)|x|
6. Why need vector representations of text?
for semantic similarity
for document retrieval
for clustering/classification algorithms operate on vectors
7. how to deal with the hidimensionality and sparsity of count-based matrices
Dimensionality reduction to the rescue:
- find the most important dimensions of dataset
- SVD
- Approximation
8. Limitations for Word vectors
polysemy
antonyms: hard to distinguish the similar contexts are synonyms or antonyms
compositionality: hard to obtain the nearning of a sequence of words
9. Evaluation of word-word Vectors
similarity
improve performace in a task
10. Evaluation of document vectors
Document similarity
information retrieval
text classification
plagiarism detection
11. Limitation of Document vectors
word order is ignored