An introduction to NLTK
NLTK:Natural Language Toolkit
Open sourse library in Python
>>>import nltk
Frequency of words
>>>dist = FreDist(text7)
>>>len(dist)
Freqwords = [w for w in vocab1 if len(w)>5 and dist[w]>100]
Normalization and Stemming
Different forms of the same ‘word’
>>>input1 = ‘List listed lists listing listings’
>>>words1 = input1.lower().split(‘ ‘)
Stemming is to find the root word or the root form of any given word
>>>porter = nltk.PorterStemmer()
>>>[porter.stem(t) for t in words]
Lemmatization
to have the words that come out to be actually meaningful.
>>>udhr = nltk.corpus.udhr.words(‘English-latin1’)
Udhr : that is the Universal Declaration of Human Rights.
Lemmatization::stemming,but resulting stems are all valid words
>>>WNlemma = nltk.WordNetLemmatizer()
>>>[WNlemma.lemmatize(t) for t in udhr[:20]]
Tokenization:
Recall splitting a sentence into words/tokens
NLTK has an in-built tokenizer
>>>nltk.word_tokenize(text11)
Sentence splitting
How would you split sentences from a long text string.
$2.99 U.S.
NLTK has an in-built sentences spliter too!
>>>sentences = nltk.sent_tokenize(text12)
>>>len(sentences)
NLP Tasks
Counting words, counting frequency of words
Finding sentence boundaries
Part of speech tagging
Parsing the sentence structure
Identifying semantic role labeling
Named Entity Reconition
Co-reference and pronoun resolution
Part-of-speech(POS) Tagging
>>>import nltk
>>>nltk.help.upenn_tagset(‘MD’)
POS tagging with NLTK
Recall splitting a sentence into words/tokens
>>>text11 = ‘Children shouldn’t drink a sugary drink before bed.’
>>>text13 = nltk.word_tokenize(text11)
NLTK’s Tokenizer
>>>nltk.pos_tag(text)
Ambiguity in POS Tagging
Ambiguity in common in English
Parsing Sentence Structure
Making sence of sentences is easy if they follow a well-defined grammatical structure
Ambiguity may exist even if sentences are grammatically correct!
Ambiguity in Parsing
>>>Text16 = nltk.word_tokenize(‘I saw the man with a telescope’)
>>>Grammer1 = nltk.data.load(‘mygrammar1.cfg’)
>>>grammer
<Grammer with 13 productions>
>>>parser = nltk.ChartParser(grammar1)
>>>trees = parser.parse_all(text16)
>>>for tree in trees:
Print tree
NLTK and Parse Tree Collection
>>>from nltk.corpus import treebank
>>>text17 = treebank.parsed_sents(‘wsj_0001.mrg’)[0]
>>>print text17
POS Tagging&Parsing Complexity
Uncommon usages of words
>>>text18 = nltk.word_tokenize(‘The old man the boat’)
>>>nltk.pos_tag(text19)
[(‘The’, ‘DT’),(‘old’, ‘JJ’),(‘man’, ‘NN’),(‘the’, ‘DT’),(‘boat’, ‘NN’)] not correct.
Well-formed sentences may still be meaningless
Take Home Concepts
POS tagging provides insights into the word classes/types in a sentence.
Parsing the grammatical structures helps derive meaning
Both tasks are difficult,linguistic ambiguity increases the difficulty even more.
NLTK provides access to tools and data for training.