More NLTK and Corpus Tools
n-grams
Conditional frequency distribution: by preceding word
What are the most common words following ‘shall’?
‘shall’ becomes the condition for the next word: conditional frequency distribution.
Stats can be compiled from a list of bigrams (w1, w2).
Conditional frequency distribution: count per year
Are words such as ‘freedom’, ‘liberty’, ‘god’ more frequent or less over time?
We will try out NLTK’s book chapter on the Inaugural corpus: http://www.nltk.org/book/ch02.html#inaugural-address-corpus
Plotting/visualization
nltk.Text object and other corpus tools
NLTK’s Text object class provides a concordancer and other classic corpus tools
A Text object can be built from a token list
exercise:
# In[1]:
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/lichun/Desktop/inaugural' # Use your own userid; Mac users should omit C:
inaug = PlaintextCorpusReader(corpus_root, '.*txt') # all files ending in 'txt'
# In[2]:
get_ipython().run_line_magic('pprint', '')
inaug.fileids()
# In[3]:
print(inaug.words()[:50])
# In[4]:
chom = 'colorless green ideas sleep furiously'.split()
chom
# In[5]:
nltk.bigrams(chom)
# fundtion returns a "generator" object: it is memory-efficient but won't let us take a peak
# In[6]:
# generator object works well in a loop environment
for x in nltk.bigrams(chom):
print(x)
# In[7]:
# Force it into a list type
list(nltk.bigrams(chom))
# In[8]:
# trigram function also available
list(nltk.trigrams(chom))
# In[9]:
# let's build a bigram list of the entire inaugural corpus
inaug_bigrams = list(nltk.bigrams(inaug.words()))
inaug_bigrams[:10]
# In[10]:
# last 10 bigrams
inaug_bigrams[-10:]
# In[11]:
# What are the most frquent bigrams?
inaug_bigrams_fd = nltk.FreqDist(inaug_bigrams)
inaug_bigrams_fd.most_common(30)
# In[12]:
inaug_bigrams_fd[('of', 'the')]
# In[13]:
# What functions are available with this object?
dir(inaug_bigrams_fd)
# In[14]:
# over 1% of all bigrams are 'of the'!
inaug_bigrams_fd.freq(('of', 'the'))
# In[15]:
# cfd is built from bigrams: a list of (w1, w2)
inaug_bigrams_cfd = nltk.ConditionalFreqDist(inaug_bigrams)
# In[16]:
# 'shall' as the w1 condition. Value is a FreqDist! # 'shal
inaug_bigrams_cfd['shall']
# In[17]:
inaug_bigrams_cfd['shall']['not']
# In[18]:
# total count of 'shall'
inaug_bigrams_cfd['shall'].N()
# In[19]:
# likelihood of 'not' following 'shall'
inaug_bigrams_cfd['shall'].freq('not')
# In[20]:
inaug_bigrams_cfd['shall'].most_common(10)
# In[21]:
inaug_Text = nltk.Text(inaug.words())
inaug_Text.concordance("shall")
# In[25]:
help(inaug_Text.concordance)
# In[26]:
# What other handy functions are available?
dir(inaug_Text)
# In[27]:
# More info on the method. Doesn't say what stats are used...
help(inaug_Text.collocations)
# In[28]:
# common context (surrounding words) shared by a list of words
inaug_Text.common_contexts(['shall', 'will'])