Day 2: NLTK, processing a text file
C-Span Inaugural Address Corpus”, available on NLTK’s corpora page: http://www.nltk.org/nltk_data/
Using NLTK
Processing a single text file
Reading in a text file
Tokenize text, compile frequency count
Average sentence length, frequency of long words
exercise:
# In[4]:
import nltk
# In[9]:
nltk.download('popular')
# In[10]:
# Tokenizing function: turns a text (a single string) into a list of word & symbol tokens
greet = "Hello, world!"
nltk.word_tokenize(greet)
# In[11]:
help(nltk.word_tokenize)
# In[12]:
sent = "You haven't seen Star Wars...?"
nltk.word_tokenize(sent)
# In[13]:
# First "Rose" is capitalized. How to lowercase? # First
sent = 'Rose is a rose is a rose is a rose.'
toks = nltk.word_tokenize(sent)
print(toks)
# In[14]:
freq = nltk.FreqDist(toks)
freq
# In[16]:
freq.most_common()
# In[18]:
freq['rose']
# In[19]:
len(freq)
# In[20]:
freq.keys()
# In[22]:
myfile = '/Users/lichun/Desktop/inaugural/1789-Washington.txt' # Use your own userid; Mac users should omit C:
wtxt = open(myfile).read()
print(wtxt)
# In[23]:
myfile = '/Users/lichun/Desktop/inaugural/1789-Washington.txt' # Use your own userid; Mac users should omit C:
wtxt = open(myfile).read()
print(wtxt)
# In[24]:
len(wtxt) # Number of characters in text
# In[25]:
'American' in wtxt # phrase as a substring. try "Americans"
# In[26]:
'Americans' in wtxt # phrase as a substring. try "Americans"
# In[27]:
'th' in wtxt
# In[28]:
# Turn off/on pretty printing (prints too many lines)
get_ipython().run_line_magic('pprint', '')
# In[29]:
# Tokenize text
nltk.word_tokenize(wtxt)
# In[30]:
wtokens = nltk.word_tokenize(wtxt.lower())
len(wtokens) # Number of words in text
# In[31]:
# Build a dictionary of frequency count
wfreq = nltk.FreqDist(wtokens)
wfreq['the']
# In[32]:
'Fellow-Citizens' in wfreq
# In[33]:
len(wfreq) # Number of unique words in text
# In[34]:
wfreq.most_common(30) # 30 most common words
# In[35]:
# dir() prints out all functions defined on the type of object.
dir(wfreq)
# In[36]:
# Hmm. Wonder what .freq does... let's find out.
help(wfreq.freq)
# In[37]:
wfreq.freq('the')
# In[38]:
len(wfreq.hapaxes())
# In[41]:
sentcount = wfreq['.'] + wfreq['?'] + wfreq['!'] # Assuming every sentence ends with ., ! or
print(sentcount)
# In[42]:
# Tokens include symbols and punctuation. First 50 tokens:# Token
wtokens[:50]
# In[44]:
wtokens_nosym = [t for t in wtokens if t.isalnum()] # alpha-numeric tokens only
len(wtokens_nosym)
# In[45]:
# Try "n't", "20th", "."
"n't".isalnum()
# In[46]:
# Try "n't", "20th", "."
"20th".isalnum()
# In[47]:
# First 50 tokens, alpha-numeric tokens only:
wtokens_nosym[:50]
# In[48]:
len(wtokens_nosym)/sentcount # Average sentence length in number of words
# In[49]:
[w for w in wfreq if len(w) >= 13] # all 13+ character words
# In[50]:
long = [w for w in wfreq if len(w) >= 13]
# sort long alphabetically using sorted()
for w in sorted(long) :
print(w, len(w), wfreq[w])