NASSLLI2018-Corpus-Linguistics【Day 2】-CSDN博客

本文链接：https://blog.csdn.net/qq_33905018/article/details/80818830

Day 2: NLTK, processing a text file
C-Span Inaugural Address Corpus”, available on NLTK’s corpora page: http://www.nltk.org/nltk_data/

Using NLTK

Processing a single text file

Reading in a text file

Tokenize text, compile frequency count

Average sentence length, frequency of long words

exercise:

# In[4]:


import nltk


# In[9]:


nltk.download('popular')


# In[10]:


# Tokenizing function: turns a text (a single string) into a list of word & symbol tokens
greet = "Hello, world!"
nltk.word_tokenize(greet)


# In[11]:


help(nltk.word_tokenize)


# In[12]:


sent = "You haven't seen Star Wars...?"
nltk.word_tokenize(sent)


# In[13]:


# First "Rose" is capitalized. How to lowercase? # First 
sent = 'Rose is a rose is a rose is a rose.'
toks = nltk.word_tokenize(sent)
print(toks)


# In[14]:


freq = nltk.FreqDist(toks)
freq


# In[16]:


freq.most_common()


# In[18]:


freq['rose']


# In[19]:


len(freq)


# In[20]:


freq.keys()


# In[22]:


myfile = '/Users/lichun/Desktop/inaugural/1789-Washington.txt'  # Use your own userid; Mac users should omit C:
wtxt = open(myfile).read()
print(wtxt)


# In[23]:


myfile = '/Users/lichun/Desktop/inaugural/1789-Washington.txt'  # Use your own userid; Mac users should omit C:
wtxt = open(myfile).read()
print(wtxt)


# In[24]:


len(wtxt)     # Number of characters in text


# In[25]:


'American' in wtxt  # phrase as a substring. try "Americans"


# In[26]:


'Americans' in wtxt  # phrase as a substring. try "Americans"


# In[27]:


'th' in wtxt


# In[28]:


# Turn off/on pretty printing (prints too many lines)
get_ipython().run_line_magic('pprint', '')


# In[29]:


# Tokenize text
nltk.word_tokenize(wtxt)


# In[30]:


wtokens = nltk.word_tokenize(wtxt.lower())
len(wtokens)     # Number of words in text


# In[31]:


# Build a dictionary of frequency count
wfreq = nltk.FreqDist(wtokens)
wfreq['the']


# In[32]:


'Fellow-Citizens' in wfreq


# In[33]:


len(wfreq)      # Number of unique words in text


# In[34]:


wfreq.most_common(30)     # 30 most common words


# In[35]:


# dir() prints out all functions defined on the type of object. 
dir(wfreq)


# In[36]:


# Hmm. Wonder what .freq does... let's find out. 
help(wfreq.freq)


# In[37]:


wfreq.freq('the')


# In[38]:


len(wfreq.hapaxes())


# In[41]:


sentcount  = wfreq['.'] + wfreq['?'] + wfreq['!']  # Assuming every sentence ends with ., ! or 
print(sentcount)


# In[42]:


# Tokens include symbols and punctuation. First 50 tokens:# Token 
wtokens[:50]


# In[44]:


wtokens_nosym  = [t for t in wtokens if t.isalnum()]    # alpha-numeric tokens only
len(wtokens_nosym)


# In[45]:


# Try "n't", "20th", "."
"n't".isalnum()


# In[46]:


# Try "n't", "20th", "."
"20th".isalnum()


# In[47]:


# First 50 tokens, alpha-numeric tokens only: 
wtokens_nosym[:50]


# In[48]:


len(wtokens_nosym)/sentcount     # Average sentence length in number of words


# In[49]:


[w for w in wfreq if len(w) >= 13]       # all 13+ character words


# In[50]:


long = [w for w in wfreq if len(w) >= 13] 
# sort long alphabetically using sorted()
for w in sorted(long) :
    print(w, len(w), wfreq[w])