NASSLLI2018-Corpus-Linguistics【Day 4】

Naomi0212

于 2018-06-28 15:27:40 发布

阅读量172

点赞数

本文链接：https://blog.csdn.net/qq_33905018/article/details/80844009

版权

More NLTK and Corpus Tools

n-grams

Conditional frequency distribution: by preceding word
What are the most common words following ‘shall’?
‘shall’ becomes the condition for the next word: conditional frequency distribution.
Stats can be compiled from a list of bigrams (w1, w2).

Conditional frequency distribution: count per year
Are words such as ‘freedom’, ‘liberty’, ‘god’ more frequent or less over time?
We will try out NLTK’s book chapter on the Inaugural corpus: http://www.nltk.org/book/ch02.html#inaugural-address-corpus
Plotting/visualization

nltk.Text object and other corpus tools
NLTK’s Text object class provides a concordancer and other classic corpus tools
A Text object can be built from a token list

exercise：

# In[1]:


import nltk 
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/lichun/Desktop/inaugural'  # Use your own userid; Mac users should omit C:
inaug = PlaintextCorpusReader(corpus_root, '.*txt')  # all files ending in 'txt'


# In[2]:


get_ipython().run_line_magic('pprint', '')
inaug.fileids()


# In[3]:


print(inaug.words()[:50])


# In[4]:


chom = 'colorless green ideas sleep furiously'.split()
chom


# In[5]:


nltk.bigrams(chom)
# fundtion returns a "generator" object: it is memory-efficient but won't let us take a peak


# In[6]:


# generator object works well in a loop environment
for x in nltk.bigrams(chom):
    print(x)


# In[7]:


# Force it into a list type
list(nltk.bigrams(chom))


# In[8]:


# trigram function also available
list(nltk.trigrams(chom))


# In[9]:


# let's build a bigram list of the entire inaugural corpus
inaug_bigrams = list(nltk.bigrams(inaug.words()))
inaug_bigrams[:10]


# In[10]:


# last 10 bigrams
inaug_bigrams[-10:]


# In[11]:


# What are the most frquent bigrams? 
inaug_bigrams_fd = nltk.FreqDist(inaug_bigrams)
inaug_bigrams_fd.most_common(30)


# In[12]:


inaug_bigrams_fd[('of', 'the')]


# In[13]:


# What functions are available with this object? 
dir(inaug_bigrams_fd)


# In[14]:


# over 1% of all bigrams are 'of the'! 
inaug_bigrams_fd.freq(('of', 'the'))


# In[15]:


# cfd is built from bigrams: a list of (w1, w2) 
inaug_bigrams_cfd = nltk.ConditionalFreqDist(inaug_bigrams)


# In[16]:


# 'shall' as the w1 condition. Value is a FreqDist! # 'shal 
inaug_bigrams_cfd['shall']


# In[17]:


inaug_bigrams_cfd['shall']['not']


# In[18]:


# total count of 'shall'
inaug_bigrams_cfd['shall'].N()


# In[19]:


# likelihood of 'not' following 'shall' 
inaug_bigrams_cfd['shall'].freq('not')


# In[20]:


inaug_bigrams_cfd['shall'].most_common(10)


# In[21]:


inaug_Text = nltk.Text(inaug.words())
inaug_Text.concordance("shall")


# In[25]:


help(inaug_Text.concordance)


# In[26]:


# What other handy functions are available? 
dir(inaug_Text)


# In[27]:


# More info on the method. Doesn't say what stats are used...
help(inaug_Text.collocations)


# In[28]:


# common context (surrounding words) shared by a list of words
inaug_Text.common_contexts(['shall', 'will'])

Naomi0212

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
NASSLLI2018-Corpus-Linguistics【Day 4】

More NLTK and Corpus Toolsn-gramsConditional frequency distribution: by preceding word What are the most common words following ‘shall’? ‘shall’ becomes the condition for the next word: conditio...
复制链接

扫一扫