NASSLLI2018-Corpus-Linguistics【day 5】

Advanced processing: lemmatization
NLTK’s WordNet lemmatizer
It works well for nouns. Verbs are tricky: default POS is set to ‘noun’, and verbs need to be specified as such.
For a better/knowlege-rich/context-aware solution, you might need to venture outside Python/NLTK and try full-scale NLP suites such as Stanford’s Core NLP.

Advanced processing: POS tagging
nltk.pos_tag is NLTK’s default POS tagger.
Default tagset is the Penn Treebank (‘wsj’) tagset.
A word of warning: it is not state-of-the-art. (Built on limited data.)

Bring Your Own Corpora : Treebanks
Treebanks are syntactically annotated sentences.
They are used in training POS-taggers and syntactic parsers.
NLTK includes a sample section of the Penn English Treebank (3914 sentences and about 10% of the entire corpus).
For more details on Treebanks and how to interact with tree structure, see this NLTK book section.

Treebanks in Non-English
A sample of ‘Sinica Treebank’ (Chinese) is available as part of NLTK’s data.
You should download it first.

exercise:

# In[1]:


import nltk
wnl = nltk.WordNetLemmatizer()   # initialize a lemmatizer


# In[2]:


# try 'geese', 'walks', 'walked', 'walking' 
wnl.lemmatize('cats')


# In[3]:


# try 'geese', 'walks', 'walked', 'walking' 
wnl.lemmatize('geese')


# In[4]:


# try 'geese', 'walks', 'walked', 'walking' 
wnl.lemmatize('walks')


# In[5]:


# try 'geese', 'walks', 'walked', 'walking' 
wnl.lemmatize('walked')


# In[6]:


# try 'geese', 'walks', 'walked', 'walking' 
wnl.lemmatize('walking')


# In[7]:


wnl.lemmatize('walking', 'v')


# In[8]:


# From this page: http://www.pitt.edu/~naraehan/python3/text-samples.txt
moby = """Call me Ishmael. Some years ago--never mind how long precisely--having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world. It is a way I have of driving off the spleen and regulating
the circulation."""


# In[9]:


get_ipython().run_line_magic('pprint', '')
nltk.word_tokenize(moby)


# In[10]:


[wnl.lemmatize(t) for t in nltk.word_tokenize(moby)]
# Output isn't very intelligent without us supplying individual tokens with their correct POS 
# Any way to identify verbs?


# In[11]:


chom = 'colorless green ideas sleep furiously'.split()
chom


# In[12]:


nltk.pos_tag(chom)


# In[13]:


nltk.pos_tag(nltk.word_tokenize(moby))


# In[14]:


help(nltk.pos_tag)


# In[16]:


from nltk.corpus import treebank
treebank.words()


# In[17]:


treebank.sents()


# In[18]:


treebank.tagged_sents()


# In[19]:


treebank.parsed_sents()


# In[20]:


t = treebank.parsed_sents()[0]


# In[21]:


# Trees are composed of subtrees, each of which itself is a Tree. 
print(t)


# In[22]:


# Opens up a new window. Close it before moving to next cell. 
t.draw()


# In[23]:


# "said" is a verb (VBD) that takes a clausal complement (S). 
#   The nodes are children of a VP node. 
print(treebank.parsed_sents()[7])


# In[24]:


# myfilter: returns True/False on whether current Tree is a VP node with an S child. 
# You can define your own function through def keyword. 

def myfilter(tree):
    child_nodes = [child.label() for child in tree if isinstance(child, nltk.Tree)]
    return  (tree.label() == 'VP') and ('S' in child_nodes)


# In[25]:


# For every full tree in the Treebank, recurse through its subtrees, 
#    filter in only those that meet the configuration. 
# Searching through first 50 sentences only: remove [:50] for a full search. 

get_ipython().run_line_magic('pprint', '')
[subtree for tree in treebank.parsed_sents()[:50]
             for subtree in tree.subtrees(myfilter)]


# In[26]:


nltk.download('sinica_treebank')


# In[27]:


from nltk.corpus import sinica_treebank as chtb
print(chtb.parsed_sents()[3450])


# In[28]:


chtb.parsed_sents()[3450].draw()    # Opens a new window

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值