python与自然语言处理-读书笔记2

control

if control structure
for loop

 >>> sent1 = ['Call', 'me', 'Ishmael', '.']
>>> for xyzzy in sent1:
...     if xyzzy.endswith('l'):
...         print(xyzzy)
...
Call
Ishmael
>>>
>>> tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w)
>>> for word in tricky: #循环,找出所有以cie, cei结尾的单词。
...     print(word, end=' ') #end=""的意思是,单词与单词之间以空格键隔开。
ancient ceiling conceit conceited conceive conscience
conscientious conscientiously deceitful deceive ...
>>>

自然语言理解

主要是做介绍:
词义消歧word sense disambiguation
代词回指pronoun resolution
generating language output,包括机器翻译,回答问题等question answering, machine translation.

第二章

来源网址
要回答下面的问题

What are some useful text corpora and lexical resources, and how can we access them with Python?
Which Python constructs are most helpful for this work?
How do we avoid repeating ourselves when writing Python code?

任务:写一段程序,计算平均词长,平均句子长度,每个词出现的平均次数(词汇多样性得分)

>>> for fileid in gutenberg.fileids():
 num_chars = len(gutenberg.raw(fileid)) #raw()指的是没有经过任何语言学处理之前把文件内容分析出来。
 num_words = len(gutenberg.words(fileid))
 num_sents = len(gutenberg.sents(fileid))
 num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)])
   print int(num_chars/num_words, int(num_words/num_sents), int(num_words/num_vocab), fileid

1.gutenberg语料库 代表Literature这一类

>>> import nltk
>>> nltk.corpus.gutenberg.fileids()

统计词频,和第一章直接调用text1.concordance()的方式不同。代码如下

#方法1:用words调用
>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
>>> len(emma)
192427
>>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
>>> emma.concordance("surprize")
#方法2:用import调用
>>> from nltk.corpus import gutenberg
>>> gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...]
>>> emma = gutenberg.words('austen-emma.txt')

网络文本

本章与爬虫无关,讲的是NLTK中的网络文本。
来源

NLTK’s small collection of web text includes content from a Firefox discussion forum, conversations overheard in New York, the movie script of Pirates of the Carribean, personal advertisements, and wine reviews:

>>> from nltk.corpus import webtext
>>> for fileid in webtext
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值