python与自然语言处理-读书笔记2-CSDN博客

本文链接：https://blog.csdn.net/commak/article/details/105105014

control

if control structure
for loop

 >>> sent1 = ['Call', 'me', 'Ishmael', '.']
>>> for xyzzy in sent1:
...     if xyzzy.endswith('l'):
...         print(xyzzy)
...
Call
Ishmael
>>>

>>> tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w)
>>> for word in tricky: #循环，找出所有以cie, cei结尾的单词。
...     print(word, end=' ') #end=""的意思是，单词与单词之间以空格键隔开。
ancient ceiling conceit conceited conceive conscience
conscientious conscientiously deceitful deceive ...
>>>

自然语言理解

主要是做介绍：
词义消歧word sense disambiguation
代词回指pronoun resolution
generating language output，包括机器翻译，回答问题等question answering, machine translation.

第二章

来源网址
要回答下面的问题

What are some useful text corpora and lexical resources, and how can we access them with Python?
Which Python constructs are most helpful for this work?
How do we avoid repeating ourselves when writing Python code?

任务：写一段程序，计算平均词长，平均句子长度，每个词出现的平均次数（词汇多样性得分）

>>> for fileid in gutenberg.fileids():
 num_chars = len(gutenberg.raw(fileid)) #raw()指的是没有经过任何语言学处理之前把文件内容分析出来。
 num_words = len(gutenberg.words(fileid))
 num_sents = len(gutenberg.sents(fileid))
 num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)])
   print int(num_chars/num_words, int(num_words/num_sents), int(num_words/num_vocab), fileid

1.gutenberg语料库代表Literature这一类

>>> import nltk
>>> nltk.corpus.gutenberg.fileids()

统计词频，和第一章直接调用text1.concordance()的方式不同。代码如下

#方法1：用words调用
>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
>>> len(emma)
192427
>>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
>>> emma.concordance("surprize")
#方法2：用import调用
>>> from nltk.corpus import gutenberg
>>> gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...]
>>> emma = gutenberg.words('austen-emma.txt')

网络文本

本章与爬虫无关，讲的是NLTK中的网络文本。
来源

NLTK’s small collection of web text includes content from a Firefox discussion forum, conversations overheard in New York, the movie script of Pirates of the Carribean, personal advertisements, and wine reviews:

>>> from nltk.corpus import webtext
>>> for fileid in webtext

python与自然语言处理-读书笔记2

control

自然语言理解

第二章

1.gutenberg语料库 代表Literature这一类

网络文本

1.gutenberg语料库代表Literature这一类