【Python 自然语言处理第二版】读书笔记2:获得文本语料和词汇资源

最新推荐文章于 2024-04-02 10:56:56 发布

__盛夏光年__

最新推荐文章于 2024-04-02 10:56:56 发布

阅读量1.9k

点赞数

分类专栏： NLP python 文章标签： nlp python 语料

本文链接：https://blog.csdn.net/u012736685/article/details/90411364

版权

python 同时被 2 个专栏收录

45 篇文章

订阅专栏

NLP

10 篇文章

订阅专栏

文章目录

一、获取文本语料库
二、条件频率分布
三、代码重用（Python）
四、词典资源
五、WordNet

大量的语言数据或者语料库。

一、获取文本语料库

1、古腾堡语料库

NLTK 包含 古腾堡项目（Project Gutenberg） 电子文本档案的经过挑选的一小部分文本，该项目大约有25,000本免费电子图书。

（1）输出语料库中的文件标识符

import nltk
# 输出语料库中的文件标识符
print(nltk.corpus.gutenberg.fileids())

输出结果

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

（2）词的统计与索引

from nltk.corpus import gutenberg

emma = gutenberg.words('austen-emma.txt')
print(len(emma))

emma = nltk.Text(gutenberg.words('austen-emma.txt'))
print(emma.concordance("surprize"))

输出结果

192427
Displaying 25 of 37 matches:
er father , was sometimes taken by surprize at his being still able to pity ` 
hem do the other any good ." " You surprize me ! Emma must do Harriet good : a
Knightley actually looked red with surprize and displeasure , as he stood up ,
r . Elton , and found to his great surprize , that Mr . Elton was actually on 
d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great ,
father was quite taken up with the surprize of so sudden a journey , and his f
y , in all the favouring warmth of surprize and conjecture . She was , moreove
he appeared , to have her share of surprize , introduction , and pleasure . Th
ir plans ; and it was an agreeable surprize to her , therefore , to perceive t
talking aunt had taken me quite by surprize , it must have been the death of m
f all the dialogue which ensued of surprize , and inquiry , and congratulation
 the present . They might chuse to surprize her ." Mrs . Cole had many to agre
the mode of it , the mystery , the surprize , is more like a young woman ' s s
 to her song took her agreeably by surprize -- a second , slightly but correct
" " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ; 
t to be considered . Emma ' s only surprize was that Jane Fairfax should accep
of your admiration may take you by surprize some day or other ." Mr . Knightle
ation for her will ever take me by surprize .-- I never had a thought of her i
 expected by the best judges , for surprize -- but there was great joy . Mr . 
 sound of at first , without great surprize . " So unreasonably early !" she w
d Frank Churchill , with a look of surprize and displeasure .-- " That is easy
; and Emma could imagine with what surprize and mortification she must be retu
tled that Jane should go . Quite a surprize to me ! I had not the least idea !
 . It is impossible to express our surprize . He came to speak to his father o
g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai

（3）文本统计

for fileid in gutenberg.fileids():
	# raw()函数：没有进行过任何语言学处理的文件的内容
	num_chars = len(gutenberg.raw(fileid))
	num_words = len(gutenberg.words(fileid))
	# sents()函数将文本划分为句子，每个句子都是一个单词列表。
	num_sents = len(gutenberg.sents(fileid))
	num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
	print('平均词长:', round(num_chars/num_words), '平均句长:', round(num_words/num_sents), '每个单词出现的平均次数：', round(num_words/num_vocab), 'from:', fileid)

输出结果

平均词长: 5 平均句长: 25 每个单词出现的平均次数： 26 from: austen-emma.txt
平均词长: 5 平均句长: 26 每个单词出现的平均次数： 17 from: austen-persuasion.txt
平均词长: 5 平均句长: 28 每个单词出现的平均次数： 22 from: austen-sense.txt
平均词长: 4 平均句长: 34 每个单词出现的平均次数： 79 from: bible-kjv.txt
平均词长: 5 平均句长: 19 每个单词出现的平均次数： 5 from: blake-poems.txt
平均词长: 4 平均句长: 19 每个单词出现的平均次数： 14 from: bryant-stories.txt
平均词长: 4 平均句长: 18 每个单词出现的平均次数： 12 from: burgess-busterbrown.txt
平均词长: 4 平均句长: 20 每个单词出现的平均次数： 13 from: carroll-alice.txt
平均词长: 5 平均句长: 20 每个单词出现的平均次数： 12 from: chesterton-ball.txt
平均词长: 5 平均句长: 23 每个单词出现的平均次数： 11 from: chesterton-brown.txt
平均词长: 5 平均句长: 18 每个单词出现的平均次数： 11 from: chesterton-thursday.txt
平均词长: 4 平均句长: 21 每个单词出现的平均次数： 25 from: edgeworth-parents.txt
平均词长: 5 平均句长: 26 每个单词出现的平均次数： 15 from: melville-moby_dick.txt
平均词长: 5 平均句长: 52 每个单词出现的平均次数： 11 from: milton-paradise.txt
平均词长: 4 平均句长: 12 每个单词出现的平均次数： 9 from: shakespeare-caesar.txt
平均词长: 4 平均句长: 12 每个单词出现的平均次数： 8 from: shakespeare-hamlet.txt
平均词长: 4 平均句长: 12 每个单词出现的平均次数： 7 from: shakespeare-macbeth.txt
平均词长: 5 平均句长: 36 每个单词出现的平均次数： 12 from: whitman-leaves.txt

显示每个文本的三个统计量：平均词长、平均句子长度和本文中每个词出现的平均次数（我们的词汇多样性得分）。平均词长似乎是英语的一个一般属性，因为它的值总是4。（事实上，平均词长是3而不是4，因为num_chars变量计数了空白字符。）相比之下，平均句子长度和词汇多样性看上去是作者个人的特点。

2、网络和聊天文本

from nltk.corpus import webtext：网络文本小集合
from nltk.corpus import nps_chat：即时消息聊天会话语料库

# 网络文本小集合
from nltk.corpus import webtext
for fileid in webtext.fileids():
	print(fileid, webtext.raw(fileid)[:65], '...')

# 即时消息聊天会话语料库
# 10-19-20s_706posts.xml包含2006 年10 月19 日
# 从20 多岁聊天室收集的706 个帖子。
from nltk.corpus import nps_chat
chatroom = nps_chat.posts('10-19-20s_706posts.xml')
print(chatroom[123])

输出结果

firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ...
grail.txt SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there!  [clop ...
overheard.txt White guy: So, do you have any plans for this evening?Asian girl ...
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ...
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ...
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...
['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']

3、布朗语料库

1961年，布朗大学，第一个百万词语的英语电子语料库，包含500个不同来源的文本。

（1）初识

布朗语料库每一部分的示例文档

from nltk.corpus import brown
print(brown.categories())

print(brown.words(categories='news'))
print(brown.words(fileids=['cg22']))
print(brown.sents(categories=['new', 'editorial', 'reviews']))

输出结果

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
[['Assembly', 'session', 'brought', 'much', 'good'], ['The', 'General', 'Assembly', ',', 'which', 'adjourns', 'today', ',', 'has', 'performed', 'in', 'an', 'atmosphere', 'of', 'crisis', 'and', 'struggle', 'from', 'the', 'day', 'it', 'convened', '.'], ...]

（2）比较不同文体中的情态动词的用法

一个文体中情态动词的对比

import nltk
from nltk.corpus import brown

news_text = brown.words(categories='news')
fdist = nltk.FreqDist(w.lower() for w in news_text)
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:
	print(m + ':', fdist[m], end=' ')

输出结果

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

在不同的文体中统计感兴趣词的词频分布

# 带条件的频率分布函数
cfd = nltk.ConditionalFreqDist(
			(genre, word) 
			for genre in brown.categories()
			for word in brown.words(categories=genre))
# 填写我们想要展示的文体种类
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
# 填写我们想要统计的词
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions = genres, samples = modals)
cfd.plot(conditions = genres, samples = modals)

输出结果

                  can could   may might  must  will 
           news    93    86    66    38    50   389 
       religion    82    59    78    12    54    71 
        hobbies   268    58   131    22    83   264 
science_fiction    16    49     4    12     8    16 
        romance    74   193    11    51    45    43 
          humor    16    30     8     8     9    13

在这里插入图片描述

4、路透社语料库

10,788 个新闻文档，90个主题，共计130 万字，按照“training”和“test”分为两组。

（1）初识

from nltk.corpus import reuters
print(reuters.fileids())
print(reuters.categories())		# 主题
print("\n")

输出结果

['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839', 'test/14840', 'test/14841', 'test/14842', 'test/14843', ,..., 'test/21567', 'test/21568', 'test/21570', 'test/21571', 'test/21573', 'test/21574', 'test/21575', 'test/21576', 'training/1', 'training/10', 'training/100', 'training/1000', 'training/10000', 'training/10002', 'training/10005', ...'training/9988', 'training/9989', 'training/999', 'training/9992', 'training/9993', 'training/9994', 'training/9995']
['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', ..., 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']

（2）通过主题和fileids查找words

路透社语料库的类别是有互相重叠的:新闻报道往往涉及多个主题。

# 查询该id中包含的主题
print(reuters.categories('training/9865'))
print(reuters.categories(['training/9865', 'training/9880']))
print("\n")
# 查询该主题中包含的id
print(reuters.fileids('barley'))
print(reuters.fileids(['barley', 'corn']))
print("\n")

输出结果

['barley', 'corn', 'grain', 'wheat']
['barley', 'corn', 'grain', 'money-fx', 'wheat']


['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', 'test/15875', 'test/15952', 'test/17767', 'test/17769', ..., 'training/8257', 'training/8759', 'training/9865', 'training/9958']
['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106', 'test/15287', 'test/15341', 'test/15618', 'test/15648',...,'training/9058', 'training/9093', 'training/9094', 'training/934', 'training/9470', 'training/9521', 'training/9667', 'training/97', 'training/9865', 'training/9958', 'training/9989']

（3）以文档或类别为单位查找想要的词或句子

print(reuters.words('training/9865')[:14])
print(reuters.words(['training/9865', 'training/9880']))
print(reuters.words(categories='barley'))
print(reuters.words(categories=['barley', 'corn']))

输出结果

['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', 'BIDS', 'DETAILED', 'French', 'operators', 'have', 'requested', 'licences', 'to', 'export']
['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]
['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]
['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]

5、就职演说语料库

（1）初识

import nltk
from nltk.corpus import inaugural

print(inaugural.fileids())
print([fileid[:4] for fileid in inaugural.fileids()])

输出结果

['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', ...,  '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']
['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', '1825', '1829', '1833', '1837', '1841', '1845', '1849', '1853', '1857', '1861', '1865', '1869', '1873', '1877', '1881', '1885', '1889', '1893', '1897', '1901', '1905', '1909', '1913', '1917', '1921', '1925', '1929', '1933', '1937', '1941', '1945', '1949', '1953', '1957', '1961', '1965', '1969', '1973', '1977', '1981', '1985', '1989', '1993', '1997', '2001', '2005', '2009']

（2）条件频率分布图

cfd = nltk.ConditionalFreqDist(
	(target, fileid[:4])
	for fileid in inaugural.fileids()
	for w in inaugural.words(fileid)
	for target in ['america', 'citizen']
	if w.lower().startswith(target))
cfd.plot()

输出结果：条件频率分布图
计数就职演说语料库中所有以america 或citizen开始的词。
在这里插入图片描述

6、标注文本语料库

7、多国语言语料库

print(nltk.corpus.cess_esp.words())
print(nltk.corpus.floresta.words())
print(nltk.corpus.indian.words('hindi.pos'))
print(nltk.corpus.udhr.fileids())
print(nltk.corpus.udhr.words('Javanese-Latin1')[11:])

用条件频率分布来研究“世界人权宣言”（udhr）语料库中不同语言版本中的字长差异

from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch', 
		'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist(
		(lang, len(word))
		for lang in languages
		for word in udhr.words(lang + '-Latin1'))
cfd.plot(cumulative=True)

在这里插入图片描述

8、文本语料库的结构

在这里插入图片描述
文本语料库的常见结构：

isolated：一些孤立的没有什么特别的组织的文本集合；
categorized：分类组织结构；
overlapping：重叠，如主题类别（路透社语料库）；
temporal：随时间变化语言用法的改变（就职演说语料库）。

9、加载你自己的语料库

（1）PlaintextCorpusReader

PlaintextCorpusReader更适合文本文件，eg：添加 corpus_root 下的语料库

from nltk.corpus import PlaintextCorpusReader

corpus_root = '/usr/share/dict'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
print(wordlists.fileids())
print(wordlists.words('american-english'))

输出结果

['README.select-wordlist', 'american-english', 'british-english', 'cracklib-small', 'words', 'words.pre-dictionaries-common']
['A', 'A', "'", 's', 'AMD', 'AMD', "'", 's', 'AOL', ...]

（2）BracketParseCorpusReader

BracketParseCorpusReader更适合已解析过的语料库

from nltk.corpus import BracketParseCorpusReader

corpus_root = ''  # 路径
file_pattern = r".*/wsj_.*\.mrg" # 匹配模式
# 初始化读取器：语料库目录和要加载文件的格式，默认utf8格式的编码
ptb = BracketParseCorpusReader(corpus_root, file_pattern)
print(ptb.fileids())
print(len(ptb.sents()))
print(ptb.sents(fileids='20/wsj_2013/mrg')[19])

二、条件频率分布

1、条件和事件

每个配对pairs的形式是：(条件, 事件)。如果我们按文体处理整个布朗语料库，将有15 个条件（每个文体一个条件）和1,161,192 个事件（每一个词一个事件）。

text = ['The', 'Fulton', 'Country', 'Grand', 'Jury', 'said', ...]
pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...]

2、按文体计数词汇

# 构建文体与词的配对
genre_word = [(genre, word)
				for genre in ['news', 'romance']
				for word in brown.words(categories=genre)]
print(len(genre_word))
print(genre_word[:4])
print(genre_word[-4:], '\n')

# 频率分布
cfd = nltk.ConditionalFreqDist(genre_word)
print(cfd)
print(cfd.conditions(), '\n')

print(cfd['news'])
print(cfd['romance'])
print(cfd['romance'].most_common(20))
print(cfd['romance']['could'])

输出结果

170576
[('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')]
[('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] 

<ConditionalFreqDist with 2 conditions>
['news', 'romance'] 

<FreqDist with 14394 samples and 100554 outcomes>
<FreqDist with 8452 samples and 70022 outcomes>
[(',', 3899), ('.', 3736), ('the', 2758), ('and', 1776), ('to', 1502), ('a', 1335), ('of', 1186), ('``', 1045), ("''", 1044), ('was', 993), ('I', 951), ('in', 875), ('he', 702), ('had', 692), ('?', 690), ('her', 651), ('that', 583), ('it', 573), ('his', 559), ('she', 496)]
193

3、绘制分布图和分布表

from nltk.corpus import inaugural
cfd = nltk.ConditionalFreqDist(
			(target, fileid[:4])
			for fileid in inaugural.fileids()
			for w in inaugural.words(fileid)
			for target in ['america', 'citizen']
			if w.lower().startswith(target))
cfd.plot()


from nltk.corpus import udhr

languages = ['Chickasaw', 'English', 'German_Deutsch', 
		'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist(
		(lang, len(word))
		for lang in languages
		for word in udhr.words(lang + '-Latin1'))
cfd.tabulate(conditions=['English', 'German_Deutsch'],
		samples=range(10), cumulative=True)
cfd.plot(cumulative=True)

4、使用双连词生成随机文本

利用bigrams制作生成模型

def generate_model(cfdist, word, num=15):
	for i in range(num):
		print(word, end=" ")
		word = cfdist[word].max()

text = nltk.corpus.genesis.words("english-kjv.txt")
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)
print(cfd)
print(list(cfd))
print(cfd["so"])
print(cfd["living"])

generate_model(cfd, "so")
generate_model(cfd, "living")

输出结果

<ConditionalFreqDist with 2789 conditions>
['In', 'the', 'beginning', 'God', 'created', 'heaven', 'and', 'earth', '.', 'And', 'was', 'without', 'form', ',', 'void', ';', 'darkness', 'upon', 'face', 'of', 'deep', 'Spirit', 'moved', 'waters', 'said', 'Let', 'there', 'be', 'light', ':', 'saw', 'that', 'it', 'good', 'divided', 'from', 'called', 'Day', 'he', ..., ', 'embalmed', 'past', 'elders', 'chariots', 'horsemen', 'threshingfloor', 'Atad', 'lamentati', 'floor', 'Egyptia', 'Abelmizraim', 'requite', 'messenger', 'Forgive', 'forgive', 'meant', 'Machir', 'visit', 'coffin']

FreqDist({'that': 8, '.': 7, ',': 4, 'the': 3, 'I': 2, 'doing': 2, 'much': 2, ':': 2, 'did': 1, 'Noah': 1, ...})
FreqDist({'creature': 7, 'thing': 4, 'substance': 2, 'soul': 1, '.': 1, ',': 1})

so that he said , and the land of the land of the land of
living creature that he said , and the land of the land of

条件频率分布的常用方法

示例	描述
cfdist= ConditionalFreqDist(pairs)	从配对链表中创建条件频率分布
cfdist.conditions()	将条件按字母排序
cfdist[condition]	此条件下的频率分布
cfdist[condition][sample]	此条件下给定样本的频率
cfdist.tabulate()	为条件频率分布制表
cfdist.tabulate(samples, conditions)	指定样本和条件限制下制表
cfdist.plot()	为条件频率分布绘图
cfdist.plot(samples, conditions)	指定样本和条件
cfdist1 < cfdist2	测试样本在 cfdist1 中出现次数是否小于在 cfdist2 中出现次数

三、代码重用（Python）

函数、方法
模块（module）：一个文件中的变量和函数定义的集合。可通过文件入来访问自定义的函数。
包（package）：相关模块的集合。

注意：当 Python 导入模块时，它先查找当前目录（文件夹）。

四、词典资源

词典或者词典资源：一个词和（或）短语以及一些相关信息的集合，附属于文本，通常在文本的帮助下创建和丰富。

在这里插入图片描述
上图为词典术语：两个拼写相同的词条但意义不同（同音异义词）的词汇项（包括词
目（也叫词条）以及其他附加信息），其他附加信息包括词性和注释信息。

1、词汇列表语料库

词汇语料库是Unix 中的/usr/share/dict/words文件，被一些拼写检查程序使用。我们可以用它来寻找文本语料中不寻常的或拼写错误的词汇。

（1）过滤文本

此程序计算文本的词汇表，然后删除所有在现有的词汇列表中出现的元
素，只留下罕见或拼写错误的词。

def unusual_words(text):
	text_vocab = set(w.lower() for w in text if w.isalpha())
	english_vocab = set(w.lower() for w in nltk.corpus.words.words())
	unusual = text_vocab - english_vocab
	return sorted(unusual)

un_words1 = unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))
print(un_words1, '\n')
un_words2 = unusual_words(nltk.corpus.nps_chat.words())
print(un_words2)

输出结果

['abbeyland', 'abhorred', 'abilities', 'abounded', 'abridgement', 'abused', 'abuses', 'accents', 'accepting', 'accommodations', ..., 'wiping', 'wisest', 'wishes', 'withdrew', 'witnessed', 'witnesses', 'witnessing', 'witticisms', 'wittiest', 'wives', 'women', 'wondered', 'woods', 'words', 'workmen', 'worlds', 'wrapt', 'writes', 'yards', 'years', 'yielded', 'youngest'] 

['aaaaaaaaaaaaaaaaa', 'aaahhhh', 'abortions', 'abou', 'abourted', 'abs', 'ack', 'acros', 'actualy', ...,'yuuuuuuuuuuuummmmmmmmmmmm', 'yvw', 'yw', 'zebrahead', 'zoloft', 'zyban', 'zzzzzzzing', 'zzzzzzzz']

（2）停用词语料库

停用词通常几乎没有什么词汇内容，eg：如the，to和also…

from nltk.corpus import stopwords

print(stopwords.words('english'))

# 计算文本中没有在停用词列表中的词的比例。
def content_fraction(text):
	stopwords = nltk.corpus.stopwords.words('english')
	content = [w for w in text if w.lower() not in stopwords]
	return len(content)/len(text)

frac = content_fraction(nltk.corpus.reuters.words())
print(frac)

输出结果

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's",..., 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
0.735240435097661

（3）一个字母拼词谜题

在由随机选择的字母组成的网格中，选择里面的字母组成词；这个谜题叫做“目标”。
在这里插入图片描述
要求：

长度不小于6
每个词必须包括中间的字母
每个字母在每个词中只能被用一次

import nltk

puzzle_letters = nltk.FreqDist('egivrvonl')
obligatory = 'r'
wordlist = nltk.corpus.words.words()
print([w for w in wordlist if len(w) >= 6
			and obligatory in w
			and nltk.FreqDist(w) <= puzzle_letters])

输出结果

['glover', 'gorlin', 'govern', 'grovel', 'ignore', 'involver', 'lienor', 'linger', 'longer', 'lovering', 'noiler', 'overling', 'region', 'renvoi', 'revolving', 'ringle', 'roving', 'violer', 'virole']

FreqDist 比较法：允许我们检查每个字母在候选词中的频率是否小于或等于相应的字母在拼词谜题中的频率。

（4）名字语料库

包括 8000 个按性别分类的名字。男性和女性的名字存储在单独的文件中。

names = nltk.corpus.names
print(names.fileids())
male_names = names.words('male.txt')
female_names = names.words('female.txt')

## 男女同名
print([w for w in male_names if w in female_names])

# 条件频率分布：此图显示男性和女性名字的结尾字母
cfd = nltk.ConditionalFreqDist(
				(fileid, name[-1])
				for fileid in names.fileids()
				for name in names.words(fileid))
cfd.plot()

输出结果

['female.txt', 'male.txt']
['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis', 'Alfie', 'Ali', 'Alix', 'Allie', 'Allyn', 'Andie', 'Andrea', 'Andy', 'Angel',..., 'Ted', 'Teddie', 'Teddy', 'Terri', 'Terry', 'Theo', 'Tim', 'Timmie', 'Timmy', 'Tobe', 'Tobie', 'Toby', 'Tommie', 'Tommy', 'Tony', 'Torey', 'Trace', 'Tracey', 'Tracie', 'Tracy', 'Val', 'Vale', 'Valentine', 'Van', 'Vin', 'Vinnie', 'Vinny', 'Virgie', 'Wallie', 'Wallis', 'Wally', 'Whitney', 'Willi', 'Willie', 'Willy', 'Winnie', 'Winny', 'Wynn']

在这里插入图片描述
条件频率分布：此图显示男性和女性名字的结尾字母；大多数以 a，e 或 i 结尾的名字是女性；以 h 和 l 结尾的男性和女性同样多；以 k，o，r，s 和 t 结尾的更可能是男性。

2、发音的词典

为语音合成器使用而设计的

entries = nltk.corpus.cmudict.entries()
print(len(entries))
for entry in entries[39943:39951]:
	print(entry)

for word, pron in entries:
	if len(pron) == 3:
		ph1, ph2, ph3 = pron
		if ph1 == 'P' and ph3 == 'T':
			print(word, ph2, end=' ')

输出结果

133737
('explorer', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0'])
('explorers', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0', 'Z'])
('explores', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'Z'])
('exploring', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'IH0', 'NG'])
('explosion', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N'])
('explosions', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N', 'Z'])
('explosive', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V'])
('explosively', ['EH2', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V', 'L', 'IY0'])

pait EY1
pat AE1
pate EY1
patt AE1
peart ER1
peat IY1
peet IY1
peete IY1
pert ER1
pet EH1
pete IY1
pett EH1
piet IY1
piette IY1
pit IH1
pitt IH1
pot AA1
pote OW1
pott AA1
pout AW1
puett UW1
purt ER1
put UH1
putt AH1

找到所有发音结尾与 nicks 相似的词汇。

syllable = ['N', 'IH0', 'K', 'S']
print([word for word, pron in entries if pron[-4:] == syllable])

输出结果

["atlantic's", 'audiotronics', 'avionics', 'beatniks', 'calisthenics', 'centronics', 'chamonix', 'chetniks', "clinic's", 'clinics', 'conics', 'conics', 'cryogenics', 'cynics', 'diasonics', "dominic's", 'ebonics', 'electronics', "electronics'", "endotronics'", 'endotronics', 'enix', 'environics', 'ethnics', 'eugenics', 'fibronics', 'flextronics', 'harmonics', 'hispanics', 'histrionics', 'identics', 'ionics', 'kibbutzniks', 'lasersonics', 'lumonics', 'mannix', 'mechanics', "mechanics'", 'microelectronics', 'minix', 'minnix', 'mnemonics', 'mnemonics', 'molonicks', 'mullenix', 'mullenix', 'mullinix', 'mulnix', "munich's", 'nucleonics', 'onyx', 'organics', "panic's", 'panics', 'penix', 'pennix', 'personics', 'phenix', "philharmonic's", 'phoenix', 'phonics', 'photronics', 'pinnix', 'plantronics', 'pyrotechnics', 'refuseniks', "resnick's", 'respironics', 'sconnix', 'siliconix', 'skolniks', 'sonics', 'sputniks', 'technics', 'tectonics', 'tektronix', 'telectronics', 'telephonics', 'tonics', 'unix', "vinick's", "vinnick's", 'vitronics']

3、比较词表

斯瓦迪士核心词列表

from nltk.corpus import swadesh

print(swadesh.fileids())
print(swadesh.words('en'))

# entries()方法:指定一个语言链表来访问多语言中的同源词
fr2en = swadesh.entries(['fr', 'en'])
print(fr2en)
translate = dict(fr2en)
print(translate['chien'])
print(translate['jeter'])

de2en = swadesh.entries(['de', 'en'])		# German-English
es2en = swadesh.entries(['es', 'en']) 		# Spanish-English
translate.update(dict(de2en))
translate.update(dict(es2en))
print(translate['Hund'])
print(translate['perro'])

languages = ['en', 'de', 'nl', 'es', 'fr', 'pt', 'la']
for i in [139, 140, 141, 142]:
	print(swadesh.entries(languages)[i])

输出结果

['be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de', 'en', 'es', 'fr', 'hr', 'it', 'la', 'mk', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sr', 'sw', 'uk']

['I', 'you (singular), thou', 'he', 'we', 'you (plural)', 'they', 'this', 'that', 'here', 'there', 'who', 'what', 'where', 'when', 'how', 'not', 'all', 'many', 'some', 'few', 'other', 'one', 'two', 'three', 'four', 'five', 'big', 'long', 'wide', 'thick', 'heavy', 'small', 'short', 'narrow', 'thin', 'woman', 'man (adult male)', 'man (human being)', 'child', 'wife', 'husband', 'mother', 'father', 'animal', 'fish', 'bird', 'dog', 'louse', 'snake', 'worm', 'tree', 'forest', 'stick', 'fruit', 'seed', 'leaf', 'root', 'bark (from tree)', 'flower', 'grass', 'rope', 'skin', 'meat', 'blood', 'bone', 'fat (noun)', 'egg', 'horn', 'tail', 'feather', 'hair', 'head', 'ear', 'eye', 'nose', 'mouth', 'tooth', 'tongue', 'fingernail', 'foot', 'leg', 'knee', 'hand', 'wing', 'belly', 'guts', 'neck', 'back', 'breast', 'heart', 'liver', 'drink', 'eat', 'bite', 'suck', 'spit', 'vomit', 'blow', 'breathe', 'laugh', 'see', 'hear', 'know (a fact)', 'think', 'smell', 'fear', 'sleep', 'live', 'die', 'kill', 'fight', 'hunt', 'hit', 'cut', 'split', 'stab', 'scratch', 'dig', 'swim', 'fly (verb)', 'walk', 'come', 'lie', 'sit', 'stand', 'turn', 'fall', 'give', 'hold', 'squeeze', 'rub', 'wash', 'wipe', 'pull', 'push', 'throw', 'tie', 'sew', 'count', 'say', 'sing', 'play', 'float', 'flow', 'freeze', 'swell', 'sun', 'moon', 'star', 'water', 'rain', 'river', 'lake', 'sea', 'salt', 'stone', 'sand', 'dust', 'earth', 'cloud', 'fog', 'sky', 'wind', 'snow', 'ice', 'smoke', 'fire', 'ashes', 'burn', 'road', 'mountain', 'red', 'green', 'yellow', 'white', 'black', 'night', 'day', 'year', 'warm', 'cold', 'full', 'new', 'old', 'good', 'bad', 'rotten', 'dirty', 'straight', 'round', 'sharp', 'dull', 'smooth', 'wet', 'dry', 'correct', 'near', 'far', 'right', 'left', 'at', 'in', 'with', 'and', 'if', 'because', 'name']

[('je', 'I'), ('tu, vous', 'you (singular), thou'), ('il', 'he'), ('nous', 'we'), ('vous', 'you (plural)'), ('ils, elles', 'they'), ('ceci', 'this'), ('cela', 'that'), ('ici', 'here'), ('là', 'there'), ('qui', 'who'), ('quoi', 'what'), ('où', 'where'), ('quand', 'when'), ('comment', 'how'), ('ne...pas', 'not'), ('tout', 'all'), ('plusieurs', 'many'), ... , ('sec', 'dry'), ('juste, correct', 'correct'), ('proche', 'near'), ('loin', 'far'), ('à droite', 'right'), ('à gauche', 'left'), ('à', 'at'), ('dans', 'in'), ('avec', 'with'), ('et', 'and'), ('si', 'if'), ('parce que', 'because'), ('nom', 'name')]

dog
throw

dog
dog

('say', 'sagen', 'zeggen', 'decir', 'dire', 'dizer', 'dicere')
('sing', 'singen', 'zingen', 'cantar', 'chanter', 'cantar', 'canere')
('play', 'spielen', 'spelen', 'jugar', 'jouer', 'jogar, brincar', 'ludere')
('float', 'schweben', 'zweven', 'flotar', 'flotter', 'flutuar, boiar', 'fluctuare')

五、WordNet

面向语义的英语词典，共有155,287 个词和117,659 个同义词集合。

1、意义与同义词

from nltk.corpus import wordnet as wn

print(wn.synsets('motorcar'))
# 意义相同的词（或“词条”）的集合
print(wn.synset('car.n.01').lemma_names())
# 获取该词在该词集的定义
print(wn.synset('car.n.01').definition())
# 获取该词在该词集下的例句
print(wn.synset('car.n.01').examples())

# 得到指定同义词集的所有词条
print(wn.synset('car.n.01').lemmas())
# 查找特定的词条
print(wn.lemma('car.n.01.automobile'))
# 得到一个词条对应的同义词集
print(wn.lemma('car.n.01.automobile').synset())
# 以得到一个词条的“名字”
print(wn.lemma('car.n.01.automobile').name(), '\n')


print(wn.synsets('car'))
for synset in wn.synsets('car'):
	print(synset.lemma_names())

print(wn.lemmas('car'))

输出结果

[Synset('car.n.01')]
['car', 'auto', 'automobile', 'machine', 'motorcar']
a motor vehicle with four wheels; usually propelled by an internal combustion engine
['he needs a car to get to work']

[Lemma('car.n.01.car'), Lemma('car.n.01.auto'), Lemma('car.n.01.automobile'), Lemma('car.n.01.machine'), Lemma('car.n.01.motorcar')]
Lemma('car.n.01.automobile')
Synset('car.n.01')
automobile 

[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')]
['car', 'auto', 'automobile', 'machine', 'motorcar']
['car', 'railcar', 'railway_car', 'railroad_car']
['car', 'gondola']
['car', 'elevator_car']
['cable_car', 'car']

[Lemma('car.n.01.car'), Lemma('car.n.02.car'), Lemma('car.n.03.car'), Lemma('car.n.04.car'), Lemma('cable_car.n.01.car')]

2、层次结构

WordNet的同义词集相当于抽象的概念，它们并不总是有对应的英语词汇。这些概念在层次结构中相互联系在一起。
在这里插入图片描述

from nltk.corpus import wordnet as wn

motorcar = wn.synset('car.n.01')
type_of_motorcar = motorcar.hyponyms()
print(type_of_motorcar[26])

# 下位词
print(sorted(
	[lemma.name()
	for synset in type_of_motorcar
	for lemma in synset.lemmas()]))

# 上位词
print(motorcar.hypernyms())
paths = motorcar.hypernym_paths()
print(len(paths))
print([synset.name for synset in paths[0]])
print([synset.name() for synset in paths[0]])
print([synset.name() for synset in paths[1]])

# 个最一般的上位（或根上位）同义词集：
print(motorcar.root_hypernyms())

输出结果

Synset('stanley_steamer.n.01')

['Model_T', 'S.U.V.', 'SUV', 'Stanley_Steamer', 'ambulance', 'beach_waggon', 'beach_wagon', 'bus', 'cab', 'compact', 'compact_car', 'convertible', 'coupe', 'cruiser', 'electric', 'electric_automobile', 'electric_car', 'estate_car', 'gas_guzzler', 'hack', 'hardtop', 'hatchback', 'heap', 'horseless_carriage', 'hot-rod', 'hot_rod', 'jalopy', 'jeep', 'landrover', 'limo', 'limousine', 'loaner', 'minicar', 'minivan', 'pace_car', 'patrol_car', 'phaeton', 'police_car', 'police_cruiser', 'prowl_car', 'race_car', 'racer', 'racing_car', 'roadster', 'runabout', 'saloon', 'secondhand_car', 'sedan', 'sport_car', 'sport_utility', 'sport_utility_vehicle', 'sports_car', 'squad_car', 'station_waggon', 'station_wagon', 'stock_car', 'subcompact', 'subcompact_car', 'taxi', 'taxicab', 'tourer', 'touring_car', 'two-seater', 'used-car', 'waggon', 'wagon']

[Synset('motor_vehicle.n.01')]
2
[<bound method Synset.name of Synset('entity.n.01')>, <bound method Synset.name of Synset('physical_entity.n.01')>, <bound method Synset.name of Synset('object.n.01')>, <bound method Synset.name of Synset('whole.n.02')>, <bound method Synset.name of Synset('artifact.n.01')>, <bound method Synset.name of Synset('instrumentality.n.03')>, <bound method Synset.name of Synset('container.n.01')>, <bound method Synset.name of Synset('wheeled_vehicle.n.01')>, <bound method Synset.name of Synset('self-propelled_vehicle.n.01')>, <bound method Synset.name of Synset('motor_vehicle.n.01')>, <bound method Synset.name of Synset('car.n.01')>]
['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']
['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']

[Synset('entity.n.01')]

3、更多的词汇关系

part_meronyms()：部分，例如：一棵树的部分是它的树干，树冠等。
substance_meronyms()：实质包括…组成，例如：一棵树的实质是包括心材和边材组成的。
member_holonyms()：形成…整体，例如：树木的集合形成了一个森林。
entailments()：蕴含关系
antonyms()：反义词
dir()：查看词汇关系和同义词集上定义的其它方法

print(wn.synset('tree.n.01').part_meronyms())
print(wn.synset('tree.n.01').substance_meronyms())
print(wn.synset('tree.n.01').member_holonyms())

print(wn.synset('mint.n.04').part_holonyms())
print(wn.synset('mint.n.04').substance_holonyms())

for synset in wn.synsets('mint', wn.NOUN):
	print(synset.name() + ':', synset.definition())
print('-----------' * 4)

# 蕴涵关系
print(wn.synset('walk.v.01').entailments())
print(wn.synset('eat.v.01').entailments())
print(wn.synset('tease.v.03').entailments())
print('-----------' * 4)

# 反义词
print(wn.lemma('supply.n.02.supply').antonyms())
print(wn.lemma('rush.v.01.rush').antonyms())
print(wn.lemma('horizontal.a.01.horizontal').antonyms())
print(wn.lemma('staccato.r.01.staccato').antonyms())

# dir():查看词汇关系和同义词集上定义的其它方法
print(dir(wn.synset('harmony.n.02')))

输出结果

[Synset('burl.n.02'), Synset('crown.n.07'), Synset('limb.n.02'), Synset('stump.n.01'), Synset('trunk.n.01')]
[Synset('heartwood.n.01'), Synset('sapwood.n.01')]
[Synset('forest.n.01')]
[Synset('mint.n.02')]
[Synset('mint.n.05')]
batch.n.02: (often followed by `of') a large number or amount or extent
mint.n.02: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers
mint.n.03: any member of the mint family of plants
mint.n.04: the leaves of a mint plant used fresh or candied
mint.n.05: a candy that is flavored with a mint oil
mint.n.06: a plant where money is coined by authority of the government
--------------------------------------------
[Synset('step.v.01')]
[Synset('chew.v.01'), Synset('swallow.v.01')]
[Synset('arouse.v.07'), Synset('disappoint.v.01')]
--------------------------------------------
[Lemma('demand.n.02.demand')]
[Lemma('linger.v.04.linger')]
[Lemma('vertical.a.01.vertical'), Lemma('inclined.a.02.inclined')]
[Lemma('legato.r.01.legato')]
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_all_hypernyms', '_definition', '_examples', '_frame_ids', '_hypernyms', '_instance_hypernyms', '_iter_hypernym_lists', '_lemma_names', '_lemma_pointers', '_lemmas', '_lexname', '_max_depth', '_min_depth', '_name', '_needs_root', '_offset', '_pointers', '_pos', '_related', '_shortest_hypernym_paths', '_wordnet_corpus_reader', 'also_sees', 'attributes', 'causes', 'closure', 'common_hypernyms', 'definition', 'entailments', 'examples', 'frame_ids', 'hypernym_distances', 'hypernym_paths', 'hypernyms', 'hyponyms', 'in_region_domains', 'in_topic_domains', 'in_usage_domains', 'instance_hypernyms', 'instance_hyponyms', 'jcn_similarity', 'lch_similarity', 'lemma_names', 'lemmas', 'lexname', 'lin_similarity', 'lowest_common_hypernyms', 'max_depth', 'member_holonyms', 'member_meronyms', 'min_depth', 'name', 'offset', 'part_holonyms', 'part_meronyms', 'path_similarity', 'pos', 'region_domains', 'res_similarity', 'root_hypernyms', 'shortest_path_distance', 'similar_tos', 'substance_holonyms', 'substance_meronyms', 'topic_domains', 'tree', 'unicode_repr', 'usage_domains', 'verb_groups', 'wup_similarity']

4、语义相似度

right = wn.synset('right_whale.n.01')
orca = wn.synset('orca.n.01')
minke = wn.synset('minke_whale.n.01')
tortoise = wn.synset('tortoise.n.01')
novel = wn.synset('novel.n.01')

# 共同的上位词
print(right.lowest_common_hypernyms(minke))
print(right.lowest_common_hypernyms(orca))
print(right.lowest_common_hypernyms(tortoise))
print(right.lowest_common_hypernyms(novel))

# 查找每个同义词集深度量化
print(wn.synset('baleen_whale.n.01').min_depth())
print(wn.synset('whale.n.02').min_depth())
print(wn.synset('vertebrate.n.01').min_depth())
print(wn.synset('entity.n.01').min_depth())

# 基于上位词层次结构中相互连接的概念之间的最短路径
# 在 0-1 范围的打分（两者之间没有路径就返回-1）。
print(right.path_similarity(minke))
print(right.path_similarity(orca))
print(right.path_similarity(tortoise))
print(right.path_similarity(novel))

输出结果

[Synset('baleen_whale.n.01')]
[Synset('whale.n.02')]
[Synset('vertebrate.n.01')]
[Synset('entity.n.01')]

14
13
8
0

0.25
0.16666666666666666
0.07692307692307693
0.043478260869565216