《Python自然语言处理（第二版）-Steven Bird等》学习笔记：第02章获得文本语料和词汇资源

最新推荐文章于 2022-10-18 19:18:22 发布

miniAI学堂

最新推荐文章于 2022-10-18 19:18:22 发布

阅读量3.2k

点赞数 1

分类专栏： 2015年度文章标签： Python 自然语言处理语料库中文资源

本文链接：https://blog.csdn.net/weixin_43935926/article/details/86423237

版权

2015年度专栏收录该内容

11 篇文章 2 订阅

订阅专栏

什么是有用的文本语料和词汇资源，我们如何使用Python 获取它们？
哪些Python 结构最适合这项工作？
编写Python 代码时我们如何避免重复的工作？

2.1 获取文本语料库

古腾堡语料库

import nltk
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

emma = nltk.corpus.gutenberg.words('austen-emma.txt') #简·奥斯丁的《爱玛》

len(emma)

emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
emma.concordance("surprize")

Displaying 25 of 37 matches:
er father , was sometimes taken by surprize at his being still able to pity ` 
hem do the other any good ." " You surprize me ! Emma must do Harriet good : a
Knightley actually looked red with surprize and displeasure , as he stood up ,
r . Elton , and found to his great surprize , that Mr . Elton was actually on 
d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great ,
father was quite taken up with the surprize of so sudden a journey , and his f
y , in all the favouring warmth of surprize and conjecture . She was , moreove
he appeared , to have her share of surprize , introduction , and pleasure . Th
ir plans ; and it was an agreeable surprize to her , therefore , to perceive t
talking aunt had taken me quite by surprize , it must have been the death of m
f all the dialogue which ensued of surprize , and inquiry , and congratulation
 the present . They might chuse to surprize her ." Mrs . Cole had many to agre
the mode of it , the mystery , the surprize , is more like a young woman ' s s
 to her song took her agreeably by surprize -- a second , slightly but correct
" " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ; 
t to be considered . Emma ' s only surprize was that Jane Fairfax should accep
of your admiration may take you by surprize some day or other ." Mr . Knightle
ation for her will ever take me by surprize .-- I never had a thought of her i
 expected by the best judges , for surprize -- but there was great joy . Mr . 
 sound of at first , without great surprize . " So unreasonably early !" she w
d Frank Churchill , with a look of surprize and displeasure .-- " That is easy
; and Emma could imagine with what surprize and mortification she must be retu
tled that Jane should go . Quite a surprize to me ! I had not the least idea !
 . It is impossible to express our surprize . He came to speak to his father o
g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai

调用了NLTK 中的corpus 包中的gutenberg 对象的words()函数。但因为总是要输入这么长的名字很繁琐，Python 提供了另一个版本的import 语句

from nltk.corpus import gutenberg

gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

emma = gutenberg.words("austen-emma.txt")

#这个程序显示每个文本的三个统计量：平均词长、平均句子长度和本文中每个词出现的平均次数（我们的词汇多样性得分）
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid)) 
    num_words = len(gutenberg.words(fileid)) #    
    num_sents = len(gutenberg.sents(fileid)) 
    num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))
    print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

5 25 26 austen-emma.txt
5 26 17 austen-persuasion.txt
5 28 22 austen-sense.txt
4 34 79 bible-kjv.txt
5 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 18 12 burgess-busterbrown.txt
4 20 13 carroll-alice.txt
5 20 12 chesterton-ball.txt
5 23 11 chesterton-brown.txt
5 18 11 chesterton-thursday.txt
4 21 25 edgeworth-parents.txt
5 26 15 melville-moby_dick.txt
5 52 11 milton-paradise.txt
4 12 9 shakespeare-caesar.txt
4 12 8 shakespeare-hamlet.txt
4 12 7 shakespeare-macbeth.txt
5 36 12 whitman-leaves.txt

平均词长似乎是英语的一个一般属性，因为它的值总是4。（事实上，平均词长是3 而不是4，因为num_chars 变量计数了空白字符。）相比之下，平均句子长度和词汇多样性看上去是作者个人的特点。

len(gutenberg.raw('blake-poems.txt')) #raw()函数给我们没有进行过任何语言学处理的文件的内容,词汇个数，包括词之间的空格。

macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')#sents()函数把文本划分成句子，其中每一个句子是一个词链表。
macbeth_sentences

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]

macbeth_sentences[1603]

['The', 'hart', 'is', 'sorely', 'charg', "'", 'd']

longest_len = max(len(s) for s in macbeth_sentences)

longest_len_sent = [s for s in macbeth_sentences if len(s) == longest_len]
print(longest_len_sent)

[['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', '(', 'Worthie', 'to', 'be', 'a', 'Rebell', ',', 'for', 'to', 'that', 'The', 'multiplying', 'Villanies', 'of', 'Nature', 'Doe', 'swarme', 'vpon', 'him', ')', 'from', 'the', 'Westerne', 'Isles', 'Of', 'Kernes', 'and', 'Gallowgrosses', 'is', 'supply', "'", 'd', ',', 'And', 'Fortune', 'on', 'his', 'damned', 'Quarry', 'smiling', ',', 'Shew', "'", 'd', 'like', 'a', 'Rebells', 'Whore', ':', 'but', 'all', "'", 's', 'too', 'weake', ':', 'For', 'braue', 'Macbeth', '(', 'well', 'hee', 'deserues', 'that', 'Name', ')', 'Disdayning', 'Fortune', ',', 'with', 'his', 'brandisht', 'Steele', ',', 'Which', 'smoak', "'", 'd', 'with', 'bloody', 'execution', '(', 'Like', 'Valours', 'Minion', ')', 'caru', "'", 'd', 'out', 'his', 'passage', ',', 'Till', 'hee', 'fac', "'", 'd', 'the', 'Slaue', ':', 'Which', 'neu', "'", 'r', 'shooke', 'hands', ',', 'nor', 'bad', 'farwell', 'to', 'him', ',', 'Till', 'he', 'vnseam', "'", 'd', 'him', 'from', 'the', 'Naue', 'toth', "'", 'Chops', ',', 'And', 'fix', "'", 'd', 'his', 'Head', 'vpon', 'our', 'Battlements']]

网络和聊天文本

from nltk.corpus import webtext

for fileid in webtext.fileids():
    print(fileid, webtext.raw(fileid)[:65], '...') #Firefox 交流论坛，在纽约无意听到的对话，《加勒比海盗》的电影剧本，个人广告和葡萄酒的评论

firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ...
grail.txt SCENE 1: [wind] [clop clop clop] 
KING ARTHUR: Whoa there!  [clop ...
overheard.txt White guy: So, do you have any plans for this evening?
Asian girl ...
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ...
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ...
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...

from nltk.corpus import nps_chat

chatroom = nps_chat.posts('10-19-20s_706posts.xml') #例如：10-19-20s_706posts.xml 包含2006 年10 月19 日从20 多岁聊天室收集的706 个帖子。
print(chatroom[123])

['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']

布朗语料库

布朗语料库是第一个百万词级的英语电子语料库的，由布朗大学于1961 年创建。这个语料库包含500 个不同来源的文本，按照文体分类，如：新闻、社论等。

表2-1. 布朗语料库每一部分的示例文档

ID	文件	文体	描述
A16	ca16	新闻	news Chicago Tribune: Society Reportage
B02	cb02	社论	editorial Christian Science Monitor: Editorials
C17	cc17	评论	reviews Time Magazine: Reviews
D12	cd12	宗教	religion Underwood: Probing the Ethics of Realtors
E36	ce36	爱好	hobbies Norling: Renting a Car in Europe
F25	cf25	传说	lore Boroff: Jewish Teenage Culture
G22	cg22	纯文学	belles_lettres Reiner: Coping with Runaway Technology
H15	ch15	政府	government US Office of Civil and Defence Mobilization: The Family Fallout Shelter
J17	cj19	博览	learned Mosteller: Probability with Statistical Applications
K04	ck04	小说	fiction W.E.B. Du Bois: Worlds of Color
L13	cl13	推理小说	mystery Hitchens: Footsteps in the Night
M01	cm01	科幻	science_fiction Heinlein: Stranger in a Strange Land
N14	cn15	探险	adventure Field: Rattlesnake Ridge
P12	cp12	言情	romance Callaghan: A Passion in Rome
R06	cr06	幽默	humor Thurber: The Future, If Any, of Comedy

from nltk.corpus import brown

brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

brown.words(categories='news')

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

brown.words(fileids=['cg22'])

['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]

brown.sents(categories=['news', 'editorial', 'reviews'])

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

布朗语料库是一个研究文体之间的系统性差异——一种叫做文体学的语言学研究——很方便的资源。

让我们来比较不同文体中的情态动词的用法

from nltk.corpus import brown

news_text = brown.words(categories='news')

fdist = nltk.FreqDist([w.lower() for w in news_text])

modals = ['can', 'could', 'may', 'might', 'must', 'will']

for m in modals:
    print(m + ':', fdist[m], end=' ')

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

modals = ['what', 'when', 'where', 'who', 'why']

for m in modals:
    print(m + ':', fdist[m], end=' ')

what: 95 when: 169 where: 59 who: 268 why: 14

统计每一个感兴趣的文体。我们使用NLTK 提供的带条件的频率分布函数。

cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))

genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']

modals = ['can', 'could', 'may', 'might', 'must', 'will']

新闻文体中最常见的情态动词是will，而言情文体中最常见的情态动词是could

cfd.tabulate(conditions=genres, samples=modals)

                  can could   may might  must  will 
           news    93    86    66    38    50   389 
       religion    82    59    78    12    54    71 
        hobbies   268    58   131    22    83   264 
science_fiction    16    49     4    12     8    16 
        romance    74   193    11    51    45    43 
          humor    16    30     8     8     9    13

路透社语料库

from nltk.corpus import reuters

reuters_fileids = reuters.fileids()
reuters_fileids[1:10]

['test/14828',
 'test/14829',
 'test/14832',
 'test/14833',
 'test/14839',
 'test/14840',
 'test/14841',
 'test/14842',
 'test/14843']

reuters_categories = reuters.categories()
reuters_categories[:5]

['acq', 'alum', 'barley', 'bop', 'carcass']

reuters.categories('training/9865')

['barley', 'corn', 'grain', 'wheat']

reuters.categories(['training/9865', 'training/9880'])

['barley', 'corn', 'grain', 'money-fx', 'wheat']

reuters_fileids = reuters.fileids('barley')
reuters_fileids[:5]

['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871']

reuters_fileids = reuters.fileids(['barley', 'corn'])
reuters_fileids[:5]

['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106']

reuters.words('training/9865')[:5]

['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT']

就职演说语料库

from nltk.corpus import inaugural

inaugural_fileids = inaugural.fileids()
inaugural_fileids[:5]

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt',
 '1805-Jefferson.txt']

每个文本的年代都出现在它的文件名中。要从文件名中获得年代，我们使用fileid[:4]提取前四个字符。

print([fileid[:4] for fileid in inaugural.fileids()])

['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', '1825', '1829', '1833', '1837', '1841', '1845', '1849', '1853', '1857', '1861', '1865', '1869', '1873', '1877', '1881', '1885', '1889', '1893', '1897', '1901', '1905', '1909', '1913', '1917', '1921', '1925', '1929', '1933', '1937', '1941', '1945', '1949', '1953', '1957', '1961', '1965', '1969', '1973', '1977', '1981', '1985', '1989', '1993', '1997', '2001', '2005', '2009']

让我们来看看词汇america 和citizen 随时间推移的使用情况。

from nltk.corpus import inaugural

cfd = nltk.ConditionalFreqDist(
...           (target, fileid[:4])
...           for fileid in inaugural.fileids()
...           for w in inaugural.words(fileid)
...           for target in ['america', 'citizen']
...           if w.lower().startswith(target))

cfd.plot()

在这里插入图片描述

图2-1. 条件频率分布图：计数就职演说语料库中所有以america 或citizen 开始的词。每个演讲单独计数。这样就能观察出随时间变化用法上的演变趋势。计数没有与文档长度进行归一化处理。

标注文本语料库

表2-2 列出了其中一些语料库。用于教学和科研的话可以免费下载,有关下载信息请参阅http://www.nltk.org/data

在其他语言的语料库

from nltk.corpus import udhr

languages = ['Chickasaw', 'English', 'German_Deutsch',
             'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']

cfd = nltk.ConditionalFreqDist(
    (lang, len(word))
    for lang in languages
    for word in udhr.words(lang + '-Latin1'))

cfd.plot(cumulative=True)

在这里插入图片描述

图2-2. 累积字长分布：内容是“世界人权宣言”的6 个翻译版本

raw_text= udhr.raw('Chinese_Mandarin-GB2312')

nltk.FreqDist(raw_text).plot() #崩溃........

在这里插入图片描述

文本语料库的结构

NLTK 语料库阅读器支持高效的访问大量语料库，并且能用于处理新的语料库。
文本语料库的常见结构：

孤立语料库最简单的一种语料库是一些孤立的没有什么特别的组织的文本集合；
分类语料库一些语料库按如文体（布朗语料库）等分类组织结构；
重叠语料库一些分类会重叠，如主题类别（路透社语料库）；
时变语料库另外一些语料库可以表示随时间变化语言用法的改变（就职演说语料库）。

表2-3. NLTK 中定义的基本语料库函数

示例	描述
fileids()	语料库中的文件
fileids([categories])	这些分类对应的语料库中的文件
categories()	语料库中的分类
categories([fileids])	这些文件对应的语料库中的分类
raw()	语料库的原始内容
raw(fileids=[f1,f2,f3])	指定文件的原始内容
raw(categories=[c1,c2])	指定分类的原始内容
words()	整个语料库中的词汇
words(fileids=[f1,f2,f3])	指定文件中的词汇
words(categories=[c1,c2])	指定分类中的词汇
sents()	指定分类中的句子
sents(fileids=[f1,f2,f3])	指定文件中的句子
sents(categories=[c1,c2])	指定分类中的句子
abspath(fileid)	指定文件在磁盘上的位置
encoding(fileid)	文件的编码（如果知道的话）
open(fileid)	打开指定语料库文件的文件流
root()	到本地安装的语料库根目录的路径

raw = gutenberg.raw("burgess-busterbrown.txt")
raw[1:20]

'The Adventures of B'

words = gutenberg.words("burgess-busterbrown.txt")
words[1:5]

['The', 'Adventures', 'of', 'Buster']

sents = gutenberg.sents("burgess-busterbrown.txt")
sents[1:3]

[['I'], ['BUSTER', 'BEAR', 'GOES', 'FISHING']]

载入你自己的语料库

如果你有自己收集的文本文件，并且想使用前面讨论的方法访问它们，你可以很容易地在NLTK 中的PlaintextCorpusReader 帮助下载入它们。检查你的文件在文件系统中的位置；在下面的例子中，我们假定你的文件在/usr/share/dict 目录下。不管是什么位置，将变量corpus_root?的值设置为这个目录。PlaintextCorpusReader 初始化函数?的第二个参数可以是一个如[‘a.txt’, ‘test/b.txt’]这样的fileids 链表，或者一个匹配所有fileids 的模式，如：’[abc]/.*.txt’。

# from nltk.corpus import PlaintextCorpusReader
# corpus_root = '/usr/share/dict'
# wordlists = PlaintextCorpusReader(corpus_root, '.*')
# wordlists.fileids()
# wordlists.words(' ...')

中文自然语言处理语料/数据集

https://github.com/SophonPlus/ChineseNlpCorpus

情感/观点/评论倾向性分析

1、ChnSentiCorp_htl_all 数据集
■数据概览：7000 多条酒店评论数据，5000 多条正向评论，2000 多条负向评论
■下载地址：
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/ChnSentiCorp_htl_all/intro.ipynb

2、waimai_10k 数据集

■数据概览：某外卖平台收集的用户评价，正向 4000 条，负向约 8000 条
■下载地址：
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/waimai_10k/intro.ipynb

3、online_shopping_10_cats 数据集
■数据概览：10 个类别，共 6 万多条评论数据，正、负向评论各约 3 万条，包括书籍、平板、手机、水果、洗发水、热水器、蒙牛、衣服、计算机、酒店
■下载地址：
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/online_shopping_10_cats/intro.ipynb

4、weibo_senti_100k 数据集
■数据概览：10 万多条，带情感标注新浪微博，正负向评论约各 5 万条
■下载地址：
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/weibo_senti_100k/intro.ipynb

5、simplifyweibo_4_moods 数据集
■数据概览：36 万多条，带情感标注新浪微博，包含 4 种情感，其中喜悦约 20 万条，愤怒、厌恶、低落各约 5 万条
■下载地址：
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/simplifyweibo_4_moods/intro.ipynb

6、dmsc_v2 数据集
■数据概览：28 部电影，超 70 万用户，超 200 万条评分/评论数据
■下载地址：
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/dmsc_v2/intro.ipynb

7、yf_dianping 数据集
■数据概览：24 万家餐馆，54 万用户，440 万条评论/评分数据
■下载地址：
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/yf_dianping/intro.ipynb

8、yf_amazon 数据集
■数据概览：52 万件商品，1100 多个类目，142 万用户，720 万条评论/评分数据
■下载地址：
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/yf_amazon/intro.ipynb

中文命名实体识别

dh_msra 数据集

■数据概览：5 万多条中文命名实体识别标注数据（包括地点、机构、人物）

■下载地址：

https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/dh_msra/intro.ipynb

2.2 条件频率分布

频率分布计算观察到的事件，如文本中出现的词汇。条件频率分布需要给每个时间关联一个条件，所以不是处理一个词序列，我们必须处理的是一个配对序列。

条件和事件

text = ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...] #每对的形式是：（条件，事件）

按文体计数词汇

FreqDist()以一个简单的链表作为输入，ConditionalFreqDist()以一个配对链表作为输入。

import nltk
from nltk.corpus import brown

cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))

genre_word = [(genre, word) 
    for genre in ['news', 'romance']
    for word in brown.words(categories=genre)]

len(genre_word)

genre_word[:4]

[('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')]

genre_word[-3:]

[('romance', 'not'), ('romance', "''"), ('romance', '.')]

cfd = nltk.ConditionalFreqDist(genre_word)

print(cfd)

<ConditionalFreqDist with 2 conditions>

cfd.conditions()

['news', 'romance']

print(cfd['news'])

<FreqDist with 14394 samples and 100554 outcomes>

print(cfd['romance'])

<FreqDist with 8452 samples and 70022 outcomes>

绘制分布图和分布表

from nltk.corpus import inaugural

cfd = nltk.ConditionalFreqDist(
    (target, fileid[:4])
    for fileid in inaugural.fileids()
    for w in inaugural.words(fileid)
    for target in ['america', 'citizen']
    if w.lower().startswith(target))

from nltk.corpus import udhr

languages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']

cfd = nltk.ConditionalFreqDist(
    (lang, len(word)) 
    for lang in languages
    for word in udhr.words(lang + '-Latin1'))

cfd.tabulate(conditions=['English', 'German_Deutsch'],
             samples=range(10), cumulative=True)

                  0    1    2    3    4    5    6    7    8    9 
       English    0  185  525  883  997 1166 1283 1440 1558 1638 
German_Deutsch    0  171  263  614  717  894 1013 1110 1213 1275

使用双连词生成随机文本

sent = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven','and', 'the', 'earth', '.']

list(nltk.bigrams(sent))

[('In', 'the'),
 ('the', 'beginning'),
 ('beginning', 'God'),
 ('God', 'created'),
 ('created', 'the'),
 ('the', 'heaven'),
 ('heaven', 'and'),
 ('and', 'the'),
 ('the', 'earth'),
 ('earth', '.')]

例2-1. 产生随机文本：此程序获得《创世记》文本中所有的双连词，然后构造一个条件频率分布来记录哪些词汇最有可能跟在给定词的后面；例如：living 后面最可能的词是creature；generate_model()函数使用这些数据和种子词随机产生文本。

def generate_model(cfdist, word, num=15):
    for i in range(num):
        print(word, end=' ')
        word = cfdist[word].max()
        
text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)

cfd['living']

FreqDist({',': 1,
          '.': 1,
          'creature': 7,
          'soul': 1,
          'substance': 2,
          'thing': 4})

generate_model(cfd, 'living')

living creature that he said , and the land of the land of the land

表2-4. NLTK 中的条件频率分布：定义、访问和可视化一个计数的条件频率分布的常用方法和习惯用法

示例	描述
cfdist= ConditionalFreqDist(pairs)	从配对链表中创建条件频率分布
cfdist.conditions()	将条件按字母排序
cfdist[condition]	此条件下的频率分布
cfdist[condition][sample]	此条件下给定样本的频率
cfdist.tabulate()	为条件频率分布制表
cfdist.tabulate(samples, conditions)	指定样本和条件限制下制表
cfdist.plot()	为条件频率分布绘图
cfdist.plot(samples, conditions)	指定样本和条件限制下绘图
cfdist1 < cfdist2	测试样本在cfdist1 中出现次数是否小于在cfdist2 中出现次数

2.3 更多关于Python代码重用

使用文本编辑器创建程序

使用IDLE
Spider
…

函数

def lexical_diversity(text):
    return len(text) / len(set(text)) #关键字return 表示函数作为输出而产生的值

def lexical_diversity(my_text_data):
    #局部变量，不能在函数体外访问。
    word_count = len(my_text_data)    
    vocab_size = len(set(my_text_data))
    diversity_score = word_count / vocab_size
    return diversity_score

例2-2. 一个Python 函数：这个函数试图生成任何英语名词的复数形式。关键字def（define）后面跟着函数的名称，然后是包含在括号内的参数和一个冒号；函数体是缩进代码块；它试图识别词内的模式，相应的对词进行处理；例如:如果这个词以y 结尾，删除它们并添加ies。

def plural(word):
    if word.endswith('y'): #使用对象的名字，一个点，然后跟函数的名称。这些函数通常被称为方法。
        return word[:-1] + 'ies'
    elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']:
        return word + 'es'
    elif word.endswith('an'):
        return word[:-2] + 'en'
    else:
        return word + 's'

plural('fairy')

'fairies'

plural('woman')

'women'

模块

请将你的plural函数保存到一个文件：textproc.py

from textproc import plural

plural('woman')

'women'

2.4 词典资源

词汇列表语料库

例2-3. 过滤文本：此程序计算文本的词汇表，然后删除所有在现有的词汇列表中出现的元素，只留下罕见或拼写错误的词。

def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab.difference(english_vocab)
    return sorted(unusual)

len(unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt')))

len(unusual_words(nltk.corpus.nps_chat.words()))

还有一个停用词语料库，就是那些高频词汇，如：the，to，我们有时在进一步的处理之前想要将它们从文档中过滤。

from nltk.corpus import stopwords

len(stopwords.words('english'))

定义一个函数来计算文本中没有在停用词列表中的词的比例

def content_fraction(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    return len(content) / len(text)

content_fraction(nltk.corpus.reuters.words())

0.735240435097661

图2-6. 一个字母拼词谜题：在由随机选择的字母组成的网格中，选择里面的字母组成词。这个谜题叫做“目标”。图中文字的意思是：用这里显示的字母你能组成多少个4 字母或者更多字母的词？每个字母在每个词中只能被用一次。每个词必须包括中间的字母并且必须至少有一个9 字母的词。没有复数以“s”结尾；没有外来词；没有姓名。能组出21 个词就是“好”；32 个词，“很好”；42 个词，“非常好”。

puzzle_letters = nltk.FreqDist('egivrvonl')

obligatory = 'r'

wordlist = nltk.corpus.words.words()

[w for w in wordlist if len(w) >= 6 
    and obligatory in w 
    and nltk.FreqDist(w) <= puzzle_letters][:5]

['glover', 'gorlin', 'govern', 'grovel', 'ignore']

names = nltk.corpus.names

names.fileids()

['female.txt', 'male.txt']

male_names = names.words('male.txt')

female_names = names.words('female.txt')

[w for w in male_names if w in female_names][:5]

['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian']

cfd = nltk.ConditionalFreqDist(
    (fileid, name[-1])
    for fileid in names.fileids()
    for name in names.words(fileid))

cfd.plot()

在这里插入图片描述

发音的词典

entries = nltk.corpus.cmudict.entries()

len(entries)

for entry in entries[39943:39951]:
    print(entry)

('explorer', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0'])
('explorers', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0', 'Z'])
('explores', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'Z'])
('exploring', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'IH0', 'NG'])
('explosion', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N'])
('explosions', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N', 'Z'])
('explosive', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V'])
('explosively', ['EH2', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V', 'L', 'IY0'])

for word, pron in entries: 
    if len(pron) == 3:
        ph1, ph2, ph3 = pron
        if ph1 == 'P' and ph3 == 'T':
            print(word, ph2, end=' ')

pait EY1 pat AE1 pate EY1 patt AE1 peart ER1 peat IY1 peet IY1 peete IY1 pert ER1 pet EH1 pete IY1 pett EH1 piet IY1 piette IY1 pit IH1 pitt IH1 pot AA1 pote OW1 pott AA1 pout AW1 puett UW1 purt ER1 put UH1 putt AH1

syllable = ['N', 'IH0', 'K', 'S']

[word for word, pron in entries if pron[-4:] == syllable][:5]  #使用此方法来找到押韵的词

["atlantic's", 'audiotronics', 'avionics', 'beatniks', 'calisthenics']

[w for w, pron in entries if pron[-1] == 'M' and w[-1] == 'n'][:5]

['autumn', 'column', 'condemn', 'damn', 'goddamn']

sorted(set(w[:2] for w, pron in entries if pron[0] == 'N' and w[0] != 'n'))

['gn', 'kn', 'mn', 'pn']

def stress(pron):
    return [char for phone in pron for char in phone if char.isdigit()]

[w for w, pron in entries if stress(pron) == ['0', '1', '0', '2', '0']][:5]

['abbreviated', 'abbreviated', 'abbreviating', 'accelerated', 'accelerating']

[w for w, pron in entries if stress(pron) == ['0', '2', '0', '1', '0']][:5]

['abbreviation',
 'abbreviations',
 'abomination',
 'abortifacient',
 'abortifacients']

p3 = [(pron[0]+'-'+pron[2], word) 
    for (word, pron) in entries
    if pron[0] == 'P' and len(pron) == 3]

cfd = nltk.ConditionalFreqDist(p3)

for template in cfd.conditions():
    if len(cfd[template]) > 10:
        words = cfd[template].keys()
        wordlist = ' '.join(words)
        print(template, wordlist[:70] + "...")

P-S pus peace pesce pass puss purse piece perse pease perce pasts poss pos...
P-N pain penn pyne pinn poon pine pin paine penh paign peine pawn pun pane...
P-T pat peart pout put pit purt putt piet pert pet pote pate patt piette p...
P-UW1 pru plue prue pshew prugh prew peru peugh pugh pew plew...
P-K pique pack paque paek perk poke puck pik polk purk peak poch pake perc...
P-Z poe's pas pei's pows pao's pose purrs peas paiz pies pays pause p.s pa...
P-CH perch petsch piech petsche piche peach pautsch pouch pietsch pitsch pu...
P-P pipp papp pep pope paup pop popp pup poop pape poppe pip paap paape pe...
P-R pour poor poore parr porr pear peer pore pier paar por pare pair par...
P-L pal pall peil pehl peele paille pile poll pearl peale perle pull pill ...

prondict = nltk.corpus.cmudict.dict()

prondict['fire'] #通过指定词典的名字后面跟一个包含在方括号里的关键字（例如：词fire）来查词典

[['F', 'AY1', 'ER0'], ['F', 'AY1', 'R']]

prondict['blog']

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-175-f0ffb282ba9a> in <module>()
----> 1 prondict['blog']


KeyError: 'blog'

prondict['blog'] = [['B', 'L', 'AA1', 'G']]

prondict['blog']

[['B', 'L', 'AA1', 'G']]

text = ['natural', 'language', 'processing']

[ph for w in text for ph in prondict[w][0]][:5]

['N', 'AE1', 'CH', 'ER0', 'AH0']

比较词表

斯瓦迪士核心词列表（Swadesh wordlists）

from nltk.corpus import swadesh

swadesh.fileids()[:5]

['be', 'bg', 'bs', 'ca', 'cs']

swadesh.words('en')[:5]

['I', 'you (singular), thou', 'he', 'we', 'you (plural)']

fr2en = swadesh.entries(['fr', 'en'])

fr2en[:5]

[('je', 'I'),
 ('tu, vous', 'you (singular), thou'),
 ('il', 'he'),
 ('nous', 'we'),
 ('vous', 'you (plural)')]

translate = dict(fr2en)

translate['chien']

'dog'

translate['jeter']

'throw'

de2en = swadesh.entries(['de', 'en']) # German-English

es2en = swadesh.entries(['es', 'en']) # Spanish-English

translate.update(dict(de2en))

translate.update(dict(es2en))

translate['Hund']

'dog'

translate['perro']

'dog'

languages = ['en', 'de', 'nl', 'es', 'fr', 'pt', 'la']

for i in [139, 140, 141, 142]:
    print(swadesh.entries(languages)[i])

('say', 'sagen', 'zeggen', 'decir', 'dire', 'dizer', 'dicere')
('sing', 'singen', 'zingen', 'cantar', 'chanter', 'cantar', 'canere')
('play', 'spielen', 'spelen', 'jugar', 'jouer', 'jogar, brincar', 'ludere')
('float', 'schweben', 'zweven', 'flotar', 'flotter', 'flutuar, boiar', 'fluctuare')

词汇工具：Toolbox和Shoebox

Toolbox（工具箱），以前叫做Shoebox

from nltk.corpus import toolbox

toolbox.entries('rotokas.dic')[:1]

[('kaa',
  [('ps', 'V'),
   ('pt', 'A'),
   ('ge', 'gag'),
   ('tkp', 'nek i pas'),
   ('dcsv', 'true'),
   ('vx', '1'),
   ('sc', '???'),
   ('dt', '29/Oct/2005'),
   ('ex', 'Apoka ira kaaroi aioa-ia reoreopaoro.'),
   ('xp', 'Kaikai i pas long nek bilong Apoka bikos em i kaikai na toktok.'),
   ('xe', 'Apoka is gagging from food while talking.')])]

2.5 WordNet

NLTK 中包括英语WordNet，共有155,287 个词和117,659 个同义词集合。

意义与同义词

from nltk.corpus import wordnet as wn

wn.synsets('motorcar') #car.n.01被称为synset或“同义词集”，意义相同的词（或“词条”）的集合

[Synset('car.n.01')]

wn.synset('car.n.01').lemma_names() #同义词

['car', 'auto', 'automobile', 'machine', 'motorcar']

wn.synset('car.n.01').definition() #synset或“同义词集”

'a motor vehicle with four wheels; usually propelled by an internal combustion engine'

wn.synset('car.n.01').examples()

['he needs a car to get to work']

wn.synset('car.n.01').lemmas()

[Lemma('car.n.01.car'),
 Lemma('car.n.01.auto'),
 Lemma('car.n.01.automobile'),
 Lemma('car.n.01.machine'),
 Lemma('car.n.01.motorcar')]

wn.lemma('car.n.01.automobile')

Lemma('car.n.01.automobile')

wn.lemma('car.n.01.automobile').synset()

Synset('car.n.01')

wn.lemma('car.n.01.automobile').name()

'automobile'

wn.synsets('car')

[Synset('car.n.01'),
 Synset('car.n.02'),
 Synset('car.n.03'),
 Synset('car.n.04'),
 Synset('cable_car.n.01')]

for synset in wn.synsets('car'):
    print(synset.lemma_names())

['car', 'auto', 'automobile', 'machine', 'motorcar']
['car', 'railcar', 'railway_car', 'railroad_car']
['car', 'gondola']
['car', 'elevator_car']
['cable_car', 'car']

wn.lemmas('car')

[Lemma('car.n.01.car'),
 Lemma('car.n.02.car'),
 Lemma('car.n.03.car'),
 Lemma('car.n.04.car'),
 Lemma('cable_car.n.01.car')]

WordNet的层次结构

motorcar = wn.synset('car.n.01') #下位词

types_of_motorcar = motorcar.hyponyms()

types_of_motorcar[26]

Synset('stanley_steamer.n.01')

sorted(lemma.name() for synset in types_of_motorcar for lemma in synset.lemmas())[:5]

['Model_T', 'S.U.V.', 'SUV', 'Stanley_Steamer', 'ambulance']

motorcar.hypernyms()

[Synset('motor_vehicle.n.01')]

paths = motorcar.hypernym_paths()

len(paths)

[synset.name for synset in paths[0]]

[<bound method Synset.name of Synset('entity.n.01')>,
 <bound method Synset.name of Synset('physical_entity.n.01')>,
 <bound method Synset.name of Synset('object.n.01')>,
 <bound method Synset.name of Synset('whole.n.02')>,
 <bound method Synset.name of Synset('artifact.n.01')>,
 <bound method Synset.name of Synset('instrumentality.n.03')>,
 <bound method Synset.name of Synset('container.n.01')>,
 <bound method Synset.name of Synset('wheeled_vehicle.n.01')>,
 <bound method Synset.name of Synset('self-propelled_vehicle.n.01')>,
 <bound method Synset.name of Synset('motor_vehicle.n.01')>,
 <bound method Synset.name of Synset('car.n.01')>]

[synset.name for synset in paths[1]]

[<bound method Synset.name of Synset('entity.n.01')>,
 <bound method Synset.name of Synset('physical_entity.n.01')>,
 <bound method Synset.name of Synset('object.n.01')>,
 <bound method Synset.name of Synset('whole.n.02')>,
 <bound method Synset.name of Synset('artifact.n.01')>,
 <bound method Synset.name of Synset('instrumentality.n.03')>,
 <bound method Synset.name of Synset('conveyance.n.03')>,
 <bound method Synset.name of Synset('vehicle.n.01')>,
 <bound method Synset.name of Synset('wheeled_vehicle.n.01')>,
 <bound method Synset.name of Synset('self-propelled_vehicle.n.01')>,
 <bound method Synset.name of Synset('motor_vehicle.n.01')>,
 <bound method Synset.name of Synset('car.n.01')>]

motorcar.root_hypernyms()

[Synset('entity.n.01')]

语义相似度

right = wn.synset('right_whale.n.01')
orca = wn.synset('orca.n.01')
minke = wn.synset('minke_whale.n.01')
tortoise = wn.synset('tortoise.n.01')
novel = wn.synset('novel.n.01')
right.lowest_common_hypernyms(minke)

[Synset('baleen_whale.n.01')]

right.lowest_common_hypernyms(orca)

[Synset('whale.n.02')]

right.lowest_common_hypernyms(tortoise)

[Synset('vertebrate.n.01')]

right.lowest_common_hypernyms(novel)

[Synset('entity.n.01')]

wn.synset('baleen_whale.n.01').min_depth()

wn.synset('whale.n.02').min_depth()

wn.synset('vertebrate.n.01').min_depth()

wn.synset('entity.n.01').min_depth()

right.path_similarity(minke)

0.25

right.path_similarity(orca)

0.16666666666666666

right.path_similarity(tortoise)

0.07692307692307693

right.path_similarity(novel)

0.043478260869565216

还有一些其它的相似性度量方法；NLTK 还包括VerbNet，一个连接到WordNet 的动词的层次结构的词典。

2.6 小结

文本语料库是一个大型结构化文本的集合。NLTK 包含了许多语料库，如：布朗语料库nltk.corpus.brown。
有些文本语料库是分类的，例如通过文体或者主题分类；有时候语料库的分类会相互重叠。
条件频率分布是一个频率分布的集合，每个分布都有一个不同的条件。它们可以用于通过给定内容或者文体对词的频率计数。
行数较多的Python 程序应该使用文本编辑器来输入，保存为.py 后缀的文件，并使用import 语句来访问。
Python 函数允许你将一段特定的代码块与一个名字联系起来，然后重用这些代码想用多少次就用多少次。
一些被称为“方法”的函数与一个对象联系在起来，我们使用对象名称跟一个点然后跟方法名称来调用它，就像：x.funct(y)或者word.isalpha()。
要想找到一些关于变量v 的信息，可以在Pyhon 交互式解释器中输入help(v)来阅读这一类对象的帮助条目。
WordNet 是一个面向语义的英语词典，由同义词的集合—或称为同义词集（synsets）—组成，并且组织成一个网络。
默认情况下有些函数是不能使用的，必须使用Python 的import 语句来访问。

致谢
《Python自然语言处理》¹²³ ⁴，作者：Steven Bird, Ewan Klein & Edward Loper，是实践性很强的一部入门读物，2009年第一版，2015年第二版，本学习笔记结合上述版本，对部分内容进行了延伸学习、练习，在此分享，期待对大家有所帮助，欢迎加我微信（验证：NLP），一起学习讨论，不足之处，欢迎指正。
在这里插入图片描述

参考文献

http://nltk.org/ ↩︎
Steven Bird, Ewan Klein & Edward Loper,Natural Language Processing with Python,2009 ↩︎
（英）伯德，（英）克莱因，（美）洛普，《Python自然语言处理》，2010年，东南大学出版社 ↩︎
Steven Bird, Ewan Klein & Edward Loper,Natural Language Processing with Python,2015 ↩︎