《Python自然语言处理》学习笔记（三）

最新推荐文章于 2023-01-31 09:47:25 发布

LucyGill

最新推荐文章于 2023-01-31 09:47:25 发布

阅读量2.4k

点赞数 2

分类专栏： Python 文章标签： python 自然语言处理 nlp 文本语料

本文链接：https://blog.csdn.net/LucyGill/article/details/54383152

版权

现在开始学习书的第二章，《获得文本语料和词汇资源》。

一. 获取文本语料库

1.古腾堡语料库gutenberg

内容：NLTK包含古腾堡项目（大约有36000本免费电子书）电子文本档案的经过挑选的一小部分文本。

>>> import nltk
>>> nltk.corpus.gutenberg.fileids()
[u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt']

nltk.corpus.gutenberg.fileids()的作用是导入nltk中包含的古腾堡语料库中的信息。

corpus：a collection of written or spoken texts

fileid: 文件标识符

所以这句话的意思很明显：从nltk的全集的古腾堡部分中，导出所有文件标识符（书名）。

例子1

>>> emma=nltk.corpus.gutenberg.words('austen-emma.txt')
>>> len(emma)

例子2：

>>> for fileid in gutenberg.fileids():
    num_chars=len(gutenberg.raw(fileid))
    num_words=len(gutenberg.words(fileid))
    num_sents=len(gutenberg.sents(fileid))
    num_vocab=len(set([w.lower() for w in gutenberg.words(fileid)]))
    print int(num_chars/num_words),int(num_words/num_sents),int(num_words/num_vocab),fileid

    
4 24 26 austen-emma.txt
4 26 16 austen-persuasion.txt
4 28 22 austen-sense.txt
4 33 79 bible-kjv.txt
4 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 17 12 burgess-busterbrown.txt
4 20 12 carroll-alice.txt
4 20 11 chesterton-ball.txt
4 22 11 chesterton-brown.txt
4 18 10 chesterton-thursday.txt
4 20 24 edgeworth-parents.txt
4 25 15 melville-moby_dick.txt
4 52 10 milton-paradise.txt
4 11 8 shakespeare-caesar.txt
4 12 7 shakespeare-hamlet.txt
4 12 6 shakespeare-macbeth.txt
4 36 12 whitman-leaves.txt

raw函数把文本中的内容以字符为单位分开，words函数把文本中的内容以单词为单位分开，sents（sentences）函数把文本中的内容以句子为单位分开。

2.网络和聊天文本webtext

nltk网络文本集合包括Firefox交流论坛，在纽约无意听到的对话，《加勒比海盗》剧本，个人广告和葡萄酒的评论。

例子1：

>>> from nltk.corpus import webtext
>>> for fileid in webtext.fileids():
	print fileid,webtext.raw(fileid)[:65],'...'

	
firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ...
grail.txt SCENE 1: [wind] [clop clop clop] 
KING ARTHUR: Whoa there!  [clop ...
overheard.txt White guy: So, do you have any plans for this evening?
Asian girl ...
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ...
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ...
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...

书P63习题：

轮到你来：在udhr.fileids()中选择一种感兴趣的语言，定义一个变量

raw_text= udhr.raw(+)。使用nltk.FreqDist(raw_text).plot()画出

此文本的字母频率分布图。

>>> def search_word(word):
	for w in udhr.fileids():
		if word in w.lower():
			return w

		
>>> search_word('english')
u'English-Latin1'
>>> raw_text=udhr.raw('English-Latin1')

>>> nltk.FreqDist(raw_text).plot()

3.即时消息聊天会话语料库nps_chat

语料库被分为15个文件，每个文件包含几百个按特定日期和特定年龄的聊天室（青少年、20岁、30岁、40岁，以及通用的成年人聊天室）收集的帖子。如：10-19-20s_706posts.xml包含2006年10月19日从20多岁聊天室收集的706个帖子。

例子1:

>>> from nltk.corpus import nps_chat
>>> chatroom=nps_chat.posts('10-19-20s_706posts.xml')
>>> chatroom[123]
[u'i', u'do', u"n't", u'want', u'hot', u'pics', u'of', u'a', u'female', u',', u'I', u'can', u'look', u'in', u'a', u'mirror', u'.']

注：之前两个语料库的操作，这个语料库也可以用。比如nps_chat.words(' 10-19-20s_706posts.xml ')

4.布朗语料库brown

布朗语料库是第一个百万词级的英语电子语料库。这个语料库包含500个不同来源的文本，按照文体分类，如：新闻、社论等。

具体包括以下类别：

例子1：

>>> from nltk.corpus import brown
>>> brown.categories()
[u'adventure', u'belles_lettres', u'editorial', u'fiction', u'government', u'hobbies', u'humor', u'learned', u'lore', u'mystery', u'news', u'religion', u'reviews', u'romance', u'science_fiction']

注意， categories是brown语料库独特的函数，上面3个语料库都不能调用。

例子2：一般语料库通用