《python自然语言处理》笔记---chap2 获得文本语料和词汇资源

最新推荐文章于 2022-05-15 12:15:19 发布

无限大地NLP_空木

最新推荐文章于 2022-05-15 12:15:19 发布

阅读量2.9k

点赞数

分类专栏： python自然语言处理及相关

本文链接：https://blog.csdn.net/u010454729/article/details/22316157

版权

python自然语言处理及相关专栏收录该内容

17 篇文章 1 订阅

订阅专栏

2.1 获取文本语料库

古腾堡语料库

import nltk
nltk.corpus.gutenberg.fileids()
[u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt']
emma = nltk.corpus.gutenberg.words(u'austen-emma.txt')
len(emma)
192427

raw()函数：给我们没有进行过任何语言学处理的文件的内容。因此，例如：len(gutenberg.raw('blake-poems.txt')告诉我们文本中出现的词汇个数，包括词之间的空格。

sents()函数：把文本划分成句子，其中每一个句子是一个词链表。

macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')   #将文本划分为句子
macbeth_sentences[1037]
['Double', ',', 'double', ',', 'toile', 'and', 'trouble', ';',
'Fire', 'burne', ',', 'and', 'Cauldron', 'bubble']
longest_len = max([len(s) for s in macbeth_sentences])   #获得最长的句子
[s for s in macbeth_sentences if len(s) == longest_len]

网络和聊天文本

NLTK的网络文本小集合的内容包括Firefox交流论坛，在纽约无意听到的对话，《加勒比海盗》的电影剧本，个人广告和葡萄酒的评论

from nltk.corpus import webtext
for fileid in webtext.fileids():
    print fileid,webtext.raw(fileid)[:60],'...'

即时消息聊天回话语料库，包含超过10000张帖子；被分成15个文件，每个文件包含几百个按特定日期和特定年龄的聊天室；文件名中包含日期、聊天室和帖子数量。

from nltk.corpus import nps_chat
chatroom=nps_chat.posts('10-19-20s_706posts.xml')
chatroom[123]

布朗语料库

布朗语料库包含500个不同来源的文本，按照文本分类，如：新闻、社论等

布朗语料库每一部分的示例文档

ID	文件	文体	描述
A16	ca16	新闻news	Chicago Tribune: Society Reportage
B02	cb02	社论editorial	Christian Science Monitor: Editorials
C17	cc17	评论reviews	Time Magazine: Reviews
D12	cd12	宗教religion	Underwood: Probing the Ethics of Realtors
E36	ce36	爱好hobbies	Norling: Renting a Car in Europe
F25	cf25	传说lore	Boroff: Jewish Teenage Culture
G22	cg22	纯文学belles_lettres	Reiner: Coping with Runaway Technology
H15	ch15	政府government	US Office of Civil and Defence Mobilization: The Family Fallout Shelter
J17	cj19	博览learned	Mosteller: Probability with Statistical Applications
K04	ck04	小说fiction	W.E.B. Du Bois: Worlds of Color
L13	cl13	推理小说mystery	Hitchens: Footsteps in the Night
M01	cm01	科幻science_fiction	Heinlein: Stranger in a Strange Land
N14	cn15	探险adventure	Field: Rattlesnake Ridge
P12	cp12	言情romance	Callaghan: A Passion in Rome
R06	cr06	幽默humor	Thurber: The Future, If Any, of Comedy

from nltk.corpus import brown
brown.categories()            #得到文本的各分类类别
brown.words(categories='news')#指定特定的类别或文件阅读
brown.words(fileids=['cg22'])
brown.sents(categories=['news', 'editorial', 'reviews'])

布朗语料库是一个研究文体之间的系统性差异——一种叫做文体学的语言学研究

比较不同文体中的情态动词的用法。

第一步：产生特定文体的计数；

from nltk.corpus import brown
news_text=brown.words(categories='news')#在新闻文体中的词s
fdist=FreqDist([w.lower() for w in news_text])#化为字典形式，并略掉大小写
modals=['can','could','may','might','must','will']
for m in modals:
    print m+':',fdist[m],

from nltk.corpus import brown
import nltk
cfd = nltk.ConditionalFreqDist((genre, word)
                          for genre in brown.categories()
                          for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)

#注意cfd编写的空格，不要放在一行，注意两个for循环的放置位置。运行结果如下：

                 can could  may might must will 
           news   93   86   66   38   50  389 
       religion   82   59   78   12   54   71 
        hobbies  268   58  131   22   83  264 
science_fiction   16   49    4   12    8   16 
        romance   74  193   11   51   45   43 
          humor   16   30    8    8    9   13

分析的出来的结果：新闻文体中最常见的情态动词是will，而言情文体中最常见的情态动词是could。

路透社语料库

路透社语料库包含10788个新闻文档，共计130万字。分为90个主题，按照“训练”和“测试”分为两组。

from nltk.corpus import reuters
reuters.fileids()       #测试文档
reuters.categories()    #路透社语料库的类别


#查找由一个或多个文档涵盖的主题，也可以查找包含在一个或多个类别中的文档。语料库方法既接受单个的fileid也接受fileids列表作为参数
reuters.categories('traing/9865')
reuters.categories(['traing/9865','traing/9880'])

reuters.fileids('barley')
reuters.fileids(['barley','corn'])
#可以以文档或类别为单位查找我们想要的词或句子
reuters.words('traing/9865')[:14]
reuters.words(['traing/9865','traing/9880'])

reuters.words(categories='barley')
reuters.words(categories=['barley','corn'])

就职演说语料库
观察america和citizen随时间推移的使用情况：

from nltk.corpus import inaugural
import nltk
#inaugural.fileids()     
#[fileid[:4] for fileid in inaugural.fileids()]
cfd=nltk.ConditionalFreqDist((target,fileid[:4])
                             for fileid in inaugural.fileids()
                             for w in inaugural.words(fileid)
                             for target in ['america','citizen']
                             if w.lower().startswith(target))
cfd.plot()

运行结果如上图，注意cfd的格式，三个for循环以及if对齐，装了matplotlib包，即能够画出图，即使不用调用包

标注文本语料库

许多文本语料库都包含语言学标注，词性标注、命名实体、句法结构、语义角色等
其他语言的语料库

文本语料库的结构

文本语料库的常见结构：

最简单的一种语料库是一些孤立的没有什么特别的组织的文本集合；一些语料库按如文体（布朗语料库）等分类组织结构；一些分类会重叠，如主题类别（路透社语料库）；另外一些语料库可以表示随时间变化语言用法的改变（就职演说语料库）。

示例	描述
fileids()	语料库中的文件
fileids([categories])	这些分类对应的语料库中的文件
categories()	语料库中的分类
categories([fileids])	这些文件对应的语料库中的分类
raw()	语料库的原始内容
raw(fileids=[f1,f2,f3])	指定文件的原始内容
raw(categories=[c1,c2])	指定分类的原始内容
words()	整个语料库中的词汇
words(fileids=[f1,f2,f3])	指定文件中的词汇
words(categories=[c1,c2])	指定分类中的词汇
sents()	指定分类中的句子
sents(fileids=[f1,f2,f3])	指定文件中的句子
sents(categories=[c1,c2])	指定分类中的句子
abspath(fileid)	指定文件在磁盘上的位置
encoding(fileid)	文件的编码（如果知道的话）
open(fileid)	打开指定语料库文件的文件流
root()	到本地安装的语料库根目录的路径

语料库访问方法之间的区别：

raw=gutenberg.raw('burgess-busterbrown.txt')#字符级
raw[1:20]
words=gutenberg.words('burgess-busterbrown.txt')#单词级
words[1:20]
sents=gutenberg.sents('burgess-busterbrown.txt')#句子级
sents[1:20]

载入你自己的语料库

检查你的文件在文件系统中的位
置；在下面的例子中，我们假定你的文件在/usr/share/dict 目录下。不管是什么位置，，将变量corpus_root的值设置为这个目录。PlaintextCorpusReader 初始化函数