nltk.获得文本语料和词汇资源

最新推荐文章于 2022-09-14 08:53:55 发布

Pinaceae

最新推荐文章于 2022-09-14 08:53:55 发布

阅读量1.1k

点赞数 1

分类专栏： NLP 文章标签： nlp

本文链接：https://blog.csdn.net/Pinaceae/article/details/78311873

版权

NLP 专栏收录该内容

6 篇文章 1 订阅

订阅专栏

1.获取文本语料库

1.1古滕堡语料库

nltk.corpus.gutenberg.fileids()//古滕堡语料库文件标识符
emma = nltk.corpus.gutenberg.words('austen-emma.txt')emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))//我们调用了NLTK 中的corpus 包中的gutenberg 对象的words()函数
emma.concordance("surprize")//获取包含suprize得上下文


macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')//sents()函数把文本划分成句子，其中每一个句子是一个词链表。


1.2网络和聊天文本

from nltk.corpus import webtext
from nltk.corpus import webtext
>>> for fileid in webtext.fileids():
... print (fileid, webtext.raw(fileid)[:65], '...')
from nltk.corpus import nps_chat
chatroom = nps_chat.posts('10-19-20s_706posts.xml')
1.3布朗语料库
from nltk.corpus import brown
brown.categories()//这个语料库包含500 个不同来源的文本，按照文体分类，如：新闻、社论等。
brown.words(categories='news')//以词链表得形式输出
brown.sents(categories=['news', 'editorial', 'reviews'])//以句链表得形式输出
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
... (genre, word)
... for genre in brown.categories()
... for word in brown.words(categories=genre))
>>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> cfd.tabulate(conditions=genres, samples=modals)
can could may might must will
news 93 86 66 38 50 389
religion 82 59 78 12 54 71
hobbies 268 58 131 22 83 264
science_fiction 16 49 4 12 8 16
romance 74 193 11 51 45 43
humor 16 30 8 8 9 13


1.4路透社语料库
>>> from nltk.corpus import reuters
>>> reuters.fileids()
['test/14826', 'test/14828', 'test/14829', 'test/14832', ...]
>>> reuters.categories()
['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa',
'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn',
'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', ...]
//与布朗语料库不同，路透社语料库的类别是有互相重叠的，只是因为新闻报道往往涉及
多个主题。我们可以查找由一个或多个文档涵盖的主题，也可以查找包含在一个或多个类别
中的文档。为方便起见，语料库方法既接受单个的fileid 也接受fileids 列表作为参数。
reuters.categories('training/9865')
reuters.fileids('barley')


1.5就职演说语料库
from nltk.corpus import inaugural
inaugural.fileids()


示例                                  描述

fileids()                             语料库中的文件

fileids([categories])           这些分类对应的语料库中的文件

categories()                      语料库中的分类

categories([fileids])           这些文件对应的语料库中的分类

raw()                                 语料库的原始内容

raw(fileids=[f1,f2,f3])         指定文件的原始内容

raw(categories=[c1,c2])    指定分类的原始内容

words()                              整个语料库中的词汇

words(fileids=[f1,f2,f3])     指定文件中的词汇

words(categories=[c1,c2]) 指定分类中的词汇

sents()                              指定分类中的句子

sents(fileids=[f1,f2,f3])      指定文件中的句子

sents(categories=[c1,c2]) 指定分类中的句子

abspath(fileid)                   指定文件在磁盘上的位置

encoding(fileid)                 文件的编码（如果知道的话）

open(fileid)                       打开指定语料库文件的文件流

root()                                到本地安装的语料库根目录的路径

在这里我们说明一些语料库访问方法之间的区别：
>>> raw = gutenberg.raw("burgess-busterbrown.txt")
>>> raw[1:20]
'The Adventures of B'
>>> words = gutenberg.words("burgess-busterbrown.txt")
>>> words[1:20]
['The', 'Adventures', 'of', 'Buster', 'Bear', 'by', 'Thornton', 'W', '.',
'Burgess', '1920', ']', 'I', 'BUSTER', 'BEAR', 'GOES', 'FISHING', 'Buster',
'Bear']
>>> sents = gutenberg.sents("burgess-busterbrown.txt")
>>> sents[1:20]
[['I'], ['BUSTER', 'BEAR', 'GOES', 'FISHING'], ['Buster', 'Bear', 'yawned', 'as',
'he', 'lay', 'on', 'his', 'comfortable', 'bed', 'of', 'leaves', 'and', 'watched',
'the', 'first', 'early', 'morning', 'sunbeams', 'creeping', 'through', ...], ...]


1.6载入自己语料库
>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_root = '/usr/share/dict' //绝对路径记得用\\访问
>>> wordlists = PlaintextCorpusReader(corpus_root, '.*') .*为正则表达式
>>> wordlists.fileids()
['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']
>>> wordlists.words('connectives')
['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]

Pinaceae

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
nltk.获得文本语料和词汇资源

1.获取文本语料库1.1古滕堡语料库nltk.corpus.gutenberg.fileids()//古滕堡语料库文件标识符emma = nltk.corpus.gutenberg.words('austen-emma.txt')emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))//我们调用了NLTK 中的co
复制链接

扫一扫