奋战聊天机器人（二）语料和词汇资源_机器人训练素材语料下载-CSDN博客

本文链接：https://blog.csdn.net/zsj470785068/article/details/77717982

当代自然语言处理都是基于统计的，统计自然需要很多样本，因此语料和词汇资源是必不可少的

1. NLTK语料库

NLTK包含多种语料库，比如：Gutenberg语料库

nltk.corpus.gutenberg.fileids()

nltk.corpus.gutenberg：语料库的阅读器
nltk.corpus.gutenberg.raw(‘chesterton-brown.txt’)：输出chesterton-brown.txt文章的原始内容
nltk.corpus.gutenberg.words(‘chesterton-brown.txt’)：输出chesterton-brown.txt文章的单词列表
nltk.corpus.gutenberg.sents(‘chesterton-brown.txt’)：输出chesterton-brown.txt文章的句子列表

类似的语料库还有：

from nltk.corpus import webtext：网络文本语料库，网络和聊天文本
from nltk.corpus import brown：布朗语料库，按照文本分类好的500个不同来源的文
from nltk.corpus import reuters：路透社语料库，1万多个新闻文档
from nltk.corpus import inaugural：就职演说语料库，55个总统的演说

1.1 语料库的一般结构

语料库的几种组织结构：
- 散养式（孤立的多篇文章）
- 分类式（按照类别组织、相互之间没有交集）
- 交叉式（一篇文章可能属于多个类）
- 渐变式（语法随时间发生变化）

1.2 语料库的通用接口

fileids()：返回语料库中的文件
categories()：返回语料库中的分类
raw()：返回语料库的原始内容
words()：返回语料库中的词汇
sents()：返回语料库句子
abspath()：指定文件在磁盘上的位置
open()：打开语料库的文件流

1.3 加载自己的语料库

收集自己的语料库（文本文件）到某路径下（比如/tmp），然后执行：

from nltk.corpus import PlaintextCorpusReader
corpus_root = '/tmp'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
wordlists.fileids()

就可以列出自己语料库的各个文件了，也可以使用如wordlists.sents(‘a.txt’)和wordlists.words(‘a.txt’)等方法来获取句子和词信息

1.4 条件频率分布

自然语言的条件频率分布就是指定条件下某个事件的频率分布

比如要输出在布朗语料库中每个类别条件下每个词的频率

# encoding:utf-8

import nltk
from nltk.corpus import brown

# 链表推导式，genre是brown语料库里的所有类别列表，word是这个类别中的词汇列表
# (genre, word)就是类别加词汇对
genre_word = [(genre, word)
              for genre in brown.categories()
              for word in brown.words(categories=genre)]

# 创建条件频率分布
cfd = nltk.ConditionalFreqDist(genre_word)
# 指定条件和样本作图
cfd.plot(conditions=['news', 'adventure'], samples=[u'stock', u'sunbonnet'])
# 自定条件和样本作表格
cfd.tabulate(conditions=['news', 'adventure'], samples=[u'stock', u'sunbonnet'])

我们还可以利用条件频率分布，按照最大条件概率生成双连词，最终生成一个随机文本

这可以直接使用bigrams()函数，它的功能是生成词对链表。

# encoding:utf-8

import nltk


# 循环10次，从cddist中取当前单词最大概率的连词，并打印出来
def generate_model(cfdist, word, num=10):
    for i in range(num):
        print(word)
        word = cfdist[word].max()

# 加载语料库
text = nltk.corpus.genesis.words('english-kjv.txt')
# 生成双连词
bigrams = nltk.bigrams(text)
# 生成条件频率分布
cfd = nltk.ConditionalFreqDist(bigrams)

# 以 the 开头，生成随机串
generate_model(cfd, 'the')