Python自然语言处理 2 获得文本语料和词汇资源

最新推荐文章于 2021-03-26 22:48:12 发布

CopperDong

最新推荐文章于 2021-03-26 22:48:12 发布

阅读量1.2k

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/QFire/article/details/78561986

版权

NLP 专栏收录该内容

88 篇文章 45 订阅

订阅专栏

#古腾堡语料库----文学作品

Project Gutenberg

import nltk

nltk.corpus.gutenberg.fileids()

emma = nltk.corpus.gutenberg.words('austen-emma.txt') #<简爱>

len(emma)

文本的3个统计量:平均词长,平均句子长度和每个词出现的平均次数

sents()函数把文本划分成句子,其中每一个句子是一个词链表

macheth_sentences = nltk.corpus.gutenberg.sents("shakespeare-macbeth.txt")

macheth_sentences

[[u'[', u'The', u'Tragedie', u'of', u'Macbeth', u'by', u'William', u'Shakespeare', u'1603', u']'], [u'Actus', u'Primus', u'.'], ...]

#网络和聊天文本---论坛,电影剧本

from nltk.corpus import webtext

for fileid in webtext.fileids():
print fileid, webtext.raw(fileid)[:65], '...'

from nltk.corpus import nps_chat

chatroom = nps_chat.posts('10-19-20s_706posts.xml')

chatroom[123]

#布朗语料库----英语电子语料库,1961年创建,有新闻,社论等

from nltk.corpus import brown

brown.categories()

布朗语料库是一个研究文体之间的系统性差异,让我们来比较不同文体中的情态动词的用法

news_text = brown.words(categories='news')

fdist = nltk.FreqDist([w.lower() for w in news_text])

modals = ['can', 'could', 'may', 'might', 'must', 'will']

for m in modals:
print m + ':', fdist[m],

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

条件频率分布函数

cfd = nltk.ConditionalFreqDist( (genre, word) for genre in brown.categories() for word in brown.words(categories=genre))

genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']

modals = ['can', 'could', 'may', 'might', 'must', 'will']

cfd.tabulate(conditions=genres, samples=modals)

                  can could   may might  must  will 
           news    93    86    66    38    50   389 
       religion    82    59    78    12    54    71 
        hobbies   268    58   131    22    83   264 
science_fiction    16    49     4    12     8    16 
        romance    74   193    11    51    45    43 
          humor    16    30     8     8     9    13

发现新闻文体中最常见的情态动词是will,而言情文体中最常见的情态动词是could

#路透社语料库----包含10788个新闻文档,90个主题

按照"训练"和"测试"分为两组,编号为"test/14826"的文档属于测试组

from nltk.corpus import reuters

reuters.fileids()

reuters.categories()

与布朗语料库不同,路透社语料库的类别是互相重叠的,因为新闻报道往往涉及多个主题

reuters.categories('training/9865')

[u'barley', u'corn', u'grain', u'wheat']

reuters.words('training/9865')[:14] #标题

#就职演说语料库

from nltk.corpus import inaugural

inaugural.fileids()

[u'1789-Washington.txt',
 u'1793-Washington.txt',
 u'1797-Adams.txt',
 u'1801-Jefferson.txt',
 u'1805-Jefferson.txt',
 u'1809-Madison.txt',
 u'1813-Madison.txt',
 u'1817-Monroe.txt',
 u'1821-Monroe.txt',
 u'1825-Adams.txt',

#标注文本语料库

许多文本语料库都包含语言学标注,有词性标注,命名实体,句法结构,语义角色

http://www.nltk.org/data

http://www.nltk.org/howto

#其他语言的语料库

#文本语料库的结构

#载入自己的语料库

二,条件频率分布

四词典资源

词典或者词典资源是一个词和/或短语及其相关信息的集合.

#词汇列表语料库

过滤不常见的或拼写错误的词汇

def unusual_words(text):

停用词语料库

from nltk.corpus import stopwords

stopwords.words('english')

名字语料库

names = nltk.corpus.names

names.fileids() #男名和女名

[u'female.txt', u'male.txt']

#发音的词典

entries = nltk.corpus.cmudict.entries()

len(entries)

entries[1]

(u'a.', [u'EY1'])

对每个词都有语音的代码,CMU发音词典中的符号是从Arpabet来的,参考http://en.wikipedia.org/wiki/Arpabet

#比较词表

几种语言的约200个常用词的列表

#词汇工具:Toolbox 和 Sheobox

http://www.sil.org/computing/toolbox下载

五 WordNet

面向语义的英语词典,共有155287个单词和117659个同义词

#意义与同义词

from nltk.corpus import wordnet as wn

wn.synsets('motorcar') #同义词

[Synset('car.n.01')]

wn.synset('car.n.01').lemma_names() #同义词集

[u'car', u'auto', u'automobile', u'machine', u'motorcar']

wn.synset('car.n.01').definition() #意义

u'a motor vehicle with four wheels; usually propelled by an internal combustion engine'

#WordNet的层次结构

独一无二的根同义词集

上位词和下位词

反义词

语义相似度

七, 深入阅读

Ethnologue有着世界上完整的语言的清单http://www.ethnologue.com