利用python进行自然语言处理学习笔记——chap2

最新推荐文章于 2024-09-27 10:11:28 发布

foursight

最新推荐文章于 2024-09-27 10:11:28 发布

阅读量215

点赞数

分类专栏： NLP 文章标签： python 自然语言处理

本文链接：https://blog.csdn.net/fouronesight/article/details/71155319

版权

NLP 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

第二章.获得文本语料与词汇资源

语料库和相关资源
条件概率分布

3. WordNet

古登堡语料库：Project Gutenberg

import nltk
nltk.corpus.gutenberg
# including fileids/words/
# .raw() 返回所有文件内容包括空格
# sents()划分句子

网络聊天文本

from nltk.corpus import webtext
from nltk.corpus import nps_chat

布朗语料库

from nltk.corpus import brown

可以用来研究文体之间的系统性
比如对比不同类型文体之间相同意思单词的选择等

路透社语料库

from nltk.corpus import reuters

就职演说资料库

from nltk.corpus import inaugural

标注文本语料库
其他信息详见NLTK链接

文本语料库的结构
基本语料库链接
载入本地语料库

# .txt
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/address/'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
# 语料库
from nltk.corpus import BracketParseCorpusReader
corpus_root = '/address/'
file_pattern = '[正则表达式]'
ptb = BracketParseCorpusReader(corpus_root, file_pattern)

条件和事件
我们一般观察不同文本类别下的词汇频率分布nltk.ConditionalFreqDist
e.g. pairs = [(‘news’: ‘word1’), (‘Romance’: ‘word2’), (‘news’: ‘word2’)…]

from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist((genre, word)
                        for genre in brown.categories()
                        for word in brown.words(categories=genre))

绘制分布图和分布表
使用双连词生成随机样本

#此程序获得创世纪文本的所有双连词，并根据频率和种子词随机生成样本
text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)
def generate_word( cfd, word, num=15):
    for i in range(num):
        print word,
        word = cfd[word].max()

词汇列表语料库
词汇语料库（nltk.corpus.words)
停用词语料库（nltk.corpus.stopwords)
名字语料库(…….names)

使用Word Net
1.寻找同义词

from nltk.corpus import wordnet as wn
wn.synsets('motocar') # output [Synset('car.n.01')]
wn.synsets('car.n.01').lemma_names # output 'car','machine',tec.
#including wn.synsets('car.n.01').definition[examples]

WordNet 基于根同义词集
这里写图片描述
包括上位词/下位词，部分/整体，蕴涵，反义词等