NLTK学习笔记

最新推荐文章于 2022-07-11 10:32:55 发布

qiqzhang

最新推荐文章于 2022-07-11 10:32:55 发布

阅读量300

点赞数

分类专栏： Python

Python 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

学习参考书： http://nltk.googlecode.com/svn/trunk/doc/book/

1. 使用代理下载数据

nltk.set_proxy("**.com:80")

nltk.download()

2. 使用sents(fileid)函数时候出现：Resource 'tokenizers/punkt/english.pickle' not found. Please use the NLTK Downloader to obtain the resource:

import nltk

nltk.download()

安装窗口中选择'Models'项，然后'在 'Identifier' 列找 'punkt，点击下载安装该数据包

3. 语料Corpus元素获取函数

from nltk.corpus import webtext

webtext.fileids() #得到语料中所有文件的id集合

webtext.raw(fileid) #给定文件的所有字符集合

webtext.words(fileid) #所有单词集合

webtext.sents(fileid) #所有句子集合

Example	Description
`fileids()`	the files of the corpus
`fileids([categories])`	the files of the corpus corresponding to these categories
`categories()`	the categories of the corpus
`categories([fileids])`	the categories of the corpus corresponding to these files
`raw()`	the raw content of the corpus
`raw(fileids=[f1,f2,f3])`	the raw content of the specified files
`raw(categories=[c1,c2])`	the raw content of the specified categories
`words()`	the words of the whole corpus
`words(fileids=[f1,f2,f3])`	the words of the specified fileids
`words(categories=[c1,c2])`	the words of the specified categories
`sents()`	the sentences of the whole corpus
`sents(fileids=[f1,f2,f3])`	the sentences of the specified fileids
`sents(categories=[c1,c2])`	the sentences of the specified categories
`abspath(fileid)`	the location of the given file on disk
`encoding(fileid)`	the encoding of the file (if known)
`open(fileid)`	open a stream for reading the given corpus file
`root()`	the path to the root of locally installed corpus
`readme()`	the contents of the README file of the corpus

4.文本处理的一些常用函数

假若text是单词集合的列表

len(text) #单词个数

set(text) #去重

sorted(text) #排序

text.count('a') #数给定的单词的个数

text.index('a') #给定单词首次出现的位置

FreqDist(text) #单词及频率，keys()为单词，*[key]得到值

FreqDist(text).plot(50,cumulative=True) #画累积图

bigrams(text) #所有的相邻二元组

text.collocations() #找文本中频繁相邻二元组

text.concordance("word") #找给定单词出现的位置及上下文

text.similar("word") #找和给定单词语境相似的所有单词

text.common_context("a“,"b") #找两个单词相似的上下文语境

text.dispersion_plot(['a','b','c',...]) #单词在文本中的位置分布比较图

text.generate() #随机产生一段文本

NLTK's Conditional Frequency Distributions: commonly-used methods and idioms for defining,accessing, and visualizing a conditional frequency distribution.of counters.

Example	Description
`cfdist = ConditionalFreqDist(pairs)`	create a conditional frequency distribution from a list of pairs
`cfdist.conditions()`	alphabetically sorted list of conditions
`cfdist[condition]`	the frequency distribution for this condition
`cfdist[condition][sample]`	frequency for the given sample for this condition
`cfdist.tabulate()`	tabulate the conditional frequency distribution
`cfdist.tabulate(samples, conditions)`	tabulation limited to the specified samples and conditions
`cfdist.plot()`	graphical plot of the conditional frequency distribution
`cfdist.plot(samples, conditions)`	graphical plot limited to the specified samples and conditions
`cfdist1 < cfdist2`	test if samples in `cfdist1` occur less frequently than in`cfdist2`

to be continued

qiqzhang

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
NLTK学习笔记

学习参考书： http://nltk.googlecode.com/svn/trunk/doc/book/1. 使用代理下载数据nltk.set_proxy("**.com:80")nltk.download()2. 使用sents(fileid)函数时候出现：Resource 'tokenizers/punkt/english.pickle' not foun
复制链接

扫一扫

专栏目录