NLTK学习笔记

学习参考书: http://nltk.googlecode.com/svn/trunk/doc/book/


1. 使用代理下载数据

nltk.set_proxy("**.com:80")

nltk.download()


2. 使用sents(fileid)函数时候出现:Resource 'tokenizers/punkt/english.pickle' not found.  Please use the NLTK Downloader to obtain the resource:

import nltk

nltk.download()

安装窗口中选择'Models'项,然后'在 'Identifier' 列找 'punkt,点击下载安装该数据包


3. 语料Corpus元素获取函数

from nltk.corpus import webtext

webtext.fileids()      #得到语料中所有文件的id集合

webtext.raw(fileid)  #给定文件的所有字符集合

webtext.words(fileid) #所有单词集合

webtext.sents(fileid)  #所有句子集合

ExampleDescription
fileids()the files of the corpus
fileids([categories])the files of the corpus corresponding to these categories
categories()the categories of the corpus
categories([fileids])the categories of the corpus corresponding to these files
raw()the raw content of the corpus
raw(fileids=[f1,f2,f3])the raw content of the specified files
raw(categories=[c1,c2])the raw content of the specified categories
words()the words of the whole corpus
words(fileids=[f1,f2,f3])the words of the specified fileids
words(categories=[c1,c2])the words of the specified categories
sents()the sentences of the whole corpus
sents(fileids=[f1,f2,f3])the sentences of the specified fileids
sents(categories=[c1,c2])the sentences of the specified categories
abspath(fileid)the location of the given file on disk
encoding(fileid)the encoding of the file (if known)
open(fileid)open a stream for reading the given corpus file
root()the path to the root of locally installed corpus
readme()the contents of the README file of the corpus

4.文本处理的一些常用函数

假若text是单词集合的列表

len(text)  #单词个数

set(text)  #去重

sorted(text) #排序

text.count('a') #数给定的单词的个数

text.index('a') #给定单词首次出现的位置

FreqDist(text) #单词及频率,keys()为单词,*[key]得到值

FreqDist(text).plot(50,cumulative=True) #画累积图

bigrams(text) #所有的相邻二元组

text.collocations() #找文本中频繁相邻二元组

text.concordance("word") #找给定单词出现的位置及上下文

text.similar("word") #找和给定单词语境相似的所有单词

text.common_context("a“,"b") #找两个单词相似的上下文语境

text.dispersion_plot(['a','b','c',...]) #单词在文本中的位置分布比较图

text.generate() #随机产生一段文本


NLTK's Conditional Frequency Distributions: commonly-used methods and idioms for defining,accessing, and visualizing a conditional frequency distribution.of counters.

ExampleDescription
cfdist = ConditionalFreqDist(pairs)create a conditional frequency distribution from a list of pairs
cfdist.conditions()alphabetically sorted list of conditions
cfdist[condition]the frequency distribution for this condition
cfdist[condition][sample]frequency for the given sample for this condition
cfdist.tabulate()tabulate the conditional frequency distribution
cfdist.tabulate(samples, conditions)tabulation limited to the specified samples and conditions
cfdist.plot()graphical plot of the conditional frequency distribution
cfdist.plot(samples, conditions)graphical plot limited to the specified samples and conditions
cfdist1 < cfdist2test if samples in cfdist1 occur less frequently than incfdist2

to be continued
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值