第一章 Python and NLTK_word.isalpha()-CSDN博客

本文链接：https://blog.csdn.net/oXiaChuan/article/details/47952929

主要应用

import nltk

from nltk import *

f=open("2008.txt")

raw=f.read()

words=raw.split() 这种分词效果并不好

set(w.lower() for w in text)

转化为小写

tokens = nltk.word_tokenize(raw)

两种输出

f = open("out.txt", "w")

1）print（word,file=f）

2）g=' '.join(word)

f.write（g）

*搜索文本

import nltk

引入包

nltk.download

下载所需文档

from nltk.book import *

from nltk import *

输入全部文档和程序

text1.concordance("monwtrous")

关键字查询

text1.similar("monstrous")

类似词语查询

text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

输出词汇分布图

*词汇计数

len(text3)

计算文章单词长度

set(text3)

获取索引

sorted(set(text3))

对索引进行排序

len(set(text3))

索引的长度

from __future__ import division

浮点数进行计算

text3.count("smote")

计算单词出现的次数

len(text3) / len(set(text3))

词汇多样性

100 * text4.count('a') / len(text4)

特定词百分比

sent1.append('some')

向sent1尾后增加some

简单计算

fdist1=FreDist(text1)

将单词出现的频率按从高到低排列

vocabulary1 = fdist1.keys()

将结果转换为链表形式

fdist1.most_common(50)

vocabulary1[:50]

fdist1['whale']

输出结果

fdist1.plot(50, cumulative=True)

前50频率词汇在全部单词中所占的比例

V=set(text1)

long_words=[w for w in V if len(w)>15]

找出文章中长度大于15的单词

fdist5 = FreqDist(text5)

sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7])

长度大于7频率大于7的单词

bigrams(['more', 'is', 'said', 'than', 'done'])

text4.collocations()

找到双连词

w=[len(w) for w in text1]

计算出文章所有单词长度序列

fdist = FreqDist([len(w) for w in text1])

计算出每个长度的频率

fdist.keys()

转化为链表

fdist.items()

输出结果

fdist.max()

求出现次数最多的单词长度

fdist[3]

长度为3单词的个数

fdist.freq(3)

长度为3的占全书的比例

fdist = FreqDist(samples) 创建包含给定样本的频率分布
fdist.inc(sample) 增加样本
fdist['monstrous'] 计数给定样本出现的次数
fdist.freq('monstrous') 给定样本的频率
fdist.N() 样本总数
fdist.keys() 以频率递减顺序排序的样本链表
for sample in fdist: 以频率递减的顺序遍历样本
fdist.max() 数值最大的样本
fdist.tabulate() 绘制频率分布表
fdist.plot() 绘制频率分布图

fdist.plot(cumulative=True) 绘制累积频率分布图

fdist1 < fdist2 测试样本在fdist1 中出现的频率是否小于fdist2

s.startswith(t) 测试s 是否以t 开头
s.endswith(t) 测试s 是否以t 结尾
t in s 测试s 是否包含t
s.islower() 测试s 中所有字符是否都是小写字母
s.isupper() 测试s 中所有字符是否都是大写字母
s.isalpha() 测试s 中所有字符是否都是字母
s.isalnum() 测试s 中所有字符是否都是字母或数字
s.isdigit() 测试s 中所有字符是否都是数字
s.istitle() 测试s 是否首字母大写（s 中所有的词都首字母大写）

[w.upper() for w in text1]

改为大写

[word.lower() for word in text1]

改为小写

set([word.lower() for word in text1 if word.isalpha()])

消除数字和标点符号

if len(word)<3:
print 'word length is less than 5'

判断，并打印出长度小于5的word

for word in ['a','b','c']:
print word

循环输出

for word in sent1:
if word.endswith('l'):
print word

for和if组成的条件循环

for word in sent1:
if word.islower():
print word, 'is a lowercase word'
elif word.istitle():
print word, 'is a titlecase word'
else:
print word, 'is punctution'

使用for和if ，elif， else