Computing with Language:Simple Statistics

最新推荐文章于 2022-11-02 17:54:33 发布

fessigy

最新推荐文章于 2022-11-02 17:54:33 发布

阅读量390

点赞数

本文链接：https://blog.csdn.net/fessigy/article/details/73605542

版权

NLP with Python(NLTK) 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Frequency Distributions

//定义变量
fdist1 = FreqDist(text1)
//输出
fdist1
//重复最多的50个
fdist1.most_common(50)
//whale重复次数
fdist1['whale']
//累积频率图
fdist1.plot(50,cumulative=True)
//单频词
fdist1.hapaxes()

//定义V，V是一个链表，而不是一个集合
V = set(text1)
//在V中长度大于15的词
long_words = [w for w in V if len(w) > 15]
//排序
sorted(long_words)

Python这里很类似于数学的表达方式，和正在用的java相比，更偏数学语言。

//词长>7，且词频>7的词（与文本内容相关的高频词）
fdist5 = FreqDist(text5)
sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7)

Collocations and Bigrams

双联词

bigrams(['more','is','said','than','done'])

直接执行上述代码会报错

Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
bigrams(['more','is','said','than','done'])
NameError: name 'nltk' is not defined

需要import nltk

from nltk import *

之后执行，并未显示出来，而是以下语句，需要加上list函数执行。

list(bigrams(['more','is','said','than','done']))

collocation函数为我们找到一个text中的双联词

text4.collocations()

Counting other things

//词长的频率
fdist = FreqDist([len(w) for w in text1])
fdist.keys()
//freqdist后的结果
fdist.items()
fdist.max()
fdist[3]
fdist.freq(3)

NLTK频率分布类中定义的函数

例子	描述
fdist = FreqDist(samples)	创建包含给定样本的频率分布
fdist.inc(sample)	增加样本
fdist['monstrous']	计数给定样本出现的次数
fdist.freq('monstrous')	给定样本的频率
fdist.N()	样本总数
fdist.keys()	以频率递减顺序排序的样本链表
for sample in fdist :	以频率递减的顺序遍历样本
fidst.max()	数值最大的样本
fdist.tabulate()	绘制频率分布表
fdist.plot()	绘制频率分布图
fdist.plot(cumulative=True)	绘制累积频率分布图
fdist1 < fdist2	测试样本在fdist1中出现的频率是否小于fdist2

fessigy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Computing with Language:Simple Statistics

Frequency Distributions//定义变量fdist1 = FreqDist(text1)//输出fdist1//重复最多的50个fdist1.most_common(50)//whale重复次数fdist1['whale']//累积频率图fdist1.plot(50,cumulative=True)//单频词fdist1.hapaxe
复制链接

扫一扫