FreqDisk
nltk FreqDisk函数能够统计数组当中单词出现的次数。
text = ['hadoop','spark','hive','hadoop','hadoop'
,'spark','lucene','hadoop','spark','hive'
,'hadoop','hadoop','spark','pig','zookeeper'
,'flume','stream','hadoop','hadoop','spark'
,'pig','zookeeper','flume','stream','hadoop'
,'hadoop','spark','pig','zookeeper','flume'
,'stream','hadoop','hadoop','spark','pig'
,'zookeeper','flume','stream','hadoop','hadoop'
,'spark','pig','zookeeper','flume','stream']
fdist = nltk.FreqDist(text)
for k in fdist:
print(k+" "+str(fdist[k]))
hadoop 14
spark 8
hive 2
lucene 1
pig 5
zookeeper 5
flume 5
stream 5
FreqDisk::plot(n)
参数n,以折线图的方式展示频数最大的前n项数据。
fdist.plot(4)
FreqDisk::tabulate(n)
参数n,以表格的方式展示频数最大的前n项数据。
fdist.tabulate(5)
FreqDisk::most_common(n)
参数n,展示频数最大的前n项数据。
print(fdist.most_common(3))
[('hadoop', 14), ('spark', 8), ('pig', 5)]
FreqDisk::hapaxes()
展示频数最小的数据。
print(fdist.hapaxes())
['lucene']
FreqDisk::max()
展示频数最大的数据。
print(fdist.max())
hadoop