来源网址
本篇-4.3
from nltk.book import *
搜索文本
索引行
text1.concordance(“big”)
最多展示25个。
?1如何展示更多?
?2big的其他形式如bigger可否也展示
?3通配符如何显示
本章看完后回答:
相似词语
text1.similar(“big”)
并不是指相同意思,而是指出现在相同上下文,有点类似于索绪尔还是乔姆斯基说的语言的聚合关系?
For example, we saw that monstrous occurred in contexts such as the ___ pictures and a ___ size . What other words appear in a similar range of contexts?
text2.common_contexts([“monstrous”, “very”])
分布图 dispersion plot
we can also determine
the location of a word in the text: how many words from the beginning it appears.
This positional information can be displayed using a dispersion plot.
Each stripe represents an instance
of a word, and each row represents the entire text.
text4.dispersion_plot([“citizens”, “democracy”, “freedom”, “duties”, “America”])
== 注意:需要安装Numpy和Matplotlib ==
参考此处
1.在python界面检查是否已安装
import numpy
2.未安装
win+R, cmd, pip install numpy
计数
len(text3) #计算形符token
该方法计算的是token:words and punctuation symbols
>set(text3) #列出所有类符
>sorted(set(text2)) #类符按字母顺序排序,标点符号在最前面
>len(set(text2))#计算类符数量type
>len(set(text2)) / len(text2) #计算类符和形符比
思考:计算古登堡或者读取本地文本的字符数,需要赋予变量,并且用words来分词。
emma=nltk.corpus.gutenberg.words('austen-emma.txt')
len(emma)
#避免冗长,也可以将上述简化成:
from nltk.corpus import gutenberg
gutenberg.fileids()
emma = gutenberg.words('austen-emma.txt')
计算每个文本的字符数。
The vocabulary of a text is just the set of tokens
that it uses, since in a set, all duplicates are collapsed
together
某个单词的出现频率
text3.count(“big”)
用函数方式固定上述问题
def lexical_diverity(text): return len(set(text)) / len(text) def percentage(count, total): return 100* count/total
字符截取
简单统计学
频率统计frequency distribution
use a FreqDist to find the most frequent words
1.统计频率出现最高的词语
2.画出cumulative frequency plot. fdist1.plot(50, cumulative=True)
3.hapaxes: words that occur only once. fdist1.hapaxes()
>>> fdist1 = FreqDist(text1)
>>> print(fdist1)
<FreqDist with 19317 samples and 260819 outcomes>
>>> fdist1.most_common(50)
[(',', 18713), ('the', 13721), ('.', 6862), ('of'