NLTK入门

最新推荐文章于 2024-04-22 02:39:54 发布

子_非_鱼

最新推荐文章于 2024-04-22 02:39:54 发布

阅读量227

点赞数

文章标签： NLP

本文链接：https://blog.csdn.net/longzhinuhou/article/details/83277281

版权

搜索文本

指定单词每次出现，连同上下文一起显示，用函数concordance：比如来查一下《白鲸记》中的词monstrous：
```
text1.concordance("monstrous")
```
搜索与某个单词出现类似上下文中的单词
```
text1.similar("monstrous")
```

具有共同上下文的多个单词

text2.common_contexts(["monstrous", "very"])
-> be_glad am_glad a_pretty is_pretty a_lucky

单词在文本中的离散图

text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

文本词汇丰富度，每个字平均被使用次数

 from __future__ import division
-> len(text3) / len(set(text3))

计数一个词在文本中出现的次数，计算一个特定的词在文本中占据的百分比。
```
 text3.count("smote")
100 * text4.count('a') / len(text4)
```

所有长度超过7 个字符出现次数超过7 次的词：

fdist5 = FreqDist(text5)
sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7])

双连词，bigrams生成的是generator，查看需要用list()函数转换为数组

bigrams(['more', 'is', 'said', 'than', 'done'])
->[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

找到文本中频繁的双连词，用collocations()函数

text4.collocations()
->United States; fellow citizens; years ago; Federal Government;......

fdist = FreqDist(samples)	创建包含给定样本的频率分布
fdist.inc(sample)	增加样本
fdist[‘monstrous’]	计数给定样本出现的次数
fdist.freq(‘monstrous’)	给定样本的频率
fdist.N()	样本总数
fdist.keys()	以频率递减顺序排序的样本链表
for sample in fdist:	以频率递减的顺序遍历样本
fdist.max()	数值最大的样本
fdist.tabulate()	绘制频率分布表
fdist.plot()	绘制频率分布图
fdist.plot(cumulative=True)	绘制累积频率分布图
fdist1 < fdist2	测试样本在fdist1 中出现的频率是否小于fdist2

[w for w in text if condition]

其中condition为满足的测试条件。一些测试属性的表达函数为：

表达式为

[f(w) for ...]或[w.f() for ...]

比如

[w.upper() for w in text1]

关注