python与自然语言处理1

最新推荐文章于 2022-04-12 11:39:53 发布

commak

最新推荐文章于 2022-04-12 11:39:53 发布

阅读量626

点赞数

CC 4.0 BY-SA版权

分类专栏：自然语言处理 python 文章标签： python 自然语言处理

本文链接：https://blog.csdn.net/commak/article/details/105070695

本文介绍了Python进行自然语言处理的一些基础操作，包括搜索文本中的关键词，使用索引行和相似词语功能，展示分布图，进行计数统计，以及频率分布分析。通过nltk库展示了如何查找单词在文本中的位置、计算字符数、频率统计、识别搭配等。此外，还探讨了如何解决特定语言问题并提供了相关函数和操作的说明。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

来源网址
本篇-4.3

from nltk.book import *

搜索文本

索引行

text1.concordance(“big”)

最多展示25个。
？1如何展示更多？
？2big的其他形式如bigger可否也展示
?3通配符如何显示
本章看完后回答：

相似词语

text1.similar(“big”)

并不是指相同意思，而是指出现在相同上下文，有点类似于索绪尔还是乔姆斯基说的语言的聚合关系？
For example, we saw that monstrous occurred in contexts such as the ___ pictures and a ___ size . What other words appear in a similar range of contexts?

text2.common_contexts([“monstrous”, “very”])

分布图 dispersion plot

we can also determine
the location of a word in the text: how many words from the beginning it appears.
This positional information can be displayed using a dispersion plot.
Each stripe represents an instance
of a word, and each row represents the entire text.

text4.dispersion_plot([“citizens”, “democracy”, “freedom”, “duties”, “America”])

== 注意：需要安装Numpy和Matplotlib ==
参考此处

1.在python界面检查是否已安装
import numpy
2.未安装
win+R, cmd, pip install numpy

计数

len(text3) #计算形符token

该方法计算的是token:words and punctuation symbols

 >set(text3) #列出所有类符
 >sorted(set(text2)) #类符按字母顺序排序，标点符号在最前面
 >len(set(text2))#计算类符数量type
 >len(set(text2)) / len(text2)  #计算类符和形符比

思考：计算古登堡或者读取本地文本的字符数，需要赋予变量，并且用words来分词。

emma=nltk.corpus.gutenberg.words('austen-emma.txt')
len(emma)
#避免冗长，也可以将上述简化成：
from nltk.corpus import gutenberg
gutenberg.fileids()
emma = gutenberg.words('austen-emma.txt')

计算每个文本的字符数。

The vocabulary of a text is just the set of tokens
that it uses, since in a set, all duplicates are collapsed
together

某个单词的出现频率
text3.count(“big”)

用函数方式固定上述问题

def lexical_diverity(text): return len(set(text)) / len(text) def percentage(count, total): return 100* count/total

字符截取

简单统计学

频率统计frequency distribution

use a FreqDist to find the most frequent words
1.统计频率出现最高的词语
2.画出cumulative frequency plot. fdist1.plot(50, cumulative=True)
3.hapaxes: words that occur only once. fdist1.hapaxes()

>>> fdist1 = FreqDist(text1)
>>> print(fdist1)
<FreqDist with 19317 samples and 260819 outcomes>
>>> fdist1.most_common(50)
[(',', 18713), ('the', 13721), ('.', 6862), ('of'

最低0.47元/天解锁文章

200万优质内容无限畅学