NLTK模块

最新推荐文章于 2023-09-03 02:03:47 发布

Eason-Sun

最新推荐文章于 2023-09-03 02:03:47 发布

阅读量692

点赞数 2

分类专栏： python

本文链接：https://blog.csdn.net/weixin_36228538/article/details/88323987

版权

python 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

NLTK 定义了一个使用Python 进行NLP 编程的基础工具。它提供重新表示自然语言处理相关数据的基本类，词性标注、文法分析、文本分类等任务的标准接口以及这些任务的标准实现，可以组合起来解决复杂的问题。
语言处理任务与相应NLTK 模块以及功能描述:

语言处理任务	NLTK模块	功能
访问语料库	corpus	语料库与词典的标准化接口
字符串处理	tokenize, stem	分词，分句，提取主干
搭配的发现	collocations	t-检验，卡方，点互信息PMI
词性标注	tag	n-gram, backoff, Brill, HMM, TnT
机器学习	classify, cluster, tbl	决策树，最大熵，贝叶斯，EM，k-means
分块	chunk	正则表达式，n-gram，命名实体
解析	parse, ccg	图表，基于特征，一致性，概率，依赖
语义解释	sem, inference	λ演算，一阶逻辑，模型检验
指标评测	metrics	精度，召回率，协议系数
概率和估计	probability	频率分布，平滑概率分布
应用	app, chat	图形化的语料库检索工具，分析器，WordNet 查看器，聊天机器人
语言学领域的工作	toolbox	处理SIL 工具箱格式的数据

1.nltk.book模块

from nltk.book import *
# 搜索文本
text1.concordance('monstrous')

1.1 搜索文本

词语索引视角显示一个指定单词的每一次出现，连同一些上下文一起显示。使用text1.concordance()函数，来查一下Moby Dick 《白鲸记》中的词monstrous：

>>> text1.concordance("monstrous")
>Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
>>>

词语索引使我们看到词的上下文。例如，我们看到monstrous 出现的上下文，the monstrous pictures和a monstrous size。还有哪些词出现在相似的上下文中？我们可以使用similar函数查询出现在相似上下文中的词：

>>> text1.similar("monstrous")
mean part maddens doleful gamesome subtly uncommon careful untoward
exasperate loving passing mouldy christian few true mystifying
imperial modifies contemptible
>>> text2.similar("monstrous")
very heartily so exceedingly remarkably as vast a great amazingly
extremely good sweet
>>>

函数common_contexts允许我们研究两个或两个以上的词共同的上下文，如monstrous和very。参数是一个列表：

>>> text2.common_contexts(["monstrous", "very"])
a_pretty is_pretty am_glad be_glad a_lucky
>>>

dispersion_plot()函数可以判断词在文本中的位置：从文本开头算起在它前面有多少词。这个位置信息可以用离散图表示。每一个竖线代表一个单词，每一行代表整个文本。

>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
>>>

1.2 频率分布( NLTK 频率分布类:FreqDist)

FreqDist()函数统计文本中你每个单词词符出现的次数(分布频率frequency distribution)：

>>> text = ['sun','it','ha','ha','sun','is','ha','.']
>>> len(text)
8
>>> fd=FreqDist(text)
>>> print(fd)
<FreqDist with 5 samples and 8 outcomes>
>>> fd.most_common(5)
[('ha', 3), ('sun', 2), ('is', 1), ('.', 1), ('it', 1)]
>>> fd['sun']
2
>>>

表达式most_common(5) 给出文本中5 个出现频率最高的单词类型.
下面画出这些词汇的累积频率图：

>>> fd.plot(5,cumulative=True)

结果如下图：
在这里插入图片描述

我们想要找出文本词汇表长度中超过15 个字符的词:

>>> V = set(text1)
>>> long_words = [w for w in V if len(w) > 15]
>>> sorted(long_words)
['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically',
'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations',
'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness',
'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities',
'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness',
'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']
>>>

这些很长的词通常是hapaxes（即唯一的），也许找出频繁出现的长词会更好。这样看起来更有前途，因为这样忽略了短高频词（如the）和长低频词（如antiphilosophists）。以下是聊天语料库中所有长度超过7 个字符，且出现次数超过7 次的词：

>>> fdist5 = FreqDist(text5)
>>> sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7)
['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question',
'actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football',
'innocent', 'listening', 'remember', 'seriously', 'something', 'together',
'tomorrow', 'watching']
>>>

1.3 词语搭配和双连词(bigrams)

一个搭配是异乎寻常地经常在一起出现的词序列。要获取搭配，我们先从提取文本词汇中的词对，也就是双连词开始。使用函数bigrams()很容易实现：

>>> list(bigrams(['more', 'is', 'said', 'than', 'done']))
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>

特别的，我们希望找到比我们基于单个词的频率预期得到的更频繁出现的双连词。collocations()函数为我们做这些:

>>> text4.collocations()
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties
>>> text8.collocations()
would like; medium build; social drinker; quiet nights; non smoker;
long term; age open; Would like; easy going; financially secure; fun
times; similar interests; Age open; weekends away; poss rship; well
presented; never married; single mum; permanent relationship; slim
build
>>>

1.4 其他计数

计数词汇是有用的，我们也可以计数其他东西。例如，我们可以查看文本中词长的分布，通过创造一长串数字的列表的FreqDist，其中每个数字是文本中对应词的长度：

>>> [len(w) for w in text1] [1]
[1, 4, 4, 2, 6, 8, 4, 1, 9, 1, 1, 8, 2, 1, 4, 11, 5, 2, 1, 7, 6, 1, 3, 4, 5, 2, ...]
>>> fdist = FreqDist(len(w) for w in text1)  [2]
>>> print(fdist)  [3]
<FreqDist with 19 samples and 260819 outcomes>
>>> fdist
FreqDist({3: 50223, 1: 47933, 4: 42345, 2: 38513, 5: 26597, 6: 17111, 7: 14399,
  8: 9966, 9: 6428, 10: 3528, ...})
>>> fdist.most_common()
[(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399),
(8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177),
(15, 70), (16, 22), (17, 12), (18, 1), (20, 1)]
>>> fdist.max()
3
>>> fdist[3]
50223
>>> fdist.freq(3)
0.19255882431878046
>>>

由此我们看到，最频繁的词长度是3，长度为3 的词有50,000 多个（约占书中全部词汇的20％）。

python语法列表推导

[w for w in text if condition]
得到text中满足condition的元素组成的列表。其中 condition 是 Python 中的一个“测试”，得到真(true)或者假(false)。其中‘测试’可以时数值的比较也可以是测试函数，可以使用下表中的函数测试词汇的各种属性。

函数	含义
s.startswith(t)	测试 s 是否以 t 开头
s.endswith(t)	测试 s 是否以 t 结尾
t in s	测试 s 是否包含 t
s.islower()	测试 s 中所有字符是否都是小写字母
s.isupper()	测试 s 中所有字符是否都是大写字母
s.isalpha()	测试 s 中所有字符是否都是字母
s.isalnum()	测试 s 中所有字符是否都是字母或数字
s.isdigit()	测试 s 中所有字符是否都是数字
s.istitle()	测试 s 是否首字母大写(s 中所有的词都首字母大写)

例子：

>>> sent7
['Pierre', 'Vinken',',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
 >>> [w for w in sent7 if len(w) < 4]
[',', '61', 'old', ',', 'the', 'as', 'a', '29', '.']
>>> sorted(w for w in set(text1) if w.endswith('ableness'))
['comfortableness', 'honourableness', 'immutableness', 'indispensableness', ...]
>>> sorted(term for term in set(text4) if 'gnt' in term)
['Sovereignty', 'sovereignties', 'sovereignty']
>>> sorted(item for item in set(text6) if item.istitle())
['A', 'Aaaaaaaaah', 'Aaaaaaaah', 'Aaaaaah', 'Aaaah', 'Aaaaugh', 'Aaagh', ...]
>>> sorted(item for item in set(sent7) if item.isdigit())
['29', '61']
>>>

我们还可以创建更复杂的条件。如果 c 是一个条件，那么 not c 也是一个条件。如果我们有两个条件 c1 和 c2，那么我们可以使用合取和析取将它们合并形成一个新的条件:c1 and c2 以及 c1 or c2 。

>>> sorted([wd for wd in set(text3) if wd.istitle() and len(wd) > 10])
['Abelmizraim', 'Allonbachuth', 'Beerlahairoi', 'Canaanitish', 'Chedorlaomer', 'Girgashites', 'Hazarmaveth', 'Hazezontamar', 'Ishmeelites', 'Jegarsahadutha', 'Jehovahjireh', 'Kirjatharba', 'Melchizedek', 'Mesopotamia', 'Peradventure', 'Philistines', 'Zaphnathpaaneah']
>>>

[f(w) for …] 或[w.f() for …]
其中f 是一个函数,对列表上的所有元素执行相同的操作.在下面的例子中，遍历text1中的每一个词，一个接一个的赋值给变量w 并在变量上执行指定的操作。

>>> [len(w) for w in text1]
[1, 4, 4, 2, 6, 8, 4, 1, 9, 1, 1, 8, 2, 1, 4, 11, 5, 2, 1, 7, 6, 1, 3, 4, 5, 2, ...]
>>> [w.upper() for w in text1]
['[', 'MOBY', 'DICK', 'BY', 'HERMAN', 'MELVILLE', '1851', ']', 'ETYMOLOGY', '.', ...]
>>>
#过滤掉所有非字母元素，从词汇表中消除数字和标点符号,忽略大小写：
>>> len(set(word.lower() for word in text1 if word.isalpha()))
16948
>>>

我们创建一个包含cie 或cei的词的列表，然后循环输出其中的每一项。请注意print语句中给出的额外信息︰end=’ '。它告诉Python 在每个单词后面打印一个空格（而不是默认的换行）。

>>> tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w)
>>> for word in tricky:
...     print(word, end=' ')
ancient ceiling conceit conceited conceive conscience
conscientious conscientiously deceitful deceive ...
>>>

Eason-Sun

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
NLTK模块

NLTK 定义了一个使用Python 进行NLP 编程的基础工具。它提供重新表示自然语言处理相关数据的基本类，词性标注、文法分析、文本分类等任务的标准接口以及这些任务的标准实现，可以组合起来解决复杂的问题。语言处理任务与相应NLTK 模块以及功能描述:语言处理任务NLTK模块功能访问语料库corpus语料库与词典的标准化接口字符串处理tokenize, ste...
复制链接

扫一扫