频率分布
数数文中词条的出现频率
《Python自然语言处理》是酱紫写的
FreqDist()#词频
方法
>>> fdist1 = FreqDist(text1)
>>> fdist1
FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024, 'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})
>>> vocabulary1 = fdist1.keys()
>>> vocabulary1[:50]
但是会出现问题:
查询文档后更正为:
>>> vocabulary1 = list(fdist1.keys())
>>> vocabulary1[:100]
['head', 'supposition', 'Rig', 'commence', 'inspection', 'swim', 'mansion', 'strained', 'bowsman', 'strangers', 'investigators', 'OCTAVO', 'bare', 'observest', 'adorned', 'maintains', 'Gone', 'monstrous', 'unread', 'bedsteads', 'wriggles', 'rears', 'compacted', 'thump', 'LASHINGS', 'Prodigies', 'useful', 'dubiously', 'ticklish', 'flour', 'yes', 'mackerel', 'rate', 'knit', 'occasions', 'imperative', 'abating', 'neutral', 'reading', 'stalk', 'prosecution', 'complimentary', 'hearse', 'Canada', 'unobstructed', 'Capting', 'impatience', 'layers', 'CHORUS', 'Scripture', 'caudam', 'ineffably', 'RESPECTABLE', 'naturae', 'clue', 'NANTUCKET', 'pike', 'steps', 'without', 'students', 'tore', 'hides', 'slave', 'oaths', 'incognita', 'darts', 'unmistakable', '"', 'stronger', 'Imprimis', 'aromatic', 'mists', 'piers', 'everlasting', 'Sway', 'temporarily', 'shirts', 'chivalric', 'unwillingness', 'Coffins', 'merchants', 'mallet', 'rounding', 'soliloquizer', 'suicide', 'smack', 'ruling', 'inexpressible', 'Fates', 'etherial', 'giant', 'obstructed', 'wharf', 'fuel', 'grounded', 'graceful', 'Lowering', 'correspondence', 'resent', 'pagans']
统计某一特定词的出现频率
>>> fdist1['giant']
2
>>> fdist1['reading']
8
>>> fdist1['whale']
906
>>> fdist1['head']
335
词汇累积频率图
>>> fdist1.plot(50,cumulative=True)
细粒度的选择词(就是加条件的词语链表)
集合的表示
数学:{w | w
∈
V
∩
w
∈
p(w)}
python: [w for w in V if p(w)]
python产生的是一个链表,酱紫元素没有唯一性
>>> V = set(text1)#获得词汇表
>>> long_words = [w for w in V if len(w)>15]
>>> sorted(long_words)#排序
['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']
词汇搭配和双连词
书中定义的搭配“不经常出现在一起的词序列”eg. red wine是个搭配,而 the wine不是;此外,搭配中的词不能被同类词语替换 eg.gery wine 很奇怪
#bigrams()#获取双连词,但是新版本的不能这样用了
list(bigrams())#获取特定词的双连词的用法(新版用法!!)
>>> bigrams(['more','is','said'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'bigrams' is not defined
>>> from nltk import *
>>> bigrams(['more','is','said'])
<generator object bigrams at 0x105302fc0>
>>> list(bigrams(['more','is','said']))
[('more', 'is'), ('is', 'said')]
>>> text1.collocations()#获取更频繁的双连词
Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand