nltk.download()
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
Out[2]: True*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
text1
Out[5]: <Text: Moby Dick by Herman Melville 1851>
text2
Out[6]: <Text: Sense and Sensibility by Jane Austen 1811>
text1.concordance("monstrous")
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
text1.similar("monstrous")
imperial subtly impalpable pitiable curious abundant perilous
trustworthy untoward singular lamentable few determined maddens
horrible tyrannical lazy mystifying christian exasperate
text4.dispersion_plot(["citizens","democracy","freedom","duties","America"])
text2.common_contexts(["monstrous", "very"])
a_pretty is_pretty a_lucky am_glad be_glad
#文中出现的词和标点符号为单位算出文本从头到尾的长度,称为标识符
len(text1)
Out[11]: 260819
set(text1)#集合中所有重复的元素都只算一个,很多词都被略过
len(set(text1))
Out[13]: 19317
#用sorted()包裹set(text1),得到词汇项的排序表
from __future__ import division
len(text1)/len((set(text1)))
Out[16]: 13.502044830977896
# 平均每个词被使用了14次
text1.count("think") # 统计text1中think出现的频率
Out[17]: 111
100*text1.count("think")/len(text1) # 计算出现的百分比
Out[18]: 0.04255824920730468
def lexical_deversity(text):
return len(text)/len(set(text))
def percentage(count,total):
return 100*count/total
lexical_deversity(text1) # 与len(text1)/len((set(text1)))结果相同
Out[20]: 13.502044830977896
percentage(text1.count("think"),len(text1)) # 与100*text1.count("think")/len(text1)结果相同
Out[22]: 0.04255824920730468
1.2 近观python:将文本当做词链表
链表(列表)
sent1 = ["Call", "me", "Ishmael", "."]
len(sent1)
Out[28]: 4
lexical_deversity(sent1)
Out[29]: 1.0
sent2 = ["The", "family", "of", "Dashwood", "had", "long", "been", "settled", "in", "Sussex", "."]
sent3 = ["In", "the", "beginning", "God", "created", "the", "heaven", "and", "the", "earth", "."]
sent4 = ["Fellow", "-", "Citizens", "of", "the", "Senate", "and", "of", "the", "House", "of",
"Representatives", ":"]
#链表连接
sent1 + sent4
Out[33]: ['Call',
'me',
'Ishmael',
'.',
'Fellow',
'-',
'Citizens',
'of',
'the',
'Senate',
'and',
'of',
'the',
'House',
'of',
'Representatives',
':']
#链表中追加一个元素:
sent1.append("Some")
sent1
Out[35]: ['Call', 'me', 'Ishmael', '.', 'Some']
索引列表:表示词的位置
text4[173]
Out[36]: u'awaken'
text4.index('awaken')
Out[38]: 173
#获取子链表,从大文本中任意获取语义片段,切片
text5[16715:16735]
Out[40]:
[u'U86',
u'thats',
u'why',
u'something',
u'like',
u'gamefly',
u'is',
u'so',
u'good',
u'because',
u'you',
u'can',
u'actually',
u'play',
u'a',
u'full',
u'game',
u'without',
u'buying',
u'it']
变量:英文字母在前数字在后,abc12是对的,12abc可能会报错
字符串:
#可把词用链表连接起来组成单个字符串,或把字符串分割成链表
''.join(["Monty", "Python"])
Out[42]: 'MontyPython'
'Monty Python'.split()
In [44]:Out[43]: ['Monty', 'Python']
计算语言:简单的统计
saying = ['After', 'all', 'is', 'said', 'and', 'done', 'more', 'is', 'said', 'than', 'done']
tokens = set(saying)
tokens = sorted(tokens)
tokens[-2:]
Out[47]: ['said', 'than']
#频率分布,FreqDist函数
fdist1 = FreqDist(text1)
fdist1
vocabulary1 = fdist1.keys() # 展示不同类型的链表
vocabulary1[:50] # 通过切片看前50项
Out[52]:
[u'funereal',
u'unscientific',
u'divinely',
u'foul',
u'four',
u'gag',
u'prefix',
u'woods',
u'clotted',
u'Duck',
u'hanging',
u'plaudits',
u'woody',
u'Until',
u'marching',
u'disobeying',
u'canes',
u'granting',
u'advantage',
u'Westers',
u'insertion',
u'DRYDEN',
u'formless',
u'Untried',
u'superficially',
u'Western',
u'portentous',
u'meadows',
u'sinking',
u'Ding',
u'Spurn',
u'treasuries',
u'churned',
u'oceans',
u'invasion',
u'powders',
u'tinkerings',
u'tantalizing',
u'yellow',
u'bolting',
u'uncertain',
u'stabbed',
u'bringing',
u'elevations',
u'ferreting',
u'wooded',
u'songster',
u'uttering',
u'scholar',
u'Less']
fdist1['whale']
Out[53]: 906
fdist1.plot(50,cumulative = True)
细粒度选择词:
V = set(text1)
long_words = [w for w in V if len(w) > 15] # 检查词w的长度大于15个字符
sorted(long_words)
Out[57]:
[u'CIRCUMNAVIGATION',
u'Physiognomically',
u'apprehensiveness',
u'cannibalistically',
u'characteristically',
u'circumnavigating',
u'circumnavigation',
u'circumnavigations',
u'comprehensiveness',
u'hermaphroditical',
u'indiscriminately',
u'indispensableness',
u'irresistibleness',
u'physiognomically',
u'preternaturalness',
u'responsibilities',
u'simultaneousness',
u'subterraneousness',
u'supernaturalness',
u'superstitiousness',
u'uncomfortableness',
u'uncompromisedness',
u'undiscriminating',
u'uninterpenetratingly']
fdist5 = FreqDist(text5)
sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] >7]) # 词的长度超过7,并且这些词出现的频率超过7
Out[61]:
[u'#14-19teens',
u'#talkcity_adults',
u'((((((((((',
u'........',
u'Question',
u'actually',
u'anything',
u'computer',
u'cute.-ass',
u'everyone',
u'football',
u'innocent',
u'listening',
u'remember',
u'seriously',
u'something',
u'together',
u'tomorrow',
u'watching']
词语搭配和双连词
text4.collocations()
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties
#查看文本中词长的分布,通过创造一长串数字的链表的FreqDist,每个数字是文本中对应词的长度
[len(w) for w in text1]
sent7 = ["Pierre", "Vinken", ",", "61", "years", "old", ",", "will", "join", "the",
"board", "as", "a", "nonexecutive", "director", "Nov", "29", "."]
[w for w in sent7 if len(w) < 4]
Out[67]: [',', '61', 'old', ',', 'the', 'as', 'a', 'Nov', '29', '.']
[w for w in sent7 if len(w) <= 4]
Out[68]: [',', '61', 'old', ',', 'will', 'join', 'the', 'as', 'a', 'Nov', '29', '.']
[w for w in sent7 if len(w) != 4]
Out[71]:
['Pierre',
'Vinken',
',',
'61',
'years',
'old',
',',
'the',
'board',
'as',
'a',
'nonexecutive',
'director',
'Nov',
'29',
'.']
自动理解自然语言