NLTK 入门
from matplotlib import pyplot as plt
from nltk import book
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
book.text1
<Text: Moby Dick by Herman Melville 1851>
# 搜索相关词
book.text1.concordance("monstrous")
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
# 查看相似上下文的词语。例如, the ___ pictures和the ___ size. 上下文一样的词.
book.text1.similar("monstrous")
imperial subtly impalpable pitiable curious abundant perilous
trustworthy untoward singular lamentable few determined maddens
horrible tyrannical lazy mystifying christian exasperate
# common_contexts 找出两个或两个以上词共同的上下文. 中间用 _ 分隔两个词.
book.text2.common_contexts(["monstrous", "very"])
print '-'*100
book.text2.common_contexts(["monstrous"])
a_pretty is_pretty a_lucky am_glad be_glad
----------------------------------------------------------------------------------------------------
a_pretty was_happy is_fond a_lucky a_deal am_glad is_pretty be_glad
查看文本中每一个出现的词的分布情况。其中 x轴表示每一个词出现的位置,能够看出一个词在文章的分布情况。
book.text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
plt.show()
计算某个词的个数
book.text3.count("smote")
5
频率统计. 产生的fd并没有被排序,如果需要统计词频最高的,使用 most_common 来获取. 总之这是一个字典。
from nltk import probability
fd = probability.FreqDist(book.text1)
fd
words = fd.keys()
print words[0:50]
w_str = ''
for w in words[0:10]:
w_str += str(fd[w]) + ' '
print w_str
print fd['whale']
fd.most_common(50)
[u'funereal', u'unscientific', u'divinely', u'foul', u'four', u'gag', u'prefix', u'woods', u'clotted', u'Duck', u'hanging', u'plaudits', u'woody', u'Until', u'marching', u'disobeying', u'canes', u'granting', u'advantage', u'Westers', u'insertion', u'DRYDEN', u'formless', u'Untried', u'superficially', u'Western', u'portentous', u'beacon', u'meadows', u'sinking', u'Ding', u'Spurn', u'treasuries', u'churned', u'oceans', u'powders', u'tinkerings', u'tantalizing', u'yellow', u'bolting', u'uncertain', u'stabbed', u'bringing', u'elevations', u'ferreting', u'believers', u'wooded', u'songster', u'uttering', u'scholar']
1 1 2 11 74 2 1 9 2 2
906
[(u',', 18713),
(u'the', 13721),
(u'.', 6862),
(u'of', 6536),
(u'and', 6024),
(u'a', 4569),
(u'to', 4542),
(u';', 4072),
(u'in', 3916),
(u'that', 2982),
(u"'", 2684),
(u'-', 2552),
(u'his', 2459),
(u'it', 2209),
(u'I', 2124),
(u's', 1739),
(u'is', 1695),
(u'he', 1661),
(u'with', 1659),
(u'was', 1632),
(u'as', 1620),
(u'"', 1478),
(u'all', 1462),
(u'for', 1414),
(u'this', 1280),
(u'!', 1269),
(u'at', 1231),
(u'by', 1137),
(u'but', 1113),
(u'not', 1103),
(u'--', 1070),
(u'him', 1058),
(u'from', 1052),
(u'be', 1030),
(u'on', 1005),
(u'so', 918),
(u'whale', 906),
(u'one', 889),
(u'you', 841),
(u'had', 767),
(u'have', 760),
(u'there', 715),
(u'But', 705),
(u'or', 697),
(u'were', 680),
(u'now', 646),
(u'which', 640),
(u'?', 637),
(u'me', 627),
(u'like', 624)]
fd.plot(50)
plt.show()
fd.plot(50, cumulative=True)
plt.show()
n-gram
使用collections 获取 n-gram的数据。下面是默认n-gram=2
book.text4.collocations(window_size=2)
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties