1.1语言计算 文本和单词
>>> from __future__ import division
>>> 1/30.33333333333333331
>>> import nltk
>>> from nltk.book import *
>>> text2.concordance("world")
Displaying 25 of 93 matches:
wn to the payment of one for all the world ." " It is certainly an unpleasant t
d have left almost everything in the world to THEM ." This argument was irresis
Alt-p获取之前输入的命令
>>> text1.similar("good")
Building word-context index...
great much large small the common in it long that white certain close
considerable important old sharp such well whale
>>> text1.common_contexts(['good','great'])
a_christian a_deal a_long a_man a_whale as_a so_a the_god too_a
>>> text1.dispersion_plot(['good','great'])
>>> text1.generate()
Building ngram index...
[ Moby Dick ?' "' Nay , could steer a ship of good omen , and tried to
open that part is still retained , but whose mysteries not even the
tokens:全部的成组的符号的个数(重复的需要重复计算)
word type:重复的词只算一个
>>> from __future__ import division
>>> len(set(text1))/ len(text1)
0.074062855850225637
>>> text1.count("good")/len(set(text1))
0.0099394315887560182
>>> def percentage(count,total):
return 100*count/total
>>> percentage(4,8)
50.0
1.2将文本当做词表链
>>> aa = "The dog is outside the house."字符串表示
>>> len(aa)
29
>>> aa = ['The','dog','is','outside','the','house'] 链表式表示
>>> len(aa)
6
>>> sent1
['Call', 'me', 'Ishmael', '.']
>>> sent1+sent2
['Call', 'me', 'Ishmael', '.', 'The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.']
>>> sent1.append('En')
>>> sent1
['Call', 'me', 'Ishmael', '.', 'En']
>>> sent1[:2]
['Call', 'me']
>>> sent1[2:]
['Ishmael', '.', 'En']
>>> sent1[2:4]=['Stupid','Tom']
>>> sent1
['Call', 'me', 'Stupid', 'Tom', 'En']
>>> aa = ['the','dog'
'is','there']
>>> aa
['the', 'dogis', 'there']
>>> ''.join(['Hi','Mike'])
'HiMike'
>>> 'Hi Mike'.split()
['Hi', 'Mike']
1.3 简单的统计
>>> freq5 = FreqDist(text5)
>>> freq5
<FreqDist with 45010 outcomes>
>>> ss = freq5.keys();
>>> ss[:50]
['.', 'JOIN', 'PART', '?', 'lol', 'to', 'i', 'the', 'you', ',', 'I', 'a', 'hi', 'me', '...', 'is', '..', 'in', 'ACTION', '!', 'and', 'it', 'that', 'hey', 'my', 'of', 'u', "'s", 'for', 'on', 'what', 'here', 'are', '....', 'not', 'do', 'all', 'have', 'up', 'like', 'no', 'with', 'chat', 'was', "n't", 'so', 'your', "'m", '/', 'good']
>>> freq5[JOIN]
1021
freq5.plot(30,cumulative = True);
fdist.hapaxes()
>>> V = set(text1)
>>> long_freq_words = [w for w in V if len(w)>15 and text1_freq[w]>1]
>>> text1.collocations()
Building collocations list
>>> aa_freq.keys()
[3, 1, 4, 2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20]
>>> aa_freq.items()
[(3, 50223), (1, 47933)
>>> aa = set([w.capitalize() for w in text1 if w.isalpha()])
>>> len(aa)
16948
>>> len(text1)
260819