NLP with python 1 语言处理与python

1.1语言计算 文本和单词

>>> from __future__ import division

>>> 1/3

0.33333333333333331


>>> import nltk
>>> from nltk.book import *


>>> text2.concordance("world")
Displaying 25 of 93 matches:
wn to the payment of one for all the world ." " It is certainly an unpleasant t
d have left almost everything in the world to THEM ." This argument was irresis

Alt-p获取之前输入的命令


>>> text1.similar("good")
Building word-context index...
great much large small the common in it long that white certain close
considerable important old sharp such well whale
>>> text1.common_contexts(['good','great'])
a_christian a_deal a_long a_man a_whale as_a so_a the_god too_a


>>> text1.dispersion_plot(['good','great'])


>>> text1.generate()
Building ngram index...
[ Moby Dick ?' "' Nay , could steer a ship of good omen , and tried to
open that part is still retained , but whose mysteries not even the


tokens:全部的成组的符号的个数(重复的需要重复计算)

word type:重复的词只算一个

>>> from __future__ import division
>>> len(set(text1))/ len(text1)
0.074062855850225637
>>> text1.count("good")/len(set(text1))
0.0099394315887560182


>>> def percentage(count,total):
return 100*count/total

>>> percentage(4,8)
50.0


1.2将文本当做词表链

>>> aa = "The dog is outside the house."字符串表示
>>> len(aa)
29
>>> aa = ['The','dog','is','outside','the','house'] 链表式表示
>>> len(aa)
6


>>> sent1
['Call', 'me', 'Ishmael', '.']
>>> sent1+sent2
['Call', 'me', 'Ishmael', '.', 'The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.']
>>> sent1.append('En')
>>> sent1
['Call', 'me', 'Ishmael', '.', 'En']


>>> sent1[:2]
['Call', 'me']
>>> sent1[2:]
['Ishmael', '.', 'En']
>>> sent1[2:4]=['Stupid','Tom']
>>> sent1
['Call', 'me', 'Stupid', 'Tom', 'En']


>>> aa = ['the','dog'
      'is','there']
>>> aa
['the', 'dogis', 'there']


>>> ''.join(['Hi','Mike'])
'HiMike'
>>> 'Hi Mike'.split()
['Hi', 'Mike']


1.3 简单的统计

>>> freq5 = FreqDist(text5)
>>> freq5
<FreqDist with 45010 outcomes>
>>> ss = freq5.keys();
>>> ss[:50]
['.', 'JOIN', 'PART', '?', 'lol', 'to', 'i', 'the', 'you', ',', 'I', 'a', 'hi', 'me', '...', 'is', '..', 'in', 'ACTION', '!', 'and', 'it', 'that', 'hey', 'my', 'of', 'u', "'s", 'for', 'on', 'what', 'here', 'are', '....', 'not', 'do', 'all', 'have', 'up', 'like', 'no', 'with', 'chat', 'was', "n't", 'so', 'your', "'m", '/', 'good']
>>> freq5[JOIN]

1021


freq5.plot(30,cumulative = True);


fdist.hapaxes()


>>> V = set(text1)
>>> long_freq_words = [w for w in V if len(w)>15 and text1_freq[w]>1]


>>> text1.collocations()
Building collocations list


>>> aa_freq.keys()
[3, 1, 4, 2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20]
>>> aa_freq.items()
[(3, 50223), (1, 47933)





>>> aa = set([w.capitalize() for w in text1 if w.isalpha()])
>>> len(aa)
16948
>>> len(text1)
260819



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值