NLP with python 1 语言处理与python

最新推荐文章于 2022-10-18 13:39:47 发布

tttmusic

最新推荐文章于 2022-10-18 13:39:47 发布

阅读量2.6k

点赞数

分类专栏：自然语言处理

本文链接：https://blog.csdn.net/tttmusic/article/details/8525365

版权

自然语言处理专栏收录该内容

4 篇文章 0 订阅

订阅专栏

1.1语言计算文本和单词

>>> from __future__ import division

>>> 1/3

0.33333333333333331

>>> import nltk
>>> from nltk.book import *

>>> text2.concordance("world")
Displaying 25 of 93 matches:
wn to the payment of one for all the world ." " It is certainly an unpleasant t
d have left almost everything in the world to THEM ." This argument was irresis

Alt-p获取之前输入的命令

>>> text1.similar("good")
Building word-context index...
great much large small the common in it long that white certain close
considerable important old sharp such well whale
>>> text1.common_contexts(['good','great'])
a_christian a_deal a_long a_man a_whale as_a so_a the_god too_a

>>> text1.dispersion_plot(['good','great'])

>>> text1.generate()
Building ngram index...
[ Moby Dick ?' "' Nay , could steer a ship of good omen , and tried to
open that part is still retained , but whose mysteries not even the

tokens：全部的成组的符号的个数（重复的需要重复计算）

word type：重复的词只算一个

>>> from __future__ import division
>>> len(set(text1))/ len(text1)
0.074062855850225637
>>> text1.count("good")/len(set(text1))
0.0099394315887560182

>>> def percentage(count,total):
return 100*count/total

>>> percentage(4,8)
50.0

1.2将文本当做词表链

>>> aa = "The dog is outside the house."字符串表示
>>> len(aa)
29
>>> aa = ['The','dog','is','outside','the','house'] 链表式表示
>>> len(aa)
6

>>> sent1
['Call', 'me', 'Ishmael', '.']
>>> sent1+sent2
['Call', 'me', 'Ishmael', '.', 'The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.']
>>> sent1.append('En')
>>> sent1
['Call', 'me', 'Ishmael', '.', 'En']

>>> sent1[:2]
['Call', 'me']
>>> sent1[2:]
['Ishmael', '.', 'En']
>>> sent1[2:4]=['Stupid','Tom']
>>> sent1
['Call', 'me', 'Stupid', 'Tom', 'En']

>>> aa = ['the','dog'
'is','there']
>>> aa
['the', 'dogis', 'there']

>>> ''.join(['Hi','Mike'])
'HiMike'
>>> 'Hi Mike'.split()
['Hi', 'Mike']

1.3 简单的统计

>>> freq5 = FreqDist(text5)
>>> freq5
<FreqDist with 45010 outcomes>
>>> ss = freq5.keys();
>>> ss[:50]
['.', 'JOIN', 'PART', '?', 'lol', 'to', 'i', 'the', 'you', ',', 'I', 'a', 'hi', 'me', '...', 'is', '..', 'in', 'ACTION', '!', 'and', 'it', 'that', 'hey', 'my', 'of', 'u', "'s", 'for', 'on', 'what', 'here', 'are', '....', 'not', 'do', 'all', 'have', 'up', 'like', 'no', 'with', 'chat', 'was', "n't", 'so', 'your', "'m", '/', 'good']
>>> freq5[JOIN]

1021

freq5.plot(30,cumulative = True);

fdist.hapaxes()

>>> V = set(text1)
>>> long_freq_words = [w for w in V if len(w)>15 and text1_freq[w]>1]

>>> text1.collocations()
Building collocations list

>>> aa_freq.keys()
[3, 1, 4, 2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20]
>>> aa_freq.items()
[(3, 50223), (1, 47933)

>>> aa = set([w.capitalize() for w in text1 if w.isalpha()])
>>> len(aa)
16948
>>> len(text1)
260819

tttmusic

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
NLP with python 1 语言处理与python

1.1语言计算文本和单词>>> from __future__ import division>>> 1/30.33333333333333331>>> import nltk>>> from nltk.book import *>>> text2.concordance("world")Displaying 25 of 93 matches:
复制链接

扫一扫