用Python进行自然语言处理-1. Language Processing and Python

最新推荐文章于 2024-04-02 10:56:56 发布

rebellion51

最新推荐文章于 2024-04-02 10:56:56 发布

阅读量997

点赞数

分类专栏： nlp 文章标签： python 自然语言处理

nlp 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

《用Python进行自然语言处理》是一本结合了自然语言处理和Python知识的入门书籍，现在书籍正在出第二版，预计2016年完成。第二版是与Python 3配套的，很多地方都要修改。附上书籍原地址链接：《用Python进行自然语言处理》

安装过程和语料下载就不说了，这里直接开始实战：

1. 查找文本

1.1 用文本的concordance方法查找某个词。当然首先要`from nltk.book import *`，示例如下：

>>> text1.concordance("monstrous")
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
>>>

1.2 通常，出现在相同上下文的单词具有相似的语义倾向，我们可以用文本的similar方法获得与指定单词语义相似的那些单词。示例如下：

>>> text1.similar("monstrous")
mean part maddens doleful gamesome subtly uncommon careful untoward
exasperate loving passing mouldy christian few true mystifying
imperial modifies contemptible
>>> text2.similar("monstrous")
very heartily so exceedingly remarkably as vast a great amazingly
extremely good sweet
>>>

1.3 如果我们想知道同一文本中两个或两个以上的单词共同出现的语境，可以用common_contexts方法。示例如下：

>>> text2.common_contexts(["monstrous", "very"])
a_pretty is_pretty am_glad be_glad a_lucky
>>>

1.4 dispersion_plot方法帮我们自动绘制单词在文章的密度分布图，示例如下：

>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
>>>

密度分布图

1.5 生成随机文本，用generate方法，目前在Python 3中无法使用，可能会有其他方法替代。

>>> text3.generate()
In the beginning of his brother is a hairy man , whose top may reach
unto heaven ; and ye shall sow the land of Egypt there was no bread in
all that he was taken out of the month , upon the earth . So shall thy
wages be ? And they made their father ; and Isaac was old , and kissed
him : and Laban with his cattle in the midst of the hands of Esau thy
first born , and Phichol the chief butler unto his son Isaac , she
>>>

2. 计数词汇

2.1 用len方法计算文本长度，这并非真实的文本长度，因为单词和标点符号都会计数在内，用“tokens”来作为单位比较合适，中文翻译成“词例”。

>>> len(text3)
44764
>>>

2.2 用set方法获取文本的词汇表，用sorted方法进行排序。如果最外围加上len方法，就能统计文本用了多少个不同的词，NLP中把它叫做”词型“，与词例相对。

>>> sorted(set(text3)) [1]
['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)',
'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech',
'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', ...]
>>> len(set(text3)) [2]
2789
>>>

2.3 用len（set（））方法和len（）方法之商，考察用词多样性。下面的例子表明text3的所有单词数量中不同单词数量只占6%，反过来说，每个单词平均使用了16次。

>>> len(set(text3)) / len(text3)
0.06230453042623537
>>>

2.4 用count方法计算某个词出现的次数。

>>> text3.count("smote")
5
>>> 100 * text4.count('a') / len(text4)
1.4643016433938312
>>>

3. 频度分布

3.1 用FreqDist方法得到某一文本所有“tokens“的频度分布，再用most_common（50）方法获得前50个高频”tokens“。还可以直接索引某个词的出现次数，这跟count方法似乎是重复的。

注意：要使用FreqDist方法，需要from nltk.probability import *才行，书上关于这一点还没有更新，照书上的做法是怎么都无法调用这个方法的。

>>> fdist1 = FreqDist(text1) [1]
>>> print(fdist1) [2]
<FreqDist with 19317 samples and 260819 outcomes>
>>> fdist1.most_common(50) [3]
[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024),
('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982),
("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124),
('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632),
('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280),
('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103),
('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005),
('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767),
('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680),
('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)]
>>> fdist1['whale']
906
>>>

3.2 用plot方法绘制频度分布图，也可以绘制累积频度分布图。如：`fdist1.plot(50, cumulative=True)`

前50个词的累积频度分布图

3.3 获得单词长度的频度分布

>>> [len(w) for w in text1] [1]
[1, 4, 4, 2, 6, 8, 4, 1, 9, 1, 1, 8, 2, 1, 4, 11, 5, 2, 1, 7, 6, 1, 3, 4, 5, 2, ...]
>>> fdist = FreqDist(len(w) for w in text1)  [2]
>>> print(fdist)  [3]
<FreqDist with 19 samples and 260819 outcomes>
>>> fdist
FreqDist({3: 50223, 1: 47933, 4: 42345, 2: 38513, 5: 26597, 6: 17111, 7: 14399,
  8: 9966, 9: 6428, 10: 3528, ...})
>>>
>>> fdist.most_common()
[(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399),
(8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177),
(15, 70), (16, 22), (17, 12), (18, 1), (20, 1)]
>>> fdist.max()
3
>>> fdist[3]
50223
>>> fdist.freq(3)
0.19255882431878046
>>>

4. 获得指定颗粒度的单词列表

4.1 获得长度大于15的单词列表

>>> V = set(text1)
>>> long_words = [w for w in V if len(w) > 15]
>>> sorted(long_words)
['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically',
'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations',
'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness',
'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities',
'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness',
'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']
>>>

4.2 获得长度大于7且频度大于7的单词列表

>>> fdist5 = FreqDist(text5)
>>> sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7)
['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question',
'actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football',
'innocent', 'listening', 'remember', 'seriously', 'something', 'together',
'tomorrow', 'watching']
>>>

5. 搭配和二元词

5.1 bigrams方法加上list方法可以获得二元词序列。如果没有加上list方法，结果是一个生成器而非列表。

>>> list(bigrams(['more', 'is', 'said', 'than', 'done']))
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>

5.2 获得文本的搭配词

>>> text4.collocations()
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties
>>> text8.collocations()
would like; medium build; social drinker; quiet nights; non smoker;
long term; age open; Would like; easy going; financially secure; fun
times; similar interests; Age open; weekends away; poss rship; well
presented; never married; single mum; permanent relationship; slim
build
>>>

书后有一些练习，可以用学过的知识练练手！