python自然语言处理 第一章

import nltk

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml

Out[2]: True*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.

text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

from nltk.book import *

text1
Out[5]: <Text: Moby Dick by Herman Melville 1851>

text2
Out[6]: <Text: Sense and Sensibility by Jane Austen 1811>

text1.concordance("monstrous")
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

text1.similar("monstrous")
imperial subtly impalpable pitiable curious abundant perilous
trustworthy untoward singular lamentable few determined maddens
horrible tyrannical lazy mystifying christian exasperate

text4.dispersion_plot(["citizens","democracy","freedom","duties","America"])

text2.common_contexts(["monstrous", "very"])
a_pretty is_pretty a_lucky am_glad be_glad
#文中出现的词和标点符号为单位算出文本从头到尾的长度,称为标识符
len(text1)
Out[11]: 260819
set(text1)#集合中所有重复的元素都只算一个,很多词都被略过
len(set(text1))
Out[13]: 19317
#用sorted()包裹set(text1),得到词汇项的排序表

from __future__ import division

len(text1)/len((set(text1)))
Out[16]: 13.502044830977896
# 平均每个词被使用了14次
text1.count("think") # 统计text1中think出现的频率
Out[17]: 111

100*text1.count("think")/len(text1) # 计算出现的百分比
Out[18]: 0.04255824920730468

def lexical_deversity(text):
    return len(text)/len(set(text))

def percentage(count,total):
    return 100*count/total

lexical_deversity(text1) # 与len(text1)/len((set(text1)))结果相同
Out[20]: 13.502044830977896

percentage(text1.count("think"),len(text1)) # 与100*text1.count("think")/len(text1)结果相同
Out[22]: 0.04255824920730468

1.2 近观python:将文本当做词链表

链表(列表)

sent1 = ["Call", "me", "Ishmael", "."]

len(sent1)
Out[28]: 4

lexical_deversity(sent1)
Out[29]: 1.0

sent2 = ["The", "family", "of", "Dashwood", "had", "long", "been", "settled", "in", "Sussex", "."]
sent3 = ["In", "the", "beginning", "God", "created", "the", "heaven", "and", "the", "earth", "."]
sent4 = ["Fellow", "-", "Citizens", "of", "the", "Senate", "and", "of", "the", "House", "of",
         "Representatives", ":"]
#链表连接
sent1 + sent4
Out[33]: 

['Call',

'me',

'Ishmael',

'.',

'Fellow',

'-',

'Citizens',

'of',

'the',

'Senate',

'and',

'of',

'the',

'House',

'of',

'Representatives',

':']

#链表中追加一个元素: sent1.append("Some") sent1

Out[35]: ['Call', 'me', 'Ishmael', '.', 'Some']

索引列表:表示词的位置

text4[173]
Out[36]: u'awaken'
text4.index('awaken')
Out[38]: 173
#获取子链表,从大文本中任意获取语义片段,切片

text5[16715:16735]
Out[40]: 
[u'U86',
 u'thats',
 u'why',
 u'something',
 u'like',
 u'gamefly',
 u'is',
 u'so',
 u'good',
 u'because',
 u'you',
 u'can',
 u'actually',
 u'play',
 u'a',
 u'full',
 u'game',
 u'without',
 u'buying',
 u'it']

变量:英文字母在前数字在后,abc12是对的,12abc可能会报错

字符串:

#可把词用链表连接起来组成单个字符串,或把字符串分割成链表

''.join(["Monty", "Python"])
Out[42]: 'MontyPython'

'Monty Python'.split()

In [44]:Out[43]: ['Monty', 'Python']
 
计算语言:简单的统计

saying = ['After', 'all', 'is', 'said', 'and', 'done', 'more', 'is', 'said', 'than', 'done']
tokens = set(saying)
tokens = sorted(tokens)
tokens[-2:]
Out[47]: ['said', 'than']
#频率分布,FreqDist函数
fdist1 = FreqDist(text1)
fdist1
vocabulary1 = fdist1.keys() # 展示不同类型的链表
vocabulary1[:50] # 通过切片看前50项
Out[52]: 
[u'funereal',
 u'unscientific',
 u'divinely',
 u'foul',
 u'four',
 u'gag',
 u'prefix',
 u'woods',
 u'clotted',
 u'Duck',
 u'hanging',
 u'plaudits',
 u'woody',
 u'Until',
 u'marching',
 u'disobeying',
 u'canes',
 u'granting',
 u'advantage',
 u'Westers',
 u'insertion',
 u'DRYDEN',
 u'formless',
 u'Untried',
 u'superficially',
 u'Western',
 u'portentous',
 u'meadows',
 u'sinking',
 u'Ding',
 u'Spurn',
 u'treasuries',
 u'churned',
 u'oceans',
 u'invasion',
 u'powders',
 u'tinkerings',
 u'tantalizing',
 u'yellow',
 u'bolting',
 u'uncertain',
 u'stabbed',
 u'bringing',
 u'elevations',
 u'ferreting',
 u'wooded',
 u'songster',
 u'uttering',
 u'scholar',
 u'Less']
fdist1['whale']
Out[53]: 906
fdist1.plot(50,cumulative = True)

细粒度选择词:

V = set(text1)

long_words = [w for w in V if len(w) > 15] # 检查词w的长度大于15个字符

sorted(long_words)
Out[57]: 
[u'CIRCUMNAVIGATION',
 u'Physiognomically',
 u'apprehensiveness',
 u'cannibalistically',
 u'characteristically',
 u'circumnavigating',
 u'circumnavigation',
 u'circumnavigations',
 u'comprehensiveness',
 u'hermaphroditical',
 u'indiscriminately',
 u'indispensableness',
 u'irresistibleness',
 u'physiognomically',
 u'preternaturalness',
 u'responsibilities',
 u'simultaneousness',
 u'subterraneousness',
 u'supernaturalness',
 u'superstitiousness',
 u'uncomfortableness',
 u'uncompromisedness',
 u'undiscriminating',
 u'uninterpenetratingly']

fdist5 = FreqDist(text5)
sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] >7]) # 词的长度超过7,并且这些词出现的频率超过7
Out[61]: 
[u'#14-19teens',
 u'#talkcity_adults',
 u'((((((((((',
 u'........',
 u'Question',
 u'actually',
 u'anything',
 u'computer',
 u'cute.-ass',
 u'everyone',
 u'football',
 u'innocent',
 u'listening',
 u'remember',
 u'seriously',
 u'something',
 u'together',
 u'tomorrow',
 u'watching']

词语搭配和双连词

text4.collocations()
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties
#查看文本中词长的分布,通过创造一长串数字的链表的FreqDist,每个数字是文本中对应词的长度
[len(w) for w in text1]
sent7 = ["Pierre", "Vinken", ",", "61", "years", "old", ",", "will", "join", "the",
         "board", "as", "a", "nonexecutive", "director", "Nov", "29", "."]

[w for w in sent7 if len(w) < 4]
Out[67]: [',', '61', 'old', ',', 'the', 'as', 'a', 'Nov', '29', '.']

[w for w in sent7 if len(w) <= 4]
Out[68]: [',', '61', 'old', ',', 'will', 'join', 'the', 'as', 'a', 'Nov', '29', '.']

[w for w in sent7 if len(w) != 4]
Out[71]: 
['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov',
 '29',
 '.']


自动理解自然语言












  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值