NLTK02 《Python自然语言处理》code01 语言处理与Python

01 语言处理与Python

# -*- coding: utf-8 -*-
# win10 python3.5.3/python3.6.1 nltk3.2.4
# 《Python自然语言处理》01 语言处理与Python

# 安装nltk库
# pip3 install nltk==3.2.4

# 下载nltk数据,nltk_data
'''
import nltk
nltk.download()
# 出现NLTK Downloader对话框后,设置[Download Directory]路径后,点击[Download]按钮开始下载,如果下载失败或者卡顿,重新下载。
'''

# 1.1 语言计算:文本和词汇
from __future__ import division
from nltk.book import *
'''
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
'''
print(text1)
'''<Text: Moby Dick by Herman Melville 1851>'''
print(text2)
'''<Text: Sense and Sensibility by Jane Austen 1811>'''

# 搜索文本
result = text1.concordance("monstrous")
print(result)
'''
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
None
'''
print(text1.similar("monstrous"))
'''
true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless
None
'''

text2.similar("monstrous")
'''
very so exceedingly heartily a as good great extremely remarkably
sweet vast amazingly
'''

text2.common_contexts(["monstrous", "very"])
'''a_pretty am_glad a_lucky is_pretty be_glad'''

# 视图
#text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

text3.generate("very")

# 计数词汇
print(len(text3)) # 44764

print(sorted(set(text3)))
'''
['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)', 'A', 'Abel', 'Abelmizraim', 
...
 'your', 'yourselves', 'youth']
'''
print(len(set(text3))) # 2789
print(len(text3)/len(set(text3))) # 16.050197203298673
print(text3.count("smote")) # 5
print(100*text4.count('a')/len(text4)) # 1.4643016433938312

def lexical_diversity(text):
    return len(text)/len(set(text))

def percentage(count, total):
    return 100*count/total

print(lexical_diversity(text3)) # 16.050197203298673
print(lexical_diversity(text5)) # 7.420046158918563
print(percentage(4, 5)) # 80.0
print(percentage(text4.count('a'), len(text4))) # 1.4643016433938312

# 1.2 近观Python:将文本当做词链表
# 链表
sent1 = ['Call', 'me', 'Ishamel', '.']
print(sent1, len(sent1))
'''['Call', 'me', 'Ishamel', '.'] 4'''
print(lexical_diversity(sent1)) # 1.0
print(sent2) # ['The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.']
print(sent3) # ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', '.']

l1 = ['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail']
print(l1) # ['Monty', 'Python', 'and', 'the', 'Holy', 'Grail']
l2 = sent4 + sent1
print(l2) # ['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the', 'House', 'of', 'Representatives', ':', 'Call', 'me', 'Ishamel', '.']
sent1.append("Some")
print(sent1) # ['Call', 'me', 'Ishamel', '.', 'Some']

# 索引列表
print(text4[173]) # awaken
print(text4.index('awaken')) # 173
print(text5[16715:16735])
# ['U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is', 'so', 'good', 'because', 'you', 'can', 'actually', 'play', 'a', 'full', 'game', 'without', 'buying', 'it']
print(text6[1600:1625])
# ['We', "'", 're', 'an', 'anarcho', '-', 'syndicalist', 'commune', '.', 'We', 'take', 'it', 'in', 'turns', 'to', 'act', 'as', 'a', 'sort', 'of', 'executive', 'officer', 'for', 'the', 'week']

sent = ['word1', 'word2', 'word3', 'word4', 'word5',
        'word6', 'word7', 'word8', 'word9', 'word10']
print(sent[0], sent[9]) # word1 word10

# print(sent[10]) # IndexError: list index out of range

print(sent[5:8]) # ['word6', 'word7', 'word8']
print(sent[5]) # word6
print(sent[6]) # word7
print(sent[7]) # word8
print(sent[:3]) # ['word1', 'word2', 'word3']
print(text2[141565:]) # ['themselves', ',', 'or', 'producing', 'coolness', 'between', 'their', 'husbands', '.', 'THE', 'END']

sent[0] = 'First'
sent[9] = 'Last'
print(len(sent)) # 10
sent[1:9] = ['Second', 'Third']
print(sent) # ['First', 'Second', 'Third', 'Last']
# print(sent[9]) # IndexError: list index out of range

# 变量
sent1 = ['Call', 'me', 'Ismael', '.']
my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode', 'forth', 'from', 'Camelot', '.']
noun_phrase = my_sent[1:4]
print(noun_phrase) # ['bold', 'Sir', 'Robin']
wOrDs = sorted(noun_phrase)
print(wOrDs) # ['Robin', 'Sir', 'bold']

# 关键字不能做变量使用
# not = 'Camelot' # SyntaxError: invalid syntax

vocab = set(text1)
vocab_size = len(vocab)
print(vocab_size) # 19317

# 字符串
name = 'Monty'
print("name[0]:", name[0], "\nname:", name, "\nname[:4]:", name[:4])
'''
name[0]: M 
name: Monty 
name[:4]: Mont
'''
print(name*2) # MontyMonty
print(name + '!') # Monty!
print(' '.join(['Monty', 'Python'])) # Monty Python
print('Monty Python'.split()) # ['Monty', 'Python']

# 1.3 计算语言:简单的统计
saying = ['After', 'all', 'is', 'said', 'and', 'done', 'more', 'is', 'than', 'done']
tokens = set(saying)
tokens = sorted(tokens)
print(tokens[-2:]) # ['said', 'than']
# 概率分布
fdist1 = FreqDist(text1)
print(fdist1) # <FreqDist with 19317 samples and 260819 outcomes>
vocabulary1 = list(fdist1.keys())
print(vocabulary1[:5]) # ['[', 'Moby', 'Dick', 'by', 'Herman']
print(fdist1['whale']) #
# fdist1.plot(50, cumulative=True) # 常用词累计频率图

# 细粒度的选择词
V = set(text1)
long_words = [w for w in V if len(w) > 15]
print(sorted(long_words))
'''
['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 
'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 
'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 
'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 
'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']
'''
fdist5 = FreqDist(text5)
l5 = sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7])
print(l5)
'''
['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question', 'actually', 'anything', 
'computer', 'cute.-ass', 'everyone', 'football', 'innocent', 'listening', 'remember', 'seriously', 
'something', 'together', 'tomorrow', 'watching']
'''

# 词语搭配与双连词
from nltk import bigrams
bis = bigrams(['more', 'is', 'said', 'than', 'done'])
print(list(bis)) # [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

text4.collocations()
'''
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties
'''
text8.collocations()
'''
would like; medium build; social drinker; quiet nights; non smoker;
long term; age open; Would like; easy going; financially secure; fun
times; similar interests; Age open; weekends away; poss rship; well
presented; never married; single mum; permanent relationship; slim
build
'''
# 计算其他东西
print([len(w) for w in text1])
'''
[1, 4, 4, 2, 6, ..., 5, 7, 6, 1]
'''
fdist = FreqDist([len(w) for w in text1])
print(fdist) # <FreqDist with 19 samples and 260819 outcomes>
print(list(fdist.keys())) # [1, 4, 2, 6, 8, 9, 11, 5, 7, 3, 10, 12, 13, 14, 16, 15, 17, 18, 20]
print(fdist.items())
'''
dict_items([(1, 47933), (4, 42345), (2, 38513), (6, 17111), (8, 9966), (9, 6428), (11, 1873), 
(5, 26597), (7, 14399), (3, 50223), (10, 3528), (12, 1053), (13, 567), (14, 177), (16, 22), 
(15, 70), (17, 12), (18, 1), (20, 1)])
'''
print(fdist.max()) # 3
print(fdist[3]) # 50223
print(fdist.freq(3)) # 0.19255882431878046

# 1.4 回到Python决策与控制
print([w for w in sent7 if len(w) < 4]) # [',', '61', 'old', ',', 'the', 'as', 'a', '29', '.']
print([w for w in sent7 if len(w) <= 4]) # [',', '61', 'old', ',', 'will', 'join', 'the', 'as', 'a', 'Nov.', '29', '.']
print([w for w in sent7 if len(w) == 4]) # ['will', 'join', 'Nov.']
print([w for w in sent7 if len(w) != 4])
# ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', '29', '.']

print(sorted([w for w in set(text1) if w.endswith('ableness')]))
'''
['comfortableness', 'honourableness', 'immutableness', 'indispensableness', 'indomitableness', 
'intolerableness', 'palpableness', 'reasonableness', 'uncomfortableness']
'''
print(sorted([term for term in set(text4) if 'gnt' in term])) # ['Sovereignty', 'sovereignties', 'sovereignty']
print(sorted([item for item in set(text6) if item.istitle()]))
'''
['A', 'Aaaaaaaaah', 'Aaaaaaaah', 'Aaaaaah', 'Aaaah', 'Aaaaugh', 'Aaagh', 'Aaah', 'Aaauggh', 'Aaaugh', 
'Aaauugh', 'Aagh', 'Aah', 'Aauuggghhh', 'Aauuugh', 'Aauuuuugh', 'Aauuuves', 'Action', 'Actually', 
'African', 'Ages', 'Aggh', 'Agh', 'Ah', 'Ahh', 'Alice', 'All', 'Allo', 'Almighty', 'Alright', 
...
'Yeah', 'Yes', 'You', 'Your', 'Yup', 'Zoot']
'''
print(sorted([item for item in set(sent7) if item.isdigit()])) # ['29', '61']

# 对每个元素进行操作
print([len(w) for w in text1]) # [1, 4, 4, 2, 6, ... 5, 7, 6, 1]
print([w.upper() for w in text1]) # IN', 'THE', 'FORECASTLES', 'OF', ... 'FOUND', 'ANOTHER', 'ORPHAN', '.']
print(len(text1), len(set(text1)), len(set([word.lower() for word in text1]))) # 260819 19317 17231
print(len(set([word.lower() for word in text1 if word.isalpha()]))) # 16948

# 嵌套代码块
word = 'cat'
if len(word) < 5 :
    print('word length is less than 5')
else:
    print('word >= 5')
'''word length is less than 5'''

for word in ['Call', 'me', 'Ishmael', '.']:
    print(word)
'''
Call
me
Ishmael
.
'''

# 条件循环
sent1 = ['Call', 'me', 'Ishmael', '.']
for xyzzy in sent1:
    if xyzzy.endswith('l'):
        print(xyzzy)
'''
Call
Ishmael
'''

for token in sent1:
    if token.islower():
        print(token, 'is a lowercase word')
    elif token.istitle():
        print(token, 'is a titlecase word')
    else:
        print(token, 'is puncatuation')
'''
Call is a titlecase word
me is a lowercase word
Ishmael is a titlecase word
. is puncatuation
'''

tricky = sorted([w for w in set(text2) if 'cie' in w or 'cei' in w])
for word in tricky:
    print(word)
'''
ancient
ceiling
conceit
...
sufficiently
undeceive
undeceiving
'''

# 1.5 自动理解自然语言
# 词意消歧
# 指代消解
# 自动生成语言(自动问答、机器翻译)
# 机器翻译
from nltk.book import*
import nltk.misc.babelfish as babelfish
babelfish.babelize_shell() # 这个没有实验成功
# Babelfish online translation service is no longer available.
#Babel>how long before the next flight to Alice Springs?
#Babel>german
#Babel>run
# 人机对话系统
import nltk.chat as chat
chat.chatbots()
'''
Which chatbot would you like to talk to?
  1: Eliza (psycho-babble)
  2: Iesha (teen anime junky)
  3: Rude (abusive bot)
  4: Suntsu (Chinese sayings)
  5: Zen (gems of wisdom)

Enter a number in the range 1-5: 1
'''
'''
表1-2 NLTK频率分布类中定义的函数
例子 描述
fdist = FreqDist(samples) 创建包含给定样本的频率分布
fdist.inc(sample) 增加样本
fdist['monstrous'] 计数给定样本monstrous出现的次数
fdist.freq('monstrous') 给定样本monstrous的频率
fdist.N() 样本总数
list(fdist.keys()) 以频率递减顺序排序的样本链表
for sample in fdist: 以频率递减的顺序遍历样本
fdist.max() 数值最大的样本
fdist.tabulate() 绘制频率分布表
fdist.plot() 绘制频率分布图
fdist.plot(cumulative=True) 绘制累计频率分布图
fdist1 < fdist2 测试样本在fdist1中出现的频率是否小于fdist2

表1-4 词汇比较运算符
函数 含义
s.startswith(t) 测试s是否以t开头
s.endswith(t) 测试s是否以t结尾
t in s 测试s是否包含t
s.islower() 测试s中所有字符是否都是小写字母
s.isupper() 测试s中所有字符是否都是大写字母
s.isalpha() 测试s中所有字符是否都是字母
s.isalnum() 测试s中所有字符是否都是字母或数字
s.isdigit() 测试s中所有字符是否都是数字
s.istitle() 测试s是否首字母大写(s中所有的词都是首字母大写)
'''
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值