python3 自然语言处理_python自然语言处理——3.8 分割

微信公众号:

第三章 加工原料文本

3.8 分割

断句

在词级水平处理文本通常假定能够将文本划分成单个句子,一些语料库已经提供在句子级别的访问,计算布朗语料库中每个句子的平均词数:

import nltk

len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents())

20.250994070456922

sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')

sents = sent_tokenizer.tokenize(text)

pprint.pprint(sents[171:181])

['"Nonsense!','" said Gregory, who was very rational when anyone else\nattempted paradox.','"Why do all the clerks and navvies in the\nrailway trains look so sad and tired,…','I will\ntell you.','It is because they know that the train is going right.','It\nis because they know that whatever place they have taken a ticket\nfor that …','It is because after they have\npassed Sloane Square they know that the next stat…','Oh, their wild rapture!','oh,\ntheir eyes like stars and their souls again in Eden, if the next\nstation w…''"\n\n"It is you who are unpoetical," replied the poet Syme.']

分词

在中文中,三个字符的字符串:爱国人(ai4 “love” [verb], guo3 “country”,ren2 “person”) 可以被分词为“爱国/人” , “country-loving person” ,或者“爱/国人” , “ love country-person” 。

例1-1:从分词表示字符串seg1和seg2 中重建文本分词。 seg1 和 seg2 表示假设的一些儿童讲话的初始和最终分词。函数 segment() 可以使用它们重现分词的文本。

def segment(text, segs):

words = []

last = 0

for i in range(len(segs)):

if segs[i] == '1':

words.append(text[last:i+1])

last = i+1

words.append(text[last:])

return words

text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"

seg1 = "0000000000000001000000000010000000000000000100000000000"

seg2 = "0100100100100001001001000010100100010010000100010010000"

print(segment(text, seg1))

print(segment(text, seg2))

['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']

['do', 'you', 'see', 'the', 'kitty', 'see', 'the', 'doggy', 'do', 'you','like', 'the', kitty', 'like', 'the', 'doggy']

例1-2:计算存储词典和重构源文本的成本。

def evaluate(text, segs):

words = segment(text, segs)

text_size = len(words)

lexicon_size = len(' '.join(list(set(words))))

return text_size + lexicon_size

text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"

seg1 = "0000000000000001000000000010000000000000000100000000000"

seg2 = "0100100100100001001001000010100100010010000100010010000"

seg3 = "0000100100000011001000000110000100010000001100010000001"

print(segment(text, seg3))

print(evaluate(text, seg3))

print(evaluate(text, seg2))

print(evaluate(text, seg1))

['doyou', 'see', 'thekitt', 'y', 'see', 'thedogg', 'y', 'doyou', 'like','thekitt', 'y', 'like', 'thedogg', 'y']

46

47

63

例1-3:使用模拟退火算法的非确定性搜索:一开始仅搜索短语分词;随机扰动 0 和 1 ,它们与“温度”成比例;每次迭代温度都会降低,扰动边界会减少。

from random import randint

def flip(segs, pos):

return segs[:pos] + str(1-int(segs[pos])) + segs[pos+1:]

def flip_n(segs, n):

for i in range(n):

segs = flip(segs, randint(0,len(segs)-1))

return segs

def anneal(text, segs, iterations, cooling_rate):

temperature = float(len(segs))

while temperature > 0.5:

best_segs, best = segs, evaluate(text, segs)

for i in range(iterations):

guess = flip_n(segs, int(round(temperature)))

score = evaluate(text, guess)

if score 

best, best_segs = score, guess

score, segs = best, best_segs

temperature = temperature / cooling_rate

print(evaluate(text, segs), segment(text, segs))

return segs

text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"

seg1 = "0000000000000001000000000010000000000000000100000000000"

anneal(text, seg1, 5000, 1.2)

60 ['doyouseetheki', 'tty', 'see', 'thedoggy', 'doyouliketh', 'ekittylike', 'thedoggy']

58 ['doy', 'ouseetheki', 'ttysee', 'thedoggy', 'doy', 'o', 'ulikethekittylike', 'thedoggy']

56 ['doyou', 'seetheki', 'ttysee', 'thedoggy', 'doyou', 'liketh', 'ekittylike', 'thedoggy']

54 ['doyou', 'seethekit', 'tysee', 'thedoggy', 'doyou', 'likethekittylike', 'thedoggy']

53 ['doyou', 'seethekit', 'tysee', 'thedoggy', 'doyou', 'like', 'thekitty', 'like', 'thedoggy']

51 ['doyou', 'seethekittysee', 'thedoggy', 'doyou', 'like', 'thekitty', 'like', 'thedoggy']42 ['doyou', 'see', 'thekitty', 'see', 'thedoggy', 'doyou', 'like', 'thekitty', 'like', 'thedoggy']

'0000100100000001001000000010000100010000000100010000000'

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值