python输出结果空格分割_python - 如何将没有空格的文本分割成单词列表? - 堆栈内存溢出...

在粘贴他的代码之前,让我先解释一下为什么Norvig的方法更准确(尽管在代码方面会更慢一些,也更长一些)。

1)数据要好一些-就大小和精度而言(他使用字数而不是简单的排名)2)更重要的是,正是n-gram背后的逻辑使方法如此精确。

他在书中提供的示例是拆分字符串“ sitdown”的问题。 现在,一个非二元组的字符串拆分方法将考虑p('sit')* p('down'),如果小于p('sitdown')-这种情况经常发生-它将不会拆分它,但我们希望(大多数时候)。

但是,如果您具有bigram模型,则可以将p('sit down')视为bigram vs p('sitdown'),并且前者获胜。 基本上,如果您不使用双字母组,它将把您拆分的单词的概率视为独立,而事实并非如此,某些单词更有可能一个接一个地出现。 不幸的是,这些词经常在很多情况下被粘在一起,使拆分器感到困惑。

这是数据的链接(它是3个独立问题的数据,而细分仅仅是一个。请阅读本章以获取详细信息): http : //norvig.com/ngrams/

这些链接已经建立了一段时间,但是我还是会在此处复制粘贴代码的分段部分

import re, string, random, glob, operator, heapq

from collections import defaultdict

from math import log10

def memo(f):

"Memoize function f."

table = {}

def fmemo(*args):

if args not in table:

table[args] = f(*args)

return table[args]

fmemo.memo = table

return fmemo

def test(verbose=None):

"""Run some tests, taken from the chapter.

Since the hillclimbing algorithm is randomized, some tests may fail."""

import doctest

print 'Running tests...'

doctest.testfile('ngrams-test.txt', verbose=verbose)

################ Word Segmentation (p. 223)

@memo

def segment(text):

"Return a list of words that is the best segmentation of text."

if not text: return []

candidates = ([first]+segment(rem) for first,rem in splits(text))

return max(candidates, key=Pwords)

def splits(text, L=20):

"Return a list of all possible (first, rem) pairs, len(first)<=L."

return [(text[:i+1], text[i+1:])

for i in range(min(len(text), L))]

def Pwords(words):

"The Naive Bayes probability of a sequence of words."

return product(Pw(w) for w in words)

#### Support functions (p. 224)

def product(nums):

"Return the product of a sequence of numbers."

return reduce(operator.mul, nums, 1)

class Pdist(dict):

"A probability distribution estimated from counts in datafile."

def __init__(self, data=[], N=None, missingfn=None):

for key,count in data:

self[key] = self.get(key, 0) + int(count)

self.N = float(N or sum(self.itervalues()))

self.missingfn = missingfn or (lambda k, N: 1./N)

def __call__(self, key):

if key in self: return self[key]/self.N

else: return self.missingfn(key, self.N)

def datafile(name, sep='\t'):

"Read key,value pairs from file."

for line in file(name):

yield line.split(sep)

def avoid_long_words(key, N):

"Estimate the probability of an unknown word."

return 10./(N * 10**len(key))

N = 1024908267229 ## Number of tokens

Pw = Pdist(datafile('count_1w.txt'), N, avoid_long_words)

#### segment2: second version, with bigram counts, (p. 226-227)

def cPw(word, prev):

"Conditional probability of word, given previous word."

try:

return P2w[prev + ' ' + word]/float(Pw[prev])

except KeyError:

return Pw(word)

P2w = Pdist(datafile('count_2w.txt'), N)

@memo

def segment2(text, prev=''):

"Return (log P(words), words), where words is the best segmentation."

if not text: return 0.0, []

candidates = [combine(log10(cPw(first, prev)), first, segment2(rem, first))

for first,rem in splits(text)]

return max(candidates)

def combine(Pfirst, first, (Prem, rem)):

"Combine first and rem results into one (probability, words) pair."

return Pfirst+Prem, [first]+rem

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值