python输出结果空格分割_python - 如何将没有空格的文本分割成单词列表？ - 堆栈内存溢出...

最新推荐文章于 2023-05-16 15:11:15 发布

weixin_39607450

最新推荐文章于 2023-05-16 15:11:15 发布

阅读量203

点赞数

文章标签： python输出结果空格分割

在粘贴他的代码之前，让我先解释一下为什么Norvig的方法更准确（尽管在代码方面会更慢一些，也更长一些）。

1）数据要好一些-就大小和精度而言（他使用字数而不是简单的排名）2）更重要的是，正是n-gram背后的逻辑使方法如此精确。

他在书中提供的示例是拆分字符串“ sitdown”的问题。现在，一个非二元组的字符串拆分方法将考虑p（'sit'）* p（'down'），如果小于p（'sitdown'）-这种情况经常发生-它将不会拆分它，但我们希望（大多数时候）。

但是，如果您具有bigram模型，则可以将p（'sit down'）视为bigram vs p（'sitdown'），并且前者获胜。基本上，如果您不使用双字母组，它将把您拆分的单词的概率视为独立，而事实并非如此，某些单词更有可能一个接一个地出现。不幸的是，这些词经常在很多情况下被粘在一起，使拆分器感到困惑。

这是数据的链接（它是3个独立问题的数据，而细分仅仅是一个。请阅读本章以获取详细信息）： http : //norvig.com/ngrams/

这些链接已经建立了一段时间，但是我还是会在此处复制粘贴代码的分段部分

import re, string, random, glob, operator, heapq

from collections import defaultdict

from math import log10

def memo(f):

"Memoize function f."

table = {}

def fmemo(*args):

if args not in table:

table[args] = f(*args)

return table[args]

fmemo.memo = table

return fmemo

def test(verbose=None):

"""Run some tests, taken from the chapter.

Since the hillclimbing algorithm is randomized, some tests may fail."""

import doctest

print 'Running tests...'

doctest.testfile('ngrams-test.txt', verbose=verbose)

################ Word Segmentation (p. 223)

@memo

def segment(text):

"Return a list of words that is the best segmentation of text."

if not text: return []

candidates = ([first]+segment(rem) for first,rem in splits(text))

return max(candidates, key=Pwords)

def splits(text, L=20):

"Return a list of all possible (first, rem) pairs, len(first)<=L."

return [(text[:i+1], text[i+1:])

for i in range(min(len(text), L))]

def Pwords(words):

"The Naive Bayes probability of a sequence of words."

return product(Pw(w) for w in words)

#### Support functions (p. 224)

def product(nums):

"Return the product of a sequence of numbers."

return reduce(operator.mul, nums, 1)

class Pdist(dict):

"A probability distribution estimated from counts in datafile."

def __init__(self, data=[], N=None, missingfn=None):

for key,count in data:

self[key] = self.get(key, 0) + int(count)

self.N = float(N or sum(self.itervalues()))

self.missingfn = missingfn or (lambda k, N: 1./N)

def __call__(self, key):

if key in self: return self[key]/self.N

else: return self.missingfn(key, self.N)

def datafile(name, sep='\t'):

"Read key,value pairs from file."

for line in file(name):

yield line.split(sep)

def avoid_long_words(key, N):

"Estimate the probability of an unknown word."

return 10./(N * 10**len(key))

N = 1024908267229 ## Number of tokens

Pw = Pdist(datafile('count_1w.txt'), N, avoid_long_words)

#### segment2: second version, with bigram counts, (p. 226-227)

def cPw(word, prev):

"Conditional probability of word, given previous word."

try:

return P2w[prev + ' ' + word]/float(Pw[prev])

except KeyError:

return Pw(word)

P2w = Pdist(datafile('count_2w.txt'), N)

@memo

def segment2(text, prev='~~'):~~

"Return (log P(words), words), where words is the best segmentation."

if not text: return 0.0, []

candidates = [combine(log10(cPw(first, prev)), first, segment2(rem, first))

for first,rem in splits(text)]

return max(candidates)

def combine(Pfirst, first, (Prem, rem)):

"Combine first and rem results into one (probability, words) pair."

return Pfirst+Prem, [first]+rem

weixin_39607450

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。