python生成相似句子_有没有一种简单的方法可以从python中的无间隔句子生成单词的可能列表?...

I have some text:

s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"

I'd like to parse this into its individual words. I quickly looked into the enchant and nltk, but didn't see anything that looked immediately useful. If I had time to invest in this, I'd look into writing a dynamic program with enchant's ability to check if a word was english or not. I would have thought there'd be something to do this online, am I wrong?

解决方案

Greedy approach using trie

Try this using Biopython (pip install biopython):

from Bio import trie

import string

def get_trie(dictfile='/usr/share/dict/american-english'):

tr = trie.trie()

with open(dictfile) as f:

for line in f:

word = line.rstrip()

try:

word = word.encode(encoding='ascii', errors='ignore')

tr[word] = len(word)

assert tr.has_key(word), "Missing %s" % word

except UnicodeDecodeError:

pass

return tr

def get_trie_word(tr, s):

for end in reversed(range(len(s))):

word = s[:end + 1]

if tr.has_key(word):

return word, s[end + 1: ]

return None, s

def main(s):

tr = get_trie()

while s:

word, s = get_trie_word(tr, s)

print word

if __name__ == '__main__':

s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"

s = s.strip(string.punctuation)

s = s.replace(" ", '')

s = s.lower()

main(s)

Results

>>> if __name__ == '__main__':

... s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"

... s = s.strip(string.punctuation)

... s = s.replace(" ", '')

... s = s.lower()

... main(s)

...

image

classification

methods

can

be

roughly

divided

into

two

broad

families

of

approaches

Caveats

There are degenerate cases in English that this will not work for. You need to use backtracking to deal with those, but this should get you started.

Obligatory test

>>> main("expertsexchange")

experts

exchange

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值