python最长公共连续子串_最长的公共子串,无需砍字-python

Given the following, i can find the longest common substring:

s1 = "this is a foo bar sentence ."

s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):

m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]

longest, x_longest = 0, 0

for x in xrange(1, 1 + len(s1)):

for y in xrange(1, 1 + len(s2)):

if s1[x - 1] == s2[y - 1]:

m[x][y] = m[x - 1][y - 1] + 1

if m[x][y] > longest:

longest = m[x][y]

x_longest = x

else:

m[x][y] = 0

return s1[x_longest - longest: x_longest]

print longest_common_substring(s1, s2)

[out]:

foo bar

But how do i ensure that the longest common substring respect English word boundary and don't cut up a word? For example, the following sentences:

s1 = "this is a foo bar sentence ."

s2 = "what a kappa foo bar black sheep ?"

print longest_common_substring(s1, s2)

outputs the follow which is NOT desired since it breaks up the word kappa from s2:

a foo bar

The desired output is still:

foo bar

I've tried also an ngram way of getting the longest common substring respecting word boundary but is there other way that deals with strings without calculating ngrams? (see answer)

解决方案

This is too simple to understand. I used your code to do 75% of the job.

I first split the sentence into words, then pass it to your function to get the largest common substring(in this case it will be longest consecutive words), so your function gives me ['foo', 'bar'], I join the elements of that array to produce the desired result.

Here is the online working copy for you to test and verify and fiddle with it.

def longest_common_substring(s1, s2):

m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]

longest, x_longest = 0, 0

for x in xrange(1, 1 + len(s1)):

for y in xrange(1, 1 + len(s2)):

if s1[x - 1] == s2[y - 1]:

m[x][y] = m[x - 1][y - 1] + 1

if m[x][y] > longest:

longest = m[x][y]

x_longest = x

else:

m[x][y] = 0

return s1[x_longest - longest: x_longest]

def longest_common_sentence(s1, s2):

s1_words = s1.split(' ')

s2_words = s2.split(' ')

return ' '.join(longest_common_substring(s1_words, s2_words))

s1 = 'this is a foo bar sentence .'

s2 = 'what a kappa foo bar black sheep ?'

common_sentence = longest_common_sentence(s1, s2)

print common_sentence

>> 'foo bar'

Edge cases

'.' and '?' are also treated as valid words as in your case if there is a space between last word and the punctuation mark. If you don't leave a space they will be counted as part of last word. In that case 'sheep' and 'sheep?' would not be same words anymore. Its up to you decide what to do with such characters, before calling such function. In that case

import re

s1 = re.sub('[.?]','', s1)

s2 = re.sub('[.?]','', s2)

and then continue as usual.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值