题目
哦,不!你不小心把一个长篇文章中的空格、标点都删掉了,并且大写也弄成了小写。像句子"I reset the computer. It still didn’t boot!“已经变成了"iresetthecomputeritstilldidntboot”。在处理标点符号和大小写之前,你得先把它断成词语。当然了,你有一本厚厚的词典dictionary,不过,有些词没在词典里。假设文章用sentence表示,设计一个算法,把文章断开,要求未识别的字符最少,返回未识别的字符数。
注意:本题相对原题稍作改动,只需返回未识别的字符数
示例:
输入:
dictionary = ["looked","just","like","her","brother"]
sentence = "jesslookedjustliketimherbrother"
输出: 7
解释: 断句后为"jess looked just like tim her brother",共7个未识别字符。
提示:
0 <= len(sentence) <= 1000
dictionary中总字符数不超过 150000。
你可以认为dictionary和sentence中只包含小写字母。
解题思路
没有思路。。。看了提示可以使用递归
递归版本:
按照dictionary中所有的字符,在sentence中找到字符出现的第一个下标,然后对去掉这个字符的sentence迭代
时间复杂度应该是
o
(
n
s
e
n
t
e
n
c
e
∗
n
d
i
c
t
i
o
n
a
r
y
=
1
0
8
)
o(n_{sentence} * n_{dictionary} = 10^8)
o(nsentence∗ndictionary=108),超时了
看了题解的dp版:
用dp[i]表示第i位之前的sentence中未识别的字符,则:
- 如果当前第
i
位和之前的字符,能组成在字典中的子串,则dp[i] = dp[j]
,其中j
是能和当前第i
位组成子串的前面的字符下标。需要考虑也许能组成多个在字典中的子串,这时就需要找最小的dp[j]
- 如果当前第
i
位的字符,和之前的任何子串都组不成在字典中的子串,则dp[i] = dp[i - 1] + 1
时间复杂度是 o ( n s e n t e n c e 2 = 1 0 6 ) o(n_{sentence}^2 = 10^6) o(nsentence2=106)
dp + Trie树版:
观察dp中发现,大头的时间都在找当前i
位之前的子串,是否存在于字典中了,这里可以用Trie树来加速
速度明显快了不少,dp版是8436 ms,加了trie树就变成3592 ms了
代码
递归版(未AC):
class Solution:
def respace(self, dictionary: List[str], sentence: str) -> int:
value_index = {}
for word in dictionary:
if sentence.find(word) != -1:
value_index[word] = sentence.find(word)
if not value_index:
return len(sentence)
min_mising = len(sentence)
for word, index in value_index.items():
min_mising = min(min_mising, self.respace(dictionary, sentence[:index] + sentence[index + len(word):]))
return min_mising
dp版:
class Solution:
def respace(self, dictionary: List[str], sentence: str) -> int:
dp = [0] + [index+1 for index in range(len(sentence))]
for index in range(len(sentence)):
for begin_index in range(index, -1, -1):
if sentence[begin_index: index + 1] in dictionary:
dp[index + 1] = min(dp[index + 1], dp[begin_index])
dp[index + 1] = min(dp[index + 1], dp[index] + 1)
return dp[-1]
dp + trie树:
class TrieNode:
def __init__(self):
self.node_dict = {}
self.end_flag = False
class TrieTree:
def __init__(self, dictionary: dict):
self.root = TrieNode()
for word in dictionary:
node = self.root
for each_char in word:
if each_char not in node.node_dict:
node.node_dict[each_char] = TrieNode()
node = node.node_dict[each_char]
node.end_flag = True
def search(self, word: str) -> bool:
node = self.root
for each_char in word:
if each_char not in node.node_dict:
return False
node = node.node_dict[each_char]
return node.end_flag
class Solution:
def respace(self, dictionary: List[str], sentence: str) -> int:
trie = TrieTree(dictionary)
dp = [0] + [index+1 for index in range(len(sentence))]
for index in range(len(sentence)):
for begin_index in range(index, -1, -1):
# if sentence[begin_index: index + 1] in dictionary:
if trie.search(sentence[begin_index: index + 1]):
dp[index + 1] = min(dp[index + 1], dp[begin_index])
dp[index + 1] = min(dp[index + 1], dp[index] + 1)
return dp[-1]