题目描述:哦,不!你不小心把一个长篇文章中的空格、标点都删掉了,并且大写也弄成了小写。像句子"I reset the computer. It still didn’t boot!“已经变成了"iresetthecomputeritstilldidntboot”。在处理标点符号和大小写之前,你得先把它断成词语。当然了,你有一本厚厚的词典dictionary,不过,有些词没在词典里。假设文章用sentence表示,设计一个算法,把文章断开,要求未识别的字符最少,返回未识别的字符数。
注意:本题相对原题稍作改动,只需返回未识别的字符数
解题思路一:哈希字符串+动态规划,从最左边到当前位置的字符串的未识别字符可以由前缀字符串的结果得到,最大不过为dp[i-1]+1,然后从当前位置依次往前截取子字符串看是否可以匹配,可以的话就更新dp[i]
,等于min(dp[i], d[j])
,j是截取的子字符串前面的字符串未识别的字符数量,但这个哈希函数的设计不是很懂,为什么这样设计,代码如下:
class Solution:
def respace(self, dictionary: List[str], sentence: str) -> int:
BASE = 41
P = 1 << 31 - 1
def get_hashcode(s):
val = 0
size = len(s)
for i in range(size-1, -1, -1):
val = val * BASE + ord(s[i]) - 97 + 1
val %= P
return val
code = set()
for d in dictionary:
code.add(get_hashcode(d))
size = len(sentence)
dp = [size] * (size+1)
dp[0] = 0
for i in range(1, size+1):
dp[i] = dp[i-1] + 1
c = 0
for j in range(i-1, -1, -1):
c = c * BASE + ord(sentence[j]) - 97 + 1
c %= P
if c in code:
dp[i] = min(dp[i], dp[j])
return dp[-1]
解题思路二:前缀树+动态规划,思路和方法一是一样的,只是匹配过程转化为前缀树的方式,可以提前结束匹配过程,如果从当前位置往前的后缀不匹配,那么也没有必要再往前匹配下去,将字典中的字符倒序用前缀树存储即可,代码如下:
class Solution:
def respace(self, dictionary: List[str], sentence: str) -> int:
class Trie():
def __init__(self):
self.root = {}
self.end_word = -1
def insert(self, word):
curnode = self.root
for c in word:
if c not in curnode:
curnode[c] = {}
curnode = curnode[c]
curnode[self.end_word] = True
root = Trie()
for d in dictionary:
root.insert(d[::-1])
size = len(sentence)
dp = [size] * (size+1)
dp[0] = 0
for i in range(1, size+1):
dp[i] = dp[i-1] + 1
curnode = root.root
for j in range(i-1, -1, -1):
if sentence[j] not in curnode:
break
curnode = curnode[sentence[j]]
if -1 in curnode:
dp[i] = min(dp[i], dp[j])
if dp[i] == 0:break
return dp[-1]