【Garen刷题笔记】LeetCode 17.13 恢复空格

最新推荐文章于 2022-05-03 23:07:39 发布

Garen_Hou

最新推荐文章于 2022-05-03 23:07:39 发布

阅读量230

点赞数

分类专栏： LeetCode刷题笔记

本文链接：https://blog.csdn.net/Garen2994/article/details/107235503

版权

LeetCode刷题笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

这篇博客讨论了如何解决LeetCode中的17.13问题，即在删除文章中的空格、标点和大写后，重新断句并最小化未识别字符数。博主提出了两种解法，第一种使用动态规划，第二种利用前缀树（Trie）优化。文章提供了详细的思路、代码实现以及复杂度分析。

摘要由CSDN通过智能技术生成

LeetCode 17.13 恢复空格

题目

哦，不！你不小心把一个长篇文章中的空格、标点都删掉了，并且大写也弄成了小写。像句子"I reset the computer. It still didn’t boot!“已经变成了"iresetthecomputeritstilldidntboot”。在处理标点符号和大小写之前，你得先把它断成词语。当然了，你有一本厚厚的词典dictionary，不过，有些词没在词典里。假设文章用sentence表示，设计一个算法，把文章断开，要求未识别的字符最少，返回未识别的字符数。
注意：本题相对原题稍作改动，只需返回未识别的字符数。

Oh, no! You have accidentally removed all spaces, punctuation, and capitalization in a lengthy document. A sentence like “I reset the computer. It still didn’t boot!” became "iresetthecomputeritstilldidntboot’’. You’ll deal with the punctuation and capitalization later; right now you need to re-insert the spaces. Most of the words are in a dictionary but a few are not. Given a dictionary (a list of strings) and the document (a string), design an algorithm to unconcatenate the document in a way that minimizes the number of unrecognized characters. Return the number of unrecognized characters.
Note: This problem is slightly different from the original one in the book.

Input:
dictionary = [“looked”,“just”,“like”,“her”,“brother”]
sentence = “jesslookedjustliketimherbrother”
Output: 7
Explanation: After unconcatenating, we got “jess looked just like tim her brother”, which containing 7 unrecognized characters.

解法一

思路

这道题看到是求最少的未匹配字符数，首先就是想到动态规划，先是一个暴力求解。
创建一个数组dp[]用来记录结果。sentence从前往后看，其中dp[0]=0为前面默认没有未识别的字符，dp[i]表示句子前i个字符中最少的未识别字符数。
然后得到状态转移方程。对于前i个字符，即句子字符串的[0,i)，它可能是由最前面的[0,j)子字符串加上一个字典匹配的单词得到，也就是dp[i]=dp[j], j<i；也可能没找到字典中的单词，可以用它前i-1个字符的结果加上一个没有匹配到的第i个字符，即dp[i]=dp[i-1]+1。要注意的是，即使前面存在匹配的单词，也不能保证哪一种剩下的字符最少，所以每轮都要比较一次最小值。所以，在字典中找得到单词的时候，状态转移方程为：
$d p [i] = m i n (d p [i], d p [j - 1])$
未找到的时候为：
$d p [i] = d p [i - 1] + 1$

代码

class Solution {
    public int respace(String[] dictionary, String sentence) {
        Set<String> dic = new HashSet<>();
        for (String s : dictionary) {
            dic.add(s);
        }
        int n = sentence.length();
        int[] dp = new int[n + 1];
        dp[0] = 0;
        for (int i = 1; i <= n; i++) {
            dp[i] = dp[i - 1] + 1;//先假设前i个都未匹配
            for (int j = 0; j < i; j++) {
                if(dic.contains(sentence.substring(j,i))){
                    dp[i] = Math.min(dp[j],dp[i]);//在识别当前字符与不识别当前字符的两种情况中取较小值
                }
            }
        }
        return dp[n];
    }
}

解法二

思路

由于暴力解法中，存在很多字典中根本不会存在的单词也进行比较，所以想到前缀树，用Trie就可以省去很多没有意义的比较。这里是倒序将字典中的单词插入到Trie中，在查询比较的时候，也是倒序遍历sentence中的字符的。如图（借用LeetCode题解的动图帮助自己理解，看动图真的好容易理解，我也要学着画动图！！）
在这里插入图片描述

代码

class Solution {

    public class TrieNode{
        public TrieNode[] next;
        public boolean isEnd;
        
        public TrieNode(){
            next = new TrieNode[26];
            isEnd = false;
        }
        public void insert(String s){
            TrieNode curPos = this;
            for (int i = s.length() - 1; i >= 0; --i) {
                int t = s.charAt(i) - 'a';
                if(curPos.next[t] == null){
                    curPos.next[t] = new TrieNode();
                }
                curPos = curPos.next[t];
            }
            curPos.isEnd = true;
        }
    }
    
    public int respace(String[] dictionary, String sentence) {
        int n = sentence.length();
        TrieNode root = new TrieNode();
        for (String word : dictionary) {
            root.insert(word);
        }
        int[] dp = new int[n + 1];
        dp[0] = 0;
        for (int i = 1; i <= n; i++) {
            dp[i] = dp[i - 1] + 1;
            TrieNode curPos = root;
            for (int j = i; j >= 1; --j) {
                int t = sentence.charAt(j - 1) - 'a';
                if(curPos.next[t] == null){
                    break;
                } else if(curPos.next[t].isEnd){
                    dp[i] = Math.min(dp[i],dp[j - 1]);
                }
                if(dp[i] == 0){
                    break;
                }
                curPos = curPos.next[t];
            }
        }
        return dp[n];
    }
}

复杂度分析

时间复杂度： $O(|dictionary| + n^2)$ ，其中 ∣dictionary∣ 代表词典中的总字符数， $n = s e n t e n c e . l e n g t h$ 。建字典树的时间复杂度取决于单词的总字符数，即 $∣ d i c t i o n a r y ∣$ ，因此时间复杂度为 $O (∣ d i c t i o n a r y ∣)$ 。dp 数组一共有 n+1 个状态，每个状态转移的时候最坏需要 $O (n)$ 的时间复杂度，因此时间复杂度为 $O(n^2)$ 。
空间复杂度： $O (∣ d i c t i o n a r y ∣ * S + n)$ ，其中 S 代表字符集大小，这里为小写字母数，因此 S=26。我们可以这样考虑空间复杂度的渐进上界：对于字典而言，如果节点个数为 $∣ n o d e ∣$ ，字符集大小为 S，那么空间代价为 $O (∣ n o d e ∣ * S)$ ；因为这里的节点数一定小于词典中的总字符数，故 $O (∣ n o d e ∣ * S) = O (∣ d i c t i o n a r y ∣ * S)$ 。dp 数组的空间代价为 $O (n)$ 。