面试题 17.13-恢复空格(Re-Space LCCI)

原文链接: https://leetcode-cn.com/problems/re-space-lcci

英文原文

Oh, no! You have accidentally removed all spaces, punctuation, and capitalization in a lengthy document. A sentence like "I reset the computer. It still didn't boot!" became "iresetthecomputeritstilldidntboot''. You'll deal with the punctuation and capitalization later; right now you need to re-insert the spaces. Most of the words are in a dictionary but a few are not. Given a dictionary (a list of strings) and the document (a string), design an algorithm to unconcatenate the document in a way that minimizes the number of unrecognized characters. Return the number of unrecognized characters.

Note: This problem is slightly different from the original one in the book.

Example:

Input: 
dictionary = ["looked","just","like","her","brother"]
sentence = "jesslookedjustliketimherbrother"
Output:  7
Explanation:  After unconcatenating, we got "jess looked just like tim her brother", which containing 7 unrecognized characters.

Note:

0 <= len(sentence) <= 1000
The total number of characters in dictionary is less than or equal to 150000.
There are only lowercase letters in dictionary and sentence.

中文题目

哦，不！你不小心把一个长篇文章中的空格、标点都删掉了，并且大写也弄成了小写。像句子"I reset the computer. It still didn’t boot!"已经变成了"iresetthecomputeritstilldidntboot"。在处理标点符号和大小写之前，你得先把它断成词语。当然了，你有一本厚厚的词典dictionary，不过，有些词没在词典里。假设文章用sentence表示，设计一个算法，把文章断开，要求未识别的字符最少，返回未识别的字符数。

注意：本题相对原题稍作改动，只需返回未识别的字符数

示例：

输入：
dictionary = ["looked","just","like","her","brother"]
sentence = "jesslookedjustliketimherbrother"
输出： 7
解释： 断句后为"jess looked just like tim her brother"，共7个未识别字符。

提示：

0 <= len(sentence) <= 1000
dictionary中总字符数不超过 150000。
你可以认为dictionary和sentence中只包含小写字母。

通过代码

高赞题解

🙋 今日打卡～

哦，不！工作日打卡题不全都是 EASY 嘛？！🤷‍♀️

题目分析

题意：给定字符串，尽可能多地匹配字典内的单词，即最少未匹配数。

分：拿到题目，先想贪心能不能做，显然不行。比如给定字符串：abcdef，字典：["abcd", "ab", "def"]，贪心匹配是 abcd ef，有2个字符未匹配。显然有更优匹配策略： ab c def ，只有 1 个字符未匹配。所以这题需要用动态规划来解决（怎么思考一个题目可不可以用 dp 来解决，先假设 dp 表达含义，再想转移方程，很多时候自然而然就出来了）。

1、状态定义：

dp[i] 表示字符串的前 i 个字符的最少未匹配数。

2、状态转移：

假设当前我们已经考虑完了前 i 个字符了，对于前 i + 1 个字符对应的最少未匹配数：

第 i + 1 个字符未匹配，则 dp[i + 1] = dp[i] + 1，即不匹配数加 1;
遍历前 i 个字符，若以其中某一个下标 idx 为开头、以第 i + 1 个字符为结尾的字符串正好在词典里，则 dp[i] = min(dp[i], dp[idx]) 更新 dp[i]。

于是，有了解法一。

3、解法一代码：

时间复杂度是 $O(n^2)$，$n$ 为待匹配字符串的长度。

class Solution {
    public int respace(String[] dictionary, String sentence) {
        Set<String> dict = new HashSet<>(Arrays.asList(dictionary));
        int n = sentence.length();
        int[] dp = new int[n + 1];
        for (int i = 1; i <= n; i++) {
            dp[i] = dp[i - 1] + 1;
            for (int idx = 0; idx < i; idx++) {
                if (dict.contains(sentence.substring(idx, i))) {
                    dp[i] = Math.min(dp[i], dp[idx]);
                }
            }
        }
        return dp[n];
    }
}

4、使用 Trie 进行优化：

对于上述解法，计算 dp[i + 1]时，我们需要用 idx 来遍历前 i 个字符，逐个判断以 idx 为开头，以第 i + 1 个字符为结尾的字符串是否在字典里。
这一步可以利用字典树来加速，通过字典树我们可以查询以第 i + 1 个字符为结尾的单词有哪些（构建字典树时将单词逆序插入即可）。
关于字典树的学习，可以看下这篇题解，这就是解法二。
时间复杂度是 $O(m + n^2)$，$m$ 是字典长度，$n$ 为待匹配字符串的长度。为什么还是 $n^2$ 呢？因为有可能状态转移的时候，每个位置都需要转移，这是最坏情况，绝大多数情况下远小于 $n$，所以解法二最终耗时会远小于解法一。

5、解法二代码：

class Solution {
    public int respace(String[] dictionary, String sentence) {
        // 构建字典树
        Trie trie = new Trie();
        for (String word: dictionary) {
            trie.insert(word);
        }
        // 状态转移，dp[i] 表示字符串的前 i 个字符的最少未匹配数
        int n = sentence.length();
        int[] dp = new int[n + 1];
        for (int i = 1; i <= n; i++) {
            dp[i] = dp[i - 1] + 1;
            for (int idx: trie.search(sentence, i - 1)) {
                dp[i] = Math.min(dp[i], dp[idx]);
            }
        }
        return dp[n];
    }
}

class Trie {
    TrieNode root;

    public Trie() {
        root = new TrieNode();
    }

    // 将单词倒序插入字典树
    public void insert(String word) {
        TrieNode cur = root;
        for (int i = word.length() - 1; i >= 0; i--) {
            int c = word.charAt(i) - 'a';
            if (cur.children[c] == null) {
                cur.children[c] = new TrieNode();
            }
            cur = cur.children[c];
        }
        cur.isWord = true;
    }

    // 找到 sentence 中以 endPos 为结尾的单词，返回这些单词的开头下标。
    public List<Integer> search(String sentence, int endPos) {
        List<Integer> indices = new ArrayList<>(); 
        TrieNode cur = root;
        for (int i = endPos; i >= 0; i--) {
            int c = sentence.charAt(i) - 'a';
            if (cur.children[c] == null) {
                break;
            }
            cur = cur.children[c];
            if (cur.isWord) {
                indices.add(i);
            }  
        }
        return indices;
    }
}

class TrieNode {
    boolean isWord;
    TrieNode[] children = new TrieNode[26];

    public TrieNode() {}
}

统计信息

通过次数	提交次数	AC比率
25006	45285	55.2%

提交历史

提交时间	提交结果	执行时间	内存消耗	语言

凤起丶宛南

从来没有真正的绝境，只有心灵的迷途