LeetCode 819. Most Common Word

最新推荐文章于 2020-07-29 11:28:31 发布

wenyq7

最新推荐文章于 2020-07-29 11:28:31 发布

阅读量143

点赞数

分类专栏： LeetCode

本文链接：https://blog.csdn.net/qq_37333947/article/details/102838467

版权

LeetCode 专栏收录该内容

287 篇文章 1 订阅

订阅专栏

题目：

Given a paragraph and a list of banned words, return the most frequent word that is not in the list of banned words. It is guaranteed there is at least one word that isn't banned, and that the answer is unique.

Words in the list of banned words are given in lowercase, and free of punctuation. Words in the paragraph are not case sensitive. The answer is in lowercase.

Example:

Input: 
paragraph = "Bob hit a ball, the hit BALL flew far after it was hit."
banned = ["hit"]
Output: "ball"
Explanation: 
"hit" occurs 3 times, but it is a banned word.
"ball" occurs twice (and no other word does), so it is the most frequent non-banned word in the paragraph. 
Note that words in the paragraph are not case sensitive,
that punctuation is ignored (even if adjacent to words, such as "ball,"), 
and that "hit" isn't the answer even though it occurs more because it is banned.

Note:

1 <= paragraph.length <= 1000.
0 <= banned.length <= 100.
1 <= banned[i].length <= 10.
The answer is unique, and written in lowercase (even if its occurrences in paragraph may have uppercase symbols, and even if it is a proper noun.)
paragraph only consists of letters, spaces, or the punctuation symbols !?',;.
There are no hyphens or hyphenated words.
Words only consist of letters, never apostrophes or other punctuation symbols.

题目给出了一个字符串和一个禁止出现的单词列表，需要返回这个字符串中出现次数最多切不在banlist的单词（用各种奇怪的非字母的分隔符分开）。这道题其实非常简单，但是需要熟练地操作C++的字符串分割和各种HashMap、HashSet的使用。

因为输入中可能存在非字母的分隔符，所以先扫描一遍字符串，把大小写都统一成小写，把非字母的字符都改成空格，然后把这个字符串split成一个个不同的单词，并存在一个hashmap中计算次数。对于banlist，采用一个hashset进行存储，以防出现了重复的单词，其实ban words也最好都统一一下大小写。最后就遍历一遍hashmap看看其中谁的count最大且不在banlist上。

2020.9.8更新

后面的C++版本不用看了，浪费时间。

这次用java写主要出现的问题还是不熟悉语法，以及一些小细节的地方没想到。刚开始跑偏了想着split string的时候检查每个string里面有没有非字母，其实可以直接在split前用regex处理一遍，这里学会了这一条regex；下一步split不能简单的用" "来split了因为可能会有连续空格，于是又要用regex来split。然后又差点多此一举给map按value排序……真是没救了。上面描述的做法还有一点可以优化的地方就是可以直接在word count的时候就更新最大值，不需要最后单独便利一遍map。看solutions其实还有一种真·one pass的方法，但是感觉没必要就放弃了。

Runtime: 17 ms, faster than 50.19% of Java online submissions for Most Common Word.

Memory Usage: 39.8 MB, less than 57.48% of Java online submissions for Most Common Word.

class Solution {
    public String mostCommonWord(String paragraph, String[] banned) {
        String normalized = paragraph.replaceAll("[^a-zA-Z0-9 ]", " ").toLowerCase();
        String[] words = normalized.split("\\s+");
        
        Set<String> bannedSet = new HashSet<>();
        for (String word : banned) {
            bannedSet.add(word);
        }
        
        Map<String, Integer> wordToCount = new HashMap<>();
        int maxCount = 0;
        String maxWord = "";
        for (String word : words) {
            if (!bannedSet.contains(word)) {
                int count = wordToCount.getOrDefault(word, 0);
                count++;
                wordToCount.put(word, count);
                if (count > maxCount) {
                    maxCount = count;
                    maxWord = word;
                }
            }
        }
        
        return maxWord;
        
    }
}

这道题主要是练习了以下几个语法：

1. 从vector建立set：

unordered_set<int> set(vec.begin(), vec.end())

2. hashmap中默认每个key对应的value都是0，所以可以直接map[word]++，相当于对于原来不在map里的数据，插入进map并count=1，对于原来在map里的数据，直接count++

3. 采用自定义的分隔符分割字符串（不会coalesce连续的分隔符，严格地一个个分割），采用stringstream和while(getline(ss, item, 分隔符))：

    string s = "This is to test splitting a string";
    vector<string> s_vec;
    stringstream ss(s);
    string item;
    while (getline(ss, item, ' '))
    {
        s_vec.push_back(item);
    }

4. 直接使用空格进行分割，如果有连续的空格也相当于合并成一个空格并去掉，采用istringstream和while(ss >> item)：

    istringstream iss(paragraph);
	vector<string> s_vec;
    while (iss >> item) {
		s_vec.push_back(item);
    }

原始的代码如下，运行时间12ms，10+%，空间9.2M，60+%：

class Solution {
public:
    string mostCommonWord(string paragraph, vector<string>& banned) {
        for (int i = 0; i < paragraph.size(); i++) {
            if (isalpha(paragraph[i])) {
                paragraph[i] = tolower(paragraph[i]);
            }
            else {
                paragraph[i] = ' ';
            }
        }
        string item;
        unordered_map<string, int> words;
        //stringstream ss(paragraph);
        istringstream iss(paragraph);
        while (iss >> item) {
        //while (getline(ss, item, ' ')) {
            if (words.count(item) == 0) {
                words[item] = 1;
            }
            else {
                words[item]++;
            }
        }
        unordered_set<string> banned_set;
        for (int i = 0; i < banned.size(); i++) {
            banned_set.insert(banned[i]);
        }
        string max_word = "";
        int max_count = 0;
        unordered_map<string, int>::iterator it;
        for (it = words.begin(); it != words.end(); it++) {
            if (it->second > max_count && banned_set.count(it->first) == 0) {
                max_count = it->second;
                max_word = it->first;
            }
        }
        return max_word;
    }
};

改进版的，虽然思路一样但是因为优化了几个操作（比如hashmap和hashset的使用），就突然飞速，4ms，96.79%，9M，92.31%：

class Solution {
public:
    string mostCommonWord(string paragraph, vector<string>& banned) {
        for (int i = 0; i < paragraph.size(); i++) {
            if (isalpha(paragraph[i])) {
                paragraph[i] = tolower(paragraph[i]);
            }
            else {
                paragraph[i] = ' ';
            }
        }
        string item;
        unordered_map<string, int> words;
        istringstream iss(paragraph);
        while (iss >> item) {
            words[item]++;
        }
        unordered_set<string> banned_set(banned.begin(), banned.end());
        string max_word = "";
        int max_count = 0;
        for (auto word : words) {
            if (word.second > max_count && banned_set.count(word.first) == 0) {
                max_count = word.second;
                max_word = word.first;
            }
        }
        return max_word;
    }
};