How to Find the Most Common Word in a String with a Banned List?

时间：2020-09-20 13:49:13
分类：网络文摘
阅读：86 次

Given a paragraph and a list of banned words, return the most frequent word that is not in the list of banned words. It is guaranteed there is at least one word that isn’t banned, and that the answer is unique.

Words in the list of banned words are given in lowercase, and free of punctuation. Words in the paragraph are not case sensitive. The answer is in lowercase.

Example:

Input:
paragraph = “Bob hit a ball, the hit BALL flew far after it was hit.”
banned = [“hit”]
Output: “ball”

Explanation:
“hit” occurs 3 times, but it is a banned word.
“ball” occurs twice (and no other word does), so it is the most frequent non-banned word in the paragraph.
Note that words in the paragraph are not case sensitive,
that punctuation is ignored (even if adjacent to words, such as “ball,”),
and that “hit” isn’t the answer even though it occurs more because it is banned.

Note:
1 <= paragraph.length <= 1000.
0 <= banned.length <= 100.
1 <= banned[i].length <= 10.
The answer is unique, and written in lowercase (even if its occurrences in paragraph may have uppercase symbols, and even if it is a proper noun.)
paragraph only consists of letters, spaces, or the punctuation symbols !?’,;.
There are no hyphens or hyphenated words.
Words only consist of letters, never apostrophes or other punctuation symbols.

We may use a Hash map which is unordered_map in C++ to store the frequencies for the valid words. This is O(N) space complexity where N is the number of words in the paragraph.

We may actively construct the current word when the current letter falls into a valid word-character i.e. upper or lower case. If it is a word boundary, we have the current word, and need to check if appears in the banned list, we can do list using std::find, but this will take O(N) to complete.

After the O(N^2), we need to go through the frequency hash table to find the word that appears the most-frequently.

class Solution {
public:
    string mostCommonWord(string paragraph, vector<string>& banned) {
        unordered_map<string, int> freq;
        for (int i = 0; i < paragraph.length(); ++ i) {
            string word = "";
            while (i < paragraph.length() && 
                        ((paragraph[i] >= 'A' && paragraph[i] <= 'Z') || 
                         (paragraph[i] >= 'a' && paragraph[i] >= 'z'))) {
                word.push_back(std::tolower(paragraph[i]));
                i ++;
            }
            if (word.length() > 0) {
                if (std::find(banned.begin(), banned.end(), word) == banned.end()) {
                    if (freq.find(word) != freq.end()) { // we don't need this actually
                        freq[word] += 1;  /  
                    } else {
                        freq[word] = 1;
                    }                    
                }            
            }
        }
        string target;
        int cnt = 0;
        for (const auto &n: freq) { // iterate the items in the hash map to find the maximum value
            if (n.second > cnt) {
                cnt = n.second;
                target = n.first;
            }
        }
        return target;
    }
};

class Solution {
public:
    string mostCommonWord(string paragraph, vector<string>& banned) {
        unordered_map<string, int> freq;
        for (int i = 0; i < paragraph.length(); ++ i) {
            string word = "";
            while (i < paragraph.length() && 
                        ((paragraph[i] >= 'A' && paragraph[i] <= 'Z') || 
                         (paragraph[i] >= 'a' && paragraph[i] >= 'z'))) {
                word.push_back(std::tolower(paragraph[i]));
                i ++;
            }
            if (word.length() > 0) {
                if (std::find(banned.begin(), banned.end(), word) == banned.end()) {
                    if (freq.find(word) != freq.end()) { // we don't need this actually
                        freq[word] += 1;  /  
                    } else {
                        freq[word] = 1;
                    }                    
                }            
            }
        }
        string target;
        int cnt = 0;
        for (const auto &n: freq) { // iterate the items in the hash map to find the maximum value
            if (n.second > cnt) {
                cnt = n.second;
                target = n.first;
            }
        }
        return target;
    }
};

We can use modern C++ features to make the above algorithm concise. First, we can convert the vector to the set, which helps us to check if a word is banned using O(1).

1	unordered_set<string> words(begin(banned), end(banned));

unordered_set<string> words(begin(banned), end(banned));

Then we can use isalpha function to check if a character is either uppercase or lowercase. Then we can use std::transform to convert the word into lowercase using the following:

1
2
3

std::transform(begin(cur), end(cur), begin(cur), [](auto c) {
    return std::tolower(c);
});

std::transform(begin(cur), end(cur), begin(cur), [](auto c) {
    return std::tolower(c);
});

Then, we can use max_element to find the most-frequently-occurred item in an unordered_map

1
2
3

auto ele = max_element(begin(count), end(count), [](auto &a, auto &b) {
    return a.second < b.second;
});

auto ele = max_element(begin(count), end(count), [](auto &a, auto &b) {
    return a.second < b.second;
});

The following C++ code does look shorter and better!

class Solution {
public:
    string mostCommonWord(string paragraph, vector<string>& banned) {
        int i = 0;
        int len = paragraph.size();
        unordered_set<string> words(begin(banned), end(banned));
        unordered_map<string, int> count;
        while (i < len) {
            // skip non-word characters
            while ((i < len) && !isalpha(paragraph[i])) i ++;
            int j = i;
            // find next word boundary
            while ((j < len) && isalpha(paragraph[j])) j ++;
            string cur = (paragraph.substr(i, j - i));
            // convert the word to lowercase
            std::transform(begin(cur), end(cur), begin(cur), [](auto c) {
                return std::tolower(c);
            });
            i = j + 1;
            // appears in the banned list?
            if (words.count(cur)) continue;
            count[cur] ++;
        }
        // find the max value in the hash table
        auto ele = max_element(begin(count), end(count), [](auto &a, auto &b) {
            return a.second < b.second;
        });
        return ele->first;
    }
};

class Solution {
public:
    string mostCommonWord(string paragraph, vector<string>& banned) {
        int i = 0;
        int len = paragraph.size();
        unordered_set<string> words(begin(banned), end(banned));
        unordered_map<string, int> count;
        while (i < len) {
            // skip non-word characters
            while ((i < len) && !isalpha(paragraph[i])) i ++;
            int j = i;
            // find next word boundary
            while ((j < len) && isalpha(paragraph[j])) j ++;
            string cur = (paragraph.substr(i, j - i));
            // convert the word to lowercase
            std::transform(begin(cur), end(cur), begin(cur), [](auto c) {
                return std::tolower(c);
            });
            i = j + 1;
            // appears in the banned list?
            if (words.count(cur)) continue;
            count[cur] ++;
        }
        // find the max value in the hash table
        auto ele = max_element(begin(count), end(count), [](auto &a, auto &b) {
            return a.second < b.second;
        });
        return ele->first;
    }
};

–EOF (The Ultimate Computing & Technology Blog) —

评论列表

添加评论: 姓名：　验证码：