LintCode-549: Top K Frequent Words (Map Reduce) (System Design题)

  1. Top K Frequent Words (Map Reduce)

Find top k frequent words with map reduce framework.

The mapper’s key is the document id, value is the content of the document, words in a document are split by spaces.

For reducer, the output should be at most k key-value pairs, which are the top k words and their frequencies in this reducer. The judge will take care about how to merge different reducers’ results to get the global top k frequent words, so you don’t need to care about that part.

The k is given in the constructor of TopK class.

Example
Example1

Input:
document A = “lintcode is the best online judge
I love lintcode” and
document B = “lintcode is an online judge for coding interview
you can test your code online at lintcode”

Output:
“lintcode”, 4
“online”, 3
Example2

Input:
document A = “a a a b b b” 和
document B = “a a a b b b”

Output:
“a”, 6
“b”, 6
Notice
For the words with same frequency, rank them with alphabet.

解法1:
我用的C++ stl::priority_queue来作为一个max heap来保存前K个最频繁的词。
注意

  1. 这里heap的维护是在Reducer端。
  2. priority_queue需要自定义operator < (因为object是自定义的)。如果里面是this->count < pair2.count就是最大堆(跟int类型默认一样)。如果是this->count > pair2.count就是最小堆。
        bool operator < (const Pair & pair2) const {   //max heap
            if (this->count == pair2.count) {
                return this->key > pair2.key;
            }
            return this->count < pair2.count; 
        }
  1. priority_queue这个结构好像不能设固定大小。可以再加个count K来输出最大的K个元素,当然也可以再包一层写个wrapper 类。
/**
 * Definition of Input:
 * template<class T>
 * class Input {
 * public:
 *     bool done(); 
 *         // Returns true if the iteration has elements or false.
 *     void next();
 *         // Move to the next element in the iteration
 *         // Runtime error if the iteration has no more elements
 *     T value();
 *        // Get the current element, Runtime error if
 *        // the iteration has no more elements
 * }
 * Definition of Document:
 * class Document {
 * public:
 *     int id; // document id
 *     string content; // document content
 * }
 */
class TopKFrequentWordsMapper: public Mapper {
public:
//Map does not need to touch heap, it is Reducer's duty to manage the heap.
    void Map(Input<Document>* input) {
        // Please directly use func 'output' to output 
        // the results into output buffer.
        // void output(string &key, int value);
        while(!input->done()) {
            stringstream ss;
            string word;
            ss << input->value().content;
            while(ss >> word) output(word, 1);
            input->next();
        }
        
    }
};


class TopKFrequentWordsReducer: public Reducer {
public:
    void setUp(int k) {
        // initialize your data structure here
        this->k = k;
    }

    void Reduce(string &key, Input<int>* input) {
        // Write your code here
        int count = 0;
        while(!input->done()) {
            count += input->value();
            input->next();
        }
        Pair pair(key, count);
        pq.push(pair);
    }

    void cleanUp() {
        // Please directly use func 'output' to output 
        // the top k pairs <word, times> into output buffer.
        // void output(string &key, int &value);
        int num = 0;
        while(!pq.empty() && num < k) {
            Pair tempPair = pq.top();
            output(tempPair.key, tempPair.count);
            num++;
            pq.pop();
        }

    }
private:
    int k;
    
    struct Pair{
        string key;
        int count;
        Pair(string k, int c) : key(k), count(c) {}
        bool operator < (const Pair & pair2) const {   //max heap
            if (this->count == pair2.count) {
                return this->key > pair2.key;
            }
            return this->count < pair2.count; 
        }
    };
    
    priority_queue<Pair> pq;   //Max-Heap
};
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值