- Top K Frequent Words (Map Reduce)
Find top k frequent words with map reduce framework.
The mapper’s key is the document id, value is the content of the document, words in a document are split by spaces.
For reducer, the output should be at most k key-value pairs, which are the top k words and their frequencies in this reducer. The judge will take care about how to merge different reducers’ results to get the global top k frequent words, so you don’t need to care about that part.
The k is given in the constructor of TopK class.
Example
Example1
Input:
document A = “lintcode is the best online judge
I love lintcode” and
document B = “lintcode is an online judge for coding interview
you can test your code online at lintcode”
Output:
“lintcode”, 4
“online”, 3
Example2
Input:
document A = “a a a b b b” 和
document B = “a a a b b b”
Output:
“a”, 6
“b”, 6
Notice
For the words with same frequency, rank them with alphabet.
解法1:
我用的C++ stl::priority_queue来作为一个max heap来保存前K个最频繁的词。
注意
- 这里heap的维护是在Reducer端。
- priority_queue需要自定义operator < (因为object是自定义的)。如果里面是this->count < pair2.count就是最大堆(跟int类型默认一样)。如果是this->count > pair2.count就是最小堆。
bool operator < (const Pair & pair2) const { //max heap
if (this->count == pair2.count) {
return this->key > pair2.key;
}
return this->count < pair2.count;
}
- priority_queue这个结构好像不能设固定大小。可以再加个count K来输出最大的K个元素,当然也可以再包一层写个wrapper 类。
/**
* Definition of Input:
* template<class T>
* class Input {
* public:
* bool done();
* // Returns true if the iteration has elements or false.
* void next();
* // Move to the next element in the iteration
* // Runtime error if the iteration has no more elements
* T value();
* // Get the current element, Runtime error if
* // the iteration has no more elements
* }
* Definition of Document:
* class Document {
* public:
* int id; // document id
* string content; // document content
* }
*/
class TopKFrequentWordsMapper: public Mapper {
public:
//Map does not need to touch heap, it is Reducer's duty to manage the heap.
void Map(Input<Document>* input) {
// Please directly use func 'output' to output
// the results into output buffer.
// void output(string &key, int value);
while(!input->done()) {
stringstream ss;
string word;
ss << input->value().content;
while(ss >> word) output(word, 1);
input->next();
}
}
};
class TopKFrequentWordsReducer: public Reducer {
public:
void setUp(int k) {
// initialize your data structure here
this->k = k;
}
void Reduce(string &key, Input<int>* input) {
// Write your code here
int count = 0;
while(!input->done()) {
count += input->value();
input->next();
}
Pair pair(key, count);
pq.push(pair);
}
void cleanUp() {
// Please directly use func 'output' to output
// the top k pairs <word, times> into output buffer.
// void output(string &key, int &value);
int num = 0;
while(!pq.empty() && num < k) {
Pair tempPair = pq.top();
output(tempPair.key, tempPair.count);
num++;
pq.pop();
}
}
private:
int k;
struct Pair{
string key;
int count;
Pair(string k, int c) : key(k), count(c) {}
bool operator < (const Pair & pair2) const { //max heap
if (this->count == pair2.count) {
return this->key > pair2.key;
}
return this->count < pair2.count;
}
};
priority_queue<Pair> pq; //Max-Heap
};