Find top k frequent words with map reduce framework.
The mapper's key is the document id, value is the content of the document, words in a document are split by spaces.
For reducer, the output should be at most k key-value pairs, which are the top k words and their frequencies in this reducer. The judge will take care about how to merge different reducers' results to get the global top k frequent words, so you don't need to care about that part.
The k is given in the constructor of TopK class.
Example
Example1
Input:
document A = "lintcode is the best online judge
I love lintcode" and
document B = "lintcode is an online judge for coding interview
you can test your code online at lintcode"
Output:
"lintcode", 4
"online", 3
思路:这题关键是在reduce,需要用minheap去收集frequency最大的那k个node,reduce方法注意要用到Nodecmp.compare()来比较,不能只比较times;
/**
* Definition of OutputCollector:
* class OutputCollector<K, V> {
* public void collect(K key, V value);
* // Adds a key/value pair to the output buffer
* }
* Definition of Document:
* class Document {
* public int id;
* public String content;
* }
*/
public class TopKFrequentWords {
public static class Map {
public void map(String _, Document value,
OutputCollector<String, Integer> output) {
// Write your code here
// Output the results into output buffer.
// Ps. output.collect(String key, int value);
StringTokenizer tokenizer = new StringTokenizer(value.content);
while(tokenizer.hasMoreTokens()){
String word = tokenizer.nextToken();
output.collect(word, 1);
}
}
}
public static class Reduce {
private class Node {
public int times;
public String word;
public Node(String word, int times) {
this.word = word;
this.times = times;
}
}
private class NodeComparator implements Comparator<Node> {
@Override
public int compare(Node a, Node b) {
if(a.times != b.times) {
return a.times - b.times;
} else {
return b.word.compareTo(a.word);
}
}
}
private PriorityQueue<Node> pq;
private int k;
public void setup(int k) {
pq = new PriorityQueue(k, new NodeComparator());
this.k = k;
}
public void reduce(String key, Iterator<Integer> values) {
int sum = 0;
while(values.hasNext()){
sum += values.next();
}
Node node = new Node(key, sum);
if(pq.size() < k ){
pq.offer(node);
} else {
Node peek = pq.peek();
NodeComparator nodeCmp = new NodeComparator();
if(nodeCmp.compare(node,peek) > 0) {
pq.poll();
pq.offer(node);
}
}
}
public void cleanup(OutputCollector<String, Integer> output) {
// Output the top k pairs <word, times> into output buffer.
// Ps. output.collect(String key, Integer value);
List<Node> list = new ArrayList<Node>();
while(!pq.isEmpty()){
list.add(0,pq.poll());
}
for(int i = 0; i < list.size(); i++){
Node node = list.get(i);
output.collect(node.word, node.times);
}
}
}
}