Top K Frequent Words (Map Reduce)

最新推荐文章于 2021-07-18 16:26:05 发布

flyatcmu

最新推荐文章于 2021-07-18 16:26:05 发布

阅读量491

点赞数

分类专栏： Map Reduce

本文链接：https://blog.csdn.net/u013325815/article/details/104243503

版权

Map Reduce 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Find top k frequent words with map reduce framework.

The mapper's key is the document id, value is the content of the document, words in a document are split by spaces.

For reducer, the output should be at most k key-value pairs, which are the top k words and their frequencies in this reducer. The judge will take care about how to merge different reducers' results to get the global top k frequent words, so you don't need to care about that part.

The k is given in the constructor of TopK class.

Example

Example1

Input:
document A = "lintcode is the best online judge
I love lintcode" and 
document B = "lintcode is an online judge for coding interview
you can test your code online at lintcode"

Output: 
"lintcode", 4
"online", 3

思路：这题关键是在reduce，需要用minheap去收集frequency最大的那k个node，reduce方法注意要用到Nodecmp.compare()来比较，不能只比较times；

/**
 * Definition of OutputCollector:
 * class OutputCollector<K, V> {
 *     public void collect(K key, V value);
 *         // Adds a key/value pair to the output buffer
 * }
 * Definition of Document:
 * class Document {
 *     public int id;
 *     public String content;
 * }
 */
public class TopKFrequentWords {

    public static class Map {
        public void map(String _, Document value,
                        OutputCollector<String, Integer> output) {
            // Write your code here
            // Output the results into output buffer.
            // Ps. output.collect(String key, int value);
            StringTokenizer tokenizer = new StringTokenizer(value.content);
            while(tokenizer.hasMoreTokens()){
                String word = tokenizer.nextToken();
                output.collect(word, 1);
            }
        }
    }
    
    public static class Reduce {
        
        private class Node {
            public int times;
            public String word;
            public Node(String word, int times) {
                this.word = word;
                this.times = times;
            }
        }
    
        private class NodeComparator implements Comparator<Node> {
            @Override
            public int compare(Node a, Node b) {
                if(a.times != b.times) {
                    return a.times - b.times;
                } else {
                    return b.word.compareTo(a.word);
                }
            }
        }

        private PriorityQueue<Node> pq;
        private int k;
        public void setup(int k) {
            pq = new PriorityQueue(k, new NodeComparator());
            this.k = k;
        }   

        public void reduce(String key, Iterator<Integer> values) {
            int sum = 0;
            while(values.hasNext()){
                sum += values.next();
            }
            
            Node node = new Node(key, sum);
            if(pq.size() < k ){
                pq.offer(node);
            } else {
                Node peek = pq.peek();
                NodeComparator nodeCmp = new NodeComparator();
                if(nodeCmp.compare(node,peek) > 0) {
                    pq.poll();
                    pq.offer(node);
                }
            }
        }

        public void cleanup(OutputCollector<String, Integer> output) {
            // Output the top k pairs <word, times> into output buffer.
            // Ps. output.collect(String key, Integer value);
            List<Node> list = new ArrayList<Node>();
            while(!pq.isEmpty()){
                list.add(0,pq.poll());
            }
            
            for(int i = 0; i < list.size(); i++){
                Node node = list.get(i);
                output.collect(node.word, node.times);
            }
        }
    }
}