Top K Frequent Words (Map Reduce)

Find top k frequent words with map reduce framework.

The mapper's key is the document id, value is the content of the document, words in a document are split by spaces.

For reducer, the output should be at most k key-value pairs, which are the top k words and their frequencies in this reducer. The judge will take care about how to merge different reducers' results to get the global top k frequent words, so you don't need to care about that part.

The k is given in the constructor of TopK class.

Example

Example1

Input:
document A = "lintcode is the best online judge
I love lintcode" and 
document B = "lintcode is an online judge for coding interview
you can test your code online at lintcode"

Output: 
"lintcode", 4
"online", 3

思路:这题关键是在reduce,需要用minheap去收集frequency最大的那k个node,reduce方法注意要用到Nodecmp.compare()来比较,不能只比较times;

/**
 * Definition of OutputCollector:
 * class OutputCollector<K, V> {
 *     public void collect(K key, V value);
 *         // Adds a key/value pair to the output buffer
 * }
 * Definition of Document:
 * class Document {
 *     public int id;
 *     public String content;
 * }
 */
public class TopKFrequentWords {

    public static class Map {
        public void map(String _, Document value,
                        OutputCollector<String, Integer> output) {
            // Write your code here
            // Output the results into output buffer.
            // Ps. output.collect(String key, int value);
            StringTokenizer tokenizer = new StringTokenizer(value.content);
            while(tokenizer.hasMoreTokens()){
                String word = tokenizer.nextToken();
                output.collect(word, 1);
            }
        }
    }
    
    public static class Reduce {
        
        private class Node {
            public int times;
            public String word;
            public Node(String word, int times) {
                this.word = word;
                this.times = times;
            }
        }
    
        private class NodeComparator implements Comparator<Node> {
            @Override
            public int compare(Node a, Node b) {
                if(a.times != b.times) {
                    return a.times - b.times;
                } else {
                    return b.word.compareTo(a.word);
                }
            }
        }

        private PriorityQueue<Node> pq;
        private int k;
        public void setup(int k) {
            pq = new PriorityQueue(k, new NodeComparator());
            this.k = k;
        }   

        public void reduce(String key, Iterator<Integer> values) {
            int sum = 0;
            while(values.hasNext()){
                sum += values.next();
            }
            
            Node node = new Node(key, sum);
            if(pq.size() < k ){
                pq.offer(node);
            } else {
                Node peek = pq.peek();
                NodeComparator nodeCmp = new NodeComparator();
                if(nodeCmp.compare(node,peek) > 0) {
                    pq.poll();
                    pq.offer(node);
                }
            }
        }

        public void cleanup(OutputCollector<String, Integer> output) {
            // Output the top k pairs <word, times> into output buffer.
            // Ps. output.collect(String key, Integer value);
            List<Node> list = new ArrayList<Node>();
            while(!pq.isEmpty()){
                list.add(0,pq.poll());
            }
            
            for(int i = 0; i < list.size(); i++){
                Node node = list.get(i);
                output.collect(node.word, node.times);
            }
        }
    }
}

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值