python实现单词计数的mapreduce

最新推荐文章于 2024-06-20 09:16:51 发布

ukakasu

最新推荐文章于 2024-06-20 09:16:51 发布

阅读量2.3k

点赞数

分类专栏： python mapreduce

本文链接：https://blog.csdn.net/ukakasu/article/details/47354711

版权

python 同时被 2 个专栏收录

14 篇文章 0 订阅

订阅专栏

mapreduce

6 篇文章 0 订阅

订阅专栏

map函数

import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words :
        print "%s\t%s" % (word , 1)

reduce函数

import sys
current_word=None
current_count=0


for line in sys.stdin:
    line=line.strip()
    word=line.split("\t",1)

    if current_word==word[0]:#当前单词如果为本次传过来的单词，则计数加一
        current_count=current_count+1
        
    if current_word==None:#第一次判断当前单词是否为空，若为空，赋值，计数为一
        current_word=word[0]
        current_count=current_count+1

    elif current_word!=word[0]:#当前单词如果不为本次传过来的，则先把当前的输出，再赋值，计数
        print "%s\t%s" %(current_word,current_count)
        current_count=1
        current_word=word[0]
print "%s\t%s" %(current_word,current_count)#打印循环结束后，最后一次的单词

测试：

echo "hello word hello Hadoop map reduce" | ./mapper.py |sort -k1,1| ./reducer.py

Python只能对排好序的单词进行计数，在Hadoop中会实现对单词的排序

在Hadoop上运行：

bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
-file test/code/mapper.py -mapper test/code/mapper.py \
-file test/code/reducer.py -reducer test/code/reducer.py \
-input /user/rte/hdfs_in/* -output /user/rte/hdfs_out