借助Streaming用三种语言编写MapReduce

最新推荐文章于 2024-08-27 06:39:08 发布

sdankle

最新推荐文章于 2024-08-27 06:39:08 发布

阅读量2k

点赞数

分类专栏：大数据文章标签：大数据 mapreduce hadoop

本文链接：https://blog.csdn.net/sdankle/article/details/51100528

版权

本文介绍了如何借助Hadoop Streaming，分别使用Python、C++和Shell编写MapReduce程序来实现词频统计。详细阐述了每个语言的mapper和reducer步骤，包括文件权限设置、测试验证及结合Hadoop执行过程。

摘要由CSDN通过智能技术生成

借助Streaming用三种语言编写MapReduce

Streaming的原理我就不介绍了，不过我也不是特别懂，我只知道Streaming会把标准输入带给Mapper和Reducer。想了解的具体可以看Hadoop官网。我只是做一个实战备忘啦。

一.Python with Streaming
第一步：写mapper.py

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print '%s\\t%s' % (word, 1)

第二步：写reducer.py

#!/usr/bin/env python

from operator import itemgetter
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
        word2count[word] = word2count.get(word, 0) + count
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        pass

# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the off