Hadoop-streaming

最新推荐文章于 2024-03-21 15:56:56 发布

Devin01213

最新推荐文章于 2024-03-21 15:56:56 发布

阅读量215

点赞数

本文链接：https://blog.csdn.net/ym01213/article/details/102830861

版权

Hadoop 专栏收录该内容

22 篇文章 0 订阅

订阅专栏

Hadoop为MapReduce提供了不同的API，可以方便我们使用不同的编程语言来使用MapReduce框架，而不是只局限于Java。你可以用任何语言来编写MapReduce程序，只要该语言可以往standard input/output上进行读写。

streamming是天然适用于文字处理的（text processing），当然，也仅适用纯文本的处理，对于需要对象和序列化的场景，hadoop streaming无能为力。它力图使我们能够快捷的通过各种脚本语言，快速的处理大量的文本文件。以下是steaming的一些特点：

Map函数的输入是通过stand input一行一行的接收数据的。（不像Java API，通过InputFormat类做预处理，使得Map函数的输入是有Key和value的）
Map函数的输出格式为key-value 键值对，key和value之间用\t分开。（MapReduce框架在处理intermediate的Map输出时，必须做sort和partition，即shuffle）
Reduce函数的input是Map函数的output也是key-value pair，key和value之间用\t分开。

常见参数

-input：必须的参数，文件名或者目录名，mapper需要操作的目录
-output：必须的参数，文件名或者目录名，reducer需要操作的目录
-mapper： 必须的参数，mapper名
-reducer：必须的参数，reducer名
-file：可选参数

-D stream.num.map.output.key.fields=1
num.key.fields.for.partition=N，这个参数是用来控制 shuffle 阶段将数据集的前N列作为Key；
所以对于 wordcount 程序，map输出为“word  1”，shuffle 是以word作为Key，因此这里N=1

Python

Map

文件map.py

#! /usr/bin/python

import sys

for line in sys.stdin:
        ss = line.strip().split(' ')
        for word in ss:
                print '\t'.join([word.strip(),'1'])

Reduce

文件reduce.py

#! /usr/bin/python

import sys

crt_word = None
sum = 0

for line in sys.stdin:
        ss = line.strip().split('\t')
        if len(ss) != 2:
                continue
        word, count = ss

        if crt_word == None:
                crt_word = word
        if crt_word != word:
                print '\t'.join([crt_word,str(sum)])
                crt_word = word
                sum = 0
        sum += int(count)
print '\t'.join([crt_word,str(sum)])

本地调试

# 本地调试流程 cat | mapper | sort | reducer 
cat sample.data | python map.py | sort -k1 | python reduce.py >> result.local

运行 bash run.sh

文件run.sh内容

HADOOP_CMD="/usr/local/src/hadoop-2.6.5/bin/hadoop"
STREAM_JAR_PATH="/usr/local/src/hadoop-2.6.5/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar"
# HDFS上的输入路径
INPUT_FILE_PATH="/Test/python_wordcount/The_man_of_property.txt"
# HDFS上的输出路径
OUTPUT_PATH="/Test/python_wordcount/output"
# 删除输出路径下的数据
$HADOOP_CMD fs -rmr -skipTrash $OUTPUT_PATH
# 运行命令
$HADOOP_CMD jar $STREAM_JAR_PATH \
        -input $INPUT_FILE_PATH \
        -output $OUTPUT_PATH \
        -mapper "python map.py" \
        -file ./map.py \
        -reducer "python reduce.py" \
        -file ./reduce.py

Devin01213

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hadoop-streaming

Hadoop为MapReduce提供了不同的API，可以方便我们使用不同的编程语言来使用MapReduce框架，而不是只局限于Java。你可以用任何语言来编写MapReduce程序，只要该语言可以往standard input/output上进行读写。streamming是天然适用于文字处理的（text processing），当然，也仅适用纯文本的处理，对于需要对象和序列化的场景，hado...
复制链接

扫一扫