Hadoop Streaming

最新推荐文章于 2020-07-18 11:08:07 发布

Claroja

最新推荐文章于 2020-07-18 11:08:07 发布

阅读量186

点赞数

分类专栏： hadoop 文章标签： hadoop

本文链接：https://blog.csdn.net/claroja/article/details/102973701

版权

hadoop 专栏收录该内容

25 篇文章 0 订阅

订阅专栏

hadoopstreaming 使用标准输入和输出,所以可以和任何语言结合.可以几星单机调试cat inputfile | mapper | sort | reducer > output
但是streaming只能进行文本数据,不能处理二进制数据,因为使用标准输入和标准输出,影响性能

1.默认被tab分割的第一个字段为key,剩余的为value,如果没有tab分隔,则认为整行为key,value为null

使用方法
hadoop jar /usr/local/share/hadoop-2.8.1/share/hadoop/tools/lib/hadoop-streaming.jar

注意Generic Command Options参数,要在Hadoop streaming command参数之前

Generic Command Options

Parameter	Optional/Required	Description
-conf configuration_file	Optional	Specify an application configuration file
-D property=value	Optional	Use value for given property`-D mapred.temp.dir=/tmp/temp` 更多属性参考http://hadoop.apache.org/docs/r2.8.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
-fs host:port or local	Optional	Specify a namenode
-files	Optional	Specify comma-separated files to be copied to the Map/Reduce cluster
-libjars	Optional	Specify comma-separated jar files to include in the classpath
-archives	Optional	Specify comma-separated archives to be unarchived on the compute machines

-D

1.指定只有map任务 -D mapreduce.job.reduces=0 和指定-reducer NONE效果相同
2.指定reduces的个数 -D mapreduce.job.reduces=2
3.指定partitioner的key/value -D stream.map.output.field.separator=. \ -D stream.num.map.output.key.fields=4指定分割符为.并将前四个字段当成key
4.files
5.archives
6.指定partitioner的key,-D mapreduce.partition.keypartitioner.options=-k1,2 \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
7.comparator
8.aggregate
9.Hadoop Field Selection Class

Hadoop streaming command

参数	是否可选	描述
-input directoryname or filename	必须	输入路径 `-input myInputDirs`
-output directoryname	必须	输出路径 `-output myOutputDir`
-mapper executable or JavaClassName	Optional	指定mapper `-mapper myPythonScript.py`
-reducer executable or JavaClassName	Optional	指定reducer `-reducer /usr/bin/wc`
-file filename	Optional	将mapper.py和reduce.py分发到各个节点,也可以分发相关的依赖 `-file myPythonScript.py \ -file myDictionary.txt`
-inputformat JavaClassName	Optional	Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default
-outputformat JavaClassName	Optional	Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default
-partitioner JavaClassName	Optional	指定partitioner
-combiner streamingCommand or JavaClassName	指定combiner
-cmdenv name=value	Optional	传递环境变量,`-cmdenv EXAMPLE_DIR=/home/example/dictionaries/`
-inputreader	Optional	For backwards-compatibility: specifies a record reader class (instead of an input format class)
-verbose	Optional	Verbose output
-lazyOutput	Optional	Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to Context.write
-numReduceTasks	Optional	reducer的数量
-mapdebug	Optional	Script to call when map task fails
-reducedebug	Optional	Script to call when reduce task fails

python实例 wordcount

1.submit

import os

os.system("hadoop fs -rmr /output")
sym_command = " hadoop jar /usr/local/share/hadoop-2.8.1/share/hadoop/tools/lib/hadoop-streaming.jar " \
              "-input '/input/' " \
              "-mapper 'sh mapper.sh' " \
              "-reducer 'sh reducer.sh' " \
              "-output '/output/' " \
              "-file mapper.sh " \
              "-file reducer.sh " \
              "-file mapper.py " \
              "-file reducer.py " \
              "-D stream.map.output.field.separator=\0 " \
              "-D map.output.key.field.separator=\0 " \
              "-D stream.num.map.output.key.fields=1 " \
              "-D num.key.fields.for.partition=1 " \
              "-D mapred.map.tasks=200 " \
              "-D mapred.reduce.tasks=100 " \
              "-D mapred.temp.dir=/tmp/temp " \
              "-D mapred.job.name='wang' "
print(sym_command)
os.system(sym_command)

mapper.sh
使用sh的好处是可以任意添加环境变量,当然使用-cmdenv参数也行

#!/usr/bin/env
python3 mapper.py

mapper.py

import sys
for line in sys.stdin:
    words = line.strip().split(" ")
    for word in words:
        print("%s %s" % (word, 1))

reducer.sh

#!/usr/bin/env
python3 reducer.py

reducer.py

import sys
current_word = None
current_count = 0
word = None

for line in sys.stdin:
    word, count = line.split(' ', 1)
    count = int(count)

    # 第二行开始执行
    if current_word == word:
        current_count += count

    # 第一行和最后一行执行下面的代码
    else:
        # reduce到最后一行执行下面if的代码
        if current_word:
            print("%s %s" % (current_word, current_count))
        # 第一行执行下面的代码,当前current_word=None,把截取的word赋值给current_word
        current_count = count
        current_word = word

参考:
http://hadoop.apache.org/docs/r2.8.0/hadoop-streaming/HadoopStreaming.html

Claroja

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop Streaming

参考:http://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html
复制链接

扫一扫