Hadoop Streaming

hadoopstreaming 使用标准输入和输出,所以可以和任何语言结合.可以几星单机调试cat inputfile | mapper | sort | reducer > output
但是streaming只能进行文本数据,不能处理二进制数据,因为使用标准输入和标准输出,影响性能

1.默认被tab分割的第一个字段为key,剩余的为value,如果没有tab分隔,则认为整行为key,value为null

使用方法
hadoop jar /usr/local/share/hadoop-2.8.1/share/hadoop/tools/lib/hadoop-streaming.jar

注意Generic Command Options参数,要在Hadoop streaming command参数之前

Generic Command Options

ParameterOptional/RequiredDescription
-conf configuration_fileOptionalSpecify an application configuration file
-D property=valueOptionalUse value for given property-D mapred.temp.dir=/tmp/temp 更多属性参考http://hadoop.apache.org/docs/r2.8.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
-fs host:port or localOptionalSpecify a namenode
-filesOptionalSpecify comma-separated files to be copied to the Map/Reduce cluster
-libjarsOptionalSpecify comma-separated jar files to include in the classpath
-archivesOptionalSpecify comma-separated archives to be unarchived on the compute machines

-D

1.指定只有map任务 -D mapreduce.job.reduces=0 和指定-reducer NONE效果相同
2.指定reduces的个数 -D mapreduce.job.reduces=2
3.指定partitioner的key/value -D stream.map.output.field.separator=. \ -D stream.num.map.output.key.fields=4指定分割符为.并将前四个字段当成key
4.files
5.archives
6.指定partitioner的key,-D mapreduce.partition.keypartitioner.options=-k1,2 \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
7.comparator
8.aggregate
9.Hadoop Field Selection Class

Hadoop streaming command

参数是否可选描述
-input directoryname or filename必须输入路径 -input myInputDirs
-output directoryname必须输出路径 -output myOutputDir
-mapper executable or JavaClassNameOptional指定mapper -mapper myPythonScript.py
-reducer executable or JavaClassNameOptional指定reducer -reducer /usr/bin/wc
-file filenameOptional将mapper.py和reduce.py分发到各个节点,也可以分发相关的依赖 -file myPythonScript.py \ -file myDictionary.txt
-inputformat JavaClassNameOptionalClass you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default
-outputformat JavaClassNameOptionalClass you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default
-partitioner JavaClassNameOptional指定partitioner
-combiner streamingCommand or JavaClassName指定combiner
-cmdenv name=valueOptional传递环境变量,-cmdenv EXAMPLE_DIR=/home/example/dictionaries/
-inputreaderOptionalFor backwards-compatibility: specifies a record reader class (instead of an input format class)
-verboseOptionalVerbose output
-lazyOutputOptionalCreate output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to Context.write
-numReduceTasksOptionalreducer的数量
-mapdebugOptionalScript to call when map task fails
-reducedebugOptionalScript to call when reduce task fails

python实例 wordcount

1.submit

import os

os.system("hadoop fs -rmr /output")
sym_command = " hadoop jar /usr/local/share/hadoop-2.8.1/share/hadoop/tools/lib/hadoop-streaming.jar " \
              "-input '/input/' " \
              "-mapper 'sh mapper.sh' " \
              "-reducer 'sh reducer.sh' " \
              "-output '/output/' " \
              "-file mapper.sh " \
              "-file reducer.sh " \
              "-file mapper.py " \
              "-file reducer.py " \
              "-D stream.map.output.field.separator=\0 " \
              "-D map.output.key.field.separator=\0 " \
              "-D stream.num.map.output.key.fields=1 " \
              "-D num.key.fields.for.partition=1 " \
              "-D mapred.map.tasks=200 " \
              "-D mapred.reduce.tasks=100 " \
              "-D mapred.temp.dir=/tmp/temp " \
              "-D mapred.job.name='wang' "
print(sym_command)
os.system(sym_command)

mapper.sh
使用sh的好处是可以任意添加环境变量,当然使用-cmdenv参数也行

#!/usr/bin/env
python3 mapper.py

mapper.py

import sys
for line in sys.stdin:
    words = line.strip().split(" ")
    for word in words:
        print("%s %s" % (word, 1))

reducer.sh

#!/usr/bin/env
python3 reducer.py

reducer.py

import sys
current_word = None
current_count = 0
word = None

for line in sys.stdin:
    word, count = line.split(' ', 1)
    count = int(count)

    # 第二行开始执行
    if current_word == word:
        current_count += count

    # 第一行和最后一行执行下面的代码
    else:
        # reduce到最后一行执行下面if的代码
        if current_word:
            print("%s %s" % (current_word, current_count))
        # 第一行执行下面的代码,当前current_word=None,把截取的word赋值给current_word
        current_count = count
        current_word = word


参考:
http://hadoop.apache.org/docs/r2.8.0/hadoop-streaming/HadoopStreaming.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值