hadoopstreaming 使用标准输入和输出,所以可以和任何语言结合.可以几星单机调试cat inputfile | mapper | sort | reducer > output
但是streaming只能进行文本数据,不能处理二进制数据,因为使用标准输入和标准输出,影响性能
1.默认被tab分割的第一个字段为key,剩余的为value,如果没有tab分隔,则认为整行为key,value为null
使用方法
hadoop jar /usr/local/share/hadoop-2.8.1/share/hadoop/tools/lib/hadoop-streaming.jar
注意Generic Command Options参数,要在Hadoop streaming command参数之前
Generic Command Options
Parameter | Optional/Required | Description |
---|---|---|
-conf configuration_file | Optional | Specify an application configuration file |
-D property=value | Optional | Use value for given property-D mapred.temp.dir=/tmp/temp 更多属性参考http://hadoop.apache.org/docs/r2.8.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml |
-fs host:port or local | Optional | Specify a namenode |
-files | Optional | Specify comma-separated files to be copied to the Map/Reduce cluster |
-libjars | Optional | Specify comma-separated jar files to include in the classpath |
-archives | Optional | Specify comma-separated archives to be unarchived on the compute machines |
-D
1.指定只有map任务 -D mapreduce.job.reduces=0
和指定-reducer NONE
效果相同
2.指定reduces的个数 -D mapreduce.job.reduces=2
3.指定partitioner的key/value -D stream.map.output.field.separator=. \ -D stream.num.map.output.key.fields=4
指定分割符为.
并将前四个字段当成key
4.files
5.archives
6.指定partitioner的key,-D mapreduce.partition.keypartitioner.options=-k1,2 \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
7.comparator
8.aggregate
9.Hadoop Field Selection Class
Hadoop streaming command
参数 | 是否可选 | 描述 |
---|---|---|
-input directoryname or filename | 必须 | 输入路径 -input myInputDirs |
-output directoryname | 必须 | 输出路径 -output myOutputDir |
-mapper executable or JavaClassName | Optional | 指定mapper -mapper myPythonScript.py |
-reducer executable or JavaClassName | Optional | 指定reducer -reducer /usr/bin/wc |
-file filename | Optional | 将mapper.py和reduce.py分发到各个节点,也可以分发相关的依赖 -file myPythonScript.py \ -file myDictionary.txt |
-inputformat JavaClassName | Optional | Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default |
-outputformat JavaClassName | Optional | Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default |
-partitioner JavaClassName | Optional | 指定partitioner |
-combiner streamingCommand or JavaClassName | 指定combiner | |
-cmdenv name=value | Optional | 传递环境变量,-cmdenv EXAMPLE_DIR=/home/example/dictionaries/ |
-inputreader | Optional | For backwards-compatibility: specifies a record reader class (instead of an input format class) |
-verbose | Optional | Verbose output |
-lazyOutput | Optional | Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to Context.write |
-numReduceTasks | Optional | reducer的数量 |
-mapdebug | Optional | Script to call when map task fails |
-reducedebug | Optional | Script to call when reduce task fails |
python实例 wordcount
1.submit
import os
os.system("hadoop fs -rmr /output")
sym_command = " hadoop jar /usr/local/share/hadoop-2.8.1/share/hadoop/tools/lib/hadoop-streaming.jar " \
"-input '/input/' " \
"-mapper 'sh mapper.sh' " \
"-reducer 'sh reducer.sh' " \
"-output '/output/' " \
"-file mapper.sh " \
"-file reducer.sh " \
"-file mapper.py " \
"-file reducer.py " \
"-D stream.map.output.field.separator=\0 " \
"-D map.output.key.field.separator=\0 " \
"-D stream.num.map.output.key.fields=1 " \
"-D num.key.fields.for.partition=1 " \
"-D mapred.map.tasks=200 " \
"-D mapred.reduce.tasks=100 " \
"-D mapred.temp.dir=/tmp/temp " \
"-D mapred.job.name='wang' "
print(sym_command)
os.system(sym_command)
mapper.sh
使用sh的好处是可以任意添加环境变量,当然使用-cmdenv
参数也行
#!/usr/bin/env
python3 mapper.py
mapper.py
import sys
for line in sys.stdin:
words = line.strip().split(" ")
for word in words:
print("%s %s" % (word, 1))
reducer.sh
#!/usr/bin/env
python3 reducer.py
reducer.py
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
word, count = line.split(' ', 1)
count = int(count)
# 第二行开始执行
if current_word == word:
current_count += count
# 第一行和最后一行执行下面的代码
else:
# reduce到最后一行执行下面if的代码
if current_word:
print("%s %s" % (current_word, current_count))
# 第一行执行下面的代码,当前current_word=None,把截取的word赋值给current_word
current_count = count
current_word = word
参考:
http://hadoop.apache.org/docs/r2.8.0/hadoop-streaming/HadoopStreaming.html