来源:
https://hadoop.apache.org/docs/r1.2.1/streaming.html#Generic+Command+Options
文档还是要好好看,中间遇到的好多问题文档中都有。之前看的时候没有感觉,等遇到了问题再来看,就知道是啥了。
============================================================
一、options
hadoop streaming有两种参数:Generic Command Options和Streaming Command Options。
注意:genericOptions要求必须放在streamingOptions前面。如:
bin/hadoop command [genericOptions] [streamingOptions]
Generic Command Options
Parameter | Optional/Required | Description |
---|---|---|
-conf configuration_file | Optional | Specify an application configuration file |
-D property=value | Optional | Use value for given property |
-fs host:port or local | Optional | Specify a namenode |
-jt host:port or local | Optional | Specify a job tracker |
-files | Optional | Specify comma-separated files to be copied to the Map/Reduce cluster |
-libjars | Optional | Specify comma-separated jar files to include in the classpath |
-archives | Optional | Specify comma-separated archives to be unarchived on the compute machines |
Streaming Command Options
Parameter | Optional/Required | Description |
---|---|---|
-input directoryname or filename | Required | Input location for mapper |
-output directoryname | Required | Output location for reducer |
-mapper executable or JavaClassName | Required | Mapper executable |
-reducer executable or JavaClassName | Required | Reducer executable |
-file filename | Optional | Make the mapper, reducer, or combiner executable available locally on the compute nodes |
-inputformat JavaClassName | Optional | Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default |
-outputformat JavaClassName | Optional | Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default |
-partitioner JavaClassName | Optional | Class that determines which reduce a key is sent to |
-combiner streamingCommand or JavaClassName | Optional | Combiner executable for map output |
-cmdenv name=value | Optional | Pass environment variable to streaming commands |
-inputreader | Optional | For backwards-compatibility: specifies a record reader class (instead of an input format class) |
-verbose | Optional | Verbose output |
-lazyOutput | Optional | Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write) |
-numReduceTasks | Optional | Specify the number of reducers |
-mapdebug | Optional | Script to call when map task fails |
-reducedebug | Optional | Script to call when reduce task fails |
注意:
1, genericoption中的-files参数和streamingoptions中的-file参数。
前者为可以放多个,中间用逗号分隔,并且可以放hdfs files的地址。后者是从本地往hadoop执行目录上传文件。
使用的时候,-files需要放到streamingoptions(如:-input, -output等)前面。
2. cmdenv参数可以设定需要的环境变量
-cmdenv EXAMPLE_DIR=/home/example/dictionaries/
二、使用hdfs files地址
先把数据文件上传的hdfs上,之后可以直接使用。
使用参数:
-files hdfs://host:fs_port/user/testfile.txt
注意:
1. host:fs_port可以从/usr/local/hadoop/etc/hadoop/core-site.xml中,找到fs.defaultFS对应的value即可。
2. 使用的时候,直接在对应的mapper.py中,使用open("testfile.txt")即可。hadoop会创建一个连接,指向对应的文件。( Hadoop automatically creates a symlink named testfile.txt in the current working directory of the tasks. This symlink points to the local copy of testfile.txt.)
三、"No space left on device" error?
这是因为hadoop streaming中,会把需要的-file文件打成jar包上传,如果文件很大,jar包也很大。jar包默认是放在/tmp目录下。如果/tmp目录空间不够,则会报该错误。
更改目录命令:
-D stream.tmpdir=/export/bigspace/...
四、在mapper.py中获取设置的参数
加入设置:
-D mapred.reduce.tasks=1
如何在mapper.py中获取该参数:
直接把对应的参数名中"."换成下划线"_"即可获得。
reducer_tasks = os.getenv(“mapred_reduce_tasks')