hadoop streaming部分问题总结

最新推荐文章于 2021-12-11 14:39:15 发布

arthur503

最新推荐文章于 2021-12-11 14:39:15 发布

阅读量817

点赞数

分类专栏： Hadoop

本文链接：https://blog.csdn.net/arthur503/article/details/51199324

版权

Hadoop 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

来源：

https://hadoop.apache.org/docs/r1.2.1/streaming.html#Generic+Command+Options

文档还是要好好看，中间遇到的好多问题文档中都有。之前看的时候没有感觉，等遇到了问题再来看，就知道是啥了。

============================================================

一、options

hadoop streaming有两种参数：Generic Command Options和Streaming Command Options。

注意：genericOptions要求必须放在streamingOptions前面。如：

bin/hadoop command [genericOptions] [streamingOptions]

Generic Command Options

Parameter	Optional/Required	Description
-conf configuration_file	Optional	Specify an application configuration file
-D property=value	Optional	Use value for given property
-fs host:port or local	Optional	Specify a namenode
-jt host:port or local	Optional	Specify a job tracker
-files	Optional	Specify comma-separated files to be copied to the Map/Reduce cluster
-libjars	Optional	Specify comma-separated jar files to include in the classpath
-archives	Optional	Specify comma-separated archives to be unarchived on the compute machines

Streaming Command Options

Parameter	Optional/Required	Description
-input directoryname or filename	Required	Input location for mapper
-output directoryname	Required	Output location for reducer
-mapper executable or JavaClassName	Required	Mapper executable
-reducer executable or JavaClassName	Required	Reducer executable
-file filename	Optional	Make the mapper, reducer, or combiner executable available locally on the compute nodes
-inputformat JavaClassName	Optional	Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default
-outputformat JavaClassName	Optional	Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default
-partitioner JavaClassName	Optional	Class that determines which reduce a key is sent to
-combiner streamingCommand or JavaClassName	Optional	Combiner executable for map output
-cmdenv name=value	Optional	Pass environment variable to streaming commands
-inputreader	Optional	For backwards-compatibility: specifies a record reader class (instead of an input format class)
-verbose	Optional	Verbose output
-lazyOutput	Optional	Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write)
-numReduceTasks	Optional	Specify the number of reducers
-mapdebug	Optional	Script to call when map task fails
-reducedebug	Optional	Script to call when reduce task fails

注意：

1， genericoption中的-files参数和streamingoptions中的-file参数。

前者为可以放多个，中间用逗号分隔，并且可以放hdfs files的地址。后者是从本地往hadoop执行目录上传文件。

使用的时候，-files需要放到streamingoptions（如：-input, -output等）前面。

2. cmdenv参数可以设定需要的环境变量

-cmdenv EXAMPLE_DIR=/home/example/dictionaries/

二、使用hdfs files地址

先把数据文件上传的hdfs上，之后可以直接使用。

使用参数：

-files hdfs://host:fs_port/user/testfile.txt

注意：

1. host:fs_port可以从/usr/local/hadoop/etc/hadoop/core-site.xml中，找到fs.defaultFS对应的value即可。

2. 使用的时候，直接在对应的mapper.py中，使用open("testfile.txt")即可。hadoop会创建一个连接，指向对应的文件。（ Hadoop automatically creates a symlink named testfile.txt in the current working directory of the tasks. This symlink points to the local copy of testfile.txt.）

三、"No space left on device" error?

这是因为hadoop streaming中，会把需要的-file文件打成jar包上传，如果文件很大，jar包也很大。jar包默认是放在/tmp目录下。如果/tmp目录空间不够，则会报该错误。

更改目录命令：

-D stream.tmpdir=/export/bigspace/...

四、在mapper.py中获取设置的参数

加入设置：

-D mapred.reduce.tasks=1

如何在mapper.py中获取该参数：

直接把对应的参数名中"."换成下划线"_"即可获得。

reducer_tasks = os.getenv(“mapred_reduce_tasks')