hadoop streaming，排序，分区

最新推荐文章于 2022-12-03 15:51:18 发布

weixin_30482181

最新推荐文章于 2022-12-03 15:51:18 发布

阅读量225

点赞数

文章标签：大数据 python

原文链接：http://www.cnblogs.com/cs-jack-cheng/p/4120380.html

版权

一个简单示例：

hadoop jar ${hdstreaming} \
-D mapreduce.job.queuename=mapreduce.normal \ #Hadoop 2.0一定要指定队列名
-D mapreduce.job.name='UserFeature::Predict' \
-D stream.num.map.output.key.fields=2 \
-D num.key.fields.for.partition=2 \
-D mapreduce.reduce.tasks=90 \
-D mapreduce.min.split.size=1073741824 \ #此参数对数据量小但分区很多的输入有用，很好地提升效率
-mapper "getPredictMap.py" \
-reducer "getPredictReduce.py" \
-file getPredictMap.py \
-file getPredictReduce.py \
-file file_used_in_program \
-input "${inputdir}" \
-output "${outputdir}" \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

Hadoop streaming默认以'/t’作为分隔符，将每行第一个'/t’之前的部分作为key，其余内容作为value，如果没有'/t’分隔符，则整行作为key；比如说map阶段输出的key/value pair, 排序后又作为reduce的输入。

这块有两个常用的参数-D stream.num.map.output.key.fields和-D num.key.fields.for.partition。

二者的区别在于前者指定排序的key（map的输出），后者指定分区的字段（可能仅为key的一部分）。

转载于:https://www.cnblogs.com/cs-jack-cheng/p/4120380.html