用python写mapreduce,会遇到通过在shell脚本中传参数,通常有两种方式
1.第一种方式通过sys.argv 获取参数值
mapper代码: countmapper.py
import sys
arg1 = sys.argv[1]
arg2 = sys.argv[2]
for line in sys.stdin:
line = line.strip()
item, count = line.split(',')
print '%s\t%s' % (item, count)
说明sys.argv获取参数下标是从1开始
命令配置方式
> hadoop jar hadoop-streaming.jar \
> -mapper 'count_mapper.py arg1 arg2' -file count_mapper.py \
> -reducer 'count_reducer.py arg3' -file count_reducer.py \
> ...
2.第二种方式通过os.environ.get获取参数值
mapper代码:
#!/usr/bin/env python
# vim: set fileencoding=utf-8
import sys
import os
def main():
card_start = os.environ.get('card_start')
card_last = os.environ.get('card_last')
trans_at = float(os.environ.get('trans_at'))
for line in sys.stdin:
detail = line.strip().split(',')
card = detail[0]
money = float(detail[17])
if trans_at == money and card_start == card[1 : 7] and card_last == card[-4 : ]:
print '%s\t%s' % (line.strip(), detail[1])
通过参数名称获取值
命令配置方式
hadoop jar ./hadoop-streaming-2.0.0-mr1-cdh4.7.0.jar \
-input $1 \
-output trans_record/result \
-file map.py \
-file reduce.py \
-mapper "python map.py" \
-reducer "python reduce.py" \
-jobconf mapred.reduce.tasks=1 \
-jobconf mapred.job.name="qianjc_trans_record" \
-cmdenv "card_start=$2" \
-cmdenv "card_last=$3" \
-cmdenv "trans_at=$4"
个人微信公众号