hadoop streaming支持其它编程语言来写map、reduce程序,下面以python为例子。
1.编写mapper.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
for word in line.split():
print '%s\t%s' % (word, 1)
2.测试mapper.py
cat file | python mapper.py
3.编写reducer.py
#!/usr/bin/env python
import sys
cur_key = None
cur_count = 0
for line in sys.stdin:
key, value = line.split()
if key == cur_key:
cur_count += int(value)
else:
if cur_key:
print '%s\t%s' % (cur_key, cur_count)
cur_key = key
cur_count = int(value)
print '%s\t%s' % (cur_key, cur_count)
4.提交作业
hadoop jar mydir/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py -input largedata -output output
5.查看结果
hadoop dfs -cat output/part-00000 | head -n 5
参考: