python-编写hadoop程序

1,先 安装好hadoop集群

2,编写 python的map程序

mapper.py

#!/usr/bin/env python
#

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)

编写reduce程序:reducer.py

#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)


3,复制连个要统计单词的数据文件到hdfs中
hadoop dfs -mkdir input2
hadoop dfs -copyFromLocal data01 data02 input2

4, 运行 map-reduce程序
编写一个脚本:runwork.sh

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.2-streaming.jar \
-file /home/hadoop-user/dopython/wordcount/mapper.py \ #注意这个选项,开始我没加,job中被kill掉加了就好了
-mapper /home/hadoop-user/dopython/wordcount/mapper.py \
-file /home/hadoop-user/dopython/wordcount/reducer.py \
-reducer /home/hadoop-user/dopython/wordcount/reducer.py \
-input input2/* \
-output pywordcount

5,运行
./runwork.sh

观察输出
packageJobJar: [/home/hadoop-user/dopython/wordcount/mapper.py, /home/hadoop-user/dopython/wordcount/reducer.py, /home/hadoop-user/tmp
/hadoop-unjar6462492368598143954/] [] /tmp/streamjob6590748737962071984.jar tmpDir=null
12/07/14 21:03:38 INFO mapred.FileInputFormat: Total input paths to process : 2
12/07/14 21:03:38 INFO streaming.StreamJob: getLocalDirs(): [/home/hadoop-user/tmp/mapred/local]
12/07/14 21:03:38 INFO streaming.StreamJob: Running job: job_201207142005_0008
12/07/14 21:03:38 INFO streaming.StreamJob: To kill this job, run:
12/07/14 21:03:38 INFO streaming.StreamJob: /home/hadoop-user/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=192.168.201.12
8:9001 -kill job_201207142005_0008
12/07/14 21:03:38 INFO streaming.StreamJob: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201207142005_0008
12/07/14 21:03:39 INFO streaming.StreamJob: map 0% reduce 0%
12/07/14 21:03:49 INFO streaming.StreamJob: map 50% reduce 0%
12/07/14 21:03:51 INFO streaming.StreamJob: map 100% reduce 0%
12/07/14 21:04:06 INFO streaming.StreamJob: map 100% reduce 100%
12/07/14 21:04:11 INFO streaming.StreamJob: Job complete: job_201207142005_0008
12/07/14 21:04:11 INFO streaming.StreamJob: Output: pywordcount

6,查看结构
hadoop dfs -cat pywordcount/part-00000

:数据文件不一样统计出来的结构不一样
Bye 1
Goodbye 1
Hadoop 1
Hadoop” 1
World 1
World” 1
“Hello 2

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/27202748/viewspace-738371/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/27202748/viewspace-738371/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值