python-编写hadoop程序


1,先安装好hadoop集群

2,编写python的map程序

mapper.py

#!/usr/bin/env python
#

import sys  

# input comes from STDIN (standard input)  
for line in sys.stdin:  
        # remove leading and trailing whitespace  
        line = line.strip()  
        # split the line into words  
        words = line.split()  
        # increase counters  
        for word in words:  
        # write the results to STDOUT (standard output);  
        # what we output here will be the input for the  
        # Reduce step, i.e. the input for reducer.py  
        #  
        # tab-delimited; the trivial word count is 1  
                print '%s\t%s' % (word, 1)  

编写reduce程序:reducer.py

#!/usr/bin/env python  
  
from operator import itemgetter  
import sys  
  
current_word = None  
current_count = 0  
word = None  
  
# input comes from STDIN  
for line in sys.stdin:  
    # remove leading and trailing whitespace  
        line = line.strip()  
  
    # parse the input we got from mapper.py  
        word, count = line.split('\t', 1)  
  
    # convert count (currently a string) to int  
        try:  
                count = int(count)  
        except ValueError:  
        # count was not a number, so silently  
        # ignore/discard this line  
                continue  
  
    # this IF-switch only works because Hadoop sorts map output  
    # by key (here: word) before it is passed to the reducer  
        if current_word == word:  
                current_count += count  
        else:  
                if current_word:  
            # write result to STDOUT  
                        print '%s\t%s' % (current_word, current_count)  
                current_count = count  
                current_word = word  
  
# do not forget to output the last word if needed!  
if current_word == word:  
        print '%s\t%s' % (current_word, current_count)  


3,复制连个要统计单词的数据文件到hdfs中
hadoop dfs -mkdir input2
hadoop dfs -copyFromLocal data01 data02 input2

4, 运行map-reduce程序
编写一个脚本:runwork.sh

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.2-streaming.jar \
 -file /home/hadoop-user/dopython/wordcount/mapper.py \   #注意这个选项,开始我没加,job中被kill掉加了就好了
 -mapper /home/hadoop-user/dopython/wordcount/mapper.py \
 -file /home/hadoop-user/dopython/wordcount/reducer.py \
 -reducer /home/hadoop-user/dopython/wordcount/reducer.py \
 -input input2/* \
 -output pywordcount 

5,运行
./runwork.sh

观察输出
packageJobJar: [/home/hadoop-user/dopython/wordcount/mapper.py, /home/hadoop-user/dopython/wordcount/reducer.py, /home/hadoop-user/tmp
/hadoop-unjar6462492368598143954/] [] /tmp/streamjob6590748737962071984.jar tmpDir=null
12/07/14 21:03:38 INFO mapred.FileInputFormat: Total input paths to process : 2
12/07/14 21:03:38 INFO streaming.StreamJob: getLocalDirs(): [/home/hadoop-user/tmp/mapred/local]
12/07/14 21:03:38 INFO streaming.StreamJob: Running job: job_201207142005_0008
12/07/14 21:03:38 INFO streaming.StreamJob: To kill this job, run:
12/07/14 21:03:38 INFO streaming.StreamJob: /home/hadoop-user/hadoop-0.20.2/bin/../bin/hadoop job  -Dmapred.job.tracker=192.168.201.12
8:9001 -kill job_201207142005_0008
12/07/14 21:03:38 INFO streaming.StreamJob: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201207142005_0008
12/07/14 21:03:39 INFO streaming.StreamJob:  map 0%  reduce 0%
12/07/14 21:03:49 INFO streaming.StreamJob:  map 50%  reduce 0%
12/07/14 21:03:51 INFO streaming.StreamJob:  map 100%  reduce 0%
12/07/14 21:04:06 INFO streaming.StreamJob:  map 100%  reduce 100%
12/07/14 21:04:11 INFO streaming.StreamJob: Job complete: job_201207142005_0008
12/07/14 21:04:11 INFO streaming.StreamJob: Output: pywordcount

6,查看结构
hadoop dfs -cat pywordcount/part-00000

:数据文件不一样统计出来的结构不一样
Bye     1
Goodbye 1
Hadoop  1
Hadoop” 1
World   1
World”  1
“Hello  2


来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/20498361/viewspace-735482/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/20498361/viewspace-735482/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值