使用Python写Map-Reduce程序

http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python ,这篇文章写得不错,不过在服务器上有些Python的库根本都没有安装,所以我将代码小小修改了一下。Python比Java更适合做快速开发,学学怎么通过Python语言编写Map-Reduce程序是很有价值的,
        首先编写一个实现map功能的Python程序,代码如下:
#!/usr/bin/env python
 
import sys
 
# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print '%s\t%s' % (word, 1)

        这个程序是非常简单,它一句句地从终端读取数据,然后将一句的字符分离出来,然后再将这些字符打印到终端上。接着写实现reduce功能的Python程序,因为原文对key排序时使用了itemgetter,我测试的服务器安装的Python的operator模块并没有这个itemgetter方法,所以我感觉将排序去掉了,修改后的代码如下:
#!/usr/bin/env python
 
#from operator imp ort itemgetter
imp ort sys
 
# maps words to their counts
word2count = {}
 
# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
 
    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
        word2count[word] = word2count.get(word, 0) + count
        #print '%s\t%s'% (word2count[word], count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        pass
 
# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
#sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
 
# write the results to STDOUT (standard output)
for word, count in word2count.items():
    print '%s\t%s'% (word, count)
       这个程序也是非常直观的,它生成了一个字典,字典的key是字段名称,字典的value为字段出现的次数。统计完成之后,使用循环将字典中的记录打印出来。文章还分别测试了一下两个程序,过程如下:
1、测试mapper.py
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" | /home/henshao/python_hadoop/mapper.py
foo     1
foo     1
quux    1
labs    1
foo     1
bar     1
quux    1
2、测试reducer.py
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" | /home/henshao/python_hadoop/mapper.py | sort | /home/henshao/python_hadoop/reducer.py       
labs    1
quux    2
foo     3
bar     1
       生成测试所用数据,然后将文件上传到HDFS上。
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" > element.txt
[henshao@test208011 python_hadoop]$ cat element.txt 
foo foo quux labs foo bar quux
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -put element.txt /home/python_test/
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -cat /home/python_test/element.txt
foo foo quux labs foo bar quux
       测试命令如下(一定要加上"-file"参数,否则Job会失败):
~/hadoop/bin/hadoop jar ~/hadoop/contrib/streaming/hadoop-0.19.2-streaming.jar  -file  /home/henshao/python_hadoop/mapper.py -mapper mapper.py  -file /home/henshao/python_hadoop/reducer.py -reducer reducer.py -input /home/python_test/element.txt -output /home/python
       测试打印数据如下:
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop jar ~/hadoop/contrib/streaming/hadoop-0.19.2-streaming.jar -file /home/henshao/python_hadoop/mapper.py -mapper mapper.py -file /home/henshao/python_hadoop/reducer.py -reducer reducer.py -input /home/python_test/element.txt -output /home/python
packageJobJar: [/home/henshao/python_hadoop/mapper.py, /home/henshao/python_hadoop/reducer.py, /home/henshao/hadoop-datastore/hadoop-henshao/hadoop-unjar5362045099634515320/] [] /tmp/streamjob7670340198799210833.jar tmpDir=null
10/01/21 19:00:51 INFO mapred.FileInputFormat: Total input paths to process : 1
10/01/21 19:00:51 INFO streaming.StreamJob: getLocalDirs(): [/home/henshao/hadoop-datastore/hadoop-henshao/mapred/local]
10/01/21 19:00:51 INFO streaming.StreamJob: Running job: job_201001211801_0013
10/01/21 19:00:51 INFO streaming.StreamJob: To kill this job, run:
10/01/21 19:00:51 INFO streaming.StreamJob: /home/henshao/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=127.0.0.1:9001 -kill job_201001211801_0013
10/01/21 19:00:51 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201001211801_0013
10/01/21 19:00:52 INFO streaming.StreamJob:  map 0%  reduce 0%
10/01/21 19:00:56 INFO streaming.StreamJob:  map 50%  reduce 0%
10/01/21 19:00:57 INFO streaming.StreamJob:  map 100%  reduce 0%
10/01/21 19:01:03 INFO streaming.StreamJob:  map 100%  reduce 100%
10/01/21 19:01:04 INFO streaming.StreamJob: Job complete: job_201001211801_0013
10/01/21 19:01:04 INFO streaming.StreamJob: Output: /home/python
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -ls /home/python/part-00000
Found 1 items
-rw-r--r--   1 henshao supergroup         26 2010-01-21 19:01 /home/python/part-00000
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -cat /home/python/part-00000
labs    1
quux    2
foo     3

bar     1

保存到此处,学习。

转自http://blog.163.com/ecy_fu/blog/static/4445126201002191329467/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值