python 实现MR

最新推荐文章于 2024-08-31 08:53:52 发布

韩王-信

最新推荐文章于 2024-08-31 08:53:52 发布

阅读量1.9k

点赞数

分类专栏： hadoop

本文链接：https://blog.csdn.net/weihongrao/article/details/11002187

版权

hadoop 专栏收录该内容

24 篇文章 1 订阅

订阅专栏

1. 看下本地的测试数据：

[root@hadoop Desktop]# cat tour.txt
air:23;hotel:34;nation:CHINA
air:35;hotel:46;nation:USA
air:36;hotel:47;nation:USA
air:26;hotel:37;nation:CHINA
air:33;hotel:44;nation:USA
air:34;hotel:45;nation:USA
air:25;hotel:36;nation:CHINA
air:24;hotel:35;nation:CHINA
[root@hadoop Desktop]# cat mapper.txt
cat: mapper.txt: No such file or directory

2. 设计mapper

[root@hadoop Desktop]# cat mapper.py
#!/usr/bin/python

import sys

for line in sys.stdin:
    airs=line.split(";")
    air=airs[0].split(":")
    hotel=airs[1].split(":")
    print hotel[0]+" "+hotel[1]
    print air[0]+" "+air[1]

3. 设计reducer

[root@hadoop Desktop]# cat reducer.py
#!/usr/bin/python

import sys

allsum={}
product=""

for line in sys.stdin:
    products=line.split()
    product=products[0]
    allsum[product]=allsum.get(product,0)+int(products[1])

for key,value in allsum.items():
    print key+" "+str(value)

4. 赋予mapper.py 和reducer.py 执行权限

5. 测试mapper

[root@hadoop Desktop]# cat tour.txt | ./mapper.py
hotel 34
air 23
hotel 46
air 35
hotel 47
air 36
hotel 37
air 26
hotel 44
air 33
hotel 45
air 34
hotel 36
air 25
hotel 35
air 24

5.测试reducer

[root@hadoop Desktop]# cat tour.txt | ./mapper.py | ./reducer.py
hotel 324
air 236

6.将测试数据上传到HDFS

[root@hadoop Desktop]# hadoop fs -ls /usr/egencia/tour/tour.txt
/usr/hadoop/hadoop-1.2.1/libexec/../conf/hadoop-env.sh: line 59: export: `mapred.tasktracker.reduce.tasks.maximum=4': not a valid identifier
Warning: $HADOOP_HOME is deprecated.

Found 1 items
-rw-r--r-- 1 root supergroup 224 2013-08-27 05:22 /usr/egencia/tour/tour.txt

7. 在hadoop 执行MR

[root@hadoop Desktop]# hadoop jar /usr/hadoop/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar -mapper /root/Desktop/mapper.py -reducer /root/Desktop/reducer.py -input /usr/egencia/tour -output /usr/egencia/tour/out

注意：执行的时候在指定mapper和reducer的时候不能像在本地测试一样采用./mapper.py 和./reducer.py ,这也是容易理解的，因为在将这两个文件拷贝到其他的datanode上去的时候，默认的执行目录不一定就是我在本地测试的desktop

8. HDFS 上查看结果：

[root@hadoop Desktop]# hadoop fs -cat /usr/egencia/tour/out/part-00000
/usr/hadoop/hadoop-1.2.1/libexec/../conf/hadoop-env.sh: line 59: export: `mapred.tasktracker.reduce.tasks.maximum=4': not a valid identifier
Warning: $HADOOP_HOME is deprecated.

hotel 324
air 236

可以看出结果和本地测试的是一样的，但是在python版本中怎么实现partitioner ，combiner，等还需考量，另外从上面的reducer'的函数也可以看出，其实根本就不会像java一样输入输出都是key、value而是自己可以随便定义，当然也不会自动将同一个key组合的一个reducer中作为一个迭代，所以在上面的reducer中药自己控制迭代，这些问题待解