Python编写Hadoop MapReduce程序

最新推荐文章于 2023-07-18 17:17:03 发布

VIP文章 Wild_Elegance_k

最新推荐文章于 2023-07-18 17:17:03 发布

阅读量2.6k

点赞数 3

分类专栏： Hadoop 文章标签： Hadoop Python编写MapReduce

本文链接：https://blog.csdn.net/wild_elegance_k/article/details/47912715

版权

adoop 的 MapReduce 程序，使用的是 Java ，但是使用 Java 很明显的一个弊端就是每次都要编码、打包、上传、执行，还真心是麻烦，想要更加简单的使用 Hadoop 的运算能力，想要写 MapReduce程序不那么复杂。还真是个问题。

仔细考虑了下，python刚好切合这个需求，随便搜了下 Python 编写 MapReduce程序，看了个教程，接下来就写下这篇博客做下记录。

Hadoop 框架使用 Java 开发的，对 Java 进行了原生的支持，不过对于其它语言也提供了 API 支持，如 Python 、 C++ 、 Perl 、 Ruby 等。在看一篇大神写的用脚本语言编写Hadoop基础程序时，需要用到一个工具，那这个工具就是 Hadoop Streaming ，顾名思义， Streaming 就是 Pipe 操作。

前置条件：

Python 环境

Hadoop 环境（ single or cluster ）

最容易的 Hadoop 编程模型就是 Mapper 和 Reducer 的编写，这种编程模型大大降低了我们对于并发、同步、容错、一致性的要求，你只要编写好自己的业务逻辑，就可以提交任务。然后喝杯茶，结果就出来了，前提是你的业务逻辑没有错误。

使用 Hadoop Streaming ，能够利用 Pipe 模型，而使用 Python 的巧妙之处在于处理输入输出的数据使用的是 STDIN 和 STDOUT ，然后 Hadoop Streaming 会接管一切，转化成 MapReduce 模型。

我们还是使用 wordcount 例子，具体内容不再详细解释。下面我们先看下 mapper 的代码：

[python]view plaincopy 
   
 #!/usr/bin/env python  
   
 import sys  
 #input comes from STDIN (standard input)  
 for line in sys.stdin:  
     # remove leading and trailing whitespace  
     line = line.strip()  
     # split the line into words  
     words = line.split()  
     # increase counters  
     for word in words:  
         # write the results to STDOUT (standard output);  
         # what we output here will be the input for the  
         # Reduce step, i.e. the input for reducer.py  
         # tab-delimited; the trivial word count is 1  
         

最低0.47元/天解锁文章

Wild_Elegance_k

关注

3
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
Python编写Hadoop MapReduce程序

adoop 的 MapReduce 程序，使用的是 Java ，但是使用 Java 很明显的一个弊端就是每次都要编码、打包、上传、执行，还真心是麻烦，想要更加简单的使用 Hadoop 的运算能力，想要写 MapReduce程序不那么复杂。还真是个问题。仔细考虑了下，python刚好切合这个需求，随便搜了下 Python 编写 MapReduce程序，看了个教程，接下来就写下这篇博客做下记录
复制链接

扫一扫