传统MapReduce框架

传统的MapReduce框架是google于2004年在论文:“MapReduce: Simplified Data Processing on Large Clusters”提出的,该框架把一些数据密集型应用的数据处理过程简化抽象成map和reduce两个阶段,用户在设计分布式程序时,只要实现 map()和reduce()两个函数,至于其它细节,例如数据分片,任务调度,机器容错,机器间通信等,都交由MapReduce框架处理。随着技术的 发展,在传统MapReduce框架的基础上,出现了一些针对特殊应用的MapReduce框架,主要有以下几种:

(1)       支持迭代MapReduce的Twister和Haloop(参见我的博文:迭代式MapReduce框架介绍).

(2)       支持多阶段流式计算的Sector/Sphere(参见我的博文:流式MapReduce实现Sector/Sphere ‎).

(3)       支持DAG(Directed Acyclic Graph)的Dryad和Cascading(参见文章:DryadCascading 以及 Cascading的主页).

(4)        MapReduce与Database结合的产物:HadoopDBgreenplum.

本文主要讲解当下较为出名的传统MapReduce开源实现。现在有非常多的开源MapReduce框架实现, 最出名的莫过于Java实现版本Hadoop。 实际上,它属于重量级的实现版本(代码量大),要理解其细节或者对其进行改进需要很大工作量。 为了克服重量级实现存在的缺陷,一些轻量级的版本出现了,如erlang实现版本Disco,Python实现版本micemeat,bash版本 bashreduce等。

本文主要介绍Disco,粗略讲解micemeat和bashreduce。

传统MapReduce实现之Disco

1、概述

Disco是一个轻量级的MapReduce框架实现,核心模块使用Erlang语言实现,外部接口为易于编程的Python。同Hadoop一 样,拥也有自己的分布式文件系统DDFS,不过DDFS是与计算框架高度耦合的。 Disco由诺基亚研究中心开发,用于处理实际应用中的大规模数据。

2、Disco的总体设计架构

Disco由分布式存储系统DDFS(Disco Distributed File System)和MapReduce框架组成,本节主要介绍Disco的总体设计架构,下面一节介绍DDFS。

Disco也是master/slave架构:

Disco master从client端接收作业,并将它们添加到作业队列中,以便进行调度。

Client processes是一些python程序,它们使用函数disco.job()向master提交作业。

Worker supervision是由master启动的,每个节点启动一个,用于监控该节点上python worker的运行情况。

Python worker用于执行用户提交的作业。

输入文件是通过http获取的,但若要读取的文件在本地,直接从磁盘上获取即可。为了能够从个远程节点上获取数据,每个节点上进行一个httpd后台进程。

3、DDFS的架构

DDFS是嵌入到Disco中的,目前只有一个master节点(存在单点故障)。每个存储节点由一组磁盘或者卷宗组成(vol0..volN), 它们分别挂载在$DDFS_ROOT/vol0 … $DDFS_ROOT/volN。每个卷宗下面有两个文件,分别为tag 和 blob,分别用于存储标记(tag,相当于key)和标记对应的值(value)。DDFS会监控每个节点上的磁盘使用情况,并每隔一段时间进行负载均 衡。

4、分布式索引Discodex

Discodex是专门为Disco设计的分布式索引系统。

Discodex实际上是一个分布式key/value存储系统。通过某一个key,可以检索出与该key相关的所有value。Discodex对外提供了一些ReST API,用户通过这些API检索数据。

当我们使用Discodex时,实际上是运行了一个HTTP 服务器,它把ReSTful url映射成Disco作业。        Discodex在DDFS上存储key和value值,其中每个文件都是分布式的,称为index,chunks或者ichunks。

5、参考资料

(1)官方网址:http://discoproject.org/

(2)安装方法:http://blog.csdn.net/socrates/archive/2009/05/26/4217641.aspx

传统MapReduce实现之micemeat

1、介绍

micemeat是MapReduce的python实现,整个代码由一个python文件构成(<13KB),它仅依赖于python标准库,非常容易部署,同时它还支持以下功能:

(1)       容错:任何一个slave可以随时加入或者离开集群而不会影响其他slave。

(2)       安全:micemeat会对每一条连接进行授权验证,防止未经授权的代码被执行。

2、参考资料:

官方网址:http://remembersaurus.com/mincemeatpy/

传统MapReduce实现之bashreduce

1、  介绍

Bashreduce采用bash脚本语言编写,整合了常用的shell命令,如sort, awk, ssh, netcat等。目前仅在ubuntu/debian系统上进行了测试。

2、  参考资料

(1)       https://github.com/erikfrey/bashreduce

另外,还有Perl实现RobotArmy和ruby实现skyner

参考资料:

(1)    RobotArmy官方主页:http://bulletsweetp.github.com/robotarmy/

(2)    RobotArmy论文:RobotArmy: A Casual Framework for Massive Distributed Processing

(3)    Skynet官方主页:http://skynet.rubyforge.org/

 

转载于:https://www.cnblogs.com/Donal/archive/2011/04/16/2018390.html

最近一直在学coursera上面web intelligence and big data这门课,上周五印度老师布置了一个家庭作业,要求写一个mapreduce程序,用python来实现。 具体描述如下: Programming Assignment for HW3 Homework 3 (Programming Assignment A) Download data files bundled as a .zip file from hw3data.zip Each file in this archive contains entries that look like: journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2. that represent bibliographic information about publications, formatted as follows: paper-id:::author1::author2::…. ::authorN:::title Your task is to compute how many times every term occurs across titles, for each author. For example, the author Alberto Pettorossi the following terms occur in titles with the indicated cumulative frequencies (across all his papers): program:3, transformation:2, transforming:2, using:2, programs:2, and logic:2. Remember that an author might have written multiple papers, which might be listed in multiple files. Further notice that ‘terms’ must exclude common stop-words, such as prepositions etc. For the purpose of this assignment, the stop-words that need to be omitted are listed in the script stopwords.py. In addition, single letter words, such as "a" can be ignored; also hyphens can be ignored (i.e. deleted). Lastly, periods, commas, etc. need to be ignored; in other words, only alphabets and numbers can be part of a title term: Thus, “program” and “program.” should both be counted as the term ‘program’, and "map-reduce" should be taken as 'map reduce'. Note: You do not need to do stemming, i.e. "algorithm" and "algorithms" can be treated as separate terms. The assignment is to write a parallel map-reduce program for the above task using either octo.py, or mincemeat.py, each of which is a lightweight map-reduce implementation written in Python. These are available from http://code.google.com/p/octopy/ and mincemeat.py-zipfile respectively. I strongly recommend mincemeat.py which is much faster than Octo,py even though the latter was covered first in the lecture video as an example. Both are very similar. Once you have computed the output, i.e. the terms-frequencies per author, go attempt Homework 3 where you will be asked questions that can be simply answered using your computed output, such as the top terms that occur for some particular author. Note: There is no need to submit the code; I assume you will experiment using octo.py to learn how to program using map-reduce. Of course, you can always write a serial program for the task at hand, but then you won’t learn anything about map-reduce. Lastly, please note that octo.py is a rather inefficient implementation of map-reduce. Some of you might want to delve into the code to figure out exactly why. At the same time, this inefficiency is likely to amplify any errors you make in formulating the map and reduce functions for the task at hand. So if your code starts taking too long, say more than an hour to run, there is probably something wrong.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值