简化的分布式WordCount程序

本程序功能是用MapperReducer思想对实时输入的字符串做单词统计。MapperReducer模型如下(图是自己用画图工具画的):
                    

介绍: 
Driver主要是用户调用MapperReducer机制的接口,用户将作业构建好调用MapperReducer进行分布式处理。
Mapper主要将作业分割,Reducer则是对分割的作业进行计算并汇聚。
程序说明:程序中有3个组件:Driver、Mapper和Reducer。每个组件即一个Actor,各个Actor之间采用进程间通信来模拟分布式通信。程序代码分5个文件,除三个组件各占一个文件外,还需要为Actor获取IP和端口建立个配置文件conf,另外三个组件间发布的作业任务构成一个文件Task。

代码
Task:

//定义Mapper的Task,  输入是line,即每行数据(String类型),Mfunc函数的功能是将每行的String转换为每个单词的元组类型
case class MapperTask(line:String ,Mfunc:(String)=>List[(String,Int)]) {
}
//定义Reducer的Task,将Mapper传入的每个元组(键值对)进行统计,rfunc功能是取每个键值对的value并累加计算
case class ReducerTask(kv:(String,Int),rfunc:(Int,Int)=>Int){}

Driver文件:
import akka.actor.ActorSystem
import com.typesafe.config.ConfigFactory
import akka.actor.Props
import akka.actor.Actor
import java.util.Scanner
class Driver extends Actor{
  override def receive= {
    case msg:String =>println(msg)
  }
 }


conf文件

trait conf {
  def getconf (hostname:String,Port:String)={
  val list =new java.util.ArrayList[String]()
  list.add("akka.remote.netty.tcp");
  val conf = new java.util.HashMap[String,Object]();
  conf.put("akka.actor.provider", "akka.remote.RemoteActorRefProvider");
  conf.put("akka.remote.enabled-transports", list);
  conf.put("akka.remote.netty.tcp.hostname",hostname);
  conf.put("akka.remote.netty.tcp.port",Port); 
  conf
  }
}

Driver文件

object Driver extends conf {
  def main(args: Array[String]): Unit = {
    //构建环境
    val sys=ActorSystem("Driver",ConfigFactory.parseMap(getconf("127.0.0.1","3000")));
    //注册
    sys.actorOf(Props[Driver]);
    val sc = new Scanner(System.in);//从键盘获取字符
    val map = sys.actorSelection("akka.tcp://Mapper@127.0.0.1:3001/user/mapper")//连接到Mapper端
    while (sc.hasNext())
    {
     val  line =sc.nextLine();
     //调用MapperTask,每行按空格来分割单词 MapperTask即Task中的样例类的构造函数
    //(line:String)=>{line.split(" ").toList.map(x=>(x,1))即mfunc的实现,先按空格分割单词,然后将其转换为list类型
    //再将每个单词转换为元组类型
     val mapperTask= MapperTask(line,(line:String)=>{line.split(" ").toList.map(x=>(x,1))})
      map ! mapperTask  //调用作业
    } 
   }  
}

Mapper文件:

import akka.actor.Actor
import akka.actor.ActorSystem
import com.typesafe.config.ConfigFactory
import akka.actor.Props
import akka.actor.ActorSelection

class Mapper extends Actor  {
  override def receive ={
    case MapperTask(line ,mfunc)=>{
     val result=mfunc(line);//mfunc对line操作,方法已在Driver中实现,result是List类型
    //依次对元组中相同key的value累加
     for(r<-result){
       println(r)
    //ReducerTask在Reducer端实现,(a,b)=>a+b是rfunc的实现
       Mapper.getreducer() ! ReducerTask(r,(a,b)=>a+b)
      }
    }
  }
}
object Mapper extends conf{
  def getreducer()={
    reducer
  }
  var reducer:ActorSelection=null;
  def main(args: Array[String]): Unit = {
    //构造环境
    val sys=ActorSystem("Mapper",ConfigFactory.parseMap(getconf("127.0.0.1","3001")));
    //注册
    sys.actorOf(Props[Mapper],"mapper");//Driver需要连Mapper,需要对Mapper需要起个名字
    //mapper地址:akka.tcp://Mapper@127.0.0.1:3001/user/mapper
    //Mapper连接到Reducer
    reducer=sys.actorSelection("akka.tcp://Reducer@127.0.0.1:3002/user/reducer")
  }
}

Reducer端:

import akka.actor.Actor
import akka.actor.ActorSystem
import com.typesafe.config.ConfigFactory
import akka.actor.Props

class Reducer extends Actor{
  import scala.collection.mutable.Map
  val combine=Map[String,Int]();
  override def receive = {
    // ReducerTask的实现
    case ReducerTask((key,value),rfunc)=>{
      //getOrElse判断combine中在否存在这个key,有就返回对应的value,没有就返回0
    val v=  combine.getOrElse(key,0);
   combine+=key -> rfunc(v,value);//调用rfunc,进行累加
   println(combine);
   }
  }
}

object Reducer extends conf{
  def main(args: Array[String]): Unit = {
    //构造环境
     val sys = ActorSystem("Reducer",ConfigFactory.parseMap(getconf("127.0.0.1","3002")))
    //注册
     sys.actorOf(Props[Reducer],"reducer");
     //Reducer 的访问地址:akka.tcp://Reducer@127.0.0.1:3002/user/reducer
  
  }
}

启动时先启动Reducer,再启动Mapper,最后启动Driver。

启动后在Driver端的控制台里输入字符串,如”hello world“,可以在mapper端控制台看到Mapper已经对字符串的单词进行了分割并分组,在reducer端控制台可以看到Reducer已经将单词统计好

//Driver端输入Hello world字符串
[INFO] [04/08/2018 14:53:20.805] [main] [Remoting] Starting remoting
[INFO] [04/08/2018 14:53:21.337] [main] [Remoting] Remoting started; listening on addresses :[akka.tcp://Driver@127.0.0.1:3000]
[INFO] [04/08/2018 14:53:21.337] [main] [Remoting] Remoting now listens on addresses: [akka.tcp://Driver@127.0.0.1:3000]
hello world
//Mapper将其分割
[INFO] [04/08/2018 14:53:09.877] [main] [Remoting] Starting remoting
[INFO] [04/08/2018 14:53:10.487] [main] [Remoting] Remoting started; listening on addresses :[akka.tcp://Mapper@127.0.0.1:3001]
[INFO] [04/08/2018 14:53:10.487] [main] [Remoting] Remoting now listens on addresses: [akka.tcp://Mapper@127.0.0.1:3001]
(hello,1)
(world,1)
//Reducer对其统计
[INFO] [04/08/2018 14:52:50.527] [main] [Remoting] Starting remoting
[INFO] [04/08/2018 14:52:51.262] [main] [Remoting] Remoting started; listening on addresses :[akka.tcp://Reducer@127.0.0.1:3002]
[INFO] [04/08/2018 14:52:51.262] [main] [Remoting] Remoting now listens on addresses: [akka.tcp://Reducer@127.0.0.1:3002]
Map(hello -> 1)
Map(world -> 1, hello -> 1)

若在Driver控制台中再输入其他字符串,如Hello spark hello Hadoop,再观看Mapper和Reducer端,可以看到Mapper和Reducer又进行了实时分割和统计。

[INFO] [04/08/2018 14:53:20.805] [main] [Remoting] Starting remoting
[INFO] [04/08/2018 14:53:21.337] [main] [Remoting] Remoting started; listening on addresses :[akka.tcp://Driver@127.0.0.1:3000]
[INFO] [04/08/2018 14:53:21.337] [main] [Remoting] Remoting now listens on addresses: [akka.tcp://Driver@127.0.0.1:3000]
hello world
hello spark hello Hadoop      //Driver端再输入字符串
[INFO] [04/08/2018 14:53:09.877] [main] [Remoting] Starting remoting
[INFO] [04/08/2018 14:53:10.487] [main] [Remoting] Remoting started; listening on addresses :[akka.tcp://Mapper@127.0.0.1:3001]
[INFO] [04/08/2018 14:53:10.487] [main] [Remoting] Remoting now listens on addresses: [akka.tcp://Mapper@127.0.0.1:3001]
(hello,1)
(world,1)
(hello,1)        //Mapper做了新的分割
(spark,1)
(hello,1)
(Hadoop,1)
[INFO] [04/08/2018 14:52:50.527] [main] [Remoting] Starting remoting
[INFO] [04/08/2018 14:52:51.262] [main] [Remoting] Remoting started; listening on addresses :[akka.tcp://Reducer@127.0.0.1:3002]
[INFO] [04/08/2018 14:52:51.262] [main] [Remoting] Remoting now listens on addresses: [akka.tcp://Reducer@127.0.0.1:3002]
Map(hello -> 1)
Map(world -> 1, hello -> 1)
Map(world -> 1, hello -> 2)          //Reducer进行了新的统计
Map(spark -> 1, world -> 1, hello -> 2)
Map(spark -> 1, world -> 1, hello -> 3)
Map(spark -> 1, Hadoop -> 1, world -> 1, hello -> 3)

本程序中需要改进的地方:

  • Mapper 要有多个,要增加并行度
  • Reducer要有多个,一方面增加并行度,
  • Mapper的输出结果,key相同的应该被发给同一个reducer
  • reducer的函数应该放在driver端
  • 缺少Master角色,负责整个集群中资源的分配,以及任务的调度

转载于:https://my.oschina.net/u/3801367/blog/1791205

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值