c++编程实现简单mapreduce程序

    hadoop提供了java版本的mapreduce编程API,我们需要自定义编写mapper和reducer,分别继承Mapper和Reducer,然后重写map和reduce方法。同时需要在main方法中构建job,然后指定mapper和reducer,最后提交任务。同时也支持c++编写mapreduce。hadoop有几种方式用c++实现mapreduce,这里介绍使用hadoop-streaming-xxx.jar的方式来运行c++实现的mapreduce程序。

    我们需要定义两个c++文件,分别编写mapper和reducer执行的任务。这里以词频统计为例,在mapper中,我们需要实现的是<word,1>这样的输出。在reducer中,我们需要通过统计的方式来计算word出现的总次数,输出<word,sum>。

    下面给出源代码:

    mapper.cpp

#include <iostream>
#include <string>
using namespace std;
int main(){
  string word;
  while(cin>>word){
    cout<<word<<"\t"<<"1"<<endl;
  }
  return 0;
}

    reducer.cpp

#include <iostream>
#include <string>
#include <map>
using namespace std;
int main(){
  string key,num;
  map<string,int> count;
  map<string,int>::iterator it;
  while(cin>>key>>num){
    it = count.find(key);
    if(it!=count.end()){
      it->second++;
    }else{
      count.insert(make_pair(key,1));
    }
  }
  for(it=count.begin();it!=count.end();it++){
    cout<<it->first<<"\t"<<it->second<<endl;
  }
  return 0;
}

    分别编译mapper.cpp和reducer.cpp

[root@client project]# g++ mapper.cpp -o mapper
[root@client project]# g++ reducer.cpp -o reducer

    准备的输入文件:

[root@client project]# hdfs dfs -cat /user/root/wordcount/input/*
hadoop framework include hdfs and mapreduce
mapreduce is a distributed framework
hdfs is a hadoop distributed file system

    通过hadoop-streaming-2.7.7.jar来运行这个mapreduce任务:

[root@client project]# hadoop jar /home/software/hadoop-2.7.7/share/hadoop/tools/lib/hadoop-streaming-2.7.7.jar -D mapred.job.name="wordcount" -input /user/root/wordcount/input -output /user/root/wordcount/output --mapper ./mapper --reducer ./reducer -file mapper -file reducer

    运行打印信息:

[root@client project]# hadoop jar /home/software/hadoop-
2.7.7/share/hadoop/tools/lib/hadoop-streaming-2.7.7.jar -D mapred.job.name="wordcount" -
input /user/root/wordcount/input -output /user/root/wordcount/output --mapper ./mapper --
reducer ./reducer -file mapper -file reducer
19/09/21 21:47:03 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper, reducer, /tmp/hadoop-unjar4749452695736673048/] [] /tmp/streamjob4372833674036516407.jar tmpDir=null
19/09/21 21:47:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/09/21 21:47:06 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/09/21 21:47:07 INFO mapred.FileInputFormat: Total input paths to process : 1
19/09/21 21:47:08 INFO mapreduce.JobSubmitter: number of splits:2
19/09/21 21:47:08 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
19/09/21 21:47:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1569071641619_0002
19/09/21 21:47:08 INFO impl.YarnClientImpl: Submitted application application_1569071641619_0002
19/09/21 21:47:08 INFO mapreduce.Job: The url to track the job: http://client:8088/proxy/application_1569071641619_0002/
19/09/21 21:47:08 INFO mapreduce.Job: Running job: job_1569071641619_0002
19/09/21 21:47:23 INFO mapreduce.Job: Job job_1569071641619_0002 running in uber mode : false
19/09/21 21:47:23 INFO mapreduce.Job:  map 0% reduce 0%
19/09/21 21:48:54 INFO mapreduce.Job:  map 100% reduce 0%
19/09/21 21:49:43 INFO mapreduce.Job:  map 100% reduce 100%
19/09/21 21:49:48 INFO mapreduce.Job: Job job_1569071641619_0002 completed successfully
19/09/21 21:49:50 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=200
		FILE: Number of bytes written=378513
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=419
		HDFS: Number of bytes written=95
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=198469
		Total time spent by all reduces in occupied slots (ms)=22521
		Total time spent by all map tasks (ms)=198469
		Total time spent by all reduce tasks (ms)=22521
		Total vcore-milliseconds taken by all map tasks=198469
		Total vcore-milliseconds taken by all reduce tasks=22521
		Total megabyte-milliseconds taken by all map tasks=203232256
		Total megabyte-milliseconds taken by all reduce tasks=23061504
	Map-Reduce Framework
		Map input records=3
		Map output records=18
		Map output bytes=158
		Map output materialized bytes=206
		Input split bytes=236
		Combine input records=0
		Combine output records=0
		Reduce input groups=11
		Reduce shuffle bytes=206
		Reduce input records=18
		Reduce output records=11
		Spilled Records=36
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=12487
		CPU time spent (ms)=8250
		Physical memory (bytes) snapshot=412422144
		Virtual memory (bytes) snapshot=6238203904
		Total committed heap usage (bytes)=263348224
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=183
	File Output Format Counters 
		Bytes Written=95
19/09/21 21:49:50 INFO streaming.StreamJob: Output directory: /user/root/wordcount/output

    查看运行结果:

[root@client project]# hdfs dfs -cat /user/root/wordcount/output/*
a	2
and	1
distributed	2
file	1
framework	2
hadoop	2
hdfs	2
include	1
is	2
mapreduce	2
system	1

 

  • 3
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

luffy5459

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值