c++编程实现简单mapreduce程序

最新推荐文章于 2020-10-22 16:31:23 发布

luffy5459

最新推荐文章于 2020-10-22 16:31:23 发布

阅读量3.6k

点赞数 3

分类专栏： hadoop 文章标签： c++ mapreduce hadoop-streaming

本文链接：https://blog.csdn.net/feinifi/article/details/101123071

版权

hadoop 专栏收录该内容

45 篇文章 4 订阅

订阅专栏

hadoop提供了java版本的mapreduce编程API，我们需要自定义编写mapper和reducer，分别继承Mapper和Reducer，然后重写map和reduce方法。同时需要在main方法中构建job，然后指定mapper和reducer，最后提交任务。同时也支持c++编写mapreduce。hadoop有几种方式用c++实现mapreduce，这里介绍使用hadoop-streaming-xxx.jar的方式来运行c++实现的mapreduce程序。

我们需要定义两个c++文件，分别编写mapper和reducer执行的任务。这里以词频统计为例，在mapper中，我们需要实现的是<word,1>这样的输出。在reducer中，我们需要通过统计的方式来计算word出现的总次数，输出<word,sum>。

下面给出源代码：

mapper.cpp

#include <iostream>
#include <string>
using namespace std;
int main(){
  string word;
  while(cin>>word){
    cout<<word<<"\t"<<"1"<<endl;
  }
  return 0;
}

reducer.cpp

#include <iostream>
#include <string>
#include <map>
using namespace std;
int main(){
  string key,num;
  map<string,int> count;
  map<string,int>::iterator it;
  while(cin>>key>>num){
    it = count.find(key);
    if(it!=count.end()){
      it->second++;
    }else{
      count.insert(make_pair(key,1));
    }
  }
  for(it=count.begin();it!=count.end();it++){
    cout<<it->first<<"\t"<<it->second<<endl;
  }
  return 0;
}

分别编译mapper.cpp和reducer.cpp

[root@client project]# g++ mapper.cpp -o mapper
[root@client project]# g++ reducer.cpp -o reducer

准备的输入文件：

[root@client project]# hdfs dfs -cat /user/root/wordcount/input/*
hadoop framework include hdfs and mapreduce
mapreduce is a distributed framework
hdfs is a hadoop distributed file system

通过hadoop-streaming-2.7.7.jar来运行这个mapreduce任务：

[root@client project]# hadoop jar /home/software/hadoop-2.7.7/share/hadoop/tools/lib/hadoop-streaming-2.7.7.jar -D mapred.job.name="wordcount" -input /user/root/wordcount/input -output /user/root/wordcount/output --mapper ./mapper --reducer ./reducer -file mapper -file reducer

运行打印信息：

[root@client project]# hadoop jar /home/software/hadoop-
2.7.7/share/hadoop/tools/lib/hadoop-streaming-2.7.7.jar -D mapred.job.name="wordcount" -
input /user/root/wordcount/input -output /user/root/wordcount/output --mapper ./mapper --
reducer ./reducer -file mapper -file reducer
19/09/21 21:47:03 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper, reducer, /tmp/hadoop-unjar4749452695736673048/] [] /tmp/streamjob4372833674036516407.jar tmpDir=null
19/09/21 21:47:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/09/21 21:47:06 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/09/21 21:47:07 INFO mapred.FileInputFormat: Total input paths to process : 1
19/09/21 21:47:08 INFO mapreduce.JobSubmitter: number of splits:2
19/09/21 21:47:08 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
19/09/21 21:47:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1569071641619_0002
19/09/21 21:47:08 INFO impl.YarnClientImpl: Submitted application application_1569071641619_0002
19/09/21 21:47:08 INFO mapreduce.Job: The url to track the job: http://client:8088/proxy/application_1569071641619_0002/
19/09/21 21:47:08 INFO mapreduce.Job: Running job: job_1569071641619_0002
19/09/21 21:47:23 INFO mapreduce.Job: Job job_1569071641619_0002 running in uber mode : false
19/09/21 21:47:23 INFO mapreduce.Job:  map 0% reduce 0%
19/09/21 21:48:54 INFO mapreduce.Job:  map 100% reduce 0%
19/09/21 21:49:43 INFO mapreduce.Job:  map 100% reduce 100%
19/09/21 21:49:48 INFO mapreduce.Job: Job job_1569071641619_0002 completed successfully
19/09/21 21:49:50 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=200
		FILE: Number of bytes written=378513
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=419
		HDFS: Number of bytes written=95
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=198469
		Total time spent by all reduces in occupied slots (ms)=22521
		Total time spent by all map tasks (ms)=198469
		Total time spent by all reduce tasks (ms)=22521
		Total vcore-milliseconds taken by all map tasks=198469
		Total vcore-milliseconds taken by all reduce tasks=22521
		Total megabyte-milliseconds taken by all map tasks=203232256
		Total megabyte-milliseconds taken by all reduce tasks=23061504
	Map-Reduce Framework
		Map input records=3
		Map output records=18
		Map output bytes=158
		Map output materialized bytes=206
		Input split bytes=236
		Combine input records=0
		Combine output records=0
		Reduce input groups=11
		Reduce shuffle bytes=206
		Reduce input records=18
		Reduce output records=11
		Spilled Records=36
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=12487
		CPU time spent (ms)=8250
		Physical memory (bytes) snapshot=412422144
		Virtual memory (bytes) snapshot=6238203904
		Total committed heap usage (bytes)=263348224
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=183
	File Output Format Counters 
		Bytes Written=95
19/09/21 21:49:50 INFO streaming.StreamJob: Output directory: /user/root/wordcount/output

查看运行结果：

[root@client project]# hdfs dfs -cat /user/root/wordcount/output/*
a	2
and	1
distributed	2
file	1
framework	2
hadoop	2
hdfs	2
include	1
is	2
mapreduce	2
system	1

luffy5459

关注

3
点赞
踩
17

收藏

觉得还不错? 一键收藏
打赏
0
评论
c++编程实现简单mapreduce程序

hadoop提供了java版本的mapreduce编程API，我们需要自定义编写mapper和reducer，分别继承Mapper和Reducer，然后重写map和reduce方法。同时需要在main方法中构建job，然后指定mapper和reducer，最后提交任务。同时也支持c++编写mapreduce。hadoop有几种方式用c++实现mapreduce，这里介绍使用hadoop-s...
复制链接

扫一扫