运行 C++ 版本的 mapreduce （Hadoop streaming）

最新推荐文章于 2024-05-13 06:38:53 发布

fuzuxian

最新推荐文章于 2024-05-13 06:38:53 发布

阅读量1k

点赞数

本文链接：https://blog.csdn.net/image_fzx/article/details/79654518

版权

mapper.cpp文件：

#include<string>
#include<iostream>

using namespace std;

int main()
{
   string line;

   while(cin>>line)
   {
	   cout<<line<<"\t"<<1<<endl;
   }
   return 0;
}

reduce.cpp文件：

#include <map>
#include <string>
#include <iostream>

using namespace std;

int main()
{
	string key;
	string value;
	map<string,int>word_count;
    map<string,int>::iterator it;
    while(cin>>key)
    {
    	cin>>value;
    	it=word_count.find(key);
    	if(it!=word_count.end())
    	{
    		++(it->second);
    	}
    	else
    	{
    		word_count.insert(make_pair(key,1));
    	}
    }

    for(it=word_count.begin();it!=word_count.end();++it)
    {
    	cout<<it->first<<"\t"<<it->second<<endl;
    }

    return 0;
}

file_1

hello hadoop helloworld

file_2

fu zu xian hadoop

脚本runjob.sh ：

#!/bin/bash
/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \
-file map -file reduce \
-input /data/demon/wordcount_c  -output /data/demon/wordcount_c/output \
-mapper /home/hadoop/project/wordcount_c/wordcount_c++/map  \
-reducer /home/hadoop/project/wordcount_c/wordcount_c++/reduce

chmod +x runjob.s

运行成功，结果如下：

hadoop@Master:~$ hadoop fs  -ls /data/demon/wordcount_c/output
Found 2 items
-rw-r--r--   2 hadoop supergroup          0 2018-03-22 16:47 /data/demon/wordcount_c/output/_SUCCESS
-rw-r--r--   2 hadoop supergroup         47 2018-03-22 16:47 /data/demon/wordcount_c/output/part-00000
hadoop@Master:~$ hadoop fs  -cat /data/demon/wordcount_c/output/part-00000
fu	1
hadoop	2
hello	1
helloworld	1
xian	1
zu	1

链接：

hadoop streaming参数配置

链接：

Hadoop Tutorial 2.2 -- Running C++ Programs on Hadoop

这个过程中遇见的错误：

运行程序失败：

hadoop@Master:~/project/wordcount_c/wordcount_c++$ ./runjob.sh 
18/03/22 16:12:58 ERROR streaming.StreamJob: Unrecognized option: -reduce
Usage: $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar [options]
Options:
  -input          <path> DFS input file(s) for the Map step.
  -output         <path> DFS output directory for the Reduce step.
  -mapper         <cmd|JavaClassName> Optional. Command to be run as mapper.
  -combiner       <cmd|JavaClassName> Optional. Command to be run as combiner.
  -reducer        <cmd|JavaClassName> Optional. Command to be run as reducer.
  -file           <file> Optional. File/dir to be shipped in the Job jar file.
                  Deprecated. Use generic option "-files" instead.
  -inputformat    <TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName>
                  Optional. The input format class.
  -outputformat   <TextOutputFormat(default)|JavaClassName>
                  Optional. The output format class.
  -partitioner    <JavaClassName>  Optional. The partitioner class.
  -numReduceTasks <num> Optional. Number of reduce tasks.
  -inputreader    <spec> Optional. Input recordreader spec.
  -cmdenv         <n>=<v> Optional. Pass env.var to streaming commands.
  -mapdebug       <cmd> Optional. To run this script when a map task fails.
  -reducedebug    <cmd> Optional. To run this script when a reduce task fails.
  -io             <identifier> Optional. Format to use for input to and output
                  from mapper/reducer commands
  -lazyOutput     Optional. Lazily create Output.
  -background     Optional. Submit the job and don't wait till it completes.
  -verbose        Optional. Print verbose output.
  -info           Optional. Print detailed usage.
  -help           Optional. Print help message.

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]


For more details about these options:
Use $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar -info

Try -help for more information
Streaming Command Failed!
hadoop@Master:~/project/wordcount_c/wordcount_c++$

这错误醉了。。。。

改了之后还是出错：

hadoop@Master:~/project/wordcount_c/wordcount_c++$ ./runjob.sh 
18/03/22 16:16:19 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [map, reduce, /tmp/hadoop-unjar2530400161100689843/] [] /tmp/streamjob7826754138722798286.jar tmpDir=null
18/03/22 16:16:20 INFO client.RMProxy: Connecting to ResourceManager at Master/192.168.1.110:8032
18/03/22 16:16:20 INFO client.RMProxy: Connecting to ResourceManager at Master/192.168.1.110:8032
18/03/22 16:16:20 ERROR streaming.StreamJob: Error Launching job : Output directory hdfs://Master:9000/data/wordcout already exists
Streaming Command Failed!
hadoop@Master:~/project/wordcount_c/wordcount_c++$

删除了以前的目录，创建新的目录，并上传文件。当然脚本也要改

hadoop@Master:~$ hadoop fs  -mkdir /data/demon/wordcount_c
hadoop@Master:~$ hadoop fs  -put  /home/hadoop/project/wordcount_c/wordcount_c++/file_2  /data/demon/wordcount_c
hadoop@Master:~$ hadoop fs  -put  /home/hadoop/project/wordcount_c/wordcount_c++/file_1  /data/demon/wordcount_c

多此一举的做法：）

hadoop@Master:~$ hadoop fs  -mkdir /data/demon/wordcount_c/output

这个目录不应该有，不必建立。。。。

hadoop@Master:~/project/wordcount_c/wordcount_c++$ ./runjob.sh 
18/03/22 16:42:43 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [map, reduce, /tmp/hadoop-unjar217718645459140512/] [] /tmp/streamjob6276961858876481540.jar tmpDir=null
18/03/22 16:42:44 INFO client.RMProxy: Connecting to ResourceManager at Master/192.168.1.110:8032
18/03/22 16:42:44 INFO client.RMProxy: Connecting to ResourceManager at Master/192.168.1.110:8032
18/03/22 16:42:44 ERROR streaming.StreamJob: Error Launching job : Output directory hdfs://Master:9000/data/demon/wordcount_c/output already exists
Streaming Command Failed!
hadoop@Master:~/project/wordcount_c/wordcount_c++$

把/data/demon/wordcount_c/output 目录删除。。。

fuzuxian

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
运行 C++ 版本的 mapreduce （Hadoop streaming）

mapper.cpp文件：#include&lt;string&gt;#include&lt;iostream&gt;using namespace std;int main(){ string line; while(cin&gt;&gt;line) { cout&lt;&lt;line&lt;&lt;"\t"&lt;&lt;1&lt;&amp
复制链接

扫一扫