mapper.cpp文件:
#include<string>
#include<iostream>
using namespace std;
int main()
{
string line;
while(cin>>line)
{
cout<<line<<"\t"<<1<<endl;
}
return 0;
}
reduce.cpp文件:
#include <map>
#include <string>
#include <iostream>
using namespace std;
int main()
{
string key;
string value;
map<string,int>word_count;
map<string,int>::iterator it;
while(cin>>key)
{
cin>>value;
it=word_count.find(key);
if(it!=word_count.end())
{
++(it->second);
}
else
{
word_count.insert(make_pair(key,1));
}
}
for(it=word_count.begin();it!=word_count.end();++it)
{
cout<<it->first<<"\t"<<it->second<<endl;
}
return 0;
}
file_1
hello hadoop helloworld
file_2
fu zu xian hadoop
脚本runjob.sh :
#!/bin/bash
/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \
-file map -file reduce \
-input /data/demon/wordcount_c -output /data/demon/wordcount_c/output \
-mapper /home/hadoop/project/wordcount_c/wordcount_c++/map \
-reducer /home/hadoop/project/wordcount_c/wordcount_c++/reduce
chmod +x runjob.s
运行成功,结果如下:
hadoop@Master:~$ hadoop fs -ls /data/demon/wordcount_c/output
Found 2 items
-rw-r--r-- 2 hadoop supergroup 0 2018-03-22 16:47 /data/demon/wordcount_c/output/_SUCCESS
-rw-r--r-- 2 hadoop supergroup 47 2018-03-22 16:47 /data/demon/wordcount_c/output/part-00000
hadoop@Master:~$ hadoop fs -cat /data/demon/wordcount_c/output/part-00000
fu 1
hadoop 2
hello 1
helloworld 1
xian 1
zu 1
链接:
链接:
Hadoop Tutorial 2.2 -- Running C++ Programs on Hadoop
这个过程中遇见的错误:
运行程序失败:
hadoop@Master:~/project/wordcount_c/wordcount_c++$ ./runjob.sh
18/03/22 16:12:58 ERROR streaming.StreamJob: Unrecognized option: -reduce
Usage: $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar [options]
Options:
-input <path> DFS input file(s) for the Map step.
-output <path> DFS output directory for the Reduce step.
-mapper <cmd|JavaClassName> Optional. Command to be run as mapper.
-combiner <cmd|JavaClassName> Optional. Command to be run as combiner.
-reducer <cmd|JavaClassName> Optional. Command to be run as reducer.
-file <file> Optional. File/dir to be shipped in the Job jar file.
Deprecated. Use generic option "-files" instead.
-inputformat <TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName>
Optional. The input format class.
-outputformat <TextOutputFormat(default)|JavaClassName>
Optional. The output format class.
-partitioner <JavaClassName> Optional. The partitioner class.
-numReduceTasks <num> Optional. Number of reduce tasks.
-inputreader <spec> Optional. Input recordreader spec.
-cmdenv <n>=<v> Optional. Pass env.var to streaming commands.
-mapdebug <cmd> Optional. To run this script when a map task fails.
-reducedebug <cmd> Optional. To run this script when a reduce task fails.
-io <identifier> Optional. Format to use for input to and output
from mapper/reducer commands
-lazyOutput Optional. Lazily create Output.
-background Optional. Submit the job and don't wait till it completes.
-verbose Optional. Print verbose output.
-info Optional. Print detailed usage.
-help Optional. Print help message.
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
For more details about these options:
Use $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar -info
Try -help for more information
Streaming Command Failed!
hadoop@Master:~/project/wordcount_c/wordcount_c++$
这错误醉了。。。。
改了之后还是出错:
hadoop@Master:~/project/wordcount_c/wordcount_c++$ ./runjob.sh
18/03/22 16:16:19 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [map, reduce, /tmp/hadoop-unjar2530400161100689843/] [] /tmp/streamjob7826754138722798286.jar tmpDir=null
18/03/22 16:16:20 INFO client.RMProxy: Connecting to ResourceManager at Master/192.168.1.110:8032
18/03/22 16:16:20 INFO client.RMProxy: Connecting to ResourceManager at Master/192.168.1.110:8032
18/03/22 16:16:20 ERROR streaming.StreamJob: Error Launching job : Output directory hdfs://Master:9000/data/wordcout already exists
Streaming Command Failed!
hadoop@Master:~/project/wordcount_c/wordcount_c++$
删除了以前的目录,创建新的目录,并上传文件。当然脚本也要改
hadoop@Master:~$ hadoop fs -mkdir /data/demon/wordcount_c
hadoop@Master:~$ hadoop fs -put /home/hadoop/project/wordcount_c/wordcount_c++/file_2 /data/demon/wordcount_c
hadoop@Master:~$ hadoop fs -put /home/hadoop/project/wordcount_c/wordcount_c++/file_1 /data/demon/wordcount_c
多此一举的做法 :)
hadoop@Master:~$ hadoop fs -mkdir /data/demon/wordcount_c/output
这个目录不应该有,不必建立。。。。
hadoop@Master:~/project/wordcount_c/wordcount_c++$ ./runjob.sh
18/03/22 16:42:43 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [map, reduce, /tmp/hadoop-unjar217718645459140512/] [] /tmp/streamjob6276961858876481540.jar tmpDir=null
18/03/22 16:42:44 INFO client.RMProxy: Connecting to ResourceManager at Master/192.168.1.110:8032
18/03/22 16:42:44 INFO client.RMProxy: Connecting to ResourceManager at Master/192.168.1.110:8032
18/03/22 16:42:44 ERROR streaming.StreamJob: Error Launching job : Output directory hdfs://Master:9000/data/demon/wordcount_c/output already exists
Streaming Command Failed!
hadoop@Master:~/project/wordcount_c/wordcount_c++$
把/data/demon/wordcount_c/output 目录删除。。。