The Files
You need 3 files to run the wordCount example:
- a C++ file containing the map and reduce functions,
- a data file containing some text, such as Ulysses, and
- a Makefile to compile the C++ file.
wordcount.cpp
- The wordcount program is shown below
- It contains two classes, one for the map, one for the reduce
- It makes use of several Hadoop classes, one of which contains useful methods for converting from tuples to other types: StringUtils.
#include <algorithm>
#include <limits>
#include <string>
#include "stdint.h" // <--- to prevent uint64_t errors!
#include "hadoop/Pipes.hh"
#include "hadoop/TemplateFactory.hh"
#include "hadoop/StringUtils.hh"
using namespace std;
class WordCountMapper : public HadoopPipes::Mapper {
public:
// constructor: does nothing
WordCountMapper( HadoopPipes::TaskContext& context ) {
}
// map function: receives a line, outputs (word,"1")
// to reducer.
void map( HadoopPipes::MapContext& context ) {
//--- get line of text ---
string line = context.getInputValue();
//--- split it into words ---
vector< string > words =
HadoopUtils::splitString( line, " " );
//--- emit each word tuple (word, "1" ) ---
for ( unsigned int i=0; i < words.size(); i++ ) {
context.emit( words[i], HadoopUtils::toString( 1 ) );
}
}
};
class WordCountReducer : public HadoopPipes::Reducer {
public:
// constructor: does nothing
WordCountReducer(HadoopPipes::TaskContext& context) {
}
// reduce function
void reduce( HadoopPipes::ReduceContext& context ) {
int count = 0;
//--- get all tuples with the same key, and count their numbers ---
while ( context.nextValue() ) {
count += HadoopUtils::toInt( context.getInputValue() );
}
//--- emit (word, count) ---
context.emit(context.getInputKey(), HadoopUtils::toString( count ));
}
};
int main(int argc, char *argv[]) {
return HadoopPipes::runTask(HadoopPipes::TemplateFactory<
WordCountMapper,
WordCountReducer >() );
}
Makefile
Before you create the Makefile, you need to figure out whether your computer hosts a 32-bit processor or a 64-bit processor, and pick the right library. To find this out, run the following command:
uname -a
To which the OS responds:
Linux hadoop6 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 05:23:09 UTC 2010 i686 GNU/Linux
The i686 indicates a 32-bit machine, for which you need to use the Linux-i386-32 library. Anything with 64 indicates the other type, for which you use the Linux-amd64-64 library.
Once you have this information create the Makefile (make sure to spell it with an uppercase M):
CC = g++ HADOOP_INSTALL = /home/hadoop/hadoop PLATFORM = Linux-i386-32 CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include wordcount: wordcount.cpp $(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \ -lhadooputils -lpthread -g -O2 -o $@
-
- Note: Users have reported that in some cases the command above returns errors, and that adding -lssl will help get rid of the error. Thanks for the tip! -- D. Thiebaut 08:16, 25 February 2011 (EST)
Data File
- We'll assume that you have some large text files already in HDFS, in a directory called dft1.
Compiling and Running
- You need a C++ compiler. GNU g++ is probably the best choice. Check that it is installed (by typing g++ at the prompt). If it is not installed yet, install it!
sudo apt-get install g++
- Compile the code:
make wordcount
- and fix any errors you're getting.
- Copy the executable file (wordcount) to the bin directory in HDFS:
hadoop dfs -mkdir bin (Note: it should already exist!) hadoop dfs -put wordcount bin/wordcount
- Run the program!
hadoop pipes -D hadoop.pipes.java.recordreader=true \ -D hadoop.pipes.java.recordwriter=true \ -input dft1 -output dft1-out \ -program bin/wordcount
- Verify that you have gotten the right output:
hadoop dfs -text dft1-out/part-00000 "Come 1 "Defects," 1 "I 1 "Information 1 "J" 1 "Plain 2 ... zodiacal 2 zoe)_ 1 zones: 1 zoo. 1 zoological 1 zouave's 1 zrads, 2 zrads. 1