用hadoop统计文本中单词的个数

最新推荐文章于 2021-05-22 20:29:52 发布

行走江湖

最新推荐文章于 2021-05-22 20:29:52 发布

阅读量1.6k

点赞数

分类专栏： hadoop 文章标签： hadoop constructor tuples makefile string command

hadoop 专栏收录该内容

31 篇文章 0 订阅

订阅专栏

The Files

You need 3 files to run the wordCount example:

a C++ file containing the map and reduce functions,
a data file containing some text, such as Ulysses, and
a Makefile to compile the C++ file.

wordcount.cpp

The wordcount program is shown below
It contains two classes, one for the map, one for the reduce
It makes use of several Hadoop classes, one of which contains useful methods for converting from tuples to other types: StringUtils.

#include <algorithm>
#include <limits>
#include <string>
 
#include  "stdint.h"  // <--- to prevent uint64_t errors! 
 
#include "hadoop/Pipes.hh"
#include "hadoop/TemplateFactory.hh"
#include "hadoop/StringUtils.hh"
 
using namespace std;
 
class WordCountMapper : public HadoopPipes::Mapper {
public:
  // constructor: does nothing
  WordCountMapper( HadoopPipes::TaskContext& context ) {
  }
 
  // map function: receives a line, outputs (word,"1")
  // to reducer.
  void map( HadoopPipes::MapContext& context ) {
    //--- get line of text ---
    string line = context.getInputValue();
 
    //--- split it into words ---
    vector< string > words =
      HadoopUtils::splitString( line, " " );
 
    //--- emit each word tuple (word, "1" ) ---
    for ( unsigned int i=0; i < words.size(); i++ ) {
      context.emit( words[i], HadoopUtils::toString( 1 ) );
    }
  }
};
 
class WordCountReducer : public HadoopPipes::Reducer {
public:
  // constructor: does nothing
  WordCountReducer(HadoopPipes::TaskContext& context) {
  }
 
  // reduce function
  void reduce( HadoopPipes::ReduceContext& context ) {
    int count = 0;
 
    //--- get all tuples with the same key, and count their numbers ---
    while ( context.nextValue() ) {
      count += HadoopUtils::toInt( context.getInputValue() );
    }
 
    //--- emit (word, count) ---
    context.emit(context.getInputKey(), HadoopUtils::toString( count ));
  }
};
 
int main(int argc, char *argv[]) {
  return HadoopPipes::runTask(HadoopPipes::TemplateFactory< 
			      WordCountMapper, 
                              WordCountReducer >() );
}

Makefile

Before you create the Makefile, you need to figure out whether your computer hosts a 32-bit processor or a 64-bit processor, and pick the right library. To find this out, run the following command:

  uname -a

To which the OS responds:

  Linux hadoop6 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 05:23:09 UTC 2010 i686 GNU/Linux

The i686 indicates a 32-bit machine, for which you need to use the Linux-i386-32 library. Anything with 64 indicates the other type, for which you use the Linux-amd64-64 library.

Once you have this information create the Makefile (make sure to spell it with an uppercase M):

CC = g++
HADOOP_INSTALL = /home/hadoop/hadoop
PLATFORM = Linux-i386-32
CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include

wordcount: wordcount.cpp
	$(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \
	-lhadooputils -lpthread -g -O2 -o $@

Note: Users have reported that in some cases the command above returns errors, and that adding -lssl will help get rid of the error. Thanks for the tip! -- D. Thiebaut 08:16, 25 February 2011 (EST)

Data File

We'll assume that you have some large text files already in HDFS, in a directory called dft1.

Compiling and Running

You need a C++ compiler. GNU g++ is probably the best choice. Check that it is installed (by typing g++ at the prompt). If it is not installed yet, install it!

  sudo apt-get install g++

Compile the code:

  make  wordcount

and fix any errors you're getting.

Copy the executable file (wordcount) to the bin directory in HDFS:

  hadoop dfs -mkdir bin                    (Note: it should already exist!)
  hadoop dfs -put  wordcount   bin/wordcount

Run the program!

  hadoop pipes -D hadoop.pipes.java.recordreader=true  \ 
                   -D hadoop.pipes.java.recordwriter=true \
                   -input dft1  -output dft1-out  \
                   -program bin/wordcount

Verify that you have gotten the right output:

  hadoop dfs -text dft1-out/part-00000
 
  "Come   1
  "Defects,"      1
  "I      1
  "Information    1
  "J"     1
  "Plain  2
  ...
  zodiacal        2
  zoe)_   1
  zones:  1
  zoo.    1
  zoological      1
  zouave's        1
  zrads,  2
  zrads.  1

行走江湖

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
用hadoop统计文本中单词的个数

The FilesYou need 3 files to run the wordCount example:a C++ file containing the map and reduce functions,a data file containing s
复制链接

扫一扫