用hadoop统计文本中单词的个数

The Files

You need 3 files to run the wordCount example:

  • a C++ file containing the map and reduce functions,
  • a data file containing some text, such as Ulysses, and
  • Makefile to compile the C++ file.

wordcount.cpp

  • The wordcount program is shown below
  • It contains two classes, one for the map, one for the reduce
  • It makes use of several Hadoop classes, one of which contains useful methods for converting from tuples to other types: StringUtils.


#include <algorithm>
#include <limits>
#include <string>
 
#include  "stdint.h"  // <--- to prevent uint64_t errors! 
 
#include "hadoop/Pipes.hh"
#include "hadoop/TemplateFactory.hh"
#include "hadoop/StringUtils.hh"
 
using namespace std;
 
class WordCountMapper : public HadoopPipes::Mapper {
public:
  // constructor: does nothing
  WordCountMapper( HadoopPipes::TaskContext& context ) {
  }
 
  // map function: receives a line, outputs (word,"1")
  // to reducer.
  void map( HadoopPipes::MapContext& context ) {
    //--- get line of text ---
    string line = context.getInputValue();
 
    //--- split it into words ---
    vector< string > words =
      HadoopUtils::splitString( line, " " );
 
    //--- emit each word tuple (word, "1" ) ---
    for ( unsigned int i=0; i < words.size(); i++ ) {
      context.emit( words[i], HadoopUtils::toString( 1 ) );
    }
  }
};
 
class WordCountReducer : public HadoopPipes::Reducer {
public:
  // constructor: does nothing
  WordCountReducer(HadoopPipes::TaskContext& context) {
  }
 
  // reduce function
  void reduce( HadoopPipes::ReduceContext& context ) {
    int count = 0;
 
    //--- get all tuples with the same key, and count their numbers ---
    while ( context.nextValue() ) {
      count += HadoopUtils::toInt( context.getInputValue() );
    }
 
    //--- emit (word, count) ---
    context.emit(context.getInputKey(), HadoopUtils::toString( count ));
  }
};
 
int main(int argc, char *argv[]) {
  return HadoopPipes::runTask(HadoopPipes::TemplateFactory< 
			      WordCountMapper, 
                              WordCountReducer >() );
}



Makefile

Before you create the Makefile, you need to figure out whether your computer hosts a 32-bit processor or a 64-bit processor, and pick the right library. To find this out, run the following command:

  uname -a

To which the OS responds:

  Linux hadoop6 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 05:23:09 UTC 2010 i686 GNU/Linux

The i686 indicates a 32-bit machine, for which you need to use the Linux-i386-32 library. Anything with 64 indicates the other type, for which you use the Linux-amd64-64 library.

Once you have this information create the Makefile (make sure to spell it with an uppercase M):

CC = g++
HADOOP_INSTALL = /home/hadoop/hadoop
PLATFORM = Linux-i386-32
CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include

wordcount: wordcount.cpp
	$(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \
	-lhadooputils -lpthread -g -O2 -o $@

Note: Users have reported that in some cases the command above returns errors, and that adding  -lssl will help get rid of the error. Thanks for the tip! -- D. Thiebaut 08:16, 25 February 2011 (EST)

Data File

  • We'll assume that you have some large text files already in HDFS, in a directory called dft1.

Compiling and Running

  • You need a C++ compiler. GNU g++ is probably the best choice. Check that it is installed (by typing g++ at the prompt). If it is not installed yet, install it!
  sudo apt-get install g++
  • Compile the code:
  make  wordcount
and fix any errors you're getting.
  • Copy the executable file (wordcount) to the bin directory in HDFS:
  hadoop dfs -mkdir bin                    (Note: it should already exist!)
  hadoop dfs -put  wordcount   bin/wordcount

  • Run the program!
  hadoop pipes -D hadoop.pipes.java.recordreader=true  \ 
                   -D hadoop.pipes.java.recordwriter=true \
                   -input dft1  -output dft1-out  \
                   -program bin/wordcount
  • Verify that you have gotten the right output:
  hadoop dfs -text dft1-out/part-00000
 
  "Come   1
  "Defects,"      1
  "I      1
  "Information    1
  "J"     1
  "Plain  2
  ...
  zodiacal        2
  zoe)_   1
  zones:  1
  zoo.    1
  zoological      1
  zouave's        1
  zrads,  2
  zrads.  1



  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值