Hadoop pipes编程

最新推荐文章于 2019-01-29 14:25:44 发布

GarfieldEr007

最新推荐文章于 2019-01-29 14:25:44 发布

阅读量896

点赞数

分类专栏： Hadoop 文章标签： Hadoop pipes 编程 MapReduce

Hadoop 专栏收录该内容

123 篇文章 4 订阅

订阅专栏

1. Hadoop pipes编程介绍

Hadoop pipes允许C++程序员编写mapreduce程序，它允许用户混用C++和Java的RecordReader， Mapper， Partitioner，Rducer和RecordWriter等五个组件。关于Hadoop pipes的设计思想，可参见我这篇文章：Hadoop Pipes设计原理。

本文介绍了Hadoop pipes编程的基本方法，并给出了若干编程示例，最后介绍了Hadoop pipes高级编程方法，包括怎样在MapReduce中加载词典，怎么传递参数，怎样提高效率等。

2. Hadoop pipes编程初体验

Hadoop-0.20.2源代码中自带了三个pipes编程示例，它们位于目录src/examples/pipes/impl中，分别为wordcount-simple.cc，wordcount-part.cc和wordcount-nopipe.cc。下面简要介绍一下这三个程序。

（1） wordcount-simple.cc：Mapper和Reducer组件采用C++语言编写，RecordReader, Partitioner和RecordWriter采用Java语言编写，其中，RecordReader 为LineRecordReader（位于InputTextInputFormat中，按行读取数据，行所在的偏移量为key，行中的字符串为value），Partitioner为PipesPartitioner，RecordWriter为LineRecordWriter（位于InputTextOutputFormat中，输出格式为”key\tvalue\n”）

（2） wordcount-part.cc：Mapper，Partitioner和Reducer组件采用C++语言编写，其他采用Java编写

（3）wordcount-nopipe.cc：RecordReader，Mapper，Rducer和RecordWriter采用C++编写

接下来简单介绍一下wordcount-simple.cc的编译和运行方法。

在Hadoop的安装目录下，执行下面命令：

1	`ant -Dcompile.c++=yes examples`

则wordcount-simple.cc生成的可执行文件wordcount-simple被保存到了目录build/c++-examples/Linux-amd64-64/bin/中，然后将该可执行文件上传到HDFS的某一个目录下，如/user/XXX/ bin下：

1	`bin/hadoop -put build/c++-examples/Linux-amd64-64/bin/wordcount-simple /user/XXX/ bin/`

上传一份数据到HDFS的/user/XXX /pipes_test_data目录下：

1	`bin/hadoop -put data.txt /user/XXX /pipes_test_data`

直接使用下面命令提交作业：

bin/hadoop pipes \

-D hadoop.pipes.java.recordreader= true \

-D hadoop.pipes.java.recordwriter= true \

-D mapred.job.name= wordcount \

-input /user/XXX /pipes_test_data \

-output /user/XXX /pipes_test_output \

-program /user/XXX/ bin/wordcount-simple

3. Hadoop pipes编程方法

先从最基础的两个组件Mapper和Reducer说起。

（1） Mapper编写方法

用户若要实现Mapper组件，需继承HadoopPipes::Mapper虚基类，它的定义如下：

class Mapper: public Closable {

public :

virtual void map(MapContext& context) = 0;

};

用户必须实现map函数，它的参数是MapContext，该类的声明如下：

class MapContext: public TaskContext {

public :

virtual const std::string& getInputSplit() = 0;

virtual const std::string& getInputKeyClass() = 0;

virtual const std::string& getInputValueClass() = 0;

};

而TaskContext类地声明如下：

class TaskContext {

public :

class Counter {

……

public :

Counter( int counterId) : id(counterId) {}

Counter( const Counter& counter) : id(counter.id) {}

……

};

virtual const JobConf* getJobConf() = 0;

virtual const std::string& getInputKey() = 0;

virtual const std::string& getInputValue() = 0;

virtual void emit( const std::string& key, const std::string& value) = 0;

virtual void progress() = 0;

…….

};

用户可以从context参数中获取当前的key，value，progress和inputsplit等数据信息，此外，还可以调用emit将结果回传给Java代码。

Mapper的构造函数带有一个HadoopPipes::TaskContext参数，用户可以通过它注册一些全局counter，对于程序调试和跟踪作业进度非常有用：

如果你想注册全局counter，在构造函数添加一些类似的代码：

WordCountMap(HadoopPipes::TaskContext& context) {

inputWords1 = context.getCounter(“group”, ”counter1”);

inputWords2 = context.getCounter(“group”, ”counter2”);

}

当需要增加counter值时，可以这样：

context.incrementCounter(inputWords1, 1);

context.incrementCounter(inputWords2, 1);

其中getCounter的两个参数分别为组名和组内计数器名，一个组中可以存在多个counter。

用户自定义的counter会在程序结束时，输出到屏幕上，当然，用户可以用通过web界面看到。

（2） Reducer编写方法

Reducer组件的编写方法跟Mapper组件类似，它需要继承虚基类public HadoopPipes::Reducer。

与Mapper组件唯一不同的地方时，map函数的参数类型为HadoopPipes::ReduceContext，它包含一个nextValue()方法，这允许用于遍历当前key对应的value列表，依次进行处理。

接下来介绍RecordReader， Partitioner和RecordWriter的编写方法：

（3） RecordReader编写方法

用户自定义的RecordReader类需要继承虚基类HadoopPipes::RecordReader，它的声明如下：

class RecordReader: public Closable {

public :

virtual bool next(std::string& key, std::string& value) = 0;

virtual float getProgress() = 0;

};

用户需要实现next和 getProgress两个方法。

用户自定义的RecordReader的构造函数可携带类型为HadoopPipes::MapContext的参数，通过该参数的getInputSplit()的方法，用户可以获取经过序列化的InpuSplit对象，Java端采用不同的InputFormat可导致InputSplit对象格式不同，但对于大多数InpuSplit对象，它们可以提供至少三个信息：当前要处理的InputSplit所在的文件名，所在文件中的偏移量，它的长度。用户获取这三个信息后，可使用libhdfs库读取文件，以实现next方法。

下面介绍一下反序列化InputSplit对象的方法：

【1】如果Java端采用的InputFormat为WordCountInpuFormat，可以这样：

class XXXReader: public HadoopPipes::RecordReader {

public :

XXXReader (HadoopPipes::MapContext& context) {

std::string filename;

HadoopUtils::StringInStream stream(context.getInputSplit());

HadoopUtils::deserializeString(filename, stream);

……

};

【2】如果Java端采用的InputFormat为TextInpuFormat，可以这样：

100

101

102

103

104

105

class XXXReader: public HadoopPipes::RecordReader {

public :

XXXReader (HadoopPipes::MapContext& context) {

std::string filename;

HadoopUtils::StringInStream stream(context.getInputSplit());

readString(filename, stream);

int start = ( int )readLong(stream);

int len = ( int )readLong(stream);

……

private :

void readString(std::string& t, HadoopUtils::StringInStream& stream)

{

int len = readShort(stream);

if (len > 0) {

// resize the string to the right length

t.resize(len);

// read into the string in 64k chunks

const int bufSize = 65536;

int offset = 0;

char buf[bufSize];

while (len > 0) {

int chunkLength = len > bufSize ? bufSize : len;

stream.read(buf, chunkLength);

t.replace(offset, chunkLength, buf, chunkLength);

offset += chunkLength;

len -= chunkLength;

}

} else {

t.clear();

}

long readLong(HadoopUtils::StringInStream& stream) {

long n;

char b;

stream.read(&b, 1);

n = ( long )(b & 0xff) << 56 ;

stream.read(&b, 1);

n |= ( long )(b & 0xff) << 48 ;

stream.read(&b, 1);

n |= ( long )(b & 0xff) << 40 ;

stream.read(&b, 1);

n |= ( long )(b & 0xff) << 32 ;

stream.read(&b, 1);

n |= ( long )(b & 0xff) << 24 ;

stream.read(&b, 1);

n |= ( long )(b & 0xff) << 16 ;

stream.read(&b, 1);

n |= ( long )(b & 0xff) << 8 ;

stream.read(&b, 1);

n |= ( long )(b & 0xff) ;

return n;

}

};

（4） Partitioner编写方法

用户自定义的Partitioner类需要继承虚基类HadoopPipes:: Partitioner，它的声明如下：

class Partitioner {

public :

virtual int partition( const std::string& key, int numOfReduces) = 0;

virtual ~Partitioner() {}

};

用户需要实现partition方法和析构函数。

对于partition方法，框架会自动为它传入两个参数，分别为key值和reduce task的个数numOfReduces，用户只需返回一个0~ numOfReduces-1的值即可。

（5） RecordWriter编写方法

用户自定义的RecordWriter类需要继承虚基类HadoopPipes:: RecordWriter，它的声明如下：

class RecordWriter: public Closable {

public :

virtual void emit( const std::string& key,

const std::string& value) = 0;

};

用户自定的RecordWriter的构造函数可携带类型为HadoopPipes::MapContext的参数，通过该参数的getJobConf()可获取一个HadoopPipes::JobConf的对象，用户可从该对象中获取该reduce task的各种参数，如：该reduce task的编号（这对于确定输出文件名有用），reduce task的输出目录等。

class MyWriter: public HadoopPipes::RecordWriter {

public :

MyWriter(HadoopPipes::ReduceContext& context) {

const HadoopPipes::JobConf* job = context.getJobConf();

int part = job->getInt( "mapred.task.partition" );

std::string outDir = job->get( "mapred.work.output.dir" );

……

}

用户需实现emit方法，将数据写入某个文件。

4. Hadoop pipes编程示例

网上有很多人怀疑Hadoop pipes自带的程序wordcount-nopipe.cc不能运行，各个论坛都有讨论，在此介绍该程序的设计原理和运行方法。

该运行需要具备以下前提：

（1）采用的InputFormat为WordCountInputFormat，它位于src/test/下的org.apache.hadoop.mapred.pipes中

（2）输入目录和输出目录需位于各个datanode的本地磁盘上，格式为：file:///home/xxx/pipes_test (注意，hdfs中的各种接口同时支持本地路径和HDFS路径，如果是HDFS上的路径，需要使用hdfs://host:9000/user/xxx，表示/user/xxx为namenode 为host的hdfs上的路径，而本地路径，需使用file:///home/xxx/pipes_test，表示/home/xxx/pipes_test为本地路径。例如，bin/hadoop fs –ls file:///home/xxx/pipes_test表示列出本地磁盘上/home/xxx/pipes_tes下的文件)

待确定好各个datanode的本地磁盘上有输入数据/home/xxx/pipes_test/data.txt后，用户首先上传可执行文件到HDFS中：

1	`bin/hadoop -put build/c++-examples/Linux-amd64-64/bin/wordcount-nopipe /user/XXX/bin/`

然后使用下面命令提交该作业：

bin/hadoop pipes \

-D hadoop.pipes.java.recordreader= false \

-D hadoop.pipes.java.recordwriter= false \

-D mapred.job.name=wordcount \

-D mapred.input.format. class =org.apache.hadoop.mapred.pipes.WordCountInputFormat \

-libjars hadoop-0.20.2-test.jar \

-input file: ///home/xxx/pipes_test/data.txt \

-output file: ///home/xxx/pipes_output \

-program /user/XXX/bin/wordcount-nopipe

5. Hadoop pipes高级编程

如果用户需要在mapreduce作业中加载词典或者传递参数，可这样做：

（1）提交作业时，用-files选项，将词典（需要传递参数可以放到一个配置文件中）上传给各个datanode，如:

bin/hadoop pipes \

-D hadoop.pipes.java.recordreader= false \

-D hadoop.pipes.java.recordwriter= false \

-D mapred.job.name=wordcount \

-files dic.txt \

….

（2）在Mapper或者Reducer的构造函数中，将字典文件以本地文件的形式打开，并把内容保存到一个map或者set中，然后再map()或者reduce()函数中使用即可，如:

WordCountMap(HadoopPipes::TaskContext& context) {

file = fopen (“dic.txt”, "r" ); //C库函数

…….

}

为了提高系能，RecordReader和RecordWriter最好采用Java代码实现（或者重用Hadoop中自带的），这是因为Hadoop自带的C++库libhdfs采用JNI实现，底层还是要调用Java相关接口，效率很低，此外，如果要处理的文件为二进制文件或者其他非文本文件，libhdfs可能不好处理。

6. 总结

Hadoop pipes使C++程序员编写MapReduce作业变得可能，它简单好用，提供了用户所需的大部分功能。

1.Hadoop pipes编程介绍

Hadoop pipes允许C++程序员编写mapreduce程序，它允许用户混用C++和Java的RecordReader，Mapper，Partitioner，Rducer和RecordWriter等五个组件。关于Hadoop pipes的设计思想，可参见我这篇文章：

2.Hadoop pipes编程初体验

（1）wordcount-simple.cc：Mapper和Reducer组件采用C++语言编写，RecordReader, Partitioner和RecordWriter采用Java语言编写，其中，RecordReader为LineRecordReader（位于InputTextInputFormat中，按行读取数据，行所在的偏移量为key，行中的字符串为value），Partitioner为PipesPartitioner，RecordWriter为LineRecordWriter（位于InputTextOutputFormat中，输出格式为”key\tvalue\n”）

（2）wordcount-part.cc：Mapper，Partitioner和Reducer组件采用C++语言编写，其他采用Java编写

（3）wordcount-nopipe.cc：RecordReader，Mapper，Rducer和RecordWriter采用C++编写

接下来简单介绍一下wordcount-simple.cc的编译和运行方法。

在Hadoop的安装目录下，执行下面命令：

ant -Dcompile.c++=yes examples

bin/hadoop-putbuild/c++-examples/Linux-amd64-64/bin/wordcount-simple/user/XXX/ bin/

上传一份数据到HDFS的/user/XXX /pipes_test_data目录下：

bin/hadoop-putdata.txt/user/XXX /pipes_test_data

直接使用下面命令提交作业：

bin/hadoop pipes \

-D hadoop.pipes.java.recordreader=true \

-D hadoop.pipes.java.recordwriter=true \

-D mapred.job.name= wordcount \

-input /user/XXX /pipes_test_data \

-output /user/XXX /pipes_test_output \

-program /user/XXX/ bin/wordcount-simple

3.Hadoop pipes编程方法

先从最基础的两个组件Mapper和Reducer说起。

（1）Mapper编写方法

用户若要实现Mapper组件，需继承HadoopPipes::Mapper虚基类，它的定义如下：

class Mapper: public Closable {

public:

virtual void map(MapContext& context) = 0;

};

用户必须实现map函数，它的参数是MapContext，该类的声明如下：

class MapContext: public TaskContext {

public:

virtual const std::string& getInputSplit() = 0;

virtual const std::string& getInputKeyClass() = 0;

virtual const std::string& getInputValueClass() = 0;

};

而TaskContext类地声明如下：

class TaskContext {

public:

class Counter {

……

public:

Counter(int counterId) : id(counterId) {}

Counter(const Counter& counter) : id(counter.id) {}

……

};

virtual const JobConf* getJobConf() = 0;

virtual const std::string& getInputKey() = 0;

virtual const std::string& getInputValue() = 0;

virtual void emit(const std::string& key, const std::string& value) = 0;

virtual void progress() = 0;

…….

};

用户可以从context参数中获取当前的key，value，progress和inputsplit等数据信息，此外，还可以调用emit将结果回传给Java代码。

Mapper的构造函数带有一个HadoopPipes::TaskContext参数，用户可以通过它注册一些全局counter，对于程序调试和跟踪作业进度非常有用：

如果你想注册全局counter，在构造函数添加一些类似的代码：

WordCountMap(HadoopPipes::TaskContext& context) {

inputWords1 = context.getCounter(“group”, ”counter1”);

inputWords2 = context.getCounter(“group”, ”counter2”);

}

当需要增加counter值时，可以这样：

context.incrementCounter(inputWords1, 1);

context.incrementCounter(inputWords2, 1);

其中getCounter的两个参数分别为组名和组内计数器名，一个组中可以存在多个counter。

用户自定义的counter会在程序结束时，输出到屏幕上，当然，用户可以用通过web界面看到。

（2）Reducer编写方法

Reducer组件的编写方法跟Mapper组件类似，它需要继承虚基类public HadoopPipes::Reducer。

接下来介绍RecordReader，Partitioner和RecordWriter的编写方法：

（3）RecordReader编写方法

用户自定义的RecordReader类需要继承虚基类HadoopPipes::RecordReader，它的声明如下：

class RecordReader: public Closable {

public:

virtual bool next(std::string& key, std::string& value) = 0;

virtual float getProgress() = 0;

};

用户需要实现next和getProgress两个方法。

（4）Partitioner编写方法

用户自定义的Partitioner类需要继承虚基类HadoopPipes:: Partitioner，它的声明如下：

class Partitioner {

public:

virtual int partition(const std::string& key, int numOfReduces) = 0;

virtual ~Partitioner() {}

};

用户需要实现partition方法和析构函数。

对于partition方法，框架会自动为它传入两个参数，分别为key值和reduce task的个数numOfReduces，用户只需返回一个0~ numOfReduces-1的值即可。

（5）RecordWriter编写方法

用户自定义的RecordWriter类需要继承虚基类HadoopPipes:: RecordWriter，它的声明如下：

class RecordWriter: public Closable {

public:

virtual void emit(const std::string& key,

const std::string& value) = 0;

};

class WordCountWriter: public HadoopPipes::RecordWriter {

public:

MyWriter(HadoopPipes::ReduceContext& context) {

const HadoopPipes::JobConf* job = context.getJobConf();

int part = job->getInt(“mapred.task.partition”);

std::string outDir = job->get(“mapred.work.output.dir”);

……

}

用户需实现emit方法，将数据写入某个文件。

4.Hadoop pipes编程示例

网上有很多人怀疑Hadoop pipes自带的程序wordcount-nopipe.cc不能运行，各个论坛都有讨论，在此介绍该程序的设计原理和运行方法。

该运行需要具备以下前提：

（1）采用的InputFormat为WordCountInputFormat，它位于src/test/下的org.apache.hadoop.mapred.pipes中

（2）输入目录和输出目录需位于各个datanode的本地磁盘上，格式为：file:///home/xxx/pipes_test(注意，hdfs中的各种接口同时支持本地路径和HDFS路径，如果是HDFS上的路径，需要使用hdfs://host:9000/user/xxx，表示/user/xxx为namenode为host的hdfs上的路径，而本地路径，需使用file:///home/xxx/pipes_test，表示/home/xxx/pipes_test为本地路径)

待确定好各个datanode的本地磁盘上有输入数据/home/xxx/pipes_test/data.txt后，用户首先上传可执行文件到HDFS中：

bin/hadoop-putbuild/c++-examples/Linux-amd64-64/bin/wordcount-simple/user/XXX/ bin/

然后使用下面命令运行该程序：

bin/hadoop pipes \

-D hadoop.pipes.java.recordreader=false \

-D hadoop.pipes.java.recordwriter=false \

-D mapred.job.name=wordcount \

-D mapred.input.format.class=org.apache.hadoop.mapred.pipes.WordCountInputFormat \

-libjars hadoop-0.20.2-test.jar \

-input file:/home/xxx/pipes_test/data.txt \

-output file:/home/xxx/pipes_output \

-program /user/XXX/ bin/wordcount-nopipe

5.Hadoop pipes高级编程

如果用户需要在mapreduce作业中加载词典或者传递参数，可这样做：

（1）提交作业时，用-files选项，将词典（需要传递参数可以放到一个配置文件中）上传给各个datanode，如

bin/hadoop pipes \

-D hadoop.pipes.java.recordreader=false \

-D hadoop.pipes.java.recordwriter=false \

-D mapred.job.name=wordcount \

-files dic.txt \

….

（2）在Mapper或者Reducer的构造函数中，将字典文件以本地文件的形式打开，并把内容保存到一个map或者set中，然后再map()或者reduce()函数中使用即可，如

WordCountMap(HadoopPipes::TaskContext& context) {

file = fopen(“dic.txt”, “r”); //C库函数

…….

}

6.总结

1. Hadoop pipes编程介绍

2. Hadoop pipes编程初体验

（2） wordcount-part.cc：Mapper，Partitioner和Reducer组件采用C++语言编写，其他采用Java编写

（3）wordcount-nopipe.cc：RecordReader，Mapper，Rducer和RecordWriter采用C++编写

接下来简单介绍一下wordcount-simple.cc的编译和运行方法。

在Hadoop的安装目录下，执行下面命令：

ant -Dcompile.c++=yes examples

bin/hadoop -put build/c++-examples/Linux-amd64-64/bin/wordcount-simple /user/XXX/ bin/

上传一份数据到HDFS的/user/XXX /pipes_test_data目录下：

bin/hadoop -put data.txt /user/XXX /pipes_test_data

直接使用下面命令提交作业：

bin/hadoop pipes \

-D hadoop.pipes.java.recordreader=true \

-D hadoop.pipes.java.recordwriter=true \

-D mapred.job.name= wordcount \

-input /user/XXX /pipes_test_data \

-output /user/XXX /pipes_test_output \

-program /user/XXX/ bin/wordcount-simple

3. Hadoop pipes编程方法

先从最基础的两个组件Mapper和Reducer说起。

（1） Mapper编写方法

用户若要实现Mapper组件，需继承HadoopPipes::Mapper虚基类，它的定义如下：

class Mapper: public Closable {

public:

virtual void map(MapContext& context) = 0;

};

用户必须实现map函数，它的参数是MapContext，该类的声明如下：

class MapContext: public TaskContext {

public:

virtual const std::string& getInputSplit() = 0;

virtual const std::string& getInputKeyClass() = 0;

virtual const std::string& getInputValueClass() = 0;

};

而TaskContext类地声明如下：

class TaskContext {

public:

class Counter {

……

public:

Counter(int counterId) : id(counterId) {}

Counter(const Counter& counter) : id(counter.id) {}

……

};

virtual const JobConf* getJobConf() = 0;

virtual const std::string& getInputKey() = 0;

virtual const std::string& getInputValue() = 0;

virtual void emit(const std::string& key, const std::string& value) = 0;

virtual void progress() = 0;

…….

};

用户可以从context参数中获取当前的key，value，progress和inputsplit等数据信息，此外，还可以调用emit将结果回传给Java代码。

Mapper的构造函数带有一个HadoopPipes::TaskContext参数，用户可以通过它注册一些全局counter，对于程序调试和跟踪作业进度非常有用：

如果你想注册全局counter，在构造函数添加一些类似的代码：

WordCountMap(HadoopPipes::TaskContext& context) {

inputWords1 = context.getCounter(“group”, ”counter1”);

inputWords2 = context.getCounter(“group”, ”counter2”);

}

当需要增加counter值时，可以这样：

context.incrementCounter(inputWords1, 1);

context.incrementCounter(inputWords2, 1);

其中getCounter的两个参数分别为组名和组内计数器名，一个组中可以存在多个counter。

用户自定义的counter会在程序结束时，输出到屏幕上，当然，用户可以用通过web界面看到。

（2） Reducer编写方法

Reducer组件的编写方法跟Mapper组件类似，它需要继承虚基类public HadoopPipes::Reducer。

接下来介绍RecordReader， Partitioner和RecordWriter的编写方法：

（3） RecordReader编写方法

用户自定义的RecordReader类需要继承虚基类HadoopPipes::RecordReader，它的声明如下：

class RecordReader: public Closable {

public:

virtual bool next(std::string& key, std::string& value) = 0;

virtual float getProgress() = 0;

};

用户需要实现next和 getProgress两个方法。

（4） Partitioner编写方法

用户自定义的Partitioner类需要继承虚基类HadoopPipes:: Partitioner，它的声明如下：

class Partitioner {

public:

virtual int partition(const std::string& key, int numOfReduces) = 0;

virtual ~Partitioner() {}

};

用户需要实现partition方法和析构函数。

对于partition方法，框架会自动为它传入两个参数，分别为key值和reduce task的个数numOfReduces，用户只需返回一个0~ numOfReduces-1的值即可。

（5） RecordWriter编写方法

用户自定义的RecordWriter类需要继承虚基类HadoopPipes:: RecordWriter，它的声明如下：

class RecordWriter: public Closable {

public:

virtual void emit(const std::string& key,

const std::string& value) = 0;

};

class WordCountWriter: public HadoopPipes::RecordWriter {

public:

MyWriter(HadoopPipes::ReduceContext& context) {

const HadoopPipes::JobConf* job = context.getJobConf();

int part = job->getInt(“mapred.task.partition”);

std::string outDir = job->get(“mapred.work.output.dir”);

……

}

用户需实现emit方法，将数据写入某个文件。

4. Hadoop pipes编程示例

网上有很多人怀疑Hadoop pipes自带的程序wordcount-nopipe.cc不能运行，各个论坛都有讨论，在此介绍该程序的设计原理和运行方法。

该运行需要具备以下前提：

（1）采用的InputFormat为WordCountInputFormat，它位于src/test/下的org.apache.hadoop.mapred.pipes中

待确定好各个datanode的本地磁盘上有输入数据/home/xxx/pipes_test/data.txt后，用户首先上传可执行文件到HDFS中：

bin/hadoop -put build/c++-examples/Linux-amd64-64/bin/wordcount-simple /user/XXX/ bin/

然后使用下面命令运行该程序：

bin/hadoop pipes \

-D hadoop.pipes.java.recordreader=false \

-D hadoop.pipes.java.recordwriter=false \

-D mapred.job.name=wordcount \

-D mapred.input.format.class=org.apache.hadoop.mapred.pipes.WordCountInputFormat \

-libjars hadoop-0.20.2-test.jar \

-input file:/home/xxx/pipes_test/data.txt \

-output file:/home/xxx/pipes_output \

-program /user/XXX/ bin/wordcount-nopipe

5. Hadoop pipes高级编程

如果用户需要在mapreduce作业中加载词典或者传递参数，可这样做：

（1）提交作业时，用-files选项，将词典（需要传递参数可以放到一个配置文件中）上传给各个datanode，如

bin/hadoop pipes \

-D hadoop.pipes.java.recordreader=false \

-D hadoop.pipes.java.recordwriter=false \

-D mapred.job.name=wordcount \

-files dic.txt \

….

（2）在Mapper或者Reducer的构造函数中，将字典文件以本地文件的形式打开，并把内容保存到一个map或者set中，然后再map()或者reduce()函数中使用即可，如

WordCountMap(HadoopPipes::TaskContext& context) {

file = fopen(“dic.txt”, “r”); //C库函数

…….

}

6. 总结

Hadoop pipes使C++程序员编写MapReduce作业变得可能，它简单好用，提供了用户所需的大部分功能。

原创文章，转载请注明： 转载自董的博客

本文链接地址: http://dongxicheng.org/mapreduce/hadoop-pipes-programming/

GarfieldEr007

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hadoop pipes编程

1. Hadoop pipes编程介绍Hadoop pipes允许C++程序员编写mapreduce程序，它允许用户混用C++和Java的RecordReader， Mapper， Partitioner，Rducer和RecordWriter等五个组件。关于Hadoop pipes的设计思想，可参见我这篇文章：Hadoop Pipes设计原理。本文介绍了Hadoop pipes编程的
复制链接

扫一扫

专栏目录

Hadoop pipes编程

“相关推荐”对你有帮助么？