4 开发MapReduce应用程序

最新推荐文章于 2023-06-04 00:46:34 发布

ALL--IN

最新推荐文章于 2023-06-04 00:46:34 发布

阅读量393

点赞数

分类专栏： Hadoop技术-学习笔记

Hadoop技术-学习笔记专栏收录该内容

11 篇文章 0 订阅

订阅专栏

系统参数配置

Configuration类由源来设置，每个源包含以XML形式出现的一系列属性/值对。如：

configuration-default.xml

configuration-site.xml

Configuration conf = new Configuration();

conf.addResource("configuraition-default.xml");

conf.addResource("|configuration-site.xml");

后添加进来的属性取值覆盖前面所添加资源中的属性取值，除非前面的属性值被标记为final。

Hadoop默认使用两个源进行配置，顺序加载core-default.xml和core-site.xml。

前者定义系统默认属性，后者定义在特定的地方重写。

性能调优

在正确完成功能的基础上，使执行的时间尽量短，占用的空间尽量小。

输入采用大文件

1000个2.3M的文件运行33分钟；合并为1个2.2G的文件后运行3分钟。

也可借用Hadoop中的CombineFileInputFormat，它将多个文件打包到一个输入单元中，从而每次执行Map操作就会处理更多的数据。

压缩文件

对Map的输出进行压缩，好处：减少存储文件的空间；加快在网络上的传输速度；减少数据在内存和磁盘间交换的时间。

mapred.compress.map.output设置为true来对Map的输出数据进行压缩；

mapred.map.output.compression.codec设置压缩格式

修改作业属性

在conf目录下修改属性

mapred.tasktracker.map.tasks.maximum

mapred.tasktracker.reduce.tasks.maximum

设置Map/Reduce任务槽数，默认均为2。

MapReduce工作流

如果处理过程变得复杂了，可以通过更加复杂、完善的Map和Reduce函数，甚至更多的MapReduce工作来体现。

复杂的Map和Reduce函数

基本的MapReduce作业仅仅集成并覆盖了基类Mapper和Reducer中的核心函数Map或Reduce。

下面介绍基类中的其他函数，使大家能够编写功能更加复杂、控制更加完备的Map和Reduce函数。

setup函数

源码如下

/**
* Called once at the start of the task
*/
protected void setup( Context context) throws IOException, InterruptedException {
    //NOTHING
}

此函数在task启动开始时调用一次。

每个task以Map类或Reduce类为处理方法主体，输入分片为处理方法的输入，自己的分片处理完之后task也就销毁了。

setup函数在task启动之后数据处理之前只调用一次，而覆盖的Map函数或Reduce函数会针对输入分片中的每个key调用一次。

可以将Map或Reduce函数中的重复处理放置到setup函数中；

可以将Map或Reduce函数处理过程中可能使用到的全局变量进行初始化；

可以从作业信息中获取全局变量；

可以监控task的启动。

cleanup函数

/**
* Called noce at the end of the task
*/
protected void cleanup(Context context) throws IOException, InterruptedException {
    //NOTHING
}

和setup相似，不同之处在于cleanup函数是在task销毁之前执行的。

run函数

/**
* Expert users can override this method for more complete control over the execution of the Mapper.
*@param context
*@throws IOException
*/
public void run(Context context) throws IOException, InterruptedException {
    setup (context);
    while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
    }
    cleanup(context);
}

此函数是map函数或Reduce函数的启动方法。

如果想更完备地控制Map或者Reduce，可以覆盖此函数。

MapReduce中全局共享数据方法

1、读写HDFS文件

利用Hadoop的Java API来实现。

需要注意：多个Map或Reduce的写操作会产生冲突，覆盖原有数据。

优点：能够实现读写，比较直观；

缺点：要贡献一些很小的全局数据也需要使用IO，这将占用系统资源，增加作业完成的资源消耗。

2、配置Job属性

在任务启动之初利用Configuration类中的set(String name, String value)将一些简单的全局数据封装到作业的配置属性中；

然后在task中利用Configuration类中的get(String name)获取配置到属性中的全局数据。

优点：简单，资源消耗小；

缺点：对量比较大的共享数据显得比较无力。

3、使用DistributedCache

为应用提供缓存文件的只读工具，可以缓存文本文件、压缩文件、jar文件。

在使用时，用户可以在作业配置时使用本地或HDFS文件的UCRL来将其设置成共享缓存文件。

在作业启动之后和task启动之前，MapReduce框架会将可能需要的缓存文件复制到执行任务结点的本地。

优点：每个Job共享文件只会在启动之后复制一次，适用于大量的共享数据；

缺点：只读。

//配置
Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/myapp/lookup"), conf);
//在Map函数中使用：
public static class Map extends Mapper<...>{
    private Path[] localArchives;
    private Paht[] localFiles;
    public void setup (Context context) throws IOException, InterruptedException{
        Configuration conf = context.getConfiguration();
        localArchives = DistributedCache.getLocalCacheArchives(conf);
        localFiles = DistributedCache.getLocalCacheFiles(conf);
    }
    public void map(K key, V value, Context context) throws IOException {
        //使用从缓存文件中获取的数据
        context.collect(k, v);
    }
}

链接MapReduce Job

如果问题不是一个MapReduce作业就能解决，就需要在工作流中安排多个MapReduce作业，让它们配合起来自动完成一些复杂的任务，而不需要用户手动启动每一个作业。

1、线性MapReduce Job流

最简单的办法是设置多个有一定顺序的Job，每个Job以前一个Job的输入作为输入，经过处理，将数据再输入到下一个Job中。

这种办法的实现非常简单，将每个Job的启动代码设置成只有上一个Job结束之后才执行，然后将Job的输入设置成上一个Job的输出路径。

2、复杂MapReduce Job流

第一种方法在某些复杂任务下仍然不能满足需求。

如Job3需要将Job1和Job2的输出结果组合起来进行处理。这种情况下Job3的启动依赖于Job1和Job2的完成，但Job1和Job2之间没有关系。

针对这种复杂情况，MapReduce框架提供了让用户将Job组织成复杂Job流的API--ControlledJob类和JobControl类。这两个类属于org.apache.hadoop.mapreduce.lib.jobcontrol包。

具体做法：

先按照正常情况配置各个Job；

配置完成后再将各个Job封装到对应的ControlledJob对象中；

然后使用ControlledJob的addDependingJob()设置依赖关系；

接着再实例化一个JobControl对象，并使用addJob()方法将所有的Job注入JobControl对象中；

最后使用JobControl对象的run方法启动Job流。

3、Job设置预处理和后处理过程

org.apache.hadoop.mapred.lib包下的ChainMapper和ChainReducer两个静态类来实现。

The ChainMapper class allows to use multiple Mapper classes within a single Map task.

The Mapper classes are invoked in a chained (or piped) fashion, the output of the first becomes the input of the second, and so on until the last Mapper, the output of the last Mapper will be written to the task's output.

The ChainReducer class allows to chain multiple Mapper classes after a Reducer within the Reducer task.

For each record output by the Reducer, the Mapper classes are invoked in a chained (or piped) fashion, the output of the first becomes the input of the second, and so on until the last Mapper, the output of the last Mapper will be written to the task's output.

The key functionality of this feature is that the Mappers in the chain do not need to be aware that they are executed in a chain. This enables having reusable specialized Mappers that can be combined to perform composite operations within a single task.

Special care has to be taken when creating chains that the key/values output by a Mapper are valid for the following Mapper in the chain. It is assumed all Mappers and the Reduce in the chain use maching output and input key and value classes as no conversion is done by the chaining code.

Using the ChainMapper and the ChainReducer classes is possible to compose Map/Reduce jobs that look like[MAP+ / REDUCE MAP*]. And immediate benefit of this pattern is a dramatic reduction in disk IO.

IMPORTANT: There is no need to specify the output key/value classes for the ChainMapper, this is done by the addMapper for the last mapper in the chain.

ChainMapper usage pattern:

 ...
 conf.setJobName("chain");
 conf.setInputFormat(TextInputFormat.class);
 conf.setOutputFormat(TextOutputFormat.class);
 
 JobConf mapAConf = new JobConf(false);
 ...
 ChainMapper.addMapper(conf, AMap.class, LongWritable.class, Text.class,
   Text.class, Text.class, true, mapAConf);
 

 JobConf mapBConf = new JobConf(false);
 ...
 ChainMapper.addMapper(conf, BMap.class, Text.class, Text.class,
   LongWritable.class, Text.class, false, mapBConf);
 

 JobConf reduceConf = new JobConf(false);
 ...
 ChainReducer.setReducer(conf, XReduce.class, LongWritable.class, Text.class,
   Text.class, Text.class, true, reduceConf);
 

 ChainReducer.addMapper(conf, CMap.class, Text.class, Text.class,
   LongWritable.class, Text.class, false, null);
 

 ChainReducer.addMapper(conf, DMap.class, LongWritable.class, Text.class,
   LongWritable.class, LongWritable.class, true, null);
 

 FileInputFormat.setInputPaths(conf, inDir);
 FileOutputFormat.setOutputPath(conf, outDir);
 ...
 

 JobClient jc = new JobClient(conf);
 RunningJob job = jc.submitJob(conf);
 ...

ALL--IN

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
4 开发MapReduce应用程序

系统参数配置Configuration类由源来设置，每个源包含以XML形式出现的一系列属性/值对。如：configuration-default.xmlconfiguration-site.xmlConfiguration conf = new Configuration();conf.addResource("configuraition-default.xml");co
复制链接

扫一扫