Hadoop多文件输出：MultipleOutputFormat和MultipleOutputs深究

最新推荐文章于 2017-11-02 13:07:00 发布

草藤木屋

最新推荐文章于 2017-11-02 13:07:00 发布

阅读量820

点赞数

分类专栏： Hadoop 文章标签： hadoop 多文件输出 multioutputs

Hadoop 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

源blog地址：http://www.iteblog.com/archives/848

由于本文比较长，考虑到篇幅问题，所以将本文拆分为二，请阅读本文之前先阅读本文的第一部分《Hadoop多文件输出：MultipleOutputFormat和MultipleOutputs深究(一)》。为你带来的不变，敬请谅解。

　　与MultipleOutputFormat类不一样的是，MultipleOutputs可以为不同的输出产生不同类型，到这里所说的MultipleOutputs类还是旧版本的功能，后面会提到新版本类库的强化版MultipleOutputs类，下面我们来用旧版本的MultipleOutputs类说明它是如何为不同的输出产生不同类型，MultipleOutputs类不是要求给每条记录请求文件名，而是创建多个OutputCollectors。每个OutputCollector可以有自己的OutputFormat和键值对类型，Mapreduce程序将决定如何向每个OutputCollector输出数据（看看上面的英文文档），说的你很晕吧，来看看代码吧！下面的代码将地理相关的信息存储在geo开头的文件中；而将时间相关的信息存储在chrono开头的文件中，具体的代码如下：

 
   01package  com.wyp; 
 
   02
 
   03import  org.apache.hadoop.conf.Configuration; 
 
   04import  org.apache.hadoop.fs.Path; 
 
   05import  org.apache.hadoop.io.LongWritable; 
 
   06import  org.apache.hadoop.io.NullWritable; 
 
   07import  org.apache.hadoop.io.Text; 
 
   08import  org.apache.hadoop.mapred.*; 
 
   09import  org.apache.hadoop.mapred.lib.MultipleOutputs; 
 
   10import  org.apache.hadoop.util.GenericOptionsParser; 
 
   11
 
   12import  java.io.IOException; 
 
   13
 
   14/** 
 
   15* User:http://www.iteblog.com/
 
   16* Date: 13-11-27
 
   17* Time: 下午10:32
 
   18*/
 
   19public  class OldMulOutput { 
 
   20publicstaticclass MapClass 
 
   21extendsMapReduceBase
 
   22implementsMapper<LongWritable,
 
   23Text, NullWritable, Text> {
 
   24privateMultipleOutputs mos;
 
   25privateOutputCollector<NullWritable, Text> collector;
 
   26
 
   27publicvoidconfigure(JobConf conf) {
 
   28mos =  new MultipleOutputs(conf);
 
   29}  
 
   30
 
   31publicvoidmap(LongWritable key, Text value,
 
   32OutputCollector<NullWritable, Text> output,
 
   33Reporter reporter)throwsIOException {
 
   34String[] arr = value.toString().split(",", -1);
 
   35String chrono = arr[1] +","+ arr[2];
 
   36String geo = arr[4] +","+ arr[5];
 
   37collector = mos.getCollector("chrono", reporter);
 
   38collector.collect(NullWritable.get(),newText(chrono));
 
   39collector = mos.getCollector("geo", reporter);
 
   40collector.collect(NullWritable.get(),newText(geo));
 
   41}  
 
   42
 
   43publicvoidclose() throwsIOException {
 
   44mos.close();
 
   45}  
 
   46
 
   47publicstaticvoid main(String[] args)throws IOException {
 
   48Configuration conf =newConfiguration();
 
   49String[] remainingArgs =
 
   50newGenericOptionsParser(conf, args).getRemainingArgs();
 
   51
 
   52if(remainingArgs.length !=2) {
 
   53System.err.println("Error!");
 
   54System.exit(1);
 
   55}  
 
   56
 
   57JobConf job =newJobConf(conf, OldMulOutput.class);
 
   58Path in =newPath(remainingArgs[0]);
 
   59Path out =newPath(remainingArgs[1]);
 
   60FileInputFormat.setInputPaths(job, in);
 
   61FileOutputFormat.setOutputPath(job, out);
 
   62
 
   63job.setJobName("MultiFile");
 
   64job.setMapperClass(MapClass.class);
 
   65job.setInputFormat(TextInputFormat.class);
 
   66job.setOutputKeyClass(NullWritable.class);
 
   67job.setOutputValueClass(Text.class);
 
   68
 
   69job.setNumReduceTasks(0);
 
   70MultipleOutputs.addNamedOutput(job,
 
   71"chrono",
 
   72TextOutputFormat.class,
 
   73NullWritable.class,
 
   74Text.class);
 
   75
 
   76MultipleOutputs.addNamedOutput(job,
 
   77"geo",
 
   78TextOutputFormat.class,
 
   79NullWritable.class,
 
   80Text.class);
 
   81JobClient.runJob(job);
 
   82
 
   83}  
 
   84}  
 
   85}

上面程序来源《Hadoop in action》。同样将上面的程序打包成jar文件（具体怎么打包，也不说了），并在Hadoop2.2.0上面运行（测试数据请在这里下载：http://pan.baidu.com/s/1td8xN）：

      
   1/home/q/hadoop-2.2.0/bin/hadoop jar \
 
   2/export1/tmp/wyp/OutputText.jar com.wyp.OldMulOutput \
 
   3/home/wyp/apat63_99.txt \
 
   4/home/wyp/out5

运行完程序之后，可以去/home/wyp/out5目录看下运行结果：

      
   01[wyp@l-datalogm1.data.cn1 bin]$ /home/q/hadoop-2.2.0/bin/hadoop fs \ 
 
   02-ls /home/wyp/out5
 
   03Found 7items
 
   04-rw-r--r-- 3wyp sg0 2013-11-2614:57/home/wyp/out5/_SUCCESS
 
   05-rw-r--r-- 3wyp sg31243 2013-11-2615:57/home/wyp/out5/chrono-m-00000
 
   06-rw-r--r-- 3wyp sg22719 2013-11-2615:57/home/wyp/out5/chrono-m-00001
 
   07-rw-r--r-- 3wyp sg29922 2013-11-2615:57/home/wyp/out5/geo-m-00000
 
   08-rw-r--r-- 3wyp sg20429 2013-11-2615:57/home/wyp/out5/geo-m-00001
 
   09-rw-r--r-- 3wyp sg0 2013-11-2614:57/home/wyp/out5/part-m-00000
 
   10-rw-r--r-- 3wyp sg0 2013-11-2614:57/home/wyp/out5/part-m-00001

　　大家可以看到在输出的文件中还存在以part开头的文件，但是里面没有信息，这是程序默认的输出文件，输出的收集器的名称是不能为part的，这是因为它已经被使用为默认的值了。
　　从上面的程序可以看出，旧版本的MultipleOutputs可以将文件基于列来进行分割，但是如果你想进行基于行的分割，这就要求你自己去实现代码了，恨不方便，针对这个问题，新版本的MultipleOutputs已经将旧版本的MultipleOutputs和MultipleOutputFormat的功能合并了，也就是说新版本的MultipleOutputs类具有旧版本的MultipleOutputs功能和MultipleOutputFormat功能；同时，在新版本的类库中已经不存在MultipleOutputFormat类了，因为MultipleOutputs都有它的功能了，还要她干嘛呢？看看官方文档是怎么说的：

　　The MultipleOutputs class simplifies writing output data to multiple outputs
　　Case one: writing to additional outputs other than the job default output. Each additional output, or named output, may be configured with its own OutputFormat, with its own key class and with its own value class.
　　Case two: to write data to different files provided by user

再看看下面摘自Hadoop：The.Definitive.Guide(3rd,Early.Release)P251，它是怎么说的：

　　In the old MapReduce API there are two classes for producing multiple outputs: MultipleOutputFormat and MultipleOutputs. In a nutshell, MultipleOutputs is more fully featured, but MultipleOutputFormat has more control over the output directory structure and file naming. MultipleOutputs in the new API combines the best features of the two multiple output classes in the old API.

　　也就是说MultipleOutputs合并了旧版本的MultipleOutputs功能和MultipleOutputFormat功能，新api都是用mapreduce包。好了，刚刚也说了新版本的MultipleOutputs有了旧版本的MultipleOutputFormat功能，那么我该怎么在新版的MultipleOutputs中实现旧版本MultipleOutputFormat的多文件输出呢？也就是上面第一个程序。看看下面的代码吧。

 
   01package  com.wyp; 
 
   02
 
   03import  org.apache.hadoop.conf.Configuration; 
 
   04import  org.apache.hadoop.fs.Path; 
 
   05import  org.apache.hadoop.io.LongWritable; 
 
   06import  org.apache.hadoop.io.NullWritable; 
 
   07import  org.apache.hadoop.io.Text; 
 
   08import  org.apache.hadoop.mapreduce.Job; 
 
   09import  org.apache.hadoop.mapreduce.Mapper; 
 
   10import  org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
 
   11import  org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 
 
   12import  org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
 
   13import  org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat; 
 
   14import  org.apache.hadoop.mapreduce.lib.output.MultipleOutputs; 
 
   15import  org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 
 
   16import  org.apache.hadoop.util.GenericOptionsParser; 
 
   17
 
   18import  java.io.IOException; 
 
   19
 
   20/** 
 
   21* User:http://www.iteblog.com/
 
   22* Date: 13-11-26
 
   23* Time: 下午2:27
 
   24*/
 
   25public  class MulOutput { 
 
   26publicstaticclass MapClass 
 
   27extendsMapper<LongWritable, Text, NullWritable, Text> {
 
   28privateMultipleOutputs mos;
 
   29@Override
 
   30protectedvoidsetup(Context context)
 
   31throwsIOException, InterruptedException {
 
   32super.setup(context);
 
   33mos =  new MultipleOutputs(context);
 
   34}  
 
   35
 
   36@Override
 
   37protectedvoidmap(LongWritable key,
 
   38Text value,
 
   39Context context)
 
   40throwsIOException, InterruptedException {
 
   41mos.write(NullWritable.get(), value,
 
   42generateFileName(value));
 
   43}  
 
   44
 
   45privateString generateFileName(Text value) {
 
   46String[] split = value.toString().split(",", -1);
 
   47String country = split[4].substring(1,3);
 
   48returncountry +"/";
 
   49}  
 
   50
 
   51@Override
 
   52protectedvoidcleanup(Context context)
 
   53throwsIOException, InterruptedException {
 
   54super.cleanup(context);
 
   55mos.close();
 
   56}  
 
   57}  
 
   58
 
   59publicstaticvoid main(String[] args)
 
   60throwsIOException, ClassNotFoundException,
 
   61InterruptedException {
 
   62Configuration conf =newConfiguration();
 
   63Job job = Job.getInstance(conf,"MulOutput");
 
   64String[] remainingArgs =
 
   65newGenericOptionsParser(conf, args)
 
   66.getRemainingArgs();
 
   67
 
   68if(remainingArgs.length !=2) {
 
   69System.err.println("Error!");
 
   70System.exit(1);
 
   71}  
 
   72Path in =newPath(remainingArgs[0]);
 
   73Path out =newPath(remainingArgs[1]);
 
   74
 
   75FileInputFormat.setInputPaths(job, in);
 
   76FileOutputFormat.setOutputPath(job, out);
 
   77
 
   78job.setJarByClass(MulOutput.class);
 
   79job.setMapperClass(MapClass.class);
 
   80job.setInputFormatClass(TextInputFormat.class);
 
   81job.setOutputKeyClass(NullWritable.class);
 
   82job.setOutputValueClass(Text.class);
 
   83job.setNumReduceTasks(0);
 
   84
 
   85System.exit(job.waitForCompletion(true) ?0: 1);
 
   86}  
 
   87}

上面的程序通过setup(Context context)来初始化MultipleOutputs对象，并在mapper函数中调用MultipleOutputs的write方法将数据输出到根据value的值不同的文件夹中（通过调用generateFileName函数来处理）。MultipleOutputs类有多个不同版本的write方法，它们的函数原型如下：

      
   1public  <K, V> void write(String namedOutput, K key, V value)
 
   2throwsIOException, InterruptedException
 
   3
 
   4public  <K, V> void write(String namedOutput, K key, V value,
 
   5String baseOutputPath)throwsIOException, InterruptedException
 
   6
 
   7public  void write(KEYOUT key, VALUEOUT value, String baseOutputPath)
 
   8throwsIOException, InterruptedException

我们可以根据不同的需求调用不同的write方法。
好了，大家来看看上面程序运行的结果吧：

      
   1/home/q/hadoop-2.2.0/bin/hadoop jar \
 
   2/export1/tmp/wyp/OutputText.jar com.wyp.MulOutput \
 
   3/home/wyp/apat63_99.txt \
 
   4/home/wyp/out11

测试数据还是上面给的地址。看下/home/wyp/out11文件中有什么吧：

      
   01[wyp@l-datalogm1.data.cn1 bin]$ /home/q/hadoop-2.2.0/bin/hadoop fs \ 
 
   02-ls /home/wyp/out11
 
   03.............................这里省略了很多...................................
 
   04drwxr-xr-x - wyp supergroup 02013-11-2619:42/home/wyp/out11/VN
 
   05drwxr-xr-x - wyp supergroup 02013-11-2619:41/home/wyp/out11/VU
 
   06drwxr-xr-x - wyp supergroup 02013-11-2619:42/home/wyp/out11/YE
 
   07drwxr-xr-x - wyp supergroup 02013-11-2619:42/home/wyp/out11/YU
 
   08drwxr-xr-x - wyp supergroup 02013-11-2619:42/home/wyp/out11/ZA
 
   09.............................这里省略了很多...................................
 
   10-rw-r--r-- 3wyp supergroup0 2013-11-2619:42/home/wyp/out11/_SUCCESS
 
   11-rw-r--r-- 3wyp supergroup0 2013-11-2619:42/home/wyp/out11/part-m-00000
 
   12-rw-r--r-- 3wyp supergroup0 2013-11-2619:42/home/wyp/out11/part-m-00001

　　现在输出的结果和用旧版本的MultipleOutputFormat输出的结果很类似了；但是在输出的结果中还有两个以part开头的文件夹，而且里面什么都没有，这是怎么回事？和第二个测试程序一样，这也是程序默认的输出文件名。那么我们可以在程序输出的结果中不输出两个文件夹吗？当然可以了，呵呵。如何实现呢？其实很简单，在上面的代码的main函数中加入以下一行代码：

      
   1LazyOutputFormat.setOutputFormatClass(job,
 
   2TextOutputFormat.class);

如果加入了上面的一行代码，请同时注释掉你代码中下面一行代码（如果有）

`1`	`job.setOutputFormatClass(TextOutputFormat.class);`

再去看下输出结果吧：

      
   01[wyp@l-datalogm1.data.cn1 bin]$ /home/q/hadoop-2.2.0/bin/hadoop fs \ 
 
   02-ls /home/wyp/out12
 
   03.............................这里省略了很多...................................
 
   04drwxr-xr-x - wyp supergroup 02013-11-2619:44/home/wyp/out12/VU
 
   05drwxr-xr-x - wyp supergroup 02013-11-2619:44/home/wyp/out12/YE
 
   06drwxr-xr-x - wyp supergroup 02013-11-2619:44/home/wyp/out12/YU
 
   07drwxr-xr-x - wyp supergroup 02013-11-2619:44/home/wyp/out12/ZA
 
   08drwxr-xr-x - wyp supergroup 02013-11-2619:44/home/wyp/out12/ZM
 
   09drwxr-xr-x - wyp supergroup 02013-11-2619:44/home/wyp/out12/ZW
 
   10.............................这里省略了很多...................................
 
   11-rw-r--r-- 3wyp supergroup0 2013-11-2619:44/home/wyp/out12/_SUCCESS