5.MR多文件的输入输出

最新推荐文章于 2021-05-23 23:27:10 发布

qq_21292551

最新推荐文章于 2021-05-23 23:27:10 发布

阅读量1.1k

点赞数

分类专栏： MapReduce 文章标签： mapreduce hadoop

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/qq_21292551/article/details/50261369

版权

MapReduce 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

1.旧API:

org.apache.hadoop.mapred.lib.MultipleOutputFormat||MultipleInput Format 和org.apache.hadoop.mapred.lib.MultipleOutputs||MultipleInputs

MultipleOutputFormat allowing to write the output data to different output files.

MultipleOutputs creates multiple OutputCollectors. Each OutputCollector can have its own OutputFormat and types for the key/value pair. Your MapReduce program will decide what to output to each OutputCollector.

2.新API:

org.apache.hadoop.mapreduce.lib.output.MultipleOutputs||MultipleInputs

整合了上面旧API两个的功能，没有了MultipleOutputFormat||MultipleInputFormat

MultipleInputs：

默认一个job只能使用 job.setInputFormatClass 设置使用一个inputfomat处理一种格式的数据。

如果需要实现在一个job中同时读取来自不同目录的不同格式文件的功能

可以自己实现一个 MultiInputFormat来读取不同格式的文件

hadoop里面已经提供了 MultipleInputs 来实现对一个目录指定一个inputformat和对应的map处理类

Mapper1<LongWritable, Text, Text, 自定义类>

Mapper2<LongWritable, Text, Text, 自定义类>

Reducer<Text,自定义类, Text, Text>

public static void main(String args[]) throws IOException

{

// args[0] file1 for MapA

String file_1 = args[0];

// args[1] file2 for MapB

String file_2 = args[1];

// args[2] outPath

String outPath = args[2];

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(自定义类.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

job.setOutputFormat(TextOutputFormat.class);

FileOutputFormat.setOutputPath(conf, new Path(outPath));

MultipleInputs.addInputPath(job, new Path(file_1), TextInputFormat.class, MapA.class);

MultipleInputs.addInputPath(job, new Path(file_2), TextInputFormat.class, MapB.class);

...

MultipleOutputs:

1.输出到多个文件或多个文件夹：

　　驱动中不需要额外改变，只需要在MapClass或Reduce类中加入如下代码

　　private MultipleOutputs<Text,IntWritable> mos;

　　 public void setup(Context context) throws IOException,InterruptedException {

　　　　mos = new MultipleOutputs(context);

　　}

　　 public void cleanup(Context context) throws IOException,InterruptedException {

　　　　 mos.close();

　　}

然后就可以用 mos.write(Key key,Value value,String baseOutputPath)代替context.write(key, value);

在MapClass或Reduce中使用，输出时也会有默认的文件part-m-00*或part-r-00*，不过这些文件是无内容的，大小为0. 而且只有part-m-00*会传给Reduce。

2.以多种格式输出：

public class TestwithMultipleOutputs extends Configured implements Tool {

public static class MapClass extends Mapper<LongWritable,Text,Text,IntWritable> {

private MultipleOutputs<Text,IntWritable> mos;

protected void setup(Context context) throws IOException,InterruptedException {

mos = new MultipleOutputs<Text,IntWritable>(context);

}

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{

String line = value.toString();

String[] tokens = line.split("-");

mos.write("MOSInt",new Text(tokens[0]), new IntWritable(Integer.parseInt(tokens[1]))); //（第一处）

mos.write("MOSText", new Text(tokens[0]),tokens[2]);　　　　 //（第二处）

mos.write("MOSText", new Text(tokens[0]),line,tokens[0]+"/");　　//（第三处）同时也可写到指定的文件或文件夹中

}

protected void cleanup(Context context) throws IOException,InterruptedException {

mos.close();

　　　　}

　　}

public int run(String[] args) throws Exception {

　　　　Configuration conf = getConf();

　　　　Job job = new Job(conf,"word count with MultipleOutputs");

　　　　job.setJarByClass(TestwithMultipleOutputs.class);

　　　　Path in = new Path(args[0]);

　　　　Path out = new Path(args[1]);

　　　　FileInputFormat.setInputPaths(job, in);

　　　　FileOutputFormat.setOutputPath(job, out);

　　　　job.setMapperClass(MapClass.class);

　　　　job.setNumReduceTasks(0);　　

　　　　MultipleOutputs.addNamedOutput(job,"MOSInt",TextOutputFormat.class,Text.class,IntWritable.class);

　　　　MultipleOutputs.addNamedOutput(job,"MOSText",TextOutputFormat.class,Text.class,Text.class);

　　　　System.exit(job.waitForCompletion(true)?0:1);

　　　　return 0;

　　}

　　public static void main(String[] args) throws Exception {

int res = ToolRunner.run(new Configuration(), new TestwithMultipleOutputs(), args);

　　　　System.exit(res);

　　}

}

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
5.MR多文件的输入输出

1.旧API:org.apache.hadoop.mapred.lib.MultipleOutputFormat||MultipleInputFormat和org.apache.hadoop.mapred.lib.MultipleOutputs||MultipleInputsMultipleOutputFormat allowing to write the output
复制链接

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。