hadoop 之 MultipleInputs

最新推荐文章于 2021-02-16 04:48:12 发布

技术蚂蚁

最新推荐文章于 2021-02-16 04:48:12 发布

阅读量1.4k

点赞数

分类专栏： Hadoop Hadoop example

Hadoop 同时被 2 个专栏收录

72 篇文章 3 订阅

订阅专栏

Hadoop example

19 篇文章 0 订阅

订阅专栏

hadoop 之 MultipleInputs--为多个输入指定不同的InputFormat和Mapper

分类：hadoopMapReduce

（195）（0）举报收藏

MultipleInputs 介绍

默认情况下，MapReduce作业的输入可以包含多个输入文件，但是所有的文件都由同一个InputFormat 和同一个Mapper 来处理，这是的多个文件应该是格式相同，内容可以使用同一个Mapper处理。

但是，有可能这多个文件的数据格式不同，这是使用同一个Mapper来处理就显得不合适了。

对于上述问题，MultipleInputs可以妥善处理，他允许对每条输入路径指定InputFormat和Mapper。

对于Reducer来说，是聚合后的map输出，并不知道是由不同的mapper产生的。

实例

1.要处理的文件：

trade_info1.txt

zhangsan@163.com    6000    0   2014-02-20
lisi@163.com    2000    0   2014-02-20
lisi@163.com    0   100 2014-02-20
zhangsan@163.com    3000    0   2014-02-20
wangwu@126.com  9000    0   2014-02-20
wangwu@126.com  0   200     2014-02-20
  
  1
2
3
4
5
6

trade_info.txt

zhangsan@163.com,6000,0,2014-02-20
lisi@163.com,2000,0,2014-02-20
lisi@163.com,0,100,2014-02-20
zhangsan@163.com,3000,0,2014-02-20
wangwu@126.com,9000,0,2014-02-20
wangwu@126.com,0,200,2014-02-20
  
  1
2
3
4
5
6

2.代码：
处理多个不同输入的重要代码

MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, SumStepByToolMapper.class);
        MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, SumStepByToolWithCommaMapper.class);
  
  1
2

两个不同的Mapper在针对每行记录时，使用了不同的分隔符将记录分成不同的内容，这是两个Mapper唯一的不同。

    String line = value.toString();
    String[] fields = line.split("\t");
  
  1
2

    String line = value.toString();
    String[] fields = line.split(",");
  
  1
2

package mapreduce.mr;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import mapreduce.bean.InfoBeanMy;

public class SumStepByTool extends Configured implements Tool{

    public static class SumStepByToolMapper extends Mapper<LongWritable, Text, Text, InfoBeanMy>{

        private InfoBeanMy outBean = new InfoBeanMy();
        private Text k = new Text();

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{

            String line = value.toString();
            String[] fields = line.split("\t");

            String account = fields[0];
            double income = Double.parseDouble(fields[1]);
            double expense = Double.parseDouble(fields[2]);

            outBean.setFields(account, income, expense);
            k.set(account);

            context.write(k, outBean);
        }
    }

    public static class SumStepByToolWithCommaMapper extends Mapper<LongWritable, Text, Text, InfoBeanMy>{

            private InfoBeanMy outBean = new InfoBeanMy();
            private Text k = new Text();

            @Override
            protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{

                String line = value.toString();
                String[] fields = line.split(",");

                String account = fields[0];
                double income = Double.parseDouble(fields[1]);
                double expense = Double.parseDouble(fields[2]);

                outBean.setFields(account, income, expense);
                k.set(account);

                context.write(k, outBean);
            }
        }

    public static class SumStepByToolReducer extends Reducer<Text, InfoBeanMy, Text, InfoBeanMy>{

        private InfoBeanMy outBean = new InfoBeanMy();
        @Override
        protected void reduce(Text key, Iterable<InfoBeanMy> values, Context context) throws IOException, InterruptedException{
            double income_sum = 0;
            double expense_sum = 0;

            for(InfoBeanMy infoBeanMy : values)
            {
                income_sum += infoBeanMy.getIncome();
                expense_sum += infoBeanMy.getExpense();
            }
            outBean.setFields("", income_sum, expense_sum);
            context.write(key, outBean);
        }

    }


    public static class SumStepByToolPartitioner extends Partitioner<Text, InfoBeanMy>{

        private static Map<String, Integer> accountMap = new HashMap<String, Integer>(); 

        static {
            accountMap.put("zhangsan", 1);
            accountMap.put("lisi", 2);
            accountMap.put("wangwu", 3);
        }

        @Override
        public int getPartition(Text key, InfoBeanMy value, int numPartitions) {
            String keyString = key.toString();
            String name = keyString.substring(0, keyString.indexOf("@"));
            Integer part = accountMap.get(name);
            if (part == null )
            {
                part = 0;
            }
            return part;
        }

    }

    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        //conf.setInt("mapreduce.input.lineinputformat.linespermap", 2);
        Job job = Job.getInstance(conf);
        job.setJarByClass(this.getClass());
        job.setJobName("SumStepByTool");

        //job.setInputFormatClass(TextInputFormat.class); //这个是默认的输入格式
        //job.setInputFormatClass(KeyValueTextInputFormat.class); //这个把一行记录的第一个区域当做key，其他区域作为value
        //job.setInputFormatClass(NLineInputFormat.class);

//      job.setMapperClass(SumStepByToolMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(InfoBeanMy.class);

        job.setReducerClass(SumStepByToolReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(InfoBeanMy.class);
        job.setNumReduceTasks(3);

        MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, SumStepByToolMapper.class);
        MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, SumStepByToolWithCommaMapper.class);
//      FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[2]));


        return job.waitForCompletion(true) ? 0:-1;
    }

    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new SumStepByTool(),args);
        System.exit(exitCode);
    }
}
  
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148

运行时：
有三个参数，前两个为输入路径，最后一个为输出路径，

[root@hadoop1 tmp]# hadoop jar sortscore.jar mapreduce.mr.SumStepByTool /tradeinfoIn/trade_info1.txt /tradeinfoIn/trade_info.txt /tradeinfoOut/
  
  1

注意

没有使用MultipleInputs时，是使用FileInputFormat来指定输入路径的，时候后，MultipleInputs替代了其工作，但是仍用FileOutputFormat指定输出路径；

技术蚂蚁

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
hadoop 之 MultipleInputs

hadoop 之 MultipleInputs--为多个输入指定不同的InputFormat和Mapper分类：hadoopMapReduce （195）（0）举报收藏 MultipleInputs 介绍默认情况下，MapReduce作业的输入可以包含多个输入文件，但是所有的文件都由同一个InputFormat 和同一个Mapper 来处理，这是的多个文件应该
复制链接

扫一扫