hadoop 之 MultipleInputs--为多个输入指定不同的InputFormat和Mapper
MultipleInputs 介绍
默认情况下,MapReduce作业的输入可以包含多个输入文件,但是所有的文件都由同一个InputFormat 和 同一个Mapper 来处理,这是的多个文件应该是格式相同,内容可以使用同一个Mapper处理。
但是,有可能这多个文件的数据格式不同,这是使用同一个Mapper来处理就显得不合适了。
对于上述问题,MultipleInputs可以妥善处理,他允许对每条输入路径指定InputFormat和Mapper。
对于Reducer来说,是聚合后的map输出,并不知道是由不同的mapper产生的。
实例
1.要处理的文件:
- trade_info1.txt
zhangsan@163.com 6000 0 2014-02-20
lisi@163.com 2000 0 2014-02-20
lisi@163.com 0 100 2014-02-20
zhangsan@163.com 3000 0 2014-02-20
wangwu@126.com 9000 0 2014-02-20
wangwu@126.com 0 200 2014-02-20
- 1
- 2
- 3
- 4
- 5
- 6
- trade_info.txt
zhangsan@163.com,6000,0,2014-02-20
lisi@163.com,2000,0,2014-02-20
lisi@163.com,0,100,2014-02-20
zhangsan@163.com,3000,0,2014-02-20
wangwu@126.com,9000,0,2014-02-20
wangwu@126.com,0,200,2014-02-20
- 1
- 2
- 3
- 4
- 5
- 6
2.代码:
处理多个不同输入的重要代码
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, SumStepByToolMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, SumStepByToolWithCommaMapper.class);
- 1
- 2
两个不同的Mapper在针对每行记录时,使用了不同的分隔符将记录分成不同的内容,这是两个Mapper唯一的不同。
String line = value.toString();
String[] fields = line.split("\t");
- 1
- 2
String line = value.toString();
String[] fields = line.split(",");
- 1
- 2
package mapreduce.mr;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import mapreduce.bean.InfoBeanMy;
public class SumStepByTool extends Configured implements Tool{
public static class SumStepByToolMapper extends Mapper<LongWritable, Text, Text, InfoBeanMy>{
private InfoBeanMy outBean = new InfoBeanMy();
private Text k = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
String line = value.toString();
String[] fields = line.split("\t");
String account = fields[0];
double income = Double.parseDouble(fields[1]);
double expense = Double.parseDouble(fields[2]);
outBean.setFields(account, income, expense);
k.set(account);
context.write(k, outBean);
}
}
public static class SumStepByToolWithCommaMapper extends Mapper<LongWritable, Text, Text, InfoBeanMy>{
private InfoBeanMy outBean = new InfoBeanMy();
private Text k = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
String line = value.toString();
String[] fields = line.split(",");
String account = fields[0];
double income = Double.parseDouble(fields[1]);
double expense = Double.parseDouble(fields[2]);
outBean.setFields(account, income, expense);
k.set(account);
context.write(k, outBean);
}
}
public static class SumStepByToolReducer extends Reducer<Text, InfoBeanMy, Text, InfoBeanMy>{
private InfoBeanMy outBean = new InfoBeanMy();
@Override
protected void reduce(Text key, Iterable<InfoBeanMy> values, Context context) throws IOException, InterruptedException{
double income_sum = 0;
double expense_sum = 0;
for(InfoBeanMy infoBeanMy : values)
{
income_sum += infoBeanMy.getIncome();
expense_sum += infoBeanMy.getExpense();
}
outBean.setFields("", income_sum, expense_sum);
context.write(key, outBean);
}
}
public static class SumStepByToolPartitioner extends Partitioner<Text, InfoBeanMy>{
private static Map<String, Integer> accountMap = new HashMap<String, Integer>();
static {
accountMap.put("zhangsan", 1);
accountMap.put("lisi", 2);
accountMap.put("wangwu", 3);
}
@Override
public int getPartition(Text key, InfoBeanMy value, int numPartitions) {
String keyString = key.toString();
String name = keyString.substring(0, keyString.indexOf("@"));
Integer part = accountMap.get(name);
if (part == null )
{
part = 0;
}
return part;
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
//conf.setInt("mapreduce.input.lineinputformat.linespermap", 2);
Job job = Job.getInstance(conf);
job.setJarByClass(this.getClass());
job.setJobName("SumStepByTool");
//job.setInputFormatClass(TextInputFormat.class); //这个是默认的输入格式
//job.setInputFormatClass(KeyValueTextInputFormat.class); //这个把一行记录的第一个区域当做key,其他区域作为value
//job.setInputFormatClass(NLineInputFormat.class);
// job.setMapperClass(SumStepByToolMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(InfoBeanMy.class);
job.setReducerClass(SumStepByToolReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(InfoBeanMy.class);
job.setNumReduceTasks(3);
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, SumStepByToolMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, SumStepByToolWithCommaMapper.class);
// FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
return job.waitForCompletion(true) ? 0:-1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new SumStepByTool(),args);
System.exit(exitCode);
}
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
运行时:
有三个参数,前两个为输入路径,最后一个为输出路径,
[root@hadoop1 tmp]# hadoop jar sortscore.jar mapreduce.mr.SumStepByTool /tradeinfoIn/trade_info1.txt /tradeinfoIn/trade_info.txt /tradeinfoOut/
- 1
注意
- 没有使用MultipleInputs时,是使用FileInputFormat来指定输入路径的,时候后,MultipleInputs替代了其工作,但是仍用FileOutputFormat指定输出路径;