split的数量与map task的数量是相等的,有几个split就有几个map task。但reduce的数量与split和map task之间没有必然联系。reduce的数量与分区的数量有关系。要设置reduce的数量,可以通过定义新的分区类并重写其分区方法getPartition(),同时调用job.setNumReduceTasks(num)来实现。
输入一个文件,如果文件很小则只产生一个split,如果很大则产生多个split。本例输入三个小文件,则会有三个split,对应有三个map task。reduce任务数量可以自己设置。
下面想针对此操作文本,以前两列为组合键,统计每种水果的总数。如:apple 40
操作文本:
apple big 3
apple little 2
apple medium 4
orange big 5
orange little 1
orange medium 2
apple big 5
apple little 1
apple medium 3
banana big 3
banana little 2
banana medium 4
apple big 3
orange little 2
apple medium 4
banana big 3
apple little 2
apple medium 4
apple big 3
banana little 2
apple medium 4
banana big 3
apple little 2
orange medium 4
下面代码设置了三个reduce任务,结果得到三个文件:part-r-00000,part-r-00001,part-r-00002
我想得到的结果是,每一个part-r-xxxxx文件中保存的结果是水果名和水果数量这一行数据,但以下代码实际得到的是水果名和每个类型的水果数量。如下:
经过请教大神,发现问题所在。原因在于,reduce在统计的时候,是根据相同的键进行统计的,而本例设置的键为“水果名,水果类型”的组合键。因此,虽然是只按照“水果名”进行分区了,但一个分区中有三种组合键,所以一个分区会统计出三组数据。
package secondarySort;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;//而非jobcontrol
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class SecondarySort_My {
//自定义key
public static class StringPair implements WritableComparable<StringPair>{
String first;
String second;
public StringPair(){
}
public void set(String left, String right) {
first = left;
second = right;
}
public String getFirst() {
return first;
}
public String getSecond() {
return second;
}
//下面这三个是必须要重写的,可以自动生成
@Override
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
first = in.readUTF();//readUTF()是DataInput接口的一个方法,用来读取UTF8编码的字符串,然后返回String
second = in.readUTF();
}
@Override
public void write(DataOutput out) throws IOException {
// TODO Auto-generated method stub
out.writeUTF(first);
out.writeUTF(second);
}
@Override
public int compareTo(StringPair o) {
// TODO Auto-generated method stub
if(!this.first.equals(o.first)){
return this.first.compareTo(o.first);
}else{
return this.second.compareTo(o.second);//>0?1:-1;
}
}
//hashCode()
//equals()
}
public static class FirstPartitioner extends Partitioner<StringPair, IntWritable>{
public int getPartition(StringPair key, IntWritable value,int numPartitions){
//return Math.abs(key.getFirst().hashCode() * 127) % numPartitions;//main中还没设置numPartitions
//此时,只有一个split,一个map任务和一个reduce任务
int i = 10;
switch(key.first){
case "apple":
i = 0%numPartitions;
break;
case "banana":
i = 1%numPartitions;
break;
case "orange":
i = 2%numPartitions;
break;
}
return i;
}
}
//map
public static class MyMapper extends Mapper<LongWritable, Text, StringPair, IntWritable>{
private final StringPair stringPair = new StringPair();
public void map(LongWritable key, Text value,Context context) throws IOException, InterruptedException{
String[] str = value.toString().split(" ");
stringPair.set(str[0], str[1]);
context.write(stringPair, new IntWritable(Integer.valueOf(str[2])));
}
}
//reduce
public static class MyReducer extends Reducer<StringPair, IntWritable, Text, IntWritable>{
public void reduce(StringPair nameAndType, Iterable<IntWritable> nums, Context context) throws IOException, InterruptedException{
int sum = 0;
for(IntWritable i:nums){
sum += i.get();
}
IntWritable i = new IntWritable();
i.set(sum);
Text t = new Text();
t.set(nameAndType.first);
context.write(t, i);
}
}
//main
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("输入参数个数为:"+otherArgs.length+",Usage: wordcount <in> <out>");
System.exit(2);//终止当前正在运行的java虚拟机
}
Job job = Job.getInstance(conf, "date sort");
job.setJarByClass(SecondarySort_My.class);
job.setMapperClass(MyMapper.class);
job.setPartitionerClass(FirstPartitioner.class);
job.setNumReduceTasks(3);
job.setMapOutputKeyClass(StringPair.class);
job.setMapOutputValueClass(IntWritable.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
//waitForCompletion()方法用来提交作业并等待执行完成
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
如何在不改变组合键的情况下,实现统计每种水果的总数呢????这就涉及到辅助排序的内容了。
前面虽然是按照自定义组合键的第一个字段进行分区了,但在一个分区中,reducer仍然是通过键(本例是组合键)进行分组的分区。也就是说,reducer中的统计是针对分组的,而默认的分组方式是相同的键为一组。因此,为了统计某种水果所有类别的总数,需要按照组合键的第一个字段进行分组。此时要用的自定义的分组器。
具体实现
首先,在上述代码中增加下面的分组类代码
public static class GroupingComparator extends WritableComparator
{
protected GroupingComparator()
{
super(StringPair.class, true);
}
//Compare two WritableComparables.
// 重载 compare:对组合键按第一个自然键分组
public int compare(WritableComparable w1, WritableComparable w2)
{
StringPair ip1 = (StringPair) w1;
StringPair ip2 = (StringPair) w2;
String l = ip1.getFirst();
String r = ip2.getFirst();
return l.compareTo(r);
}
}
然后在main方法中调用即可:job.setGroupingComparatorClass(GroupingComparator.class);
这样就会得到三个part-r-xxxx文件,分别保存apple、banana和orange以及对应的总数
一篇挺不错的二次排序的文章:https://www.cnblogs.com/codeOfLife/p/5568786.html