并行算法能算很多东西,不只是计数,wordCount是一个比较简单的例子,很多其他的请参见我上传的基于mapreduce 的并行算法的设计。
今天来实现一个排序的简单例子。实现过程从简,因为具体的流程在我写的wordCount中已经详细的写在注释里了
首先输入是一堆文件file1、file2……里面存着数字,具体的逻辑是先对数字进行分块,比如100-200放在一起,200-300……然后每组分别分发给下面,
算完结果一拼就ok了
具体不啰嗦,直接贴代码
map
package sort;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* Created by zhangguanlong on 2017/11/15.
*/
public class SortMapper extends Mapper<Object, Text, IntWritable, IntWritable>{
private static IntWritable data = new IntWritable();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
data.set(Integer.parseInt(line));
context.write(data, new IntWritable(1));
}
}
reduce
package sort;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* Created by zhangguanlong on 2017/11/15.
*/
public class SortReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable>{
private static IntWritable linenum = new IntWritable(1);
public void reduce(IntWritable key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
for (IntWritable val : values) {
context.write(linenum, key);
linenum = new IntWritable(linenum.get() + 1);
}
}
}
Runner 为了方便分区也写在这个里。。按照程序设计思想,应该分开的。。但是我就这么写了,感觉这样写比较舒服,也许我是个假的程序员 0.0
package sort;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
/**
* Created by zhangguanlong on 2017/11/15.
*/
public class SortRunner {
public static class Partition extends Partitioner<IntWritable, IntWritable> {
@Override
public int getPartition(IntWritable key, IntWritable value,
int numPartitions) {
int MaxNumber = 65223;
int bound = MaxNumber / numPartitions + 1;
int keynumber = key.get();
for (int i = 0; i < numPartitions; i++) {
if (keynumber < bound * i && keynumber >= bound * (i - 1))
return i - 1;
}
return 0;
}
}
/**
* @param args
*/
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
Job job = new Job(conf, "Sort");
job.setJarByClass(SortRunner.class);
job.setMapperClass(SortMapper.class);
job.setPartitionerClass(Partition.class);
job.setReducerClass(SortReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("/wc/sort1/"));
FileOutputFormat.setOutputPath(job, new Path("/wc/sort2/"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
代码里路径这么写其实是放在hdfs的文件系统里的 ,所以我们把文件上传到hdfs
注意这里的file本来是放在linux根目录下的,不懂得可以去看hadoop 和linux的shell指令,然后运行
[hadoop@zhang ~]$ hadoop jar SProject.jar sort.SortRunner
注意这里因为输出目录写死了,如果目录已存在会报错。。。
成功跑起来是这个样子
还要注意的是文件里一定不要有空格,内容必须是数字,因为这只是个简单的demo。。。。
跑完看下结果
over