Hadoop简单实现全排序

最新推荐文章于 2021-05-20 18:40:45 发布

qwurey

最新推荐文章于 2021-05-20 18:40:45 发布

阅读量1.7w

点赞数 3

分类专栏： Hadoop 文章标签： hadoop 全排序

本文链接：https://blog.csdn.net/yeruby/article/details/21233661

版权

Hadoop 专栏收录该内容

21 篇文章 1 订阅

订阅专栏

做毕设用到Hadoop的全排序处理大数据，接触Hadoop已经2个月了，进展缓慢，深刻认识到进入到一个好的团队、共同研究是多么的重要，以此纪念我的大四一个人的毕设。废话不多说，我实现了整形和字符串型的全排序。

基础知识：

1. TeraSort思想：

关于terasort的文章很多，我没有找到那篇经典的原创。大体思想可以参看：http://hi.baidu.com/dt_zhangwei/item/c2a80032c7dbc5ff96f88dbf

我的理解：

（1）如果reducer的个数为1，那么输出一定是一个文件（part-r-00000），hadoop内部可以保证输出时已经排序好的。

这时：如果key是Text类型，按字典序排好；

如果key是IntWriteable类型，按整形排好；

（2）如果reducer的个数大于1，那么可以保证的是每一个reducer的输出是排好序的，但是不同reducer的输出不能保证。若想实现全排序，我们只需保证：到第0个reducer的数据的最后一项一定小于到第1个reducer的数据的第一项，以此类推，到第n-1个reducer的数据的最后一项一定小于到第n个reducer的数据的第一项（假设我们job.setNumReduceTasks(n)，即设定reduce任务数为n个，且按升序来排序）。

那么如何实现呢？

分为两步：取样+Partition对每条数据做标记（即发往哪个reducer做处理）

2. 取样

原理：取样工作在JobClient端进行，目的是取出n-1个、排序好的样本（可以划分出n个reducer），在partition的过程中，通过将当前keyvalue对的key跟样本中数据作比较，就可以知道该keyvalue对发往哪个reducer了。

以此我们需要写自己的“取样类”：

    static class TextSampler implements IndexedSortable {

        public ArrayList<IntWritable> records = new ArrayList<IntWritable>();//全部样本数据

        @Override
        public int compare(int arg0, int arg1) {
        	IntWritable right = records.get(arg0);
        	IntWritable left = records.get(arg1);
            return right.compareTo(left);
        }

        @Override
        public void swap(int arg0, int arg1) {
        	IntWritable right = records.get(arg0);
        	IntWritable left = records.get(arg1);
            records.set(arg0, left);
            records.set(arg1, right);
        }

        public void addKey(IntWritable key) {
            records.add(key);
        }

        public IntWritable[] createPartitions(int numPartitions) {
            int numRecords = records.size();
            if (numPartitions > numRecords) {
                throw new IllegalArgumentException("Requested more partitions than input keys (" + numPartitions +
                        " > " + numRecords + ")");
            }
            new QuickSort().sort(this, 0, records.size());
            float stepSize = numRecords / (float) numPartitions;//取数的步长
            IntWritable[] result = new IntWritable[numPartitions - 1];
            for (int i = 1; i < numPartitions; ++i) {
                result[i - 1] = records.get(Math.round(stepSize * i));//从全部样本数据中再抽出n-1个样本
            }
            return result;
        }
    }

说明：实现了IndexedSortable接口，IndexedSortable接口是Hadoop中的排序器，Hadoop关于可排序的数据集定义了一个抽象接口IndexedSortable，也就是说任何能够排序的数据集必须要实现两个方法，一是能够比较它的数据集中任意两项的大小，二是能够交换它的数据集中任意两项的位置。实现了这个接口我们就可以使用hadoop预定义的快排进行排序。如上：new QuickSort().sort(this, 0, records.size());

那么样本怎么得来的呢？

我们需要从分片中获得，在Job启动前必须得到n-1个取样数据——>需要对输入的数据进行控制——>需要自定义实现InputFormat接口的类。InputFormat做了2件事：

（1）InputSplit[] getSplits(JobConf job, int numSplits) throws IOException; 得到划分

（2）RecordReader<K, V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException; 处理每个划分，对每个划分的数据生成KeyValue对

分片不用重写。需要自定义实现RecordReader接口的类。

static class TeraRecordReader implements RecordReader<IntWritable, Text> {

        private LineRecordReader in;
        private LongWritable junk = new LongWritable();
        private Text line = new Text();

        public TeraRecordReader(Configuration job, FileSplit split) throws IOException {
            in = new LineRecordReader(job, split);
        }

        @Override
        public void close() throws IOException {
            in.close();
        }

        @Override
        public IntWritable createKey() {
            return new IntWritable();
        }

        @Override
        public Text createValue() {
            return new Text();
        }

        @Override
        public long getPos() throws IOException {
            // TODO Auto-generated method stub
            return in.getPos();
        }

        @Override
        public float getProgress() throws IOException {
            // TODO Auto-generated method stub
            return in.getProgress();
        }

        @Override
        public boolean next(IntWritable key, Text value) throws IOException {
            if (in.next(junk, line)) {
                    key.set(Integer.parseInt(line.toString()));
                    value.clear();
                return true;
            } else {
                return false;
            }
        }
    }//end RecordReader

默认情况下会对每个分片中的每行数据得到一个形如<Key=该行的起始位置：LongWritable，Value=该行的内容的：Text>的KeyValue对，我们需要将这个KeyValue对转化成我们想要的形式<Key=该行内容：IntWritable，Value=空字符串：Text>，所以如上重写了next函数。

到此我们可以按格式读到RecordReader提供的KeyValue对了。那么接下来我们就要找到读到的数据中你认为可以当做样本的数据：

    public static void writePartitionFile(JobConf conf, Path partFile) throws IOException {
        SamplerInputFormat inputFormat = new SamplerInputFormat();
        TextSampler sampler = new TextSampler();
        int partitions = conf.getNumReduceTasks(); // Reducer任务的个数
        long sampleSize = conf.getLong(SAMPLE_SIZE, 100); // 采集数据-键值对的个数
        InputSplit[] splits = inputFormat.getSplits(conf, conf.getNumMapTasks());// 获得数据分片
        int samples = Math.min(10, splits.length);// 采集分片的个数
        long recordsPerSample = sampleSize / samples;// 每个分片采集的键值对个数
        int sampleStep = splits.length / samples; // 采集分片的步长
        long records = 0;
        IntWritable key = new IntWritable();
        Text value = new Text();
        for (int i = 0; i < samples; i++) {
        	//to particular split construct a record_reader
            RecordReader<IntWritable, Text> reader = inputFormat.getRecordReader(splits[sampleStep * i], conf, null);
            while (reader.next(key, value)) {
                sampler.addKey(key);
                key=new IntWritable();
                value = new Text();
                records += 1;
                if ((i + 1) * recordsPerSample <= records) {
                    break;
                }
            }
        }
        FileSystem outFs = partFile.getFileSystem(conf);
        if (outFs.exists(partFile)) {
            outFs.delete(partFile, false);
        }
        SequenceFile.Writer writer = SequenceFile.createWriter(outFs, conf, partFile, IntWritable.class, NullWritable.class);
        NullWritable nullValue = NullWritable.get();
        for (IntWritable split : sampler.createPartitions(partitions)) {
            writer.append(split, nullValue);
        }
        writer.close();
    }

如上所示，我们通过writer将（n-1）个样本写入到了临时的样本文件中。接下来可以启动Job了。

3. Partition对每条数据做标记（即发往哪个reducer做处理）

在map-reduce流程中，partitioner会负责“告知”每条数据的归属地reducer，这里我们要根据上面写好的临时样本文件判断每天数据的归属，因此需要自定义实现Partitioner接口的类：

	// 自定义的Partitioner  
	public static class TotalOrderPartitioner implements Partitioner<IntWritable, NullWritable> {  
		
		private IntWritable[] splitPoints;  
		
		public TotalOrderPartitioner() {  
		}  
		
		@Override  
		public int getPartition(IntWritable key, NullWritable value, int numReduceTasks) {  
			// TODO Auto-generated method stub  
			return findPartition(key);  
		}  
		
		public void configure(JobConf conf) {  
			try {  
				FileSystem fs = FileSystem.get(conf);
				Path partFile = new Path(SamplerInputFormat.PARTITION_FILENAME);  
				splitPoints = readPartitions(fs, partFile, conf,splitPoints); // 读取采集文件 
			} catch (IOException ie) {  
				throw new IllegalArgumentException("can't read paritions file", ie);  
			}  
		}
		//通过找区间的方式定位partition
		public int findPartition(IntWritable key) {  
			int len = splitPoints.length;  
			for (int i = 0; i < len; i++) {  
				int res = key.compareTo(splitPoints[i]);  
				if (res > 0 && i < len - 1) {  
					continue;  
				} else if (res == 0) {  
					return i;  
				} else if (res < 0) {  
					return i;  
				} else if (res > 0 && i == len - 1) {  
					return i + 1;  
				}  
			} 
			return 0;  
		}  
		
		private static IntWritable[] readPartitions(FileSystem fs, Path p, JobConf job, IntWritable[] splitPoints) throws IOException { 
			URI[] uris = DistributedCache.getCacheFiles(fs.getConf());
			SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(uris[0]), job);  
			ArrayList<IntWritable> parts = new ArrayList<IntWritable>();  
			IntWritable key = new IntWritable();  	       
			NullWritable value = NullWritable.get(); 
			while (reader.next(key, value)) {  
				parts.add(key);	  
				key=new IntWritable();
				value = NullWritable.get();
			}  
			reader.close();  
			splitPoints = new IntWritable[parts.size()];
			for(int i=0;i<parts.size();i++) {
				splitPoints[i] = parts.get(i);
			}
			return splitPoints;
		}  
	}

如上所示，一个自定义的Partitioner只需要实现两个功能：getPartition()和configure()。

（1）getPartition()函数返回一个0到（Reducer数目-1）之间的int值来确定将<key,value>键值对送到哪一个Reducer中。

（2）configure()使用Hadoop Job Configuration来配置partitioner，并读取样本数据。

至此，我们控制了哪些数据发往哪些reducer，且这种控制是有序的控制，在每个reducer中的数据，hadoop会自动实现排序，因此整体上实现了全排序。

以上是整形的全排序，字符串的全排序与此大同小异。

注意：伪分布式reducer的个数只能是0或1，无法设置reducer的个数。