Hadoop与Spark算法分析（二）——排序算法

最新推荐文章于 2022-04-11 20:15:00 发布

tmac1027

最新推荐文章于 2022-04-11 20:15:00 发布

阅读量535

点赞数 1

分类专栏：大数据文章标签： spark hadoop

本文链接：https://blog.csdn.net/tmac1027/article/details/78454994

版权

大数据专栏收录该内容

5 篇文章 0 订阅

订阅专栏

数据排序是实际任务执行时非常重要的一步，为后续的数据处理打下基础。

1. 实验准备

本次实验中，每个数据以行的形式保存在输入文件中。其中输入文件通过编写Linux Shell脚本makeNumber.sh随机生成。shell脚本内容如下：

#! /bin/bash
for i in `seq 1 $1`
do
    echo $((RANDOM)) >> $2
done

第1个参数表示输入文件的行数，第2个参数表示输入文件路径。编写完成后，修改脚本权限可执行

$chmod a+x ./makeNumber.sh

运行脚本生成输入文件。例如，随机生成包含10行数字的输入文件，路径为~/testFile

$./makeNumber.sh 10 ~/testFile

2. Hadoop实现

由WordCount过程可知，MapReduce中默认的排序根据key进行，发生在各个reduce过程中。基于这个思想，需要自定义Partition类，以输入数据中可能出现最大值除以系统partition数量的商，作为数据分割的边界增量，保证所有的reduce过程在整体有序。同时，在reduce过程需要将从Map端得到的key作为value输出。具体代码实现如下：

package org.hadoop.test;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.hadoop.test.WordCount.IntSumReducer;
import org.hadoop.test.WordCount.TokenizerMapper;

public class Sort {
    public static class SortMapper
        extends Mapper<Object, Text, IntWritable, IntWritable>{
            private IntWritable data = new IntWritable();
            private IntWritable one = new IntWritable(1);

            public void map(Object key, Text value, Context context) 
                    throws IOException, InterruptedException{
                        String line = value.toString();
                        data.set(Integer.parseInt(line));
                        context.write(data, one);
            }
    }

    public static class SortReducer
        extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable>{
            private IntWritable linenum = new IntWritable(1);

            public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) 
                    throws IOException, InterruptedException{
                        for (IntWritable value : values){
                            context.write(linenum, key);
                            linenum.set(linenum.get()+1);
                        }
            }
    }

    public static class Partition 
        extends Partitioner<IntWritable, IntWritable>{

        @Override
        public int getPartition(IntWritable key, IntWritable value, int numPartitions) {
            // TODO Auto-generated method stub
            int max = 999999;
            int bound = max/numPartitions+1;
            int number = key.get();
            for (int i=0;i<numPartitions;i++){
                if(number < bound*i && number >= bound*(i-1)){
                    return i-1;
                }
            }
            return -1;
        }
    }

    public static void main(String[] args) 
            throws Exception{
                Configuration conf = new Configuration();
                String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
                if (otherArgs.length < 2){
                    System.err.println("Usage: sort <in> <out>");
                    System.exit(2);
                }

                Job job = new Job(conf, "Sort");
                job.setJarByClass(Sort.class);
                job.setMapperClass(SortMapper.class);
                job.setPartitionerClass(Partition.class);
                job.setReducerClass(SortReducer.class);
                job.setOutputKeyClass(IntWritable.class);
                job.setOutputValueClass(IntWritable.class);
                FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
                FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
                System.exit(job.waitForCompletion(true)? 0:1);
    }
}

3. Spark实现

Spark RDD通过sortBy算子执行排序过程，具体实现代码如下：

import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by rose on 16-4-27.
  */
object Sort {
  def main(args:Array[String]): Unit = {
    if (args.length < 2) {
      println("Usage:<in> <out>")
      return
    }

    val conf = new SparkConf().setAppName("Sort")
    val sc = new SparkContext(conf)
    val textRDD = sc.textFile(args(0))
    val result = textRDD.map(line => line.toInt).sortBy(x => x, true)
    result.saveAsTextFile(args(1))
  }
}

4. 运行过程

1）上传本地文件到HDFS目录下
在HDFS上创建输入文件夹

$hadoop fs -mkdir -p sort/input

上传本地文件到集群的input目录下

$hadoop fs -put ~/file* sort/input

查看集群文件目录

$hadoop fs -ls sort/input

2）运行程序
将排序算法程序Sort打包为后缀名为jar的压缩文件Sort.jar，进入到压缩文件所在文件夹（这里以一个file输入文件和一个output输出文件夹为例说明）。
Hadoop程序运行如下命令执行

$hadoop jar ~/hadoop/Sort.jar org.hadoop.test.Sort sort/input/file sort/hadoop/output

Spark程序运行如下命令执行

$spark-submit --master yanr-client --class Sort ~/spark/Sort.jar hdfs://master:9000/sort/input/file hdfs://master:9000/sort/spark/output

3）查看运行结果
查看Hadoop执行结果

$hadoop fs -ls sort/hadoop/output

查看Spark执行结果

$hadoop fs -ls sort/spark/output

5. 测试对比

如图所示为排序测试对比图，两者对于单一文件数据集的排序效率相差不大。Hadoop将数据分区排序，在排序算法的运行时间上更加稳定，而Spark算法采用sortBy算子进行排序计算，有着更大的优化空间。

tmac1027

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录