Spark系列修炼---入门笔记23

最新推荐文章于 2022-03-30 17:36:55 发布

一只懒得睁眼的猫

最新推荐文章于 2022-03-30 17:36:55 发布

阅读量1.7k

点赞数

分类专栏： Spark 文章标签： spark 二次排序

本文链接：https://blog.csdn.net/a2011480169/article/details/53694417

版权

Spark 专栏收录该内容

30 篇文章 2 订阅

订阅专栏

核心内容：
1、Spark中的基础排序算法
2、Spark中的二次排序算法
3、Spark中排序的相关思考

好的，今天我们进入Spark的二次排序，当然我们还是先看最简单的基础排序算法……
排序的地位：排序非常重要，但是排序不是最常用的，一般超过3维的排序可能性不算太大，其实超过二次排序的可能性也不太大。
实例程序1：基于单一key的简单排序
注意：凡是涉及到排序，数据必须是key与value的方式
输入数据：

Hello Spark Hello Scala
Hello Hadoop
Hello Hbase
Spark Hadoop 
Java Spark
Hello Hadoop
Hello Hbase

直接上代码：

package com.appache.spark.app

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by hp on 2016/12/15.
  * 本程序的目的是基于key进行最简单的排序
*/

object App2
{
  def main(args: Array[String]): Unit =
  {
      val conf = new SparkConf()
      conf.setAppName("SimpleSort")
      conf.setMaster("local")

      val sc = new SparkContext(conf)
      val lines:RDD[String] = sc.textFile("C:\\word.txt")
      val words:RDD[String] = lines.flatMap(line =>  //拿到日志中的一行数据
      {
          val splited:Array[String] = line.split(" ")
          splited
      })
      val pairs:RDD[(String,Int)] = words.map( word =>
      {
        (word,1)
      })
      val wordCounts:RDD[(String,Int)] = pairs.reduceByKey((a,b)=>
      {
        a + b
      })
      val result:RDD[(String,Int)] = wordCounts.map(wordnum => (wordnum._2,wordnum._1)).sortByKey(false)
        .map(wordnum => (wordnum._2,wordnum._1))

      result.collect.foreach(wordnum => println(wordnum._1+"\t"+wordnum._2))
      sc.stop()
  }
}

运行结果：

Hello   6
Spark   3
Hadoop  3
Hbase   2
Java    1
Scala   1

好的，接下来我们将上面的程序以Spark-shell的方式在运行一下：
运行结果：

res1: Array[(String, Int)] = Array((Hello,6), (Spark,3), (Hadoop,3), (Hbase,2), (Java,1), (Scala,1))

运行过程中的DAG有向无环图：
这里写图片描述

分析：从运行日志中我们可以发现在第一个阶段中有2个task任务，其原因是因为我们有2个文件，每个文件都小于128M，所以每个文件对应一个split数据分片，而一个Split数据片对应一个task任务；在第二个阶段中只有一个任务，是因为我们在reduceByKey的时候设定了并行度为1，即指定了第一个阶段中每一个task任务的分区数量为1，所以到二个阶段的时候只有一个数据分片，所以只有一个task；在第三个阶段只有一个任务，其原因是前面的并行度为1，而并行度是可以继承的。
好的，接下来我们讲述二次排序，所谓二次排序就是指排序的时候考虑两个维度，当我们考虑两个维度进行排序的时候，这就是二次排序。（比如说我们刚才排序的时候只考虑了一列，而现在我们需要考虑两列）
二次排序的方法：自定义二次排序的key
具体实现的步骤：
1>按照Ordered和Serializable接口实现自定义排序的key
2>将要进行二次排序的文件加载进来生成key，value类型的RDD
3>使用sortBykey基于自定义的key进行二次排序
4>去除掉排序的Key，只保留排序的结果
注意：面试的时候基本上都会问二次排序，RangePartitioner背后工作的原理。
好的，我们进行具体的二次排序：
输入数据：

排序要求：先考虑第一列，第一列按照降序排序，当一列相同的时候，第二列按照升序排序。
具体实现代码：

package com.appache.spark.app

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by hp on 2016/12/15.
  * 本程序的目的是进行二次排序
  * 排序规则：先考虑第一列，第一列按照降序排序，当一列相同的时候，第二列按照升序排序。
*/

//关键是自定义二次排序的Key
class  SecondarySortKey(val first:Int,val second:Int) extends Ordered[SecondarySortKey] with Serializable
{
  override def compare(other: SecondarySortKey):Int =
  {
     //先考虑第一列
     if(this.first != other.first)
       -(this.first - other.first)  //降序
     else //如果第一列相同
        this.second - other.second  //升序
  }
}
object App2
{
  def main(args: Array[String]): Unit =
  {
      val conf = new SparkConf()
      conf.setMaster("local")
      conf.setAppName("SecondarySort")

      val sc:SparkContext = new SparkContext(conf)
      val lines:RDD[String] = sc.textFile("C:\\word2.txt")
      val pairWithSortKey:RDD[(SecondarySortKey,String)] = lines.map(line =>  //拿到日志中的一行数据
      {
          val splited:Array[String] = line.split(" ")
          val first = splited(0)
          val second = splited(1)
          (new SecondarySortKey(first.toInt,second.toInt),line)
      }) //pairWithSortKey这个数组当中的对象为元组

      val sorted:RDD[(SecondarySortKey,String)] = pairWithSortKey.sortByKey()
      //最后我们要去除掉排序的Key，只保留排序的结果
      val sortResult:RDD[String] = sorted.map(sortedLine => sortedLine._2)

      //将结果进行输出
      sortResult.collect.foreach(println)

      sc.stop()
  }
}

执行结果：

接下来我们测试一些核心步骤：
1>自定义排序的类不实现Serializable接口
结果：抛出异常

Serialization stack:
    - object not serializable (class: com.appache.spark.app.SecondarySortKey, value: com.appache.spark.app.SecondarySortKey@49ce8d1b)

2>自定义的排序类不实现Ordered特质，实现Comparable接口

class  SecondarySortKey(val first:Int,val second:Int) extends Comparable[SecondarySortKey] with Serializable
{
  override def compareTo(other: SecondarySortKey):Int =
  {
    //先考虑第一列
    if(this.first != other.first)
      -(this.first - other.first)  //降序
    else //如果第一列相同
      this.second - other.second  //升序
  }
}

运行结果：可以的！
3>测试一下原始的写法：

class  SecondarySortKey extends Comparable[SecondarySortKey] with Serializable
{
  var first:Int = _
  var second:Int = _
  def this(first:Int,second:Int)
  {
    this()
    this.first = first
    this.second = second
  }
  override def compareTo(other: SecondarySortKey):Int =
  {
    //先考虑第一列
    if(this.first != other.first)
      -(this.first - other.first)  //降序
    else //如果第一列相同
      this.second - other.second  //升序
  }
}

当然，程序也可以这么写：

package com.appache.spark.app

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by hp on 2016/12/15.
  * 本程序的目的是进行二次排序
  * 排序规则：先考虑第一列，第一列按照升序排序，当一列相同的时候，第二列也按照升序排序。
*/

//关键是自定义二次排序的Key
class SecondarySortKey extends Comparable[SecondarySortKey] with Serializable
{
    var first:Int = _
    var second:Int = _
    def this(first:Int,second:Int)
    {
      this()
      this.first = first
      this.second = second
    }
    override def compareTo(other: SecondarySortKey):Int =
    {
       if(this.first != other.first)
          this.first - other.first   //第一列升序
       else
          this.second - other.second //第二列升序
    }

    override def toString =
    {
      this.first+"\t"+this.second
    }
}
object App2
{
  def main(args: Array[String]): Unit =
  {
     val conf = new SparkConf()
     conf.setMaster("local")
     conf.setAppName("SecondSort")

     val sc = new SparkContext(conf)
     val lines:RDD[String] = sc.textFile("C:\\word2.txt")
     val pair:RDD[(SecondarySortKey,Null)] = lines.map(line =>  //拿到日志中的一行数据
     {
         val splited:Array[String] = line.split(" ")
         val first = splited(0)
         val second = splited(1)
         val keyvalue = (new SecondarySortKey(first.toInt,second.toInt),null)
         keyvalue
     })
     val results:RDD[(SecondarySortKey,Null)] = pair.sortByKey()
     results.collect().foreach(wordline => println(wordline._1))  //只要key,不要value

     sc.stop()
  }
}

运行结果：

好的，我们在用MapReduce实现同样的功能：
输入数据：

排序规则：第一列升序排序，第二列降序排序
实例代码：

package com.appache.seconadarySort;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import com.appache.hadoop.FlowCount.MyMapper;
import com.appache.hadoop.FlowCount.MyReducer;


//本程序的目的是在MapReduce中实现二次排序
class SecondarySortKey implements WritableComparable<SecondarySortKey>
{
    public int first;
    public int second;
    public SecondarySortKey(){}
    public SecondarySortKey(int first,int second)
    {
        this.first = first;
        this.second = second;
    }
    //从代码上看来,无论是mapreduce还是Spark都需要进行序列化与反序列化
    @Override
    public void write(DataOutput fw) throws IOException 
    {
        fw.writeInt(first);  //序列化就是将对象写到字节输出流当中
        fw.writeInt(second);    
    }
    @Override
    public void readFields(DataInput fr) throws IOException 
    {
        this.first = fr.readInt();  //反序列化就是将对象从字节输入流当中读取出来
        this.second = fr.readInt();     
    }   
    @Override
    public int compareTo(SecondarySortKey other) 
    {
        if(this.first != other.first)
            return -(this.first - other.first);
        else
            return -(this.second - other.second);
    }
    @Override
    public String toString()
    {
        return this.first+"\t"+this.second;
    }
}
public class Sort2
{
   public static String path1 = "C:\\word2.txt";
   public static String path2 = "C:\\dirout\\";
   public static void main(String[] args) throws Exception
   {
      Configuration conf = new Configuration(); 
      FileSystem fileSystem = FileSystem.get(conf);

      if(fileSystem.exists(new Path(path2)))  //如果输出路径事先存在,则删除
      {
          fileSystem.delete(new Path(path2), true);
      }

      Job job = Job.getInstance(conf, "SortSecond");
      job.setJarByClass(Sort2.class);


      //编写驱动
      FileInputFormat.setInputPaths(job, new Path(path1));
      job.setInputFormatClass(TextInputFormat.class);
      job.setMapperClass(MyMapper.class);
      job.setMapOutputKeyClass(SecondarySortKey.class);
      job.setMapOutputValueClass(NullWritable.class);
      //shuffle洗牌阶段
      job.setReducerClass(MyReducer.class);
      job.setOutputKeyClass(SecondarySortKey.class);
      job.setOutputValueClass(NullWritable.class);
      job.setOutputFormatClass(TextOutputFormat.class);
      FileOutputFormat.setOutputPath(job, new Path(path2));   
      //将任务提交给JobTracker
      job.waitForCompletion(true);
      //查看程序的运行结果
      FSDataInputStream fr = fileSystem.open(new Path("C:\\dirout\\part-r-00000"));
      IOUtils.copyBytes(fr,System.out,1024,true);
   }
   public static class MyMapper extends Mapper<LongWritable, Text, SecondarySortKey, NullWritable>
   {
       @Override
       protected void map(LongWritable k1,Text v1,Context context)throws IOException, InterruptedException 
       {
          String[] splited = v1.toString().split(" ");
          String first = splited[0];
          String second = splited[1];
          SecondarySortKey k2 = new SecondarySortKey(Integer.valueOf(first), Integer.valueOf(second));
          context.write(k2, NullWritable.get());
       }
   }
   //分区、排序、分组、shuffle阶段
   public static class MyReducer extends Reducer<SecondarySortKey, NullWritable, SecondarySortKey, NullWritable>
   {
       @Override
       protected void reduce(SecondarySortKey k2,Iterable<NullWritable> v2s,Context context)throws IOException, InterruptedException 
       {
           for(NullWritable v2 : v2s)  //<{20 10},NullWritable>
           {
               context.write(k2, NullWritable.get());
           }
       } 
   }
}

运行结果：

呵呵，MapReduce显得好麻烦！
最后留下几个问题：在分布式集群上数据要进行排序的话就涉及到一个问题：数据既然是在几百到几千台机器上，你怎么知道每台数据是在什么样的范围，会不会一个一个的进行遍历呢？RangePartitioner到底是怎么划分不同的partition的，集群级别的排序内幕到底是怎样的呢？
OK，继续努力！