[Spark的二次排序的实现]

最新推荐文章于 2021-04-25 21:06:34 发布

fazhi-bb

最新推荐文章于 2021-04-25 21:06:34 发布

阅读量1.6w

点赞数 12

分类专栏： Spark scala 数据算法-Spark大数据处理 Spark进阶专栏文章标签：二次排序

本文链接：https://blog.csdn.net/luofazha2012/article/details/80587128

版权

scala 同时被 3 个专栏收录

14 篇文章 0 订阅

订阅专栏

Spark

10 篇文章 0 订阅

订阅专栏

数据算法-Spark大数据处理

7 篇文章 2 订阅

订阅专栏

二次排序原理

二次排序就是首先按照第一字段排序，然后再对第一字段相同的行按照第二字段排序，注意不能破坏第一次排序的结果。

二次排序技术

假设对应的Key = K有如下值：

（K,V1)，（K,V2），…，（K,Vn）

另外假设每个Vi是包含m个属性的一个元组，如下所示：

（Ai1，Ai2，…，Aim）

在这里我们希望按Ai1对归约器的元组的值进行排序。我们将用R表示元组其余的属性：(Ai2，…，Aim)，因此，可以把归约器的值表示为：

（K,(A1,R1)），（K,(A2,R2)），…，（K,(An，Rn)）

要按Ai对归约器的值进行排序，那么需要创建一个组合键：（K,Ai），新映射器将发出对应的Key=K的键值对，如下表所示：

键	值
（K,A1）	(A1，R1)
（K,A2）	(A2，R2)
…	…
（K,An）	(An，Rn)

从上表中不难理解，定义组合键为（K,Ai），自然键为K，通过定义组合键（即为自然键增加属性Ai）。假设二次排序使用MapReduce框架对归约器的值进行排序，按照自然键（K）来完成分区。则自然键和组合键排序图如下所示：

假设有这样的一个二次排序问题的例子：考虑一个科学试验得到的温度数据，这样温度的数据如下所示（各列分别为年，月，日，温度）：

2015 1 1 10

2015 1 2 11

2015 1 3 12

2015 1 4 13

…

2015 2 1 22

2015 2 2 23

2015 2 3 24

2015 2 4 25

…

2015 3 1 20

2015 3 2 21

2015 3 3 22

2015 3 4 23

假设我们希望输出每一个[年-月]的温度，并且值按升序排序。

Spark的二次排序代码实现

1、自定义排序分区，代码如下：

/**
  * 自定义排序分区
  **/
class SortPartitioner(partitions: Int) extends Partitioner {

    require(partitions > 0, s"分区的数量($partitions)必须大于零。")

    def numPartitions: Int = partitions

    def getPartition(key: Any): Int = key match {
        case (k: String, v: Int) => math.abs(k.hashCode % numPartitions)
        case null => 0
        case _ => math.abs(key.hashCode % numPartitions)
    }

    override def equals(other: Any): Boolean = other match {
        case o: SortPartitioner => o.numPartitions == numPartitions
        case _ => false
    }

    override def hashCode: Int = numPartitions
}

2、二次排序代码实现

/**
  * Spark的二次排序
  **/
object SparkSecondarySort {
    def main(args: Array[String]): Unit = {
        if (args.length != 3) {
            println("输入参数<分区数> <输入路径> <输出路径>不正确")
            sys.exit(1)
        }

        //分区数量
        val partitions: Int = args(0).toInt
        //文件输入路径
        val inputPath: String = args(1)
        //文件输出路径
        val outputPath: String = args(2)
        val config: SparkConf = new SparkConf()
        config.setMaster("local[1]").setAppName("SparkSecondarySort")
        //创建Spark上下文
        val sc: SparkContext = SparkSession.builder().config(config).getOrCreate().sparkContext
        //读取文件内容
        val input: RDD[String] = sc.textFile(inputPath)
        val valueToKey: RDD[((String, Int), Int)] = input.map(x => {
            val line: Array[String] = x.split("\t")
            ((line(0) + "-" + line(1), line(3).toInt), line(3).toInt)
        })

        implicit def tupleOrderingDesc = new Ordering[Tuple2[String, Int]] {
            override def compare(x: Tuple2[String, Int], y: Tuple2[String, Int]): Int = {
                if (y._1.compare(x._1) == 0) -y._2.compare(x._2)
                else -y._1.compare(x._1)
            }
        }

        val sorted: RDD[((String, Int), Int)] = valueToKey.repartitionAndSortWithinPartitions(new SortPartitioner(partitions))
        val result = sorted.map {
            case (k, v) => (k._1, v.toString())
        }.reduceByKey(_ + "," + _)
        result.saveAsTextFile(outputPath)
        // done
        sc.stop()
    }
}

运行结果：

(2015-1,5,6,7,8,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19,20,20,21,21,22,22)
(2015-3,18,19,20,20,20,21,21,21,22,22,22,23,23,23,24,24,24,25,25,25,26,26,26,27,27,27,28,28,28,29,30)
(2015-2,12,13,14,15,16,17,18,19,20,21,22,22,23,23,24,24,25,25,26,26,27,28,29,30,30,30,31,32)

fazhi-bb

关注

12
点赞
踩
10

收藏

觉得还不错? 一键收藏
1
评论
[Spark的二次排序的实现]

二次排序原理二次排序就是首先按照第一字段排序，然后再对第一字段相同的行按照第二字段排序，注意不能破坏第一次排序的结果。二次排序技术假设对应的Key = K有如下值：（K,V1)，（K,V2），…，（K,Vn）另外假设每个Vi是包含m个属性的一个元组，如下所示：（Ai1，Ai2，…，Aim）在这里我们希望按Ai1对归约器的元组的值...
复制链接

扫一扫

专栏目录