Spark的sortByKey API允许自定义排序规则,这样就可以进行自定义的二次排序、三次排序等等。
先来看一下sortByKey的源码实现:
def sortByKey(): JavaPairRDD[K, V] = sortByKey(true)
def sortByKey(ascending: Boolean): JavaPairRDD[K, V] = {
val comp = com.google.common.collect.Ordering.natural().asInstanceOf[Comparator[K]]
sortByKey(comp, ascending)
}
def sortByKey(comp: Comparator[K], ascending: Boolean): JavaPairRDD[K, V] = {
implicit val ordering = comp // Allow implicit conversion of Comparator to Ordering.
fromRDD(new OrderedRDDFunctions[K, V, (K, V)](rdd).sortByKey(ascending))
}
class OrderedRDDFunctions[K : Ordering : ClassTag,
V: ClassTag,
P <: Product2[K, V] : ClassTag] @DeveloperApi() (
self: RDD[P])
extends Logging with Serializable
通过代码我们可以发现要实现自定义的二次排序,则Key必须实现Spark 的Ordered特质和Java的Serializable接口。
Java实现:
首先是Key类的自定义实现:
import scala.math.Ordered;
import java.io.Serializable;
/**
* Key的自定义
* Created by Administrator on 2016/8/14 0014.
*/
public class SecondarySortKey implements Ordered<SecondarySort>, Serializable {
public int getFirst() {
return first;
}
public int getSecond() {
return second;
}
public void setFirst(int first) {
this.first = first;
}
public void setSecond(int second) {
this.second = second;
}
@Override
public boolean