spark更改分区_Spark自定义分区(Partitioner)

Spark提供了HashPartitioner和RangePartitioner两种分区策略

,这两种分区策略在很多情况下都适合我们的场景。但是有些情况下,Spark内部不能符合咱们的需求

,这时候我们就可以自定义分区策略。

为此,Spark提供了相应的接口,我们只需要扩展Partitioner抽象类,然后实现里面的方法。

Partitioner类如下

/*** An object that defines how the elements in a key-value pair RDD are partitioned by key.

* Maps each key to a partition ID, from 0 to `numPartitions - 1`.*/

abstract class Partitioner extendsSerializable {//这个方法返回你要创建分区的个数;

def numPartitions: Int//这个方法对输入的key做计算,返回该key对应的分区ID,范围是0到numPartitions-1

def getPartition(key: Any): Int

}

spark默认的实现是hashPartitioner,看一下它的实现方法:

/*** A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using

* Java's `Object.hashCode`.

*

* Java arrays have hashCodes that are based on the arrays' identities rather than their contents,

* so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will

* produce an unexpected or incorrect result.*/

class HashPartitioner(partitions: Int) extendsPartitioner {

require(partitions>= 0, s"Number of partitions ($partitions) cannot be negative.")

def numPartitions: Int=partitions

def getPartition(key: Any): Int=key match {case null => 0

case _ =>Utils.nonNegativeMod(key.hashCode, numPartitions)

}//这个是Java标准的判断相等的函数,这个函数是因为Spark内部会比较两个RDD的分区是否一样。

override def equals(other: Any): Boolean =other match {case h: HashPartitioner =>h.numPartitions==numPartitionscase _ =>

false}

override def hashCode: Int=numPartitions

}

nonNegativeMod方法:

/*Calculates 'x' modulo 'mod', takes to consideration sign of x,

* i.e. if 'x' is negative, than 'x' % 'mod' is negative too

* so function return (x % mod) + mod in that case.*/def nonNegativeMod(x: Int, mod: Int): Int={

val rawMod= x %mod

rawMod+ (if (rawMod < 0) mod else 0)

}

举个例子

//将jack、world相关的元素分到单独的分区中

JavaRDD javaRDD =jsc.parallelize(Arrays.asList("jack1", "jack2", "jack3","world1", "world2", "world3"));

自定义partitioner

importorg.apache.spark.Partitioner;/*** 自定义Partitioner*/

public class MyPartitioner extendsPartitioner {private intnumPartitions;public MyPartitioner(intnumPartitions){this.numPartitions =numPartitions;

}

@Overridepublic intnumPartitions() {returnnumPartitions;

}

@Overridepublic intgetPartition(Object key) {if(key == null){return 0;

}

String str=key.toString();int hashCode = str.substring(0, str.length() - 1).hashCode();returnnonNegativeMod(hashCode,numPartitions);

}public booleanequals(Object obj) {if (obj instanceofMyPartitioner) {return ((MyPartitioner) obj).numPartitions ==numPartitions;

}return false;

}//Utils.nonNegativeMod(key.hashCode, numPartitions)

private int nonNegativeMod(int hashCode,intnumPartitions){int rawMod = hashCode %numPartitions;if(rawMod < 0){

rawMod= rawMod +numPartitions;

}returnrawMod;

}

}

然后我们在partitionBy()方法里面使用自定义的partitioner,测试示例:

//将jack、world相关的元素分到单独的分区中

JavaRDD javaRDD =jsc.parallelize(Arrays.asList("jack1", "jack2", "jack3","world1", "world2", "world3"));//自定义partitioner需要在pairRDD的基础上调用

JavaPairRDD pairRDD = javaRDD.mapToPair(s -> new Tuple2<>(s, 1));

JavaPairRDD pairRDD1 = pairRDD.partitionBy(new MyPartitioner(2));

System.out.println("指定分区之后的分区数:"+pairRDD1.getNumPartitions());

pairRDD1.mapPartitionsWithIndex((v1, v2)->{

ArrayList result = new ArrayList<>();while(v2.hasNext()){

result.add(v1+"_"+v2.next());

}returnresult.iterator();

},true).foreach(s -> System.out.println(s));

输出

指定分区之后的分区数:20_(world1,1)

0_(world2,1)

0_(world3,1)

1_(jack1,1)

1_(jack2,1)

1_(jack3,1)

参考:https://my.oschina.net/u/939952/blog/1863372

参考:https://www.iteblog.com/archives/1368.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值