spark更改分区_Spark自定义分区(Partitioner)

最新推荐文章于 2024-05-07 22:53:47 发布

weixin_39796839

最新推荐文章于 2024-05-07 22:53:47 发布

阅读量506

点赞数

文章标签： spark更改分区

本文链接：https://blog.csdn.net/weixin_39796839/article/details/111963149

版权

Spark提供了HashPartitioner和RangePartitioner两种分区策略

，这两种分区策略在很多情况下都适合我们的场景。但是有些情况下，Spark内部不能符合咱们的需求

，这时候我们就可以自定义分区策略。

为此，Spark提供了相应的接口，我们只需要扩展Partitioner抽象类，然后实现里面的方法。

Partitioner类如下

/*** An object that defines how the elements in a key-value pair RDD are partitioned by key.

* Maps each key to a partition ID, from 0 to `numPartitions - 1`.*/

abstract class Partitioner extendsSerializable {//这个方法返回你要创建分区的个数；

def numPartitions: Int//这个方法对输入的key做计算，返回该key对应的分区ID，范围是0到numPartitions-1

def getPartition(key: Any): Int

}

spark默认的实现是hashPartitioner，看一下它的实现方法：

/*** A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using

* Java's `Object.hashCode`.

* Java arrays have hashCodes that are based on the arrays' identities rather than their contents,

* so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will

* produce an unexpected or incorrect result.*/

class HashPartitioner(partitions: Int) extendsPartitioner {

require(partitions>= 0, s"Number of partitions ($partitions) cannot be negative.")

def numPartitions: Int=partitions

def getPartition(key: Any): Int=key match {case null => 0

case _ =>Utils.nonNegativeMod(key.hashCode, numPartitions)

}//这个是Java标准的判断相等的函数，这个函数是因为Spark内部会比较两个RDD的分区是否一样。

override def equals(other: Any): Boolean =other match {case h: HashPartitioner =>h.numPartitions==numPartitionscase _ =>

false}

override def hashCode: Int=numPartitions

}

nonNegativeMod方法：

/*Calculates 'x' modulo 'mod', takes to consideration sign of x,

* i.e. if 'x' is negative, than 'x' % 'mod' is negative too

* so function return (x % mod) + mod in that case.*/def nonNegativeMod(x: Int, mod: Int): Int={

val rawMod= x %mod

rawMod+ (if (rawMod < 0) mod else 0)

}

举个例子

//将jack、world相关的元素分到单独的分区中

JavaRDD javaRDD =jsc.parallelize(Arrays.asList("jack1", "jack2", "jack3","world1", "world2", "world3"));

自定义partitioner

importorg.apache.spark.Partitioner;/*** 自定义Partitioner*/

public class MyPartitioner extendsPartitioner {private intnumPartitions;public MyPartitioner(intnumPartitions){this.numPartitions =numPartitions;

}

@Overridepublic intnumPartitions() {returnnumPartitions;

}

@Overridepublic intgetPartition(Object key) {if(key == null){return 0;

}

String str=key.toString();int hashCode = str.substring(0, str.length() - 1).hashCode();returnnonNegativeMod(hashCode,numPartitions);

}public booleanequals(Object obj) {if (obj instanceofMyPartitioner) {return ((MyPartitioner) obj).numPartitions ==numPartitions;

}return false;

}//Utils.nonNegativeMod(key.hashCode, numPartitions)

private int nonNegativeMod(int hashCode,intnumPartitions){int rawMod = hashCode %numPartitions;if(rawMod < 0){

rawMod= rawMod +numPartitions;

}returnrawMod;

}

然后我们在partitionBy()方法里面使用自定义的partitioner，测试示例：

//将jack、world相关的元素分到单独的分区中

JavaRDD javaRDD =jsc.parallelize(Arrays.asList("jack1", "jack2", "jack3","world1", "world2", "world3"));//自定义partitioner需要在pairRDD的基础上调用

JavaPairRDD pairRDD = javaRDD.mapToPair(s -> new Tuple2<>(s, 1));

JavaPairRDD pairRDD1 = pairRDD.partitionBy(new MyPartitioner(2));

System.out.println("指定分区之后的分区数："+pairRDD1.getNumPartitions());

pairRDD1.mapPartitionsWithIndex((v1, v2)->{

ArrayList result = new ArrayList<>();while(v2.hasNext()){

result.add(v1+"_"+v2.next());

}returnresult.iterator();

},true).foreach(s -> System.out.println(s));

输出

指定分区之后的分区数：20_(world1,1)

0_(world2,1)

0_(world3,1)

1_(jack1,1)

1_(jack2,1)

1_(jack3,1)

参考：https://my.oschina.net/u/939952/blog/1863372

参考：https://www.iteblog.com/archives/1368.html

weixin_39796839

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark更改分区_Spark自定义分区(Partitioner)

Spark提供了HashPartitioner和RangePartitioner两种分区策略，这两种分区策略在很多情况下都适合我们的场景。但是有些情况下，Spark内部不能符合咱们的需求，这时候我们就可以自定义分区策略。为此，Spark提供了相应的接口，我们只需要扩展Partitioner抽象类，然后实现里面的方法。Partitioner类如下/*** An object that defines...
复制链接

扫一扫