spark (key,value)类型的rdd才会有partitionby函数
rdd的运算都是以partition作为单位,合理的partition分配将会极大提高运算速度
另一篇博客中写到:
我们都知道Spark内部提供了HashPartitioner
和RangePartitioner
两种分区策略(这两种分区的代码解析可以参见:《Spark分区器HashPartitioner和RangePartitioner代码详解》),这两种分区策略在很多情况下都适合我们的场景。但是有些情况下,Spark内部不能符合咱们的需求,这时候我们就可以自定义分区策略。为此,Spark提供了相应的接口,我们只需要扩展Partitioner
抽象类,然后实现里面的三个方法:
package
org.apache.spark
/**
* An object that defines how the elements in a key-value pair RDD are partitioned by key.
* Maps each key to a partition ID, from 0 to `numPartitions - 1`.
*/
abstract
class
Partitioner
extends
Serializable {
def
numPartitions
:
Int
def
getPartition(key
:
Any)
:
Int
}
|
def numPartitions: Int
:这个方法需要返回你想要创建分区的个数;
def getPartition(key: Any): Int
:这个函数需要对输入的key做计算,然后返回该key的分区ID,范围一定是0到numPartitions-1
;
equals()
:这个是Java标准的判断相等的函数,之所以要求用户实现这个函数是因为Spark内部会比较两个RDD的分区是否一样。
package partitionTest import org.apache.log4j.{Level, Logger} import org.apache.spark.{Partitioner, SparkConf, SparkContext} import java.io._ /** * Created by Administrator on 2017-07-27. */ class myPartition(numParts: Int) extends Partitioner { override def numPartitions: Int = numParts override def getPartition(key: Any): Int = { if(key.toString.contains("2")) 1 else 0 } override def equals(other: Any): Boolean = other match { case iteblog: myPartition => iteblog.numPartitions == numPartitions case _ => false } override def hashCode: Int = numPartitions } object myStart { def main(args: Array[String]): Unit = { Logger.getLogger("org").setLevel(Level.ERROR) val logFile = "F:\\testData\\spark\\wordcount.txt" // 本地文件目录 val conf = new SparkConf().setAppName("Simple Application") //给Application命名 conf.setMaster("local[4]") val sc = new SparkContext(conf) //创建SparkContext //val sqlContext = new SQLContext(sc) val logData = sc.textFile(logFile).flatMap(f=>f.split(' ')).map(f=>(f,8)).cache() //logData.foreach(println) println("kaishi") val oldPartition = logData.mapPartitionsWithIndex{(id,irt) => { var res = List[(String,Int)](); irt.foreach(tmp=>{ res = (tmp._1,id) ::res }) res.iterator }} oldPartition.foreach(println) println("xinde") val newPartition = logData.partitionBy(new myPartition(2) ).mapPartitionsWithIndex{(id,irt) => { // irt.foreach(tmp=>println(tmp._1 + tmp._2 + " " + id)) // irt.foreach(tmp=>writer.write(tmp._1 + tmp._2 + " " + id)) var res = List[(String,Int)](); irt.foreach(tmp=>{ res = (tmp._1,id) ::res }) res.iterator }} newPartition.foreach(println) sc.stop() } }类似的,在Java中定义自己的分区策略和Scala类似,只需要继承
org.apache.spark.Partitioner
,并实现其中的方法即可。