是什么?
partitioner是RDD的一个属性,默认值为None.可以被子类重写
@transient val partitioner: Option[Partitioner] = None
有何作用?
决定RDD如何分区,就是具体的分区方式,测试一下
默认的partitioner是None
object RDDTest extends App{
val conf = new SparkConf().setAppName("wordcount").setMaster("local")
val sc = new SparkContext(conf)
val lines: RDD[String] = sc.textFile("D:\\tmp", 2)
println(lines.partitioner)//None
}
默认的partitioner是None
object RDDTest extends App{
val conf = new SparkConf().setAppName("wordcount").setMaster("local")
val sc = new SparkContext(conf)
private val rdd: RDD[Int] = sc.parallelize(Array(1, 2, 3))
println(rdd.partitioner)//None
}
如果是kv形式的RDD,可以重新分区,如下
object RDDTest extends App{
val conf = new SparkConf().setAppName("wordcount").setMaster("local")
val sc = new SparkContext(conf)
private val rdd: RDD[Int] = sc.parallelize(Array(1, 2, 3))
private val value: RDD[(Int, Int)] = rdd.map((x: Int) => (x, 1)).partitionBy(new HashPartitioner(3))
println(value.partitioner)//Some(org.apache.spark.HashPartitioner@3)
}