- repartition
重新洗牌RDD中的数据,重新分配partiton以创建更多或更少的分区,并在它们之间平衡.这总是对网络上的所有数据进行洗牌.
- coalesce(numPartitions)
将对RDD中的数据partition减少到numPartitions,这样过滤大型数据集后提高运行效率
def main(args: Array[String]) {
val sparkConf = new SparkConf()
.setMaster("local[2]")
.setAppName("PartitionTestApp")
val sc = new SparkContext(sparkConf)
val students = ListBuffer[Student]()
for (i <- 0 to 1000000) {
students.append(Student(i, "student" + i, 39))
}
val studentRDD = sc.parallelize(students)
val rpRdd = studentRDD.repartition(4)
val coaRdd = rpRdd.coalesce(3)
rpRdd.persist(StorageLevel.MEMORY_ONLY_SER)
rpRdd.count()
coaRdd.persist(StorageLevel.MEMORY_ONLY_SER)
coaRdd.count()
Thread.sleep(1000 * 20)
sc.stop()
}
- 结果
2.