SparkCore08
1. 广播
将Driver端得变量 广播到Executor端,Executor里得所有Task都可以用这个变量
所以这个变量是只读得
package com.hpznyf.spark.broadcast
import org.apache.spark.{SparkConf, SparkContext}
object joinApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("accumulator")
val sc = new SparkContext(conf)
join(sc)
sc.stop()
}
def join(sc:SparkContext): Unit ={
val info = sc.parallelize(Array(("111","csz"),("222","lihao"))).collectAsMap()
val bc = sc.broadcast(info)
val detail = sc.parallelize(Array(("111","school1","bj"),("112","school2","sh"),("113","school3","sz")))
.map(x => (x._1, x))
detail.mapPartitions(partition => {
val broadcastValue = bc.value
for((k, v)<-partition if broadcastValue.contains(k))
yield (k, broadcastValue.getOrElse(k, ""), v.toString())
}).foreach(println)
}
}
2. cache() 和 persist()
cache() lazy操作
以分区为单位
rdd.unpersist(true) 清楚缓存
RDD cache ==> 以partition 为单位
如果RDD有3个partition ,但是内存只能存下2.5个 ,最后只能存2个partition
cache 和 persist得区别
cache : 调用默认得persist
persist : 默认 memory_only
memory_only : memory + deserialied(true) 不序列化
momory_only_ser : momory + deserialied(false) 序列化
3. 序列化
默认 java 序列化
kryo 序列化 更快、紧凑 但并非所有都支持
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.storage.StorageLevel
import scala.collection.mutable.ArrayBuffer
object CacheApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[3]")
.setAppName("KryoSerAPP")
.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(Array(classOf[person]))
val sc = new SparkContext(conf)
val persons = new ArrayBuffer[person]()
for(i <- 1 to 1000000){
persons += (person("csz" + i, i, "sh" + i))
}
// memory_only_ser 30
// memory_only 140
// kryo meomory_only_ser 22.7
val rdd = sc.parallelize(persons)
rdd.persist(StorageLevel.MEMORY_ONLY_SER)
rdd.count()
Thread.sleep(100000)
sc.stop()
}
case class person(name:String, age:Int, address:String)
}
4. spark-submig设置
默认
–execuotr-cores 1
–num-executors 2
–executor-memory 1
100 task 1.core 2.executors 1.memory 同时只能执行2个task 要跑50轮
如果4.executor 那只要25轮
如果2.core 只要25轮
executor加大 并行度提升 轮次少
memory 加大 shuffle加大 内存和磁盘交互次数变少 减少GC
–driver-memory
collect,将RDD放入数组中,适合小数据,因为结果会被返回到driver中