Normally, when a function passed to a Spark operation (such as
map
orreduce
) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient(低效). However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.
通常,当传递给Spark操作(如“map”或“reduce”)的函数在远程群集节点上执行时,它将在函数中使用的所有变量的单独副本上工作。这些变量被复制到每台机器上,对远程机器上变量的更新不会传播回驱动程序。跨任务支持通用的读写共享变量将是低效的(低效)。但是,Spark确实为两种常见的使用模式提供了两种有限类型的共享变量:广播变量和累加器。
broadcast广播变量
val num:Any = xxx
val numBC:Broadcast[Any] = sc.broadcast(num)
调用
val n = numBC.value
需要注意一点的是,显然该num需要进行序列化。
object Demo5 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("_01BroadcastVariableOps")
.setMaster("local[*]")
val sc = new SparkContext(conf)
val stus = List(
Student(1, "唐玉峰", "安徽·合肥"),
Student(2, "李梦", "山东·济宁"),
Student(3, "胡国权", "甘肃·白银"),
Student(4, "陈延年", "甘肃·张掖"),
Student(5, "马惠", "辽宁·葫芦岛"),
Student(10086, "刘炳文", "吉林·通化")
)
val scoreRDD = sc.parallelize(List(
Score(1, "chinese", 95.5),
Score(2, "english", 55.5),
Score(3, "math", 20.5),
Score(4, "pe", 32.5),
Score(5, "physical", 59),
Score(10000, "Chemistry", 99.5)
))
// joinOps(stus, scoreRDD)
broadcastOps(stus, scoreRDD)
sc.stop()
}
/*
使用广播变量的方式来完成如下的关联操作
map join-->大表+小表
*/
def broadcastOps(stus: List[Student], scoreRDD:RDD[Score]): Unit = {
val id2Stu = stus.map(stu => (stu.id, stu)).toMap
//构建广播变量
val bc: Broadcast[Map[Int, Student]] = scoreRDD.sparkContext.broadcast(id2Stu)
scoreRDD.foreach(score => {
val id = score.sid
val stu = bc.value.getOrElse(id, Student(-1, null, null))
println(s"${stu.id}\t${stu.name}\t${stu.province}\t${score.course}\t${score.score}")
})
}
def joinOps(stus: List[Student], scoreRDD:RDD[Score]): Unit = {
val stuRDD = scoreRDD.sparkContext.parallelize(stus)
val id2Stu:RDD[(Int, Student)] = stuRDD.map(stu => (stu.id, stu))
val id2Score:RDD[(Int, Score)] = scoreRDD.map(score => (score.sid, score))
val joinedRDD:RDD[(Int, (Student, Score))] = id2Stu.join(id2Score)
joinedRDD.foreach{case (id, (stu, score)) => {
println(s"id为${id}的学生信息为:${stu},其考试成绩信息为:${score}")
}}
}
}
case class Student(id: Int, name:String, province: String)
case class Score(sid: Int, course: String, score: Double)
accumulator累加器
accumulator累加器的概念和mr中出现的counter计数器的概念有异曲同工之妙,对某些具备某些特征的数据进行累加。累加器的一个好处是,不需要修改程序的业务逻辑来完成数据累加,同时也不需要额外的触发一个action job来完成累加,反之必须要添加新的业务逻辑,必须要触发一个新的action job来完成,显然这个accumulator的操作性能更佳!
*累加的使用:
构建一个累加器
val accu = sc.longAccumuator()
累加的操作
accu.add(参数)
获取累加器的结果,累加器的获取,必须需要action的触发
val ret = accu.value
普通累加器
普通的累加是在excutor上执行的,excutor上有累加结果,而本地driver没有
object Demo1 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Demo1")
.setMaster("local[*]")
val sc = new SparkContext(conf)
val array = sc.parallelize(Array(
"hello you",
"hello me",
"hello you",
"hello you",
"hello me",
"hello you"
), 2)
val count = sc.longAccumulator("count")//累加器名为count
val value = array.flatMap(_.split("\\s+")).map(word => {
if (word == "hello") {
count.add(1)
}
(word, 1)
})
value.foreach(println)
println(s"hello的累加结果${count.value}")
count.reset()//重置
println("---再次调用累加器-----")
value.count()
println(s"hello的累加结果${count.value}")
count.reset()//重置
sc.stop()
}
}
注意
-
累加器的调用,也就是accumulator.value必须要在action之后被调用,也就是说累加器必须在action触发之后。
-
多次使用同一个累加器,应该尽量做到用完即重置
accumulator.reset
-
尽量给累加器指定name,方便我们在web-ui上面进行查看
自定义累加器
/*
自定义累加器
AccumulatorV2[IN, OUT]
IN 指的是accmulator.add(sth.)中sth的数据类型
OUT 指的是accmulator.value返回值的数据类型
*/
class MyAccumulator extends AccumulatorV2[String,Map[String,Int]]{
//定义map存储返回值value
private var map:mutable.Map[String,Int]=mutable.Map()
/**
* 初始化值是否为0还是非0
* 0,返回true
* 非0,返回false
*/
override def isZero: Boolean = true
//在不同的task运行时,需要将上一个task的值拷贝到当前task进行累加
override def copy(): AccumulatorV2[String, Map[String, Int]] = {
val myaccu = new MyAccumulator
myaccu.map=map
myaccu
}
//清空累加器
override def reset(): Unit = map.clear()
//局部累加
override def add(v: String): Unit = {
map.put(v,map.getOrElse(v,0)+1)
}
//全局累加
override def merge(other: AccumulatorV2[String, Map[String, Int]]): Unit = {
for((word,count)<-other.value){
map.put(word,map.getOrElse(word,0)+count)
}
}
//获取累加结果
override def value: Map[String, Int] = map.toMap
}
object Demo2 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Demo1")
.setMaster("local[*]")
val sc = new SparkContext(conf)
val array = sc.parallelize(Array(
"hello you",
"hello me",
"hello you",
"hello you",
"hello me",
"hello you"
), 2)
//注册累加器
val myaccu = new MyAccumulator()
sc.register(myaccu, "myaccu")
val value = array.flatMap(_.split("\\s+")).map(word => {
if (word == "hello"||word=="me") {
myaccu.add(word)
}
(word, 1)
})
value.foreach(println)
println(s"hello的累加结果${myaccu.value}")
myaccu.reset()//重置
sc.stop()
}
}