mysql physical plan_Spark SQL 源代码分析之Physical Plan 到 RDD的详细实现

最新推荐文章于 2023-12-29 18:20:01 发布

Matt小特

最新推荐文章于 2023-12-29 18:20:01 发布

阅读量190

点赞数

文章标签： mysql physical plan

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_42134094/article/details/113909848

版权

的版本号。将右表的join keys放到HashSet里。然后遍历左表，查找左表的join key能否匹配。case class LeftSemiJoinHash(

leftKeys: Seq[Expression],

rightKeys: Seq[Expression],

left: SparkPlan,

right: SparkPlan) extends BinaryNode with HashJoin {

val buildSide = BuildRight //buildSide是以右表为基准

override def requiredChildDistribution =

ClusteredDistribution(leftKeys) :: ClusteredDistribution(rightKeys) :: Nil

override def output = left.output

def execute() = {

buildPlan.execute().zipPartitions(streamedPlan.execute()) { (buildIter, streamIter) => //右表的物理计划运行后生成RDD，利用zipPartitions对Partition进行合并。然后用上述方法实现。

val hashSet = new java.util.HashSet[Row]()

var currentRow: Row = null

// Create a Hash set of buildKeys

while (buildIter.hasNext) {

currentRow = buildIter.next()

val rowKey = buildSideKeyGenerator(currentRow)

if(!rowKey.anyNull) {

val keyExists = hashSet.contains(rowKey)

if (!keyExists) {

hashSet.add(rowKey)

}

}

}

val joinKeys = streamSideKeyGenerator()

streamIter.filter(current => {

!joinKeys(current).anyNull && hashSet.contains(joinKeys.currentValue)

})

}

}

}

2.2、BroadcastHashJoin 名约：广播HashJoin，呵呵。是InnerHashJoin的实现。这里用到了concurrent并发里的future，异步的广播buildPlan的表运行后的的RDD。

假设接收到了广播后的表，那么就用streamedPlan来匹配这个广播的表。

实现是RDD的mapPartitions和HashJoin里的joinIterators最后生成join的结果。case class BroadcastHashJoin(

leftKeys: Seq[Expression],

rightKeys: Seq[Expression],

buildSide: BuildSide,

left: SparkPlan,

right: SparkPlan)(@transient sqlContext: SQLContext) extends BinaryNode with HashJoin {

override def otherCopyArgs = sqlContext :: Nil

override def outputPartitioning: Partitioning = left.outputPartitioning

override def requiredChildDistribution =

UnspecifiedDistribution :: UnspecifiedDistribution :: Nil

@transient

lazy val broadcastFuture = future { //利用SparkContext广播表

sqlContext.sparkContext.broadcast(buildPlan.executeCollect())

}

def execute() = {

val broadcastRelation = Await.result(broadcastFuture, 5.minute)

streamedPlan.execute().mapPartitions { streamedIter =>

joinIterators(broadcastRelation.value.iterator, streamedIter) //调用joinIterators对每一个分区map

}

}

}

2.3、ShuffleHashJoinShuffleHashJoin顾名思义就是须要shuffle数据，outputPartitioning是左孩子的的Partitioning。

会依据这个Partitioning进行shuffle。

然后利用SparkContext里的zipPartitions方法对每一个分区进行zip。

这里的requiredChildDistribution。的是ClusteredDistribution，这个会在HashPartitioning里面进行匹配。

关于这里面的分区这里不赘述，能够去org.apache.spark.sql.catalyst.plans.physical下的partitioning里面去查看。case class ShuffledHashJoin(

leftKeys: Seq[Expression],

rightKeys: Seq[Expression],

buildSide: BuildSide,

left: SparkPlan,

right: SparkPlan) extends BinaryNode with HashJoin {

override def outputPartitioning: Partitioning = left.outputPartitioning

override def requiredChildDistribution =

ClusteredDistribution(leftKeys) :: ClusteredDistribution(rightKeys) :: Nil

def execute() = {

buildPlan.execute().zipPartitions(streamedPlan.execute()) {

(buildIter, streamIter) => joinIterators(buildIter, streamIter)

}

}

}

未完待续 :)

原创文章，转载请注明：

转载自：OopsOutOfMemory盛利的Blog。作者： OopsOutOfMemory

本文链接地址：http://blog.csdn.net/oopsoom/article/details/38274621

注：本文基于署名-非商业性使用-禁止演绎 2.5 中国大陆(CC BY-NC-ND 2.5 CN)协议。欢迎转载、转发和评论。可是请保留本文作者署名和文章链接。如若须要用于商业目的或者与授权方面的协商，请联系我。

Spark SQL 源代码分析之Physical Plan 到 RDD的详细实现

标签：ide set select sof sample 封装 element rod 并发

本条技术文章来源于互联网，如果无意侵犯您的权益请点击此处反馈版权投诉

本文系统来源：http://www.cnblogs.com/cxchanpin/p/6869232.html

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。