Spark执行流程与原理

本文详细分析了Spark SQL的执行流程,从Parsed Logical Plan到Analyzed Logical Plan,再到Optimized Logical Plan和最终的Physical Plan。重点讨论了Physical Plan中的WholeStageCodegen技术,它是Spark 2.x引入的性能优化手段,通过自动生成可执行代码以减少CPU密集操作中的虚函数调用。尽管这对性能有显著提升,但对IO密集操作如Shuffle的性能优化有限。文章以一个SQL查询为例,展示了WholeStageCodegen生成的代码,并解释了虚调用耗时的原因。
摘要由CSDN通过智能技术生成

 

Spark执行计划分析:

https://blog.csdn.net/zyzzxycj/article/details/82704713

-----------

先贴一张sql解析的总流程图:

第一次看这图可能还不是很理解,先看一个简单sql:

select * from heguozi.payinfo where pay = 0 limit 10

当这个sqlText,到获得最终结果中间,会存在哪些执行计划呢?

explain extended select * from heguozi.payinfo where pay = 0 limit 10

 会看到有4个执行计划:

== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Project [*]
      +- 'Filter ('pay = 0)
         +- 'UnresolvedRelation `heguozi`.`payinfo`

 Parsed Logical Plan对应图中Unresolved LogicalPlan,

== Analyzed Logical Plan ==
pay_id: string, totalpay_id: string, kindpay_id: string, kindpayname: string, fee: double, operator: string, operator_name: string, pay_time: bigint, pay: double, charge: double, is_valid: int, entity_id: string, create_time: bigint, op_time: bigint, last_ver: bigint, opuser_id: string, card_id: string, card_entity_id: string, online_bill_id: string, type: int, code: string, waitingpay_id: string, load_time: int, modify_time: int, ... 8 more fields
GlobalLimit 10
+- LocalLimit 10
   +- Project [pay_id#10079, totalpay_id#10080, kindpay_id#10081, kindpayname#10082, fee#10083, operator#10084, operator_name#10085, pay_time#10086L, pay#10087, charge#10088, is_valid#10089, entity_id#10090, create_time#10091L, op_time#10092L, last_ver#10093L, opuser_id#10094, card_id#10095, card_entity_id#10096, online_bill_id#10097, type#10098, code#10099, waitingpay_id#10100, load_time#10101, modify_time#10102, ... 8 more fields]
      +- Filter (pay#10087 = cast(0 as double))
         +- SubqueryAlias payinfo
            +- Relation[pay_id#10079,totalpay_id#10080,kindpay_id#10081,kindpayname#10082,fee#10083,operator#10084,operator_name#10085,pay_time#10086L,pay#10087,charge#10088,is_valid#10089,entity_id#10090,create_time#10091L,op_time#10092L,last_ver#10093L,opuser_id#10094,card_id#10095,card_entity_id#10096,online_bill_id#10097,type#10098,code#10099,waitingpay_id#10100,load_time#10101,modify_time#10102,... 8 more fields] parquet

 Analyzed Logical Plan对应图中Resolved LogicalPlan,

== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Filter (isnotnull(pay#10087) && (pay#10087 = 0.0))
      +- Relation[pay_id#10079,totalpay_id#10080,kindpay_id#10081,kindpayname#10082,fee#10083,operator#10084,operator_name#10085,pay_time#10086L,pay#10087,charge#10088,is_valid#10089,entity_id#10090,create_time#10091L,op_time#10092L,last_ver#10093L,opuser_id#10094,card_id#10095,card_entity_id#10096,online_bill_id#10097,type#10098,code#10099,waitingpay_id#10100,load_time#10101,modify_time#10102,... 8 more fields] parquet

Optimized Logical Plan对应图中Optimized LogicalPlan, 

== Physical Plan ==
CollectLimit 10
+- *(1) LocalLimit 10
   +- *(1) Project [pay_id#10079, totalpay_id#10080, kindpay_id#10081, kindpayname#10082, fee#10083, operator#10084, operator_name#10085, pay_time#10086L, pay#10087, charge#10088, is_valid#10089, entity_id#10090, create_time#10091L, op_time#10092L, last_ver#10093L, opuser_id#10094, card_id#10095, card_entity_id#10096, online_bill_id#10097, type#10098, code#10099, waitingpay_id#10100, load_time#10101, modify_time#10102, ... 8 more fields]
      +- *(1) Filter (isnotnull(pay#10087) && (pay#10087 = 0.0))
         +- *(1) FileScan parquet heguozi.payinfo[pay_id#10079,totalpay_id#10080,kindpay_id#10081,kindpayname#10082,fee#10083,operator#10084,operator_name#10085,pay_time#10086L,pay#10087,charge#10088,is_valid#10089,entity_id#10090,create_time#10091L,op_time#10092L,last_ver#10093L,opuser_id#10094,card_id#10095,card_entity_id#10096,online_bill_id#10097,type#10098,code#10099,waitingpay_id#10100,load_time#10101,modify_time#10102,... 8 more fields] Batched: true, Format: Parquet, Location: CatalogFileIndex[hdfs://cluster-cdh/user/flume/heguozi/payinfo], PartitionCount: 0, PartitionFilters: [], PushedFilters: [IsNotNull(pay), EqualTo(pay,0.0)], ReadSchema: struct<pay_id:string,totalpay_id:string,kindpay_id:string,kindpayname:string,fee:double,operator:...

Physical Plan即为最终可执行的PhysicalPlan。

 

在Spark执行计划分析(https://blog.csdn.net/zyzzxycj/article/details/82704713)中已经说明,Physical Plan 中的*(n)为WholeStageCodegenId,这个WholeStageCodegen又是个啥东西呢??

(whole-stage code generation --暂时没找到有什么确切的翻译)

它是在spark2.x中才有的一个新技术,它的作用是将spark job执行过程中的算子自动生成为可执行代码来执行,本质就是scala的反射机制,不涉及虚函数的调用,更优于spark1.x的Volcano Iterator Model (火山迭代模型)。当然,whole-stage code generation技术只是从CPU密集操作的方面进行性能调优,对IO密集操作的层面是无法提高效率的,比如Shuffle中产生的读写磁盘操作是无法通过该技术提升性能的。

 

那么就拿刚刚的sql为例,来看一下*(1)所生成的代码吧:

代码比较长,但是仔细看一下,几乎都是重复的操作。

Found 1 WholeStageCodegen subtrees.
== Subtree 1 / 1 ==
*(1) LocalLimit 10
+- *(1) Project [pay_id#10263, totalpay_id#10264, kindpay_id#10265, kindpayname#10266, fee#10267, operator#10268, operator_name#10269, pay_time#10270L, pay#10271, charge#10272, is_valid#10273, entity_id#10274, create_time#10275L, op_time#10276L, last_ver#10277L, opuser_id#10278, card_id#10279, card_entity_id#10280, online_bill_id#10281, type#10282, code#10283, waitingpay_id#10284, load_time#10285, modify_time#10286, ... 8 more fields]
   +- *(1) Filter (isnotnull(pay#10271) && (pay#10271 = 0.0))
      +- *(1) FileScan parquet heguozi.payinfo[pay_id#10263,totalpay_id#10264,kindpay_id#10265,kindpayname#10266,fee#10267,operator#10268,operator_name#10269,pay_time#10270L,pay#10271,charge#10272,is_valid#10273,entity_id#10274,create_time#10275L,op_time#10276L,last_ver#10277L,opuser_id#10278,card_id#10279,card_entity_id#10280,online_bill_id#10281,type#10282,code#10283,waitingpay_id#10284,load_time#10285,modify_time#10286,... 8 more fields] Batched: true, Format: Parquet, Location: CatalogFileIndex[hdfs://cluster-cdh/user/flume/heguozi/payinfo], PartitionCount: 0, PartitionFilters: [], PushedFilters: [IsNotNull(pay), EqualTo(pay,0.0)], ReadSchema: struct<pay_id:string,totalpay_id:string,kindpay_id:string,kindpayname:string,fee:dou
  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值