1. 文章开始之前
先附上一句SQL,使用tpc-ds的表结构,我们围绕这句SQL讲。
- SQL:
SQL> select
avg(cs_ext_discount_amt)
from
catalog_sales, date_dim
where
d_date between '1999-02-22'
and
cast('1999-05-22' as date)
and
d_date_sk = cs_sold_date_sk
group by cs_sold_date_sk;
- 逻辑计划:
Aggregate [cs_sold_date_sk#24], [cast((avg(UnscaledValue(cs_ext_discount_amt#46)) / 100.0) as decimal(11,6)) AS avg(cs_ext_discount_amt)#149]
+- Project [cs_sold_date_sk#24, cs_ext_discount_amt#46]
+- Join Inner, (d_date_sk#58 = cs_sold_date_sk#24)
:- Project [cs_sold_date_sk#24, cs_ext_discount_amt#46]
: +- Filter isnotnull(cs_sold_date_sk#24)
: +- Relation[cs_sold_date_sk#24,cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,... 10 more fields]
+- Project [d_date_sk#58]
+- Filter (((isnotnull(d_date#60) && (cast(d_date#60 as string) >= 1999-02-22)) && (d_date#60 <= 10733)) && isnotnull(d_date_sk#58))
+- Relation[d_date_sk#58,d_date_id#59,d_date#60,d_month_seq#61,d_week_seq#62,d_quarter_seq#63,d_year#64,d_dow#65,d_moy#66,d_dom#67,d_qoy#68,d_fy_year#69,d_fy_quarter_seq#70,d_fy_week_seq#71,d_day_name#72,d_quarter_name#73,d_holiday#74,d_weekend#75,d_following_holiday#76,d_first_dom#77,d_last_dom#78,d_same_day_ly#79,d_same_day_lq#80,d_current_day#81,... 4 more fields]
2. 物理计划源码分析
2.1 物理策略
def strategies: Seq[Strategy] =
extraStrategies ++ (
FileSourceStrategy ::
DataSourceStrategy ::
DDLStrategy ::
SpecialLimits ::
Aggregation ::
JoinSelection ::
InMemoryScans ::
BasicOperators :: Nil)
其中,extraStrategies是提供给外部人员可以自己添加的策略。调用这些strategies的代码如下:
// Collect physical plan candidates.
val candidates = strategies.iterator.flatMap(_(plan))
将strategies逐个去应用在逻辑计划上,然后做flat操作,返回一个PhysicalPlan
的iterator。那么每个策略什么作用?
2.1.1 FileSourceStrategy
一个针对Hadoop文件系统做的策略,当执行计划的底层Relation是HadoopFsRelation
时会调用到,用来扫描文件。
2.1.2 DataSourceStrategy
Spark针对DataSource预定义了四种scan接口,TableScan
、PrunedScan
、PrunedFilteredScan
、CatalystScan
(其中Catalys