一、常量合并(Constant Folding)
替换可以被静态计算的表达式
例如sql:
select
1+2+3
from
t1
|
优化过程:
scala> sqlContext.sql(
"select 1+2+3 from t1"
)
17/07/25 16:50:21 INFO parse.ParseDriver: Parsing command:
select
1+2+3
from
t1
17/07/25 16:50:21 INFO parse.ParseDriver: Parse Completed
res27: org.apache.spark.sql.DataFrame = [_c0:
int
]
scala> res27.queryExecution
res28: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias(((1 + 2) + 3))]
+- '
UnresolvedRelation `t1`, None
== Analyzed Logical Plan ==
_c0:
int
Project [((1 + 2) + 3)
AS
_c0#19]
+- Subquery t1
+- Project [_1#0
AS
name
#5,_2#1
AS
date
#6,_3#2
AS
cate#7,_4#3
AS
amountSpent#8,_5#4
AS
time
#9]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
== Optimized Logical Plan ==
Project [6
AS
_c0#19]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
== Physical Plan ==
Project [6
AS
_c0#19]
+- Scan ExistingRDD[_1#0,_2#1,_3#2,_4#3,_5#4]
|
可见经过优化后,逻辑计划里的project转化成了6(1+2+3的结果),物理计划直接返回6
实现代码如下:
/**
* 替换可以被静态计算的表达式
*/
object
ConstantFolding
extends
Rule[LogicalPlan] {
def
apply(plan
:
LogicalPlan)
:
LogicalPlan
=
plan transform {
case
q
:
LogicalPlan
=
> q transformExpressionsDown {
// 对计划的表达式执行转化操作
// 如果是字面量,直接返回,避免对字面量的重复计算(因为Literal也是foldable的)
case
l
:
Literal
=
> l
// 调用eval方法合并foldable的表达式,返回字面量
case
e
if
e.foldable
=
> Literal.create(e.eval(EmptyRow), e.dataType)
}
}
}
|
二、简化过滤器 (Simlify Filters)
如果过滤器一直返回true, 则删掉此过滤器(如:where 2>1)
如果过滤器一直返回false, 则直接让计划返回空(如: where 2<1)
例如sql:
select
name
from
t1
where
2 > 1
|
优化过程:
scala> sqlContext.sql(
"select name from t1 where 2 > 1"
)
17/07/25 15:50:25 INFO parse.ParseDriver: Parsing command:
select
name
from
t1
where
2 > 1
17/07/25 15:50:25 INFO parse.ParseDriver: Parse Completed
res23: org.apache.spark.sql.DataFrame = [
name
: string]
scala> res23.queryExecution
res24: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias('
name
)]
+-
'Filter (2 > 1)
+- '
UnresolvedRelation `t1`, None
== Analyzed Logical Plan ==
name
: string
Project [
name
#5]
+- Filter (2 > 1)
+- Subquery t1
+- Project [_1#0
AS
name
#5,_2#1
AS
date
#6,_3#2
AS
cate#7,_4#3
AS
amountSpent#8,_5#4
AS
time
#9]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
== Optimized Logical Plan ==
Project [_1#0
AS
name
#5]
+- LogicalRDD [_1#0,_2#1,_3#2,_4#3,_5#4], MapPartitionsRDD[1]
at
rddToDataFrameHolder
at
<console>:27
== Physical Plan ==
Project [_1#0
AS
name
#5]
+- Scan ExistingRDD[_1#0,_2#1,_3#2,_4#3,_5#4]
|
可见经过优化后,逻辑计划里的的 2 > 1这个恒为true的过滤器被删除了
实现代码如下:
object
SimplifyFilters
extends
Rule[LogicalPlan] {
def
apply(plan
:
LogicalPlan)
:
LogicalPlan
=
plan transform {
// If the filter condition always evaluate to true, remove the filter.
case
Filter(Literal(
true
, BooleanType), child)
=
> child
// If the filter condition always evaluate to null or false,
// replace the input with an empty relation.
case
Filter(Literal(
null
,
_
), child)
=
> LocalRelation(child.output, data
=
Seq.empty)
case
Filter(Literal(
false
, BooleanType), child)
=
> LocalRelation(child.output, data
=
Seq.empty)
}
}
|
三、简化Cast (Simplify Casts)
如果数据类型和要转换的类型一致,则去掉Cast
例如sql:
select
cast
(
|