1.整体运行流程
使用下列代码对SparkSQL流程进行分析,让大家明白LogicalPlan的几种状态,理解SparkSQL整体执行流程
<code class="hljs vhdl has-numbering">// sc <span class="hljs-keyword">is</span> an existing SparkContext. val sqlContext = <span class="hljs-keyword">new</span> org.apache.spark.sql.SQLContext(sc) // this <span class="hljs-keyword">is</span> used <span class="hljs-keyword">to</span> implicitly convert an RDD <span class="hljs-keyword">to</span> a DataFrame. import sqlContext.implicits._ // Define the schema using a <span class="hljs-keyword">case</span> class. // Note: <span class="hljs-keyword">Case</span> classes <span class="hljs-keyword">in</span> Scala <span class="hljs-number">2.10</span> can support only up <span class="hljs-keyword">to</span> <span class="hljs-number">22</span> fields. <span class="hljs-keyword">To</span> work around this limit, // you can <span class="hljs-keyword">use</span> custom classes that implement the Product interface. <span class="hljs-keyword">case</span> class Person(name: <span class="hljs-typename">String</span>, age: Int) // Create an RDD <span class="hljs-keyword">of</span> Person objects <span class="hljs-keyword">and</span> <span class="hljs-keyword">register</span> it as a table. val people = sc.textFile(<span class="hljs-string">"/examples/src/main/resources/people.txt"</span>).<span class="hljs-keyword">map</span>(_.split(<span class="hljs-string">","</span>)).<span class="hljs-keyword">map</span>(p => Person(p(<span class="hljs-number">0</span>), p(<span class="hljs-number">1</span>).trim.toInt)).toDF() people.registerTempTable(<span class="hljs-string">"people"</span>) // SQL statements can be run by using the sql methods provided by sqlContext. val teenagers = sqlContext.sql(<span class="hljs-string">"SELECT name, age FROM people WHERE age >= 13 AND age <= 19"</span>) </code>
(1)查看teenagers的Schema信息
<code class="hljs vbnet has-numbering">scala> teenagers.printSchema root |-- name: <span class="hljs-built_in">string</span> (nullable = <span class="hljs-literal">true</span>) |-- age: <span class="hljs-built_in">integer</span> (nullable = <span class="hljs-literal">false</span>)</code>
(2)查看运行流程
<code class="hljs asciidoc has-numbering">scala> teenagers.queryExecution res3: org.apache.spark.sql.SQLContext#QueryExecution = <span class="hljs-header">== Parsed Logical Plan ==</span> <span class="hljs-emphasis">'Project [unresolvedalias('</span>name),unresolvedalias(<span class="hljs-emphasis">'age)] '</span>Filter ((<span class="hljs-emphasis">'age >= 13) && ('</span>age <= 19)) <span class="hljs-code"> 'UnresolvedRelation [people], None</span> <span class="hljs-header">== Analyzed Logical Plan ==</span> name: string, age: int Project [name#0,age#1] <span class="hljs-code"> Filter ((age#1 >= 13) && (age#1 <= 19))</span> <span class="hljs-code"> Subquery people</span> <span class="hljs-code"> LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22</span> <span class="hljs-header">== Optimized Logical Plan ==</span> Filter ((age#1 >= 13) && (age#1 <= 19)) <span class="hljs-code"> LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22</span> <span class="hljs-header">== Physical Plan ==</span> Filter ((age#1 >= 13) && (age#1 <= 19)) <span class="hljs-code"> Scan PhysicalRDD[name#0,age#1]</span> Code Generation: true </code>
QueryExecution中表示的是整体Spark SQL运行流程,从上面的输出结果可以看到,一个SQL语句要执行需要经过下列步骤:
<code class="hljs asciidoc has-numbering"><span class="hljs-header">== (1)Parsed Logical Plan ==</span> <span class="hljs-emphasis">'Project [unresolvedalias('</span>name),unresolvedalias(<span class="hljs-emphasis">'age)] '</span>Filter ((<span class="hljs-emphasis">'age >= 13) && ('</span>age <= 19)) <span class="hljs-code"> 'UnresolvedRelation [people], None</span> <span class="hljs-header">== (2)Analyzed Logical Plan ==</span> name: string, age: int Project [name#0,age#1] <span class="hljs-code"> Filter ((age#1 >= 13) && (age#1 <= 19))</span> <span class="hljs-code"> Subquery people</span> <span class="hljs-code"> LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22</span> <span class="hljs-header">== (3)Optimized Logical Plan ==</span> Filter ((age#1 >= 13) && (age#1 <= 19)) <span class="hljs-code"> LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22</span> <span class="hljs-header">== (4)Physical Plan ==</span> Filter ((age#1 >= 13) && (age#1 <= 19)) <span class="hljs-code"> Scan PhysicalRDD[name#0,age#1]</span> <span class="hljs-comment">//启动动态字节码生成技术(bytecode generation,CG),提升查询效率</span> Code Generation: true</code>
2.全表查询运行流程
执行语句:
<code class="hljs fsharp has-numbering"><span class="hljs-keyword">val</span> all= sqlContext.sql(<span class="hljs-string">"SELECT * FROM people"</span>)</code>
运行流程:
<code class="hljs asciidoc has-numbering">scala> all.queryExecution res9: org.apache.spark.sql.SQLContext#QueryExecution = <span class="hljs-comment">//注意*号被解析为unresolvedalias(*)</span> <span class="hljs-header">== Parsed Logical Plan ==</span> <span class="hljs-emphasis">'Project [unresolvedalias(*)] '</span>UnresolvedRelation [people], None <span class="hljs-header">== Analyzed Logical Plan ==</span> <span class="hljs-comment">//unresolvedalias(*)被analyzed为Schema中所有的字段</span> <span class="hljs-comment">//UnresolvedRelation [people]被analyzed为Subquery people</span> name: string, age: int Project [name#0,age#1] <span class="hljs-code"> Subquery people</span> <span class="hljs-code"> LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22</span> <span class="hljs-header">== Optimized Logical Plan ==</span> LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 <span class="hljs-header">== Physical Plan ==</span> Scan PhysicalRDD[name#0,age#1] Code Generation: true </code>
3. filter查询运行流程
执行语句:
<code class="hljs avrasm has-numbering">scala> val filterQuery= sqlContext<span class="hljs-preprocessor">.sql</span>(<span class="hljs-string">"SELECT * FROM people WHERE age >= 13 AND age <= 19"</span>) <span class="hljs-label">filterQuery:</span> org<span class="hljs-preprocessor">.apache</span><span class="hljs-preprocessor">.spark</span><span class="hljs-preprocessor">.sql</span><span class="hljs-preprocessor">.DataFrame</span> = [name: string, age: int]</code>
执行流程:
<code class="hljs asciidoc has-numbering">scala> filterQuery.queryExecution res0: org.apache.spark.sql.SQLContext#QueryExecution = <span class="hljs-header">== Parsed Logical Plan ==</span> <span class="hljs-emphasis">'Project [unresolvedalias(*)] '</span>Filter ((<span class="hljs-emphasis">'age >= 13) && ('</span>age <= 19)) <span class="hljs-code"> 'UnresolvedRelation [people], None</span> <span class="hljs-header">== Analyzed Logical Plan ==</span> name: string, age: int Project [name#0,age#1] <span class="hljs-code"> //多出了Filter,后同</span> <span class="hljs-code"> Filter ((age#1 >= 13) && (age#1 <= 19))</span> <span class="hljs-code"> Subquery people</span> <span class="hljs-code"> LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:20</span> <span class="hljs-header">== Optimized Logical Plan ==</span> Filter ((age#1 >= 13) && (age#1 <= 19)) <span class="hljs-code"> LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:20</span> <span class="hljs-header">== Physical Plan ==</span> Filter ((age#1 >= 13) && (age#1 <= 19)) <span class="hljs-code"> Scan PhysicalRDD[name#0,age#1]</span> Code Generation: true</code>
4. join查询运行流程
执行语句:
<code class="hljs sql has-numbering">val joinQuery= sqlContext.sql("<span class="hljs-operator"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> people a, people b <span class="hljs-keyword">where</span> a.age=b.age<span class="hljs-string">")</span></span></code>
查看整体执行流程
<code class="hljs lasso has-numbering">scala<span class="hljs-subst">></span> joinQuery<span class="hljs-built_in">.</span>queryExecution res0: org<span class="hljs-built_in">.</span>apache<span class="hljs-built_in">.</span>spark<span class="hljs-built_in">.</span>sql<span class="hljs-built_in">.</span>SQLContext<span class="hljs-variable">#QueryExecution</span> <span class="hljs-subst">=</span> <span class="hljs-comment">//注意Filter</span> <span class="hljs-comment">//Join Inner</span> <span class="hljs-subst">==</span> Parsed Logical Plan <span class="hljs-subst">==</span> <span class="hljs-string">'Project [unresolvedalias(*)] '</span>Filter (<span class="hljs-string">'a.age = '</span>b<span class="hljs-built_in">.</span>age) <span class="hljs-string">'Join Inner, None '</span>UnresolvedRelation <span class="hljs-preprocessor">[</span>people<span class="hljs-preprocessor">]</span><span class="hljs-markup">, Some(a) 'UnresolvedRelation </span><span class="hljs-preprocessor">[</span>people<span class="hljs-preprocessor">]</span><span class="hljs-markup">, Some(b) == Analyzed Logical Plan == name: string, age: int, name: string, age: int Project </span><span class="hljs-preprocessor">[</span>name<span class="hljs-variable">#0</span>,age<span class="hljs-variable">#1</span>,name<span class="hljs-variable">#2</span>,age<span class="hljs-variable">#3</span><span class="hljs-preprocessor">]</span><span class="hljs-markup"> Filter (age#1 = age#3) Join Inner, None Subquery a Subquery people LogicalRDD </span><span class="hljs-preprocessor">[</span>name<span class="hljs-variable">#0</span>,age<span class="hljs-variable">#1</span><span class="hljs-preprocessor">]</span><span class="hljs-markup">, MapPartitionsRDD</span><span class="hljs-preprocessor">[</span><span class="hljs-number">4</span><span class="hljs-preprocessor">]</span><span class="hljs-markup"> at rddToDataFrameHolder at <console>:22 Subquery b Subquery people LogicalRDD </span><span class="hljs-preprocessor">[</span>name<span class="hljs-variable">#2</span>,age<span class="hljs-variable">#3</span><span class="hljs-preprocessor">]</span><span class="hljs-markup">, MapPartitionsRDD</span><span class="hljs-preprocessor">[</span><span class="hljs-number">4</span><span class="hljs-preprocessor">]</span><span class="hljs-markup"> at rddToDataFrameHolder at <console>:22 == Optimized Logical Plan == Project </span><span class="hljs-preprocessor">[</span>name<span class="hljs-variable">#0</span>,age<span class="hljs-variable">#1</span>,name<span class="hljs-variable">#2</span>,age<span class="hljs-variable">#3</span><span class="hljs-preprocessor">]</span><span class="hljs-markup"> Join Inner, Some((age#1 = age#3)) LogicalRDD </span><span class="hljs-preprocessor">[</span>name<span class="hljs-variable">#0</span>,age<span class="hljs-variable">#1</span><span class="hljs-preprocessor">]</span><span class="hljs-markup">, MapPartitionsRDD</span><span class="hljs-preprocessor">[</span><span class="hljs-number">4</span><span class="hljs-preprocessor">]</span><span class="hljs-markup">... //查看其Physical Plan scala> joinQuery.queryExecution.sparkPlan res16: org.apache.spark.sql.execution.SparkPlan = TungstenProject </span><span class="hljs-preprocessor">[</span>name<span class="hljs-variable">#0</span>,age<span class="hljs-variable">#1</span>,name<span class="hljs-variable">#2</span>,age<span class="hljs-variable">#3</span><span class="hljs-preprocessor">]</span><span class="hljs-markup"> SortMergeJoin </span><span class="hljs-preprocessor">[</span>age<span class="hljs-variable">#1</span><span class="hljs-preprocessor">]</span><span class="hljs-markup">, </span><span class="hljs-preprocessor">[</span>age<span class="hljs-variable">#3</span><span class="hljs-preprocessor">]</span><span class="hljs-markup"> Scan PhysicalRDD</span><span class="hljs-preprocessor">[</span>name<span class="hljs-variable">#0</span>,age<span class="hljs-variable">#1</span><span class="hljs-preprocessor">]</span><span class="hljs-markup"> Scan PhysicalRDD</span><span class="hljs-preprocessor">[</span>name<span class="hljs-variable">#2</span>,age<span class="hljs-variable">#3</span><span class="hljs-preprocessor">]</span><span class="hljs-markup"></span></code>
前面的例子与下面的例子等同,只不过其运行方式略有不同,执行语句:
<code class="hljs avrasm has-numbering">scala> val innerQuery= sqlContext<span class="hljs-preprocessor">.sql</span>(<span class="hljs-string">"SELECT * FROM people a inner join people b on a.age=b.age"</span>) <span class="hljs-label">innerQuery:</span> org<span class="hljs-preprocessor">.apache</span><span class="hljs-preprocessor">.spark</span><span class="hljs-preprocessor">.sql</span><span class="hljs-preprocessor">.DataFrame</span> = [name: string, age: int, name: string, age: int] </code>
查看整体执行流程:
<code class="hljs oxygene has-numbering">scala> innerQuery.queryExecution res2: org.apache.spark.sql.SQLContext#QueryExecution = <span class="hljs-comment">//注意Join Inner</span> <span class="hljs-comment">//另外这里面没有Filter</span> == Parsed Logical Plan == <span class="hljs-string">'Project [unresolvedalias(*)] '</span><span class="hljs-keyword">Join</span> Inner, Some((<span class="hljs-string">'a.age = '</span>b.age)) <span class="hljs-string">'UnresolvedRelation [people], Some(a) '</span>UnresolvedRelation [people], Some(b) == Analyzed Logical Plan == name: string, age: int, name: string, age: int Project [name<span class="hljs-string">#0</span>,age<span class="hljs-string">#1</span>,name<span class="hljs-string">#4</span>,age<span class="hljs-string">#5</span>] <span class="hljs-keyword">Join</span> Inner, Some((age<span class="hljs-string">#1</span> = age<span class="hljs-string">#5</span>)) Subquery a Subquery people LogicalRDD [name<span class="hljs-string">#0</span>,age<span class="hljs-string">#1</span>], MapPartitionsRDD[<span class="hljs-number">4</span>] at rddToDataFrameHolder at <console>:<span class="hljs-number">22</span> Subquery b Subquery people LogicalRDD [name<span class="hljs-string">#4</span>,age<span class="hljs-string">#5</span>], MapPartitionsRDD[<span class="hljs-number">4</span>] at rddToDataFrameHolder at <console>:<span class="hljs-number">22</span> <span class="hljs-comment">//注意Optimized Logical Plan与Analyzed Logical Plan</span> <span class="hljs-comment">//并没有进行特别的优化,突出这一点是为了比较后面的子查询</span> <span class="hljs-comment">//其Analyzed和Optimized间的区别</span> == Optimized Logical Plan == Project [name<span class="hljs-string">#0</span>,age<span class="hljs-string">#1</span>,name<span class="hljs-string">#4</span>,age<span class="hljs-string">#5</span>] <span class="hljs-keyword">Join</span> Inner, Some((age<span class="hljs-string">#1</span> = age<span class="hljs-string">#5</span>)) LogicalRDD [name<span class="hljs-string">#0</span>,age<span class="hljs-string">#1</span>], MapPartitionsRDD[<span class="hljs-number">4</span>] at rddToDataFrameHolder ... <span class="hljs-comment">//查看其Physical Plan</span> scala> innerQuery.queryExecution.sparkPlan res14: org.apache.spark.sql.execution.SparkPlan = TungstenProject [name<span class="hljs-string">#0</span>,age<span class="hljs-string">#1</span>,name<span class="hljs-string">#6</span>,age<span class="hljs-string">#7</span>] SortMergeJoin [age<span class="hljs-string">#1</span>], [age<span class="hljs-string">#7</span>] Scan PhysicalRDD[name<span class="hljs-string">#0</span>,age<span class="hljs-string">#1</span>] Scan PhysicalRDD[name<span class="hljs-string">#6</span>,age<span class="hljs-string">#7</span>] </code>
5. 子查询运行流程
执行语句:
<code class="hljs avrasm has-numbering">scala> val subQuery=sqlContext<span class="hljs-preprocessor">.sql</span>(<span class="hljs-string">"SELECT * FROM (SELECT * FROM people WHERE age >= 13)a where a.age <= 19"</span>) <span class="hljs-label">subQuery:</span> org<span class="hljs-preprocessor">.apache</span><span class="hljs-preprocessor">.spark</span><span class="hljs-preprocessor">.sql</span><span class="hljs-preprocessor">.DataFrame</span> = [name: string, age: int] </code>
查看整体执行流程:
<code class="hljs asciidoc has-numbering"> scala> subQuery.queryExecution res4: org.apache.spark.sql.SQLContext#QueryExecution = <span class="hljs-header">== Parsed Logical Plan ==</span> <span class="hljs-emphasis">'Project [unresolvedalias(*)] '</span>Filter (<span class="hljs-emphasis">'a.age <= 19) '</span>Subquery a <span class="hljs-code"> 'Project [unresolvedalias(*)]</span> <span class="hljs-code"> 'Filter ('age >= 13)</span> <span class="hljs-code"> 'UnresolvedRelation [people], None</span> <span class="hljs-header">== Analyzed Logical Plan ==</span> name: string, age: int Project [name#0,age#1] <span class="hljs-code"> Filter (age#1 <= 19)</span> <span class="hljs-code"> Subquery a</span> <span class="hljs-code"> Project [name#0,age#1]</span> <span class="hljs-code"> Filter (age#1 >= 13)</span> <span class="hljs-code"> Subquery people</span> <span class="hljs-code"> LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22</span> <span class="hljs-comment">//这里需要注意Optimized与Analyzed间的区别</span> <span class="hljs-comment">//Filter被进行了优化</span> <span class="hljs-header">== Optimized Logical Plan ==</span> Filter ((age#1 >= 13) && (age#1 <= 19)) <span class="hljs-code"> LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22</span> <span class="hljs-header">== Physical Plan ==</span> Filter ((age#1 >= 13) && (age#1 <= 19)) <span class="hljs-code"> Scan PhysicalRDD[name#0,age#1]</span> Code Generation: true </code>
6. 聚合SQL运行流程
执行语句:
<code class="hljs avrasm has-numbering">scala> val aggregateQuery=sqlContext<span class="hljs-preprocessor">.sql</span>(<span class="hljs-string">"SELECT a.name,sum(a.age) FROM (SELECT * FROM people WHERE age >= 13)a where a.age <= 19 group by a.name"</span>) <span class="hljs-label">aggregateQuery:</span> org<span class="hljs-preprocessor">.apache</span><span class="hljs-preprocessor">.spark</span><span class="hljs-preprocessor">.sql</span><span class="hljs-preprocessor">.DataFrame</span> = [name: string, _c1: bigint] </code>
运行流程查看:
<code class="hljs oxygene has-numbering"> scala> aggregateQuery.queryExecution res6: org.apache.spark.sql.SQLContext#QueryExecution = <span class="hljs-comment">//注意'Aggregate ['a.name], [unresolvedalias('a.name),unresolvedalias('sum('a.age))]</span> <span class="hljs-comment">//即group by a.name被 parsed为unresolvedalias('a.name)</span> == Parsed Logical Plan == <span class="hljs-string">'Aggregate ['</span>a.name], [unresolvedalias(<span class="hljs-string">'a.name),unresolvedalias('</span>sum(<span class="hljs-string">'a.age))] '</span>Filter (<span class="hljs-string">'a.age <= 19) '</span>Subquery a <span class="hljs-string">'Project [unresolvedalias(*)] '</span>Filter (<span class="hljs-string">'age >= 13) '</span>UnresolvedRelation [people], None == Analyzed Logical Plan == name: string, _c1: bigint Aggregate [name<span class="hljs-string">#0</span>], [name<span class="hljs-string">#0</span>,sum(cast(age<span class="hljs-string">#1</span> <span class="hljs-keyword">as</span> bigint)) <span class="hljs-keyword">AS</span> _c1<span class="hljs-string">#9</span>L] Filter (age<span class="hljs-string">#1</span> <= <span class="hljs-number">19</span>) Subquery a Project [name<span class="hljs-string">#0</span>,age<span class="hljs-string">#1</span>] Filter (age<span class="hljs-string">#1</span> >= <span class="hljs-number">13</span>) Subquery people LogicalRDD [name<span class="hljs-string">#0</span>,age<span class="hljs-string">#1</span>], MapPartitionsRDD[<span class="hljs-number">4</span>] at rddToDataFrameHolder at <console>:<span class="hljs-number">22</span> == Optimized Logical Plan == Aggregate [name<span class="hljs-string">#0</span>], [name<span class="hljs-string">#0</span>,sum(cast(age<span class="hljs-string">#1</span> <span class="hljs-keyword">as</span> bigint)) <span class="hljs-keyword">AS</span> _c1<span class="hljs-string">#9</span>L] Filter ((age<span class="hljs-string">#1</span> >= <span class="hljs-number">13</span>) && (age<span class="hljs-string">#1</span> <= <span class="hljs-number">19</span>)) LogicalRDD [name<span class="hljs-string">#0</span>,age<span class="hljs-string">#1</span>], MapPartitions... <span class="hljs-comment">//查看其Physical Plan</span> scala> aggregateQuery.queryExecution.sparkPlan res10: org.apache.spark.sql.execution.SparkPlan = TungstenAggregate(key=[name<span class="hljs-string">#0</span>], functions=[(sum(cast(age<span class="hljs-string">#1</span> <span class="hljs-keyword">as</span> bigint)),mode=<span class="hljs-keyword">Final</span>,isDistinct=<span class="hljs-keyword">false</span>)], output=[name<span class="hljs-string">#0</span>,_c1<span class="hljs-string">#14</span>L]) TungstenAggregate(key=[name<span class="hljs-string">#0</span>], functions=[(sum(cast(age<span class="hljs-string">#1</span> <span class="hljs-keyword">as</span> bigint)),mode=<span class="hljs-keyword">Partial</span>,isDistinct=<span class="hljs-keyword">false</span>)], output=[name<span class="hljs-string">#0</span>,currentSum<span class="hljs-string">#17</span>L]) Filter ((age<span class="hljs-string">#1</span> >= <span class="hljs-number">13</span>) && (age<span class="hljs-string">#1</span> <= <span class="hljs-number">19</span>)) Scan PhysicalRDD[name<span class="hljs-string">#0</span>,age<span class="hljs-string">#1</span>] </code>
其它SQL语句,大家可以使用同样的方法查看其执行流程,以掌握Spark SQL背后实现的基本思想。