Spark SQL解析过程

最新推荐文章于 2024-06-22 16:58:41 发布

bigdataCoding

最新推荐文章于 2024-06-22 16:58:41 发布

阅读量1.3k

点赞数

本文链接：https://blog.csdn.net/UnionIBM/article/details/78027744

版权

1.Spark SQL中Join的分类

当前SparkSQL支持三种Join算法－shuffle hash join、broadcast hash join以及sort merge join。其中前两者归根到底都属于hash join，只不过在hash join之前需要先shuffle还是先broadcast。对于broadcast join模式，会将小于spark.sql.autoBroadcastJoinThreshold值（默认为10M）的表广播到其他计算节点，不走shuffle过程，所以会更加高效。

2.Using Catalyst in Spark SQL
Catalyst 优化器执行过程
We use Catalyst’s general tree transformation framework in four phases, as shown below:
(1) analyzing a logical plan to resolve references,
(2) logical plan optimization,
(3) physical planning
(4) code generation to compile parts of the query to Java bytecode. In the physical planning phase, Catalyst may generate multiple plans and compare them based on cost. All other phases are purely rule-based. Each phase uses different types of tree nodes; Catalyst includes libraries of nodes for expressions, data types, and logical and physical operators. We now describe each of these phases.

Spark SQL解析过程

http://hbasefly.com/2017/03/01/sparksql-catalyst/
https://spark-packages.org/package/neo4j-contrib/neo4j-spark-connector
http://blog.csdn.net/qq_21050291/article/details/78037883?locationNum=10&fps=1