Apache Sedona（GeoSpark） spatial join 源码解析

最新推荐文章于 2024-08-15 09:14:21 发布

KD_

最新推荐文章于 2024-08-15 09:14:21 发布

阅读量1.5k

点赞数 1

分类专栏： Spark 分布式系统

本文链接：https://blog.csdn.net/qq_41775852/article/details/115462917

版权

Spark 同时被 2 个专栏收录

25 篇文章 7 订阅

订阅专栏

分布式系统

18 篇文章 0 订阅

订阅专栏

文章目录

Apache Sedona（GeoSpark） Spatial Join

Sedona Spatial operators fully supports Apache SparkSQL query optimizer. It has the following query optimization features:

Automatically optimizes range join query and distance join query.
Automatically performs predicate pushdown.

Range join

Introduction: Find geometries from A and geometries from B such that each geometry pair satisfies a certain predicate. Most predicates supported by SedonaSQL can trigger a range join.

Spark SQL Example:

SELECT *
FROM polygondf, pointdf
WHERE ST_Contains(polygondf.polygonshape,pointdf.pointshape)

SELECT *
FROM polygondf, pointdf
WHERE ST_Intersects(polygondf.polygonshape,pointdf.pointshape)

SELECT *
FROM pointdf, polygondf
WHERE ST_Within(pointdf.pointshape, polygondf.polygonshape)

Spark SQL Physical plan:

== Physical Plan ==
RangeJoin polygonshape#20: geometry, pointshape#43: geometry, false
:- Project [st_polygonfromenvelope(cast(_c0#0 as decimal(24,20)), cast(_c1#1 as decimal(24,20)), cast(_c2#2 as decimal(24,20)), cast(_c3#3 as decimal(24,20)), mypolygonid) AS polygonshape#20]
:  +- *FileScan csv
+- Project [st_point(cast(_c0#31 as decimal(24,20)), cast(_c1#32 as decimal(24,20)), myPointId) AS pointshape#43]
   +- *FileScan csv

!!!note
All join queries in SedonaSQL are inner joins

Distance join

Introduction: Find geometries from A and geometries from B such that the internal Euclidean distance of each geometry pair is less or equal than a certain distance

Spark SQL Example:

Only consider fully within a certain distance

SELECT *
FROM pointdf1, pointdf2
WHERE ST_Distance(pointdf1.pointshape1,pointdf2.pointshape2) < 2

Consider intersects within a certain distance

SELECT *
FROM pointdf1, pointdf2
WHERE ST_Distance(pointdf1.pointshape1,pointdf2.pointshape2) <= 2

Spark SQL Physical plan:

== Physical Plan ==
DistanceJoin pointshape1#12: geometry, pointshape2#33: geometry, 2.0, true
:- Project [st_point(cast(_c0#0 as decimal(24,20)), cast(_c1#1 as decimal(24,20)), myPointId) AS pointshape1#12]
:  +- *FileScan csv
+- Project [st_point(cast(_c0#21 as decimal(24,20)), cast(_c1#22 as decimal(24,20)), myPointId) AS pointshape2#33]
   +- *FileScan csv

!!!warning
Sedona doesn’t control the distance’s unit (degree or meter). It is same with the geometry. To change the geometry’s unit, please transform the coordinate reference system. See ST_Transform.

源码解析

SedonSQLRegistrator.registerAll(sparkSession)

在初始化SparkSession后，需要调用SedonaSQLRegistrator.registerAll(sparkSession)来注册SedonaSQL User Defined Type, User Defined Function and optimized join query strategy。
在这里插入图片描述
JoinQueryDetector是针对spatial join的策略，UdtRegistrator.registerAll()注册GeometryUDT和IndexUDT。UdfRegistrator.registerAll(sqlContext)注册自定义的udf，udaf等。

JoinQueryDetector

在这里插入图片描述

JoinQueryDetector继承自Strategy，用于将逻辑计划转换为物理计划。从apply方法中可以看到，JoinQueryDetector匹配Join逻辑计划节点，根据其Join类中的condition的类型来决定生成那种join类型，即RangeJoinExec或者DistanceJoinExec，并传入leftShape和rightShape这两个表示几何列的表达式。

Spark Join 逻辑计划：
在这里插入图片描述

planSpatialJoin

在这里插入图片描述
此方法用于生成RangeJoinExec物理计划。首先调用matchExpressionsToPlans检查left和right两个子逻辑计划的outputSet是否包含了Expression代表的几何类型。

planDistanceJoin

在这里插入图片描述
此方法用于生成DistanceJoinExec，具体逻辑与planSpatialJoin大致相同。

TraitJoinQueryExec

RangeJoinExec ：
在这里插入图片描述
DistanceJoinExec ：

可以看出DistanceJoinExec和RangeJoinExec的具体实现逻辑都在TraitJoinQueryExec中。

TraitJoinQueryExec ：
在这里插入图片描述
TraitJoinQueryExec是一个接口，继承SparkPlan。

doExecute

1. 构造SpatialRDD

在这里插入图片描述
在doExecute方法中首先调用BindReferences.bindReference方法，将leftShape和rightShape绑定到left和right子物理计划的output中。生成的BindReferences表达式的eval方法可以从left和right的InternalRow中直接获取到几何列。

然后调用left和right的execute方法获取子RDD。然后调用toSpatialRddPair方法生成SpatialRdd（这里不介绍SpatialRdd的内部结构了），即从unsafeRow中获取到几何列，然后转换为Geometry对象。
在这里插入图片描述

toSpatialRdd方法利用内部自定义的集合对象序列化器GeometrySerializer.deserialize方法将获取到的几何列转换对集合对象。

2. doSpatialPartitioning

在这里插入图片描述
为了完成spatial join，两个SpatialRDD必须具有相同的分区。首先决定JoinSparitionDominantSide，然后决定numPartitions。

doSpatialPartitioning方法中，dominantShapes根据sedonaConf的设置选择相应的空间分区的方式进行自定义分区。followerShapes获取dominantShapes的分区器，进行相同的空间分区。

3. spatialJoin

在这里插入图片描述

首先构造JoinParams对象，其决定了join时是否使用索引，是否考虑边界相交、索引类型，以及joinBuildSide。
在这里插入图片描述

然后调用JoinQuery.spatialJoin方法，进行空间连接操作。

JoinQuery.spatialJoin

在这里插入图片描述
首先检查两个SpatialRDD的CRS和Partitioning是否相符合。

构造JoinJudgement，其继承自FlatMapFunction2接口，用于zipPartitions算子中，两个SpatialRDD中相同分区上的元素如何进行空间连接。

比如：

RightIndexLookupJudgement ：leftRDD.spatialPartitionedRDD.zipPartitions(rightRDD.indexedRDD, judgement)，即利用rightRDD.indexedRDD上的分区空间索引，逐个遍历leftRDD.spatialPartitionedRDD中的记录，对空间索引进行查询，获取可以空间连接的记录对。
DynamicIndexLookupJudgement需要在连接之前进行分区索引的建立。