系列目录
带你入门GeoSpark系列之一【环境篇】
带你入门GeoSpark系列之二【Spatial RDD篇】
带你入门GeoSpark系列之三【空间查询篇】
1.空间范围查询( Spatial Range Query)
空间范围查询,顾名思义我们可以给定一个范围(query window),然后查询出包含在当前范围内的地理对象。
1.1 数据准备
创建checkin1.csv
在 data/checkin1.csv
路径下:
注意这里故意把bar坐标修改了一下
-88.331492,32.324142,hotel
-88.175933,32.360763,gas
-99.388954,32.357073,bar
-88.221102,32.35078,restaurant
1.2 代码示例
considerBoundaryIntersection
参数可以配置查询是否包括query window边界上的地理对象。
package com.suddev.bigdata.query
import com.vividsolutions.jts.geom.Envelope
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.{SparkConf, SparkContext}
import org.datasyslab.geospark.enums.FileDataSplitter
import org.datasyslab.geospark.serde.GeoSparkKryoRegistrator
import org.datasyslab.geospark.spatialOperator.RangeQuery
import org.datasyslab.geospark.spatialRDD.PointRDD
/**
* Spatial Range Query
* @author Rand
* @date 2020/4/16 0016
*/
object SpatialRangeQueryApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().
setAppName("SpatialRangeQueryApp").setMaster("local[*]").
set("spark.serializer",classOf[KryoSerializer].getName).
set("spark.kryo.registrator", classOf[GeoSparkKryoRegistrator].getName)
implicit val sc = new SparkContext(conf)
val objectRDD = createPointRDD
objectRDD.rawSpatialRDD.rdd.collect().foreach(println)
// 定义QueryWindow
val rangeQueryWindow = new Envelope(-90.01, -80.01, 30.01, 40.01)
// 是否考虑边界
val considerBoundaryIntersection = false
val usingIndex = false
val queryResult = RangeQuery.SpatialRangeQuery(objectRDD, rangeQueryWindow, considerBoundaryIntersection, usingIndex)
queryResult.rdd.collect().foreach(println)
}
def createPointRDD(implicit sc:SparkContext): PointRDD ={
val pointRDDInputLocation = "data/checkin1.csv"
// 这个变量控制我们的地理经度和纬度在数据的哪两列,我们这里是第0,1列,Offset就设置为0
val pointRDDOffset = 0
val pointRDDSplitter = FileDataSplitter.CSV
// 这个参数允许我们除了经纬度外还可以携带其他自定义数据
val carryOtherAttributes = true
val objectRDD = new PointRDD(sc, pointRDDInputLocation,pointRDDOffset, pointRDDSplitter, carryOtherAttributes)
objectRDD
}
}
🔥这里的rangeQueryWindow
除了支持Envelope外还可以使用Point/Polygon/LineString
点->创建一个Point Query Window:
val geometryFactory = new GeometryFactory()
val pointObject = geometryFactory.createPoint(new Coordinate(-84.01, 34.01))
多边形->创建一个Polygon Query Window:
val geometryFactory = new GeometryFactory()
val coordinates = new Array[Coordinate](5)
coordinates(0) = new Coordinate(0,0)
coordinates(1) = new Coordinate(0,4)
coordinates(2) = new Coordinate(4,4)
coordinates(3) = new Coordinate(4,0)
coordinates(4) = coordinates(0) // The last coordinate is the same as the first coordinate in order to compose a closed ring
val polygonObject = geometryFactory.createPolygon(coordinates)
线->创建一个Linestring Query Window:
val geometryFactory = new GeometryFactory()
val coordinates = new Array[Coordinate](5)
coordinates(0) = new Coordinate(0,0)
coordinates(1) = new Coordinate(0,4)
coordinates(2) = new Coordinate(4,4)
coordinates(3) = new Coordinate(4,0)
val linestringObject = geometryFactory.createLineString(coordinates)
1.3 运行效果
可以看到查询结果包含hotel,gas,restaurant
不包含bar
POINT (-88.331492 32.324142) hotel
POINT (-88.175933 32.360763) gas
POINT (-99.388954 32.357073) bar
POINT (-88.221102 32.35078) restaurant
-------------------------------
POINT (-88.331492 32.324142) hotel
POINT (-88.175933 32.360763) gas
POINT (-88.221102 32.35078) restaurant
-------------------------------
2.空间临近查询(Spatial KNN Query)
空间临近算法,我们可以给的一个中心点的坐标,然后找出该点相邻的K个地理对象
2.1 数据准备
创建checkin2.csv
在 data/checkin2.csv
路径下:
-88.331492,32.324142,hotel
-88.175933,32.360763,gas1
-88.176033,32.360763,gas2
-88.175833,32.360763,gas3
-88.388954,32.357073,bar
-88.221102,32.35078,restaurant
2.2 代码示例
k
参数可以设置限制查询k个结果
🙃这里吐槽一下,如果查询结果为5个,但是我们k设置的大于5就会报空指针异常hhh,不能查到多少返回多少么
🙃再吐槽一下,它这种设计一次只能查询一个点,实际生产上肯定是一批点和另外一批点做KNN匹配,而他这个不支持两个RDD查询,如果有感兴趣的两个RDD做KNN匹配的请给我留言,我单独写一篇文章
package com.suddev.bigdata.query
import com.vividsolutions.jts.geom.{Coordinate, Envelope, GeometryFactory}
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.{SparkConf, SparkContext}
import org.datasyslab.geospark.enums.FileDataSplitter
import org.datasyslab.geospark.serde.GeoSparkKryoRegistrator
import org.datasyslab.geospark.spatialOperator.{KNNQuery, RangeQuery}
import org.datasyslab.geospark.spatialRDD.PointRDD
import scala.collection.JavaConversions._
/**
* SpatialKNNQueryApp
* @author Rand
* @date 2020/4/16 0016
*/
object SpatialKNNQueryApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().
setAppName("SpatialKNNQueryApp").setMaster("local[*]").
set("spark.serializer",classOf[KryoSerializer].getName).
set("spark.kryo.registrator", classOf[GeoSparkKryoRegistrator].getName)
implicit val sc = new SparkContext(conf)
val objectRDD = createPointRDD
objectRDD.rawSpatialRDD.rdd.collect().foreach(println)
val geometryFactory = new GeometryFactory()
// 做临近查询的中心点
val pointObject = geometryFactory.createPoint(new Coordinate(-84.01, 34.01))
val K = 2 // K Nearest Neighbors
val usingIndex = false
val result = KNNQuery.SpatialKnnQuery(objectRDD, pointObject, K, usingIndex)
println("-----------------------------------")
// 记得import scala.collection.JavaConversions._ 否则这里报错哈
result.foreach(println)
}
def createPointRDD(implicit sc:SparkContext): PointRDD ={
val pointRDDInputLocation = "data/checkin2.csv"
// 这个变量控制我们的地理经度和纬度在数据的哪两列,我们这里是第0,1列,Offset就设置为0
val pointRDDOffset = 0
val pointRDDSplitter = FileDataSplitter.CSV
// 这个参数允许我们除了经纬度外还可以携带其他自定义数据
val carryOtherAttributes = true
val objectRDD = new PointRDD(sc, pointRDDInputLocation,pointRDDOffset, pointRDDSplitter, carryOtherAttributes)
objectRDD
}
}
2.3 运行效果
可以看到查询结果包含gas3,gas1
两个点
POINT (-88.331492 32.324142) hotel
POINT (-88.175933 32.360763) gas1
POINT (-88.176033 32.360763) gas2
POINT (-88.175833 32.360763) gas3
POINT (-88.388954 32.357073) bar
POINT (-88.221102 32.35078) restaurant
-----------------------------------
POINT (-88.175833 32.360763) gas3
POINT (-88.175933 32.360763) gas1
3.空间连接查询(Spatial Join Query)
空间连接查询算法,类似于数据库中的Join操作, 有Spatial RDD A and B,遍历A中的几何对象去匹配B中覆盖或相交的几何对象。
3.1 数据准备
创建checkin3.csv
在 data/checkin3.csv
路径下:
-88.331492,32.324142,1.hotel
-88.175933,32.360763,1.gas
-88.388954,32.357073,1.bar
-88.588954,32.357073,1.spark
创建checkin4.csv
在 data/checkin4.csv
路径下:
-88.175933,32.360763,2.gas
-88.388954,32.357073,2.bar
-88.221102,32.35078,2.restaurant
-88.321102,32.35078,2.bus
3.2 代码示例
package com.suddev.bigdata.query
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.{SparkConf, SparkContext}
import org.datasyslab.geospark.enums.{FileDataSplitter, GridType}
import org.datasyslab.geospark.serde.GeoSparkKryoRegistrator
import org.datasyslab.geospark.spatialOperator.JoinQuery
import org.datasyslab.geospark.spatialRDD.PointRDD
/**
* SpatialJoinQueryApp
*
* @author Rand
* @date 2020/4/16 0016
*/
object SpatialJoinQueryApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().
setAppName("SpatialJoinQueryApp").setMaster("local[*]").
set("spark.serializer",classOf[KryoSerializer].getName).
set("spark.kryo.registrator", classOf[GeoSparkKryoRegistrator].getName)
implicit val sc = new SparkContext(conf)
// 准备数据
val objectRDD = createObjectRDDRDD
objectRDD.rawSpatialRDD.rdd.collect().foreach(println)
val queryWindowRDD = createQueryWindowRDD
println("---------------------------")
queryWindowRDD.rawSpatialRDD.rdd.collect().foreach(println)
println("---------------------------")
objectRDD.analyze()
// 必须设置objectRDD和queryWindowRDD的spatialPartitioning
// 条件有二
// 1.objectRDD和queryWindowRDD的spatialPartitioning 必须非空相同
// 2.objectRDD和queryWindowRDD的分区数量必须一样
objectRDD.spatialPartitioning(GridType.KDBTREE)
queryWindowRDD.spatialPartitioning(objectRDD.getPartitioner)
val considerBoundaryIntersection = false
val usingIndex = false
val result = JoinQuery.SpatialJoinQuery(objectRDD, queryWindowRDD, usingIndex, considerBoundaryIntersection)
result.rdd.foreach(println)
}
def createObjectRDDRDD(implicit sc:SparkContext): PointRDD ={
val pointRDDInputLocation = "data/checkin3.csv"
val pointRDDOffset = 0
val pointRDDSplitter = FileDataSplitter.CSV
val carryOtherAttributes = true
val objectRDD = new PointRDD(sc, pointRDDInputLocation,pointRDDOffset, pointRDDSplitter, carryOtherAttributes)
objectRDD
}
def createQueryWindowRDD(implicit sc:SparkContext): PointRDD ={
val pointRDDInputLocation = "data/checkin4.csv"
val pointRDDOffset = 0
val pointRDDSplitter = FileDataSplitter.CSV
val carryOtherAttributes = true
val objectRDD = new PointRDD(sc, pointRDDInputLocation,pointRDDOffset, pointRDDSplitter, carryOtherAttributes)
objectRDD
}
}
3.3 运行效果
可以看到两边的gas,bar
Join关联上了
POINT (-88.331492 32.324142) 1.hotel
POINT (-88.175933 32.360763) 1.gas
POINT (-88.388954 32.357073) 1.bar
POINT (-88.588954 32.357073) 1.spark
---------------------------
POINT (-88.175933 32.360763) 2.gas
POINT (-88.388954 32.357073) 2.bar
POINT (-88.221102 32.35078) 2.restaurant
POINT (-88.321102 32.35078) 2.bus
---------------------------
(POINT (-88.175933 32.360763) 2.gas,[POINT (-88.175933 32.360763) 1.gas])
(POINT (-88.388954 32.357073) 2.bar,[POINT (-88.388954 32.357073) 1.bar])
4.距离连接查询(Distance Join Query)
距离联接查询将两个Spatial RDD A和B和一个距离作为输入。对于A中的每个几何对象,找到B中都在给定距离之内的集合对象。
⚠️关于距离说明:
GeoSpark不会控制SpatialRDD中所有几何的坐标单位(基于度或基于米)。GeoSpark中所有相关距离的单位与SpatialRDD中所有几何的单位()相同。
转换参考坐标系(Coordinate Reference System)代码:
val sourceCrsCode = "epsg:4326" // WGS84, the most common degree-based CRS
val targetCrsCode = "epsg:3857" // The most common meter-based CRS
objectRDD.CRSTransform(sourceCrsCode, targetCrsCode)
参考资料:
GIS基础知识 - 坐标系、投影、EPSG:4326、EPSG:3857
4.1 数据准备
创建checkin5.csv
在 data/checkin5.csv
路径下:
-89.331492,32.324142,1.hotel
-88.1760,32.360763,1.gas
-88.3890,32.357073,1.bar
-89.588954,32.357073,1.spark
创建checkin6.csv
在 data/checkin6.csv
路径下:
-88.175933,32.360763,2.gas
-88.388954,32.357073,2.bar
-88.221102,32.35078,2.restaurant
-88.321102,32.35078,2.bus
4.2 代码示例
package com.suddev.bigdata.query
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.{SparkConf, SparkContext}
import org.datasyslab.geospark.enums.{FileDataSplitter, GridType}
import org.datasyslab.geospark.serde.GeoSparkKryoRegistrator
import org.datasyslab.geospark.spatialOperator.JoinQuery
import org.datasyslab.geospark.spatialRDD.{CircleRDD, PointRDD}
/**
* DistanceJoinQueryApp
*
* @author Rand
* @date 2020/4/16 0016
*/
object DistanceJoinQueryApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().
setAppName("DistanceJoinQueryApp$").setMaster("local[*]").
set("spark.serializer",classOf[KryoSerializer].getName).
set("spark.kryo.registrator", classOf[GeoSparkKryoRegistrator].getName)
implicit val sc = new SparkContext(conf)
// 准备数据
val objectRddA = createObjectRDDA
objectRddA.rawSpatialRDD.rdd.collect().foreach(println)
val objectRddB = createObjectRDDB
println("---------------------------")
objectRddB.rawSpatialRDD.rdd.collect().foreach(println)
println("---------------------------")
// 设置距离
val circleRDD = new CircleRDD(objectRddA, 0.1) // Create a CircleRDD using the given distance
circleRDD.analyze()
circleRDD.spatialPartitioning(GridType.KDBTREE)
objectRddB.spatialPartitioning(circleRDD.getPartitioner)
val considerBoundaryIntersection = false // Only return gemeotries fully covered by each query window in queryWindowRDD
val usingIndex = false
val result = JoinQuery.DistanceJoinQueryFlat(objectRddB, circleRDD, usingIndex, considerBoundaryIntersection)
result.rdd.foreach(println)
}
def createObjectRDDA(implicit sc:SparkContext): PointRDD ={
val pointRDDInputLocation = "data/checkin5.csv"
val pointRDDOffset = 0
val pointRDDSplitter = FileDataSplitter.CSV
val carryOtherAttributes = true
val objectRDD = new PointRDD(sc, pointRDDInputLocation,pointRDDOffset, pointRDDSplitter, carryOtherAttributes)
objectRDD
}
def createObjectRDDB(implicit sc:SparkContext): PointRDD ={
val pointRDDInputLocation = "data/checkin6.csv"
val pointRDDOffset = 0
val pointRDDSplitter = FileDataSplitter.CSV
val carryOtherAttributes = true
val objectRDD = new PointRDD(sc, pointRDDInputLocation,pointRDDOffset, pointRDDSplitter, carryOtherAttributes)
objectRDD
}
}
4.3 运行效果
可以看到1.gas
匹配到了2.gas,2.restaurant
两个点
1.bar
匹配到了2.bar,2.bus
两个点
POINT (-89.331492 32.324142) 1.hotel
POINT (-88.176 32.360763) 1.gas
POINT (-88.389 32.357073) 1.bar
POINT (-89.588954 32.357073) 1.spark
---------------------------
POINT (-88.175933 32.360763) 2.gas
POINT (-88.388954 32.357073) 2.bar
POINT (-88.221102 32.35078) 2.restaurant
POINT (-88.321102 32.35078) 2.bus
---------------------------
(POINT (-88.176 32.360763) 1.gas,POINT (-88.175933 32.360763) 2.gas)
(POINT (-88.176 32.360763) 1.gas,POINT (-88.221102 32.35078) 2.restaurant)
(POINT (-88.389 32.357073) 1.bar,POINT (-88.388954 32.357073) 2.bar)
(POINT (-88.389 32.357073) 1.bar,POINT (-88.321102 32.35078) 2.bus)