裁剪分析
说明
图层裁剪:绘制一个裁剪范围,一般业务中裁剪范围都会用geojson表示,当然也可以放到shp文件中
裁剪跟图层相交很类似,区别就是对属性的处理和图层数据量,数据量多少影响了具体实现逻辑和思路是有区别的
准备工作
首先在Pom文件中添加依赖
后续要去操作shp文件和geojson之间的转换
spark sql的包是换一种写法初始化spark
<dependency>
<groupId>org.geotools</groupId>
<artifactId>gt-shapefile</artifactId>
<version>${gt.version}</version>
</dependency>
<dependency>
<groupId>org.geotools</groupId>
<artifactId>gt-geojson</artifactId>
<version>${gt.version}</version>
</dependency>
<dependency>
<groupId>org.locationtech.geomesa</groupId>
<artifactId>geomesa-spark-sql_2.11</artifactId>
<version>${geomesa.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
创建GeomesaClipShp对象
初始化Spark上下文,设置序列化方式、开启jts等等
初始化代码参考官网
https://www.geomesa.org/documentation/stable/user/spark/spark_jts.html
虽然这里初始化了JTS,但后面并不会用相关的函数,就是想换个写法
val sparkSession = SparkSession.builder()
.appName("GeomesaClipShp")
.config("spark.sql.crossJoin.enabled", "true")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryo.registrator", classOf[GeoMesaSparkKryoRegistrator].getName)
.master("local[6]")
.getOrCreate()
.withJTS
SQLTypes.init(sparkSession.sqlContext)
加载被裁减的数据源shp
数据源加载后得到空间RDD,后续操作都是对他操作
定义函数用来打开shp
def loadShpRdd(sparkContext: SparkContext, shpPath: String, name: String): SpatialRDD = {
var params: Map[String, String] = Map()
params += ("url" -> new File(shpPath).toURI.toURL.toString)
params += ("geotools" -> "true")
params += ("charset" -> "UTF-8")
GeoMesaSpark(params).rdd(new Configuration(), sparkContext, params, new Query(name))
}
打开数据源 被裁减的数据源
val rdd = loadShpRdd(sparkSession.sparkContext, "D:\\work\\bigdata\\data\\buildings\\gis_osm_buildings_a_free_1.shp", "gis_osm_buildings_a_free_1");
val inputRdd = loadShpRdd(sparkSession.sparkContext, "D:\\work\\bigdata\\data\\interset\\interset-4326.shp", "interset-4326");
手机裁减区域数据,裁减数据不能分区,必须分发到被裁减数据的所有分区上进行计算
val inpurarray = inputRdd.collect()
//使用前先验证图层的拓扑关系 GeometryUtils.validate(geometry)函数能够处理绝大多数的拓扑关系不正确问题
for (simpleFeature <- inpurarray) {
val geometry: Geometry = simpleFeature.getDefaultGeometry.asInstanceOf[Geometry]
if (!geometry.isValid) {
simpleFeature.setDefaultGeometry(GeometryUtils.validate(geometry))
}
}
//分发裁减区域数据,因为spark分布式机制,在不同Partition种访问数据需要先共享出去
val broadcastRdd = sparkSession.sparkContext.broadcast(inpurarray)
执行裁减任务
clipFeature(sparkSession.sparkContext, rdd, broadcastRdd)
val endTime = System.currentTimeMillis
println(((endTime - startTime) / 1000).toString().concat("秒"))
/**
* 裁减图层
*
* @param sparkContext
* @param rdd
*/
def clipFeature(sparkContext: SparkContext, rdd: SpatialRDD, inputRdd: Broadcast[Array[SimpleFeature]]): Unit = {
val resultPath = "D:\\work\\bigdata\\clip\\".concat(sparkContext.getConf.getAppId) //创建当前应用的输出目录
val directory = Directory(resultPath);
directory.createDirectory(true, false); //创建目录
//对所有分区的数据进行裁减处理,并且将处理后的rdd返回
val resultRdd = rdd.mapPartitions(partion => {
partion.flatMap(item => {
var originGeometry: Geometry = item.getDefaultGeometry.asInstanceOf[Geometry]
val intersetArray = ListBuffer[SimpleFeature]();
if (originGeometry != null) {
if (!originGeometry.isValid) {
originGeometry = GeometryUtils.validate(originGeometry)
}
for (simpleFeature <- inputRdd.value) {
var targetGeometry: Geometry = simpleFeature.getDefaultGeometry.asInstanceOf[Geometry]
if (targetGeometry != null) {
if (targetGeometry.getEnvelope.intersects(originGeometry)) {
if (!targetGeometry.isValid) {
targetGeometry = GeometryUtils.validate(targetGeometry)
}
//下面的几何关系处理跟普通的处理方式一样
var resultGeometry: Geometry = null
if (originGeometry.contains(targetGeometry)) {
resultGeometry = targetGeometry;
} else if (originGeometry.intersects(targetGeometry)) {
resultGeometry = originGeometry.intersection(targetGeometry)
}
if (resultGeometry != null) {
val copySimpleFeature = ScalaSimpleFeatureFactory.copyFeature(item.getFeatureType, item, item.getID)
copySimpleFeature.setDefaultGeometry(resultGeometry)
intersetArray += copySimpleFeature;
}
}
}
}
}
intersetArray
})
})
//按照50000个数据一个分区对裁剪结果进行重新分区,这样做是为了避免有的结果很少,会出现几条数据一个shp的情况
val reResultRdd = repartRdd(resultRdd, 50000);
//遍历分区将裁剪数据保存到磁盘
reResultRdd.foreachPartition(partition => {
import java.io.{File, Serializable}
import java.util
val path = resultPath.concat("\\").concat(UUID.randomUUID.toString).concat("_").concat(TaskContext.getPartitionId.toString).concat(".shp")
val file = new File(path)
val featureparams: util.Map[String, Serializable] = new util.HashMap[String, Serializable]
featureparams.put(ShapefileDataStoreFactory.URLP.key, file.toURI.toURL)
var featurewriter: FeatureWriter[SimpleFeatureType, SimpleFeature] = null;
var featureshapefileDataStore: ShapefileDataStore = null
partition.foreach(item => {
try {
if (featurewriter == null) {
featureshapefileDataStore = new ShapefileDataStoreFactory().createNewDataStore(featureparams).asInstanceOf[ShapefileDataStore]
featureshapefileDataStore.createSchema(item.getFeatureType)
featurewriter = featureshapefileDataStore.getFeatureWriterAppend(Transaction.AUTO_COMMIT)
}
var simpleFeature = featurewriter.next()
simpleFeature.setAttributes(item.getAttributes)
simpleFeature.setDefaultGeometry(item.getDefaultGeometry)
featurewriter.write()
} catch {
case e: Exception => {
e.printStackTrace()
}
}
})
if (featurewriter != null) {
featurewriter.close()
featureshapefileDataStore.dispose()
}
})
}
代码结构很简单,如果了解spark的任务执行机制,理解起来就很简单
关于Geotools的代码,那就去多看看geotools的资料吧,这里都是些基础的操作
全部代码
package hello.geomesa.shp
import java.io.File
import java.util.UUID
import hello.geomesa.utils.GeometryUtils
import org.apache.hadoop.conf.Configuration
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{SQLTypes, SparkSession}
import org.apache.spark.{SparkContext, TaskContext}
import org.geotools.data.shapefile.{ShapefileDataStore, ShapefileDataStoreFactory}
import org.geotools.data.{FeatureWriter, Query, Transaction}
import org.locationtech.geomesa.features.ScalaSimpleFeatureFactory
import org.locationtech.geomesa.spark.jts.SparkSessionWithJTS
import org.locationtech.geomesa.spark.{GeoMesaSpark, GeoMesaSparkKryoRegistrator, SpatialRDD}
import org.locationtech.jts.geom.Geometry
import org.opengis.feature.simple.{SimpleFeature, SimpleFeatureType}
import scala.collection.JavaConversions._
import scala.collection.mutable.ListBuffer
import scala.reflect.io.Directory
object GeomesaClipShp {
def main(args: Array[String]): Unit = {
var startTime = System.currentTimeMillis();
val sparkSession = SparkSession.builder()
.appName("GeomesaClipShp")
.config("spark.sql.crossJoin.enabled", "true")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryo.registrator", classOf[GeoMesaSparkKryoRegistrator].getName)
.master("local[6]")
.getOrCreate()
.withJTS
SQLTypes.init(sparkSession.sqlContext)
//被裁减数据源
val rdd = loadShpRdd(sparkSession.sparkContext, "D:\\work\\bigdata\\data\\buildings\\gis_osm_buildings_a_free_1.shp", "gis_osm_buildings_a_free_1");
val inputRdd = loadShpRdd(sparkSession.sparkContext, "D:\\work\\bigdata\\data\\interset\\interset-4326.shp", "interset-4326");
val inpurarray = inputRdd.collect()
//使用前先验证图层的拓扑关系 GeometryUtils.validate(geometry)函数能够处理绝大多数的拓扑关系不正确问题
for (simpleFeature <- inpurarray) {
val geometry: Geometry = simpleFeature.getDefaultGeometry.asInstanceOf[Geometry]
if (!geometry.isValid) {
simpleFeature.setDefaultGeometry(GeometryUtils.validate(geometry))
}
}
//分发裁减区域数据,因为spark分布式机制,在不同Partition种访问数据需要先共享出去
val broadcastRdd = sparkSession.sparkContext.broadcast(inpurarray)
clipFeature(sparkSession.sparkContext, rdd, broadcastRdd)
val endTime = System.currentTimeMillis
println(((endTime - startTime) / 1000).toString().concat("秒"))
}
def loadShpRdd(sparkContext: SparkContext, shpPath: String, name: String): SpatialRDD = {
var params: Map[String, String] = Map()
params += ("url" -> new File(shpPath).toURI.toURL.toString)
params += ("geotools" -> "true")
params += ("charset" -> "UTF-8")
GeoMesaSpark(params).rdd(new Configuration(), sparkContext, params, new Query(name))
}
def repartRdd(rdd: RDD[SimpleFeature], count: Int): RDD[SimpleFeature] = {
println("默认分区数:".concat(rdd.getNumPartitions.toString))
val shpCount = rdd.count
var rddCount = shpCount / count
if (shpCount - rddCount * count > 0) rddCount = rddCount + 1
if (rddCount < 1) rddCount = 1
val reRdd = rdd.repartition(rddCount.toInt)
println("现有分区数:".concat(reRdd.getNumPartitions.toString))
reRdd
}
/**
* 裁减图层
*
* @param sparkContext
* @param rdd
*/
def clipFeature(sparkContext: SparkContext, rdd: SpatialRDD, inputRdd: Broadcast[Array[SimpleFeature]]): Unit = {
val resultPath = "D:\\work\\bigdata\\clip\\".concat(sparkContext.getConf.getAppId) //创建当前应用的输出目录
val directory = Directory(resultPath);
directory.createDirectory(true, false); //创建目录
//对所有分区的数据进行裁减处理,并且将处理后的rdd返回
val resultRdd = rdd.mapPartitions(partion => {
partion.flatMap(item => {
var originGeometry: Geometry = item.getDefaultGeometry.asInstanceOf[Geometry]
val intersetArray = ListBuffer[SimpleFeature]();
if (originGeometry != null) {
if (!originGeometry.isValid) {
originGeometry = GeometryUtils.validate(originGeometry)
}
for (simpleFeature <- inputRdd.value) {
var targetGeometry: Geometry = simpleFeature.getDefaultGeometry.asInstanceOf[Geometry]
if (targetGeometry != null) {
if (targetGeometry.getEnvelope.intersects(originGeometry)) {
if (!targetGeometry.isValid) {
targetGeometry = GeometryUtils.validate(targetGeometry)
}
//下面的几何关系处理跟普通的处理方式一样
var resultGeometry: Geometry = null
if (originGeometry.contains(targetGeometry)) {
resultGeometry = targetGeometry;
} else if (originGeometry.intersects(targetGeometry)) {
resultGeometry = originGeometry.intersection(targetGeometry)
}
if (resultGeometry != null) {
val copySimpleFeature = ScalaSimpleFeatureFactory.copyFeature(item.getFeatureType, item, item.getID)
copySimpleFeature.setDefaultGeometry(resultGeometry)
intersetArray += copySimpleFeature;
}
}
}
}
}
intersetArray
})
})
//按照50000个数据一个分区对裁剪结果进行重新分区,这样做是为了避免有的结果很少,会出现几条数据一个shp的情况
val reResultRdd = repartRdd(resultRdd, 50000);
//遍历分区将裁剪数据保存到磁盘
reResultRdd.foreachPartition(partition => {
import java.io.{File, Serializable}
import java.util
val path = resultPath.concat("\\").concat(UUID.randomUUID.toString).concat("_").concat(TaskContext.getPartitionId.toString).concat(".shp")
val file = new File(path)
val featureparams: util.Map[String, Serializable] = new util.HashMap[String, Serializable]
featureparams.put(ShapefileDataStoreFactory.URLP.key, file.toURI.toURL)
var featurewriter: FeatureWriter[SimpleFeatureType, SimpleFeature] = null;
var featureshapefileDataStore: ShapefileDataStore = null
partition.foreach(item => {
try {
if (featurewriter == null) {
featureshapefileDataStore = new ShapefileDataStoreFactory().createNewDataStore(featureparams).asInstanceOf[ShapefileDataStore]
featureshapefileDataStore.createSchema(item.getFeatureType)
featurewriter = featureshapefileDataStore.getFeatureWriterAppend(Transaction.AUTO_COMMIT)
}
var simpleFeature = featurewriter.next()
simpleFeature.setAttributes(item.getAttributes)
simpleFeature.setDefaultGeometry(item.getDefaultGeometry)
featurewriter.write()
} catch {
case e: Exception => {
e.printStackTrace()
}
}
})
if (featurewriter != null) {
featurewriter.close()
featureshapefileDataStore.dispose()
}
})
}
}
任务执行界面
任务大概执行了51秒,分析数据源有150万数据,裁减数据源三个几何面。
测试数据地址百度云
链接:https://pan.baidu.com/s/1-00M3tqYkYzhBO_YwonwdQ
提取码:pl8z
裁减结果