Dataset[String] to Data[case class] dataset[String] to Dataset[ThrDynamicRowV001]
`val ds: Dataset[ThrDynamicRowV001] = spark.read.textFile(inputThrFile).map(row => {
val split_str = row.split(",")
for (i <- 0 to 13) {
if (split_str(i).isEmpty) {
split_str(i) = "-1"
}
}
val uniqueId = split_str(0).toLong
val acqTime = split_str(1).toLong
val targetType = split_str(2).toShort
val dataSupplier = split_str(3).toShort
val dataSource = split_str(4).toShort
val status = split_str(5).toShort
val longitude = split_str(6).toLong
val latitude = split_str(7).toLong
val areaId = split_str(8).toLong
val speed: Int = split_str(9).toInt
val convertion: Double = split_str(10).toDouble
val cog: Int = split_str(11).toInt
val trueHead: Int = split_str(12).toInt
val power: Int = split_str(13).toInt
val ext: String = split_str(14)
val extend: String = split_str(15)
ThrDynamicRowV001(
uniqueId, acqTime, targetType, dataSupplier, dataSource, status,
longitude, latitude, areaId, speed, convertion, cog, trueHead,
power, ext, extend
)
})`
spark dataframe or dataset GroupBy flatMapGroups
- GroupByKey
- Similar to SQL “GROUP BY” clause,Spark groupBy() funciton is used to collect the identical data into groups on Dataframe/Dataset and perform aggregate functions on the grouped data.
- Syntax: perform
groupBy()
on Spark Dataframe,it returns RelationGroupedDataset object which contains below aggregate functions.- groupBy(col1 : scala.Predef.String, cols : scala.Predef.String*) :
org.apache.spark.sql.RelationalGroupedDataset
- groupBy(col1 : scala.Predef.String, cols : scala.Predef.String*) :
- flatMapGroups
- Applies the given function to each group of data.把udf使用到每个分组,并return a new Dataset[case class],最后把分组执行udf后return的所有dataset压平形成一个大的dataset。
- Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an iterator containing elements of an arbitrary type which will be returned as a new [[Dataset]].
- groupByKey(key1).flatMapGroups((key,iters) => {function1})
-
demo:
` val shipData = readBooksDF.groupByKey(.mmsi)
.flatMapGroups((, group) => {
var i = 0
var list1: List[AISship] = List()
val pointList = group.toList.sortBy(_.acqtime) // group.toList List(AISship(309787000,1440246594,-5760000,129434402),…)
// print(pointList)
// println("---------------")
var lastTime = pointList.head.acqtime
var index = 0
for (row <- pointList) {
val nTime = row.acqtime// 两个时间戳相减 val timeDiff = getDiff(nTime, lastTime) / 3600.0 if ((timeDiff > 48) & ((index + 3 + 1) <= pointList.length)) { // 取出当前点的后面连续三个点,并判断两两连续点的时间间隔是否小于十分钟,若是则认为是新的一个批次 val tmp = pointList.apply(index + 1).acqtime val tmp2 = pointList.apply(index + 2).acqtime val tmp3 = pointList.apply(index + 3).acqtime val diff1 = getDiff(tmp, nTime) / 60.0 val diff2 = getDiff(tmp2, tmp) / 60.0 val diff3 = getDiff(tmp3, tmp2) / 60.0 if ((diff1 < 30) & (diff2 < 30) & (diff3 < 30)) { i += 1 // print(i, "---------------i") } } list1 = list1 :+ AISship(row.mmsi, row.acqtime, row.latitude, row.longitude, i.toString) lastTime = nTime index += 1
}
// print(list1, “------------list1”)
list1
})`
-
Spark SQL
Spark write to HBase table
`df.write.options(
Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "4"))
.format("org.apache.spark.sql.execution.datasources.hbase")
.save()`
- store dataFrame to HBase table using save() function on the dataframe,format takes “
org.apache.spark.sql.execution.datasources.hbase
” DataSource defined in “hbase-spark” API which enables us to use DataFrame with HBase tables.Anddf.write.options
take the catalog and specifies to use 4 regions in cluster.Finally,save()
writes it to HBase table
Spark Write to Cassandra keyspace table dataset[case class]
` foramtChange.write
.option("keyspace", keyspace).option("table", table)
.option("spark.cassandra.output.consistency.level", consistencyLevel)
.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.save
`
demo: store dataset[CassandraAisDynamicCompressRow] to cassandra keyspace table
spark read HBase table ## Reading from HBase to DataFrame
val hbaseDF = spark.read .options(Map(HBaseTableCatalog.tableCatalog -> catalog)) .format("org.apache.spark.sql.execution.datasources.hbase") .load()
spark read cassandra table
val readBooksDF = spark .read .format("org.apache.spark.sql.cassandra") .options(Map("table" -> table, "keyspace" -> keyspace)) .load
spark dataframe where or filter where 和 filter 功能使用方法一样。下面两者是等同的
- Spark
filter()
orwhere()
function is used to filter the rows from dataframe or fdataset based on the givern one or multiple conditions or SQL
expression. - spark dataframe filter or where syntaxes()
-
- filter(condition: Column): Dataset[T]
-
- filter(conditionExpr: String): Dataset[T] //using SQL expression
-
- filter(func: T => Boolean): Dataset[T]
-
- filter(func: FilterFunction[T]): Dataset[T]
- using first signature you can refer Column names using one of the following syntexes $colname , col(‘colname’) , ‘colname’ df(“colname”) with condition expression
- the second signature will be used to provide
SQL expression to filters() rows
- the third signature is used with SQL funcions where the function applied on each row.
- the fourth signature is used with FilterFuncion class
- demo1:DataFrame where() with Column condition
- demo 找到state 列等于 “OH”的所有行
df.where(df("state") === "OH").show(false) df.where('state === "OH").show(false) df.where($state === "OH").show(false) df.where(col("state") === "OH").show(false)
- demo 找到state 列等于 “OH”的所有行
- demo2:where with Multiple Conditions 找到 state 列等于 “OH” 且 gender列等于 “M”的所有行
- df.where(df(“state”) === “OH” && df(“gender”) === “M”)
- demo3: 找到 ymxy 列值在 cqlQuery 列表里的所有行
- df.where( $“ymxy” isInCollection (cqlQuery.toList))
-
spark select select multiple columns
- 指定选择这几列值
- demo: df.select(“mmsi”, “acqtime”, “latitude”, “longitude”, “ext”)
Spark dataframe to Dataset
- spark dataframe to Dataset[AISship] 前提,dataframe 和dataset列名和数据格式一致
- demo:val readBookDataset=dataframe.as[AISship]
spark dataset write to csv files
- demo: IoPolyEvent datatype:dataset[IoPortEventRow]
val outputIoPolyEventPath = "data6.csv" IoPolyEvent .repartition(1) .write .option("header", "true") .csv(outputIoPolyEventPath)
Spark examples:https://sparkbyexamples.com/
https://www.javatpoint.com/apache-spark-groupbykey-function
DataSet
- DataSet 优点
- 整合了 RDD 和 DataFrame 优点,支持结构化和非结构化数据
- 和 RDD 一样,支持 自定义对象存储
- 和 DataFrame 一样,支持结构化数据的SQL查询
- 采用堆外内存存储,gc友好
- 类型转化安全,代码友好
DataSet
- A DataSet is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations.Each Dataset also has an untyped view called a DataFrame,which is a Dataset of Row.
- Opearions avaliable on Datasets are divided into transformations and actions.Transformations are the ones that produce new Datasets,and actions are the ones that trigger computation and return results.Example transformations include map,filter,select,and aggregate(group by).Example actions count,show,or writing data out to file systems.
- Datasets are “lazy”,i.e. computations are only triggered when an action is invoked.Internally,a Dataset represents a logical plan and generates a physical plan for efficient execution in a parallel and distributed manner.To explore the logical plan as well as optimized physical plan,use the explain function
- To efficiently support domain-specific objects,an Encoder is required.The encoder maps(映射) the domain specific type T to Spark’s internal type system.For example,given a class Person with two fields,name(String) and age(int),an encoder is used to tell spark to generate code at runtime to serialize the Person object into a binary structure.The binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing(eg. in a columnar format).To understand the internal binary representation for data,use the schema function
RDD
—
- 上图直观地体现了DataFrame 和 RDD 的区别。左侧的RDD[Person]虽然以Person 为类型参数,但spark 框架本身不了解 Person类的内部结构。而右侧的DataFrame 却提供了 详细的结构信息,使得Spark SQL 可以清楚地知道 该数据集中包含哪些列,每列的名称和类型各是什么。DataFrame 多了数据的结构信息,即schema。RDD是分布式的java对象的集合。DataFrame 是分布式的Row 对象的集合。DataFrame 除了提供比RDD 更丰富的算子以外,更重要的特点是提升执行效率、减少数据读取以及执行计划的优化,比如filter下推、裁剪等。
RDD和DataSet
- DataSet以Catalyst逻辑执行计划表示,并且数据以编码的二进制形式被存储,不需要反序列化就可以执行sorting、shuffle等操作
- Dataset创立需要一个显式的Encoder,把对象序列化为二进制,可以把对象的shema映射为SparkSQL类型,然而RDD依赖于运行时反射机制
- 通过上面两点,DataSet的性能比RDD要好很多
DataSet和DataFrame
- DataFrame是DataSet的一个特例;Dataset每个record存储的是一个强类型值(case class),而dataframe每个record存储的是row
spark dataset
http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html
sparkRDD、DataFrame、DataSet的区别
https://www.jianshu.com/p/711ded043053
spark rdd api
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#groupByKey
spark dataset GroupByKey().flatMapGroups() spark dataset