spark scala dataset、dataframe、RDD 、SQL使用

最新推荐文章于 2024-07-23 16:14:33 发布

置顶 qq_18617299

最新推荐文章于 2024-07-23 16:14:33 发布

阅读量642

点赞数

分类专栏： spark 文章标签：大数据 spark java scala

本文链接：https://blog.csdn.net/qq_18617299/article/details/111817069

版权

spark 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Dataset[String] to Data[case class] dataset[String] to Dataset[ThrDynamicRowV001]

`val ds: Dataset[ThrDynamicRowV001] = spark.read.textFile(inputThrFile).map(row => {
  val split_str = row.split(",")
  for (i <- 0 to 13) {
    if (split_str(i).isEmpty) {
      split_str(i) = "-1"
    }
  }
  val uniqueId = split_str(0).toLong
  val acqTime = split_str(1).toLong
  val targetType = split_str(2).toShort
  val dataSupplier = split_str(3).toShort
  val dataSource = split_str(4).toShort
  val status = split_str(5).toShort
  val longitude = split_str(6).toLong
  val latitude = split_str(7).toLong
  val areaId = split_str(8).toLong
  val speed: Int = split_str(9).toInt
  val convertion: Double = split_str(10).toDouble
  val cog: Int = split_str(11).toInt
  val trueHead: Int = split_str(12).toInt
  val power: Int = split_str(13).toInt
  val ext: String = split_str(14)
  val extend: String = split_str(15)
  ThrDynamicRowV001(
    uniqueId, acqTime, targetType, dataSupplier, dataSource, status,
    longitude, latitude, areaId, speed, convertion, cog, trueHead,
    power, ext, extend
  )
})`

spark dataframe or dataset GroupBy flatMapGroups

GroupByKey
Similar to SQL “GROUP BY” clause,Spark groupBy() funciton is used to collect the identical data into groups on Dataframe/Dataset and perform aggregate functions on the grouped data.
Syntax: perform groupBy() on Spark Dataframe,it returns RelationGroupedDataset object which contains below aggregate functions.
- groupBy(col1 : scala.Predef.String, cols : scala.Predef.String*) :
  org.apache.spark.sql.RelationalGroupedDataset
flatMapGroups
Applies the given function to each group of data.把udf使用到每个分组，并return a new Dataset[case class]，最后把分组执行udf后return的所有dataset压平形成一个大的dataset。
- Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an iterator containing elements of an arbitrary type which will be returned as a new [[Dataset]].

groupByKey(key1).flatMapGroups((key,iters) => {function1})

demo:
` val shipData = readBooksDF.groupByKey(.mmsi)
.flatMapGroups((, group) => {
var i = 0
var list1: List[AISship] = List()
val pointList = group.toList.sortBy(_.acqtime) // group.toList List(AISship(309787000,1440246594,-5760000,129434402),…)
// print(pointList)
// println("---------------")
var lastTime = pointList.head.acqtime
var index = 0
for (row <- pointList) {
val nTime = row.acqtime

// 两个时间戳相减
val timeDiff = getDiff(nTime, lastTime) / 3600.0
if ((timeDiff > 48) & ((index + 3 + 1) <= pointList.length)) {
  // 取出当前点的后面连续三个点，并判断两两连续点的时间间隔是否小于十分钟，若是则认为是新的一个批次
  val tmp = pointList.apply(index + 1).acqtime
  val tmp2 = pointList.apply(index + 2).acqtime
  val tmp3 = pointList.apply(index + 3).acqtime
  val diff1 = getDiff(tmp, nTime) / 60.0
  val diff2 = getDiff(tmp2, tmp) / 60.0
  val diff3 = getDiff(tmp3, tmp2) / 60.0
  if ((diff1 < 30) & (diff2 < 30) & (diff3 < 30)) {
    i += 1
    //              print(i, "---------------i")
  }

}

list1 = list1 :+ AISship(row.mmsi, row.acqtime, row.latitude, row.longitude, i.toString)
lastTime = nTime
index += 1

}

// print(list1, “------------list1”)

list1
})`

Spark SQL

Spark write to HBase table

`df.write.options(
  Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "4"))
  .format("org.apache.spark.sql.execution.datasources.hbase")
  .save()`

store dataFrame to HBase table using save() function on the dataframe,format takes “org.apache.spark.sql.execution.datasources.hbase” DataSource defined in “hbase-spark” API which enables us to use DataFrame with HBase tables.And df.write.options take the catalog and specifies to use 4 regions in cluster.Finally,save() writes it to HBase table

Spark Write to Cassandra keyspace table dataset[case class]

` foramtChange.write  
  .option("keyspace", keyspace).option("table", table)
  .option("spark.cassandra.output.consistency.level", consistencyLevel)
  .format("org.apache.spark.sql.cassandra")
  .mode(SaveMode.Append)
  .save 
 `

demo: store dataset[CassandraAisDynamicCompressRow] to cassandra keyspace table

spark read HBase table ## Reading from HBase to DataFrame

val hbaseDF = spark.read .options(Map(HBaseTableCatalog.tableCatalog -> catalog)) .format("org.apache.spark.sql.execution.datasources.hbase") .load()

spark read cassandra table

val readBooksDF = spark .read .format("org.apache.spark.sql.cassandra") .options(Map("table" -> table, "keyspace" -> keyspace)) .load

spark dataframe where or filter where 和 filter 功能使用方法一样。下面两者是等同的

Spark filter() or where() function is used to filter the rows from dataframe or fdataset based on the givern one or multiple conditions or SQL
expression.
spark dataframe filter or where syntaxes()
- 1. filter(condition: Column): Dataset[T]
- 1. filter(conditionExpr: String): Dataset[T] //using SQL expression
- 1. filter(func: T => Boolean): Dataset[T]
- 1. filter(func: FilterFunction[T]): Dataset[T]
- using first signature you can refer Column names using one of the following syntexes $colname , col(‘colname’) , ‘colname’ df(“colname”) with condition expression
- the second signature will be used to provide SQL expression to filters() rows
- the third signature is used with SQL funcions where the function applied on each row.
- the fourth signature is used with FilterFuncion class
- demo1:DataFrame where() with Column condition
  - demo 找到state 列等于 “OH”的所有行
    df.where(df("state") === "OH").show(false) df.where('state === "OH").show(false) df.where($state === "OH").show(false) df.where(col("state") === "OH").show(false)
- demo2:where with Multiple Conditions 找到 state 列等于 “OH” 且 gender列等于 “M”的所有行
  - df.where(df(“state”) === “OH” && df(“gender”) === “M”)
- demo3: 找到 ymxy 列值在 cqlQuery 列表里的所有行
  - df.where( $“ymxy” isInCollection (cqlQuery.toList))

spark select select multiple columns

指定选择这几列值
- demo： df.select(“mmsi”, “acqtime”, “latitude”, “longitude”, “ext”)

Spark dataframe to Dataset

spark dataframe to Dataset[AISship] 前提，dataframe 和dataset列名和数据格式一致
- demo:val readBookDataset=dataframe.as[AISship]

spark dataset write to csv files

demo: IoPolyEvent datatype:dataset[IoPortEventRow]
val outputIoPolyEventPath = "data6.csv" IoPolyEvent .repartition(1) .write .option("header", "true") .csv(outputIoPolyEventPath)

Spark examples:https://sparkbyexamples.com/
https://www.javatpoint.com/apache-spark-groupbykey-function

DataSet

DataSet 优点
- 整合了 RDD 和 DataFrame 优点，支持结构化和非结构化数据
- 和 RDD 一样，支持自定义对象存储
- 和 DataFrame 一样，支持结构化数据的SQL查询
- 采用堆外内存存储，gc友好
- 类型转化安全，代码友好

DataSet

A DataSet is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations.Each Dataset also has an untyped view called a DataFrame,which is a Dataset of Row.
Opearions avaliable on Datasets are divided into transformations and actions.Transformations are the ones that produce new Datasets,and actions are the ones that trigger computation and return results.Example transformations include map,filter,select,and aggregate(group by).Example actions count,show,or writing data out to file systems.
Datasets are “lazy”,i.e. computations are only triggered when an action is invoked.Internally,a Dataset represents a logical plan and generates a physical plan for efficient execution in a parallel and distributed manner.To explore the logical plan as well as optimized physical plan,use the explain function
To efficiently support domain-specific objects,an Encoder is required.The encoder maps(映射) the domain specific type T to Spark’s internal type system.For example,given a class Person with two fields,name(String) and age(int),an encoder is used to tell spark to generate code at runtime to serialize the Person object into a binary structure.The binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing(eg. in a columnar format).To understand the internal binary representation for data,use the schema function

RDD
— 在这里插入图片描述

上图直观地体现了DataFrame 和 RDD 的区别。左侧的RDD[Person]虽然以Person 为类型参数，但spark 框架本身不了解 Person类的内部结构。而右侧的DataFrame 却提供了详细的结构信息，使得Spark SQL 可以清楚地知道该数据集中包含哪些列，每列的名称和类型各是什么。DataFrame 多了数据的结构信息，即schema。RDD是分布式的java对象的集合。DataFrame 是分布式的Row 对象的集合。DataFrame 除了提供比RDD 更丰富的算子以外，更重要的特点是提升执行效率、减少数据读取以及执行计划的优化，比如filter下推、裁剪等。