【大数据分析】Spark SQL查询：DataFrame

sword_csdn

已于 2022-08-11 12:40:49 修改

阅读量1.7k

点赞数 1

分类专栏： Spark 文章标签： spark 数据分析 sql

于 2020-04-11 16:48:00 首次发布

本文链接：https://blog.csdn.net/sword_csdn/article/details/105428110

版权

Spark 专栏收录该内容

38 篇文章 6 订阅

订阅专栏

这篇文章的所有数据可以通过以下链接获取

链接：https://pan.baidu.com/s/1-Vlo3S8tjMyM6E0kwGALwQ
提取码：nlri

DataFrame可处理结构化数据（按行和列组织的数据，其中每列仅包含特定类型的值）。原本SQL是处理这类数据最常用的方法，而Spark的SparkSQL组件集成了SQL的用法。

1、DataFrame介绍

RDD代表了一种低级，直接的方式来处理Spark中的数据，它是Spark运行时的核心，而DataFrame API用于类似于表的形式处理结构化的分布式数据。DataFrame的灵感来自几种语言：Python的Pandas包中的DataFrame，R中的DataFrame和Julia语言的DataFrame。Spark的不同之处在于它们的分布式特性和Spark的Catalyst，它可以基于可插拔数据源、规则和数据类型实时优化资源使用。

假设有一个需要连接两种类型数据的查询，Spark SQL可以通过在DataFrame中加载这两个数据源来实现。如下图所示

两个客户端（使用DataFrame DSL的Spark应用程序和通过JDBC连接的非Spark客户端应用程序）执行连接两份数据的查询，这两份数据来自不同的数据源，一个是关系型数据库，另一个是HDFS中的Parquet文件。图中Hive是在Hadoop MapReduce之上构建的抽象层的分布式仓库，被广泛应用于数据查询和分析。SparkSQL允许将DataFrame注册到表目录中的表中，这些表不保存数据，而保存有关如何访问结构化数据的信息。

2、基于RDD创建DataFrame

基于RDD创建DataFrame的创建有3种方式。

（1）tuple RDD--->DataFrame

import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}

object App {
  def main(args:Array[String]):Unit={
    val sparkAppName = "Basic"
    val spark = SparkSession.builder().appName("SparkInAction")
      .master("local[*]").getOrCreate()
    val sc = spark.sparkContext
    import spark.implicits._
    val itPostsRows = sc.textFile("/ch05/italianPosts.csv")
    val itPostsSplit = itPostsRows.map(x => x.split("~"))
    val itPostsRDD = itPostsSplit.map(x => (x(0),x(1),x(2),x(3),x(4),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12)))
    val itPostsDFrame = itPostsRDD.toDF()
    itPostsDFrame.show(10)
  }
}

+---+--------------------+---+--------------------+---+--------------------+----+--------------------+--------------------+----+----+---+----+
| _1|                  _2| _3|                  _4| _5|                  _6|  _7|                  _8|                  _9| _10| _11|_12| _13|
+---+--------------------+---+--------------------+---+--------------------+----+--------------------+--------------------+----+----+---+----+
|  4|2013-11-11 18:21:...| 17|&lt;p&gt;The infi...| 23|2013-11-10 19:37:...|null|                    |                    |null|null|  2|1165|
|  5|2013-11-10 20:31:...| 12|&lt;p&gt;Come cre...|  1|2013-11-10 19:44:...|  61|Cosa sapreste dir...| &lt;word-choice&gt;|   1|null|  1|1166|
|  2|2013-11-10 20:31:...| 17|&lt;p&gt;Il verbo...|  5|2013-11-10 19:58:...|null|                    |                    |null|null|  2|1167|
|  1|2014-07-25 13:15:...|154|&lt;p&gt;As part ...| 11|2013-11-10 22:03:...| 187|Ironic constructi...|&lt;english-compa...|   4|1170|  1|1168|
|  0|2013-11-10 22:15:...| 70|&lt;p&gt;&lt;em&g...|  3|2013-11-10 22:15:...|null|                    |                    |null|null|  2|1169|
|  2|2013-11-10 22:17:...| 17|&lt;p&gt;There's ...|  8|2013-11-10 22:17:...|null|                    |                    |null|null|  2|1170|
|  1|2013-11-11 09:51:...| 63|&lt;p&gt;As other...|  3|2013-11-11 09:51:...|null|                    |                    |null|null|  2|1171|
|  1|2013-11-12 23:57:...| 63|&lt;p&gt;The expr...|  1|2013-11-11 10:09:...|null|                    |                    |null|null|  2|1172|
|  9|2014-01-05 11:13:...| 63|&lt;p&gt;When I w...|  5|2013-11-11 10:28:...| 122|Is &quot;scancell...|&lt;usage&gt;&lt;...|   3|1181|  1|1173|
|  0|2013-11-11 10:58:...| 18|&lt;p&gt;Wow, wha...|  5|2013-11-11 10:58:...|null|                    |                    |null|null|  2|1174|
+---+--------------------+---+--------------------+---+--------------------+----+--------------------+--------------------+----+----+---+----+

italianPosts.csv文件是 Stack Exchange在2009年发布的，包含社区中的所有问题/答案的匿名数据。它包含意大利语的问题（和答案）与以下字段（用波浪符号“~”分隔）

commentCount：与问题/答案相关的评论数

lastActivityDate：上次修改的日期和时间

ownerUserId：所有者的用户ID

body：问题/答案的文本内容

score：基于upvote和downvotes的总分

creationDate：创建的日期和时间

viewCount：浏览计数

title：问题的标题

tags：已标记问题的标签集

answerCount：相关答案的数量

acceptedAnswerId：如果问题包含其接受的答案的ID

postTypeId：帖子的类型：1是问题，2为答案

id：帖子的类型：1是问题，2为答案

以上show方法如果没有传入参数，则会默认返回20条。可以在RDD转化为DataFrame时指定列名如下，

val itPostsDF = itPostsRDD.toDF("commentCount", "lastActivityDate", "ownerUserId", "body", "score", "creationDate", "viewCount", "title", "tags", "answerCount", "acceptedAnswerId", "postTypeId", "id")

itPostsDF.printSchema

root
 |-- commentCount: string (nullable = true)
 |-- lastActivityDate: string (nullable = true)
 |-- ownerUserId: string (nullable = true)
 |-- body: string (nullable = true)
 |-- score: string (nullable = true)
 |-- creationDate: string (nullable = true)
 |-- viewCount: string (nullable = true)
 |-- title: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- answerCount: string (nullable = true)
 |-- acceptedAnswerId: string (nullable = true)
 |-- postTypeId: string (nullable = true)
 |-- id: string (nullable = true)

printSchema方法显示DataFrame有关列的信息。这里有几个问题，比如说ID应该是long类型，date列应为时间戳。这种不足会在后面两个方法得以解决。

（2）case class 将RDD转换为DataFrame

将RDD转换为Data的第二种方法是将RDD中的每一行映射到案例类，然后再使用toDF方法。首先需要定义将保存数据的case class如下

import java.sql.Timestamp
case class Post (
  commentCount:Option[Int], 
  lastActivityDate:Option[java.sql.Timestamp],
  ownerUserId:Option[Long], 
  body:String, score:Option[Int], 
  creationDate:Option[java.sql.Timestamp],
  viewCount:Option[Int], 
  title:String, 
  tags:String, 
  answerCount:Option[Int],
  acceptedAnswerId:Option[Long], 
  postTypeId:Option[Long], 
  id:Long
)

接着定义一个隐式类

object StringImplicits {
   implicit class StringImprovements(val s: String) {
      import scala.util.control.Exception.catching
      def toIntSafe = catching(classOf[NumberFormatException]) opt s.toInt
      def toLongSafe = catching(classOf[NumberFormatException]) opt s.toLong
      def toTimestampSafe = catching(classOf[IllegalArgumentException]) opt Timestamp.valueOf(s)
   }
}

这个隐式类的作用是将字符串安全地转换成整形、长整形和时间戳，如果转不了则返回None，这里值得一提的是 implicit发挥的作用，直观一点解释，就是StringImprovements给String类型数据新增了toIntSafe，toLongSafe，和toTimestampSafe方法，依据是被implicit修饰的class的属性类型。当一个字符串调用里面的方法（如：toIntSafe）时，opt方法可用于将指定函数（如s.toInt）的结果映射到Option对象，如果发生异常，则返回None。

定义从字符串到Post的转换方法并调用如下：

def stringToPost(row:String):Post = {
  val r = row.split("~")
  Post(r(0).toIntSafe,
    r(1).toTimestampSafe,
    r(2).toLongSafe,
    r(3),
    r(4).toIntSafe,
    r(5).toTimestampSafe,
    r(6).toIntSafe,
    r(7),
    r(8),
    r(9).toIntSafe,
    r(10).toLongSafe,
    r(11).toLongSafe,
    r(12).toLong)
}
val itPostsDFCase = itPostsRows.map(x => stringToPost(x)).toDF()
itPostsDFCase.printSchema

（3）使用指定模式将RDD转换为DataFrame

将RDD转换为DataFrame的最后一种方法是使用SparkSession的createDataFrame方法。该方法使用包含Row和StructType类型的对象的RDD。首先是构建StructType

import org.apache.spark.sql.types._
val postSchema = StructType(Seq(
  StructField("commentCount", IntegerType, true),
  StructField("lastActivityDate", TimestampType, true),
  StructField("ownerUserId", LongType, true),
  StructField("body", StringType, true),
  StructField("score", IntegerType, true),
  StructField("creationDate", TimestampType, true),
  StructField("viewCount", IntegerType, true),
  StructField("title", StringType, true),
  StructField("tags", StringType, true),
  StructField("answerCount", IntegerType, true),
  StructField("acceptedAnswerId", LongType, true),
  StructField("postTypeId", LongType, true),
  StructField("id", LongType, false))
  )

在Spark SQL中，StructType表示一个模式，它包含一个或多个StructType来描述列。DataFrame支持主要关系数据库支持的常规类型：字符串、整型、短整型、浮点数、双精度、字节、日期、时间戳和二进制值（BLOB）同时也支持几种复杂类型：（1）数组包含相同类型的多个值。（2）键是基本数据类型的映射包含键值对。（3）结构体包含嵌套列定义。

接着构建Row类型的RDD，然后转换为DataFrame

import org.apache.spark.sql.Row
def stringToRow(row:String):Row = {
  val r = row.split("~")
  Row(r(0).toIntSafe.getOrElse(null),
    r(1).toTimestampSafe.getOrElse(null),
    r(2).toLongSafe.getOrElse(null),
    r(3),
    r(4).toIntSafe.getOrElse(null),
    r(5).toTimestampSafe.getOrElse(null),
    r(6).toIntSafe.getOrElse(null),
    r(7),
    r(8),
    r(9).toIntSafe.getOrElse(null),
    r(10).toLongSafe.getOrElse(null),
    r(11).toLongSafe.getOrElse(null),
    r(12).toLong)
}
val rowRDD = itPostsRows.map(row => stringToRow(row))
val itPostsDFStruct = spark.createDataFrame(rowRDD, postSchema)

最后可以使用columns方法返回列名列表，dtypes方法返回元组列表

itPostsDFStruct.columns
itPostsDFStruct.dtypes

3、DataFrame API基础

DataFrame使用DSL来操作数据，DSL跟SQL函数类似，是使用Spark SQL的基础。值得一提的是，Spark ML（Spark的机器学习库）也依赖于DataFrame。

（1）选择数据

val postsDf = itPostsDFStruct
val postsIdBody = postsDf.select("id", "body")
val postsIdBody = postsDf.select(postsDf.col("id"), postsDf.col("body"))
val postsIdBody = postsDf.select(Symbol("id"), Symbol("body"))
val postsIdBody = postsDf.select('id, 'body)
val postsIdBody = postsDf.select($"id", $"body")

代码中出现的单引号（'id和'body）表达式其实是Symbol对象，另一个隐式方法（称为$）将字符串转换为ColumnName对象。

（2）过滤数据

可以使用where和filter函数过滤DataFrame数据。

postsIdBody.filter('body contains "Italiano").count()
val noAnswer = postsDf.filter(('postTypeId === 1) and ('acceptedAnswerId isNull))
val firstTenQs = postsDf.filter('postTypeId === 1).limit(10)

（3）添加和重命名列

val firstTenQsRn = firstTenQs.withColumnRenamed("ownerUserId", "owner")
postsDf.filter('postTypeId === 1).withColumn("ratio", 'viewCount / 'score).where('ratio < 35).show()

withColumn为DataFrame添加了一个列，这个列的值计算方法是用已有的列计算得到的，计算表达式是：'viewCount/'score

4、使用SQL函数执行数据计算

Spark SQL支持大量的SQL函数，这些函数可通过DataFrame API和SQL表达式获得。

（1）使用内置标量和聚合函数

Spark提供了很多标量函数来执行数学计算、字符串、日期时间等操作。例如，如果想从数据中知道活跃时长最长的那条问题，可以使用下面的代码实现

postsDf.filter('postTypeId === 1).withColumn("activePeriod", datediff('lastActivityDate, 'creationDate)).orderBy('activePeriod desc).head.getString(3).replace("&lt;","<").replace("&gt;",">")

再例如，如果想找出所有问题的平均得分、最好得分以及问题总数，可以使用下面的代码

postsDf.select(avg('score), max('score), count('score)).show

（2）窗口函数

窗口函数的代码规则一般需要符合：

分析函数.over(Window.partitionBy(子句)[.orderBy(子句)])

比如需要显示所有用户问题的最高分数，并且显示每一个问题与这个最高分之间的差，可以使用以下代码：

postsDf.filter('postTypeId === 1).select('ownerUserId, 'acceptedAnswerId, 'score, max('score).over(Window.partitionBy('ownerUserId)) as "maxPerUser").withColumn("toMax", 'maxPerUser - 'score).show(10)

稍微解释一下这行代码，这条语句的窗口函数Window.partitionBy('ownerUserId)是在'score.max('score).over()里面的，意思是先根据ownerUserId分组，然后在每个分组里面计算出一个最大值，这个最大值的列名为maxPerUser。

（3）用户定义的函数

很多情况下，Spark SQL可能不会再特定时刻提供需要的特定函数。UDF可以扩展Spark SQL的内置函数。

5、使用缺失值

数据在使用前，可能会包含null或空值，又或者是一些与之等效的字符串常量（如“N/A”或“unknow”），这在一般的程序代码中可能会造成一些没必要的困扰。但是Spark SQL有能力读取这些“空值”，比如DataFrameNaFunctions类可以通过DataFrames na字段访问。

如果要从postsDF中删除包含了null或NAN值的数据，可以使用如下代码：

val cleanPosts = postsDf.na.drop()

上面这条语句会检测所有列，只要这条数据某一个列“空值”，就被删除，跟drop("any")的作用相同，如果你想所有列都为“空”的数据才被删除，可以drop("all")

如果你不想将具有“空值”的数据删除，而是希望将空值替换成某个值，比如"0"，可以使用下面这条语句

postsDf.na.fill(Map("viewCount" -> 0))

最后是replace函数可以替换特定列中的某些值为不同的值，例如，假设数据导出有问题，需要将帖子ID为1177的数据更改ID为3000，可以用下面的方法

val postsDfCorrected = postsDf.na.replace(Array("id", "acceptedAnswerId"), Map(1177 -> 3000))

6、将DataFrame转换为RDD

如果需要将DataFrame转换成RDD，这个过程也不复杂，可以使用如下代码：

val postsRdd = postsDf.rdd

但是生成的RDD是包含org.apache.apark.sql.Row类型的元素，Row有各种get*函数，用于按照列索引（getString(index)、getInt(index)、getMap(index)等）访问列值。但是一般不会将以转成DataFrame的数据再转回RDD，但是为了加深对API的理解，可以思考下面这段代码：

val postsMapped = postsDf.rdd.map(row => Row.fromSeq(
  row.toSeq.updated(3, row.getString(3).replace("&lt;","<").replace("&gt;",">")).
    updated(8, row.getString(8).replace("&lt;","<").replace("&gt;",">"))))
val postsDfNew = spark.createDataFrame(postsMapped, postsDf.schema)

7、分组数据

跟SQL的groupBy类似

postsDfNew.groupBy('ownerUserId, 'tags, 'postTypeId).count.orderBy('ownerUserId desc).show(10)

进一步可以结合agg函数来形成几个聚合，如下

postsDfNew.groupBy('ownerUserId).agg(max('lastActivityDate), max('score)).show(10)
postsDfNew.groupBy('ownerUserId).agg(Map("lastActivityDate" -> "max", "score" -> "max")).show(10)

8、连接数据

如果需要在两个DataFrame中连接相关数据，可以使用连接操作。跟SQL类似，Spark SQL包含了inner、outer、left_outer、right_outer和leftsemi等连接类型。

val itVotesRaw = sc.textFile("first-edition/ch05/italianVotes.csv").map(x => x.split("~"))
val itVotesRows = itVotesRaw.map(row => Row(row(0).toLong, row(1).toLong, row(2).toInt, Timestamp.valueOf(row(3))))
val votesSchema = StructType(Seq(
  StructField("id", LongType, false),
  StructField("postId", LongType, false),
  StructField("voteTypeId", IntegerType, false),
  StructField("creationDate", TimestampType, false))
  )
val votesDf = spark.createDataFrame(itVotesRows, votesSchema)
val postsVotes = postsDf.join(votesDf, postsDf("id") === votesDf("postId"))
val postsVotesOuter = postsDf.join(votesDf, postsDf("id") === votesDf("postId"), "outer")