Spark SQL应用

最新推荐文章于 2022-12-13 20:14:23 发布

愿人了

最新推荐文章于 2022-12-13 20:14:23 发布

阅读量697

点赞数

分类专栏： Spark 文章标签： spark 大数据 hdfs hadoop

本文链接：https://blog.csdn.net/qq_43503329/article/details/112629797

版权

Spark 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

Spark SQL应用

实验目的

深入理解和掌握DataFrame各种常见操作和编程方法；掌握使用Spark SQL编程解决实际问题的方法。

实验要求

掌握基于Maven的Scala和Spark SQL编程环境配置；
掌握DataFrame查询方法。

实验内容

将实验二中的Online Retail.csv上传至HDFS
在Maven中配置Spark SQL编程环境，pom.xml中添加：

<dependency>
	<groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.12</artifactId>
    <version>3.0.1</version>
</dependency>

若自己安装的Spark不是3.0.1，则自己搜索适合的spark-sql版本

编写代码将csv文件读取至DataFrame，将Schema设置如下：
表1 Schema设置

Column Name	Type	Nullable
InvoiceNo	StringType	false
StockCode	StringType	false
Description	StringType	false
Quantity	IntegerType	false
InvoiceDate	DateType	false
UnitPrice	DecimalType	false
CustomerID	StringType	false
Country	StringType	false

val conf = new SparkConf().setAppName(“Spark SQL”).setMaster(“local[*]”)
val sc = new SparkContext(conf)
val spark = SparkSession.builder().master(“local[*]”).appName(“Spark SQL”).getOrCreate()
val df = spark.read.format(“com.databricks.spark.csv”)
.option(“header”, “true”)
.option(“mode”, “DROPMALFORMED”)
.load(“hdfs://主机名或ip地址:端口号/文件路径”)
val df_null = df.na.drop()
val intRegx = “^\d+$”.r
val timeRegx = “^\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2}$”.r
val doubleRegx = “^\d+(\.\d+)?$”.r
val timeFormat = new SimpleDateFormat(“M/d/yyyy H:m”)
val rdd = df_null.rdd.map(x => (x.getString(0), x.getString(1), x.getString(2), x.getString(3), x.getString(4), x.getString(5), x.getString(6), x.getString(7)))
.filter(x => !intRegx.findFirstIn(x._4).isEmpty &&
!timeRegx.findFirstIn(x._5).isEmpty && //以日期规则过滤InvoiceDate列  
!doubleRegx.findFirstIn(x._6).isEmpty) //以浮点型规则过滤UnitPrice列  
.map(x => (x._1, x._2, x._3, x._4.toInt, new java.sql.Date(timeFormat.parse(x._5).getTime), x._6.toDouble, x._7, x._8))
val schema = StructType(Array(
StructField(“InvoiceNo”, StringType, false),
StructField(“StockCode”, StringType, false),
StructField(“Description”, StringType, false),
StructField(“Quantity”, IntegerType, false),
StructField(“InvoiceDate”, DateType, false),
StructField(“UnitPrice”, DoubleType, false),
StructField(“CustomerID”, StringType, false),
StructField(“Country”, StringType, false)
))
val df_final = spark.createDataFrame(rdd.map(x => Row.fromTuple(x)), schema)
df_final.write.json(“hdfs://主机名或ip地址:端口号/文件路径”)

在这里插入图片描述

(1) 要求：所有字段不能为空，可将所有包含空值行或无法做数据类型转换的行视为无效行去掉。

val rdd = df_null.rdd.map(x => (x.getString(0), x.getString(1), x.getString(2), x.getString(3), x.getString(4), x.getString(5), x.getString(6), x.getString(7)))
.filter(x => !intRegx.findFirstIn(x._4).isEmpty &&  
 !timeRegx.findFirstIn(x._5).isEmpty && //以日期规则过滤InvoiceDate列  
 !doubleRegx.findFirstIn(x._6).isEmpty) //以浮点型规则过滤UnitPrice列  
 .map(x => (x._1, x._2, x._3, x._4.toInt, new java.sql.Date(timeFormat.parse(x._5).getTime), x._6.toDouble, x._7, x._8))

(2) 读取文件方法：使用Spark SQL读取csv的方法，直接按列读取为DataFrame，然后再取RDD；或直接将csv以文件的方式读如RDD，然后以逗号将各列split为数组，但注意字段中本身包含逗号以及字段两端有双引号的情况，可用正则表达式识别各列内容

(3) 数据转换方法：使用RDD做数据转换并去除空行，然后创建Schema，将RDD按创建的Schema转为DataFrame

val df_final = spark.createDataFrame(rdd.map(x => Row.fromTuple(x)), schema)

在程序中将转换好的DataFrame存储为JSON
进入spark-shell
(1) 将第4步保存的文件载入到DataFrame

val df = spark.read.json("hdfs://主机名或ip地址:端口号/文件名")

(2) 执行DataFrame中的查询（以下查询分别用转换操作算子和SQL语句实现），并用show命令打印出摘要信息
① 查询单价小于0.2的所有商品
先创建一个视图，用作sql语句的查询

val table = df.createTempView("data")

Sql

spark.sql("select Description, UnitPrice from data where UnitPrice<0.2").show()

算子

df.selectExpr("Description", "UnitPrice").where("UnitPrice<0.2").show() //单价小于0.2

在这里插入图片描述

② 查询订单551845～551850的顾客
Sql

spark.sql("select CustomerID, InvoiceNo from data where InvoiceNo>=55184 and InvoiceNo<=551850").show()

算子

df.selectExpr("CustomerID", "InvoiceNo").where("InvoiceNo>=55184 and InvoiceNo<=551850").show() //查询订单551845～551850的顾客

在这里插入图片描述

③ 统计本数据中包含了多少个订单
Sql

spark.sql("select count(distinct InvoiceNo) from data").show()

算子

df.select(countDistinct("InvoiceNo")).show()

在这里插入图片描述

④ 统计所有订单的最大金额、订单包含的最多产品数量、订单包含的最多产品种类
订单最大金额
Sql

spark.sql("select sum(Quantity*UnitPrice) as sumPrice from data group by InvoiceNo order by sum(Quantity*UnitPrice) desc").show()

算子

df.selectExpr("max(Quantity*UnitPrice)").show()

在这里插入图片描述

最多产品数量
Sql

spark.sql("select StockCode, max(Quantity) as maxNum from data group by StockCode order by max(Quantity) desc").show()

算子

df.groupBy("StockCode").agg(sum("Quantity") as "maxNum").orderBy(desc("maxNum")).show()

在这里插入图片描述

最多产品总类
Sql

spark.sql("select StockCode, count(StockCode) from data group by StockCode order by count(StockCode) desc").show()

算子

df.groupBy("StockCode").count().orderBy(desc("count")).show()

在这里插入图片描述

愿人了

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录