Spark SQL应用
实验目的
深入理解和掌握DataFrame各种常见操作和编程方法;掌握使用Spark SQL编程解决实际问题的方法。
实验要求
- 掌握基于Maven的Scala和Spark SQL编程环境配置;
- 掌握DataFrame查询方法。
实验内容
- 将实验二中的Online Retail.csv上传至HDFS
- 在Maven中配置Spark SQL编程环境,pom.xml中添加:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.0.1</version>
</dependency>
若自己安装的Spark不是3.0.1,则自己搜索适合的spark-sql版本
- 编写代码将csv文件读取至DataFrame,将Schema设置如下:
表1 Schema设置
Column Name Type Nullable
InvoiceNo StringType false
StockCode StringType false
Description StringType false
Quantity IntegerType false
InvoiceDate DateType false
UnitPrice DecimalType false
CustomerID StringType false
Country StringType false
val conf = new SparkConf().setAppName(“Spark SQL”).setMaster(“local[*]”)
val sc = new SparkContext(conf)
val spark = SparkSession.builder().master(“local[*]”).appName(“Spark SQL”).getOrCreate()
val df = spark.read.format(“com.databricks.spark.csv”)
.option(“header”, “true”)
.option(“mode”, “DROPMALFORMED”)
.load(“hdfs://主机名或ip地址:端口号/文件路径”)
val df_null = df.na.drop()
val intRegx = “^\d+$”.r
val timeRegx = “^\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2}$”.r
val doubleRegx = “^\d+(\.\d+)?$”.r
val timeFormat = new SimpleDateFormat(“M/d/yyyy H:m”)
val rdd = df_null.rdd.map(x => (x.getString(0), x.getString(1), x.getString(2), x.getString(3), x.getString(4), x.getString(5), x.getString(6), x.getString(7)))
.filter(x => !intRegx.findFirstIn(x._4).isEmpty &&
!timeRegx.findFirstIn(x._5).isEmpty && //以日期规则过滤InvoiceDate列
!doubleRegx.findFirstIn(x._6).isEmpty) //以浮点型规则过滤UnitPrice列
.map(x => (x._1, x._2, x._3, x._4.toInt, new java.sql.Date(timeFormat.parse(x._5).getTime), x._6.toDouble, x._7, x._8))
val schema = StructType(Array(
StructField(“InvoiceNo”, StringType, false),
StructField(“StockCode”, StringType, false),
StructField(“Description”, StringType, false),
StructField(“Quantity”, IntegerType, false),
StructField(“InvoiceDate”, DateType, false),
StructField(“UnitPrice”, DoubleType, false),
StructField(“CustomerID”, StringType, false),
StructField(“Country”, StringType, false)
))
val df_final = spark.createDataFrame(rdd.map(x => Row.fromTuple(x)), schema)
df_final.write.json(“hdfs://主机名或ip地址:端口号/文件路径”)
(1) 要求:所有字段不能为空,可将所有包含空值行或无法做数据类型转换的行视为无效行去掉。
val rdd = df_null.rdd.map(x => (x.getString(0), x.getString(1), x.getString(2), x.getString(3), x.getString(4), x.getString(5), x.getString(6), x.getString(7)))
.filter(x => !intRegx.findFirstIn(x._4).isEmpty &&
!timeRegx.findFirstIn(x._5).isEmpty && //以日期规则过滤InvoiceDate列
!doubleRegx.findFirstIn(x._6).isEmpty) //以浮点型规则过滤UnitPrice列
.map(x => (x._1, x._2, x._3, x._4.toInt, new java.sql.Date(timeFormat.parse(x._5).getTime), x._6.toDouble, x._7, x._8))
(2) 读取文件方法:使用Spark SQL读取csv的方法,直接按列读取为DataFrame,然后再取RDD;或直接将csv以文件的方式读如RDD,然后以逗号将各列split为数组,但注意字段中本身包含逗号以及字段两端有双引号的情况,可用正则表达式识别各列内容
(3) 数据转换方法:使用RDD做数据转换并去除空行,然后创建Schema,将RDD按创建的Schema转为DataFrame
val df_final = spark.createDataFrame(rdd.map(x => Row.fromTuple(x)), schema)
-
在程序中将转换好的DataFrame存储为JSON
-
进入spark-shell
(1) 将第4步保存的文件载入到DataFrame
val df = spark.read.json("hdfs://主机名或ip地址:端口号/文件名")
(2) 执行DataFrame中的查询(以下查询分别用转换操作算子和SQL语句实现),并用show命令打印出摘要信息
① 查询单价小于0.2的所有商品
先创建一个视图,用作sql语句的查询
val table = df.createTempView("data")
Sql
spark.sql("select Description, UnitPrice from data where UnitPrice<0.2").show()
算子
df.selectExpr("Description", "UnitPrice").where("UnitPrice<0.2").show() //单价小于0.2
② 查询订单551845~551850的顾客
Sql
spark.sql("select CustomerID, InvoiceNo from data where InvoiceNo>=55184 and InvoiceNo<=551850").show()
算子
df.selectExpr("CustomerID", "InvoiceNo").where("InvoiceNo>=55184 and InvoiceNo<=551850").show() //查询订单551845~551850的顾客
③ 统计本数据中包含了多少个订单
Sql
spark.sql("select count(distinct InvoiceNo) from data").show()
算子
df.select(countDistinct("InvoiceNo")).show()
④ 统计所有订单的最大金额、订单包含的最多产品数量、订单包含的最多产品种类
订单最大金额
Sql
spark.sql("select sum(Quantity*UnitPrice) as sumPrice from data group by InvoiceNo order by sum(Quantity*UnitPrice) desc").show()
算子
df.selectExpr("max(Quantity*UnitPrice)").show()
最多产品数量
Sql
spark.sql("select StockCode, max(Quantity) as maxNum from data group by StockCode order by max(Quantity) desc").show()
算子
df.groupBy("StockCode").agg(sum("Quantity") as "maxNum").orderBy(desc("maxNum")).show()
最多产品总类
Sql
spark.sql("select StockCode, count(StockCode) from data group by StockCode order by count(StockCode) desc").show()
算子
df.groupBy("StockCode").count().orderBy(desc("count")).show()