一、准备事项
1. 依赖
根据自己的Scala版本进行选择
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.1.1</version>
</dependency>
2. 结构化数据
test.data
(数据说明:每一时刻每个单词出现的个数)
id | 时间戳 | 单词 | 数量
{"id":1, "timestamp": 1620262800000, "word": "java", "count": 8}
{"id":2, "timestamp": 1620262800000, "word": "scala", "count": 10}
{"id":3, "timestamp": 1620262800001, "word": "c++", "count": 15}
{"id":4, "timestamp": 1620262800001, "word": "java", "count": 20}
{"id":5, "timestamp": 1620262800002, "word": "python", "count": 3}
{"id":6, "timestamp": 1620262800002, "word": "php", "count": 12}
{"id":7, "timestamp": 1620262800003, "word": "python", "count": 11}
{"id":8, "timestamp": 1620262800003, "word": "java", "count": 9}
{"id":9, "timestamp": 1620262800003, "word": "scala", "count": 10}
{"id":10, "timestamp": 1620262800004, "word": "go", "count": 16}
二、数据处理
1. 打印表结构和数据
package ace.gjh.sql
import org.apache.spark.sql.SparkSession
/**
* Spark SQL应用
*
* @author ACE_GJH
* @date 2021/5/6
*/
object SqlAnalysis {
def main(args: Array[String]): Unit = {
// 创建session
val session = SparkSession.builder()
.appName("sql-analysis")
.master("local[*]")
.getOrCreate()
// 设置日志等级
SparkSession.active.sparkContext.setLogLevel("ERROR")
// 从文件中读取json数据
val dataFrame = session.read.json("file:///E:/spark/test.data")
// 打印schema
dataFrame.printSchema()
// 打印表数据
dataFrame.show()
// 释放资源
session.close()
}
}
运行结果:
root
|-- count: long (nullable = true)
|-- id: long (nullable = true)
|-- timestamp: long (nullable = true)
|-- word: string (nullable = true)
+-----+---+-------------+------+
|count| id| timestamp| word|
+-----+---+-------------+------+
| 8| 1|1620262800000| java|
| 10| 2|1620262800000| scala|
| 15| 3|1620262800001| c++|
| 20| 4|1620262800001| java|
| 3| 5|1620262800002|python|
| 12| 6|1620262800002| php|
| 11| 7|1620262800003|python|
| 9| 8|1620262800003| java|
| 10| 9|1620262800003| scala|
| 16| 10|1620262800004| go|
+-----+---+-------------+------+
2. 统计每时刻输入单词种类数和总数
// 统计每时刻的输入单词总类和总数
dataFrame
.groupBy("timestamp")
.agg(("word", "count"), ("count", "sum"))
.withColumnRenamed("count(word)", "typeCount")
.withColumnRenamed("sum(count)", "total")
// asc()函数通过导入"import org.apache.spark.sql.functions._"SQL函数获取
.orderBy(asc("timestamp"))
.show()
运行结果:
+-------------+---------+-----+
| timestamp|typeCount|total|
+-------------+---------+-----+
|1620262800000| 2| 18|
|1620262800001| 2| 35|
|1620262800002| 2| 15|
|1620262800003| 3| 30|
|1620262800004| 1| 16|
+-------------+---------+-----+
3. 统计单词在所有时间点出现的总数透视
这里用到了“透视”(pivot)函数。
// 统计单词在所有时间点出现的次数透视
dataFrame
.groupBy("timestamp")
.pivot("word")
.sum("count")
.orderBy(asc("timestamp"))
.show()
运行结果:
+-------------+----+----+----+----+------+-----+
| timestamp| c++| go|java| php|python|scala|
+-------------+----+----+----+----+------+-----+
|1620262800000|null|null| 8|null| null| 10|
|1620262800001| 15|null| 20|null| null| null|
|1620262800002|null|null|null| 12| 3| null|
|1620262800003|null|null| 9|null| 11| 10|
|1620262800004|null| 16|null|null| null| null|
+-------------+----+----+----+----+------+-----+