一环境准备
- 在代码中访问hive需要导入hive的依赖和jdbc依赖
- 需要hive的配置文件,否则默认访问的是自带的hive
- spark-shell 中默认是开启支持hive的,在本地代码中没有开启,需手动开启
pom依赖
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.27</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
</dependency>
</dependencies>
二 代码编写
object SparkSql2Hive {
def main(args: Array[String]): Unit = {
// 创建sparkSession 对象
val spark: SparkSession = SparkSession.builder().appName("SparkSql2Hive")
.master("local[2]")
.enableHiveSupport() //开启支持hive
.getOrCreate()
spark.sql("show tables").show()
spark.close()
}
}
三 在此基础上基于二对表userinfo中的数据进行指标计算
spark.sql("use default")
// 都是懒执行的只有在调用行动算子后才会执行
spark.sql("select * from userinfo").show()
数据如下:
+---+---------+----+
|u01|2017/1/21| 5|
|u02|2017/1/23| 6|
|u03|2017/1/22| 8|
|u04|2017/1/20| 3|
|u01|2017/1/23| 6|
|u01|2017/2/21| 8|
|U02|2017/1/23| 6|
|U01|2017/2/22| 4|
| | null|null|
+---+---------+----+
需求:
① 过滤掉空的数据
② 计算出每个用户累计到当前月的充值金额
思路:开窗累加,按照用户id进行分区,按照时间排序,指定窗口大小为上无边界,到当前行
③ 实现:
spark.sql(
"""
|select
|lower(uid) as uid,
|regexp_replace(dt,'(\\/+)',"-") as dayTime,
|num
|from userinfo
|where regexp_replace(dt,'(\\/+)',"-") is not null
""".stripMargin
).createOrReplaceTempView("t1")
spark.sql(
"""
|select
|uid,
|dayTime,
|num,
|sum(num)over(partition by uid order by dayTime asc rows between unbounded preceding and current row)
|from t1
""".stripMargin).show()
指标分析结果:
+---+---------+---+------------------------------------------------------------------------------------------------------------------+
|uid| dayTime|num|sum(num) OVER (PARTITION BY uid ORDER BY dayTime ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)|
+---+---------+---+------------------------------------------------------------------------------------------------------------------+
|u01|2017-1-21| 5| 5|
|u01|2017-1-23| 6| 11|
|u01|2017-2-21| 8| 19|
|u01|2017-2-22| 4| 23|
|u02|2017-1-23| 6| 6|
|u02|2017-1-23| 6| 12|
|u04|2017-1-20| 3| 3|
|u03|2017-1-22| 8| 8|
+---+---------+---+------------------------------------------------------------------------------------------------------------------+
三 总结
在创建Spark-Session对象的时候使用 .enableHiveSupport() //开启支持hive
注意:
在使用此种方式创建hive的数据库的时候会出现问题,需要增加配置
.config(“spark.sql.warehouse.dir”, “hdfs://hadoop102:9000/user/hive/warehouse”)
创建SparkSession对象的语句如下:
val spark: SparkSession = SparkSession.builder().appName("SparkSql2Hive")
.master("local[2]")
.enableHiveSupport() //开启支持hive
.config("spark.sql.warehouse.dir", "hdfs://hadoop102:9000/user/hive/warehouse")
.getOrCreate()