一、数据准备
1.1 将hive-site.xml拷贝到spark/conf目录下:
分析:从错误提示上面就知道,spark无法知道hive的元数据的位置,所以就无法实例化对应的client。
解决的办法就是必须将hive-site.xml拷贝到spark/conf目录下
1.2 测试代码中没有加sc.stop会出现如下错误:
ERROR scheduler.LiveListenerBus: Listener EventLoggingListener threw an exception
java.lang.reflect.InvocationTargetException
在代码最后一行添加sc.stop()解决了该问题。
1.3 提前在Hive数据仓库中创建库spark_sql_hive
hive (default)> create database spark_sql_hive;
OK
Time taken: 5.502 seconds
准备people.txt与peoplescores.txt两个文本文件
people.txt内容为:(人员信息表:姓名,年龄)
Michael 29
Andy 30
Justin 19
peoplescores.txt内容为:(人员分数表:姓名,分数)
Michael 99
Andy 97
Justin 68
1.4 pom.xml文件添加依赖
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>sparkandhive</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<spark.version>2.4.4</spark.version>
<scala.version>2.11.12</scala.version>
<hadoop.version>2.6.0</hadoop.version>
<hive.version>2.3.0</hive.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${hive.version}</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<configuration>
<recompileMode>incremental</recompileMode>
</configuration>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.2.1</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
二、Spark加载数据到Hive
备注:
val spark = SparkSession
.builder()//.master("local[*]")
.appName("spark_readOrwrite_hive")
// 开启Hive支持
.enableHiveSupport()
// 配置Hivewarehouse地址
.config("spark.sql.warehouse.dir","hdfs:///user/hive/warehouse")
// 设置了(CUSTOM)账号和密码的需要输入,如果为NONE则不需要添加账号和密码
.config("username","sibat")
.config("password","sibat706")
.getOrCreate()
2.1 将hive中student表经过条件筛选,保存到person表中
package com.sibat.applications
import org.apache.spark.sql.{SaveMode, SparkSession}
object SparkWriteHive {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()//.master("local[*]")
.appName("spark_write_hive")
.config("spark.sql.warehouse.dir", "/user/hive/warehouse") //这行可有可无
.enableHiveSupport()
.getOrCreate()
spark.table("student").createOrReplaceTempView("person") //自己hive的表,person是创建的临时视图
// 注册成临时表
val tt= spark.sql(" select * from person")
tt.show();
tt.write.mode(SaveMode.Overwrite).saveAsTable("person")
}
}
2.2 加载people.txt与peoplescores.txt数据到Hive的people和peoplescores表中,并且两张表做关联,筛选出成绩大于90分的人群
package com.sibat.applications
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
/**
*
* Spark集群中使用Hive数据仓库
*
*/
object SparkSQLOnHive { // 提交到集群运行
def main(args:Array[String]):Unit = {
// 创建SparkConf对象
val sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("SparkSQLOnHive") // 设置Spark应用名称
// 创建SparkContext上下文对象
val sparkContext = new SparkContext(sparkConf);
/**
* 第一:直接通过saveAsTable的方式把DataFrame的数据保存到Hive数据仓库
* 第二:可以直接通过HiveContext.table方法来直接加载Hive中的表而生成DataFrame
*/
val hiveContext = new HiveContext(sparkContext)
hiveContext.sql("CREATE DATABASE IF NOT EXISTS spark_sql_hive") // 需要提前在Hive中创建测试库
hiveContext.sql("use spark_sql_hive")
hiveContext.sql("drop table if exists people") // 删除同名表
hiveContext.sql("create table if not exists people(name STRING,age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'")//创建自定义的表
/**
* 把本地的数据加载到Hive数据仓库中,背后实际上发生了数据的复制
* 当然,也可以通过LOAD DATA INPATH 获取HDFS等上面的数据到Hive中,此时发生了数据的移动
*/
hiveContext.sql("load data local inpath '/home/pro/people.txt'into table people")
hiveContext.sql("drop table if exists peoplescores") // 删除peoplescores
hiveContext.sql("create table if not exists peoplescores(name STRING,score INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'")
hiveContext.sql("load data local inpath '/home/pro/peoplescores.txt' into table peoplescores")
/**
* 通过HiveContext使用join直接基于Hive中的两张表进行操作,获得day90分的人的
* name,score,age
*/
val resultDF = hiveContext.sql("select pi.name,pi.age,ps.score from people pi join peoplescores ps on pi.name=ps.name" +
" where ps.score > 90")
/**
* resultDF都是大于90分的人员信息表Data Frame,包括姓名,分数,调用DataFrame的saveAsTable方法,将resultDF中大于90分的记录保存到Hive
* 数据表peopleinformationresult中
*/
hiveContext.sql("drop table if exists peopleinformationresult")
// 数据插入,Spark2.0.x版本后提交,spark1.x resultDF.savaAsTable("peopleinformationresult");
resultDF.write.mode(SaveMode.Overwrite).saveAsTable("peopleinformationresult")
val dataFrameHive = hiveContext.table("peopleinformationresult")
dataFrameHive.show();
sparkContext.stop()
}
}
三、spark读取Hive中的数据
3.1 读取出ods库中ajm_idcard表中fileDate是20210608的数据
package com.sibat.applications
import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SparkSession}
object SparkReadHive {
def main(args: Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder().appName("spark_read_hive").enableHiveSupport().getOrCreate()
val sc: SparkContext = spark.sparkContext
val df: DataFrame = spark.sql(s"SELECT * FROM ods.ajm_idcard WHERE fileDate='20210608'")
.select("serverReceiveTimestamp", "idno", "capTime", "gbNo")
df.show(10,false)
}
}
3.2 使用HiveContext
(1)方式一:
package com.sibat.applications
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
object SparkUseHiveText {
def main(args: Array[String]) {
val conf=new SparkConf().setMaster("local").setAppName("SparkHiveText")
val sc=new SparkContext(conf)
val hc=new HiveContext(sc)
hc.sql("select * from ods.sibat_face").show()
sc.stop()
}
}
(2)方式二:
package com.sibat.applications
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("test").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
sqlContext.table("person") // 库名.表名 的格式
.registerTempTable("person") // 注册成临时表
sqlContext.sql(
"""
| select *
| from person
| limit 10
""".stripMargin).show()
sc.stop()
}
}
(3)当发现HiveContext过期,则"Use SparkSession.builder.enableHiveSupport instead"( "2.0.0")源码如下:
package org.apache.spark.sql.hive
import org.apache.spark.SparkContext
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.internal.Logging
import org.apache.spark.sql.{SparkSession, SQLContext}
/**
* An instance of the Spark SQL execution engine that integrates with data stored in Hive.
* Configuration for Hive is read from hive-site.xml on the classpath.
*/
@deprecated("Use SparkSession.builder.enableHiveSupport instead", "2.0.0")
class HiveContext private[hive](_sparkSession: SparkSession)
extends SQLContext(_sparkSession) with Logging {
self =>
def this(sc: SparkContext) = {
this(SparkSession.builder().sparkContext(HiveUtils.withHiveExternalCatalog(sc)).getOrCreate())
}
def this(sc: JavaSparkContext) = this(sc.sc)
/**
* Returns a new HiveContext as new session, which will have separated SQLConf, UDF/UDAF,
* temporary tables and SessionState, but sharing the same CacheManager, IsolatedClientLoader
* and Hive client (both of execution and metadata) with existing HiveContext.
*/
override def newSession(): HiveContext = {
new HiveContext(sparkSession.newSession())
}
/**
* Invalidate and refresh all the cached the metadata of the given table. For performance reasons,
* Spark SQL or the external data source library it uses might cache certain metadata about a
* table, such as the location of blocks. When those change outside of Spark SQL, users should
* call this function to invalidate the cache.
*
* @since 1.3.0
*/
def refreshTable(tableName: String): Unit = {
sparkSession.catalog.refreshTable(tableName)
}
}