一.内嵌的 HIVE
如果使用 Spark 内嵌的 Hive, 则什么都不用做, 直接使用即可.
Hive 的元数据存储在 derby 中,默认仓库地址:$SPARK_HOME/spark-warehouse
(1)查表
scala> spark.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
(2)使用临时的表
scala> val df=spark.read.json("file:///usr/hadoop/spark-2.1.1/bin/data/user.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, username: string]
scala> df.create
createGlobalTempView createOrReplaceTempView createTempView
scala> df.createOrReplaceTempView("user")
scala> spark.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | user| true|
+--------+---------+-----------+
(2)自己建表
scala> spark.sql("create table user(id int)")
(3)向表加载本地数据
在spark安装目录data下创建user.txt.输入如下:1 2 3 4
scala> spark.sql("load data local inpath '/usr/hadoop/spark-2.1.1/data/user.txt' into table user")
21/02/03 11:21:58 ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
res3: org.apache.spark.sql.DataFrame = []
scala> spark.sql("select * from user").show
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 4|
+---+
scala> spark.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| default| user| false|
| | user| true|
+--------+---------+-----------+
二.外部的 HIVE
如果想连接外部已经部署好的 Hive,需要通过以下几个步骤:
➢ Spark 要接管 Hive 需要把 hive-site.xml 拷贝到 conf/目录下
➢ 把 Mysql的驱动(mysql-connector-java-xxxx-bin.jar) copy 到 Spark的jars/目录下。(如果自己之前装过mysql和hive,这个驱动可以在hive的安装目录下lib中找到)
➢ 如果访问不到 hdfs,则需要把 core-site.xml 和 hdfs-site.xml 拷贝到 conf/目录下
➢ 重启 spark-shell
1.启动错误
我自己在启动时第一次是失败了的,没有加载到hive。启动的过程和之前启动spark-shell信息一样,我注意到如下警告
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/02/03 13:02:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/02/03 13:02:18 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
我按照网上方法
Spark安装目录下的conf/spark-env.sh引入JAVA_LIBRAY_PATH
export LD_LIBRARY_PATH=$JAVA_LIBRARY_PATH
此处$JAVA_LIBRARY_PATH为hadoop安装目录下的lib/native目录
详情参考:https://blog.csdn.net/someby/article/details/82930140
或者是因为我没有开启hive元数据服务 hive --service metastore
我就进行了这两个操作,不确定是哪个引起的之后执行spark-shell后显示的警告信息就变了,然后连接hive成功
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/02/03 13:11:01 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
21/02/03 13:11:01 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException
21/02/03 13:11:02 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.137.128:4040
Spark context available as 'sc' (master = local[*], app id = local-1612329052731).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.1
/_/
注意检查是否开启如下:
- 开启hive元数据服务:hive --service metastore
- 开启hadoop服务:sh $HADOOP_HOME/sbin/start-all.sh
- 开启spark服务:sh $SPARK_HOME/sbin/start-all.sh
2.操作hive
scala> spark.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext
scala> val hiveContext = new HiveContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@ac6da02
scala> sqlContext.sql("create table user(id int)")
21/02/03 13:28:31 WARN metastore.HiveMetaStore: Location: file:/usr/hadoop/spark-2.1.1/spark-warehouse/user specified for non-external table:user
res3: org.apache.spark.sql.DataFrame = []
scala> spark.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| default| user| false|
+--------+---------+-----------+
scala> spark.sql("show databases").show
+------------+
|databaseName|
+------------+
| default|
+------------+
三.IDEA操作HIVE
1.导入依赖
我的版本是hive 2.3.5 spark2.1.1
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.32</version>
</dependency>
2.文件拷贝
最好将这三个文件分别都拷入resources和classes下
3. 代码编写
注意:在开发工具中创建数据库默认是在本地仓库,通过参数修改数据库仓库的地址:
config(“spark.sql.warehouse.dir”, “hdfs://linux1:8020/user/hive/warehouse”)
import org.apache.spark.SparkConf
import org.apache.spark.sql._
object SparkSQL_Hive {
def main(args: Array[String]): Unit = {
System.setProperty("HADOOP_USER_NAME", "root")
// TODO 创建SparkSQL的运行环境
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("sparkSQL")
val spark = SparkSession.builder().enableHiveSupport().config(sparkConf).getOrCreate()
spark.sql("show tables").show
// TODO 关闭环境
spark.close()
}
}
如果报错:
Exception in thread “main” java.lang.IllegalArgumentException: Error
while instantiating ‘org.apache.spark.sql.hive.HiveSessionState
参考我的上篇博文:https://blog.csdn.net/smallworldxyl/article/details/113608967
如果在执行操作时,出现如下错误:
可以代码最前面增加如下代码解决:
System.setProperty("HADOOP_USER_NAME", "root")
此处的 root 改为你们自己的 hadoop 用户名称