spark操作hive初实验
IDEA基于mave搭建spark操作hive环境
pom.xml配置如下
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>study.dashuju.spark</groupId>
<artifactId>spark-hive</artifactId>
<version>1.0-SNAPSHOT</version>
<!-- 声明子项目公用的配置属性 -->
<properties>
<spark.version>2.1.1</spark.version>
<scala.version>2.11.8</scala.version>
<log4j.version>1.2.17</log4j.version>
<slf4j.version>1.7.22</slf4j.version>
</properties>
<dependencies>
<!-- 引入Spark相关的Jar包 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
<!-- provider如果存在,那么运行时该Jar包不存在,也不会打包到最终的发布版本中,只是编译器有效 -->
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${spark.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.3.6</version>
</dependency>
<!-- 引入mysql驱动Jar包 -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.27</version>
</dependency>
<!-- 所有日志框架 -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jcl-over-slf4j</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>${slf4j.version}</version>
</dependency>
<!-- 具体的日志实现 -->
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>${log4j.version}</version>
</dependency>
<!-- Logging End -->
</dependencies>
</project>
scala code
package outer
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ArrayBuffer
object HiveSupport {
def search(spark: SparkSession, sql: String): Unit={
spark.sql(sql).show()
}
def createTable(spark: SparkSession, tableName: String, dataDF: DataFrame): Unit={
spark.sql("DROP TABLE IF EXISTS " + tableName)
dataDF.write.saveAsTable(tableName)
}
def main(args: Array[String]): Unit = {
// 创建Spark配置
val sparkConf = new SparkConf().setAppName("HiveSupport").setMaster("local[*]")
// 创建Spark SQL 客户端
val spark: SparkSession = SparkSession.builder().config(sparkConf)
.enableHiveSupport().getOrCreate()
//获取sc
val sc: SparkContext = spark.sparkContext
sc.setLogLevel("WARN")
// 加载SparkSQL的隐式转换支持
import spark.implicits._
// 生成数据
val users: ArrayBuffer[TempUser] = ArrayBuffer[TempUser]()
for(i <- (1 to 15)){
users += TempUser(i, s"name${i}", 20+i, i%2)
}
// 将数据转为数组,只有数组才能转为RDD
val rows: Array[TempUser] = users.toArray
// 生成数据RDD
val usersRdd: RDD[TempUser] = sc.makeRDD(rows)
// 转为DataFrame,其实这里是DataSet,结构和属性都存在,保存时也必须告诉数据表结构和属性
val usersDF: DataFrame = usersRdd.toDF()
// 插入数据
println("insert")
createTable(spark, "tempusers", usersDF)
// 查询数据
println("select")
search(spark, "select * from tempusers")
// 清空数据
println("truncate")
spark.sql("truncate table tempusers")
// 再查询数据,这里已经查询不到了
println("select")
search(spark, "select * from tempusers")
// 删除表
println("drop table")
spark.sql("DROP TABLE IF EXISTS tempusers")
spark.stop()
}
}
// 生成样例类
case class TempUser(id: Int, name: String, age: Int, gender: Int)
这里默认使用spark内置的hive
如果想使用外部hive需要把hive-site.xml放入resources文件夹,直接从hive中考出来放到文件夹中即可
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://hadoop102:9000/user/hive/warehouse</value>
<description>配置warehouse位置,这里必须配置,否则会默认在项目根目录中生成临时的warehouse文件夹,这样插入过后在hive的shell中是查不到数据的,而且删除表时会出现表已经删除但是数据依然在hdfs中的情况</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop102:3306/metastore?createDatabaseIfNotExist=false</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>hive.zookeeper.quorum</name>
<value>hadoop102,hadoop103,hadoop104</value>
<description>The list of ZooKeeper servers to talk to. This is only needed for read/write locks.</description>
</property>
<property>
<name>hive.zookeeper.client.port</name>
<value>2181</value>
<description>The port of ZooKeeper servers to talk to. This is only needed for read/write locks.</description>
</property>
<property>
<name>hive.cli.print.header</name>
<value>true</value>
<description>Whether to print the names of the columns in query output.</description>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
<description>Whether to include the current database in the Hive prompt.</description>
</property>
</configuration>
期间遇到的错误汇总
IDEA跑spark程序报错
Exception in thread "main" java.lang.VerifyError: class scala.collection.mutable.WrappedArray overrides final method toBuffer.()Lscala/collection/mutable/Buffer;
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:68)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:55)
at rdd.RddMap$.main(RddMap.scala:10)
at rdd.RddMap.main(RddMap.scala)
解决方法
出现这个问题说明包冲突,考虑现有包的冲突和spark版本问题
我的这个问题是由于,spark是基于scala 2.11.8编译的,但是安装了2.13版本的scala,所以报错
在IDEA中使用spark sql创建表后,在hive中使用drop table过后,数据依然存在hdfs中
解决方法
不仅删除不掉数据,在hive表location错误的情况下,是无法在hive控制台中查询数据的
在hive-site.xml中配置hive.metastore.warehouse.dir即可解决,默认情况下会在当前项目下创建warehouse,但是数据在hdfs中,所以会找不到数据进而数据不会被删除。
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://hadoop102:9000/user/hive/warehouse</value>
<description>hive.metastore.warehouse.dir</description>
</property>
删除hive中管理表报错
2019-10-28 10:33:06,411 WARN --- [ main] org.apache.hadoop.hive.metastore.MetaStoreDirectSql (line: 1606) : Failed to execute [select "PARTITIONS"."PART_ID" from "PARTITIONS" inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID" and "TBLS"."TBL_NAME" = ? inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID" and "DBS"."NAME" = ? order by "PART_NAME" asc] with parameters [tempusers, default]
javax.jdo.JDODataStoreException: Error executing SQL query "select "PARTITIONS"."PART_ID" from "PARTITIONS" inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID" and "TBLS"."TBL_NAME" = ? inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID" and "DBS"."NAME" = ? order by "PART_NAME" asc".
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'OPTION SQL_SELECT_LIMIT=DEFAULT' at line 1
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
大致意思是生成的mysql语句错误检查驱动版本,应该是由于不同的驱动版本对sql语句有一些差别。(这个未经过验证,但是看报错信息应该是这样)
解决
出现这个问题是由mysql驱动版本不对造成的,
修改pom文件中的mysql版本为hive/libs/文件夹下jdbc驱动版本
最终我采用hive2.3.6,spark2.1.1,jdbc5.1.27运行成功
应该与hive版本关系不大,我更改了几个版本运行无差别
但是每个版本的spark与hive都有推荐版本匹配
对应hive和spark版本参照表参考https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark:+Getting+Started