一. 前期准备
1.linux搭建Java和Scala环境搭建
2. linux搭建hadoop+spark+hive分布式集群
hadoop分布式集群搭建:hadoop分布式集群搭建
spark分布式集群搭建:spark分布式集群搭建
hive分布式集群搭建:待完善
二.项目代码实现
1 需求
通过Spark Sql 查询Hive数据库数据
数据库:bi_ods;
表:owms_m_locator
2 maven 项目搭建
新增一个Mave project工程
3 实现代码
package com.lm.hive.SparkHive;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.hive.HiveContext;
/**
* Spark sql获取Hive数据
*
*/
public class App
{
public static void main( String[] args )
{
SparkConf sparkConf = new SparkConf().setAppName("SparkHive").setMaster("local[2]");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
//不要使用SQLContext,部署异常找不到数据库和表
HiveContext hiveContext = new HiveContext(sc);
SQLContext sqlContext = new SQLContext(sc);
//查询表前10条数据
hiveContext.sql("select * from bi_ods.owms_m_locator limit 10").show();
sc.stop();
}
}
4. pom.xml文件
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.lm.hive</groupId>
<artifactId>SparkHive</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>SparkHive</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.39</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.22</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.4</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.6.0</version>
<exclusions>
<exclusion>
<artifactId>slf4j-log4j12</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>2.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.1.1</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<shadedArtifactAttached>true</shadedArtifactAttached>
<shadedClassifierName>allinone</shadedClassifierName>
<artifactSet>
<includes>
<include>*:*</include>
</includes>
</artifactSet>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>reference.conf</resource>
</transformer>
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>META-INF/spring.handlers</resource>
</transformer>
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>META-INF/spring.schemas</resource>
</transformer>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<manifestEntries>
<Main-Class></Main-Class>
</manifestEntries>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
三. 项目部署
1 maven项目编译成jar包
本文编译成 SparkHive-0.0.1-SNAPSHOT-allinone.jar
2 上传jar包至linux服务器
通过rz命令上传
3 通过spark-submit命令运行jar包
命令:
sh bin/spark-submit --class com.lm.hive.SparkHive.App --master yarn --files /home/winit/spark-1.6.0/conf/hive-site.xml java_jar/SparkHive-0.0.1-SNAPSHOT-allinone.jar
结果:
四.异常问题
在网上搜索,很多是通过SQLContext实例去查询Hive数据,当本人使用时,出现以下异常
1. Exception in thread "main" org.apache.spark.sql.AnalysisException: Table not found: `bi_ods`.`owms_m_locator`;
解决:
SQLContext sqlContext = new SQLContext(sc);
替换为:
HiveContext hiveContext = new HiveContext(sc);
2. Exception in thread "main" java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package
解决:
javax.servlet相关包重复导入引起的
hadoop-client或hadoop-common排除java.servelt等包
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.4</version>
<exclusions>
<exclusion>
<groupId>javax.servlet</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
五. SQLContext和HiveContext区别和联系
Spark SQL程序的主入口是SQLContext类或它的子类。创建一个基本的SQLContext,你只需要SparkContext
除了基本的SQLContext,也可以创建HiveContext。SQLContext和HiveContext区别与联系为:
SQLContext现在只支持SQL语法解析器(SQL-92语法)
HiveContext现在支持SQL语法解析器和HiveSQL语法解析器,默认为HiveSQL语法解析器,用户可以通过配置切换成SQL语法解析器,来运行HiveSQL不支持的语法。
使用HiveContext可以使用Hive的UDF,读写Hive表数据等Hive操作。SQLContext不可以对Hive进行操作。
Spark SQL未来的版本会不断丰富SQLContext的功能,做到SQLContext和HiveContext的功能容和,最终可能两者会统一成一个Context
HiveContext包装了Hive的依赖包,把HiveContext单独拿出来,可以在部署基本的Spark的时候就不需要Hive的依赖包,需要使用HiveContext时再把Hive的各种依赖包加进来。
SQL的解析器可以通过配置spark.sql.dialect参数进行配置。在SQLContext中只能使用Spark SQL提供的”sql“解析器。在HiveContext中默认解析器为”hiveql“,也支持”sql“解析器。
详情见:SQLContext和HiveContext区别与联系
代码下载地址:Java + Spark SQL + Hive + Maven简单实例代码