一、SparkSQL功能
SparkSQL需要的是hive表的元数据,如果配置了metastore的服务,直接去服务里面拿,如果没有,那就只能去mysql中间拿
二、配置和hive集成
1. 将hive的hive-site.xml文件复制或者软连接到spark的conf文件夹中
cd /opt/modules/spark-2.1.0-bin-2.7.3/conf/
ln -s /opt/modules/hive-1.2.1/conf/hive-site.xml
2.hive的配置文件hive-site.xml
<configuration>
<!--让hive的sql语句不提交到集群上,在本地执行 -->
<property>
<name>hive.exec.mode.local.auto</name>
<value>true</value>
<description> Let Hive determine whether to run in local mode automatically </description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://bigdata.ibeifeng.com:3306/metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
<description>Whether to include the current database in the Hive prompt.</description>
</property>
<property>
<name>hive.cli.print.header</name>
<value>true</value>
<description>Whether to print the names of the columns in query output.</description>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://bigdata.ibeifeng.com:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
<property>
<name>hive.exec.max.dynamic.partitions.pernode</name>
<value>100</value>
<description>Maximum number of dynamic partitions allowed to be created in each mapper/reducer node.</description>
</property>
<property>
<name>hive.exec.max.dynamic.partitions</name>
<value>1000</value>
<description>Maximum number of dynamic partitions allowed to be created in total.</description>
</property>
<property>
<name>hive.exec.dynamic.partition</name>
<value>true</value>
<description>Whether or not to allow dynamic partitions in DML/DDL.</description>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
<description>In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions.</description>
</property>
<property>
<name>hive.support.sql11.reserved.keywords</name>
<value>false</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>bigdata.ibeifeng.com</value>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>bigdata.ibeifeng.com</value>
</property>
</configuration>
3.根据hive的配置文件的hive.metastore.uris参数的配置值选择不同的操作方式
(1) 如果没有给定参数(默认情况)
将hive元数据数据库的驱动包(就是mysql)添加到spark的classpath环境变量中即可完成spark和hive的集成
(2) 给定具体的metastore服务所在的节点信息(值非空,一般情况下用这种比较多)
(a)启动hive的metastore服务(目录下:/opt/modules/hive-1.2.1)
bin/hive --service metastore &
(b)完成spark和hive的集成
三、测试SparkSQL和hive的集成
1.Hive创建表,插入测试数据
【员工表】
create table emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)
row format delimited fields terminated by '\t';
load data local inpath '/opt/datas/emp.txt' into table emp;
加载数据:overwrite
load data local inpath '/opt/datas/emp.txt' overwrite into table emp;
【部门表】
create table dept(
deptno int,
dname string,
loc string
)
row format delimited fields terminated by '\t';
load data local inpath '/opt/datas/dept.txt' into table dept;
2.测试两种方式
(1)spark-sql测试
-》开启
./bin/spark-sql
-》运行语句
spark-sql (default)> select * from default.emp;
spark-sql (default)> select * from default.emp a join default.dept b on a.deptno = b.deptno;
spark-sql (default)> explain select * from default.emp a join default.dept b on a.deptno = b.deptno;
(2)spark-shell测试
-》开启
./bin/spark-shell
-》运行语句
scala>spark.sqlContext
scala>spark.sqlContext.sql("select * from default.emp a join default.dept b on a.deptno = b.deptno").show()
结果:
+-----+------+---------+----+----------+------+------+------+------+----------+--------+
|empno| ename| job| mgr| hiredate| sal| comm|deptno|deptno| dname| loc|
+-----+------+---------+----+----------+------+------+------+------+----------+--------+
| 7369| SMITH| CLERK|7902|1980-12-17| 800.0| null| 20| 20| RESEARCH| DALLAS|
| 7499| ALLEN| SALESMAN|7698| 1981-2-20|1600.0| 300.0| 30| 30| SALES| CHICAGO|
| 7521| WARD| SALESMAN|7698| 1981-2-22|1250.0| 500.0| 30| 30| SALES| CHICAGO|
| 7566| JONES| MANAGER|7839| 1981-4-2|2975.0| null| 20| 20| RESEARCH| DALLAS|
| 7654|MARTIN| SALESMAN|7698| 1981-9-28|1250.0|1400.0| 30| 30| SALES| CHICAGO|
| 7698| BLAKE| MANAGER|7839| 1981-5-1|2850.0| null| 30| 30| SALES| CHICAGO|
| 7782| CLARK| MANAGER|7839| 1981-6-9|2450.0| null| 10| 10|ACCOUNTING|NEW YORK|
| 7788| SCOTT| ANALYST|7566| 1987-4-19|3000.0| null| 20| 20| RESEARCH| DALLAS|
| 7839| KING|PRESIDENT|null|1981-11-17|5000.0| null| 10| 10|ACCOUNTING|NEW YORK|
| 7844|TURNER| SALESMAN|7698| 1981-9-8|1500.0| 0.0| 30| 30| SALES| CHICAGO|
| 7876| ADAMS| CLERK|7788| 1987-5-23|1100.0| null| 20| 20| RESEARCH| DALLAS|
| 7900| JAMES| CLERK|7698| 1981-12-3| 950.0| null| 30| 30| SALES| CHICAGO|
| 7902| FORD| ANALYST|7566| 1981-12-3|3000.0| null| 20| 20| RESEARCH| DALLAS|
| 7934|MILLER| CLERK|7782| 1982-1-23|1300.0| null| 10| 10|ACCOUNTING|NEW YORK|
+-----+------+---------+----+----------+------+------+------+------+----------+--------+
四、代码实现
1.参考blog:https://blog.csdn.net/u010886217/article/details/82916520,注意外网访问服务器,需要配置datanode使用hostname的属性,否则无法访问