MySQL端
使用dept表,表内容如下:
mysql> select * from dept
-> ;
+--------+------------+----------+
| deptno | dname | loc |
+--------+------------+----------+
| 10 | ACCOUNTING | NEW YORK |
| 20 | RESEARCH | DALLAS |
| 30 | SALES | CHICAGO |
| 40 | OPERATIONS | BOSTON |
+--------+------------+----------+
Hive端
1.创建empspark表,并导入数据
hive> CREATE TABLE empspark (
> empno int,
> ename string,
> job string,
> mgr int,
> hiredate string,
> salary double,
> comm double,
> deptno int
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
OK
Time taken: 0.635 seconds
hive> LOAD DATA LOCAL INPATH "/home/hadoop/data/emp" OVERWRITE INTO TABLE empspark;
Loading data to table default.empspark
Table default.empspark stats: [numFiles=1, numRows=0, totalSize=820, rawDataSize=0]
OK
Time taken: 0.973 seconds
hive> select * from empspark;
OK
369 SMITH CLERK 7902 1980-12-17 00:00:00 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-02-22 00:00:00 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-04-02 00:00:00 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-09-28 00:00:00 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-05-01 00:00:00 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1982-12-09 00:00:00 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 00:00:00 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-09-08 00:00:00 1500.0 0.0 30
7876 ADAMS CLERK 7788 1983-01-12 00:00:00 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-03 00:00:00 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-03 00:00:00 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-01-23 00:00:00 1300.0 NULL 10
Time taken: 0.679 seconds, Fetched: 14 row(s)
Spark端
1.数据准备结束,首先我们应该使sparksql可以访问hive的表,在这里我做的两步操作分别是:
- 将hive的conf/hive-site.xml 复制导spark的conf/目录下
- 将mysql-connector-java-5.1.45-bin.jar驱动复制到spark的jars/目录下
2.创建两张表的dataframe
- MySQL表:
val jdbcDF = spark.read.format("jdbc")
.option("url", "jdbc:mysql://localhost:3306")
.option("dbtable", "wl.dept")
.option("user", "username")
.option("password", "password").load()
- hive表:
val hivedf=spark.table("empspark")
- 做join操作
jdbcDF.join(hivedf,jdbcDF.col("deptno")===hivedf.col("deptno")).show
- 显示结果
+------+----------+--------+-----+------+---------+----+-------------------+------+------+------+
|deptno| dname| loc|empno| ename| job| mgr| hiredate|salary| comm|deptno|
+------+----------+--------+-----+------+---------+----+-------------------+------+------+------+
| 10|ACCOUNTING|NEW YORK| 7934|MILLER| CLERK|7782|1982-01-23 00:00:00|1300.0| null| 10|
| 10|ACCOUNTING|NEW YORK| 7839| KING|PRESIDENT|null|1981-11-17 00:00:00|5000.0| null| 10|
| 10|ACCOUNTING|NEW YORK| 7782| CLARK| MANAGER|7839|1981-06-09 00:00:00|2450.0| null| 10|
| 20| RESEARCH| DALLAS| 7902| FORD| ANALYST|7566|1981-12-03 00:00:00|3000.0| null| 20|
| 20| RESEARCH| DALLAS| 7876| ADAMS| CLERK|7788|1983-01-12 00:00:00|1100.0| null| 20|
| 20| RESEARCH| DALLAS| 7788| SCOTT| ANALYST|7566|1982-12-09 00:00:00|3000.0| null| 20|
| 20| RESEARCH| DALLAS| 7566| JONES| MANAGER|7839|1981-04-02 00:00:00|2975.0| null| 20|
| 20| RESEARCH| DALLAS| 369| SMITH| CLERK|7902|1980-12-17 00:00:00| 800.0| null| 20|
| 30| SALES| CHICAGO| 7900| JAMES| CLERK|7698|1981-12-03 00:00:00| 950.0| null| 30|
| 30| SALES| CHICAGO| 7844|TURNER| SALESMAN|7698|1981-09-08 00:00:00|1500.0| 0.0| 30|
| 30| SALES| CHICAGO| 7698| BLAKE| MANAGER|7839|1981-05-01 00:00:00|2850.0| null| 30|
| 30| SALES| CHICAGO| 7654|MARTIN| SALESMAN|7698|1981-09-28 00:00:00|1250.0|1400.0| 30|
| 30| SALES| CHICAGO| 7521| WARD| SALESMAN|7698|1981-02-22 00:00:00|1250.0| 500.0| 30|
| 30| SALES| CHICAGO| 7499| ALLEN| SALESMAN|7698|1981-02-20 00:00:00|1600.0| 300.0| 30|
+------+----------+--------+-----+------+---------+----+-------------------+------+------+------+