本文主要介绍spark join相关操作,Java描述。
讲述三个方法spark join,left-outer-join,right-outer-join
我们以实例来进行说明。我的实现步骤记录如下。
1、数据准备
2、HSQL描述
3、Spark描述
1、数据准备
我们准备两张Hive表,分别是orders(订单表)和drivers(司机表),通过driver_id字段进行关联。数据如下:
orders
hive (gulfstream_test)> select * fromorders;
OK
orders.order_id orders.driver_id1000 5000
1001 5001
1002 5002Time taken:0.387 seconds, Fetched: 3 row(s)
drivers
hive (gulfstream_test)> select * fromdrivers;
OK
drivers.driver_id drivers.car_id5000 100
5003 103Time taken:0.036 seconds, Fetched: 2 row(s)
2、HSQL描述
JOIN
自然连接,输出连接键匹配的记录。
hive (gulfstream_test)> select * from orders t1 join drivers t2 on (t1.driver_id =t2.driver_id) ;
OK
t1.order_id t1.driver_id t2.driver_id t2.car_id1000 5000 5000 100Time taken:36.079 seconds, Fetched: 1 row(s)
LEFT OUTER JOIN
左外链接,输出连接键匹配的记录,左侧的表无论匹配与否都输出。
hive (gulfstream_test)> select * from orders t1 left outer join drivers t2 on (t1.driver_id =t2.driver_id) ;
OK
t1.order_id t1.driver_id t2.driver_id t2.car_id1000 5000 5000 100
1001 5001 NULL NULL
1002 5002 NULL NULLTime taken:36.063 seconds, Fetched: 3 row(s)
RIGHT OUTER JOIN
右外连接,输出连接键匹配的记录,右侧的表无论匹配与否都输出。
hive (gulfstream_test)> select * from orders t1 right outer join drivers t2 on (t1.driver_id =t2.driver_id) ;
OK
t1.order_id t1.driver_id t2.driver_id t2.car_id1000 5000 5000 100
NULL NULL 5003 103Time taken:30.089 seconds, Fetched: 2 row(s)
3、Spark描述
Join.java
spark实现join的方式也是通过RDD的算子,spark同样提供了三个算子join,leftOuterJoin,rightOuterJoin。
在下面给出的例子中,我们通过spark-hive读取了Hive表中的数据,并将DataFrame转化成了RDD。
在join之后,通过collect()函数把数据拉到Driver端本地,并通过标准输出打印。
需要指出的是:
1)join算子(join,leftOuterJoin,rightOuterJoin)只能通过PairRDD使用;
2)join算子操作的Tuple2类型中,Object1是连接键,我只试过Integer和String,Object2比较灵活,甚至可以是整个Row。
packagecom.kangaroo.studio.algorithms.join;importcom.google.common.base.Optional;importorg.apache.spark.SparkConf;importorg.apache.spark.api.java.JavaPairRDD;importorg.apache.spark.api.java.JavaSparkContext;importorg.apache.spark.api.java.function.PairFunction;importorg.apache.spark.sql.DataFrame;importorg.apache.spark.sql.Row;importorg.apache.spark.sql.hive.HiveContext;importscala.Tuple2;importjava.io.Serializable;importjava.util.Iterator;/** spark-submit --queue=root.zhiliangbu_prod_datamonitor spark-join-1.0-SNAPSHOT-jar-with-dependencies.jar
**/
public class Join implementsSerializable {private transientJavaSparkContext javaSparkContext;private transientHiveContext hiveContext;/** 初始化Load
* 创建sparkContext, sqlContext, hiveContext
**/
publicJoin() {
initSparckContext();
initHiveContext();
}/** 创建sparkContext
**/
private voidinitSparckContext() {
String warehouseLocation= System.getProperty("user.dir");
SparkConf sparkConf= newSparkConf()
.setAppName("spark-join")
.set("spark.sql.warehouse.dir", warehouseLocation)
.setMaster("yarn-client");
javaSparkContext= newJavaSparkContext(sparkConf);
}/** 创建hiveContext
* 用于读取Hive中的数据
**/
private voidinitHiveContext() {
hiveContext= newHiveContext(javaSparkContext);
}public voidjoin() {/** 生成rdd1
**/String query1= "select * from gulfstream_test.orders";
DataFrame rows1= hiveContext.sql(query1).select("order_id", "driver_id");
JavaPairRDD rdd1 = rows1.toJavaRDD().mapToPair(new PairFunction() {
@Overridepublic Tuple2 call(Row row) throwsException {
String orderId= (String)row.get(0);
String driverId= (String)row.get(1);return new Tuple2(driverId, orderId);
}
});/** 生成rdd2
**/String query2= "select * from gulfstream_test.drivers";
DataFrame rows2= hiveContext.sql(query2).select("driver_id", "car_id");
JavaPairRDD rdd2 = rows2.toJavaRDD().mapToPair(new PairFunction() {
@Overridepublic Tuple2 call(Row row) throwsException {
String driverId= (String)row.get(0);
String carId= (String)row.get(1);return new Tuple2(driverId, carId);
}
});/** join
**/System.out.println(" ****************** join *******************");
JavaPairRDD> joinRdd =rdd1.join(rdd2);
Iterator>> it1 =joinRdd.collect().iterator();while(it1.hasNext()) {
Tuple2> item =it1.next();
System.out.println("driver_id:" + item._1 + ", order_id:" + item._2._1 + ", car_id:" +item._2._2 );
}/** leftOuterJoin
**/System.out.println(" ****************** leftOuterJoin *******************");
JavaPairRDD>> leftOuterJoinRdd =rdd1.leftOuterJoin(rdd2);
Iterator>>> it2 =leftOuterJoinRdd.collect().iterator();while(it2.hasNext()) {
Tuple2>> item =it2.next();
System.out.println("driver_id:" + item._1 + ", order_id:" + item._2._1 + ", car_id:" +item._2._2 );
}/** rightOuterJoin
**/System.out.println(" ****************** rightOuterJoin *******************");
JavaPairRDD, String>> rightOuterJoinRdd =rdd1.rightOuterJoin(rdd2);
Iterator, String>>> it3 =rightOuterJoinRdd.collect().iterator();while(it3.hasNext()) {
Tuple2, String>> item =it3.next();
System.out.println("driver_id:" + item._1 + ", order_id:" + item._2._1 + ", car_id:" +item._2._2 );
}
}public static voidmain(String[] args) {
Join sj= newJoin();
sj.join();
}
}
pom.xml
pom依赖
这里只依赖spark-core和spark-hive两个jar。
org.apache.spark
spark-core_2.10
1.6.0
provided
org.apache.spark
spark-hive_2.10
1.6.0
provided
打包方式
maven-assembly-plugin
com.kangaroo.studio.algorithms.join.Join
jar-with-dependencies
make-assembly
package
single
org.apache.maven.plugins
maven-compiler-plugin
1.6
1.6
执行结果
其中Optional.absent()表示的就是null,可以看到和HSQL是一致的。
Application ID is application_1508228032068_2746260, trackingURL: http://10.93.21.21:4040
****************** join *******************driver_id:5000, order_id:1000, car_id:100
****************** leftOuterJoin *******************driver_id:5001, order_id:1001, car_id:Optional.absent()
driver_id:5002, order_id:1002, car_id:Optional.absent()
driver_id:5000, order_id:1000, car_id:Optional.of(100)****************** rightOuterJoin *******************driver_id:5003, order_id:Optional.absent(), car_id:103driver_id:5000, order_id:Optional.of(1000), car_id:100