准备数据: orders:(订单t1) order_id driver_id 1000 5000 1001 5001 1002 5002 drivers:(司机表t2) driver_id car_id 5000 100 5003 103 spark三种连接join join:自然连接,输出连接键匹配的记录 测试:hive (gulfstream_test)> select * from orders t1 join drivers t2 on (t1.driver_id = t2.driver_id) ; 结果: t1.order_id t1.driver_id t2.driver_id t2.car_id 1000 5000 5000 100 left-outer-join:左外链接,输出连接键匹配的记录,左侧的表无论匹配与否都输出。 测试:hive (gulfstream_test)> select * from orders t1 left outer join drivers t2 on (t1.driver_id = t2.driver_id); 结果: t1.order_id t1.driver_id t2.driver_id t2.car_id 1000 5000 5000 100 1001 5001 NULL NULL 1002 5002 NULL NULL right-outer-join:右外连接,输出连接键匹配的记录,右侧的表无论匹配与否都输出。 测试:hive (gulfstream_test)> select * from orders t1 right outer join drivers t2 on (t1.driver_id = t2.driver_id) ; 结果: t1.order_id t1.driver_id t2.driver_id t2.car_id 1000 5000 5000 100 NULL NULL 5003 103 3、Spark描述 spark实现join的方式也是通过RDD的算子,spark同样提供了三个算子join,leftOuterJoin,rightOuterJoin。 在下面给出的例子中,我们通过spark-hive读取了Hive中orders表和drivers表中的数据,这时候数据的表现形式是DataFrame,如果要使用Join操作: 1)首先需要先将DataFrame转化成了JavaRDD。 2)不过,JavaRDD其实是没有join算子的,下面还需要通过mapToPair算子将JavaRDD转换成JavaPairRDD,这样就可以使用Join了。 需要指出的是 1)join算子(join,leftOuterJoin,rightOuterJoin)只能通过PairRDD使用; 2)join算子操作的Tuple2<Object1, Object2>类型中,Object1是连接键,我只试过Integer和String,Object2比较灵活,甚至可以是整个Row。 这里我们使用driver_id作为连接键。 所以在输出Tuple2的时候,我们将driver_id放在了前面。 public class Join implements Serializable { private transient JavaSparkContext javaSparkContext; private transient HiveContext hiveContext; /* * 初始化Load * 创建sparkContext, sqlContext, hiveContext * */ public Join() { initSparckContext(); initHiveContext(); } /* * 创建sparkContext * */ private void initSparckContext() { String warehouseLocation = System.getProperty("user.dir"); SparkConf sparkConf = new SparkConf() .setAppName("spark-join") .set("spark.sql.warehouse.dir", warehouseLocation) .setMaster("yarn-client"); javaSparkContext = new JavaSparkContext(sparkConf); } /* * 创建hiveContext * 用于读取Hive中的数据 * */ private void initHiveContext() { hiveContext = new HiveContext(javaSparkContext); } public void join() { /* * 生成rdd1 * */ String query1 = "select * from gulfstream_test.orders"; DataFrame rows1 = hiveContext.sql(query1).select("order_id", "driver_id"); JavaPairRDD<String, String> rdd1 = rows1.toJavaRDD().mapToPair(new PairFunction<Row, String, String>() { @Override public Tuple2<String, String> call(Row row) throws Exception { String orderId = (String)row.get(0); String driverId = (String)row.get(1); return new Tuple2<String, String>(driverId, orderId); } }); /* * 生成rdd2 * */ String query2 = "select * from gulfstream_test.drivers"; DataFrame rows2 = hiveContext.sql(query2).select("driver_id", "car_id"); JavaPairRDD<String, String> rdd2 = rows2.toJavaRDD().mapToPair(new PairFunction<Row, String, String>() { @Override public Tuple2<String, String> call(Row row) throws Exception { String driverId = (String)row.get(0); String carId = (String)row.get(1); return new Tuple2<String, String>(driverId, carId); } }); /* * join * */ System.out.println(" ****************** join *******************"); JavaPairRDD<String, Tuple2<String, String>> joinRdd = rdd1.join(rdd2); Iterator<Tuple2<String, Tuple2<String, String>>> it1 = joinRdd.collect().iterator(); while (it1.hasNext()) { Tuple2<String, Tuple2<String, String>> item = it1.next(); System.out.println("driver_id:" + item._1 + ", order_id:" + item._2._1 + ", car_id:" + item._2._2 ); } /* * leftOuterJoin * */ System.out.println(" ****************** leftOuterJoin *******************"); JavaPairRDD<String, Tuple2<String, Optional<String>>> leftOuterJoinRdd = rdd1.leftOuterJoin(rdd2); Iterator<Tuple2<String, Tuple2<String, Optional<String>>>> it2 = leftOuterJoinRdd.collect().iterator(); while (it2.hasNext()) { Tuple2<String, Tuple2<String, Optional<String>>> item = it2.next(); System.out.println("driver_id:" + item._1 + ", order_id:" + item._2._1 + ", car_id:" + item._2._2 ); } /* * rightOuterJoin * */ System.out.println(" ****************** rightOuterJoin *******************"); JavaPairRDD<String, Tuple2<Optional<String>, String>> rightOuterJoinRdd = rdd1.rightOuterJoin(rdd2); Iterator<Tuple2<String, Tuple2<Optional<String>, String>>> it3 = rightOuterJoinRdd.collect().iterator(); while (it3.hasNext()) { Tuple2<String, Tuple2<Optional<String>, String>> item = it3.next(); System.out.println("driver_id:" + item._1 + ", order_id:" + item._2._1 + ", car_id:" + item._2._2 ); } } public static void main(String[] args) { Join sj = new Join(); sj.join(); } }
Spark的三种join笔记
最新推荐文章于 2023-08-01 07:51:21 发布