spark union和hive union的区别
spark中data frame 有union和union all算子,均不去重
这点,不像hive中那样,hive sql中union all不去重,union去重
示例
val df3: DataFrame = sc.makeRDD(Seq((1, "xm"), (2, "xl"))).toDF("id", "name")
val df4: DataFrame = sc.makeRDD(Seq((1, "xm"), (2, "xl"), (3, "xw"))).toDF("id", "name")
df3.union(df4).show(false)
+---+----+
|id |name|
+---+----+
|1 |xm |
|2 |xl |
|1 |xm |
|2 |xl |
|3 |xw |
+---+----+
df3.unionAll(df4).show(false)
+---+----+
|id |name|
+---+----+
|1 |xm |
|2 |xl |
|1 |xm |
|2 |xl |
|3 |xw |
+---+----+
如果想达到hive中的效果,可以使用distinct算子
df3.union(df4).distinct().show(false)
+---+----+
|id |name|
+---+----+
|1 |xm |
|3 |xw |
|2 |xl |
+---+----+