spark sql Dataframe 的 union、reduce、reduce(_ union _)

最新推荐文章于 2025-03-15 09:38:40 发布

nefu-ljw

最新推荐文章于 2025-03-15 09:38:40 发布

阅读量1.1k

点赞数 1

分类专栏：从零开始学大数据文章标签： spark sql dataframe reduce union

本文链接：https://blog.csdn.net/ljw_study_in_CSDN/article/details/128537012

版权

从零开始学大数据专栏收录该内容

13 篇文章

订阅专栏

文章介绍了Spark中DataFrame的union函数，用于合并两个数据集的行，等同于SQL的UNIONALL。它基于列的位置而不是名称合并数据。同时提到了reduce函数，通过reduceLeft将多个DataFrame合并。示例展示了如何使用reduce(_union_)将三个DataFrame合并成一个。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

union函数

val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val join12 = df1.union(df2)
join12.show(false)
// join12结果
+----+----+----+
|col0|col1|col2|
+----+----+----+
|1   |2   |3   |
|1   |2   |3   |
+----+----+----+

val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
val df3 = Seq((1, 2, 3)).toDF("col1", "col2", "col0")
val join12 = df1.union(df2)
val join123 = join12.union(df3)
// join123结果
+----+----+----+
|col0|col1|col2|
+----+----+----+
|1   |2   |3   |
|4   |5   |6   |
|1   |2   |3   |
+----+----+----+

union返回一个新的数据集，其中包含此数据集中的行和另一个数据集中的行的并集。
这相当于 SQL 中的 UNION ALL。要执行 SQL 样式的集合并集（对元素进行重复数据删除），请使用此函数，后跟一个不同的。

This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.

同样作为 SQL 中的标准，此函数按位置（而不是按名称）解析列：

val df1 = Seq((1, 2, 3)).toDF(“col0”, “col1”, “col2”)
val df2 = Seq((4, 5, 6)).toDF(“col1”, “col2”, “col0”)
df1.union(df2).show

// 输出：
// ±—±—±—+
// |col0|col1|col2|
// ±—±—±—+
// | 1| 2| 3|
// | 4| 5| 6|
// ±—±—±—+
请注意，schema中的列位置不一定与数据集中强类型对象中的字段匹配。此函数根据列在schema中的位置而不是强类型对象中的字段来解析列。使用 unionByName 按类型对象中的字段名称解析列。

reduce函数

调用reduceLeft函数

Applies a binary operator to all elements of this traversable or iterator, going left to right. Note: will not terminate for infinite-sized collections. Note: might return different results for different runs, unless the underlying collection type is ordered or the operator is associative and commutative.
形参:
op – the binary operator.
类型形参:
B – the result type of the binary operator.
返回:
the result of inserting op between consecutive elements of this traversable or iterator, going left to right:
op( op( … op(x_1, x_2) …, x_{n-1}), x_n)
where x,1, …, x,n, are the elements of this traversable or iterator.
抛出:
UnsupportedOperationException – if this traversable or iterator is empty.

reduce(_ union _) 示例

https://stackoverflow.com/a/37612978/17434375

val df1 = sc.parallelize(1 to 4).map(i => (i,i*10)).toDF("id","x")
val df2 = sc.parallelize(1 to 4).map(i => (i,i*100)).toDF("id","y")
val df3 = sc.parallelize(1 to 4).map(i => (i,i*1000)).toDF("id","z")

val dfs = Seq(df1, df2, df3)
val result_df = dfs.reduce(_ union _) // reduce{(x,y) => (x union y)}
result_df.show(false)

+---+----+
|id |x   |
+---+----+
|1  |10  |
|2  |20  |
|3  |30  |
|4  |40  |
|1  |100 |
|2  |200 |
|3  |300 |
|4  |400 |
|1  |1000|
|2  |2000|
|3  |3000|
|4  |4000|
+---+----+