大家都知道collect_list和collect_set是将多行同组数据转化为一行,但是如何进行其的逆操作将一行数据转化为同组的多行数据呢?
首先创建简单DF
var x = Seq(
("li", "1,2,3"),
("bo", "10,20,30")
).toDF("name", "time")
x.show()
初始表为:
+----+--------+
|name| time|
+----+--------+
| li| 1,2,3|
| bo|10,20,30|
+----+--------+
我们为了后期和其他表进行合并,给初始表添加一个index
参考自:https://blog.csdn.net/xiligey1/article/details/82498389
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
val w = Window.orderBy("name")
val result = x.withColumn("index", row_number().over(w))
result.show()
结果是:
+----+--------+-----+
|name| time|index|
+----+--------+-----+
| bo|10,20,30| 1|
| li| 1,2,3| 2|
+----+--------+-----+
name是我初始表中的列名,可以自己替换
将time值进行展开:
参考:https://blog.csdn.net/baifanwudi/article/details/86700400
//备注:explode已经弃用了,现在使用functions.expolde()。
//备注:split() 在scala中形式是 string.split("-", Int i) 所以也需要使用functions.split()
//备注:col需要使用functions.col()
import org.apache.spark.sql.functions
val finalResult= result.withColumn("newtime",functions.explode(functions.split(functions.col("time"),",")))
finalResult.show()
结果是:
+----+--------+-----+-------+
|name| time|index|newtime|
+----+--------+-----+-------+
| bo|10,20,30| 1| 10|
| bo|10,20,30| 1| 20|
| bo|10,20,30| 1| 30|
| li| 1,2,3| 2| 1|
| li| 1,2,3| 2| 2|
| li| 1,2,3| 2| 3|
+----+--------+-----+-------+
完美逆操作!!!!