方法说明:
- union: 两个df合并,但是不按列名进行合并,而是位置,列名以前表为准(a.union(b) 列名顺序以a为准)
- unionAll:同union方法
- unionByName:合并时按照列名进行合并,而不是位置
举例:
把 b表的id_num和CST_NO两列的值更改顺序
var a = Seq(
("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
("200", "ming", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")
).toDF("id_num", "CST_NO", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")
a.show()
var b = Seq(
("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
("ming", "787878", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")
).toDF("CST_NO", "id_num", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")
b.show()
var r = a.union(b)
r.show()
var p = a.unionAll(b)
p.show()
var t = a.unionByName(b)
t.show()
结果是:
a: org.apache.spark.sql.DataFrame = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution| dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
| 1| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 1| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 200| ming| hlj|2019-09-06 17:15:15| 2002| 192.196| win7| 13| ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
b: org.apache.spark.sql.DataFrame = [CST_NO: string, id_num: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|CST_NO|id_num|distribution| dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
| ke| 9999| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| ke| 9999| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| ming|787878| hlj|2019-09-06 17:15:15| 2002| 192.196| win7| 13| ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
r: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution| dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
| 1| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 1| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 200| ming| hlj|2019-09-06 17:15:15| 2002| 192.196| win7| 13| ATM|
| ke| 9999| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| ke| 9999| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| ming|787878| hlj|2019-09-06 17:15:15| 2002| 192.196| win7| 13| ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
warning: there was one deprecation warning; re-run with -deprecation for details
p: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution| dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
| 1| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 1| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 200| ming| hlj|2019-09-06 17:15:15| 2002| 192.196| win7| 13| ATM|
| ke| 9999| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| ke| 9999| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| ming|787878| hlj|2019-09-06 17:15:15| 2002| 192.196| win7| 13| ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
t: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution| dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
| 1| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 1| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 200| ming| hlj|2019-09-06 17:15:15| 2002| 192.196| win7| 13| ATM|
| 9999| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
| 9999| ke| hb|2019-09-04 21:15:15| 1001| 192.196| mac| 43| ATM|
|787878| ming| hlj|2019-09-06 17:15:15| 2002| 192.196| win7| 13| ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
可以看出: r表和p表只关注位置, t表对应列进行合并,为正确的
unionByName如果两表列不完全相同,是会报错:
var a = Seq(
("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
("200", "ming", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")
).toDF("id_num", "CST_NO", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")
var b = Seq(
("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
("ming", "787878", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")
).toDF("vvv", "id_num", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")
var t = a.unionByName(b)
t.show()
报错: org.apache.spark.sql.AnalysisException: Cannot resolve column name "CST_NO" among (vvv, id_num, distribution, dayId, AMOUNT_cnt, CLIENT_IP, CLIENT_MAC, PAYER_CODE_num, CHANNEL_CODE);
注:三种方法的前提是两个df的行数一样,不一样会直接报错。