Spark中对Dataframe的union 、unionAll和 unionByName方法说明

56 篇文章 4 订阅
47 篇文章 3 订阅

方法说明:

  • union: 两个df合并,但是不按列名进行合并,而是位置,列名以前表为准(a.union(b) 列名顺序以a为准)
  • unionAll:同union方法
  • unionByName:合并时按照列名进行合并,而不是位置

举例:

把 b表的id_num和CST_NO两列的值更改顺序
    var a = Seq(
      ("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
      ("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
      ("200", "ming", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")
    ).toDF("id_num", "CST_NO", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")
    a.show()
      var b = Seq(
      ("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
      ("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
      ("ming", "787878", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")
    ).toDF("CST_NO", "id_num", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")
    b.show()
    var r = a.union(b)
    r.show()
    var p = a.unionAll(b)
    p.show()
    var t = a.unionByName(b)
    t.show()
结果是:

a: org.apache.spark.sql.DataFrame = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution|              dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|     1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|     1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|   200|  ming|         hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|         ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+

b: org.apache.spark.sql.DataFrame = [CST_NO: string, id_num: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|CST_NO|id_num|distribution|              dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|  ming|787878|         hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|         ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+

r: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution|              dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|     1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|     1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|   200|  ming|         hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|         ATM|
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|  ming|787878|         hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|         ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+

warning: there was one deprecation warning; re-run with -deprecation for details
p: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution|              dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|     1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|     1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|   200|  ming|         hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|         ATM|
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|  ming|787878|         hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|         ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+

t: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution|              dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|     1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|     1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|   200|  ming|         hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|         ATM|
|  9999|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|  9999|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|       mac|            43|         ATM|
|787878|  ming|         hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|         ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
可以看出: r表和p表只关注位置, t表对应列进行合并,为正确的


unionByName如果两表列不完全相同,是会报错:
    var a = Seq(
      ("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
      ("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
      ("200", "ming", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")
    ).toDF("id_num", "CST_NO", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")
      var b = Seq(
      ("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
      ("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
      ("ming", "787878", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")
    ).toDF("vvv", "id_num", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")

    var t = a.unionByName(b)
    t.show()
报错: org.apache.spark.sql.AnalysisException: Cannot resolve column name "CST_NO" among (vvv, id_num, distribution, dayId, AMOUNT_cnt, CLIENT_IP, CLIENT_MAC, PAYER_CODE_num, CHANNEL_CODE);   


注:三种方法的前提是两个df的行数一样,不一样会直接报错。

评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值