PySpark SQL USING 子句仅保留左表的连表字段

有两个表 employee,department,现需要根据 dept_id 关联表

emp = [(1,"Smith",-1,"2018",10,"M",3000), \
    (2,"Rose",1,"2010",20,"M",4000), \
    (3,"Williams",1,"2010",10,"M",1000), \
    (4,"Jones",2,"2005",10,"F",2000), \
    (5,"Brown",2,"2010",40,"",-1), \
      (6,"Brown",2,"2010",50,"",-1) \
  ]
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
       "dept_id","gender","salary"]

empDF = spark.createDataFrame(data=emp, schema = empColumns)
empDF.printSchema()
empDF.show(100, False)
empDF.createOrReplaceTempView('employee')

dept = [("Finance",10), \
    ("Marketing",20), \
    ("Sales",30), \
    ("IT",40) \
  ]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(100, False)
deptDF.createOrReplaceTempView('department')
root
 |-- emp_id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- superior_emp_id: long (nullable = true)
 |-- year_joined: string (nullable = true)
 |-- dept_id: long (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)

+------+--------+---------------+-----------+-------+------+------+
|emp_id|name    |superior_emp_id|year_joined|dept_id|gender|salary|
+------+--------+---------------+-----------+-------+------+------+
|1     |Smith   |-1             |2018       |10     |M     |3000  |
|2     |Rose    |1              |2010       |20     |M     |4000  |
|3     |Williams|1              |2010       |10     |M     |1000  |
|4     |Jones   |2              |2005       |10     |F     |2000  |
|5     |Brown   |2              |2010       |40     |      |-1    |
|6     |Brown   |2              |2010       |50     |      |-1    |
+------+--------+---------------+-----------+-------+------+------+

root
 |-- dept_name: string (nullable = true)
 |-- dept_id: long (nullable = true)

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+

根据以往经验,使用 ON 子句结果表会出现两个表的关联字段

spark.sql("""
SELECT *
FROM employee a
INNER JOIN department b
ON a.dept_id = b.dept_id
""").show(100, False)

上面的 SQL 会生成两个 dept_id

+------+--------+---------------+-----------+-------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-------+------+------+---------+-------+
|1     |Smith   |-1             |2018       |10     |M     |3000  |Finance  |10     |
|3     |Williams|1              |2010       |10     |M     |1000  |Finance  |10     |
|4     |Jones   |2              |2005       |10     |F     |2000  |Finance  |10     |
|5     |Brown   |2              |2010       |40     |      |-1    |IT       |40     |
|2     |Rose    |1              |2010       |20     |M     |4000  |Marketing|20     |
+------+--------+---------------+-----------+-------+------+------+---------+-------+

如果使用 USING 子句结果只有 1 个表的关联字段

spark.sql("""
SELECT *
FROM employee a
INNER JOIN department b
USING(dept_id)
""").show(100, False)
+-------+------+--------+---------------+-----------+------+------+---------+
|dept_id|emp_id|name    |superior_emp_id|year_joined|gender|salary|dept_name|
+-------+------+--------+---------------+-----------+------+------+---------+
|10     |1     |Smith   |-1             |2018       |M     |3000  |Finance  |
|10     |4     |Jones   |2              |2005       |F     |2000  |Finance  |
|10     |3     |Williams|1              |2010       |M     |1000  |Finance  |
|40     |5     |Brown   |2              |2010       |      |-1    |IT       |
|20     |2     |Rose    |1              |2010       |M     |4000  |Marketing|
+-------+------+--------+---------------+-----------+------+------+---------+

从下面可以清楚看到 USING 子句仅保留左表的连表字段

spark.sql("""
SELECT a.*
FROM employee a
INNER JOIN department b
USING(dept_id)
""").show(100, False)

spark.sql("""
SELECT b.*
FROM employee a
INNER JOIN department b
USING(dept_id)
""").show(100, False)
+-------+------+--------+---------------+-----------+------+------+
|dept_id|emp_id|name    |superior_emp_id|year_joined|gender|salary|
+-------+------+--------+---------------+-----------+------+------+
|10     |3     |Williams|1              |2010       |M     |1000  |
|10     |1     |Smith   |-1             |2018       |M     |3000  |
|10     |4     |Jones   |2              |2005       |F     |2000  |
|40     |5     |Brown   |2              |2010       |      |-1    |
|20     |2     |Rose    |1              |2010       |M     |4000  |
+-------+------+--------+---------------+-----------+------+------+

+---------+
|dept_name|
+---------+
|Finance  |
|Finance  |
|Finance  |
|IT       |
|Marketing|
+---------+
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值