有两个表 employee,department,现需要根据 dept_id 关联表
emp = [(1,"Smith",-1,"2018",10,"M",3000), \
(2,"Rose",1,"2010",20,"M",4000), \
(3,"Williams",1,"2010",10,"M",1000), \
(4,"Jones",2,"2005",10,"F",2000), \
(5,"Brown",2,"2010",40,"",-1), \
(6,"Brown",2,"2010",50,"",-1) \
]
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
"dept_id","gender","salary"]
empDF = spark.createDataFrame(data=emp, schema = empColumns)
empDF.printSchema()
empDF.show(100, False)
empDF.createOrReplaceTempView('employee')
dept = [("Finance",10), \
("Marketing",20), \
("Sales",30), \
("IT",40) \
]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(100, False)
deptDF.createOrReplaceTempView('department')
root
|-- emp_id: long (nullable = true)
|-- name: string (nullable = true)
|-- superior_emp_id: long (nullable = true)
|-- year_joined: string (nullable = true)
|-- dept_id: long (nullable = true)
|-- gender: string (nullable = true)
|-- salary: long (nullable = true)
+------+--------+---------------+-----------+-------+------+------+
|emp_id|name |superior_emp_id|year_joined|dept_id|gender|salary|
+------+--------+---------------+-----------+-------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-------+------+------+
root
|-- dept_name: string (nullable = true)
|-- dept_id: long (nullable = true)
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
根据以往经验,使用 ON 子句结果表会出现两个表的关联字段
spark.sql("""
SELECT *
FROM employee a
INNER JOIN department b
ON a.dept_id = b.dept_id
""").show(100, False)
上面的 SQL 会生成两个 dept_id
+------+--------+---------------+-----------+-------+------+------+---------+-------+
|emp_id|name |superior_emp_id|year_joined|dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-------+------+------+---------+-------+
|1 |Smith |-1 |2018 |10 |M |3000 |Finance |10 |
|3 |Williams|1 |2010 |10 |M |1000 |Finance |10 |
|4 |Jones |2 |2005 |10 |F |2000 |Finance |10 |
|5 |Brown |2 |2010 |40 | |-1 |IT |40 |
|2 |Rose |1 |2010 |20 |M |4000 |Marketing|20 |
+------+--------+---------------+-----------+-------+------+------+---------+-------+
如果使用 USING 子句结果只有 1 个表的关联字段
spark.sql("""
SELECT *
FROM employee a
INNER JOIN department b
USING(dept_id)
""").show(100, False)
+-------+------+--------+---------------+-----------+------+------+---------+
|dept_id|emp_id|name |superior_emp_id|year_joined|gender|salary|dept_name|
+-------+------+--------+---------------+-----------+------+------+---------+
|10 |1 |Smith |-1 |2018 |M |3000 |Finance |
|10 |4 |Jones |2 |2005 |F |2000 |Finance |
|10 |3 |Williams|1 |2010 |M |1000 |Finance |
|40 |5 |Brown |2 |2010 | |-1 |IT |
|20 |2 |Rose |1 |2010 |M |4000 |Marketing|
+-------+------+--------+---------------+-----------+------+------+---------+
从下面可以清楚看到 USING 子句仅保留左表的连表字段
spark.sql("""
SELECT a.*
FROM employee a
INNER JOIN department b
USING(dept_id)
""").show(100, False)
spark.sql("""
SELECT b.*
FROM employee a
INNER JOIN department b
USING(dept_id)
""").show(100, False)
+-------+------+--------+---------------+-----------+------+------+
|dept_id|emp_id|name |superior_emp_id|year_joined|gender|salary|
+-------+------+--------+---------------+-----------+------+------+
|10 |3 |Williams|1 |2010 |M |1000 |
|10 |1 |Smith |-1 |2018 |M |3000 |
|10 |4 |Jones |2 |2005 |F |2000 |
|40 |5 |Brown |2 |2010 | |-1 |
|20 |2 |Rose |1 |2010 |M |4000 |
+-------+------+--------+---------------+-----------+------+------+
+---------+
|dept_name|
+---------+
|Finance |
|Finance |
|Finance |
|IT |
|Marketing|
+---------+