参考网址: join on 和 where的区别
- 定义df
d1 = {'name1':["A","B","C"], 'height':[165,170,160]}
d2 = {'name2':["B","C","D"], 'age':[45,43,50]}
df1 = spark.createDataFrame(pd.DataFrame(d1))
df2 = spark.createDataFrame(pd.DataFrame(d2))
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
df1 df2
+-----+------+ +-----+---+
|name1|height| |name2|age|
+-----+------+ +-----+---+
| A| 165| | B| 45|
| B| 170| | C| 43|
| C| 160| | D| 50|
+-----+------+ +-----+---+
- join是一个笛卡尔积操作,形成一张临时表
>>>spark.sql(
"""
select * from table1
left join table2
"""
).show()
# 输出
+-----+------+-----+---+
|name1|height|name2|age|
+-----+------+-----+---+
| A| 165| B| 45|
| A| 165| C| 43|
| A| 165| D| 50|
| B| 170| B| 45|
| B| 170| C| 43|
| B| 170| D| 50|
| C| 160| B| 45|
| C| 160| C| 43|
| C| 160| D| 50|
+-----+------+-----+---+
ON
on条件是在生成临时表时使用的条件,他不管on的条件是否为真,都会返回左边表所有的记录
- (1) 自己理解: 如果条件为左表条件,返回on中满足条件的左表记录对应右表的所有记录,返回不满足条件的左表记录对应右表null,不管on的条件是否为真,都会返回左边表所有的记录
>>>spark.sql(
"""
select * from table1
left join table2
on table1.name1 in ('A','B')
"""
).show()
# 输出
+-----+------+-----+----+
|name1|height|name2| age|
+-----+------+-----+----+
| A| 165| B| 45|
| A| 165| C| 43|
| A| 165| D| 50|
| B| 170| B| 45|
| B| 170| C| 43|
| B| 170| D| 50|
| C| 160| null|null|
+-----+------+-----+----+
- (2)自己理解: 如果条件为右表条件,返回on中满足条件的右表记录对应的左表所有记录。
>>>spark.sql(
"""
select * from table1
left join table2
on table2.name2 in ('B','C')
"""
).show()
# 输出
+-----+------+-----+---+
|name1|height|name2|age|
+-----+------+-----+---+
| A| 165| B| 45|
| A| 165| C| 43|
| B| 170| B| 45|
| B| 170| C| 43|
| C| 160| B| 45|
| C| 160| C| 43|
+-----+------+-----+---+
- (3)自己理解: 如果条件为右表条件,但没有满足on中条件的右表记录,则也返回左表记录
>>>spark.sql(
"""
select * from table1
left join table2
on table2.name2 in ('A')
"""
).show()
# 输出
+-----+------+-----+----+
|name1|height|name2| age|
+-----+------+-----+----+
| A| 165| null|null|
| B| 170| null|null|
| C| 160| null|null|
+-----+------+-----+----+
- (4)自己理解: 如果条件为左表条件等于右表条件,则在1的基础上,过滤右表条件
>>>spark.sql(
"""
select * from table1
left join table2
on table1.name1=table2.name2
"""
).show()
# 输出
+-----+------+-----+----+
|name1|height|name2| age|
+-----+------+-----+----+
| A| 165| null|null|
| B| 170| B| 45|
| C| 160| C| 43|
+-----+------+-----+----+
WHERE
where条件是在临时表生成好后,再对临时表进行过滤的条件。这时已经没有left join的含义了,where后提交不为真的就全部过滤掉.
- 在笛卡尔积的基础上,直接按照where过滤
>>>spark.sql(
"""
select * from table1
left join table2
where table1.name1='A'
"""
).show()
# 输出
+-----+------+-----+---+
|name1|height|name2|age|
+-----+------+-----+---+
| A| 165| B| 45|
| A| 165| C| 43|
| A| 165| D| 50|
+-----+------+-----+---+