过滤条件on:指的是join后面on的具体内容,例如: on t1.age>=20
过滤条件where:指的是where条件后面的内容,例如:where t1.age>=20
我们来看具体结果,到底有什么不同?
案例1:过滤条件on
spark-sql> with test1 as (
> select 1 as user_id,'xiaoming' as name
> union all
> select 2 as user_id,'xiaolan' as name
> union all
> select 3 as user_id,'xiaoxin' as name),
>
> test2 as (
> select 1 as user_id,19 as age
> union all
> select 2 as user_id,20 as age
> union all
> select 3 as user_id,21 as age)
>
> select
> t1.user_id,
> t1.name,
> t2.user_id,
> t2.age
> from
> (select
> user_id,
> name
> from test1) t1
>
> left join
> (select
> user_id,
> age
> from test2) t2
> on t1.user_id=t2.user_id and t1.user_id<=2;
1 xiaoming 1 19
2 xiaolan 2 20
3 xiaoxin NULL NULL
Time taken: 6.106 seconds, Fetched 3 row(s)
注:从结果能看出来,on t1.user_id<=2的条件只是让user_id<=2的数据不参与join操作,这样右表数据就关联不上了,但是左边中的数据并不会被过滤掉
案例2:过滤条件where
spark-sql> with test1 as (
> select 1 as user_id,'xiaoming' as name
> union all
> select 2 as user_id,'xiaolan' as name
> union all
> select 3 as user_id,'xiaoxin' as name),
>
> test2 as (
> select 1 as user_id,19 as age
> union all
> select 2 as user_id,20 as age
> union all
> select 3 as user_id,21 as age)
>
> select
> t1.user_id,
> t1.name,
> t2.user_id,
> t2.age
> from
> (select
> user_id,
> name
> from test1) t1
>
> left join
> (select
> user_id,
> age
> from test2) t2
> on t1.user_id=t2.user_id
> where t1.user_id<=2;
1 xiaoming 1 19
2 xiaolan 2 20
Time taken: 0.279 seconds, Fetched 2 row(s)
注:从结果能看出来,where t1.user_id<=2的条件让user_id<=2的数据直接过滤掉了,但是有一个点没有考虑优化的是左表中user_id<=2会参与join操作后才会被过滤掉
总结:
1.过滤条件on起到的作用是让user_id<=2的数据不参与join操作,并不能将我们不需要的数据过滤掉
2.过滤条件where起到了过滤的作用,把我们不需要的数据直接过滤掉
3.也可以这么理解,过滤条件on起到了局部过滤作用,过滤条件where起到了全局过滤作用