目前在做hive上表关联的时候,觉得有些以前oracle sql的写法效率特低。尤其是左表的限制条件是写里面,还是写外面的问题,hive和另外两种sql还是有区别的:
oracle:
--写外面
explain plan for select count(*) as cnt
2 from cj_0711_tmp1 a
3 left outer join
4 cj_0711_tmp2 b
5 on (b.stat_date=to_date('20120701')
6 and a.object_id=b.object_id)
7 where a.stat_date=to_date('20120702');
---------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 44 | 61 (2)| 00:00:02 | | |
| 1 | SORT AGGREGATE | | 1 | 44 | | | | |
| 2 | HASH JOIN OUTER | | 10 | 440 | 61 (2)| 00:00:02 | | |
| 3 | PARTITION RANGE SINGLE| | 10 | 220 | 30 (0)| 00:00:01 | KEY | KEY |
| 4 | TABLE ACCESS FULL | CJ_0711_TMP1 | 10 | 220 | 30 (0)| 00:00:01 | KEY | KEY |
| 5 | PARTITION RANGE SINGLE| | 2 | 44 | 30 (0)| 00:00:01 | KEY | KEY |
| 6 | TABLE ACCESS FULL | CJ_0711_TMP2 | 2 | 44 | 30 (0)| 00:00:01 | KEY | KEY |
---------------------------------------------------------------------------------------------------------
--写里面
explain plan for select count(*) as cnt
from cj_0711_tmp1 a
left outer join
cj_0711_tmp2 b
on (b.stat_date=to_date('20120701')
and a.object_id=b.object_id
and a.stat_date=to_date('20120702'))
;
-----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 22 | 863K (1)| 04:18:56 | | |
| 1 | SORT AGGREGATE | | 1 | 22 | | | | |
| 2 | NESTED LOOPS OUTER | | 28718 | 616K| 863K (1)| 04:18:56 | | |
| 3 | PARTITION RANGE ALL | | 28718 | 616K| 613 (1)| 00:00:12 | 1 | 23 |
| 4 | TABLE ACCESS FULL | CJ_0711_TMP1 | 28718 | 616K| 613 (1)| 00:00:12 | 1 | 23 |
| 5 | VIEW | | 1 | | 30 (0)| 00:00