hive map side join
If all but one of the tables being joined are small, the join can be performed as a map only job. The query
does not need a reducer. For every mapper of A, B is read completely. The restriction is that a FULL/RIGHT OUTER JOIN b cannot be performed.
如果需要join的表中存在某些个小表,则可以使用map side join,这样的话,这次的join可以优化为仅运行map job,不需要再运行reduce job.这样使用存在的限制是不能支持 full/right outer join b.
类似于,先把小表缓存起来(内存中),然后使用缓存起来的小表和大表做关联,如:
step 1:
从HDFS读取小表的数据到内存中(可以只读取小表的key列)
step 2:
在map端:
for(大表.row){
for(小表.row){
if(大表.key==小表.key){ out(大表.row)}
}
}
//由此,无法做到right outer join 或full outer join,因为只有map,输出的只有 大表的row.