官网的demo:
FROM ( FROM pv_users SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS dt, uid CLUSTER BY dt ) map_output
INSERT OVERWRITE TABLE pv_users_reduced SELECT TRANSFORM(map_output.dt, map_output.uid) USING 'reduce_script' AS date, count;
使用MAP
和REDUCE
关键字是SELECT TRANSFORM
关键字的别名,下面的等价代码阅读跟清洗一点:
FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script' AS dt, uid CLUSTER BY dt ) map_output
INSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.dt, map_output.uid USING 'reduce_script' AS date, count;
MAP中,SELECT TRANSFORM() 等价于 关键字MAP
REDUCE中, SELECT TRANSFORM() 等价于 关键字 REDUCE ;
CLUSTER BY
关键字是DISTRIBUTE BY
和SORT BY
的简写,这两者可以认为对应与Hadoop的partition和sort过程。如果partition和sort的key是不同的,可以使用DISTRIBUTE BY
和SORT BY
分别指定。例如: distribute by a.user_id sort by a.user_id,a.begintime (同一个user_id的记录行都在同一个map中,并且按照begintime升序排列,每一个map中是同一个用户的时间序列轨迹) ;