hive自定义mapred脚本运行机制分析

     1.  hive自定义mapred脚本运行机制

        1.1)HQL中使用自定义mapred脚本的语法格式及案例

语法格式:

FROM (
    FROM src
    MAP expression (',' expression)*
    (inRowFormat)?
    USING 'my_map_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?
    ( clusterBy? | distributeBy? sortBy? ) src_alias
  )
  REDUCE expression (',' expression)*
    (inRowFormat)?
    USING 'my_reduce_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?
 
  FROM (
    FROM src
    SELECT TRANSFORM '(' expression (',' expression)* ')'
    (inRowFormat)?
    USING 'my_map_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?
    ( clusterBy? | distributeBy? sortBy? ) src_alias
  )
  SELECT TRANSFORM '(' expression (',' expression)* ')'
    (inRowFormat)? 
    USING 'my_reduce_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?
HQL使用案例:

FROM (
  FROM pv_users
  MAP pv_users.userid, pv_users.date
  USING 'map_script'
  AS dt, uid
  CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
  REDUCE map_output.dt, map_output.uid
  USING 'reduce_script'
  AS date, count;
FROM (
  FROM pv_users
  SELECT TRANSFORM(pv_users.userid, pv_users.date)
  USING 'map_script'
  AS dt, uid
  CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
  SELECT TRANSFORM(map_output.dt, map_output.uid)
  USING 'reduce_script'
  AS date, count;
1.2)HIVE运行自定义mapred脚本的机制

MAP和REDUCE关键字都是SELECT TRANSFORM关键字的语法转换,也就是说:在使用MAP关键字时,自定义脚本不一定是在map阶段运行,使用REDUCE关键字时自定义脚本不一定是在red阶段运行;

可以参考如下案例:

#red.py 代码
#!/usr/bin/python
# coding=utf-8

import sys
import time
import traceback

sys.stderr.write("we are running at " + time.strftime("%Y-%m-%d %H:%M:%S") + "\n")

num = 10
i = 0

try:
    for l in sys.stdin:
        if i < num:
            print l[:-1]
            i += 1
except:
    traceback.print_exc(file=sys.stderr)

案例1:该HQL语句生成一个mapred Job,该Job只执行了map阶段,red数量为零;

FROM (
  FROM testhive
  MAP testhive.channel, testhive.pv
  USING '/bin/cat'
  AS dt, uid) map_output
REDUCE map_output.dt, map_output.uid
  USING '/bin/cat'
  AS date, count;
案例2: 该HQL语句生成一个mapred Job,该Job只执行了map阶段,red数量为零;

FROM testhive
MAP testhive.channel, testhive.pv
USING 'red.py'
AS c, p;

FROM testhive
REDUCE testhive.channel, testhive.pv
USING 'red.py'
AS c, p;
案例3: 该HQL语句生成一个mapred Job,该Job执行了map red阶段

FROM (
  FROM testhive
  MAP testhive.channel, testhive.pv
  USING 'red.py'
  AS dt, uid
  cluster By dt) map_output
REDUCE map_output.dt, map_output.uid
  USING 'red.py'
  AS date, count;

总结:HQL中嵌入自定义脚本时,自定义脚本的运行阶段取决于自定义脚本在HQL语句中的语法位置。也就是说,hive 在将hql语句翻译成mapred任务时,如果hql中存在group by、cluster by、sort by等需要reduce阶段的命令时,且MAP、REDUCE命令在上述命令之后,则自定义脚本运行在red端。

注意事项:如果是在windows上开发的mapred脚本,在提交到linux后最好先执行一下 dos2unix 命令,否则可能会出现错误:

hadoop mapred框架错误: 
java.io.IOException: Cannot run program "/data9/mapred/local/taskTracker/zhaoxiuxiang/jobcache/job_201311120949_47441/attempt_201311120949_47441_m_000003_2/work/././map.py": error=2, No such file or directory
本地错误:
-bash: ./map.py: /usr/bin/python^M: bad interpreter: No such file or directory


注意事项:使用下面的参数打开report和调试信息输出

stream.stderr.reporter.enabled=true
stream.stderr.reporter.prefix=reporter:











评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值